If you expect your database to be reliable, you need the disks used to have Reliable Writestoo. The PostgreSQL documentation recommends a program named diskchecker.pl for this job. It’s the most sensitive single test I’ve found for that. It’s possible to have a disk pass its test but still hit an unclean write in the real world. But if the diskchecker test fails, ever, that drive is useless for a Postgres database. That will corrupt your data one day, only question is how long you have.
diskchecker requires a second system to listen to data written on your potential server drive. That system can be anything that’s capable of running a Perl script and talking to a network. I usually use my laptop to run the listener. First confirm you know your address using a program like ifconfig, and then duplicate the diskchecker.pl repo and run the listener like this:
$ git clone http://gist.github.com/3172656.git $ cd 3172656 $ chmod +x ./diskchecker.pl $ ./diskchecker.pl -l [server] diskchecker.pl: waiting for connection...
My listening server has IP address 192.168.1.110 in the examples here. The examples also assume you’ve put the resulting program in your home directory (~) on the server being tested. To start testing, on the system to be tested switch to the disk you want to test and tell the program to create a moderately sized test file:
$ cd /ssd/test/ $ ~/diskchecker.pl -s 192.168.1.110 create test_file 500 diskchecker: running 1 sec, 0.74% coverage of 500 MB (237 writes; 237/s) diskchecker: running 2 sec, 2.22% coverage of 500 MB (714 writes; 357/s) diskchecker: running 3 sec, 3.22% coverage of 500 MB (1042 writes; 347/s) diskchecker: running 4 sec, 4.64% coverage of 500 MB (1521 writes; 380/s) diskchecker: running 5 sec, 6.04% coverage of 500 MB (2000 writes; 400/s) diskchecker: running 6 sec, 7.48% coverage of 500 MB (2482 writes; 413/s) diskchecker: running 7 sec, 8.79% coverage of 500 MB (2936 writes; 419/s) diskchecker: running 8 sec, 9.72% coverage of 500 MB (3260 writes; 407/s) diskchecker: running 9 sec, 11.07% coverage of 500 MB (3737 writes; 415/s) diskchecker: running 10 sec, 12.39% coverage of 500 MB (4207 writes; 420/s) diskchecker: running 11 sec, 12.90% coverage of 500 MB (4385 writes; 398/s) diskchecker: running 12 sec, 13.55% coverage of 500 MB (4621 writes; 385/s) diskchecker: running 13 sec, 14.42% coverage of 500 MB (4941 writes; 380/s) diskchecker: running 14 sec, 15.55% coverage of 500 MB (5377 writes; 384/s) diskchecker: running 15 sec, 16.72% coverage of 500 MB (5830 writes; 388/s) diskchecker: running 16 sec, 17.45% coverage of 500 MB (6114 writes; 382/s) diskchecker: running 17 sec, 18.19% coverage of 500 MB (6397 writes; 376/s) diskchecker: running 18 sec, 19.30% coverage of 500 MB (6828 writes; 379/s)
These write rates are actually a useful performance test. diskchecker.pl uses a fairly large block size, and not too many storage devices can manage hundreds of those per second. This test was against a SSD that legitimately can support hundreds of writes per second. Here’s the sort of diskchecker write speed you’ll see on a regular 7200RPM disk that’s configured correctly:
$ ~/diskchecker.pl -s 192.168.1.110 create test_file 500 diskchecker: running 1 sec, 0.01% coverage of 500 MB (2 writes; 2/s) diskchecker: running 2 sec, 0.10% coverage of 500 MB (33 writes; 16/s) diskchecker: running 3 sec, 0.21% coverage of 500 MB (68 writes; 22/s) diskchecker: running 4 sec, 0.31% coverage of 500 MB (100 writes; 25/s) diskchecker: running 5 sec, 0.42% coverage of 500 MB (134 writes; 26/s)
If you are testing a traditional disk drive, and you see >100 writes/s here, you can tell already that there’s something wrong. Your write cache must not be configured correctly. Excessively high speeds are one sign that’s happening, and the resulting test verification is expected to fail.
Now remove the power cord(s) from the back of the test server. If the power supply has a switch on it, you might flip that instead, that’s just as good. Don’t shut the system down normally, kill a process, or otherwise do a less destructive shutdown. That defeats the purpose of the test. The monitoring server will look like this after you’ve pulled the plug:
[server] diskchecker.pl: waiting for connection... [192.168.1.2] New connection [192.168.1.2] 1000 writes [192.168.1.2] 2000 writes [192.168.1.2] 3000 writes [192.168.1.2] 4000 writes [192.168.1.2] 5000 writes [192.168.1.2] 6000 writes [192.168.1.2] 7000 writes [192.168.1.2] 8000 writes [192.168.1.2] 9000 writes [192.168.1.2] 10000 writes [192.168.1.2] 11000 writes [192.168.1.2] 12000 writes [192.168.1.2] 13000 writes [192.168.1.2] Writing state file... [192.168.1.2] Done.
Now start the test server up again. Once it’s finished booting, check the file it was writing. Here’s an example showing a passing test:
$ cd /ssd/test $ ~/diskchecker.pl -s 192.168.1.110 verify test_file verifying: 0.01% verifying: 20.61% verifying: 65.27% verifying: 100.00% Total errors: 0
What does a failing test look like? Here’s a 7200RPM drive with a write cache that loses data when you pull the plug:
$ ~/diskchecker.pl -s 192.168.1.2 create test_file 500 diskchecker: running 1 sec, 0.34% coverage of 500 MB (108 writes; 108/s) diskchecker: running 2 sec, 0.93% coverage of 500 MB (300 writes; 150/s) diskchecker: running 3 sec, 1.69% coverage of 500 MB (545 writes; 181/s) diskchecker: running 4 sec, 2.27% coverage of 500 MB (730 writes; 182/s) diskchecker: running 5 sec, 2.77% coverage of 500 MB (893 writes; 178/s) diskchecker: running 6 sec, 3.18% coverage of 500 MB (1029 writes; 171/s) diskchecker: running 7 sec, 3.58% coverage of 500 MB (1159 writes; 165/s) diskchecker: running 8 sec, 4.32% coverage of 500 MB (1404 writes; 175/s) diskchecker: running 9 sec, 4.58% coverage of 500 MB (1494 writes; 166/s) [Plug pull, reboot] $ cd /data/test $ ~/diskchecker.pl -s 192.168.1.110 verify test_file verifying: 2200.00% Error at page 22, 7 seconds before end. Error at page 37, 9 seconds before end. Error at page 38, 1 seconds before end. ... Error at page 31942, 7 seconds before end. Error at page 31982, 6 seconds before end. Error at page 31988, 1 seconds before end. verifying: 3198800.00% Total errors: 1689 Histogram of seconds before end: 0 31 1 219 2 163 3 128 4 122 5 229 6 221 7 112 8 103 9 302 10 59
It’s much harder now than it used to be to find a lying storage stack, which is good news. On Linux, distributions released with the 2.6.32 kernel or later (Debian Squeeze, RedHat Enterprise 6, Ubuntu 10.04) have a reliable fsync function when running against a regular disk drive. See ext4 finally doing the right thing for details.
When newer Linux kernels write data, if the OS is told to make sure it’s durable on disk, it actually flushes the disk’s write cache. That’s good enough to be durable for database use on most regular disks. There are still some SSDs that don’t honor the write cache flushing call. It’s an economic issue; flushing small blocks of data all the time lowers the drive’s lifetime, and some common applications do that now. As an example, when fsync calls were added to Firefox 3, there were many complaints about how often that pushed data to disk expensively. Typically a SSD can nend to erase a relatively large block, often 128KB, even if you only write a single byte of new data and then force it to storage. Drives accordingly prefer to cache small writes and write the data in larger chunks. Every time you erase a block of flash it burns up a little bit of the drive’s total lifetime.
The git clone of diskchecker.pl works on some of my systems, but it does fail on my one RHEL6 test server like this:
$ git clone http://gist.github.com/3172656.git Initialized empty Git repository in /home/gsmith/3172656/.git/ error: RPC failed; result=22, HTTP code = 400
Followed by the command not completing. If that happens to you, it’s not your mistake. You can try cloning the repository on another machine, or use the "view raw" link below the source code on the livejournal site. At the time I’m writing this, here’s a direct URL
If your listening server doesn’t have an open firewall, you might see this error when trying to connect to it:
$ ~/3172656/diskchecker.pl -s 192.168.1.2 create test_file 500 Couldn't connect to host:port of '192.168.1.2:5400'
You’ll need to poke a hole in your firewall to make it work. On the RHEL6 system I saw that error on, I turned the firewall off temporarily like this as root:
# sudo /sbin/service iptables stop
Another easy to make mistake is to start the diskchecker.pl monitoring server listen process in a terminal session from the server you’re testing. Since you’re on that server anyway, you might start there, then ssh out to another system to create the listener program. This probably will not work. When you disconnect power, the terminal session will close, and that can close any processes that are running at that point. The verify step will then fail because there’s nothing left for it to talk to. To fix this, you can start the listener program as a background process on most UNIX-ish systems like this::
$ nohup ./diskchecker.pl -l &
That will keep the program running even after you logout or disconnect. You’ll have to manually kill it to make it go away. Another possibility is to use the screen program to create a persistent terminal session.