Tuning Disk Timeouts on Virtual Machines

Dedicated servers are important for some databases, and I write a lot more about those difficult cases than the easy ones. But virtual machines have big management advantages. Recently I started moving a lot of my personal dedicated servers onto one larger shared box. It went terribly–my personal tech infrastructure lives out Murphy’s Law every day–and this blog has been down a whole lot of the last month. But I’m stubborn and always learn something from these battles, and this time the lesson was all about disk timeouts on virtual machines and similar cloud deployments. If you’d like to peek at the solution, I resolved the problems with advice from a blog entry on VM disk timeouts. The way things fell apart has its own story.

My project seemed simple enough: start with a Windows box (originally dual core/2GB RAM) and three Linux boxes (small hosting jobs, database/git/web/files). Turn them all into VMs, move them onto a bigger and more reliable server (8 cores, 16GB of RAM, RAID1), and run them all there. Disk throughput wasn’t going to be great, but all the real work these systems do fit in RAM. How bad could it be?

Well, really bad is the answer. Every system crashed intermittently, the exact form unique to their respective operating system. But intermittent problems get much easier to find when they happen more frequently. When one of the Linux systems started crashing constantly I dropped everything else to look at it. First there were some read and write errors:

Sep 11 13:09:48 wish kernel: ata3.00: failed command: WRITE FPDMA QUEUED
Sep 11 13:09:48 wish kernel: ata3.00: cmd 61/08:00:48:12:65/00:00:02:00:00/40 tag 0 ncq 4096 out
Sep 11 13:09:48 wish kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 11 13:09:48 wish kernel: ata3.00: status: { DRDY }
Sep 11 13:09:48 wish kernel: ata3.00: failed command: READ FPDMA QUEUED
Sep 11 13:09:48 wish kernel: ata3.00: cmd 60/40:28:40:9f:68/00:00:02:00:00/40 tag 5 ncq 32768 in
Sep 11 13:09:48 wish kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 11 13:09:48 wish kernel: ata3.00: status: { DRDY }
Sep 11 13:09:48 wish kernel: ata3: hard resetting link
Sep 11 13:09:48 wish kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 11 13:09:48 wish kernel: ata3.00: qc timeout (cmd 0xec)
Sep 11 13:09:48 wish kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Sep 11 13:09:48 wish kernel: ata3.00: revalidation failed (errno=-5)
Sep 11 13:09:48 wish kernel: ata3: limiting SATA link speed to 1.5 Gbps
Sep 11 13:09:48 wish kernel: ata3: hard resetting link
Sep 11 13:09:48 wish kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Sep 11 13:09:48 wish kernel: ata3.00: configured for UDMA/133
Sep 11 13:09:48 wish kernel: ata3.00: device reported invalid CHS sector 0
Sep 11 13:09:48 wish kernel: ata3: EH complete

Most of the time when a SATA device gives an error, the operating system will reset the whole SATA bus it’s on to try and regain normal operation. In this example that happens, and Linux drops the link speed (from 3.0Gbps to 1.5Gbps) too. That of course makes the disk I/O problem worse, because now transfers are less efficient. Awesome.

To follow this all the way to crash, more errors start popping up. Next Linux disables NCQ, further pruning the feature set it’s relying on in hopes the disk works better that way. PC hardware has a long history of device bugs when using advanced features, so this falling back to small feature sets approach happens often when things start failing:

Sep 11 13:11:36 wish kernel: ata3: hard resetting link
Sep 11 13:11:36 wish kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Sep 11 13:11:36 wish kernel: ata3.00: configured for UDMA/133
Sep 11 13:11:36 wish kernel: ata3: EH complete
Sep 11 13:11:36 wish kernel: ata3: illegal qc_active transition (00000003->000003f8)
Sep 11 13:11:36 wish kernel: ata3.00: NCQ disabled due to excessive errors
Sep 11 13:11:36 wish kernel: ata3.00: exception Emask 0x2 SAct 0x3 SErr 0x0 action 0x6 frozen
Sep 11 13:11:36 wish kernel: ata3.00: failed command: READ FPDMA QUEUED
Sep 11 13:11:36 wish kernel: ata3.00: cmd 60/28:00:e8:e4:95/00:00:00:00:00/40 tag 0 ncq 20480 in
Sep 11 13:11:36 wish kernel: res 00/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
Sep 11 13:11:36 wish kernel: ata3.00: failed command: READ FPDMA QUEUED
Sep 11 13:11:36 wish kernel: ata3.00: cmd 60/08:08:58:e2:90/00:00:01:00:00/40 tag 1 ncq 4096 in
Sep 11 13:11:36 wish kernel: res 00/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)

Finally, the system tries to write something to swap. When that fails too it hits a kernel panic, and then the server crashes. I can’t show you text because the actual crash didn’t ever hit the server log file–that write failed too and then the server was done. Here’s a console snapshot showing the end:

VM Disk Timeout Crash

VM Disk Timeout Crash

Now, when this happens once or twice, you might write it off as a fluke. But this started happening every few days to the one VM. And the Windows VM was going through its own mysterious crashes. Running along fine, the screen goes black, and Windows was just dead. No blue screen or anything, no error in the logs either. I could still reach the VM with VirtualBox’s RDP server, but the screen was black and it didn’t respond to input. Whatever mysterious issue was going on here, it was impacting all of my VM guests.

On this day where the one server was crashing all of the time, I looked at what was different. I noticed that the VM host itself was running a heavy cron job at the time. Many Linux systems run a nightly updatedb task, the program that maintains the database used by the locate command. When the VM host is busy, that makes I/O to all of the guest VMs slow too. And I had moved a 2TB drive full of data into that server the previous day. Reindexing that whole thing in updatedb was taking a long time. That was the thing that changed–the updatedb job was doing many disk reads.

What was happening here was an I/O timeout. Operating systems give the disks a relatively small amount of time to answer requests. When those timeouts, typically 30 seconds long, expire, the OS does all of this crazy bus reset and renegotiation stuff. With the dueling VMs I had, one of the two systems sharing a disk could easily take longer than 30 seconds to respond when the other was pounding data. Since I/O the kernel isn’t aware of is rare on dedicated hardware, it’s a condition that can easily turn into a crash or some other type of poorly handled kernel code. Thus my Linux kernel panics and mysterious Windows black screens.

Once I realized the problem, the fix was easy. There’s a fantastic article on VM disk timeouts that covers how to do this sort of tuning for every operating system I was worried about. I followed his udev based approach for my recent Linux systems, changing them to 180 seconds. (I even added a comment at the end suggesting a different way to test that is working). Then I hit regedit on Windows to increase its timeouts from the 60 seconds they were set to:

Windows Disk Timeout Registry Entry

Windows Disk Timeout Registry Entry

You’ll find similar advice from VM vendors too, like in VMWare’s Linux timeout guide. Some VM programs will tweak the timeouts for you when you install the guest tools for the VM. But the value I settled on that made all my problems go away was 180 seconds, far beyond what even VM setup software normally does by default.

You can also find advice about this from NAS manufacturers like NetApp too, although I wasn’t able to find one of their recommendations I could link to here (they all needed a support account to access). NAS hardware can be very overloaded sometimes. And even when you have an expensive and redundant model, there can be a long delay when a component dies and there’s a failover to its replacement. When disks die, even inside of a NAS it can go through some amount of this renegotiation work. Tuning individual disk timeouts in a RAID volume is its own long topic.

The other lesson I learned here is that swap is a lot more expensive on a VM than I was estimating, because the potential odds and impact for shared disk contention is very high. For example, even though the work memory of the server hosting this blog usually fit into the 512MB I initially gave it, the minute something heavy hit the server the performance tanked. And many of the performance events were correlated between servers. The Linux guest VM was running updatedb at the exact same time each night as the Linux host system! Both were using the same default crontab schedule. No wonder they were clashing and hitting timeouts during that period on many nights. I improved that whole area by just giving the one Linux VM 2GB of RAM, so swapping was much less likely even during things like updatedb. That change made the server feel much more responsive during concurrent busy periods on other hosts. You can’t just tune servers for average performance; you have to think about their worst case periods too.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>