Seeking Revisited: Intel 320 Series and NCQ

Running accurate database benchmark tests is hard.  I’ve managed to publish a good number of them without being embarrassed by errors in procedure or results, but today I have a retraction to make.  Last year I did a conference talk called “Seeking PostgreSQL” that focused on worst case situations for storage.  And that, it turns out, had a giant error.  The results for the Intel 320 Series SSD were much lower in some cases than they should have been, because the drive’s NCQ feature wasn’t working properly.  When presenting this talk I had a few people push back that the results looked weird, and I was suspicious too.  I have a correction to publish now, and I think the way this slipped by me is itself interesting.  The full updated SeekingPostgres talk is also available, with all of the original graphs followed by an “Oops!” section showing the new data.

Native Command Queueing is an important optimization for seek heavy workloads.  When trying to optimize work for a mechanical disk drive, it’s very important to know where the drive is currently at when deciding where to go next.  If you have a read for that same area of the drive in the queue, you want to read that one now, get the I/O out of the way while you’re nearby, and then move to another physical area of the disk.

However, on a SSD, you might think that re-ordering commands isn’t that important.  If reads are always inexpensive, taking a constant and small period of time on a flash device, their order doesn’t matter, right?  Well, that’s wrong on a few counts.  The idea that reads always take the same amount of time on SSD is a popular misconception.  There’s a bit of uncertainty around what else is happening in the drive.  Flash cells are made of blocks larger than a single database read.  What happens if you are reading 8K of a cell that is being rewritten right now, because someone is updating another 8K section?  Coordinating that is likely to pause your read for a moment.  It doesn’t take much lag at SSD speeds to result in a noticable slowdown.  Partially due to contention concerns, and partially due to nature of I/O, keeping the command queue full is still very important to keeping the drive usefully busy all of the time.

On the 120GB Intel 320 Series drive I used for testing, the drive tops out at around 28MB/s of transfers if you’re not pipelining requests via NCQ.  It goes a whole lot faster than that once the queue is full:
You might think such a huge difference would be immediately obvious in all test results, right?  It’s not though, and that’s how the error slipped by me.  Normally all of my tests are done by two similar machines, and then I validate they match.  I did that for some of the Seeking Postgres results, such as the write heavy tests.  For comparison, here are results from database’s pgbench tool executing its standard, TPC-B-like write test:
The write rate test is barely impacted by whether NCQ is turned on or off, so it wasn’t obvious that one drive had the feature enabled while the other didn’t.  I was using this to validate my test server was operating similar to a second system with one of these drives.  But I picked the one test here where NCQ doesn’t really matter.

The general conclusion of the original presentation is that the Intel SSDs are much faster than regular disk, but still a good bit slower than the more expensive FusionIO flash.  That I knew to be true from real-world workloads, so I’d have been surprised if things didn’t turn out that way.  But it turns out that is true whether or not NCQ is working.  The Intel 320 line in these results is better with NCQ than without, but the relative ranking isn’t any different now.  It’s just the case that the Intel SSD is more competative in some tests than I gave it credit for.

The seeking read results show a much large gap with NCQ enabled:
You might notice a small drop in TPS on that brown line at low scales.  That’s a test error I can’t correct for at this point.  The original server I used for these tests was gone before I figured out what was wrong.  The replacement has the same type of CPU chip, but it’s clocked a bit slower.  (Was an Intel i7 870, now is an Intel i7 860)  That’s why the CPU limited results at low scales dropped.  On any of the I/O limited tests, that original CPU and the slower new one are almost identical, so I still think I’m being fair here.

Finally, I turned the random seek throughput into a business oriented question by asking how long it would take to refill all of RAM after something like a server reboot.  My original test placed the Intel drive as taking 5 minutes to read 16GB of random data with 32 clients reading.  This is exactly what NCQ helps with, and the correctly working drive only takes 1 minute to refill cache:
Thankfully, I don’t have to say I was completely wrong before.  The relative ranking of the various storage options is still the same.  The FusionIO drive I tested was and still is at the top of heap, especially if you need high write throughput.  But the worst case for reading on the Intel 320 series drives (and the very similar 710 series) is much closer to specifications than my tests showed.

With this old territory sorted out, next up I’m testing Intel’s latest enterprise drive, the DC S3700, which replaces the 710 drives in their lineup.  Initial test results look great so far; detailed ones are coming soon.