bonnie++-1.03e Direct I/O Compiler Errors

bonnie++ is the simplest tool that can validate if the disks in a server are working at expected speed. It’s not always accurate, and some of its tests are meaningless nowadays. But the sequential read and write rate figures it computes are usually right for smaller servers, and it automates details like sizing the test file to exceed RAM caching. The main thing to watch out for is that large servers with many disks (or very fast SSDs) may max out the CPU bonnie++ is running on before the true maximum rate is found. bonnie++ has been split into a stable 1.0 and a development 2.0 version for a while; the 2.0 branch allows using more CPUs to improve on that issue. I still see enough quirkiness in the new branch that I settle for the 1.0 series on any server without fast storage.

Except today, when I was trying to get bonnie++-1.03e running on a Mac laptop. The latest update to the program added support for Direct I/O. That looks like it was tested on Linux, and it works fine on some systems. But on many types of BSD UNIX derived systems, including Mac OS X, NetBSD, and Solaris, bonnie++-1.03e won’t compile. You get this error instead:

$ make
g++ -O2  -DNDEBUG -Wall -W -Wshadow -Wpointer-arith -Wwrite-strings -pedantic -ffor-scope   -c bon_io.cpp
bon_io.cpp: In member function ‘int CFileOp::m_open(const char*, int, bool)’:
bon_io.cpp:398: error: ‘O_DIRECT’ was not declared in this scope
make: *** [bon_io.o] Error 1

There are two worksarounds for this problem. The easier one is to roll back to version 1.03d. The only thing added by 1.03e is the Direct I/O feature that doesn’t compile.

Since this sort of thing ticks me off, I wanted to fix it in a way I can roll forward into newer versions too. NetBSD packaging added a commit to fix the problem by Jaromir Dolecek. The relevant code is in patch 1 and patch 2. Since the fixes are so small, I combined them both into a single git created context dif, and it’s easier to paste it here than store it somewhere:

diff --git a/bon_io.cpp b/bon_io.cpp
new file mode 100644
index a9eab20..ac8d6e6
*** a/bon_io.cpp
--- b/bon_io.cpp
*************** CFileOp::CFileOp(BonTimer &timer, int fi
*** 318,324 ****
--- 318,326 ----
   , m_isopen(false)
   , m_name(NULL)
   , m_sync(use_sync)
+ #ifdef O_DIRECT
   , m_use_direct_io(use_direct_io)
+ #endif
   , m_chunk_bits(chunk_bits)
   , m_chunk_size(1 << m_chunk_bits)
   , m_chunks_per_file(Unit / m_chunk_size * IOFileSize)
*************** int CFileOp::m_open(CPCCHAR base_name, i
*** 393,403 ****
--- 395,407 ----
      createFlag = OPEN_ACTION_CREATE_IF_NEW | OPEN_ACTION_REPLACE_IF_EXISTS;
  #else
      flags = O_RDWR | O_CREAT | O_EXCL;
+ #ifdef O_DIRECT
      if(m_use_direct_io)
      {
        flags |= O_DIRECT;
      }
  #endif
+ #endif
    }
    else
    {
diff --git a/bon_io.h b/bon_io.h
new file mode 100644
index cbce2b1..3772140
*** a/bon_io.h
--- b/bon_io.h
*************** private:
*** 33,39 ****
--- 33,41 ----
    bool m_isopen;
    char *m_name;
    bool m_sync;
+ #ifdef O_DIRECT
    bool m_use_direct_io;
+ #endif
    const int m_chunk_bits, m_chunk_size;
    int m_chunks_per_file, m_total_chunks;
    int m_last_file_chunks;

Apply that patch, and 1.03e will compile on my Mac. I don't really need the 1.03e features yet, but one day there might be something in a later 1.03f or 1.04 I do want. Then it will be nice to have a patch to fix the problem like this one.

Making make make PostgreSQL 9.3 on FreeBSD 8.2

Building PostgreSQL on UNIX-like systems from source code can be hard the first time. But often, once you extract the code it only takes three steps common to many C programs:

./configure
make
make install

With PostgreSQL 9.3.0 officially released, I’ve been testing out the final source code on some systems it was still a bit too raw for before now. And I ran into two problems: one real that I found a simple fix for, the other self-induced (but still painful).

This was on a FreeBSD 8.2 based system. The main lesson here is pretty simple: when building PostgreSQL on FreeBSD, always prefer gmake to regular make. There is sometimes a BSD make compatible file in a directory, but that’s not the case for all of the contrib module source code. If you’re installing with FreeBSD packages or their Ports system, all of this is taken care of for you; this is mainly an issue for early adopters and PostgreSQL code developers.

Checkout vs. release source code files

When you build a checkout of PostgreSQL from source code, it expects a few more development tools than the officially packaged versions. A random git checkout needs programs like bison and flex installed. Those are used to rebuild the database’s SQL parser. One step of the release build process does that for you and publishes the results. That means a release source code archive, sometimes called a “tarball”, requires less development tools.

Partly due to this issue (which I’ll return to again at the end), when new versions of PostgreSQL come out–like the just released version 9.3–I will test them on systems that don’t have a full development tool set. You still need a lot of things to build the database from source: a C compiler, perl, the make program, auto-configuration tools, and some other things. On this FreeBSD system I installed all of those one at a time until I had a minimal working set, and the main database code built and installed fine.

contrib modules

Once the main database is installed, there are also a set of additional programs called the contrib modules that you get source code for. The idea is that these programs aren’t required for the core database, but they can add extra features you might want. Today I needed the pgbench contrib program. Normally you just go into its directory do the usual make/make install combo. But not today:

$ cd contrib/pgbench
$ ls
Makefile pgbench.c
$ make
"Makefile", line 12: Need an operator
"Makefile", line 15: Could not find
"Makefile", line 16: Need an operator
"../../src/Makefile.global", line 45: Missing dependency operator
"../../src/Makefile.global", line 48: Need an operator

make is one of the lowest level C programming utilities. It lets you compile individual programs and specify how they fit together. If you use a server’s default make program, you can name the make script “Makefile” or “makefile” instead, as done here. To understand what’s gone wrong with it, you need some UNIX history.

make versions

FreeBSD traces its style back to the older BSD UNIX distribution. Some of the tools it provides are still based on expectations set by BSD UNIX. FreeBSD’s make to build software is one tool like that. It has its own specific format too–using the name “BSDmakefile” says you’re expecting the feature set BSD make supports. But here it’s just the generic makefile it’s choking on something, when being read by BSD make.

When you’re on a Linux system, you don’t get a BSD make utility. Instead you get the GNU Make utility. That has some common heritage with BSD make, but the two programs accept different command line parameters and syntax inside the build instructions. There are a lot of utility pairs like this. As another example, FreeBSD has its own tar utility for making file archives. There is also a GNU tar utility, which some prefer to the BSD one, but you won’t find it on most FreeBSD system.

GNU Make is also installed on most FreeBSD systems, because it’s a building block for so many popular GNU packages. But on FreeBSD the program is called gmake to tell it apart from the default, regular BSD make. Because the programs are a little different, if you want something specific to GNU make it looks for an input file named “GNUmakefile”.

But PostgreSQL needs GNU Make to build correctly. In the root of the source code tree, it has a Makefile for compatibility with other system make programs, but it’s just a wrapper. Its comments read like this:

$ more Makefile
# The PostgreSQL make files exploit features of GNU make that other
# makes do not have. Because it is a common mistake for users to try
# to build Postgres with a different make, we have this make file
# that, as a service, will look for a GNU make and invoke it, or show
# an error message if none could be found.

If you look inside of contrib/pgbench, there is a Makefile there. That means the BSD make program will try to read it. But the build instructions in there are actually written for GNU make. That’s why BSD make is choking on them.

Compiling with GNU make

It’s taken a long explanation to get to here, but the fix is simple. Just run gmake directly instead:
$ cd contrib/pgbench
$ gmake
$ gmake install

gmake supports Makefile on its list of file formats, and it then builds everything fine. If you wanted to build and install every contrib code module, you’d do that like this:
$ cd contrib
$ gmake
$ gmake install

Parser source rebuilds

One more related note here, while I’m mentioning build trivia. I mentioned flex and bison aren’t needed with official release source code archives. Building without them will give warnings like this:

configure: WARNING:
*** Without Bison you will not be able to build PostgreSQL from Git nor
*** change any of the parser definition files. You can obtain Bison from
*** a GNU mirror site. (If you are using the official distribution of
*** PostgreSQL then you do not need to worry about this, because the Bison
*** output is pre-generated.)
checking for flex... configure: WARNING:
*** The installed version of Flex, /usr/bin/flex, is too old to use with PostgreSQL.
*** Flex version 2.5.31 or later is required, but this is /usr/bin/flex version 2.5.4.
configure: WARNING:
*** The installed version of Flex, /usr/bin/lex, is too old to use with PostgreSQL.
*** Flex version 2.5.31 or later is required, but this is /usr/bin/lex version 2.5.4.
no
configure: WARNING:
*** Without Flex you will not be able to build PostgreSQL from Git nor
*** change any of the scanner definition files. You can obtain Flex from
*** a GNU mirror site. (If you are using the official distribution of
*** PostgreSQL then you do not need to worry about this because the Flex
*** output is pre-generated.)

But that’s fine; they are just warnings. And if you use make clean to clear that source code directory tree up, it won’t delete these files, and compiles will still be fine.

Aside: if you read these warnings carefully, you’ll see why I didn’t even try to compile PostgreSQL 9.3 here from a source code checkout, one without the parser files. The problem with this server isn’t that bison and flex are missing. It has versions of those tools that are too old. Missing packages are easy to add if you’re root. Packages that need a new version can be a much harder problem. I’ve done that before to update flex on RHEL, and that is no fun at all.

There are two other option available beyond this simplest clean though, using names similar to those available for gcc builds. There’s make distclean, which eliminates everything that configure generated. This is the right type of cleanup to use if you have added new software to the server and now you want to re-run configure to find them. It also works for cleaning up before changing the options passed to configure.

There’s a third clean option too:

make maintainer-clean

The goal of maintainer-clean is to eliminate everything that is generated by the source code build process. And I’d gotten into the habit of always using this one when I needed to eliminate a now obsolete configure session. But this is a bad idea if you’re trying to build PostgreSQL on a system without a full development toolset. maintainer-clean wipes out all of the parser files. And if those are gone, you’re back to where you need bison and flex again! Moral of this part: if you’re using a source code release archive and hoped to avoid bison/flex, don’t ever run maintainer-clean on the source code tree. That will blow things away so that what you get is more like a git checkout of the code. It won’t compile unless you have those parser generator tools. You may have to re-extract the original release distribution archive to get the parser code back again.

Tuning Disk Timeouts on Virtual Machines

Dedicated servers are important for some databases, and I write a lot more about those difficult cases than the easy ones. But virtual machines have big management advantages. Recently I started moving a lot of my personal dedicated servers onto one larger shared box. It went terribly–my personal tech infrastructure lives out Murphy’s Law every day–and this blog has been down a whole lot of the last month. But I’m stubborn and always learn something from these battles, and this time the lesson was all about disk timeouts on virtual machines and similar cloud deployments. If you’d like to peek at the solution, I resolved the problems with advice from a blog entry on VM disk timeouts. The way things fell apart has its own story.

My project seemed simple enough: start with a Windows box (originally dual core/2GB RAM) and three Linux boxes (small hosting jobs, database/git/web/files). Turn them all into VMs, move them onto a bigger and more reliable server (8 cores, 16GB of RAM, RAID1), and run them all there. Disk throughput wasn’t going to be great, but all the real work these systems do fit in RAM. How bad could it be?

Well, really bad is the answer. Every system crashed intermittently, the exact form unique to their respective operating system. But intermittent problems get much easier to find when they happen more frequently. When one of the Linux systems started crashing constantly I dropped everything else to look at it. First there were some read and write errors:

Sep 11 13:09:48 wish kernel: ata3.00: failed command: WRITE FPDMA QUEUED
Sep 11 13:09:48 wish kernel: ata3.00: cmd 61/08:00:48:12:65/00:00:02:00:00/40 tag 0 ncq 4096 out
Sep 11 13:09:48 wish kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 11 13:09:48 wish kernel: ata3.00: status: { DRDY }
Sep 11 13:09:48 wish kernel: ata3.00: failed command: READ FPDMA QUEUED
Sep 11 13:09:48 wish kernel: ata3.00: cmd 60/40:28:40:9f:68/00:00:02:00:00/40 tag 5 ncq 32768 in
Sep 11 13:09:48 wish kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 11 13:09:48 wish kernel: ata3.00: status: { DRDY }
Sep 11 13:09:48 wish kernel: ata3: hard resetting link
Sep 11 13:09:48 wish kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 11 13:09:48 wish kernel: ata3.00: qc timeout (cmd 0xec)
Sep 11 13:09:48 wish kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Sep 11 13:09:48 wish kernel: ata3.00: revalidation failed (errno=-5)
Sep 11 13:09:48 wish kernel: ata3: limiting SATA link speed to 1.5 Gbps
Sep 11 13:09:48 wish kernel: ata3: hard resetting link
Sep 11 13:09:48 wish kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Sep 11 13:09:48 wish kernel: ata3.00: configured for UDMA/133
Sep 11 13:09:48 wish kernel: ata3.00: device reported invalid CHS sector 0
Sep 11 13:09:48 wish kernel: ata3: EH complete

Most of the time when a SATA device gives an error, the operating system will reset the whole SATA bus it’s on to try and regain normal operation. In this example that happens, and Linux drops the link speed (from 3.0Gbps to 1.5Gbps) too. That of course makes the disk I/O problem worse, because now transfers are less efficient. Awesome.

To follow this all the way to crash, more errors start popping up. Next Linux disables NCQ, further pruning the feature set it’s relying on in hopes the disk works better that way. PC hardware has a long history of device bugs when using advanced features, so this falling back to small feature sets approach happens often when things start failing:

Sep 11 13:11:36 wish kernel: ata3: hard resetting link
Sep 11 13:11:36 wish kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Sep 11 13:11:36 wish kernel: ata3.00: configured for UDMA/133
Sep 11 13:11:36 wish kernel: ata3: EH complete
Sep 11 13:11:36 wish kernel: ata3: illegal qc_active transition (00000003->000003f8)
Sep 11 13:11:36 wish kernel: ata3.00: NCQ disabled due to excessive errors
Sep 11 13:11:36 wish kernel: ata3.00: exception Emask 0x2 SAct 0x3 SErr 0x0 action 0x6 frozen
Sep 11 13:11:36 wish kernel: ata3.00: failed command: READ FPDMA QUEUED
Sep 11 13:11:36 wish kernel: ata3.00: cmd 60/28:00:e8:e4:95/00:00:00:00:00/40 tag 0 ncq 20480 in
Sep 11 13:11:36 wish kernel: res 00/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)
Sep 11 13:11:36 wish kernel: ata3.00: failed command: READ FPDMA QUEUED
Sep 11 13:11:36 wish kernel: ata3.00: cmd 60/08:08:58:e2:90/00:00:01:00:00/40 tag 1 ncq 4096 in
Sep 11 13:11:36 wish kernel: res 00/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x2 (HSM violation)

Finally, the system tries to write something to swap. When that fails too it hits a kernel panic, and then the server crashes. I can’t show you text because the actual crash didn’t ever hit the server log file–that write failed too and then the server was done. Here’s a console snapshot showing the end:

VM Disk Timeout Crash

VM Disk Timeout Crash

Now, when this happens once or twice, you might write it off as a fluke. But this started happening every few days to the one VM. And the Windows VM was going through its own mysterious crashes. Running along fine, the screen goes black, and Windows was just dead. No blue screen or anything, no error in the logs either. I could still reach the VM with VirtualBox’s RDP server, but the screen was black and it didn’t respond to input. Whatever mysterious issue was going on here, it was impacting all of my VM guests.

On this day where the one server was crashing all of the time, I looked at what was different. I noticed that the VM host itself was running a heavy cron job at the time. Many Linux systems run a nightly updatedb task, the program that maintains the database used by the locate command. When the VM host is busy, that makes I/O to all of the guest VMs slow too. And I had moved a 2TB drive full of data into that server the previous day. Reindexing that whole thing in updatedb was taking a long time. That was the thing that changed–the updatedb job was doing many disk reads.

What was happening here was an I/O timeout. Operating systems give the disks a relatively small amount of time to answer requests. When those timeouts, typically 30 seconds long, expire, the OS does all of this crazy bus reset and renegotiation stuff. With the dueling VMs I had, one of the two systems sharing a disk could easily take longer than 30 seconds to respond when the other was pounding data. Since I/O the kernel isn’t aware of is rare on dedicated hardware, it’s a condition that can easily turn into a crash or some other type of poorly handled kernel code. Thus my Linux kernel panics and mysterious Windows black screens.

Once I realized the problem, the fix was easy. There’s a fantastic article on VM disk timeouts that covers how to do this sort of tuning for every operating system I was worried about. I followed his udev based approach for my recent Linux systems, changing them to 180 seconds. (I even added a comment at the end suggesting a different way to test that is working). Then I hit regedit on Windows to increase its timeouts from the 60 seconds they were set to:

Windows Disk Timeout Registry Entry

Windows Disk Timeout Registry Entry

You’ll find similar advice from VM vendors too, like in VMWare’s Linux timeout guide. Some VM programs will tweak the timeouts for you when you install the guest tools for the VM. But the value I settled on that made all my problems go away was 180 seconds, far beyond what even VM setup software normally does by default.

You can also find advice about this from NAS manufacturers like NetApp too, although I wasn’t able to find one of their recommendations I could link to here (they all needed a support account to access). NAS hardware can be very overloaded sometimes. And even when you have an expensive and redundant model, there can be a long delay when a component dies and there’s a failover to its replacement. When disks die, even inside of a NAS it can go through some amount of this renegotiation work. Tuning individual disk timeouts in a RAID volume is its own long topic.

The other lesson I learned here is that swap is a lot more expensive on a VM than I was estimating, because the potential odds and impact for shared disk contention is very high. For example, even though the work memory of the server hosting this blog usually fit into the 512MB I initially gave it, the minute something heavy hit the server the performance tanked. And many of the performance events were correlated between servers. The Linux guest VM was running updatedb at the exact same time each night as the Linux host system! Both were using the same default crontab schedule. No wonder they were clashing and hitting timeouts during that period on many nights. I improved that whole area by just giving the one Linux VM 2GB of RAM, so swapping was much less likely even during things like updatedb. That change made the server feel much more responsive during concurrent busy periods on other hosts. You can’t just tune servers for average performance; you have to think about their worst case periods too.

FDW Scaling at the Washington DC PostgreSQL Users’ Group

Last night the new-ish Washington DC PostgreSQL Users Group re-launched with that name. We had a talk from Stephen Frost about the new scale out possibilities introduced by combing writeable Foreign Data Wrappers in PostgreSQL 9.3 with the new Postgres FDW. Stephen is giving a more formal talk on FDW architecture at this year’s Postgres Open next month, when that conference comes to Chicago on September 16. The schedule for the conference went up recently. If you waiting to see that before buying a ticket, jump on it!

For more on FDW architecture you can read right now, see articles on that subject by Josh Berkus, Depesz, Michael Paquier, and What’s New in PostgreSQL 9.3.

As for why I’m describing the PUG as new-ish, that’s a story. For a few years now there has been a Baltimore/Washington PostgreSQL Users’ Group, acronymed as the BWPUG, at a point midway between those two cities. The idea was that a halfway point could draw people from both places. In practice, though, what ended up happening was that people in both central Baltimore and Washington DC thought it was too far to travel. There’s a lesson there that new PUGs should pay attention to when planning their meeting spots.

Recently the PUG has been reorganized with the help of PostgreSQL advocates at Resonate Insights in Reston, Virginia. It’s on a regular schedule now: 1st Monday of each month. And to make it clear that this is strictly a Washington, DC location, the PUG has been renamed to be the DCPUG. We might go back to the office of longtime BWPUG host OmniTI for special occasions, but on average it’s been a lot easier to get people to show up in Reston instead.

Help with getting reliable database writes

First today is a PostgreSQL community blogger note. Those of you who publish to the Planet PostgreSQL blog feed should take a look at the updated Planet PostgreSQL Policy. There’s a new clause there clarifying when it’s appropriate to mention promotions of commercial products like books. We’re trying to keep every blog post to the Planet feed focused on providing useful information, and just informing people of things like product giveaways doesn’t meet that standard. Several of these have gone by recently, but moving forward that will be considered a violation of the rules.

Speaking of what it’s safe to write about, Continue reading

Disks for Databases: 3rd Gen Seagate Hybrid Drive

Seagate’s new third generation hybrid drive combines 8GB of MLC NAND SSD with a 1TB mechanical drive spinning at 5400RPM. They’re using the term solid state hybrid drive or SSHD for the product line. The big advance of these second generation hybrid drives is that the cache can be used for writes, and that write cache is protected by capacitors. It is supposed to shut down cleanly when the power dies.

Continue reading

Preparing for backup and replication failures

Recently a few new PostgreSQL guides have come out, and it’s always good to see more documentation available. We even have new, not yet syndicated bloggers like “Mr. Muskrat” showing up to review “Instant PostgreSQL Starter”. But I like to rant about disasters and Murphy’s Law rather than write happy reviews. Today I’m going to talk mainly about backup paranoia, with a side course of looking at the “Instant PostgreSQL Backup and Restore How-to” by Shaun M. Thomas. I hit some ugly problems during the same week I came across the book, and that crystallized my thinking about some common mistakes I keep seeing.
Continue reading

Seeking Revisited: Intel 320 Series and NCQ

Running accurate database benchmark tests is hard.  I’ve managed to publish a good number of them without being embarrassed by errors in procedure or results, but today I have a retraction to make.  Last year I did a conference talk called “Seeking PostgreSQL” that focused on worst case situations for storage.  And that, it turns out, had a giant error.  The results for the Intel 320 Series SSD were much lower in some cases than they should have been, because the drive’s NCQ feature wasn’t working properly.  When presenting this talk I had a few people push back that the results looked weird, and I was suspicious too.  I have a correction to publish now, and I think the way this slipped by me is itself interesting.  The full updated SeekingPostgres talk is also available, with all of the original graphs followed by an “Oops!” section showing the new data.

Continue reading

Launching a career in bikeshed painting

In March I’ll pass six years since I first submitted a feature change to PostgreSQL.  Numbers and years shouldn’t mean that much to people, but try and tell that to anyone who’s ever forgotten about their anniversary.  The fifth anniversary is traditionally celebrated with gifts made of wood.  I think I’ll use mine to build a bikeshed, a tradition in software development going back to at least 1956.  My guess is that minutes after the first person built storage that held one bit of information, someone questioned not the logic they used to flip its state, but instead whether the soldering technique used to wire it on would last.

Continue reading