Solaris: A cautionary tale • Gated Logic • nevali.net

One of our hosting providers had a major power outage this evening, and once the generator ran dry four of our servers lost power unexpectedly and died. Three of them came back fine, but the fourth—the oldest, incidentally, didn’t. No hardware problems, thankfully, but fairly heavy filesystem corruption.

Now, this being an old server—older than my tenure, in fact—I didn’t have any hand in the filesystem setup and I’m not in the habit of repartitioning servers in active deployment. Chinese whispers tell me that the filesystem layout was recommended by a Sun engineer (it being a SPARC box running Solaris 9), although in retrospect I actually have trouble believing that.

The issue was that /usr was on a different device to the root filesystem. Nothing unusual in that, you might think. But wait, this is Solaris 9. Solaris 9 conveniently keeps fsck in /usr/sbin.

Guess which filesystem had a hairy hissy-fit and wouldn’t mount without some serious coaxing?

So here’s the thing: if you’re dealing with Solaris, it’s probably a good idea to ensure that /usr is part of the root filesystem. Obviously you’ll want to make sure it’s large enough for the various amounts of rubbish that Solaris puts in there (keeping /var separate isn’t a shockingly bad idea). Alternatively, of course, you could copy the useful bits from /usr/sbin to /sbin, but you’d have to remember to copy them all again whenever they got patched, and I guarantee you’ll forget right before a nasty crash.

(For handful the readers in the know, it was sun1 which didn’t come back up properly—no prizes for guessing which two sites are still on it)