You may have noticed that the site went down around noon today, and stayed dead for a few solid hours. This was perhaps my longest downtime since the good-old Dreamhost days. Or at least the longest period between the time I became aware of the downtime, and when the site went back up.
What was the reason for the downtime? I’m still not sure. Around 2pm Chris shot me an email notifying me the site was unresponsive. Fortunately, today was my day off so I could devote my full attention to the issue. Unfortunately at that point I was out and about running some errands and away from a computer. Still, being crafty technocrat that I am, I logged into my Linode app and I rebooted the VM. Because, hey – golden rule of troubleshooting: turn it on and off again.
That could have actually been what did the site in. When it came back, I was treated to the dreaded
Error establishing a database connection message. This is a typical, WordPress catch-all message that can be caused by a host of underlying issues. So I started going down the list and addressing all of them.
First, I made sure MySQL was running and restarted it. This did absolutely nothing.
Next, I opened my wp-config.php and added the following magic line to the end:
This allows you to use the built-in repair routine by navigating to /wp-admin/maint/repair.php. For about a minute there it seemed like everything was going to be all right, until I got a lovely message claiming that
wp_options table is marked as crashed and last repair failed. So built in repair was a dud.
I figured maybe the MySQL client will succeed where WordPress failed so I ran:
mysqlcheck --repair --all-databases
The results were about the same as above. The wp_options table was broken and beyond repair.
Unwilling to give up, I rolled out heavy artillery:
sudo myisamchk -r /var/lib/mysql/my_db/wp_options
If you have to run it with sudo you know it’s serious business. Unfortunately, this too has failed. This time I got even more meaningful error message: something like
Error: not enough memory for blob at..
I kinda got preoccupied with the memory message and spent some time trying to go around it. I found out that you can increase the size of buffers that myisamchk uses for repair functions using command line switches like –sort_buffer_size so I tried things like:
myisamchk -o -f /var/lib/mysql/my_db/wp_options –sort_buffer_size=4G
Eventually it occurred to me that wp_options should not have any blobs in it, and whatever data these tools are trying to load is not actual WordPress data but random binary garbage. On other words, chances were that the table was fucked beyond repair.
So after many hours of troubleshooting I gave up and restored my site from backup. As you may know my host is Linode, and they offer a really neat backup program. For like $5 per month they offer to take nightly snapshots of your VM. About a year ago I wisely added that feature because I was to lazy to set up backups on my own. I can now say with full confidence that it was worth every penny. I literally clicked one button, and my site was rolled back to last backup, which coincidentally was mere 7 hours ago. The restore took about 10 minutes, and as you can see the site is back in business.
Some data was lost in the process. While my latest blog post is up, the comments on it were lost, for which I apologize. Still, losing only 7 hours of data is actually nothing to scoff at. Actually, it’s only like 3-4 hours of real data, because there were no comments posted while the site was down, since noon. So all in all I think I made out ok.
Moral of the story: BACKUPS ARE GOOD! Seriously. Without backups I would still be down, and probably curled up in a fetal position under my desk, crying.
Also WordPress is a piece of shit. Oh, don’t get me wrong – it’s amazing when it works. In a way love it, I’m used to it, and I don’t want another platform. But I am always shocked and amazed how easy it is to completely wreck it by doing absolutely nothing. A silly DB error, and the entire site is out of business. If you use WordPress, backups are doubly important because of issues like this one.