Explosive Log Failure

Terminally Incoherent went down this Friday. And when I say down, I mean all the way down – terminally shut as it were. Fortunately, no one noticed. I got no angry emails.Hell, I didn’t even get a friendly “hey, your site just went down”… On second thought this is actually kinda sad. One of you should have noticed, damn it!

Initially I thought it was a traffic spike so I just kept rebooting the server. Initially this would temporarily solve the problem. The blog would come back, only to crash and burn few minutes later. Eventually rebooting stopped working. Intriguing thing was that Apache seemed to keep on chugging. When you visited the front page it would display the familiar WordPress error: “Could not establish database connection”. This usually does not happen when I get DDOS‘ed due to Reddit or Hacker News link.

Whenever I get a huge traffic spike, Apache is usually the first thing to go. It’s actually a condition that is rather easy to spot: as soon as you ssh to the server you get a scrolling list of error notifications about processes being killed because your system is out of memory. This time however there was no such thing happening. I ran top and was surprised how few processes were actually running. I’m used to seeing at least a dozen lines for Apache sub-processes but there was only like one or two present.

I did some more poking around and ascertained that the culprit was definitely MySQL. I tried manually restart the server but it wouldn’t stay up. I tried running it in a non-deamon mode with the –debug parameter but it would silently close right away without displaying any error messages. The MySQL error log was empty giving me nothing to go on.

In an act of desperation, I decided to reinstall MySQL. I figured out that maybe a recent update screwed something up (especially after I rebooted the machine so many times). Typically aptitude does not delete the contents of /var/lib/mysql (ie. where your physical database files live) but I was not going to take any chances. So I attempted to make a backup copy of all of the files in my home directory…

Unfortunately (or maybe fortunately, considering the circumstances) this failed. Initially I figured out it was a permission thing (wonky sudoers file maybe), so I tried this again as root with the same result. The machine simply refused to copy files. I think it might have mentioned something about being out of storage space but I initially ignored it. I mean, how could that machine be out of space seeing how it only hosts this blog which is not very large at all. Last time I checked it there was only about 2GB of used space on the root partition. This included both the OS, software and the site itself including all the auxiliary files.

After few minutes of muddling I ran df -h to make sure I was not out of space. I was wrong. The Used% value was at 100%. Somehow I managed to fill up my entire 20GB disk to the brim. My first thought? “HACKERS!”

I’m actually quite embarrassed at this and I blame Hollywood for infecting me with this memmetic garbage. I guess it is easier to externalize personal failures and throw blame at imaginary boogeymen than take responsibility. So I slapped myself, and decided to act like an adult. If your first reaction to a computer issue is “HACKERS” then you are either a child, my grandmother, a schizophrenic suffering from delusions of grandeur or a pointy haired middle-management individual who failed upwards into a position of power. Chances were that whatever happened there was my own fuck-up and not the work of some elusive and mysterious hax0r.

How do we track down where the disk space is going? It’s actually not that difficult, especially considering I wrote an extensive article on this very subject not so long ago. To make a long story short, the magic command is:

du -sBG /* | sort -nr | head

This gave me a list of the first ten top level folders in / ordered by size of their contents. Guess what was the #1 spot on that list? It was the /srv folder where my site lives. Or, as I found out after drilling down a bit it was the /srv/www/terminally-incoherent.com/logs directory. There were only two files in that folder: error.log which was 47MB and access.log which was over 17GB.

How did this happen? Why were these logs not rotated? Well, because I fucked up. When I moved the site to a new host back in June of 2011 I decided to put the logs in this directory rather than in the default /var/log/apache2. I guess the reason for this was that I was setting it up using Apache Virtual Hosts feature allowing me to run more than one site from this server. So if I ever decided to set up another domain, I didn’t want separate logging for both.

What I forgot to do back then was to set up a logrotate rule for that directory. By default, Apache automatically rotates it’s logs on a weekly basis keeping approximately a year’s worth of archival logs for reference. But whenever you set up custom Virtual Hosts and specify new log locations for them, said rule does not apply.

To set up log rotation for my site I simply copied over and tweaked the default Apache rules, creating a new file in /etc/logrotate.d with the following contents

/srv/www/terminally-incoherent.com/logs/*.log {
        weekly
        missingok
        rotate 12
        compress
        delaycompress
        notifempty
        create 640 root adm
        sharedscripts
        postrotate
                if [ -f "`. /etc/apache2/envvars ; echo ${APACHE_PID_FILE:-/var/run/apache2.pid}`" ]; then
                        /etc/init.d/apache2 reload > /dev/null
                fi
        endscript
}

I figured that 12 weeks ought to be enough backlog for now, considering that I haven’t really looked at the access.log file at all since 2011.

To be honest, I really did not expect this to ever be a problem. These log files are just pure text, and they only grow when someone views the site. Granted, I probably get multiple log entries per hit because of all the little bits and pieces, images and scripts that have to load for each page. Still, 17GB in a little under two years is quite a respectable amount of traffic.

Thanks for reading my ramblings all these years guys!

9 Responses to Explosive Log Failure

Jason *StDoodle* Wood says:

April 8, 2013 at 10:29 am

Thanks for having it here to read!

But don’t expect me to start reading it regularly on Fridays. :P

Reply | Quote
FX says:

April 8, 2013 at 11:50 am

Didn’t notice there was an error message in my RSS feeds!

Well, anyway, I’m pretty sure you’re not alone in that small “mistake”, I’ve heard quite a number of people had a case of gluttonous-log-file-itis on their website ;)

Reply | Quote
MrJones2015 says:

April 8, 2013 at 3:34 pm

Dont worry, im with you every time. Its just that some of your posts are way to tecchy for me or i am just too tired to write a few lines. As a mechanical engineer I just cant see “the beauty of html5” when somehow throws a few lines consisting of (completely unknown) dev, head, href and div at me. I know how to create a loop in Visual Basic, isnt that enough ? :P

HTML5 more or less makes that sort of messiness go away. Observe the same skeleton written in proper HTML5. Can you spot the difference:

No, it looks almost the same to me, except you removed all of the divs ;)

What i really enjoyed was the “Wisdom of the Young and Reckless”. This really was an astonishing and impressive log entry. Thanks for that!

Reply | Quote
IceBrain says:

April 8, 2013 at 4:38 pm

Same here; I blame Tiny Tiny RSS for not warning properly! It shall be punished severely.

I’ve had a similar situation, but it filled 9GB in two weeks(!), due to a misconfigured uptime monitor which was constantly pounding nginx, coupled with a daily reboot which prevented logrotate from running.

And people actually pay me in part as a sysadmin…

Reply | Quote
Mitlik says:

April 8, 2013 at 5:52 pm

I must have hit a cached version of your site. I was waiting for a new post to pass some time. I think I even went to some of your links (e.g. /dev/random, twenty sided) last Friday.

Reply | Quote
Luke Maciak says:

April 8, 2013 at 10:03 pm

@ Jason *StDoodle* Wood:

Oh, come on! Fridays are legitimate work days. I don’t expect anyone to read my blog on their weekends but you should totally read me on company time ;)

@ FX:

Ah, yes – I forget RSS feeds create that safety buffer. Good to know. :)

@ MrJones2015:

But… But the divs were what was making it ugly. lol Also, VisualBasic gives me nightmare. I’ve been coding a lot in Ruby lately and it is probably my favorite language now.

@ IceBrain:

Heh, I won’t tell anyone about your sysadmin mishaps if you keep quiet about mine. No one needs to know most of our expertise boils down to being really good at googling error messages. :P

@ Mitlik:

Well, it was still chugging along in the morning, and I fixed it in the early afternoon so you probably got here either before the downtime, after it was fixed or during one of the temporary uptimes when I was rebooting it a lot. :)

Reply | Quote
Jed says:

April 9, 2013 at 4:48 am

I will confess, I actually noticed, I was using google to find one of your old posts that I remembered had an answer to something I needed to know (don’t remember what now) and so used the google’s cached version to read it.
But, it’s the kind of thing I’ve got used to on the internet, finding a site I go to temp down, so didn’t really occur to me to contact you.
I’ll keep that in mind in the future.

Reply | Quote
Kim Johnsson says:

April 9, 2013 at 5:35 am

Totally would’ve noticed, but I was actually on vacation last week.

And yeah, been there. Logs can be a bitch to manage. And we’ve had enough issues with disk space on our 70+ servers at work that I always look at disk usage first when something unexpectedly fails. Especially on database servers.

Reply | Quote
Travis McCrea says:

April 27, 2013 at 5:00 pm

Why not switch to lighttpd or nginx?

Reply | Quote