I found Dennis the Menace, he now has a job as system administrator for a hosting company. Scenario: client has a problem with a server becoming unavailable (cause unknown) and has it restarted. MySQL had some page corruption in the InnoDB tablespace.
The hosting provider, being really helpful, goes in as root and first deletes ib_logfile* then ib* in /var/lib/mysql. He later says “I am sorry if I deleted it. I thought I deleted the log only. Sorry again.” Now this may appear nice, but people who know what they’re doing with MySQL will realise that deleting the iblogfiles actually destroys data also. MySQL of course screams loudly that while it has FRM files it can’t find the tables. No kidding!
Then, while he’s been told to not touch anything any more, and I’m trying to see if I can recover the deleted files on ext3 filesystem (yes there are tools for that), he goes in again and puts an ibdata1 file back. No, not the logfiles – but he had those somewhere else too. The files get restored and turn out to be two months old (no info on how they were made in the first place but that’s minor detail in this grand scheme). All the extra write activity on the partition would’ve also made potential deleted file recovery more difficult or impossible.
This story will still get a “happy” ending, using a recent mysqldump to load a new server at a different hosting provider. Really – some helpfulness is not what you want. Secondary lesson: pick your hosting provider with care. Feel free to ask us for recommendations as we know some excellent providers and have encountered plenty of poor ones.
We just had a booboo in one of our internal systems, causing it to not come up properly on reboot. The actual mishap occurred several weeks ago (simple case of human error) and was in itself a valid change so monitoring didn’t raise any concerns. So, as always, it’s interesting and useful to think about such events and see what we can learn.
Years ago, but for some now still, one objective is to see long uptime for a server, sometimes years. It means the sysadmin is doing everything right, and thus some serious pride is attached to this number. As described only last week in Modern Uptime on the Standalone Sysadmin blog, security patches are a serious issue these days, and so (except if you’re using ksplice 😉 sysadmin quality has become more a question of when you last did security updates (which might have involved a reboot), rather than the uptime number.
But I think the aforementioned booboo illustrates an additional aspect, I think it might be quite sensible to reboot a system every so many weeks (we can debate the interval and it may differ per system and situation) since in the end it will be rebooted some time, and that may show trouble at an inconvenient time. Better to test and fine out when you’re there.
Of course this also has consequences for either your external uptime (scheduled maintenance slots with outages), or thinking about your architecture differently. Can you take out any individual system in your infrastructure without some service getting interrupted? It’s doable, but not necessarily with some traditional approaches or equipment that carries the “enterprise” label.
Food for thought! As always, comments welcome.
Peter Lieverdink (also known as cafuego on IRC/identi.ca, engineer on OurDelta builds and for Open Query) has co-authored a book that’s available since Monday. The title is Pro Linux System Administration published by Apress.
These days some people don’t want to bother with system administration, and either hire or outsource. Others want to find out more and do things themselves (home and small office use), and that’s the intended audience for this book.