We just had a booboo in one of our internal systems, causing it to not come up properly on reboot. The actual mishap occurred several weeks ago (simple case of human error) and was in itself a valid change so monitoring didn’t raise any concerns. So, as always, it’s interesting and useful to think about such events and see what we can learn.
Years ago, but for some now still, one objective is to see long uptime for a server, sometimes years. It means the sysadmin is doing everything right, and thus some serious pride is attached to this number. As described only last week in Modern Uptime on the Standalone Sysadmin blog, security patches are a serious issue these days, and so (except if you’re using ksplice sysadmin quality has become more a question of when you last did security updates (which might have involved a reboot), rather than the uptime number.
But I think the aforementioned booboo illustrates an additional aspect, I think it might be quite sensible to reboot a system every so many weeks (we can debate the interval and it may differ per system and situation) since in the end it will be rebooted some time, and that may show trouble at an inconvenient time. Better to test and fine out when you’re there.
Of course this also has consequences for either your external uptime (scheduled maintenance slots with outages), or thinking about your architecture differently. Can you take out any individual system in your infrastructure without some service getting interrupted? It’s doable, but not necessarily with some traditional approaches or equipment that carries the “enterprise” label.
Food for thought! As always, comments welcome.