Posted on

The Flipside of Uptime

We just had a booboo in one of our internal systems, causing it to not come up properly on reboot. The actual mishap occurred several weeks ago (simple case of human error) and was in itself a valid change so monitoring didn’t raise any concerns. So, as always, it’s interesting and useful to think about such events and see what we can learn.

Years ago, but for some now still, one objective is to see long uptime for a server, sometimes years. It means the sysadmin is doing everything right, and thus some serious pride is attached to this number. As described only last week in Modern Uptime on the Standalone Sysadmin blog, security patches are a serious issue these days, and so (except if you’re using ksplice 😉 sysadmin quality has become more a question of when you last did security updates (which might have involved a reboot), rather than the uptime number.

But I think the aforementioned booboo illustrates an additional aspect, I think it might be quite sensible to reboot a system every so many weeks (we can debate the interval and it may differ per system and situation) since in the end it will be rebooted some time, and that may show trouble at an inconvenient time. Better to test and fine out when you’re there.

Of course this also has consequences for either your external uptime (scheduled maintenance slots with outages), or thinking about your architecture differently. Can you take out any individual system in your infrastructure without some service getting interrupted? It’s doable, but not necessarily with some traditional approaches or equipment that carries the “enterprise” label.

Food for thought! As always, comments welcome.

Posted on

BarCamp Melbourne 12-13 September 2009

BarCamp melbourne logo

Open Query is pleased to sponsor BarCamp Melbourne, a rocking unconference held at UrbanCamp, Royal Park, Melbourne VIC (Australia). If you’re anywhere nearby this coming weekend (12-13 September 2009), you really really want to be there and participate, learn, and enjoy! Open Query‘s own Peter Lieverdink (cafuego) will be there.

Barcamps are run at low cost, but of course there are still costs, so it’s very important that lots of businesses and people toss in something to help cover that. If you would like to contribute, just follow the links and Ben or Donna will be happy to help.

Posted on

Market share vs market impact

This is very relevant in the context of the EU probe of the Oracle-Sun takeover. MySQL’s share of the database market, which is usually measured by revenue, is of course peanuts and estimated range from half a percent to something slightly more. Peanuts.

This is not surprising, considering an estimated 999 out of every 1000 MySQL users does not pay Sun/MySQL anything (although some might be Open Query clients 😉 and while MySQL has been targeting higher-end clients and corresponding higher revenue, its pricing is still far lower than the premium-cost of Oracle, DB2 and the like.

All this proves very clearly something which I’ve been saying for years (do scan back in my blog ;-), the definition of market share is borked when it comes to Open Source and low-end disruptors (MySQL has been both although it might no longer be a low-end disruptor, having overshot the needs of a significant chunk of its users). The market impact (usage and influence) of such products is much greater than their revenue. So we have to consider, what matters most? I think the usage and influence matters most, but usage is difficult to measure for OSS, and influence is a subjective issue. Analysts go for solid numbers, and therefore revenue is a sensible -and traditionally reasonably accurate- way to see how things are, including in terms of influence and usage.

So, what is interesting about the EU probe is that it appears to acknowledge that little MySQL actually is a big force in the database market, and that is spot on. As to whether it makes sense to stall the takeover while meanwhile Sun is continuing its freefall and vultures IBM, HP and MS are circling around…. that’s a different matter. Having a philosophical debate while the patient is bleeding to death and getting pecked by scavengers… you get the idea. And I believe that Oracle has, all things considered, done a very decent job with InnoDB since its acquisition. With the takeover I’m not entirely convinced either way; it’s definitely interesting stuff playing out, but it shouldn’t be dragged on too much, that doesn’t help anybody.

Posted on

Will your production MySQL server survive a restart?

Do you know if your production MySQL servers will come back up when restarted? A recent support episode illustrates a number of best practices. The task looked trivial: Update a production MySQL server (replication master) with a configuration tuned and tested on a development server. Clean shutdown, change configuration, restart. Unfortunately, the MySQL daemon did not just ‘come back’, leaving 2 sites offline. Thus begins an illuminating debugging story.
First place to look is the daemon error log, which revealed that the server was segfaulting, seemingly at the end of or just after InnoDB recovery. Reverting to the previous configuration did not help, nor did changing the InnoDB recovery mode. Working with the client, we performed a failover to a replication slave, while I got a second opinion from a fellow engineer to work out what had gone wrong on the server.
Since debug symbols weren’t shown in the stack trace, we needed to generate a symbol file (binary was unstripped) to use with the resolve_stack_dump utility. The procedure for obtaining this is detailed in the MySQL manual. With a good stack trace in hand, we were able (with assistance from an old friend, thanks Dean!) to narrow the crash down to bug 38856 (also see 37027). A little further investigation showed that the right conditions did exist to trigger this bug:
  • expire_logs_days = 14 # had been set in the my.cnf
  • the binlog.index file did not match the actual state of log files (i.e. some had been manually deleted, or deleted by a script)
So with this knowledge, it was possible to bring the MySQL server back up. It turned out that the expire_logs_days had perhaps been added to the configuration but not tested at the time (the server had not been restarted for 3 months). This had placed the system in a state, unbeknownst to the administrators, where it would not come back up after a restart. It was an interesting (if a tad stressful) incident as it shows the reasons for many best practices – which most of us know and follow – but worth re-capping here.
  • even seemingly trivial maintenance can potentially trigger downtime
  • plan any production maintenance in the quiet zone, and be sure to allow enough time to deal with the unforeseen
  • don’t assume your live server will ‘just restart’
  • put my.cnf under revision control (check out “etckeeper”, a standard Ubuntu package; it can keep track of everything in /etc using bzr, svn or git)
  • do not make un-tested changes to config, test immediately, preferably on dev or staging system
  • be ready to failover (test regularly like a fire drill); this is another reason why master-master setups are more convenient than mere master-slave
  • replication alone is NOT a backup
  • don’t remove binlogs or otherwise touch anything in data dir behind mysql’s back
  • have only 1 admin per server so you don’t step on each other’s toes (but share credentials with 2IC for emergencies only)
  • use a trusted origin for your binary packages, just building and passing the basis test-suite is not always sufficient
  • know how to get a good stack trace with symbols, to help find bug reports
  • be familiar with, but it still helps to ask others as they might have just seen something similar and can help you quickly find what you’re looking for!
  • and last but very important: it really pays to find the root cause to a problem (and prevention requires it!), so a “postmortum” on a dead server is very important… if we had just wiped that server, the problem might have reoccurred with another server later.
Posted on

Getting ready for FrOScon 2009

I arrived yesterday in St. Augustin, near Bonn in Germany. After a good day of hitchhiking (weather is beautiful here) I stayed with my Pakistani Couchsurfing host and we had an extremely interesting evening talking about the gigantic cultural differences between western civilization and Pakistani civilization. It beats staying in a hotel by about a million points 🙂

This morning I headed to the FrOScon HQ at the fachhochschule to help out with whatever was needed. Turns out that was a bit premature (misunderstanding on my part), so I have had some time to catch up on mail and give some more attention to my talk on Saturday. I’ll be helping out throughout the and the whole day tomorrow with things though.

I’ll be talking about MySQL MMM, a project that I have invested quite a bit of time in getting to know. My talk will outline what MMM is, what it’s not and an example of our setup at Open Query. It’s a full hour long, so it should be very interesting to be able to go into that much detail.

If you are near St. Augustin, make sure to come by for Froscon, as it’s schedule has some very interesting talks and you’ll also have a good chance to meet fellow MySQL-geeks in the OpenSQLCamp dev-room.

Posted on – doing it differently

I’ve found that exchanging ideas and asking questions, even with people some might consider to be direct competitors, is more valuable than risky. Enter… logo is the home of a group of people who run, or are interested in running, their business according to a set of Principles that make them more people friendly (both to clients and self), resilient to recessions, (potentially) better for the environment, and more. A buzzword compliant mission statement could be something like “Business strategy incubation through co-mentoring”.

While being particularly suited to on-line, ICT and Open Source related endeavours, it is by no means limited to that. In addition, the guidelines also apply well to non-profits and other organisations.

The group’s monthly membership fee is currently set at a nomimal AUD 5, sufficient to cover cost and creating a sense of commitment that a gratis service would not have. When you think about it, that’s cheaper per year than many static books! Members actively participate by contributing on the wiki (like a dynamic book) and mailing list, mentoring fellow members, and (where possible) attending live meetings for face-to-face interaction.

Posted on

Tool of the Day: screen

Only the other day I was talking with someone who does a lot of work on the shell command line, but hadn’t used the GNU screen tool, so I’d better scribble a post about it as I regard it as an absolute must-have for any remote work, for multiple reasons.

First of all, what screen does. You start screen inside a terminal session (local or SSH remote), and then you can create additional sessions though Ctrl-A C. The initial screen is number 0, the next one 1, and so on. You can switch between screens with Ctrl-A # where # is the screen number. This way, you can have multiple things going within a single ssh connection, very handy. But that’s not all!

If you get disconnected (it happens 😉 and you reconnect, your screen sessions will still be there, and running too. You can reattach with screen -r. To do a nice disconnect, you can do Ctrl-A D (detach) before closing your ssh connection.

You can also have multiple screen sessions by name, screens within screens (that confuses me for the control keys so I tend not to use that), and an absolute supertrick is that you can actually share a screen session with someone else. That’s sometimes mighty handy with two engineers to look at something, and also for showing things to clients.

The tool itself is absolutely ancient (aka rock solid, in maintenance mode), I did a quick check and I see references as far back as 1987. I remember using it long long ago, might’ve been a XENIX box. I reckon screen’s authors deserve a prize for creating one of the most useful tools ever!

Default Linux installs often don’t have it, but rectifying that is as simple as sudo apt-get install screen or sudo yum install screen. Then, man screen is your friend, but there are also quite a few decent tutorials on the web.

Posted on

Time-share computing is back!

I kid you not. Let’s quote

“Time-sharing is sharing a computing resource among many users by multitasking. Its introduction in the 1960s, and emergence as the prominent model of computing in the 1970s, represents a major historical shift in the history of computing. By allowing a large number of users to interact simultaneously on a single computer, time-sharing dramatically lowered the cost of providing computing, while at the same time making the computing experience much more interactive.”

Virtualisation and Cloud computing are merely the new form of this, and not actually a new concept as such 😉

Both virtualisation and cloud architecture (and combinations thereof) have their place; they’re not the new solution to everything, as any architecture is situation dependent. Looking at the various cloud providers now, they have quite distinct deployment and pricing models which means that different providers will be suitable and economical for different applications. That’s quite interesting. It may well mean that a smart company might deploy independent aspects of their operation on different environments.

Architecting infrastructure is about way more than a properly specced box and a bit of tuning, and that’s why the above is also of interest to Open Query. There are serious technical aspects to this, but also financial and business factors. Haha, and you thought we just did some database stuff 😉

Posted on

What to do with the Falcon engine?

Keep it. Make sure it gets correctly positioned in the coming months.

It appears that with the Oracle acquisition, the reason-to-exist for Falcon is regarded as gone (a non-Oracle-owned InnoDB replacement), previously seen as a strategic imperative – much delayed though.

But look, each engine has unique architectural aspects and thus a niche where it does particularly well. Given that Falcon exists, I’d suggest to not just “ditch it” but have it live as one of the pluggables. What Oracle will do to it is unknown, but Sun/MySQL can make sure of this positioning by making sure in the coming months that Falcon works in 5.1 as a pluggable engine, perhaps also creating a separate bzr project/tree for it on Launchpad.

Then the good work can find its way into the real world, now.

Posted on

MySQL User Groups on – sponsorship – ask Open Query

This is about the ending of the sponsorship of the user groups by Sun/MySQL and their suggested move to Facebook.

If people want to move to Facebook, that’s fine. For those who want to stay but don’t have the local funding, I have an offer for you. Contact Open Query, and we’ll sponsor your group for the coming months. This is not open-ended, I think a more permanent solution is important (moving, sponsorship, whatever) but I want to make the effort for the community to prevent any groups from disappearing now just because of this.

I was the one who originally set up the agreement with when they first started charging for meetups (and it turned out to be a very good business model for them!). When that change was announced, quite a few meetup organisers were going to quit. The sponsorship made it easy to stay, although we did lose a few groups still.

I expect the same will happen this time, moving across groups and users to another service will cause some groups to just fold, and others to lose quite a few members. People lead busy lives, and organising or even just showing up at a usergroup involves a little bit of effort. Extra hassle can just be the trigger which ends that involvement.

The mailing list is quite popular on my own group in Brisbane (does Facebook have mailing lists?) and there are plenty of people who are not on Facebook nor care to be. Another aspect is that many related groups (PHP, etc) also are on and people will be members of multiple groups. Having people keep track of their groups through multiple sites is awkward, so chances are they’ll drop out of some. An unnecessary waste

It’s just that people dislike having something taken away that was previously theirs, or having to pay for something that was previously free. It’s simple social psychology… people will fight any such instance quite vigorously (or protest by going elsewhere) even if it’s a very small issue. I do appreciate that sentiment, but it can hurt the community.