Good practice / Bad practice | Open Query

Posted on March 29, 2011March 29, 2011 by Arjen Lentz

MySQL data backup: going beyond mysqldump

A user on a linux user group mailing list asked about this, and I was one of the people replying. Re-posting here as I reckon it’s of wider interest.

> […] tens of gigs of data in MySQL databases.
> Some in memory tables, some MyISAM, a fair bit InnoDB. According to my
> understanding, when one doesn’t have several hours to take a DB
> offline and do dbbackup, there was/is ibbackup from InnoBase.. but now
> that MySQL and InnoBase have both been ‘Oracle Enterprised’, said
> product is now restricted to MySQL Enterprise customers..
>
> Some quick searching has suggested Percona XtraBackup as a potential
> FOSS alternative.
> What backup techniques do people employ around these parts for backups
> of large mixed MySQL data sets where downtime *must* be minimised?
>
> Has your backup plan ever been put to the test?

You should put it to the test regularly, not just when it’s needed.
An untested backup is not really a backup, I think.

At Open Query we tend to use dual master setups with MMM, other replication slaves, mysqldump, and XtracBackup or LVM snapshots. It’s not just about having backups, but also about general resilience, maintenance options, and scalability. I’ll clarify:

XtraBackup and LVM give you physical backups. that’s nice if you want to recover or clone a complete instance as-is. But if anything is wrong, it’ll be all stuffed (that is, you can sometimes recover InnoDB tablespaces and there are tools for it, but time may not be on your side). Note that LVM cannot snapshot between multiple volumes consistently, so if you have your InnoDB ibdata/IBD files and iblog files on separate spindles, using LVM is not suitable.

mysqldump for logical (SQL) backups. Most if not all setups should have this. Even if the file(s) were to be corrupted, they’re still readable since it’s plain SQL. You can do partial restores, which is handy in some cases. It’ll be slower to load so having *only* an SQL dump of a larger dataset is not a good idea.

some of the above backups can and should *also* be copied off-site. that’s for extra safety, but in terms of recovery speed it may not be optimal and should not be relied upon.

having dual masters is for easier maintenance without scheduled outages, as well as resilience when for instance hardware breaks (and it does).

slaves. You can even delay a slave (Maatkit has a tool for this), so that would give you a live correct image even in case of a user error, provided you get to it in time. Also, you want enough slack in your infra to be able to initialise a new slave off an existing one. Scaling up at a time when high load is already occurring can become painful if your infra is not prepared for it.

A key issue to consider is this… if the dataset is sufficiently large, and the online requirements high enough, you can’t afford to just have backups. Why? Because, how quickly can you deploy new suitable hardware, install OS, do restore, validate, put back online?

In many cases one or more aspects of the above list simply take too long, so my summary would be “then you don’t really have a backup”. Clients tend to argue with me on that, but only fairly briefly, until they see the point: if a restore takes longer than you can afford, that backup mechanism is unsuitable.

So, we use a combination of tools and approaches depending on needs, but in general terms we aim for keeping the overall environment online (individual machines can and will fail! relying on a magic box or SAN to not fail *will* get you bitten) to vastly reduce the instances where an actual restore is required.
Into that picture also comes using separate test/staging servers to not have developers stuff around on live servers (human error is an important cause of hassles).

In our training modules, we’ve combined the backups, recovery and replication topics as it’s clearly all intertwined and overlapping. Discussing backup techniques separate from replication and dual master setups makes no sense to us. It needs to be put in place with an overall vision.

Note that a SAN is not a backup strategy. And neither is replication on its own.

Posted on March 29, 2011March 29, 2011 by Arjen Lentz

Posted on August 2, 2010 by Arjen Lentz — 3 Comments

Challenge: identify this pattern in datadir

You take a look at someone’s MySQL (or MariaDB) data directory, and see

mysql
foo
bar -> foo

What’s the issue? Identify pattern.
What does it mean? Consequences.
Is there any way it can be safe and useful/usable? Describe.

Good luck!

Posted on August 2, 2010 by Arjen Lentz — 3 Comments

Posted on May 27, 2010May 27, 2010 by Arjen Lentz — 4 Comments

Unqualified COUNT(*) speed PBXT vs InnoDB

So this is about a SELECT COUNT(*) FROM tblname without a WHERE clause. MyISAM has an optimisation for that since it maintains a rowcount for each table. InnoDB and PBXT can’t do that (at least not easily) because of their multi-versioned nature… different transactions may see a different number of rows for the table table!

So, it’s kinda known but nevertheless often ignored that this operation on InnoDB is costly in terms of time; what InnoDB has to do to figure out the exact number of rows is scan the primary key and just tally. Of course it’s faster if it doesn’t have to read a lot of the blocks from disk (i.e. smaller dataset or a large enough buffer pool).

I was curious about PBXT’s performance on this, and behold it appears to be quite a bit faster! For a table with 50 million rows, PBXT took about 20 minutes whereas the same table in InnoDB took 30 minutes. Interesting!

From those numbers [addendum: yes I do realise there’s something else wrong on that server to take that long, but it’d be slow regardless] you can tell that doing the query at all is not an efficient thing to do, and definitely not something a frontend web page should be doing. Usually you just need a ballpark figure so running the query in a cron job and putting the value into memcached (or just an include file) will work well in such cases.

If you do use a WHERE clause, all engines (including MyISAM) are in the same boat… they might be able to use an index to filter on the conditions – but the bigger the table, the more work it is for the engine. PBXT being faster than InnoDB for this task makes it potentially interesting for reporting purposes as well, where otherwise you might consider using MyISAM – we generally recommend using a separate reporting slave with particular settings anyway (fewer connections but larger session-specific buffers), but it’s good to have extra choices for the task.

(In case you didn’t know, it’s ok for a slave to use a different engine from a master – so you can really make use of that ability for specialised tasks such as reporting.)

Posted on May 27, 2010May 27, 2010 by Arjen Lentz — 4 Comments

Posted on March 29, 2010 by Arjen Lentz

The actual range and storage size of an INT

What’s the difference between INT(2) and INT(20) ? Not a lot. It’s about output formatting, which you’ll never encounter when talking with the server through an API (like you do from most app languages).

The confusion stems from the fact that with CHAR(n) and VARCHAR(n), the (n) signifies the length or maximum length of that field. But for INT, the range and storage size is specified using different data types: TINYINT, SMALLINT, MEDIUMINT, INT (aka INTEGER), BIGINT.

At Open Query we tend to pick on things like INT(2) when reviewing a client’s schema, because chances are that the developers/DBAs are working under a mistaken assumption and this could cause trouble somewhere – even if not in the exact spot where we pick on it. So it’s a case of pattern recognition.

A very practical example of this comes from a client I worked with last week. I first spotted some harmless ones, we talked about it, and then we hit the jackpot: INT(22) or something, which in fact was storing a unix timestamp converted to int by the application, for the purpose of, wait for this, user’s birth date. There’s a number of things wrong with this, and the result is something that doesn’t work properly.

Currently, the unix epoc/timestamp when stored in binary is a 32 bit unsigned integer, with a range from 1970-01-01 to somewhere in 2037. Note the unsigned qualifier, otherwise it already wraps around 2004.

if using signed, you’d currently only find out with users younger than 7 or so. You may be “lucky” to not have any, but kids are tech savvy so websites and systems in general may well have entries with kids younger than that.

using a timestamp for date-of-birth tells me that the developers are young 😉 well that’s relative, but in this: younger than 40. I was born in 1969, so I am very aware that it’s impossible to represent my birthdate in a unix timestamp! What dates do you test with? Your own, and people around you. ‘nuf said.
finally, INT(22) is still an INT, which for MySQL means 32 bits (4 bytes) and it happened to be signed also.

So, all in all, this wasn’t going to work. Exactly what would fail where would be highly app code (and date) dependent, but you can tell it needs a quick redesign anyway.

I actually suggested checking the requirements whether having just a year would suffice for the intended use (can be stored in a YEAR(4) field), this reduces the amount of personal data stored and thus removes privacy concerns. Otherwise, a DATE field which can optionally be allowed to not have a day-of-month (i.e. only ask for year/month) as that again can be sufficient for the intended purpose.

Posted on March 29, 2010 by Arjen Lentz

Posted on March 24, 2010 by Arjen Lentz

relay-log-space-limit

We don’t often see this option configured (default: unlimited) but it might be a good idea to set it. What it does is limit the amount of disk space the combined relay logs are allowed to take up.

A slave’s IO_Thread reads from the master and puts the events into the relay log; the slave’s SQL_Thread reads from the relay log and executes the query. If/when replication “breaks”, unless it’s connection related it tends to be during execution of a query. In that case the IO_Thread will keep running (receiving master events and storing in the relay log). Beyond some point, that doesn’t make sense.

The reason for having two separate replication threads (introduced in MySQL 4.0) is that long-running queries don’t delay receiving more data. That’s good. But receiving data is generally pretty fast, so as long as that basic issue is handled, it’s not necessary (for performance) to have the IO_Thread run ahead that far.

So you can set something like relay-log-space-limit=256M. This prevents slave disk space from getting gobbled up in some replication failure scenarios. The data will still be available in the logs on the master (provided of course the log expiration there isn’t too short – replication monitoring is still important!).

Conclusion: the relay log as a cache. Don’t leave it at “Unlimited”, that’s inefficient (and potentially problematic) use of resources. If you do run out of diskspace, the relay log can get corrupted – then you have to reposition, which will re-read the data from the master anyway.

Posted on March 24, 2010 by Arjen Lentz

Posted on December 23, 2009January 10, 2010 by toby

Know your data – and your numeric types.

Numeric types in MySQL have two varieties:

– “precise” types such as INTEGER and DECIMAL;

– the IEEE-standard floating point types FLOAT and DOUBLE.

As a rule of thumb, the first group are for exact, “counted” quantities. The INTEGER types represent whole numbers, and DECIMAL represents “fixed point” decimal, with a preset number of places after the decimal point.

Various widths of INTEGER are available in MySQL, from 8-bit TINYINT to 64-bit BIGINT. Calculations with integer types are fast, as they usually correspond to hardware register sizes.

DECIMAL is commonly used for quantities like decimal currency where the number of digits of precision is known and fixed. For example, exactly counting pennies in two decimal digits. Computation with DECIMAL is slower than other types, but this is unlikely to impact most applications.

In the other category are FLOAT and DOUBLE, which are the 32 and 64-bit IEEE standard types, which are usually supported in hardware and are therefore fast and convenient for arithmetic. These are generally good choices for “measurements” – values with limited precision.

It is important to understand what is meant by “precision” (Wikipedia has <a href=”http://en.wikipedia.org/wiki/Arithmetic_precision”>a full discussion</a>). For example, I can measure my height at 185.8 centimetres. Because of the way I make the measurement, which we know to be approximate, this figure is understood have only one meaningful digit after the decimal point.

Precision is a property of all inexact real world “measurements” – such as position, length, weight, brightness; it is usually expressed as the number of “significant figures” or significant digits. (My height measurement has four significant digits.) This should be considered when values are displayed. It is somewhat misleading to represent my height as 185.8000 – common sense tells us that the value is not this accurate.

Serious problems can occur when we do not know the actual precision of measurements, if we wrongly assume a greater precision than exists. A typical example might be a GPS map display which uses measured position to locate the user relative to features such as rivers, roads, and railway tracks. The map display is high resolution and implies a great deal of precision to the viewer. Let us say that based on incoming data, it places our vehicle three metres East of a river. If the measurement has a true precision of, say, 20 metres, we cannot even know even which side of the river we are on! So to allow users to draw safe conclusions, the presentation of data needs to take precision into account.

(While very frequently cited as examples of “precise” figures, not all monetary values are. For example, the value of <a href=”http://twitter.com/nationaldeficit”>USA’s national deficit</a> was estimated today at $11,983,250,643,192.95. I am no economist, but rather few of these digits can be actually significant!)

Throwing away precision with inappropriate modelling is also a potential problem. Ideally measurements arrive with an implied or explicit precision. But even without that, we need to assure ourselves that we are safely storing them.

Keep in mind that we are always dealing with two kinds of precision:

– Machine precision, which is defined by the chosen type; and

– Data precision, which is a property of the values we are storing.

If I am storing my height as cm in a FLOAT column, the data precision is only 4 significant digits as discussed, but the machine precision of this column is always about 7 significant digits, no matter what we try to store. Clearly to avoid throwing away significant parts of your input, the machine precision should exceed your data precision.

Consider a <a href=”http://geocoder.ibegin.com/downloads/canada_cities.zip”>CSV file of latitude and longitude</a>, and notice that every value has 11 digits after the point:

Afton Station,NS,45.6051050000,-61.6974950000

Agassiz,BC,49.2421627750,-121.7496169988

This does not mean that every value has 14 digits of precision – if that were so, then these coordinates would be accurate to within 0.01mm at the equator! This is clearly not true.

Let’s say we merely want to stick markers on a national map. One hundred metre resolution would be more than adequate. Given an Earth circumference of 40,075,020 metres, 100 metres is approximately 1/1113 of a degree. While three digits (0.001) can represent 1/1000ths of a degree, this is not quite precise enough, so let’s keep four fractional digits after the point. Therefore we are looking to represent 7 significant figures, for example:

Afton Station,NS,45.60511,-61.69750

Which of FLOAT or DOUBLE is the right type to use for such values? Let’s investigate.

A key difference between floating point types and fixed point representations (such as DECIMAL) is that while the overall binary precision is fixed, the precision of the fractional part of the value can vary! The available precision depends on magnitude of the value. The highest precision is available for values closest to zero, and precision gets worse as numbers increase in magnitude (by every power of two).

To understand the effect of this, it is necessary to examine floating point representation in more detail. (I am going to handwave more esoteric features of <a href=”http://en.wikipedia.org/wiki/IEEE_754-1985″>the IEEE standard</a> and try to give a general picture.)

Floating point values have three parts:

– a sign (+/-)

– an “exponent” (binary scale factor)

– the value itself (known as the fraction or “mantissa”).

These are analogous to decimal “scientific” or “exponential” notation that you may already be familiar with (e.g. 76.4935 = 7.64935 x 10^1 or 7.64935E+1, where 1 is the decimal exponent).

The combination of these fields precisely defines a rational number, whose value is desirably “close enough” to the value you need to store (which was imprecise to begin with, so the approximation involved in converting to floating point representation is normally not a problem).

The overall precision available is determined by the bits allowed to store the fraction. For reference,

– FLOAT allows 23 bits for the fraction

– DOUBLE allows 52 bits for the fraction

Respectively, this means about 7 and 16 decimal digits in total. So it appears that FLOAT is probably adequate for our seven digit needs.

To make sure, let’s work backwards and confirm just how precisely a FLOAT value can represent a latitude. To do this I will show how an example value is converted into the FLOAT representation. None of the values in our table will have latitudes greater than 77 degrees, so I will pick 76.4935. (Higher values have larger exponents and hence the least available precision for the fractional part, so are the safest test.)

First we need to determine the correct exponent for the value. Then we can work out the real-world “resolution” of the number, i.e. how much the actual value changes if we change the fraction by the smallest possible amount (i.e., in its least significant bit).

The exponent is the largest power of 2 that divides the value, to “normalise” it into the range 0..1. A glance shows us that for 76, a divisor of 128 (2^7) is the right one. That is, the floating point exponent is 7. And the resulting fraction is 76.4935 / 128, or in decimal, 0.59760546875.

Remembering that the binary fraction portion has 23 bits, let’s examine its “binary expansion”. This is effectively just the result of 0.59760546875 x 2^23, written out in base 2:

1001100 . 0111111001010110 (total 23 bits)

^^^^^^^ ^^^^^^^^^^^^^^^^

whole # fraction part

= 76 . 4935 (approx)

Note that, because the exponent is 7, the first 7 bits make up the whole number part (= binary 1001100 = 76). I’ve put a gap where the “binary point” belongs. Written out like this, we can see that we have 16 bits to the right of this point. 2^16 = 65536; so, around this magnitude of 76 degrees (and up to 128, as at 128 the binary exponent increases to 8), we can resolve to 1/65536th of a degree. This is enough bits for four decimal digits (which only requires 1/10000th).

So exactly how precise on the ground will this be?

1/65536th of a degree of latitude is about 1.7 metres: much better than our hoped for resolution of 100 metres. So we have shown that FLOAT is more than adequate for the job.

The same analysis can be done for DOUBLE, of course. For interest’s sake, the equivalent binary expansion is:

1001100 . 011111100101011000000100000110001001001101110 (total 52 bits)

We have 29 more bits to play with, or a total of 45 fraction bits after the whole number part, at this magnitude. This is ridiculously precise, and can resolve 1/35184372088832th of a degree; or 0.00114mm on the surface of the globe. (This is enough to represent 13 decimal digits after the point.)

This example has shown how knowing a little of how floating point works can help you be confident about issues of precision, when choosing types to represent approximate values, or measurements. The key is to know your data, and understand how much precision you have, and how much your application needs.

In this “Good Practice/Bad Practice” post I hope to give some guidelines to choosing between MySQL’s numeric types, using longitude and latitude as a modelling example. (Disclaimer: I am not a mathematician, and the generalisations here are meant to help with practical modelling questions rather than be rigorously theoretical.)

Numeric types in MySQL fall into two main varieties:

“precise” types such as INTEGER and DECIMAL;
the IEEE-standard floating point types FLOAT and DOUBLE.

As a rule of thumb, the first group are for exact, or “counted” quantities. The INTEGER types represent whole numbers, and DECIMAL represents “fixed point” decimal, with a preset number of places after the decimal point.

Various widths of INTEGER are available in MySQL, from 8-bit TINYINT to 64-bit BIGINT. Calculations with integer types are fast, as they usually correspond to hardware register sizes.

In the other category are FLOAT and DOUBLE, which are the 32 and 64-bit IEEE standard types, which are usually supported in hardware and are therefore fast and convenient for arithmetic. These are generally good choices for “measurements” – values with limited precision.

It is important to understand what is meant by “precision” (Wikipedia has a full discussion). For example, I can measure my height at 185.8 centimetres. Because of the way I make the measurement, which we know to be approximate, this figure is understood have only one meaningful digit after the decimal point.

(While very frequently cited as examples of “precise” figures, not all monetary values are. For example, the value of USA’s national deficit was estimated today at $11,983,250,643,192.95. I am no economist, but rather few of these digits can be actually significant!)

Keep in mind that we are always dealing with two kinds of precision:

Machine precision, which is defined by the chosen type; and
Data precision, which is a property of the values we are storing.

Consider a CSV file of latitude and longitude, and notice that every value has 10 digits after the point:

 Afton Station,NS,45.6051050000,-61.6974950000

 Agassiz,BC,49.2421627750,-121.7496169988

This does not mean that every value has 13 digits of precision – if that were so, then these coordinates would be accurate to within 4mm on the ground! This is clearly not possible.

Let’s say we merely want to stick markers on a web page showing a national map. One hundred metre resolution would be more than adequate. Given an Earth circumference of 40,075,020 metres, 100 metres is approximately 1/1113 of a degree. While three digits (0.001) can represent 1/1000ths of a degree, this is not quite precise enough, so let’s keep four fractional digits after the point. Therefore we are looking to represent 7 significant figures, for example:

 Afton Station,NS,45.60511,-61.69750

Which of FLOAT or DOUBLE is the right type to use for such values? Let’s investigate.

A key difference between floating point types and fixed point representations (such as DECIMAL) is that while the overall binary precision is fixed, the precision of the fractional part of the value can vary! The available precision depends on magnitude of the value. The highest precision is available for values closest to zero, and precision gets worse as numbers increase in magnitude (by every power of two).

To understand the effect of this, it is necessary to examine floating point representation in more detail. (I am going to handwave more esoteric features of the IEEE standard and try to give a general picture. In particular I am not going to talk about rounding, biased exponents, or denormalised numbers.)

Floating point values have three parts:

a sign (+/-)
an “exponent” (binary scale factor)
the value itself (known as the fraction or “mantissa”).

These are analogous to decimal “scientific” or “exponential” notation that you may already be familiar with (e.g. 76.4935 = 7.64935 x 10^1 or 7.64935E+1, where 1 is the decimal exponent).

The overall precision available is determined by the bits allowed to store the fraction. For reference,

FLOAT allows 23 physical bits for the fraction (with hidden bit, effectively 24 bits of fraction)
DOUBLE allows 52 physical bits for the fraction (with hidden bit, effectively 53)

Respectively, this is 7 and 15 precise decimal digits in the whole figure. So FLOAT probably meets our 7 digit requirement.

The exponent is the largest power of 2 that divides the value, to “normalise” it into a value between 0 and 1. Since our starting value is more than 1, we need to look at divisors among the powers of 2, which are greater than one: 1, 2, 4, 8, 16, 32, 64, 128, 256… A glance at this list shows us that for 76, a divisor of 128 (2^7) is the right one; i.e., the floating point exponent is 7. And the resulting fraction part is 76.4935 / 128, or in decimal, 0.59760546875.

(Normalisation works similarly for starting values less than one, but in the other direction. We multiply by the largest power of 2 that leaves the value less than one, and store the negative exponent. Negative values are simply dealt with by converting to positive before normalising, and noting a negative sign.)

Since all non-zero positive numbers begin with binary ‘1’, IEEE representation cleverly implies this “hidden” 1 bit, and doesn’t physically store it. This frees up one more bit for the fraction, i.e. giving a total of 24 bits for FLOAT precision. (Because of this, and “exponent biasing”, the binary sequence shown isn’t the actual IEEE bit pattern representing this number.)

Assuming a fraction of 24 bits, let’s examine its “binary expansion”. This is effectively just the result of 0.59760546875 x 2^24, written out in base 2:

1001100 . 01111110010101100  (total 24 bits including "hidden" bit)

^^^^^^^   ^^^^^^^^^^^^^^^^^

whole #   fraction part

=    76 . 4935 (approx)

Because the exponent is 7, the first 7 bits in my binary sequence above are the whole number part (= binary 1001100 = 76). I’ve put a gap where the “binary point” belongs. Written out like this, we can see that we have 17 bits to the right of this point. 2^17 = 131072; so, around this magnitude of 76 degrees (and up to 128, as at 128 the binary exponent increases to 8), we can resolve no worse than 1/131072th of a degree. This is enough bits for five decimal digits (which only requires 1/100000th).

So how precise on the ground will this be?

1/131072th of a degree of latitude is about 0.85 metres: much better than our hoped for resolution of 100 metres. So we have shown that FLOAT is more than adequate for the job. This is not to say that FLOAT is the correct choice for all geographic uses; only that it is adequate for this use, where we decided 100m resolution was enough. On the other hand, source data for geocoding may need more precision than FLOAT can deliver. (Naturally the extra precision must be present in the source data. Simply using a more precise type cannot add any precision to the original measurement, of course 🙂

The analysis above can be done for DOUBLE, of course. For interest’s sake, the equivalent binary expansion is:

1001100 . 0111111001010110000001000001100010010011011101  (total 53 bits including "hidden" bit)

We have 29 more bits to play with, or a total of 46 fraction bits after the whole number part, at this magnitude. This is ridiculously precise, and can resolve no worse than 1/70368744177664th of a degree; or 0.0000015mm on the surface of the globe. (This is enough to represent 13 decimal digits after the point.)

This example has shown how knowing a little of how floating point works can help you be confident about issues of precision, when choosing types to represent approximate values, or measurements – rather than automatically falling back on DOUBLE or even DECIMAL as a “paranoid default”. The key is to know your data, and understand how much precision you have, and how much your application needs.

Posted on December 23, 2009January 10, 2010 by toby

Posted on October 28, 2009October 30, 2009 by Arjen Lentz — 5 Comments

Trivia: identify this replication failure

We got good responses to the “identify this query profile” question. Indeed it indicates an SQL injection attack. Obviously a code problem, but you must also think about “what can we do right now to stop this”. See the responses and my last note on it below the original post.

Got a new one for you!

You find a system with broken replication, could be a slave or one in a dual master setup. the IO thread is still running. but the SQL thread is not and the last error is (yes the error string is exactly this, very long – sorry I did not paste this string into the original post – updated later):

“Could not parse relay log event entry. The possible reasons are: the master’s binary log is corrupted (you can check this by running ‘mysqlbinlog’ on the binary log), the slave’s relay log is corrupted (you can check this by running ‘mysqlbinlog’ on the relay log), a network problem, or a bug in the master’s or slave’s MySQL code. If you want to check the master’s binary log or slave’s relay log, you will be able to know their names by issuing ‘SHOW SLAVE STATUS’ on this slave.”

In other similar cases the error message is about something else but the query it shows with it makes no sense. To me, that essentially says the same as the above.

The server appears to have been restarted recently.

What’s wrong, and what’s your quickest way to get replication going again given this state?

Posted on October 28, 2009October 30, 2009 by Arjen Lentz — 5 Comments

Posted on October 26, 2009October 26, 2009 by Arjen Lentz — 6 Comments

Trivia: identify this query profile

You do SHOW PROCESSLIST, and you see one of your web apps issue the following query:

 SELECT ... WHERE ... AND 1=2 UNION SELECT ...

What does this tell you, and what do you do next?

Posted on October 26, 2009October 26, 2009 by Arjen Lentz — 6 Comments

Posted on September 18, 2009September 18, 2009 by Arjen Lentz

Dogfood: making our systems more resilient

This is a “dogfood” type story (see below for explanation of the term)… Open Query has ideas on resilient architecture which it teaches (training) and recommends (consulting, support) to clients and the general public (blog, conferences, user group talks). Like many other businesses, when we first started we set up our infrastructure quickly and on the cheap, and it’s grown since. That’s how things grow naturally, and is as always a trade-off between keeping your business running and developing while also improving infrastructure (business processes and technical).

Quite a few months ago we also started investing (mostly time) in the technical infrastructure, and slowly moving the various systems across to new servers and splitting things up along the way. Around the same time, the main webserver frequently became unresponsive. I’ll spare you the details, we know what the problem was and it was predictable, but since it wasn’t our system there was only so much we could do. However, systems get dependencies over time and thus it was actually quite complicated to move. In fact, apart from our mail, the public website was the last thing we moved, and that was through necessity not desire.

Of course it’s best for a company when their public website works, it’s quite likely you have noticed some glitches in ours over time. Now running on the new infra, I happened to take a quick peek at our Google Analytics data, and noticed an increase in average traffic numbers of about 40%. Great big auch.

And I’m telling this, because I think it’s educational and the world is generally not served by companies keeping problems and mishaps secret. Nasties grow organically and without malicious intent, improvements are a step-wise process, all that… but in the end, the net results of improvements can be more amazing than just general peace of mind! And of course it’s very important to not just see things happen, but to actively work on those incremental improvements, ongoing.

Our new infra has dual master MySQL servers (no surprise there 😉 but based in separate data centres so that makes the setup a bit more complicated (MMM doesn’t deal with that setup). Other “new” components we use are lighttpd, haproxy, and Zimbra (new in the sense that our old external infra used different tech). Most systems (not all, yet) are redundant/expendable and run on a mix of Linode instances and our own machines. Doing these things for your own infra is particularly educational, it provides extra perspective. The result is, I believe, pretty decent. Failures generally won’t cause major disruption any more, if at all. Of course, it’s still work in progress.

Running costs of this “farm”? I’ll tell later, as I think it’s a good topic for a poll and I’m curious: how much do you spend on server infrastructure per month?

Background for non-Anglophones: “eating your own dogfood” refers to a company doing themselves what they’re recommending to their clients and in general. Also known as “leading by example”, but I think it’s also about trust and credibility. On the other hand, there’s the “dentist’s tooth-ache” which refers to the fact that doctors are their own worst patients 😉

Posted on September 18, 2009September 18, 2009 by Arjen Lentz

Posted on August 29, 2009 by toby — 5 Comments

Will your production MySQL server survive a restart?

Do you know if your production MySQL servers will come back up when restarted? A recent support episode illustrates a number of best practices. The task looked trivial: Update a production MySQL server (replication master) with a configuration tuned and tested on a development server. Clean shutdown, change configuration, restart. Unfortunately, the MySQL daemon did not just ‘come back’, leaving 2 sites offline. Thus begins an illuminating debugging story.

First place to look is the daemon error log, which revealed that the server was segfaulting, seemingly at the end of or just after InnoDB recovery. Reverting to the previous configuration did not help, nor did changing the InnoDB recovery mode. Working with the client, we performed a failover to a replication slave, while I got a second opinion from a fellow engineer to work out what had gone wrong on the server.

Since debug symbols weren’t shown in the stack trace, we needed to generate a symbol file (binary was unstripped) to use with the resolve_stack_dump utility. The procedure for obtaining this is detailed in the MySQL manual. With a good stack trace in hand, we were able (with assistance from an old friend, thanks Dean!) to narrow the crash down to bug 38856 (also see 37027). A little further investigation showed that the right conditions did exist to trigger this bug:

expire_logs_days = 14 # had been set in the my.cnf
the binlog.index file did not match the actual state of log files (i.e. some had been manually deleted, or deleted by a script)

So with this knowledge, it was possible to bring the MySQL server back up. It turned out that the expire_logs_days had perhaps been added to the configuration but not tested at the time (the server had not been restarted for 3 months). This had placed the system in a state, unbeknownst to the administrators, where it would not come back up after a restart. It was an interesting (if a tad stressful) incident as it shows the reasons for many best practices – which most of us know and follow – but worth re-capping here.

even seemingly trivial maintenance can potentially trigger downtime
plan any production maintenance in the quiet zone, and be sure to allow enough time to deal with the unforeseen
don’t assume your live server will ‘just restart’
put my.cnf under revision control (check out “etckeeper”, a standard Ubuntu package; it can keep track of everything in /etc using bzr, svn or git)
do not make un-tested changes to config, test immediately, preferably on dev or staging system
be ready to failover (test regularly like a fire drill); this is another reason why master-master setups are more convenient than mere master-slave
replication alone is NOT a backup
don’t remove binlogs or otherwise touch anything in data dir behind mysql’s back
have only 1 admin per server so you don’t step on each other’s toes (but share credentials with 2IC for emergencies only)
use a trusted origin for your binary packages, just building and passing the basis test-suite is not always sufficient
know how to get a good stack trace with symbols, to help find bug reports
be familiar with bugs.mysql.com, but it still helps to ask others as they might have just seen something similar and can help you quickly find what you’re looking for!
and last but very important: it really pays to find the root cause to a problem (and prevention requires it!), so a “postmortum” on a dead server is very important… if we had just wiped that server, the problem might have reoccurred with another server later.

Posted on August 29, 2009 by toby — 5 Comments

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: