Posted on 2 Comments

HDlatency – now with quick option

I’ve done a minor update to the hdlatency tool (get it from Launchpad), it now has a –quick option to have it only do its tests with 16KB blocks rather than a whole range of sizes. This is much quicker, and 16KB is the InnoDB page size so it’s the most relevant for MySQL/MariaDB deployments.

However, I didn’t just remove the other stuff, because it can be very helpful in tracking down problems and putting misconceptions to rest. On SANs (and local RAID of course) you have things like block sizes and stripe sizes, and opinions on what might be faster. Interestingly, the real world doesn’t always agree with the opinions.

We Mark Callaghan correctly pointed out when I first published it, hdlatency does not provide anything new in terms of functionality, the db IO tests of sysbench cover it all. A key advantage of hdlatency is that it doesn’t have any dependencies, it’s a small single piece of C code that’ll compile on or can run on very minimalistic environments. We often don’t control what the base environment we have to work on is, so that’s why hdlatency was initially written. It’s just a quick little tool that does the job.

We find hdlatency particularly useful for comparing environments, primarily at the same client. For instance, the client might consider moving from one storage solution to another – well, in that case it’s useful to know whether we can expect an actual performance benefit.

The burst data rate (big sequential read or write) which often gets quoted for a SAN or even an individual disk is of little interest to database use, since its key performance bottleneck lies in random access I/O. The disk head(s) will need to move. So it’s important to get some real relevant numbers, rather than just go with magic vendor numbers that are not really relevant to you. Also, you can have a fast storage system attached via a slow interface, and consequentially the performance then will not be at all what you’d want to see. It can be quite bad.

To get an absolute baseline on what are sane numbers, run hdlatency also on a local desktop HD. This may seem odd, but you might well encounter storage systems that show a lower performance than that. ‘nuf said.

If you’re willing to share, I’d be quite interested in seeing some (–quick) output data from you – just make sure you tell what storage it is: type of interface, etc. Simply drop it in a comment to this post, so it can benefit more people. thanks

Posted on 2 Comments
Posted on

Slides from DrupalDownUnder2011 on Tuning for Drupal

By popular request, here’s the PDF of the slides of this talk as presented in January 2011 in brisbane; it’s fairly self-explanatory. Note that it’s not really extensive “tuning”, it just fixes up a few things that are usually “wrong” in default installs, creating a more sane baseline. If you want to get to optimal correctness and more performance, other things do need to be done as well.

Posted on
Posted on

Open Query, new on Fifth Ave

Some of you already know since you helped us move, we recently shifted Open Query’s main office to Fifth Avenue, next door to Elizabeth’s. The new place is comfortable, I really like it so far. Anna is also happy with her new admin space and cat Figaro has found an empty spot on a bookshelf to stretch out on!

The lease costs are a bit steep, as is common these days… chances are we’ll just buy our next place.


Follow-Up yes this was an April 1st post. But, everything in the above post is the truth, it’s just phrased to be very open for a bit of mis-interpretation 😉

I find that the real world provides plenty of fun and unbelievable yet true tidbits, so why bother making up nonsense!

Posted on
Posted on

MySQL data backup: going beyond mysqldump

A user on a linux user group mailing list asked about this, and I was one of the people replying. Re-posting here as I reckon it’s of wider interest.

> […] tens of gigs of data in MySQL databases.
> Some in memory tables, some MyISAM, a fair bit InnoDB. According to my
> understanding, when one doesn’t have several hours to take a DB
> offline and do dbbackup, there was/is ibbackup from InnoBase.. but now
> that MySQL and InnoBase have both been ‘Oracle Enterprised’, said
> product is now restricted to MySQL Enterprise customers..
>
> Some quick searching has suggested Percona XtraBackup as a potential
> FOSS alternative.
> What backup techniques do people employ around these parts for backups
> of large mixed MySQL data sets where downtime *must* be minimised?
>
> Has your backup plan ever been put to the test?

You should put it to the test regularly, not just when it’s needed.
An untested backup is not really a backup, I think.

At Open Query we tend to use dual master setups with MMM, other replication slaves, mysqldump, and XtracBackup or LVM snapshots. It’s not just about having backups, but also about general resilience, maintenance options, and scalability. I’ll clarify:

  • XtraBackup and LVM give you physical backups. that’s nice if you want to recover or clone a complete instance as-is. But if anything is wrong, it’ll be all stuffed (that is, you can sometimes recover InnoDB tablespaces and there are tools for it, but time may not be on your side). Note that LVM cannot snapshot between multiple volumes consistently, so if you have your InnoDB ibdata/IBD files and iblog files on separate spindles, using LVM is not suitable.
  • mysqldump for logical (SQL) backups. Most if not all setups should have this. Even if the file(s) were to be corrupted, they’re still readable since it’s plain SQL. You can do partial restores, which is handy in some cases. It’ll be slower to load so having *only* an SQL dump of a larger dataset is not a good idea.
  • some of the above backups can and should *also* be copied off-site. that’s for extra safety, but in terms of recovery speed it may not be optimal and should not be relied upon.
  • having dual masters is for easier maintenance without scheduled outages, as well as resilience when for instance hardware breaks (and it does).
  • slaves. You can even delay a slave (Maatkit has a tool for this), so that would give you a live correct image even in case of a user error, provided you get to it in time. Also, you want enough slack in your infra to be able to initialise a new slave off an existing one. Scaling up at a time when high load is already occurring can become painful if your infra is not prepared for it.

A key issue to consider is this… if the dataset is sufficiently large, and the online requirements high enough, you can’t afford to just have backups. Why? Because, how quickly can you deploy new suitable hardware, install OS, do restore, validate, put back online?

In many cases one or more aspects of the above list simply take too long, so my summary would be “then you don’t really have a backup”. Clients tend to argue with me on that, but only fairly briefly, until they see the point: if a restore takes longer than you can afford, that backup mechanism is unsuitable.

So, we use a combination of tools and approaches depending on needs, but in general terms we aim for keeping the overall environment online (individual machines can and will fail! relying on a magic box or SAN to not fail *will* get you bitten) to vastly reduce the instances where an actual restore is required.
Into that picture also comes using separate test/staging servers to not have developers stuff around on live servers (human error is an important cause of hassles).

In our training modules, we’ve combined the backups, recovery and replication topics as it’s clearly all intertwined and overlapping. Discussing backup techniques separate from replication and dual master setups makes no sense to us. It needs to be put in place with an overall vision.

Note that a SAN is not a backup strategy. And neither is replication on its own.

Posted on
Posted on 2 Comments

Cache pre-loading on mysqld startup

The following quirky dynamic SQL will scan each index of each table so that they’re loaded into the key_buffer (MyISAM) or innodb_buffer_pool (InnoDB). If you also use the PBXT engine which does have a row cache but no clustered primary key, you could also incorporate some full table scans.

To make mysqld execute this on startup, create /var/lib/mysql/initfile.sql and make it be owned by mysql:mysql

SET SESSION group_concat_max_len=100*1024*1024;
SELECT GROUP_CONCAT(CONCAT('SELECT COUNT(`',column_name,'`) FROM `',table_schema,'`.`',table_name,'` FORCE INDEX (`',index_name,'`)') SEPARATOR ' UNION ALL ') INTO @sql FROM information_schema.statistics WHERE table_schema NOT IN ('information_schema','mysql') AND seq_in_index = 1;
PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
SET SESSION group_concat_max_len=@@group_concat_max_len;

and in my.cnf add a line in the [mysqld] block

init-file = /var/lib/mysql/initfile.sql

That’s all. mysql reads that file on startup and executes each line. Since we can do the whole select in a single (admittedly quirky) query and then use dynamic SQL to execute the result, we don’t need to create a stored procedure.

Of course this kind of simplistic “get everything” only really makes sense if the entire dataset+indexes fit in memory, otherwise you’ll want to be more selective. Still, you could use the above as a basis, perhaps using another table to provide a list of tables/indexes to be excluded – or if the schema is really stable, simply have a list of tables/indexes to be included instead of dynamically using information_schema.

Practical (albeit niche) application:

In a system with multiple slaves, adding in a new slave makes it start with cold caches, but since with loadbalancing it will pick up only some of the load it often works out ok. However, some environments have dual masters but the application is not able to do read/write splits to utilise slaves. In that case all the reads also go to the active master. Consequentially, the passive master will have relatively cold caches (only rows/indexes that have been updated will be in memory) so in case of a failover the amount of disk reads for the many concurrent SELECT queries will go through the roof – temporarily slowing the effective performance to a dismal crawl: each query takes longer with the required additional disk access so depending on the setup the server may even run out of connections which in turn upsets the application servers. It’d sort itself out but a) it looks very bad on the frontend and b) it may take a number of minutes.

The above construct prevents that scenario, and as mentioned it can be used as a basis to deal with other situations. Not many people know about the init-file option, so this is a nice example.

If you want to know how the SQL works, read on. The original line is very long so I’ll reprint it below with some reformatting:

SELECT GROUP_CONCAT(CONCAT(
  'SELECT COUNT(`',column_name,'`)
          FROM `',table_schema,'`.`',table_name,
          '` FORCE INDEX (`',index_name,'`)'
       ) SEPARATOR ' UNION ALL ')
  INTO @sql
  FROM information_schema.statistics
  WHERE table_schema NOT IN ('information_schema','mysql')
  AND seq_in_index = 1;

The outer query grabs each regular db/table/index/firstcol name that exists in the server, writing out a SELECT query that counts all not-NULL values of the indexed column (so it must scan the index), forcing that specific index. We then abuse the versatile and flexible GROUP_CONCAT() function to glue all those SELECTs together, with “UNION ALL” inbetween. The result is a single very long string, so we need to tweak the maximum allowed group_concat output beforehand to prevent truncation.

Posted on 2 Comments
Posted on 6 Comments

Oracle Blamed for Laws of Nature

A catchy headline, and I believe more accurate than Oracle Puts the Squeeze on SMBs with MySQL Price Hike (Network World) and MySQL price hikes reveal depth of Oracle’s wallet love [MySQL Jacking up MySQL Prices] (The Register). Slightly more realistic is Oracle kills low-priced MySQL support (again The Register).

First, let’s review what Oracle has actually done: they ditched the MySQL enterprise Basic and Silver offerings. For Oracle, that makes sense. Their intended client base is “enterprise” (high end, think big corporates) and their MySQL sales and cost structure reflects this. It’s not a new thing that came with MySQL at Oracle, because MySQL at Sun Microsystems and MySQL AB before it had the same approach.

A company simply cannot operate below its market – that is not simply a matter of choice, instead it is dictated by their processes and cost structure. Smart people like Clayton Christensen at Harvard Business School have done ample research on this, here I’ll just give one simple example:

If you hire a sales person on commission and their quarterly quota is $100k, then they have to talk with clients that have at least a $10k-$20k potential (qualified leads), and they need to close (sign contract) with at least 10 within the period. They simply cannot spend any time on talking with potential $1k customers.

We may lament this state of affairs, but you can see how, given the choices made (sales person hired, commission system, quota), it’s as inevitable as an apple falling when you drop it. The way I describe this at Upstarta: if a company wants different results, they need to make sure that their business processes and cost structure lead them in that direction. But the simple fact is that most companies don’t have an internal feedback cycle that keeps an eye on these things, so they just go with the flow of consequences of common choices: aim for large(r) clients, grow turnover, get higher operational costs along the way – that in itself is a cycle and the only direction this particular one can go is up. As a natural consequence, over time old low-end offerings and clients need to be jettisoned – one way or another.

I say horay for Oracle to finally acknowledge this, since Sun Microsystems and MySQL AB before it did not (for whatever reason). This is years overdue. Whether the original MySQL company should have aimed to also serve smaller clients also is an entirely separate topic – and one which I covered at length previously (including internally in my time at MySQL AB), but it’s very much a station long passed. Once you float upward in the market, you can’t operate or move downward.

Now, are SMBs using MySQL actually getting squeezed by Oracle? They are not. There is no lock-in. This is about service contracts, not licensing. As we all know, MySQL is GPL licensed and internal use (even on a website or SaaS offering) is well within GPL parameters. There are a number of different companies offering service for MySQL, different types of service and delivery models and a corresponding wide range of pricing. So SMBs and anyone else has a choice, each can pick the type of service most suited to their needs. Let us celebrate and promote that freedom within the MySQL ecosystem, rather than being outraged about dropped apples falling!

Posted on 6 Comments
Posted on

Open Query on Twitter/Identi.ca

Open Query now has its own @openquery account on Twitter and Identi.ca so you can conveniently follow us there for announcements and tips – and also ask us questions! All OQ engineers can post/reply. The OQ site front page also tracks this feed.

Previously I was posting from my personal @arjenlentz account with #openquery hashtag, but that’s obviously less practical.

Posted on
Posted on 3 Comments

Challenge: identify this pattern in datadir

You take a look at someone’s MySQL (or MariaDB) data directory, and see

mysql
foo
bar -> foo

  1. What’s the issue? Identify pattern.
  2. What does it mean?  Consequences.
  3. Is there any way it can be safe and useful/usable? Describe.

Good luck!

Posted on 3 Comments
Posted on 4 Comments

Unqualified COUNT(*) speed PBXT vs InnoDB

So this is about a SELECT COUNT(*) FROM tblname without a WHERE clause. MyISAM has an optimisation for that since it maintains a rowcount for each table. InnoDB and PBXT can’t do that (at least not easily) because of their multi-versioned nature… different transactions may see a different number of rows for the table table!

So, it’s kinda known but nevertheless often ignored that this operation on InnoDB is costly in terms of time; what InnoDB has to do to figure out the exact number of rows is scan the primary key and just tally. Of course it’s faster if it doesn’t have to read a lot of the blocks from disk (i.e. smaller dataset or a large enough buffer pool).

I was curious about PBXT’s performance on this, and behold it appears to be quite a bit faster! For a table with 50 million rows, PBXT took about 20 minutes whereas the same table in InnoDB took 30 minutes. Interesting!

From those numbers [addendum: yes I do realise there’s something else wrong on that server to take that long, but it’d be slow regardless] you can tell that doing the query at all is not an efficient thing to do, and definitely not something a frontend web page should be doing. Usually you just need a ballpark figure so running the query in a cron job and putting the value into memcached (or just an include file) will work well in such cases.

If you do use a WHERE clause, all engines (including MyISAM) are in the same boat… they might be able to use an index to filter on the conditions – but the bigger the table, the more work it is for the engine. PBXT being faster than InnoDB for this task makes it potentially interesting for reporting purposes as well, where otherwise you might consider using MyISAM – we generally recommend using a separate reporting slave with particular settings anyway (fewer connections but larger session-specific buffers), but it’s good to have extra choices for the task.

(In case you didn’t know, it’s ok for a slave to use a different engine from a master – so you can really make use of that ability for specialised tasks such as reporting.)

Posted on 4 Comments
Posted on 1 Comment

PBXT early impressions in production use

With Paul McCullagh’s PBXT storage engine getting integrated into MariaDB 5.1, it’s never been easier to it out. So we have, on a slave off one of our own production systems which gets lots of inserts from our Zabbix monitoring system.

That’s possibly an ideal usage profile, since PBXT is a log based engine (simplistically stated, it indexes its transaction logs, rather than rewriting data from log into index and indexing that) so it should require less disk I/O than say InnoDB. And that means it should be particularly suited to for instance logging, which have lots of inserts on a sustained basis. Note that for short insert burst you may not see a difference with InnoDB because of caching, but sustain it and then you can notice.

Because PBXT has such different/distinct architecture there’s a lot of learning involved. Together with Paul and help from Roland Bouman we also created a stored procedure that can calculate the optimal average row size for PBXT, and even ALTER TABLE statements you can paste to convert tables. The AVG_ROW_LENGTH option is quite critical with PBXT, if set too big (or if you let PBXT guess and it gets it wrong) it’ll eat heaps more diskspace as well as being much slower, and if too small it’ll be slower also; this, it needs to be in the right ballpark. For existing datasets it can be calculated, so that’s what we’ve worked on. The procs will be published shortly, and Paul will also put them in with the rest of the PBXT files.

Another important aspect for PBXT is having sufficient cache memory allocated, otherwise operations can take much much longer. While the exact “cause” is different, one would notice similar performance aspects when using InnoDB on larger datasets and buffers that are too small for the purpose.

So, while using or converting some tables to PBXT takes a bit of consideration, effort and learning, it appears to be dealing with the real world very well so far – and that’s a testament to Paul’s experience. Paul is also very responsive to questions. As we gain more experience, it is our intent to try PBXT for some of our clients that have operational needs that might be a particularly good fit for PBXT.

I should also mention that it is possible to have a consistent transaction between PBXT, InnoDB and the binary log, because of the 2-phase commit (XA) infrastructure. This means that you should even be able to do a mysqldump with –single-transaction if you have both PBXT and InnoDB tables, and acquire a consistent snapshot!

More experiences and details to come.

Posted on 1 Comment