Posted on 2 Comments

TEXT and VARCHAR inefficiencies in your db schema

The TEXT and VARCHAR definitions in many db schemas are based on old information – that is, they appear to be presuming restrictions and behaviour from MySQL versions long ago. This has consequences for performance. To us, use of for instance VARCHAR(255) is a key indicator for this. Yep, an anti-pattern.

VARCHAR

In MySQL 4.0, VARCHAR used to be restricted to 255 max. In MySQL 4.1 character sets such as UTF8 were introduced and MySQL 5.1 supports VARCHARs up to 64K-1 in byte length. Thus, any occurrence of VARCHAR(255) indicates some old style logic that needs to be reviewed.

Why not just set the maximum length possible? Well…

A VARCHAR is subject to the character set it’s in, for UTF8 this means either 3 or 4 (utf8mb4) bytes per character can be used. So if one specifies VARCHAR(50) CHARSET utf8mb4, the actual byte length of the stored string can be up to 200 bytes. In stored row format, MySQL uses 1 byte for VARCHAR length when possible (depending on the column definition), and up to 2 bytes if necessary. So, specifying VARCHAR(255) unnecessarily means that the server has to use a 2 byte length in the stored row.

This may be viewed as nitpicking, however storage efficiency affects the number of rows that can fit on a data page and thus the amount of I/O required to manage a certain amount of rows. It all adds up, so having little unnecessary inefficiencies will cost – particularly for larger sites.

VARCHAR best practice

Best practice is to set VARCHAR to the maximum necessary, not the maximum possible – otherwise, as per the above, the maximum possible is about 16000 for utf8mb4, not 255 – and nobody would propose setting it to 16000, would they? But it’s not much different, in stored row space a VARCHAR(255) requires a 2 byte length indicator just like VARCHAR(16000) would.

So please review VARCHAR columns and set their definition to the maximum actually necessary, this is very unlikely to come out as 255. If 255, why not 300? Or rather 200? Or 60? Setting a proper number indicates that thought and data analysis has gone into the design. 255 looks sloppy.

TEXT

TEXT (and LONGTEXT) columns are handled different in MySQL/MariaDB. First, a recap of some facts related to TEXT columns.

The db server often needs to create a temporary table while processing a query. MEMORY tables cannot contain TEXT type columns, thus the temporary table created will be a disk-based one. Admittedly this will likely remain in the disk cache and never actually touch a disk, however it goes through file I/O functions and thus causes overhead – unnecessarily. Queries will be slower.

InnoDB can store a TEXT column on a separate page, and only retrieve it when necessary (this also means that using SELECT * is needlessly inefficient – it’s almost always better to specify only the columns that are required – this also makes code maintenance easier: you can scan the source code for referenced column names and actually find all relevant code and queries).

TEXT best practice

A TEXT column can contain up to 64k-1 in byte length (4G for LONGTEXT). So essentially a TEXT column can store the same amount of data as a VARCHAR column (since MySQL 5.0), and we know that VARCHAR offers us benefits in terms of server behaviour. Thus, any instance of TEXT should be carefully reviewed and generally the outcome is to change to an appropriate VARCHAR.

Using LONGTEXT is ok, if necessary. If the amount of data is not going to exceed say 16KB character length, using LONGTEXT is not warranted and again VARCHAR (not TEXT) is the most suitable column type.

Summary

Particularly when combined with the best practice of not using SELECT *, using appropriately defined VARCHAR columns (rather than VARCHAR(255) or TEXT) can have a measurable and even significant performance impact on application environments.

Applications don’t need to care, so the db definition can be altered without any application impact.

It is a worthwhile effort.

Posted on 2 Comments
Posted on 1 Comment

Contributing to popular frameworks for scalability

Right now we’re volunteering some engineering time to assisting the WordPress and WooCommerce people with scalability issues. In the past we’ve put similar efforts into Drupal.

There are many opinions on these systems out there, most of them sadly negative. We take a different view. Each of these frameworks obviously has their advantages and disadvantages, but the key question is why people use them. When we understand that, we can assess that reasoning, and assist further.

Obviously writing your own code all the way is going to potentially create the most optimal result for your site. A custom tool is going to be less code and more optimal for your situation. However, it also requires you to put in quite a bit of development effort both to create and to maintain that system, including security issues. When you’re big enough (as a site/company) this can be worthwhile, but more and more organisations actually appear to be moving towards utilising a CMS and then extending that to their needs using the usual module/plugin/extension model.

This also means that bigger and bigger sites use these systems, and thus we need to look at the scalability. CMS frameworks tend to be “db heavy”, caused by being highly configurable combined with a modular/abstraction architecture that sometimes goes for “code architectural correctness” without taking scaling into account. Naturally most such code works fine on a dev box and even on a modest live server. But add sufficient content, and things rapidly go downhill. Just like with other database related tuning issues, there is no gradual degradation in performance – when certain boundaries are reached, performance plummets to cause nasty page load times or even complete page load failures.

Are these systems inherently dreadful? Actually, no. They do a darn good job and even in security terms they’re not bad at all. Naturally one always has to be careful with modules/plugins and whether they are (still) maintained.

So, with that assessment out of the way – we can look at the actual issues. It makes sense for us to contribute to these systems as it directly benefits our clients, ourselves, and many many others.

 

Just no the option structure in WordPress has drawn our interest. It’s abstracted, and so a plugin will request the value of an individual option item (by name). Typically it causes a db query. WordPress has an ‘autoload’ mechanism particularly for its core settings, which it loads in one bigger query and caches – that makes sense.

We’ve just commented on an existing issue regarding the indexing of the autoload column, with people arguing that indexing a boolean field is not beneficial (not necessarily true and easily tested for a specific case – the outcomes is that in this case an index IS beneficial) and that having more indexes slows down writes (true of course, but that’s never an argument against proper indexing – also this is mostly a read table, so a bit of overhead on writes is fairly immaterial). Lastly there were comments re MyISAM which has a slightly different performance profile to InnoDB when testing this. But InnoDB has been the default engine for quite a few years now – any installation in recent years, and any installation where the owner needs to care for more performance and other factors, will already be using InnoDB. Taking MyISAM into account is not beneficial. We hope the index will shortly be added. In the mean time you can add it yourself:

ALTER TABLE wp_options ADD INDEX (autoload)

Using autoload for everything would not be efficient, there can be tens of thousands of options in a decently sized site. So we see plugins retrieve half a dozen options, one by one using the appropriate framework function, and each triggers a database query. Again we have to appreciate why this architecture is the way it is, and not waste time on arguing with that. There are sensible reasons for having such an option architecture. What we need to address is the inefficiency of triggering many queries.

Our proposal is to extend the get_option() function with an extra parameter for a plugin name (call it a domain), then the function can issue a single db query to retrieve all the options for that plugin and cache it in a single object. When another option is requested it can be retrieved from there rather than triggering another query. This reduces the number of db queries from N per plugin to 1, which is significant. Mind you, each of these queries is tiny and doesn’t require much lookup time. But running any query has intrinsic overhead so we want to reduce the number of queries whenever possible – not running superfluous queries, combining queries, and so on.

There are variations on the proposal as there are many aspects to consider, but it’s important to not leave it “as is” as currently it affects performance and scalability.

So how do we catch potential items of interest? On the database end, the slow query log is still very useful with the proper settings. log_queries_not_using indexes, min_examined_row_limit=1000 (tune appropriate for the environment), log_slow_filter and log_slow_verbosity to give as much info as possible, and long_query_time to something sub-second. Naturally, this presumes you’ve already taken care of real nasties otherwise you’ll get very big slow log files very quickly. You can analyse the slow query log using Percona Tools, but the ‘mysqldumpslow’ tool which is part of the server package is pretty useful also. You’re looking for queries that either take a long time, or that happen often, or just look odd/inefficient/unnecessary. Then hunt where they come from in the code, and figure out the why. Often reasons aren’t necessarily good or current (historical scenarios linger), but there will be reasons and understanding them will help you talk with developers more effectively – seeing where they come from.

WordPress has a nice plugin called Query Monitor. As admin you get details from a page load in your top admin bar, and you can delve into specific aspects. Things to look out for are an excessive number of queries for a page (a hundred is sadly common, but when you see into the thousands it’s worthwhile investigating – of course, start with the biggest problems first and work your way down), duplicated queries (indicating framework inefficiencies) and things like that. Have a good browse around, there’s much there that’s useful.

If you want, feel free to bring suspected issues to our attention. We’ll look at them, perhaps write about them here, and in any case follow up with the developers in their bug tracking system. It’s worthwhile.

Posted on 1 Comment
Posted on

Open Query training at Drupal DownUnder 2012

DrupalDownUnder 2012 will be held in Melbourne Australia 13-15 January. A great event, I’ve been to several of its predecessors. People there don’t care an awful lot for databases, but they do realise that sometimes it’s important to either learn more about it or talk to someone specialised in that field. And when discussing general infrastructure, resilience is quite relevant. Clients want a site to remain up, but keep costs low.

I will teach pre-conference training sessions on the Friday at DDU:

The material is made specific to Drupal developers and users. The query design skills, for instance, will help you with module development and designing Drupal Views. The two half-days can also be booked as a MySQL Training Pack for $395.

On Saturday afternoon in the main conference, I have a session Scaling out your Drupal and Database Infrastructure, Affordably covering the topics of resilience, ease of maintenance, and scaling.

I’m honoured to have been selected to do these sessions, I know there were plenty of submissions from excellent speakers. As with all Drupal conferences, attendees also vote on which submissions they would like to see.

After DDU I’m travelling on to Ballarat for LinuxConfAU 2012, where I’m not speaking in the main program this year, but will have sessions in the “High Availability and Storage” and “Business of Open Source” miniconfs. I’ll do another post on the former – the latter is not related to Open Query.

Posted on
Posted on

Slides from DrupalDownUnder2011 on Tuning for Drupal

By popular request, here’s the PDF of the slides of this talk as presented in January 2011 in brisbane; it’s fairly self-explanatory. Note that it’s not really extensive “tuning”, it just fixes up a few things that are usually “wrong” in default installs, creating a more sane baseline. If you want to get to optimal correctness and more performance, other things do need to be done as well.

Posted on
Posted on 6 Comments

Fast paging in the real world

This blag was originally posted at http://cafuego.net/2010/05/26/fast-paging-real-world

Some time ago I attended the “Optimisation by Design” course from Open Query¹. In it, Arjen teaches how writing better queries and schemas can make your database access much faster (and more reliable). One such way of optimising things is by adding appropriate query hints or flags. These hints are magic strings that control how a server executes a query or how it returns results.

An example of such a hint is SQL_CALC_FOUND_ROWS. You use it in a select query with a LIMIT clause. It instructs the server to select a limited numbers of rows, but also to calculate the total number of rows that would have been returned without the limit clause in place. That total number of rows is stored in a session variable, which can be retrieved via SELECT FOUND_ROWS(); That simply reads the variable and clears it on the server, it doesn’t actually have to look at any table or index data, so it’s very fast.

This is useful when queries are used to generate pages of data where a user can click a specific page number or click previous/next page. In this case you need the total number of rows to determine how many pages you need to generate links for.

The traditional way is to first run a SELECT COUNT(*) query and then select the rows you want, with LIMIT. If you don’t use a WHERE clause in your query, this can be pretty fast on MyISAM, as it has a magic variable that contains the number of rows in a table. On InnoDB however, which is my storage engine of choice, there is no such variable and consequently it’s not pretty fast.

Paging Drupal

At DrupalConSF earlier this year I’d floated the idea of making Drupal 7 use SQL_CALC_FOUND_ROWS in its pager queries. These are queries generated specifically to display paginated lists of content and the API to do this is pretty straightforward. To do it I needed to add query hint support to the MySQL driver. When it turned out that PostgreSQL and Oracle also support query hints though, the aim became adding hint support for all database drivers.

That’s now done, though only the patch only implements hints on the pager under MySQL at the moment.

One issue keeps cropping up though, a blog by Alexey Kovyrin in 2007 that states SELECT COUNT(*) is faster than using SQL_CALC_FOUND_ROWS. It’s all very well to not have a patch accepted if that statement is correct, but in my experience that is in fact not the case. In my experience the stats are in fact the other way around, SQL_CALC_FOUND_ROWS is nearly always faster than SELECT COUNT(*).

To back up my claims I thought I should run some benchmarks.

I picked the Drupal pager query that lists content (nodes) on the content administration page. It selects node IDs from the node table with a WHERE clause which filters by the content language. Or, in plain SQL, what currently happens is:

SELECT COUNT(*) FROM node WHERE language = 'und';
SELECT nid FROM node WHERE language = 'und' LIMIT 0,50;

and what I’d like to happen is:

SELECT SQL_CALC_FOUND_ROWS nid FROM node WHERE language = 'und' LIMIT 0,50;
SELECT FOUND_ROWS();

Methodology

I ran two sets of tests. One on a node table with 5,000 rows and one with 200,000 rows. For each of these table sizes I ran a pager with 10, 20, 50, 100 and 200 loops, each time increasing the offset by 50; effectively paging through the table. I ran all these using both MyISAM and InnoDB as the storage engine for the node table and I ran them on two machines. One was my desktop, a dual core Athlon X2 5600 with 4Gb of RAM and the other is a single core Xen virtual machine with 512Mb of RAM.

I was hoping to also run tests with 10,000,000 rows, but the virtual machine did not complete any of the queries. So I ran these on my desktop machine only. Again for 10, 20, 50, 100 and 200 queries per run. First with an offset of 50, then with an offset of 10,000. I restarted the MySQL server between each run. To discount query cache advantages, I ran all tests with the query cache disabled. The script I used is attached at the bottom of this post. The calculated times do include the latency of client/server communication, though all tests ran via the local socket connection.

My desktop runs an OurDelta mysql .5.0.87 (the -d10-ourdelta-sail66) to be exact. The virtual machine runs 5.0.87 (-d10-ourdelta65).  Before you complain that not running a vanilla MySQL invalidates the results, I run these because I am able to tweak InnoDB a bit more, so the I/O write load on the virtual machine is somewhat reduced compared to the vanilla MySQL.

Results

Query time graphs - NEW is faster than OLD and InnoDB is not slower than MyISAM

The graphs show that using SQL_CALC_FOUND_ROWS is virtually always faster than running two queries that each need to look at actual data. Even when using MyISAM. As the database gets bigger, the speed advantage of SQL_CALC_FOUND_ROWS increases. At the 10,000,000 row mark, it’s consistently about twice as fast.

Also interesting is that InnoDB seems significantly slower than MyISAM on the shorter runs. I say seems, because (especially with the 10,000,000 row table) the delay is caused by InnoDB first loading the table from disk into its buffer pool. In the spreadsheet you can see the first query takes up to 40 seconds, whilst subsequent ones are much faster. The MyISAM data is still in the OS file cache, so it doesn’t have that delay on the first query. Because I use innodb_flush_method=O_DIRECT, the InnoDB data is not kept in the OS file cache.

Conclusion

So, it’s official. COUNT(*) is dead, long live SQL_CALC_FOUND_ROWS!  🙂

I’ve attached my raw results as a Gnumeric document, so feel free to peruse them. The test script I’ve used is also attached, so you can re-run the benchmark on your own systems if you wish.

Conclusion Addendum

As pointed out in the Drupal pager issue that caused me to run these tests, the query I’m benchmarking uses the language column, which is not indexed and the test also doesn’t allow the server to cache the COUNT(*) query. I’ve rerun the tests with 10 million rows after adding an index and I no longer get a signification speed difference between the two ways of getting the total number of rows.

So I suppose that at least SQL_CALC_FOUND_ROWS will cause your non-indexed pager queries to suck a lot less than they might otherwise and it won’t hurt if they are properly indexed 🙂

¹ I now work for Open Query as a consultant.

Posted on 6 Comments
Posted on

Friendlist Graph Module for Drupal

At DrupalSouth 2010 (Wellington) after LCA2010, Peter and I implemented a Drupal module as a practical example of how the OQGRAPH engine can be used to enable social networking trickery in any website. The friendlist_graph module (available from GitHub) extends friendlist, which implements basic functionality of friends (2-way) and fans (1-way) for Drupal users.

The friendlist_graph module transposes the friendlist data using an OQGRAPH table, allowing you to query it in new and interesting ways. By adding some extra Drupal Views, it allows you to play Six Degrees of Kevin Bacon with your Drupal users or find out how two arbitrary users are connected. It can find a path of arbitrary length near-instantly. Previously, you’d just avoid doing any such thing as it’s somewhere between impossible/limited/slow/painful in a regular relational schema.

Now think beyond: retrieve/share connections using Open Social, FOAF, Twitter/Identi.ca, logins with OpenID, and you “instantly” get a very functional social networking enabled site that does not rely on localised critical mass!

We tested with about a million users in Drupal (and approx 3.5 million random connections), which worked fine – the later demo at the DrupalSouth stuffed up because I hadn’t given the demo VM sufficient memory.

Naturally, you could do the same in Joomla! or another CMS or any site for that matter, we just happened to be at DrupalSouth so a Drupal module was the obvious choice. Take a peek at the code, it’s pretty trivial. Just make sure you run a version of MySQL that has the OQGRAPH engine, for instance 5.0.87-d10 (Sail edition!) from OurDelta.

Posted on
Posted on 5 Comments

Ladies and gentlemen, check your assumptions

I spent some time earlier this week trying to debug a permissions problem in Drupal.

After a lot of head-scratching, it turned out that Drupal assumes that when you run INSERT queries sequentially on a table with an auto_increment integer column, the values that are assigned to this column will also be sequential, ie: 1, 2, 3, …

This might be a valid assumption when you are the only user doing inserts on a single MySQL server, but unfortunately that is not always the situation in which an application runs.

I run MySQL in a dual-master setup, which means that two sequential INSERT statements will never return sequential integers.  The value will always be determined by the  auto_increment_increment and auto_increment_offset settings in the configuration file.

In my case, one master will only assign even numbers, the other only uneven ones.

My patch was accepted, so this problem is now fixed in the Drupal 7 (and hopefully soon in 6 as well) codebase.

The moral of the story is that your application should never make such assumptions about auto_increment columns.  A user may run the application on a completely different architecture, and it may break in interesting and subtle ways.

If you want to use defined integers like Drupal does, make sure you explicitly insert them. Otherwise, you can retrieve the assigned number via the mysql_insert_id() function in PHP or via SELECT LAST_INSERT_ID() in MySQL itself.

Have you checked your code today?

Posted on 5 Comments