Posted on

The Optimiser Conundrum

We’ve been helping a long-term client who runs some fairly complex queries (covering lots of tables and logic on a respectably big but mainly volatile dataset). We tend to look first at query structure and table design, as fixing problems there tends to have the most impact. This contrary to just tossing more hardware at the problem, which is just expensive.

As subqueries are used (and necessary in this case), MariaDB 5.3 was already a great help with its subquery optimisations. Once again thanks, Monty and the Monty Program optimiser team (Igor, Sergey, Timour, and possibly others) – all former colleagues and they’re absolutely awesome. Together, they know the MySQL optimiser like no other.

Because the queries are generated indirectly from an exposed API (just for paying clients, but still), the load is more unpredictable than having merely a local front-end. Maintaining spare capacity with slaves addresses this, but naturally it’s still important to ensure that each query takes as little time as possible. Queries use appropriate indexes and each bit is fairly optimal by now.

Then the client reported that while overall things were going really well, there was this one (type of) query that was really misbehaving. The report was that it appeared to be “locking up the server”. Now I generally don’t believe in that, because I know how the server works: it doesn’t just hang, but it might just take a long time to do exactly what it was told to do. Computers are like that.

Typically that kind of thing would happen with for instance a really bad join, but in this case there was nothing like that. Rather than running the query which was sure to not go well (or quick) regardless of what it was doing, the proper thing to try was an EXPLAIN (on a slave) – after 10 minutes I got fed up and killed the query. So, the server was spending all that time in the query analysis/optimisation phase to figure out the optimal order of the tables and which access method for each table it should use.

I know that pattern too, we used to see MySQL 4 do this. Obviously, spending ages on finding the optimal execution plan for a tiny query makes no sense – so back then the optimiser team (same people as noted above!) implemented a limitation and pruning algorithm, with configurable settings: optimizer_search_depth (default: 62) and optimizer_prune_level (default: 1). After this algorithm was implemented, we didn’t see the problem with clients so adjusting the search depth was not even required.

There are exceptions. Another former colleague, Max Mether (now at SkySQL) wrote about this a year ago: Setting optimizer search depth in MySQL, based on an experience with one of their clients. While staying with EXPLAIN rather than actually trying to run the query, I did SET SESSION optimizer_search_depth=1 and validated the hypothesis that indeed this had been where the server was spending its time. Then I played with the setting a bit, and found that for my 28-table query the curve was much steeper than what Max had found. At depth=15 I already had to wait 15 seconds just for the analysis (Max had 5 seconds at that point). While any of us old hands can look at a query and figure out reasonably well what the optimiser would do with it, that exercise becomes a bit tedious when dealing with 28 tables and subqueries involved.

As a side-note, running the actual query (with a low depth setting) takes hardly any extra time, which is typical – it’s the analysis/optimisation phase that’s eating all the time.

For now, we have a workaround… the application can set the option before running queries like this.

SET @save_optimizer_search_depth = @@optimizer_search_depth;
SET SESSION optimizer_search_depth=1;
SET SESSION optimizer_search_depth=@save_optimizer_search_depth;

But I’m not satisfied: I don’t believe the server should be acting in this way. The issue appears similar to the MySQL 4 scenario, and just like then we should come up with a way for the optimiser to decide when to call it quits and just execute. Perhaps we need to set a cap on the total amount of time allowed (configurable), but that seems rather crude. Other things have changed in the optimiser since that time, so it could even be some kind of regression – hopefully the optimiser team will figure this out when they look at it.

I’m working with the client to isolate a dataset sufficient for reproducing this issue, so that we can give it to the team at Monty Program. I’ll file a bug report when I have that set. If you happen to have something similar, please contribute your insights (and data/query) also! For now, if you can, please comment to this post. I’m interested to learn if it’s a more common occurrence now. thanks!