Posted on 3 Comments

A bit of traffic-analysis

From the public #mysql channel… someone was analysing the Browser useragent string from his site traffic.
SELECT COUNT(*) AS cnt FROM hits WHERE useragent LIKE "%Mac%"
There were of course a number of problems with this, some more obvious than others.
For educational purposes, let’s go through this as the issues are also relevant for many other situations.

The user said that the query took quite a few seconds, and was using 100% CPU time.
The table was MyISAM and had a few million rows. Nothing special, not particularly big even.

Of course, a LIKE starting with a wildcard can’t use an index. So the server has to do a full table scan. That’s not a problem, but merely a fact we need to take into account.
(Could we use say a FULLTEXT index for this? Probably not, as Browser useragent strings are very much free form so the words might not be separated the way FULLTEXT likes it. We could tune the server use different delimiters, but let’s try without all that first.)

If you’re doing a full table scan, you want to reduce disk I/O to minimize overhead. So, 1) you want your rows to be as short as possible, and 2) you want read big chunks of data for each OS disk read. For MyISAM, you can do the latter with
SET SESSION read_buffer_size=26214400 (that’s about 25MB)
We just do this inside this connection, because this buffer would be allocated for each connection… you wouldn’t have enough memory to do this globally.

That actually brings us to the 100% CPU use thing. Generally, if a query takes a bit longer, it would still not use 100% CPU use, for instance because much of the issue will be disk-bound so the CPU should just be waiting.
100% CPU use often happens with bad builds (bad threading libraries), but in this case the answer was simpler… the user had set the read_buffer_size on the GLOBAL level. So what happened? The operating system was swapping like crazy! That’s what happens if you allocate more memory than you have physically. You really want to always make sure that your server does not swap. If you have an Apache and a MySQL server on the same machine, this can become quite tricky as Apache 1.3 is a memory hog and MySQL of course likes to be tuned for optimal performance by increasing various buffers/caches. So anyway, this problem was easily resolved.

At this point the query ran reasonably well, but the user would repeat it about 7 times… one for Mac, one for Linux, and so on. How to optimize that? It would be nice to do it all in a single query:

  SELECT SUM(IF(useragent LIKE "%Mac%",1,0)) AS Mac,
         SUM(IF(useragent LIKE "%Linux%",1,0)) AS Linux, ....
    FROM hits

What this does may not be immediately obvious to you, but I’ll leave you to ponder (and try!) that as homework 😉

And, no WHERE clause? Well, I figured the table scan is happening anyway, with each row checked for the specified patterns. A WHERE clause only makes sense if it reduces the number of rows the server needs to look at, and here we can expect most if not all rows to have a useragent pattern we want. So a WHERE would actually slow us down.

Someone then asked… would it make sense to copy the data to a MEMORY table and do the analysis there? Answer… unless you really do have to do multiple queries on the same dataset you’d just be doing more data shuffling rather than less, so: no.

And how would this work in InnoDB? Well, you’d make sure that innodb_buffer_pool was large enough rather than tuning read_buffer_size, and again you’d want the rows in this table to be as short as possible.
But MyISAM is very suited to datawarehousing, that’s what it was actually first used for. Of course the above case is a very minimalistic datawarehouse, but you get the idea.

Now, the one thing you can debate with this particular case is whether it might be better to just do the parsing when the data comes in with a browser request, and store the wanted information fully structured. You can also keep the original string in a logfile.

Posted on 3 Comments

3 thoughts on “A bit of traffic-analysis

  1. Being the “user” from the article, a little insight on the memory tables issue.
    Following the line of thought discussed on the IRC, by selecting the userAgents grouped, I had a count of unique user agents (something to the line of 65% of the original volume of data) By placing that into a memory table allowed me to run the sum(if) query on that result set with an added bonus of getting the count of each unique userAgent.

    It could be done in two separate steps without the memory table, but, the reduction of 35% of the total volume accounts for a quicker second query…

  2. Grouped by what? I think you’re heading towards the “preprocessing before storing” scenario. Why not just go all the way….

  3. Once again, indisputable arguments…
    I took the time to normalize things, pre-process the details I needed to get from each item, separated the index values in a new table, index values, and pretty much did everything that I should have done in the first place. Besides reducing considerably the amount of space used by the table (something in the line of 30%) I was able to actually increase overall performance by an astounding 700%.

Comments are closed.