Posted on

Browsing genomes with MySQL, long text cols, Boyer-Moore optimization

Take a peek at http://www.ensembl.org/, the Ensembl Genome Browser.
There’s the human genome, and a variety of other animals as well as diseases.
What’s the link? Well, the data is stored on and searched with MySQL, and you can directly download a MySQL format SQL dump.

As I understand it, some genomes have DNA sequences longer than 1GB (that need to stay as a single entity for analysis), and MySQL is one of the few RDBMS that can a) handle such large TEXT/BLOB fields and b) has the ability to search for substrings fairly efficiently.

To give you some background info on the latter, MySQL version 4.0 and above implements the Boyer-Moore optimization for LIKE “%pattern%” searches. As you know, a LIKE pattern that starts with a wildcard can never use an index (except FULLTEXT indexes, see below). It’s like searching the phonebook for people whose second letter of their name is “e”, you need to go through all the pages anyway…
So MySQL needs to do a table scan, looking at the relevant column in each row of the table. With Boyer-Moore, this becomes less tedious. For example, if the pattern to search for is “bla”, MySQL only has to check each 3rd character in the stored data, checking if it is a “b”, an “l”, or an “a” (it does that through a reverse lookup using a binary search, but don’t worry about that). If it is none of those, it can safely skip another 3 characters.

Now, you may forget all of this techie detail if you want, the important thing to remember is that when you need to do a search on a pattern starting with a wildcard, MySQL cannot use an index, but will use the abovementioned optimization. And the consequence of the Boyer-Moore algorithm is that the long the pattern, the bigger the hops that MySQL can take through the column…
(if you’re not yet using at least a 4.x version of MySQL, all the more reason to upgrade!)

Of course, if you’re searching for whole words (i.e. in a text) rather than part of a contiguous string (like a DNA sequence), FULLTEXT indexes are a better choice. But that’s another story, perhaps for another day. The manual info for FULLTEXT is at http://dev.mysql.com/doc/mysql/en/fulltext-search.html

Posted on