Tag Archives: open query

Walking the Tree of Life in simple SQL

Antony and I are busy getting the Open Query GRAPH Engine code ready so you all can play with it, but we needed to test with a larger dataset to make sure all was fundamentally well with the system.

We have some intersting suitable dataset sources, but the first we tried in ernest because it was easy to get in (thanks to Roland Bouman for both the idea and providing xslt stylesheets to transform the set), was the Tree of Life which is a hierarchy of 89052 entries showing how biological species on earth are related to eachother.

GRAPH engine operates in a directed fashion, so I inserted the connections both ways resulting in 178102 entries. So, I inserted A->B as well as B->A for each connection. So we now have a real graph, not just a simple tree.

Just like with my previous post, we have a separate table that contains the name of the species. For query simplicity, I looked up the id the start/end name separately. By the way, latch=1 indicates we use Dijkstra’s shortest-path algorithm for our search.

# with all that explained, let’s find ourselves in the tree of life!
SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_graph JOIN tol ON (linkid=id) WHERE latch=1 AND origid=1 AND destid=16421 \G
*************************** 1. row ***************************
path: Life on Earth -> Eukaryotes -> Unikonts -> Opisthokonts -> Animals -> Bilateria -> Deuterostomia -> Chordata -> Craniata -> Vertebrata -> Gnathostomata -> Teleostomi -> Osteichthyes -> Sarcopterygii -> Terrestrial Vertebrates -> Tetrapoda -> Reptiliomorpha -> Amniota -> Synapsida -> Eupelycosauria -> Sphenacodontia -> Sphenacodontoidea -> Therapsida -> Theriodontia -> Cynodontia -> Mammalia -> Eutheria -> Primates -> Catarrhini -> Hominidae -> Homo -> Homo sapiens
1 row in set (0.13 sec)

# how are we related to the family of plants containing the banana
SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_graph JOIN tol ON (linkid=id) WHERE latch=1 AND origid=16421 AND destid=21506 \G
*************************** 1. row ***************************
path: Homo sapiens -> Homo -> Hominidae -> Catarrhini -> Primates -> Eutheria -> Mammalia -> Cynodontia -> Theriodontia -> Therapsida -> Sphenacodontoidea -> Sphenacodontia -> Eupelycosauria -> Synapsida -> Amniota -> Reptiliomorpha -> Tetrapoda -> Terrestrial Vertebrates -> Sarcopterygii -> Osteichthyes -> Teleostomi -> Gnathostomata -> Vertebrata -> Craniata -> Chordata -> Deuterostomia -> Bilateria -> Animals -> Opisthokonts -> Unikonts -> Eukaryotes -> Archaeplastida (Plantae) -> Green plants -> Streptophyta -> Embryophytes -> Spermatopsida -> Angiosperms -> Monocotyledons -> Zingiberanae -> Musaceae
1 row in set (0.06 sec)

Obviously, this search needs to find its way up the tree then find the appropriate other branch.

# finally, our connection retro-viruses
SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_graph JOIN tol ON (linkid=id) WHERE latch=1 AND origid=16421 AND destid=57380 \G
*************************** 1. row ***************************
path: Homo sapiens -> Homo -> Hominidae -> Catarrhini -> Primates -> Eutheria -> Mammalia -> Cynodontia -> Theriodontia -> Therapsida -> Sphenacodontoidea -> Sphenacodontia -> Eupelycosauria -> Synapsida -> Amniota -> Reptiliomorpha -> Tetrapoda -> Terrestrial Vertebrates -> Sarcopterygii -> Osteichthyes -> Teleostomi -> Gnathostomata -> Vertebrata -> Craniata -> Chordata -> Deuterostomia -> Bilateria -> Animals -> Opisthokonts -> Unikonts -> Eukaryotes -> Life on Earth -> Viruses -> DNA-RNA Reverse Transcribing Viruses -> Retroviridae
1 row in set (0.06 sec)

As you can see this one has to walk all the way back to “life on earth”, we’re really not related at all.

I left in the lines that show the amount of time taken. In earlier queries it took a few seconds, and I thought that was just some slowness in the graph engine, until I found out that the join was un-indexed so MySQL was table-scanning the tol table for each item found. Quickly corrected, the numbers are as you see.

I was still curious though, and since the SELECT returns a single item (a string in this case) it was really easy to use the BENCHMARK(N,func) function. That standard MySQL function runs func N times. Simple.

# so, we do
SELECT benchmark(1000000,(SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_tree JOIN tol ON (linkid=id) WHERE latch=1 AND origid=16421 AND destid=57380));

1 row in set (1.86 sec)

As it turns out, we were really just measuring latency before, as this shows we can do a million of these path searches through a graph in less than 2 seconds. To me, that’s not just “not bad” (the usual opinion a Dutch person would express ;-) but freaking awesome. And that is just what I wanted to tell.

New Open Query training days in Australia

The favourite Open Query course modules as well as reworked and brand new ones, with November/December 2009 dates for Brisbane, Sydney, Canberra and Melbourne listed below. You can register for days/modules individually, to suit your time, budget and current needs. Your trainers are Sean, Ray and Arjen (see OQ people).

For the Canberra and Melbourne days which are DBA/HA, registrations for all of the modules in a series before 15 October will receive a copy of the “High Performance MySQL” book (normal bookstore price is AUD 105).

Canberra

Sydney

Brisbane

  • Thu 19 Nov: MySQL Query Performance Optimisation and Tuning
  • Fri 20 Nov: MySQL Server Performance Optimisation and Tuning

Melbourne

Dogfood: making our systems more resilient

This is a “dogfood” type story (see below for explanation of the term)… Open Query has ideas on resilient architecture which it teaches (training) and recommends (consulting, support) to clients and the general public (blog, conferences, user group talks). Like many other businesses, when we first started we set up our infrastructure quickly and on the cheap, and it’s grown since. That’s how things grow naturally, and is as always a trade-off between keeping your business running and developing while also improving infrastructure (business processes and technical).

Quite a few months ago we also started investing (mostly time) in the technical infrastructure, and slowly moving the various systems across to new servers and splitting things up along the way. Around the same time, the main webserver frequently became unresponsive. I’ll spare you the details, we know what the problem was and it was predictable, but since it wasn’t our system there was only so much we could do. However, systems get dependencies over time and thus it was actually quite complicated to move. In fact, apart from our mail, the public website was the last thing we moved, and that was through necessity not desire.

Of course it’s best for a company when their public website works, it’s quite likely you have noticed some glitches in ours over time. Now running on the new infra, I happened to take a quick peek at our Google Analytics data, and noticed an increase in average traffic numbers of about 40%. Great big auch.

And I’m telling this, because I think it’s educational and the world is generally not served by companies keeping problems and mishaps secret. Nasties grow organically and without malicious intent, improvements are a step-wise process, all that… but in the end, the net results of improvements can be more amazing than just general peace of mind! And of course it’s very important to not just see things happen, but to actively work on those incremental improvements, ongoing.

Our new infra has dual master MySQL servers (no surprise there ;-) but based in separate data centres so that makes the setup a bit more complicated (MMM doesn’t deal with that setup). Other “new” components we use are lighttpd, haproxy, and Zimbra (new in the sense that our old external infra used different tech). Most systems (not all, yet) are redundant/expendable and run on a mix of Linode instances and our own machines. Doing these things for your own infra is particularly educational, it provides extra perspective. The result is, I believe, pretty decent. Failures generally won’t cause major disruption any more, if at all. Of course, it’s still work in progress.

Running costs of this “farm”? I’ll tell later, as I think it’s a good topic for a poll and I’m curious: how much do you spend on server infrastructure per month?

Background for non-Anglophones: “eating your own dogfood” refers to a company doing themselves what they’re recommending to their clients and in general. Also known as “leading by example”, but I think it’s also about trust and credibility. On the other hand, there’s the “dentist’s tooth-ache” which refers to the fact that doctors are their own worst patients ;-)

Tool of the Day: rsnapshot

rsnapshot is a filesystem snapshot utility for making backups of local and remote systems, based on rsync. Rather than just doing a complete copy every time, it uses hardlinks to create incrementals (which are from a local perspective a full backup also). You can specify how long to keep old backups, and all the other usual jazz. You’d generally have it connect over ssh. You’ll want/need to run it on a filesystem that supports hardlinks, so that precludes NTFS.

In the context of MySQL, you can’t just do a filesystem copy of your MySQL data/logs, that would be inconsistent and broken. (amazingly, I still see people insisting/arguing on this – but heck it’s your business/data to gamble with, right?)

Anyway, if you do a local mysqldump also, or for instance use XtraBackup to take a binary backup of your InnoDB tablespace/logs, then rsnapshot can be used to automate the transfer of those files to a different geographical location.

Two extra things you need to do:

  • Regularly test your backups. They can fail, and that can be fatal. For XtraBackup, run the prepare command and essentially start up a MySQL instance on it to make sure it’s all happy. Havint this already done also saves time if you need to restore.
  • For restore time, you need to include the time needed to transfer files back to the target server.

Tool of the Day: Firefox Tab Kit extension

We often need many tabs open in a browser, and horizontal tabs become unmanageable. Tab Kit allows you to have them vertically on the left, with various additional configuration choices.

I opted for the tree structure, so when I open a tab from another one it’ll show up as a child to the original. I can “lock” tabs so they cannot be closed by an accidental click or keypress. They get a “read” marker so if you open a few tabs and leave them till later you can still tell which ones you’ve actually already looked at. And there’s colour coding also.

In short, a great help. Just click the Tools/Add-Ons menu in Firefox and find Tab Kit in the extensions. Install, configure, and enjoy!

Tool of the day: inotify

I was actually exploring inotify-tools for something else, but they can also be handy for seeing what goes on below a mysqld process. inotify hooks into the filesystem handlers, and sees which files are accessed. You can then set triggers, or just display a tally over a certain period.

It has been a standard Linux kernel module since 2.6.13 (2005, wow that’s a long time ago already) and can be used through calls or the inotify-tools (commandline). So with the instrumentation already in the kernel, apt-get install inotify-tools is all you need to get started.

 # inotifywatch -v -t 20 -r /var/lib/mysql/* /var/lib/mysql/zabbix/*
Establishing watches...
Setting up watch(es) on /var/lib/mysql/mysql/user.frm
OK, /var/lib/mysql/mysql/user.frm is now being watched.
[...]
Total of 212 watches.
Finished establishing watches, now collecting statistics.
Will listen for events for 60 seconds.
total  modify  filename
2371   2371    /var/lib/mysql/relay-log.info
2148   2148    /var/lib/mysql/master.info
1157   1157    /var/lib/mysql/ib_logfile0
24     24      /var/lib/mysql/zabbix/
24     24      /var/lib/mysql/zabbix/history.ibd
8      8       /var/lib/mysql/zabbix/trends_uint.ibd
6      6       /var/lib/mysql/zabbix/items.ibd
5      5       /var/lib/mysql/ibdata1

This is just a limited example from a dev box, but you can see the benefit. You can see which files have been accessed, in what way, and how many times over the specified period. Consequently this provides the most insight if you’re using innodb-file-per-table (or MyISAM) rather than a single InnoDB tablespace. But of course it depends a bit on what you’re looking for.

Book: Pro Linux System Administration

Peter Lieverdink (also known as cafuego on IRC/identi.ca, engineer on OurDelta builds and for Open Query) has co-authored a book that’s available since Monday. The title is Pro Linux System Administration published by Apress.

These days some people don’t want to bother with system administration, and either hire or outsource. Others want to find out more and do things themselves (home and small office use), and that’s the intended audience for this book.

100% subscription renewal

I’m happy to note (this is internal Open Query happiness but I’m pleased to share) that so far we have a 100% renewal rate for our Proactive Services for MySQL subscriptions. Some of the early clients have grown in the initial period and are have now moved to a higher # of hours (this can also be changed upward during a term), which is of course excellent both for the clients and for us.

I was in eager anticipation of this time since the introduction of the concept late last year, as it is of course the essential proof of whether a subscription service actually works over time. Ideally, you’d want renewal to be a simple straightforward process, with the client having experienced the value of the service. This is relatively straightforward in this case, since it’s not an insurance, emergency or retainer type arrangement – the client actually gets benefits each and every month, so there’s both technical progression as well as ongoing human contact. Seems like a winner!

Along the way we also see a steady influx of new clients. I haven’t been specifically chasing this, as all new concepts take a while to mature, and we also had new people internally. The really cool thing is that our business structure for this service is scalable – I won’t say linearly because at some point the # of internal people involved would require adapting some processes, but it’ll scale a fair way still from where we are now.

Elspeth, our Special Projects Operative, who apart from an ace coder&geek is also organisationally organised, has been a great help with some of the admin aspects of the company. We’re paper-less, but that doesn’t mean there’s no paper. We tend to not produce more, but we do get it from others ;-)