Posted on 3 Comments

Walking the Tree of Life in simple SQL

Antony and I are busy getting the Open Query GRAPH Engine code ready so you all can play with it, but we needed to test with a larger dataset to make sure all was fundamentally well with the system.

We have some intersting suitable dataset sources, but the first we tried in ernest because it was easy to get in (thanks to Roland Bouman for both the idea and providing xslt stylesheets to transform the set), was the Tree of Life which is a hierarchy of 89052 entries showing how biological species on earth are related to eachother.

GRAPH engine operates in a directed fashion, so I inserted the connections both ways resulting in 178102 entries. So, I inserted A->B as well as B->A for each connection. So we now have a real graph, not just a simple tree.

Just like with my previous post, we have a separate table that contains the name of the species. For query simplicity, I looked up the id the start/end name separately. By the way, latch=1 indicates we use Dijkstra’s shortest-path algorithm for our search.

# with all that explained, let’s find ourselves in the tree of life!
SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_graph JOIN tol ON (linkid=id) WHERE latch=1 AND origid=1 AND destid=16421 \G
*************************** 1. row ***************************
path: Life on Earth -> Eukaryotes -> Unikonts -> Opisthokonts -> Animals -> Bilateria -> Deuterostomia -> Chordata -> Craniata -> Vertebrata -> Gnathostomata -> Teleostomi -> Osteichthyes -> Sarcopterygii -> Terrestrial Vertebrates -> Tetrapoda -> Reptiliomorpha -> Amniota -> Synapsida -> Eupelycosauria -> Sphenacodontia -> Sphenacodontoidea -> Therapsida -> Theriodontia -> Cynodontia -> Mammalia -> Eutheria -> Primates -> Catarrhini -> Hominidae -> Homo -> Homo sapiens
1 row in set (0.13 sec)

# how are we related to the family of plants containing the banana
SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_graph JOIN tol ON (linkid=id) WHERE latch=1 AND origid=16421 AND destid=21506 \G
*************************** 1. row ***************************
path: Homo sapiens -> Homo -> Hominidae -> Catarrhini -> Primates -> Eutheria -> Mammalia -> Cynodontia -> Theriodontia -> Therapsida -> Sphenacodontoidea -> Sphenacodontia -> Eupelycosauria -> Synapsida -> Amniota -> Reptiliomorpha -> Tetrapoda -> Terrestrial Vertebrates -> Sarcopterygii -> Osteichthyes -> Teleostomi -> Gnathostomata -> Vertebrata -> Craniata -> Chordata -> Deuterostomia -> Bilateria -> Animals -> Opisthokonts -> Unikonts -> Eukaryotes -> Archaeplastida (Plantae) -> Green plants -> Streptophyta -> Embryophytes -> Spermatopsida -> Angiosperms -> Monocotyledons -> Zingiberanae -> Musaceae
1 row in set (0.06 sec)

Obviously, this search needs to find its way up the tree then find the appropriate other branch.

# finally, our connection retro-viruses
SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_graph JOIN tol ON (linkid=id) WHERE latch=1 AND origid=16421 AND destid=57380 \G
*************************** 1. row ***************************
path: Homo sapiens -> Homo -> Hominidae -> Catarrhini -> Primates -> Eutheria -> Mammalia -> Cynodontia -> Theriodontia -> Therapsida -> Sphenacodontoidea -> Sphenacodontia -> Eupelycosauria -> Synapsida -> Amniota -> Reptiliomorpha -> Tetrapoda -> Terrestrial Vertebrates -> Sarcopterygii -> Osteichthyes -> Teleostomi -> Gnathostomata -> Vertebrata -> Craniata -> Chordata -> Deuterostomia -> Bilateria -> Animals -> Opisthokonts -> Unikonts -> Eukaryotes -> Life on Earth -> Viruses -> DNA-RNA Reverse Transcribing Viruses -> Retroviridae
1 row in set (0.06 sec)

As you can see this one has to walk all the way back to “life on earth”, we’re really not related at all.

I left in the lines that show the amount of time taken. In earlier queries it took a few seconds, and I thought that was just some slowness in the graph engine, until I found out that the join was un-indexed so MySQL was table-scanning the tol table for each item found. Quickly corrected, the numbers are as you see.

I was still curious though, and since the SELECT returns a single item (a string in this case) it was really easy to use the BENCHMARK(N,func) function. That standard MySQL function runs func N times. Simple.

# so, we do
SELECT benchmark(1000000,(SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_tree JOIN tol ON (linkid=id) WHERE latch=1 AND origid=16421 AND destid=57380));

1 row in set (1.86 sec)

As it turns out, we were really just measuring latency before, as this shows we can do a million of these path searches through a graph in less than 2 seconds. To me, that’s not just “not bad” (the usual opinion a Dutch person would express 😉 but freaking awesome. And that is just what I wanted to tell.

Posted on 3 Comments

3 thoughts on “Walking the Tree of Life in simple SQL

  1. Nice work! Glad I could provide the pointer.

    kind regards,

    Roland

  2. Hi,

    Is this dataset that you’re using available somewhere?
    Maybe even the SQL load script?

    Thanks,

    BL

Comments are closed.