LEVENSHTEIN MySQL stored function

At Open Query we steer clear of code development for clients. We sometimes advise on code, but as a company we don’t want to be in the programmer role. Naturally we do write scripts and other necessities to do our job.

Assisting with an Open Source project, I encountered three old UDFs. User Defined Functions are native functions that are compiled and then loaded by the server similar to a plugin. As with plugins, compiling can be a pest as it requires some of the server MySQL header files and matching build switches to the server it’s going to be loaded in. Consequentially, binaries cannot be considered safely portable and that means that you don’t really want to have a project rely on UDFs as it can hinder adoption quite severely.

Since MySQL 5.0 we can also use SQL stored functions and procedures. Slower, of course, but functional and portable. By the way, there’s one thing you can do with UDFs that you (at least currently) can’t do with stored functions, and that’s create a new aggregate function (like SUM or COUNT).

The other two functions were very specific to the app, but the one was a basic levenshtein implementation. A quick google showed that there were existing SQL and even MySQL stored function implementations, most derived from a single origin which was actually broken (and the link is now dead, as well). I grabbed one that appeared functional, and reformatted it for readability then cleaned it up a bit as it was doing some things in a convoluted way. Given that the stored function is going to be much slower than a native function anyway, doing things inefficiently inside loops can really hurt.

The result is below. Feel free to use, and if you spot a bug or can improve the code further, please let me know!
Given the speed issue, I’m actually thinking this should perhaps be added as a native function in MariaDB. What do you think?

-- core levenshtein function adapted from
-- function by Jason Rust (http://sushiduy.plesk3.freepgs.com/levenshtein.sql)
-- originally from http://codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function/
-- rewritten by Arjen Lentz for utf8, code/logic cleanup and removing HEX()/UNHEX() in favour of ORD()/CHAR()
-- Levenshtein reference: http://en.wikipedia.org/wiki/Levenshtein_distance

-- Arjen note: because the levenshtein value is encoded in a byte array, distance cannot exceed 255;
-- thus the maximum string length this implementation can handle is also limited to 255 characters.

DELIMITER $$
DROP FUNCTION IF EXISTS LEVENSHTEIN $$
CREATE FUNCTION LEVENSHTEIN(s1 VARCHAR(255) CHARACTER SET utf8, s2 VARCHAR(255) CHARACTER SET utf8)
  RETURNS INT
  DETERMINISTIC
  BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR CHARACTER SET utf8;
    -- max strlen=255 for this function
    DECLARE cv0, cv1 VARBINARY(256);

    SET s1_len = CHAR_LENGTH(s1),
        s2_len = CHAR_LENGTH(s2),
        cv1 = 0x00,
        j = 1,
        i = 1,
        c = 0;

    IF (s1 = s2) THEN
      RETURN (0);
    ELSEIF (s1_len = 0) THEN
      RETURN (s2_len);
    ELSEIF (s2_len = 0) THEN
      RETURN (s1_len);
    END IF;

    WHILE (j <= s2_len) DO
      SET cv1 = CONCAT(cv1, CHAR(j)),
          j = j + 1;
    END WHILE;

    WHILE (i <= s1_len) DO
      SET s1_char = SUBSTRING(s1, i, 1),
          c = i,
          cv0 = CHAR(i),
          j = 1;

      WHILE (j <= s2_len) DO
        SET c = c + 1,
            cost = IF(s1_char = SUBSTRING(s2, j, 1), 0, 1);

        SET c_temp = ORD(SUBSTRING(cv1, j, 1)) + cost;
        IF (c > c_temp) THEN
          SET c = c_temp;
        END IF;

        SET c_temp = ORD(SUBSTRING(cv1, j+1, 1)) + 1;
        IF (c > c_temp) THEN
          SET c = c_temp;
        END IF;

        SET cv0 = CONCAT(cv0, CHAR(c)),
            j = j + 1;
      END WHILE;

      SET cv1 = cv0,
          i = i + 1;
    END WHILE;

    RETURN (c);
  END $$

DELIMITER ;

InnoDB without PRIMARY KEY

Having an InnoDB table without a PRIMARY KEY is not good. Many have known this for years, but exact opinions as to why have differed. From observation, it was clear to me that it impacted performance.

InnoDB stores its row data in the leaf nodes of the primary key B+tree structure, that means that it can’t work without… so if you don’t specify a PK, it makes one up. Seems pretty innocuous and shouldn’t actually perform any worse than an auto-inc field. Except that in reality the performance can be much much worse. Annoying. Naturally we recommend clients to always have a PK (auto-inc, a composite of foreign keys, or if need be a natural key) but production systems cannot always be quickly changed, depending on the app code adding a column is not something you can just do at the DBA level.

Recently my good friend and former colleague Jeremy Cole, who has been delving into the depths of InnoDB, asked me if I had any open questions on the topic. So I mentioned the above, and after a brief look at the relevant code caught he was definitely interested in exploring the issue. The result is this blog post: http://blog.jcole.us/2013/05/02/how-does-innodb-behave-without-a-primary-key/ for your enlightenment and enjoyment.

Now we know what goes on internally. And it’s clear that performance is negatively affected, and why. Useful.

Thanks Jeremy!

Tool of the Day: MOSH (Mobile Shell)

Today I nominate MOSH (Mobile Shell) from MIT in our “tool of the day” category.

With people working remote, we sometimes encounter connectivity issues. But even when working from a stable connection, it’s sometimes just a pest when you close your laptop even though you hadn’t quite finished looking at something on that SSH connection…

We tend to use a jumping box for connections to clients, so connections come from a known IP (for the firewalls) and where we have our end of VPN  and such. On that box we now also have a MOSH server. It doesn’t replace the authentication part of SSH, but rather takes over afterwards and maintains an (encrypted) UDP path (it being UDP you can’t really call it a connection).

Now you can change IPs, close your laptop lid, see your ADSL connection retrain, and all will be well anyway. MOSH will warn you when a connection is (temporarily) gone but it’ll automatically sort out the reconnection for you. And depending on what you’re doing, you can actually keep typing locally. Even roaming between wifi and mobile will not break things.

There’s more to it, but the important thing is that now you know it exists! The MOSH site at MIT is simple but clear, and fairly complete including instructions how to install and use on pretty much any platform. Have you tried MOSH already? Send us your thoughts.

MariaDB 5.5 LIMIT ROWS EXAMINED

SELECT … LIMIT has always been very useful, particularly for web applications, restricting the number of rows in the result set to the amount that’s immediately required. To have web apps performing well, it’s always important to only retrieve as many rows as you need and no more.

The SQL_CALC_FOUND_ROWS option was added later, so that an application would be able to figure out (by using the FOUND_ROWS() function) how many more rows – and thus pages – would be available that can then be retrieved with the appropriate LIMIT … OFFSET … calls.

The problem with that construct was that while it kept the restriction of the number of rows in the result set, it required the server to keep retrieving rows even if the limit was already reached. Now, if the ORDER BY column is not the same as the indexes used for initial retrieval of table rows, the server naturally needs to first have all the matching rows, then order by, and then limit. There is no other way.

But from the above you can see that there is an interesting edge case: if the ORDER BY happens to be on the same column as the WHERE condition (which actually does happen quite a bit in the real world) and there is an index on that column, the server doesn’t necessarily have to do all the extra work, provided we get a way of  restricting that execution path. MariaDB 5.5 offers exactly that by adding a ROWS EXAMINED parameter to the LIMIT clause. For full syntax details, see https://kb.askmonty.org/en/limit-rows-examined/

Typically, what you’d do use use LIMIT, ROWS EXAMINED and SQL_CALC_FOUND_ROWS in an initial search or overview query, limiting to a maximum of a handful of pages. This way you can still indicate that there is more data available, and should the user select page 6, you just run a new query with a similar restriction but with a new LIMIT OFFSET boundary. This way you can vastly reduce the amount of work the server is required to do for paginated results.

We often see performance problems with search functionality on sites, and this is one of the ways that can be mitigated. Naturally that’s not the only thing, but it can really help.

 

Storage caching options in Linux 3.9 kernel

dm-cache is (albeit still classified “experimental”) is in the just released Linux 3.9 kernel. It deals with generic block devices and uses the device mapper framework. While there have been a few other similar tools flying around, since this one has been adopted into the kernel it looks like this will be the one that you’ll be seeing the most in to the future. It saves sysadmins the hassle of compiling extra stuff for a system.

A typical use is for an SSD to cache a HDD. Similar to a battery backed RAID controller, the objective is to insulate the application from latency caused by the mechanical device, the most laggy part of which is seek time (measured in milliseconds). Giventhe  relatively high storage capacity of an SSD (in the hundreds of GBs), this allows you to mostly disregard the mechanical latency for writes and that’s very useful for database systems such as MariaDB.

That covers writes (for the moment), but what about reads? Can MariaDB benefit from the read-caching? For the MyISAM storage engine, yes (as it relies on filesystem caching for speeding up row data access). For InnoDB, much less so. But let’s explore this, because it’s not quite a yes/no story – it depends. For typical systems with a correctly dimensioned system and InnoDB buffer pool, most of the active dataset will reside in RAM. For a system using a cached RAID controller that means that an actual disk read is not likely to be in the cache. With an SSD cache you might get lucky as it’s bigger – so stuff that has been read or written in some recent past may still be there. What we have found from testing with hdlatency (on actual client/hosting infra) is that SANs typically don’t have enough cache to pull that off – they too may have SSD caches now, but remember they get accessed by many more users with different data needs as well. The result of SSD filesystem caching for reads is actually similar to InnoDB tweaks that implement a secondary buffer pool on SSD storage, it creates a relatively large and cheap space for “lukewarm” pages (ones that haven’t been recently accessed).

So why does it depend? Because your active dataset might be too large, and/or your combined reads/writes are still more than the physical disks can handle. It’s very important to consider the latter: write caching insulates you from the seeks and allows an intermediate layer to re-order writes to optimise the head movement, but the writes still need to be done and thus ultimately you remain bound by an upper end physical limit. Insulation is not complete separation. If your active dataset is larger than RAM+SSD, then the reads also also need to be taken into account for seek capacity.

So right now you could say that at decent prices, if your active dataset is in the range of a few hundred GB to even a few TB, RAM with the optional addition of SSD caching can all work out nicely – what can still make it go sour is the rate of writes. Conclusion: this type of setup provides you with more headroom than a battery backed RAID controller, should you need that.

Separating reporting to distinct database servers (typically slaves, configured for relatively few connections and large queries) actually still helps quite a bit as it really changes what’s in the buffer pool and other caches. Or, differently put, looking at the access patterns of the different parts of your application is important – there are numerous variation on this basic pattern. It’s a form of functional sharding.

You’ll have noticed I didn’t mention any benchmarks when discussing all this (and most other topics). Many if not most benchmarks have artificial aspects to them, which makes them problematic when dealing with the real world. As shown above, applying background knowledge of the systems and structures, logic, and maths gets you a very long way (either independently or in consultation with us). It can get you through important decision processes quicker. Testing can still play an important part, but then it’s either part of or very close to your real world environment, not a lab activity. It will be specific to you. Don’t get trapped having to deliver on numbers from benchmarks.

SPDY protocol available in nginx 1.4

Nginx 1.4 can now do SPDY (draft 2). It’s hiding away in a separate file http://nginx.org/en/docs/http/ngx_http_spdy_module.html

So what is SPDY? In a nutshell, it does multiplexing, prioritization and compression of HTTP/HTTPS requests over a single TCP/IP connection. It also enables the server to push data before requested. These enable a browser or web services client to obtain multiple responses quickly by opening and authenticating a single connection to a web server and then issue multiple requests in parallel (well, whenever it wants, but in any case not requiring additional the completion of one request before the next request and not requiring, though still possible, multiple TCP/IP concurrent connections). For more info on SPDY, see http://www.chromium.org/spdy/spdy-protocol.

There is also an Apache 2.2 module for SPDY (https://code.google.com/p/mod-spdy/). Browser support for SPDY is present in Firefox, Chrome, the default Android web browser, and Opera.

If you have production experience with SPDY, good or bad, we’d like to hear about it! Particularly since SPDY is still relatively new and not yet used everywhere, the more information is published, the better.

edit: official docs are up.

Fedora 19 – MariaDB Test Day 2013-04-30

From https://fedoraproject.org/wiki/Test_Day:2013-04-30_MariaDB, this installment of Fedora’s Test Day focuses on the replacement of MySQL with MariaDB. If you’re a Fedora (or RHEL or CentOS user), do take a peek at the page and see if you can pitch in – it might be a little bit of work for you, but with great benefits in terms of getting the MariaDB performance and features, and specifically on the day the Fedora crowd have extra people on the case to track and address issues you might find, so it’s an ideal opportunity to upgrade on a development or test-prod environment!

Galera pre-deployment check

One of the first things we do when preparing a client’s infrastructure for Galera deployment is see whether their schema is suitable.

  • Avoiding quirks and edge cases, we can say that Galera simply requires all tables to be InnoDB and also have a PRIMARY KEY (obviously having a PK in InnoDB is important anyway, for InnoDB-internal reasons).
  • We want to know about FULLTEXT indexes. With recent InnoDB versions also supporting FULLTEXT we need to check not just whether a table has such an index, but actually which engine it is.
  • Spatial indexes. While both InnoDB and MyISAM can deal with spatial datatypes (POINT, GEOMETRY, etc), only MyISAM has the spatial indexes.

Naturally, checking a schema in the server is more effective than going through other sources and possibly missing bits. On the downside, the only viable way to get this info out of MariaDB is INFORMATION_SCHEMA, but because of the way it’s implemented queries tend to be slow and resource intensive. So essentially we do need to ask I_S, but do it as efficiently as possible (we’re dealing with production systems). We have multiple separate questions to ask, which normally we’d ask in separate queries, but in case of I_S that’s really something to avoid. So that’s why it’s all integrated into the single query below, catching every permutation of “not InnoDB”, “lacks primary key”, “has fulltext or spatial index”. We skip the system databases and any VIEWs.

We use the lesser known mysql client command ‘tee’ to output the data into a file, and close it after the query.

We publish the query not as a work of art – I don’t think it’s that pretty! We’d like you to see because we don’t care for secrets and also because if there is any way you can reach the same objective using a less resource intensive approach, we’d love to hear about it! This is one of the very few cases where we care only about efficiency, not how pretty the query looks. That said, of course I’d prefer it to be easily readable.

If you regard it purely as a query to be used for Galera, then you can presume it’ll be run on MariaDB 5.5 or later – since 5.3 and above has optimised subqueries, perhaps you can do something with that.

If you spot any other flaw or missing bit, please comment on that too. thanks!

-- snip
tee galeracheck.txt

SELECT DISTINCT
       CONCAT(t.table_schema,'.',t.table_name) as tbl,
       t.engine,
       IF(ISNULL(c.constraint_name),'NOPK','') AS nopk,
       IF(s.index_type = 'FULLTEXT','FULLTEXT','') as ftidx,
       IF(s.index_type = 'SPATIAL','SPATIAL','') as gisidx
  FROM information_schema.tables AS t
  LEFT JOIN information_schema.key_column_usage AS c
    ON (t.table_schema = c.constraint_schema AND t.table_name = c.table_name
        AND c.constraint_name = 'PRIMARY')
  LEFT JOIN information_schema.statistics AS s
    ON (t.table_schema = s.table_schema AND t.table_name = s.table_name
        AND s.index_type IN ('FULLTEXT','SPATIAL'))
  WHERE t.table_schema NOT IN ('information_schema','performance_schema','mysql')
    AND t.table_type = 'BASE TABLE'
    AND (t.engine <> 'InnoDB' OR c.constraint_name IS NULL OR s.index_type IN ('FULLTEXT','SPATIAL'))
  ORDER BY t.table_schema,t.table_name;

notee
-- snap

Credit: the basis of the “find tables without a PK” is based on SQL by Sheeri Cabral and Giuseppe Maxia.

Query pattern: OR across different tables

When a query uses a construct like

SELECT ... FROM a JOIN b ON (...) WHERE a.c1 = X OR b.c2 = Y

execution will inevitably degrade as the dataset grows.

The optimiser can choose to use an index merge when dealing with two relevant indexes over a single table, but that’s obviously of no use in this scenario as the optimiser has to choose which table to access first. And regardless of which table is accessed first, the other one might yield a result. Thus the query will never be efficient.

The real answer is that the query construct is wrong, a JOIN is used inappropriately. The correct approach for this type of query is using a UNION:

SELECT ... FROM a WHERE a.c1 = X
UNION [ALL]
SELECT ... FROM b WHERE b.c2 = Y

This mistake occurs relatively often, because while on the one hand people try to reduce the number of queries necessary to achieve their objective, they are (somewhat) familiar with JOINs and either don’t know about UNION or use it so rarely that they don’t think of it when the question calls for its use. So, this is your reminder ;-)

Fatal Half-measures in Incident Response

CSO Online writes about a rather sad list of security breaches at http://www.csoonline.com/article/721151/fatal-half-measures-in-incident-response, and the half-hearted approach companies take in dealing with the security on their networks and websites.

What I find most embarrassing is that it appears (judging by the actions) that many companies have their lawyers do some kind of borked risk assessment , and decide that they can just leave things as-is and yell foul when there’s a breach. After all, particularly in the US prosecutors are very heavy handed with breaches, even when the company has been totally negligent. That’s weird, because an insurance company wouldn’t pay out for a break-in when you’ve left your front door wide open! The problem is of course that the damage will have been done, generally data (such as personal details or credit card info) taken. The damage that does might be hidden and not even get tracked back to this cause. But it hurts individuals, potentially badly (and not just financially).

One example I know of… years ago, Commonwealth Bank Australia had an open mail relay. This means that outsiders could pass mail through CBA mail systems, thus send emails pretending to come from CBA addresses and looking 100% legit. When CBA was notified about this, somehow they decided to not do anything (lawyers again?). If it had been passed to a tech, it would have been about 10 minutes of work to rectify.

At Open Query we do security reviews for clients, naturally focusing on the externally facing sites with the back-end infrastructure including MySQL/MariaDB. We specifically added this offering because we happen to have the skillset in-house, our clients often have e-commerce or privacy sensitive data, and we regard this as very important.

We’re heartened by the introduction of more strict legislation in Australia that requires disclosure of breaches – that means that companies no longer have the option of not fessing up about an incident. Of course they could try and hide it, but these things tend to come out and apart from the public nature of that there are now legal consequences. It’s not perfect, I’d hope companies are smarter than even try to walk that line. Fessing up to a problem and dealing with it is much better, and that’s what we advise companies to do. But that’s about incident response policy, and while important in the overall picture that’s not our main focus.

Similar to our approach with reliability of infrastructure, we take a precautionary approach with security. We want to help prevent problems, rather than doing remedial work later. Of course there’s always a trade-off (law of diminished returns applies), but even small budgets can accommodate a decent level of security. And really, it’s not an optional extra. If you have a website or other publicly facing system as part of your business, you take on this responsibility. You can outsource the work, but not the responsibility.