Posted on

innodb_flush_logs_on_trx_commit and Galera Cluster

We deploy Galera Cluster (in MariaDB) for some clients, and innodb_flush_logs_on_trx_commit is one of the settings we’ve been playing with. The options according to the manual:

  • =0 don’t write or flush at commit, write and flush once per second
  • =1 write and flush at trx commit
  • =2 write log, but only flush once per second

The flush (fsync) refers to the mechanism the filesystem uses to try and guarantee that written data is actually on the physical medium/device and not just in a buffer (of course cached RAID controllers, SANs and other devices use some different logic there, but it’s definitely written beyond the OS space).

In a non-cluster setup, you’d always want it to be =1 in order to be ACID compliant and that’s also InnoDB’s default. So far so good. For cluster setups, you could be more lenient with this as you require ACID on the cluster as a whole, not each individual machine – after all, if one machine drops out at any point, you don’t lose any data.

Codership docu recommended =2, so that’s what Open Query engineer Peter Lock initially used for some tests that he was conducting. However, performance wasn’t particularly shiny – actually not much higher than =1. That in itself is interesting, because typically we regard the # of fsyncs/second a storage system can deal with as a key indicator of performance capacity. That is, as our HD Latency tool shows when you run it on a storage device (even your local laptop harddisk), the most prominent aspect of what limits the # of writes you can do per second appears to be the fsyncs.

I then happened to chat with Oli Sennhauser (former colleague from MySQL AB) who now runs the FromDual MySQL/MariaDB consulting firm in Switzerland, and he’s been working with Galera for quite a long time. He recognised the pattern and said that he too had that experience, and he thought =0 might be the better option.

I delved into the InnoDB source code to see what was actually happening, and the code indeed concurs with what’s described in the manual (that hasn’t always been the case ;-). I also verified this with Jeremy Cole whom we may happily regard as guru on “how InnoDB actually works”. The once-per-second flush (and optional preceding write) is performed by the InnoDB master thread. Take a peek in log/log0log.c and trx/trx0trx.c, specifically trx_commit_off_kernel() and srv_sync_log_buffer_in_background().

In conclusion:

  1. Even with =0, the log does get written and flushed once per second. This is done in the background so connection threads don’t have to wait for it.
  2. There is no setting where there is never a flush/fsync.
  3. With =2, the writing of the log takes place in the connection thread and this appears to incur a significant overhead, at least relative to =0. Aside from the writing of the log at transaction commit, there doesn’t appear to be a difference.
  4. Based on the preceding points, I would say that if you don’t want =1, you might as well set =0 in order to get the performance you’re after. There is of course a slight practical difference between =0 and =2. With =2 the log is immediately written. If the mysqld process were to crash within a second after that, the OS would close the file and have that log write stored. With =0 that log data wouldn’t have been written. If the OS or machine fails, that log write is lost either way.

In production environments, we tend to mainly want to mitigate trouble from system failures, so =0 appears to be a suitable/appropriate option – for a Galera cluster environment.

What remains is the question of why the log write operation appears to reduce transaction commit performance so much, in a way more so than the flush/fsync. Something to investigate further!
Your thoughts?

Posted on
Posted on

Mixing databases usually not optimal

Dan McKinley (Etsy) wrote an [IMHO] insightful article Why MongoDB Never Worked at Etsy.

First off, it’s important to realise that it’s not a snipe at MongoDB – it’s a fine tool.

The lessons are related to mixing multiple databases in a deployment (administration and monitoring overhead) and the acknowledgement that issues of schema design, scalability and maintenance need attention regardless of which brand or technology you pick for your database. That comes back to the old insight that migrations are rarely worth it (regardless of what you migrate to what).

I think these are indeed important considerations as they have a major impact on the ongoing costs of your entire environment (production as well as development and testing) – these days we encounter the “we’re doing this part of our application using MongoDB” approach quite often, so it’s useful to read about and learn from other people’s experience.

With MongoDB there is a particular extra issue to consider, and Dan McKinley also mentions it in his post. NoSQL databases are often also schema-less. However, to keep your data manageable when it grows to significance, you do need to structure it somehow – that is, you need to make sure that (and I’ll just use generic terminology here) in a specific set of records each record contains the required fields. If you don’t, at some point things become unmanageable (or your data ends up as a pile of unusable bits).

Thus, you’re dealing with some form of schema, whether you call it that or not. And you might deal with it in application logic or through some toolkit, rather than in the database itself, but it can’t just be ignored or disregarded. And that’s critical, as often going to a schema-less database is presented as a “then you don’t need to worry about that” change. You do need to “worry” about it: you can pick where the most suitable place is for your needs. If you look at it in that way, you can make an appropriate choice for the particular application at hand.

Posted on
Posted on

Luxbet, MariaDB and Melbourne Cup

Yesterday was Melbourne Cup day in Australia – the biggest annual horse race event in the country, and in the state of Victoria it’s even a public holiday.

Open Query does work for Luxbet (part of Tabcorp), and Melbourne Cup day is by far their biggest day of the year in terms of traffic. It’s not just a big spike, there’s orders of magnitude difference so you can really say that the rest of the year is downright quiet (in relative terms). So, a very interesting load pattern.

Since last year Luxbet has upgraded from stock MySQL to MariaDB, and with our input made some other infrastructure modifications including moving to a pure solid state storage (FusionIO) solution as a SAN just won’t deliver the resilience and performance required. This may seem odd, but remember that a) a SAN is also a single point of failure (so when the SAN fails, multiple db servers will be “out” – not desirable even though a failover to another datacenter is possible), and b) MariaDB/XtraDB (InnoDB) already have all recent data and indexes in RAM, so whatever I/O is required won’t benefit from a SAN cache. Thus, the SAN will have to actually do a physical disk seek and read to get what is needed, and we all know seeks are slow. A write or fsync also incurs some latency, regardless of the storage array speed.

So those are the reasons for the local storage solution. While there are aspects of RAID and other redundancy in that setup, the main resilience in the infrastructure comes from having more machines, rather than necessarily having more redundancy in each machine.

Grant is working on a more comprehensive version of this story.

Posted on
Posted on 11 Comments

Hint of the day: noatime and relatime in fstab

It’s been written about everywhere, but since we keep spotting installations in the wild where people don’t know about it, it probably deserves another mention.

By default, Linux uses the atime option on a disk mount, which means it writes a timestamp (e.g. a write to the drive) every time it reads anything. So in this case, reads cause writes – and also disk seeks, because a read from a file will then trigger having to write to the directory that contains the file. This even occurs if a file is read from the file system’s page cache (reading from the machine’s memory rather than the drive).

Unless you require an audit trail of users reading files, you generally you don’t want this. Thus, you want to add the noatime option to the disk mount in /etc/fstab. If you have just the defaults in there, you just make it defaults,noatime. It’ll doesn’t necesarily require a reboot as you can use umount/mount, but that gets tricky when dealing with the root filesystem so a reboot is generally easier. Setting these options is one of the first things we do when configuring a server.

Some user applications, such as Mutt (mail reader) do use the read access time. In that case, you can use the relatime option instead, which only writes a timestamp when a file or directory is written to. This is just for completeness of this story, as it’s still sub-optimal for a database server.

If you require read details for auditing (security) of the operating system, make sure all database-related files (database directories, InnoDB log files, binary logs, etc) are on a separate mount where you can use noatime.

Using noatime also makes a lot of sense on a web server, as it does a lot of reads. Remember, the fact that most files are in the filesystem cache doesn’t make a difference. As a general guide, it makes sense to set on most server installations. Quick win.

Posted on 11 Comments
Posted on 1 Comment

Temporary Tables and Replication

I recently wrote about non-deterministic queries in the replication stream. That’s resolved by using either MIXED or ROW based replication rather than STATEMENT based.

Another thing that’s not fully handled by STATEMENT based replication is temporary tables. Imagine the following:

  1. Master: CREATE TEMPORARY TABLE rpltmpbreak (i INT);
  2. Wait for slave to replicate this statement, then stop and start mysqld (not just STOP/START SLAVE)
  3. Master: INSERT INTO rpltmpbreak VALUES (1);
  4. Slave: SHOW SLAVE STATUS \G

If for any reason a slave server shuts down and restarts after the temp table creation, replication will break because the temporary table will no longer exist on the restarted slave server. It’s obvious when you think about it, but nevertheless it’s quite annoying.

A long time ago (early 2007, when I was still working at MySQL AB) I filed a bug report on this. It’s important to realise that back then, row based replication did exist but was so buggy that you wouldn’t recommend it, so the topic was quite relevant. For some reason the bug has remained open for over 6 years until some recent activity.

It is not an issue with determinism and most temporary table constructs are technically regarded as “safe” to replicate via statement based replication, so if you use MIXED you will still find replication broken with the above scenario. Important to realise!

http://dev.mysql.com/doc/refman/5.5/en/replication-features-temptables.html (the obvious place to look) doesn’t really explain this well, but http://dev.mysql.com/doc/refman/5.5/en/replication-rbr-usage.html correctly states that ROW based replication doesn’t suffer from this problem as it replicates the values from the temporary table on the master rather than the statement, thus the slave doesn’t have to deal with the temporary table at all. I’ve suggested that the bug be changed to a documentation issue, updating the page on replication and temporary tables to properly explain the issue and point clearly and explicitly to the solution.

So, why would you ever use STATEMENT or MIXED rather than ROW based replication?

  • Well, as I mentioned, earlier row based wasn’t particularly reliable. At that time, for non-deterministic scenarios we recommended mixed as a compromise (that only uses row based information in the replication stream when it’s necessary, and statements the rest of the time). Many issues have been fixed over time and now we can generally say that row based replication is ok in recent versions of MySQL and MariaDB (5.5 or above, just to be sure). So if you’re replicating from an older master, STATEMENT or MIXED might still be preferable, as long as you know that the limitations are.
  • Non-local replication (outside the datacenter) is vastly more efficient with STATEMENT based replication: if you’re updating 100,000 rows, it’s a single statement whereas it’s a 100,000 row updates. So depending on bandwidth/cost and such, that might also be a relevant.

If none of those considerations apply, ROW based replication might be the way to go now. But the really important thing to realise is that for each of the choices of STATEMENT, MIXED and ROW, there are advantages and consequences.

Do you have any other reasons for using STATEMENT or MIXED in your environment?

Posted on 1 Comment
Posted on

Non-Deterministic Query in Replication Stream

You might find a warning like the below in your error log:

130522 17:54:18 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statements writing to a table with an auto-increment column after selecting from another table are unsafe because the order in which rows are retrieved determines what (if any) rows will be written. This order cannot be predicted and may differ on master and the slave.
Statement: INSERT INTO tbl2 SELECT * FROM tbl1 WHERE col IN (417,523)

What do MariaDB and MySQL mean with this warning? The server can’t guarantee that this exact query, with STATEMENT based replication, will always yield identical results on the slave.

Does that mean that you have to use ROW based (or MIXED) replication? Possibly, but not necessarily.

For this type of query, it primarily refers to the fact that without ORDER BY, rows have no order and thus a result set may show up in any order the server decides. Sometimes it’s predictable (depending on storage engine and index use), but that’s not something you want to rely on. You don’t have to ponder that, as an ORDER BY is never harmful.

Would ORDER BY col solve the problem? That depends!
If col is unique, yes. If col is not unique, then multiple rows could result and they’d still have a non-deterministic order. So in that case you’d need to ORDER BY col,anothercol to make it absolutely deterministic. The same of course applies if the WHERE clause only referred to a single col value: if multiple rows can match, then it’s not unique and it will require an additional column for the sort.

There are other query constructs where going to row based or mixed replication is the only way. But, just because the server tells you it can’t safely replicate a query with statement based replication, that doesn’t mean you can’t use statement based replication at all… there might be another way.

Posted on
Posted on 5 Comments

LEVENSHTEIN MySQL stored function

At Open Query we steer clear of code development for clients. We sometimes advise on code, but as a company we don’t want to be in the programmer role. Naturally we do write scripts and other necessities to do our job.

Assisting with an Open Source project, I encountered three old UDFs. User Defined Functions are native functions that are compiled and then loaded by the server similar to a plugin. As with plugins, compiling can be a pest as it requires some of the server MySQL header files and matching build switches to the server it’s going to be loaded in. Consequentially, binaries cannot be considered safely portable and that means that you don’t really want to have a project rely on UDFs as it can hinder adoption quite severely.

Since MySQL 5.0 we can also use SQL stored functions and procedures. Slower, of course, but functional and portable. By the way, there’s one thing you can do with UDFs that you (at least currently) can’t do with stored functions, and that’s create a new aggregate function (like SUM or COUNT).

The other two functions were very specific to the app, but the one was a basic levenshtein implementation. A quick google showed that there were existing SQL and even MySQL stored function implementations, most derived from a single origin which was actually broken (and the link is now dead, as well). I grabbed one that appeared functional, and reformatted it for readability then cleaned it up a bit as it was doing some things in a convoluted way. Given that the stored function is going to be much slower than a native function anyway, doing things inefficiently inside loops can really hurt.

The result is below. Feel free to use, and if you spot a bug or can improve the code further, please let me know!
Given the speed issue, I’m actually thinking this should perhaps be added as a native function in MariaDB. What do you think?

-- core levenshtein function adapted from
-- function by Jason Rust (http://sushiduy.plesk3.freepgs.com/levenshtein.sql)
-- originally from http://codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function/
-- rewritten by Arjen Lentz for utf8, code/logic cleanup and removing HEX()/UNHEX() in favour of ORD()/CHAR()
-- Levenshtein reference: http://en.wikipedia.org/wiki/Levenshtein_distance

-- Arjen note: because the levenshtein value is encoded in a byte array, distance cannot exceed 255;
-- thus the maximum string length this implementation can handle is also limited to 255 characters.

DELIMITER $$
DROP FUNCTION IF EXISTS LEVENSHTEIN $$
CREATE FUNCTION LEVENSHTEIN(s1 VARCHAR(255) CHARACTER SET utf8, s2 VARCHAR(255) CHARACTER SET utf8)
  RETURNS INT
  DETERMINISTIC
  BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR CHARACTER SET utf8;
    -- max strlen=255 for this function
    DECLARE cv0, cv1 VARBINARY(256);

    SET s1_len = CHAR_LENGTH(s1),
        s2_len = CHAR_LENGTH(s2),
        cv1 = 0x00,
        j = 1,
        i = 1,
        c = 0;

    IF (s1 = s2) THEN
      RETURN (0);
    ELSEIF (s1_len = 0) THEN
      RETURN (s2_len);
    ELSEIF (s2_len = 0) THEN
      RETURN (s1_len);
    END IF;

    WHILE (j <= s2_len) DO
      SET cv1 = CONCAT(cv1, CHAR(j)),
          j = j + 1;
    END WHILE;

    WHILE (i <= s1_len) DO
      SET s1_char = SUBSTRING(s1, i, 1),
          c = i,
          cv0 = CHAR(i),
          j = 1;

      WHILE (j <= s2_len) DO
        SET c = c + 1,
            cost = IF(s1_char = SUBSTRING(s2, j, 1), 0, 1);

        SET c_temp = ORD(SUBSTRING(cv1, j, 1)) + cost;
        IF (c > c_temp) THEN
          SET c = c_temp;
        END IF;

        SET c_temp = ORD(SUBSTRING(cv1, j+1, 1)) + 1;
        IF (c > c_temp) THEN
          SET c = c_temp;
        END IF;

        SET cv0 = CONCAT(cv0, CHAR(c)),
            j = j + 1;
      END WHILE;

      SET cv1 = cv0,
          i = i + 1;
    END WHILE;

    RETURN (c);
  END $$

DELIMITER ;
Posted on 5 Comments
Posted on

Fedora 19 – MariaDB Test Day 2013-04-30

From https://fedoraproject.org/wiki/Test_Day:2013-04-30_MariaDB, this installment of Fedora’s Test Day focuses on the replacement of MySQL with MariaDB. If you’re a Fedora (or RHEL or CentOS user), do take a peek at the page and see if you can pitch in – it might be a little bit of work for you, but with great benefits in terms of getting the MariaDB performance and features, and specifically on the day the Fedora crowd have extra people on the case to track and address issues you might find, so it’s an ideal opportunity to upgrade on a development or test-prod environment!

Posted on
Posted on 2 Comments

Galera pre-deployment check

One of the first things we do when preparing a client’s infrastructure for Galera deployment is see whether their schema is suitable.

  • Avoiding quirks and edge cases, we can say that Galera simply requires all tables to be InnoDB and also have a PRIMARY KEY (obviously having a PK in InnoDB is important anyway, for InnoDB-internal reasons).
  • We want to know about FULLTEXT indexes. With recent InnoDB versions also supporting FULLTEXT we need to check not just whether a table has such an index, but actually which engine it is.
  • Spatial indexes. While both InnoDB and MyISAM can deal with spatial datatypes (POINT, GEOMETRY, etc), only MyISAM has the spatial indexes.

Naturally, checking a schema in the server is more effective than going through other sources and possibly missing bits. On the downside, the only viable way to get this info out of MariaDB is INFORMATION_SCHEMA, but because of the way it’s implemented queries tend to be slow and resource intensive. So essentially we do need to ask I_S, but do it as efficiently as possible (we’re dealing with production systems). We have multiple separate questions to ask, which normally we’d ask in separate queries, but in case of I_S that’s really something to avoid. So that’s why it’s all integrated into the single query below, catching every permutation of “not InnoDB”, “lacks primary key”, “has fulltext or spatial index”. We skip the system databases and any VIEWs.

We use the lesser known mysql client command ‘tee’ to output the data into a file, and close it after the query.

We publish the query not as a work of art – I don’t think it’s that pretty! We’d like you to see because we don’t care for secrets and also because if there is any way you can reach the same objective using a less resource intensive approach, we’d love to hear about it! This is one of the very few cases where we care only about efficiency, not how pretty the query looks. That said, of course I’d prefer it to be easily readable.

If you regard it purely as a query to be used for Galera, then you can presume it’ll be run on MariaDB 5.5 or later – since 5.3 and above has optimised subqueries, perhaps you can do something with that.

If you spot any other flaw or missing bit, please comment on that too. thanks!

-- snip
tee galeracheck.txt

SELECT DISTINCT
       CONCAT(t.table_schema,'.',t.table_name) as tbl,
       t.engine,
       IF(ISNULL(c.constraint_name),'NOPK','') AS nopk,
       IF(s.index_type = 'FULLTEXT','FULLTEXT','') as ftidx,
       IF(s.index_type = 'SPATIAL','SPATIAL','') as gisidx
  FROM information_schema.tables AS t
  LEFT JOIN information_schema.key_column_usage AS c
    ON (t.table_schema = c.constraint_schema AND t.table_name = c.table_name
        AND c.constraint_name = 'PRIMARY')
  LEFT JOIN information_schema.statistics AS s
    ON (t.table_schema = s.table_schema AND t.table_name = s.table_name
        AND s.index_type IN ('FULLTEXT','SPATIAL'))
  WHERE t.table_schema NOT IN ('information_schema','performance_schema','mysql')
    AND t.table_type = 'BASE TABLE'
    AND (t.engine <> 'InnoDB' OR c.constraint_name IS NULL OR s.index_type IN ('FULLTEXT','SPATIAL'))
  ORDER BY t.table_schema,t.table_name;

notee
-- snap

Credit: the basis of the “find tables without a PK” is based on SQL by Sheeri Cabral and Giuseppe Maxia.

Posted on 2 Comments
Posted on

Fatal Half-measures in Incident Response

CSO Online writes about a rather sad list of security breaches at http://www.csoonline.com/article/721151/fatal-half-measures-in-incident-response, and the half-hearted approach companies take in dealing with the security on their networks and websites.

What I find most embarrassing is that it appears (judging by the actions) that many companies have their lawyers do some kind of borked risk assessment , and decide that they can just leave things as-is and yell foul when there’s a breach. After all, particularly in the US prosecutors are very heavy handed with breaches, even when the company has been totally negligent. That’s weird, because an insurance company wouldn’t pay out for a break-in when you’ve left your front door wide open! The problem is of course that the damage will have been done, generally data (such as personal details or credit card info) taken. The damage that does might be hidden and not even get tracked back to this cause. But it hurts individuals, potentially badly (and not just financially).

One example I know of… years ago, Commonwealth Bank Australia had an open mail relay. This means that outsiders could pass mail through CBA mail systems, thus send emails pretending to come from CBA addresses and looking 100% legit. When CBA was notified about this, somehow they decided to not do anything (lawyers again?). If it had been passed to a tech, it would have been about 10 minutes of work to rectify.

At Open Query we do security reviews for clients, naturally focusing on the externally facing sites with the back-end infrastructure including MySQL/MariaDB. We specifically added this offering because we happen to have the skillset in-house, our clients often have e-commerce or privacy sensitive data, and we regard this as very important.

We’re heartened by the introduction of more strict legislation in Australia that requires disclosure of breaches – that means that companies no longer have the option of not fessing up about an incident. Of course they could try and hide it, but these things tend to come out and apart from the public nature of that there are now legal consequences. It’s not perfect, I’d hope companies are smarter than even try to walk that line. Fessing up to a problem and dealing with it is much better, and that’s what we advise companies to do. But that’s about incident response policy, and while important in the overall picture that’s not our main focus.

Similar to our approach with reliability of infrastructure, we take a precautionary approach with security. We want to help prevent problems, rather than doing remedial work later. Of course there’s always a trade-off (law of diminished returns applies), but even small budgets can accommodate a decent level of security. And really, it’s not an optional extra. If you have a website or other publicly facing system as part of your business, you take on this responsibility. You can outsource the work, but not the responsibility.

Posted on