Posted on 1 Comment

Today’s up-time requirements

When asking about up-time requirements set down in SLAs (Service Level Agreements) with our clients’ clients, we’d hear anything ranging from hours to the familiar five nines, but these days also simply 100% and otherwise penalties apply. From my perspective, there’s not much difference between five nines and 100%, 99.999% uptime over a year amounts to a maximum of little over 5 minutes outage. In many cases, this includes scheduled outages!

So, we can just not have any outages, scheduled or otherwise. Emergency support is not going to help here, because however fast and good they are, you’re already in serious penalty time or well on your way to not having a business any more. Most will respond within say 30 minutes but then need up to a few hours to resolve the issue. That won’t help you, really, will it? And in any case, how are you going to do your maintenance? The answer is, you need to architect things differently.

I do appreciate the issue of transitioning from the corporate tradition of outsourcing the liability along with emergency support, e.g. someone to call and if need be sue… it takes time both in business processes as well as in actual architecture to make things resilient. But really, if those are the SLAs you agree on with your clients, that’s what has to be done.

Anyway, aiming for resilience (expecting things to break but building infra so that it can cope with it) rather than purchasing many-9s is I think a better focus. This because making an individual component even more reliable becomes prohibitively expensive, whereas having more servers is relatively cheap. That’s simple economics.

Posted on 1 Comment
Posted on 4 Comments

Quest for Resilience: Multi-DC Masters

This is a Request for Input. Dual MySQL masters with MMM in a single datacentre are in common use, and other setups like DRBD and of course VM/SAN based failover solutions are conceptually straightforward also. Thus, achieving various forms of resilience within a single data-centre is doable and not costly.

Doing the same across multiple (let’s for simplicity sake limit it to two) datacentres is another matter. MySQL replication works well across longer links, and it can use MySQL’s in-built SSL or tools like stunnel. Of course it needs to be kept an eye on, as usual, but since it’s asynchronous the latency between the datacentres is not a big issue (apart from the fact that the second server gets up-to-date a little bit later).

But as those who have tried will know, having a client (application server) connection to a MySQL instance in a remote data-centre is a whole other matter, latency becomes a big issue and is generally very noticeable on the front-end. One solution for that is to have application servers only connect to their “local” MySQL server.

So the question to you is, do you now have (or have you had in the past) a setup with MySQL masters in different datacentres, what did that setup look like (which additional tools and infra did you use for it), and what were your experiences (good and bad, solutions to issues, etc). I’m trying to gather additional expertise that might already be about, which can help us all. Please add your input! thanks

Posted on 4 Comments
Posted on 2 Comments

Business insight from the MySQL Conference 2010

At this year’s conference, I was pleasantly surprised with the high level of interest in Open Query’s proactive services for MySQL and MariaDB, and specifically our focus on preventing problems, while explicitly not offering emergency services.

I’ll describe what this is about first, and why I reckon it’s interesting. When you think about it, most IT related support that includes emergency (24×7) operates similar to this:

You have a house that has the front and back doors wide open with no locks, and you take out an insurance policy for the house contents. After a short time you call the insurance company “guess what, the most terrible thing happened, my TV got stolen.” Insurance company responds “that’s dreadful, you poor soul, let us fix it all up for you with getting a new TV and installing it. It’ll be our pleasure to serve you.” A few weeks later you call the insurance company again “guess what …” and they help you in the same fabulous way.

You get the idea, it’s rather silly because it’s very predictable. If you leave your doors open, you’re very close to actually being the cause of the problem yourself and insurance companies tend to not cover you under such circumstances – yet most IT support arrangements do. If IT support were actually run like insurance, premiums would be based on a risk assessment, and consequentially most companies would have to pay much higher premiums.

Much of this is actually about company processes as much as the technical setup. Depending on how you arrange things in your business, you can actually be very “emergency prone”. Since company processes are notoriously hard to change, many businesses operate in a way that is fundamentally not suitable for Open Query to do business with. That’s a fact and we’re fine with it, the market is big enough. We have clients all around the world, but so far very few from Silicon Valley. My presumption was that this was due to the way those businesses are often set up, making them simply incompatible for our services. But a significant number of companies we spoke with at and around the conference were very interested in our services exactly because of the way we work, and so that to me was interesting news. A good lesson, making attending the conference extra worthwhile. It’s also a good vote of confidence in the way we’ve set up our service offering.

Posted on 2 Comments
Posted on 2 Comments

Crosstabs cell-shifting

A crosstab query is a specific query used to create aggregate reports on two or more fields, it’s  a handy way to display summary information. At Open Query we have customers using that trick to display production schedules.

The summary table is generated from the database to extract the manufacturing date (mand), unit number (unitn), product area (pro_area), and ranking (rnk). Then we can start using that summary table to process our crosstab. The summary table looks like this:

mand unitn pro_area rnk
2009-12-15 587-MWI2-PP49 1 11
2009-12-15 587-MWI2-PP50 1 10
2009-12-16 670-W1101 1 11
2009-12-17 670-W1102 1 7
2009-12-16 670-W1103 1 10
2009-12-16 683-BR107-S2-01 12 9
2009-12-18 683-BR107-S4-05a 12 11
2009-12-18 683-BR107-S5-06 12 10
2009-12-15 691-02-105 1 9
2009-12-16 691-02-107 1 8
2009-12-17 691-03-101 1 6
2009-12-17 691-03-105 1 5

Starting from that table we are going to generate the final report, who will display the production schedule by factory and by production order (rank).  Our crosstab query will look like this:

select
pro_area as ‘factory’,
max( case when mand = ‘2009-12-15′ then unitn else ” end ) as ’15Dec09’,
max( case when mand = ‘2009-12-16′ then unitn else ” end ) as ’16Dec09’,
max( case when mand = ‘2009-12-17′ then unitn else ” end ) as ’17Dec09’,
max( case when mand = ‘2009-12-18′ then unitn else ” end ) as ’18Dec09’
from pro_sched_sorted
group by pro_area, rnk
order by pro_area, rnk desc

SELECT pro_area as ‘factory’,

max( case when mand = ‘2009-12-15′ then unitn else ” end ) as ’15Dec09’,

max( case when mand = ‘2009-12-16′ then unitn else ” end ) as ’16Dec09’,

max( case when mand = ‘2009-12-17′ then unitn else ” end ) as ’17Dec09’,

max( case when mand = ‘2009-12-18′ then unitn else ” end ) as ’18Dec09’

FROM pro_sched_sorted

GROUP BY pro_area, rnk

ORDER BY pro_area, rnk desc

And the result of our query will give the following report:

factory 15Dec09 16Dec09 17Dec09 18Dec09
1 587-MWI2-PP49 670-W1101
1 587-MWI2-PP50 670-W1103
1 691-02-105 691-03-106
1 691-02-107 691-04-103
1 670-W1102
1 691-03-101
1 691-03-105
5 698-078
5 698-079
5 698-080
5 698-081
5 698-082 698-087
5 698-083 698-088
5 698-084 698-089
5 698-085 698-090
5 698-086 698-091
9 692-1930-073 692-1930-081
9 692-1930-074 692-1930-082
9 692-1930-075 692-1930-083
9 692-1930-076 692-1930-084
10 692-1930-077 692-1930-085
10 692-1930-078 692-1930-086
10 692-1930-079 692-1930-087
10 692-1930-080 692-1930-088
12 683-BR107-S4-05a
12 683-BR107-S5-06
12 683-BR107-S2-01

Now the result of our crosstab query is statistically correct, but the result layout is not satisfying because of the empty cells that are generated in the case a given factory does not have a production order for a specific date.

I have been looking into a solution for removing those empty cells on the server side. That would involve either filtering, or doing cell-shifting in SQL.

The filtering solution is not really an option since the date columns are aggregated. You cannot just add a WHERE … != ”clause because of the GROUP BY process.

For the second solution, I have thought of writing a special stored procedure as a cursor on the result set that would reorder the report by replacing empty cells by the value in the next row. Now probably MySQL stored procedure language would be somewhat limited for doing previous row/next row operations.

Any thoughts on doing cell-shifting in SQL?

Posted on 2 Comments
Posted on 1 Comment

Tokutek’s Fractal Tree Indexes

Tokutek’s Bradley did a session on their Fractal Tree Index technology at the MySQL Conference (and an OpenSQL Camp before that – but I wasn’t at that one), and my first thought was: great, now we get to see what and where the magic is. On second thought, I realised you may not want to know.

I know I’m going to be a party pooper here, but I do feel it’s important for people to be aware of the consequences of looking at this stuff (there’s slide PDFs online as well as video), and software patents in general. I reckon Tokutek has done some cool things, but the patents are a serious problem.

Tokutek’s technology has patents pending, and is thus patent encumbered. What does this mean for you? It means that if you look at their “how they did it” info and you happen to code something that later ends up in a related patent lawsuit, you and the company you work for will be liable for triple damages. That’s basic US patent law, if you knowingly infringe you pay thrice. If you were at either session and are involved in database development work, you may wish to talk with your boss and legal council.

I made the assessment for myself (although I’m in Australia, there’s the Free Trade Agreement with patent-related provisions, so I am exposed) and decided that since Open Query’s activities are well within my control, it’s a manageable risk. So yep I’ve looked at the details. I’ll review some broad aspects below – I am not a lawyer but if the above worries you, to be sure, now is the time to stop reading and not see the rest of this post.


The insertion methodology is an interesting and nifty trick. It’s more CPU intensive but reduces disk I/O, and is thus faster for high volume inserts (the exact spot where B-trees and derivatives tend to be slower).

First of all, it’s important to appreciate why the B-tree family of indexing algorithms exist. They acknowledge that disk I/O is a) relatively expensive and b) operates in blocks (that is, writing/grabbing a larger chunk is more efficient when you’re reading from disk anyway). So B-trees store groups of keys together and thus try to minimise disk I/O particularly on lookup, balanced B-trees (B+tree algorithm etc) go wide rather than deep so for billions of entries you could still have a max of 6-8 disk blocks to fetch. Inserts (and deletes) can be more costly, particularly with page splits (merges for deletes) and rebalancing operations. Blocks are also not full, which is technically wasteful on your storage – it’s a tradeoff.

If you have an index purely in memory, algorithms that don’t work with blocks are more efficient, MySQL (NDB)Cluster uses T-trees and MySQL’s MEMORY tables have red/black trees which are a balanced (weighted) binary tree. If you’re interested in the structure and basic logic for each of the algorithms involved, Wikipedia tends to have good descriptions and diagrams, and there are many resources on the web including neatly animated demos of how inserts work, and so on.

So, Tokutek’s method is basically an enhancement on B-trees, it’s relevant as long as we deal with not just spinning disks but block devices that operate in large(r) read/write chunks. For spinning disks, seek time is an important factor. For SSD it is not, but SSD still works with relatively large blocks of data: you can’t just write 3 bytes, if you do the SSD actually reads the rest of the block and rewrites it (with your new 3 bytes) elsewhere, marking the old block for re-use (since SSD requires an erase cycle before it can write again). These technologies will be with us for a while yet, so enhancements are useful.

Monetisation models (and patents) aside, I reckon it’d be best to see enhancements such as these added to existing storage engines and indexing implementations (think text indexers and many other applications – it’s by no means limited to plain RDBMS or databases in general). Then it would quickly benefit a large group of users.

Building a basic storage engine is not that hard for an experienced database coder, but it takes time to mature and there are many aspects and trade-offs to it. It’s taken years for InnoDB to mature and for people to understand how to optimally use it. Planting a new/separate storage engine on the market to monetise a new indexing scheme makes -to me- only sense in the monetisation context. It makes absolutely no sense when looking at the technical aspects or the needs of the users.

For companies using MySQL/MariaDB because the code is available and they’re not locked into a single vendor for bugfixing and enhancements (just look at what Percona has done with InnoDB!), buying/using proprietary extensions makes no sense. I do by no means wish to diminish the accomplishments of the innovative minds at Tokutek, and I appreciate their tough predicament in terms of finding a way to monetise on their innovation, but what we have now is problematic.

In a nutshell, my excitement on behalf of my clients is hindered by the proprietary and patent aspects. Which is a great pity! We need to seriously think about alternative ways for smart people to benefit from their innovation, without effectively hindering broad adoption. Using different monetisation means may mean less money is made – however, do also consider the cost (both in time and money) of the patenting and product development process (in this case for a complete storage engine). That’s all overhead and significantly burdens future profitability. You need to consider these things the moment you create something, before going down the road of patenting or setting up a business as those things are in fact defining decisions – they define how you approach the market and how much money you need to make to make any profit at all. There are methods to cheaply explore what might be a right way for you (and quickly eliminate wrong ways), but some “wrong ways” are permanent, you can’t backtrack and you definitely lose any time advantage (which is of course more relevant if you don’t patent).

Posted on 1 Comment
Posted on

Open Query @ DrupalConSF

Peter and Arjen will be at DrupalCon SF 2010. Peter specifically for the event, Arjen staying around the SF area after the MySQL Conference last week.

Specifically, we’ll be talking with people about using the OQGRAPH engine to help with social graphs and other similar problems, easily inside Drupal. You may recall that Peter already created the friendlist_graph extension for the friendlist Drupal module.

From the MySQL Conf and other earlier feedback, OQGRAPH is proving to be a real enabler. And since it’s free/GPLv2 and integrated in MariaDB 5.2, there’s generally no hindrance in starting to use it.

Posted on
Posted on

Open Query @ MySQL Conf & Expo 2010

Walter and I are giving a tutorial on Monday morning, MySQL (and MariaDB) Dual Master Setups with MMM, I believe there are still some seats available – tutorials are a bit extra when you register for the conference, so you do need to sign up if you want to be there! It’s a hands-on tutorial/workshop, we’ll be setting up multiple clusters with dual master and the whole rest of the MMM fun, using VMs on your laptops and a separate wired network. Nothing beats messing with something live, breaking it, and seeing what happens!

Then on Tuesday afternoon (5:15pm, Ballroom F), Antony and I will do a session on the OQGRAPH engine: hierarchies/graphs inside the database made easy. If you’ve been struggling with trees in SQL, would really like to effectively use social networking in your applications, need to work with RDF datasets, or have been exploring neo4j but otherwise have everything in MySQL or MariaDB, this session is for you.

We (and a few others from OQ) will be around for the entire conference, the community dinner (Monday evening) and other social events, and are happy to answer any questions you might have. You’ll be able to easily recognise us in the crowds by our distinct friendly Open Query olive green shirts (green stands out because most companies mainly use blue/grey and orange/red).

Naturally we would love to do business with you (proactive support services, OQGRAPH development), but we don’t push ourselves on to unsuitable scenarios. In fact, we’re known to refer and even actively introduce clients to competent other vendors where appropriate. In any case, it’s our pleasure and privilege to meet you!

See you all in Santa Clara in a few days.

Posted on
Posted on 1 Comment

Visiting Monty HQ

On this big trip, I made particular effort to finally visit Monty at his home near Helsinki. Somehow, in all my years at MySQL AB, this never happened – a sad omission. So, I spent the Easter days with Monty, Anna and now 5yo Maria.

I’m not a fan of most meetings, and in many cases in-person meetings are not actually necessary to get things organised or done, but I think this was both most enjoyable as well as productive for our respective businesses and joint interests. Good company, discussion, food, drink, sauna… fabulous.

It’s a great pity we live on opposite sides of the planet, as we do get along very well together. We definitely don’t agree on everything, but we’re always absolutely direct with each other, and try to provide good arguments whenever we disagree, to explore things further.

Posted on 1 Comment
Posted on

The actual range and storage size of an INT

What’s the difference between INT(2) and INT(20) ? Not a lot. It’s about output formatting, which you’ll never encounter when talking with the server through an API (like you do from most app languages).

The confusion stems from the fact that with CHAR(n) and VARCHAR(n), the (n) signifies the length or maximum length of that field. But for INT, the range and storage size is specified using different data types: TINYINT, SMALLINT, MEDIUMINT, INT (aka INTEGER), BIGINT.

At Open Query we tend to pick on things like INT(2) when reviewing a client’s schema, because chances are that the developers/DBAs are working under a mistaken assumption and this could cause trouble somewhere – even if not in the exact spot where we pick on it. So it’s a case of pattern recognition.

A very practical example of this comes from a client I worked with last week. I first spotted some harmless ones, we talked about it, and then we hit the jackpot: INT(22) or something, which in fact was storing a unix timestamp converted to int by the application, for the purpose of, wait for this, user’s birth date. There’s a number of things wrong with this, and the result is something that doesn’t work properly.

Currently, the unix epoc/timestamp when stored in binary is a 32 bit unsigned integer, with a range from 1970-01-01 to somewhere in 2037. Note the unsigned qualifier, otherwise it already wraps around 2004.

  • if using signed, you’d currently only find out with users younger than 7 or so. You may be “lucky” to not have any, but kids are tech savvy so websites and systems in general may well have entries with kids younger than that.
  • using a timestamp for date-of-birth tells me that the developers are young 😉 well that’s relative, but in this: younger than 40. I was born in 1969, so I am very aware that it’s impossible to represent my birthdate in a unix timestamp! What dates do you test with? Your own, and people around you. ‘nuf said.
  • finally, INT(22) is still an INT, which for MySQL means 32 bits (4 bytes) and it happened to be signed also.

So, all in all, this wasn’t going to work. Exactly what would fail where would be highly app code (and date) dependent, but you can tell it needs a quick redesign anyway.

I actually suggested checking the requirements whether having just a year would suffice for the intended use (can be stored in a YEAR(4) field), this reduces the amount of personal data stored and thus removes privacy concerns. Otherwise, a DATE field which can optionally be allowed to not have a day-of-month (i.e. only ask for year/month) as that again can be sufficient for the intended purpose.

Posted on
Posted on

relay-log-space-limit

We don’t often see this option configured (default: unlimited) but it might be a good idea to set it. What it does is limit the amount of disk space the combined relay logs are allowed to take up.

A slave’s IO_Thread reads from the master and puts the events into the relay log; the slave’s SQL_Thread reads from the relay log and executes the query. If/when replication “breaks”, unless it’s connection related it tends to be during execution of a query. In that case the IO_Thread will keep running (receiving master events and storing in the relay log). Beyond some point, that doesn’t make sense.

The reason for having two separate replication threads (introduced in MySQL 4.0) is that long-running queries don’t delay receiving more data. That’s good. But receiving data is generally pretty fast, so as long as that basic issue is handled, it’s not necessary (for performance) to have the IO_Thread run ahead that far.

So you can set something like relay-log-space-limit=256M. This prevents slave disk space from getting gobbled up in some replication failure scenarios. The data will still be available in the logs on the master (provided of course the log expiration there isn’t too short – replication monitoring is still important!).

Conclusion: the relay log as a cache. Don’t leave it at “Unlimited”, that’s inefficient (and potentially problematic) use of resources. If you do run out of diskspace, the relay log can get corrupted – then you have to reposition, which will re-read the data from the master anyway.

Posted on