Posted on

Cleaning up data: OpenRefine

Always a problem: imported data tends to be messy.  So, you want to clean it – and preferably before it gets into your database!

OpenRefine has existed for some years already, and I particularly like that it runs locally (on Linux, Mac, Windows) rather than being an server “elsewhere”. It does use a web interface, but you can run the (Java based) backend on your laptop or another local place.  Have a look at the videos on the site to see how it works and what different tricks OpenRefine can do for you.

And another thing I like – it’s possibly to call it programmatically.  Once you work out that you need to do certain operations on a particular dataset to sanitise it, you should be able to automate the process for when you grab more of the same data later.

Posted on
Posted on

Attached Storage Hindering Commit Performance

Even with SSD becoming more prolific for local-ish storage even on cloud servers, we still encounter attached storage (SAN) quite frequently.  This can be fine, but we often find that the performance of SANs is quite dismal.

  • SANs are very efficient with bulk sequential reads or writes, which a database server of course doesn’t care about;
  • SANs often have a large memory cache (sometimes with intermediate SSD), optimising frequent reads – again a database server won’t benefit from this, because it will already have any recent data in its own caches (for MySQL/MariaDB, that’d typically be the InnoDB Buffer Pool).  So usually any database disk read needs to access physical storage = slow;
  • SANs will cache data writes in (battery backed) memory.  That can work fine, provided they respond quickly enough.

Slow response on that last aspect can really damage performance. But we need a reference point.

So let’s compare with a fairly fast spinning HDD, 15000 RPM.  That sounds like a lot, but 15000 RPM / 60 seconds = only 250 spins per second.  It gets slightly better, but not that much: even presuming the disk is at the correct cylinder already (let’s hope it is because disk seeks are measured in milliseconds which is, relatively, really slow) to write to the InnoDB log file, you will -on average- still have an overhead of 1/2 round of the disk to get the head to the right point so you can write.  So you’ll get at most 500 writes per second.

Anyway, that’s spinning disk performance in an optimal situation. You’d expect an expensive SAN to deliver higher write/fsync performance than that, right?  So, you may well expect that, but you prepare to be disappointed.  We often see performance in the range of 500-600.

InnoDB tries to combine commits when it writes to and fsyncs the InnoDB log file, so that helps a bit.  We also see some SANs that do wonderfully well with numbers in the thousands, and of course local SSDs that easily do close to ten thousand and higher.

Just be aware that expensive “enterprise” hardware does not necessarily help database performance.

If you need assistance with recommendations for infrastructure design, or measuring on existing systems, contact Open Query.

Posted on
Posted on

On “The Gitgub threat” and distributed development

The Github Threat” is a great write-up by Carl Chenet, reviewing the problems created by this centralised system.

Github is very convenient, but that’s not really the point…

The greatest irony in the setup is that key advantages of using distributed revision control are undermined by using a centralised repository for bugs and other key aspects of the development process.

It’s most unfortunate, but indeed ubiquity comes with lots of side-effects. People join without considering, and many people joining will not have the background or information to even be able to consider.

For an example of a distributed version control system that has its bug tracking (and other aspects) built-in: Fossil, by the author of SQLite, Richard Hipp.
The approach has specific merits that we should consider, and they can “easily” be applied with Git also.

Many of the GitHub alternatives are in themselves centralised – yes you can run your own instance, but they still split the code from the bugs and other info. Why?

There are documented cases of Github projects (ref. Gadgetbridge) being blocked due to DMCA take-down notices.  Imagine your company relying on a centralised service and that service being (even temporarily) being unavailable to your employees.  How well will your company cope?  Yes, with Git you can share changes in a different way, but your business processes will need to adjust and that can be quite hard.  How will the equivalent of pull-requests be managed, and where is your bug tracking?

Finally, it should not be necessary to have a centralised user-base at all. It would be good to have/use a distributed notification system (Mastodon might qualify) for distributed repos, using signed messages. That way even “politically endangered” projects would be able to exist effectively without an intrinsic risk of being taken out. Secondary hosts can automatically clone and broadcast availability.

As part-fixes, also see options like this idea for Gitlab (and others): Implement cross-server (federated) merge requests.

Posted on
Posted on

Fax no more

Please note Open Query no longer maintains a fax number.

It has always been virtual (that is, any fax received would just arrive here as a PDF attachment) and it’s sometimes been useful for contracts and the like. However, companies have got used to scanning and emailing, and we haven’t had a need for our fax number in quite a while. Since the service was up for renewal, we decided to cancel.

Of course, more and more companies no longer have physical phone lines either, instead only using VoIP.  While it’s technically possible to do a fax transmission over VoIP (it even has a special mode for that), that really gets quite silly when you think about the protocol stack involved!

Did you know that the Japanese invented the consumer fax (a step up from the facsimile machines used by the press to transmit photos) to assist them in transmitting documents in their scripts (before electronic transmissions were mainstream). It makes sense in that context. But it “got out” to the rest of the world. Sigh. We don’t think anybody will miss moving on from this technology.

Posted on
Posted on

SSL and trust

We can all agree on this: security is important, as is trust.

Does a pretty seal from an SSL certificate provider create trust? Doubtful. The provider’s own claims aside, it’s marketing fluff.
Oh, it used to provide them with some extra Google juice (one more link to them) but Google’s algorithms doesn’t care for that any more. Good!

What Google (and others) do care about is security, all sites should use SSL. For everything.
Expensive? Not really. Let’s Encrypt is free, and updates can be fully automated (scripted). Quite shiny really.

Let’s Encrypt only does domain validation, so a user sees the green lock and a “Secure” indicator. If you want company validation, you need to use another provider and pay their fees. Do you need that? That’s up to you. We reckon that in many (if not most) cases, you don’t really. It might depend on whether your clients are informed enough to care for SSL, and then whether they know (and care) enough to discern which indicators actually have real security meaning and which are just fluff. Tech geeks aside, few people do. We’re not saying that is brilliant, but it is reality. Do people care for pretty seals, and do we want to feed that realm of misinformation and false security? We hope you don’t go that path, because if we really care for security, this just distracts without solving the real issues. Doing things you technically don’t believe in won’t create real trust, as it’s not genuine. And whatever marketing/sales types tell you, you can’t fake genuine. Increasingly, people see right through it. Which is awesome! If your users know enough and care to ensure that your site is really owned by your company, then yes, a certificate with company validation makes sense.

Actionable task

If your publicly facing web or API servers aren’t using SSL for everything yet, you’ll want to spend some time to fix this. Real security aside, it affects your search engine ranking. If web pages pull in logos, javascript or even stylesheets from third parties, make sure those too use https as otherwise browsers produce “mixed content” warnings.

References

Posted on
Posted on 2 Comments

TEXT and VARCHAR inefficiencies in your db schema

The TEXT and VARCHAR definitions in many db schemas are based on old information – that is, they appear to be presuming restrictions and behaviour from MySQL versions long ago. This has consequences for performance. To us, use of for instance VARCHAR(255) is a key indicator for this. Yep, an anti-pattern.

VARCHAR

In MySQL 4.0, VARCHAR used to be restricted to 255 max. In MySQL 4.1 character sets such as UTF8 were introduced and MySQL 5.1 supports VARCHARs up to 64K-1 in byte length. Thus, any occurrence of VARCHAR(255) indicates some old style logic that needs to be reviewed.

Why not just set the maximum length possible? Well…

A VARCHAR is subject to the character set it’s in, for UTF8 this means either 3 or 4 (utf8mb4) bytes per character can be used. So if one specifies VARCHAR(50) CHARSET utf8mb4, the actual byte length of the stored string can be up to 200 bytes. In stored row format, MySQL uses 1 byte for VARCHAR length when possible (depending on the column definition), and up to 2 bytes if necessary. So, specifying VARCHAR(255) unnecessarily means that the server has to use a 2 byte length in the stored row.

This may be viewed as nitpicking, however storage efficiency affects the number of rows that can fit on a data page and thus the amount of I/O required to manage a certain amount of rows. It all adds up, so having little unnecessary inefficiencies will cost – particularly for larger sites.

VARCHAR best practice

Best practice is to set VARCHAR to the maximum necessary, not the maximum possible – otherwise, as per the above, the maximum possible is about 16000 for utf8mb4, not 255 – and nobody would propose setting it to 16000, would they? But it’s not much different, in stored row space a VARCHAR(255) requires a 2 byte length indicator just like VARCHAR(16000) would.

So please review VARCHAR columns and set their definition to the maximum actually necessary, this is very unlikely to come out as 255. If 255, why not 300? Or rather 200? Or 60? Setting a proper number indicates that thought and data analysis has gone into the design. 255 looks sloppy.

TEXT

TEXT (and LONGTEXT) columns are handled different in MySQL/MariaDB. First, a recap of some facts related to TEXT columns.

The db server often needs to create a temporary table while processing a query. MEMORY tables cannot contain TEXT type columns, thus the temporary table created will be a disk-based one. Admittedly this will likely remain in the disk cache and never actually touch a disk, however it goes through file I/O functions and thus causes overhead – unnecessarily. Queries will be slower.

InnoDB can store a TEXT column on a separate page, and only retrieve it when necessary (this also means that using SELECT * is needlessly inefficient – it’s almost always better to specify only the columns that are required – this also makes code maintenance easier: you can scan the source code for referenced column names and actually find all relevant code and queries).

TEXT best practice

A TEXT column can contain up to 64k-1 in byte length (4G for LONGTEXT). So essentially a TEXT column can store the same amount of data as a VARCHAR column (since MySQL 5.0), and we know that VARCHAR offers us benefits in terms of server behaviour. Thus, any instance of TEXT should be carefully reviewed and generally the outcome is to change to an appropriate VARCHAR.

Using LONGTEXT is ok, if necessary. If the amount of data is not going to exceed say 16KB character length, using LONGTEXT is not warranted and again VARCHAR (not TEXT) is the most suitable column type.

Summary

Particularly when combined with the best practice of not using SELECT *, using appropriately defined VARCHAR columns (rather than VARCHAR(255) or TEXT) can have a measurable and even significant performance impact on application environments.

Applications don’t need to care, so the db definition can be altered without any application impact.

It is a worthwhile effort.

Posted on 2 Comments
Posted on 1 Comment

Contributing to popular frameworks for scalability

Right now we’re volunteering some engineering time to assisting the WordPress and WooCommerce people with scalability issues. In the past we’ve put similar efforts into Drupal.

There are many opinions on these systems out there, most of them sadly negative. We take a different view. Each of these frameworks obviously has their advantages and disadvantages, but the key question is why people use them. When we understand that, we can assess that reasoning, and assist further.

Obviously writing your own code all the way is going to potentially create the most optimal result for your site. A custom tool is going to be less code and more optimal for your situation. However, it also requires you to put in quite a bit of development effort both to create and to maintain that system, including security issues. When you’re big enough (as a site/company) this can be worthwhile, but more and more organisations actually appear to be moving towards utilising a CMS and then extending that to their needs using the usual module/plugin/extension model.

This also means that bigger and bigger sites use these systems, and thus we need to look at the scalability. CMS frameworks tend to be “db heavy”, caused by being highly configurable combined with a modular/abstraction architecture that sometimes goes for “code architectural correctness” without taking scaling into account. Naturally most such code works fine on a dev box and even on a modest live server. But add sufficient content, and things rapidly go downhill. Just like with other database related tuning issues, there is no gradual degradation in performance – when certain boundaries are reached, performance plummets to cause nasty page load times or even complete page load failures.

Are these systems inherently dreadful? Actually, no. They do a darn good job and even in security terms they’re not bad at all. Naturally one always has to be careful with modules/plugins and whether they are (still) maintained.

So, with that assessment out of the way – we can look at the actual issues. It makes sense for us to contribute to these systems as it directly benefits our clients, ourselves, and many many others.

 

Just no the option structure in WordPress has drawn our interest. It’s abstracted, and so a plugin will request the value of an individual option item (by name). Typically it causes a db query. WordPress has an ‘autoload’ mechanism particularly for its core settings, which it loads in one bigger query and caches – that makes sense.

We’ve just commented on an existing issue regarding the indexing of the autoload column, with people arguing that indexing a boolean field is not beneficial (not necessarily true and easily tested for a specific case – the outcomes is that in this case an index IS beneficial) and that having more indexes slows down writes (true of course, but that’s never an argument against proper indexing – also this is mostly a read table, so a bit of overhead on writes is fairly immaterial). Lastly there were comments re MyISAM which has a slightly different performance profile to InnoDB when testing this. But InnoDB has been the default engine for quite a few years now – any installation in recent years, and any installation where the owner needs to care for more performance and other factors, will already be using InnoDB. Taking MyISAM into account is not beneficial. We hope the index will shortly be added. In the mean time you can add it yourself:

ALTER TABLE wp_options ADD INDEX (autoload)

Using autoload for everything would not be efficient, there can be tens of thousands of options in a decently sized site. So we see plugins retrieve half a dozen options, one by one using the appropriate framework function, and each triggers a database query. Again we have to appreciate why this architecture is the way it is, and not waste time on arguing with that. There are sensible reasons for having such an option architecture. What we need to address is the inefficiency of triggering many queries.

Our proposal is to extend the get_option() function with an extra parameter for a plugin name (call it a domain), then the function can issue a single db query to retrieve all the options for that plugin and cache it in a single object. When another option is requested it can be retrieved from there rather than triggering another query. This reduces the number of db queries from N per plugin to 1, which is significant. Mind you, each of these queries is tiny and doesn’t require much lookup time. But running any query has intrinsic overhead so we want to reduce the number of queries whenever possible – not running superfluous queries, combining queries, and so on.

There are variations on the proposal as there are many aspects to consider, but it’s important to not leave it “as is” as currently it affects performance and scalability.

So how do we catch potential items of interest? On the database end, the slow query log is still very useful with the proper settings. log_queries_not_using indexes, min_examined_row_limit=1000 (tune appropriate for the environment), log_slow_filter and log_slow_verbosity to give as much info as possible, and long_query_time to something sub-second. Naturally, this presumes you’ve already taken care of real nasties otherwise you’ll get very big slow log files very quickly. You can analyse the slow query log using Percona Tools, but the ‘mysqldumpslow’ tool which is part of the server package is pretty useful also. You’re looking for queries that either take a long time, or that happen often, or just look odd/inefficient/unnecessary. Then hunt where they come from in the code, and figure out the why. Often reasons aren’t necessarily good or current (historical scenarios linger), but there will be reasons and understanding them will help you talk with developers more effectively – seeing where they come from.

WordPress has a nice plugin called Query Monitor. As admin you get details from a page load in your top admin bar, and you can delve into specific aspects. Things to look out for are an excessive number of queries for a page (a hundred is sadly common, but when you see into the thousands it’s worthwhile investigating – of course, start with the biggest problems first and work your way down), duplicated queries (indicating framework inefficiencies) and things like that. Have a good browse around, there’s much there that’s useful.

If you want, feel free to bring suspected issues to our attention. We’ll look at them, perhaps write about them here, and in any case follow up with the developers in their bug tracking system. It’s worthwhile.

Posted on 1 Comment
Posted on

Amazon vs Linode in case of hardware failure

One of our clients received an email like this:

Dear Amazon EC2 Customer,

We have important news about your account (…). EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance (instance-ID: …) in the […] region. Due to this degradation, your instance could already be unreachable. After […] UTC your instance, which has an EBS volume as the root device, will be stopped.

You can see more information on your instances that are scheduled for retirement in the AWS Management Console

  • How does this affect you?
    Your instance’s root device is an EBS volume and the instance will be stopped after the specified retirement date. You can start it again at any time. Note that if you have EC2 instance store volumes attached to the instance, any data on these volumes will be lost when the instance is stopped or terminated as these volumes are physically attached to the host computer
  • What do you need to do?
    You may still be able to access the instance. We recommend that you replace the instance by creating an AMI of your instance and launch a new instance from the AMI.
  • Why retirement?AWS may schedule instances for retirement in cases where there is an unrecoverable issue with the underlying hardware.

Great to be notified of such things. That’s not the problem. What I find curious is that Amazon tosses the problem resolution entirely at their clients. Time and effort (cost) is required, even just create an AMI (if you don’t already have one) and restart elsewhere from that.
Could it be done differently? I think so, because it has been done for years. At Linode (for example). If something like this happens on Linode, they’ll migrate the entire VM+data to another host – quickly and at no cost. Just a bit of downtime (often <30mins). They’ll even do this on request if you (the client) suspect there is some problem on the host and would just like to make sure by going to another host.
So… considering how much automation and user-convenience Amazon produces, I would just expect better.

Of course it’s nice to have everything scripted so that new nodes can be spun up quickly. In that case you can just destroy an old instance and start a new one, which might then be very low effort. But some systems and individual instances are (for whatever reason) not set up like that, and then a migration like the one that Linode does is eminently sensible and very convenient….

Posted on
Posted on

Tom Eastman on File Uploads

The awesome Tom Eastman presented a session at PyCon Australia (Melbourne) 2016 entitled

“The dangerous, exquisite art of safely handing user-uploaded files”.

Every web application has an attack surface — the exposed points of interaction where a malicious or mischievous user can commit malice, or mischief (respectively). Possibly nowhere, however, is more vulnerable than places a user is allowed to upload arbitrary files.
The scope for abuse is eye-widening: The contents of the file, the type of the file, the size and encoding of the file, even the *name* of the file can be a potent vector for attacking your system.
The scariest part? Even the best and most secure web-frameworks can’t protect you from all of it.

In this talk, Tom shows you every scary thing he knows about that can be done with a file upload, and how to protect yourself from — hopefully — most of them.

Do watch it and pick up any hints you can.  This is important stuff.

How do your web applications handle file uploads?

Posted on
Posted on

On Open Source and Business Choices

Open Source is a whole-of-process approach to development that can produce high-quality products better tailored to users’ real world needs.  A key reason for this is the early feedback cycle built into that complete process.

Simply publishing something under an Open Source license (while not applying Open Source development processes) does not yield the same quality and other benefits.  So, not all Open Source is the same.

Publishing source of a product “later” (for instance when the monetary benefit has diminished for the company) is meaningless.  In this scenario, there is no “Open Source benefit” to users whatsoever, it’s simply a proprietary product. There is no opportunity for the client to make custom modifications or improvements, or ask a third party to work on such matters – neither is there any third party opportunity to verify and validate either code quality or security.

Open Source is not a marketing gimmick.  Labels such as “Open Source”, or “Enterprise”, on their own, do not have any more positive outcome than a greasy hamburger labeled with “healthy”.  If a company “believes” in Open Source software, they’ll use the open source development model for their software development.

And now we see things like this: Uproar: MariaDB Corp. veers away from open source (by Simon Phipps, InfoWorld, August 2016)

So what does it mean when a company publishes some of their software under an open source license, and does some related products under a proprietary license?  To me, it’s generally a strong indication that the company either doesn’t believe in that model, or doesn’t understand it.  And we’ve seen it before.

It also reminds me of an interaction I had many years ago.  A Marketing VP asked me “How can we leverage our [Open Source] community?”  I answered the only possible way: “One does not ‘leverage’ the community, that’s not how it works.”  Of course that wasn’t the answer the VP wanted to hear, but that doesn’t make it less true.  They saw the community as an asset to use, rather than work with.  People don’t like getting used, and in the Open Source space that’s even more true.

Companies that have turned their back on their earlier Open Source work and who have devised some other model to (arguably) make more money, have all discovered that this fundamentally changes their market.  They’ll lose some of their users, customers and supporters, and gain some new different clients.  It’s a different market.  Whether and how that pans out in terms of commercial success is never certain.  Given that we know that the Open Source development process yields benefits in terms of quality and features users want, we can say that non-OSS products lack (some of) those benefits, so to put it bluntly, it’ll be a different product of possibly less quality and the feature set is likely to differ as well.

Naturally we cannot ascertain code quality directly as we can’t review closed code directly, bug systems of proprietary software tends to be closed, changelogs are condensed for marketing purposes, but as far back as a decade and a half there have been independent studies that worked out “lines of code per software flaw” and it came out significantly in favour of Open Source software, having proportionally much fewer bugs.  Bugs also tend to get fixed quicker in Open Source software.  None of this is new(s). see for instance Open-source vs. proprietary software bugs: Which get squashed fastest? (CNET, 2007)

For complete products (libraries are a slightly different beast) with a relatively large market scope, source code being available does not in any way diminish a company’s ability to make money.  Having the core developers, tech writers and support people gives them a significant edge in the open market, and that’s a business asset you can leverage.  You do that by focusing on those aspects in your communications – that’s basic marketing, you draw attention to the positive aspects that make your company/product stand out from the rest.  Clearly, this objective cannot not achieved by force, as you don’t make a (potential) client like or trust you by denying them choice or transparency.

There is one other known option aside from not believing or not understanding, and that’s fear. But fear is an awkward business driver, it makes for very bad decisions.

MariaDB Corp in part uses the Open Source development model, in part they’re an Open Source publisher (in-house work that’s only made available at a later stage in the development process), and now some proprietary product has been added to the mix (actually new versions of an existing product).  Looking at this I am rather unclear about what they believe in.  Of course companies can make business choices as they see fit – but they never operate in a vacuum.  In the end it doesn’t matter much what I believe personally, the market will do what it will – historically, it responds in the various ways as described above.  We’ll see how it pans out.

Open Query does not recommend (or re-sell at all) proprietary tools, as it just doesn’t make sense for us or our clients.  We often do bugfixes and improvements which we contribute upstream – for proprietary tools we can’t do that and thus it becomes a hindrance for us and our clients.  On the specific practical level, we’ve actually never used MaxScale (the product that MariaDB Corp will now sell under different conditions for future versions), and this stems from our experience with its effective predecessor MySQL Proxy.  Having a complex set of scripted logic in a proxy slows down applications and introduces a rather large extra (single) point of potential failure in to infrastructure.   So, while Simon refers to MaxScale as an essential tool for scale-able environments, we know from experience that there are other ways of achieving that desired objective, and without the downsides.

Rather than promoting a single tool for many wildly different jobs, we utilise a few different tools depending on the needs of particular client infrastructure.  We still have a couple of (now legacy) MySQL-MMM deployments, but also quite a few Galera clusters, and other setups as suit our clients’ needs.  Key is to not only make the infrastructure convenient to use for applications, but also to not introduce any more single points of failure.  We build resilience into the client’s server infrastructure, without adding significant overhead in either performance or maintenance requirements.

We believe that that’s what clients want, and since potential clients come to us asking exactly for that (and note our approach with relief) we think that we’re doing the right thing by our clients.  We’ve used this approach for over 9 years, and we’ll just keep on doing that – our basic approach doesn’t change even when our tools do.  If you’d like to talk with us about helping you with your infra, using our approach and way of working, contact us today!

Posted on