Posted on

The Australian Online Census 2016 Example of How-Not-To

error crossOne of the key problems with the 2016 online census was the architecture, but also the how the whole thing was organised and who was contracted for the job.

IBM, for the $9.6mln it got paid for the job, built something very clunky. They used Java, which is not bad per-se but the system also required Java on the client (browser) side which is just daft. The number of systems that either don’t have or can’t run client side Java is huge, and for the rest you get into version conflict mayhem. And it’s clunky, it’s a lot of code and heaviness to shuffle around which is not a great approach to build a scalable site.

If you think of the census form, the total amount of data gathered is not actually that big. It doesn’t require any particularly complicated database or storage setup.
Serving forms to clients is very light on web servers – if you then use Javascript logic to control the flow through the forms you can actually run most of the work on the client side, including intermediate local saving for the “just in case”. Then you produce a single submit with confirmation, and a transaction with a number of inserts into the database. The language used on the server end is not that important as its job is minimal. Most of the content served can be static, and might even be handled through a CDN.

The scale of the online census task is quite small, relative to many websites. Not only Twitter/Facebook/etc but many e-commerce sites have a vastly more complicated situation where they have to serve many different pages of which many are dynamic, lots of writes and shopping carts that get updated in chunks, then the whole checkout process…. and all that can work fine too. So the census is not a big or complicated problem, really. It just needs to be done right.

The fact that IBM, for $9.6mln, completely stuffed it, is a very serious indicator of where the relevant skills and innovation capability lies. For this type of job, not with IBM. Going with a big company does not guarantee good results. If you reckon this is a one-off, ask Queensland Health about their payroll debacle (SAP implemented by… IBM). Similarly, very expensive is not necessarily better. It can be just very costly, in so many respects.

ABS/IBM also declined the NextDC offer for datacenter level firewalling and DoS protection. Another serious mistake. But application architecture too affects security. When I googled for Census 2016 on census night, the first link that came up was a Census staff login. That’s just beyond astonishing. That should not be public at all. It doesn’t need to be on a public domain, and probably should be only accessible via a VPN.

The company that did the online Census 2016 load testing for another half million $ and bragged before census night about how well their team worked together with the ABS and IBM people, should also be seriously embarrassed about the shoddy job they delivered. From their own site:

“Revolution IT worked in a highly collaborative manner, and their subject knowledge, expertise and advice were key to achieve our project goals and objectives. We were impressed with how well they engaged with our e-Census solution provider (another private company). [IBM]”

Success is not defined by how well your team worked, it’s very simply proven by how well the system deals with the real world. In this case, it didn’t. At all. So, total process fail. It would have been very wise to wait with the bragging until after census night. If it holds up well, you can brag. Otherwise, you hush and no public embarrassment at least on that front. PR fail.

Their public statement (after census night) is at http://revolutionit.com.au/revolution-it-q-a-australian-bureau-of-statistics-abs-2016-census-website/ where they explain that the Census site was taken offline due to security concerns, and since security was not part of their brief, their performance was all ok and successful.  But come on now, how is security not part of any practical testing?  It is by nature an integral part of how things work online!  Implementation of security may impact performance, and obviously security aspects always impact availability – and without availability you have no performance at all.

All in all, Census 2016 is a brilliant example of “how not to” in modern online architecture.

And to prove all this again, two students at QUT in Brisbane just built the same in a few days and for about $500 which I understand was mostly pizza costs.

Read that story at http://eftm.com.au/2016/08/how-two-uni-students-built-a-better-census-site-in-just-54-hours-for-500-30752 (that write-up is rather populist simplistic, but the fact that a few students can very well design a site like this, and properly, is absolutely correct).

Posted on

Motivation to Migrate RDBMS

http://www.itnews.com/article/3004953/use-oracles-database-watch-out-for-this-dec-1-deadline.html

Companies that use a standard edition of Oracle’s database software should be aware that a rapidly approaching deadline could mean increased licensing costs.

Speaking from experience (at both MySQL AB and Open Query), typically, licensing/pricing changes such as these act as a motivator for migrations.

Migrations are a nuisance (doesn’t matter from/to what platform) and are best avoided as they’re intrinsically painful, costly and time-consuming. Smart companies know this.

When asked in generic terms, we generally recommend against migrations (even to MySQL/MariaDB) for the above-mentioned practical and business reasons. There are also technical reasons. I’ll list a few:

  • application, query and schema design tends to be most tuned to a particular RDBMS, usually the one the main developer(s) are familiar with. Features are used in a certain way, and the original target platform (even if non deliberate) is likely to execute most efficiently;
  • RDBMS choice drives hardware/network architecture. A migration should also include a re-think of this, to make optimal use of the database platform;
  • it’s quite rare (but not unheard of!) for an application to perform better on another platform, without putting a lot of extra work in. If extra work is on the table, then the original DB platform should also be considered as a valid option;
  • related to other points: a desire to migrate might be based on employees’ expertise with a particular platform rather than this particular application’s intrinsic suitability to that platform. While that can be a valid reason, it should be recognised as the actual reason as there are obviously cost/effort implications in terms of migration cost and other options such as training can be considered.
Nevertheless, a company that’s really annoyed by a vendor’s attitude can opt for the migration route, as they may decide it’s the path of less pain (and lower cost) in the long(er) term.

We do occasionally guide and assist with migrations, if after review it looks like a viable and sensible direction to take.

Posted on

Serving Clients Rather than Falling Over

Dawnstar Australis (yes, nickname – but I know him personally – he speaks with knowledge and authority) updates on The Real Victims Of The Click Frenzy Fail: The Australian Consumer after his earlier post from a few months ago.

Colourful language aside, I believe he rightfully points out the failings of the organising company and the big Australian retailers. From the Open Query perspective we can just review the situation where sites fall over under load. Contrary to what they say, that’s not a cool indication of popularity. Let’s compare with the real world:

  1. Brick & Mortar store does something that turns out popular and we see a huge queue outside, people need to wait for hours. The people in the queue can chat, and overall the situation can be regarded as positive: it shows passers-by that there’s something special going on, and that’s cool. If you don’t want to be in the crowd, you’ll come back later.
  2. Website is unresponsive/inaccessible. There’s nothing cool or positive about this, as the cause is not only unknown, but in fact irrelevant in the context. Each potential client is on their own. Things fail, so they go elsewhere (if there are substitutes) or potentially away completely (concert, it’ll sell out). The bad taste sticks, so if there are alternatives they will not only move there, but be quite vocal about it so others move also.

So you see, you really don’t want your site to go down because of popularity, or for any other reason. Slashdot years ago created a “degrade gracefully” mechanism, where parts of the site would go static. So where normally users would be able to comment and rate posts, they’d just be able to read. In the worst case, only the front page would remain active. On Sept 11 2001, Slashdot was one of the few big sites that actually remained accessible and provided regular news that people could then read even though the topic was not really in its normal scope. The point is, they proved the approach multiple times.

Contrarily, companies like Ticketek have surely got Enterprise Design architecture, however their site has been seen to fall over with events such as The Wiggles. They might be able to get away with this since they’re essentially a monopoly provider: if you want a ticket for this particular event, you need to go to them. But it’s not good. Generally they acted surprised, even though the huge load was entirely predictable. Is that just naive, or a hope to mislead the public, or negligent? You decide.

It’s really a failure in design of sorts. As to where exactly, only an architectural review would show, and it’ll be different for different sites. However, the real lesson is that it’s not about “Enterprise Design” at all, nor about using any particular high-profile hosting provider or involvement of other buzzwords. It’s about proper architecture and deployment and the database is only one aspects of this. It doesn’t have to end up particularly expensive either, it just has to be done right and there’s no single magical approach – each case is unique. Looking at this is best done early on (it tends to also work our better and cheaper), but we’ve helped clients out at much later stages also.  Ideally, we do like to help before there’s a raging fire.

Posted on

When Clever Goes Wrong & How Etsy Overcame – Arstechnica

In 2007, Etsy made a big bet on homegrown middleware to help with the site’s scalability. A half-year after it was taken live, the company decided to abandon it. As a senior software engineer at Etsy put it, “if you’re doing something ‘clever,” you’re probably doing it wrong.”

Read the full article at Arstechnica.com

I want to focus on the important lessons from this article, about middleware and using stored procedures in this fashion for a public web application, creating unscalable design complexity (smart and “proper” according to the old enterprise design teachings…) – causing infrastructure, development and maintenance hassles.

In the process they did replace PostgreSQL with MySQL but that’s not the critical change that made all the difference. PostgreSQL is a fine database system also.