Posted on

Tom Eastman on File Uploads

The awesome Tom Eastman presented a session at PyCon Australia (Melbourne) 2016 entitled

“The dangerous, exquisite art of safely handing user-uploaded files”.

Every web application has an attack surface — the exposed points of interaction where a malicious or mischievous user can commit malice, or mischief (respectively). Possibly nowhere, however, is more vulnerable than places a user is allowed to upload arbitrary files.
The scope for abuse is eye-widening: The contents of the file, the type of the file, the size and encoding of the file, even the *name* of the file can be a potent vector for attacking your system.
The scariest part? Even the best and most secure web-frameworks can’t protect you from all of it.

In this talk, Tom shows you every scary thing he knows about that can be done with a file upload, and how to protect yourself from — hopefully — most of them.

Do watch it and pick up any hints you can.  This is important stuff.

How do your web applications handle file uploads?

Posted on

Optimising Web Servers

I was lucky enough to attend PyCon-AU recently and one talk in particular highlighted the process of web server optimisation.

Graham Dumpleton’s add-in talk Web Server Bottlenecks And Performance Tuning available on YouTube (with the majority of PyCon-AU talks)

The first big note at the beginning is that the majority of the delay in user’s perception of a website is caused by the browser rendering the page. Though not covered in the talk for those that haven’t used the tool YSlow (for Firefox and Chrome) or Google’s Developer Tools (ctrl-alt-I in Chrome), both tools will give you pretty much identical recommendations as to how to configure the application page generated and server caching /compression settings to maximise the ease at which a web browser will render the page. These recommendations also will also minimise the second most dominate effect in web pages displayed, network latency and bandwidth. Once you have completed this the process of making web pages faster on the web server begins to take a measurable effect to the end user.

The majority of the talk however continues talking about web server configuration. The issues you will find at the web server are the memory, CPU and I/O are the constraints that you may hit depending on your application.

Measuring memory usage by considering an applications use of memory multiplies by how many concurrently running processes will give you an idea of how much memory is needed. Remember always that spare memory is disk cache for Linux based systems and this is significant in reducing I/O read time for things like static content serving. Memory reduction can be helped by front-end proxying as described by the question at offset 19:40 and relating it to the earlier description of threads and processes.  In short the buffering provided by Nginx in a single process on the input ensures that the application code isn’t running until a large amount of input is ready and that output is buffered in Nginx such that the process can end quicker while Nginx trickles the web page out to the client depending on the network speed. This reduction in the running time of the application enables the server to support better concurrency and hence better memory usage. This is why we at Open Query like to Nginx as the web server for the larger websites of our clients.

Database is effectively an I/O constraint from the web server perspective as it should be on a different server if you run something more than a simple blog or an application where database utilisation is very low.  A database query that requires input from the web request that takes a long time to run will add to the time taken in rendering the page in fairly significant terms. Taking note of which queries are slow, like enabling the slow query log is the first step to identifying problem. Significant gains can usually be made by using indexes and using the database rather than the application to do joins, iterations and sorting. Of course much more optimisation of server and queries is possible and Open Query is happy to help.

Thanks again to PyCon speakers, organisers, sponsors and delegates. I had a great time.

Posted on

Open Query training at Drupal DownUnder 2012

DrupalDownUnder 2012 will be held in Melbourne Australia 13-15 January. A great event, I’ve been to several of its predecessors. People there don’t care an awful lot for databases, but they do realise that sometimes it’s important to either learn more about it or talk to someone specialised in that field. And when discussing general infrastructure, resilience is quite relevant. Clients want a site to remain up, but keep costs low.

I will teach pre-conference training sessions on the Friday at DDU:

The material is made specific to Drupal developers and users. The query design skills, for instance, will help you with module development and designing Drupal Views. The two half-days can also be booked as a MySQL Training Pack for $395.

On Saturday afternoon in the main conference, I have a session Scaling out your Drupal and Database Infrastructure, Affordably covering the topics of resilience, ease of maintenance, and scaling.

I’m honoured to have been selected to do these sessions, I know there were plenty of submissions from excellent speakers. As with all Drupal conferences, attendees also vote on which submissions they would like to see.

After DDU I’m travelling on to Ballarat for LinuxConfAU 2012, where I’m not speaking in the main program this year, but will have sessions in the “High Availability and Storage” and “Business of Open Source” miniconfs. I’ll do another post on the former – the latter is not related to Open Query.

Posted on

SQL Locking and Transactions – OSDC 2011 video

This recent session at OSDC 2011 Canberra is based on part of an Open Query training day, and (due to time constraints) without much of the usual interactivity, exercises and further MySQL specific detail. People liked it anyway, which is nice! The info as presented is not MySQL specific, it provides general insight in how databases implement concurrency and what trade-offs they make.

See for the talk abstract.

Posted on

Slides from DrupalDownUnder2011 on Tuning for Drupal

By popular request, here’s the PDF of the slides of this talk as presented in January 2011 in brisbane; it’s fairly self-explanatory. Note that it’s not really extensive “tuning”, it just fixes up a few things that are usually “wrong” in default installs, creating a more sane baseline. If you want to get to optimal correctness and more performance, other things do need to be done as well.

Posted on

Report from Barcamp Johor Bahru

This weekend, I decided to attend BarcampJB pretty last minute. Lucky for me, barcamps are made for chaotics like me, so it was no problem at all. I found some friends that live here in Kuala Lumpur who I drove down to JB with (JB is around a 5 hour drive from KL, we did it in 3.5 🙂 ).

The camp was very interesting. Because JB is on the border with Singapore, there’s a good crossover between Malaysian and Singaporean techies.

I decided to go all out and give three talks on Saturday: First up was the MMM talk I’ve given at a few conferences before. All went well, and later on in the day some people approached me for more in-depth questions. It still seems that people have this idea in their head that they somehow need MySQL Cluster when there is more then one machine involved. When I explain them that that is very rarely the case and they can achieve what they want with MMM as well, they are often happy to hear that.

My next talk was more of a personal development one. People keep asking me here where I am from. When I explain to them that I’ve been location independent for the last 3 years, they are usually very eager to find out how I pull that off. I decided to summarise my experiences and put them in a talk. This talk was very well attended and I loved giving it. Most of the attendants were young techies, they are usually in a perfect position to do something very similar to what I’m doing.

The last talk was a lightning talk on Zabbix, the Open Source monitoring system we use at Open Query. Quick, and dirty, but effective.

Other interesting talks I attended were on breeze, an online banking application made for Standard Chartered bank that looks very slick and usable (If anyone from my bank is reading this: get with the program and fix our banking application to enter the 21st century please 😉 ).

Conary and Foresight Linux were interesting as well. Conary (the package management system in Foresight Linux) is not quite mature yet, but definitely a very interesting technology. I was interested to hear about it and hope to see it become more mainstream in the future.

Daniel Cerventus gave a good lightning talk on what not to do as a startup. The main message was to just do it, and not wait for grant money or VC’s. Some solid tips as well, one of them being to run your potential name through Namechk, a handy potential username checker for many services.

There was obviously also a lot of networking and we went for a foot massage at the end of the day. Funny fact: I was the only one to stay awake through the massage (Even though I am narcoleptic), while two of my  friends (who I won’t name here 😉 ) snored all the way through it 🙂

All in all another succesful tech event in Malaysia. Definitely one of the many reasons I love living here!

Posted on

Business insight from the MySQL Conference 2010

At this year’s conference, I was pleasantly surprised with the high level of interest in Open Query’s proactive services for MySQL and MariaDB, and specifically our focus on preventing problems, while explicitly not offering emergency services.

I’ll describe what this is about first, and why I reckon it’s interesting. When you think about it, most IT related support that includes emergency (24×7) operates similar to this:

You have a house that has the front and back doors wide open with no locks, and you take out an insurance policy for the house contents. After a short time you call the insurance company “guess what, the most terrible thing happened, my TV got stolen.” Insurance company responds “that’s dreadful, you poor soul, let us fix it all up for you with getting a new TV and installing it. It’ll be our pleasure to serve you.” A few weeks later you call the insurance company again “guess what …” and they help you in the same fabulous way.

You get the idea, it’s rather silly because it’s very predictable. If you leave your doors open, you’re very close to actually being the cause of the problem yourself and insurance companies tend to not cover you under such circumstances – yet most IT support arrangements do. If IT support were actually run like insurance, premiums would be based on a risk assessment, and consequentially most companies would have to pay much higher premiums.

Much of this is actually about company processes as much as the technical setup. Depending on how you arrange things in your business, you can actually be very “emergency prone”. Since company processes are notoriously hard to change, many businesses operate in a way that is fundamentally not suitable for Open Query to do business with. That’s a fact and we’re fine with it, the market is big enough. We have clients all around the world, but so far very few from Silicon Valley. My presumption was that this was due to the way those businesses are often set up, making them simply incompatible for our services. But a significant number of companies we spoke with at and around the conference were very interested in our services exactly because of the way we work, and so that to me was interesting news. A good lesson, making attending the conference extra worthwhile. It’s also a good vote of confidence in the way we’ve set up our service offering.

Posted on

Tokutek’s Fractal Tree Indexes

Tokutek’s Bradley did a session on their Fractal Tree Index technology at the MySQL Conference (and an OpenSQL Camp before that – but I wasn’t at that one), and my first thought was: great, now we get to see what and where the magic is. On second thought, I realised you may not want to know.

I know I’m going to be a party pooper here, but I do feel it’s important for people to be aware of the consequences of looking at this stuff (there’s slide PDFs online as well as video), and software patents in general. I reckon Tokutek has done some cool things, but the patents are a serious problem.

Tokutek’s technology has patents pending, and is thus patent encumbered. What does this mean for you? It means that if you look at their “how they did it” info and you happen to code something that later ends up in a related patent lawsuit, you and the company you work for will be liable for triple damages. That’s basic US patent law, if you knowingly infringe you pay thrice. If you were at either session and are involved in database development work, you may wish to talk with your boss and legal council.

I made the assessment for myself (although I’m in Australia, there’s the Free Trade Agreement with patent-related provisions, so I am exposed) and decided that since Open Query’s activities are well within my control, it’s a manageable risk. So yep I’ve looked at the details. I’ll review some broad aspects below – I am not a lawyer but if the above worries you, to be sure, now is the time to stop reading and not see the rest of this post.

The insertion methodology is an interesting and nifty trick. It’s more CPU intensive but reduces disk I/O, and is thus faster for high volume inserts (the exact spot where B-trees and derivatives tend to be slower).

First of all, it’s important to appreciate why the B-tree family of indexing algorithms exist. They acknowledge that disk I/O is a) relatively expensive and b) operates in blocks (that is, writing/grabbing a larger chunk is more efficient when you’re reading from disk anyway). So B-trees store groups of keys together and thus try to minimise disk I/O particularly on lookup, balanced B-trees (B+tree algorithm etc) go wide rather than deep so for billions of entries you could still have a max of 6-8 disk blocks to fetch. Inserts (and deletes) can be more costly, particularly with page splits (merges for deletes) and rebalancing operations. Blocks are also not full, which is technically wasteful on your storage – it’s a tradeoff.

If you have an index purely in memory, algorithms that don’t work with blocks are more efficient, MySQL (NDB)Cluster uses T-trees and MySQL’s MEMORY tables have red/black trees which are a balanced (weighted) binary tree. If you’re interested in the structure and basic logic for each of the algorithms involved, Wikipedia tends to have good descriptions and diagrams, and there are many resources on the web including neatly animated demos of how inserts work, and so on.

So, Tokutek’s method is basically an enhancement on B-trees, it’s relevant as long as we deal with not just spinning disks but block devices that operate in large(r) read/write chunks. For spinning disks, seek time is an important factor. For SSD it is not, but SSD still works with relatively large blocks of data: you can’t just write 3 bytes, if you do the SSD actually reads the rest of the block and rewrites it (with your new 3 bytes) elsewhere, marking the old block for re-use (since SSD requires an erase cycle before it can write again). These technologies will be with us for a while yet, so enhancements are useful.

Monetisation models (and patents) aside, I reckon it’d be best to see enhancements such as these added to existing storage engines and indexing implementations (think text indexers and many other applications – it’s by no means limited to plain RDBMS or databases in general). Then it would quickly benefit a large group of users.

Building a basic storage engine is not that hard for an experienced database coder, but it takes time to mature and there are many aspects and trade-offs to it. It’s taken years for InnoDB to mature and for people to understand how to optimally use it. Planting a new/separate storage engine on the market to monetise a new indexing scheme makes -to me- only sense in the monetisation context. It makes absolutely no sense when looking at the technical aspects or the needs of the users.

For companies using MySQL/MariaDB because the code is available and they’re not locked into a single vendor for bugfixing and enhancements (just look at what Percona has done with InnoDB!), buying/using proprietary extensions makes no sense. I do by no means wish to diminish the accomplishments of the innovative minds at Tokutek, and I appreciate their tough predicament in terms of finding a way to monetise on their innovation, but what we have now is problematic.

In a nutshell, my excitement on behalf of my clients is hindered by the proprietary and patent aspects. Which is a great pity! We need to seriously think about alternative ways for smart people to benefit from their innovation, without effectively hindering broad adoption. Using different monetisation means may mean less money is made – however, do also consider the cost (both in time and money) of the patenting and product development process (in this case for a complete storage engine). That’s all overhead and significantly burdens future profitability. You need to consider these things the moment you create something, before going down the road of patenting or setting up a business as those things are in fact defining decisions – they define how you approach the market and how much money you need to make to make any profit at all. There are methods to cheaply explore what might be a right way for you (and quickly eliminate wrong ways), but some “wrong ways” are permanent, you can’t backtrack and you definitely lose any time advantage (which is of course more relevant if you don’t patent).

Posted on

Open Query @ DrupalConSF

Peter and Arjen will be at DrupalCon SF 2010. Peter specifically for the event, Arjen staying around the SF area after the MySQL Conference last week.

Specifically, we’ll be talking with people about using the OQGRAPH engine to help with social graphs and other similar problems, easily inside Drupal. You may recall that Peter already created the friendlist_graph extension for the friendlist Drupal module.

From the MySQL Conf and other earlier feedback, OQGRAPH is proving to be a real enabler. And since it’s free/GPLv2 and integrated in MariaDB 5.2, there’s generally no hindrance in starting to use it.

Posted on

Open Query @ MySQL Conf & Expo 2010

Walter and I are giving a tutorial on Monday morning, MySQL (and MariaDB) Dual Master Setups with MMM, I believe there are still some seats available – tutorials are a bit extra when you register for the conference, so you do need to sign up if you want to be there! It’s a hands-on tutorial/workshop, we’ll be setting up multiple clusters with dual master and the whole rest of the MMM fun, using VMs on your laptops and a separate wired network. Nothing beats messing with something live, breaking it, and seeing what happens!

Then on Tuesday afternoon (5:15pm, Ballroom F), Antony and I will do a session on the OQGRAPH engine: hierarchies/graphs inside the database made easy. If you’ve been struggling with trees in SQL, would really like to effectively use social networking in your applications, need to work with RDF datasets, or have been exploring neo4j but otherwise have everything in MySQL or MariaDB, this session is for you.

We (and a few others from OQ) will be around for the entire conference, the community dinner (Monday evening) and other social events, and are happy to answer any questions you might have. You’ll be able to easily recognise us in the crowds by our distinct friendly Open Query olive green shirts (green stands out because most companies mainly use blue/grey and orange/red).

Naturally we would love to do business with you (proactive support services, OQGRAPH development), but we don’t push ourselves on to unsuitable scenarios. In fact, we’re known to refer and even actively introduce clients to competent other vendors where appropriate. In any case, it’s our pleasure and privilege to meet you!

See you all in Santa Clara in a few days.