MySQL Performance Blog

Syndicate content
Percona's MySQL & InnoDB performance and scalability blog
Updated: 15 min 29 sec ago

Percona University @Montevideo, FISL & São Paulo MySQL Meetup

Tue, 22/04/2014 - 07:20

Peter Zaitsev at last year’s Percona University event in Montevideo

Later this week I’m excited to depart on a trip to South America. First I will stop in Montevideo, Uruguay, to meet with Percona’s team out there as well as to participate in our next Percona University event on April 29.

For those who do not know, Percona University events are free to attend and packed with technical presentations about MySQL and surrounding technologies, delivered by members of the Percona team and community speakers. Even though attendance is free, because space is limited you will need to register now to ensure that we have space for everyone.

Next stop will be Brazil, where I’ll be speaking at the local São Paulo MySQL Meetup group on May 6 and then on to Porto Alegre where I’ll speak at FISL (the International Free Software Forum) (May 9 14:00) about the “Practical Optimization of MySQL.”

If you’re also attending FISL and would like to connect at the show drop me a note at ceo AT percona.com.

The post Percona University @Montevideo, FISL & São Paulo MySQL Meetup appeared first on MySQL Performance Blog.

Using Apache Hadoop and Impala together with MySQL for data analysis

Mon, 21/04/2014 - 23:43

Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from  MySQL to Hadoop, load the data to Cloudera Impala (columnar format) and run a reporting on top of that. For the examples below I will use the “ontime flight performance” data from my previous post (Increasing MySQL performance with parallel query execution). I’ve used the Cloudera Manager v.4 to install Apache Hadoop and Impala. For this test I’ve (intentionally) used an old hardware (servers from 2006) to show that Hadoop can utilize the old hardware and still scale. The test cluster consists of 6 datanodes. Below are the specs:

PurposeServer specsNamenode, Hive metastore, etc + Datanodes2x PowerEdge 2950, 2x L5335 CPU @ 2.00GHz, 8 cores, 16G RAM, RAID 10 with 8 SAS drivesDatanodes only4x PowerEdge SC1425, 2x Xeon CPU @ 3.00GHz, 2 cores, 8G RAM, single 4TB drive

As you can see those a pretty old servers; the only thing I’ve changed is added a 4TB drive to be able to store more data. Hadoop provides redundancy on the server level (it writes 3 copies of the same block to all datanodes) so we do not need RAID on the datanodes (need redundancy for namenodes thou).

Data export

There are a couple of ways to export data from MySQL to Hadoop. For the purpose of this test I have simply exported the ontime table into a text file with:

select * into outfile '/tmp/ontime.psv'  FIELDS TERMINATED BY ',' from ontime;

(you can use “|” or any other symbol as a delimiter) Alternatively, you can download data directly from www.transtats.bts.gov site using this simple script:

for y in {1988..2013} do for i in {1..12} do u="http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_${y}_${i}.zip" wget $u -o ontime.log unzip On_Time_On_Time_Performance_${y}_${i}.zip done done

Load into Hadoop HDFS

First thing we will need to do is to load data into HDFS as a set of files. Hive or Impala it will work with a directory to which you have imported your data and concatenate all files inside this directory. In our case it is easy to simply copy all our files into the directory inside HDFS

$ hdfs dfs -mkdir /data/ontime/ $ hdfs -v dfs -copyFromLocal On_Time_On_Time_Performance_*.csv /data/ontime/

 Create external table in Impala

Now, when we have all data files loaded we can create an external table:

CREATE EXTERNAL TABLE ontime_csv ( YearD int , Quarter tinyint , MonthD tinyint , DayofMonth tinyint , DayOfWeek tinyint , FlightDate string , UniqueCarrier string , AirlineID int , Carrier string , TailNum string , FlightNum string , OriginAirportID int , OriginAirportSeqID int , OriginCityMarketID int , Origin string , OriginCityName string , OriginState string , OriginStateFips string , OriginStateName string , OriginWac int , DestAirportID int , DestAirportSeqID int , DestCityMarketID int , Dest string , ... ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE  LOCATION '/data/ontime';

Note the “EXTERNAL” keyword and LOCATION (LOCATION points to a directory inside HDFS, not a file). The impala will create a meta information only (will not modify the table). We can query this table right away, however, impala will need to scan all files (full scan) for queries.

Example:

[d30.local:21000] > select yeard, count(*) from ontime_psv group by yeard; Query: select yeard, count(*) from ontime_psv group by yeard +-------+----------+ | yeard | count(*) | +-------+----------+ | 2010 | 6450117 | | 2013 | 5349447 | | 2009 | 6450285 | | 2002 | 5271359 | | 2004 | 7129270 | | 1997 | 5411843 | | 2012 | 6096762 | | 2005 | 7140596 | | 1999 | 5527884 | | 2007 | 7455458 | | 1994 | 5180048 | | 2008 | 7009726 | | 1988 | 5202096 | | 2003 | 6488540 | | 1996 | 5351983 | | 1989 | 5041200 | | 2011 | 6085281 | | 1998 | 5384721 | | 1991 | 5076925 | | 2006 | 7141922 | | 1993 | 5070501 | | 2001 | 5967780 | | 1995 | 5327435 | | 1990 | 5270893 | | 1992 | 5092157 | | 2000 | 5683047 | +-------+----------+ Returned 26 row(s) in 131.38s

(Note that “group by” will not sort the rows, unlike MySQL. To sort we will need to add “ORDER BY yeard”)

Explain plan:

Query: explain select yeard, count(*) from ontime_csv group by yeard +-----------------------------------------------------------+ | Explain String | +-----------------------------------------------------------+ | PLAN FRAGMENT 0 | | PARTITION: UNPARTITIONED | | | | 4:EXCHANGE | | | | PLAN FRAGMENT 1 | | PARTITION: HASH_PARTITIONED: yeard | | | | STREAM DATA SINK | | EXCHANGE ID: 4 | | UNPARTITIONED | | | | 3:AGGREGATE (merge finalize) | | | output: SUM(COUNT(*)) | | | group by: yeard | | | | | 2:EXCHANGE | | | | PLAN FRAGMENT 2 | | PARTITION: RANDOM | | | | STREAM DATA SINK | | EXCHANGE ID: 2 | | HASH_PARTITIONED: yeard | | | | 1:AGGREGATE | | | output: COUNT(*) | | | group by: yeard | | | | | 0:SCAN HDFS | | table=ontime.ontime_csv #partitions=1/1 size=45.68GB | +-----------------------------------------------------------+ Returned 31 row(s) in 0.13s

As we can see it will scan 45G of data.

Impala with columnar format and compression

The great benefit of the impala is that it supports columnar format and compression. I’ve tried the new “parquet” format with “snappy” compression codec. As our table is very wide (and de-normalized) it will help alot to use columnar format. To take advantages of the “parquet” format we will need to load data into it, which is easy to do when we already have a table inside impala and files inside HDFS:

[d30.local:21000] > set PARQUET_COMPRESSION_CODEC=snappy; [d30.local:21000] > create table ontime_parquet_snappy LIKE ontime_parquet_snappy STORED AS PARQUET; [d30.local:21000] > insert into ontime_parquet_snappy select * from ontime_csv; Query: insert into ontime_parquet_snappy select * from ontime_csv Inserted 152657276 rows in 729.76s

Then we can test our query against the new table:

Query: explain select yeard, count(*) from ontime_parquet_snappy group by yeard +---------------------------------------------------------------------+ | Explain String | +---------------------------------------------------------------------+ | PLAN FRAGMENT 0 | | PARTITION: UNPARTITIONED | | | | 4:EXCHANGE | | | | PLAN FRAGMENT 1 | | PARTITION: HASH_PARTITIONED: yeard | | | | STREAM DATA SINK | | EXCHANGE ID: 4 | | UNPARTITIONED | | | | 3:AGGREGATE (merge finalize) | | | output: SUM(COUNT(*)) | | | group by: yeard | | | | | 2:EXCHANGE | | | | PLAN FRAGMENT 2 | | PARTITION: RANDOM | | | | STREAM DATA SINK | | EXCHANGE ID: 2 | | HASH_PARTITIONED: yeard | | | | 1:AGGREGATE | | | output: COUNT(*) | | | group by: yeard | | | | | 0:SCAN HDFS | | table=ontime.ontime_parquet_snappy #partitions=1/1 size=3.95GB | +---------------------------------------------------------------------+ Returned 31 row(s) in 0.02s

As we can see it will scan much smaller amount of data: 3.95 (with compression) compared to 45GB

Results:

Query: select yeard, count(*) from ontime_parquet_snappy group by yeard +-------+----------+ | yeard | count(*) | +-------+----------+ | 2010 | 6450117 | | 2013 | 5349447 | | 2009 | 6450285 | ... Returned 26 row(s) in 4.17s

And the response time is much better as well.

Impala complex query example

I’ve used the complex query from my previous post. I had to adapt it for use with Impala: it does not support “sum(ArrDelayMinutes>30)” notation but “sum(if(ArrDelayMinutes>30, 1, 0)” works fine.

select min(yeard), max(yeard), Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_parquet_snappy WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and flightdate < '2010-01-01' GROUP by carrier HAVING cnt > 100000 and max(yeard) > 1990 ORDER by rate DESC LIMIT 1000;

The query is intentionally designed the way it does not take advantage of the indexes: most of the conditions will only filter out less than 30% of the data.

Impala results:

+------------+------------+---------+----------+-----------------+------+ | min(yeard) | max(yeard) | carrier | cnt | flights_delayed | rate | +------------+------------+---------+----------+-----------------+------+ | 2003 | 2009 | EV | 1454777 | 237698 | 0.16 | | 2003 | 2009 | FL | 1082489 | 158748 | 0.15 | | 2006 | 2009 | XE | 1016010 | 152431 | 0.15 | | 2003 | 2009 | B6 | 683874 | 103677 | 0.15 | | 2006 | 2009 | YV | 740608 | 110389 | 0.15 | | 2003 | 2005 | DH | 501056 | 69833 | 0.14 | | 2001 | 2009 | MQ | 3238137 | 448037 | 0.14 | | 2004 | 2009 | OH | 1195868 | 160071 | 0.13 | | 2003 | 2006 | RU | 1007248 | 126733 | 0.13 | | 2003 | 2006 | TZ | 136735 | 16496 | 0.12 | | 1988 | 2009 | UA | 9593284 | 1197053 | 0.12 | | 1988 | 2009 | AA | 10600509 | 1185343 | 0.11 | | 1988 | 2001 | TW | 2659963 | 280741 | 0.11 | | 1988 | 2009 | CO | 6029149 | 673863 | 0.11 | | 2007 | 2009 | 9E | 577244 | 59440 | 0.10 | | 1988 | 2009 | US | 10276941 | 991016 | 0.10 | | 2003 | 2009 | OO | 2654259 | 257069 | 0.10 | | 1988 | 2009 | NW | 7601727 | 725460 | 0.10 | | 1988 | 2009 | DL | 11869471 | 1156267 | 0.10 | | 1988 | 2009 | AS | 1506003 | 146920 | 0.10 | | 1988 | 2005 | HP | 2607603 | 235675 | 0.09 | | 2005 | 2009 | F9 | 307569 | 28679 | 0.09 | | 1988 | 1991 | PA | 206841 | 19465 | 0.09 | | 1988 | 2009 | WN | 12722174 | 1107840 | 0.09 | +------------+------------+---------+----------+-----------------+------+ Returned 24 row(s) in 15.28s

15.28 seconds is significantly faster than original MySQL results (15 min 56.40 sec without parallel execution and  5 min 47 with the parallel execution). However, this is not “apple to apple comparison”:

  • MySQL will scan 45G of data and Impala with parquet will only scan 3.5G
  • MySQL will run on a single server, Hadoop + Impala will run in parallel on 6 servers.

Nevertheless, Hadoop + Implala shows impressive performance and ability to scale out the box, which can help a lot with the large data volume analysis.

Conclusion

Hadoop + Impala will give us an easy way to analyze large datasets using SQL with the ability to scale even on the old hardware.

In my next posts I will plan to explore:

As always, please share your thoughts in the comments.

The post Using Apache Hadoop and Impala together with MySQL for data analysis appeared first on MySQL Performance Blog.

Percona Software in Ubuntu 14.04 LTS (Trusty Tahr) release

Fri, 18/04/2014 - 23:22

I’d like to congratulate Canonical with the new Ubuntu 14.04 LTS (Trusty Tahr) Release, it really looks like a great release, and I say it having my own agenda It looks even more great because it comes with a full line of Percona Software.
If you install Ubuntu 14.04 and run aptitude search you will find:

Percona Toolkit and Percona XtraBackup are up to the latest versions, but Percona Server and Percona XtraDB Cluster comes with 5.5 versions, and it is in line with default MySQL version, which again is 5.5.

I expect this release will make it much easier for users to get familiar with our software, so you can go and try this today!

The post Percona Software in Ubuntu 14.04 LTS (Trusty Tahr) release appeared first on MySQL Performance Blog.

How to find bugs in MySQL

Thu, 17/04/2014 - 15:00

Finding bugs in MySQL is not only fun, it’s also something I have been doing the last four years of my life.

Whether you want to become the next Shane Bester (who is generally considered the most skilled MySQL bug hunter worldwide), or just want to prove you can outsmart some of the world’s best programmers, finding bugs in MySQL is a skill not reserved anymore to top QA engineers armed with a loads of scripts, expensive flash storage and top-range server hardware. Off course, for professionals that’s still the way to go, but now anyone with an average laptop and a standard HDD can have a lot of fun trying to find that elusive crash…

 If you follow this post carefully, you may well be able to find a nice crashing bug (or two) running RQG (an excellent database QA tool). Linux would be the preferred testing OS, but if you are using Windows as your main OS, I would recommend getting Virtual Box and running a Linux guest in a suitably sized (i.e. large) VM. In terms of the acronym “RQG”, this stands for “Random Query Generator,” also named “randgen.”

If you’re not just after finding any bug out there (“bug hunting”), you can tune the RQG grammars (files that define what sort of SQL RQG executes) to more or less match your “issue area.” For example, if you are always running into a situation where the server crashes on a DELETE query (as seen at the end of the mysqld error log for example), you would want an SQL grammar that definitely has a variety of DELETE queries in it. These queries should be closely matched with the actual crashing query – crashes usually happen due to exactly the same, or similar statements with the same clauses, conditions etc.

But, taking a step back, to get started with RQG, you can either use the setup_server.sh script in percona-qa (more on how to obtain percona-qa from Launchpad using bazaar below), or you can use yum to manually install a set of modules installed on your Linux machine:

$ sudo yum install kernel-devel wget patch make cmake automake autoconf libtool bzr gtest zlib-static \
gcc gcc-c++ ncurses-devel libaio libaio-devel bison valgrind perl-DBD-mysql cpan zlib-devel \
bzip2 valgrind-devel svn pam-devel openssl openssl-dev screen strace sysbench

(Note that I have included a few extra items needed for Percona Server & a few items that are handy like screen and sysbench for example. Note also that you can change yum to apt-get, though this may require a few package name changes – search the web for info if you want to use apt-get.)

After these modules are installed, you can:

1. Pull the tree from Launchpad:
$ bzr branch lp:randgen

2. Make sure to have the many (Perl modules etc.) dependencies installed. Follow:
https://github.com/RQG/RQG-Documentation/wiki/RandomQueryGeneratorQuickStart#wiki-Prerequisites

3. Do a quick test to see if RQG is working fine:
https://github.com/RQG/RQG-Documentation/wiki/RandomQueryGeneratorQuickStart#wiki-Running_your_first_test

If all that worked, then congratulations, you now have simple RQG runs working. Let’s now look into how RQG is structured.

The way you can think about RQG “execution” in a hierarchical tree format is like this: combinations.pl starts runall.pl which starts gentest.pl which may start gendata.pl. You don’t have to use combinations.pl, (or even runall.pl as you would have learned by following the ‘running your first test’ example above; i.e. mysqld can be started manually and gentest.pl can then be used for testing the already started server), but runall.pl and definitely combinations.pl surely add more power to our testing, as we will see soon.

In terms of the various Perl scripts (.pl) listed in the tree, it is:

- combinations.pl which generates the many different ways (read ‘trials’) in which RQG is started (using runall.pl) with various options to mysqld etc.
- runall.pl which starts/stops mysqld, does various high-level checking etc. (in other words; ‘a basic RQG run’)
- gentest.pl which is the executor component (iow ‘the actual basic RQG test’)
- gendata.pl which setups the data (tables + data)

If you know the performance testing tool SysBench, you may compare gentest.pl with an actual SysBench run and gendata.pl with a “sysbench prepare.”

To get into real bug hunting territory, we will use combinations.pl to do an extensive random testing run against a server. You never know what you may find. Small warning; before you log your newly discovered bug, make sure that it is not logged on bugs.mysql.com (for MySQL Server bugs) or on bugs.launchpad.net/percona-server (for Percona Server bugs) already.

In this example, we will be testing Percona Server, as the combination.pl (.cc) grammar we use is optimized for Percona Server. If you would like to test the MySQL server, you can build your own grammars, or use one of the many grammars available in RQG (though they are not many combinations.pl grammars in RQG yet. There are plenty of (less-powerful) runall.pl grammars however). For a MySQL-compatible combinations.pl grammar which tests the optimizer, see randgen/conf/optimizer/starfish.cc – a grammar which I developed whilst working for oracle.

Another very extensive grammar set, usable with Percona Server 5.6 (we call this our ‘base grammar’ as it test many features developed for Percona Server), can be found in randgen/conf/percona_qa/5.6/* – edit and then use 5.6.sh – the startup script (in this set WORKDIR and RQG_DIR) and 5.6.cc – the combinations file (in this change path names for the optimized/debug and valgrind compiled server to match your system) to get started . More on this below.

An earlier and more limited version of this base grammar can be found in your randgen tree; go to randgen/conf/percona_qa/ and review the files there. This more limited base grammar can be used for testing any version of Percona Server, or you can follow along and use the 5.6 grammar mentioned above and test Percona Server 5.6 – the same basic steps (detailed below) apply.

In the percona_qa directory, the percona_qa.sh script is the start script (like 5.6.sh), percona_qa.yy file contains the SQL (like 5.6.yy etc.), the .zz file contains the data definitions, and finally the .cc file is a combinations.pl setup which “combines” various options from various blocks. Combinations.pl has great bug-hunting power.

Side note: you can read more about how the option blocks work at:
https://github.com/RQG/RQG-Documentation/wiki/RandomQueryGeneratorCombinations

All you need to do to get an exhaustive test run started, is edit some options (assuming you have Percona Server installed on your test machine already) and start the percona_qa.sh script:

1. Edit the “percona_qa.sh” script and set the WORKDIR and RQG_DIR variables (In regards RQG_DIR, the script will normally assume that randgen is stored under WORKDIR, but you can change RQG_DIR to point to your randgen download path instead, for example RQG_DIR=/randgen).

2. Edit the “percona_qa.cc” script and point it to the location of your server in the –basedir= setting (i.e. replace “/Percona-Server” with “/path_to_your_Percona_Server_installation”.

For the moment, you can just use a standard Percona Server installation, and remove the Valgrind line directly under the one we just edited (use “dd” in vim), but once you get a bit more professional, compiling from source (“building”) is the way to to go.

The reason for building yourself is that if you use a debug compiled server (i.e. execute ./build/build-binary.sh –debug in a Percona Server source download) or a Valgrind instrumented compiled server (i.e. execute ./build/build-binary.sh –debug –valgrind in a Percona Server source download) you will find more bugs (the debug server contains more developer debug asserts etc.).

Note you can use the “build_percona.sh” in the percona-qa Launchpad project (more on this below) to quickly build an optimized, debug and Valgrind server from source. build_mysql.sh does the same for MySQL server.

3. Now you’re ready to go; execute ./percona_qa.sh and watch the screen carefully. You’ll likely immediately see some STATUS_ENVIRONMENT_FAILURE runs. This is quite common and means you have made a little error somewhere a long the way. Stop the run (ctrl+z, then kill -9 all relevant pids, then execute “fg”). Now edit the files as needed (check all the logs, starting with the failed trials ‘trial<no>.log’, etc.). Then start the run again. If your machine is used for testing only (i.e. no production programs running), you can use the following quick command to kill all relevant running mysqld, perl and Valgrind processes:

ps -ef | grep `whoami` | egrep "mysql|perl|valgrind" | grep -v "grep" | awk '{print $2}' | xargs sudo kill -9;

4. Once you’re run is going, leave it going for a few hours, or a few days (we regularly test with runs that go for 2-5 days or more), and then start checking logs (trial<nr>.log is the one you want to study first. Use “:$” in vim to jump to the end of the file, or “:1″ to jump back to the first line).

5. Once you get a bit more professional, use the percona-qa scripts (bzr branch lp:percona-qa) to quickly handle trials of interest. You may want to initially checkout rqg_results.sh, analyze_crash.sh, startup.sh, build_percona.sh, delete_single_trial.sh and keep_single_trial.sh – simply execute them without parameters to get an initial idea on how to use them. These scripts greatly reduce the efforts required when analyzing multiple trials.

6. Finally, for those of you needing the reduce long SQL testcases (from any source) quickly, see reducer.sh in randgen/util/reducer/reducer.sh – it’s a multi-threaded high-performance simplification script I developed whilst working at oracle. They kindly open sourced it some ago. You may then also want to checkout parse_general_log.pl in the percona-qa scripts listed in point 5 above. This script parses a general log created by mysqld (–general_log option) into a ready-to-use SQL trace.

If you happen to find a bug, share the joy! If you happen to run into issues, post your questions below so others who run into the same can find answers quickly. Also feel free to share any tips you find while playing around with this.

Enjoy!

The post How to find bugs in MySQL appeared first on MySQL Performance Blog.

‘Open Source Appreciation Day’ draws OpenStack, MySQL and CentOS faithful

Wed, 16/04/2014 - 20:00

210 people registered for the inaugural “Open Source Appreciation Day” March 31 in Santa Clara, Calif. The event will be held each year at Percona Live henceforth.

To kick off the Percona Live MySQL Conference & Expo 2014, Percona held the first “Open Source Appreciation Day” on Monday, March 31st. Over 210 people registered and the day’s two free events focused on CentOS and OpenStack.

The OpenStack Today event brought together members of the OpenStack community and MySQL experts in an afternoon of talks and sharing of best practices for both technologies. After a brief welcome message from Peter Zaitsev, co-founder and CEO of Percona, Florian Haas shared an introduction to OpenStack including its history and the basics of how it works.

Jay Pipes delivered lessons from the field based on his years of OpenStack experience at AT&T, at Mirantis, and as a frequent code contributor to the project. Jay Janssen, a Percona managing consultant, complemented Jay Pipes’ talk with a MySQL expert’s perspective of OpenStack. He also shared ways to achieve High Availability using the latest version of Galera (Galera 3) and other new features found in the open source Percona XtraDB Cluster 5.6.

Amrith Kumar’s presentation focused on the latest happenings in project Trove, OpenStack’s evolving DBaaS component, and Tesora’s growing involvement. Amrith also won quote of the day for his response to a question about the difference between “elastic” and “scalable.” Amrith: “The waistband on my trousers is elastic. It is not scalable.” Sandro Mazziotta wrapped up the event by sharing the challenges and opportunities of OpenStack from both an integrator as well as operator point of view based on the customer experiences of eNovance.

OpenStack Today was made possible with the support of our sponsors, Tesora and hastexo. Here are links to presentations from the OpenStack Today event. Any missing presentations will soon be added to the OpenStack Today event page.

Speakers in the CentOS Dojo Santa Clara event shared information about the current status of CentOS, the exciting road ahead, and best practices in key areas such as system administration, running MySQL, and administration tools. Here’s a rundown of topics and presentations from the event. Any missing presentations will soon be added to the CentOS Dojo Santa Clara event page.

  • Welcome and Housekeeping
    Karsten Wade, CentOS Engineering Manager, Red Hat
  • The New CentOS Project
    Karsten Wade, CentOS Engineering Manager, Red Hat
  • Systems Automation and Metrics at Pinterest
    Jeremy Carroll, Operations Engineer, Pinterest
  • Software Collections on CentOS
    Joe Brockmeier, Open Source & Standards, Red Hat
  • Two Years Living Your Future
    Joe Miller, Lead Systems Engineer, Pantheon
  • Running MySQL on CentOS Linux
    Peter Zaitsev, CEO and Co-Founder, Percona
  • Notes on MariaDB 10
    Michael Widenius, Founder and CTO, MariaDB Foundation
  • Happy Tools
    Jordan Sissel, Systems Engineer, DreamHost

Thank you to all of the presenters at the Open Source Appreciation Day events and to all of the attendees for joining.

I hope to see you all again this November 3-4  at Percona Live London. The Percona Live MySQL Conference and Expo 2015 will also return to the Hyatt Santa Clara and Santa Clara Convention Center from April 13-16, 2015 – watch for more details in the coming months!

The post ‘Open Source Appreciation Day’ draws OpenStack, MySQL and CentOS faithful appeared first on MySQL Performance Blog.

percona-millipede – Sub-second replication monitor

Tue, 15/04/2014 - 23:00

I recently helped a client implement a custom replication delay monitor and wanted to share the experience and discuss some of the iterations and decisions that were made. percona-millipede was developed in conjunction with Vimeo with the following high-level goal in mind: implement a millisecond level replication delay monitor and graph the results.  Please visit http://making.vimeo.com for more information and thanks to Vimeo for sharing this tool!

Here is the rough list of iterations we worked through in developing this tool/process:

  1. Standard pt-heartbeat update/monitor
  2. Asynchronous, threaded update/monitor tool
  3. Synchronized (via zeroMQ), threaded version of the tool
pt-heartbeat

Initially, we had been running pt-heartbeat (with default interval of 1.0) to monitor real replication delay.  This was fine for general alerting, but didn’t allow us to gain deeper insight into the running slaves.  Even when pt-heartbeat says it is “0 seconds behind”, that can actually mean the slave is up to .99 seconds behind (which in SQL time, can be a lot).  Given the sensitivity to replication delay and the high end infrastructure in place (Percona Server 5.6, 384G RAM, Virident and FusionIO PCI-Flash cards), we decided it was important to absorb the extra traffic in an effort gain further insight into the precise points of any slave lag.

There had been discussion about tweaking the use of pt-heartbeat (–interval = .01 with reduced skew) to look at the sub-second delay, but there were some other considerations:

  • The tool needed to update/monitor multiple hosts from a single process (i.e. we didn’t want multiple pt-heartbeat processes to track)
  • Native integration with existing statsd/graphite system

We likely could’ve achieved the above using pt-heartbeat, but it would’ve required a custom wrapper application to handle the other pieces (threads/subprocesses/inter-process communication).  As the main gist of pt-heartbeat is fairly straightforward (select ts from heartbeat table, calculate delay), it was decided to mimic that logic in a custom application that was configurable and met the other goals.

First Iteration – Async update/monitor

The first iteration was to spin up several threads within a main process (I chose Python for this, but it could be anything really) and set each in a loop with a set delay (.01 seconds for example).  One thread sends update statements with a timestamp, the other threads simply select that timestamp and calculate how long ago the row was updated (i.e. current time – row timestamp):

This iteration was better (we could see replication delay of 1-999 ms and isolate some sub-optimal processes), but there was still some concern.  When testing the solution, we noticed that when pointing the monitor against a single box (for both the update and select), we were still seeing replication delay.  After some discussion, it became apparent that using the current CPU time as the baseline was introducing time to calculate the delay as part of the delay.  Further, since these threads weren’t synchronized, there was no way to determine how long after the update statement the select was even run.

Final Iteration – ZeroMQ update/monitor

Based on this analysis, we opted to tweak the process and use a broadcast model to keep the monitor threads in sync.  For this, I chose to use ZeroMQ for the following reasons:

  • Built in PUB/SUB model with filtering – allows for grouping masters with slaves
  • Flexibility in terms of synchronization (across threads, processes, or servers – just a config tweak to the sockets)

After the update, here was the final architecture:

In this model, the update thread publishes the timestamp that was set on the master and each monitor thread simply waits as a consumer and then checks the timestamp on the slave vs the published timestamp.  Synchronization is built in using this model and we saw much more accurate results.  As opposed to sitting at 5-10ms all the time with spikes up to 50ms, we found 0ms in most cases (keep in mind that means 0ms from an application standpoint with network latency, time to process the query, etc) with spikes up to 40ms.  Here is a sample graph (from Graph Explorer on top of statsd, courtesy of Vimeo) showing some micro spikes in delay that pointed us to a process that was now noticeable and able to be optimized:

While this process was what the client needed, there are some things that I’d like to point out about this approach that may not apply (or even hinder) other production workloads:

  • The frequency of update/select statements added several hundred queries per second
    • You could configure less frequent update/selects, but then you may see less accuracy
    •  The longer delay between updates, the less chance you will see delay
  • For replication delay monitoring (i.e. Nagios), 1 second granularity is plenty
    • Typically, you would only alert after several seconds of delay were noticed

Naturally, there are some other factors that can impact the delay/accuracy of this system (pub/sub time, time to issue select, etc), but for the purpose of isolating some sub-optimal processes at the millisecond level, this approach was extremely helpful.

Stay tuned for a followup post where I’ll share the tool and go over it’s installation, configuration, and other details!

The post percona-millipede – Sub-second replication monitor appeared first on MySQL Performance Blog.

Advisory on Heartbleed (CVE-2014-0160) for Percona’s customers and users

Tue, 15/04/2014 - 01:03

Over the last few days, the Percona team has spent a lot of time evaluating the impact of the Heartbleed bug (CVE-2014-0160) for our customers and for the users of our software. We published a formal disclosure a few days ago. However, I thought a quick summary and some additional information would be good to provide for our MySQL Performance Blog readers.

First, I want to point out that “Heartbleed” is an issue in a commonly used third-party library which typically comes with your operating system, so there is a lot of software which is impacted. An openly exposed service, which is typically a website or some form of API, can potentially cause the biggest impact for anyone. Even though we talk a lot about MySQL Server (and its variants), it will not be the primary concern for organizations following best practices and not exposing their MySQL server to the open Internet.

Second, if you take care of patching your operating system, this will take care of Percona Server, MariaDB or MySQL Server (see note below) as well as other software which uses the OpenSSL library as long as it is linked dynamically. It is highly recommended to dynamically link OpenSSL exactly to take care of such security issues with a single library update and not wait for separate security updates for multiple software packages. Note that updating the library is not enough – you need to restart the service in order for the new library to be loaded. In most cases, I recommend a full system restart as the simplest way to guaranty that all processes using the library have been restarted.

Third, it is worth noting that not all MySQL variants have been impacted and not in all cases. Obviously, your MySQL Server is not impacted if you’re running an operating system which is not vulnerable. You will also not be vulnerable if the MySQL Server or variant you’re using uses yaSSL instead of OpenSSL. In addition, in many cases SSL support is disabled on the server side by default, which might not be the best thing from a security standpoint but can save us from this bug. Finally, in many configurations the SSL/TLS connection setup will take place after initial handshake which does not allow this vulnerability in all cases. I do not have hard numbers but I would guess no more than 10-20% of MySQL (and variants) installations would be vulnerable, even before you look at when they are exposed to the Internet.

To find out whenever MySQL is dynamically compiled with OpenSSL or yaSSL you can use this command:

[root@localhost vagrant]# ldd /usr/sbin/mysqld | grep ssl libssl.so.10 => /usr/lib64/libssl.so.10 (0x00007fb7f4cbc000

It will show “libssl” for server linked with OpenSSL dynamically and it will show no matches for server compiled with yaSSL

It is worth noting as Lenz Grimmer pointed out in a blog post comment that Heartbleed impacts not only vulnerable servers but vulnerable clients can be at risk as well if they connect to a compromised server which implements code specially targeting the clients. This means you want to make sure to update your client machines as well, especially if you’re connecting to a non-trusted MySQL Server.

But enough on Percona Software. There is an impact to Percona web systems as well. The majority of our systems have not been impacted directly because they were running an OpenSSL version which did not have the Heartbleed vulnerability. However, because of how our systems are integrated, there was a small probability that some customer accounts could be exposed through impacted services: https://rdba.percona.com and https://cloud.percona.com. We promptly patched these services last week, regenerated keys, and reset passwords for all accounts which had even a small chance of being impacted.

We believe our teams have acted quickly and appropriately to secure our systems and minimize the chance of information leaks. We will continue to monitor the situation closely and update you via our advisory document if there is any new information needing your attention.

The post Advisory on Heartbleed (CVE-2014-0160) for Percona’s customers and users appeared first on MySQL Performance Blog.

Percona Live MySQL Conference Highlights

Thu, 10/04/2014 - 21:59

The Percona Live MySQL Conference and Expo 2014 was March 31st through April 4th in Santa Clara, California. I heard numerous positive comments from attendees and saw even more on social media. Our conference team lead by Kortney Runyan pulled together a smooth, enjoyable event which made it easy for attendees to focus on learning and networking. Some of the Percona Live MySQL Conference highlights from this year follow.

Percona Live MySQL Conference Highlights

A few stats for the conference this year versus last year:

  • Total registrations were up nearly 15%
  • Attendees represented 40 countries, up from 36 in 2013
  • 34 companies sponsored the conference this year, up from 33 last year
  • This year’s conference covered 5 days including the Open Source Appreciation Day, up from 4 days last year

The conference could not grow without the support of attendees and sponsors. It’s great to see continued growth in attendees which is driven in large part by the word of mouth endorsements of those who attended in previous years.

Keynotes & SiliconANGLE Interviews Video Recordings

The Percona Live MySQL Conference 2014 featured some great keynote presentations. I would like to thank our great keynote speakers who covered a wide variety of topics including MySQL 5.7, integrating MySQL and Hadoop, SSD storage, OpenStack, and the future of running MySQL at scale. If you couldn’t attend this year, you can watch the keynote recordings by visiting the conference website. I was particularly interested in the presentation on MySQL 5.7 from Tomas Ulin, VP of MySQL Engineering for Oracle, which demonstrated the ongoing improvements and commitment Oracle has made to MySQL. I also found the talk by Boris Renski of Mirantis especially helpful in increasing my understanding of the OpenStack ecosystem and the organizations involved in that space. The talk by Robert Hodges of Continuent on integrating MySQL and Hadoop and the talk by Nisha Talagala from Fusion-io on advances in SSD storage are also definitely worth a look.

We were fortunate to have SiliconANGLE at the conference this year, recording video interviews with a wide range of exhibitors and attendees. Whether you attended the conference or not, I think you’ll find the interviews entertaining and enlightening. You can find the videos on the Percona Live conference website or on the SiliconANGLE website or YouTube channel.

Conference Committee

Thanks go first to the nearly 130 speakers who were chosen based on over 300 submitted proposals. The Conference Committee lead by Shlomi Noach, which represented a wide array of ecosystem participants, created a solid schedule of tutorials and breakout sessions which covered a wide variety of MySQL-related topics at a variety of levels. Thanks go out to all of the Committee members:

  • Shlomi Noach, Engineer, Outbrain
  • Roland Bouman, Software Engineer, Pentaho
  • Tim Callaghan, VP of Engineering, Tokutek
  • Laine Campbell, CEO and Co-Founder, Blackbird
  • Jeremy Cole, Sr. Systems Engineer, Google
  • Todd Farmer, Director, Oracle
  • Jay Janssen, Consulting Lead, Percona
  • Giuseppe Maxia, QA Director, Continuent
  • Cedric Peintre, DBA, Dailymotion
  • Ivan Zoratti, CTO, SkySQL
  • Liz van Dijk, Head of Technical Account Management, Percona

Visit the Percona Live MySQL Conference 2014 website and click the “Session Slides” to download the slides from the sessions. Check back periodically for new sets of slides added by the conference speakers.

Sponsors

We cannot put on this conference without the support of our sponsors. Special thanks go out to our Diamond Plus sponsors, Fusion-io and Continuent, for their support and their knowledgeable keynote speakers. The full list of sponsors is:

Diamond Plus: Continuent, Fusion-io
Platinum: Booking.com
Gold: Machine Zone, Micron, Pythian, SkySQL
Silver: AppDynamics, Box, InfiniDB, Diablo, Galera/Codership, Google, Rackspace, Tesora, Twitter, Yelp
Exhibitors: Attunity, Blackbird (formerly PalominoDB), Dropbox, Etsy, FoundationDB, HGST, Rocket Fuel, RSSBus, ScaleArc, ScaleBase, Severalnines, Sphinx, Tokutek, VividCortex
Other: Devart, Facebook, Webyog, MailChimp
Media: ADMIN Magazine, Datanami, DBTA, Linux Journal, Linux Pro Magazine, O’Reilly, Software Developer’s Journal
Open Source Appreciation Day: Tesora, hastexo, CentOS

Open Source Appreciation Day

Special thanks go to our partners who helped put together the OpenStack Today and CentOS Dojo Santa Clara events on Monday during the inaugural Open Source Appreciation Day. Attendees for these free events were treated to great talks from noteworthy speakers from the OpenStack, CentOS, and MySQL worlds on a wide variety of topics. Watch for links to the presentations in a future blog post.

WebScaleSQL

One of the biggest announcements this year wasn’t made by a sponsor. The announcement of the WebScaleSQL project, which is being driven by team members from Facebook, Google, Twitter, and LinkedIn, got a lot of attention from attendees and from the media. My takeaway was that the project is both a strong vote of confidence in MySQL (versus other MySQL variants or alternatives) and a promise of continued improvement in MySQL for large scale web applications in the future.

Percona Live London 2014 and Percona Live MySQL Conference 2015 Dates

Our next event is Percona Live London which is November 3-4, 2014. The Percona Live MySQL Conference and Expo 2015 will be April 13-16, 2015 at the Hyatt Santa Clara and Santa Clara Convention Center – watch for more details in the coming months!

The post Percona Live MySQL Conference Highlights appeared first on MySQL Performance Blog.

Heartbleed: Separating FAQ From FUD

Thu, 10/04/2014 - 04:52

If you’ve been following this blog (my colleague, David Busby, posted about it yesterday) or any tech news outlet in the past few days, you’ve probably seen some mention of the “Heartbleed” vulnerability in certain versions of the OpenSSL library.

So what is ‘Heartbleed’, really?

In short, Heartbleed is an information-leak issue. An attacker can exploit this bug to retrieve the contents of a server’s memory without any need for local access. According to the researchers that discovered it, this can be done without leaving any trace of compromise on the system. In other words, if you’re vulnerable, they can steal your keys and you won’t even notice that they’ve gone missing. I use the word “keys” literally here; by being able to access the contents of the impacted service’s memory, the attacker is able to retrieve, among other things, private encryption keys for SSL/TLS-based services, which means that the attacker could be able to decrypt communications, impersonate other users (see here, for example, for a session hijacking attack based on this bug), and generally gain access to data which is otherwise believed to be secure. This is a big deal. It isn’t often that bugs have their own dedicated websites and domain names, but this one does: http://www.heartbleed.com

Why is it such a big deal?

One, because it has apparently been in existence since at least 2012. Two, because SSL encryption is widespread across the Internet. And three, because there’s no way to know if your keys have been compromised. The best detection that currently exists for this are some Snort rules, but if you’re not using Snort or some other IDS, then you’re basically in the dark.

What kinds of services are impacted?

Any software that uses the SSL/TLS features of a vulnerable version of OpenSSL. This means Apache, NGINX, Percona Server, MariaDB, the commercial version of MySQL 5.6.6+, Dovecot, Postfix, SSL/TLS VPN software (OpenVPN, for example), instant-messaging clients, and many more. Also, software packages that bundle their own vulnerable version of SSL rather than relying on the system’s version, which might be patched. In other words, it’s probably easier to explain what isn’t affected.

What’s NOT impacted?

SSH does not use SSL/TLS, so you’re OK there. If you downloaded a binary installation of MySQL community from Oracle, you’re also OK, because the community builds use yaSSL, which is not known to be vulnerable to this bug. In fact, anything that’s been built against yaSSL (or some other SSL library that isn’t OpenSSL) should be fine. Obviously, any service which doesn’t use SSL/TLS is not going to be vulnerable, either, since the salient code paths aren’t going to be executed. So, for example, if you don’t use SSL for your MySQL connections, then this bug isn’t going to affect your database server, although it probably still impacts you in other ways (e.g., your web servers).

What about Amazon cloud services?

According to Amazon’s security bulletin on the issue, all vulnerable services have been patched, but they still recommend that you rotate your SSL certificates.

Do I need to upgrade Percona Server, MySQL, NGINX, Apache, or other server software?

No, not unless you built any of these and statically-linked them with a vulnerable version of OpenSSL. This is not common. 99% of affected users can fix this issue by upgrading their OpenSSL libraries and cycling their keys/certificates.

What about client-level tools, like Percona Toolkit or XtraBackup?

Again, no. The client sides of Percona Toolkit, Percona XtraBackup, and Percona Cloud Tools (PCT) are not impacted. Moreover, the PCT website has already been patched. Encrypted backups are still secure.

There are some conflicting reports out there about exactly how much information leakage this bug allows. What’s the deal?

Some of the news articles and blogs claim that with this bug, any piece of the server’s memory can be accessed. Others have stated that the disclosure is limited to memory space owned by processes using a vulnerable version of OpenSSL. As far as we are aware, and as reported in CERT’s Vulnerability Notes Database, the impact of the bug is the latter; i.e., it is NOT possible for an attacker to use this exploit to retrieve arbitrary bits of your server’s memory, only bits of memory from your vulnerable services. That said, your vulnerable services could still leak information that attackers could exploit in other ways.

How do I know if I’m affected?

You can test your web server at http://filippo.io/Heartbleed/ – enter your site’s URL and see what it says. If you have Go installed, you can also get a command-line tool from Github and test from the privacy of your own workstation. There’s also a Python implementation. You can also check the version of OpenSSL that’s installed on your servers. If you’re running OpenSSL 1.0.1 through 1.0.1f or 1.0.2-beta, you’re vulnerable. (Side note here: some distributions, such as RHEL/CentOS, have patched the issue without actually updating the OpenSSL version number; on RPM-based systems you can view the changelog to see if it’s been fixed, for example):

rpm -q --changelog openssl | head -2 * Mon Apr 07 2014 Tomáš Mráz <tmraz@redhat.com> 1.0.1e-16.7 - fix CVE-2014-0160 - information disclosure in TLS heartbeat extension

Also, note that versions of OpenSSL prior to 1.0.1 are not known to be vulnerable. If you’re still unsure, we would be happy to assist you in determining the extent of the issue in your environment and taking any required corrective action. Just give us a call.

How can I know if I’ve been compromised?

If you’ve already been compromised, you have no way of knowing. However, if you use Snort as an IDS, you can use some rules developed by Fox-IT to detect and defer new attacks; other IDS/IPS vendors may have similar rule updates available. Without some sort of IDS in place, however, attacks can and will go unnoticed.

Are there any exploits for this currently out there?

Currently, yes, there are some proofs-of-concept floating around out there, although before this week, that was uncertain. But given that this is likely a 2-year-old bug, I would be quite surprised if someone, somewhere (you came to my talk at Percona Live last week, didn’t you? Remember what I said about assuming that you’re already owned? There are No Secrets Anymore.) didn’t have a solid, viable exploit.

So, then, what should I do?

Ubuntu, RedHat/CentOS, Amazon, and Fedora have already released patched versions of the OpenSSL library. Upgrade now. Do it now, as in right now. If you’ve compiled your own version of OpenSSL from source, upgrade to 1.0.1g or rebuild your existing source with the -DOPENSSL_NO_HEARTBEATS flag.

Once that’s been done, stop any certificate-using services, regenerate the private keys for any services that use SSL (Apache, MySQL, whatever), and then obtain a new SSL certificate from whatever certificate authority you’re using. Once the new certificates are installed, restart your services. You can also, of course, do the private key regeneration and certificate cycling on a separate machine, and only bring the service down long enough to install the new keys and certificates. Yes, you will need to restart MySQL. Or Apache. Or whatever. Full stop, full start, no SIGHUP (or equivalent).

Unfortunately, that’s still not all. Keeping in mind the nature of this bug, you should also change / reset any user passwords and/or user sessions that were in effect prior to patching your system and recycling your keys. See, for example, the session hijacking exploit referenced above. Also note that Google services were, prior to their patching of the bug, vulnerable to this issue, which means that it’s entirely possible that your Gmail login (or any other Google-related login) could have been compromised.

Can I get away with just upgrading OpenSSL?

NO. At a bare minimum, you will need to restart your services, but in order to really be sure you’ve fixed the problem, you also need to cycle your keys and certificates (and revoke your old ones, if possible). This is the time-consuming part, but since you have no way of knowing whether or not someone has previously compromised your private keys (and you can bet that now that the bug is public, there will be a lot of would-be miscreants out there looking for servers to abuse), the only safe thing to do is cycle them. You wouldn’t leave your locks unchanged after being burgled, would you?

Also note that once you do upgrade OpenSSL, you can get a quick list of the services that need to be restarted by running the following:

sudo lsof | grep ssl | grep DEL

Where can I get help and/or more information?

In addition to the assorted links already mentioned, you can read up on the nuts and bolts of this bug, or various news articles such as this or this. There are a lot of articles out there right now, most of which are characterizing this as a major issue. It is. Test your vulnerability, upgrade your OpenSSL and rotate your private keys and certificates, and then change your passwords.

The post Heartbleed: Separating FAQ From FUD appeared first on MySQL Performance Blog.

OpenSSL heartbleed CVE-2014-0160 – Data leaks make my heart bleed

Wed, 09/04/2014 - 01:04

The heartbleed bug was introduced in OpenSSL 1.0.1 and is present in

  • 1.0.1
  • 1.0.1a
  • 1.0.1b
  • 1.0.1c
  • 1.0.1d
  • 1.0.1e
  • 1.0.1f

The bug is not present in 1.0.1g, nor is it present in the 1.0.0 branch nor the 0.9.8 branch of OpenSSL some sources report 1.0.2-beta is also affected by this bug at the time of writing, however it is a beta product and I would really recommend not to use beta quality releases for something as fundamentally important as OpenSSL in production.

The bug itself is within the heartbeat extension of OpenSSL (RFC6520). The bug allows an attacker to leak the memory in up to 64k chunks, this is not to say the data being leaked is limited to 64k as the attacker can continually abuse this bug to leak data, until they are satisfied with what has been recovered.

At worst the attacker can retrieve the private keys, the implications for which is that the attacker now has the keys to decrypt the encrypted data, as such the only way to be 100% certain of protection against this bug is to first update OpenSSL (>= 1.0.1g) and then revoke and regenerate new keys and certificates, expect to see a tirade of revocations and re-issuing of CA certs and the like in the coming days.

So how does this affect you as a MySQL user?

Taking Percona Server as an example, this is linked against OpenSSL, meaning if you want to use TLS for your client connections and/or your replication connections you’re going to need to have openSSL installed.

You can find your version easily via your package manager for example:

  • rpm -q openssl
  • dpkg-query -W openssl

If you’re running a vulnerable installation of OpenSSL an update will be required.

  • update OpenSSL >= 1.0.1g
  1. 1.0.1e-2+deb7u5 is reported as patched on debian,
  2. 1.0.1e-16.el6_5.7 is reported as patched in RedHat
  3. 1.0.1e-16.el6_5.4.0.1.centos is reported as being an interim patch for CentOS from the updates repository this is superseded by the RedHat package where available.
  4. 1.0.1e-37.66 changelogs note this has been patched on Amazon AMI
  • shutdown mysqld
  • regenerate keys and certs used by mysql for TLS connections (revoking the old certs if possible to do so)
  • start mysqld

You can read more about the heartbleed bug at heartbleed.com Redhat Bugzilla Mitre CVE filing Ubuntu Security Notice

UPDATE 2014-04-10: This video provides a fantastic description on heartbleed

 

The post OpenSSL heartbleed CVE-2014-0160 – Data leaks make my heart bleed appeared first on MySQL Performance Blog.

Optimizing MySQL Performance: Choosing the Right Tool for the Job

Tue, 08/04/2014 - 07:00

Next Wednesday, I will present a webinar about MySQL performance profiling tools that every MySQL DBA should know.

Application performance is a key aspect of ensuring a good experience for your end users. But finding and fixing performance bottlenecks is difficult in the complex systems that define today’s web applications. Having a method and knowing how to use the tools available can significantly reduce the amount of time between problems manifesting and fixes being deployed.

In the webinar, titled “Optimizing MySQL Performance: Choosing the Right Tool for the Job,” we’ll start with the basic top, iostat, and vmstat then move onto advanced tools like GDB, Oprofile, and Strace.

I’m looking forward to this webinar and invite you to join us April 16th at 10 a.m. Pacific time. You can learn more and also register here to reserve your spot. I also invite you to submit questions ahead of time by leaving them in the comments section below. Thanks for reading and see you next Wednesday!

The post Optimizing MySQL Performance: Choosing the Right Tool for the Job appeared first on MySQL Performance Blog.

Facebook’s Yoshinori Matsunobu on MySQL, WebScaleSQL & Percona Live

Sat, 05/04/2014 - 03:41

Facebook’s Yoshinori Matsunobu

I spoke with Facebook database engineer Yoshinori Matsunobu here at Percona Live 2014 today about a range of topics, including MySQL at Facebook, the company’s recent move to WebScaleSQL, new MySQL flash storage technologies – and why attending the Percona Live MySQL Conference and Expo each year is very important to him.

Facebook engineers are hosting several sessions at this year’s conference and all have been standing room only. That’s not too surprising considering that Facebook, the undisputed king of online social networks, has 1.23 billion monthly active users collectively contributing to an ocean of data-intensive tasks – making the company one of the world’s top MySQL users. And they are sharing what they’ve learned in doing so this week.

You can read my previous post where I interviewed five Facebook MySQL experts (Steaphan Greene, Evan Elias, Shlomo Priymak, Yoshinori Matsunobu and Mark Callaghan) and below is my short video interview with Yoshi, who will lead his third and final session of the conference today at 12:50 p.m. Pacific time titled, “Global Transaction ID at Facebook.”

The post Facebook’s Yoshinori Matsunobu on MySQL, WebScaleSQL & Percona Live appeared first on MySQL Performance Blog.

How to add an existing Percona XtraDB Cluster to Percona ClusterControl

Tue, 01/04/2014 - 20:00

In my last blog post I explained how to use Percona ClusterControl to create a new Percona XtraDB Cluster from scratch. That’s a good option when you want to create a testing environment in just some mouse clicks. In this case I’m going to show you how to add your existing cluster to Percona ClusterControl so you can manage and monitor it on the web interface.

The environment will be pretty similar, we will have UI, CMON and 3 XtraDB Cluster nodes. The cluster should be already running and Percona ClusterControl also installed.

Adding an existing Cluster

The ClusterControl web interface is empty, there are no clusters on it. To add an existing one we need to click on “Add existing Galera Cluster.” (Click on the image for an enlarged view).

 

 

A new form will be shown pretty similar to the one we saw last time when we were creating a new cluster. We can divide the form in two parts. First we need to give information about our Cluster. The info requires is the Linux distribution and version, IP of PXC nodes and MySQL root passwords. Pretty easy:

 

 

 

 

In the second part we have the SSH configuration. There is one pre-requisite, the UI server should be able to connect to all servers using a SSH key. Therefore, our first step is to create a SSH key pair in our UI server and copy the public one to all other servers.

 

 

It is also necessary to add the private key in the web interface. You can do it using the form shown after clicking on “Add Key Pair”:

Once the key is added, we can verify the access:

As we can see here, everything works as expected and all servers are reachable by SSH. The parameter “Create shared SSH key” also needs to be enabled. That option will make ClusterControl to create a new SSH key pair on CMON node so this one can also connect to PXC nodes with passwordless SSH.

Now everything is prepared. We can proceed with the deployment. Just click on “Add cluster” and the installation process will start. While the installation is in progress you will see this notification:

 

 

 

 

Clicking on it we can see the progress of the deployment:

After some minutes our PXC is shown in the Percona ClusterControl UI:

Now we can monitor it, get alerts, clone, run backups and everything from the web interface. You can also add multiple clusters and create new ones.

The post How to add an existing Percona XtraDB Cluster to Percona ClusterControl appeared first on MySQL Performance Blog.

ScaleArc: Benchmarking with sysbench

Tue, 01/04/2014 - 00:11

ScaleArc recently hired Percona to perform various tests on its database traffic management product. This post is the outcome of the benchmarks carried out by Uday Sawant (ScaleArc) and myself. You can also download the report directly as a PDF here.

The goal of these benchmarks is to identify the potential overhead of the ScaleArc software itself and the potential benefits of caching. The benchmarks were carried out with the trunk version of sysbench. For this reason, we used a very small set of data, so the measurements will be fast, and it’s known that caching has huge benefits when the queries themselves are rather expensive. We decided that we would rather show that benefit with a real-world application, which is coming later is this series. And if you’re in the Silicon Valley area, be sure to join us this evening at the first-ever Open Source Appreciation Day – I’d be happy to discuss the findings presented here in this post. Admission is free but due to limited space you should register now. I’ll also be available throughout the Percona Live MySQL Conference and Expo all this week.

In this summary graph it’s visible that in terms of throughput (read-only benchmark, which is relevant for read mostly applications), ScaleArc doesn’t have any significant overhead, while caching can have potentially huge benefits.

The situation is pretty similar with response times. ScaleArc doesn’t add any significant overhead, and caching can mean huge benefit in terms of response time as well.

In case of this particular workload (which is read only sysbench), using caching means a roughly 3x increase in throughput and a roughly 80% drop in response time.

Overall, ScaleArc is a good product in terms of performance and features as well. I would definitely recommend it.

About ScaleArc for MySQL
ScaleArc for MySQL is a software appliance that drops in transparently between applications and databases to improve application availability and performance. It requires no changes to applications or databases and delivers:

  • Instant scale up – transparent connection pooling and multiplexing, TTL-based transparent caching, surge protection
  • Transparent scale out – read/write split, load balancing, query routing, sharding
  • Automatic high availability – automatic failover
  • Real-time actionable analytics

Benchmarking setup
The client machines are running the benchmarking software like sysbench in case of these benchmarks.

CPU: 2 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
Memory: 64G

We used 2 clients. The results of the 2 clients are graphed separately, so it’s visible that they put the same amount of workload on the database or ScaleArc software.

Database machines
CPU: 2 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
Memory: 64G

Running MySQL Community Edition 5.6.15

MySQL configuration

[mysqld] max_allowed_packet = 64M thread_cache = 256 query_cache_size = 0 query_cache_type = 0 max_connections = 20020 max_user_connections = 20000 max_connect_errors = 99999999 wait_timeout = 28800 interactive_timeout = 28800 log-error=/var/lib/mysql/mysql.err back_log=60000 innodb_buffer_pool_size = 3G innodb_additional_mem_pool_size = 16M innodb_log_buffer_size = 8M innodb_flush_log_at_trx_commit = 0 innodb_flush_method = O_DIRECT innodb_open_files = 2000 innodb_file_per_table innodb_log_file_size=2G innodb_log_files_in_group=2 innodb_purge_threads=1 innodb_max_purge_lag=0 innodb_support_xa=0 innodb_locks_unsafe_for_binlog = 1 innodb_buffer_pool_instances=8 sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES

The buffer pool of the database is intentionally small, so it’s easy to generate a disk-bound workload.

Please note that the following settings are not recommended in production.

innodb_support_xa=0 innodb_locks_unsafe_for_binlog = 1

 

We used these settings to drive the node to its peak performance, avoiding any possible overhead which might be required on a production system. In typical production settings, these are not set, and binary logging is enabled, which potentially reduces ScaleArc’s overhead further.

ScaleArc software appliances
CPU: 1 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
Memory: 64G

The machines were running ScaleArc for MySQL 3.0.

Network
The machines were connected using 10G connections.

Measurements
All of the measurements were done with a very small database that completely fits in memory.

--oltp-table-size=10000 --oltp-tables-count=64

 

In these benchmarks, we expected both the database and ScaleArc to be CPU bound. In case of a disk-bound workload, ScaleArc would shine even more than in this benchmark. If the queries are more expensive (they have to hit storage), the overhead in % is smaller, and in case of caching the query-by-query benefit is bigger.

We measured 3 different setups, both on read-only and read-write cases. These are the following.

  • Direct connection to the database.
  • Connection to the database through ScaleArc, where ScaleArc only acts as a pass-through filter (since it’s a load balancer that speaks the MySQL wire protocol, all the mechanics for that are still in place). Please note that this setup doesn’t make sense in real life. The purpose of this setup is to show the potential overhead of using ScaleArc and uncovering potential limitations of the ScaleArc software itself.
  • Connection to database through ScaleArc, where ScaleArc is allowed to cache. Caching in ScaleArc is TTL (Time To Live)-based caching, meaning that a read query’s results are cached in ScaleArc. If that read query is seen again before expiring, the query is not run again on the database server but rather served from the cache. Once the timer for the cached query expires, the query will be issued on the database again. Caching of course only works for reads, which are not in an explicit transaction (autocommit is on and no START TRANSACTION is issued). Because of that, we used –oltp-skip-trx during cached benchmarks (read-only case). In case of these benchmarks, the TTL was 1 hour, because we wanted to saturate the ScaleArc software while serving cached queries. An 1 hour TTL might be unrealistic for some applications, while for other applications even an 1 day TTL is something they can live with for some queries. In this case, we wanted to measure the cache’s performance, so we wanted the queries to be cached during the entire benchmark run to show the potential gain even in case of very small queries.

TTL-based caching
It’s important to note that the cache’s expiration is controlled by a TTL value – there is no other invalidation, so it’s possible to read stale data when the query results is changed, but the cache is not expired. Reading stale data alone is ok for most applications, it can happen with a regular, asynchronous slave if it’s lagging behind the master (and it always lagging behind somewhat). Otherwise, the cache is pretty similar to MySQL’s query cache, which doesn’t suffer from the stale read problem, but it has a coarse invalidation (if a table is written, the cache entries belonging to the given table are flushed). While the cache is flushed, the query cache mutex is held, which blocks reads even. Because of the mutex, the built-in query cache is a very usual performance bottleneck. ScaleArc’s cache doesn’t suffer from this.

It’s important to note that ScaleArc caches nothing by default. Also, there are other ways to invlidate cache entries apart from waiting for the TTL to expire.

  • API Call based invalidation (you can clear the cache for an entire query pattern rule with one API call)
  • Query comment based invalidation (you can put a comment /*wipe*/ before a query and wipe and refresh the cache)
  • Cache Bypass (you can send a comment /*nocache*/ and bypass the cache for that specific query)

Read-only
Sysbench throughput

In the lower region of threads (up to 32), we see that the TPS value significantly drops in case of going through ScaleArc. That’s nothing to be surprised about, the reason for that is network roundtrips. Because ScaleArc is a software appliance, it adds a hop between the database and the application, which introduces latency. If the number of threads is higher (32 and up), this starts to matter less and less, and performance is almost identical which is very impressive. It means that around the optimal degree of parallelism for these machines, ScaleArc introduces very little (barely measurable) overhead.

Sysbench response time

This graph contains the response times belonging to the previous benchmarks. This is really hard to read because at 4096 threads, the system is overloaded, and the response time is much more than in the maximum throughput region. Because it’s multiple orders of magnitude higher, the interesting response times are not readable from this graph.

The following graph is the same as above, except that the y axis is limited to 250 ms, so the region which is not visible on the graph above is visible here. What we see there regarding the overhead is pretty much the same as we saw in case of the throughput graph, which means that ScaleArc by itself introduces immeasurably low latency (which explains the difference in cases when parallelism is low). Usually applications which are utilizing the database server are using significantly more than one thread (in MySQL a single query always uses a single thread, in other words there is no intra-query parallelism). The latency from 32 threads above is actually somewhat lower when going through ScaleArc (the exact tipping point can be different here based on the number of CPUs). The reason for that is ScaleArc itself uses an event loop to connect to MySQL, so at a high concurrency, and can schedule sending the traffic to MySQL differently. This only matters when otherwise the MySQL server is saturated CPU-wise.

CPU utilization

Last but not least, this graph contains the CPU utilization of the different setups. The left-hand side shows the CPU utilization when connecting directly to the database, and the right-hand side shows connecting through ScaleArc. In both cases, the database server’s CPU is the bottleneck. It’s visible that the client node’s CPU is more than 75% idle (only client1 is graphed to improve readability, client2 is practically the same). From 32 threads and up, the blue bar (CPU user%) is relatively high on the database servers, as is the green (CPU sys%). From 64 threads, the idle time is practically 0, until the systems are overloaded. On the right hand side, we can see that ScaleArc at this load still had 50% idle CPU, which means that we could practically do the same benchmark on another set of boxes through the very same ScaleArc, and only then it will be fully utilized. We are talking about 3000 sysbench tps here. One more interesting thing to note is the relatively high system time of ibd. This is also because of the way ScaleArc connects to the database (see the previous paragraph).

[ 17s] threads: 64, tps: 3001.98, reads/s: 41962.70, writes/s: 0.00, response time: 35.22ms (95%)

 

These threads are from a single client, which means that ScaleArc could keep up with parsing roughly 84000 statements / second with utilizing half of its CPU, which is impressive. Please note that the ScaleArc software in this case was tuned towards this type of workload, which means we had more query processing threads. In case of caching, we will have more cache handler threads.

Effects of caching on read-only workload
Sysbench throughput

The next set of graphs will compare the cases when cache is used and not used.

The preceding TPS graph contains reads / second (because we measured with –oltp-skip-trx), so roughly 42000 reads corresponds to roughly 3000 transactions in the earlier setup (14 reads in a transaction). On the left-hand side of the graph, the cached throughput is visible with green – on the right-hand side, the non-cached throughput is visible with red (direct access) and blue (access through ScaleArc as a pass-through filter). It’s visible that caching improves the speed drastically, but when ScaleArc becomes overloaded (8192 client threads, 4096 from each client), the performance becomes somewhat inconsistent, which is understandable considering how few cores ScaleArc was running on. On the graph, the dots are translucent, which means the colors are brighter in the areas that have more samples. Even in the overloaded case, the majority of the samples are in the region of 100k+ reads / second across two clients, which means that the performance degrades very gracefully even under heavy load.

Sysbench response time

Like in the case of a non-cached workload, the response times are not too readable because of the very high response times when the systems are overloaded. But from the overloaded response times visible, it seems like using caching doesn’t make response times worse.

Like in the case of non-cached workload, this graph is the zoomed version of the previous one. Here the maximum of the y axis is 100 ms. From this graph, it’s visible that at lower concurrency and at the optimal throughput, caching actually helps response time. This is understandable, since in case of a cache hit, ScaleArc can serve the results, and the client (in our case here sysbench) doesn’t have to go to the database, so a roundtrip and database processing time is spared. It’s also worth mentioning that the data “comes from memory,” it doesn’t matter if we hit the ScaleArc cache of the database. When the ScaleArc cache is used, the response time is lower because the additional roundtrip to the database and potential database work (like parsing SQL) is avoided. This means that caching can have benefits even if the database fits in the buffer pool. The improvement is always subject to the workload – caching helps the most when it can cache relatively expensive queries like aggregations and queries hitting the storage.

CPU utilization

Similarly to the previous case, the preceding graph shows CPU utilization of the various components. In case of the cached workload, the client itself is much more utilized (since it gets responses sooner, it has to generate the traffic faster). With this kind of workload, when using only one client, we would hit the client’s CPU as the performance bottleneck. The database is interesting too. With caching, its CPU is barely used. This is because if a query is served from the cache, it never gets to the database, so the database’s CPU utilization will be lower. In other words, using the cache helps to offload the database. If offloading is visible on ScaleArc’s graphs, when caching is used, the CPU on the server hosting ScaleArc is much more utilized. For this benchmark, the ScaleArc software was tuned to handle a cached workload, which means more cache handler threads.

Read-write
For read-write benchmarks, we had to create oltp_nontran.lua, which is the same sysbench benchmark as oltp.lua, except that it does the reads outside of the transaction and does only the writes in transaction, so caching can have an effect on read. The rest of the benchmarking setup is the same as the read-only case.

Sysbench throughput

Similarly to the read-only case, at a low concurrency, the overhead of ScaleArc is coming from the additional network roundtrip. At the optimal concurrency, the overhead is barely measurable (the dots are plotted practically on top of each other).

Sysbench response time

The case is pretty similar with the response times as in the read-only case. Similarly, the second graph is a zoomed version of the first one, which a 250 ms maximum.

CPU utilization

The CPU utilization graph shows that in this case, the database server’s CPU is the bottleneck. What is interesting is that ScaleArc is using less CPU than in the read-only case. This is understandable, since a transaction now contains writes as well, which are expensive on the database side, but they are still just statements to route on the ScaleArc side.

Effects of caching on read-write workload
Measuring caching here is interesting because the workload is no longer read-only of mostly reads. We have a very significant amount of writes.

[ 16s] threads: 64, tps: 2151.99, reads/s: 30012.88, writes/s: 8575.97, response time: 45.20ms (95%)

For 30k reads, we get 8,5k writes. It’s expected that caching won’t help as much as in the previous case, because writes can’t be cached and while they are in process, the benchmarking threads can’t proceed with reads. Please note that this means that roughly 25% of the traffic is write, a typical application scaling out with additional slaves for reads doesn’t have this kind of read-to-write ratio.

Sysbench throughput

The first graph shows that in terms of total throughput, caching still helps.

Sysbench response time

Similarly to the read-only case, caching also helps response time, because it reduces the time needed for the read part of the workload.

CPU utilization

This test really stresses the database server’s CPU when not caching. With caching on, similarly to the read-only case, the client’s workload increases somewhat (but not as much), and the database server’s CPU usage decreases significantly. In the last row, the CPU utilization of ScaleArc shows that although it’s somewhat higher with caching, it’s still not that much higher.

From these tests it’s visible that caching can still be beneficial even if the write ratio is as high as in this test.

Conclusion
Engineering is always about making the right tradeoffs. If one wants features that needs a protocol-level load balancer like ScaleArc, the price should be paid in the overhead of Layer 7 parsing and decision making. ScaleArc’s engineering team did a great job minimizing this overhead. ScaleArc itself is very well tunable for different workload types (if caching is important, ScaleArc can be tuned for caching – if query rewriting, ScaleArc can be tuned for that).

The post ScaleArc: Benchmarking with sysbench appeared first on MySQL Performance Blog.

Percona XtraBackup 2.2.1 alpha release is now available

Sat, 29/03/2014 - 00:37

Percona is glad to announce the release of Percona XtraBackup 2.2.1-alpha1 on March 28th 2014. Downloads are available from our download site here. This ALPHA release, will be available in our Debian experimental and CentOS testing repositories.

This is an ALPHA quality release and it is not intended for production. If you want a high quality, Generally Available release, the current Stable version should be used (Percona XtraBackup 2.1.8 in the 2.1 series at the time of writing).

This release contains all of the features and bug fixes in Percona XtraBackup 2.1.8, plus the following:

New Features:

  • Percona XtraBackup has removed the multiple binaries (xtrabackup_56, xtrabackup_55, xtrabackup) and now uses single xtrabackup binary instead for handling backups. Single binary implementation removed the requirement to download server source tarballs and removed different patches which resulted in cleaner code.
  • Percona XtraBackup source layout has been changed to implement the single binary. Percona XtraBackup code can now be found in storage/innobase/xtrabackup.
  • Percona XtraBackup implemented support for Backup Locks.
  • Percona XtraBackup can now store backup history on the server itself in a special table created for that purpose.
  • innobackupex-1.5.1 symlink has been removed, instead innobackupex binary should be used.
  • Percona XtraBackup has removed the build.sh script and it’s now built with CMake.
  • Percona XtraBackup has been rebased on MySQL 5.6.16.

Bugs Fixed:

  • Information about tool version used to take the backup was added by implementing backup history feature. Bug fixed #1133017.
  • If an XtraDB-based binary was used to a backup an InnoDB database, it would convert it to XtraDB by adding the XTRADB_1 marker in the dictionary header page and by adding the SYS_STATS table. Bug fixed #988310.

Release notes with all the bugfixes for Percona XtraBackup 2.2.1 are available in our online documentation. All of Percona‘s software is open source and free, all the details of the release can be found in the 2.2.1 milestone at Launchpad. Bugs can be reported on the launchpad bug tracker.

The post Percona XtraBackup 2.2.1 alpha release is now available appeared first on MySQL Performance Blog.

Innodb redo log archiving

Fri, 28/03/2014 - 23:00

Percona Server 5.6.11-60.3 introduces a new “log archiving” feature. Percona XtraBackup 2.1.5 supports “apply archived logs.” What does it mean and how it can be used?

Percona products propose three kinds of incremental backups. The first is full scan of data files and comparison the data with backup data to find some delta. This approach provides a history of changes and saves disk space by storing only data deltas. But the disadvantage is a full-data file scan that adds load to the disk subsystem. The second kind of incremental backup avoids extra disk load during data file scans.

The idea is in reading only changed data pages. The information about what specific pages were changed is provided by the server itself which writes files with the information during work. It’s a good alternative but changed-pages tracking adds some small load. And Percona XtraBackup’s delta reading leads to non-sequential disk io. This is good alternative but there is one more option.

The Innodb engine has a data log. It writes all operations which modify database pages to log files. This log is used in the case of unexpected server terminating to recover data. The Innodb log consists of the several log files which are filled sequentially in circular. The idea is to save those files somewhere and apply all modifications from archived logs to backup data files. The disadvantage of this approach is in using extra disk space. The advantage is there is no need to do an “explicit” backup on the host server. A simple script could sit and wait for logs to appear then scp/netcat them over to another machine.

But why not use good-old replication? Maybe replication does not have such performance as logs recovering but it is more controlled and well-known. Archived logs allows you to do any number of things with them from just storing them to doing periodic log applying. You can not recover from a ‘DROP TABLE’, etc with replication. But with this framework one could maintain the idea of “point in time” backups.

So the “archived logs” feature is one more option to organize incremental backups. It is not widely used as it was issued not so far and there is not A good understanding of how it works and how it can be used. We are open to any suggestions about its suggest improvements and use cases. The subject of this post is to describe how it works in depth. As log archiving is closely tied with innodb redo logs the internals of redo logs will be covered too. This post would be useful not only for DBA but also for Software Engineers because not only common principles are considered but the specific code too, and knowledge from this post can be used for further MySQL code exploring and patching.

What is the innodb log and how it is written?

Let’s remember what are innodb logs, why they are written, what they are used for.

The Innodb engine has buffer pool. This is a cache of database pages. Any changes are done on page in buffer pool, then page is considered as “dirty,” which means it must be flushed, and pushed to the flush list which is processed periodically by special thread. If pages are not flushed to disk and server is terminated unexpectedly the changes will be lost. To avoid this innodb writes changes to redo log and recover data from redo log during start. This technique allows to delay buffer pool pages flushing. It can increase performance because several changes of one page can be accumulated in memory and then flushed by one io. Except that flushed pages can be grouped to decrease the number of non-sequential io’s. But the down-side of this approach is time for data recovering. Let’s consider how this log is stored, generated and used for data recovering.

Log files

Redo log consists of a several log files which are treated as a circular buffer. The number and the size of log files can be configured. Each log file has a header. The description of this header can be found in “storage/innobase/include/log0log.h” by “LOG_GROUP_ID” keyword.

Each log file contains log records. Redo log records are written sequentially by log blocks of OS_FILE_LOG_BLOCK_SIZE size which is equal to 512 bytes by default and can be changed with innodb option. Each record has its LSN. LSN is a “Log Sequence Number” – the number of bytes written to log from the log creation to the certain log record. Each block consists of header, trailer and log records.

Log blocks

Let’s consider log block header. The first 4 bytes of the header is log block number. The block number is very similar as LSN but LSN is measured in bytes and block number is measured by OS_FILE_LOG_BLOCK_SIZE. Here is the simple formula how LSN is converted to block number:

return(((ulint) (lsn / OS_FILE_LOG_BLOCK_SIZE) & 0x3FFFFFFFUL) + 1);

This formula can be found in log_block_convert_lsn_to_no() function. The next two bytes is the number of bytes used in the block. The next two bytes is the offset of the first MTR log record in this block. What is MTR will be described below. Currently it can be considered as a synonym of bunch of log records which are gathered together as a description of some logical operation. For example it can be a group of log records for inserting new row to some table. This field is used when there are records of several MTR’s in one block. The next four bytes is a checkpoint number. The trailer is four bytes of log block checksum. The above description can be found in “storage/innobase/include/log0log.h” by “LOG_BLOCK_HDR_NO” keyword.

Before writing to disk log blocks must be somehow formed and stored. And the question is:

How log blocks are stored in memory and on disk?

Where log blocks are stored before flushing to disk and how they are written and flushed?

Global log object and log buffer

The answer to the first part of the question is log buffer. Server holds very important global object log_sys in memory. It contains a lot of useful information about logging state. Log buffer is pointed by log_sys->buf pointer which is initialized in log_init(). I would highlight the following log_sys fields that are used for work with log buffer and flushing:

log_sys->buf_size – the size of log buffer, can be set with innodb-log-buffer-size variable, the default value is 8M;
log_sys->buf_free – the offset from which the next log record will be written;
log_sys->max_buf_free – if log_sys->buf_free is greater then this value log buffer must be flushed, see log_free_check();
log_sys->buf_next_to_write – the offset of the next log record to write to disk;
log_sys->write_lsn – the LSN up to which log is written;
log_sys->flushed_to_disk_lsn – the LSN up to which log is written and flushed;
log_sys->lsn – the last LSN in log buffer;

So log_sys->buf_next_to_write is between 0 and log_sys->buf_free, log_sys->write_lsn is equal or less log_sys->lsn, log_sys->flushed_to_disk_lsn is less or equal to log_sys->write_lsn.

The relationships for those fields can be easily traced with debugged by setting up watchpoints.

Ok, we have log buffer, but how do log records come to this buffer?

Where log records come from?

Innodb has special objects that allow you to gather redo log records for some operations in one bunch before writing them to log buffer. These objects are called “mini-transactions” and corresponding functions and data types have “mtr” prefix in the code. The objects itself are described in mtr_t “c” structure. The most interesting fields of this structure are the following:

mtr_t::log – contains log records for the mini-transaction,
mtr_t::memo – contains pointers to pages which are changed or locked by the mini-transaction, it is used to push pages to flush list and release locks after logs records are copied to log buffer in mtr_commit() (see mtr_memo_pop_all() called in mtr_commit()).

mtr_start() function initializes an object of mtr_t type and mtr_commit() writes log records from mtr_t::log to log_sys->buf + log_sys->buf_free. So the typical sequence of any operation which changes data is the following:


mtr_start(); // initialize mtr object
some_ops... // operations on data which are logged in mtr_t::log
mtr_commit(); // write logged operations from mtr_t::log to log buffer log_sys->buf

page_cur_insert_rec_write_log() is a good example of how mtr records can be written and mtr::memo can be filled. The low-level function which writes data to log buffer is log_write_low(). This function is invoked inside of mtr_commit() and not only copy the log records from mtr_t object to log buffer log_sys->buf but also creates a new log blocks inside of log_sys->buf, fills their header, trailer, calculates checksum.

So log buffer contains log blocks which are sequentially filled with log records which are grouped in “mini-transactions” which logically can be treated as some logical operation over data which consists of a sequence of mini-operations(log records).

As log records are written sequentially in log buffer one mini-transaction and even one log record can be written in two neighbour blocks. That is why the header field which would contain the offset of the first MTR in the block is necessary to calculate the point from which log records parsing can be started. This field was described in 2.2.

So we have a buffer of log blocks in a memory. How is data from this buffer written to disk? The mysql documentation says that this depends on innodb_flush_log_at_trx_commit option. There can be three cases depending on the value of this option. Let’s consider each of them.

Writing log buffer to disk: innodb_flush_log_at_trx_commit is 1 or 2.

The first two cases is when innodb_flush_log_at_trx_commit is 1 or 2. In these cases flush log records are written for 2 and flushed for 1 on each transaction commit. If innodb_flush_log_at_trx_commit is 2 log records are flushed periodically by special thread which will be considered later. The low-level function which writes log records from buffer to file is log_group_write_buf(). But in the most cases it is not called directly but it is called from more high level log_write_up_to(). For the current case the calling stack is the following:


(trx_commit_in_memory() or
trx_commit_complete_for_mysql() or
trx_prepare() e.t.c)->
trx_flush_log_if_needed()->
trx_flush_log_if_needed_low()->
log_write_up_to()->
log_group_write_buf().

It is quite easy to find the higher levels of calling stack, just set up breakpoint on log_group_write_buf() and execute any sql query that modifies innodb data. For example for the simple “insert” sql query the higher levels of calling stack are the following:


mysql_execute_command()->
trans_commit_stmt()->
ha_commit_trans()->
TC_LOG_DUMMY::commit()->
ha_commit_low()->
innobase_commit()->
trx_commit_complete_for_mysql()->
trx_flush_log_if_needed()-> ... .

log_io_complete() callback is invoked when i/o is finished for log files (see fil_aio_wait()). log_io_complete() flushes log files if this is not forbidden by innodb_flush_method or innodb_flush_log_at_trx_commit options.

Writing log buffer to disk: innodb_flush_log_at_trx_commit is equal to 0

The third case is when innodb_flush_log_at_trx_commit is equal to 0. For this case log buffer is NOT written to disk on transaction commit, it is written and flushed periodically by separate thread “srv_master_thread”. If innodb_flush_log_at_trx_commit = 0 log files are flushed in the same thread by the same calls. The calling stack is the following:


srv_master_thread()->
(srv_master_do_active_tasks() or srv_master_do_idle_tasks() or srv_master_do_shutdown_tasks())->
srv_sync_log_buffer_in_background()->
log_buffer_sync_in_background()->log_write_up_to()->... .

Special cases for logs flushing

While log_io_complete() do flushing depending on innodb_flush_log_at_trx_commit value among others log_write_up_to() has it’s own flushing criteria. This is flush_to_disk function argument. So it is possible to force log files flushing even if innodb_flush_log_at_trx_commit = 0. Here are examples of such cases:

1) buf_flush_write_block_low()
Each page contains information about the last applied LSN(buf_flush_write_block_low::newest_modification), each log record is a description of change on certain page. Imagine we flushed some changed pages but log records for these pages were not flushed and server goes down. After starting the server some pages will have the newest modifications, but some of them were not flushed and the correspondent log records are lost too. We will have inconsistent database in this case. That is why log records must be flushed before the pages they refer.

2) srv_sync_log_buffer_in_background()
As it was described above this function is called periodically by special thread and forces flushing.

3) log_checkpoint()
When checkpoint is made log files must be reliably flushed.

4) The special handlerton innobase_flush_logs() which can be called through ha_flush_logs() from mysql server.
For example ha_flush_logs() is called from MYSQL_BIN_LOG::reset_logs() when “RESET MASTER” or “RESET SLAVE” are executed.

5) srv_master_do_shutdown_tasks() – on shutdown, ha_innobase::create() – on table creating, ha_innobase::delete_table() – on table removing, innobase_drop_database() – on all database tables removing, innobase_rename_table() – on table rename e.t.c

If log files are treated as circular buffer what happens when the buffer is overflown?

Briefly. Innodb has a mechanism which allows you to avoid overflowing. It is called “checkpoints.” The checkpoint is a state when log files are synchronized with data files. In this case there is no need to keep the history of changes before checkpoint because all pages with the last modifications LSN less or equal to checkpoint LSN are flushed and the log files space from the last written LSN to the last checkpoint LSN can be reused. We will not describe a checkpoint process here because it is a separate interesting subject. The only thing we need to know is when checkpoint happens all pages with modification LSN less or equal to checkpoint LSN are reliably flushed.

How archived logs are written by server.

So the log contains information about page changes. But as we said, log files are the circular buffer. This means that they occupy fixed disk size and the oldest records can be rewritten by the newest ones as there are points when data files are synchronized with log files called checkpoints and there is no need to store the previous history of log records to guarantee database consistency. The idea is to save somewhere all log records to have the possibility of applying them to backuped data to have some kind if incremental backup. For example if we want to have an archive of log records. As log consists of log files it is reasonable to store log records in such files too, and these files are called “archived logs.”

Archived log files are written to the directory which can be set with special innodb option. Each file has the same size as innodb log size and the suffix of each archived file is the LSN from which it is started.

As well as log writing system log archiving system stores its data in global log_sys object. Here are the most valuable fields in log_sys from my point of view:

log_sys->archive_buf, log_sys->archive_buf_size – logs archive buffer and its size, log records are copied from log buffer log_sys->buf to this buffer before writing to disk;
log_sys->archiving_phase – the current phase of log archiving: LOG_ARCHIVE_READ when log records are being copied from log_sys->buf to log_sys->archive_buf, LOG_ARCHIVE_WRITE when log_sys->archive_buf is being written to disk;
log_sys->archived_lsn – the LSN to which log files are written;
log_sys->next_archived_lsn – the LSN to which write operations was invoked but not yet finished;
log_sys->max_archived_lsn_age – the maximum difference between log_sys->lsn and log_sys->archived_lsn, if this difference exceeds the log are being archived synchronously, i.e. the difference is decreased;
log_sys->archive_lock – this is rw-lock which is used for synchronizing LOG_ARCHIVE_WRITE and LOG_ARCHIVE_READ phases, it is x-locked on LOG_ARCHIVE_WRITE phase.

So how is data copied from log_sys->buf to log_sys->archived_buf? log_archive_do() is used for this. It is not only set the proper state for archived log fields in log_sys but also invokes log_group_read_log_seg() with corresponding arguments which not only copy data from log buffer to archived log buffer but also invokes asynchronous write operation for archived log buffer. log_archive_do() can wait until io operations are finished using log_sys->archive_lock if corresponding function parameter is set.

The main question is on what circumstances log_archive_do() is invoked, i.e. when log records are being written to archived log files. The first call stack is the following:


log_free_check()->
log_check_margins()->
log_archive_margin()->
log_archive_do().

Here is text of log_free_check() with comments:


/*********************************************************************//
Checks if there is need for a log buffer flush or a new checkpoint, and does
this if yes. Any database operation should call this when it has modified
more than about 4 pages. NOTE that this function may only be called when the
OS thread owns no synchronization objects except the dictionary mutex. */
UNIV_INLINE
void
log_free_check(void)
/*================*/
{

#ifdef UNIV_SYNC_DEBUG
ut_ad(sync_thread_levels_empty_except_dict());
#endif /* UNIV_SYNC_DEBUG */

if (log_sys->check_flush_or_checkpoint) {

log_check_margins();
}
}

log_sys->check_flush_or_checkpoint is set when there is no enough free space in log buffer or it is time to do checkpoint or any other bound case. log_archive_margin() is invoked only if the limit if the difference between log_sys->lsn and log_sys->archived_lsn is exceeded. Let’s refer to this difference as archived lsn age.

One more call log_archive_do() is from log_open() when archived lsn age exceeds some limit. log_open() is called on each mtr_commit(). And for this case archived logs are written synchronously.

The next synchronous call is from log_archive_all() during shutdown.

Summarizing all above archived logs begins to be written when the log buffer is full enough to be written or when checkpoint happens or when the server is in the process of shut down. And there is no any delay between writing to archive log buffer and writing to disk. I mean there is no way to say that archived logs must be written once a second as it is possible for redo logs with innodb_flush_log_at_trx_commit = 0. As soon as data is copied to the buffer the write operation is invoked immediately for this buffer. Archived log buffer is not filled on each mtr_commit() so it does not slow down the usual logging process. The exception is when there are a lot of io operations what can be the reason of archive log age is too big. The result of big archive log age is the synchronous archived logs writing during mtr_commit(). Memory to memory copying is quite fast operation that is why the data is copied to archived log buffer and is written to disk asynchronously minimizing delays which can be caused by logs archiving.

PS: Here is another call stack for writing archived log buffer to archived log files:

log_io_complete()->log_io_complete_archive()->log_archive_check_completion_low()->log_archive_groups().

I propose to explore this stack yourself.

Logs recovery process, how it is started and works inside. Archived logs applying.

So we discovered how innodb redo logging works, and how redo logs are archived. And the last uncovered thing is how recovery works and how archived logs are applied. These two processes are very similar – that is why they are discussed in one section of this post.

The story begins with innobase_start_or_create_for_mysql() which is invoked from innobase_init(). The following trident in innobase_start_or_create_for_mysql() can be used to search the relevant code:


if (create_new_db) {
...
} else if (srv_archive_recovery) {
...
} else {
...
}

The second condition and the last one is the place from which archived logs applying and innodb logs recovery processes correspondingly start. These two blocks wrap two pairs of functions:


recv_recovery_from_archive_start()
recv_recovery_from_archive_finish()

and

recv_recovery_from_checkpoint_start()
recv_recovery_from_checkpoint_finish()

And all the magic happens in these pairs. As well as global log_sys object for redo logging there is global recv_sys object for innodb recovery and archived logs applying. It is created and initialized in recv_sys_create() and recv_sys_init() functions correspondingly. The following fields if recv_sys object are the most important from my point:

recv_sys->limit_lsn – the LSN up to which recovery should be made, this value is initialized with the maximum value of uint64_t(see #define LSN_MAX) for the recovery process and with certain value which is passed as an argument of recv_recovery_from_archive_start() function and can be set via xtrabackup option for log applying;
recv_sys->parse_start_lsn – the LSN from which logs parsing is started, for the the logs recovery this value equals to the last checkpoint LSN, for logs applying this is last applied LSN;
recv_sys->scanned_lsn – the LSN up to which log files are scanned;
recv_sys->recovered_lsn – the LSN up to which log records are applied, this value <= recv_sys->scanned_lsn;

The first thing that must be done for starting recovery process is to find out the point in log files where the recovery must be started from. This is the last checkpoint LSN. recv_find_max_checkpoint() proceed this. As we can see in log_group_checkpoint() the following code writes checkpoint info into two places in the first log file depending on the checkpoint number:


/* We alternate the physical place of the checkpoint info in the first
log file */

 

if ((log_sys->next_checkpoint_no & 1) == 0) {
write_offset = LOG_CHECKPOINT_1;
} else {
write_offset = LOG_CHECKPOINT_2;
}

So recv_find_max_checkpoint() reads checkpoint info from both places and selects the latest checkpoint.

The same idea is applied for logs, too, but the last applied LSN instead of last checkpoint LSN must be found. Here is the call stack for reading last applied LSN:


innobase_start_or_create_for_mysql()->
open_or_create_data_files()->
fil_read_first_page().

The last applied LSN is stored in the first page of data files in (min|max)_flushed_lsn fields(see FIL_PAGE_FILE_FLUSH_LSN offset). These values are written in fil_write_flushed_lsn_to_data_files() function on server shutdown.

So the main difference between logs applying and recovery process at this stage is the manner of calculating LSN from which log records will be read. For logs applying the last flushed LSN is used but for recovery process it is the last checkpoint LSN. Why does this difference take place? Logs can be applied periodically. Assume we gather archived logs and apply them once an hour to have fresh backup. After applying the previous bunch of log files there can be unfinished transactions. For the recovery process any unfinished transactions are rolled back to have consistent db state at server starting. But for the logs applying process there is no need to roll back them because any unfinished transactions can be finished during the next logs applying.

After calculating the start LSN the sequence of actions is the same for both recovering and applying. The next step is reading and parsing log records. See recv_group_scan_log_recs() which is invoked from recv_recovery_from_checkpoint_start_func() for logs recovering and recv_recovery_from_archive_start()->log_group_recover_from_archive_file() for logs applying.

The first we read log records to some buffer and then invoke recv_scan_log_recs() to parse them. recv_scan_log_recs() checks each log block on consistency(checksum + comparing the log block number written in log block with log block number calculated from log block LSN) and other edge cases and copy it to parsing buffer recv_sys->buf with recv_sys_add_to_parsing_buf() function. The parsing buffer is then parsed by recv_parse_log_recs(). Log records are stored in hash table recv_sys->addr_hash. The key for this hash table is calculated basing on space id and page number pair. This pair refers to the page to which log records must be applied. The value of the hash table is object of recv_addr_t type. recv_addr_t type contains rec_list field which is the list of log records for applying to the (space id, page num) page (see recv_add_to_hash_table().

After parsing and storing log record in hash table recv_sys->addr_hash log records are applied. The function which is responsible for log records applying is recv_apply_hashed_log_recs(). It is invoked from recv_scan_log_recs() if there is no enough memory to store log records and at the end of recovering/applying process. For each element of recv_sys->addr_hash, i.e. for each DB page which must be changed with log records recv_recover_page() is invoked. It can be invoked as from recv_apply_hashed_log_recs() in the case if page is already in buffer pool of from buf_page_io_complete() on io completion, i.e. just after page was read from storage. Applying log records on page read completion is necessary and very convenient. Assume log records have not yet applied as we had enough memory to store the whole recovery log records. But we want for example to boot DB dictionary. I this case any records that concern to the pages of the dictionary will be applied to those pages just after reading them from storage to buffer pool.

The function which applies log records to the certain page is recv_recover_page_func(). It gets the list of log records for the certain page from recv_sys->addr_hash hash table, for each element of this list it compares the lsn of last page changes with the LSN of the record, and if the former is greater the later it applies log record to the page.

After applying all log records from archived logs xtrabackup writes last applied LSN to (min|max)_flushed LSN fields of each data file and finishes execution. The logs recovery process rollbacks all unfinished transactions unless this is forbidden with innodb-force-recovery parameter.

Conclusion

We covered the processes of redo logs writing and recovery in depth. These are very important processes as they provide data consistency on crashes. These two processes became a base for logs archiving and applying features. As log records can describe any data changes the idea is to store these records somewhere and then apply them to backups for organizing some kind of incremental backup.

The features were implemented a short time ago and currently they are not widely used. So if you have something to say about them you are welcome to comment for discussion.

The post Innodb redo log archiving appeared first on MySQL Performance Blog.

Percona Server 5.6.16-64.2 with TokuDB engine Beta is now available

Fri, 28/03/2014 - 02:58

Percona Server version 5.6.16-64.2 with TokuDB engine

Percona is glad to announce the first BETA release of Percona Server 5.6.16-64.2 with TokuDB engine on March 27th, 2014. Downloads are available here and from the Percona Software Repositories.

Based on Percona Server 5.6.16-64.2 including all the features and bug fixes in it, and on TokuDB 7.1.5-rc.4, Percona Server 5.6.16-64.2-tokudb is the first BETA release in the Percona Server 5.6 with TokuDB engine series. All of Percona‘s software is open-source and free, all the details of the release can be found in the 5.6.16-64.2 milestone at Launchpad.

New Feature

  • Installer will now automatically install and enable TokuDB engine on new installations.

More information on how to install and use TokuDB can be found in the documentation. This feature is currently considered BETA quality.

Release notes for Percona Server 5.6.16-64.2 with TokuDB storage engine are available in our online documentation. We did our best to eliminate bugs and problems during the testing this release, but this is a software, so bugs are expected. If you encounter them, please report them to our bug tracking system.

The post Percona Server 5.6.16-64.2 with TokuDB engine Beta is now available appeared first on MySQL Performance Blog.

A conversation with 5 Facebook MySQL gurus

Thu, 27/03/2014 - 18:00

Facebook, the undisputed king of online social networks, has 1.23 billion monthly active users collectively contributing to an ocean of data-intensive tasks – making the company one of the world’s top MySQL users.

A small army of Facebook MySQL experts will be converging on Santa Clara, Calif. next week where several of them are leading sessions at the Percona Live MySQL Conference and Expo. I had the chance to chat virtually with four of them about their sessions: Steaphan Greene, Evan Elias, Shlomo Priymak and Yoshinori MatsunobuMark Callaghan, who spoke at Percona Live last year, also joined our conversation which included a discussion of Facebook’s use of MySQL and other open source technologies.

Tom: What’s Facebook’s view of Open Source?

Steaphan: Facebook was built on open source software, and we still invest heavily in open source today. We understand the power these communities have to drive innovation – they allow us to focus on new challenges, as opposed to reinventing the wheel over and over again. And contributing as much as possible back to open projects is in everyone’s best interest.

Tom: Why MySQL? Wouldn’t NoSQL databases, for example, be better suited for the massive workloads seen at Facebook?

Mark: MySQL is great for many of our important workloads. We make it even better with our expertise in MySQL operations and engineering, and by working with the community and learning from their experience.

Yoshinori: I have not been able to find a transactional NoSQL database better than InnoDB. And it’s easy to understand how MySQL Replication works, which makes much easier to fix problems in production.

 Tom: How does Facebook make MySQL scale?

Steaphan: Sharding, automation, monitoring, and heavy investment in operations and performance engineering.

 Tom: What other things help Facebook run smoothly?

Steaphan: Our completely open culture, and the freedom all engineers here have to try any idea they have.

Tom: What is the top scaling challenge(s) Facebook faces in 2014 – and beyond?

Mark: Our biggest challenge is to make things better (performance, efficiency, availability) in the future at the rate we made things better in the past.

Yoshinori: Availability has improved a lot so far for us. Come to my session at Percona Live to hear about that. For me personally, efficiency is the biggest challenge for 2014 and 2015. This includes reducing space and optimizing for newer-generation hardware.

Tom: Facebook deployed MySQL 5.6 last year – including on critical environments – long before many other large organizations. What prompted such a move so soon? And where there any major concerns?

Steaphan: The same thing that prompts most efforts on the Facebook infra team: We will consider any technology that will help us improve performance, efficiency, or reliability, and we’re willing to accept the risk that sometimes comes along with adopting things like 5.6 very early on. But that’s only half the story here. The other half is that Facebook encourages its engineers to go after big bets like these — in this case it was just one engineer who made this happen. And we had the MySQL engineering talent we needed to work with the Oracle team to get 5.6 ready for production at our scale.

Yoshinori: At Facebook, we have three MySQL teams — Operations, Performance and Engineering. Facebook is one of the very few MySQL users that has internal MySQL developers. We all worked hard to adapt 5.6 to our scale and ensure that it would be production-ready. We found some issues after production deployment, but in many cases we could fix the problem and deployed new MySQL binary within one or two days. When deploying in production, we expected that we encountered MySQL 5.6 specific issues, which was typical when releasing new software. We were just confident that we could fix issues immediately.

Our 5.6 deployment step was not all at once. At first rollout, we disabled most major 5.6 features, such as GTID and binlog checksum. We gradually enabled such features in production.

Tom: Where there any significant issues in that move to MySQL 5.6? Any lessons learned you like to share – along with best practices you’d like to share?

Yoshinori: Performance regression of the CPU intensive replication was a main blocker for some of our applications. I wrote a blog post about this last year. We have several design plans to fix the problem on MySQL side. One of the most effective plans is grouping multiple transactions into one, since the most expensive part is writing to InnoDB system table at transaction commit. This optimization would be done when writing binary log, or by SQL thread. It may take longer time to test and deploy in production. For existing applications, we optimized to group multiple transactions from application side to mitigate the problem.

Tom: Performance monitoring is usually challenging at any organization. How do you do that at Facebook, which has tens of thousands of MySQL instances?

Yoshinori: Top-N monitoring is very important for managing a huge number of instances. Average statistics (for example: average innodb_rows_read across all instances) is not always useful since ~1% of problematic instances won’t noticeably change average numbers. p99 gives better indicators, but in our environment we typically have fewer than 0.1% instances causing problems, in which case p99 is not helpful either. We have several graphical and command-line tools to efficiently list up top-N bad behaving hosts. After listing up bad instances, the way to investigate root cause is pretty straightforward, like what MySQL consultants usually do. Server failure is something we expect and plan for at Facebook. For example, typical MySQL DBA at small companies may not encounter master instance failure during employment, because recent mysqld and H/W are stable enough. At Facebook, master failure is a norm and something the system can accommodate.

Tom: Evan, you and Yoshinori will present on Global Transaction ID (GTID) at Facebook. GTID is very tricky to deploy to an existing large-scale environment – how, and why, did you decide on adopting it?

Evan: Our primary motivations for adopting GTID all relate to either failover or binlog backups. When a master fails, getting replicas in sync with GTID is substantially simpler, faster, and less error-prone than previous methods of diffing binlogs. For backups, GTID is a cornerstone in building cross-datacenter point-in-time recovery, without needing redundant binlog streams from every region.

The “how” question is a bit more involved, and we’ll be covering this in detail during the session. The GTID project was a joint effort between three of Facebook’s MySQL teams. Santosh added new functionality to the MySQL server to make online rollout possible, and Yoshinori improved MHA to seamlessly support GTID-based failover. I added GTID support to all of our other in-house automation, and also scripted the rollout procedure across our many thousands of replica sets. A lot of validation logic and monitoring functionality was involved to ensure the safety of the rollout.

Tom: Shlomo, your session is titled “Under the Hood – MySQL Pool Scanner (MPS).” As you point out in your talk, Facebook has one of the largest MySQL database clusters in the world, comprising thousands of servers across multiple data centers. You must have an army of DBAs – or is there some secret you’d like to share? 

Shlomo: We do have an army, yes — it’s an “army of one.” We have one person on call on the MySQL Operations team at a given time, and they don’t even need to do all that much most days. We built “robots” to do our day to day jobs. The largest and most complex robot we have is MPS, an automated system to do most of the work a DBA might otherwise spend time on, such as replacement of broken or overflowing servers. Among other things, MPS also allows a human to initiate complex bulk operations with a few keystrokes, and it will follow up and complete the operations over the course of days or weeks.

I’ll be describing some of the complex MySQL automation systems we have at Facebook, and how they fit together during my talk.

Tom: Shlomo, what does a typical day look like for you there at Facebook?

Shlomo: The team’s work mostly focuses on maintaining those robots I’ve mentioned, as well as developing new ways to improve the reliability of our databases for Facebook’s users. This year the team also spent a lot of time making sure the new MySQL features such as GTIDs and semi-sync are deeply integrated in our automation. Every day, we work hard to to make ourselves obsolete, but we haven’t gotten there just yet!

On a typical day, I probably spend much of the time coding, mainly in Python. I also spend a significant amount of time working on capacity-related projects, such as thinking of ways to optimize the way we distribute the data across our fleet of servers.
Even after 2.5 years at Facebook, I am still in awe of the number of servers we manage. The typical small-scale maintenance operation at Facebook probably involves more servers than all the companies I’ve previously worked for had, combined. It really is pretty amazing!

Tom:  What are you looking forward to the most at this year’s conference?

Evan: There are plenty of fascinating sessions this year. Just to mention a handful: Jeremy Cole and Davi Arnaut’s session on innodb_ruby, since it’s a very unique way to interactively learn about InnoDB’s internals. Baron Schwartz’s session on using Go with MySQL, as VividCortex is blazing the trail here. Peter Boros and Kenny Gryp’s talk on scalability and benchmarking, which I’m hoping will include recent developments of Percona Playback. Tom Christ’s session on my former project Jetpants, to see how it has evolved over the past year at Tumblr. And several talks by Oracle engineers about upcoming functionality in MySQL 5.7.

Steaphan: In addition to the conference sessions, I look forward to the birds of a feather session with the MySQL team.  Last year, it proved to be a valuable opportunity to engage with those upstream developers who make the changes we care about, and I expect the same this year.

Tom: If you could talk to a DBA or developer on the fence about attending this year’s conference, what would be your top 3-5 reasons for making it over to Santa Clara for this event?

Evan: I’m based in NYC, so I’m traveling a bit further than many of my colleagues, but I can still confidently say that Percona Live is well worth the trip. The MySQL ecosystem is very healthy and constantly evolving, and the conference is the best place to learn about ongoing developments across a wide spectrum of companies and contributors. It’s also a perfect opportunity to personally connect with all of the amazing engineers, DBAs, users, and vendors that make MySQL so unique and compelling.

 The Percona Live MySQL Conference and Expo 2014 runs April 1-4 in Santa Clara, Calif. Use the code “SeeMeSpeak” when registering and save 10 percent. The inaugural Open Source Appreciation Day is on March 31 – this full-day event is free but because space is limited I suggest registering now to reserve your spot.

The post A conversation with 5 Facebook MySQL gurus appeared first on MySQL Performance Blog.

Explosive OpenStack growth at the Percona Live MySQL Conference & Expo

Thu, 27/03/2014 - 00:34

The Percona Live MySQL Conference and Expo 2014 runs April 1-4  at the Santa Clara Convention Center and Santa Clara Hyatt – but be sure to get there a day early for the inaugural Open Source Appreciation Day on March 31 – this event is free but because space is limited I suggest registering now to reserve your spot.

Interest is particularly high in OpenStack and its intersection with MySQL and Open Source Appreciation Day and the conference sessions and events reflect this. Last year, the Percona Live MySQL Conference included 4 talks with Trove (“Red Dwarf” at the time) as the principal topic. This year there are 14 OpenStack-related talks and events:

  • OpenStack Today on Monday, March 31 features 6 talks devoted to discussing the intersection of MySQL and OpenStack
  • 2 keynote talks
  • 4 breakout sessions
  • 1 Birds of a Feather session
  • 1 tutorial

In addition to Trove, sessions will focus on operating OpenStack clouds, the MySQL core, and hybrid deployments.

Beyond the OpenStack sessions, the Percona Live conference features a total of 9 keynotes, 13 tutorials, 112 breakout sessions, 8 Birds of a Feather sessions, and two great networking receptions. There is also the Open Source Appreciation Day on Monday featuring the OpenStack Today and CentOS Dojo Santa Clara events.

The OpenStack Today event is free but as I mentioned above preregistration is required and space is limited. Visit the event page to learn more and to register.

The full conference schedule is available by visiting the Percona Live site. Take advantage of the full conference slate of activities by registering now. Use discount code “OpenStack-Today” to receive $200 off your registration.

The full list of OpenStack events at the Percona Live MySQL Conference and Expo 2014 is:

Monday, March 31

1:00 pm to 6:00 pm – As part of Open Source Appreciation Day, Percona is holding OpenStack Today. This is a chance for members of both MySQL and OpenStack communities to interact and learn. This event is sponsored by Tesora and hastexo.

Speakers and topics include:

  • Florian Haas, CEO, Principal Consultant, hastexo: “What is OpenStack, and what’s in it for DBAs?”
  • Jay Pipes, Principal Technical Architect, Mirantis: “Tales from the Field: Backend Data Storage in OpenStack Clouds”
  • Jay Janssen, Principal Consultant, Percona: “Percona XtraDB Cluster in OpenStack”
  • Amrith Kumar, Founder and CTO, Tesora: “An Elastic Parallel Data-as-a-Service (DaaS) platform with OpenStack Trove and Tesora”
  • Sandro Mazziotta, Senior Director, Product Management, eNovance: “Challenges and lessons learned with OpenStack deployments and MySQL”
  • Sebastian Stadil, Founder, Scalr: “Autoscaling, replication, and failover: cloud native MySQL on OpenStack”

Tuesday, April 1

1:30 pm to 4:30 pm
Hands On Trove: Database as a Service in OpenStack (for MySQL)
Florian Haas, CEO, Principal Consultant, hastexo

Wednesday, April 2

3:50 pm to 4:40 pm
Virtually Available MySQL, or How to Stop Worrying and Love the Cloud
Robert Hodges, CEO, Continuent

6:00 pm to 7:00 pm
BOF: Running A Successful Trove’d OpenStack Installation
Michael Basnight, Sr. Software Developer, Rackspace
Amrith Kumar, Founder & CTO, Tesora (formerly ParElastic)

Thursday, April 3

9:35 am to 10:00 am
Keynote: MySQL, Private Cloud Infrastructure and OpenStack
Sean Chighizola, Senior Director Database Administration, Big Fish Games

Friday, April 4

9:35 am to 10:10 am
Keynote: OpenStack Coopetition, A View From Within
Boris Renski, Co­Founder and CMO, Mirantis, Member of OpenStack Board of Directors

11:00 am to 11:50 am
OpenStack Trove – Database as a Service
Vipul Sabhaya, Lead Software Developer, HP
Michael Basnight, Sr. Software Developer, Rackspace
Patrick Galbraith, Senior Systems Engineer, Hewlett Packard

12:50 pm to 1:40 pm
Avoiding Pain When Running MySQL in the Cloud
Neil Armitage, Deployment and Support Engineer, Continuent

1:50 pm to 2:40 pm
Building Globally Available Storage Layers
Art van Scheppingen, Head of database engineering, Spil Games

I look forward to seeing you next week in Santa Clara!

The post Explosive OpenStack growth at the Percona Live MySQL Conference & Expo appeared first on MySQL Performance Blog.

Database security: Why should you review yours?

Wed, 26/03/2014 - 03:27

Ah database security… the black sheep of topics and something you would really rather not have to deal with right?

I mean surely all the fanfare and paranoia is reserved for the neck beards with tinfoil hats whom live in their own D.I.Y Faraday cage … that must be it … it just has to be?

No, the hard reality is the world is not rose tinted and “they” are out to get you be it for fun or for profit; from defacements to theft compromising your applications, and more importantly your data is big business. For some these acts are nothing short of sheer entertainment for an otherwise boring evening. (I’ll be speaking about this topic next week in much more detail at the Percona Live MySQL Conference and Expo in Santa Clara, California. My session, “Security and why you need to review yours” will go into much more detail on April 2. Use the code “SeeMeSpeak” on the registration page and save 10 percent.)

Note I’m avoiding discussing corporate espionage/government spying, as this re-enforces the image of it all being cloak and dagger; admittedly some of it is … the part that’s going to affect you, your business, your livelihood isn’t.

It’s time for that wake-up caffeine infusion and drink the “kool aid” – this is not something you can shun and pretend it’s a monster under the bed/in the closet that doesn’t really exist. Unlike the “bogey man” these threats are real, and have real impacts to peoples livelihood.

F.U.D? The fear part I certainly am wanting to portray here; a level of Fear is healthy;  it keeps you alert, the uncertainty and doubt? No these should and will be removed so please allow me to continue.

Removing Uncertainty
As with anything that requires research I’m sure you the reader would carry out proper “Due Diligence.” You wouldn’t want anything to adversely affect your product/business, taking hosting as an example you’ll look at the providers reputation their S.L.A. agreements etc.

What about their Security credentials? PCI / SOX / HIPAA … there’s numerous classifications.

“But I don’t need PCI / SOX / HIPAA!” to this I say you need a compromise of your application/business even less… what’s the harm in asking the provider if they have been through any regulatory compliance? I just don’t get the stigma some people seem to feel when asking a question related to security. Remember when deploying your application you’re building upon your hosting providers infrastructure.

“A foolish man who built his house on sand” in short if your foundations are not sound you’re opening yourself up to failure and compromise, an example Paypal and godaddy socially engineered resulting in the loss of a $50K twitter username and Barclays £1.37M theft both of which are due to the same level of Failure, the term for which is “Social Engineering” … which really is just a new term for conning someone into doing what you want them to do.

“The art of the con” is hardly anything new; and has been around for centuries take Victor Lustig whose infamous example of con artistry was to sell the Eiffel tower for scrap … twice.

Dispelling Doubt
“By failing to prepare you are preparing to fail” - Benjamin Franklin

Let’s look at this a little more with some common misconceptions.

“I don’t really need to look at security, my project business is small and will not be attacked.” I’d liken this statement to saying you don’t need seat belts and air bags because you’re a careful driver; driving at night with no lights on because “I have good night vision.” You have safety and security measures in your everyday life which because they fall as part of the routine are not thought about: locks on doors, car/home/business alarms, cctv, gps locators for phones/cars/tablets/laptops … we need to eliminate this thinking that information security is anything other than a requirement which should form part of our every day “norms”.

“Security is too expensive.” Have you looked at the cost of a compromise of your system, how much is the potential loss of your entire customer base worth to you? …not looking quiet so expensive now is it? Liken an investment in security to an investment in High Availability: you want your application to be “always on” … why do many think “secure” is prohibitively expensive to achieve?

“We simply don’t have the resources to implement security measures.” Yet you have the resources for development, DBA’s, sysadmins? One of the best ways to introduce security into your application is “from the ground up,” so that it becomes part of your general practise – this requires a “state of mind” orientated toward security.

What many fail to realize is assuming your business is successful you already have a state of mind orientated to best practises which work for your business to produce an application/service of value; minor tweaking to this could also introduce a mindset of security.

Remediation - “the action of remedying something, in particular of reversing or stopping environmental damage.”

It’s not going to be a painful or expensive as you may think; the first most powerful step is a minor change of development/sysadmin attitudes to consider security implications of code/services/configurations, let’s bring back the healthy attitude of asking questions.

Do I really need to disable SELinux to get this to work? – the answer is of course no you shouldn’t, this should be the same vein of thought of do I really need to chmod this 777 to make it work?

Does this service really need to be installed? – e.g. bluetoothd doesn’t need to be on your production machines.

We’re adding a user input form, we should really sanitize the input – seems obvious to most now, though this was met with just as much “resistance to change” before it became a best practise standard.

Does MySQL really need to be accessible from everywhere on the internet? – again may seem obvious to most now, though this was and sometimes still met with resistance of “I may need to run queries from home, off my mobile, from the open wifi at the local coffee shop …” (those of a security orientated nature I apologize for this statement and I can sense the cringing now in progress as this is read …).

The above is just a small example of reducing your attack surface. Your attack surface being the potential entry points into your system/network/application which can be potentially attacked.

The thinking behind the need to invest a lot of money into some expensive appliance / magic solution is for the most part misguided and throwing a “security blanket” over or in front of your application isn’t going to be as effective as ensuring consideration of security at every layer of your business.

Over the coming months I will be working on more related blogs, webinars – and a reminder that I’ll also be giving a talk on “Security and why you need to review yours” at Percona Live in Santa Clara this April 2nd.

The post Database security: Why should you review yours? appeared first on MySQL Performance Blog.