Posted on 5 Comments

Trivia: identify this replication failure

We got good responses to the “identify this query profile” question. Indeed it indicates an SQL injection attack. Obviously a code problem, but you must also think about “what can we do right now to stop this”. See the responses and my last note on it below the original post.

Got a new one for you!

You find a system with broken replication, could be a slave or one in a dual master setup. the IO thread is still running. but the SQL thread is not and the last error is (yes the error string is exactly this, very long – sorry I did not paste this string into the original post – updated later):

“Could not parse relay log event entry. The possible reasons are: the master’s binary log is corrupted (you can check this by running ‘mysqlbinlog’ on the binary log), the slave’s relay log is corrupted (you can check this by running ‘mysqlbinlog’ on the relay log), a network problem, or a bug in the master’s or slave’s MySQL code. If you want to check the master’s binary log or slave’s relay log, you will be able to know their names by issuing ‘SHOW SLAVE STATUS’ on this slave.”

In other similar cases the error message is about something else but the query it shows with it makes no sense. To me, that essentially says the same as the above.

The server appears to have been restarted recently.

What’s wrong, and what’s your quickest way to get replication going again given this state?

Posted on 5 Comments

5 thoughts on “Trivia: identify this replication failure

  1. What’s wrong: the relay log file is corrupted. e.g. server restarted as it was being written.
    Possible solution: check otu the master lgos file/position in “show slave status” (not the relay log one), and try “stop slave; change master to …; start slave”, thus reading from that same position again. The “change master to” will discard the existing relay log(s).

  2. Without knowing the exact error its not possible to make a real decision, however its highly likely the master DB has crashed and binlog has become corrupted. However the slave DB has become stuck reading to the end of the binlog file of the master with an “impossible position” type error.

    Thus the quickest way to get the slave working again is to get the new master DB’s bin log name from column “Master _Log_file” using query;

    SLAVE> SHOW SLAVE STATUS;

    Then change replication position with;

    SLAVE> STOP SLAVE; — to stop IO thread
    SLAVE> CHANGE MASTER TO master_host=”, Master_user=”, Master_password = ”, Master_log_file = ”, Matser_log_pos=0;
    SLAVE> START SLAVE;

    This doesn’t ensure data integrity, but as mentioned about you need to know the nature of error to statt making decisions to ensure data integrity.

  3. replication event checksums anyone? 🙁

  4. Writing binlogs and the actual table data is not atomic. With InnoDB, the binlog is only written after the transaction has been committed. When the server (is made to) crash when it’s writing the binlogs it might leave them unfinished, i. e. corrupt, while InnoDB rolls back the transaction.

    The sync_binlog and innodb_support_xa setting should prevent this, however there can still be problems with too short binlogs.

    To repair this there are several options. If there is only a small to moderate amount of data, using a fresh dump might be a solution. If this takes too long or is otherwise impractical, you need to find out up there where the slave executed correctly – the SHOW SLAVE STATUS command helps here. Then look at the master’s corresponding binlog and either set the skip_slave_counter appropriately (if the statement was executed correctly after the crash) or better reset the replication with CHANGE MASTER TO and point it to the new binlog the master would have started when it was brought back up again.

    Anyway maybe a mk-table-checksum based verification would be advisable to see if you actually got it right 🙂

  5. I agree with Shlomi’s example as the most likely cause. However I’ve also seen the same behaviour when a router was silently garbling packets — with enough frequency that some would pass through the crude TCP checksum and produce invalid SQL. Solution was the same though.

Comments are closed.