RAM flakier than expected

Ref: Google: Computer memory flakier than expected (CNET DeepTech, Stephen Shankland)

Summary: According to tests at Google, it appears that today’s RAM modules have several thousand errors a year, which would be correctable if it weren’t for the fact that most of us aren’t using ECC RAM.

Previous research, such as some data from a 300-computer cluster, showed that memory modules had correctable error rates of 200 to 5,000 failures per billion hours of operation. Google, though, found the rate much higher: 25,000 to 75,000 failures per billion hours.

This is quite relevant for database servers because they write a lot rather than mainly read (desktop use). In the MySQL context, if a bit gets flipped in RAM, your data could get corrupted, or it’s ok on disk and you’re just reading corrupted data somehow. While using more RAM is good for performance, it also means a bigger RAM footprint for your data and thus more exposure to the issue.

In MySQL 5.0 and the general 5.1, the binary and relay logs do not have checksums on log events. If something gets corrupted anywhere on disk or on its way to disk, garbage will come out and we have seen instances where this happens. There are patches to add a checksum to the binlog structure (Google worked on this) and we’ll be pushing for this to be ported into MariaDB 5.1 urgently. It’s no use having it just in later versions. It does change the on-disk format, but so be it. This is very very important stuff.

FYI, InnoDB does use page checksums which are also stored on disk. There is an option to turn them off, but our general recommendation would be to not do that 😉 What about the iblog files though? Normally they just refer to pages which at some stage get flushed, but a) if through a glitch they refer to a different page that could lose some committed data and b) on recovery, it could directly affect data. Mind you I’m conjecturing here, more research necessary!

Naturally this does not just affect database systems, file systems too can easily suffer from RAM glitches – probably with the exception of ZFS, since it has checksums everywhere and keeps them separate from the data.

Anything that keeps data around in RAM, and/or is write intensive. Memcached! How do other database systems work in this respect?

Note: this post is not intended to be alarmist; I just think it’s good to be aware of things so they can be taken into account when designing systems. If you look closely at any system, there are things that can potentially be cause for concern. That doesn’t mean we shouldn’t use them, per-say.

Share this:

Related