MySQL Replication Lag
Monday, September 29, 2008
MySQL replication is completely asynchronous, meaning that the front-end application (often a web site) does not have to wait for replication to happen on each transaction and this keeps the master database operating efficiently.
However, having a completely asynchronous replication solution means that the slave database is never 100% up to date. There is an even more serious issue with MySQL replication because it is single-threaded in nature to ensure transactions are applied to the slave database in the same order that they were applied to the master. This means that if the master database is under load with 100 concurrent transactions for a period of time and the slave is only able to process 1 concurrent transaction due to the single-threaded implementation then the slave database will very quickly fall behind the master database and this could mean thousands or tens of thousands of rows that are not replicated and if the master database fails then these transactions could be lost. I'm sure most companies would find this unacceptable.
I'm curious how developers and DBAs view this issue. It seems unlikely that anyone is entirely comfortable with an unreliable "best effort" replication system for failover although it might be acceptable for scalability if the application is intelligent enough to work around any lag issues optimistically by querying a slave first and then going to the master if the slave is not up to date.
Google have contributed open source code for SemiSyncReplication which provides a partial solution to the problem. Their approach is to have the master sychronously replicate the transaction to the slave mysql instance when the transaction is committed on the master but does not wait for the slave to actually write the transaction to disk, therefore maintaining a sensible balance between performance and reliability.
How do you handle the issue of MySQL replication lag in your organization? Do you use Google SemiSyncReplication or have you found other products/tools to mitigate against replication lag?