Two-phase Commit in MySQL
The two-phase commit in MySQL is not a two-phase commit of a distributed transaction, but a two-phase commit between redo and Binlog after the Binlog is turned on.
Overall, two-phase commit:
- prepare phase - call
MYSQL_BIN_LOG::prepare
- redo log
prepare
-innobase_xa_prepare
, i.e., the transaction is fully recorded in the redo log and syncs the redo log file to disk (as necessary)
- redo log
- commit phase
- innodb释放锁,释放回滚段,设置提交状态
- write Binlog- call
MYSQL_BIN_LOG::commit
- redo log
commit
-innobase_commit
Prepare phase
- When MySQL gets a request to
COMMIT
, it acts as a coordinator for the transaction (a “transaction coordinator”). The first step is to callMYSQL_BIN_LOG::prepare
, which tells each storage engine to commit resources to the transaction and ensure that the transaction will succeed. In InnoDB, this call is eventually resolved toinnobase_xa_prepare
where in InnoDB ensures that the transaction is fully recorded in the redo log and syncs the redo log file to disk (as necessary). At this point, InnoDB considers the transaction “prepared”, but not yet committed. Should MySQL crash now, InnoDB would roll-back the in-flight transaction on recovery.
Commit phase
- Then, once MySQL has received acknowledgement from all storage engines involved in the transaction (i.e. they’ve all returned a “yes” vote),
MYSQL_BIN_LOG::commit
will be called and the transaction will be recorded to the binary log. Next, the storage engines will be instructed to commit their changes. In InnoDB, this call is eventually resolved toinnobase_commit
wherein InnoDB lazily logs that the previously-prepared transaction is now “committed”. At this point, the transaction is effectively committed and present in both the binary log and the InnoDB redo log.
So, you may be wondering: what happens if MySQL crashes after writing its own binary log events but before calling innobase_commit
? Well, in this case, when MySQL starts back up, in checks whether or not its binary log was safely closed. In this case, it wouldn’t have been, so MySQL will collect a list of XA transaction ids (Xid
s) from the binary log and tell each storage engine to commit them—if they exist and have not yet been committed. Remember, this is safe because, by this time, all storage engines have already recorded the transaction before voting “yes” during the prepare phase.
Logs
If after redo log is prepare
, binlog before the outage, rollback the transaction, the log is as follows:
2015-07-29 17:03:1821957 [note starting crash recovery ...
2015-07-29 17:03:18 7ffff7fe4780 innodb:starting recovery for XA transactions ...
2015-07-29 17:03:18 7ffff7fe4780 innodb:transaction 35077 in prepared state after recovery
2015-07-29 17: 03:18 7ffff7fe4780 innodb:transaction contains changes to 1 rows
2015-07-29 17:03:18 7ffff7fe4780 InnoDB: 1 transactions in prepared state after recovery
2015-07-29 17:03:18 21957 [note Found 1 prepared Transaction (s) in innodb
2015-07-29 17:03:18 21957 [note] rollback XID ' mysqlxid\1\0\0\0\0\0\0\0\6\0\0\0\0\0\0\0 '
If writing the binlog is done and then the outage occurs, the transaction will be recovered by commiting the transaction.
2015-07-29 17:06:23 7ffff7fe4780 innodb:starting Recoveryfor XA transactions ...
2015-07-29 17:06:23 7ffff7fe4780 innodb:transaction 35590 in prepared state after recovery
2015-07-29 17: 06:23 7ffff7fe4780 innodb:transaction contains changes to 1 rows
2015-07-29 17:06:23 7ffff7fe4780 InnoDB : 1 transactions in prepared state after recovery
2015-07-29 17:06:23 22040 [note" Found 1 prepared Transaction (s) in innodb
2015-07-29 17:06:23 22040 [note] commit XID Mysqlxid\1\0\0\0\0\0\0\0\6\0\0\0\0\0\0\0 '
2015-07-29 17:06:23 22040 [Note Crash recovery finished.
Failover
Principle
- 如果redo log里面的事务是完整的,也就是已经有了commit标识,则直接提交;
- 如果redo log里面的事务只有完整的prepare,则判断对应的事务binlog是否存在并完整:
- 如果是,则提交事务;
- 否则,回滚事务。
Situation
时刻A:写入redo log处于prepare阶段之后,写binlong之前,放生了crash。由于此时binlog还没写,redo log 也还没提交,所以崩溃恢复的时候,事务会回滚。这时候binlog还没写,所以也不会传到备库。
时刻B:对应原则2.1。
MySQL引入了binlog-checksum参数用来验证binlog的正确性以此验证事务binlog的正确性。而redo log 和 binlog 的数据格式有一个共同的数据字段 - XID。 崩溃恢复的时候,会按顺序扫描redo log:
- 如果碰到既有prepare, 又有commit的redo log,就直接提交;
- 如果碰到只有prepare,而没有commit的redo log, 就拿着这个redo log的XID 去binlog 找对应的事务。以此关联起redo log和binlog。
- 如果对应的事务binlog存在并完整,则commit
- 如果没有,则rollback
两阶段提交是经典 的分布式系统问题。举例来说,对于InnoDB引擎,如果redo log提交完了,事务就不能回滚。而如果redo log直接提交,然后binlog 写入失败,则InnoDB又回滚不了,数据和binlog日志又不一致了,因为存在binlog和redo log两种日志且存在主从环境,就需要两阶段提交。两阶段提交就是为了给所有人一个机会,当每个人都说“我OK”的时候,再一起提交。
那如果只用binlog做崩溃恢复,避免费事的两阶段提交可以吗?
答案是binlog是不能做崩溃恢复的。原因1 - binlog 没有能力恢复“数据页”。如下图因为binlog 只记录了逻辑操作,而非像redo log记录数据页磁盘的数据变更。当mysql crash时候,例如binlog1 记录的事件,如果没有WAL 刷盘,binlog是会丢数据的。也就是binlog没有能力恢复数据页。 [
Reference
- https://dev.mysql.com/doc/refman/8.0/en/xa.html
- MySQL 实战 45 讲 - https://time.geekbang.org/column/intro/100020801?tab=catalog
- https://smartkeyerror.oss-cn-shenzhen.aliyuncs.com/Psyduck/distributed-system/2PC.pdf
- https://www.burnison.ca/notes/fun-mysql-fact-of-the-day-everything-is-two-phase
- https://gsmtoday.github.io/2019/02/15/mysql-data-relable/