In my previous posts I introduced two new conflict detection functions, NDB$EPOCH and NDB$EPOCH_TRANS without explaining how these functions actually detect conflicts? To simplify the explanation I'll initially consider two circularly replicating MySQL Servers, A and B, rather than two replicating Clusters, but the principles are the same.
Commit ordering
Avoiding conflicts requires that data is only modified on one Server at a time. This can be done by defining Master/Slave roles or Active/Passive partitions etc. Where this is not done, and data can be modified anywhere, there can be conflicts. A conflict occurs when the same data is modified at both Servers concurrently, but what does concurrently mean? On a single server, modifications to the same data are serialised by locking or MVCC mechanisms, so that there is a defined order between them. e.g. two modifications MX and MY are committed either in order {MX, MY} or {MY, MX}.
For the purposes of replication, two modifications MX and MY on the same data are concurrent if the order of commit is different at different servers in the system. Each server will choose one order, but if they don't all choose the same order then there is a conflict. Having a different order means that the last modification on each server is different, and therefore the final state of the data can be different on different servers.
One way to avoid conflicts is to get all servers to agree on a commit order before processing an operation - this ensures that all replicas process operations in the same order, waiting if necessary for missing operations to arrive to ensure no commit-order variance.
Note that commit-ordering is only important between modifications affecting the same data - modifications which do not overlap in their data footprint are unrelated and can be committed in any order. A system which totally orders commits may be less efficient than one which only orders conflicting commits.
Happened before
For the NDB$EPOCH asynchronous conflict detection functions, commit orders are monitored to detect when two modifications to the same data have been committed in different orders.
Given two modifications MX and MY to the same data, each server will decide a happened before (denoted ->) relationship between them :
- MX -> MY (MX happened before MY)
or - MY -> MX (MY happened before MX)
If all servers agree on order 1, or all servers agree on order 2 then there is no conflict. If there is any disagreement then there is a conflict.
In practice, disagreement arises because the same data is modified at both Server A and Server B before the Server A modification is replicated to B and/or vice-versa.
Sometimes when reading about commit ordering, the reason why commit orders should not diverge is lost - the only reason to care about commit ordering is because it is related to conflicting modifications and the potential for data divergence.
Determining happened before from the Binlog
We assume a steady start state, where both Server A and Server B agree about the state of their data, and no modifications are in-flight. If a client of Server A then commits modification MA1 to row X, then from Server A's point of view, MA1 happened before any future modification to row X.
MA1 -> M*
If a client of Server B commits modification MB1 to row X around the same time (before, or after, or thereabouts), from Server B's point of view, MB1 happened before any future modification to row X.
MB1 -> M*
Both Servers are correct, and content with their world view. Note that in general, when committing a modification Mj, a server naturally asserts that from its point of view the modification happened before any as-yet-unseen modification Mk.
Some time will pass and the replication mechanisms will pull Binlogged changes across and apply them. When Server B pulls and applies Server A's Binlogged changes, modification MA1 will be applied to row X. Server B will then naturally be of the opinion that :
MB1 -> MA1
Independently, Server A will pull Server B's binlogged changes and apply modification MB1 to row X, and will come to the certain opinion that :
MA1 -> MB1
These happened before relationships are contradictory so there is a conflict. If nothing is done then A and B will have diverged, with Server A storing the outcome of MB1, and Server B storing the outcome of MA1.
Note that if the --log-slave-updates server option were on, then Server A's Binlog would have recorded {...MA1...MB1...}, whereas Server B's Binlog would have recorded {...MB1...MA1...}. By recording when the Slave applies replicated updates in the Binlog, we record the commit order of the replicated updates relative to other local updates, and encode the happened before relationship in the relative positions of events in the Binlog.
The Binlog is of course transferred between servers, so in a circular replication setup, Server A can become aware of the happened before information from Server B and vice-versa by examining the received Binlogs. The Slave SQL thread examines Binlogs as it applies them, so can be extended to extract happened before information, and use it to detect conflicts.
Recall that Server A asserts that its committed modification to row X (MA1) happened before any as-yet-unseen replicated modification :
MA1 -> M*
Therefore, to detect a conflict, Server A only needs to detect the case where the incoming Binlog from Server B infers that some modification MB* to row X happened before server A's already committed modification MA1.
If Server B Binlog implies MB* -> MA1 then there has been a conflict
This is in essence how the NDB$EPOCH functions work - the Binlog is used to capture happened before relationships which are checked to determine whether conflicting concurrent modifications have occurred.
Conflict Windows
In the previous example, Server A commits MA1 modifying row X, and Server B commits MB1 also modifying row X. From Server A's point of view, as soon as it commits MA1, there is potential for a replicated modification from B such as MB1 to be found in-conflict with MA1. We say that from Server A's point of view a window of potential conflict on row X has opened when MA1 was committed. Server A monitors Server B's Binlog as it is applied and when it reaches the point where the commit of MA1 at Server B is recorded, Server A knows that any further MB* recorded in Server B's Binlog after this cannot have happened before MA1, therefore the window of potential conflict on row X has closed.
We define the window of potential conflict on a row X as the time between the commit of a modification M1, and the Slave processing of an event in a replicated Binlog indicating that modification M1 has been applied on the other server(s) in the replication loop.
Any incoming replicated modification M2 also affecting row X while it has an open conflict window is in conflict with M1, as it must appear to have happened-before M1 to the server which committed it.
Observations about the window of potential conflict :
- It is defined per committed modification per disjoint data set
- It can be extended by further modifications to the same data from the same server
The window does not close all further modifications have been fully replicated - Window duration is dependent on the replication round-trip delay
Which can vary greatly - Once it closes, further modifications to the same data from anywhere are safe, but will each open their own window of potential conflict.
- From the point of view of one Server, conflicts can occur at any time until the conflict window is closed
- From the point of view of one Server, the duration of the window of potential conflict is similar to
Replication Propagation Delay A to B + Replication Propagation Delay B to A
These delays may not be symmetric. - From the point of view of an external observer/actor, the system will detect two modifications MA1 and MB1 committed at times tMA1 and tMA2 as in-conflict if
tMB1 - tMA1 < Replication Propagation Delay A to B
( A before B, but not by enough to avoid conflict )
or
tMA1 - tMB1 < Replication Propagation Delay B to A
( B before A, but not by enough to avoid conflict ) - The window of potential conflict can only be as short as the replication propagation delay between systems, which can tend towards, but never reach zero.
Tracking conflict windows with a logical clock
A row's conflict window opens when a modification is committed to it, and closes when the Slave processes an event indicating that the modification was committed on the other server(s). How can we track all of these independent conflict windows? If only we had a database :)
This is solved by maintaining a per-server logical clock, which increments periodically. Each modification to a row sets a hidden metacolumn of the row to the current value of the server's logical clock. This gives each row a kind of coarse logical timestamp. When the logical clock increments, an event is included in the Binlog to record the transition. Further, all row events for modifications with logical clock value X are stored in the Binlog before any row events for modifications with logical clock value X+1.
Server A Binlog events ClockVal stored in DB
by Modification
...
MA1 39
MA2 39
MA3 39
ClockVal_A = 40
MA4 40
MA5 40
ClockVal_A = 41
MA6 41
When a Slave applies the Binlog, the ClockVal events are passed through into its Binlog, and are then made available to the original server in a circular configuration.
Server B Binlog events
...
MB1
MB2
ClockVal_A = 40
MB3
MB4
ClockVal_B = 234
MB5
MB6
ClockVal_A = 41
MB7
...
Using the Binlog ordering, we can see that ClockVal_A = 40 happened before MB3 and MB4 at Server B. This implies that MA1, MA2 and MA3 happened before MB3 and MB4 at server B.
When applying Server B's Binlog to Server A, the Slave at Server A maintains a maximum replicated clock value, which increases as it observes its ClockVal_A events returned. When applying a row event originating from Server B, the affected row's stored clock value is first compared to the maximum replicated clock value to determine whether the row event from B conflicts with the latest committed change to the row at Server A.
The two modifications are in conflict if the stored row's clock value is greater than or equal to the maximum replicated clock value.
in_conflict = row_clockval >= maximum_replicated_clockval
Using a logical clock to track conflict windows has the following benefits :
- Automatic update on commit of row modification, opening conflict window
- Automatic extension of conflict window on further modification on row with open conflict window.
- Automatic closure of conflict window on maximum replicated clock value exceeding row's stored value
- Efficient storage cost per row - one clock value.
- Efficient runtime processing cost - inequality comparison between maximum replicated clock value and row's stored clock value.
As you might have guessed, NDB$EPOCH uses the MySQL Cluster epoch values as a logical clock to detect conflicts. The details of this will have to wait for yet another post. In my first two posts on this subject I thought, 'one more post and I can finish describing this', but here I am at three posts and still not finished. Hopefully the next will get more concrete and finally describe the mysterious workings of NDB$EPOCH. We're getting closer, honest.
Edit 23/12/11 : Added index