Monday, 3 October 2011

Eventual consistency with MySQL

tl;dr : New 'automatic' optimistic conflict detection functions available giving the best of both optimistic and pessimistic replication on the same data

MySQL replication supports a number of topologies, and one of the most interesting is an active-active, or master-master topology, where two or more Servers accept read and write traffic, with asynchronous replication between them.

This topology has a number of attractions, including :
  • Potentially higher availability
  • Potentially low impact on read/write latency
  • Service availability insensitive to replication failures
  • Conceptually simple

However, data consistency is hard to maintain in this environment. Data, and access to it, must usually be partitioned or otherwise controlled, so that the consistency of reads is acceptable, and to avoid lost writes or badly merged concurrent writes. Implementing a distributed data access partitioning scheme which can safely handle communication failures is not simple.

Relaxed read consistency

Relaxed read consistency is a fairly well understood concept, with many Master-Slave topologies deployed where some read traffic is routed to the Slave to offload the Master and get 'read scaling'.
Generally this is acceptable as :
  1. A Read-only Slave's state is self-consistent. It is a state which, at least logically, existed on the Master at some time in the past.
  2. The reading application can tolerate some level of read-staleness w.r.t. the most recently committed writes to the Master

A surprisingly large number of applications can manage with a stale view as long as it is self-consistent.


Applications requiring 'read your writes' consistency (or session consistency) must either read from the Master, or wait until the Slave has replicated up to at least the point in time where the application's last write committed on the Master before reading from it. It is simpler and less delay-prone to just read from the Master, but this increases the load on the Master, reducing the ability of a system to read-scale. When the Master is unavailable, some sort of failover is required, and therefore, some sort of recovery process is also required.

Partitioned Active-Active/ Balanced Master-Slave

Rather than treating a whole replica as either Master or Slave, we can have each replica be both a Master and a Slave. The partitioning could be on database level, table level, or some function of the rows contained in tables, perhaps some key prefix which maps to application entities. Balancing the Master/Slave role in this way allows the request load to be balanced, reducing issues with a single system providing the Master 'role' having to do more work.

In this configuration, rather than talking about Master and Slave, it makes more sense to talk about some partition of data being 'Active' on one replica, and 'Backup' on the others. Read requests routed to the Active replica will be guaranteed to get the latest state, whereas the Backup replicas can potentially return stale states. Write requests should always be routed to the Active replica to avoid potential races between concurrent writes.

Implementing a partitioned replicated system like this generally requires application knowledge to choose a partitioning scheme where expected transaction footprints align with the partitioning scheme, and cross-partition transactions are rare/non-existant. Additionally, it requires application modification, or a front-end routing mechanism to ensure that requests are correctly routed. The routing system must also be designed to re-route in cases of communication or system failure, to ensure availability, and avoid data divergence. After a failure, recovery must take care to ensure replicas are resynchronised before restoring Active status to partitions in a recovered replica.

Implementing a partitioned replicated system with request routing, failover and recovery is a complex undertaking. Additionally, it can be considered a pessimistic system. For embarassingly parallel applications, with constrained behaviours, most transactions are non-overlapping in their data footprint in space and (reasonable lengths of) time. Enforced routing of requests to a primary replica adds cost and complexity that is most often unnecessary. Is it possible to take a more optimistic approach?

Optimistic Active-Active replication

An optimistic active-active replication system assumes that conflicting operations are rare, and prefers to handle conflicts after they happen, than to make conflicts impossible, by mapping them to delays or overheads all of the time. The one-time cost of recovering from a conflict after it happens may be higher than the one-time cost of preventing a conflict, but this can be a win if conflicts are rare enough.

Practically, optimistic active-active replication involves allowing transactions to execute and commit at all replicas, and asynchronously propagating their effects between replicas. When applying replicated changes to a replica, checks are made to determine whether any conflicts have occurred.

Benefits of optimism include :
  • Local reads - low latency, higher availability
  • Local writes - low latency, higher availability
  • No need to route requests, failover, recover
    Recovery from network failure is the same as for normal async replication - catch up the backlog.

A pessimist is never disappointed, as they always expect the worst, but an optimist is occasionally (often?) disappointed. With active-active replication, this disappointment can include reading stale data, as with relaxed read consistency, or having committed writes later rejected due to a conflict. This is the price of optimism. Not all applications are suited to the slings and arrows inherent in optimism. Some prefer the safety of a pessimistic outlook.

Benefits of pessimism include :
  • Only durable data returned by reads
  • Committed writes are durable
MySQL Cluster replication has supported symmetric optimistic conflict detection functions since the 6.3 release. These provide detection of conflicts for optimistic active-active replication, allowing data to be written on any cluster, and write-write conflicts to be detected for handling. The functions use an application defined comparison value to determine when a conflict has occurred, and optionally, which change should 'win'. This is very flexible, but can be difficult to understand, and requires application and schema changes to be made use of.

When presented with an either-or decision, why not ask for both? Is it possible to have the benefits of both optimistic and pessimistic replication? Can we have them both on the same data at the same time?

Asymmetric optimistic Active-Active replication

I have recently been working on new asymmetric conflict detection functions for MySQL Cluster replication. These functions do not require schema or application modifications. They are asymmetric in that one data replica is regarded as the Active replica. However, unlike a pessimistic partitioned replicated system, writes can be made at Active or Backup replicas - they do not have to be routed to the Active replica. Writes made at the Backup replica will asynchronously propagate to the Active replica and be applied, but only if they do not conflict with writes made concurrently at the Active replica.

Having a first class Active replica and a second class Backup replica may seem like a weakness. However, it allows optimistic and pessimistic replication to be mixed, on the same data for different use-cases.

Where a pessimistic approach is required, requests can be routed to the Active replica. At the Active replica, they will be guaranteed to read durable data, and once committed, writes will not be rejected later.

Where an optimistic approach is acceptable, requests can also be routed to the Backup replica. At the Backup replica, committed writes may later be rejected, and reads may return data which will later be rejected. The potential for disappointment is there, and applications must be able to cope with that, but in return, they can read and write locally, with latency and availability independent of network conditions between replicas.

A well understood application and schema can use pessimistic replication, with request routing, where appropriate, and write-anywhere active-active where the application and schema can cope with the relaxed consistency.

New conflict functions - NDB$EPOCH, NDB$EPOCH_TRANS

The new NDB$EPOCH function implements asymmetic conflict detection, on a row basis. One replica of a table is considered Active (or Primary), and the other(s) are Backup (or Secondary). Writes originating from the Backup replica are checked at the Active replica to ensure that they don't conflict with concurrent writes originating at the Active replica. If they do conflict, then they are rejected, and the Backup is realigned to the Active replica's state. In this way, data divergence is avoided, and the replicated system eventually becomes consistent.

The conflict detection, and realignment to give eventual consistency all occur asynchronously as part of the normal MySQL replication mechanisms.

As with the existing conflict detection functions, an exceptions table can be defined which will be populated with the primary keys of rows which have experienced a conflict. This can be used to take application specific actions when a conflict is detected.

Unlike the existing conflict detection functions, no schema changes or application changes are required. However, as with any optimistic replication system, applications must be able to cope with the relaxed consistency on offer. Applications which cannot cope, can still access the data, but should route their requests to Active replicas only, as with a more traditional pessimistic system.

As these functions build on the existing MySQL Cluster asynchronous replication, the existing features are all available :
  • Slave batching performance optimisations
  • High availability - redundant replication channels
  • Transactional replication and progress tracking
  • Normal MySQL replication features : DDL replication, Binlog, replicate to other engines etc..
Ok, that's long enough for one post - I'll describe NDB$EPOCH_TRANS and its motivations in a follow-up. If you're interested in trying this out, then download the latest versions of MySQL Cluster. If you're interested in the optimistic replication concept in general, I recommend reading Saito and Shapiro's survey.

Edit 23/12/11 : Added index


Ivan said...

Frazer - great stuff!

Thanks for sharing in such a clear way!

Frazer Clement said...

Thanks Ivan!

The next post describing NDB$EPOCH_TRANS is available now :

Baron said...

This is great. Now we need some marketing to raise awareness among the audiences who are trying to decide what data store to use. People who are evaluating Cassandra and Riak and HBase and so on haven't heard of NDB in many cases, and "MySQL anything" can be an automatic turn-off to them. It might be hard to turn that ship

Frazer Clement said...

Thanks Baron. Agree with your points completely.

Mark Callaghan said...

Wow. Multi-master with conflict detection & resolution is a big deal for internet apps.

Frazer Clement said...

Thanks Mark, I hope so.

dengcheng he said...

Frazer, excellent series, i have read for more than two times!
I have a little question about NDB$GCI64, hidden column in every row. On transaction commit, how this column be updated? Should we traverse back all the rows modified in the transaction and update this value on transaction commit?

Frazer Clement said...

Thanks dengcheng he, well done for getting through it twice!

The NDB$GCI64 metacolumn is automatically maintained by the Ndb software. You do not need to do anything to set it at commit time. If you attempt to modify it manually then your operation will fail.

At commit time, the Ndb software internally iterates the DML operations,setting the NDB$GCI64 value for each modified row to the current epoch value.

Unknown said...

Good stuff... How do I run it?? I try to run:
but to no avail.

My problem is I get an inconsistency error whenever I try to create a datetime or time or timestamp field in a table.

Frazer Clement said...

Hi Justin,
NDB$EPOCH() et al are not SQL callable functions, they are internal functions applied by a replication slave when applying transactions.
To configure the slave to use them, you must add entries to the mysql.ndb_replication table before creating the tables that conflict handling is needed for.
Then when applying replicated transactions, the Slave knows to use this function to detect + handle conflicts.
I'm not sure about your other problem, probably you need to be more specific?