Wednesday, 7 March 2012

The CAP theorem and MySQL Cluster

tldr; A single MySQL Cluster prioritises Consistency in Network partition events. Asynchronously replicating MySQL Clusters prioritise Availability in Network partition events.


I was recently asked about the relationship between MySQL Cluster and the CAP theorem. The CAP theorem is often described as a pick two out of three problem, such as choosing from good, cheap, fast. You can have any two, but you can't have all three. For CAP the three qualities are 'Consistency', 'Availability' and 'Partition tolerance'. CAP states that in a system with data replicated over a network only two of these three qualities can be maintained at once, so which two does MySQL Cluster provide?

Standard 'my interpretation of CAP' section

Everyone who discusses CAP like to rehash it, and I'm no exception. Daniel Abadi has the best CAP write-up that I've read so far, which reframes CAP as a decision about whether to ultimately prioritise availability or data consistency in the event of a network partition. This is how I think of CAP. He also discusses related system behaviour in normal operation which I'll return to later.

While this reframing clarifies CAP, the terms network partition, availability and consistency also need some definition.

Network replicated database

CAP is only really relevant in the context of a network replicated database (or filesystem or state machine). A network replicated database stores copies of data in multiple different systems (database nodes), connected by a network. Data can be read and updated. Updates are propagated to all nodes with replicas via the network. Database clients connect to database nodes via the network to read data and make updates. Replication may occur to improve availability, to improve request latency, or to improve read bandwidth.

Availability

The network replicated database exists to provide services such as Read and Write on the data it stores. Its availability can be measured as the ability of any client to perform any service on any data item.

This Service Availability can be compromised by :
  • Failure of client nodes
  • Network failures between clients and database nodes
  • Network failures between database nodes
  • Failure of database nodes
Client node and networking failures cannot really be considered a property within the control of a database system, so I consider their effects out of the scope of CAP. However, where clients connect to a database node, and that database node is isolated from other database nodes, whether or not those clients are given service is within the scope of CAP.

Service Availability is not binary, it can partially degrade, perhaps by affecting :
  • A subset of all clients
  • A subset of all stored data
  • A subset of request types

The shades of grey within the definition of availability are responsible for most of the arguments around CAP. If we take a strict view - either all services available on all data for all clients, or nothing, then availability is fragile and hard to maintain. If we take a more flexible approach then some service availabilty can be preserved even with a completely decimated network. In the loosest definition, if any client receives any service on any data, then the system is still available. Rather than choose one position, I regard availability as a range from 100% down to 0% for a full outage. Anything in the middle is reduced availability, but it does not mean that the system is not serving its purpose adequately.

Consistency

For consistency to be satisfied, the multiple replicas of data in a network replicated database should behave as though there were only one copy of the data. Simultaneous reads of the same data item from clients connected to different database nodes must always return the same result. Where two or more updates to the same data item are submitted simulteneously, they must be serialised, or one must be rejected, or they must be merged so that a single value results. This one-copy model makes it simple for database clients to use the network replicated database as if it were a single database system with one atomically read/written copy of their data.

If one copy consistency is relaxed, then different database nodes may observably have different values for the same data item simultaneously. Over time the data copies may be aligned, but clients accessing the data must beware that reads may not return the results of the most recently accepted writes. This behaviour may be described as eventual consistency. Providing eventual consistency allows a network replicated database to maximise availability, but pushes the problem of dealing with transient inconsistencies up the stack to user applications. Furthermore there are varying qualities of eventual consistency, with varying guarantees and levels of application support available.

Network Partitions

Network partitions isolate subsets of the nodes of a network replicated database. The interesting property of a network partition is that each node subset cannot tell whether the other node subset(s) are :
  1. dead
  2. alive but isolated from clients
  3. alive and reachable by clients but isolated from us
Not knowing the state of the other subset(s) is what forces a system to decide between maximising service availability and maximising consistency. The interesting case is 3) where some database nodes (potentially containing all or some of the data) are alive elsewhere and have clients connected to them. If those clients are allowed to make writes on data copies stored on those database nodes, then we must lose one copy consistency as we cannot supply those new values in response to a read of our local copy. If those clients are not allowed to make writes then we have degraded service availability for them. Which is it to be? This is the unavoidable choice at the centre of the CAP theorem. Stated this way it seems less of a theorem and more of a fact.

Back to MySQL Cluster - which does it provide?

A single MySQL Cluster prioritises data consistency over availability when network partitions occur.

A pair of asynchronously replicating MySQL Clusters prioritise service availability over data consistency when network partitions occur.

So you can have it both ways with MySQL Cluster - Great!

Single MySQL Cluster - CP

Within a single MySQL Cluster, data is synchronously replicated between database nodes using two-phase commit. Nodes are monitored using heartbeats, and failed or silent nodes are promptly isolated by live and responsive nodes. Where a network partition occurs, live nodes in each partition regroup and decide what to do next :
  • If there are not enough live nodes to serve all of the data stored - shutdown
    Serving a subset of user data (and risking data consistency) is not an option
  • If there are not enough failed or unreachable nodes to serve all of the data stored - continue and provide service
    No other subset of nodes can be isolated from us and serving clients
  • If there are enough failed or unreachable nodes to serve all of the data stored - arbitrate.
    There could be another subset of nodes regrouped into a viable cluster out there.

Arbitration occurs to avoid the split brain scenario where a cluster could theoretically split in two (or more), with each half (or third, or quarter) accepting writes and diverging from the others. In other words, arbitration occurs to preserve consistency.

Arbitration involves :
  • Database nodes agree on an arbitrator in advance
  • During node or network failure handling, no data writes are committed.
  • When arbitration is required due to node failures or network issues, viable node subsets (potential clusters) request permission from the previously agreed arbitrator to provide service.
  • Each request to the arbitrator will result in either : Yes, No or timeout
  • Anything other than Yes results in node shutdown.
  • The arbitrator only says Yes once per election round (First come first served). Therefore the arbitrator only says yes to one potential cluster in a partitioned network.

Note that arbitration is not the same as achieving a quorum. A cluster with three replicas and an arbitrator node can survive the loss of two data nodes as long as the arbitrator remains reachable to the last survivor. The arbitrator role is lightweight as it is not involved in normal traffic. I am surprised that the lightweight arbitrator pattern is not more common.

How does a single MySQL Cluster degrade service availability as a result of network partitions?

Where some subset of data nodes are isolated and shut-down :
  • Those nodes are 100% out of service, until they restart and can rejoin the cluster
    They will attempt to do so automatically
  • Any clients connected only to those nodes are out of service
    By default clients attempt to connect to all data nodes, so partial connectivity issues needn't degrade client availability.
  • The remaining live nodes are 100% in-service
  • Clients connected to the remaining live nodes are 100% in service
Where no subset of data nodes is live
  • All clients experience 100% service loss, until the data nodes restart and can rejoin the cluster
    They will attempt to do so automatically.

A single MySQL Cluster does not degrade to partial data access, or read only modes as a result of network partitions. It does not sacrifice consistency.

How can MySQL Cluster be described as highly available if it sacrifices availability for consistency in the event of a network partition?

Availability is not binary - many types of network partition can erode availability, for some clients, but do not extinguish it. Some set of clients continue to receive 100% service. Only double failures in the network can cause a network partition resulting in full service loss.
Furthermore, network partitions are not the only risks to availability, software errors, power failures, upgrades, overloads are other potential sources of downtime which Cluster is designed to overcome.

Asynchronously replicating clusters - AP


Where two Clusters are asynchronously replicating via normal MySQL Replication, in a circular configuration, reads and writes can be performed locally at both clusters. Data consistency within each cluster is guaranteed as normal, but data consistency across the two clusters is not. On the other hand, availability is not compromised by network partitioning of the two clusters. Each cluster can continue to accept read and write requests to all of the data from any connected client.

Eventual consistency between the clusters is possible when using conflict resolution functions such as NDB$EPOCH_TRANS, NDB$EPOCH, NDB$MAX etc.

How does consistency degrade between replicating MySQL Clusters during a network partition?

This depends on the conflict resolution function chosen, and how detected conflicts are handled. Some details of consistency guarantees provided by NDB$EPOCH et al are described here.

What about normal operation?

Abadi's post introduced his PACELC acronym, standing for something like :

 if (network Partition)
{
trade-off Availability vs Consistency;
}
else
{
trade-off Latency vs Consistency;
}


My first comment has to be that it's bad form to put the common case in an else branch!
However, it is certainly true that the properties during normal operation are usually more important than what happens during a network partition. The ELC section is stating that while all database nodes are present, a network replicated database can choose between minimising request Latency, or maintaining Consistency. In theory this normal operation latency-vs-consistency tradeoff could be completely independent to the Network Partitioning availability-vs-consistency tradeoff, e.g. you could have any of :
  1. PA EL (Partition - Availability, Else - Latency minimisation)
  2. PA EC (Partition - Availability, Else - Consistency)
  3. PC EL (Partition - Consistency, Else - Latency minimisation)
  4. PC EC (Partition - Consistency, Else - Consistency)

The common cases are 1 + 4, where we choose either consistency at all times, or Maximum Availability and Minimum Latency. Case 2 is a system which aims for consistency, but when a network partition occurs, aims for Availability. Case 3 is a system which aims for minimal request Latency, and when a partition occurs aims for consistency.

Examples of systems of each type :
  1. Any eventually consistent system, especially with local-database-node updates + reads
  2. Best-effort consistent systems that degrade in failure modes (e.g. MySQL semi-synchronous replication)
  3. ???
  4. Always consistent systems (e.g. single database instance, single MySQL Cluster)

I am not aware of systems meeting case 3 where normally they minimise latency over consistency, but start choosing consistency after a network partition. Maybe this category should be called 'repentant systems'?

The problem for systems in Cases 1 or 2 - anywhere where Latency minimisation or Availability is chosen over consistency - is the need for user applications to deal with potential inconsistencies. It is not enough to say that things will 'eventually' be consistent. It's important to describe how inconsistent they can be, whether the temporary inconsistencies are values which were once valid, how those values relate to other, connected values etc.

There are certainly applications which can operate correctly with practical eventually consistent databases, but it's not well known how to design applications and schemas to cope with the transient states of an eventually consistent database. The first ORM framework to opaquely support an underlying eventually consistent database may actually be worth the effort to use! A reasonable approach is to design schemas with associated read/modification 'protocols' as if they were abstract data types (ADTs). These ADTs can then have strengths and weaknesses, properties and limitations which make sense in some parts of an application schema where the need to support eventual consistency overcomes the inherent effort and limitations.

Stonebraker and others have commented on network partitions being a minor concern for a well designed datacentre-local network, where redundancy can be reliably implemented. Also the latency cost of maintaining consistency is lower as physical distances are smaller and hop counts are lower. This results in 'CP' systems being attractive at the data centre scale as the need to sacrifice availability due to network partition is rarely dominant, and the latency implications during normal operation are bearable. Perhaps this highlights the need in these theoretical discussions to illustrate theoretically problematic latencies and availabilities with real numbers.

At a wider network scale, latencies are naturally higher, implying that bandwidth is lower. The probability of network partitions of some sort may also increase, due to the larger number of components (and organisations) involved. The factors combine to make 'AP' systems more palatable. The everyday latency cost of consistency is higher, and losing availability due to potentially more frequent network partitions may not be acceptable. Again, real numbers are required to illuminate whether the achievable latencies and probable availability impacts are serious enough to warrant changing applications to deal with eventually consistent data. For a particular application there may or may not be a point at which an AP system would meet its requirements better.

Consistent systems can be scaled across many nodes and high latency links, but the observed operation latency, and the necessary impacts to availability implied by link failure set a natural ceiling on the desirable scale of a consistent system. Paraphrasing John Mashey, "Bandwidth improves, latency is forever". Applications that find the latency and availability constraints of a single consistent system unacceptable, must subdivide their datasets into smaller independent consistency zones and manage potential consistency shear between them.

Finally (another excessively long post), I think the technical and actual merits of widely distributed 'CP' systems are not well known as they have not been commonly available. Many different database systems support some form of asynchronous replication, but few offer synchronous replication, fewer still offer to support it over wide areas with higher latency and fluctuating links. As this changes, the true potential and weaknesses of these technologies, backed by real numbers, will start to appear.

Edit 7/3/12 : Fix bad link

Tuesday, 21 February 2012

One billion

As always, I am a little late, but I want to jump on the bandwagon and mention the recent MySQL Cluster milestone of passing 1 billion queries per minute. Apart from echoing the arbitrarily large ransom demand of Dr Evil, what does this mean?

Obviously 1 billion is only of interest to us humans as we generally happen to have 10 fingers, and seem to name multiples in steps of 10^3 for some reason. Each processor involved in this benchmark is clocked at several billion cycles per second, so a single billion is not so vast or fast.

Measuring over a minute also feels unnatural for a computer performance benchmark - we are used to lots of things happening every second! A minute is a long time in silicon.

What's more, these reads are served from tables stored entirely in memory - and everyone knows that main memory is infinitely fast and scalable and always getting cheaper, right?

If we convert to seconds we are left with only 17 million reads per second! Hardly worth getting out of bed for?

On the contrary, I think that achieving 17 million independent random reads per second, each read returning 100 bytes across a network, from a database that also supports arbitrary SQL, row locking, transactions, high availability and all sorts of other stuff, is pretty cool. I doubt that (m)any other similar databases can match this raw performance, though I look forward to being proved wrong.

(Also, don't forget to meet + beat 1.9 million random updates/s, synchronously replicated)

Raw performance is good, but not everyone just needs horsepower. The parallel, independent work on improving join performance (also known as SPJ/AQL) and query optimisation helps more applications harness this power, by improving the efficiency of joins.

I wrote a post about SPJ/AQL at the start of last year, when it was still in the early stages. Since then much has improved, to the extent that the performance improvement factors have become embarrassingly high on real user queries. A further post on the technical details of SPJ/AQL is long overdue... Perhaps the most interesting details are on the integration between the parallel, streaming linked operations and the essentially serialised MySQL Nested Loops join executor. A linked scan and lookup operation can be considered to be a form of parallel hash join, which the normal MySQL NLJ executor can invoke as part of executing a query. Who says Nested Loop joins can't scale?

Friday, 17 February 2012

Transactional memory in 2012

I've been observing the appearance of hardware transactional memory (HTM) systems in the wild with interest. Keen readers might recall posts recalling my work on software for the XA-Core HTM system at Nortel.

The transactional memory concept is unusual in that it seems to have proponents both at the chip level (Azul, Sun's failed Rock SPARC CPU, IBM, Intel) and in the functional language community, most notably Haskell. I suspect the functional language community is motivated by the simplicity of the concurrency abstraction, and the chip community are motivated by the transistor use case. There doesn't seem to be the same demand for it from the vast middle ground of OSs, middleware, applications etc. Does this signify something?

Two things have seemed rather opaque in most coverage of transactional memory systems. The first is when and why it is better than using explicit locking / atomic operations. The second is the time+space properties of actual implementations. Too many expositions are still bound up in the simplicity of the interface to discuss the real benefits and drawbacks. Everybody likes a simplifying abstraction, but not if it is slower or has unpredictable side-effects.

So it was refreshing to read a blog post by Greg Pfister (formerly of IBM), describing an HTM implementation in relative laymans terms. I first read Greg's book 'In Search of Clusters' around the same time I was working on XA-Core (~2000), so Greg has quite some background context. He does not pretend to fully comprehend the implementation, but he asks the right questions. Searching more widely, I came across a discussion of the forthcoming Intel 'Haswell' chip at LWN. The comments here give some insight into the implementation and implications. Hopefully we'll start to hear more about the physical properties of these mechanisms, their sweet spots and limitations.

Thursday, 22 December 2011

Eventual Consistency in MySQL Cluster - implementation part 3




As promised, this is the final post in a series looking at eventual consistency with MySQL Cluster asynchronous replication. This time I'll describe the transaction dependency tracking used with NDB$EPOCH_TRANS and review some of the implementation properties.

Transaction based conflict handling with NDB$EPOCH_TRANS

NDB$EPOCH_TRANS is almost exactly the same as NDB$EPOCH, except that when a conflict is detected on a row, the whole user transaction which made the conflicting row change is marked as conflicting, along with any dependent transactions. All of these rejected row operations are then handled using inserts to an exceptions table and realignment operations. This helps avoid the row-shear problems described here.

Including user transaction ids in the Binlog

Ndb Binlog epoch transactions contain row events from all the user transactions which committed in an epoch. However there is no information in the Binlog indicating which user transaction caused each row event. To allow detected conflicts to 'rollback' the other rows modified in the same user transaction, the Slave applying an epoch transaction needs to know which user transaction was responsible for each of the row events in the epoch transaction. This information can now be recorded in the Binlog by using the --ndb-log-transaction-id MySQLD option. Logging Ndb user transaction ids against rows in-turn requires a v2 format RBR Binlog, enabled with the --log-bin-use-v1-row-events=0 option. The mysqlbinlog --verbose tool can be used to see per-row transaction information in the Binlog.

User transaction ids in the Binlog are useful for NDB$EPOCH_TRANS and more. One interesting possibility is to use the user transaction ids and same-row operation dependencies to sort the row events inside an epoch into a partial order. This could enable recovery to a consistent point other than an epoch boundary. A project for a rainy day perhaps?

NDB$EPOCH_TRANS multiple slave passes

Initially, NDB$EPOCH_TRANS proceeds in the same way as NDB$EPOCH, attempting to apply replicated row changes, with interpreted code attached to detect conflicts. If no row conflicts are detected, the epoch transaction is committed as normal with the same minimal overhead as NDB$EPOCH. However if a row conflict is detected, the epoch transaction is rolled back, and reapplied. This is where NDB$EPOCH_TRANS starts to diverge from NDB$EPOCH.

In this second pass, the user transaction ids of rows with detected conflicts are tracked, along with any inter-transaction dependencies detectable from the Binlog. At the end of the second pass, prior to commit, the set of conflicting user transactions is combined with the user transaction dependency data to get a complete set of conflicting user transactions. The epoch transaction initiated in the second pass is then rolled-back and a third pass begins.

In the third pass, only row events for non-conflicting transactions are applied, though these are still applied with conflict detecting interpreted programs attached in case a further conflict has arisen since the second pass. Conflict handling for row events belonging to conflicting transactions is performed in the same way as NDB$EPOCH. Prior to commit, the applied row events are checked for further conflicts. If further conflicts have occurred then the epoch transaction is rolled back again and we return to the second pass. If no further conflicts have occurred then the epoch transaction is committed.

These three passes, and associated rollbacks are only externally visible via new counters added to the MySQLD server. From an external observer's point of view, only non-conflicting transactions are committed, and all row events associated with conflicting transactions are handled as conflicts. As an optimisation, when transactional conflicts have been detected, further epochs are handled with just two passes (second and third) to improve efficiency. Once an epoch transaction with no conflicts has been applied, further epochs are initially handled with the more optimistic and efficient first pass.

Dependency tracking implementation

To build the set of inter-transaction dependencies and conflicts, two hash tables are used. The first is a unique hashmap mapping row event tables and primary keys to transaction ids. If two events for the same table and primary key are found in a single epoch transaction then there is a dependency between those events, specifically the second event depends on the first. If the events belong to different user transactions then there is a dependency between the transactions.

Transaction dependency detection hash :
{Table, Primary keys} -> {Transaction id}

The second hash table is a hashmap of transaction id to an in-conflict marker and a list of dependent user transactions. When transaction dependencies are discovered using the first dependency detection hash, the second hash is modified to reflect the dependency. By the end of processing the epoch transaction, all dependencies detectable from the Binlog are described.

Transaction dependency tracking and conflict marking hash :
{Transaction id} -> {in_conflict, List}

As epoch operations are applied and row conflicts are detected, the operation's user transaction id is marked in the dependency hash as in-conflict. When marking a transaction as in-conflict, all of its dependent transactions must also be transitively marked as in-conflict. This is done by a traverse through the dependency tree of the in-conflict transaction. Due to slave batching, the addition of new dependencies and the marking of conflicting transactions is interleaved, so adding a dependency can result in a sub-tree being marked as in-conflict.

After the second pass is complete, the transaction dependency hash is used as a simple hash for looking up whether a particular transaction id is in conflict or not :

Transaction in-conflict lookup hash :
{Transaction id} -> {in_conflict}

This is used in the third pass to determine whether to apply each row event, or to proceed straight to conflict handling.

The size of these hashes, and the complexity of the dependency graph is bounded by the size of the epoch transaction. There is no need to track dependencies across the boundary of two epoch transactions, as any dependencies will be discovered via conflicts on the data committed by the first epoch transaction when attempting to apply the second epoch transaction.

Event counters

Like the existing conflict detection functions, NDB$EPOCH_TRANS has a row-conflict detection counter called ndb_conflict_epoch_trans.

Additional counters have been added which specifically track the different events associated with transactional conflict detection. These can be seen with the usual SHOW GLOBAL STATUS LIKE syntax, or via the INFORMATION_SCHEMA tables.

  • ndb_conflict_trans_row_conflict_count
    This is essentially the same as ndb_conflict_epoch_trans - the number of row events with conflict detected.
  • ndb_conflict_trans_row_reject_count
    The number of row events which were handled as in-conflict. It will be at least as large as ndb_conflict_trans_row_count, and will be higher if other rows are implicated by being in a conflicting transaction, or being dependent on a row in a conflicting transaction.
    A separate ndb_conflict_trans_row_implicated_count could be constructed as ndb_conflict_trans_row_reject_count - ndb_conflict_trans_row_conflict_count
  • ndb_conflict_trans_reject_count
    The number of discrete user transactions detected as in-conflict.
  • ndb_conflict_trans_conflict_commit_count
    The number of epoch transactions which had transactional conflicts detected during application.
  • ndb_conflict_trans_detect_iter_count
    The number of iterations of the three-pass algorithm that have occurred. Each set of passes counts as one. Normally this would be the same as ndb_conflict_trans_conflict_commit_count. Where further conflicts are found on the third pass, another iteration may be required, which would increase this count. So if this count is larger than ndb_conflict_trans_conflict_commit_count then there have been some conflicts generated concurrently with conflict detection, perhaps suggesting a high conflict rate.


Performance properties of NDB$EPOCH and NDB$EPOCH_TRANS

I have tried to avoid getting involved in an explanation of Ndb replication in general which would probably fill a terabyte of posts. Comparing replication using NDB$EPOCH and NDB$EPOCH_TRANS relative to Ndb replication with no conflict detection, what can we can say?

  • Conflict detection logic is pushed down to data nodes for execution
    Minimising extra data transfer + locking
  • Slave operation batching is preserved
    Multiple row events are applied together, saving MySQLD <-> data node round trips, using data node parallelism
    For both algorithms, one extra MySQLD <-> data node round-trip is required in the no-conflicts case (best case)
  • NDB$EPOCH : One extra MySQLD <-> data node round-trip is required per *batch* in the all-conflicts case (worst case)
  • NDB$EPOCH : Minimal impact to Binlog sizes - one extra row event per epoch.
  • NDB$EPOCH : Minimal overhead to Slave SQL CPU consumption
  • NDB$EPOCH_TRANS : One extra MySQLD <-> data node round-trip is required per *batch* per *pass* in the all-conflicts case (worst case)
  • NDB$EPOCH_TRANS : One round of two passes is required for each conflict newly created since the previous pass.
  • NDB$EPOCH_TRANS : Small impact to Binlog sizes - one extra row event per epoch plus one user transaction id per row event.
  • NDB$EPOCH_TRANS : Small overhead to Slave SQL CPU consumption in no-conflict case

Current and intrinsic limitations

These functions support automatic conflict detection and handling without schema or application changes, but there are a number of limitations. Some limitations are due to the current implementation, some are just intrinsic in the asynchronous distributed consistency problem itself.

Intrinsic limitations
  • Reads from the Secondary are tentative
    Data committed on the secondary may later be rolled back. The window of potential rollback is limited, after which Secondary data can be considered stable. This is described in more detail here.
  • Writes to the Secondary may be rolled back
    If this occurs, the fact will be recorded on the Primary. Once a committed write is stable it will not be rolled back.
  • Out-of-band dependencies between transactions are out-of-scope
    For example direct communication between two clients creating a dependency between their committed transactions, not observable from their database footprints.

Current implementation limitations

  • Detected transaction dependencies are limited to dependencies between binlogged writes (Insert, Update, Delete)
    Reads are not currently included.
  • Delete vs Delete+Insert conflicts risk data divergence
    Delete vs Delete conflicts are detected, but currently do not result in conflict handling, so that Delete vs Delete + Insert can result in data divergence.
  • With NDB$EPOCH_TRANS, unplanned Primary outages may require manual steps to restore Secondary consistency
    With pending multiple, time spaced, non-overlapping transactional conflicts, an unexpected failure may need some Binlog processing to ensure consistency.

Want to try it out?

Andrew Morgan has written a great post showing how to setup NDB$EPOCH_TRANS. He's even included non-ascii art. This is probably the easiest way to get started. NDB$EPOCH is slightly easier to get started with as the --ndb-log-transaction-id (and Binlog v2) options are not required.

Edit 23/12/11 : Added index