Showing posts with label cluster. Show all posts
Showing posts with label cluster. Show all posts

Wednesday, 11 November 2020

MySQL Cluster Backup and Restore

MySQL Ndb Cluster provides durability for data by default via logging and checkpointing.

In addition, users can take backups at any time, which allows for disaster recovery, replication synchronisation, data portability and other use cases.

This post looks at the backup and restore mechanisms in MySQL Ndb Cluster.

MySQL Ndb Cluster architecture recap

MySQL Ndb Cluster is a distributed SQL relational database :

  • Designed for low overhead read + write scale out, high availability, high throughput and low latency.
  • Providing distributed parallel joins, transactions, row locks, foreign keys.
  • Data is primarily stored and managed by a set of independent data node processes.
  • Data is accessed via distributed MySQL servers and other clients.
  • Performance
    • Scale out read + write : Tables and indices are distributed across data nodes.
    • Scale up read + write : Tables and indices are distributed across Local Data Manager (LDM) components within each data node, with parallel checkpointing and redo logs.
    • Scale out + up of read + write : Fully parallel and distributed transaction prepare + commit.
  • High availability :
    • Tables and indices transactionally replicated across data nodes for high availability, locality, read scaling.
    • Automatic logging, checkpointing, failure detection, failover, recovery, resynchronisation
    • Online operations : Backup, schema changes, upgrade.
  • MySQL compatibility, async replication between clusters and InnoDB, connectors, 3rd party support.

MySQL Ndb Cluster is a superset of the MySQL Server functionality, and users may use a mix of Ndb and InnoDB tables and features.

Ndb backup + ndb_restore

MySQL Ndb Cluster has a built-in backup mechanism called Ndb backup.  Ndb backup captures the state of all of the tables and indexes stored in the cluster data nodes, allowing them to be restored to a transactionally consistent point in time.  As this data is partitioned and distributed across the cluster data nodes, the backup is also partitioned and distributed, with each data node producing a standalone set of backup filesets containing a subset of the cluster's data.  Restoring all of the backup filesets results in restoring all of the cluster's data.

Parallel backup across and within data nodes

Ndb backup filesets are restored using a tool called ndb_restore.  This tool can read the filesets generated by Ndb backup, and uses NdbApi to efficiently restore the data to a target cluster.  The backup files contain a logical rather than physical format - restoration is effectively re-insertion and modification of rows.  

Parallel restore across and within data nodes

This design allows ndb_restore to be flexible about restoring data, supporting :
 - Restore to a different target schema (1, 2, 3, 4, 5, 6)
 - Restore of a subset of tables and indices (1, 2, 3, 4)
 - Restore to a different table or index distribution
 - Restore to a different cluster size or topology
 - Restore to a different cluster software version
 - Restore to a different hardware or OS platform
 - Remote restore across a network

 

Flexible restore - different target schema, cluster, parallelism etc

As the backup is partitioned into logical shared-nothing filesets, the restore is also naturally partitioned and parallelised for minimal restoration time.

Note that Ndb Backup captures only the data stored in the data nodes, e.g. tables with ENGINE=NDB[CLUSTER].  InnoDB tables and indices, and other MySQL Server local objects such as Triggers, Stored procedures and Views must be backed up and restored using different mechanisms such as mysqldump.

Binary logs

Each Ndb Backup captures the state of a cluster at a transactionally consistent point in time, represented in Ndb as an epoch number.  Epoch numbers are also used in MySQL Ndb Cluster as transactional Binary log markers, dividing the parallel flow of committed transactions into consistent units.  This transactional alignment between backups and binary logs makes it easy to setup and manage async replication between clusters, and implement point in time recovery solutions.  The binlog position corresponding to a backup can be precisely determined, and replication can be established from that point onwards.


Indexes and Constraints

MySQL Ndb Cluster supports secondary unique and ordered indexes, and transactional foreign key constraints.  Unique and foreign key constraints are not enforceable until backup restoration is complete, and maintaining secondary indexes during data restoration consumes resources.  ndb_restore has options to temporarily disable and then rebuild secondary indexes and constraints around the bulk data loading phase of restore allowing :
 - Optimised restoration without the need to maintain indexes, check constraints
 - Avoidance of dependency problems with e.g. circular foreign key references, self references etc
 - Optimised parallel index build

All constraints are naturally checked as part of the constraint rebuild, so integrity is enforced.

Schema changes

Ndb backup filesets capture the schema of tables and indices.  ndb_restore uses this information to recreate tables and indexes prior to restoring data.  

When restoring data, ndb_restore compares the schema of tables and indexes in the target cluster to those captured in the backup.  ndb_restore optionally allows a number of differences, e.g. in numbers of columns, type widths etc, so that data captured from an old schema version can be restored into a new schema without intermediate transformation steps.

Recent versions have added support for column remapping, where data can be transformed, and changes to the set of primary key columns, allowing distribution optimisations.

These features can be used alongside the schema flexibility features of MySQL asynchronous replication, allowing replication to be established and synchronised from an old schema cluster to a new schema cluster, and vice-versa.

Compression

Ndb backup filesets can optionally be compressed using a parameter in the cluster configuration.  Compression is performed by background IO threads in the data node processes, prior to writing data to disk.  ndb_restore handles compressed backups automatically during restoration.

Encryption

From the 8.0.22 release, Ndb backup filesets can be encrypted.  Backup encryption provides protection for data which must leave the cluster security perimeter, and combines with e.g. MySQL Binlog encryption to help implement a secure async replication mechanism.  Backup encryption occurs in background IO threads in the data node processes.

Backups are encrypted and decrypted using a user supplied password.  For each file in each backup fileset, a unique random salt value is generated, and combined with the password via a key derivation function to generate a symmetric encryption key for that file.  The key is then used to encrypt the backup data using AES 256 CBC inline.  For restoration, ndb_restore combines the user supplied password with the per-file salt values to determine the unique decryption key for each file.

Key = KDF(RandomSalt, Password)

EncryptedData = AES_256_CBC(Key, PlainData)


Using an external password in this way allows users to manage backup encryption secrets in alignment with existing policies and systems.


Backup and restore parallelism

Each Ndb backup runs in parallel across all data nodes, with each node producing independent filesets.  At restore time, each ndb_restore instance restores batches of rows concurrently, and multiple ndb_restore instances can be run in parallel, each operating on independent backup filesets.

From the 8.0.16 release onwards, each data node can produce multiple backup filesets - one per Local Data Manager (LDM) component, and ndb_restore restores multiple backup filesets in parallel by default.


In the 8.0.20 release, the ability to use multiple ndb_restore instances to restore slices of a single backup fileset was added.


Triggering a backup

Users trigger a new Ndb backup using the management api (ndb_mgm / mgmapi).  The command in the ndb_mgm shell (version 8.0.22) looks like this :

ndb_mgm > START BACKUP [<backup id>] [ENCRYPT PASSWORD='<password>']
              [SNAPSHOTSTART | SNAPSHOTEND] [NOWAIT | WAIT STARTED | WAIT COMPLETED]


The backup id affects the name of the backup files on disk.  If no backup id is supplied, then the cluster generates one internally.

If an encryption password is supplied, then the generated backup files will be encrypted using this password.  If the RequireEncryptedBackup parameter is set in the cluster configuration, then a password must always be supplied.

SnapshotStart and SnapshotEnd give some control over the choice of the transactionally consistent point in time that the backup represents - either at the start of the backup, before the backup is STARTED, or at the end of the backup, before the backup is COMPLETED.  The default is SnapshotEnd, but SnapshotStart can be useful for coordinating the backup of an Ndb cluster with other clusters and external systems.

The NoWait, Wait Started and Wait Completed options affect when the START BACKUP command itself will return control to the user, with Wait Completed as the default.

Restoring a backup

The ndb_restore tool operates on one or more backup filesets generated by a node at backup time.  The main options are :

  • --restore-meta : Read the schema metadata from the backup and create the same schema in the target cluster.
  • --restore-data : Read the backup data and log files, and use NdbApi to insert, update and delete data to restore to the backup consistency point.

Options can be combined, so that a single ndb_restore invocation performs multiple actions.

All backup filesets redundantly contain the schema metadata, but the --restore-meta step need only be performed once.  This step will recreate the schema as it was at backup time, though in cases where the table distribution was not explicitly set, it will be automatically adjusted to suit the target cluster.

For the --restore-data step, the rows in the backup fileset's DATA file are first inserted into the target cluster, then the entries from the backup fileset's LOG file are idempotently applied to bring the target cluster to a transactionally consistent state.

For a SNAPSHOTEND backup, the backup log contains actions to redo, which are applied in the order they were logged.
For a SNAPSHOTSTART backup, the backup log contains actions to undo, which are applied in the reverse of the order that they were logged.

Where there are schema changes, or transformations applied, these are performed as data is restored.  One exception is where type changes convert to/from/between BLOB types - in this case ndb_restore automatically uses internal staging tables to perform the restore in two steps.

ndb_restore can be used to restore data from backup filesets while a cluster is in operation, with some caveats :

  • Data being restored is only consistent when the restore is complete, before which :
    • Other clients can observe partially restored data
    • Constraints may be temporarily violated


Some other relevant ndb_restore options :

The ndb_restore tool can also be used to show the content of backup filesets in various ways.


Restoration sequences 

ndb_restore is a flexible tool.  Full cluster restoration normally follows a sequence like this.

  1. Restore/create schema - serial

    ndb_restore --restore-meta ...

    or some other means of creating a schema :
    • Apply a mysqldump schema dump
    • Execute a SQL script
    • Use ndb_restore --restore-meta, then ALTER the tables
    •  ...
     
    Tables or databases can be included or excluded.



  2. [ Drop indices and constraints - serial ]

    ndb_restore --disable-indexes ...

    This step is only required if there are unique or foreign key constraints on the tables being restored.

    Note that if the target cluster schema does not reflect the schema in the backup, then extra steps may be needed here to ensure all indices and constraints are disabled and rebuilt.



  3. Restore data - parallel/serial

    ndb_restore --restore-data ...

    This step must be performed for each node which generated backup filesets, but these steps can be done serially or in parallel as they are independent to each other.

    Tables or databases can be included or excluded.

    If the schema in the target cluster is different, then schema flexibility options may be required.
     


  4. [ Restore epoch - serial ]

    ndb_restore --restore-epoch ...

    This step is only necessary when restoring a backup from another cluster which will act as an async replication source.


     
  5. [ Rebuild indices and constraints - serial ]

    ndb_restore --rebuild-indexes ...

    This step is only necessary when indices and constraints have been disabled (step 2).
    While the step itself must be run serially, internally it makes use of the parallelism of the cluster for fast index and constraint rebuilds.


For restoring a backup from a single data node cluster, with no schema changes, a single command can perform all of these steps in one pass :


ndb_restore --restore-meta --disable-indexes --restore-data --restore-epoch --rebuild-indexes ...

However, it is more common that a backup has come from a multi node cluster, and requires a number of invocations of ndb_restore to fully restore.  This enables efficient parallel restore independent of cluster scale.

Thursday, 29 October 2020

State of the DolphiNDB

Software development is always moving forward, and the latest version is always the best ... until the next one arrives.  When you visit the MySQL Ndb Cluster downloads page, you are naturally recommended to pull the latest MySQL Cluster 8.0.22 (announcement), but sometimes it is good to look back at the journey taken to reach this point.

7.x release series

Prior to the 8.0 releases, MySQL Ndb Cluster had a sequence of 7.x (x=0..6) releases based on MySQL Server versions 5.1, 5.6, 5.7.  In each of the 7.x release series, MySQL Ndb Cluster was under feature development for some time, producing a number of minor releases, until eventually one minor release was validated as being acceptable as the first generally available (GA) version.  Once a release series 7.x was generally available, further changes in that series were confined to bug fixes and small improvements, with new feature development becoming only available in a new release series 7.x+1.

Each of the 7.x releases significantly improved and extended MySQL Ndb Cluster, improving SQL execution, threading, performance, async replication, checkpointing, recovery, functionality etc, as well as picking up a number of fixes and minor improvements discovered by customers, users and our own QA.

MySQL Ndb Cluster includes all of the functionality of MySQL Server, plus the unique Ndb Cluster software.  As a result, each MySQL Ndb Cluster release series has also benefited from improvements in the related MySQL Server version - new features, optimisations and bug fixes.

7.5 release series

The 7.5 release series was based on the MySQL Server 5.7 release, becoming GA (announcement) with version 7.5.4 in October 2016.

The new features included :

  • Support for JSON columns.
  • Stored or virtual generated columns.
  • Read + Join performance optimisations : read from any replica, read from any node.
  • Improved performance, memory management, usability.
  • Use of InnoDB for replication state management.

Since then we have released a further sixteen minor versions of the 7.5 series, one every quarter.  Each minor version includes small bug fixes and improvements to both the Cluster and MySQL Server software, with the current latest version being 7.5.20 released in October 2020.

7.6 release series

The 7.6 release series was also based on the MySQL Server 5.7 release series, becoming GA (announcement) with version 7.6.6 in May 2018.  

The new features included :

  • Optimised checkpoints giving reduced disk space usage, reduced write amplification, reduced recovery time - increased availability, increased practical data limits.
  • Improved distributed SQL join support.
  • Unified memory management.
  • Optimised node-local communication.
  • Optimised distribution awareness.
  • New parallel import utility .
  • Many performance improvements.

Since the 7.6.6 release of MySQL Cluster 7.6, we have released a further ten minor versions, with small bug fixes and improvements.  The current latest version is 7.6.16 released in October 2020.

8.0 release series

The MySQL Cluster 8.0 series is based on the MySQL Server 8.0 series, which first became generally available (announcement) as release 8.0.11 in April 2018.   

The MySQL Server 8.0 release included many features in this first version, and has been continuously extended in each of the eleven releases since.  MySQL Ndb Cluster is following this release model with its 8.0 releases, with new features arriving in each minor release.

The first generally available release of MySQL Cluster 8.0 was 8.0.19 (announcement), released in January 2020.

MySQL Cluster 8.0.19 (January 2020)

Features included :

  • Simplified configuration, elastic memory management.
  • Better support for 3 and 4 replica clusters.
  • Support for up to 144 data nodes in a cluster.
  • Improved multithreading for parallel backup and restore.
  • Increase row size limit.
  • Improved distributed SQL execution performance and efficiency.
  • Disk stored columns performance and stability improvements .
  • Integration with MySQL Server 8.0.19 (Transactional MySQL Data Dictionary, many other MySQL Server features).
  • Usability improvements.

Since 8.0.19 there have been three further releases of MySQL Ndb Cluster 8.0 : 

MySQL Cluster 8.0.20 (April 2020)

Features included :

  • Further backup restoration process parallelism
  • Parallel intra-nodegroup links for high write volume applications
  • Improved distributed SQL processing of outer and semi joins
  • Performance improvements
  • Data integrity tooling improvements

MySQL Cluster 8.0.21 (July 2020)

Features included :

  • Support for up to 16TB of in-memory data per Data node
  • Read any replica tables by default
  • Further disk stored column performance and monitoring improvements
  • Improved distributed schema synchronisation

MySQL Cluster 8.0.22 (October 2020)

Features included :

  • Support for generating and restoring encrypted backups
  • Support for IPv6 addressing between all cluster node types
  • Improved performance for Blob column Insert, Update, Delete
  • Improved replication applier performance for Blob columns

As you can see, MySQL Ndb Cluster is constantly evolving and improving with several new releases every quarter.  The 8.0 releases continue to bring new capabilities on a regular basis, and there is no shortage of ideas in the pipeline.

For more information about MySQL Ndb Cluster, you can check the official documentation, see the getting started guide, follow a tutorial, read blogs, watch a video, check out recent Ndb 8.0 101, architecture or performance slides, or even read a book or two.

Edit : typo

Wednesday, 5 February 2020

FOSDEM 2020

My post-FOSDEM detox has started - despite preparing by reading some survival guides, I hadn't really fathomed the variety and quantity (and quality) of beer that would flow over four days.  On reflection however, the beer flow has been far exceeded by the flow of tech content and conversation.

On Thursday and Friday I attended the pre-FOSDEM MySQL Days fringe event, where there were two tracks of talks and tutorials on MySQL including sessions on :
 - MySQL Server simplification
 - MySQL replication tooling improvements
 - Configuring group replication
 - Troubleshooting group replication
 - Using DNS for loadbalancing and failover
 - Upgrading to MySQL 8.0
 - New hash join implementation
 - Indexing JSON arrays
 - Datetime types
 - Check constraints
 - New VALUES() syntax
 - Security-hardening MySQL
 - Document store
 - MySQL Analytics
 - MySQL Replication performance modelling
 - Machine learning for MySQL service automation
 - Using reduced durability modes
 - Benchmarking
 - Using Vitesse
 - Using InnoDB Cluster

There were sessions from Oracle, Percona, Planetscale, Facebook, TicketSolve, DBdeployer and Uber.

Naturally the highlights for me were the Friday sessions focused on MySQL Ndb Cluster including :


https://twitter.com/lefred/status/1223274289744547840


  • MySQL NDB 8.0 clusters in your laptop with DBdeployer - Giuseppe Maxia

    This was a great session from the Datacharmer Giuseppe, where he talked about DBdeployer, which is a tool for trying out different database software versions, configurations and topologies.  Giuseppe demonstrated how to use DBDeployer to setup two MySQL Ndb Clusters very simply, starting them in under 30 seconds and testing both the synchronous internal replication within each cluster and the asynchronous replication between them.

    If you want to experiment with MySQL Cluster then I think DBDeployer looks like the easiest and quickest way to get started.
https://twitter.com/lefred/status/1223204988148690946

 
  • SQL with MySQL Ndb 8.0 faster than your NoSQL allow - Bernd Ocklin

    This session was focused on one of Ndb's traditional strengths - performance.  Bernd presented an array of benchmarks showing that MySQL Ndb Cluster continues to be faster than the NoSQL products invented in response to the claim that 'SQL cannot scale'. 

    The presentation had lots of graphs and numbers showing linear scalability, hundreds of millions of reads per second from a single cluster, winning efficiency + performance comparisons on YCSB, scalability of parallel disk I/O and scalability of Ndb as a Hadoop metadata and small file store.

    Bernd then gave some insights into the unique architecture of Ndb, which allows it to achieve these numbers.

    Finally there were some more graphs and numbers :

    First looking at TPC-H query performance as data node internal parallelism scales, and showing how the query processing parallelism in Ndb allows a 2-node distributed (HA) Ndb setup to outperform a single node InnoDB on some queries.

    Next looking at DBT2 (TPC-C) scalability as # replicas and # NodeGroups were increased, and also looking at the behaviour of DBT2 when run using Optane memory for system scale.
https://twitter.com/lefred/status/1223222430623326208


  • Boosting MySQL Ndb Cluster & MySQL InnoDB Cluster with ProxySQL v2 - Marco Tusa

    Marco from Percona showed how having a layer of proxies in front of a cluster can simplify providing an HA MySQL service, smooth operations and allow new features not available in MySQL itself.  He laid out architectures for InnoDB Cluster and Ndb Cluster, and noted that Ndb cluster was the only cluster that could scale out writes. 

    He enthusiastically followed this up with numbers by showing how adding MySQL Servers to a cluster can allow it to scale beyond the limits imposed by a single MySQLD.

  • MySQL Ndb 8.0 Cluster tutorial - Frazer Clement

    In my session we stepped through installing, configuring, starting and using a MySQL Cluster, including looking at high availability features, push down joins, foreign keys, JSON, Binlogging, Synchronized privilieges, and online scale-out.

    The audience were generous with feedback and questions, complementing my mid-90's style use of a text file, and generally surprised at how easily they had a cluster running on their laptop, even when manually performing all the steps required.
https://twitter.com/lefred/status/1223255796802314240


After my tutorial, I was interested in locating and consuming a drink, and luckily the MySQL Community dinner was just starting in the event venue.  This was a very agreeable event, well attended by the wider community including colleagues from 'old MySQL' that I had not seen for over 10 years.  It is great to see that while there are differences of opinion, and competition, we still have more in common than not.

https://twitter.com/lefred/status/1223340364683251712





On Saturday the action moved to the FOSDEM conference itself, running in a University campus.  Despite looking at the program I had not quite appreciated the volume of tracks, sessions and experts that would be present and talking in the rooms and the corridors.  The MySQL developer room alone had 17 talks from across the community, one every half hour for eight and a half hours !

I spent most of my time here, but also attended some talks in the other tracks, including some interesting performance talks in the Software Defined Networking room including a nice one on vectorising packet processing in DPDK.  Unfortunately I was unable to attend the run a mainframe on your laptop session in the Retrocomputing room, but luckily everything was being streamed, and the videos are being uploaded here.  While wandering around the University campus I stumbled upon a bar and tried some of the (open-source?) FOSDEM beer.

I will not bore you by enumerating all of the sessions, following beers and conversations, but I can definitely recommend reviewing the schedule and viewing some of the captured videos to get a bit of the experience for yourself.  Even better, if this sounds like your sort of thing, why not see if you can find a way to attend yourself, in 2021?

Many thanks to lefred for organising, running, presenting, photographing and tweeting across all the events.

Tuesday, 28 January 2020

MySQL Cluster at FOSDEM 2020


Tomorrow I depart for Brussels, where FOSDEM 2020 is taking place this weekend.

This will be my first FOSDEM and I have heard many good things about it - there are quite a few FOSDEM survival guides online which give a sense of what it is about.

As part of the main FOSDEM event there is a MySQL, MariaDB and friends developer room, with sessions from across the MySQL ecosystem.  Additionally there is also a Databases main track and a PostgreSQL developer room as well as a wide variety of other tracks and developer rooms.

As if that were not enough, there is also a FOSDEM Fringe, before, during and after FOSDEM itself.  This year MySQL are holding a two day fringe event called pre-FOSDEM MySQL Days on Thursday and Friday, where there are two tracks, with over thirty sessions from the MySQL community team, engineers, product managers, customers, contributors and others.

MySQL Cluster is well represented on Friday with four talks from Bernd Ocklin, Giuseppe Maxia and Marco Tusa about using the new MySQL Cluster 8.0.  Later on Friday afternoon I will be running a tutorial on MySQL Cluster 8.0, covering configuration, setup, and looking at some of the existing and new features.

If you are going to be there then maybe we will see each other.  If not then maybe you should try to get there next year?



Wednesday, 14 October 2015

MySQL Cluster at Oracle OpenWorld 2015

It's Oracle OpenWorld time, and MySQL Cluster will be in San Francisco again along with the rest of the MySQL team. 

The session agenda is online, but can be tricky to navigate given the breadth of the conference, but it's possible to narrow down to those sessions in the MySQL track.

From a MySQL Cluster perspective, there two conference sessions from the Cluster development team, a Hands-On Lab from our Cluster Support team and most exciting, a conference session from some real-world users at NEC.

200 Million QPS on Commodity Hardware—Getting Started with MySQL Cluster 7.4
[CON2178]

Frazer Clement, MySQL Cluster Technical lead, Oracle

Bernhard Ocklin, Director MySQL Cluster Engineering, Oracle

Do you have performance demands that a database can’t scale to meet — especially one as simple as MySQL? And if it were even possible, would it require top-of-the-range servers and storage? We previously demonstrated that MySQL Cluster could scale to 1 billion writes per minute, and MySQL Cluster 7.4 exceeded 200 million queries per second (QPS), all with open source software running on commodity servers. Discover how MySQL Cluster achieves this scalability while also delivering in-memory performance, 99.999 percent availability, active-active update-anywhere geographic redundancy, ACID transactions, and both SQL and NoSQL access. Finally, this session offers some tips on getting MySQL Cluster up and running so that you can try it out for yourself.
Conference Session

Monday, Oct 26, 2:45 p.m. | Moscone South—262

Fully Elastic Real-Time Services with MySQL Cluster
[CON4772]
Bernhard Ocklin, Director MySQL Cluster Engineering, Oracle
In this session, learn how the MySQL Cluster in-memory real-time engine brings together the best of both worlds—SQL and NoSQL. It scales from a single Raspberry Pi to systems on hundreds of servers.  MySQL Cluster powers networks for more than a billion mobile phone users worldwide and serves massive multiplayer online gaming back-ends. The MySQL Cluster architecture allows the addition or removal of nodes in seconds without interrupting service. It adopts to capacity demands making resources available instantaneously and when needed. Its native Node.js platform and connectors for Java make it easy to write real-time web applications in the cloud. This sessions guides you through common architectures and a use case for elastic MySQL Cluster cloud deployments.
Conference Session
Tuesday, Oct 27, 11:00 a.m. | Moscone South—262

MySQL Server and MySQL Cluster at India’s Financial Inclusion Gateway Service
[CON3846]
Shrestha Anishman, Engineer, NEC Corporation
KEIJI ENDO, Staff, 日本電気株式会社
太地 岩田, Manager, 日本電気株式会社
NEC is developing India’s financial inclusion gateway service utilizing MySQL Server and MySQL Cluster. To handle the massive, business-critical financial transactions involved in this project, NEC has conducted deep, detailed research on internal architectures and behaviors of MySQL products. This session explores use cases of MySQL Server and MySQL Cluster in India’s financial inclusion gateway service, and explains how to select appropriate software and optimize systems architecture with methods of software architecture analysis at NEC.
Conference Session
Tuesday, Oct 27, 5:15 p.m. | Moscone South—250

Get Started with MySQL Cluster 
[HOL3348]
Benedita Vasconcelos, Principal Technical Support Engineer, Oracle

Attend this hands-on lab to learn the basics of MySQL Cluster—when to use it, when not to use it, and how to install, configure, administer, and access it. MySQL Cluster is a write-scalable, real-time, ACID (atomicity, consistency, isolation, and durability)-compliant transactional database combining 99.999 percent availability with the low TCO of open source. Developers and DBAs attending this session have the chance to familiarize themselves with MySQL Cluster and get a better understanding of how to use it to meet the database challenges of next-generation web, cloud, and communications services with uncompromising scalability, uptime, and agility.
HOL (Hands-on Lab) Session
Thursday, Oct 29, 9:30 a.m. | Hotel Nikko—Peninsula (25th Floor)


There are many other relevant talks from the MySQL team covering Server improvements, Replication improvements, OpenStack integration, Customer success stories etc.

Outside the MySQL track, there are sessions covering the breadth of Oracle products and services.  Despite developing the world's most popular Open Source Database, Oracle still puts some effort into developing older DBMS products :).  I will be interested to hear of the latest advances there.

However the main attraction of a conference like this is meeting people and getting to talk face to face.  If you don't have an OOW pass, but want to mix with the MySQL crowd then please come to the MySQL Community Reception on October 27th.  You can pre-register to avoid any queueing.

See you there!

Thursday, 6 March 2014

Reports exaggerated

I've been letting the blog rest recently, and not so recently as well.  The problem is not a lack of subjects, but a lack of time to do them any justice.  However it is quite sad to see that my last entry was in September 2012, so it is time to post again.

Of late I have been pondering what I have to say about :
  • Distributed MVCC and write-scaling
  • Different approaches to eventual consistency with replicated RDBMS
  • Various MySQL Cluster related topics
  • Various general rambling and unstructured topics
However, these will take some time to percolate and calcify.

In the meantime here are some things I have found interesting recently :
  • Learn You Some Erlang for Great Good
    I actually rediscovered this online book after watching some Joe Armstrong + Erlang videos, after watching some spoof video about bringing Erlang up to date.  All recommended  Erlang and Ndb Cluster share some Plex heritage, which can still be seen in their architectures today.  Since Plex, Erlang has mated with Prolog, and Ndb Cluster was involved in a car crash with C++.
  • HyperDex + Hyperdex Warp
    Something I discovered last year from Emin Gün Sirer's blog and have returned to since.  There are a number of nice ideas combined here (chain replication, value dependent chaining, hyperspace hashing, subspaces).  My favourites are the concept of 'spurious coordination' and their solution w.r.t. transaction consistency : ordering the route of the optimistic 'distributed commit' based on the affected keys.  I guess we need more independent analysis and evaluation to understand the strengths and weaknesses of these techniques.
  • Kronos
    This is a distributed HA 'event ordering system' from the same Cornell HyperDex team.  Thinking about distributed MVCC led me to thinking about efficiently maintaining a distributed partial ordering of events while avoiding 'spurious coordination'.  Kronos is an attempt to solve part of that problem in a kind of abstracted SOA way.  There is some nice detail in the paper about their dependency graph traversal optimisations, and how dependencies are immutable once discovered, so can be cached, replicated for read scale-out etc.  This could be a great systems building block.
  • Systems Performance book
    I am slowly reading this doorstop book from Brendan Gregg, an ex Solaris kernel engineer at Sun, now at Joyent.  It contains a great amount of recent practical information about Linux + Solaris performance analysis and optimisation.  Unix performance tools have always been a little opaque to me, with very little of how-to-approach a performance problem ever being documented.  This book covers many old and new tools, but also includes rare information on how to analyse problems with these tools, rather than just syntax and units of values returned.  Perhaps even better is his supreme confidence about tackling and solving any performance problem that foolishly catches his eye.  I guess that comes from experience, but maybe a little can be conveyed to his readers by this book.
  • Google Spanner, Galera Cluster, MoSQL, RAMCloud, NuoDB, OpenReplica
    Different approaches, ideas, hidden tradeoffs, strengths and weaknesses!
One of my favourite discoveries was this quote attributed to Charles Babbage :
"On two occasions I have been asked 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?'
I am not able rightly to comprehend the kind of confusion of ideas that could provoke such a question." 
Strange how often this response has been on my lips since ! 

Sunday, 30 September 2012

Saturday at MySQL Connect

The first day of the first MySQL Connect conference is done.  It's been a busy day!  Many attendees are interested in the new MySQL Server 5.6 release, but of course MySQL Cluster is the main draw here.  After a session from Oracle on the new features in 7.2, and early access features in 7.3, I attended Santo Leto's MySQL Cluster 'hands on lab'.  Despite having started more clusters than most, it felt like a new and exciting experience installing and running my cluster alongside everyone else.  The lab machines had some networking issues, but with Santo's help we seamlessly failed-over to some downloads he'd prepared earlier - very professional!

Afterwards it was my turn to talk on the subject of MySQL Cluster performance.  The quality of the questions was impressive - there seems to be a very capable crowd in San Francisco this weekend.  I was flattered that some audience members were taking pictures of my slides, but there's no need, you can download them here, and probably from Oracle somewhere too.

Finally we had a 'Birds of a Feather' session where a number of users asked questions about replication, backups, failover + arbitration, system recovery etc...  Great to get the chance to explain some things, hear about user experiences, and get some feedback about user experience.

There's been a lot of good content so far, and interesting discussions, but the highlight for me has been meeting with colleagues old and new, from MySQL, Sun and Oracle times.  More of the same tomorrow, I'd better get some sleep!

Wednesday, 15 August 2012

Session at MySQL Connect


I will double my all-time total of public speaking engagements this September at the MySQL Connect conference in San Fransisco.

The title of my session is "Delivering Breakthrough Performance with MySQL Cluster", and it's between 17:30 and 18:30 on Saturday 29th September.

The content is not finalised yet, so if there's something you would like to hear about which fits with the abstract, then comment below.  If it doesn't fit in with the abstract then we will just have to talk about it afterwards over some drinks.

This is the first year of MySQL Connect, but there's some great content.  I'm looking forward to meeting with users, customers, other MySQL engineers, and hopefully get the chance to learn something!

See you there.

Wednesday, 7 March 2012

The CAP theorem and MySQL Cluster

tldr; A single MySQL Cluster prioritises Consistency in Network partition events. Asynchronously replicating MySQL Clusters prioritise Availability in Network partition events.


I was recently asked about the relationship between MySQL Cluster and the CAP theorem. The CAP theorem is often described as a pick two out of three problem, such as choosing from good, cheap, fast. You can have any two, but you can't have all three. For CAP the three qualities are 'Consistency', 'Availability' and 'Partition tolerance'. CAP states that in a system with data replicated over a network only two of these three qualities can be maintained at once, so which two does MySQL Cluster provide?

Standard 'my interpretation of CAP' section

Everyone who discusses CAP like to rehash it, and I'm no exception. Daniel Abadi has the best CAP write-up that I've read so far, which reframes CAP as a decision about whether to ultimately prioritise availability or data consistency in the event of a network partition. This is how I think of CAP. He also discusses related system behaviour in normal operation which I'll return to later.

While this reframing clarifies CAP, the terms network partition, availability and consistency also need some definition.

Network replicated database

CAP is only really relevant in the context of a network replicated database (or filesystem or state machine). A network replicated database stores copies of data in multiple different systems (database nodes), connected by a network. Data can be read and updated. Updates are propagated to all nodes with replicas via the network. Database clients connect to database nodes via the network to read data and make updates. Replication may occur to improve availability, to improve request latency, or to improve read bandwidth.

Availability

The network replicated database exists to provide services such as Read and Write on the data it stores. Its availability can be measured as the ability of any client to perform any service on any data item.

This Service Availability can be compromised by :
  • Failure of client nodes
  • Network failures between clients and database nodes
  • Network failures between database nodes
  • Failure of database nodes
Client node and networking failures cannot really be considered a property within the control of a database system, so I consider their effects out of the scope of CAP. However, where clients connect to a database node, and that database node is isolated from other database nodes, whether or not those clients are given service is within the scope of CAP.

Service Availability is not binary, it can partially degrade, perhaps by affecting :
  • A subset of all clients
  • A subset of all stored data
  • A subset of request types

The shades of grey within the definition of availability are responsible for most of the arguments around CAP. If we take a strict view - either all services available on all data for all clients, or nothing, then availability is fragile and hard to maintain. If we take a more flexible approach then some service availabilty can be preserved even with a completely decimated network. In the loosest definition, if any client receives any service on any data, then the system is still available. Rather than choose one position, I regard availability as a range from 100% down to 0% for a full outage. Anything in the middle is reduced availability, but it does not mean that the system is not serving its purpose adequately.

Consistency

For consistency to be satisfied, the multiple replicas of data in a network replicated database should behave as though there were only one copy of the data. Simultaneous reads of the same data item from clients connected to different database nodes must always return the same result. Where two or more updates to the same data item are submitted simulteneously, they must be serialised, or one must be rejected, or they must be merged so that a single value results. This one-copy model makes it simple for database clients to use the network replicated database as if it were a single database system with one atomically read/written copy of their data.

If one copy consistency is relaxed, then different database nodes may observably have different values for the same data item simultaneously. Over time the data copies may be aligned, but clients accessing the data must beware that reads may not return the results of the most recently accepted writes. This behaviour may be described as eventual consistency. Providing eventual consistency allows a network replicated database to maximise availability, but pushes the problem of dealing with transient inconsistencies up the stack to user applications. Furthermore there are varying qualities of eventual consistency, with varying guarantees and levels of application support available.

Network Partitions

Network partitions isolate subsets of the nodes of a network replicated database. The interesting property of a network partition is that each node subset cannot tell whether the other node subset(s) are :
  1. dead
  2. alive but isolated from clients
  3. alive and reachable by clients but isolated from us
Not knowing the state of the other subset(s) is what forces a system to decide between maximising service availability and maximising consistency. The interesting case is 3) where some database nodes (potentially containing all or some of the data) are alive elsewhere and have clients connected to them. If those clients are allowed to make writes on data copies stored on those database nodes, then we must lose one copy consistency as we cannot supply those new values in response to a read of our local copy. If those clients are not allowed to make writes then we have degraded service availability for them. Which is it to be? This is the unavoidable choice at the centre of the CAP theorem. Stated this way it seems less of a theorem and more of a fact.

Back to MySQL Cluster - which does it provide?

A single MySQL Cluster prioritises data consistency over availability when network partitions occur.

A pair of asynchronously replicating MySQL Clusters prioritise service availability over data consistency when network partitions occur.

So you can have it both ways with MySQL Cluster - Great!

Single MySQL Cluster - CP

Within a single MySQL Cluster, data is synchronously replicated between database nodes using two-phase commit. Nodes are monitored using heartbeats, and failed or silent nodes are promptly isolated by live and responsive nodes. Where a network partition occurs, live nodes in each partition regroup and decide what to do next :
  • If there are not enough live nodes to serve all of the data stored - shutdown
    Serving a subset of user data (and risking data consistency) is not an option
  • If there are not enough failed or unreachable nodes to serve all of the data stored - continue and provide service
    No other subset of nodes can be isolated from us and serving clients
  • If there are enough failed or unreachable nodes to serve all of the data stored - arbitrate.
    There could be another subset of nodes regrouped into a viable cluster out there.

Arbitration occurs to avoid the split brain scenario where a cluster could theoretically split in two (or more), with each half (or third, or quarter) accepting writes and diverging from the others. In other words, arbitration occurs to preserve consistency.

Arbitration involves :
  • Database nodes agree on an arbitrator in advance
  • During node or network failure handling, no data writes are committed.
  • When arbitration is required due to node failures or network issues, viable node subsets (potential clusters) request permission from the previously agreed arbitrator to provide service.
  • Each request to the arbitrator will result in either : Yes, No or timeout
  • Anything other than Yes results in node shutdown.
  • The arbitrator only says Yes once per election round (First come first served). Therefore the arbitrator only says yes to one potential cluster in a partitioned network.

Note that arbitration is not the same as achieving a quorum. A cluster with three replicas and an arbitrator node can survive the loss of two data nodes as long as the arbitrator remains reachable to the last survivor. The arbitrator role is lightweight as it is not involved in normal traffic. I am surprised that the lightweight arbitrator pattern is not more common.

How does a single MySQL Cluster degrade service availability as a result of network partitions?

Where some subset of data nodes are isolated and shut-down :
  • Those nodes are 100% out of service, until they restart and can rejoin the cluster
    They will attempt to do so automatically
  • Any clients connected only to those nodes are out of service
    By default clients attempt to connect to all data nodes, so partial connectivity issues needn't degrade client availability.
  • The remaining live nodes are 100% in-service
  • Clients connected to the remaining live nodes are 100% in service
Where no subset of data nodes is live
  • All clients experience 100% service loss, until the data nodes restart and can rejoin the cluster
    They will attempt to do so automatically.

A single MySQL Cluster does not degrade to partial data access, or read only modes as a result of network partitions. It does not sacrifice consistency.

How can MySQL Cluster be described as highly available if it sacrifices availability for consistency in the event of a network partition?

Availability is not binary - many types of network partition can erode availability, for some clients, but do not extinguish it. Some set of clients continue to receive 100% service. Only double failures in the network can cause a network partition resulting in full service loss.
Furthermore, network partitions are not the only risks to availability, software errors, power failures, upgrades, overloads are other potential sources of downtime which Cluster is designed to overcome.

Asynchronously replicating clusters - AP


Where two Clusters are asynchronously replicating via normal MySQL Replication, in a circular configuration, reads and writes can be performed locally at both clusters. Data consistency within each cluster is guaranteed as normal, but data consistency across the two clusters is not. On the other hand, availability is not compromised by network partitioning of the two clusters. Each cluster can continue to accept read and write requests to all of the data from any connected client.

Eventual consistency between the clusters is possible when using conflict resolution functions such as NDB$EPOCH_TRANS, NDB$EPOCH, NDB$MAX etc.

How does consistency degrade between replicating MySQL Clusters during a network partition?

This depends on the conflict resolution function chosen, and how detected conflicts are handled. Some details of consistency guarantees provided by NDB$EPOCH et al are described here.

What about normal operation?

Abadi's post introduced his PACELC acronym, standing for something like :

 if (network Partition)
{
trade-off Availability vs Consistency;
}
else
{
trade-off Latency vs Consistency;
}


My first comment has to be that it's bad form to put the common case in an else branch!
However, it is certainly true that the properties during normal operation are usually more important than what happens during a network partition. The ELC section is stating that while all database nodes are present, a network replicated database can choose between minimising request Latency, or maintaining Consistency. In theory this normal operation latency-vs-consistency tradeoff could be completely independent to the Network Partitioning availability-vs-consistency tradeoff, e.g. you could have any of :
  1. PA EL (Partition - Availability, Else - Latency minimisation)
  2. PA EC (Partition - Availability, Else - Consistency)
  3. PC EL (Partition - Consistency, Else - Latency minimisation)
  4. PC EC (Partition - Consistency, Else - Consistency)

The common cases are 1 + 4, where we choose either consistency at all times, or Maximum Availability and Minimum Latency. Case 2 is a system which aims for consistency, but when a network partition occurs, aims for Availability. Case 3 is a system which aims for minimal request Latency, and when a partition occurs aims for consistency.

Examples of systems of each type :
  1. Any eventually consistent system, especially with local-database-node updates + reads
  2. Best-effort consistent systems that degrade in failure modes (e.g. MySQL semi-synchronous replication)
  3. ???
  4. Always consistent systems (e.g. single database instance, single MySQL Cluster)

I am not aware of systems meeting case 3 where normally they minimise latency over consistency, but start choosing consistency after a network partition. Maybe this category should be called 'repentant systems'?

The problem for systems in Cases 1 or 2 - anywhere where Latency minimisation or Availability is chosen over consistency - is the need for user applications to deal with potential inconsistencies. It is not enough to say that things will 'eventually' be consistent. It's important to describe how inconsistent they can be, whether the temporary inconsistencies are values which were once valid, how those values relate to other, connected values etc.

There are certainly applications which can operate correctly with practical eventually consistent databases, but it's not well known how to design applications and schemas to cope with the transient states of an eventually consistent database. The first ORM framework to opaquely support an underlying eventually consistent database may actually be worth the effort to use! A reasonable approach is to design schemas with associated read/modification 'protocols' as if they were abstract data types (ADTs). These ADTs can then have strengths and weaknesses, properties and limitations which make sense in some parts of an application schema where the need to support eventual consistency overcomes the inherent effort and limitations.

Stonebraker and others have commented on network partitions being a minor concern for a well designed datacentre-local network, where redundancy can be reliably implemented. Also the latency cost of maintaining consistency is lower as physical distances are smaller and hop counts are lower. This results in 'CP' systems being attractive at the data centre scale as the need to sacrifice availability due to network partition is rarely dominant, and the latency implications during normal operation are bearable. Perhaps this highlights the need in these theoretical discussions to illustrate theoretically problematic latencies and availabilities with real numbers.

At a wider network scale, latencies are naturally higher, implying that bandwidth is lower. The probability of network partitions of some sort may also increase, due to the larger number of components (and organisations) involved. The factors combine to make 'AP' systems more palatable. The everyday latency cost of consistency is higher, and losing availability due to potentially more frequent network partitions may not be acceptable. Again, real numbers are required to illuminate whether the achievable latencies and probable availability impacts are serious enough to warrant changing applications to deal with eventually consistent data. For a particular application there may or may not be a point at which an AP system would meet its requirements better.

Consistent systems can be scaled across many nodes and high latency links, but the observed operation latency, and the necessary impacts to availability implied by link failure set a natural ceiling on the desirable scale of a consistent system. Paraphrasing John Mashey, "Bandwidth improves, latency is forever". Applications that find the latency and availability constraints of a single consistent system unacceptable, must subdivide their datasets into smaller independent consistency zones and manage potential consistency shear between them.

Finally (another excessively long post), I think the technical and actual merits of widely distributed 'CP' systems are not well known as they have not been commonly available. Many different database systems support some form of asynchronous replication, but few offer synchronous replication, fewer still offer to support it over wide areas with higher latency and fluctuating links. As this changes, the true potential and weaknesses of these technologies, backed by real numbers, will start to appear.

Edit 7/3/12 : Fix bad link

Tuesday, 21 February 2012

One billion

As always, I am a little late, but I want to jump on the bandwagon and mention the recent MySQL Cluster milestone of passing 1 billion queries per minute. Apart from echoing the arbitrarily large ransom demand of Dr Evil, what does this mean?

Obviously 1 billion is only of interest to us humans as we generally happen to have 10 fingers, and seem to name multiples in steps of 10^3 for some reason. Each processor involved in this benchmark is clocked at several billion cycles per second, so a single billion is not so vast or fast.

Measuring over a minute also feels unnatural for a computer performance benchmark - we are used to lots of things happening every second! A minute is a long time in silicon.

What's more, these reads are served from tables stored entirely in memory - and everyone knows that main memory is infinitely fast and scalable and always getting cheaper, right?

If we convert to seconds we are left with only 17 million reads per second! Hardly worth getting out of bed for?

On the contrary, I think that achieving 17 million independent random reads per second, each read returning 100 bytes across a network, from a database that also supports arbitrary SQL, row locking, transactions, high availability and all sorts of other stuff, is pretty cool. I doubt that (m)any other similar databases can match this raw performance, though I look forward to being proved wrong.

(Also, don't forget to meet + beat 1.9 million random updates/s, synchronously replicated)

Raw performance is good, but not everyone just needs horsepower. The parallel, independent work on improving join performance (also known as SPJ/AQL) and query optimisation helps more applications harness this power, by improving the efficiency of joins.

I wrote a post about SPJ/AQL at the start of last year, when it was still in the early stages. Since then much has improved, to the extent that the performance improvement factors have become embarrassingly high on real user queries. A further post on the technical details of SPJ/AQL is long overdue... Perhaps the most interesting details are on the integration between the parallel, streaming linked operations and the essentially serialised MySQL Nested Loops join executor. A linked scan and lookup operation can be considered to be a form of parallel hash join, which the normal MySQL NLJ executor can invoke as part of executing a query. Who says Nested Loop joins can't scale?

Thursday, 22 December 2011

Eventual Consistency in MySQL Cluster - implementation part 3




As promised, this is the final post in a series looking at eventual consistency with MySQL Cluster asynchronous replication. This time I'll describe the transaction dependency tracking used with NDB$EPOCH_TRANS and review some of the implementation properties.

Transaction based conflict handling with NDB$EPOCH_TRANS

NDB$EPOCH_TRANS is almost exactly the same as NDB$EPOCH, except that when a conflict is detected on a row, the whole user transaction which made the conflicting row change is marked as conflicting, along with any dependent transactions. All of these rejected row operations are then handled using inserts to an exceptions table and realignment operations. This helps avoid the row-shear problems described here.

Including user transaction ids in the Binlog

Ndb Binlog epoch transactions contain row events from all the user transactions which committed in an epoch. However there is no information in the Binlog indicating which user transaction caused each row event. To allow detected conflicts to 'rollback' the other rows modified in the same user transaction, the Slave applying an epoch transaction needs to know which user transaction was responsible for each of the row events in the epoch transaction. This information can now be recorded in the Binlog by using the --ndb-log-transaction-id MySQLD option. Logging Ndb user transaction ids against rows in-turn requires a v2 format RBR Binlog, enabled with the --log-bin-use-v1-row-events=0 option. The mysqlbinlog --verbose tool can be used to see per-row transaction information in the Binlog.

User transaction ids in the Binlog are useful for NDB$EPOCH_TRANS and more. One interesting possibility is to use the user transaction ids and same-row operation dependencies to sort the row events inside an epoch into a partial order. This could enable recovery to a consistent point other than an epoch boundary. A project for a rainy day perhaps?

NDB$EPOCH_TRANS multiple slave passes

Initially, NDB$EPOCH_TRANS proceeds in the same way as NDB$EPOCH, attempting to apply replicated row changes, with interpreted code attached to detect conflicts. If no row conflicts are detected, the epoch transaction is committed as normal with the same minimal overhead as NDB$EPOCH. However if a row conflict is detected, the epoch transaction is rolled back, and reapplied. This is where NDB$EPOCH_TRANS starts to diverge from NDB$EPOCH.

In this second pass, the user transaction ids of rows with detected conflicts are tracked, along with any inter-transaction dependencies detectable from the Binlog. At the end of the second pass, prior to commit, the set of conflicting user transactions is combined with the user transaction dependency data to get a complete set of conflicting user transactions. The epoch transaction initiated in the second pass is then rolled-back and a third pass begins.

In the third pass, only row events for non-conflicting transactions are applied, though these are still applied with conflict detecting interpreted programs attached in case a further conflict has arisen since the second pass. Conflict handling for row events belonging to conflicting transactions is performed in the same way as NDB$EPOCH. Prior to commit, the applied row events are checked for further conflicts. If further conflicts have occurred then the epoch transaction is rolled back again and we return to the second pass. If no further conflicts have occurred then the epoch transaction is committed.

These three passes, and associated rollbacks are only externally visible via new counters added to the MySQLD server. From an external observer's point of view, only non-conflicting transactions are committed, and all row events associated with conflicting transactions are handled as conflicts. As an optimisation, when transactional conflicts have been detected, further epochs are handled with just two passes (second and third) to improve efficiency. Once an epoch transaction with no conflicts has been applied, further epochs are initially handled with the more optimistic and efficient first pass.

Dependency tracking implementation

To build the set of inter-transaction dependencies and conflicts, two hash tables are used. The first is a unique hashmap mapping row event tables and primary keys to transaction ids. If two events for the same table and primary key are found in a single epoch transaction then there is a dependency between those events, specifically the second event depends on the first. If the events belong to different user transactions then there is a dependency between the transactions.

Transaction dependency detection hash :
{Table, Primary keys} -> {Transaction id}

The second hash table is a hashmap of transaction id to an in-conflict marker and a list of dependent user transactions. When transaction dependencies are discovered using the first dependency detection hash, the second hash is modified to reflect the dependency. By the end of processing the epoch transaction, all dependencies detectable from the Binlog are described.

Transaction dependency tracking and conflict marking hash :
{Transaction id} -> {in_conflict, List}

As epoch operations are applied and row conflicts are detected, the operation's user transaction id is marked in the dependency hash as in-conflict. When marking a transaction as in-conflict, all of its dependent transactions must also be transitively marked as in-conflict. This is done by a traverse through the dependency tree of the in-conflict transaction. Due to slave batching, the addition of new dependencies and the marking of conflicting transactions is interleaved, so adding a dependency can result in a sub-tree being marked as in-conflict.

After the second pass is complete, the transaction dependency hash is used as a simple hash for looking up whether a particular transaction id is in conflict or not :

Transaction in-conflict lookup hash :
{Transaction id} -> {in_conflict}

This is used in the third pass to determine whether to apply each row event, or to proceed straight to conflict handling.

The size of these hashes, and the complexity of the dependency graph is bounded by the size of the epoch transaction. There is no need to track dependencies across the boundary of two epoch transactions, as any dependencies will be discovered via conflicts on the data committed by the first epoch transaction when attempting to apply the second epoch transaction.

Event counters

Like the existing conflict detection functions, NDB$EPOCH_TRANS has a row-conflict detection counter called ndb_conflict_epoch_trans.

Additional counters have been added which specifically track the different events associated with transactional conflict detection. These can be seen with the usual SHOW GLOBAL STATUS LIKE syntax, or via the INFORMATION_SCHEMA tables.

  • ndb_conflict_trans_row_conflict_count
    This is essentially the same as ndb_conflict_epoch_trans - the number of row events with conflict detected.
  • ndb_conflict_trans_row_reject_count
    The number of row events which were handled as in-conflict. It will be at least as large as ndb_conflict_trans_row_count, and will be higher if other rows are implicated by being in a conflicting transaction, or being dependent on a row in a conflicting transaction.
    A separate ndb_conflict_trans_row_implicated_count could be constructed as ndb_conflict_trans_row_reject_count - ndb_conflict_trans_row_conflict_count
  • ndb_conflict_trans_reject_count
    The number of discrete user transactions detected as in-conflict.
  • ndb_conflict_trans_conflict_commit_count
    The number of epoch transactions which had transactional conflicts detected during application.
  • ndb_conflict_trans_detect_iter_count
    The number of iterations of the three-pass algorithm that have occurred. Each set of passes counts as one. Normally this would be the same as ndb_conflict_trans_conflict_commit_count. Where further conflicts are found on the third pass, another iteration may be required, which would increase this count. So if this count is larger than ndb_conflict_trans_conflict_commit_count then there have been some conflicts generated concurrently with conflict detection, perhaps suggesting a high conflict rate.


Performance properties of NDB$EPOCH and NDB$EPOCH_TRANS

I have tried to avoid getting involved in an explanation of Ndb replication in general which would probably fill a terabyte of posts. Comparing replication using NDB$EPOCH and NDB$EPOCH_TRANS relative to Ndb replication with no conflict detection, what can we can say?

  • Conflict detection logic is pushed down to data nodes for execution
    Minimising extra data transfer + locking
  • Slave operation batching is preserved
    Multiple row events are applied together, saving MySQLD <-> data node round trips, using data node parallelism
    For both algorithms, one extra MySQLD <-> data node round-trip is required in the no-conflicts case (best case)
  • NDB$EPOCH : One extra MySQLD <-> data node round-trip is required per *batch* in the all-conflicts case (worst case)
  • NDB$EPOCH : Minimal impact to Binlog sizes - one extra row event per epoch.
  • NDB$EPOCH : Minimal overhead to Slave SQL CPU consumption
  • NDB$EPOCH_TRANS : One extra MySQLD <-> data node round-trip is required per *batch* per *pass* in the all-conflicts case (worst case)
  • NDB$EPOCH_TRANS : One round of two passes is required for each conflict newly created since the previous pass.
  • NDB$EPOCH_TRANS : Small impact to Binlog sizes - one extra row event per epoch plus one user transaction id per row event.
  • NDB$EPOCH_TRANS : Small overhead to Slave SQL CPU consumption in no-conflict case

Current and intrinsic limitations

These functions support automatic conflict detection and handling without schema or application changes, but there are a number of limitations. Some limitations are due to the current implementation, some are just intrinsic in the asynchronous distributed consistency problem itself.

Intrinsic limitations
  • Reads from the Secondary are tentative
    Data committed on the secondary may later be rolled back. The window of potential rollback is limited, after which Secondary data can be considered stable. This is described in more detail here.
  • Writes to the Secondary may be rolled back
    If this occurs, the fact will be recorded on the Primary. Once a committed write is stable it will not be rolled back.
  • Out-of-band dependencies between transactions are out-of-scope
    For example direct communication between two clients creating a dependency between their committed transactions, not observable from their database footprints.

Current implementation limitations

  • Detected transaction dependencies are limited to dependencies between binlogged writes (Insert, Update, Delete)
    Reads are not currently included.
  • Delete vs Delete+Insert conflicts risk data divergence
    Delete vs Delete conflicts are detected, but currently do not result in conflict handling, so that Delete vs Delete + Insert can result in data divergence.
  • With NDB$EPOCH_TRANS, unplanned Primary outages may require manual steps to restore Secondary consistency
    With pending multiple, time spaced, non-overlapping transactional conflicts, an unexpected failure may need some Binlog processing to ensure consistency.

Want to try it out?

Andrew Morgan has written a great post showing how to setup NDB$EPOCH_TRANS. He's even included non-ascii art. This is probably the easiest way to get started. NDB$EPOCH is slightly easier to get started with as the --ndb-log-transaction-id (and Binlog v2) options are not required.

Edit 23/12/11 : Added index