Saturday, 26 March 2011

Data distribution in MySQL Cluster

MySQL Cluster distributes rows amongst the data nodes in a cluster, and also provides data replication. How does this work? What are the trade offs?

Table fragments

Tables are 'horizontally fragmented' into table fragments each containing a disjoint subset of the rows of the table. The union of rows in all table fragments is the set of rows in the table. Rows are always identified by their primary key. Tables with no primary key are given a hidden primary key by MySQLD.

By default, one table fragment is created for each data node in the cluster at the time the table is created.

Node groups and Fragment replicas

The data nodes in a cluster are logically divided into Node groups. The size of each Node group is controlled by the NoOfReplicas parameter. All data nodes in a Node group store the same data. In other words, where the NoOfReplicas parameter is two or greater, each table fragment has a number of replicas, stored on multiple separate data nodes in the same nodegroup for availability.

One replica of each fragment is considered primary, and the other(s) are considered backup replicas. Normally, each node contains a mix of primary and backup fragments for every table, which encourages system balance.

Which replica to use?

The primary fragment replica is used to serialise locking between transactions concurrently accessing the same row. Write operations update all fragment replicas synchronously, ensuring no committed data loss on node failure. Read operations normally access the primary fragment replica, ensuring consistency. Reads with a special lock mode can access the backup fragment replicas.

Primary key read protocol

When an NdbApi client (for example a MySQLD process) wants to read a row by primary key, it sends a read request to a data node acting as a Transaction Coordinator (TC).
The TC node will determine which fragment the row would be stored in from the primary key, decide which replica to access (usually the primary), and send a read request to the data node containing that fragment replica. The data node containing the fragment replica then sends the row's data (if present) directly back to the requesting NdbApi client, and also sends a read acknowledgement or failure notification back to the TC node, which also propagates it back to the NdbApi client.

Minimising inter data node hops

The 'critical path' for this protocol in terms of potential inter-data-node hops is four hops :

Client -> TC -> Fragment -> TC -> Client

To minimise remote client experienced latency, ideally two inter-node hops can be avoided by having the TC node and the Fragment replica(s) on the same node. This requires controlling the choice of node for TC based on the primary key of the data which will be read. Where a transaction only reads rows stored on the same node as its TC, this can improve latency and system efficiency.

Distribution awareness

From NdbApi, users can specify a table and key when starting a transaction. The transaction will then choose a TC data node based on where the corresponding row's primary fragment replica is located in the system. This mechanism is sometimes referred to as 'transaction hinting'.

The Ndb handler in MySQLD generally waits for the first primary key lookup in a user session before starting an NdbApi transaction, so that it can choose a TC node based on this. This is a best-effort attempt at having the data node acting as TC colocated with the accessed data. This feature is usually referred to as 'Distribution Awareness'.

Write operations also benefit from distribution awareness, but not to the same extent in systems with NoOfReplicas > 1. Write operations must update all fragment replicas, which must be stored on different nodes, in the same nodegroup, so for NoOfReplicas > 1, distribution awareness avoids inter-node-group communication, and some intra-node-group communication, but some inter-data-node communication is always required. In a system with good data partitioning and distribution awareness, most read transactions will access only one data node, and write transactions will result in messaging between the data nodes of a single node group. Messaging between node groups will be minimal.

Distribution keys

By default, the whole of a table's primary key is used to determine which fragment replica will store a row. However, any subset of the columns in the primary key can be used. The key columns used to determine the row distribution are called the 'distribution key'.

Where a table's primary key contains only one column, the distribution key must be the full primary key. Where the primary key has more than one column, the distribution key can be different to (a subset of) the primary key.

From MySQLD, a distribution key can be set using the normal PARTITION BY KEY() syntax. The effect of using a distribution key which is a subset of the primary key is that rows with different primary key values, but the same distribution key values are guaranteed to be stored in the same table fragment.

For example, if we create a table :

CREATE TABLE user_accounts (user_id BIGINT,
account_type VARCHAR(255),
username VARCHAR(60),
state INT,
PRIMARY KEY (user_id, account_type))
engine = ndb partition by key (user_id);

Then insert some rows :

INSERT INTO user_accounts VALUES (22, "Twitter", "Bader", 2),
(22, "Facebook", "Bd77", 2),
(22, "Flickr", "BadB", 3),
(23, "Facebook", "JJ", 2);

Then we know that all rows with the same value(s) for the distribution key (user_id), will be stored on the same fragment. If we know that individual transactions are likely to access rows with the same distribution key value then this will increase the effectiveness of distribution awareness. Many schemas are 'partitionable' like this, though not all.

Note that partitioning is a performance hint in Ndb - correctness is not affected in any way, and transactions can always span table fragments on the same or different data nodes. This allows applications to take advantage of the performance advantages of distribution awareness without requiring that all transactions affect only one node etc as required by simpler 'sharding' mechanisms.

Correlated distribution keys across tables

A further guarantee from Ndb is that two tables with the same number of fragments, and the same number and type of distribution keys will have rows distributed in the same way.

For example, if we add another table :

CREATE TABLE user_prefs (user_id BIGINT,
type VARCHAR(60),
value VARCHAR(255),
PRIMARY KEY (user_id, type))
engine = ndb partition by key (user_id);

Then insert some rows :

INSERT INTO user_prefs VALUES (22, "Coffee", "Milk + 6 sugars"),
(22, "Eggs", "Over easy"),
(23, "Custard", "With skin");

Then we know that the rows with the same user_id in the user_prefs and user_accounts tables will be stored on the same data node. Again, this helps with distribution awareness. In this example, we are ensuring that rows related to a single user, as identified by a common user_id, will be located on one data node, maximising system efficiency, and minimising latency.

Ordered index scan pruning

MySQL Cluster supports arbitrary ordered indexes. Ordered indexes are defined on one or more columns and support range scan operations. Range scans are defined by supplying optional lower and upper bounds. All rows between these bounds are returned.

Each Ndb ordered index is implemented as a number of in memory tree structures (index fragments), distributed with the fragments of the indexed table. Each index fragment contains the index entries for the local table fragment. Having ordered indexes local to the table fragments makes index maintenance more efficient, but means that there may not be much locality to exploit when scanning as rows in a range may be spread across all index fragments of an index.

The only case where an ordered index scan does not require to scan all index fragments is where it is known that all rows in the range will be found in one table fragment.
This is the case where both :
  1. The ordered index has all of the table's distribution keys as a prefix
  2. The range is contained within one value of the table's distribution keys

NdbApi detects this case when a range scan is defined, and 'prunes' the scan to one index fragment (and therefore one data node). For all other cases, all index fragments must be scanned.

Continuing the example above, assuming an ordered index on the primary key, the following ordered index scans can be pruned :

SELECT * FROM user_accounts WHERE user_id = 22;
SELECT * FROM user_accounts WHERE user_id = 22 AND account_type LIKE 'F%';

However, the following ordered index scans cannot be pruned, as matching rows are not guaranteed to be stored in one table fragment :

SELECT * FROM user_accounts WHERE account_type = "Facebook";
SELECT * FROM user_accounts WHERE user_id > 20 AND user_id < 30;

MySQLD partitioning variants and manually controlling distribution

Since MySQL 5.1, table partitioning has been supported. Tables can be partitioned based on functions of the distribution keys such as :

  • KEY
  • HASH
  • LIST

For engines other than Ndb, partitioning is implemented in the Server, with each partition implemented as a separate table in the Storage engine. Ndb implements these partition functions natively, using them to control data distribution across table fragments in a single table.

From Ndb's point of view, KEY and LINEAR KEY are native partitioning functions. Ndb knows how to determine which table fragment to use for a row from a table's distribution key, based on an MD5 hash of the distribution key.

HASH, RANGE and LIST are not natively supported by Ndb. When accessing tables defined using these functions, MySQLD must supply information to NdbApi to indicate which fragments to access. For example before primary key insert, update, delete and read operations, the table fragment to perform the operation on must be supplied. From MySQLD, the partitioning layer supplies this information.

Any NdbApi application can use the same mechanisms to manually control data distribution across table fragments. At the NdbApi level this is referred to as 'User Defined' partitioning. This feature is rarely used. One downside of using User Defined partitioning is that online data redistribution is not supported. I'll discuss Online data redistribution in a future post here.

Edited on 12/10/11 to fix formatting imbalance


zoby said...

Hi Frazer, thanks for the great post ! About hinting, it seems to be possible to specify a partition id when starting a transaction :

NdbTransaction* startTransaction
const NdbDictionary::Table* table,
Uint32 partitionId

I'm not sure how that's usable ?

Frazer said...

Hi zoby,
When using 'UserDefined' partitioning at the NdbApi level, the distribution of rows to fragments is fully under the control of the user. This is the underlying mechanism used for MySQL's LIST, RANGE and HASH partitioning on Ndb tables. The MySQL Server calculates which partition the row should be insert/updated/deleted/read into/from and explicitly sets the partition id on NdbApi insert/update/delete/read operations.
The method signature you show is used to hint transactions to the correct node for the fragment given. Where UserDefined partitioning is in use for a table, the partitionId (or fragment id) is given explicitly.
For non User-Defined partitioned tables (Native partitioned, e.g. PARTITION BY [LINEAR] KEY), Ndb knows how to map the distribution key to the partition (fragment) id, so the distribution keys can be supplied rather than an explicit partition id.
Hope this makes sense,

Umesh Shastry said...

Hello Frazer,

Thanks a lot for the wonderful explanation.


Anonymous said...

Thank you a lot for this post! I am currently looking at NDB and I was wondering about data locality and your post is much more clean than the official docs! Thanks again!