Monday 28 September 2009

Ndb software architecture

I'm sure that someone else can describe the actual history of Ndb development much better, but here's my limited and vague understanding.

  • Ndb is developed in an environment (Ericsson AXE telecoms switch) where Ericsson's PLEX is the language of choice
    PLEX supports multiple state machines (known as blocks) sending messages (known as signals) between them with some system-level conventions for starting up, restart and message classes. Blocks maintain internal state and define signal handling routines for different signal types. Very little abstraction within a block beyond subroutines is supported. (I'd love to hear some more detail on PLEX and how it has evolved). This architecture maps directly to the AXE processor design (APZ) which is unusual in having signal buffers implemented directly in silicon rather than software. This hard-coding drove Ndb's initial max supported signal size of 25 x 32-bit words.
  • An emulated PLEX environment (VM) is made available on Unix systems, written in C++
    The VM runs as a Unix process. PLEX code for blocks is interpreted. Signals are routed between blocks by the VM. This allows development and deployment of PLEX based systems on standard Unix systems. It also allows Plex based systems to easily interact with Unix software. Each VM instance is a single threaded process routing incoming signals to the signal handling functions in each block class.
  • A PLEX to C++ translation system is designed
    Blocks are mapped to large C++ classes with signal handling methods and per-block global state mapped to member variables. The limited labelling and abstraction encoded in the PLEX source are mapped to C style code within C++ classes.
  • The VM environment is 'branched' from the original PLEX/AXE environment and starts to evolve independently as a base for Ndb.
    It offers access to more OS services such as communication, disk IO etc. Plex interpretation functionality is removed as all relevant Plex code has been mapped to native C++. VM instances can communicate with each other over various channels and form a distributed system.
  • At some point in the timeline around here the Ndb team and product leave Ericsson
  • Over time, common block functionality is abstracted into base and utility classes.
    Hardware and system-convention sourced constraints are eased, the level of abstraction is raised. New blocks are designed and implemented without a Plex heritage making use of C++ abstraction facilities. Existing blocks are refactored.
  • Multi-threaded Ndbd (ndbmtd) is introduced, with groups of block instances running on different threads.
    Rather than being a radical design, it's a move back towards the original PLEX design point of 1 block instance per processor.

Today, Ndb executes a blocks communicating via signals model. Signals are no longer limited to 25 words. In single threaded Ndb (ndbd), all blocks share a single thread, with separate threads used for inter-VM communication setup and disk IO. In multi threaded Ndb (ndbmtd), block instances are grouped, and different functional groups share threads. In all cases, each block instance remains single-threaded, although the thread may be shared with other blocks.

The blocks and signals model is reminiscent of Erlang and Hoare's CSP – where concurrency is modelled as serial (or sequential) processes communicating with explicit messages, as opposed to a shared-memory model where communication occurs via memory with correctness controlled by locks, memory barriers and atomic instructions. It can also be considered similar to MPI and the Active object / Actor model.

Using explicit messaging for synchronisation/communication has costs – at runtime a given algorithm may require more data copying. At design time, potential concurrency must be explicitly designed-in with messaging and state changes. Mapping sequential algorithms to message passing state machines may require a bigger code transformation than mapping to a naive multithread safe shared-memory and locks implementation.

However I believe that these costs are generally paid off by the benefit of improved code clarity. Inter state-machine synchronisation becomes clearly visible, making synchronisation costs easier to visualise and understand. With explicit messaging as the main mechanism for inter-thread and inter-process communication, there is only a small kernel of multithreaded code to be implemented, proved correct and optimised. The bulk of the code can be implemented in single threaded style. There is no need for diverse libraries of multithread-optimised data structures. Processor and system architecture specific code and tradeoffs are minimised.

Internally, Ndb's VM supports only asynchronous messages between blocks. Using an asynchonous message passing style has many benefits. As the sending thread does not block awaiting a response to a message sent, it can work on other jobs, perhaps including the message just sent. This allows it to make the best use of warm instruction and data caches, reduces voluntary context switches and can reduce the likelihood of deadlock. Blocking IO (network, disk) is outsourced to a pool of threads. The signal processing thread(s) never block, except when no signals are available to process. The responsiveness of the system can be ensured by using prioritised job queues to determine the job to execute next and minimising the time spent processing individual jobs. From a formal point of view the number of possible multithreaded interactions is vastly reduced as thread-interleaving is only significant at signal processing boundaries. These limitations can make it easier to reason about the correctness and timing properties of the system.

However, coding in this asynchronous, event-driven style can be demanding. Any blocking operations (disk access, blocking communications, requests to other threads or processes etc.) must be implemented as an asynchronous request and response pair. This style can have an abstraction-dissolving property as many published data structures and algorithms are implemented assuming a synchronous model and making much use of the caller's stack for state storage and managing control flow. It can be difficult to design abstractions for the asynchronous style which don't leak so much messy detail as to be pointless. Additionally, the asynchronous style tends to flatten a system – as the need to return control to the lowest-level call point whenever concurrency is possible acts as a force against deep layers of abstraction. Side effects of this can include a tendency for error handling code to be non-localised to the source of the error. However, that is part of the charm of working on the system. The C++ environment gives a wide set of tools for designing such abstractions, and each improvement made simplifies future work.

Comments, corrections?

14 comments:

Anonymous said...
This comment has been removed by a blog administrator.
水災 said...
This comment has been removed by a blog administrator.
英文發音真難 said...
This comment has been removed by a blog administrator.
白色 said...
This comment has been removed by a blog administrator.
Steinway Wu said...

Thanks for the blog. It is pretty clear. I am reading the code but has no clue which block handles the real dirty stuff writing records or metadata to disk/memory. The VM part makes sense and I think i need to dig more into blocks. Do you have any good source or reference about your contents?

Frazer Clement said...

Hi Steinway,
Good question, there is an 'Internals' part of the user manual which contains some detail on the roles of blocks :

http://dev.mysql.com/doc/ndbapi/en/ndb-internals.html

Mikael Ronstrom's blog is good for details on the internals of MySQL Cluster :

http://mikaelronstrom.blogspot.co.uk/

Regarding writing to disk, the following blocks are involved :

NDBFS : Implements an async filesystem abstraction for the other blocks, including optional compression.

DBDICT : Stores table schema metadata on disk

DBDIH : Stores table partitioning / fragmentation and redo recovery metadata on disk

DBLQH : Stores operation redo logs on disk

BACKUP : Stores Local Checkpoint and online backup files on disk

RESTORE : Reads and applied Local Checkpoint files during restarts.

TSMAN : Stores 'Disk Data' pages on disk

Most disk read/write is sequential (apart from disk-data access). For metadata files, the normal pattern is to sequentially write to two copies of each file, so that at least one should be retrievable intact after a failure.

Ndb code can be difficult to get started with - is there something specific you are trying to understand?

Frazer

Steinway Wu said...

Thanks Frazer, that's an awesome answer :). I am actually trying to understand how NDB storage handles partition and replication. But to do that, I would also like to understand the big picture.

I have read the Partition Plugin by Mikael before. And I think that's for a single node in MySQL server. In a cluster environment, it seems that NDB handles it natively for KEY/LINEAR hash. I did to a little trace down to the sendReq() and then I don't know who receives it or dispatch it. Now according to you, that should happen in the VM, and maybe VM dispatch it to some handler. I will take a further look at them today.

Also, will the partition plugin play a role in NDB storage engine?

Thanks again. This blog, and Mikael's, are really helpful. I appreciate your efforts on putting your experiences and knowledge on the blog.

Frazer Clement said...

Thanks Steinway.

You are correct that the Partition plugin in the MySQL Server is mostly concerned with partitioning of tables within a single MySQL Server. For MySQL Cluster, most of the work is done in the cluster software.

I described how data is distributed (partitioned) in a later blog (http://messagepassing.blogspot.co.uk/2011/03/data-distribution-in-mysql-cluster.html). This concentrates on PARTITION BY [LINEAR] KEY, which is our default and most commonly used scheme. MySQL Cluster also supports partition by HASH, RANGE and LIST, which at the MySQL Cluster software level is sometimes referred to as 'user defined partitioning'. These alternative schemes make more use of the MySQL Server partitioning plugin.

At a high level, all the partitioning schemes rely on the result of a function applied to some [sub]set of the column[s] in the primary key.

For PARTITION BY KEY, the basics are :

- All row access is via Primary key (Indexes and scans also internally go via primary key)

- Where the fragment (partition) of the row is unknown, the transaction coordinator (TC) will apply the MD5 hash function to the relevant columns of the primary key (known as the distribution key, or partition key), to get a hash value. (storage/ndb/src/kernel/blocks/dbtc/DbtcMain.cpp :: execTCKEYREQ() / hash() / execSCANTABREQ())

- Then the TC will ask the Distribution Handler block (DIH) to give it an ordered list of the fragment replicas which store the hash value.

- For hashmap partitioning, DIH will
use the hash to lookup a fragment id in a table (hashmap). For non-hashmap partitioning, DIH will divide the hash modulo the number of fragments to get a fragment id. DIH will return details of the fragment's replicas, with the current Primary fragment first in the list. (storage/ndb/src/kernel/blocks/dbdih/DbdihMain.cpp execDIGETNODESREQ() / execDIH_SCAN_GET_NODES_REQ())

- TC then uses the fragment information to forward the request to the LQH block on the correct node and instance - usually the current Primary replica. This signal will then be handled by e.g. (storage/ndb/src/kernel/block/dblqh/DblqhMain.cpp::execLQHKEYREQ() / execSCAN_FRAGREQ()).

Features of this implementation :

- The cluster itself takes care of locating rows on nodes - clients need not care, though if they do they can improve performance.

- Reducing the number of inter-node signals improves efficiency, so MySQL Cluster has features to allow correlated data to be co-located on the same node, and also to allow a co-located transaction coordinator instance to be chosen, based on the data to be accessed.

Frazer Clement said...

In terms of 'replication', it's built in as part of 2 phase commit. As mentioned above, for writing operations, the DIH block returns a list of fragment replicas. The prepare and commit (and complete) phases occur serially for each operation (but in parallel between operations in a transaction). Assuming for example a single row transaction with TC, LQHPrimary and LQHBackup participants (NoOfReplicas = 2), we would see :

Update request, with commit exec mode
(TCKEYREQ)
API -> TC

Prepare op request to LQH
(LQHKEYREQ)
TC -> LQHPrimary
LQHPrimary -> LQHBackup
(LQHKEYCONF)
LQHBackup -> TC

TC immediately requests commit
(COMMIT)
TC -> LQHBackup (note different order to prepare)
LQHBackup -> LQHPrimary (lock release on primary, other transactions can read changes)
(COMMITTED)
LQHPrimary -> TC

TC sends COMMIT_CONF or just TCKEYCONF to client
(COMMIT_CONF, TCKEYCONF)
TC -> API (Api now considers transaction committed)

TC sends COMPLETE_REQ to LQHPrimary (note prepare order)
(COMPLETE_REQ)
TC -> LQHPrimary
LQHPrimary -> LQHBackup (lock release on backup)
(COMPLETE_CONF)
LQHBackup -> TC

Perhaps more interesting are the techniques used to resynchronise fragments when nodes recover etc, or to re-partition fragments online?

In terms of understanding the flow, one problem is finding the recipient of a signal - grep can be handy here! Also the Block classes have a lot of common implementation code in storage/ndb/src/kernel/vm/SimulatedBlock.(h|c)pp.

Recent releases add some complexity due to multithreading of blocks, the older 6.3 release does not include this and may be an easier on-ramp to understanding the protocols and general flow.

Steinway Wu said...

Thanks Frazer, that's a great explanation.

I didn't realize that TC should be a starting point until you pointed it out. It really contains a lot of important information I would like to see, including those data structures involved in a transaction. I also found out the ThreadConfig::ipControlLoop where all things start. SimulatedBlock helps a lot, where I found how each block register it's handlers through macros and functions like addRecSignal. Some signal info are defined in /storage/ndb/include/signaldata/..., which I also found important. (It's really difficult to read because of message-passing/macro/global variable ..., even a xref tool like opengrok can't help too much.)

And you're right, I'm interested in online re-partition. I could see DIH handling alter table requests, probably maintaining metadata through FS_XXXXX requests (to and from NDBFS?), but that's only a portion of it. If you could sketch the high level flow, that would be very helpful.

Thanks!

Frazer Clement said...

Hi, sounds like you're making good progress hacking through the undergrowth!

I've written a blog about online table reorg before (http://messagepassing.blogspot.co.uk/2011/03/mysql-cluster-online-scaling.html), but it's at quite a high level.

The re-org is quite a large operation, and involves many different components : DICT, DIH, TC, TRIX and DBUTIL, DBTUP, SUMA,...

The re-org operation itself is triggered via DICT as part of a schema transaction (an alter table, where the table's hashmap is changed).
Rows which must be moved from existing to new fragments are detected by looking up their distribution key hash in the 'old' and 'new' hashmaps. Where these differ, the row needs to be moved.

The TRIX and DBUTIL blocks are used to scan the fragments of the table, and where a row which will be moved is found, it is inserted into the new fragment.

In the meantime, Cluster-internal row change triggers (DBTUP) are used to detect when user DML activity causes a row which will be moved to be changed. The trigger then causes the change to be propagated to the new fragment.

When the scan completes, the data-to-be-moved exists in both the new and old fragments. At this point, the schema transaction commit begins :
- The 'new' hashmap becomes the official hashmap - used for new requests
- 'Moved' rows in the old fragment become invisible to scans.
- The system waits for the open operation refcount on the old hashmap to drop to zero - this required transactions to complete.
- Once all operations using the old hashmap are complete, TRIX and DBUTIL begin a delete-scan of the existing fragment, removing the copied rows.
- Then the scan is complete.

As with most databases, many things are accomplished with internal triggers. storage/ndb/src/kernel/dbtup/DbtupTrigger.cpp is a good place to look. Generally triggers either have a node-local or cluster-wide effect. Node-local triggers are used e.g. for online backup, ordered index maintenance etc. Cluster-wide (Global) triggers are used for Unique indexes, Foreign keys (in 7.3+), and online re-org.
Where a trigger is global, a transaction's TC is informed when the trigger is fired, and the TC is able to undertake further distributed work based on the trigger firing.

In terms of understanding the source, it can be difficult. Many protocols are similar however. Many have PREPARE and COMMIT phases, and signals are mostly named to indicate their purpose (REQuest, CONFirm, REPort, ORDer).

One difficulty with following signal flows is getting lost in the details. After a while it is easier to skip over the sub-protocol state machines (e.g. the work + signalling for reading / writing files) to get onto the next state of some higher state machine. I'd recommend that approach - don't get too bogged down in one place, but keep looking from different angles and trying to connect them together

Good luck!

Steinway Wu said...

Thanks Frazer.

I never know that triggers could be so important inside an engine! I just read that post you mentioned about online scaling, and that's also very informative.

Thanks to your pointers, I saw the hashMapObjectId comparison in alterTable_parse, and this function seems to be very important (although I think the name "parse" confuses me at first glance. it does much more than just parse).

And in DbtcMain, it executeReorgTrigger to send TCKEYREQ, and then Dbtc itself handles it in tckeyreq050Lab(interesting name), who request DIH for nodes info, and uses two loop to choose owning node (TownNode) and backup node, and finally send LQHKEYREQ(s) to LQH for those dirty job (I guess).

BTW, It seems that XXX_KEY_REQ is one of the most important messages passing around

I've got some idea now. Thanks for helping me out of this message-passing style code. I will keep reading to get better understanding.

Would you mind me putting these valuable comments on my blog as part of a code-reading note? My blog is steinwaywhw.ghost.io (temp domain name, might change in the following months)

Thanks again for your answers.

Frazer Clement said...

Hi Steinway,
You're a quick study! Seems you're on the right tracks. Some of the strangeness in Ndb code is due to its being auto-translated from the Plex language, and some of it is just....strangeness :)

Feel free to put the content and your thoughts on your blog.
Frazer

Frazer Clement said...

P.S. For a thorough look at the uses of triggers in a DBMS, I recommend this paper :
Practical Applications of Triggers and Constraints
http://www.vldb.org/conf/2000/P254.pdf