Monday, 28 September 2009

Ndb software architecture

I'm sure that someone else can describe the actual history of Ndb development much better, but here's my limited and vague understanding.

  • Ndb is developed in an environment (Ericsson AXE telecoms switch) where Ericsson's PLEX is the language of choice
    PLEX supports multiple state machines (known as blocks) sending messages (known as signals) between them with some system-level conventions for starting up, restart and message classes. Blocks maintain internal state and define signal handling routines for different signal types. Very little abstraction within a block beyond subroutines is supported. (I'd love to hear some more detail on PLEX and how it has evolved). This architecture maps directly to the AXE processor design (APZ) which is unusual in having signal buffers implemented directly in silicon rather than software. This hard-coding drove Ndb's initial max supported signal size of 25 x 32-bit words.
  • An emulated PLEX environment (VM) is made available on Unix systems, written in C++
    The VM runs as a Unix process. PLEX code for blocks is interpreted. Signals are routed between blocks by the VM. This allows development and deployment of PLEX based systems on standard Unix systems. It also allows Plex based systems to easily interact with Unix software. Each VM instance is a single threaded process routing incoming signals to the signal handling functions in each block class.
  • A PLEX to C++ translation system is designed
    Blocks are mapped to large C++ classes with signal handling methods and per-block global state mapped to member variables. The limited labelling and abstraction encoded in the PLEX source are mapped to C style code within C++ classes.
  • The VM environment is 'branched' from the original PLEX/AXE environment and starts to evolve independently as a base for Ndb.
    It offers access to more OS services such as communication, disk IO etc. Plex interpretation functionality is removed as all relevant Plex code has been mapped to native C++. VM instances can communicate with each other over various channels and form a distributed system.
  • At some point in the timeline around here the Ndb team and product leave Ericsson
  • Over time, common block functionality is abstracted into base and utility classes.
    Hardware and system-convention sourced constraints are eased, the level of abstraction is raised. New blocks are designed and implemented without a Plex heritage making use of C++ abstraction facilities. Existing blocks are refactored.
  • Multi-threaded Ndbd (ndbmtd) is introduced, with groups of block instances running on different threads.
    Rather than being a radical design, it's a move back towards the original PLEX design point of 1 block instance per processor.

Today, Ndb executes a blocks communicating via signals model. Signals are no longer limited to 25 words. In single threaded Ndb (ndbd), all blocks share a single thread, with separate threads used for inter-VM communication setup and disk IO. In multi threaded Ndb (ndbmtd), block instances are grouped, and different functional groups share threads. In all cases, each block instance remains single-threaded, although the thread may be shared with other blocks.

The blocks and signals model is reminiscent of Erlang and Hoare's CSP – where concurrency is modelled as serial (or sequential) processes communicating with explicit messages, as opposed to a shared-memory model where communication occurs via memory with correctness controlled by locks, memory barriers and atomic instructions. It can also be considered similar to MPI and the Active object / Actor model.

Using explicit messaging for synchronisation/communication has costs – at runtime a given algorithm may require more data copying. At design time, potential concurrency must be explicitly designed-in with messaging and state changes. Mapping sequential algorithms to message passing state machines may require a bigger code transformation than mapping to a naive multithread safe shared-memory and locks implementation.

However I believe that these costs are generally paid off by the benefit of improved code clarity. Inter state-machine synchronisation becomes clearly visible, making synchronisation costs easier to visualise and understand. With explicit messaging as the main mechanism for inter-thread and inter-process communication, there is only a small kernel of multithreaded code to be implemented, proved correct and optimised. The bulk of the code can be implemented in single threaded style. There is no need for diverse libraries of multithread-optimised data structures. Processor and system architecture specific code and tradeoffs are minimised.

Internally, Ndb's VM supports only asynchronous messages between blocks. Using an asynchonous message passing style has many benefits. As the sending thread does not block awaiting a response to a message sent, it can work on other jobs, perhaps including the message just sent. This allows it to make the best use of warm instruction and data caches, reduces voluntary context switches and can reduce the likelihood of deadlock. Blocking IO (network, disk) is outsourced to a pool of threads. The signal processing thread(s) never block, except when no signals are available to process. The responsiveness of the system can be ensured by using prioritised job queues to determine the job to execute next and minimising the time spent processing individual jobs. From a formal point of view the number of possible multithreaded interactions is vastly reduced as thread-interleaving is only significant at signal processing boundaries. These limitations can make it easier to reason about the correctness and timing properties of the system.

However, coding in this asynchronous, event-driven style can be demanding. Any blocking operations (disk access, blocking communications, requests to other threads or processes etc.) must be implemented as an asynchronous request and response pair. This style can have an abstraction-dissolving property as many published data structures and algorithms are implemented assuming a synchronous model and making much use of the caller's stack for state storage and managing control flow. It can be difficult to design abstractions for the asynchronous style which don't leak so much messy detail as to be pointless. Additionally, the asynchronous style tends to flatten a system – as the need to return control to the lowest-level call point whenever concurrency is possible acts as a force against deep layers of abstraction. Side effects of this can include a tendency for error handling code to be non-localised to the source of the error. However, that is part of the charm of working on the system. The C++ environment gives a wide set of tools for designing such abstractions, and each improvement made simplifies future work.

Comments, corrections?

Thursday, 10 September 2009

MySQL Cluster development

MySQL Cluster is the name given to one or more MySQL Server processes, connected to an Ndb Cluster database. From the point of view of the MySQL Server processes, the Ndb Cluster is a Storage Engine, implementing transactional storage of tables containing rows. From the point of view of the Ndb Cluster database, the MySQL Server processes are API nodes, performing DDL and DML transactions on tables stored in the cluster. Both exist independently – Ndb Cluster can be used without attached MySQL Server processes, but almost all users of Ndb Cluster connect at least one MySQL Server for DDL and administration.

Ndb stands for Network DataBase. This is a telecoms phrase where Network usually refers to a fixed or wireless telephone network, rather than the database topology definition of the term. Ndb was originally designed as a platform for implementing databases required to operate telecoms networks - HLR, VLR, Number Portability, Fraud Detection etc. At the time Ndb was first designed, Network Databases were generally implemented in-house on exotic 'switch' hardware by telecoms equipment vendors, often with hard-coded schemas and very inflexible query capabilities. These databases were expensive to develop and maintain, but had superb reliability and exceptional performance on minimal spec. hardware. The aim of the original Ndb design was to couple these desirable properties with more general purpose database functionality and deliver the result on a more standard hardware and OS stack.

I first discovered Ndb Cluster around 2001, when looking at potential designs for the next generation of an existing HLR database. I read the paper by Mikael Ronström in Ericsson Review (No 4,1997) which gives a good overview of the Ndb functionality. This paper describes functionality in the current tense when in fact some of the features described are yet to be implemented in 2009! This sort of optimism and vision has helped Ndb to survive and thrive over the years. The Ericsson Review paper was written while Ndb was one of multiple telecoms-database projects at Ericsson. Since then the Ndb product and team were spun out as a separate company, before being sold to MySQL AB in 2003 as a result of the dot com affair.

Ndb was originally designed for :
  • High throughput – sustaining tens to hundreds of thousands of transactions per second
  • Low latency – bounded transactions latencies which can be reliably factored into end-to-end latency budgets, implying main-memory storage
  • High update to read ratio – 50/50 as the norm
  • Transactional properties : Atomicity, Consistency, Isolation, Durability
  • Fault tolerance + HA – No single point of failure, automatic failover and recovery with minimal user or application involvement. Online upgrade. N-way synchronous and asynchronous replication. Fail-fast fault isolation.
  • Persistence – disk checkpointing and logging with automated recovery
  • Scalability – Parallel query execution. Distributed system can utilise > 1 system's resources. Capacity can be expanded horizontally with extra systems.

In the original Ndb design, high volume low latency transactions are submitted directly to the cluster using simple access primitives on the client. More complex queries are submitted to a separate query processor which itself uses combinations of the simpler primitives to access the cluster. An early example of a higher-level query processor was created by Martin Sköld who extended an Object Oriented query processor to create 'QDB' which could perform queries against data stored in Ndb. Numerous high level front-end processors have been implemented since.

Using MySQLD as a higher-level query processing front end we come to the architecture of MySQL Cluster, with MySQLD providing SQL based access to data stored in the cluster. In this sense MySQLD and Ndb cluster are a perfect fit and were designed for each other before they first met! Despite MySQLD being the default and most prominent front end to Ndb cluster, a number of others exist including several open and closed-source LDAP servers (OpenLDAP, OpenDS), several Java APIs and an Apache module giving HTTP access to data stored in Ndb.

The separation of low level, simple, fast access and higher level, more flexible access allows MySQL Cluster to offer many benefits of a full RDBMS without always incurring the drawback of over-generality. This fits well with many large transaction processing systems, where most heavy transaction processing does not require the full flexibility of the RDBMS, but some less frequent analysis does. Separating the central database engine (which in Ndb is referred to as the kernel ) from the query processing layer can also help with workload management – even the most complex queries are subdivided into manageable components and resources can be shared fairly.

The original Ndb design was not aimed at :
  • Disk resident storage
    Where data larger-than-aggregate-system-memory-capacity can be stored on disk. This functionality was later added in the MySQL 5.1 timeframe
  • Complex query processing
    Where multiple tables are joined. This was always possible, but not always efficient. Improving the efficiency of MySQL and Ndb on complex query processing is ongoing work - as it is in all actively developed RDBMS, for some definition of complex :).
  • Storing large rows
    Ndb currently has a per-row size limit of around 8kB, ignoring Blob and Text column types.
  • One size fits all
    Being a drop-in replacement for an existing MySQL engine such as MyISAM or InnoDB
    Many initial users were not aware of the history of Ndb, and expected it to be (MySQL + InnoDB/MyISAM) + 'Clustering'. Issuing 'ALTER TABLE xxx ENGINE=ndbcluster;' appeared to be all that was required to gain fault tolerance, but the performance of queries on the resulting tables was not always as expected!

Since the initial integration of Ndb Cluster with MySQLD in 2003+, there have been many improvements to bring Ndb closer in behaviour to the most popular MySQL engines, and to optimise MySQLD for Ndb's strengths, including :
  • Support for Autoincrement and primary key-less tables
  • Synchronisation of schemas across connected MySQLD instances
  • Support for MySQL character sets and collations
  • Storage and retrieval of Blob and Text columns
  • Support for pushed-down filter conditions
  • Support for batching of operations
  • Integration with MySQL asynchronous replication
  • 'Distribution awareness' in MySQLD for efficiency

These improvements have required work in the Ndb table handler - the code which maps MySQL storage engine API calls from the generic SQL layer to the underlying storage engine. Some improvements have also required enhancements in the storage engine API and Server, for example a new API to expose conditions (WHERE or HAVING clause predicates) to the storage engine, enabling it to perform more efficient filtering. These changes add complexity to MySQLD and the storage engine API, but as they are implemented generically, they can be reused by other engines. The pushed conditions API is now being used by the Spider engine for similar reasons to Ndb – e.g. to push filtering functionality as close to the data as possible. The Batched Key Access (BKA) improvements made to the MySQLD join executor benefit Ndb, but also benefit MyISAM and InnoDB to a lesser extent. This Functionality push-down pattern – increasing the granularity and complexity of work items which can be passed to the storage engine - will continue and benefit all storage engines.

The next large step to be taken by the MySQL Server team in this direction is referred to as Query Fragment Pushdown, where MySQLD can pass parts of queries to a storage engine for execution. Storage engines which support SQL natively could perhaps use their own implementation-aware optimisation and execution engines to efficiently evaluate query fragments. For Ndb, we are designing composite primitives at the NdbApi level for evaluating query fragments more efficiently - in parallel and closer to the data. This work will increase the number of query types that Ndb can handle efficiently, increasing the number of applications where Ndb is a good fit.

For an in-depth description of the original Ndb requirements, design approach and some specific design solutions, Mikael's phD thesis is the place to go. This is probably the best source of information on the design philosophy of Ndb Cluster. However as it is a frozen document it does not reflect the current state of the system, and as it is an academic paper, it does not describe the lower level, more software engineering oriented aspects of the system implementation.

I hope to cover some of these aspects in a future post.