message passing: 2009

Monday, 28 September 2009

Ndb software architecture

I'm sure that someone else can describe the actual history of Ndb development much better, but here's my limited and vague understanding.

Ndb is developed in an environment (Ericsson AXE telecoms switch) where Ericsson's PLEX is the language of choice
PLEX supports multiple state machines (known as blocks) sending messages (known as signals) between them with some system-level conventions for starting up, restart and message classes. Blocks maintain internal state and define signal handling routines for different signal types. Very little abstraction within a block beyond subroutines is supported. (I'd love to hear some more detail on PLEX and how it has evolved). This architecture maps directly to the AXE processor design (APZ) which is unusual in having signal buffers implemented directly in silicon rather than software. This hard-coding drove Ndb's initial max supported signal size of 25 x 32-bit words.

An emulated PLEX environment (VM) is made available on Unix systems, written in C++
The VM runs as a Unix process. PLEX code for blocks is interpreted. Signals are routed between blocks by the VM. This allows development and deployment of PLEX based systems on standard Unix systems. It also allows Plex based systems to easily interact with Unix software. Each VM instance is a single threaded process routing incoming signals to the signal handling functions in each block class.

A PLEX to C++ translation system is designed
Blocks are mapped to large C++ classes with signal handling methods and per-block global state mapped to member variables. The limited labelling and abstraction encoded in the PLEX source are mapped to C style code within C++ classes.

The VM environment is 'branched' from the original PLEX/AXE environment and starts to evolve independently as a base for Ndb.
It offers access to more OS services such as communication, disk IO etc. Plex interpretation functionality is removed as all relevant Plex code has been mapped to native C++. VM instances can communicate with each other over various channels and form a distributed system.

At some point in the timeline around here the Ndb team and product leave Ericsson

Over time, common block functionality is abstracted into base and utility classes.
Hardware and system-convention sourced constraints are eased, the level of abstraction is raised. New blocks are designed and implemented without a Plex heritage making use of C++ abstraction facilities. Existing blocks are refactored.

Multi-threaded Ndbd (ndbmtd) is introduced, with groups of block instances running on different threads.
Rather than being a radical design, it's a move back towards the original PLEX design point of 1 block instance per processor.

Today, Ndb executes a blocks communicating via signals model. Signals are no longer limited to 25 words. In single threaded Ndb (ndbd), all blocks share a single thread, with separate threads used for inter-VM communication setup and disk IO. In multi threaded Ndb (ndbmtd), block instances are grouped, and different functional groups share threads. In all cases, each block instance remains single-threaded, although the thread may be shared with other blocks.

The blocks and signals model is reminiscent of Erlang and Hoare's CSP – where concurrency is modelled as serial (or sequential) processes communicating with explicit messages, as opposed to a shared-memory model where communication occurs via memory with correctness controlled by locks, memory barriers and atomic instructions. It can also be considered similar to MPI and the Active object / Actor model.

Using explicit messaging for synchronisation/communication has costs – at runtime a given algorithm may require more data copying. At design time, potential concurrency must be explicitly designed-in with messaging and state changes. Mapping sequential algorithms to message passing state machines may require a bigger code transformation than mapping to a naive multithread safe shared-memory and locks implementation.

However I believe that these costs are generally paid off by the benefit of improved code clarity. Inter state-machine synchronisation becomes clearly visible, making synchronisation costs easier to visualise and understand. With explicit messaging as the main mechanism for inter-thread and inter-process communication, there is only a small kernel of multithreaded code to be implemented, proved correct and optimised. The bulk of the code can be implemented in single threaded style. There is no need for diverse libraries of multithread-optimised data structures. Processor and system architecture specific code and tradeoffs are minimised.

Internally, Ndb's VM supports only asynchronous messages between blocks. Using an asynchonous message passing style has many benefits. As the sending thread does not block awaiting a response to a message sent, it can work on other jobs, perhaps including the message just sent. This allows it to make the best use of warm instruction and data caches, reduces voluntary context switches and can reduce the likelihood of deadlock. Blocking IO (network, disk) is outsourced to a pool of threads. The signal processing thread(s) never block, except when no signals are available to process. The responsiveness of the system can be ensured by using prioritised job queues to determine the job to execute next and minimising the time spent processing individual jobs. From a formal point of view the number of possible multithreaded interactions is vastly reduced as thread-interleaving is only significant at signal processing boundaries. These limitations can make it easier to reason about the correctness and timing properties of the system.

However, coding in this asynchronous, event-driven style can be demanding. Any blocking operations (disk access, blocking communications, requests to other threads or processes etc.) must be implemented as an asynchronous request and response pair. This style can have an abstraction-dissolving property as many published data structures and algorithms are implemented assuming a synchronous model and making much use of the caller's stack for state storage and managing control flow. It can be difficult to design abstractions for the asynchronous style which don't leak so much messy detail as to be pointless. Additionally, the asynchronous style tends to flatten a system – as the need to return control to the lowest-level call point whenever concurrency is possible acts as a force against deep layers of abstraction. Side effects of this can include a tendency for error handling code to be non-localised to the source of the error. However, that is part of the charm of working on the system. The C++ environment gives a wide set of tools for designing such abstractions, and each improvement made simplifies future work.

Comments, corrections?

Thursday, 10 September 2009

MySQL Cluster development

MySQL Cluster is the name given to one or more MySQL Server processes, connected to an Ndb Cluster database. From the point of view of the MySQL Server processes, the Ndb Cluster is a Storage Engine, implementing transactional storage of tables containing rows. From the point of view of the Ndb Cluster database, the MySQL Server processes are API nodes, performing DDL and DML transactions on tables stored in the cluster. Both exist independently – Ndb Cluster can be used without attached MySQL Server processes, but almost all users of Ndb Cluster connect at least one MySQL Server for DDL and administration.

Ndb stands for Network DataBase. This is a telecoms phrase where Network usually refers to a fixed or wireless telephone network, rather than the database topology definition of the term. Ndb was originally designed as a platform for implementing databases required to operate telecoms networks - HLR, VLR, Number Portability, Fraud Detection etc. At the time Ndb was first designed, Network Databases were generally implemented in-house on exotic 'switch' hardware by telecoms equipment vendors, often with hard-coded schemas and very inflexible query capabilities. These databases were expensive to develop and maintain, but had superb reliability and exceptional performance on minimal spec. hardware. The aim of the original Ndb design was to couple these desirable properties with more general purpose database functionality and deliver the result on a more standard hardware and OS stack.

I first discovered Ndb Cluster around 2001, when looking at potential designs for the next generation of an existing HLR database. I read the paper by Mikael Ronström in Ericsson Review (No 4,1997) which gives a good overview of the Ndb functionality. This paper describes functionality in the current tense when in fact some of the features described are yet to be implemented in 2009! This sort of optimism and vision has helped Ndb to survive and thrive over the years. The Ericsson Review paper was written while Ndb was one of multiple telecoms-database projects at Ericsson. Since then the Ndb product and team were spun out as a separate company, before being sold to MySQL AB in 2003 as a result of the dot com affair.

Ndb was originally designed for :

High throughput – sustaining tens to hundreds of thousands of transactions per second
Low latency – bounded transactions latencies which can be reliably factored into end-to-end latency budgets, implying main-memory storage
High update to read ratio – 50/50 as the norm
Transactional properties : Atomicity, Consistency, Isolation, Durability
Fault tolerance + HA – No single point of failure, automatic failover and recovery with minimal user or application involvement. Online upgrade. N-way synchronous and asynchronous replication. Fail-fast fault isolation.
Persistence – disk checkpointing and logging with automated recovery
Scalability – Parallel query execution. Distributed system can utilise > 1 system's resources. Capacity can be expanded horizontally with extra systems.

In the original Ndb design, high volume low latency transactions are submitted directly to the cluster using simple access primitives on the client. More complex queries are submitted to a separate query processor which itself uses combinations of the simpler primitives to access the cluster. An early example of a higher-level query processor was created by Martin Sköld who extended an Object Oriented query processor to create 'QDB' which could perform queries against data stored in Ndb. Numerous high level front-end processors have been implemented since.

Using MySQLD as a higher-level query processing front end we come to the architecture of MySQL Cluster, with MySQLD providing SQL based access to data stored in the cluster. In this sense MySQLD and Ndb cluster are a perfect fit and were designed for each other before they first met! Despite MySQLD being the default and most prominent front end to Ndb cluster, a number of others exist including several open and closed-source LDAP servers (OpenLDAP, OpenDS), several Java APIs and an Apache module giving HTTP access to data stored in Ndb.

The separation of low level, simple, fast access and higher level, more flexible access allows MySQL Cluster to offer many benefits of a full RDBMS without always incurring the drawback of over-generality. This fits well with many large transaction processing systems, where most heavy transaction processing does not require the full flexibility of the RDBMS, but some less frequent analysis does. Separating the central database engine (which in Ndb is referred to as the kernel ) from the query processing layer can also help with workload management – even the most complex queries are subdivided into manageable components and resources can be shared fairly.

The original Ndb design was not aimed at :

Disk resident storage
Where data larger-than-aggregate-system-memory-capacity can be stored on disk. This functionality was later added in the MySQL 5.1 timeframe
Complex query processing
Where multiple tables are joined. This was always possible, but not always efficient. Improving the efficiency of MySQL and Ndb on complex query processing is ongoing work - as it is in all actively developed RDBMS, for some definition of complex :).
Storing large rows
Ndb currently has a per-row size limit of around 8kB, ignoring Blob and Text column types.
One size fits all
Being a drop-in replacement for an existing MySQL engine such as MyISAM or InnoDB
Many initial users were not aware of the history of Ndb, and expected it to be (MySQL + InnoDB/MyISAM) + 'Clustering'. Issuing 'ALTER TABLE xxx ENGINE=ndbcluster;' appeared to be all that was required to gain fault tolerance, but the performance of queries on the resulting tables was not always as expected!

Since the initial integration of Ndb Cluster with MySQLD in 2003+, there have been many improvements to bring Ndb closer in behaviour to the most popular MySQL engines, and to optimise MySQLD for Ndb's strengths, including :

Support for Autoincrement and primary key-less tables
Synchronisation of schemas across connected MySQLD instances
Support for MySQL character sets and collations
Storage and retrieval of Blob and Text columns
Support for pushed-down filter conditions
Support for batching of operations
Integration with MySQL asynchronous replication
'Distribution awareness' in MySQLD for efficiency

These improvements have required work in the Ndb table handler - the code which maps MySQL storage engine API calls from the generic SQL layer to the underlying storage engine. Some improvements have also required enhancements in the storage engine API and Server, for example a new API to expose conditions (WHERE or HAVING clause predicates) to the storage engine, enabling it to perform more efficient filtering. These changes add complexity to MySQLD and the storage engine API, but as they are implemented generically, they can be reused by other engines. The pushed conditions API is now being used by the Spider engine for similar reasons to Ndb – e.g. to push filtering functionality as close to the data as possible. The Batched Key Access (BKA) improvements made to the MySQLD join executor benefit Ndb, but also benefit MyISAM and InnoDB to a lesser extent. This Functionality push-down pattern – increasing the granularity and complexity of work items which can be passed to the storage engine - will continue and benefit all storage engines.

The next large step to be taken by the MySQL Server team in this direction is referred to as Query Fragment Pushdown, where MySQLD can pass parts of queries to a storage engine for execution. Storage engines which support SQL natively could perhaps use their own implementation-aware optimisation and execution engines to efficiently evaluate query fragments. For Ndb, we are designing composite primitives at the NdbApi level for evaluating query fragments more efficiently - in parallel and closer to the data. This work will increase the number of query types that Ndb can handle efficiently, increasing the number of applications where Ndb is a good fit.

For an in-depth description of the original Ndb requirements, design approach and some specific design solutions, Mikael's phD thesis is the place to go. This is probably the best source of information on the design philosophy of Ndb Cluster. However as it is a frozen document it does not reflect the current state of the system, and as it is an academic paper, it does not describe the lower level, more software engineering oriented aspects of the system implementation.

I hope to cover some of these aspects in a future post.

Tuesday, 28 April 2009

Latency hiding patterns in CPUs

Latency is a major factor in CPU design. Most general purpose CPUs execute a sequence of instructions with the logical convention that one instruction completes before the next starts. However memory access latency and even the latency of pure computation bottleneck the achievable processing throughput. To circumvent this a number of techniques are used to parallelise computation and communication.

Conductors have capacitance and resistance and therefore take time to switch between voltage levels. This manifests as a propagation delay proportional to the length of the conductor. As CPU clock speeds increase, the time between each rising clock edge reduces and the maximum length of conductor that can be charged in that time shrinks. CPUs and other data-path designs must be designed to ensure that no signal propagation path is near the tolerance for propagation delay at the designed maximum clock speed.

In practice this means that there is a trade-off between clock frequency and circuit size. Large circuits can accomplish more per clock cycle, but cannot be clocked as high as smaller circuits. To continue increasing clock rates with a fixed feature size, a circuit must be broken up into smaller and smaller sub-circuits with synchronous boundaries. Even this is not enough in some cases as the clock signals themselves are skewed by the propagation latencies across the chip. In some cases this is solved by having islands of synchronous logic which communicate asynchronously as no synchronous clock can span the whole circuit.

So even within a chip, there can be a large, and growing difference between local and remote communication latency. Ignoring this simplifies designs but forces lowest common denominator performance. This trade-off between complexity and latency awareness and tolerance is repeated at many layers of the system stack.

Even within a synchronous logic island on a chip, there is latency to be hidden. Instruction fetch, decode and execute takes time and pipelining used in CPU designs to maximise throughput despite irreducible latency. The various steps of instruction decode, register selection, ALU activation etc. are split out with latches between so that the maximum propagation delay in any stage of the pipleine can be minimised. For example, a datapath with a 400 gate worst-case propagation delay could theoretically be split into 4 stages, each with a 100 gate worst case propagation delay, allowing a 4 times faster clock speed.

With pipelining in a CPU, the first instruction to be processed takes at least the same amount of time to be completed, but after that, a new instruction can be completed on every clock cycle - theoretically allowing 4 times as many instructions to be completed per unit time. In practice memory latency and bandwidth limits and dependencies between instructions mean that the pipeline can stall, or bubbles can form, reducing the achieved throughput. However, these problems can often be alleviated and pipelines have been very successful in CPU design with instruction fetch-execute-retire pipelines comprising as many as 40 or more stages with multiple instances of stages in super-scalar designs

So pipelining in general allows repetitive serial task processing to be parallelised with potential benefits :

Increased parallelism between instances of serial tasks
Allowing greater throughput
Benefits of task-specificity (instruction and data caching benefits)
Potentially improving efficiency

and the following caveats :

Dependencies between tasks must be understood and honoured
Sometimes this reduces or eliminates parallelism
Pipeline stalls or bubbles due to job starvation or dependencies will reduce throughput
The pipeline must be kept fed.
Achievable throughput is bottlenecked by the slowest stage
As with a production line, the slowest worker sets the pace.
As with a production line, slow workers can be duplicated to improve throughput.
Individual task latency can increase

Pipelining as a general technique can apply to any system that processes tasks of a regular form which can be split into stages. The SEDA system from Harvard is a general purpose server framework for implementing server processes as pipelined stages. The software setting allows more flexible and dynamic tradeoffs to be made between pipeline lengths and widths. It also offers a flexible way to separate asynchronous and synchronous steps.

Other interesting latency hiding techniques seen in CPUs are mostly associated with hiding memory access latency.

Latency hiding patterns

I want to write an entry about latency hiding patterns. Unfortunately my previous attempts became too long and boring even for me. This time I'm going to try something smaller and less rich to get the blog-flow more regular.

I'm interested in latency hiding patterns as I repeatedly see them being implemented at all levels of systems from the silicon to the top of the application stack. Often latency hiding techniques are what deliver better-than-Moore's law performance improvements and can have great impacts to usability as well as throughput and system efficiency.

I would describe latency hiding techniques as things that :

Maximise the value of communication
Maximise the concurrency of communication and computation

Communication value is maximised by avoiding communication, and when it cannot be avoided, minimising the overheads.
Communication and computation concurrency is maximised by maximising the independence of communication and computation. This requires understanding the dependencies between computation and communication.

Right, all very abstract. Let's hope some examples are more interesting.

Thursday, 16 April 2009

Protel II

In the first post I described the basic Protel language. In the mid 1990s it was extended with object oriented capabilities. This was done in a number of phases and was tied into the development of a project at Nortel called Generic Services Framework (GSF). This was an object oriented reimplementation of 'call processing' on the DMS with wide scope and a huge development team. Nortel even created an 'Object Center' somewhere in Canada with a helpline that confused designers could call to get OO Protel advice. I suspect that today it would be a 'Center of Object Excellence'. I heard that the GSF project was not an unqualified success, but it did drive the evolution of Protel-2 which was used later in other products. In retrospect it seems that GSF and Protel-2 were more motivated by the Objects-with-everything zeitgeist than any particularly compelling benefits.

Protel-2 supports single inheritance with a common root class called $OBJECT. Methods can be explicitly declared to be overridable in base classes, similarly to C++'s virtual keyword. It does not support operator overloading, or overloading method names with different signatures. It supports fully abstract classes.
Like Eiffel, Protel-2 allows parameterless methods which return a value to be called without parentheses. This allows data members to be refactored to be read directly or via an accessor function method without changing the callers. I'm not sure how valuable this is in practice, especially as it does not affect assignment. Perhaps some Eiffel practitioners have experience of finding this useful? Protel-2 methods can be declared to be read-only with respect to the object instances they operate on, in a similar way to specifying const-ness in C++.

The GENERIC keyword in Protel-2 allows class definitions to be parameterised by type. This allows the creation of type-safe generic collection classes and datastructures. The type parameterisation is similar to the Generics mechanism in Java, in that it is effectively a compile-time-only mechanism. The compiler generates a single underlying class implementation and checks type-correctness at compile time. A side-effect is that all access to the parameterised class must be made via a pointer. This fits well with the requirement for online load-replacing implementation etc, but it offers much reduced power compared to C++'s code-generation style templating mechanisms.

As with most C++ implementations, Protel-2 objects contain a vtbl pointer in the first word of their data followed by data members of superclasses and the class instance. On XA-Core, this sometimes presented a problem in that the normal transactional memory ownership mechanism could create a bottleneck when used for the vtbl ptr, but often data in the rest of the class should use the transactional memory mechanism. To deal with this, special libraries were created to allow the object header to be stored in WRITEBLOCKING memory and the rest of the object to be stored in BLOCKING memory.

That's scraping the bottom of the barrel on Protel-2 information in my head. I thought there was more there (or maybe just something more interesting). At least I've written it down.

Sunday, 12 April 2009

Protel I

Protel is the PRocedure Oriented Type Enforcing Language used for most DMS software. Currently there's not much information available about it online. An early paper describing the language is referenced here, but is hidden behind a subscription-only portal. Wikipedia offers a very minimal definition of the term but little else. I think there's some good stuff that should be recorded so I'll attempt to describe what I found interesting.

So what is Protel like? I'm told that it's similar to Modula-2 and even Modula-3 and it's true that it shares explicit BEGIN / END block syntax with Pascal, and all code is divided into modules.
One of the most basic differences between Protel and these languages is its use of the composite symbol '->' or 'Gozinta' (Goes into) for assignment. This eliminates any confusion between assignment and equality testing. This 'removal of ambiguity' is a key pattern in the design of Protel. Similar manifestations of the pattern are the rules :

No operator precedence rules
Unlike C, Protel assigns no built-in relative precedence to various operators. All expressions are evaluated left-to-right, and the programmer must use brackets explicitly to specify non l2r evaluation.
No preprocessor
There is no standard preprocessor and therefore no macro language. The compiler supports some limited compile time expression evaluation including sizeof() for types etc. This avoids context specific semantics for source code and non-visible code expansion
No support for pointer arithmetic
Where a pointer is to be treated as referring to some element of an array, the descriptior mechanism, which includes bounds checking, is to be used rather than pointer arithmetic.
Control structure specific end-of-block keywords
Rather than a single END keyword, Protel employs ENDBLOCK, ENDPROC, ENDWHILE, etc. to aid code readability.
No 'keystroke reduction mechanisms'
C's ++, += etc.

These rules can make life harder when writing code and increase verbosity, but can aid readability and reduce the amount of non-local knowledge needed to understand code.

Protel modules contain one or more source code files which can export definitions for use by other modules. Different source files within a module can be arranged into trees which control compile order within a module. Modules often have multiple levels of interface source files - most public and general APIs in the top level, more private and specific APIs in the lower levels. Access to definitions in each interface file can be controlled independently if required.

Basic Protel supports built-in and user defined types, pointers, arrays, an 'array slice' descriptor mechanism, and a novel extensible fixed-size type called an Area as well as some type-reflection capabilities.

A Protel DESCriptor is used to refer to a range of elements in an array of . It is used in the same way an array - with a subsript as an lvalue or rvalue to an expression (though the usual meaning of those terms is confused by the Gozinta operator!). The compiler is aware of the of the slice being DESCribed, and by inference the size of the elements. In the storage allocated to the DESC itself, it stores a pointer to the zeroth element and an upperbound in terms of elements. In this way it can provide bounds checking on accesses through the DESC. When an out-of-bounds exception is hit, the actual upperbound and the supplied subscript are available in the exception report, often allowing debugging straight from the trace. The array slice abstraction can be a nice way to deal with zero-copy in a protocol stack.

Protel offers the BIND keyword which can be used to define a local-scope alias to some variable instance. It's use is encouraged to reduce keystrokes, and it is also useful for indicating to the reader and the compiler that some dereferencing operations need only be performed once even though the referenced value is used multiple times. A side effect of its use is that it reduces the tension between short, easily typed and long, descriptive variable names, allowing long descriptive names to be shortened in use when necessary. Of course this can add to ambiguity and confusion.

Protel supports typed procedure pointers with an explicit type classifier - PROCVAR. I suspect that this type classifier exists to improve code readability rather than for any syntactic necessity as PTR to PROC can be used similarly. PROCVARs are used heavily in SOS code to allow applications to override behaviours and extend OS behaviour. SOS has unique terms for the use of procedure pointers :

GATE
SOS name for a Procedure Variable that is expected to be set by some other module. It is a 'gate' to some other implementation. Usually gates are defined in lower-level modules.
TARGET
SOS name for a procedure implementation referenced from a GATE. This is the 'target' of a call to a 'gate'. Usually targets are defined in higher-level modules
ASPECT
SOS name for a structure containing a number of procedure variables. This can be thought of as an 'interface' in the Java sense - a set of method signatures. Often a lower-level module would provide an API for a higher-level API to add some functionality including some data and perhaps an 'aspect' of procedure variables.

As well as supporting standard composite structures similar to a C struct, Protel supports the concept of an Area which can be used to implement a form of type-inheritance. An AREA is similar to a struct, but contains as part of its definition a storage size in bits, as well as zero or more members. At compile time, it is checked that the declared member's types fit within the size given, and instances will be created with the size given. Other modules can declare further areas which REFINE this area, and add definitions to the original AREA. The compiler will check that the complete set of member definitions continue to fit in the bit size of the original AREA. This mechanism can be used to create trees of hierarchically related data types which is very useful for code modularity and extensibility as well as more basic optionality similar to a C union. Putting procedure pointers into the Area gives a rather rough extensible virtual-method mechanism in-language. However, most DMS software designers used AREA refinements for hierarchically varying data rather than allowing control flow overrides.

Protel offers some reflection capability via the TYPEDESC operator. It is applied to a type and returns a structure which can be used to determine type names, bit offsets etc. SOS supports an online extensible data dictionary which uses TYPEDESC to track types and their relationships. It is version aware and is used by Table Control and for data reformatting between software releases.

Combining PROCVARs and REFINEd AREAs allowed extensible systems to be built fairly easily without OO techniques built into the language. However, the explicit nature of the PROCVARs, the requirement to define up-front bitsizes for refinable areas and the general micromanagement required to define, initialise and use 'object' hierarchies made from these components discouraged most designers from using them in this way. Providing tools at this atomic level encouraged each designer to try their own combination of hard coding, ProcVars, extensible areas, pointers-to-extension-structs, pointers-to-data-structs-with-procvars etc. More manual visualisation effort was required to grasp these mechanisms than would be required for an equivalent language with OO extensions.

I think PROTEL was fairly state-of-the-art when it was introduced for DMS software. Especially considering its planned use for telecoms switching equipment, it is a very general purpose language, not visibly oriented towards telecoms. It has a fairly clean split between language features and runtime libraries. Perhaps if it had been more widely known of beyond the confines of Nortel/BNR then it could have enjoyed some life of its own? I believe it is still being actively used - these days SOS images run in virtualised environments with code written by outsourced employees paid by a broken company, but I imagine there must still be patches getting written. However the outlook looks bleak. With Nortel on the rocks and apparently no interesting information about Protel available on the web (except this blog of course :) ), it looks like it could vanish after 30 years.

In the mid to late nineties, Nortel added object orientation to Protel, I'll talk briefly about that in a future post.

Saturday, 28 March 2009

What is SOS? Part III

Part 1 Part 2

Online code patching and extension

A SOS system is comprised of modules, which contain code and various data segments and are vaguely similar to shared libraries or DLLs. Each function in a module has a function pointer stored at some known offset in a code header segment. The actual machine code for the function is stored in a different segment of the module. This indirection requires that all procedure calls involve a pointer dereference, but gives the flexibility to change the implementation of any procedure at any time. The code and data segments in a module also include limited 'spare' space, so that a number of extra global data variables and functions can be added to a module online. This, coupled with the ability to load completely new modules with arbitrary code and data makes a SOS system completly runtime-patchable, with all behaviours modifiable online. Online upgrade of a running module is referred to as load-replacement. In development it is used to test and debug code, and in deployment it is used to patch code, and to add small new features.

Run-time code modification is made managable by the cooperation of the standard source code control system (PLS), the Protel language compiler and linker, and SOS. At compile and link time, metadata about the header contents and sizes, and the version of source compiled is included in the module file. When SOS is asked to load-replace the module, it compares the new module with the existing module and will only allow the load-replace if it can be done safely. When a module is replaced, SOS updates its module metadata with the new module's version etc.

SOS also includes a patch management system which tracks the state of applied patches. Patches use the basic module load-replacement system in a controlled and automated way to :

Load-replace existing modules to add hooks into existing code
Load new modules to contain modified functionality and state storage space
Execute patch application and removal steps

SOS tracks patch dependencies and can generally unapply and reapply patches at runtime. It tracks inter-patch dependencies, and allows different deployments to run different sets of patches. This is especially useful when patches are used to implement features and functionality specific to a single user.

Writing SOS patches is quite an art, and whole teams that write nothing but patches existed in Nortel's good times. Often the patch specialists were very technically capable and innovative, being aware of the innards of SOS and able to deal with the extra dimensions of design visualisation required to consider patch application and later removal. However, extended exposure to writing patches in the convoluted style required for safe application and removal tends to corrode a designer's sense of elegance and abstraction.

'Relational' data access system

SOS includes a table based data access layer called 'Table control'. This layer supports interactive and batch access to tables. Tables include key columns and non-key columns with a flexible type system, including inheritance. Tables are statically defined with a good deal of flexibility in the implementation of the mapping onto the underlying data source. Table control supports separate data representations at 'External', 'Logical', 'Data' and 'Physical' layers. These abstractions give great freedom to decouple the external, user visible view of the data from the internal constraint optimised storage of the data. Table control was initially designed to give a standard way to store and retrieve DMS configuration information, but over time in different products is used to give standardised access to huge databases of mobile subscriber information etc. The Table Control API was later built on to implement external data provisioning and management systems, and is a large part of the online software upgrade process.

Despite being table and column oriented, table control is only 'relational' in a limited sense. There is no SQL-style declarative language for querying data stored in tables, and no standard way to 'join' tables. However, foreign key constraints can be enforced, and DMS supports a basic scripting language which can be used to write scripts to automate cross-table analysis and maintenance.

User model

SOS supports multiple interactive user sessions, connected via telnet or older technologies. Users can have various permissions with respect to commands, table access etc, and all user activity can be logged.

Online software upgrade

Theoretically, module load-replacement can be used to upgrade software, but in practice it is only used for bug fixes and small features with economic or time-pressure reasons for in-release delivery. Writing all code to be online replacable against old code adds an excessive burden to the design and test cycles.

SOS supports online upgrade of duplex systems via the ONP process (One Night(mare) Process). What happens is :

Hardware sync-match is dropped, splitting the system into two separate systems, one active, the other inactive.
The inactive side is rebooted with the new software
Bulk personality and state data from the active side is transferred across to the inactive side, potentially involving data reformats for changed or extended schemas.
Once all bulk data is transferred, up to date state data transfer starts
Once all components agree that state transfer is reasonably up-to-date
- New Active side activity is stopped
- All remaining state is transferred
- Inactive side becomes active side (SWitch of ACTivity). IO systems are reconnected to new active side.
Users perform acceptance testing on the new system for a limited period
If users decide to revert :
- Modified state is transferred back to newly inactive side
- SWACT is performed in reverse
If users decide to continue :
- Newly inactive side is dropped and hardware sync-match is restored.

This upgrade mechanism is complex and error prone, but it offers online upgrade with minimal service outage (of the order of 4 seconds) at the cost of a temporary loss of redundancy.

From the application designer's point of view, they need to consider :

Data reformats
Table control can automate some conversions which map into type-promotions. More complex conversions can be performed in user-code callbacks.
State transfer
Essential state can be transferred around SWACT using user-code callbacks
Protocol compatibility
Newer software versions must support old protocol versions until all parties can deal with newer versions.
Upgrade-abort implications
If upgrade is reverted then data and states which only exist in the new version must be avoided or dealt with.

SOS applications generally support direct upgrade over 3 versions. In a DMS system comprising a number of smaller SOS based systems, generally the upgrades start at the leaves of the tree of systems (peripherals), and work back towards the Computing Module (CM). This implies that each system must be willing to accept old-version protocol interactions from systems higher in the tree than it, but need not worry about protocol versions for systems lower in the tree (Assuming the usual computer-science leaves-at-the-bottom tree layout). Given that individual system complexity increases as you go 'up' the tree to the root, this is a good arrangement.

Online multithreaded breakpoint capable debugger

SOS supports interactive use and one application users can use is an interactive breakpoint and tracepoint capable debugger. This tool allows a running system to be inspected, and code and data to be modified on the fly. Break and tracepoints can be made data-conditional and can use thread (process) ids to be thread conditional. The debugger also has some knowledge of symbols and offsets within modules.
Full debugging access with breakpoints is not usually made available for deployed systems as the risk of accidental damage is too great.

Well that's been a quick tour of SOS. It's an interesting system with very little documentation outside of Nortel. My own memories of it are fading fast, so please excuse mistakes and the lack of detail here. I don't intend to blog about the system in-general any more, but may cover some specific details that are of interest.

(Well of interest to me, as no-one else seems interested so far :) )

Tuesday, 6 January 2009

What is SOS? part II

Part 1 Part 3

Resource Ownership Model

SOS implements an extensible resource ownership module. System resources such as memory, IPC 'mailboxes', semaphores and any user-defined resource are owned by either a loaded module or a running process. Process death results in process owned resources being freed. Module unload results in module owned resources being freed. Processes are owned by Modules.

Since all processes in SOS share memory by default (more like threads in other OS), a mechanism for cleaning up resources on process death is very useful.

Applications can arrange to have their own resources owned by a process or a module using a mechanism confusingly called 'Events'. This mechanism guarantees a callback on process death or module unload allowing arbitrary cleanup. This is essential as processes can exit at any point with no stack cleanup due to exceptions (divide by zero, bus error etc.).

Prioritised proportional scheduling

SOS processes are each placed in a process class. Each class has a guaranteed share of CPU where all process class shares sum to 100% of available CPU. Additionally, all process classes have a relative priority.

The scheduler chooses which process to run next based on answering the question "What is the highest priority class with both a runnable process and time left in it's CPU share?".

The guaranteed share mechanism is used to ensure that all process classes regularly get some share of CPU time even in a heavily loaded system, while still providing priority to high urgency tasks. This allows the system to avoid throughput or latency degradation as load increases.

The placement of processes into process classes is generally fixed at design time. The set of process class guaranteed shares (called a scheduler template) is also generally fixed, but is changed between restarts and normal operation.

Application control of scheduler pre-emption

The SOS scheduler implements pre-emptive multitasking on a single CPU core, changing the running process periodically. Processes can temporarily stop pre-emption occurring via the rather wordy setunpreemptable() call. This can be used to demarcate critical regions in code, where atomic actions are performed. All other processes can only observe the state of memory before the call to setunpreemptable(), or after a call to setpreemptable().

Since all other processes are excluded from running while the current process is unpreemptable, it is important that only bounded amounts of computation are performed while unpreemptable. To enforce this, SOS implements an unpreemptable timer, of the order of tens of milliseconds which stops the running process with an exception if it remains unpreemptable for longer than this. For this reason, activities like blocking IO and dynamic memory allocation cannot reliably be performed while unpreemptable.

Most applications are collections of message driven state machines, and run unpre-emptably in bursts, processing messages, updating state and sending responses. As well as giving low-overhead mutual exclusion, making applications non-blocking with control over pre-emption maximises the instruction cache hit rate and improves data cache locality.

That's enough for now, I'm boring myself !

message passing