Friday, 12 December 2008

What is SOS?

Part 2 Part 3

SOS is either the Switch, Support or Service Operating System which runs on a number of the components making up a Digital Multiplex Switch (DMS).
Work began on the system around 1979. It is mostly written in and highly coupled to the PROTEL language and the PLS (Product Library System) SCCM tool.
It is a pre-emptive multitasking operating system with some bullet-pointable features :
  • Comprised entirely of runtime reloadable modules
  • Multiple memory pools with different durability + protection characteristics
  • Multiple levels of system restart with restart escalation
  • Strong and extensible resource ownership model
  • Prioritised proportional scheduling
  • Online code patching and extension
  • Built-in relational style database system
  • Support for multiuser interactive use
  • Support for online upgrade to new version
  • Contains online multi threaded trace/breakpoint debugger
One of the most retro features is having no per-process memory protection. All processes run in a shared address space which makes them similar to modern day threads within a single process. Chaos is somewhat contained by the support for write-protected memory. One advantage of not having per-process address spaces is that processor caches and TLBs do not need to be flushed when context switching.

Module based
The PROTEL language allows a system to be split into modules, each with multiple source code files. All definitions in the files are contained in the scope of the module. Each source file can be marked as either public interface, private interface or implementation. Modules import definitions from other module's public and permitted private interfaces. The Module concept provides a component-level encapsulation, independent of OO or other abstraction mechanisms used in the code itself.

SOS allows modules to be loaded at runtime. SOS also allows modules to be 'replaced' at runtime. This involves overwriting the object code of the module while only making safe modifications to the module's exported procedure entry points and global data. This is the basis of the online code patch system which allows any object code to be replaced while processes execute over it.

Each module can define an entry procedure. This is called when the system is performing a restart and allows the module to take different initialisation actions depending on the restart type.

A SOS system is comprised of a set of modules and an initialisation order. At the various restarts, the SOS system iterates through the modules in initialisation order, calling their entry procedures.

To allow different types of systems sharing the same source modules to be easily defined, sets of modules, and their dependencies can be grouped together to form larger components. A system can then be specified in terms of these larger components. The inter-component and inter-module dependencies are then used together with some hints to compute the module initialisation order and build a system image.

Multiple memory types
The DMS model is unusual in that memory is expected to provide sufficient persistence for most data, with disk based recovery only occasionally required. This is a reasonable assumption given fault tolerant redundant memory with redundant power supplies, arrays of lead acid batteries etc.
A number of basic memory types are defined by SOS, with a number of variants for special purposes.
  • PSPROT
    Program Store, protected. Used for object code. Write protected. Loaded from and Saved to a SOS image on disk/tape.
  • DSSAVE
    Data Store. Not initialised by operating system reboot or restart. Not part of a SOS image.
  • DSPROT
    Data Store, write protected. Used for configuration or otherwise slow changing data. Loaded from and Saved to a SOS image on disk/tape.
  • DSPERM
    Data Store, permanently allocated, wiped by some restarts. Not part of a SOS image.
  • DSTEMP
    Data Store, temporarily allocated, wiped by most restarts. Not part of a SOS image.
The DSSAVE memory is limited in size but is useful for tracking system debugging state across multiple OS reboots. Most applications have no need for it.
DSPROT is written to by transiently removing write protection during the write. If a write is attempted while write protection is active, the writing process gets an exception. Special handling is required while DSPROT is being backed up to ensure a consistent snapshot is taken.
DSPERM remains allocated across all restart types but is reset on some (see below). This gives it the interesting property that a pointer to allocated DSPERM must be stored in DSPROT memory to ensure that the allocated memory can be 'found' again after a restart.
DSTEMP is deallocated and reset across all restart types.

The memory types tie into the set of restart types supported by the operating system (below).
One of the main benefits of this system is that it orients application designers towards thinking of their application in terms of multiple levels of state, and the benefits of throwing state away to recover from error situations.

Multiple levels of system restart
SOS defines three levels of restart :
  • Initial Program Load (IPL)
    This is performed only once when a module is initially loaded
  • Reload restart
    This is the most severe restart type and occurs as part of a reboot, or when an assertion failure or user request demands it.
    DSPERM memory is reset, DSTEMP memory is deallocated and reset. All modules' entry procedures are called.
  • Cold restart
    This is second-most-severe restart type and occurs when requested by the user, or when a number of Warm restarts have failed to clear a problem. DSTEMP memory is deallocated and reset.
  • Warm restart
    This is the least severe restart type and occurs when requested by the user or when the system determines that a number of failure indicators suggest ill health. DSTEMP memory is deallocated and reset.
By placing different parts of a module's state in different memory types, and reallocating/reinitialising the state in the module's entry procedure, Applications can cooperate with the system's restart escalation mechanism. One of the Call Processing (CALLP) applications written on SOS uses Warm Restart to drop connecting calls, but keep connected calls, and Cold restart to drop all calls.

Low level modules in SOS monitor system health indicators (number of process deaths, exceptions while in a critical region, system load etc.) and if there is a perceived problem will trigger a warm restart of the system.

If the warm restart fails, or the system does not recover correctly after a number of warm restarts, the restart type is escalated to a cold restart. Modules are generally designed to re-initialise more state during a cold restart (which as a result, generally takes longer to accomplish).

If multiple cold restarts fail, the system escalates to a Reload restart, which, again, reinitialises more state, taking longer.

If all attempts to restart the running system fail, a reboot can be attempted which reloads the system image from disk and performs a reload restart on it.

If this fails, previously backed up images are tried.

In this way, the system automatically escalates recovery efforts, resetting more and more state each time, eventually trying previous images. The driving philosophy is to *never* give up trying to recover. Never wait for a friendly user to press a key, or intervene.

What makes this different?
SOS is curious in the ways it differs from the Operating Systems in common use today but it is also similar in a number of ways :
  • Written in high level language
  • Written for general purpose CPU and memory model
  • Multitasking
  • Supports interactive use
These features are not particularly noteworthy for a modern general purpose OS, but for one designed in 1979 for a telecoms switch they are unusual. Other telecoms software at the time tended to be more :
  • Written in assembly language and/or
  • Written in telecoms-specific DSL with severe expressivity limitations
  • Designed for telecoms specific CPUs and hardware
  • Cooperatively scheduled
  • Very limited interactivity
I believe SOS was ahead of its time in being fairly general purpose, powerful and flexible.

Well done if you got this far, I'll continue boring on about SOS in another post...

Tuesday, 9 December 2008

Hardware Transactional Memory II

In my last entry, I introduced Nortel's XA-Core platform which I believe was one of the first commercially successful HTM machines. This time I want to talk about the hardware.

Modular Architecture
An XA-Core system is comprised of various card types including Processing Elements (PEs), Shared Memory cards (SM) and IO processors (IOP). These components are connected by a 'Gigabit Interconnect' (GI) which in practice is a set of point-to-point optical links with agreeable 'hot pluggable' and optical isolation properties. All of these card types exist in various versions and live in a standardish DMS rack with redundant power and cooling.

Processing Element
The PE card has a number of large chips and a few smaller ones (probably most circuit boards do :)) :
  • Two lockstep PowerPC CPUs, initially PPC603.
  • A 'Hippo' or 'Rhino' chip acting as the CPU -> Memory interface.
  • One custom 'PIGI' chip interfacing the PE with the GI

The processors run in lockstep with a fairly standard comparator mechanism to check them. From each processor's perspective, the Hippo/Rhino chip looks like main memory. Among other things, the Hippo/Rhino and PIGI chips provide :
  • Lock-step comparison of CPU outputs
  • Mapping of PPC bus requests to GI protocol requests, including transaction identifiers etc.
  • Mapping of 32-bit PE address space to 40-bit Shared Memory Address space.

The PE board additionally has some local memory, referred to as 'Scratch' which applications can use as a non-persistent workspace. While the PE has lockstepped CPUs, it is not possible to run the PE with only one CPU functioning. If either CPU fails, the whole PE is isolated. This avoids the requirement for a post-mismatch fault detection algorithm.

The minimum configuration is two PE cards, giving tolerance of one failure. Extra PE cards can be configured to give different n+m fault tolerance configurations.

When PE cards are inserted, they perform a self test, a number of initialisation steps are performed and then they begin executing the SOS scheduler loop, taking work. When a PE card is hot-pulled or fails, any outstanding memory transaction is rolled back. Some other PE can then pick up the aborted work from wherever in shared memory the original PE found the work.

Shared Memory card
The SM cards contain some fairly fast memory accessed via a custom 'SMOAC' (Shared Memory Ownership and Access Controller) chip.

The SMOAC chip maintains the memory ownership information that is necessary to enforce the transactional semantics of memory access. Ownership information is maintained for every 32-bytes (PPC cache line) of memory in the system using ownership information sent with cache-line read and write requests from the PE.

To support rollback of unwanted memory transactions, every cache line is duplicated within an SM card. Every cache line has an Active copy (last committed) and an Update copy (dirty, yet to be committed). This doubles the amount of memory required, although it probably simplifies the hardware design and theoretically allows arbitrarily large transactions.

Logical memory is mapped onto the SM cards in 32MByte blocks. The normal configuration is that every block is mapped onto at least two SM cards, and sometimes three. This allows for one or two SM card failures to be tolerated. Combined with the two copies of memory required for the transaction mechanism, this means that each byte of logical shared memory requires four to six bytes of physical memory.

When SM cards are inserted, SOS decides which blocks should be copied to the new card, and begins a background task to copy the blocks across.

Input/Output Processor
The IO processor cards are used to connect the XA-Core to the outside world, including terminals, disk + tape and the rest of the DMS system. The cards themselves contain single PPC CPUs (no lockstepped redundancy here) and ASICs to interface to the GI and provide some DMA capability. IOPs are deployed in pairs so that no IO facility is completely lost due to an IOP failure.

Weird / Cool things
  • Fault tolerant, single system image, shared memory multiprocessing
    I don't think many examples exist where a single-system-image SMP can handle an arbitrary processor failure without a crash.
  • Existing correct code runs correctly in parallel
    But may be serialised due to contention on shared memory access
  • Even IO is transactional
    This puts pressure on IO latencies, and requires good batching to minimise IO overheads.
  • Easier identification of transient CPU faults
    When a mismatch is detected within a PE, the failing operation can be safely rolled back and retried on the same PE multiple times. This can be used to help diagnose hard faults from transient / temporal faults.
  • Online System split is possible for upgrade
    To support online software and data upgrade, the system can de-duplicate memory, assign a PE to the 'other' half of the memory and boot it from a system image on disk. This gives 2 systems running on one machine. At cutover time, most PEs and IOPs are quickly migrated from the old to the new side. Eventually memory can be re-duplicated from the new half.
  • No 'standard-SMP' cache-coherency glue logic required
Not so cool things
  • Four to Six times memory hardware overhead
    Perhaps some trade-off between maximum transaction size and hardware complexity could have been made?
  • Expensive memory required to contain latency
  • Large, complex custom ASICs required
    Pushes out time-to-market, reduces time-in-market for modifications. Expensive.
  • OS cooperation required to assist with transaction demarcation, ensuring forward progress, IO handling, bringing PEs, IOPs and SMs on + offline.
    Requires cooperation from OS owners.
So that's my tour of the XA-Core hardware. Please comment if you have corrections or further questions, I may be able to dredge up some more details.
Next time I'll talk about some of the modifications made to the SOS operating system to make it run on this platform.

Friday, 5 December 2008

Hardware Transactional Memory I

The first multiprocessor I worked on was Nortel's XA-Core platform. This exotic platform was a replacement for the 'Computing Module' (CM) of their DMS telecoms switching platform.

Background
Previous CM generations are built on a pair of CPUs (Motorola 88k, 68k, BNR NT40) run in lockstep through a comparator for fault tolerance. The software running on these includes a multitasking OS (SOS) and a huge amount of call processing, database, hardware support and other telecoms code written in the proprietary PROTEL language, starting around 1979. SOS supports write-protectable memory, but not per-process memory protection, so the memory map resembles a heavily multithreaded process. Shared data is commonly used with an assumption of a strictly ordered memory model. Heavy use is made of a single-global-lock to enforce mutual exclusion between processes to the extent that the bulk of the computation time is spent with a single process holding the global lock in 'jumbo' timeslices of tens of milliseconds. Much of the large code base is > 10 years old and in a 'frozen' state where changes are not possible,

The problem
How to increase CM computation capacity beyond the incremental improvements available from successive generations of CPUs without a huge software rewriting and revalidation effort and while maintaining CPU and memory fault tolerance?

The solution ( XA-Core patent)
Create a fault tolerant SMP platform with replicated hardware transactional memory. Modify the OS so that a process claiming the 'single global lock' implictly sets the boundaries on a memory transaction. Handle inter-process memory access contention by rolling back one of the contenders. Handle CPU failure by rolling back in-progress memory transactions.
The achievable level of parallelism is then limited by the memory access patterns of the concurrently running processes at the cache-line level.
Code can still be written using the 'single CPU multitasking OS with big-global-lock' approach. Incremental improvements to available parallelism can be made by changing the data access patterns of the parallel processes. Tools exist to monitor contention between competing processes and map it to stack traces and/or data structures.

The interesting details and issues
The actual hardware used, the transaction handling in the operating system, handling IO, application modifications required etc.

In the spirit of actually completing some blog entries, I'll continue this post later.

To be continued...