Friday 12 December 2008

What is SOS?

Part 2 Part 3

SOS is either the Switch, Support or Service Operating System which runs on a number of the components making up a Digital Multiplex Switch (DMS).
Work began on the system around 1979. It is mostly written in and highly coupled to the PROTEL language and the PLS (Product Library System) SCCM tool.
It is a pre-emptive multitasking operating system with some bullet-pointable features :
  • Comprised entirely of runtime reloadable modules
  • Multiple memory pools with different durability + protection characteristics
  • Multiple levels of system restart with restart escalation
  • Strong and extensible resource ownership model
  • Prioritised proportional scheduling
  • Online code patching and extension
  • Built-in relational style database system
  • Support for multiuser interactive use
  • Support for online upgrade to new version
  • Contains online multi threaded trace/breakpoint debugger
One of the most retro features is having no per-process memory protection. All processes run in a shared address space which makes them similar to modern day threads within a single process. Chaos is somewhat contained by the support for write-protected memory. One advantage of not having per-process address spaces is that processor caches and TLBs do not need to be flushed when context switching.

Module based
The PROTEL language allows a system to be split into modules, each with multiple source code files. All definitions in the files are contained in the scope of the module. Each source file can be marked as either public interface, private interface or implementation. Modules import definitions from other module's public and permitted private interfaces. The Module concept provides a component-level encapsulation, independent of OO or other abstraction mechanisms used in the code itself.

SOS allows modules to be loaded at runtime. SOS also allows modules to be 'replaced' at runtime. This involves overwriting the object code of the module while only making safe modifications to the module's exported procedure entry points and global data. This is the basis of the online code patch system which allows any object code to be replaced while processes execute over it.

Each module can define an entry procedure. This is called when the system is performing a restart and allows the module to take different initialisation actions depending on the restart type.

A SOS system is comprised of a set of modules and an initialisation order. At the various restarts, the SOS system iterates through the modules in initialisation order, calling their entry procedures.

To allow different types of systems sharing the same source modules to be easily defined, sets of modules, and their dependencies can be grouped together to form larger components. A system can then be specified in terms of these larger components. The inter-component and inter-module dependencies are then used together with some hints to compute the module initialisation order and build a system image.

Multiple memory types
The DMS model is unusual in that memory is expected to provide sufficient persistence for most data, with disk based recovery only occasionally required. This is a reasonable assumption given fault tolerant redundant memory with redundant power supplies, arrays of lead acid batteries etc.
A number of basic memory types are defined by SOS, with a number of variants for special purposes.
  • PSPROT
    Program Store, protected. Used for object code. Write protected. Loaded from and Saved to a SOS image on disk/tape.
  • DSSAVE
    Data Store. Not initialised by operating system reboot or restart. Not part of a SOS image.
  • DSPROT
    Data Store, write protected. Used for configuration or otherwise slow changing data. Loaded from and Saved to a SOS image on disk/tape.
  • DSPERM
    Data Store, permanently allocated, wiped by some restarts. Not part of a SOS image.
  • DSTEMP
    Data Store, temporarily allocated, wiped by most restarts. Not part of a SOS image.
The DSSAVE memory is limited in size but is useful for tracking system debugging state across multiple OS reboots. Most applications have no need for it.
DSPROT is written to by transiently removing write protection during the write. If a write is attempted while write protection is active, the writing process gets an exception. Special handling is required while DSPROT is being backed up to ensure a consistent snapshot is taken.
DSPERM remains allocated across all restart types but is reset on some (see below). This gives it the interesting property that a pointer to allocated DSPERM must be stored in DSPROT memory to ensure that the allocated memory can be 'found' again after a restart.
DSTEMP is deallocated and reset across all restart types.

The memory types tie into the set of restart types supported by the operating system (below).
One of the main benefits of this system is that it orients application designers towards thinking of their application in terms of multiple levels of state, and the benefits of throwing state away to recover from error situations.

Multiple levels of system restart
SOS defines three levels of restart :
  • Initial Program Load (IPL)
    This is performed only once when a module is initially loaded
  • Reload restart
    This is the most severe restart type and occurs as part of a reboot, or when an assertion failure or user request demands it.
    DSPERM memory is reset, DSTEMP memory is deallocated and reset. All modules' entry procedures are called.
  • Cold restart
    This is second-most-severe restart type and occurs when requested by the user, or when a number of Warm restarts have failed to clear a problem. DSTEMP memory is deallocated and reset.
  • Warm restart
    This is the least severe restart type and occurs when requested by the user or when the system determines that a number of failure indicators suggest ill health. DSTEMP memory is deallocated and reset.
By placing different parts of a module's state in different memory types, and reallocating/reinitialising the state in the module's entry procedure, Applications can cooperate with the system's restart escalation mechanism. One of the Call Processing (CALLP) applications written on SOS uses Warm Restart to drop connecting calls, but keep connected calls, and Cold restart to drop all calls.

Low level modules in SOS monitor system health indicators (number of process deaths, exceptions while in a critical region, system load etc.) and if there is a perceived problem will trigger a warm restart of the system.

If the warm restart fails, or the system does not recover correctly after a number of warm restarts, the restart type is escalated to a cold restart. Modules are generally designed to re-initialise more state during a cold restart (which as a result, generally takes longer to accomplish).

If multiple cold restarts fail, the system escalates to a Reload restart, which, again, reinitialises more state, taking longer.

If all attempts to restart the running system fail, a reboot can be attempted which reloads the system image from disk and performs a reload restart on it.

If this fails, previously backed up images are tried.

In this way, the system automatically escalates recovery efforts, resetting more and more state each time, eventually trying previous images. The driving philosophy is to *never* give up trying to recover. Never wait for a friendly user to press a key, or intervene.

What makes this different?
SOS is curious in the ways it differs from the Operating Systems in common use today but it is also similar in a number of ways :
  • Written in high level language
  • Written for general purpose CPU and memory model
  • Multitasking
  • Supports interactive use
These features are not particularly noteworthy for a modern general purpose OS, but for one designed in 1979 for a telecoms switch they are unusual. Other telecoms software at the time tended to be more :
  • Written in assembly language and/or
  • Written in telecoms-specific DSL with severe expressivity limitations
  • Designed for telecoms specific CPUs and hardware
  • Cooperatively scheduled
  • Very limited interactivity
I believe SOS was ahead of its time in being fairly general purpose, powerful and flexible.

Well done if you got this far, I'll continue boring on about SOS in another post...

No comments: