Saturday, 28 March 2009

What is SOS? Part III

Part 1 Part 2

Online code patching and extension

A SOS system is comprised of modules, which contain code and various data segments and are vaguely similar to shared libraries or DLLs. Each function in a module has a function pointer stored at some known offset in a code header segment. The actual machine code for the function is stored in a different segment of the module. This indirection requires that all procedure calls involve a pointer dereference, but gives the flexibility to change the implementation of any procedure at any time. The code and data segments in a module also include limited 'spare' space, so that a number of extra global data variables and functions can be added to a module online. This, coupled with the ability to load completely new modules with arbitrary code and data makes a SOS system completly runtime-patchable, with all behaviours modifiable online. Online upgrade of a running module is referred to as load-replacement. In development it is used to test and debug code, and in deployment it is used to patch code, and to add small new features.

Run-time code modification is made managable by the cooperation of the standard source code control system (PLS), the Protel language compiler and linker, and SOS. At compile and link time, metadata about the header contents and sizes, and the version of source compiled is included in the module file. When SOS is asked to load-replace the module, it compares the new module with the existing module and will only allow the load-replace if it can be done safely. When a module is replaced, SOS updates its module metadata with the new module's version etc.

SOS also includes a patch management system which tracks the state of applied patches. Patches use the basic module load-replacement system in a controlled and automated way to :
  • Load-replace existing modules to add hooks into existing code
  • Load new modules to contain modified functionality and state storage space
  • Execute patch application and removal steps
SOS tracks patch dependencies and can generally unapply and reapply patches at runtime. It tracks inter-patch dependencies, and allows different deployments to run different sets of patches. This is especially useful when patches are used to implement features and functionality specific to a single user.

Writing SOS patches is quite an art, and whole teams that write nothing but patches existed in Nortel's good times. Often the patch specialists were very technically capable and innovative, being aware of the innards of SOS and able to deal with the extra dimensions of design visualisation required to consider patch application and later removal. However, extended exposure to writing patches in the convoluted style required for safe application and removal tends to corrode a designer's sense of elegance and abstraction.

'Relational' data access system

SOS includes a table based data access layer called 'Table control'. This layer supports interactive and batch access to tables. Tables include key columns and non-key columns with a flexible type system, including inheritance. Tables are statically defined with a good deal of flexibility in the implementation of the mapping onto the underlying data source. Table control supports separate data representations at 'External', 'Logical', 'Data' and 'Physical' layers. These abstractions give great freedom to decouple the external, user visible view of the data from the internal constraint optimised storage of the data. Table control was initially designed to give a standard way to store and retrieve DMS configuration information, but over time in different products is used to give standardised access to huge databases of mobile subscriber information etc. The Table Control API was later built on to implement external data provisioning and management systems, and is a large part of the online software upgrade process.

Despite being table and column oriented, table control is only 'relational' in a limited sense. There is no SQL-style declarative language for querying data stored in tables, and no standard way to 'join' tables. However, foreign key constraints can be enforced, and DMS supports a basic scripting language which can be used to write scripts to automate cross-table analysis and maintenance.

User model

SOS supports multiple interactive user sessions, connected via telnet or older technologies. Users can have various permissions with respect to commands, table access etc, and all user activity can be logged.

Online software upgrade

Theoretically, module load-replacement can be used to upgrade software, but in practice it is only used for bug fixes and small features with economic or time-pressure reasons for in-release delivery. Writing all code to be online replacable against old code adds an excessive burden to the design and test cycles.

SOS supports online upgrade of duplex systems via the ONP process (One Night(mare) Process). What happens is :
  • Hardware sync-match is dropped, splitting the system into two separate systems, one active, the other inactive.
  • The inactive side is rebooted with the new software
  • Bulk personality and state data from the active side is transferred across to the inactive side, potentially involving data reformats for changed or extended schemas.
  • Once all bulk data is transferred, up to date state data transfer starts
  • Once all components agree that state transfer is reasonably up-to-date
    - New Active side activity is stopped
    - All remaining state is transferred
    - Inactive side becomes active side (SWitch of ACTivity). IO systems are reconnected to new active side.
  • Users perform acceptance testing on the new system for a limited period
  • If users decide to revert :
    - Modified state is transferred back to newly inactive side
    - SWACT is performed in reverse
  • If users decide to continue :
    - Newly inactive side is dropped and hardware sync-match is restored.
This upgrade mechanism is complex and error prone, but it offers online upgrade with minimal service outage (of the order of 4 seconds) at the cost of a temporary loss of redundancy.

From the application designer's point of view, they need to consider :
  • Data reformats
    Table control can automate some conversions which map into type-promotions. More complex conversions can be performed in user-code callbacks.
  • State transfer
    Essential state can be transferred around SWACT using user-code callbacks
  • Protocol compatibility
    Newer software versions must support old protocol versions until all parties can deal with newer versions.
  • Upgrade-abort implications
    If upgrade is reverted then data and states which only exist in the new version must be avoided or dealt with.

SOS applications generally support direct upgrade over 3 versions. In a DMS system comprising a number of smaller SOS based systems, generally the upgrades start at the leaves of the tree of systems (peripherals), and work back towards the Computing Module (CM). This implies that each system must be willing to accept old-version protocol interactions from systems higher in the tree than it, but need not worry about protocol versions for systems lower in the tree (Assuming the usual computer-science leaves-at-the-bottom tree layout). Given that individual system complexity increases as you go 'up' the tree to the root, this is a good arrangement.


Online multithreaded breakpoint capable debugger

SOS supports interactive use and one application users can use is an interactive breakpoint and tracepoint capable debugger. This tool allows a running system to be inspected, and code and data to be modified on the fly. Break and tracepoints can be made data-conditional and can use thread (process) ids to be thread conditional. The debugger also has some knowledge of symbols and offsets within modules.
Full debugging access with breakpoints is not usually made available for deployed systems as the risk of accidental damage is too great.

Well that's been a quick tour of SOS. It's an interesting system with very little documentation outside of Nortel. My own memories of it are fading fast, so please excuse mistakes and the lack of detail here. I don't intend to blog about the system in-general any more, but may cover some specific details that are of interest.

(Well of interest to me, as no-one else seems interested so far :) )