Tuesday, 9 December 2008

Hardware Transactional Memory II

In my last entry, I introduced Nortel's XA-Core platform which I believe was one of the first commercially successful HTM machines. This time I want to talk about the hardware.

Modular Architecture
An XA-Core system is comprised of various card types including Processing Elements (PEs), Shared Memory cards (SM) and IO processors (IOP). These components are connected by a 'Gigabit Interconnect' (GI) which in practice is a set of point-to-point optical links with agreeable 'hot pluggable' and optical isolation properties. All of these card types exist in various versions and live in a standardish DMS rack with redundant power and cooling.

Processing Element
The PE card has a number of large chips and a few smaller ones (probably most circuit boards do :)) :
  • Two lockstep PowerPC CPUs, initially PPC603.
  • A 'Hippo' or 'Rhino' chip acting as the CPU -> Memory interface.
  • One custom 'PIGI' chip interfacing the PE with the GI

The processors run in lockstep with a fairly standard comparator mechanism to check them. From each processor's perspective, the Hippo/Rhino chip looks like main memory. Among other things, the Hippo/Rhino and PIGI chips provide :
  • Lock-step comparison of CPU outputs
  • Mapping of PPC bus requests to GI protocol requests, including transaction identifiers etc.
  • Mapping of 32-bit PE address space to 40-bit Shared Memory Address space.

The PE board additionally has some local memory, referred to as 'Scratch' which applications can use as a non-persistent workspace. While the PE has lockstepped CPUs, it is not possible to run the PE with only one CPU functioning. If either CPU fails, the whole PE is isolated. This avoids the requirement for a post-mismatch fault detection algorithm.

The minimum configuration is two PE cards, giving tolerance of one failure. Extra PE cards can be configured to give different n+m fault tolerance configurations.

When PE cards are inserted, they perform a self test, a number of initialisation steps are performed and then they begin executing the SOS scheduler loop, taking work. When a PE card is hot-pulled or fails, any outstanding memory transaction is rolled back. Some other PE can then pick up the aborted work from wherever in shared memory the original PE found the work.

Shared Memory card
The SM cards contain some fairly fast memory accessed via a custom 'SMOAC' (Shared Memory Ownership and Access Controller) chip.

The SMOAC chip maintains the memory ownership information that is necessary to enforce the transactional semantics of memory access. Ownership information is maintained for every 32-bytes (PPC cache line) of memory in the system using ownership information sent with cache-line read and write requests from the PE.

To support rollback of unwanted memory transactions, every cache line is duplicated within an SM card. Every cache line has an Active copy (last committed) and an Update copy (dirty, yet to be committed). This doubles the amount of memory required, although it probably simplifies the hardware design and theoretically allows arbitrarily large transactions.

Logical memory is mapped onto the SM cards in 32MByte blocks. The normal configuration is that every block is mapped onto at least two SM cards, and sometimes three. This allows for one or two SM card failures to be tolerated. Combined with the two copies of memory required for the transaction mechanism, this means that each byte of logical shared memory requires four to six bytes of physical memory.

When SM cards are inserted, SOS decides which blocks should be copied to the new card, and begins a background task to copy the blocks across.

Input/Output Processor
The IO processor cards are used to connect the XA-Core to the outside world, including terminals, disk + tape and the rest of the DMS system. The cards themselves contain single PPC CPUs (no lockstepped redundancy here) and ASICs to interface to the GI and provide some DMA capability. IOPs are deployed in pairs so that no IO facility is completely lost due to an IOP failure.

Weird / Cool things
  • Fault tolerant, single system image, shared memory multiprocessing
    I don't think many examples exist where a single-system-image SMP can handle an arbitrary processor failure without a crash.
  • Existing correct code runs correctly in parallel
    But may be serialised due to contention on shared memory access
  • Even IO is transactional
    This puts pressure on IO latencies, and requires good batching to minimise IO overheads.
  • Easier identification of transient CPU faults
    When a mismatch is detected within a PE, the failing operation can be safely rolled back and retried on the same PE multiple times. This can be used to help diagnose hard faults from transient / temporal faults.
  • Online System split is possible for upgrade
    To support online software and data upgrade, the system can de-duplicate memory, assign a PE to the 'other' half of the memory and boot it from a system image on disk. This gives 2 systems running on one machine. At cutover time, most PEs and IOPs are quickly migrated from the old to the new side. Eventually memory can be re-duplicated from the new half.
  • No 'standard-SMP' cache-coherency glue logic required
Not so cool things
  • Four to Six times memory hardware overhead
    Perhaps some trade-off between maximum transaction size and hardware complexity could have been made?
  • Expensive memory required to contain latency
  • Large, complex custom ASICs required
    Pushes out time-to-market, reduces time-in-market for modifications. Expensive.
  • OS cooperation required to assist with transaction demarcation, ensuring forward progress, IO handling, bringing PEs, IOPs and SMs on + offline.
    Requires cooperation from OS owners.
So that's my tour of the XA-Core hardware. Please comment if you have corrections or further questions, I may be able to dredge up some more details.
Next time I'll talk about some of the modifications made to the SOS operating system to make it run on this platform.

No comments: