Tuesday, 28 April 2009

Latency hiding patterns in CPUs

Latency is a major factor in CPU design. Most general purpose CPUs execute a sequence of instructions with the logical convention that one instruction completes before the next starts. However memory access latency and even the latency of pure computation bottleneck the achievable processing throughput. To circumvent this a number of techniques are used to parallelise computation and communication.

Conductors have capacitance and resistance and therefore take time to switch between voltage levels. This manifests as a propagation delay proportional to the length of the conductor. As CPU clock speeds increase, the time between each rising clock edge reduces and the maximum length of conductor that can be charged in that time shrinks. CPUs and other data-path designs must be designed to ensure that no signal propagation path is near the tolerance for propagation delay at the designed maximum clock speed.

In practice this means that there is a trade-off between clock frequency and circuit size. Large circuits can accomplish more per clock cycle, but cannot be clocked as high as smaller circuits. To continue increasing clock rates with a fixed feature size, a circuit must be broken up into smaller and smaller sub-circuits with synchronous boundaries. Even this is not enough in some cases as the clock signals themselves are skewed by the propagation latencies across the chip. In some cases this is solved by having islands of synchronous logic which communicate asynchronously as no synchronous clock can span the whole circuit.

So even within a chip, there can be a large, and growing difference between local and remote communication latency. Ignoring this simplifies designs but forces lowest common denominator performance. This trade-off between complexity and latency awareness and tolerance is repeated at many layers of the system stack.

Even within a synchronous logic island on a chip, there is latency to be hidden. Instruction fetch, decode and execute takes time and pipelining used in CPU designs to maximise throughput despite irreducible latency. The various steps of instruction decode, register selection, ALU activation etc. are split out with latches between so that the maximum propagation delay in any stage of the pipleine can be minimised. For example, a datapath with a 400 gate worst-case propagation delay could theoretically be split into 4 stages, each with a 100 gate worst case propagation delay, allowing a 4 times faster clock speed.

With pipelining in a CPU, the first instruction to be processed takes at least the same amount of time to be completed, but after that, a new instruction can be completed on every clock cycle - theoretically allowing 4 times as many instructions to be completed per unit time. In practice memory latency and bandwidth limits and dependencies between instructions mean that the pipeline can stall, or bubbles can form, reducing the achieved throughput. However, these problems can often be alleviated and pipelines have been very successful in CPU design with instruction fetch-execute-retire pipelines comprising as many as 40 or more stages with multiple instances of stages in super-scalar designs

So pipelining in general allows repetitive serial task processing to be parallelised with potential benefits :
  • Increased parallelism between instances of serial tasks
    Allowing greater throughput
  • Benefits of task-specificity (instruction and data caching benefits)
    Potentially improving efficiency

and the following caveats :
  • Dependencies between tasks must be understood and honoured
    Sometimes this reduces or eliminates parallelism
  • Pipeline stalls or bubbles due to job starvation or dependencies will reduce throughput
    The pipeline must be kept fed.
  • Achievable throughput is bottlenecked by the slowest stage
    As with a production line, the slowest worker sets the pace.
    As with a production line, slow workers can be duplicated to improve throughput.
  • Individual task latency can increase

Pipelining as a general technique can apply to any system that processes tasks of a regular form which can be split into stages. The SEDA system from Harvard is a general purpose server framework for implementing server processes as pipelined stages. The software setting allows more flexible and dynamic tradeoffs to be made between pipeline lengths and widths. It also offers a flexible way to separate asynchronous and synchronous steps.

Other interesting latency hiding techniques seen in CPUs are mostly associated with hiding memory access latency.

No comments: