< previous page page_436 next page >

Page 436
0436-01.gif
Figure 7.12
Effect of vector chaining.
0436-02.gif
Figure 7.13
Vector chaining path.
and contention. While memory can be designed to accommodate such reference patterns, the complexity for managing both scalar and vector accessing without excessive contention becomes prohibitive. Thus, most modern vector processors execute in the load/store fashion from local vector registers. The vector registers generally consist of a set of eight registers, each containing from 16 to 64 entries, and each entry generally accommodates a 64-bit floating-point word. The arithmetic pipeline or portion of the vector processor may be shared (at least in part) with the scalar or base portion of the processor. With current technology it is relatively straightforward to design floating-point add and floating-point multiply units to accommodate an operation-per-cycle execution rate and still maintain relatively low overall execution time (24 cycles). Divide is an exceptional case for which the vector processor may or may not provide a cycle-per-operation quotient. Cray machines use a multiplicative-based division or reciprocal operation that provides an approximation to the reciprocal of a number. This reciprocal can then be multiplied by a numerator to produce an approximation to the quotient. The Cray system's reciprocal operation is supported on an operation-per-cycle basis.
Under some conditions, it is possible to execute more than one vector arithmetic operation per cycle. The results of one vector arithmetic operation can be directly used as an operand in subsequent vector instructions without first passing into a vector register. Such an operation, shown in Figures

 
< previous page page_436 next page >