< previous page page_497 next page >

Page 497
what technique is used. Looking at the reported speedups and excluding two studies that place no reasonable limits on hardware, there seems to be generally good agreement on the potential for multiple instruction issue in the range of 1.53.
Acosta's results [2] are useful for our evaluation, since he uses multiple execution unit models and compares several different approaches to speedup. Figure 7.52 shows the parallelism that is realizable for various control/dispatch strategies for an ideal processor. The ideal processor executes all operations in one cycle and is not limited in any way by branches. In the Acosta model, all parallelism detected is within basic blocks. Several processor combinations are studied; Figure 7.53 shows the speedup attainable for an ideal processor for varying amounts of instruction dispatch (instruction window size). The system is limited to about 2.75 instructions per issuing cycle when between 12 and 16 instructions are inspected (window size of 1216). Any advantage beyond this is minimal. When a more realistic processor model is created (non unit time execution), various processor control strategies can be studied (Figure 7.53). In this processor model, there are two fixed-point adders, two fixed-point multipliers, two floating-point adders, and two floating-point multipliers. The fixed-point operations execute in one cycle; for the floating-point operations, the add executes in two cycles and the multiply executes in three cycles. Delays result primarily from data dependencies that arise among the operands. The following control models are considered:
1. Simple Control and In-Order ExecutionThis is the basic model on which many of our earlier pipelined processor examples are based.
2. Out-of-Order ExecutionThis control strategy, as the name implies, allows instructions to be dispatched so long as they do not have a dependency upon any instructions already in execution. Only one instruction is inspected per cycle, and it is either executed or delayed.
3. Windowed Out-of-Order ExecutionHere, an instruction window of size N is examined, and at most M instructions are dispatched when found to be independent. These are labeled M = 1, M = 2, etc., for the number of instructions that can be dispatched per cycle. The case M = 1 corresponds to "out-of-order decode and out-of-order execute," limited to issuance of a single instruction each cycle. For the hardware model proposed, Figure 7.52 plots the throughput in instructions per cycle for the various control models plotted against window size, and Figure 7.53 evaluates the same control strategies for a processor with pipelined execution units [302]. This processor has only one integer adder and one integer multiplier, and one floating-point adder and one floating-point multiplier, but each of these units is capable of pipelining its operations at the rate of one pair per cycle. The execution times are increased so that the integer units execute in two cycles, a floating-point add in three cycles, and the floating-point multiply in four cycles.
A number of general conclusions can be developed:
1. Instruction concurrency can be potentially realized up to the point where two and one-half to three instructions can be issued per cycle.

 
< previous page page_497 next page >