< previous page page_471 next page >

Page 471
Initially, the DIV.F is issued to the divide unit, and the divide unit fetches the two source register values. Presumably, the divide takes a number of cycles for execution. In the cycle following the divide issue, the MPY.F is decoded and issued to the multiplier reservation station. The tag for R4 is placed in a reservation station for one source operand, while the divide unit tag is placed in the other operand identifier slot. When the divide unit completes, it will, with permission of the scoreboard, broadcast its result, and this result will be detected by both R3 and the multiplier reservation station. Since the multiplier gets its result directly from the divide unit, there is no need to put an "MPY" read tag in the R3 scoreboard. In the next cycle, the ADD.F is decoded and issued to the add reservation station. Since its source values are independent of other computations in process, the ADD.F can begin immediately; however, the scoreboard detects the ordering dependency mentioned earlier, based on R4, and the ADD.F will be delayed until after the R4 value is available to the multiplier unit. This insures that the ADD.F does not affect R4 before it is used by the multiplier. The scoreboard prevents the ADD.F from executing because of the MPY.F tag in the R4 read and the read (R) precedence scoreboard entry. When the MPY.F reads R4 (i.e., begins), its tag is removed (set to Æ) and the write (W) precedence is set. This causes the adder to be notified to begin operation. On the next cycle, it begins by reading R6 and R7, setting their scoreboard tags to empty (Æ). In the simple scoreboard shown, if the instructions following the ADD.F tried to read R4, it could not do so because of the pending write. In this case, the issuing stage "freezes," since there is no room in the scoreboard to enter another operation (at least, until the MPY.F read is complete).c
A more complex scoreboard would schedule the ADD.F to begin a number of cycles earlier than the multiplier's use of R4. This early schedule corresponds to the delay in the adder itself.
Only one copy of the register value is kept, and that is in the register set. Since each unit must fetch its operands from a common register set and provide its results to the same register set, there is a possibility for contention. Multiple buses can be used to reduce the read-from-registers contention, and, since the results are broadcast to all units, contention for functional unit results must be arbitrated by a priority scheme to insure proper operation when multiple units compete for the same bus in the same cycle.
Dataflow
The dataflow approach is an alternative first suggested by Tomasulo [288]. Again, each register in the central register set is extended to include a tag that identifies the functional unit that produces a result to be placed in a particular register. Similarly, each of the multiple functional units has one or more reservation stations. The reservation station, however, can contain either a tag identifying another functional unit or register, or it can contain the variable needed. The centralized scoreboard is replaced with distributed control within the functional units. Each reservation station effectively defines its own functional unit; thus, two reservations for a floating point multiplier are two functional unit tags: multiplier 1 and multiplier 2 (Figure 7.38). If operands can go directly into the multiplier, then there is another tag: multiplier 3. Once a pair of operands have a designated functional unit tag, that tag remains with that operand pair until completion of

 
< previous page page_471 next page >