|
|
|
|
|
|
for either path is equal to to the IF width. If we decode one instruction per cycle, the number of in-line buffer words (BF) is: |
|
|
|
|
|
|
|
|
If we have s instructions decoded each cycle, then: |
|
|
|
|
|
|
|
|
Here, the IF access time (cycles) is simply the number of cycles required for an IF, and instructions/IF is the average number of instructions accessed by an IF. These BF registers should be viewed as a minimum; in many cases, it is useful to have a larger buffer. This would allow for normal processing in case of a delayed IF due to cache access contention. The alternate path buffer should be the same size as the primary path if target branch prediction is used. If we use strictly "guess in line" then we need only one entry (for TIF) for the alternate path buffer. |
|
|
|
|
|
|
|
|
The average instruction decode delay is less if the average execution sequence length is greater, for the unmaskable delay experienced on the fetch of the first word is amortized over more instruction decodes. The average instruction decode delay is also reduced by increasing the degree of prefetch. However, the traffic to the memory is increased, since more unnecessary words are fetched when the degree of prefetch is increased. The two criteria by which instruction prefetch techniques can be evaluated are: (1) the average delay per instruction decoded and (2) the average number of words fetched from memory per instruction decoded. In our simple example, 18 instructions were decoded in 30 cycles. Thus, there is an instruction buffer runout delay of 0.667 cycles per instruction decoded. Since the example does not contain any taken branches, we do not experience any wasted fetches from memory, and thus the number of instruction words fetched equals the number of instructions decoded. When taken branches are present, this is obviously not the case. |
|
|
|
|
|
|
|
|
By using a sufficiently high degree of instruction prefetch, it is possible to mask the memory access time for all instruction fetches except for the first one of an instruction execution sequence (although this could be at the expense of a substantial increase in the memory traffic). Any further reduction in the average instruction access time must concentrate upon the first instruction, which contains the target of the branch instruction. |
|
|
|
|
|
|
|
|
As we have seen, branches can be a major limitation to pipeline processor performance [72, 264]. There are four major approaches to the branch problem: |
|
|
|
|
|
|
|
|
1. Branch elimination. For certain code sequences, we can replace the branch with another conditional operation. |
|
|
|
|
|
|
|
|
2. Branch speedup. Reducing the time required for target instruction fetch and CC determination (Figure 4.18). |
|
|
|
|
|