< previous page page_451 next page >

Page 451
Table 7.3 The effects of vector bypass on performance.
No BypassLimited BypassFull Bypass
Buffer size and complexity
16 entries simple
32 entries complex
>64 entries complex
g
0
0.5
1.375
Total cycles for single VLD (64 reg), m = 16 n = 12
104
90
86.5
Peforrmance relative to ideal vector ALU with 6 cycles startup (70 cycles total)
0.68
.78
0.81

Suppose we increase the buffer size significantly so that we achieve n = B (no contention). Then we have:
0451-01.gif
Note that we reduced the total time for the load only from 90 cycles to 86.5 cycles! With no bypassing at all, we would have:
0451-02.gif
or a total time of 103 cycles.
The buffer to support this case is a simple one-entry-per-module (16 total) buffer merely holding the deferred address and source id.
From Table 7.3, it might appear that the "best" vector performance achievable would be in the range of 0.680.81 times maximum performance, based on memory system limitation.
Our analysis in this example is based on two simultaneous VLD or VST. Insofar as some applications may not require two simultaneous memory accesses, this is a pessimistic performance projectione.g., for one access we would have (at g = 0):
B(6, 16, g = 0) = 5.06
and
0451-03.gif
Here again, any request bypassing would improve performance. An obvious solution is to provide more vector registers, so that the VLDs would not be used in the next vector iteration. The problem here is the expense of (area occupied by) these additional registers.
The preceding discussion is based on the assumption that we are not bypassing memory operations between vector instructions. That is, while we

 
< previous page page_451 next page >