< previous page page_333 next page >

Page 333
Study 5.4 Assessing Contention in an Integrated (vs. Split) Cache
Suppose we had selected an integrated cache in study 5.3. Now let us compute the delay at the cache access controller due to this contention.
In solving this problem, we must first recognize that not all of the traffic per instruction determined in study 5.3 causes contention. In particular, contention can arise only when both the I-buffer and the Data AG are making requests to the cache access controller. Specifically, this does not happen when ''run-on" instructions are making excess data requests (more than one) as the I-buffer goes idle, since there is a "run-on" interlock issued from the decoder.
Initial Analysis
Contention arises when both the I-buffer (IF) and the DF are active. The probability that the I-buffer makes a data request to cache in a given (noninterlocked) cycle is 0.80 (from study 5.3). The probability that the DF will make a request during a non-interlocked (or non-run-on) period is:
d87111c01013bcda00bb8640fdff6754.gif
D-reads + D-writes - excess LM/STM traffic (the run-on traffic),
or, using Table 3.15:
0333-01.gif
Here we determine reads and writes from the L/S data (Table 3.15) but divide by the R/M instruction count. All register set architectures with about the same number of registers should create the same number of reads and writes per HLL operation. This is then adjusted by the instruction count per HLL operation.
The model assumes that all reads or writes create a reference to cache except the branch, and that the first LM or STM data reference occurs while the I-buffer is active.
Thus, if all contention caused delay, we would have contention of:
d87111c01013bcda00bb8640fdff6754.gif
Prob(IF) * Prob(DF or DS) = (0.73) * (0.53) = 0.39.
We would thus add 39% additional cycles to manage this contention. Not all of these cycles need cause execution delay, since either the in-line I-buffer or the store buffer can be delayed a cycle without affecting performance. However, even with a well-buffered processor and well-managed cache access priority system, we still require:
d87111c01013bcda00bb8640fdff6754.gif
0.73 + 0.53 cache access/instr = 1.26 cache access cycles/instr.
Improved Analysis
This "lower bound" on contention ignores an important factormany of the I-requests arise from branches that occur during "extra" branch resolution cycles. From study 4.3, decoding and branch effects result in processor performance of 2.025 ~/instruction. In other words, for every 100 instructions, we have 202.5 cycles of expected execution (assuming that

 
< previous page page_333 next page >