|
|
|
|
|
|
Study 5.4 Assessing Contention in an Integrated (vs. Split) Cache |
|
|
|
|
|
|
|
|
Suppose we had selected an integrated cache in study 5.3. Now let us compute the delay at the cache access controller due to this contention. |
|
|
|
|
|
|
|
|
In solving this problem, we must first recognize that not all of the traffic per instruction determined in study 5.3 causes contention. In particular, contention can arise only when both the I-buffer and the Data AG are making requests to the cache access controller. Specifically, this does not happen when ''run-on" instructions are making excess data requests (more than one) as the I-buffer goes idle, since there is a "run-on" interlock issued from the decoder. |
|
|
|
|
|
|
|
|
Contention arises when both the I-buffer (IF) and the DF are active. The probability that the I-buffer makes a data request to cache in a given (noninterlocked) cycle is 0.80 (from study 5.3). The probability that the DF will make a request during a non-interlocked (or non-run-on) period is: |
|
|
|
 |
|
|
|
|
D-reads + D-writes - excess LM/STM traffic (the run-on traffic), |
|
|
|
|
|
|
|
|
Here we determine reads and writes from the L/S data (Table 3.15) but divide by the R/M instruction count. All register set architectures with about the same number of registers should create the same number of reads and writes per HLL operation. This is then adjusted by the instruction count per HLL operation. |
|
|
|
|
|
|
|
|
The model assumes that all reads or writes create a reference to cache except the branch, and that the first LM or STM data reference occurs while the I-buffer is active. |
|
|
|
|
|
|
|
|
Thus, if all contention caused delay, we would have contention of: |
|
|
|
 |
|
|
|
|
Prob(IF) * Prob(DF or DS) = (0.73) * (0.53) = 0.39. |
|
|
|
|
|
|
|
|
We would thus add 39% additional cycles to manage this contention. Not all of these cycles need cause execution delay, since either the in-line I-buffer or the store buffer can be delayed a cycle without affecting performance. However, even with a well-buffered processor and well-managed cache access priority system, we still require: |
|
|
|
 |
|
|
|
|
0.73 + 0.53 cache access/instr = 1.26 cache access cycles/instr. |
|
|
|
|
|
|
|
|
This "lower bound" on contention ignores an important factormany of the I-requests arise from branches that occur during "extra" branch resolution cycles. From study 4.3, decoding and branch effects result in processor performance of 2.025 ~/instruction. In other words, for every 100 instructions, we have 202.5 cycles of expected execution (assuming that |
|
|
|
|
|