< previous page page_537 next page >

Page 537
Study 8.1 SRMP vs. Pipelined Processor
In this study, we contrast a conventional pipelined processor (similar to our baseline) with a four-processor SRMP occupying roughly the same chip area.
Suppose an L/S pipelined processor has a 16KB I-cache and an 8KB D-cache, both set associative, CBWA and LRU replacement. The caches have a 16B line and miss delay of eight cycles. The processor makes one I-refr/I and 0.5 D-refr/I. The processor itself has performance of 1.5 CPI without cache misses (i.e., one CPI for decode and 0.5 CPI for branch, run-on, and other effects). We contrast the piplined processor with a four-processor SRMP. Each processor has its own register set and I-cache (4KB direct mapped). The SRMP shares D-cache, decoder, floating point ALU, etc. Once a processor is stalled (cache miss, etc.), it immediately switches on the next cycleto the next available processor. The SRMP D-cache is designed to allow it to "non-block" on a miss; i.e., the miss is processed concurrently with accesses for another processor (unless, of course, it is to the missed line).
Pipelined Processor Analysis
The base CPI = 1.5.
The additional CPI lost due to cache misses (using chapter 4 data) is computed as follows:
I-cache CPI loss
=
I-cache miss rate ´ I-refr/I ´ miss penalty
=
[0.05 ´ 1.04] ´ 1 ´ 8 cycles
=
0.42 CPI.
D-cache CPI loss
=
D-cache miss rate ´ D-refr/I ´ miss penalty
=
[0.08 ´ 1.04] ´ 0.5 ´ 8 cycles
=
0.33 CPI.
Pipelined processor CPI total = 2.25.

SRMP (Figure 8.15)
Now each processor has its own I-cache: 4KB direct mapped. They share the D-cache. This ensures cache consistency and simplifies the I-cache design.
I-cache CPI loss
=
[.095 ´ 1.29] ´ 1 ´ 8 cycles
=
0.98 CPI

The D-cache has data for four processors resident. We approximate this situation by using MP = 3 (warm start) and Q = 100.
D-cache CPI loss
=
[0.26 ´ 1.04] ´ 0.5 ´ 8 cycles
=
1.08 CPI.
Total CPI for single SRMP processor = 3.56 CPI.

 
< previous page page_537 next page >