< previous page page_97 next page >

Page 97
Unit
Area
Integer ALU (32b)
1.0A
Bypass
0.15A
Integer reg.
1.0A
Shifter
0.5A
Incrementor
0.4A
I-fetch/PC unit
d87111c01013bcda00bb8640fdff6754.gif
PC chain
d87111c01013bcda00bb8640fdff6754.gif
Cache miss logic
0.85A
2 TLBs (Assumes use of PID)
d87111c01013bcda00bb8640fdff6754.gif
32b virtual to 24b real
2 ´ 3A
Decode + control
1.0A
Cache controller
1.0A
Bus logic
2.0A
Store buffer + bypass
1.0A
Load/store byte support
0.2A
Clock generator
1.0A
Subtotal integer
16.1A

Most of the preceding data is empirically determined. The TLB requires some discussion, as it occupies almost a third of the base area of the integer processor. The dual TLBs (one for IF and one for DF/DS) are assumed to consist of single-ported register sets (i.e., one TLB bit = 1 rbe). This is consistent with fast TLB access requirements. Each TLB is assumed to be 2-way set associative with a total of 128 entries (64 ´ 2). Each entry (4KB pages) has a 14-bit virtual address tag (32b - 12b(byte in a page address) - 6b(TLB entry address)) and a 12-bit real address (24b - 12b). Also, the entry contains a 4-bit PID (process ID number) and 4 bits of control information (LRU, R/W, etc.). See Chapter 5 for further discussion. Summing up, this gives 34 bits/entry or about 4,352 rbe per TLB (34 ´ 128). From our earlier discussion, we know that 1,481 rbe occupies 1 mm2 = 1A. Thus, including MUX and comparitors, a single TLB occupies about 3A. We assume that two TLBs will be used: one for data fetching and one for instruction fetching. Since instruction fetches frequently are simply in-line fetches, it might be possible to have both instructions and data share the same TLB. However, for the moment we choose not to do that, and show two separate TLBs. This correspondingly implies that we have two separate caches, an I-cache and a D-cache.
Floating Point It has been empirically determined that a floating-point adder occupies the area corresponding to 13.5 times the integer ALU. For our floating-point multiplier, we assume a high-speed two-pass multiplier, which will occupy 1.5 times the floating-point adder area. The divider uses the multiplier hardware. This combination provides a performance of the following:
FADD
3 cycles
FMPY
3 cycles
FDIV
15 cycles,

 
< previous page page_97 next page >