page_97

< previous page

page_97

Page 97

Unit

Area

Integer ALU (32^b)

1.0A

Bypass

0.15A

Integer reg.

1.0A

Shifter

0.5A

Incrementor

0.4A

I-fetch/PC unit



		PC chain



		Cache miss logic

0.85A

2 TLBs (Assumes use of PID)



		32^b virtual to 24^b real

2 ´ 3A

Decode + control

1.0A

Cache controller

1.0A

Bus logic

2.0A

Store buffer + bypass

1.0A

Load/store byte support

0.2A

Clock generator

1.0A

Subtotal integer

16.1A



		Most of the preceding data is empirically determined. The TLB requires some discussion, as it occupies almost a third of the base area of the integer processor. The dual TLBs (one for IF and one for DF/DS) are assumed to consist of single-ported register sets (i.e., one TLB bit = 1 rbe). This is consistent with fast TLB access requirements. Each TLB is assumed to be 2-way set associative with a total of 128 entries (64 ´ 2). Each entry (4^KB pages) has a 14-bit virtual address tag (32^b - 12^b(byte in a page address) - 6^b(TLB entry address)) and a 12-bit real address (24^b - 12^b). Also, the entry contains a 4-bit PID (process ID number) and 4 bits of control information (LRU, R/W, etc.). See Chapter 5 for further discussion. Summing up, this gives 34 bits/entry or about 4,352 rbe per TLB (34 ´ 128). From our earlier discussion, we know that 1,481 rbe occupies 1 mm² = 1A. Thus, including MUX and comparitors, a single TLB occupies about 3A. We assume that two TLBs will be used: one for data fetching and one for instruction fetching. Since instruction fetches frequently are simply in-line fetches, it might be possible to have both instructions and data share the same TLB. However, for the moment we choose not to do that, and show two separate TLBs. This correspondingly implies that we have two separate caches, an I-cache and a D-cache.



		Floating Point It has been empirically determined that a floating-point adder occupies the area corresponding to 13.5 times the integer ALU. For our floating-point multiplier, we assume a high-speed two-pass multiplier, which will occupy 1.5 times the floating-point adder area. The divider uses the multiplier hardware. This combination provides a performance of the following:

FADD

3 cycles

FMPY

3 cycles

FDIV

15 cycles,

< previous page

page_97