page_452

< previous page

page_452

Page 452



		allow bypassing to occur within a vector instruction, we assume that any ALU or other instruction that uses a vector register cannot begin until all previously designated operations that use that vector instruction have been completed.



		Interinstruction bypassing can improve performance, as in the above example, but as mentioned earlier it can be accomplished only at significant expense. Some of the issues in dealing with interinstruction bypassing are similar to those discussed later in this chapter in the discussion on out-of-order execution for multiple instruction execution machines.



		7.4 Vector Processor Speedup: Performance Relative to a Pipelined Processor



		7.4.1 Basic Issues



		Vector processor performance is determined by:



		1. The amount of the program that can be expressed in a vectorizable form.



		2. Vector startup costs. These correspond to the length of the pipeline for vector instructions.



		3. The number of execution units and the support for the chaining of vector operands provided within the execution unit.



		4. The number of operands that can be simultaneously accessed/stored from the memory system.



		5. The number of vector registers.



		A secondary element in the memory system design is the g factor, or the number of requests that can be bypassed before the memory accessing mechanism stalls waiting for a conflict-free reference. Suppose we compare the vector processor to a well-designed high-speed pipelined processor with the same cycle time. Depending upon the memory system design, the overall effect of speedup possible by the vector processor over the pipelined processor is generally limited to four. This assumes that the memory system supports the concurrent execution of two load instructions and a store instruction concurrently with the execution of an arithmetic operation. If chaining is allowed, the memory system must accommodate at least an additional concurrent load instruction, but the overall speedup is then limited (for most cases) at S_p_max < 6 (3 LD's, 2 Arith, and 1 ST). In practice, such speedups are not achievable for a variety of reasons. In order to sustain high degrees of speedup, the program must consist of purely vector code. Since at least some address arithmetic and control operations are also required, some part of the program is not vectorizable. This limits the attainable speedup. Suppose a particular problem has a maximum speedup of four, and this was available for 75% of the operations to be executed by the processor. This would give an overall speedup of:

< previous page

page_452