|
|
|
|
|
|
|
7.7.2 Performance Comparison |
|
|
|
|
|
|
|
|
The performance of vector processors is primarily a function of two factors: |
|
|
|
|
|
|
|
|
1. The percentage of the code that is vectorizable in a particular application. |
|
|
|
|
|
|
|
|
2. The average length of the vector seen by the vector processor. |
|
|
|
|
|
|
|
|
As we saw earlier, the n½, or the vector size at which the vector processor achieves approximately half its asymptotic performance, is roughly the same as the length of the arithmetic-plus-memory access pipeline. Since the multiple-issue processors access data from the data cache, their equivalent n½ is generally less than the vector processor, and so one would expect that for short vectors the multiple-issue processor would perform better than an equivalent vector processor. The actual advantage depends on the sophistication of the compiler for both vector and multiple-issue processors. It may be that the vectorizing compiler can recognize an occurrence of a short vector and treat that portion of the code as if it were scalar code using the scalar data cache. Usually, code corresponding to rather short vectors can be reassembled in such a way as to provide the type of instruction independence that is most suitable for the multiple-issue machine. As vectors get longer, the performance of the multiple-issue machine becomes much more dependent on the size of the data cache. When the vector (or underlying array) exceeds the size of the data cache, there is little reuse of data values stored in the cache and the performance of the processor suffers greatly. |
|
|
|
|
|
|
|
|
Simmons and Wasserman [258] have done an interesting performance evaluation of the IBM RISC System/6000, a scalar processor capable of issuing up to four instructions with two vector processors (a Convex C240 and a Floating Point System, FPS-500). Figure 7.55 shows the results of their study for the particular version of the RISC System/6000 that was used, which contains a 32-KB cache and operates at a 50-ns cycle. The steep falloff in performance occurs after vector lengths of over 225. Ultimately, the processor is limited by the time required (about 15 clock periods) to load a cache line following a miss. The total cache capacity is exceeded at vector length 800, which accounts for the second dip in performance. It is interesting to note that neither a vector processor of similar capability (for instance, the Convex C240) nor an improved version of the IBM RISC System/6000 (with 40-ns cycle and 64-KB cache) suffered any performance degradation up to vector length 1,000. One would expect, of course, that at some point a similar performance degradation would be noticed on the larger cache version of the System/6000. |
|
|
|
|
|
|
|
|
In addition to this simple example, Simmons and Wasserman compared the machines on applications programs. Across a broad range of applications, this 40-ns System RS/6000 performed quite well relative to the C240, with an execution time ratio of 0.42.3, and with most applications favoring the multiple-issue (RS/6000) machine by about 2030%. |
|
|
|
|
|
|
|
|
Another study conducted by Bennett [33] was restricted to the FFT algorithm, in which a high-speed pipelined processor was compared to a vector processor on a relative cycle count basis. The FFT problem consists of finding a functional approximation that fits a set of sample data of size N; in |
|
|
|
|
|