|
|
|
|
|
|
|
1. Communications delays. This includes the delays caused by messages that are passed among computational modes. |
|
|
|
|
|
|
|
|
In order to make the grain size larger than this overhead effect, we must have |
|
|
|
|
|
|
|
|
In this relationship, k1 is the number of messages passed in a typical computational grain. This overhead delay, when multiplied by the MIPS available at the processing node, gives a measure of the equivalent number of overhead instructions present in the computation. The reader should recall that we are making the assumption that the program was perfectly partitionable without incurring algorithmic overhead; that is, the number of instructions to perform the computation with p processors (Op), is the same as the number of operations required to perform that same computation on a single processor (O1), ignoring overhead. |
|
|
|
|
|
|
|
|
As networks of processors get larger, the network delay tends to increase, and as processor performance increases (MIPS), the effect of increased message delay and increased MIPS accelerates the overhead costs. |
|
|
|
|
|
|
|
|
While one would expect that context switch time should scale with MIPS, it may or may not. As we have seen with advanced superscalar processors, there is a tendency to increase register set size or to increase the number of user-visible registers. Each of these tends to increase the context switch time. |
|
|
|
|
|
|
|
|
Probably the most subtle effect of overhead is the cold-cache effect. Suppose we have a loop whose iterations are independent from one another. The first tendency would be to immediately recognize this as independent computational grains and distribute them across multiple processors. However, the single processor will load its instruction cache once with the loop, where the distributed version of the program now must incur the I-cache miss penalties n times. The same may be true with certain data, which may not even be shared among the processors (in the sense that it is communicating from processor to processor), but rather that it is simply used by the processors. Extra traffic created by cache initialization overhead both congests buses and slows down individual processing nodes. |
|
|
|
|
|
|
|
|
Effective problem partitioning is a difficult problem that must be approached carefullyparallelism does not necessarily translate into speedup. |
|
|
|
|
|
|
|
|
8.6 Types of Shared Memory Multiprocessors |
|
|
|
|
|
|
|
|
Even within the seemingly limited class of processor architectures represented by shared memory multiprocessors, there is a great deal of variety. |
|
|
|
|
|