|
|
|
|
|
|
|
Spin locks create significant additional memory traffic (at least, to cache memory) to the shared variables or to the synchronization primitives until access is granted [19]. In order to reduce this extra traffic, a suspend lock can be implemented. In the suspend lock, status can be recorded by the equivalent of a test-and-set instruction and then the requesting process can go idle, awaiting an interrupt from the producing process. |
|
|
|
|
|
|
|
|
Finally, it is possible to combine synchronization primitives from multiple sources in an interconnection network. It is clear that in a fine-grain system we may have a good deal of synchronization traffic; since this traffic is typically directed at just a few synchronization cells, significant contention can arise at these particular designated cells. When such contention develops, it is referred to as a hotspot in memory [233]. In order to avoid such hotspots, the fetch and add primitive was developed [111], which is able to combine multiple test-and-set instructions into a single instruction that will access a synchronization variable and return the value. This combining is done in the network before access is made to the synchronization primitive in memory. |
|
|
|
|
|
|
|
|
8.5 The Effects of Partitioning and Scheduling Overhead |
|
|
|
|
|
|
|
|
When a program is partitioned into tasks, the maximum number of concurrent tasks can be determined. This is simply the maximum number of tasks that can be executed at any one time. It is sometimes called the degree of parallelism that exists in the program. Even if a program has a high degree of parallelism, a corresponding degree of speedup may not be achieved. Recall the definition of speedup: |
|
|
|
|
|
|
|
|
T1 represents the time required for a uniprocessor to execute the program using the best uniprocessor algorithm. Tp is the time it takes for p processors to execute the program using an algorithm that corresponds to the best execution time for p processors. It may well happen that the uniprocessor algorithm is "significantly better" than the parallel processor algorithm, thus limiting the overall potential speedup even though the parallel processor algorithm used all of its p processors almost all of the time. |
|
|
|
|
|
|
|
|
There is yet another reason that the parallel processor may fail to achieve high degrees of speedup with p processors. It may be that the maximum degree of parallelism that exists within the program may occur for a relatively small period of time, and that the rest of the program may indeed be serial. This is a problem related to the control structure or to the precedence defined within the program. For either of these reasons, or for other reasons, the overall efficiency of the parallel processor is usually less than the uniprocessor. |
|
|
|
|
|
|
|
|
Efficiency has been defined by Kuck [173] as: |
|
|
|
|
|