page_518

< previous page

page_518

Page 518



		possible for an element to be written before its value was read for use in computing the value for a lower-numbered element [27, 308]. For example, if element 9 were computed before it had been read, the array value for element 3 would have the wrong value.



		To eliminate this problem, two solutions are shown. In (c), a temporary array is used to store the values until all of the original array has been read. Then, the values are copied from the temporary array to the original array. Since there are no dependencies between iterations of the loop, both loops can now be executed in parallel. The disadvantage to this solution is that roughly twice as much work must now be done.



		Another solution is presented in (d). In this case, an array element is scheduled for execution as soon as the elements that use it for computation have read its values. The amount of parallelism varies through each iteration of the loop.



		The choice of solutions is based on the configuration of the parallel machine. If the machine has a large number of processing elements, the double buffering solution might be best. For machines with a limited number of processors, the second solution is better.



		The process of partitioning begins with the decomposition of the program into tasks. This process usually starts with the outermost loop of the program. Each iteration of the loop is a candidate for being designated a task. Once the various candidates have been defined, the process of dependency resolution can begin, as shown in Example 8.1. Techniques such as scalar renaming (similar to register renaming) can be used to reduce or eliminate apparent task dependency. Similarly, techniques such as forward substitution can have the same effect. Other techniques include loop normalization and subscript expansion. Clearly, the most effort must be spent in areas that have the most potential benefit. Stone [272] describes a recipe for, in his words, finding ''easy parallelism" in a program:



		Step 1. Find loop or recursion structures in the program.



		Step 2. Determine the loops that account for the most time, i.e., those corresponding to the largest number of iterations.



		Step 3. Assign instruction execution of these structures across the n processors, if this can be done in an independent manner.



		Step 4. Add synchronization and communication instructions as required.



		Step 3 collects sub-tasks into single tasks. In this clustering process, the compiler must be aware of the maximum number of resources that are likely to be available to it during the course of execution. Finding concurrency at any one point in excess of this is not profitable. Overhead considerations, then, can be used to assist in clustering tasks into units that afford the maximum potential speedup. This has been demonstrated by the Parafrase 2 compiler [235]. After the clustering process, the compiler may be used to determine potentially critical paths through the program. In order to do this, the amount of time required for task execution must be estimated. This corresponds to the probable number of instructions to be executed to complete the task. If determinable, this information is quite useful:

< previous page

page_518