< previous page page_563 next page >

Page 563
Notice that in invalidate protocols, the processor-cache that last performed the write is the single owner of the line. Other processors may read the line and become owners of the line if they perform a write requiring the invalidation of all other users of the line. Invalidate protocols involve additional traffic to get updated copies of shared lines. Central directories, in general, are used with updated memories, as it is relatively easy to update memory with new data at the same time as the central directory is being updated. This can, of course, create additional congestion at memory, as the memory system itself may have less available bandwidth than the directory. In any event, the central directory and its memory (when updated) represent a significant limitation to the ultimate scalability of multiprocessor systems. For the distributed directory case, as systems and applications get large, so too do the lists; while traffic per se may not be as much of a problem, the length of the list increases the transaction time for each write. Ultimately, the write transaction time becomes the limit on the scalability of such systems.
8.12 Evaluating Some Systems Alternatives
Suppose we select an interconnect network based upon a grid. (The motivation for a grid interconnection will be clearer after our discussion on interconnection later in this chapter.) Now the question is how best to treat the design of the system. What coherency protocol should be selected to give minimum traffic? What parameters of an application will influence program behavior most strongly? What extra hardware ought be considered to improve performance?
In order to look at some alternatives, let us define a node (Figure 8.37) as containing a very high-speed processor with an infinite cache and in a network of n processors, 1/nth of the memory. We assume a 64-byte line (16 4B physical words) for the cache, and a network link time (time to transmit a line between two nodes) of 16 cycles (i.e., w = 32b, l/w = 16). Suppose we arrange a 64-node system as an 8 ´ 8 grid. We use this as a baseline system to evaluate some alternative protocols.
Applications vary in some important ways, even when there is ample parallelism in the program. From a network point of view, the number of distinct localities that are shared (or the number of shared data blocks) plays an important role. In our base system, we ignore nonshared traffic (infinite cache) so that we cannot have any misses due to the capacity of the cache. An important parameter is the size of the shared item. If the shared item is significantly less than a line, we have potential for poor line utilization and hence a great deal of excess traffic using protocols that move lines as a unit. An important parameter of an application is the number of consumers. Applications that have a great deal of consumers create different kinds of traffic patterns than those with just a few consumers. Finally, there is the issue of line utilization: how many words of a shared line are actually used by the consuming processor?
A producer of a shared data value has a timing relationship to the consumer of that value. The exact relationship is determined by the protocol. The producer of shared data information spends some time computing values which it then writes into a shared data line. Typically, these writes are

 
< previous page page_563 next page >