Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., BSP, LOGP, and QSM) account for bandwidth limitations using a per-processor parameter g > 1, such that each processor can send/receive at most h messages in g·h time. Other models (e.g., PRAM(m)) account for bandwidth limitations as an aggregate parameter m < p, such that the p processors can send at most in messages in total at each step. This paper provides the first detailed study of the algorithmic implications of modeling parallel bandwidth as a per-processor (local) limitation versus an aggregate (global) limitation. We consider a number of basic problems such as broadcasting, parity, summation, and sorting, and give several new upper and lower time bounds that demonstrate the advantage of globally limited models over locally limited models given the same aggregate bandwidth (i.e., p·1/g = m). In general, globally limited models have a possible advantage whenever there is an imbalance in the number of messages sent/received by the processors. To exploit this advantage, the processors must schedule the sending of messages so as to respect the aggregate bandwidth limit. We present a new parallel scheduling algorithm for globally limited models that enable an unknown, arbitrarily unbalanced set of messages to be sent through the limited bandwidth within a (1 + ε) factor of the optimal off-line schedule with high probability, even if the penalty for overloading the network is an exponential function of the overload. We also present a near-optimal algorithm for the case where long messages must be sent as flits in consecutive time steps, as well as for the case where new messages to be sent arrive dynamically over an infinite time line. These results consider both message passing (distributed memory) and shared memory scenarios, and improve upon the best results for the locally limited model by a factor of Θ(g). Finally, we present results quantifying the power of concurrent reads in a globally limited bandwidth setting, including showing an Ω(p lg m/m lg p)) time separation between the exclusive-read and the concurrent-read PRAM(m) models, which, when m « p, greatly improves upon the 2Ω(√1g p) separation known previously.
- Limited bandwidth
- Parallel computation