TY - GEN
T1 - Network Topologies and Inevitable Contention
AU - Ballard, Grey
AU - Demmel, James
AU - Gearhart, Andrew
AU - Lipshitz, Benjamin
AU - Oltchik, Yishai
AU - Schwartz, Oded
AU - Toledo, Sivan
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/1/23
Y1 - 2017/1/23
N2 - Network topologies can have significant effect on the execution costs of parallel algorithms due to inter-processor communication. For particular combinations of computations and network topologies, costly network contention may inevitably become a bottleneck, even if algorithms are optimally designed so that each processor communicates as little as possible. We obtain novel contention lower bounds that are functions of the network and the computation graph parameters. For several combinations of fundamental computations and common network topologies, our new analysis improves upon previous per-processor lower bounds which only specify the number of words communicated by the busiest individual processor. We consider torus and mesh topologies, universal fat-trees, and hypercubes; algorithms covered include classical matrix multiplication and direct numerical linear algebra, fast matrix multiplication algorithms, programs that reference arrays, N-body computations, and the FFT. For example, we show that fast matrix multiplication algorithms (e.g., Strassen's) running on a 3D torus will suffer from contention bottlenecks. On the other hand, this network is likely sufficient for a classical matrix multiplication algorithm. Our new lower bounds are matched by existing algorithms only in very few cases, leaving many open problems for network and algorithmic design.
AB - Network topologies can have significant effect on the execution costs of parallel algorithms due to inter-processor communication. For particular combinations of computations and network topologies, costly network contention may inevitably become a bottleneck, even if algorithms are optimally designed so that each processor communicates as little as possible. We obtain novel contention lower bounds that are functions of the network and the computation graph parameters. For several combinations of fundamental computations and common network topologies, our new analysis improves upon previous per-processor lower bounds which only specify the number of words communicated by the busiest individual processor. We consider torus and mesh topologies, universal fat-trees, and hypercubes; algorithms covered include classical matrix multiplication and direct numerical linear algebra, fast matrix multiplication algorithms, programs that reference arrays, N-body computations, and the FFT. For example, we show that fast matrix multiplication algorithms (e.g., Strassen's) running on a 3D torus will suffer from contention bottlenecks. On the other hand, this network is likely sufficient for a classical matrix multiplication algorithm. Our new lower bounds are matched by existing algorithms only in very few cases, leaving many open problems for network and algorithmic design.
KW - Communication costs
KW - Communication-avoiding algorithms
KW - FFT
KW - Matrix Multiplication
KW - Network topology
KW - Numerical Linear Algebra
KW - Strong scaling
UR - http://www.scopus.com/inward/record.url?scp=85013967270&partnerID=8YFLogxK
U2 - 10.1109/COMHPC.2016.010
DO - 10.1109/COMHPC.2016.010
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85013967270
T3 - Proceedings of COM-HPC 2016: 1st Workshop on Optimization of Communication in HPC Runtime Systems - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 39
EP - 52
BT - Proceedings of COM-HPC 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 1st Workshop on Optimization of Communication in HPC Runtime Systems, COM-HPC 2016
Y2 - 18 November 2016
ER -