TY - JOUR
T1 - Communication lower bounds for distributed-memory matrix multiplication
AU - Irony, Dror
AU - Toledo, Sivan
AU - Tiskin, Alexander
N1 - Funding Information:
This research was supported by the Israel Science Foundation founded by the Israel Academy of Sciences and Humanities (Grant Number 572/00 and Grant Number 9060/99) and by the University Research Fund of Tel-Aviv University. ∗Corresponding author. E-mail addresses: irony@tau.ac.il (D. Irony), stoledo@tau.ac.il (S. Toledo), tiskin@dcs.warwick.ac.uk (A. Tiskin). URL: http://www.tau.ac.il/∼stoledo, http://www.dcs.warwick.ac.uk/∼tiskin.
Funding Information:
Thanks to the two anonymous referees for helpful comments and suggestions. Sivan Toled o was supported in part by an IBM Faculty Partnership Award and by Grants 572/00 and 9060/99 from the Israel Science Foundation (founded by the Israel Academy of Sciences and Humanities).
PY - 2004/9
Y1 - 2004/9
N2 - We present lower bounds on the amount of communication that matrix multiplication algorithms must perform on a distributed-memory parallel computer. We denote the number of processors by P and the dimension of square matrices by n. We show that the most widely used class of algorithms, the so-called two-dimensional (2D) algorithms, are optimal, in the sense that in any algorithm that only uses O(n2 / P) words of memory per processor, at least one processor must send or receive Ω(n2 / P 1/2) words. We also show that algorithms from another class, the so-called three-dimensional (3D) algorithms, are also optimal. These algorithms use replication to reduce communication. We show that in any algorithm that uses O(n2 / P2/3) words of memory per processor, at least one processor must send or receive Ω(n2 / P2/3) words. Furthermore, we show a continuous tradeoff between the size of local memories and the amount of communication that must be performed. The 2D and 3D bounds are essentially instantiations of this tradeoff. We also show that if the input is distributed across the local memories of multiple nodes without replication, then Ω(n2) words must cross any bisection cut of the machine. All our bounds apply only to conventional O(n3) algorithms. They do not apply to Strassen's algorithm or other o(n3) algorithms.
AB - We present lower bounds on the amount of communication that matrix multiplication algorithms must perform on a distributed-memory parallel computer. We denote the number of processors by P and the dimension of square matrices by n. We show that the most widely used class of algorithms, the so-called two-dimensional (2D) algorithms, are optimal, in the sense that in any algorithm that only uses O(n2 / P) words of memory per processor, at least one processor must send or receive Ω(n2 / P 1/2) words. We also show that algorithms from another class, the so-called three-dimensional (3D) algorithms, are also optimal. These algorithms use replication to reduce communication. We show that in any algorithm that uses O(n2 / P2/3) words of memory per processor, at least one processor must send or receive Ω(n2 / P2/3) words. Furthermore, we show a continuous tradeoff between the size of local memories and the amount of communication that must be performed. The 2D and 3D bounds are essentially instantiations of this tradeoff. We also show that if the input is distributed across the local memories of multiple nodes without replication, then Ω(n2) words must cross any bisection cut of the machine. All our bounds apply only to conventional O(n3) algorithms. They do not apply to Strassen's algorithm or other o(n3) algorithms.
KW - Communication
KW - Distributed memory
KW - Lower bounds
KW - Matrix multiplication
UR - http://www.scopus.com/inward/record.url?scp=10844258198&partnerID=8YFLogxK
U2 - 10.1016/j.jpdc.2004.03.021
DO - 10.1016/j.jpdc.2004.03.021
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:10844258198
VL - 64
SP - 1017
EP - 1026
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
SN - 0743-7315
IS - 9
ER -