TY - GEN
T1 - The design, implementation, and evaluation of a banded linear solver for distributed-memory parallel computers
AU - Gupta, Anshul
AU - Gustavson, Fred G.
AU - Joshi, Mahesh
AU - Toledo, Sivan
N1 - Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 1996.
PY - 1996
Y1 - 1996
N2 - This paper describes the design, implementation, and evaluation of a parallel algorithm for the Cholesky factorization of banded matrices. The algorithm is part of IBM's Parallel Engineering and Scientific Subroutine Library version 1.2 and is compatible with ScaLA PACK’s banded solver. Analysis, as well as experiments on an IBM SP2 distributed-memory parallel computer, show that the algorithm efficiently factors banded matrices with wide bandwidth. For example, a 31- node SP2 factors a large matrix more than 16 times faster than a single node would factor it using the best sequential algorithm, and more than 20 times faster than a single node would using LAPACK's DPBTRP. The algorithm uses novel ideas in the area of distributed dense matrix computations that include the use of a dynamic schedule for a blocked systolic-like algorithm and the separation of the input and output data layouts from the layout the algorithm uses internally. The algorithm also uses known techniques such as blocking to improve its communicationto- computation ratio and its data-cache behavior.
AB - This paper describes the design, implementation, and evaluation of a parallel algorithm for the Cholesky factorization of banded matrices. The algorithm is part of IBM's Parallel Engineering and Scientific Subroutine Library version 1.2 and is compatible with ScaLA PACK’s banded solver. Analysis, as well as experiments on an IBM SP2 distributed-memory parallel computer, show that the algorithm efficiently factors banded matrices with wide bandwidth. For example, a 31- node SP2 factors a large matrix more than 16 times faster than a single node would factor it using the best sequential algorithm, and more than 20 times faster than a single node would using LAPACK's DPBTRP. The algorithm uses novel ideas in the area of distributed dense matrix computations that include the use of a dynamic schedule for a blocked systolic-like algorithm and the separation of the input and output data layouts from the layout the algorithm uses internally. The algorithm also uses known techniques such as blocking to improve its communicationto- computation ratio and its data-cache behavior.
UR - http://www.scopus.com/inward/record.url?scp=84947903312&partnerID=8YFLogxK
U2 - 10.1007/3-540-62095-8_35
DO - 10.1007/3-540-62095-8_35
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84947903312
SN - 3540620958
SN - 9783540620952
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 328
EP - 340
BT - Applied Parallel Computing
A2 - Waśniewski, Jerzy
A2 - Olesen, Dorte
A2 - Dongarra, Jack
A2 - Madsen, Kaj
PB - Springer Verlag
T2 - 3rd International Workshop on Applied Parallel Computing in Industrial Problems and Optimization, PARA 1996
Y2 - 18 August 1996 through 21 August 1996
ER -