Abstract
This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture 1990 and of the hardware (size of instruction buffer) on the efficiency of loop unrolling. We also develop a methodology for classifying software pipelining techniques. For loop unrolling, a straightforward scheduling algorithm is shown to produce near-optimal results when not inhibited by recurrences or memory hazards. Our study indicates that the performance produced with a modified CRAY-1S scalar architecture and a code scheduler utilizing loop unrolling is comparable to the performance achieved by the CRAY-1S with a vector unit and the CFT vectorizing compiler. Finally, we show that the combination of loop unrolling and dynamic software pipelining, as implemented by a decoupled computer, substantially outperforms the vector CRAY-1S.
Original language | English |
---|---|
Pages (from-to) | 223-245 |
Number of pages | 23 |
Journal | ACM Transactions on Mathematical Software |
Volume | 16 |
Issue number | 3 |
DOIs | |
State | Published - 9 Jan 1990 |
Externally published | Yes |