TY - GEN
T1 - An FPGA implementation of pipelined multiplicative division with IEEE Rounding
AU - Goldberg, Ronen
AU - Even, Guy
AU - Seidel, Peter-M.
PY - 2007
Y1 - 2007
N2 - We report the results of an FPGA implementation of double precision floating-point division with IEEE rounding. We achieve a total latency (i.e., cycles times clock period) that is 2:6 times smaller than the latency of the fastest previous implementation on FPGAs. The amount of hardware, on the other hand, is comparable to commercial cores. The division circuit is based on Goldschmidt's algorithm. All IEEE rounding modes are supported and are implemented using dewpoint rounding. The precision of the initial approximation of the reciprocal is 14 bits. To save hardware and reduce the critical path, a half-sized 62x30 Booth radix-8 multiplier is used. This multiplier can receive both the multiplicand and the multiplier in carry-save representation. The division circuit is partitioned into four pipeline stages, has a latency of 11 cycles, and may restart a new double precision division operation after 8 cycles. Synthesis results of an implementation (not including the computation of the initial approximation of the reciprocal and the exponent path) guarantee a clock frequency of 131 MHz on an Altera Stratix II using 3592 ALMs. The implementation was successfully tested with over 10 million random vectors as well as over a million hard-to-round vectors.
AB - We report the results of an FPGA implementation of double precision floating-point division with IEEE rounding. We achieve a total latency (i.e., cycles times clock period) that is 2:6 times smaller than the latency of the fastest previous implementation on FPGAs. The amount of hardware, on the other hand, is comparable to commercial cores. The division circuit is based on Goldschmidt's algorithm. All IEEE rounding modes are supported and are implemented using dewpoint rounding. The precision of the initial approximation of the reciprocal is 14 bits. To save hardware and reduce the critical path, a half-sized 62x30 Booth radix-8 multiplier is used. This multiplier can receive both the multiplicand and the multiplier in carry-save representation. The division circuit is partitioned into four pipeline stages, has a latency of 11 cycles, and may restart a new double precision division operation after 8 cycles. Synthesis results of an implementation (not including the computation of the initial approximation of the reciprocal and the exponent path) guarantee a clock frequency of 131 MHz on an Altera Stratix II using 3592 ALMs. The implementation was successfully tested with over 10 million random vectors as well as over a million hard-to-round vectors.
UR - http://www.scopus.com/inward/record.url?scp=47349092816&partnerID=8YFLogxK
U2 - 10.1109/FCCM.2007.59
DO - 10.1109/FCCM.2007.59
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
SN - 0-7695-2940-2
SN - 978-0-7695-2940-0
T3 - Proceedings 2007 IEEE Symposium on Field-Programme Custom Computing Machines, FCCM 2007
SP - 185
EP - 196
BT - 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007)
PB - IEEE
Y2 - 23 April 2007 through 25 April 2007
ER -