TY - GEN
T1 - Fast multiplication in binary fields on GPUs via register cache
AU - Ben-Sasson, Eli
AU - Hamilis, Matan
AU - Silberstein, Mark
AU - Tromer, Eran
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/6/1
Y1 - 2016/6/1
N2 - Finite fields of characteristic 2 - "binary fields" - are used in a variety of applications in cryptography and data storage. Multiplication of two finite field elements is a fundamental operation and a well-known computational bottleneck in many of these applications, as they often require multiplication of a large number of elements. In this work we focus on accelerating multiplication in "large" binary fields of sizes greater than 232. We devise a new parallel algorithm optimized for execution on GPUs. This algorithm makes it possible to multiply large number of finite field elements, and achieves high performance via bit-slicing and fine-grained parallelization. The key to the efficient implementation of the algorithm is a novel performance optimization methodology we call the register cache. This methodology speeds up an algorithm that caches its input in shared memory by transforming the code to use per-thread registers instead. We show how to replace shared memory accesses with the shuffle() intra-warp communication instruction, thereby significantly reducing or even eliminating shared memory accesses. We thoroughly analyze the register cache approach and characterize its benefits and limitations. We apply the register cache methodology to the implementation of the binary finite field multiplication algorithm on GPUs. We achieve up to 138× speedup for fields of size 232 over the popular, highly optimized Number Theory Library (NTL) [26], which uses the specialized CLMUL CPU instruction, and over 30× for larger fields of size below 2256. Our register cache implementation enables up to 50% higher performance compared to the traditional shared-memory based design.
AB - Finite fields of characteristic 2 - "binary fields" - are used in a variety of applications in cryptography and data storage. Multiplication of two finite field elements is a fundamental operation and a well-known computational bottleneck in many of these applications, as they often require multiplication of a large number of elements. In this work we focus on accelerating multiplication in "large" binary fields of sizes greater than 232. We devise a new parallel algorithm optimized for execution on GPUs. This algorithm makes it possible to multiply large number of finite field elements, and achieves high performance via bit-slicing and fine-grained parallelization. The key to the efficient implementation of the algorithm is a novel performance optimization methodology we call the register cache. This methodology speeds up an algorithm that caches its input in shared memory by transforming the code to use per-thread registers instead. We show how to replace shared memory accesses with the shuffle() intra-warp communication instruction, thereby significantly reducing or even eliminating shared memory accesses. We thoroughly analyze the register cache approach and characterize its benefits and limitations. We apply the register cache methodology to the implementation of the binary finite field multiplication algorithm on GPUs. We achieve up to 138× speedup for fields of size 232 over the popular, highly optimized Number Theory Library (NTL) [26], which uses the specialized CLMUL CPU instruction, and over 30× for larger fields of size below 2256. Our register cache implementation enables up to 50% higher performance compared to the traditional shared-memory based design.
KW - Finite field multiplication
KW - GPGPU
KW - GPU code optimization
KW - Parallel algorithms
KW - SIMD
UR - http://www.scopus.com/inward/record.url?scp=84978535957&partnerID=8YFLogxK
U2 - 10.1145/2925426.2926259
DO - 10.1145/2925426.2926259
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:84978535957
T3 - Proceedings of the International Conference on Supercomputing
BT - Proceedings of the 2016 International Conference on Supercomputing, ICS 2016
PB - Association for Computing Machinery
T2 - 30th International Conference on Supercomputing, ICS 2016
Y2 - 1 June 2016 through 3 June 2016
ER -