Fast multiplication in binary fields on GPUs via register cache

Eli Ben-Sasson, Matan Hamilis, Mark Silberstein, Eran Tromer

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

16 Scopus citations

Abstract

Finite fields of characteristic 2 - "binary fields" - are used in a variety of applications in cryptography and data storage. Multiplication of two finite field elements is a fundamental operation and a well-known computational bottleneck in many of these applications, as they often require multiplication of a large number of elements. In this work we focus on accelerating multiplication in "large" binary fields of sizes greater than 232. We devise a new parallel algorithm optimized for execution on GPUs. This algorithm makes it possible to multiply large number of finite field elements, and achieves high performance via bit-slicing and fine-grained parallelization. The key to the efficient implementation of the algorithm is a novel performance optimization methodology we call the register cache. This methodology speeds up an algorithm that caches its input in shared memory by transforming the code to use per-thread registers instead. We show how to replace shared memory accesses with the shuffle() intra-warp communication instruction, thereby significantly reducing or even eliminating shared memory accesses. We thoroughly analyze the register cache approach and characterize its benefits and limitations. We apply the register cache methodology to the implementation of the binary finite field multiplication algorithm on GPUs. We achieve up to 138× speedup for fields of size 232 over the popular, highly optimized Number Theory Library (NTL) [26], which uses the specialized CLMUL CPU instruction, and over 30× for larger fields of size below 2256. Our register cache implementation enables up to 50% higher performance compared to the traditional shared-memory based design.

Original languageEnglish
Title of host publicationProceedings of the 2016 International Conference on Supercomputing, ICS 2016
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450343619
DOIs
StatePublished - 1 Jun 2016
Event30th International Conference on Supercomputing, ICS 2016 - Istanbul, Turkey
Duration: 1 Jun 20163 Jun 2016

Publication series

NameProceedings of the International Conference on Supercomputing
Volume01-03-June-2016

Conference

Conference30th International Conference on Supercomputing, ICS 2016
Country/TerritoryTurkey
CityIstanbul
Period1/06/163/06/16

Funding

FundersFunder number
Israeli Ministry of Economics
Israeli Ministry of Science
Israel Science Foundation1138/14

    Keywords

    • Finite field multiplication
    • GPGPU
    • GPU code optimization
    • Parallel algorithms
    • SIMD

    Fingerprint

    Dive into the research topics of 'Fast multiplication in binary fields on GPUs via register cache'. Together they form a unique fingerprint.

    Cite this