Reference : Efficient Arithmetic on ARM-NEON and Its Application for High-Speed RSA Implementation
Scientific journals : Article
Engineering, computing & technology : Computer science
Security, Reliability and Trust
Efficient Arithmetic on ARM-NEON and Its Application for High-Speed RSA Implementation
Seo, Hwajeong [Pusan National University > School of Computer Science and Engineering]
Liu, Zhe [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC) >]
Groszschädl, Johann mailto [University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC) >]
Kim, Howon [Pusan National University > School of Computer Science and Engineering]
Security and Communication Networks
John Wiley & Sons
Yes (verified by ORBilu)
United Kingdom
[en] Public-Key Cryptography ; Multiple-Precision Arithmetic ; Modular Reduction ; SIMD-Level Parallelism ; Vector Instructions ; ARM NEON
[en] A steadily increasing number of modern processors support Single Instruction Multiple Data (SIMD) instructions to speed up multimedia, communication, and security applications. The computational power of Intel's SSE and AVX extensions as well as ARM's NEON engine has initiated a body of research on SIMD-parallel implementation of multiple-precision integer arithmetic operations, in particular modular multiplication and modular squaring, which are performance-critical components of widely-used public-key cryptosystems such as RSA, DSA, Diffie-Hellman, and their elliptic-curve variants ECDSA and ECDH. In this paper, we introduce the Double Operand Scanning (DOS) method for multiple-precision squaring and describe its implementation for ARM NEON processors. The DOS method uses a full-radix representation of the operand to be squared and aims to maximize performance by reducing the number of Read-After-Write (RAW) dependencies between source and destination registers. We also analyze the benefits of applying Karatsuba's technique to both multiple-precision multiplication and squaring, and present an optimized implementation of Montgomery's algorithm for modular reduction. Our performance evaluation shows that the DOS method along with the other optimizations described in this paper allows one to execute a full 2048-bit modular exponentiation in about 14.25 million clock cycles on an ARM Cortex-A15 processor, which is significantly faster than previously-reported RSA implementations for the ARM-NEON platform.

File(s) associated to this reference

Fulltext file(s):

Limited access
SCN2016.pdfAuthor postprint99.74 kBRequest a copy

Bookmark and Share SFX Query

All documents in ORBilu are protected by a user license.