References of "Cheng, Hao 50027928"
     in
Bookmark and Share    
Full Text
Peer Reviewed
See detailBatching CSIDH Group Actions using AVX-512
Cheng, Hao UL; Fotiadis, Georgios UL; Groszschädl, Johann UL et al

in IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES) (2021, August), 2021(4), 618-649

Commutative Supersingular Isogeny Diffie-Hellman (or CSIDH for short) is a recently-proposed post-quantum key establishment scheme that belongs to the family of isogeny-based cryptosystems. The CSIDH ... [more ▼]

Commutative Supersingular Isogeny Diffie-Hellman (or CSIDH for short) is a recently-proposed post-quantum key establishment scheme that belongs to the family of isogeny-based cryptosystems. The CSIDH protocol is based on the action of an ideal class group on a set of supersingular elliptic curves and comes with some very attractive features, e.g. the ability to serve as a “drop-in” replacement for the standard elliptic curve Diffie-Hellman protocol. Unfortunately, the execution time of CSIDH is prohibitively high for many real-world applications, mainly due to the enormous computational cost of the underlying group action. Consequently, there is a strong demand for optimizations that increase the efficiency of the class group action evaluation, which is not only important for CSIDH, but also for related cryptosystems like the signature schemes CSI-FiSh and SeaSign. In this paper, we explore how the AVX-512 vector extensions (incl. AVX-512F and AVX-512IFMA) can be utilized to optimize constant-time evaluation of the CSIDH-512 class group action with the goal of, respectively, maximizing throughput and minimizing latency. We introduce different approaches for batching group actions and computing them in SIMD fashion on modern Intel processors. In particular, we present a hybrid batching technique that, when combined with optimized (8 × 1)-way prime-field arithmetic, increases the throughput by a factor of 3.64 compared to a state-of-the-art (non-vectorized) x64 implementation. On the other hand, vectorization in a 2-way fashion aimed to reduce latency makes our AVX-512 implementation of the group action evaluation about 1.54 times faster than the state-of-the-art. To the best of our knowledge, this paper is the first to demonstrate the high potential of using vector instructions to increase the throughput (resp. decrease the latency) of constant-time CSIDH. [less ▲]

Detailed reference viewed: 82 (10 UL)
Full Text
Peer Reviewed
See detailAVRNTRU: Lightweight NTRU-based Post-Quantum Cryptography for 8-bit AVR Microcontrollers
Cheng, Hao UL; Groszschädl, Johann UL; Roenne, Peter UL et al

in 2021 Design, Automation and Test in Europe Conference and Exhibition, DATE 2021, Grenoble, France, February 1-5, 2021, Proceedings (2021, February)

Introduced in 1996, NTRUEncrypt is not only one of the earliest but also one of the most scrutinized lattice-based cryptosystems and expected to remain secure in the upcoming era of quantum computing ... [more ▼]

Introduced in 1996, NTRUEncrypt is not only one of the earliest but also one of the most scrutinized lattice-based cryptosystems and expected to remain secure in the upcoming era of quantum computing. Furthermore, NTRUEncrypt offers some efficiency benefits over “pre-quantum” cryptosystems like RSA or ECC since the low-level arithmetic operations are less computation-intensive and, thus, more suitable for constrained devices. In this paper we present AVR N TRU, a highly-optimized implementation of NTRUEncrypt for 8-bit AVR microcontrollers that we developed from scratch to reach high performance and resistance to timing attacks. AVR N TRU complies with the EESS #1 v3.1 specification and supports product-form parameter sets such as ees443ep1, ees587ep1, and ees743ep1. An entire encryption (including mask generation and blinding-polynomial generation) using the ees443ep1 parameters requires 847973 clock cycles on an ATmega1281 microcontroller; the decryption is more costly and has an execution time of 1051871 cycles. We achieved these results with the help of a novel hybrid technique for multiplication in a truncated polynomial ring, whereby one of the operands is a sparse ternary polynomial in product form and the other an arbitrary element of the ring. A constant-time multiplication in the ring given by the ees443ep1 parameters takes only 192577 cycles, which sets a new speed record for the arithmetic part of a lattice-based cryptosystem on AVR. [less ▲]

Detailed reference viewed: 36 (2 UL)
Full Text
Peer Reviewed
See detailLightweight Post-quantum Key Encapsulation for 8-bit AVR Microcontrollers
Cheng, Hao UL; Groszschädl, Johann UL; Roenne, Peter UL et al

in Liardet, Pierre-Yvan; Mentens, Nele (Eds.) Smart Card Research and Advanced Applications, 19th International Conference, CARDIS 2020, Virtual Event, November 18–19, 2020, Revised Selected Papers (2020, November)

Recent progress in quantum computing has increased interest in the question of how well the existing proposals for post-quantum cryptosystems are suited to replace RSA and ECC. While some aspects of this ... [more ▼]

Recent progress in quantum computing has increased interest in the question of how well the existing proposals for post-quantum cryptosystems are suited to replace RSA and ECC. While some aspects of this question have already been researched in detail (e.g. the relative computational cost of pre- and post-quantum algorithms), very little is known about the RAM footprint of the proposals and what execution time they can reach when low memory consumption rather than speed is the main optimization goal. This question is particularly important in the context of the Internet of Things (IoT) since many IoT devices are extremely constrained and possess only a few kB of RAM. We aim to contribute to answering this question by exploring the software design space of the lattice-based key-encapsulation scheme ThreeBears on an 8-bit AVR microcontroller. More concretely, we provide new techniques for the optimization of the ring arithmetic of ThreeBears (which is, in essence, a 3120-bit modular multiplication) to achieve either high speed or low RAM footprint, and we analyze in detail the trade-offs between these two metrics. A low-memory implementation of BabyBear that is secure against Chosen Plaintext Attacks (CPA) needs just about 1.7 kB RAM, which is significantly below the RAM footprint of other lattice-based cryptosystems reported in the literature. Yet, the encapsulation time of this RAM-optimized BabyBear version is below 12.5 million cycles, which is less than the execution time of scalar multiplication on Curve25519. The decapsulation is more than four times faster and takes roughly 3.4 million cycles on an ATmega1284 microcontroller. [less ▲]

Detailed reference viewed: 47 (5 UL)
Full Text
Peer Reviewed
See detailHigh-Throughput Elliptic Curve Cryptography Using AVX2 Vector Instructions
Cheng, Hao UL; Groszschädl, Johann UL; Tian, Jiaqi UL et al

in Dunkelman, Orr; Jacobson Jr., Michael J.; O'Flynn, Colin (Eds.) Selected Areas in Cryptography, 27th International Conference, Halifax, NS, Canada (Virtual Event), October 21-23, 2020, Revised Selected Papers (2020, October)

Single Instruction Multiple Data (SIMD) execution engines like Intel’s Advanced Vector Extensions 2 (AVX2) offer a great potential to accelerate elliptic curve cryptography compared to implementations ... [more ▼]

Single Instruction Multiple Data (SIMD) execution engines like Intel’s Advanced Vector Extensions 2 (AVX2) offer a great potential to accelerate elliptic curve cryptography compared to implementations using only basic x64 instructions. All existing AVX2 implementations of scalar multiplication on e.g. Curve25519 (and alternative curves) are optimized for low latency. We argue in this paper that many real-world applications, such as server-side SSL/TLS handshake processing, would benefit more from throughput-optimized implementations than latency-optimized ones. To support this argument, we introduce a throughput-optimized AVX2 implementation of variable-base scalar multiplication on Curve25519 and fixed-base scalar multiplication on Ed25519. Both implementations perform four scalar multiplications in parallel, where each uses a 64-bit element of a 256-bit vector. The field arithmetic is based on a radix-2^29 representation of the field elements, which makes it possible to carry out four parallel multiplications modulo a multiple of p=2^255−19 in just 88 cycles on a Skylake CPU. Four variable-base scalar multiplications on Curve25519 require less than 250,000 Skylake cycles, which translates to a throughput of 32,318 scalar multiplications per second at a clock frequency of 2 GHz. For comparison, the to-date best latency-optimized AVX2 implementation has a throughput of some 21,000 scalar multiplications per second on the same Skylake CPU. [less ▲]

Detailed reference viewed: 52 (7 UL)
Full Text
Peer Reviewed
See detailA Lightweight Implementation of NTRU Prime for the Post-Quantum Internet of Things
Cheng, Hao UL; Dinu, Dumitru-Daniel; Groszschädl, Johann UL et al

in Laurent, Maryline; Giannetsos, Thanassis (Eds.) Information Security Theory and Practice, 13th IFIP WG 11.2 International Conference, WISTP 2019, Paris, France, December 11–12, 2019, Proceedings (2019, December)

The dawning era of quantum computing has initiated various initiatives for the standardization of post-quantum cryptosystems with the goal of (eventually) replacing RSA and ECC. NTRU Prime is a variant of ... [more ▼]

The dawning era of quantum computing has initiated various initiatives for the standardization of post-quantum cryptosystems with the goal of (eventually) replacing RSA and ECC. NTRU Prime is a variant of the classical NTRU cryptosystem that comes with a couple of tweaks to minimize the attack surface; most notably, it avoids rings with "worrisome" structure. This paper presents, to our knowledge, the first assembler-optimized implementation of Streamlined NTRU Prime for an 8-bit AVR microcontroller and shows that high-security lattice-based cryptography is feasible for small IoT devices. An encapsulation operation using parameters for 128-bit post-quantum security requires 8.2 million clock cycles when executed on an 8-bit ATmega1284 microcontroller. The decapsulation is approximately twice as costly and has an execution time of 15.6 million cycles. We achieved this performance through (i) new low-level software optimization techniques to accelerate Karatsuba-based polynomial multiplication on the 8-bit AVR platform and (ii) an efficient implementation of the coefficient modular reduction written in assembly language. The execution time of encapsulation and decapsulation is independent of secret data, which makes our software resistant against timing attacks. Finally, we assess the performance one could theoretically gain by using a so-called product-form polynomial as part of the secret key and discuss potential security implications. [less ▲]

Detailed reference viewed: 338 (31 UL)
Full Text
Peer Reviewed
See detailA Lightweight Implementation of NTRUEncrypt for 8-bit AVR Microcontrollers
Cheng, Hao UL; Groszschädl, Johann UL; Roenne, Peter UL et al

E-print/Working paper (2019)

Introduced in 1996, NTRUEncrypt is not only one of the earliest but also one of the most scrutinized lattice-based cryptosystems and a serious contender in NIST’s ongoing Post-Quantum Cryptography (PQC ... [more ▼]

Introduced in 1996, NTRUEncrypt is not only one of the earliest but also one of the most scrutinized lattice-based cryptosystems and a serious contender in NIST’s ongoing Post-Quantum Cryptography (PQC) standardization project. An important criterion for the assessment of candidates is their computational cost in various hardware and software environments. This paper contributes to the evaluation of NTRUEncrypt on the ATmega class of AVR microcontrollers, which belongs to the most popular 8-bit platforms in the embedded domain. More concretely, we present AvrNtru, a carefully-optimized implementation of NTRUEncrypt that we developed from scratch with the goal of achieving high performance and resistance to timing attacks. AvrNtru complies with version 3.3 of the EESS#1 specification and supports recent product-form parameter sets like ees443ep1, ees587ep1, and ees743ep1. A full encryption operation (including mask generation and blinding- polynomial generation) using the ees443ep1 parameters takes 834,272 clock cycles on an ATmega1281 microcontroller; the decryption is slightly more costly and has an execution time of 1,061,683 cycles. When choosing the ees743ep1 parameters to achieve a 256-bit security level, 1,539,829 clock cycles are cost for encryption and 2,103,228 clock cycles for decryption. We achieved these results thanks to a novel hybrid technique for multiplication in truncated polynomial rings where one of the operands is a sparse ternary polynomial in product form. Our hybrid technique is inspired by Gura et al’s hybrid method for multiple-precision integer multiplication (CHES 2004) and takes advantage of the large register file of the AVR architecture to minimize the number of load instructions. A constant-time multiplication in the ring specified by the ees443ep1 parameters requires only 210,827 cycles, which sets a new speed record for the arithmetic component of a lattice-based cryptosystem on an 8-bit microcontroller. [less ▲]

Detailed reference viewed: 216 (34 UL)
Full Text
Peer Reviewed
See detailEfficient Implementation of the SHA-512 Hash Function for 8-bit AVR Microcontrollers
Cheng, Hao UL; Dinu, Dumitru-Daniel; Groszschädl, Johann UL

in Lanet, Jean-Louis; Toma, Cristian (Eds.) Innovative Security Solutions for Information Technology and Communications, 11th International Conference, SecITC 2018, Bucharest, Romania, November 8-9, 2018, Revised Selected Papers (2018, November)

SHA-512 is a member of the SHA-2 family of cryptographic hash algorithms that is based on a Davies-Mayer compression function operating on eight 64-bit words to produce a 512-bit digest. It provides ... [more ▼]

SHA-512 is a member of the SHA-2 family of cryptographic hash algorithms that is based on a Davies-Mayer compression function operating on eight 64-bit words to produce a 512-bit digest. It provides strong resistance to collision and preimage attacks, and is assumed to remain secure in the dawning era of quantum computers. However, the compression function of SHA-512 is challenging to implement on small 8 and 16-bit microcontrollers because of their limited register space and the fact that 64-bit rotations are generally slow on such devices. In this paper, we present the first highly-optimized Assembler implementation of SHA-512 for the ATmega family of 8-bit AVR microcontrollers. We introduce a special optimization technique for the compression function based on a duplication of the eight working variables so that they can be more efficiently loaded from RAM via the indirect addressing mode with displacement (using the ldd and std instruction). In this way, we were able to achieve high performance without unrolling the main loop of the compression function, thereby keeping the code size small. When executed on an 8-bit AVR ATmega128 microcontroller, the compression function takes slightly less than 60k clock cycles, which corresponds to a compression rate of roughly 467 cycles per byte. The binary code size of the full SHA-512 implementation providing a standard Init-Update-Final (IUF) interface amounts to approximately 3.5 kB. [less ▲]

Detailed reference viewed: 388 (43 UL)