Browse ORBi

- What it is and what it isn't
- Green Road / Gold Road?
- Ready to Publish. Now What?
- How can I support the OA movement?
- Where can I learn more?

ORBi

Highly Vectorized SIKE for AVX-512 Cheng, Hao ; Fotiadis, Georgios ; Groszschädl, Johann et al in IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES) (2022, February), 2022(2), 41-68 It is generally accepted that a large-scale quantum computer would be capable to break any public-key cryptosystem used today, thereby posing a serious threat to the security of the Internet’s public-key ... [more ▼] It is generally accepted that a large-scale quantum computer would be capable to break any public-key cryptosystem used today, thereby posing a serious threat to the security of the Internet’s public-key infrastructure. The US National Institute of Standards and Technology (NIST) addresses this threat with an open process for the standardization of quantum-safe key establishment and signature schemes, which is now in the final phase of the evaluation of candidates. SIKE (an abbreviation of Supersingular Isogeny Key Encapsulation) is one of the alternate candidates under evaluation and distinguishes itself from other candidates due to relatively short key lengths and relatively high computing costs. In this paper, we analyze how the latest generation of Intel’s Advanced Vector Extensions (AVX), in particular AVX-512IFMA, can be used to minimize the latency (resp. maximize the throughput) of the SIKE key encapsulation mechanism when executed on Ice LakeCPUs based on the Sunny Cove microarchitecture. We present various techniques to parallelize and speed up the base/extension field arithmetic, point arithmetic, and isogeny computations performed by SIKE. All these parallel processing techniques are combined in AVXSIKE, a highly optimized implementation of SIKE using Intel AVX-512IFMA instructions. Our experiments indicate that AVXSIKE instantiated with the SIKEp503 parameter set is approximately 1.5 times faster than the to-date best AVX-512IFMA-based SIKE software from the literature. When executed on an Intel Core i3-1005G1 CPU, AVXSIKE outperforms the x64 assembly implementation of SIKE contained in Microsoft’s SIDHv3.4 library by a factor of about 2.5 for key generation and decapsulation, while the encapsulation is even 3.2 times faster. [less ▲] Detailed reference viewed: 60 (14 UL)Lightweight EdDSA Signature Verification for the Ultra-Low-Power Internet of Things Groszschädl, Johann ; Franck, Christian ; in Deng, Robert; Bao, Feng; Wang, Guilin (Eds.) et al Information Security Practice and Experience, 16th International Conference, ISPEC 2021, Nanjing, China, December 17–19, 2021, Proceedings (2021, December) EdDSA is a digital signature scheme based on elliptic curves in Edwards form that is supported in the latest incarnation of the TLS protocol (i.e. TLS version 1.3). The straightforward way of verifying an ... [more ▼] EdDSA is a digital signature scheme based on elliptic curves in Edwards form that is supported in the latest incarnation of the TLS protocol (i.e. TLS version 1.3). The straightforward way of verifying an EdDSA signature involves a costly double-scalar multiplication of the form kP - lQ where P is a "fixed" point (namely the generator of the underlying elliptic-curve group) and Q is only known at run time. This computation makes a verification not only much slower than a signature generation, but also more memory demanding. In the present paper we compare two implementations of EdDSA verification using Ed25519 as case study; the first is speed-optimized, while the other aims to achieve low RAM footprint. The speed-optimized variant performs the double-scalar multiplication in a simultaneous fashion and uses a Joint-Sparse Form (JSF) representation for the two scalars. On the other hand, the memory-optimized variant splits the computation of kP - lQ into two separate parts, namely a fixed-base scalar multiplication that is carried out using a standard comb method with eight pre-computed points, and a variable-base scalar multiplication, which is executed by means of the conventional Montgomery ladder on the birationally-equivalent Montgomery curve. Our experiments with a 16-bit ultra-low-power MSP430 microcontroller show that the separated method is 24% slower than the simultaneous technique, but reduces the RAM footprint by 40%. This makes the separated method attractive for "lightweight" cryptographic libraries, in particular if both Ed25519 signature generation/verification and X25519 key exchange need to be supported. [less ▲] Detailed reference viewed: 48 (12 UL)An Evaluation of the Multi-Platform Efficiency of Lightweight Cryptographic Permutations Cardoso Dos Santos, Luan ; Groszschädl, Johann in Ryan, Peter Y A; Toma, Cristian (Eds.) Innovative Security Solutions for Information Technology and Communications 14th International Conference, SECITC 2021, Virtual Event, November 25-26, 2021, Revised Selected Papers (2021, November) Permutation-based symmetric cryptography has become increasingly popular over the past ten years, especially in the lightweight domain. More than half of the 32 second-round candidates of NIST's ... [more ▼] Permutation-based symmetric cryptography has become increasingly popular over the past ten years, especially in the lightweight domain. More than half of the 32 second-round candidates of NIST's lightweight cryptography standardization project are permutation-based designs or can be instantiated with a permutation. The performance of a permutation-based construction depends, among other aspects, on the rate (i.e. the number of bytes processed per call of the permutation function) and the execution time of the permutation. In this paper we analyze the execution time and code size of assembler implementations of the permutation of Ascon, Gimli, Schwaemm, and Xoodyak on an 8-bit AVR and a 32-bit ARM Cortex-M3 microcontroller. Our aim is to ascertain how well these four permutations perform on microcontrollers with very different architectural and micro-architectural characteristics such as the available register capacity or the latency of multi-bit shifts and rotations. We also determine the impact of flash wait states on the execution time of the permutations on Cortex-M3 development boards with 0, 2, and 4 wait states. Our results show that the throughput (in terms of permutation time divided by rate when the capacity is fixed to 256 bits) of the permutation of Ascon, Schwaemm, and Xoodyak is similar on ARM Cortex-M3 and lies in the range of 41.1 to 48.6 cycles per rate-byte. However, on an 8-bit AVR ATmega128, the permutation of Schwaemm outperforms its counterparts of Ascon and Xoodyak by a factor of 1.20 and 1.59, respectively. [less ▲] Detailed reference viewed: 38 (6 UL)Optimized Implementation of SHA-512 for 16-bit MSP430 Microcontrollers Franck, Christian ; Groszschädl, Johann in Ryan, Peter Y A; Toma, Cristian (Eds.) Innovative Security Solutions for Information Technology and Communications 14th International Conference, SECITC 2021, Virtual Event, November 25-26, 2021, Revised Selected Papers (2021, November) The enormous growth of the Internet of Things (IoT) in the recent past has fueled a strong demand for lightweight implementations of cryptosystems, i.e. implementations that are efficient enough to run on ... [more ▼] The enormous growth of the Internet of Things (IoT) in the recent past has fueled a strong demand for lightweight implementations of cryptosystems, i.e. implementations that are efficient enough to run on resource-limited devices like sensor nodes. However, most of today's widely-used cryptographic algorithms, including the AES or the SHA2 family of hash functions, were already designed some 20 years ago and did not take efficiency in restricted environments into account. In this paper, we introduce implementation options and software optimization techniques to reduce the execution time of SHA-512 on 16-bit MSP430 microcontrollers. These optimizations include a novel register allocation strategy for the 512-bit hash state, a fast "on-the-fly" message schedule with low RAM footprint, special pointer arithmetic to avoid the need to copy state words, as well as instruction sequences for multi-bit rotation of 64-bit operands. Thanks to the combination of all these optimization techniques, our hand-written MSP430 Assembler code for the SHA-512 compression function reaches an execution time of roughly 40.6k cycles on an MSP430F1611 microcontroller. Hashing a message of 1000 bytes takes slightly below 338k clock cycles, which corresponds to a hash rate of about 338 cycles/byte. This execution time sets a new speed record for hashing with 256 bits of security on a 16-bit platform and improves the time needed by the fastest C implementations by a factor of 2.3. In addition, our implementation is extremely small in terms of code size (roughly 2.1k bytes) and has a RAM footprint of only 390 bytes. [less ▲] Detailed reference viewed: 105 (16 UL)Batching CSIDH Group Actions using AVX-512 Cheng, Hao ; Fotiadis, Georgios ; Groszschädl, Johann et al in IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES) (2021, August), 2021(4), 618-649 Commutative Supersingular Isogeny Diffie-Hellman (or CSIDH for short) is a recently-proposed post-quantum key establishment scheme that belongs to the family of isogeny-based cryptosystems. The CSIDH ... [more ▼] Commutative Supersingular Isogeny Diffie-Hellman (or CSIDH for short) is a recently-proposed post-quantum key establishment scheme that belongs to the family of isogeny-based cryptosystems. The CSIDH protocol is based on the action of an ideal class group on a set of supersingular elliptic curves and comes with some very attractive features, e.g. the ability to serve as a “drop-in” replacement for the standard elliptic curve Diffie-Hellman protocol. Unfortunately, the execution time of CSIDH is prohibitively high for many real-world applications, mainly due to the enormous computational cost of the underlying group action. Consequently, there is a strong demand for optimizations that increase the efficiency of the class group action evaluation, which is not only important for CSIDH, but also for related cryptosystems like the signature schemes CSI-FiSh and SeaSign. In this paper, we explore how the AVX-512 vector extensions (incl. AVX-512F and AVX-512IFMA) can be utilized to optimize constant-time evaluation of the CSIDH-512 class group action with the goal of, respectively, maximizing throughput and minimizing latency. We introduce different approaches for batching group actions and computing them in SIMD fashion on modern Intel processors. In particular, we present a hybrid batching technique that, when combined with optimized (8 × 1)-way prime-field arithmetic, increases the throughput by a factor of 3.64 compared to a state-of-the-art (non-vectorized) x64 implementation. On the other hand, vectorization in a 2-way fashion aimed to reduce latency makes our AVX-512 implementation of the group action evaluation about 1.54 times faster than the state-of-the-art. To the best of our knowledge, this paper is the first to demonstrate the high potential of using vector instructions to increase the throughput (resp. decrease the latency) of constant-time CSIDH. [less ▲] Detailed reference viewed: 119 (18 UL)AVRNTRU: Lightweight NTRU-based Post-Quantum Cryptography for 8-bit AVR Microcontrollers Cheng, Hao ; Groszschädl, Johann ; Roenne, Peter et al in 2021 Design, Automation and Test in Europe Conference and Exhibition, DATE 2021, Grenoble, France, February 1-5, 2021, Proceedings (2021, February) Introduced in 1996, NTRUEncrypt is not only one of the earliest but also one of the most scrutinized lattice-based cryptosystems and expected to remain secure in the upcoming era of quantum computing ... [more ▼] Introduced in 1996, NTRUEncrypt is not only one of the earliest but also one of the most scrutinized lattice-based cryptosystems and expected to remain secure in the upcoming era of quantum computing. Furthermore, NTRUEncrypt offers some efﬁciency beneﬁts over “pre-quantum” cryptosystems like RSA or ECC since the low-level arithmetic operations are less computation-intensive and, thus, more suitable for constrained devices. In this paper we present AVR N TRU, a highly-optimized implementation of NTRUEncrypt for 8-bit AVR microcontrollers that we developed from scratch to reach high performance and resistance to timing attacks. AVR N TRU complies with the EESS #1 v3.1 speciﬁcation and supports product-form parameter sets such as ees443ep1, ees587ep1, and ees743ep1. An entire encryption (including mask generation and blinding-polynomial generation) using the ees443ep1 parameters requires 847973 clock cycles on an ATmega1281 microcontroller; the decryption is more costly and has an execution time of 1051871 cycles. We achieved these results with the help of a novel hybrid technique for multiplication in a truncated polynomial ring, whereby one of the operands is a sparse ternary polynomial in product form and the other an arbitrary element of the ring. A constant-time multiplication in the ring given by the ees443ep1 parameters takes only 192577 cycles, which sets a new speed record for the arithmetic part of a lattice-based cryptosystem on AVR. [less ▲] Detailed reference viewed: 52 (4 UL)Lightweight Post-quantum Key Encapsulation for 8-bit AVR Microcontrollers Cheng, Hao ; Groszschädl, Johann ; Roenne, Peter et al in Liardet, Pierre-Yvan; Mentens, Nele (Eds.) Smart Card Research and Advanced Applications, 19th International Conference, CARDIS 2020, Virtual Event, November 18–19, 2020, Revised Selected Papers (2020, November) Recent progress in quantum computing has increased interest in the question of how well the existing proposals for post-quantum cryptosystems are suited to replace RSA and ECC. While some aspects of this ... [more ▼] Recent progress in quantum computing has increased interest in the question of how well the existing proposals for post-quantum cryptosystems are suited to replace RSA and ECC. While some aspects of this question have already been researched in detail (e.g. the relative computational cost of pre- and post-quantum algorithms), very little is known about the RAM footprint of the proposals and what execution time they can reach when low memory consumption rather than speed is the main optimization goal. This question is particularly important in the context of the Internet of Things (IoT) since many IoT devices are extremely constrained and possess only a few kB of RAM. We aim to contribute to answering this question by exploring the software design space of the lattice-based key-encapsulation scheme ThreeBears on an 8-bit AVR microcontroller. More concretely, we provide new techniques for the optimization of the ring arithmetic of ThreeBears (which is, in essence, a 3120-bit modular multiplication) to achieve either high speed or low RAM footprint, and we analyze in detail the trade-offs between these two metrics. A low-memory implementation of BabyBear that is secure against Chosen Plaintext Attacks (CPA) needs just about 1.7 kB RAM, which is significantly below the RAM footprint of other lattice-based cryptosystems reported in the literature. Yet, the encapsulation time of this RAM-optimized BabyBear version is below 12.5 million cycles, which is less than the execution time of scalar multiplication on Curve25519. The decapsulation is more than four times faster and takes roughly 3.4 million cycles on an ATmega1284 microcontroller. [less ▲] Detailed reference viewed: 60 (7 UL)Parallel Implementation of SM2 Elliptic Curve Cryptography on Intel Processors with AVX2 ; ; et al in Liu, Joseph K.; Cui, Hui (Eds.) Information Security and Privacy, 25th Australasian Conference, ACISP 2020, Perth, WA, Australia, November 30 - December 2, 2020, Proceedings (2020, November) This paper presents an efficient and secure implementation of SM2, the Chinese elliptic curve cryptography standard that has been adopted by the International Organization of Standardization (ISO) as ISO ... [more ▼] This paper presents an efficient and secure implementation of SM2, the Chinese elliptic curve cryptography standard that has been adopted by the International Organization of Standardization (ISO) as ISO/IEC 14888-3:2018. Our SM2 implementation uses Intel’s Advanced Vector Extensions version 2.0 (AVX2), a family of three-operand SIMD instructions operating on vectors of 8, 16, 32, or 64-bit data elements in 256-bit registers, and is resistant against timing attacks. To exploit the parallel processing capabilities of AVX2, we studied the execution flows of Co-Z Jacobian point arithmetic operations and introduce a parallel 2-way Co-Z addition, Co-Z conjugate addition, and Co-Z ladder algorithm, which allow for fast Co-Z scalar multiplication. Furthermore, we developed an efficient 2-way prime-field arithmetic library using AVX2 to support our Co-Z Jacobian point operations. Both the field and the point operations utilize branch-free (i.e. constant-time) implementation techniques, which increase their ability to resist Simple Power Analysis (SPA) and timing attacks. Our software for scalar multiplication on the SM2 curve is, to our knowledge, the first constant-time implementation of the Co-Z based ladder that leverages the parallelism of AVX2. [less ▲] Detailed reference viewed: 106 (4 UL)Fast and Flexible Elliptic Curve Cryptography for Dining Cryptographers Networks Dupont, Elona ; Franck, Christian ; Groszschädl, Johann in Bouzefrane, Samia; Laurent, Maryline; Boumerdassi, Selma (Eds.) et al Mobile, Secure, and Programmable Networking, 6th International Conference, MSPN 2020, Paris, France, October 28–29, 2020, Revised Selected Papers (2020, October) A Dining Cryptographers network (DCnet for short) allows anonymous communication with sender and receiver untraceability even if an adversary has unlimited access to the connection metadata of the network ... [more ▼] A Dining Cryptographers network (DCnet for short) allows anonymous communication with sender and receiver untraceability even if an adversary has unlimited access to the connection metadata of the network. Originally introduced by David Chaum in the 1980s, DCnets were for a long time considered not practical for real-world applications because of the tremendous communication and computation overhead they introduce. However, technological innovations such as 5G networks and extremely powerful 64-bit processors make a good case to reassess the practicality of DCnets. In addition, recent advances in elliptic-curve based commitment schemes and Zero-Knowledge Proofs (ZKPs) provide a great opportunity to reduce the computational cost of modern DCnets that are able to detect malicious behavior of communicating parties. In this paper we introduce X64ECC, a self-contained library for Elliptic Curve Cryptography (ECC) developed from scratch to support all the public-key operations needed by modern DCnets: key exchange, digital signatures, Pedersen commitments, and ZKPs. X64ECC is written in C and uses compiler intrinsics to speed up performance-critical arithmetic operations. It is highly scalable and works with Montgomery curves and twisted Edwards curves of different cryptographic strength. Despite its high scalability and portability, X64ECC is able to compute a fixed-base scalar multiplication on a twisted Edwards curve over a 255-bit prime field in about 145,000 clock cycles on a modern Intel X64 processor. All cryptosystems can be adapted on-the-fly (i.e. without recompilation) to implement DCnets with arbitrary message sizes, and tradeoffs between the cryptographic strength and throughput of a DCnet are possible. [less ▲] Detailed reference viewed: 113 (13 UL)High-Throughput Elliptic Curve Cryptography Using AVX2 Vector Instructions Cheng, Hao ; Groszschädl, Johann ; Tian, Jiaqi et al in Dunkelman, Orr; Jacobson Jr., Michael J.; O'Flynn, Colin (Eds.) Selected Areas in Cryptography, 27th International Conference, Halifax, NS, Canada (Virtual Event), October 21-23, 2020, Revised Selected Papers (2020, October) Single Instruction Multiple Data (SIMD) execution engines like Intel’s Advanced Vector Extensions 2 (AVX2) offer a great potential to accelerate elliptic curve cryptography compared to implementations ... [more ▼] Single Instruction Multiple Data (SIMD) execution engines like Intel’s Advanced Vector Extensions 2 (AVX2) offer a great potential to accelerate elliptic curve cryptography compared to implementations using only basic x64 instructions. All existing AVX2 implementations of scalar multiplication on e.g. Curve25519 (and alternative curves) are optimized for low latency. We argue in this paper that many real-world applications, such as server-side SSL/TLS handshake processing, would benefit more from throughput-optimized implementations than latency-optimized ones. To support this argument, we introduce a throughput-optimized AVX2 implementation of variable-base scalar multiplication on Curve25519 and fixed-base scalar multiplication on Ed25519. Both implementations perform four scalar multiplications in parallel, where each uses a 64-bit element of a 256-bit vector. The field arithmetic is based on a radix-2^29 representation of the field elements, which makes it possible to carry out four parallel multiplications modulo a multiple of p=2^255−19 in just 88 cycles on a Skylake CPU. Four variable-base scalar multiplications on Curve25519 require less than 250,000 Skylake cycles, which translates to a throughput of 32,318 scalar multiplications per second at a clock frequency of 2 GHz. For comparison, the to-date best latency-optimized AVX2 implementation has a throughput of some 21,000 scalar multiplications per second on the same Skylake CPU. [less ▲] Detailed reference viewed: 67 (10 UL)Alzette: A 64-Bit ARX-box (Feat. CRAX and TRAX) ; Biryukov, Alex ; Cardoso Dos Santos, Luan et al in Micciancio, Daniele; Ristenpart, Thomas (Eds.) Advances in Cryptology -- CRYPTO 2020, 40th Annual International Cryptology Conference, CRYPTO 2020, Santa Barbara, CA, USA, August 17-21, 2020, Proceedings, Part III (2020, August) S-boxes are the only source of non-linearity in many symmetric primitives. While they are often defined as being functions operating on a small space, some recent designs propose the use of much larger ... [more ▼] S-boxes are the only source of non-linearity in many symmetric primitives. While they are often defined as being functions operating on a small space, some recent designs propose the use of much larger ones (e.g., 32 bits). In this context, an S-box is then defined as a subfunction whose cryptographic properties can be estimated precisely. We present a 64-bit ARX-based S-box called Alzette, which can be evaluated in constant time using only 12 instructions on modern CPUs. Its parallel application can also leverage vector (SIMD) instructions. One iteration of Alzette has differential and linear properties comparable to those of the AES S-box, and two are at least as secure as the AES super S-box. As the state size is much larger than the typical 4 or 8 bits, the study of the relevant cryptographic properties of Alzette is not trivial. We further discuss how such wide S-boxes could be used to construct round functions of 64-, 128- and 256-bit (tweakable) block ciphers with good cryptographic properties that are guaranteed even in the related-tweak setting. We use these structures to design a very lightweight 64-bit block cipher (Crax) which outperforms SPECK-64/128 for short messages on micro-controllers, and a 256-bit tweakable block cipher (Trax) which can be used to obtain strong security guarantees against powerful adversaries (nonce misuse, quantum attacks). [less ▲] Detailed reference viewed: 188 (19 UL)Lightweight AEAD and Hashing using the Sparkle Permutation Family Beierle, Christof ; Biryukov, Alex ; Cardoso Dos Santos, Luan et al in IACR Transactions on Symmetric Cryptology (2020), 2020(S1), 208-261 We introduce the Sparkle family of permutations operating on 256, 384 and 512 bits. These are combined with the Beetle mode to construct a family of authenticated ciphers, Schwaemm, with security levels ... [more ▼] We introduce the Sparkle family of permutations operating on 256, 384 and 512 bits. These are combined with the Beetle mode to construct a family of authenticated ciphers, Schwaemm, with security levels ranging from 120 to 250 bits. We also use them to build new sponge-based hash functions, Esch256 and Esch384. Our permutations are among those with the lowest footprint in software, without sacrificing throughput. These properties are allowed by our use of an ARX component (the Alzette S-box) as well as a carefully chosen number of rounds. The corresponding analysis is enabled by the long trail strategy which gives us the tools we need to efficiently bound the probability of all the differential and linear trails for an arbitrary number of rounds. We also present a new application of this approach where the only trails considered are those mapping the rate to the outer part of the internal state, such trails being the only relevant trails for instance in a differential collision attack. To further decrease the number of rounds without compromising security, we modify the message injection in the classical sponge construction to break the alignment between the rate and our S-box layer. [less ▲] Detailed reference viewed: 136 (15 UL)A Lightweight Implementation of NTRU Prime for the Post-Quantum Internet of Things Cheng, Hao ; ; Groszschädl, Johann et al in Laurent, Maryline; Giannetsos, Thanassis (Eds.) Information Security Theory and Practice, 13th IFIP WG 11.2 International Conference, WISTP 2019, Paris, France, December 11–12, 2019, Proceedings (2019, December) The dawning era of quantum computing has initiated various initiatives for the standardization of post-quantum cryptosystems with the goal of (eventually) replacing RSA and ECC. NTRU Prime is a variant of ... [more ▼] The dawning era of quantum computing has initiated various initiatives for the standardization of post-quantum cryptosystems with the goal of (eventually) replacing RSA and ECC. NTRU Prime is a variant of the classical NTRU cryptosystem that comes with a couple of tweaks to minimize the attack surface; most notably, it avoids rings with "worrisome" structure. This paper presents, to our knowledge, the first assembler-optimized implementation of Streamlined NTRU Prime for an 8-bit AVR microcontroller and shows that high-security lattice-based cryptography is feasible for small IoT devices. An encapsulation operation using parameters for 128-bit post-quantum security requires 8.2 million clock cycles when executed on an 8-bit ATmega1284 microcontroller. The decapsulation is approximately twice as costly and has an execution time of 15.6 million cycles. We achieved this performance through (i) new low-level software optimization techniques to accelerate Karatsuba-based polynomial multiplication on the 8-bit AVR platform and (ii) an efficient implementation of the coefficient modular reduction written in assembly language. The execution time of encapsulation and decapsulation is independent of secret data, which makes our software resistant against timing attacks. Finally, we assess the performance one could theoretically gain by using a so-called product-form polynomial as part of the secret key and discuss potential security implications. [less ▲] Detailed reference viewed: 366 (33 UL)FELICS-AEAD: Benchmarking of Lightweight Authenticated Encryption Algorithms Cardoso Dos Santos, Luan ; Groszschädl, Johann ; Biryukov, Alex in Belaïd, Sonia; Güneysu, Tim (Eds.) Smart Card Research and Advanced Applications, 18th International Conference, CARDIS 2019, Prague, Czech Republic, November 11–13, 2019, Revised Selected Papers (2019, November) Cryptographic algorithms that can simultaneously provide both encryption and authentication play an increasingly important role in modern security architectures and protocols (e.g. TLS v1.3). Dozens of ... [more ▼] Cryptographic algorithms that can simultaneously provide both encryption and authentication play an increasingly important role in modern security architectures and protocols (e.g. TLS v1.3). Dozens of authenticated encryption systems have been designed in the past five years, which has initiated a large body of research in cryptanalysis. The interest in authenticated encryption has further risen after the National Institute of Standards and Technology (NIST) announced an initiative to standardize "lightweight" authenticated ciphers and hash functions that are suitable for resource-constrained devices. However, while there already exist some cryptanalytic results on these recent designs, little is known about their performance, especially when they are executed on small 8, 16, and 32-bit microcontrollers. In this paper, we introduce an open-source benchmarking tool suite for a fair and consistent evaluation of Authenticated Encryption with Associated Data (AEAD) algorithms written in C or assembly language for 8-bit AVR, 16-bit MSP430, and 32-bit ARM Cortex-M3 platforms. The tool suite is an extension of the FELICS benchmarking framework and provides a new AEAD-specific low-level API that allows users to collect very fine-grained and detailed results for execution time, RAM consumption, and binary code size in a highly automated fashion. FELICS-AEAD comes with two pre-defined evaluation scenarios, which were developed to resemble security-critical operations commonly carried out by real IoT applications to ensure the benchmarks are meaningful in practice. We tested the AEAD tool suite using five authenticated encryption algorithms, namely AES-GCM and the CAESAR candidates ACORN, ASCON, Ketje-Jr, and NORX, and present some preliminary results. [less ▲] Detailed reference viewed: 197 (22 UL)Fast ECDH Key Exchange Using Twisted Edwards Curves with an Efficiently Computable Endomorphism Groszschädl, Johann ; Liu, Zhe ; et al in Proceedings of the 8th International Workshop on Secure Internet of Things 2019 (SIoT 2019) (2019, September) It is widely accepted that public-key cryptosystems play a major role in the security arena of the Internet of Things (IoT), but they need to be implemented efficiently to not deplete the scarce resources ... [more ▼] It is widely accepted that public-key cryptosystems play a major role in the security arena of the Internet of Things (IoT), but they need to be implemented efficiently to not deplete the scarce resources of battery-operated devices such as wireless sensor nodes. This paper describes a highly-optimized software implementation of scalar multiplication for Elliptic Curve Diffie-Hellman (ECDH) key exchange on resource-limited IoT devices that achieves fast execution times along with reasonably small code size and RAM consumption. Our software uses a special class of elliptic curves, namely twisted Edwards curves with an efficiently computable endomorphism similar to that of the so- called Gallant-Lambert-Vanstone (GLV) curves. This allows us to combine the main advantage of the GLV model, which is an efficiently-computable endomorphism to speed up variable-base scalar multiplication, with the fast and complete addition rules of the (twisted) Edwards model. We implemented variable-base scalar multiplication for static ECDH on two such curves, one over a 159-bit and the second over a 207-bit pseudo-Mersenne prime field, respectively, and evaluated their execution time on a 16-bit MSP430F1611 processor. The arithmetic operations in the prime field do not contain operand-dependent conditional statements (in particular no "if-then-else" clauses) and also the scalar multiplication follows a fixed execution path for a given (static) scalar. A variable-base scalar multiplication on curves over the 159 and 207-bit field takes about 2.63 and 4.84 million clock cycles, respectively, on an MSP430F1611 processor. These results compare favorably with the Montgomery ladder on the equivalent Montgomery curves, which is almost 50% slower. [less ▲] Detailed reference viewed: 39 (5 UL)Triathlon of Lightweight Block Ciphers for the Internet of Things Dinu, Dumitru-Daniel ; Le Corre, Yann ; Khovratovich, Dmitry et al in Journal of Cryptographic Engineering (2019), 9(3), 283-302 In this paper, we introduce a framework for the benchmarking of lightweight block ciphers on a multitude of embedded platforms. Our framework is able to evaluate the execution time, RAM footprint, as well ... [more ▼] In this paper, we introduce a framework for the benchmarking of lightweight block ciphers on a multitude of embedded platforms. Our framework is able to evaluate the execution time, RAM footprint, as well as binary code size, and allows one to define a custom "figure of merit" according to which all evaluated candidates can be ranked. We used the framework to benchmark implementations of 19 lightweight ciphers, namely AES, Chaskey, Fantomas, HIGHT, LBlock, LEA, LED, Piccolo, PRESENT, PRIDE, PRINCE, RC5, RECTANGLE, RoadRunneR, Robin, Simon, SPARX, Speck, and TWINE, on three microcontroller platforms: 8-bit AVR, 16-bit MSP430, and 32-bit ARM. Our results bring some new insights into the question of how well these lightweight ciphers are suited to secure the Internet of things. The benchmarking framework provides cipher designers with an easy-to-use tool to compare new algorithms with the state of the art and allows standardization organizations to conduct a fair and consistent evaluation of a large number of candidates. [less ▲] Detailed reference viewed: 234 (4 UL)A Lightweight Implementation of NTRUEncrypt for 8-bit AVR Microcontrollers Cheng, Hao ; Groszschädl, Johann ; Roenne, Peter et al E-print/Working paper (2019) Introduced in 1996, NTRUEncrypt is not only one of the earliest but also one of the most scrutinized lattice-based cryptosystems and a serious contender in NIST’s ongoing Post-Quantum Cryptography (PQC ... [more ▼] Introduced in 1996, NTRUEncrypt is not only one of the earliest but also one of the most scrutinized lattice-based cryptosystems and a serious contender in NIST’s ongoing Post-Quantum Cryptography (PQC) standardization project. An important criterion for the assessment of candidates is their computational cost in various hardware and software environments. This paper contributes to the evaluation of NTRUEncrypt on the ATmega class of AVR microcontrollers, which belongs to the most popular 8-bit platforms in the embedded domain. More concretely, we present AvrNtru, a carefully-optimized implementation of NTRUEncrypt that we developed from scratch with the goal of achieving high performance and resistance to timing attacks. AvrNtru complies with version 3.3 of the EESS#1 specification and supports recent product-form parameter sets like ees443ep1, ees587ep1, and ees743ep1. A full encryption operation (including mask generation and blinding- polynomial generation) using the ees443ep1 parameters takes 834,272 clock cycles on an ATmega1281 microcontroller; the decryption is slightly more costly and has an execution time of 1,061,683 cycles. When choosing the ees743ep1 parameters to achieve a 256-bit security level, 1,539,829 clock cycles are cost for encryption and 2,103,228 clock cycles for decryption. We achieved these results thanks to a novel hybrid technique for multiplication in truncated polynomial rings where one of the operands is a sparse ternary polynomial in product form. Our hybrid technique is inspired by Gura et al’s hybrid method for multiple-precision integer multiplication (CHES 2004) and takes advantage of the large register file of the AVR architecture to minimize the number of load instructions. A constant-time multiplication in the ring specified by the ees443ep1 parameters requires only 210,827 cycles, which sets a new speed record for the arithmetic component of a lattice-based cryptosystem on an 8-bit microcontroller. [less ▲] Detailed reference viewed: 227 (35 UL)Alzette: A 64-bit ARX-box Beierle, Christof ; Biryukov, Alex ; Cardoso Dos Santos, Luan et al E-print/Working paper (2019) S-boxes are the only source of non-linearity in many symmetric primitives. While they are often defined as being functions operating on a small space, some recent designs propose the use of much larger ... [more ▼] S-boxes are the only source of non-linearity in many symmetric primitives. While they are often defined as being functions operating on a small space, some recent designs propose the use of much larger ones (e.g., 32 bits). In this context, an S-box is then defined as a subfunction whose cryptographic properties can be estimated precisely. In this paper, we present a 64-bit ARX-based S-box called Alzette, which can be evaluated in constant time using only 12 instructions on modern CPUs. Its parallel application can also leverage vector (SIMD) instructions. One iteration of Alzette has differential and linear properties comparable to those of the AES S-box, while two iterations are at least as secure as the AES super S-box. Since the state size is much larger than the typical 4 or 8 bits, the study of the relevant cryptographic properties of Alzette is not trivial. [less ▲] Detailed reference viewed: 137 (6 UL)A Family of Lightweight Twisted Edwards Curves for the Internet of Things Ghatpande, Sankalp ; Groszschädl, Johann ; Liu, Zhe in Blazy, Olivier; Yeun, Chan Y. (Eds.) Information Security Theory and Practice, 12th IFIP WG 11.2 International Conference, WISTP 2018, Brussels, Belgium, December 10-11, 2018, Proceedings (2018, December) We introduce a set of four twisted Edwards curves that satisfy common security requirements and allow for fast implementations of scalar multiplication on 8, 16, and 32-bit processors. Our curves are ... [more ▼] We introduce a set of four twisted Edwards curves that satisfy common security requirements and allow for fast implementations of scalar multiplication on 8, 16, and 32-bit processors. Our curves are defined by an equation of the form -x^2 + y^2 = 1 + dx^2y^2 over a prime field Fp, where d is a small non-square modulo p. The underlying prime fields are based on "pseudo-Mersenne" primes given by p = 2^k - c and have in common that p is congruent to 5 modulo 8, k is a multiple of 32 minus 1, and c is at most eight bits long. Due to these common features, our primes facilitate a parameterized implementation of the low-level arithmetic so that one and the same arithmetic function is able to process operands of different length. Each of the twisted Edwards curves we introduce in this paper is birationally equivalent to a Montgomery curve of the form -(A+2)y^2 = x^3 + Ax^2 + x where 4/(A+2) is small. Even though this contrasts with the usual practice of choosing A such that (A+2)/4 is small, we show that the Montgomery form of our curves allows for an equally efficient implementation of point doubling as Curve25519. The four curves we put forward roughly match the common security levels of 80, 96, 112 and 128 bits. In addition, their Weierstraß representations are isomorphic to curves of the form y^2 = x^3 - 3x + b so as to facilitate inter-operability with TinyECC and other legacy software. [less ▲] Detailed reference viewed: 447 (33 UL)Efficient Implementation of the SHA-512 Hash Function for 8-bit AVR Microcontrollers Cheng, Hao ; ; Groszschädl, Johann in Lanet, Jean-Louis; Toma, Cristian (Eds.) Innovative Security Solutions for Information Technology and Communications, 11th International Conference, SecITC 2018, Bucharest, Romania, November 8-9, 2018, Revised Selected Papers (2018, November) SHA-512 is a member of the SHA-2 family of cryptographic hash algorithms that is based on a Davies-Mayer compression function operating on eight 64-bit words to produce a 512-bit digest. It provides ... [more ▼] SHA-512 is a member of the SHA-2 family of cryptographic hash algorithms that is based on a Davies-Mayer compression function operating on eight 64-bit words to produce a 512-bit digest. It provides strong resistance to collision and preimage attacks, and is assumed to remain secure in the dawning era of quantum computers. However, the compression function of SHA-512 is challenging to implement on small 8 and 16-bit microcontrollers because of their limited register space and the fact that 64-bit rotations are generally slow on such devices. In this paper, we present the first highly-optimized Assembler implementation of SHA-512 for the ATmega family of 8-bit AVR microcontrollers. We introduce a special optimization technique for the compression function based on a duplication of the eight working variables so that they can be more efficiently loaded from RAM via the indirect addressing mode with displacement (using the ldd and std instruction). In this way, we were able to achieve high performance without unrolling the main loop of the compression function, thereby keeping the code size small. When executed on an 8-bit AVR ATmega128 microcontroller, the compression function takes slightly less than 60k clock cycles, which corresponds to a compression rate of roughly 467 cycles per byte. The binary code size of the full SHA-512 implementation providing a standard Init-Update-Final (IUF) interface amounts to approximately 3.5 kB. [less ▲] Detailed reference viewed: 422 (44 UL) |
||