This post is about a rather technical coding strategy choice that arises when implementing cryptographic algorithms on some elliptic curves, namely how to represent elements of the base field. We will be discussing Curve25519 implementations, in particular as part of Ed25519 signatures, as specified in RFC 8032. The most widely used Rust implementation of these operations is the curve25519-dalek library. My own research library is crrl, also written in plain Rust (no assembly); it is meant for research purposes, but I write it using all best practices for production-level implementations, e.g. it is fully constant-time and offers an API amenable to integration into various applications.
The following table measures performance of Ed25519 signature generation and verification with these libraries, using various backend implementations for operations in the base field (integers modulo 2255 – 19), on two test platforms (64-bit x86, and 64-bit RISC-V):
Implementation | x86 (Intel “Coffee Lake”) | RISC-V (SiFive U74) | |||
Library | Backend | sign | verify | sign | verify |
crrl | m64 | 49130 | 108559 | 202021 | 412764 |
m51 | 70149 | 148653 | 158928 | 304902 | |
curve25519-dalek | simd | 59553 | 116243 | – | – |
serial | 59599 | 169621 | 180142 | 449980 | |
fiat | 70552 | 198289 | 187220 | 429755 |
Test platforms are the following:
In both cases, the current Rust “stable” version is used (1.72.0, from 2023-08-23), and compilation uses the environment variable RUSTFLAGS=”-C target-cpu=native” to allow the compiler to use all opcodes supported by the current platform. The computation is performed over a single core, with measurements averaged over randomized inputs. The CPU cycle counter is used. Figures above are listed with many digits, but in practice there is a bit of natural variance due to varying inputs (signature verification is not constant-time, since it uses only public data) and, more generally, because of the effect of various operations also occurring within the CPU (e.g. management mode, cache usage from other cores, interruptions from hardware…), so that the measured values should be taken with a grain of salt (roughly speaking, differences below about 3% are not significant).
crrl and curve25519-dalek differ a bit in how they use internal tables to speed up computations; in general, crrl tables are smaller, and crrl performs fewer point additions but more point doublings. For signature verification, crrl implements the Antipa et al optimization with Lagrange’s algorithm for lattice basis reduction, but curve25519-dalek does not. The measurements above show that crrl’s strategy works (i.e. it is a tad faster than curve25519-dalek) (note: not listed above is the fact that curve25519-dalek supports batch signature verification with a substantially lower per-signature cost; crrl does not implement that feature yet). The point of this post is not to boast about how crrl is faster; its good performance should be taken as an indication that it is decently optimized and thus a correct illustration of the effect of its implementation strategy choices. Indeed, the interesting part is how the different backends compare to each other, on the two tested architectures.
curve25519-dalek has three backends:
crrl has two backends:
The “m64” backend could be deemed to be the most classical, in that such a representation would be what was preferred for big integer computations in, say, the 1990s. It minimizes the number of multiplication opcode invocations during a field element multiplication (with two 4-limb operands, 16 register multiplications are used), but also implies quite a lot of carry propagation. See for instance this excerpt of the implementation of field element multiplication in crrl’s m64 backend:
let (e0, e1) = umull(a0, b0);
let (e2, e3) = umull(a1, b1);
let (e4, e5) = umull(a2, b2);
let (e6, e7) = umull(a3, b3);let (lo, hi) = umull(a0, b1);
let (e1, cc) = addcarry_u64(e1, lo, 0);
let (e2, cc) = addcarry_u64(e2, hi, cc);
let (lo, hi) = umull(a0, b3);
let (e3, cc) = addcarry_u64(e3, lo, cc);
let (e4, cc) = addcarry_u64(e4, hi, cc);
let (lo, hi) = umull(a2, b3);
let (e5, cc) = addcarry_u64(e5, lo, cc);
let (e6, cc) = addcarry_u64(e6, hi, cc);
let (e7, _) = addcarry_u64(e7, 0, cc);let (lo, hi) = umull(a1, b0);
let (e1, cc) = addcarry_u64(e1, lo, 0);
let (e2, cc) = addcarry_u64(e2, hi, cc);
let (lo, hi) = umull(a3, b0);
let (e3, cc) = addcarry_u64(e3, lo, cc);
let (e4, cc) = addcarry_u64(e4, hi, cc);
let (lo, hi) = umull(a3, b2);
let (e5, cc) = addcarry_u64(e5, lo, cc);
let (e6, cc) = addcarry_u64(e6, hi, cc);
let (e7, _) = addcarry_u64(e7, 0, cc);
The addcarry_u64() calls above implement the “add with carry” operation, which, on x86, map to the ADC opcode (or, on that core, the ADCX or ADOX opcodes).
When Ed25519 signatures were invented, in 2011, the Intel CPUs du jour (Intel “Westmere” core) were not very good at carry propagation; they certainly supported the ADC opcode, but with a relatively high latency (2 cycles), and that made the classical code somewhat slow. The use of 51-bit limbs allowed a different code, which, in curve25519-dalek’s serial backend, looks like this:
let b1_19 = b[1] * 19;
let b2_19 = b[2] * 19;
let b3_19 = b[3] * 19;
let b4_19 = b[4] * 19;// Multiply to get 128-bit coefficients of output
let c0: u128 = m(a[0], b[0]) + m(a[4], b1_19) + m(a[3], b2_19) + m(a[2], b3_19) + m(a[1], b4_19);
let mut c1: u128 = m(a[1], b[0]) + m(a[0], b[1]) + m(a[4], b2_19) + m(a[3], b3_19) + m(a[2], b4_19);
let mut c2: u128 = m(a[2], b[0]) + m(a[1], b[1]) + m(a[0], b[2]) + m(a[4], b3_19) + m(a[3], b4_19);
let mut c3: u128 = m(a[3], b[0]) + m(a[2], b[1]) + m(a[1], b[2]) + m(a[0], b[3]) + m(a[4], b4_19);
let mut c4: u128 = m(a[4], b[0]) + m(a[3], b[1]) + m(a[2], b[2]) + m(a[1], b[3]) + m(a[0] , b[4]);
This code excerpt computes the result over five limbs which can now range over close to 128 bits, and some extra high part propagation (not shown above) is needed to shrink limbs down to 51 bits or so. As we see here, there are now 25 individual multiplications (the m() function), since there are five limbs per input. There still are ADC opcodes in there! They are somewhat hidden behind the additions: these additions are over Rust’s u128 type, a 128-bit integer type that internally uses two registers, so that each addition implies one ADC opcode. However, these carry propagations can occur mostly in parallel (there are five independent dependency chains here), and that maps well to the abilities of a Westmere core. On such cores, the “serial” backend is faster than crrl’s m64. However, in later x86 CPUs from Intel (starting with the Haswell core), support for add-with-carry opcodes was substantially improved, and the classical method with 64-bit limbs again gained the upper hand. This was already noticed by Nath and Sarkar in 2018 and this explains why crrl’s m64 backend, on our x86 test system, is faster than curve25519-dalek’s serial and fiat backends, and even a bit faster than the AVX2-powered simd backend.
Now enters the RISC-V platform. RISC-V is an open architecture which has been designed with what can be viewed as “pure RISC philosophy”, with a much reduced instruction set. It is inspired from the older DEC Alpha, including in particular a large number of integer registers (32), one of which being fixed to the value zero, and, most notably, no carry flag at all. An “add-with-carry” operation, which adds together two 64-bit inputs x and y and an input carry c, and outputs a 64-bit result z and an output carry d, now requires no fewer than five instructions:
(I cannot prove that it is not doable in fewer RISC-V instructions; if there is a better solution please tell me.)
Thus, the add-with-carry is not only a high-latency sequence, but it also requires quite a few instructions, and instruction throughput may be a bottleneck. Out test platform (SiFive U74 core) is not well documented, but some cursory tests show the following:
Under such conditions, a 5-instruction add-with-carry will need a minimum of 2.5 cycles (in terms of throughput). The main output (z) is available with a latency of 2 cycles, but the output carry has a latency of 4 cycles. A “partial” add-with-carry with no input carry is cheaper (an ADD and a SLTU), and so is an add-with-carry with no output carry (two ADDs), but these are still relatively expensive. The high latency is similar to the Westmere situation, but the throughput cost is new. For that RISC-V platform, we need to avoid not only long dependency chains of carry propagation, but we should also endeavour to do fewer carry propagations. Another operation which is similarly expensive is the split of a 115-bit value (held in a 128-bit variable) into a low (51 bits) and high (64 bits) parts. The straightforward Rust code looks like this (from curve25519-dalek):
let carry: u64 = (c4 >> 51) as u64;
out[4] = (c4 as u64) & LOW_51_BIT_MASK;
On x86, the 128-bit value is held in two registers; the low part is a simple bitwise AND with a constant, and the high part is extracted with a single SHLD opcode, that can extract a chunk out of the concatenation of two input registers. On RISC-V, there is no shift opcode with two input registers (not counting the shift count); instead, the extraction of the high part (called carry in the code excerpt above) requires three instructions: two single-register shifts (SHR, SHL) and one bitwise OR to combine the results.
In order to yield better performance on RISC-V, the crrl m51 backend does things a bit differently:
let a0 = a0 << 6;
let b0 = b0 << 7;
// ...
let (c00, h00) = umull(a0, b0);
let d0 = c00 >> 13;
Here, the input limbs are pre-shifted (by 6 or 7 bits) so that the products are shifted by 13 bits. In that case, the boundary between the low and high parts falls exactly on the boundary between the two registers that receive the multiplication result; the extraction of the high part becomes free! The low part is obtained with a single opcode (a right shift of the low register by 13 bits). Moreover, instead of performing 128-bit additions, crrl’s m51 code adds the low and high parts separately:
let d0 = c00 >> 13;
let d1 = (c01 >> 13)
+ (c10 >> 13);
let d2 = (c02 >> 13)
+ (c11 >> 13)
+ (c20 >> 13);
// ...
let h0 = h00;
let h1 = h01 + h10;
let h2 = h02 + h11 + h20;
In that way, all add-with-carry operations are avoided. This makes crrl’s m51 code somewhat slower than curve2519-dalek’s serial backend on x86, but it quite improves the performance on RISC-V.
The discussion above is about a fairly technical point. In the grand scheme of things, the differences in performance between the various implementation strategies is not great; there are not many usage contexts where a speed difference of less than 30% in computing or verifying Ed25519 signatures has any relevance to overall application performance. But, insofar as such things matter, the following points are to be remembered:
Authored by Joshua Kamp (main author) and Alberto Segura. Summary Hook and ERMAC are Android based malware families that are both advertised by the actor named “DukeEugene”. Hook is the latest variant to be released by this actor and was first announced at the start of 2023. In this announcement,…
Mathew Vermeer is a doctoral candidate at the Organisation Governance department of the faculty of Technology, Policy and Management of Delft University of Technology. At the same university, he has received both a BSc degree in Computer Science and Engineering, as well as a MSc degree in Computer Science with…
Aaron Adams presented this talk at HITB Phuket on the 24th August 2023. The talk detailed how NCC Exploit Development Group (EDG) in Pwn2Own 2022 Toronto was able to exploit two different PostScript vulnerabilities in Lexmark printers. The presentation is a good primer for those interested in further researching the…