Ten years ago, we published a Clang Hardening Cheat Sheet. Since then, both the threat landscape and the Clang toolchain have evolved significantly. This blog post presents the new mitigations available in Clang to improve the security of your applications.
Ten years ago, we published on this blog a Clang Hardening Cheat Sheet. The original post walked through essential hardening techniques available at the time, such as FORTIFY_SOURCE checks, ASLR via position-independent code, stack protection (canaries and safe stack), Control Flow Integrity (CFI), GOT protection with RELRO/now, but also options to activate warnings about string formatting that could lead to potential attacks.
Since that article was published in early 2016, both the threat landscape and the Clang toolchain have evolved significantly.
To celebrate the 10th anniversary of the initial article, here is a new cheat sheet with some new hardening flags to improve security.
The OpenSSF Best Practices Working Group maintains a Compiler Options Hardening Guide for C and C++. They recommend using the following set of options:
-O2 -Wall -Wformat -Wformat=2 -Wconversion -Wimplicit-fallthrough \
-Werror=format-security \
-U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 \
-D_GLIBCXX_ASSERTIONS \
-fstrict-flex-arrays=3 \
-fstack-clash-protection -fstack-protector-strong \
-Wl,-z,nodlopen -Wl,-z,noexecstack \
-Wl,-z,relro -Wl,-z,now \
-Wl,--as-needed -Wl,--no-copy-dt-needed-entries
The options we recommended in our original blog post are still present, even if some have evolved (-Wformat=2, -D_FORTIFY_SOURCE=3). Others have been added:
-Wconversion, -Wimplicit-fallthrough;-Wl,--as-needed -Wl,--no-copy-dt-needed-entries;-D_GLIBCXX_ASSERTIONS, -fstrict-flex-arrays=3, -fstack-clash-protection, -Wl,-z,nodlopen, -Wl,-z,noexecstack.In this post, we present the additional hardening options recommended by the OpenSSF, as well as more specialized options that mitigate newer classes of exploits. We first go through several general protections when using the standard C/C++ libraries or loading libraries. Then we present mitigations against stack-based memory corruption, and the use of Return-Oriented Programming (ROP) or Jump-Oriented Programming (JOP) attacks. And finally, we talk about the defenses against speculative execution attacks.
Since last time, the -D_FORTIFY_SOURCE flag has evolved. It now provides a level 31, which includes all the security features of levels 1 and 2, along with additional checks for potentially dangerous code patterns. These additional checks are designed to detect a wider range of security issues, including:
memcpy and memmovesnprintf, vsnprintf, and related functionsstrtok, strncat, and strpbrkIn addition to -D_FORTIFY_SOURCE, C++ developers can enable extra runtime checks in the standard library by defining -D_GLIBCXX_ASSERTIONS.
This option is one of several supported by the libstdc++ library and it is used to enable various NULL pointer and bounds checking security features.
The -Wl,-z,nodlopen linker flag is used to prevent a shared object from being dynamically loaded at runtime using dlopen().
This can help in reducing an attacker’s ability to load and manipulate shared objects.
To test this flag, let's compile a simple shared object:
#include <stdio.h>
void example_function(void)
{
printf("Hello from the dynamically loaded library!\n");
}
We compile it with the nodlopen linker flag:
clang -fPIC -c src/libexample.c -o libexample.o
clang -Wl,-z,nodlopen -shared libexample.o -o libexample.so
Now we can attempt to load it at runtime using dlopen:
#include <stdio.h>
#include <dlfcn.h>
int main(void)
{
void *handle;
void (*func)(void);
/* Load the shared library */
handle = dlopen("./libexample.so", RTLD_NOW);
if (!handle) {
fprintf(stderr, "dlopen failed: %s\n", dlerror());
return 1;
}
/* Resolve the symbol */
func = (void (*)(void))dlsym(handle, "example_function");
if (!func) {
fprintf(stderr, "dlsym failed: %s\n", dlerror());
dlclose(handle);
return 1;
}
/* Call the function */
func();
/* Close the library */
dlclose(handle);
return 0;
}
When we run this program, we get the following error:
╭─trikkss@archlinux ~/work/clang-hardening/tests/nodlopen ‹main*›
╰─➤ ./main
dlopen failed: ./libexample.so: shared object cannot be dlopen()ed
This demonstrates that the nodlopen flag effectively prevents the shared library from being loaded dynamically at runtime. Note that this does not prevent the library from being linked at build time:
╭─trikkss@archlinux ~/work/clang-hardening/tests/nodlopen ‹main*›
╰─➤ cat src/main2.c
#include <stdio.h>
/* declaration of the library function */
void example_function(void);
int main(void)
{
example_function();
return 0;
}
╭─trikkss@archlinux ~/work/clang-hardening/tests/nodlopen ‹main*›
╰─➤ clang -o main2 src/main2.c $PWD/libexample.so
╭─trikkss@archlinux ~/work/clang-hardening/tests/nodlopen ‹main*›
╰─➤ ldd main2
linux-vdso.so.1 (0x00007fec50fe3000)
/home/trikkss/work/clang-hardening/tests/nodlopen/libexample.so (0x00007fec50fd1000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007fec50c00000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fec50fe5000)
╭─trikkss@archlinux ~/work/clang-hardening/tests/nodlopen ‹main*›
╰─➤ ./main2
Hello from the dynamically loaded library!
If we inspect the dynamic section of the library, and compare it to the dynamic section of the same library generated without nodlopen, we can see that the main difference is the presence of the flag NOOPEN:
2c2
< Dynamic section at offset 0x2e08 contains 24 entries:
---
> Dynamic section at offset 0x2df8 contains 25 entries:
7c7
< 0x0000000000000019 (INIT_ARRAY) 0x3df8
---
> 0x0000000000000019 (INIT_ARRAY) 0x3de8
9c9
< 0x000000000000001a (FINI_ARRAY) 0x3e00
---
> 0x000000000000001a (FINI_ARRAY) 0x3df0
22a23
> 0x000000006ffffffb (FLAGS_1) Flags: NOOPEN
The protection may therefore be easy to bypass: if an attacker has write access to the library in question, they can easily patch it.
One of the oldest and simplest exploit mitigation mechanisms is the non-executable stack. The idea is to prevent code execution directly from the stack, which was a common exploitation technique in early stack-based buffer overflow attacks.
When a binary is compiled or linked with the noexecstack flag, the stack memory region is marked as non-executable. As a result, even if an attacker manages to inject shellcode onto the stack and redirect execution to it, the CPU will refuse to execute it and will raise a fault.
Today, this protection is enabled by default on most modern systems and toolchains.
During its execution, a program uses the stack memory region to store data. This memory region is special because it grows automatically as the program needs more stack memory. To prevent uncontrolled growth, modern operating systems place guard pages below the stack, which trigger a fault when the stack grows too far.
A stack clash occurs when the stack grows by a very large amount at once and skips over these guard pages. As a result, the stack can collide with another memory region, such as the heap or a memory-mapped area, without triggering an immediate fault. This can cause stack writes to corrupt adjacent memory regions, or allow other memory regions to overlap with the stack, leading to memory corruption and potential exploitation.
The Stack Clash vulnerability was publicly disclosed on June 19, 2017 by the Qualys research team2.
LLVM's solution to this problem is to divide large allocations into smaller ones of size PAGE_SIZE as described in this blog post. This can be done by adding the -fstack-clash-protection compilation flag.
To observe this behavior, we can compile a simple C program that allocates a large buffer on the stack:
#include <stdio.h>
#include <sys/user.h>
int main(void)
{
char buffer[PAGE_SIZE * 10];
printf("hello world\n");
return 0;
}
Without stack clash protection enabled, the generated assembly looks like this:
0000000000401130 <main>:
401130: 55 push rbp
401131: 48 89 e5 mov rbp,rsp
401134: 48 81 ec 10 a0 00 00 sub rsp,0xa010
40113b: c7 45 fc 00 00 00 00 mov DWORD PTR [rbp-0x4],0x0
401142: 48 8d 3d bb 0e 00 00 lea rdi,[rip+0xebb] # 402004 <_IO_stdin_used+0x4>
401149: b0 00 mov al,0x0
40114b: e8 e0 fe ff ff call 401030 <printf@plt>
401150: 31 c0 xor eax,eax
401152: 48 81 c4 10 a0 00 00 add rsp,0xa010
401159: 5d pop rbp
40115a: c3 ret
Here, we can clearly identify a single large stack allocation performed by the instruction sub rsp, 0xa010, which decreases the stack pointer by more than ten pages at once.
Now, we can take a look at the hardened version of this program:
0000000000401140 <main>:
401140: 55 push rbp
401141: 48 89 e5 mov rbp,rsp
401144: 49 89 e3 mov r11,rsp
401147: 49 81 eb 00 a0 00 00 sub r11,0xa000
40114e: 48 81 ec 00 10 00 00 sub rsp,0x1000
401155: 48 c7 04 24 00 00 00 mov QWORD PTR [rsp],0x0
40115c: 00
40115d: 4c 39 dc cmp rsp,r11
401160: 75 ec jne 40114e <main+0xe>
401162: 48 83 ec 20 sub rsp,0x20
401166: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28
40116d: 00 00
40116f: 48 89 45 f8 mov QWORD PTR [rbp-0x8],rax
401173: c7 85 ec 5f ff ff 00 mov DWORD PTR [rbp-0xa014],0x0
40117a: 00 00 00
40117d: 48 8d 3d 80 0e 00 00 lea rdi,[rip+0xe80] # 402004 <_IO_stdin_used+0x4>
401184: 31 c0 xor eax,eax
401186: e8 b5 fe ff ff call 401040 <printf@plt>
40118b: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28
401192: 00 00
401194: 48 8b 4d f8 mov rcx,QWORD PTR [rbp-0x8]
401198: 48 39 c8 cmp rax,rcx
40119b: 75 0b jne 4011a8 <main+0x68>
40119d: 31 c0 xor eax,eax
40119f: 48 81 c4 20 a0 00 00 add rsp,0xa020
4011a6: 5d pop rbp
4011a7: c3 ret
4011a8: e8 83 fe ff ff call 401030 <__stack_chk_fail@plt>
We can see that the single stack allocation (sub rsp, 0xa10) has been replaced by a loop (address 40114e) that grows the stack by 0x1000 bytes (size of a memory page) per iteration until the stack pointer reaches the target address, touching each page along the way (mov QWORD PTR [rsp],0x0) to ensure that all guard pages are accessed and no stack guard is skipped.
Modern exploitation techniques rarely rely on injecting new code. Instead, attackers increasingly reuse existing code already present in the binary or in linked libraries. This class of attacks, commonly referred to as code-reuse attacks, bypasses traditional defenses such as non-executable stack memory (NX) by chaining together short instruction sequences called gadgets that end with indirect control-flow transfers.
The most well-known form is Return-Oriented Programming (ROP), where execution is redirected through a sequence of RET instructions by overwriting the return address of a function. Variants such as Jump-Oriented Programming (JOP) and Call-Oriented Programming (COP) rely on indirect jumps or calls instead.
To mitigate these issues, modern defenses focus on controlling where indirect branches are allowed to land, protecting return addresses, or reducing the usefulness of available gadgets. Rather than preventing memory corruption itself, these mechanisms aim to constrain how corrupted control flow can be exploited.
In the following sections, we will examine how Clang compilation flags implement these strategies through hardware-assisted features and compiler-driven transformations, and how they can be combined to harden binaries against code-reuse attacks in practice.
If you want to learn more about how ROP works, I recommend the following blog post: ROP - Return-Oriented Programming by Pixis.
Modern x86_64 CPUs from Intel support Control-flow Enforcement Technology (CET), a hardware-assisted mechanism to protect against code-reuse attacks. CET combines two complementary features:
Clang exposes CET through the -fcf-protection flag, which can take several values:
-fcf-protection=return: enables shadow stack protection (SHSTK);-fcf-protection=branch: enables endbranch (EB) generation (IBT);-fcf-protection=full: enables shadow stack protection and endbranch (EB) generation; this is the same as specifying this compiler option with no keyword;-fcf-protection=none: disables Intel CET protection.To study this security mechanism, consider the following trivial function:
void hello_world()
{
printf("hello world\n");
}
When compiled without CET, the assembly code looks like:
0000000000401130 <hello_world>:
401130: 55 push rbp
401131: 48 89 e5 mov rbp,rsp
401134: 48 8d 3d c9 0e 00 00 lea rdi,[rip+0xec9] # 402004 <_IO_stdin_used+0x4>
40113b: b0 00 mov al,0x0
40113d: e8 ee fe ff ff call 401030 <printf@plt>
401142: 5d pop rbp
401143: c3 ret
and compiled with -fcf-protection=full, it looks like:
0000000000001150 <hello_world>:
1150: f3 0f 1e fa endbr64
1154: 55 push rbp
1155: 48 89 e5 mov rbp,rsp
1158: 48 8d 3d a5 0e 00 00 lea rdi,[rip+0xea5] # 2004 <_IO_stdin_used+0x4>
115f: b0 00 mov al,0x0
1161: e8 da fe ff ff call 1040 <printf@plt>
1166: 5d pop rbp
1167: c3 ret
The only difference is the ENDBR64 instruction at the beginning, which acts as a landing pad for indirect branches under IBT.
If we look at the properties of our ELFs, the hardened binary declares support for IBT and SHSTK:
╭─trikkss@archlinux ~/work/clang-hardening/tests/fcf-protection ‹main*›
╰─➤ readelf -n main-hardened | grep Properties
Properties: x86 feature: IBT, SHSTK
╭─trikkss@archlinux ~/work/clang-hardening/tests/fcf-protection ‹main*›
╰─➤ readelf -n main | grep Properties
Properties: x86 ISA needed: x86-64-baseline
⚠️ Today in the 64-bit linux kernel, only userspace shadow stack and kernel IBT are supported (kernel.org - Control-flow Enforcement Technology (CET) Shadow Stack).
Windows doesn't support Indirect Branch Tracking
When IBT is enabled, the compiler inserts landing pads at the beginning of functions that may be the target of an indirect branch. These landing pads are represented by the end branch instructions (ENDBR32 and ENDBR64). On Intel processors, if this property is enabled, the program can only jump to an end branch. This prevents attackers from using JOP gadgets because they are only allowed to jump to addresses marked by ENDBR instructions.
It is still possible to hijack the program’s control flow, but the options available to an attacker are limited. Additionally, this protection does not prevent ROP attacks.
Shadow stack, also referred to as SHSTK, is a backward-edge code flow integrity protection feature available in both Intel and AMD processors. Its main purpose is to protect the integrity of return addresses on the call stack to prevent return-oriented programming (ROP) attacks.
The way it works is conceptually straightforward: every time a CALL instruction is executed, the CPU pushes the return address not only onto the regular stack but also onto a separate, protected “shadow” stack. Later, when a RET instruction is executed, the CPU checks that the return address on the normal stack matches the one on the shadow stack. If the two addresses differ, a fault is triggered, preventing the program from returning to an attacker-controlled location.
Just as x86 systems can be protected using CET, ARM provides equivalent hardware-assisted mechanisms through PAC (Pointer Authentication Codes) and BTI (Branch Target Identification). The Clang flag -mbranch-protection enables these protections, which are designed to prevent code-reuse attacks such as ROP, JOP, and COP on ARM systems.
Clang exposes PAC and BIT through the -mbranch-protection flag, which can take several values:
For more information on these flags, you can consult the ARM Developer Manual.
Branch Target Identification (BTI) is an architectural hardware security feature that restricts the set of valid destination addresses for indirect branch instructions. By enforcing where indirect calls and jumps are allowed to land, BTI helps mitigate Jump-Oriented Programming (JOP) and Call-Oriented Programming (COP) attacks.
To examine how this mechanism works in practice, we can once again compile a trivial function and inspect the generated assembly code.
Here is the assembly code of a simple hello world compiled without BTI :
STP X29, X30, [SP,#-0x10+var_s0]!
MOV X29, SP
ADRL X0, aHelloWorld ; "hello world\n"
BL .printf
LDP X29, X30, [SP+var_s0],#0x10
RET
And here is the hardened version compiled with BTI enabled:
BTI c
STP X29, X30, [SP,#-0x10+var_s0]!
MOV X29, SP
ADRL X0, aHelloWorld ; "hello world\n"
BL .printf
LDP X29, X30, [SP+var_s0],#0x10
RET
The only visible difference is the addition of a BTI instruction at the beginning of the function. This instruction plays a role similar to the ENDBR instruction on x86: every indirect branch must land on a valid BTI instruction, otherwise execution is aborted.
Unlike ENDBR, the BTI instruction takes an operand that specifies which types of indirect branches are allowed to target this location.
The operand controls the value of the PSTATE.BTYPE register, a 2-bit field that encodes the type of indirect control flow:
00 - none01 - CALLS (c)10 - JUMPS (j)11 - JUMPS and CALLS (jc)In this example, the BTI c instruction indicates that only indirect calls are allowed to branch to this function entry point. Indirect jumps targeting this address would result in a fault.
By enforcing these constraints, BTI significantly reduces the set of valid gadget entry points available to an attacker. While it does not completely prevent control-flow hijacking, it limits exploitation to a restricted set of legitimate targets, making JOP and COP chains much harder to construct reliably.
As with similar mechanisms on x86 (IBT), BTI doesn't protect against return-oriented programming. To achieve full control-flow integrity on ARM systems, BTI can be combined with PAC.
Pointer Authentication Codes (PAC) is a hardware-assisted mechanism on ARM64 that protects against ROP by signing return addresses.
When a function is called, PAC generates a cryptographic signature using:
The return address is then signed and stored on the stack. When the function returns, the CPU verifies the signature using the same key and modifier. If the signature is invalid, the program triggers a fault, effectively preventing the execution of an attacker-controlled return address.
Pointer authentication takes advantage of the fact that pointers are stored in a 64-bit format, but not all those bits are needed to represent the address. The virtual address space layout is the following:
0xFFF0_0000_0000_0000 to 0xFFFF_FFFF_FFFF_FFFF0x0000_0000_0000_0000 to 0x00FF_FFFF_FFFF_FFFFAny address that falls outside of both ranges is always invalid and results in a fault if accessed.
You can see that any valid virtual address has its top 12 bits as 0x000 or 0xFFF. When pointer authentication is enabled, the upper bits are used to store a signature and are not treated as part of the address. This signature is referred to as a Pointer Authentication Code (PAC). Its size can change depending on the architecture.
Here is an example taken from the ARM developer Manual - Return-Oriented Programming:
SIGNED POINTER = PAC | POINTER
e.g:
PAC = 0x123
POINTER = 0x0007FFFF5678
SIGNED POINTER = 0x1237FFFF5678
To analyse this security mechanism, we can use our previous hello world example:
STP X29, X30, [SP,#-0x10+var_s0]!
MOV X29, SP
ADRL X0, aHelloWorld ; "hello world\n"
BL .printf
LDP X29, X30, [SP+var_s0],#0x10
RET
and compile it with PAC enabled:
PACIASP
STP X29, X30, [SP,#-0x10+var_s0]!
MOV X29, SP
ADRL X0, aHelloWorld ; "hello world\n"
BL .printf
LDP X29, X30, [SP+var_s0],#0x10
RETAA
As we can see, two things have changed between the two versions. A PACIASP instruction has been added at the beginning, and the RET instruction has been changed to a RETAA instruction.
The PACIASP instruction signs the return address with key register A and uses SP as a modifier. The RETAA instruction means that the return address must be checked using the key register A.
Several PAC bypasses exist, depending on how pointer authentication is implemented. In some cases, PAC is signed with a null modifier, allowing an attacker to reuse an old authenticated pointer since the signature is not bound to any execution context. Additionally, because the number of bits allocated to the PAC signature varies across architectures, it may sometimes be possible to brute-force the authentication code, among other implementation-specific weaknesses.
If you want to go further, here are some very interesting slides by Brandon Azad on PAC bypasses on iOS: iOS Kernel PAC, One Year Later.
Register zeroing is a hardening technique designed to reduce the amount of useful gadgets an attacker can reuse after hijacking the control flow of a program. By explicitly clearing registers before a function returns, register zeroing limits what an attacker can carry from one gadget to the next. Instead of inheriting a rich execution context, each return leaves the program in a mostly clean state, making exploit construction significantly more complex. This helps mitigate Return-Oriented Programming exploits.
Clang provides this mitigation through the -fzero-call-used-regs compilation flag which tells the compiler to zero out certain registers before the function returns.
The two upper categories are:
used: Zero out used registers.all: Zero out all registers, whether used or not.The individual options are:
skip: Don't zero out any registers. This is the default.used: Zero out all used registers.used-arg: Zero out used registers that are used for arguments.used-gpr: Zero out used registers that are GPRs.used-gpr-arg: Zero out used GPRs that are used as arguments.all: Zero out all registers.all-arg: Zero out all registers used for arguments.all-gpr: Zero out all GPRs.all-gpr-arg: Zero out all GPRs used for arguments.General-Purpose Registers (GPR) are CPU registers that programs can use freely to hold temporary values, addresses, or intermediate results during execution; for example, on x86-64, registers like
RAXorRCXcan be set to zero or reused without breaking the program, whileRSPis not freely usable because it must always point to the top of the stack.
Let's take a look at a simple hello world program:
0000000000001140 <hello_world>:
1140: 55 push rbp
1141: 48 89 e5 mov rbp,rsp
1144: 48 8d 3d b9 0e 00 00 lea rdi,[rip+0xeb9] # 2004 <_IO_stdin_used+0x4>
114b: b0 00 mov al,0x0
114d: e8 de fe ff ff call 1030 <printf@plt>
1152: 5d pop rbp
1153: c3 ret
If we compile it with -fzero-call-used-regs=all:
0000000000001140 <hello_world>:
1140: 55 push rbp
1141: 48 89 e5 mov rbp,rsp
1144: 48 8d 3d b9 0e 00 00 lea rdi,[rip+0xeb9] # 2004 <_IO_stdin_used+0x4>
114b: b0 00 mov al,0x0
114d: e8 de fe ff ff call 1030 <printf@plt>
1152: 5d pop rbp
1153: d9 ee fldz
1155: d9 ee fldz
1157: d9 ee fldz
1159: d9 ee fldz
115b: d9 ee fldz
115d: d9 ee fldz
115f: d9 ee fldz
1161: d9 ee fldz
1163: dd d8 fstp st(0)
1165: dd d8 fstp st(0)
1167: dd d8 fstp st(0)
1169: dd d8 fstp st(0)
116b: dd d8 fstp st(0)
116d: dd d8 fstp st(0)
116f: dd d8 fstp st(0)
1171: dd d8 fstp st(0)
1173: 31 c0 xor eax,eax
1175: 31 c9 xor ecx,ecx
1177: 31 ff xor edi,edi
1179: 31 d2 xor edx,edx
117b: 31 f6 xor esi,esi
117d: 45 31 c0 xor r8d,r8d
1180: 45 31 c9 xor r9d,r9d
1183: 45 31 d2 xor r10d,r10d
1186: 45 31 db xor r11d,r11d
1189: 0f 57 c0 xorps xmm0,xmm0
118c: 0f 57 c9 xorps xmm1,xmm1
118f: 0f 57 d2 xorps xmm2,xmm2
1192: 0f 57 db xorps xmm3,xmm3
1195: 0f 57 e4 xorps xmm4,xmm4
1198: 0f 57 ed xorps xmm5,xmm5
119b: 0f 57 f6 xorps xmm6,xmm6
119e: 0f 57 ff xorps xmm7,xmm7
11a1: 45 0f 57 c0 xorps xmm8,xmm8
11a5: 45 0f 57 c9 xorps xmm9,xmm9
11a9: 45 0f 57 d2 xorps xmm10,xmm10
11ad: 45 0f 57 db xorps xmm11,xmm11
11b1: 45 0f 57 e4 xorps xmm12,xmm12
11b5: 45 0f 57 ed xorps xmm13,xmm13
11b9: 45 0f 57 f6 xorps xmm14,xmm14
11bd: 45 0f 57 ff xorps xmm15,xmm15
11c1: c3 ret
Here we can clearly see the downside of using the -fzero-call-used-regs=all variant. In order to guarantee that no potentially useful state remains after the function returns, the compiler aggressively clears all classes of registers: general-purpose registers, SIMD registers (XMM), and even the x87 floating‑point stack (fldz, fstp st(0)). While this significantly reduces the availability of useful ROP gadgets, it comes at a high cost.
In practice, such an aggressive mode is often overkill. A more balanced approach is to use -fzero-call-used-regs=used-gpr, which only clears the general‑purpose registers that were actually used by the function.
Here is the code hardened with -fzero-call-used-regs=used-gpr:
0000000000001140 <hello_world>:
1140: 55 push rbp
1141: 48 89 e5 mov rbp,rsp
1144: 48 8d 3d b9 0e 00 00 lea rdi,[rip+0xeb9] # 2004 <_IO_stdin_used+0x4>
114b: b0 00 mov al,0x0
114d: e8 de fe ff ff call 1030 <printf@plt>
1152: 5d pop rbp
1153: 31 c0 xor eax,eax
1155: 31 ff xor edi,edi
1157: c3 ret
As you can see, the difference is much more subtle. Only the general‑purpose registers that were actually used by the function (RAX and RDI in this case) are explicitly cleared before returning. This drastically reduces the amount of additional instructions compared to the all variant, while still removing valuable attacker‑controlled state that could otherwise be reused as part of a ROP chain.
If a binary relies on any library (libc, etc.) not compiled with this compilation flag, an attacker can still find a lot of gadgets. That's why this mitigation is most effective for kernel-sized or static binaries, as it does not depend on external libraries.
Modern CPUs use speculative execution to improve performance. When the processor encounters a conditional branch, it may predict the outcome and execute instructions from the predicted path before the branch condition is fully resolved. If the prediction is correct, the results are committed; if it is wrong, the architectural effects are discarded.
However, even when speculative execution is rolled back architecturally, microarchitectural side effects remain, such as changes in cache state. Speculation attacks exploit this behavior by deliberately causing mispredicted branches so that the CPU speculatively accesses sensitive data. An attacker can then infer this data through side-channel measurements, leading to vulnerabilities.
The first public traces of this class of attacks appeared in early 2018, with the coordinated disclosure of Spectre and Meltdown in January 2018 by Google Project Zero.
To mitigate speculative execution attacks, Clang provides the compilation flag -mspeculative-load-hardening. For X86 targets, this flag can be configured with two main strategies, indirect masking and fencing, but in practice, only indirect masking is commonly used for production code. Other options are also available, which we do not present here.
To analyse these mitigations, we will use the following simple C program:
#include <stdio.h>
int main(int argc, char **argv)
{
if (argc == 2)
printf("%s", argv[1]);
return 0;
}
This code is not vulnerable to any speculative execution attack. It is only used as a minimal example that contains a conditional branch.
Here is the corresponding assembly code:
0000000000001140 <main>:
1140: 55 push rbp
1141: 48 89 e5 mov rbp,rsp
1144: 48 83 ec 10 sub rsp,0x10
1148: c7 45 fc 00 00 00 00 mov DWORD PTR [rbp-0x4],0x0
114f: 89 7d f8 mov DWORD PTR [rbp-0x8],edi
1152: 48 89 75 f0 mov QWORD PTR [rbp-0x10],rsi
1156: 83 7d f8 02 cmp DWORD PTR [rbp-0x8],0x2
115a: 75 16 jne 1172 <main+0x32>
115c: 48 8b 45 f0 mov rax,QWORD PTR [rbp-0x10]
1160: 48 8b 70 10 mov rsi,QWORD PTR [rax+0x10]
1164: 48 8d 3d 99 0e 00 00 lea rdi,[rip+0xe99] # 2004 <_IO_stdin_used+0x4>
116b: b0 00 mov al,0x0
116d: e8 be fe ff ff call 1030 <printf@plt>
1172: 31 c0 xor eax,eax
1174: 48 83 c4 10 add rsp,0x10
1178: 5d pop rbp
1179: c3 ret
As described in the High Level Mitigation Approach section of the LLVM documentation on Speculative Load Hardening, one way to mitigate these attacks is to cause loads to be checked using branchless code to ensure that they are executing along a valid control flow path. In order to do that, Clang provides the SLH -mllvm -x86-slh-indirect option (default option).
The LLVM documentation illustrates this approach with the following example:
void leak(int data);
void example(int* pointer1, int* pointer2) {
if (condition) {
// ... lots of code ...
leak(*pointer1);
} else {
// ... more code ...
leak(*pointer2);
}
}
This code can be transformed into a hardened version such as:
uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
uintptr_t all_zeros_mask = 0;
void leak(int data);
void example(int* pointer1, int* pointer2) {
uintptr_t predicate_state = all_ones_mask;
if (condition) {
// Assuming ?: is implemented using branchless logic...
predicate_state = !condition ? all_zeros_mask : predicate_state;
// ... lots of code ...
//
// Harden the pointer so it can't be loaded
pointer1 &= predicate_state;
leak(*pointer1);
} else {
predicate_state = condition ? all_zeros_mask : predicate_state;
// ... more code ...
//
// Alternative: Harden the loaded value
int value2 = *pointer2 & predicate_state;
leak(value2);
}
}
The key aspect here is that the update of predicate_state is performed in a branchless manner. This forces the processor to wait for the branch condition to be resolved before the mask can be correctly computed. The resulting mask is then applied to either the pointer or the loaded value, ensuring that speculative execution cannot access or propagate sensitive data when the control flow is mispredicted.
If we compile our example using this hardening method, the resulting assembly code becomes significantly more verbose. We can see that pointers inside conditional branches are masked using conditional move instructions (CMOV). The key point is that the CPU does not speculatively execute the outcome of conditional moves. This ensures that the pointer values are only applied once the branch condition is fully resolved, preventing speculative execution from using or leaking sensitive data along mispredicted paths.
0000000000001140 <main>:
1140: 55 push rbp
1141: 48 89 e5 mov rbp,rsp
1144: 48 83 ec 30 sub rsp,0x30
1148: 48 c7 c0 ff ff ff ff mov rax,0xffffffffffffffff
114f: 48 89 45 e0 mov QWORD PTR [rbp-0x20],rax
1153: 48 89 e0 mov rax,rsp
1156: 48 c1 f8 3f sar rax,0x3f
115a: 48 89 45 e8 mov QWORD PTR [rbp-0x18],rax
115e: c7 45 fc 00 00 00 00 mov DWORD PTR [rbp-0x4],0x0
1165: 89 7d f8 mov DWORD PTR [rbp-0x8],edi
1168: 48 89 75 f0 mov QWORD PTR [rbp-0x10],rsi
116c: 83 7d f8 02 cmp DWORD PTR [rbp-0x8],0x2
1170: 75 02 jne 1174 <main+0x34>
1172: eb 12 jmp 1186 <main+0x46>
1174: 48 8b 4d e0 mov rcx,QWORD PTR [rbp-0x20]
1178: 48 8b 45 e8 mov rax,QWORD PTR [rbp-0x18]
117c: 48 0f 44 c1 cmove rax,rcx
1180: 48 89 45 d8 mov QWORD PTR [rbp-0x28],rax
1184: eb 59 jmp 11df <main+0x9f>
1186: 48 8b 45 e0 mov rax,QWORD PTR [rbp-0x20]
118a: 48 8b 4d e8 mov rcx,QWORD PTR [rbp-0x18]
118e: 48 0f 45 c8 cmovne rcx,rax
1192: 48 8b 45 f0 mov rax,QWORD PTR [rbp-0x10]
1196: 48 8b 40 10 mov rax,QWORD PTR [rax+0x10]
119a: 48 89 ce mov rsi,rcx
119d: 48 09 c6 or rsi,rax
11a0: 48 8d 3d 5d 0e 00 00 lea rdi,[rip+0xe5d] # 2004 <_IO_stdin_used+0x4>
11a7: 31 c0 xor eax,eax
11a9: 48 c1 e1 2f shl rcx,0x2f
11ad: 48 09 cc or rsp,rcx
11b0: e8 7b fe ff ff call 1030 <printf@plt>
11b5: 48 8b 55 e0 mov rdx,QWORD PTR [rbp-0x20]
11b9: 48 8b 74 24 f8 mov rsi,QWORD PTR [rsp-0x8]
11be: 48 89 e1 mov rcx,rsp
11c1: 48 c1 f9 3f sar rcx,0x3f
11c5: 48 8d 3d e9 ff ff ff lea rdi,[rip+0xffffffffffffffe9] # 11b5 <main+0x75>
11cc: 48 39 fe cmp rsi,rdi
11cf: 48 0f 45 ca cmovne rcx,rdx
11d3: 48 89 4d d0 mov QWORD PTR [rbp-0x30],rcx
11d7: 48 8b 45 d0 mov rax,QWORD PTR [rbp-0x30]
11db: 48 89 45 d8 mov QWORD PTR [rbp-0x28],rax
11df: 48 8b 4d d8 mov rcx,QWORD PTR [rbp-0x28]
11e3: 31 c0 xor eax,eax
11e5: 48 c1 e1 2f shl rcx,0x2f
11e9: 48 09 cc or rsp,rcx
11ec: 48 83 c4 30 add rsp,0x30
11f0: 5d pop rbp
11f1: c3 ret
11f2: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
11f9: 00 00 00
11fc: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
To better understand this mechanism, let us take a closer look at what this code does in practice.
Mask initialization
asm
mov rax,0xffffffffffffffff
mov QWORD PTR [rbp-0x20],rax
The compiler initializes a mask with all bits set (0xFFFFFFFFFFFFFFFF). This value represents a speculation-dependent mask that is later used to either preserve or deliberately corrupt pointers, ensuring that mis-speculated execution paths cannot access meaningful data.
Extracting the most significant bit of RSP
asm
mov rax,rsp
sar rax,0x3f
mov QWORD PTR [rbp-0x18],rax
This sequence extracts the most significant bit of the stack pointer by performing an arithmetic right shift. Since RSP is expected to hold a canonical user-space address, this bit should normally be zero.
First conditional branch
After the if (argc == 2) check, the CPU may speculatively execute one of the branches.
asm
mov rax,QWORD PTR [rbp-0x20]
mov rcx,QWORD PTR [rbp-0x18]
cmovne rcx,rax
Here, a conditional move (CMOVNE) is used to update the mask stored in RCX.
The important property is that the CPU does not speculate on the result of a conditional move. As a consequence, it must wait for the condition flags to be resolved before applying the move. It creates a data dependency.
Once the condition is resolved, speculative execution may continue, but the mask will now correctly reflect the control-flow outcome. Depending on the condition, the mask will be either 0xFFFF_FFFF_FFFF_FFFF or 0x0000_0000_0000_0000.
Applying the mask to the pointer
At this point, RCX holds the computed mask.
asm
mov rax,QWORD PTR [rbp-0x10]
mov rax,QWORD PTR [rax+0x10]
mov rsi,rcx # copie du masque dans RSI
or rsi,rax # OR du pointeur vers argv[1] avec notre masque.
lea rdi,[rip+0xe5d] # 2004 <_IO_stdin_used+0x4>
xor eax,eax
shl rcx,0x2f
or rsp,rcx
call 1030 <printf@plt>
Here, the pointer to argv[1] is combined with the mask using a bitwise OR. If the execution path is valid, the mask preserves the pointer value. If the path is invalid due to mis-speculation, the mask corrupts the pointer, preventing it from being used to access meaningful data.
⚠️ **Enabling this feature has a notable performance impact, but it is still less costly than inserting
lfencebarriers everywhere (see next section). An attribute3 can be used to restrict hardening to specific functions, avoiding the need to slow down the entire program4.
LFENCE barriersAnother way to avoid speculative execution on branches is to insert LFENCE instructions after all conditional jumps to act as speculation barriers. To select this strategy, Clang provides the SLH option -mllvm -x86-slh-lfence. The LFENCE instruction does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. As such, it acts as a barrier because the branch will never be executed speculatively.
push rbp
mov rbp,rsp
sub rsp,0x10
mov DWORD PTR [rbp-0x4],0x0
mov DWORD PTR [rbp-0x8],edi
mov QWORD PTR [rbp-0x10],rsi
cmp DWORD PTR [rbp-0x8],0x2
jne 1175 <main+0x35>
lfence
mov rax,QWORD PTR [rbp-0x10]
mov rsi,QWORD PTR [rax+0x10]
lea rdi,[rip+0xe96] # 2004 <_IO_stdin_used+0x4>
xor eax,eax
call 1030 <printf@plt>
lfence
xor eax,eax
add rsp,0x10
pop rbp
ret
⚠️ We do not recommend using this compilation flag as its impact on program performance is very significant. It is mentioned here for completeness only.
Over the years, the list of recommended compiler hardening options has significantly evolved, driven by the continuous discovery of new classes of vulnerabilities and exploitation techniques. New compilation flags have been introduced to mitigate threats such as control-flow hijacking, speculative execution attacks, and information leaks at both the architectural and micro-architectural levels.
While these mitigations greatly improve the security posture of compiled binaries, they often come with a non-negligible performance cost. As a result, enabling them requires careful consideration and an informed trade-off between security and performance, depending on the threat model and deployment context.
The growing awareness that security is a first-class concern in modern software development suggests that these compiler options will continue to evolve. New mitigations will be added, existing ones will be refined, and defaults may change over time. Staying up to date with compiler developments and security recommendations is therefore essential.
Finally, compiler hardening options, while powerful, are not sufficient on their own. They must be combined with good programming practices, regular code reviews and audits. In addition, software protection techniques such as obfuscation, integrity checks or Runtime Application Self Protection (RASP) make the identification and exploitation of vulnerabilities more difficult and slow down attackers.
We would like to thank our colleagues Laurent Laubin and Rémy Salim for their thorough review of this article as well as their suggestions.
If you would like to learn more about our security audits and explore how we can help you, get in touch with us!