Branch target
Many architectures encode a branch/jump/call instruction with PC-relative addressing, i.e. the distance to the target is encoded in the instruction. In an executable or shared object (called a component in ELF), if the target is bound to the same component, the instruction has a fixed encoding at link time; otherwise the target is unknown at link time and there are two choices:
- text relocation
- add indirection
In All about Global Offset Table, I mentioned that linker/loader developers often frowned upon text relocations because the text segment will be unshareable. In addition, the number of relocations would be dependent on the number of calls, which can be large.
1 | call foo # R_X86_64_PC32 |
Therefore, the prevailing scheme is to add a level of indirection analogous to that provided by the Global Offset Table for data.
Procedure Linkage Table
ELF uses Procedure Linkage Table to provide the indirection. Here is an x86-64 example:
1 | <.text>: |
In the relocatable object file, we have two call instructions. The linker synthesizes stubs in .plt
and redirects calls into the stubs.
In the assembly dump, foo@plt
is such a stub. Note that foo@plt
is not a symbol table entry. It is just that objdump displays the PLT entry as foo@plt
. We use the notation to describe a stub. The stub consists of 3 instructions. The first jmpq instruction loads an entry from .got.plt
and performs an indirect jump. The remaining two instructions will be described when introducing lazy binding.
.got.plt
holds an array of word size entries. On some architectures (x86-32, x86-64) .got.plt[0]
is the link time address of _DYNAMIC
. .got.plt[1]
and .got.plt[2]
are reserved by ld.so. .got.plt[1]
is a descriptor of the current component while .got.plt[2]
is the address of the PLT resolver.
The subsequent entries are for resolved function addresses. Each entry is relocated by an R_*_JUMP_SLOT
relocation describing the target function symbol. Resolving a function address is also called a binding. There are two binding schemes.
Eager binding
The simpler scheme is eager binding. Before ld.so transfers control to the component, during relocation resolving, ld.so resolves the address of foo
and writes it into the .got.plt
entry. Therefore, when the program runs, the .got.plt
entry is guaranteed to hold the address of the target function.
This is the model used by musl. In other ld.so implementations, lazy binding is usually the default. However, with the rise of security hardening on Linux distributions, many have switched to full RELRO linking scheme. If the component is linked with ld -z now
, the linker will set the DF_1_NOW
dynamic tag, and ld.so will use eager binding.
Eager binding can also be enabled with the environment variable LD_BIND_NOW=1
.
For better understanding, we can ignore many instructions in the assembly output above.
1 | <.text>: |
Eager binding has a bonus: it can detect underlinking problems. See Dependency related linker options for details.
Lazy binding
The extra instructions and extra entries in .got.plt
are all for lazy binding.
The ELF dynamic shared library scheme postpones binding from the link time to the load time. There is some slowdown due to work before the program starts. In addition, a shared object has more code than the portion linked into the executable with static linking. This is because shared objects usually need to export all symbols which can defeat the archive member extraction trick and linker garbage collection.
Anyhow, eagerly resolving all R_*_JUMP_SLOT
relocations had a significant overhead. (The overhead is exacerbated by the typical ld.so implementation. In the absence of LD_PRELOAD
, ld.so first looks at the symbol table of the executable itself, then at the symbol tables of the DT_NEEDED
entries (in order), and then at the second level DT_NEEDED
entries, and so on. You may think that symbol search can be optimized if all defined symbols are added in one global hash table. However, a global hash table also costs some memory. In addition, dlopen, dlmopen, symbol versioning, and probably some features I don't know can all introduce complexity in the process and make a global hash table complex. Anyway, none of glibc/musl/FreeBSD rtld implements it. )
In the lazy binding scheme, ld.so does minimum work to ensure the jmp instruction lands in a trampoline which will tail call the PLT resolver in ld.so. On x86-64, the associated .got.plt
entry refers to the pushq instruction immediately after the jmpq. The pushq instructions pushes a descriptor (for the .got.plt
entry) and jumps to the PLT header. The PLT header is a stub calling the function address stored at .got.plt+16
(PLT resolver in ld.so) with an extra argument (descriptor for the current component).
1 | <.plt>: |
The PLT resolver in ld.so looks up the symbol with the .got.plt
descriptor, writes the resolved address to the .got.plt
entry, and then performs a tail call to the resolved address. Now that the .got.plt
entry has been updated to the actual function address, subsequent calls to foo
will bypass the PLT resolver.
-fno-plt
After the .got.plt
entry is resolved, a function call requires the following 3 instructions:
1 | In .text: call foo # 5 bytes |
An optimization idea is that we can just use one instruction (note that a GOT entry is eager binding), essentially inlining the PLT entry:
1 | In .text: call *.got.plt+offset(%rip) # 6 bytes |
GCC 6.0 added -fno-plt
to AArch64 and x86 to perform this transformation.
If the target is bound to the same component, GNU ld, gold, and LLD rewrite the instruction to addr32 call foo
(with an instruction prefix). Anyhow, the long code sequence can be slightly slower. On RISC architectures, -fno-plt
is usually bad. For example, AArch64 needs 3 instructions:
1 | adrp x0, _GLOBAL_OFFSET_TABLE_ |
The function attribute __attribute__ ((noplt))
can be added to individual declarations for fine-grained control.
GOT setup is expensive without PC-relative addressing
On some older architectures, PLT may introduce overhead in function prologue code. The fundamental problem is that they do not support memory load with PC-relative addressing (e.g. x86-64 PC-relative instructions). Such architectures typically make a distinction between non-PIC PLT and PIC PLT.
x86-32 is a notorious example.
1 | void bar(void); |
1 | foo: |
What went wrong?
To work around the lack of PC-relative addressing, the i386 psABI reserves the callee-saved register ebx
to refer to the GOT base (_GLOBAL_OFFSET_TABLE_
). Then the PLT entry can load the .got.plt
entry (by adding an offset to the GOT base) and do an indirect jump.
Using a callee-saved register has pros and cons. On the upside, a function just needs to compute _GLOBAL_OFFSET_TABLE_
once, because callees cannot clobber the register. On the downside, one general-purpose register is wasted (significant on x86-32 due to the lack of general-purpose registers. Prologue/epilogue code needs to save and restore the register, so tail calls are inhibited.
32-bit x86 Position Independent Code - It's that bad mentions an alternative scheme which allows a tail call:
1 | foo: |
jmp *bar@GOT+_GLOBAL_OFFSET_TABLE_(%ecx)
requires a relocation type representing the distance from PC to the GOT entry. Unfortunately R_386_GOT32
and R_386_GOT32
relocation types are misdesigned and do not support the usage. That said, adding an extra addl
instruction is not too bad. GCC since 6.0 can produce:
1 | foo: |
This code sequence uses a GOT-generating relocation which may defeat lazy binding, so it is not the default. Thankfully x86-64 supports PC-relative addressing and does not have the aforementioned problem.
Canonical PLT entries
For position dependent code (-fno-pic
), traditionally the compiler optimizes for statically linked executables and uses direct addressing (usually absolute relocations). Taking the address of a function does not use GOT indirection. If the symbol turns out to be external, the linker has to employ a trick called "canonical PLT entry" (st_shndx=0, st_value!=0
). The term is a parlance within a few LLD developers, but not broadly adopted.
Every PLT entry has an associated R_*_JUMP_SLOT
entry and has a dynamic symbol (st_shndx=0
, display as UND
in readelf
output). If we assign a non-zero value to the symbol, we can consider the address of the PLT entry as canonical and bind references from executable/shared objects to it.
1 |
|
1 | % gcc -m32 -shared -fuse-ld=bfd -fpic b.c -o b.so |
If b.so
binds foo
locally, e.g. if it is linked with -Bsymbolic-functions
, taking the address from the executable and from the shared object will get different results.
To fix this (C and C++ require address uniqueness), compilers need to use GOT indirection. See Copy relocations, canonical PLT entries and protected visibility, sadly it doesn't seem that GCC folks want to take action.
x86-32 has an extra problem that it uses R_386_PC32
to represent a function call from non-PIC code. See my article for details.
Stack unwinding
As linker synthesized code, the PLT entries may not have unwind tables information. Unwinders using .eh_frame
needs to recognize the _PROCEDURE_LINKAGE_TABLE_
region. See Explain GNU style linker options --no-ld-generated-unwind-info
(the positive option is unnecessary in my opinion) for details.
Case study
AArch64
AArch64 has the best PLT design in my opinion.
PLT header
1 | stp x16, x30, [sp,#-16]! |
.plt[n]
1 | adrp x16, Page(&(.plt.got[n])) |
The PLT resolver (e.g. _rtld_bind_start
in FreeBSD rtld) can compute the R_AARCH64_JMP_SLOT
index from &.plt.got[n]
(saved x16) and &.plt.got[2]
(x16).
When Arm v8.5 Branch Target Enablement is enabled, all indirect branches must land on a bti
instruction. A PLT entry may indirectly branch to the PLT header, so the PLT header needs a bti
instruction.
1 | % aarch64-linux-gnu-gcc -fpic -mbranch-protection=standard -shared a.c -o a.so |
PowerPC32
Power Architecture® 32-bit Application Binary Interface Supplement 1.0 - Linux® & Embedded specifies two PLT ABIs: BSS-PLT and Secure-PLT.
BSS-PLT is the older one. While .plt
on other architectures are created by the linker, BSS-PLT let ld.so generate the PLT entries. This has the advantage that the section can be made SHT_NOBITS
and therefore not occupy file size. The downside is the security concern of writable and executable memory pages. Even worse, as an implementation issue, GNU ld places .plt
in the text segment and therefore the whole text segment is writable and executable. -z relro -z now
has no effect.
In the newer Secure-PLT ABI, .plt
holds the table of function addresses. This usage is weird, because on other architectures such a section is called .got.plt
.
Like x86-32, PowerPC32 lacks memory load with relative addressing and has a distinction between non-PIC PLT and PIC PLT.
1 | 00000000.plt_call32.f: |
The call-clobbered register r11 is used to compute the .plt
entry address. For a non-PIC PLT, just use absolute addressing.
For a PIC PLT, the callee-saved register r30 holds the GOT base. PowerPC32 is one of the few old architectures where GCC -fpic
and -fPIC
are different. For -fpic
, the GOT base is at _GLOBAL_OFFSET_TABLE_
in the component. For -fPIC
, the GOT base is at .got2
for the current translation unit. The component may have multiple translation units and each has a different .got2
.
Unlike x86 (which just places the lazy binding code after the initial jmpq
instruction), the Secure-PLT ABI places the lazy binding code in __glink_PLTresolve
(like PLT header on x86) and following entries. When I added PowerPC32 port to LLD, I picked the glink code in a separate section .glink
for clarity.
For PIC function prologue code, setting up r30 is expensive. The previous foo tail calling bar C code requires these many instructions.
1 | <foo>: |
PowerPC64 ELFv1
The latest ABI is 64-bit PowerPC ELF Application Binary Interface Supplement 1.9. It defines the TOC ("Table of Contents") section which combines the functions of the GOT and the small data section. To optimize for inter-component calls, the ABI tries to avoid the per-function TOC base setup code. The TOC base address is stored in r2 which is not clobbered by inter-component calls. We will see later that as a result intra-component calls are slower.
The single .got.plt
entry on other architectures is replaced by a function descriptor which contains 3 entries:
- The first doubleword contains the address of the entry point of the function.
- The second doubleword contains the TOC base address for the function.
- The third doubleword contains the environment pointer for languages such as Pascal and PL/1.
1 | 00000000000002c0 <0000001a.plt_call.bar>: |
bl 0x2c0
jumps to the call stub. The call stub loads the function address and the TOC base address from the function descriptor. If the TOC base address is zero, which means the function descriptor hasn't been resolved, branch to the glink entry to resolve the function descriptor; otherwise branch to the resolved function address. In both cases, r2 has been updated to the new TOC base address.
After returning from the callee, the linker synthesized ld r2,40(r1)
is executed to restore the TOC base address.
When the target function is bound locally, the instruction after bl
is a nop
(instead of ld r2,40(r1)
). This scheme is better than the compiler unconditionally generating a code sequence to load the TOC base address.
When the target function is external, the scheme pays the cost in the call stub. This scheme is worse than the compiler unconditionally generating a code sequence to load the TOC base address.
PowerPC64 ELFv2
Here are the main pain points with ELFv1:
- Function descriptors waste space.
- Call stubs are too long (7 instructions padded by one nop).
The idea is to move the TOC base address setup code from call stubs to the start of functions.
1 | 00000000000002c0 <0000001a.plt_call.bar>: |
The distance between the global entry and the local entry is normally 8 bytes (2 instructions), but there can be more in the large code model. I have seen BoringSSL using more instructions to instrument assembly files to avoid relocations from text to data.
The globalentry/localentry scheme requires caution in many parts of the toolchain: taking addresses, copying symbols, computing branch distance for range extension thunks, etc.
In POWER10, IBM finally decided to add PC-relative instructions (Prefixed Load).
1 | 0000000000000300 <0000001a.plt_call.bar>: |
So why do we still need glink entries? Well, the pld
instruction loads the .plt
entry (.got.plt
on other architectures), but the address of the .plt
entry is not communicated to the PLT resolver. This means that for the small code model the pld
scheme is not optimal.
Compatibility is a thing. How to make the PC-relative instructions work with TOC and globalentry/localentry need a lot of thought and some complexity in the linker. I am sure that if you read the GNU ld or gold code, you will be amused:)
RISC-V
RISC-V supports memory load with PC-relative addressing, so it doesn't need non-PIC PLT vs PIC PLT distinction. However, its psABI defines the unneeded R_RISCV_CALL
and GNU ld does something special with it. I filed R_RISCV_CALL vs R_RISCV_CALL_PLT
about deprecation for the former.
Otherwise, its PLT scheme is quite good.
1 | Disassembly of section .plt: |
The address of an anchor in the PLT entry is available in t1, so the R_*_JUMP_SLOT
index can be computed in the PLT header.
x86
LLD supports -z retpolineplt
for Spectre v2 mitigation. The indirection branch needs to be protected by a special code sequence. With -z now
, the code sequence can be simplified a bit.
As part of Intel's Control-flow Enforcement Technology, Indirect Branch Tracking requires that indirect jumps land on an endbr
instruction.
1 | % gcc -fpic -fcf-protection=full -shared a.c -o a.so |
An endbr64
instruction takes 4 bytes and does not fit in the current 16-byte PLT entry scheme. They somehow decided to add a new section .plt.sec
. The linker uses endbr64
if all object files have the GNU_PROPERTY_X86_FEATURE_1_IBT
GNU property.
1 | Disassembly of section .plt: |
foo
will jump to bar@plt
in .plt.sec
, then either jump to the corresponding .plt
entry or the target function directly.
Let me register a complaint here. I think .plt.sec
is unneeded complexity and I proposed:
1 | endbr64 |
https://groups.google.com/g/x86-64-abi/c/sQcX3__r4c0 has more discussions. Most folks agreed that the 2-PLT scheme was over-engineered. After this event, I subscribed to x86-64-abi in case I missed such over-engineering designs in the future.
Summary
1 | if (supports memory load with PC-relative addressing) { |
The history of *NIX shared library support
... according to my archaeology.
In the late 1980s, SunOS 4.x extended the a.out binary format with dynamic shared library support. Unlike previous static shared library schemes, the a.out shared libraries are position independent and can be loaded at different addresses. The SunOS linker synthesized the jump linkage table to allow jumping to a target bound at the load time.
ELF uses Procedure Linkage Table which is similar to the SunOS a.out format.
In 1993, on NetBSD, https://github.com/NetBSD/src/commit/97ca10e37476fb84a20a8ec4b0be3188db703670 (A linker supporting shared libraries.
) and https://github.com/NetBSD/src/commit/3d68d0acaed0a32f929b2c174146c62940005a18 (A linker supporting shared libraries (run-time part).
) added shared library support similar to the SunOS scheme.
In 1994, GNU ld added a.out and ELF shared library support.
In 1995, glibc added ELF shared library support.
Dynamic shared library support on Mach-O was later. In a NeXTSTEP manual released in 1995, I can find MH_FVMLIB
(fixed virtual memory library, which appears to be a static shared library scheme) but not MH_DYLIB
(used by modern macOS for .dylib files). On Mach-O, the linker synthesizes call stubs in __TEXT,__stubs
and redirects calls into this section.