MMU-less systems and RISC-V FDPIC

UNDER CONSTRUCTION

This article describes some ABI and toolchain notes about systems without a Memory Management Unit (MMU), with a focus on FDPIC and the future RISC-V FDPIC ABI.

Linux binfmt loaders

fs/Kconfig.binfmt defines a few loaders.

BINFMT_ELF defaults to y and depends on MMU.
BINFMT_ELF_FDPIC defaults to y when BINFMT_ELF is not selected. A few architecture support BINFMT_ELF_FDPIC for NOMMU. ARM supports FDPIC even with a MMU.
BINFMT_FLAT is provided for a few architectures.

BINFMT_AOUT, removed in 2022, had been supported for alpha/arm/x86-32.

BFLT

uClinux uses an object file format called Binary Flat format (BFLT). https://myembeddeddev.blogspot.com/2010/02/uclinux-flat-file-format.html has an introduction.

The Linux kernel has supported the format before the git era (fs/binfmt_flat.c). Both version 2 (OLD_FLAT_VERSION in the kernel) and version 4 are supported. Shared library support was removed in April 2022.

BFLT is not for relocatable files. An image file is typically converted from ELF using elf2flt. ld-elf2flt is a ld wrapper that invokes elf2flt when the option -elf2flt is seen.

GCC's m68k port supports -msep-data, a special -fPIC mode, which assumes that text and data segments are placed in different areas of memory. This option is used for XIP (eXecute In Place).

`-mno-pic-data-is-text-relative`

kpatch (live kernel patching) uses this option for s390x.

FDPIC

The Linux kernel has supported the format before the git era (fs/binfmt_elf_fdpic.c). Only ET_DYN executables are supported. Each port that supports FDPIC defines an EI_OSABI value to be checked by the loader.

Several architectures define a FDPIC ABI.

Fujitsu FR-V: The FR-V FDPIC ABI, initial version in 2004
Blackfin: Blackfin FDPIC ABI
SuperH: The SH FDPIC ABI, initial version in 2008
ARM FDPIC ABI, initial version in 2013.

Here is a summary.

The read-only sections, which can be shared, are commonly referred to as the "text segment", whereas the writable sections are non-shared and commonly referred to as the "data segment". Functions and certain data symbols (.rodata) reside in the text segment, while other data symbols and the GOT reside in the data segment. Special entries called "canonical function descriptors" also reside in the GOT.

A call-clobbered register is reserved as the FDPIC register, used to access the data segment. Upon entry to a function, the FDPIC register holds the address of _GLOBAL_OFFSET_TABLE_. The text segment can be referenced using PC-relative addressing. We will see later that sometimes it's unknown whether a non-preemptible symbol resides in the text segment or the data segment, in which case GOT-indirect addressing with the FDPIC register has to be used.

A function call is called external if the destination may reside in another module, which has a different data segment and therefore needs a different FDPIC register value. Therefore, an external function call needs to update the FDPIC register as well as changing the program counter (PC). The FDPIC register can be spilled into a stack slot or a call-saved register, if the caller needs to reference the data segment later.

Calling a function pointer, including calling a PLT entry, also sets both the FDPIC register and PC. When the address of a function is taken, the address of its canonical function descriptor is obtained, not that of the entry point. The descriptor, resides in the GOT, contains pointers to both the function's entry point and its FDPIC register value. The two GOT entries are relocated by a dynamic relocation of type R_*_FUNCDESC_VALUE (e.g. R_FRV_FUNCDESC_VALUE).

If the symbol is preemptible, the code sequence loads a GOT entry. When the symbol is a function, the GOT entry is relocated by a dynamic relocation R_*_FUNCDESC and will contain the address of the function descriptor address.

Let's checkout examples taking addresses of functions and variables.

__attribute__((visibility("hidden"))) void hidden_fun();
void fun();
__attribute__((visibility("hidden"))) extern int hidden_var;
extern int var;
__attribute__((visibility("hidden"))) const int ro_hidden_var = 42;

void *get_hidden_fun() { return hidden_fun; }
void *get_fun() { return fun; }
void *get_hidden_var() { return &hidden_var; }
void *get_var() { return &var; }
const int *get_ro_hidden_var() { return &ro_hidden_var; }

Function access in FDPIC

Canonical function descriptors are stored in the GOT, and their access depends on whether the referenced function is preemptible or not.

For non-preemptible functions: the address of the descriptor is directly computed by adding an offset to the FDPIC register.
For preemptible functions: a GOT entry is loaded first. This entry, relocated by a R_*_FUNCDESC dynamic relocation, holds the final address of the function descriptor.

// arm-linux-gnueabihf-gcc -c -fpic -mfdpic -Wa,--fdpic
get_hidden_fun: // non-preemptible function
  ldr     r0, .L3            // r0 = &.got[n] - FDPIC
  add     r0, r0, r9         // r0 = &.got[n]; the address of the canonical function descriptor
  ...
.L3:
// Linker resolves this to &.got[n] - FDPIC. .got[n], relocated by R_ARM_FUNCDESC_VALUE, is the canonical function descriptor.
  .word   hidden_fun(GOTOFFFUNCDESC) // R_ARM_GOTOFFFUNCDESC(hidden_fun)

get_fun: // preemptible function
  ldr     r3, .L6            // r3 = &.got[n] - FDPIC
  ldr     r0, [r9, r3]       // r0 = &.got[n]; the address of the canonical function descriptor
  ...
.L6:
// Linker resolves this to &.got[n] - FDPIC. .got[n], relocated by R_ARM_FUNCDESC, will contain the address of the canonical function descriptor.
  .word   fun(GOTFUNCDESC)   // R_ARM_GOTFUNCDESC(fun)

Unfortunately, when linking a DSO, an R_ARM_GOTOFFFUNCDESC relocation referencing a hidden symbol results in a linker error. This error likely arises because the generated R_ARM_FUNCDESC_VALUE dynamic relocation requires a dynamic symbol. While this can be implemented using an STB_LOCAL STT_SECTION dynamic symbol, GNU ld currently lacks support for this approach.

% arm-linux-gnueabihf-gcc -fpic -mfdpic -O2 -Wa,--fdpic q.c -shared
/tmp/ccxpnij8.o: in function `get_hidden_fun':
q.c:(.text+0x10): dangerous relocation: no dynamic index information available
collect2: error: ld returned 1 exit status

Let's try sh4. sh4-linux-gnu-gcc -fpic -mfdpic -O2 q.c -shared -nostdlib allows taking the address of a hidden function but not a protected function.

Then, let's see a global variable initialized by the address of a function and a C++ virtual table.

1
2
3

struct A { virtual void foo(); };
void call(A *a) { a->foo(); }
auto *var_call = call;

// arm-linux-gnueabihf-g++ -c -fpic -mfdpic -Wa,--fdpic
  ldr     r3, [r0]           // load vtable
...
  ldr     r3, [r3]           // load vtable entry `.word _ZN1A3fooEv(FUNCDESC)`
  ldr     r9, [r3, #4]       // load FDPIC register value
  ldr     r3, [r3]           // load foo's entry point
  blx     r3

.section        .data.rel,"aw"
var_call:
// Function descriptor address, relocated by R_ARM_FUNCDESC dynamic relocation
  .word   _Z4callP1A(FUNCDESC) // R_ARM_FUNCDESC

.section        .data.rel.ro,"aw"
_ZTV1A:
  .word   0
  .word   _ZTI1A
// Function descriptor address, relocated by R_ARM_FUNCDESC dynamic relocation
  .word   _ZN1A3fooEv(FUNCDESC) // R_ARM_FUNCDESC

Data access in FDPIC

GOT-indirect addressing is required for accessing data symbols under two conditions:

Preemptible symbols: Traditional GOT requirement.
Non-preemptible symbols with potential data segment placement: This includes
- Writable data symbols: This covers both locally declared (int var;) and externally declared (extern int var;) non-const variables.
- Potential dynamic initialization: const A a; extern const int var;
- Certain guaranteed constant initialization: extern constinit const int *const extern_const;. Constant initialization may require a relocation, e.g. constinit const int *const extern_const = &var;

get_hidden_var: // non-preemptible data with potential data segment placement
  ldr     r3, .L9             // r3 = &.got[n] - FDPIC
  ldr     r0, [r9, r3]
  ...
.L9:
// Linker resolves this to &.got[n] - FDPIC. .got[n], relocated by R_ARM_RELATIVE, will contain the address of hidden_var.
  .word   hidden_var(GOT)    // R_ARM_GOT_BREL

get_var: // preemptible data
  ldr     r3, .L12
  ldr     r0, [r9, r3]
  ...
.L12:
// Linker resolves this to &.got[n] - FDPIC. .got[n], relocated by R_ARM_GLOB_DAT, will contain the address of var.
  .word   var(GOT)           // R_ARM_GOT_BREL

The dynamic relocations R_*_RELATIVE and R_*_GLOB_DAT do not use the standard + load_base semantics. It seems that musl fdpic doesn't support the special R_*_RELATIVE.

If the referenced data symbol is non-preemptible and guaranteed to be in the text segment, we can use PC-relative addressing. However, this scenario is remarkably rare in practice. The most likely use case is like the following:

1
2
3

const int ro_array[] = {1, 2, 3, 4}; 

int get_ro_array_elem(int i) { return ro_array[i]; }

GCC's arm port does not seem to utilize PC-relative addressing. We can try GCC's SuperH port:

// sh4-linux-gnu-gcc -S -fpic -mfdpic -O2 q.c
get_hidden_var: // non-preemptible data
  mov.l   .L12,r0
  rts
  add     r12,r0
.L13:
  .align 2
.L12:
  .long   hidden_var@GOTOFF

get_ro_hidden_var: // non-preemptible data
  mov.l   .L18,r0
  rts
  mov.l   @(r0,r12),r0
.L19:
  .align 2
.L18:
  .long   ro_hidden_var@GOT

It optimizes get_hidden_var but not get_ro_hidden_var.

Thread-local storage

ARM FDPIC ABI defines static TLS relocations R_ARM_TLS_GD32_FDPIC, R_ARM_TLS_LDM32_FDPIC, R_ARM_TLS_IE32_FDPIC to be relative to GOT, as opposed to their non-FDPIC counterpart relative to PC.

Toolchain notes

-mfdpic, which only makes sense for -fpie/-fpic, enables FDPIC code generation. Like -mno-pic-data-is-text-relative, external data accesses use a different base register, r9 for arm. In addition, external function calls save and restore r9.

gas's arm port needs --fdpic to assemble FDPIC-related relocation types. -mfdpic -mtls-dialect=gnu2 is not supported. The ARM FDPIC ABI uses ldr to load a 32-bit constant embedded in the text segment. The offset is used to materialize the address of a GOT entry (canonical function descriptor, address of the canonical function descriptor, or address of data).

RISC-V

Several proposals exist for defining FDPIC-like ABIs to work for MMU-less systems.

[RFC] RISC-V ELF FDPIC psABI addendum and RISC-V FDPIC/NOMMU toolchain/runtime support: This offers a starting point, but needs further discussion.
lowRISC ePIC: While simple and interesting, ePIC lacks dynamic linking support.
Stefan O'Rear's proposal: This proposal holds promise and deserves close attention.

Undoubtly, GP should be used as the FDPIC register.

Loading a constant near code (like ARM) is not efficient. Instead, consider a two-instruction sequence:

Use hi20 and lo12 instructions to generate an offset relative to the GP register.
Use c.add a0, gp to compute the address of the GOT entry.

Maciej's code sequence supports both function and data access through indirect GP-relative addressing. We can easily enhance it by adding R_RISCV_RELAX to enable linker relaxation and improve performance. Additionally, for consistency with similar notations on x86-64 and AArch64 ("gotpcrel"), let's adopt "gotgprel" notation.

.L0:
lui rX, %gotgprel_hi20(sym)  # R_RISCV_?(sym); R_RISCV_RELAX
c.add rX, gp                 # R_RISCV_?(.L0)
ld rY, %gotgprel_lo12(sym)   # R_RISCV_?(.L0); rY = address

For data access, the code sequence is followed by instructions like:

1 2	lb a0,0(rY) sb a1,0(rY)

Function descriptors and data have different semantics, requiring two relocation types. Stefan O'Rear proposes:

R_RISCV_FUNCDESC_GOTGPREL_HI: Find or create two GOT entries for the canonical function descriptor.
R_RISCV_GOTGPREL_HI: For or create a GOT for the symbol, and return an offset from the FDPIC register.

Drawing inspiration from ARM FDPIC, two additional relocation types are needed for TLS. This results in a 4-type scheme.

RISC-V FDPIC: optimization

Addressing performance concerns is crucial. Stefan suggests an "indirect-to-relative optimization and relaxation scheme":

R_RISCV_PIC_ADD: Tags c.add rX, gp to enable optimization
R_RISCV_INTERMEDIATE_LOAD: Tags ld rY, <gotgprel_lo12>(rX) to enable optimization

Indirect GP-relative addressing can be optimized to direct GP-relative addressing under specific conditions:

Non-preemptible functions
Non-preemptible data in the data segment

# Indirect GP-relative to direct GP-relative
lui rX, <gprel_hi20>
c.add rX, gp
addi rY, rX, <gprel_lo12>

GOT-indirect addressing can be optimized to PC-relative for non-preemptible data in the text segment.

# Indirect GP-relative to PC-relative
auipc rX, %pcrel_hi20(sym)
c.nop                               # deletable
addi rY, rX, %pcrel_lo12(sym)

While enabling absolute addressing for -no-pie linking might seem straightforward, the added complexity might outweigh the benefits.

# Indirect GP-relative to absolute
lui rX, %hi20(sym)
c.nop                               # deletable
addi rY, rX, %lo12(sym)

Remember, FDPIC aims for position-independent code, and a static dynamic library scheme wouldn't be ideal. Imagine multiple executables sharing the same system: each text segment needs a unique address for proper execution.

RISC-V FDPIC: thread-local storage

To handle TLSDESC, we introduce a new relocation type: R_RISCV_TLSDESC_GPREL_HI. This type instructs the linker to find or create two GOT entries unless optimized to local-exec or init-exec. The combined hi20 and lo12 offsets compute the GP-relative offset to the first GOT entry.

label:
lui rX, %tlsdesc_gprel_hi(sym)      # R_RISCV_TLSDESC_GPREL_HI(sym); R_RISCV_RELAX
c.add a0, gp                        # R_RISCV_PIC_ADD(label)
ld rY, rX, %tlsdesc_load_lo(label)  # R_RISCV_TLSDESC_LOAD_LO12(label)
addi a0, rX, %tlsdesc_add_lo(label) # R_RISCV_TLSDESC_ADD_LO12(label)
jalr t0, rY, %tlsdesc_call(label)   # R_RISCV_TLSDESC_CALL(label)

Existing relocation types, R_RISCV_TLSDESC_LOAD_LO12 and R_RISCV_TLSDESC_ADD_LO12, are extended to work with R_RISCV_TLSDESC_GPREL_HI.

# TLSDESC to initial-exec optimization
lui a0, <gottpoff_gprel_hi20>
c.add a0, gp
ld a0, <gottpoff_gprel_lo12>(a0)

# TLSDESC to local-exec optimization
lui a0, <tpoff_hi20>
addi a0, a0, <tpoff_lo12>

For initial-exec TLS model, we need a new pseudoinstruction, say, la.tls.ie.fd rX, sym. It expands to:

1
2
3

lui rX, 0                    # R_RISCV_TLS_GOTGPREL_HI20(sym)
c.add rX, gp                 # R_RISCV_PIC_ADD(label)
ld rX, 0(rX)                 # R_RISCV_PCREL_LO12_I(label)

Stefan's scheme uses R_RISCV_PIC_ADDR_LO12_I. I think we can just repurpose R_RISCV_PCREL_LO12_I.