The dark side of RISC-V linker relaxation

Linker optimization/relaxation

The linker is in a position to perform some peephole optimizations which are difficult/impossible to do on the compiler side (due to lack of a global view and layout information). Generic link-time code sequence transformation is risky. However, if every instruction in the code sequence is associated with one ore more relocations, the ABI and the implementation can assign (additional) semantics to the relocation types and make such transformation safe. This technique is usually called linker optimization or linker relaxation. It seems that the term "linker optimization" is often used when the code sequence does not change while "linker relaxation" is used when the code sequence can be shortened.

i386, x86-64 and ppc64 ELFv2 linkers have implemented various linker optimizations.

# Load the address via GOT indirection.
movq x@GOTPCREL(%rip), %rax  # R_X86_64_REX_GOTPCRELX
# =>
# Add an offset to PC.
leaq x(%rip), %rax

Because the term "link-time optimization" is similar to linker relaxation but is usually used in a narrow sense which is very different (communicate symbol resolution to the compiler, combine the information of multiple translation units, and perform IR level optimization), I (and some other folks) have used "linker relaxation" to refer to the transformations without changing the code sequence length.

RISC architectures usually need more than one instructions to materialize a symbol address. There are usually two instructions, the first materializing the high bits and the second materializing the low bits. In many cases, an alternative instruction can replace the two instructions.

Another case is when a long instruction can be replaced with a short instruction.

1
2
3

jmp dest    # 4 bytes, R_AVR_CALL
# =>
rjmp dest   # 2 bytes

RISC-V linker relaxation

Instead of giving relocation types such as R_RISCV_HI20, R_RISCV_LO12, R_RISCV_PCREL_LO12_I, R_RISCV_PCREL_LO12_S, R_RISCV_CALL additional semantics, the designer introduced a new relocation type R_RISCV_RELAX. You will see that at the same location there are two relocations. The ELF specification says:

If multiple consecutive relocation records are applied to the same relocation location (r_offset), they are composed instead of being applied independently, as described above. By consecutive, we mean that the relocation records are contiguous within a single relocation section. By composed, we mean that the standard application described above is modified as follows:

In all but the last relocation operation of a composed sequence, the result of the relocation expression is retained, rather than having part extracted and placed in the relocated field. The result is retained at full pointer precision of the applicable ABI processor supplement.

In all but the first relocation operation of a composed sequence, the addend used is the retained result of the previous relocation operation, rather than that implied by the relocation type.

Note that a consequence of the above rules is that the location specified by a relocation type is relevant for the first element of a composed sequence (and then only for relocation records that do not contain an explicit addend field) and for the last element, where the location determines where the relocated value will be placed. For all other relocation operands in a composed sequence, the location specified is ignored.

An ABI processor supplement may specify individual relocation types that always stop a composition sequence, or always start a new one.

In the composed sequence, R_RISCV_RELAX performs the operation of deleting bytes.

lui a0, %hi(sym)               # R_RISCV_HI20, R_RISCV_RELAX
addi a0, a0, %lo(sym)          # R_RISCV_LO12, R_RISCV_RELAX
# =>
addi a0, offset(gp)

.L0: auipc a0, %pcrel_hi(sym)  # R_RISCV_PCREL_HI20, R_RISCV_RELAX
addi a0, a0, %pcrel_lo(.L0)    # R_RISCV_PCREL_LO12_I, R_RISCV_RELAX
# =>
addi a0, offset(gp)

call fun                       # auipc ra, ..; jalr ra, ..(ra)
# =>
jal fun  # or c.jal fun

beqz a0, .L0
# =>
c.beqz a0, .L0

Example

void ext(void);
void foo(void) {
  ext();
  ext();
  ext();
  ext();
}

0000000000000000 <.text>:
# sh_addralign=4, insert NOP of sh_addrline-2 bytes
       0: 01 00         nop
                0000000000000000:  R_RISCV_ALIGN        *ABS*+0x2

0000000000000002 <foo>:
       2: 41 11         addi    sp, sp, -16
       4: 06 e4         sd      ra, 8(sp)
       6: 97 00 00 00   auipc   ra, 0
                0000000000000006:  R_RISCV_CALL ext
                0000000000000006:  R_RISCV_RELAX        *ABS*
       a: e7 80 00 00   jalr    ra
       e: 97 00 00 00   auipc   ra, 0
                000000000000000e:  R_RISCV_CALL ext
                000000000000000e:  R_RISCV_RELAX        *ABS*
      12: e7 80 00 00   jalr    ra
      16: 97 00 00 00   auipc   ra, 0
                0000000000000016:  R_RISCV_CALL ext
                0000000000000016:  R_RISCV_RELAX        *ABS*
      1a: e7 80 00 00   jalr    ra
      ...

Two instructions are used to materialize a call. If the call turns out to be short ranged, the first instruction can be dropped.

000000000000244 <foo>:
     244: 41 11         addi    sp, sp, -16
     246: 06 e4         sd      ra, 8(sp)
     248: ef 00 80 01   jal     24 <ext>
     24c: ef 00 40 01   jal     20 <ext>
     250: ef 00 00 01   jal     16 <ext>
     254: ef 00 c0 00   jal     12 <ext>
     258: a2 60         ld      ra, 8(sp)
     25a: 41 01         addi    sp, sp, 16
     25c: 11 a0         j       4 <ext>
     25e: 00 00         unimp

Assembler implications

Alignment directives

Deleting bytes may affect the following alignment directive. In the following example, if the call code sequence is shortened to 4 bytes, the mv instruction may not start at an offset aligned by 16.

1
2
3

call fun
.balign 16
mv a1, a0

The address this problem, depending on whether RVC (compressed instructions) can be used, the assembler emits a padding of align-4 (no RVC) or align-2 (RVC) bytes for an alignment directive. Without generality, let's assume RVC is unavailable and the padding is align-4. To tell the linker the number of bytes, an R_RISCV_ALIGN with addend align-4 is emitted at the beginning of the padding.

The linker can delete some bytes to satisfy the alignment requirement. The deleted number of bytes must be less than or equal to align-4.

1
2
3

call fun     # 8 bytes
.zero 16-4   # R_RISCV_ALIGN(addend=16-4=12)
mv a1, a0

Since the assembler emits NOPs according to the worse case, the alignment requirement is not necessarily correct when no relaxation is performed. Therefore, a linker which has not implemented deleting bytes (e.g. LLD) should conservatively bail out if such an R_RISCV_ALIGN is seen. This explains the LLD diagnostic error: relocation R_RISCV_ALIGN requires unimplemented linker relaxation.

Fixup against a local symbol

For a reference to a local symbol in the same section (e.g. a branch target), traditionally no relocation is needed. The fixup is resolved at assembly time. With linker relaxation, we need to preserve the relocation so that the linker can fix the offset.

.globl fun
fun:
  jmp .L0     # No relocation
.L0:
  call local0 # No relocation

local0:

DWARF

Now let's talk about the dark side of linker relaxation. We will start with DWARF, which is used by many binary formats.

Code addresses

In DWARF, a debugging information entry (DIE) describing an entity that has a range of machine code adddresses may have DW_AT_low_pc/DW_AT_high_pc/DW_AT_ranges attributes to describe the addresses.

Due to linker relaxation, the entity size is not a constant. GCC/Clang use a label difference with two relocations (R_RISCV_ADD32 and R_RISCV_SUB32) to encode the length.

0x0000002a:   DW_TAG_subprogram [2]
                DW_AT_low_pc [DW_FORM_addr]     (0x0000000000000002 ".text")
                  # R_RISCV_64
                DW_AT_high_pc [DW_FORM_data4]   (0x00000040)
                  # Constant on other architectures
                  # Two relocations: R_RISCV_ADD32 and R_RISCV_SUB32
                  # Neither GCC nor Clang uses DW_FORM_addr (DWARF v3)
                DW_AT_frame_base [DW_FORM_exprloc]      (DW_OP_reg8 X8)
                DW_AT_name [DW_FORM_strp]       ( .debug_str[0x00000020] = "foo")
                DW_AT_decl_file [DW_FORM_data1] ("/tmp/c/a.c")
                DW_AT_decl_line [DW_FORM_data1] (2)
                DW_AT_external [DW_FORM_flag_present]   (true)

An alternative choice is to let the value of the DW_AT_high_pc be of class address, specifically, DW_FORM_addr. It takes 8 bytes on ELFCLASS64 but can remove one relocation.

Line number information

Line number information gives association from source file locations to machine instruction addresses. It is conceptually a matrix with one row for each instruction. The matrix has columns for:

the source file name
the source line number
the source column number
and so on

DWARF uses a byte-coded language to encode the matrix. The specification says:

Most of the instructions in a line number program are special opcodes.`

For the above example (consecutive ext() calls), on most architectures, a call takes one special opcode of one byte. llvm-dwarfdump --debug-line can dump the matrix:

# x86-64
            Address            Line   Column File   ISA Flags
            ------------------ ------ ------ ------ --- -------------
0x00000025: 00 DW_LNE_set_address (0x0000000000000000)
0x00000030: 13 address += 0,  line += 1
            0x0000000000000000      2      0      1   0  is_stmt
0x00000031: 05 DW_LNS_set_column (3)
0x00000033: 0a DW_LNS_set_prologue_end
0x00000034: 4b address += 4,  line += 1
            0x0000000000000004      3      3      1   0  is_stmt prologue_end
0x00000035: 75 address += 7,  line += 1
            0x000000000000000b      4      3      1   0  is_stmt
0x00000036: 75 address += 7,  line += 1
            0x0000000000000012      5      3      1   0  is_stmt
0x00000037: 75 address += 7,  line += 1
            0x0000000000000019      6      3      1   0  is_stmt
0x00000038: 75 address += 7,  line += 1
            0x0000000000000020      7      3      1   0  is_stmt
0x00000039: 75 address += 7,  line += 1
            0x0000000000000027      8      3      1   0  is_stmt

However, one RISC-V, it is bloated with two relocations associated with one row! DW_LNS_advance_line+DW_LNS_fixed_advance_pc+DW_LNS_copy take 6 bytes. Two Elf64_Rela relocations take 48 bytes.

# RISC-V
            Address            Line   Column File   ISA Flags
            ------------------ ------ ------ ------ --- -------------
0x00000025: 00 DW_LNE_set_address (0x0000000000000002)
0x00000030: 13 address += 0,  line += 1
            0x0000000000000002      2      0      1   0  is_stmt
0x00000031: 05 DW_LNS_set_column (3)
0x00000033: 0a DW_LNS_set_prologue_end
0x00000034: 03 DW_LNS_advance_line (3)
0x00000036: 09 DW_LNS_fixed_advance_pc (0x0002)
0x00000039: 01 DW_LNS_copy
            0x0000000000000004      3      3      1   0  is_stmt prologue_end
0x0000003a: 03 DW_LNS_advance_line (4)
0x0000003c: 09 DW_LNS_fixed_advance_pc (0x000e)
0x0000003f: 01 DW_LNS_copy
            0x0000000000000012      4      3      1   0  is_stmt
0x00000040: 03 DW_LNS_advance_line (5)
0x00000042: 09 DW_LNS_fixed_advance_pc (0x0008)
0x00000045: 01 DW_LNS_copy
            0x000000000000001a      5      3      1   0  is_stmt
...

Relocation section ''.rela.debug_line' at offset 0x7e0 contains 17 entries:
    Offset             Info             Type      Symbol's Value  Symbol's Name + Addend
0000000000000028  0000000400000002 R_RISCV_64    0000000000000002 <null> + 0
0000000000000037  0000000600000022 R_RISCV_ADD16 0000000000000004 <null> + 0
0000000000000037  0000000400000026 R_RISCV_SUB16 0000000000000002 <null> + 0
000000000000003d  0000000a00000022 R_RISCV_ADD16 0000000000000012 <null> + 0
000000000000003d  0000000600000026 R_RISCV_SUB16 0000000000000004 <null> + 0
0000000000000043  0000000b00000022 R_RISCV_ADD16 000000000000001a <null> + 0
0000000000000043  0000000a00000026 R_RISCV_SUB16 0000000000000012 <null> + 0
...

54x waste! So, what is wrong?

Well, due to linker relaxation, the address increment between two calls is not a compile-time constant. A special opecode is encoded with the following formula:

1
2
3

# DWARF v4 introduced maximum_operations_per_instruction for VLIW architectures.
# maximum_operations_per_instruction is 1 and operation_increment is address_advance on non-VLIW architectures.
opcode = line_increment - line_base + (line_range * operation_increment) + opcode_base

The variables except operation_increment are compile-time constants, but we don't have a relocation type representing multiplication. If such a relocation type exists and the compiler can ensure that the maximum address_advance does not make the ubyte opcode overflow (this is not simple), we can make the linked output much smaller. That said, relocations are still the primary overhead of object files.

Among major binary formats (Mach-O (8 bytes), PE-COFF (10 bytes)), Elf64_Rela (24 bytes) on ELF has a very noticeable overhead. I don't know whether RISC-V people may want to switch to ELFCLASS32 for 64-bit RISCV in the future. Currently the Linux kernel associates ELFCLASS32 to ILP32 ABI variants, but nothing prevents ELFCLASS32 from being used for small code model object files.

Call frame information

A frame description entry (FDE) in call frame information (.debug_frame/.eh_frame) encodes the length of the entity. A pair of relocations are needed.

Similar to line number information, the call frame instructions are encoded in a byte-coded language. The DW_CFA_advance_loc instruction has a 6-bit operand (encoded wit hthe opcode) representing that the location delta is operand * code_alignment_factor. The instruction can be relocated by a pair of R_RISCV_SET6 and R_RISCV_SUB6.

Location lists

For -gsplit-dwarf, the .dwo files cannot have relocations. Due to lack of support for uleb128 label differences, the compact DW_LLE_offset_pair description cannot be used. GCC uses the DW_LLE_startx_endx description (PR99090). The operands are indices into the .debug_addr section. The two values in .debug_addr are relocated by R_RISCV_64 relocations.

Language-specific data area

In the Itanium C++ ABI, the information needed to process exceptions is called language-specific data area (LSDA). On ELF targets, this is usually stored in the .gcc_except_table section. See C++ exception handling ABI for details.

int comdat() {
  try { throw 1; }
  catch (int) { return 1; }
  return 0;
}

Call site records describe the landing pad offset/length. Without linker relaxation, the values are assembly time constant and actually .gcc_except_table has no relocations referencing text sections.

  .section .gcc_except_table,"a",@progbits
  .p2align 2
GCC_except_table0:
.Lexception0:
  .byte    255                         # @LPStart Encoding = omit
  .byte    3                           # @TType Encoding = udata4
  .uleb128 .Lttbase0-.Lttbaseref0
.Lttbaseref0:
  .byte    1                           # Call site Encoding = uleb128
  .uleb128 .Lcst_end0-.Lcst_begin0
.Lcst_begin0:
  .uleb128 .Lfunc_begin0-.Lfunc_begin0 # >> Call Site 1 <<
  .uleb128 .Ltmp0-.Lfunc_begin0        #   Call between .Lfunc_begin0 and .Ltmp0
  .byte    0                           #     has no landing pad
  .byte    0                           #   On action: cleanup
  .uleb128 .Ltmp0-.Lfunc_begin0        # >> Call Site 2 <<
  .uleb128 .Ltmp1-.Ltmp0               #   Call between .Ltmp0 and .Ltmp1
  .uleb128 .Ltmp2-.Lfunc_begin0        #     jumps to .Ltmp2
  .byte    1                           #   On action: 1
  .uleb128 .Ltmp1-.Lfunc_begin0        # >> Call Site 3 <<
  .uleb128 .Lfunc_end0-.Ltmp1          #   Call between .Ltmp1 and .Lfunc_end0
  .byte    0                           #     has no landing pad
  .byte    0                           #   On action: cleanup
.Lcst_end0:
  .byte    1                           # >> Action Record 1 <<
                                       #   Catch TypeInfo 1
  .byte   0                            #   No further actions
  .p2align 2
                                       # >> Catch TypeInfos <<
  .long _ZTIi                          # TypeInfo 1
.Lttbase0:

With linker relaxation, in the generic case we need to have relocation pairs representing label differences. If a label difference may exceed 127, a uleb128 directive will not be suitable because we don't have uleb128 label difference relocation types.

In addition, uleb128 relaxation is difficult. The call site table length field in the header is encoded in uleb128. The value and other uleb128 offsets/lengths can cause oscillation. See GNU as (PR4029).

Anyway GCC and Clang have chosen .word to encode offsets/lengths.

  .section        .gcc_except_table,"a",@progbits
  .p2align        2
GCC_except_table1:
.Lexception0:
  .byte   255                          # @LPStart Encoding = omit
  .byte   155                          # @TType Encoding = indirect pcrel sdata4
  .uleb128 .Lttbase0-.Lttbaseref0
.Lttbaseref0:
  .byte   3                            # Call site Encoding = udata4
  .uleb128 .Lcst_end0-.Lcst_begin0
.Lcst_begin0:
# RISC-V -mrelax cannot use .uleb128.
# Worse, label differences are not resolved at assembly time.
  .word   .Lfunc_begin1-.Lfunc_begin1  # >> Call Site 1 <<
  .word   .Ltmp2-.Lfunc_begin1         #   Call between .Lfunc_begin1 and .Ltmp2
  .word   0                            #     has no landing pad
  .byte   0                            #   On action: cleanup
  .word   .Ltmp2-.Lfunc_begin1         # >> Call Site 2 <<
  .word   .Ltmp3-.Ltmp2                #   Call between .Ltmp2 and .Ltmp3
  .word   .Ltmp4-.Lfunc_begin1         #     jumps to .Ltmp4
  .byte   1                            #   On action: 1
  .word   .Ltmp3-.Lfunc_begin1         # >> Call Site 3 <<
  .word   .Lfunc_end1-.Ltmp3           #   Call between .Ltmp3 and .Lfunc_end1
  .word   0                            #     has no landing pad
  .byte   0                            #   On action: cleanup
.Lcst_end0:
  .byte   1                            # >> Action Record 1 <<
  ...

The other problem is that relocations from .gcc_except_table to the text section can cause some linker garbage collection difficulty. It requires fragmented .gcc_except_table sections. See C++ exception handling ABI for details.

Linker implementation

Here is an overflow of ld.lld:

Command line options
Symbol table (input files, -e, -u, symbol assignments)
LLVM LTO (bitcode \(\Rightarrow\) object files)
Input sections
Split SHF_MERGE and .eh_frame
--gc-secitons: markLive
Create synthetic sections (linker generated)
Move .eh_frame from inputSections to synthetic .eh_frame
Linker script SECTIONS
Identical Code Folding
Scan relocations
Finalize synthetic sections
Layout (addresses, thunks, SHT_RELR, symbol assignments)
Assign file offsets
Write header and sections

The difficult is due to relocation scanning affecting address dependent content. Basically the following three steps need to be done in a loop:

Scan relocations
Finalize synthetic sections
Layout (assign addresses to sections, compute symbol assignments, produce thunks, process SHT_RELR)

I think GNU ld uses an iterative loop with at least these steps. In LLD, calling finalizeSynthetic() more than once can be tricky.

Does link-time optimization or post link-time optimization help?

They help for a reference to a local symbol in the same section, but not for common cases like referencing .data from .text, or referencing another text section from a text section.

Prelude

I sometimes call linker relaxation poor man's link-time optimization with nice ergonomics. I agree it is useful, probably more so for embedded systems. However, the engineering efforts are large and I worry about the additional complexity in various toolchain components.