LLVM integrated assembler: Engineering better fragments

In my previous assembler posts, I've discussed improvements on expression evaluation and relocation generation. Now, let's turn our attention to recent refinements within section fragments. Understanding how an assembler utilizes these fragments is key to appreciating the improvements we've made. At a high level, the process unfolds in three main stages:

Parsing phase: The assembler constructs section fragments. These fragments represent sequences of regular instructions or data, span-dependent instructions, alignment directives, and other elements.
Section layout phase: Once fragments are built, the assembler assigns offsets to them.
Relocation decision phase: In the final stage, the assembler evaluates fixups and, if necessary, updates the content of the fragments.

When the LLVM integrated assembler was introduced in 2009, its section and fragment design was quite basic. Performance wasn't the concern at the time. As LLVM evolved, many assembler features added over the years came to rely heavily on this original design. This created a complex web that made optimizing the fragment representation increasingly challenging.

Here's a look at some of the features that added to this complexity over the years:

2010: Mach-O .subsection_via_symbols and atoms
2012: NativeClient's bundle alignment mode. I've created a dedicated chapter for this.
2015: Hexagon instruction bundle
2016: CodeView variable definition ranges
2018: RISC-V linker relaxation
2020: x86 -mbranches-within-32B-boundaries
2023: LoongArch linker relaxation, largely identical to RISC-V linker relaxation. This often requires me to change it as well while refactoring and improving RISC-V linker relaxation.
2023: z/OS GOFF (Generalized Object File Format)

I've included the start year for each feature to indicate when it was initially introduced, to the best of my knowledge. This doesn't imply that maintenance stopped after that year. On the contrary, many of these features, like RISC-V linker relaxation, require ongoing, active maintenance.

Despite the intricate history, I've managed to untangle these dependencies and implement the necessary fixes. And that, in a nutshell, is what this blog post is all about!

The quest for trivially destructible fragments

Historically, MCFragment subclasses, specifically MCDataFragment and MCRelaxableFragment, relied on SmallVector member variables to store their content and fixups. This approach, while functional, presented two key inefficiencies:

Inefficient storage of small objects: The content and fixups for individual fragments are typically very small. Storing a multitude of these tiny objects individually within SmallVectors led to less-than-optimal memory utilization.
Non-trivial destructors: When deallocating sections, the ~MCSection destructor had to meticulously traverse the fragment list and explicitly destroy each fragment.

In 2024, @aengelke initiated a draft to store fragment content out-of-line. Building upon that foundation, I've extended this approach to also store fixups out-of-line, and ensured compatibility with the aforementioned features that cause complexity (especially RISC-V and LoongArch linker relaxation.)

Furthermore, MCRelaxableFragment previously contained MCInst Inst;, which also necessitated a non-trivial destructor. To address this, I've redesigned its data structure. operands are now stored within the parent MCSection, and the MCRelaxableFragment itself only holds references:

uint32_t Opcode = 0;
uint32_t Flags = 0; 
uint32_t OperandStart = 0;
uint32_t OperandSize = 0;

Unfortunately, we still need to encode MCInst::Flags to support the x86 EVEX prefix, e.g., {evex} xorw $foo, %ax. My hope is that the x86 maintainers might refactor X86MCCodeEmitter::encodeInstruction to make this flag storage unnecessary.

The new design of MCFragment and MCSection is as follows:

class MCFragment {
  ...
  
  
  
  uint32_t ContentStart = 0;
  uint32_t ContentEnd = 0;
  uint32_t FixupStart = 0;
  uint32_t FixupEnd = 0;

  
  
  uint32_t VarContentStart = 0;
  uint32_t VarContentEnd = 0;
  uint32_t VarFixupStart = 0;
  uint32_t VarFixupEnd = 0;
};

class MCSection {
  ...
  
  SmallVector<char, 0> ContentStorage;
  SmallVector<MCFixup, 0> FixupStorage;
  SmallVector<MCOperand, 0> MCOperandStorage;
};

(As a side note, the LLVM CamelCase variables are odd. As the MC maintainer, I'd be delighted to see them refactored to camelBack or snake_case if people agree on the direction.)

These changes were primarily driven by two key pull requests:

Fewer fragments: fixed-size part and variable tail

Prior to LLVM 21.1, the assembler, operated with a fragment design dating back to 2009, placed every span-dependent instruction into its own distinct fragment. The x86 code sequence push rax; jmp foo; nop; jmp foo would be represented with numerous fragments: MCDataFragment(nop); MCRelaxableFragment(jmp foo); MCDataFragment(nop); MCRelaxableFragment(jmp foo).

A more efficient approach emerged: storing both a fixed-size part and an optional variable-size tail within a single fragment.

The fixed-size part maintains a consistent size throughout the assembly process.
The variable-size tail, if present, encodes elements that can change in size or content, such as a span-dependent instruction, an alignment directive, a fill directive, or other similar span-dependent constructs.

The new design led to significantly fewer fragments:

1 2	MCFragment(fixed: push rax, variable: jmp foo) MCFragment(fixed: nop, variable: jmp foo)

Key changes:

Reducing instruction encoding overhead

Encoding individual instructions is the most performance-critical operation within MCObjectStreamer. Recognizing this, significant effort has been dedicated to reducing this overhead since May 2023.

It's worth mentioning that x86's instruction padding features, introduced in 2020, have imposed considerable overhead. Specifically, these features are:

-mbranches-within-32B-boundaries. See Align branches within 32-Byte boundary(NOP padding)
"Enhanced relaxation": The feature allows x86 prefix padding for all instructions, effectively making all instructions span-dependent and requiring its own fragment.

My recent optimization efforts demanded careful attention to these particularly complex and performance-sensitive code.

Eager fragment creation

Encoding an instruction is a far more frequent operation than appending a variable-size tail to the current fragment. In the previous design, the instruction encoder was burdened with an extra check: it had to determine if the current fragment already had a variable-size tail.

encodeInstruction:
  if (current fragment has a variable-size tail)
    start a new fragment
  append data to the current fragment

emitValueToAlignment:
  Encode the alignment in the variable-size tail of the current fragment

emitDwarfLocDirective:
  Encode the .loc in the variable-size tail of the current fragment

Our new strategy optimizes this by maintaining a current fragment that is guaranteed not to have a variable-size tail. This means functions appending data to the fixed-size part no longer need to perform this check. Instead, any function that sets a variable-size tail will now immediately start a new fragment.

Here's how the workflow looks with this optimization:

encodeInstruction:
  assert(current fragment doesn't have a variable-size tail)
  append data to the current fragment

emitValueToAlignment:
  Encode the alignment in the variable-size tail of the current fragment
  start a new fragment

emitDwarfLocDirective:
  Encode the .loc in the variable-size tail of the current fragment
  start a new fragment

Key changes:

It's worth noting that the first patch was made possible thanks to the removal of the bundle alignment mode.

Fragment content in trailing data

Our MCFragment class manages four distinct sets of appendable data: fixed-size content, fixed-size fixups, variable-size tail content, and variable-size tail fixups. Of these, the fixed-size content is typically the largest. We can optimize its storage by utilizing it as trailing data, akin to a flexible array member.

This approach offers several compelling advantages:

Improved data locality: Storing the content after the MCFragment object enhances cache utility.
Simplified metadata: We can replace the pair of uint32_t ContentStart = 0; uint32_t ContentEnd = 0; with a single uint32_t ContentSize;.

This optimization leverages a clever technique made possible by using a special purpose bump allocator. After allocating sizeof(MCFragment) bytes for a new fragment, we know that any remaining space within the current bump allocator block immediately follows the fragment's end. This contiguous space can then be efficiently used for the fragment's trailing data.

However, this design introduces a few important considerations:

Tail fragment appends only: Data can only be appended to the tail fragment of a subsection. Fragments located in the middle of a subsection are immutable in their fixed-size content. Any post-assembler-layout adjustments must target the variable-size tail.
Dynamic Allocation Management: When new data needs to be appended, a function is invoked to ensure the current bump allocator block has sufficient space. If not, the current fragment is closed (its fixed-size content is finalized), and a new fragment is started. For instance, an 8-byte sequence could be stored as one single fragment, or, if space constraints dictate, as two fragments each encoding 4 bytes.
New block allocation: If the available space in the current block is insufficient, a new block large enough to accommodate both an MCFragment and the required bytes for its trailing data is allocated.
Section/subsection Switching: The previously saved fragment list tail cannot be simply reused. This is because it's tied to the memory space of the previous bump allocator block. Instead, a new fragment must be allocated using the current bump allocator block and appended to the new subsection's tail.

Key changes:

MCFragment: Use trailing data for fixed-size part

Deprecating complexity: NativeClient's bundle alignment mode

Google's now-discontinued Native Client (NaCl) project provided a sandboxing environment through a combination of Software Fault Isolation (SFI) and memory segmentation. A distinctive feature of its SFI implementation was the "bundle alignment mode", which enforced strict rules:

Instructions were grouped into 32-byte "bundles".
No instruction crosses a bundle boundary.

While the core concept of aligned bundling is intriguing and has influenced successors like Lightweight Fault Isolation (LFI), its implementation within the LLVM assembler proved problematic. Introduced in 2012, this feature significantly increased the complexity of MC's internal workings and, perhaps more critically, imposed noticeable performance penalties on users who had no need for NaCl. I was particularly concerned by its pervasive modifications to MCObjectStreamer and MCAssembler.

The complexity deepened with the introduction of

2014: PendingLabels
2015: NaCl's mc-relax-all optimization

This pending label mechanism led to more complexity:

2015: [MC] Ensure that pending labels are flushed when -mc-relax-all flag is used
2019: [ MC ] Match labels to existing fragments even when switching sections. by an Apple developer. In a nutshell, the pending labels mechanism was causing headache to Mach-O, requiring additional code to manage.

In MCObjectStreamer, newly defined labels were put into a "pending label" list and initially assigned to a MCDummyFragment associated with the current section. The symbols would be reassigned to a new fragment when the next instruction or directive was parsed. This pending label system introduced complexity and potential for subtle bugs. flushPendingLabels was called by many MCObjectStreamer functions, noticeably once for each new fragment, adding overhead. It also complicated the label difference evaluation due to MCDummyFragment in MCExpr.cpp:AttemptToFoldSymbolOffsetDifference.

Recognizing these long-standing issues, a series of pivotal changes were undertaken:

2024: [MC] Aligned bundling: remove special handling for RelaxAll removed an optimization for NaCl in the mc-relax-all mode
2024: [MC] Remove pending labels
2024: [MC] AttemptToFoldSymbolOffsetDifference: remove MCDummyFragment check. NFC
2025: Finally, MC: Remove bundle alignment mode, after Derek Schuff agreed to drop NaCl support from LLVM.

Should future features require a variant of bundle alignment, I firmly believe a much cleaner implementation is necessary. This could potentially be achieved through a backend hook within X86AsmBackend::finishLayout, applied after the primary assembler layout phase, similar to how the -mbranches-within-32B-boundaries option is handled.