In my previous assembler posts, I've discussed improvements on expression evaluation and relocation generation. Now, let's turn our attention to recent refinements within section fragments. Understanding how an assembler utilizes these fragments is key to appreciating the improvements we've made. At a high level, the process unfolds in three main stages:
- Parsing phase: The assembler constructs section fragments. These fragments represent sequences of regular instructions or data, span-dependent instructions, alignment directives, and other elements.
- Section layout phase: Once fragments are built, the assembler assigns offsets to them.
- Relocation decision phase: In the final stage, the assembler evaluates fixups and, if necessary, updates the content of the fragments.
When the LLVM integrated assembler was introduced in 2009, its section and fragment design was quite basic. Performance wasn't the concern at the time. As LLVM evolved, many assembler features added over the years came to rely heavily on this original design. This created a complex web that made optimizing the fragment representation increasingly challenging.
Here's a look at some of the features that added to this complexity over the years:
- 2010: Mach-O
.subsection_via_symbols
and atoms - 2012: NativeClient's bundle alignment mode. I've created a dedicated chapter for this.
- 2015: Hexagon instruction bundle
- 2016: CodeView variable definition ranges
- 2018: RISC-V linker relaxation
- 2020: x86
-mbranches-within-32B-boundaries
- 2023: LoongArch linker relaxation, largely identical to RISC-V linker relaxation. This often requires me to change it as well while refactoring and improving RISC-V linker relaxation.
- 2023: z/OS GOFF (Generalized Object File Format)
I've included the start year for each feature to indicate when it was initially introduced, to the best of my knowledge. This doesn't imply that maintenance stopped after that year. On the contrary, many of these features, like RISC-V linker relaxation, require ongoing, active maintenance.
Despite the intricate history, I've managed to untangle these dependencies and implement the necessary fixes. And that, in a nutshell, is what this blog post is all about!
The quest for trivially destructible fragments
Historically, MCFragment
subclasses, specifically
MCDataFragment
and MCRelaxableFragment
, relied
on SmallVector
member variables to store their content and
fixups. This approach, while functional, presented two key
inefficiencies:
- Inefficient storage of small objects: The content and fixups for
individual fragments are typically very small. Storing a multitude of
these tiny objects individually within
SmallVectors
led to less-than-optimal memory utilization. - Non-trivial destructors: When deallocating sections, the
~MCSection
destructor had to meticulously traverse the fragment list and explicitly destroy each fragment.
In 2024, @aengelke initiated a draft to store fragment content out-of-line. Building upon that foundation, I've extended this approach to also store fixups out-of-line, and ensured compatibility with the aforementioned features that cause complexity (especially RISC-V and LoongArch linker relaxation.)
Furthermore, MCRelaxableFragment
previously contained
MCInst Inst;
, which also necessitated a non-trivial
destructor. To address this, I've redesigned its data structure.
operands are now stored within the parent MCSection, and the
MCRelaxableFragment
itself only holds references:
1 | uint32_t Opcode = 0; |
Unfortunately, we still need to encode MCInst::Flags
to
support the x86 EVEX prefix, e.g., {evex} xorw $foo, %ax
.
My hope is that the x86 maintainers might refactor
X86MCCodeEmitter::encodeInstruction
to make this flag
storage unnecessary.
The new design of MCFragment
and MCSection
is as follows:
1 | class MCFragment { |
(As a side note, the LLVM
CamelCase
variables are odd. As the MC maintainer, I'd
be delighted to see them refactored to camelBack
or
snake_case
if people agree on the direction.)
These changes were primarily driven by two key pull requests:
Fewer fragments: fixed-size part and variable tail
Prior to LLVM 21.1, the assembler, operated with a fragment design
dating back to 2009, placed every span-dependent instruction into its
own distinct fragment. The x86 code sequence
push rax; jmp foo; nop; jmp foo
would be represented with
numerous fragments:
MCDataFragment(nop); MCRelaxableFragment(jmp foo); MCDataFragment(nop); MCRelaxableFragment(jmp foo)
.
A more efficient approach emerged: storing both a fixed-size part and an optional variable-size tail within a single fragment.
- The fixed-size part maintains a consistent size throughout the assembly process.
- The variable-size tail, if present, encodes elements that can change in size or content, such as a span-dependent instruction, an alignment directive, a fill directive, or other similar span-dependent constructs.
The new design led to significantly fewer fragments:
1 | MCFragment(fixed: push rax, variable: jmp foo) |
Key changes:
- MC: Restructure MCFragment as a fixed part and a variable tail.
- MC: Encode FT_Align in fragment's variable-size tail
Reducing instruction encoding overhead
Encoding individual instructions is the most performance-critical
operation within MCObjectStreamer
. Recognizing this,
significant effort has been dedicated to reducing this overhead since
May 2023.
- [MC] Always encode instruction into SmallVector
- [MC] Remove the legacy overload of encodeInstruction with a lot of prior cleanups
- [MC][ELF] Emit instructions directly into fragment
- [MC][X86] Avoid copying MCInst in emitInstrEnd in 2024-06
- X86AsmBackend: Remove some overhead from auto padding feature
- X86AsmBackend: Simplify isRightAfterData for the auto-pad feature
It's worth mentioning that x86's instruction padding features, introduced in 2020, have imposed considerable overhead. Specifically, these features are:
-mbranches-within-32B-boundaries
. See Align branches within 32-Byte boundary(NOP padding)- "Enhanced relaxation": The feature allows x86 prefix padding for all instructions, effectively making all instructions span-dependent and requiring its own fragment.
My recent optimization efforts demanded careful attention to these particularly complex and performance-sensitive code.
Eager fragment creation
Encoding an instruction is a far more frequent operation than appending a variable-size tail to the current fragment. In the previous design, the instruction encoder was burdened with an extra check: it had to determine if the current fragment already had a variable-size tail.
1 | encodeInstruction: |
Our new strategy optimizes this by maintaining a current fragment that is guaranteed not to have a variable-size tail. This means functions appending data to the fixed-size part no longer need to perform this check. Instead, any function that sets a variable-size tail will now immediately start a new fragment.
Here's how the workflow looks with this optimization:
1 | encodeInstruction: |
Key changes:
It's worth noting that the first patch was made possible thanks to the removal of the bundle alignment mode.
Fragment content in trailing data
Our MCFragment
class manages four distinct sets of
appendable data: fixed-size content, fixed-size fixups, variable-size
tail content, and variable-size tail fixups. Of these, the fixed-size
content is typically the largest. We can optimize its storage by
utilizing it as trailing data, akin to a flexible array member.
This approach offers several compelling advantages:
- Improved data locality: Storing the content after the MCFragment object enhances cache utility.
- Simplified metadata: We can replace the pair of
uint32_t ContentStart = 0; uint32_t ContentEnd = 0;
with a singleuint32_t ContentSize;
.
This optimization leverages a clever technique made possible by using
a special purpose bump allocator. After allocating
sizeof(MCFragment)
bytes for a new fragment, we know that
any remaining space within the current bump allocator block immediately
follows the fragment's end. This contiguous space can then be
efficiently used for the fragment's trailing data.
However, this design introduces a few important considerations:
- Tail fragment appends only: Data can only be appended to the tail fragment of a subsection. Fragments located in the middle of a subsection are immutable in their fixed-size content. Any post-assembler-layout adjustments must target the variable-size tail.
- Dynamic Allocation Management: When new data needs to be appended, a function is invoked to ensure the current bump allocator block has sufficient space. If not, the current fragment is closed (its fixed-size content is finalized), and a new fragment is started. For instance, an 8-byte sequence could be stored as one single fragment, or, if space constraints dictate, as two fragments each encoding 4 bytes.
- New block allocation: If the available space in the current block is insufficient, a new block large enough to accommodate both an MCFragment and the required bytes for its trailing data is allocated.
- Section/subsection Switching: The previously saved fragment list tail cannot be simply reused. This is because it's tied to the memory space of the previous bump allocator block. Instead, a new fragment must be allocated using the current bump allocator block and appended to the new subsection's tail.
Key changes:
Deprecating complexity: NativeClient's bundle alignment mode
Google's now-discontinued Native Client (NaCl) project provided a sandboxing environment through a combination of Software Fault Isolation (SFI) and memory segmentation. A distinctive feature of its SFI implementation was the "bundle alignment mode", which enforced strict rules:
- Instructions were grouped into 32-byte "bundles".
- No instruction crosses a bundle boundary.
While the core concept of aligned bundling is intriguing and has
influenced successors like Lightweight Fault Isolation (LFI), its
implementation within the LLVM assembler proved problematic. Introduced
in 2012, this feature significantly increased the complexity of MC's
internal workings and, perhaps more critically, imposed noticeable
performance penalties on users who had no need for NaCl. I was
particularly concerned by its pervasive modifications to
MCObjectStreamer
and MCAssembler
.
The complexity deepened with the introduction of
- 2014: PendingLabels
- 2015: NaCl's
mc-relax-all
optimization
This pending label mechanism led to more complexity:
- 2015: [MC] Ensure that pending labels are flushed when -mc-relax-all flag is used
- 2019: [ MC ] Match labels to existing fragments even when switching sections. by an Apple developer. In a nutshell, the pending labels mechanism was causing headache to Mach-O, requiring additional code to manage.
In MCObjectStreamer
, newly defined labels were put into
a "pending label" list and initially assigned to a
MCDummyFragment
associated with the current section. The
symbols would be reassigned to a new fragment when the next instruction
or directive was parsed. This pending label system introduced complexity
and potential for subtle bugs. flushPendingLabels
was
called by many MCObjectStreamer
functions, noticeably once
for each new fragment, adding overhead. It also complicated the label
difference evaluation due to MCDummyFragment
in
MCExpr.cpp:AttemptToFoldSymbolOffsetDifference
.
Recognizing these long-standing issues, a series of pivotal changes were undertaken:
- 2024: [MC]
Aligned bundling: remove special handling for RelaxAll removed an
optimization for NaCl in the
mc-relax-all
mode - 2024: [MC] Remove pending labels
- 2024: [MC] AttemptToFoldSymbolOffsetDifference: remove MCDummyFragment check. NFC
- 2025: Finally, MC: Remove bundle alignment mode, after Derek Schuff agreed to drop NaCl support from LLVM.
Should future features require a variant of bundle alignment, I
firmly believe a much cleaner implementation is necessary. This could
potentially be achieved through a backend hook within
X86AsmBackend::finishLayout
, applied after the primary
assembler layout phase, similar to how the
-mbranches-within-32B-boundaries
option is handled.