Since the LLVM 22 branch was cut, I've landed a series of patches to
speed up lld/ELF by parallelizing additional link phases and removing a
few per-relocation hotspots. This post walks through the main changes
and compares the current main branch against lld 22.1, mold, and wild.
Benchmark
Three reproduce tarballs, --threads=8,
hyperfine -w 1 -r 10, pinned to CPU cores with
numactl -C. lld-0201 is main at 2026-02-01
(6a1803929817); lld-load is main plus the new
[ELF] Parallelize input file loading. mold and
wild run with --no-fork so the wall-clock
numbers include the linker process itself.
| Workload | lld-0201 | lld-load | mold | wild |
|---|---|---|---|---|
clang-23 Release+Asserts, --gc-sections |
1.255 s | 917.8 ms | 552.6 ms | 367.2 ms |
clang-23 Debug (no --gdb-index) |
4.582 s | 4.306 s | 2.464 s | 1.565 s |
clang-23 Debug (--gdb-index) |
6.291 s | 5.915 s | 4.001 s | N/A |
Chromium Debug (no --gdb-index) |
6.140 s | 5.904 s | 2.665 s | 2.010 s |
Chromium Debug (--gdb-index) |
7.857 s | 7.322 s | 3.786 s | N/A |
Note that llvm/lib/Support/Parallel.cpp design keeps the
main thread idle during parallelFor, so
--threads=N really utilizes N+1 threads.
wild does not yet implement --gdb-index, so those rows
are skipped.
A few observations before diving in:
- The
--gdb-indexcost on the Chromium link is+1.72 sfor lld versus+1.12 sfor mold. This is currently the single biggest gap. - Excluding
--gdb-index, mold is still 1.66x–2.22x as fast and wild 2.5x–2.94x as fast on this machine. There is plenty of room left.
Parallelize input file loading
Historically, LinkerDriver::createFiles walked the
command line and called addFile serially.
addFile maps the file (MemoryBuffer::getFile),
sniffs the magic, and constructs an ObjFile,
SharedFile, BitcodeFile, or
ArchiveFile. For thin archives it also materializes each
member. On workloads with hundreds of archives and thousands of objects,
this serial walk dominates the early part of the link.
The pending patch will rewrite addFile to record a
LoadJob for each non-script input together with a snapshot
of the driver's state machine (inWholeArchive,
inLib, asNeeded, withLOption,
groupId). After createFiles finishes,
loadFiles fans the jobs out to worker threads. Linker
scripts stay on the main thread because INPUT() and
GROUP() recursively call back into
addFile.
A few subtleties made this harder than it sounds:
BitcodeFileand fatLTO construction callctx.saver/ctx.uniqueSaver, both of which are non-thread-safeStringSaver/UniqueStringSaver. I serialized those constructors behind a mutex; pure-ELF links hit it zero times.- Thin-archive member buffers used to be appended to
ctx.memoryBuffersdirectly. To keep the output deterministic across--threadsvalues, each job now accumulates into a per-jobSmallVectorwhich is merged intoctx.memoryBuffersin command-line order. InputFile::groupIdused to be assigned inside theInputFileconstructor from a global counter. With parallel construction the assignment race would have been unobservable but still ugly; b6c8cba516da hoists++nextGroupIdinto the serial driver loop and stores the value into each file after construction.
The output is byte-identical to the old lld and deterministic across
--threads values, which I verified with diff
across --threads={1,2,4,8} on Chromium.
Parallelize
--gc-sections mark
Garbage collection had been a single-threaded BFS over
InputSection graph. On a Release+Asserts clang link,
markLive was ~315 ms of the 1562 ms wall time (20%).
6f9646a598f2
adds markParallel, a level-synchronized BFS. Each BFS level
is processed with parallelFor; newly discovered sections
land in per-thread queues, which are merged before the next level. The
parallel path activates when
!TrackWhyLive && partitions.size() == 1.
Implementation details that turned out to matter:
- Depth-limited inline recursion (
depth < 3) before pushing to the next-level queue. Shallow reference chains stay hot in cache and avoid queue overhead. - Optimistic "load then compare-exchange" section-flag dedup instead of atomic fetch-or. The vast majority of sections are visited once, so the load almost always wins.
On the Release+Asserts clang link, markLive dropped from
315 ms to 82 ms at --threads=8 (from 199 ms to 50 ms at
--threads=16); total wall time 1.16x–1.18x.
Two prerequisite cleanups were needed for correctness:
- 6a874161621e
moved
Symbol::usedinto the existingstd::atomic<uint16_t> flags. The bitfield was previously racing with other mark threads. - 2118499a898b
decoupled
SharedFile::isNeededfrom the mark walk.--as-neededused to flipisNeededinsideresolveReloc, which would have required coordinated writes across threads; it is now a post-GC scan of global symbols.
Extending parallel relocation scanning
Relocation scanning has been parallel since LLVM 17, but three cases
had opted out via bool serial:
-z nocombreloc, because.rela.dynmerged relative and non-relative relocations and needed deterministic ordering.- MIPS, because
MipsGotSectionis mutated during scanning. - PPC64, because
ctx.ppc64noTocRelax(aDenseSetof(Symbol*, offset)pairs) was written without a lock.
076226f378df
and dc4df5da886e
separate relative and non-relative dynamic relocations unconditionally
and always build .rela.dyn with
combreloc=true; the only remaining effect of
-z nocombreloc is suppressing DT_RELACOUNT. 2f7bd4fa9723
then protects ctx.ppc64noTocRelax with the already-existing
ctx.relocMutex, which is only taken on rare slow paths.
After these changes, only MIPS still runs scanning serially, and MIPS is
a marginal target for lld.
getSectionPiece
in O(1) for the common case
Merge sections (SHF_MERGE) split their input into
"pieces". Every reference into a merge section needs to map an offset to
a piece. The old implementation was always a binary search in
MergeInputSection::pieces, called from
MarkLive, includeInSymtab, and
getRelocTargetVA.
42cc45477727 changes this in two ways:
- For non-string fixed-size merge sections,
getSectionPieceusesoffset / entsizedirectly. - For non-section
Definedsymbols pointing into merge sections, the piece index is pre-resolved duringsplitSectionsand packed intoDefined::valueas((pieceIdx + 1) << 32) | intraPieceOffset.
The binary search is now limited to references via section symbols
(addend-based), which is common on AArch64 but rare on x86-64 where the
assembler emits local labels for .L references into
mergeable strings. The clang-relassert link is 1.05x faster with
--gc-sections.
Small wins worth mentioning
- 036b755daedb
parallelizes
demoteAndCopyLocalSymbols. Each file collects localSymbol*pointers in a per-file vector viaparallelFor, which are merged into the symbol table serially. Linking clang-14 with its 208K.symtabentries is 1.04x faster. - 0bde74ab0499
changes
getSectionPieceto takeSectionPieceby reference. A small but measurable hit on large debug links.
Where lld still loses time
The benchmark makes two remaining bottlenecks obvious.
--gdb-index. This adds 1.7 s to both of
my debug links, compared to 1.1–1.5 s in mold. The cost comes from
reading .debug_info / .debug_names on every
input, splitting CUs, and rewriting the index. The work is
embarrassingly parallel per input file but the current implementation
does a lot of string interning through a single hash table.
Debug-info-heavy links in general. With
--gdb-index stripped, mold is 1.75x–2.22x ahead and wild is
2.75x–2.94x ahead. The output write itself (especially the
.debug_* sections) is a big fraction. mold and wild both do
more aggressive parallelization of the write phase; lld still writes
many .debug_* sections on a single thread.
wild is worth calling out separately: its user time is comparable to lld's but its system time is roughly half. That is consistent with wild writing a smaller intermediate output footprint and issuing fewer syscalls. mold is at the other extreme — the highest user time on every workload, bought back by aggressive parallelism.
Reproducing
The reproduce tarballs come from --reproduce=foo.tar on
real clang and Chromium links. If you want to try this at home:
1 | tar xf clang-relassert.tar -C /tmp/reproduce/clang-relassert |
Pin to a single CCX (numactl -C 20-28 on my Zen 4
machine) if you want stable numbers — otherwise cross-CCX memory traffic
dominates the noise at these wall times.