Recent lld/ELF performance improvements
好的,我现在需要帮用户总结一篇文章的内容,控制在100个字以内,而且不需要特定的开头。首先,我得仔细阅读文章,了解主要信息。 文章主要讲的是LLVM 22分支之后,作者为lld/ELF做了很多优化,比如并行化链接阶段和移除热点。然后通过基准测试比较了不同版本的性能。 我需要提取关键点:优化措施、基准测试结果、主要改进和剩余问题。比如,并行加载输入文件、加速标记阶段、优化重定位扫描等。 接下来,我要把这些信息浓缩成100字以内。确保涵盖主要变化和比较结果,同时提到lld在某些方面仍落后于mold和wild。 最后,检查语言是否简洁明了,避免使用复杂的术语,让用户容易理解。 </think> 文章介绍了LLVM 22分支后对lld/ELF的优化工作,包括并行化链接阶段、移除热点以及性能提升。通过基准测试比较了当前主分支与lld 22.1、mold和wild的性能表现,并详细分析了优化措施及其效果。结果显示,在大多数情况下mold和wild仍领先于lld,特别是在调试信息处理方面存在瓶颈。 2026-4-12 07:0:0 Author: maskray.me(查看原文) 阅读量:6 收藏

Since the LLVM 22 branch was cut, I've landed a series of patches to speed up lld/ELF by parallelizing additional link phases and removing a few per-relocation hotspots. This post walks through the main changes and compares the current main branch against lld 22.1, mold, and wild.

Benchmark

Three reproduce tarballs, --threads=8, hyperfine -w 1 -r 10, pinned to CPU cores with numactl -C. lld-0201 is main at 2026-02-01 (6a1803929817); lld-load is main plus the new [ELF] Parallelize input file loading. mold and wild run with --no-fork so the wall-clock numbers include the linker process itself.

Workload lld-0201 lld-load mold wild
clang-23 Release+Asserts, --gc-sections 1.255 s 917.8 ms 552.6 ms 367.2 ms
clang-23 Debug (no --gdb-index) 4.582 s 4.306 s 2.464 s 1.565 s
clang-23 Debug (--gdb-index) 6.291 s 5.915 s 4.001 s N/A
Chromium Debug (no --gdb-index) 6.140 s 5.904 s 2.665 s 2.010 s
Chromium Debug (--gdb-index) 7.857 s 7.322 s 3.786 s N/A

Note that llvm/lib/Support/Parallel.cpp design keeps the main thread idle during parallelFor, so --threads=N really utilizes N+1 threads.

wild does not yet implement --gdb-index, so those rows are skipped.

A few observations before diving in:

  • The --gdb-index cost on the Chromium link is +1.72 s for lld versus +1.12 s for mold. This is currently the single biggest gap.
  • Excluding --gdb-index, mold is still 1.66x–2.22x as fast and wild 2.5x–2.94x as fast on this machine. There is plenty of room left.

Parallelize input file loading

Historically, LinkerDriver::createFiles walked the command line and called addFile serially. addFile maps the file (MemoryBuffer::getFile), sniffs the magic, and constructs an ObjFile, SharedFile, BitcodeFile, or ArchiveFile. For thin archives it also materializes each member. On workloads with hundreds of archives and thousands of objects, this serial walk dominates the early part of the link.

The pending patch will rewrite addFile to record a LoadJob for each non-script input together with a snapshot of the driver's state machine (inWholeArchive, inLib, asNeeded, withLOption, groupId). After createFiles finishes, loadFiles fans the jobs out to worker threads. Linker scripts stay on the main thread because INPUT() and GROUP() recursively call back into addFile.

A few subtleties made this harder than it sounds:

  • BitcodeFile and fatLTO construction call ctx.saver / ctx.uniqueSaver, both of which are non-thread-safe StringSaver / UniqueStringSaver. I serialized those constructors behind a mutex; pure-ELF links hit it zero times.
  • Thin-archive member buffers used to be appended to ctx.memoryBuffers directly. To keep the output deterministic across --threads values, each job now accumulates into a per-job SmallVector which is merged into ctx.memoryBuffers in command-line order.
  • InputFile::groupId used to be assigned inside the InputFile constructor from a global counter. With parallel construction the assignment race would have been unobservable but still ugly; b6c8cba516da hoists ++nextGroupId into the serial driver loop and stores the value into each file after construction.

The output is byte-identical to the old lld and deterministic across --threads values, which I verified with diff across --threads={1,2,4,8} on Chromium.

Parallelize --gc-sections mark

Garbage collection had been a single-threaded BFS over InputSection graph. On a Release+Asserts clang link, markLive was ~315 ms of the 1562 ms wall time (20%).

6f9646a598f2 adds markParallel, a level-synchronized BFS. Each BFS level is processed with parallelFor; newly discovered sections land in per-thread queues, which are merged before the next level. The parallel path activates when !TrackWhyLive && partitions.size() == 1. Implementation details that turned out to matter:

  • Depth-limited inline recursion (depth < 3) before pushing to the next-level queue. Shallow reference chains stay hot in cache and avoid queue overhead.
  • Optimistic "load then compare-exchange" section-flag dedup instead of atomic fetch-or. The vast majority of sections are visited once, so the load almost always wins.

On the Release+Asserts clang link, markLive dropped from 315 ms to 82 ms at --threads=8 (from 199 ms to 50 ms at --threads=16); total wall time 1.16x–1.18x.

Two prerequisite cleanups were needed for correctness:

  • 6a874161621e moved Symbol::used into the existing std::atomic<uint16_t> flags. The bitfield was previously racing with other mark threads.
  • 2118499a898b decoupled SharedFile::isNeeded from the mark walk. --as-needed used to flip isNeeded inside resolveReloc, which would have required coordinated writes across threads; it is now a post-GC scan of global symbols.

Extending parallel relocation scanning

Relocation scanning has been parallel since LLVM 17, but three cases had opted out via bool serial:

  1. -z nocombreloc, because .rela.dyn merged relative and non-relative relocations and needed deterministic ordering.
  2. MIPS, because MipsGotSection is mutated during scanning.
  3. PPC64, because ctx.ppc64noTocRelax (a DenseSet of (Symbol*, offset) pairs) was written without a lock.

076226f378df and dc4df5da886e separate relative and non-relative dynamic relocations unconditionally and always build .rela.dyn with combreloc=true; the only remaining effect of -z nocombreloc is suppressing DT_RELACOUNT. 2f7bd4fa9723 then protects ctx.ppc64noTocRelax with the already-existing ctx.relocMutex, which is only taken on rare slow paths. After these changes, only MIPS still runs scanning serially, and MIPS is a marginal target for lld.

getSectionPiece in O(1) for the common case

Merge sections (SHF_MERGE) split their input into "pieces". Every reference into a merge section needs to map an offset to a piece. The old implementation was always a binary search in MergeInputSection::pieces, called from MarkLive, includeInSymtab, and getRelocTargetVA.

42cc45477727 changes this in two ways:

  1. For non-string fixed-size merge sections, getSectionPiece uses offset / entsize directly.
  2. For non-section Defined symbols pointing into merge sections, the piece index is pre-resolved during splitSections and packed into Defined::value as ((pieceIdx + 1) << 32) | intraPieceOffset.

The binary search is now limited to references via section symbols (addend-based), which is common on AArch64 but rare on x86-64 where the assembler emits local labels for .L references into mergeable strings. The clang-relassert link is 1.05x faster with --gc-sections.

Small wins worth mentioning

  • 036b755daedb parallelizes demoteAndCopyLocalSymbols. Each file collects local Symbol* pointers in a per-file vector via parallelFor, which are merged into the symbol table serially. Linking clang-14 with its 208K .symtab entries is 1.04x faster.
  • 0bde74ab0499 changes getSectionPiece to take SectionPiece by reference. A small but measurable hit on large debug links.

Where lld still loses time

The benchmark makes two remaining bottlenecks obvious.

--gdb-index. This adds 1.7 s to both of my debug links, compared to 1.1–1.5 s in mold. The cost comes from reading .debug_info / .debug_names on every input, splitting CUs, and rewriting the index. The work is embarrassingly parallel per input file but the current implementation does a lot of string interning through a single hash table.

Debug-info-heavy links in general. With --gdb-index stripped, mold is 1.75x–2.22x ahead and wild is 2.75x–2.94x ahead. The output write itself (especially the .debug_* sections) is a big fraction. mold and wild both do more aggressive parallelization of the write phase; lld still writes many .debug_* sections on a single thread.

wild is worth calling out separately: its user time is comparable to lld's but its system time is roughly half. That is consistent with wild writing a smaller intermediate output footprint and issuing fewer syscalls. mold is at the other extreme — the highest user time on every workload, bought back by aggressive parallelism.

Reproducing

The reproduce tarballs come from --reproduce=foo.tar on real clang and Chromium links. If you want to try this at home:

1
2
3
4
5
tar xf clang-relassert.tar -C /tmp/reproduce/clang-relassert
cd /tmp/reproduce/clang-relassert
hyperfine -w 1 -r 10 \
'/tmp/t/lld-0201 @response.txt --threads=8' \
'/tmp/t/lld-load @response.txt --threads=8'

Pin to a single CCX (numactl -C 20-28 on my Zen 4 machine) if you want stable numbers — otherwise cross-CCX memory traffic dominates the noise at these wall times.


文章来源: https://maskray.me/blog/2026-04-12-recent-lld-elf-performance-improvements
如有侵权请联系:admin#unsafe.sh