lld 14 ELF changes
2022-2-20 16:0:0 Author: maskray.me(查看原文) 阅读量:13 收藏

llvm-project 14 will be released soon. I added some lld/ELF notes to https://github.com/llvm/llvm-project/blob/release/14.x/lld/docs/ReleaseNotes.rst. Here I will elaborate on some changes.

  • --export-dynamic-symbol-list has been added. (D107317) When I added --export-dynamic-symbol to GNU ld, H.J. Lu asked me to add this option. I asked myself whether this was necessary but then realized this may help deprecate --dynamic-list in the long term. --dynamic-list is confusing. It has a different semantics for executables and shared objects. The symbolic intention for shared objects isn't clear.
  • --why-extract has been added to query why archive members/lazy object files are extracted. (D109572) This was a long missing feature from ld.lld -Map. I picked a separate option because I realized that this need is often orthogonal to input section to output section map.
  • If -Map is specified, --cref will be printed to the specified file. (D114663) A linker's stdout output is often interleaved with different information, so being able to redirect a piece of information to a file is useful. I think it would be nice if GNU ld had --cref=<file> and not reused -Map.
  • -z bti-report and -z cet-report are now supported. (D113901)
  • --lto-pgo-warn-mismatch has been added. (D104431)
  • Archives without an index (symbol table) are now supported and work with --warn-backrefs. One may build such an archive with llvm-ar rcS [--thin] to save space. (D117284) In 15.0.0, the archive symbol table will be entirely ignored. Archives and --start-lib has more context.
  • No longer deduplicate local symbol names at the default optimization level of -O1. This results in a larger .strtab (usually less than 1%) but a faster link time. Use optimization level -O2 to restore the deduplication. In 15.0.0, the -O2 deduplication is dropped to help parallel .symtab write.
  • In relocatable output, relocations to discarded symbols now use tombstone values. (D116946)
  • --compress-debug-sections=zlib is now run in parallel. {clang,gcc} -gz link actions are significantly faster. (D117853) Compressed debug sections#linkers has more context.
  • "relocation out of range" diagnostics and a few uncommon diagnostics now report an object file location beside a source file location. (D112518)
  • The write of .rela.dyn and SHF_MERGE|SHF_STRINGS sections (e.g. .debug_str) is now run in parallel.

Linker script changes:

  • Orphan section placement now picks a more suitable segment. Previously the algorithm might pick a read-only segment for a writable orphan section and make the segment writable. (D111717)
  • An empty output section moved by an INSERT comment now gets appropriate flags. (D118529)
  • Negation in a memory region attribute is now correctly handled. (D113771)

Architecture specific changes:

  • The AArch64 port now supports adrp+ldr and adrp+add optimizations. --no-relax can suppress the optimization. (D112063) (D117614)
  • The x86-32 port now supports TLSDESC (-mtls-dialect=gnu2). (D112582)
  • The x86-64 port now handles non-RAX/non-adjacent R_X86_64_GOTPC32_TLSDESC and R_X86_64_TLSDESC_CALL (-mtls-dialect=gnu2). (D114416)
  • The x86-32 and x86-64 ports now support mixed TLSDESC and TLS GD, i.e. mixing objects compiled with and without -mtls-dialect=gnu2 referencing the same TLS variable is now supported. (D114416)
  • For x86-64, --no-relax now suppresses R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX GOT optimization (D113615)
  • R_X86_64_PLTOFF64 is now supported. (D112386)
  • R_AARCH64_NONE, R_PPC_NONE, and R_PPC64_NONE in input REL relocation sections are now supported.

Breaking changes

  • e_entry no longer falls back to the address of .text if the entry symbol does not exist. Instead, a value of 0 will be written. (D110014)
  • --lto-pseudo-probe-for-profiling has been removed. In LTO, the compiler enables this feature automatically. (D110209)
  • Use of --[no-]define-common, -d, -dc, and -dp will now get a warning. They will be removed or ignored in 15.0.0. (llvm-project#53660 <https://github.com/llvm/llvm-project/issues/53660>_)

Speed

I use a -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXE_LINKER_FLAGS=-Wl,--push-state,$HOME/Dev/mimalloc/out/release/libmimalloc.a,--pop-state -DLLVM_ENABLE_PROJECTS='clang;lld' -DLLVM_TARGETS_TO_BUILD=X86 -fno-pic -no-pie build. The host compiler is a close-to-main clang. (Compared with glibc malloc, linking against libmimalloc.a is 1.12x as fast.)

I have made dozens of changes scattering across the lld/ELF codebase to improve performance, e.g.

Linking a -DCMAKE_BUILD_TYPE=Release build of clang:

1
2
3
4
5
6
7
8
9
10
11
12
% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "/tmp/llvm-{13,14}/out/release/bin/ld.lld" @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 1.181 s ± 0.009 s [User: 1.300 s, System: 0.471 s]
Range (min … max): 1.161 s … 1.193 s 16 runs

Benchmark 2: numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 1.054 s ± 0.008 s [User: 1.292 s, System: 0.492 s]
Range (min … max): 1.039 s … 1.068 s 16 runs

Summary
'numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8' ran
1.12 ± 0.01 times faster than 'numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8'

(--threads=2 => 1.17x)

Linking a -DCMAKE_BUILD_TYPE=Debug build of clang:

1
2
3
4
5
6
7
8
9
10
11
12
% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "/tmp/llvm-{13,14}/out/release/bin/ld.lld" @response.txt --threads=8"
Benchmark 1: numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 6.240 s ± 0.123 s [User: 9.937 s, System: 2.517 s]
Range (min … max): 6.136 s … 6.495 s 16 runs

Benchmark 2: numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 5.089 s ± 0.066 s [User: 9.531 s, System: 2.225 s]
Range (min … max): 5.006 s … 5.210 s 16 runs

Summary
'numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8' ran
1.23 ± 0.03 times faster than 'numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8'

(--threads=2 => 1.11x)

Linking a default build of chrome:

1
2
3
4
5
6
7
8
9
10
11
12
% hyperfine --warmup 2 --min-runs 16 "numactl -C 20-27 "/tmp/llvm-{13,14}/out/release/bin/ld.lld" @response.txt --threads=8"       
Benchmark 1: numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 8.017 s ± 0.042 s [User: 7.440 s, System: 3.238 s]
Range (min … max): 7.946 s … 8.089 s 16 runs

Benchmark 2: numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8
Time (mean ± σ): 6.857 s ± 0.052 s [User: 6.921 s, System: 3.006 s]
Range (min … max): 6.796 s … 6.982 s 16 runs

Summary
'numactl -C 20-27 /tmp/llvm-14/out/release/bin/ld.lld @response.txt --threads=8' ran
1.17 ± 0.01 times faster than 'numactl -C 20-27 /tmp/llvm-13/out/release/bin/ld.lld @response.txt --threads=8'

Memory usage

I have made some changes decreasing sizeof(SymbolUnion) and sizeof(InputSection). There is a 1~2% decrease for some programs with several malloc implementations.

ThinLTO application will see more reduction. lld uses file-backed mmap to read input files. For ThinLTO indexing, the page buffers are nearly unused after symbol resolution. I have changed lld to call madvise(MADV_DONTNEED) to overlap the page buffer memory with the memory allocated by LTO library (mostly ThinLTO import and export lists): https://reviews.llvm.org/D116367. This change led to a 16% reduction when linking a large executable.

I have made another change that changed the -–start-lib code path to cache the symbol interning result, which led to 0.6% reduction: https://reviews.llvm.org/D116390.


文章来源: https://maskray.me/blog/2022-02-20-lld-14-elf-changes
如有侵权请联系:admin#unsafe.sh