Thread local storage provides a mechanism allocating separate objects for different threads. It is the usual implementation for C11 _Thread_local
and C++11 thread_local
on ELF platforms.
An example is POSIX errno
:
Each thread has its own thread ID, scheduling priority and policy, errno value, floating point environment, thread-specific key/value bindings, and the required system resources to support a flow of control.
Different threads have different errno
copies. errno
is typically defined as a function which returns a thread local variable.
Representation
Assembler behavior
The compiler usually defines thread-local variables in .tdata
and .tbss
sections (which have the section flag SHF_TLS
). The variables have type STT_TYPE
. In GNU as syntax, you can give a
the type STT_TLS
with .type a, @tls_object
.
1 | .section .tbss,"awT",@nobits |
In Clang and GCC produced assembly, thread-local variables are annotated as .type a, @object
(STT_OBJECT
). When the assembler sees that such symbols are defined in SHF_TLS
sections or referenced by TLS specific relocations, STT_OBJECT
will be upgraded to STT_TLS
.
GNU as supports an obscure directive .tls_common
which defines STT_TLS SHN_COMMON
symbols. LLVM integrated assembler does not support .tls_common
.
Linker behavior
The linker collects SHF_TLS
output sections and place them into a PT_TLS
program header.
GNU ld treats STT_TLS SHN_COMMON
symbols as defined in .tcommon
sections. The internal linker script places such sections into the output section .tdata
. LLD does not support STT_TLS SHN_COMMON
symbols.
Dynamic loader behavior
The dynamic loader collects PT_TLS
program headers from the main executable and immediately loaded shared objects (via transitive DT_NEEDED
), and allocates static TLS blocks, one block for each PT_TLS
.
Models
Local exec TLS model (executable & non-preemptible)
This is the most efficient TLS model. It applies when the TLS symbol is defined in the executable.
The compiler picks this model in -fno-pic/-fpie
modes if the variable is
- a definition
- or a declaration with a non-default visibility.
The first condition is obvious. The second condition is becuase a non-default visibility means the variable must be defined by another translation unit in the executable.
1 | _Thread_local int def; |
1 | # x86-64 |
The offset from the thread pointer to the start of a static block is fixed at program start. The offset from the thread pointer to a TLS symbol is usually referred to as the TP offset. You can usually find "TPREL" or "TPOFF" in the relocation type name.
In addition, for the static TLS block of the main executable, the TP offset of a TLS symbol is a link-time constant. The linker and the dynamic loader share the same formula.
Initial exec TLS model (executable & preemptible)
This model is less efficient than local exec. It applies when the TLS symbol is defined in the executable or a shared object available at program start. The shared object can be due to DT_NEEDED
or LD_PRELOAD
.
The compiler picks this model in -fno-pic/-fpie
modes if the variable is a declaration with default visibility. The idea is that a symbol referenced by the executable must be defined by an immediately loaded shared object, instead of a dlopen loaded shared object. The linker enforces this as well by defaulting to -z defs
for a -no-pie/-pie
link.
1 | extern thread_local int ref; |
1 | # x86-64 |
Because the offset from the thread pointer to the start of a static block is fixed at program start, such an offset can be encoded by a GOT relocation. Such relocation types typically have GOT
and TPREL/TPOFF
in their names.
If you add the __attribute((tls_model("initial-exec")))
attribute, a thread-local variable can use this model in -fpic
mode. If the object file is linked into an executable, everything is fine. If the object file is linked into a shared object, the shared object generally needs to be an immediately loaded shared object. The linker sets the DF_STATIC_TLS
flag to annotate a shared object with initial-exec TLS relocations.
glibc ld.so reserves some space in static TLS blocks and allows dlopen on such a shared object if its TLS size is small. There could be an obscure reason for using such an attribute: general dynamic and local dynamic TLS models are not async-signal-safe in glibc. However, other libc implementations may not reserve additional TLS space for dlopen'ed initial-exec shared objects, e.g. musl will error.
General dynamic and local dynamic TLS models (DSO)
The two modes are used when the TLS symbol may be defined by a shared object. They do not assume the TLS symbol is backed by a static TLS block. Instead, they assume that the thread-local storage of the module may be dynamically allocated, making the models suitable for dlopen usage. The dynamically allocated TLS storage is usually referred to as dynamic TLS.
Each TLS symbol is assigned a pair of (module ID, offset from dtv[m] to the symbol), which is usually referred to as a tls_index
object. The module ID m is assigned by the dynamic loader when the module (the executable or a shared object) is loaded, so it is unknown at link time. dtv means the dynamic thread vector. Each thread has its own dynamic thread vector, which is a mapping from module ID to thread-local storage. dtv[m] points to the storage allocated for the module with the ID m.
In the simplest form, once we have a pointer to the (module ID, offset from dtv[m] to the symbol) pair, we can get the address of the symbol with the following C program:
1 |
|
General dynamic TLS model (DSO & non-preemptible)
The general dynamic TLS model is the most flexible model. It assumes neither the module ID nor the offset from dtv[m] to the symbol is known at link time. The model is used in -fpic
mode when the local dynamic TLS model does not apply. The compiler emits code to set up a pointer to the TLSGD entry of the symbol, then arranges for a call to __tls_get_addr
. The return value will contain the runtime address of the TLS symbol in the current thread. On x86-64, you will notice that the leaq instruction has a data16 prefix and the call instruction has two data16 (0x66) prefixes and one rex64 prefix. This is a deliberate choice to make the total size of leaq+call to be 16, suitable for link-time relaxation.
1 | data16 leaq def@tlsgd(%rip), %rdi |
At the linker stage, if the TLS symbol does not satisfy relaxation requirements, the linker will allocate two consecutive words in the .got
section for the TLSGD relocation, relocated by two dynamic relocations. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word. The relocation types are:
- arm:
R_ARM_TLS_DTPMOD32
andR_ARM_TLS_DTPOFF32
- aarch64:
R_AARCH64_TLS_DTPMOD
andR_AARCH64_TLS_DTPREL
(rarely used because TLS descriptors are the default) - i386:
R_386_TLS_DTPMOD32
andR_386_TLS_DTPOFF32
- x86-64:
R_X86_64_DTPMOD64
andR_X86_64_DTPOFF64
- mips32:
R_MIPS_TLS_DTPMOD32
andR_MIPS_TLS_DTPOFF32
- mips64:
R_MIPS_TLS_DTPMOD64
andR_MIPS_TLS_DTPOFF64
- ppc32:
R_PPC_DTPMOD32
andR_X86_64_DTPREL32
- ppc64:
R_PPC64_DTPMOD64
andR_X86_64_DTPREL64
Local dynamic TLS model (DSO & preemptible)
The local dynamic TLS model assumes that the offset from dtv[m] to the symbol is a link-time constant. This case happens when the TLS symbol is non-preemptible. The compiler emits code to set up a pointer to the TLSLD entry of the module, next arranges for a call to __tls_get_addr
, then adds a link-time constant to the return value to get the address. I say "the TLSLD entry of the module" because while (on x86-64) def@tlsld
looks like the TLSLD entry of the non-preemptible TLS symbol, it can really be shared by other non-preemptible TLS symbols. So one module needs just one such entry.
1 | leaq def@tlsld(%rip), %rdi |
At the linker stage, if the TLS symbol does not satisfy local-dynamic to local-exec relaxation requirements, the linker will allocate two consecutive words in the .got
section for the TLSLD relocation. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word.
If the architecture does not define TLS relaxations, the linker can still made an optimization: in -no-pie/-pie
modes, set the first word to 1 (main executable) and omit the dynamic relocation for the module ID.
TLS descriptors
Some architectures have define TLS descriptors as more efficient alternatives to the traditional general dynamic and local dynamic TLS models. Such ABIs repurpose the first word of the (module ID, offset from dtv[m] to the symbol) pair to represent a function pointer. The function pointer points to a very simple function in the static TLS case and a function similar to __tls_get_addr
in the dynamic TLS case. The caller does an indirection function call instead of calling __tls_get_addr
. There are two main points:
- The function call to
__tls_get_addr
uses the regular calling convention: the compiler has to make the pessimistic assumption that all volatile registers may be clobbered by__tls_get_addr
. - In glibc,
__tls_get_addr
is very complex. If the TLS of the module is backed by a static TLS block, the dynamic loader can simply place the TP offset into the second word and let the function pointer point to a function which simply returns the second word.
The first point is the prominent reason that TLS descriptors are generally more efficient. Arguably traditional general dynamic and local dynamic TLS models could have a mechanism to use custom calling convention for __tls_get_addr
as well.
In musl, in the static TLS case, the two words will be set to ((size_t)__tlsdesc_static, tpoff)
where __tlsdesc_static
is a function which returns the second word. glibc's static TLS case is similar.
1 | .globl __tlsdesc_static |
The scheme optimizes for static TLS but penalizes the case that requires dynamic TLS. Remember that we have just two words in the GOT and by changing the first word to a function pointer, we have lost information about the module ID. To retain the information, the dynamic loader has to set the second word to a pointer to a (module ID, offset) pair allocated by malloc.
aarch64 defaults to TLS descriptors. On arm, i386 and x86-64, you can select TLS descriptors via GCC -mtls-dialect=gnu2
.