Thread local storage provides a mechanism allocating separate objects for different threads. It is the usual implementation for C11 _Thread_local
, C++11 thread_local
and vendor extension __thread
on ELF platforms.
An example is POSIX errno
:
Each thread has its own thread ID, scheduling priority and policy, errno value, floating point environment, thread-specific key/value bindings, and the required system resources to support a flow of control.
Different threads have different errno
copies. errno
is typically defined as a function which returns a thread local variable.
Representation
Assembler behavior
The compiler usually defines thread-local variables in .tdata
and .tbss
sections (which have the section flag SHF_TLS
). The symbols have type STT_TLS
, which represents a thread-local storage entity. In GNU as syntax, you can give a
the type STT_TLS
with .type a, @tls_object
. The st_value
value of a TLS symbols is the offset relative to the defining section.
1 | .section .tbss,"awT",@nobits |
In this example, st_value(a)=0
while st_value(b)=4
.
In Clang and GCC produced assembly, thread-local variables are annotated as .type a, @object
(STT_OBJECT
). When the assembler sees that such symbols are defined in SHF_TLS
sections or referenced by TLS specific relocations, STT_OBJECT
will be upgraded to STT_TLS
.
GNU as supports an obscure directive .tls_common
which defines STT_TLS SHN_COMMON
symbols. LLVM integrated assembler does not support .tls_common
.
Linker behavior
The linker combines .tdata
input sections into a .tdata
output section. .tbss
input sections are combined into a .tbss
output section. The two SHF_TLS
output sections are placed into a PT_TLS
program header. Its p_memsz
field holds the size of thread-local storage. The PT_TLS
program header is contained in a PT_LOAD
program header. Conceptually PT_TLS
and STT_TLS
symbols are like in a separate address space.
In executable and shared object files, st_value
normally holds a virtual address. For a STT_TLS
symbol, st_value
holds an offset relative to the virtual address of the PT_TLS
program header.
GNU ld treats STT_TLS SHN_COMMON
symbols as defined in .tcommon
sections. The internal linker script places such sections into the output section .tdata
. LLD does not support STT_TLS SHN_COMMON
symbols.
Dynamic loader behavior
The dynamic loader collects PT_TLS
program headers from the main executable and immediately loaded shared objects (via transitive DT_NEEDED
), and allocates static TLS blocks, one block for each PT_TLS
.
Its p_vaddr
field holds the virtual address of the initial TLS. The first byte is reserved for the TLS symbol with st_value==0
in that module. The dynamic loader copies the initial content [p_vaddr,p_vaddr+p_filesz)
to the corresponding static TLS block.
Models
Local exec TLS model (executable & non-preemptible)
This is the most efficient TLS model. It applies when the TLS symbol is defined in the executable.
The compiler picks this model in -fno-pic/-fpie
modes if the variable is
- a definition
- or a declaration with a non-default visibility.
The first condition is obvious. The second condition is becuase a non-default visibility means the variable must be defined by another translation unit in the executable.
1 | _Thread_local int def; |
1 | # x86-64 |
The offset from the thread pointer to the start of a static block is fixed at program start. The offset from the thread pointer to a TLS symbol is usually referred to as the TP offset. You can usually find "TPREL" or "TPOFF" in the relocation type name.
In addition, for the static TLS block of the main executable, the TP offset of a TLS symbol is a link-time constant. The linker and the dynamic loader share the same formula. Here is a list of common relocation types:
- arm:
R_ARM_TLS_LE32
- aarch64:
R_AARCH64_TLSLE_ADD_TPREL_HI12
,R_AARCH64_TLSLE_ADD_TPREL_LO12_NC
- i386:
R_386_TLS_LE
- x86-64:
R_X86_64_TPOFF32
- mips:
R_MIPS_TPREL_HI16
,R_MIPS_TPREL_LO16
- ppc32:
R_PPC_TPREL_HA
,R_PPC_TPREL_LO
- ppc64:
R_PPC64_TPREL_HA
,R_PPC64_TPREL_LO
- riscv:
R_RISCV_TPREL_HI20
,R_RISCV_TPREL_LO12_I
,R_RISCV_TPREL_LO12_S
For RISC architectures, because an instruction typically has 4 bytes and cannot encode a 32-bit offset, it usually takes two instructions to materialize the TP offset.
Initial exec TLS model (executable & preemptible)
This model is less efficient than local exec. It applies when the TLS symbol is defined in the executable or a shared object available at program start. The shared object can be due to DT_NEEDED
or LD_PRELOAD
.
The compiler picks this model in -fno-pic/-fpie
modes if the variable is a declaration with default visibility. The idea is that a symbol referenced by the executable must be defined by an immediately loaded shared object, instead of a dlopen loaded shared object. The linker enforces this as well by defaulting to -z defs
for a -no-pie/-pie
link.
1 | extern thread_local int ref; |
1 | # x86-64 |
Because the offset from the thread pointer to the start of a static block is fixed at program start, such an offset can be encoded by a GOT relocation. Such relocation types typically have GOT
and TPREL/TPOFF
in their names. Here is a list of common relocation types:
- arm:
R_ARM_TLS_IE32
- aarch64:
R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21
,R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC
- i386:
R_386_TLS_IE
- x86-64:
R_X86_64_GOTTPOFF
- ppc32:
R_PPC_GOT_TPREL16
- ppc64:
R_PPC64_GOT_TPREL16_HA
,R_PPC64_GOT_TPREL16_LO_DS
- riscv:
R_RISCV_TLS_GOT_HI20
,R_RISCV_PCREL_LO12_I
If the TLS symbol does not satisfy initial-exec to local-exec relaxation requirements, the linker will allocate a GOT entry and emit a dynamic relocation. Here is a list of dynamic relocation types:
- arm:
R_ARM_TLS_TPOFF32
- aarch64:
R_AARCH64_TLS_TPREL64
- mips32:
R_MIPS_TLS_TPREL32
- mips64:
R_MIPS_TLS_TPREL64
- i386:
R_386_TPOFF
- x86-64:
R_X86_64_TPOFF64
- ppc32:
R_PPC_TPREL32
- ppc64:
R_PPC64_TPREL64
- riscv:
R_RISCV_TLS_TPREL64
While they have TPREL
or TPOFF
in their names, these dynamic relocations have the same bitwidth as the word size. This is a good way to distinguish them from the local-exec relocation types used in object files.
If you add the __attribute((tls_model("initial-exec")))
attribute, a thread-local variable can use this model in -fpic
mode. If the object file is linked into an executable, everything is fine. If the object file is linked into a shared object, the shared object generally needs to be an immediately loaded shared object. The linker sets the DF_STATIC_TLS
flag to annotate a shared object with initial-exec TLS relocations.
glibc ld.so reserves some space in static TLS blocks and allows dlopen on such a shared object if its TLS size is small. There could be an obscure reason for using such an attribute: general dynamic and local dynamic TLS models are not async-signal-safe in glibc. However, other libc implementations may not reserve additional TLS space for dlopen'ed initial-exec shared objects, e.g. musl will error.
General dynamic and local dynamic TLS models (DSO)
The two modes are used when the TLS symbol may be defined by a shared object. They do not assume the TLS symbol is backed by a static TLS block. Instead, they assume that the thread-local storage of the module may be dynamically allocated, making the models suitable for dlopen usage. The dynamically allocated TLS storage is usually referred to as dynamic TLS.
Each TLS symbol is assigned a pair of (module ID, offset from dtv[m] to the symbol), which is usually referred to as a tls_index
object. The module ID m is assigned by the dynamic loader when the module (the executable or a shared object) is loaded, so it is unknown at link time. dtv means the dynamic thread vector. Each thread has its own dynamic thread vector, which is a mapping from module ID to thread-local storage. dtv[m] points to the storage allocated for the module with the ID m.
In the simplest form, once we have a pointer to the (module ID, offset from dtv[m] to the symbol) pair, we can get the address of the symbol with the following C program:
1 |
|
General dynamic TLS model (DSO & non-preemptible)
The general dynamic TLS model is the most flexible model. It assumes neither the module ID nor the offset from dtv[m] to the symbol is known at link time. The model is used in -fpic
mode when the local dynamic TLS model does not apply. The compiler emits code to set up a pointer to the TLSGD entry of the symbol, then arranges for a call to __tls_get_addr
. The return value will contain the runtime address of the TLS symbol in the current thread. On x86-64, you will notice that the leaq instruction has a data16 prefix and the call instruction has two data16 (0x66) prefixes and one rex64 prefix. This is a deliberate choice to make the total size of leaq+call to be 16, suitable for link-time relaxation.
1 | data16 leaq def@tlsgd(%rip), %rdi |
At the linker stage, if the TLS symbol does not satisfy relaxation requirements, the linker will allocate two consecutive words in the .got
section for the TLSGD relocation, relocated by two dynamic relocations. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word. The relocation types are:
- arm:
R_ARM_TLS_DTPMOD32
andR_ARM_TLS_DTPOFF32
- aarch64:
R_AARCH64_TLS_DTPMOD
andR_AARCH64_TLS_DTPREL
(rarely used because TLS descriptors are the default) - i386:
R_386_TLS_DTPMOD32
andR_386_TLS_DTPOFF32
- x86-64:
R_X86_64_DTPMOD64
andR_X86_64_DTPOFF64
- mips32:
R_MIPS_TLS_DTPMOD32
andR_MIPS_TLS_DTPOFF32
- mips64:
R_MIPS_TLS_DTPMOD64
andR_MIPS_TLS_DTPOFF64
- ppc32:
R_PPC_DTPMOD32
andR_X86_64_DTPREL32
- ppc64:
R_PPC64_DTPMOD64
andR_X86_64_DTPREL64
- riscv32:
R_RISCV_TLS_DTPMOD32
andR_X86_64_TLS_DTPREL32
- riscv64:
R_RISCV_TLS_DTPMOD64
andR_X86_64_TLS_DTPREL64
Local dynamic TLS model (DSO & preemptible)
The local dynamic TLS model assumes that the offset from dtv[m] to the symbol is a link-time constant. This case happens when the TLS symbol is non-preemptible. The compiler emits code to set up a pointer to the TLSLD entry of the module, next arranges for a call to __tls_get_addr
, then adds a link-time constant to the return value to get the address.
1 | leaq def@tlsld(%rip), %rdi |
I say "the TLSLD entry of the module" because while (on x86-64) def@tlsld
looks like the TLSLD entry of the non-preemptible TLS symbol, it can really be shared by other non-preemptible TLS symbols. So one module needs just one such entry. Technically we can just use general dynamic relocation types to represent the local dynamic TLS model. For example, GCC riscv does this:
1 | la.tls.gd a0, .LANCHOR0 |
This is clever. However, I would prefer dedicated local-dynamic relocation types. If we perform a relocatable link merging this object file with another (with its own local symbol .LANCHOR0
), the local symbols .LANCHOR0
are separate and their GOT entries cannot be shared. Architectures have dedicated local-dynamic relocation types can share the GOT entries.
At the linker stage, if the TLS symbol does not satisfy local-dynamic to local-exec relaxation requirements, the linker will allocate two consecutive words in the .got
section for the TLSLD relocation. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word.
If the architecture does not define TLS relaxations, the linker can still made an optimization: in -no-pie/-pie
modes, set the first word to 1 (main executable) and omit the dynamic relocation for the module ID.
TLS descriptors
Some architectures have define TLS descriptors as more efficient alternatives to the traditional general dynamic and local dynamic TLS models. Such ABIs repurpose the first word of the (module ID, offset from dtv[m] to the symbol) pair to represent a function pointer. The function pointer points to a very simple function in the static TLS case and a function similar to __tls_get_addr
in the dynamic TLS case. The caller does an indirection function call instead of calling __tls_get_addr
. There are two main points:
- The function call to
__tls_get_addr
uses the regular calling convention: the compiler has to make the pessimistic assumption that all volatile registers may be clobbered by__tls_get_addr
. - In glibc (which does lazy TLS allocation),
__tls_get_addr
is very complex. If the TLS of the module is backed by a static TLS block, the dynamic loader can simply place the TP offset into the second word and let the function pointer point to a function which simply returns the second word.
The first point is the prominent reason that TLS descriptors are generally more efficient. Arguably traditional general dynamic and local dynamic TLS models could have a mechanism to use custom calling convention for __tls_get_addr
as well.
In musl, in the static TLS case, the two words will be set to ((size_t)__tlsdesc_static, tpoff)
where __tlsdesc_static
is a function which returns the second word. glibc's static TLS case is similar.
1 | .globl __tlsdesc_static |
The scheme optimizes for static TLS but penalizes the case that requires dynamic TLS. Remember that we have just two words in the GOT and by changing the first word to a function pointer, we have lost information about the module ID. To retain the information, the dynamic loader has to set the second word to a pointer to a (module ID, offset) pair allocated by malloc.
aarch64 defaults to TLS descriptors. On arm, i386 and x86-64, you can select TLS descriptors via GCC -mtls-dialect=gnu2
.
(I implemented TLS descriptors and relaxations in LLD'x x86-64 port.)
TLS relaxations
The code sequences have fixed forms and are annotated with appropriate relocations. So the linker understands the compiler's intention and can perform 4 kinds of code sequence modification as optimizations. They are:
- general-dynamic/TLSDESC to local-exec relaxation:
-no-pie/-pie
&& non-preemptible - general-dynamic/TLSDESC to initial-exec relaxation:
-no-pie/-pie
&& preemptible - local-dynamic to local-exec relaxation:
-no-pie/-pie
(the symbol must be non-preemptible, otherwise it is an error to use local-dynamic) - initial-exec to local-exec relaxation:
-no-pie/-pie
&& non-preemptible
Async-signal-safe TLS
TLS can be accessed by signal handlers (think of CPU and memory profilers), hence they need to be async-signal safe. Local-exec and initial-exec TLS models trivially satisfy the requirement since static TLS blocks are fixed at program start.
For a dlopen'ed shared object which does not use initial-exec TLS model (I've said above the dynamic loader has limited support for initial-exec), there are two cases.
- The dynamic loader allocates sufficient storage for all currently running threads at dlopen time. This is musl's choice. At dlopen time, the dynamic loader needs to block signal deliveray, take a thread list lock and install a new dynamic thread vector for each thread.
- Lazy TLS allocation. TLS allocation is done at the first time
__tls_get_addr
is called. This is glibc and many other libs implementation's choice. The allocation is typically done by malloc, which is not async-signal-safe.
Lazy TLS allocation has the nice property that it does not penalizes the threads which do not need access TLS of the new shared object. However, it is difficult to make __tls_get_addr
async-signal-safe.
Google has attempts to upstream patches to make dynamic TLS async-signal-safe. It has been concluded that making signal-safe allocators is a dead end, see TLS, 2015 edition. If a dlopen implementing eager TLS allocation is developed, conceivably it may need a new symbol version because there can be programs desiring lazy TLS allocation.