All about thread local storage

Thread local storage provides a mechanism allocating separate objects for different threads. It is the usual implementation for C11 _Thread_local, C++11 thread_local and vendor extension __thread on ELF platforms.

An example is POSIX errno:

Each thread has its own thread ID, scheduling priority and policy, errno value, floating point environment, thread-specific key/value bindings, and the required system resources to support a flow of control.

Different threads have different errno copies. errno is typically defined as a function which returns a thread local variable.

Representation

Assembler behavior

The compiler usually defines thread-local variables in .tdata and .tbss sections (which have the section flag SHF_TLS). The symbols have type STT_TLS, which represents a thread-local storage entity. In GNU as syntax, you can give a the type STT_TLS with .type a, @tls_object. The st_value value of a TLS symbols is the offset relative to the defining section.

.section .tbss,"awT",@nobits
.globl a, b
.type a, @tls_object
.type b, @tls_object
a:
  .zero 4
  .size a, .-a
b:
  .zero 4
  .size b, .-b

In this example, st_value(a)=0 while st_value(b)=4.

In Clang and GCC produced assembly, thread-local variables are annotated as .type a, @object (STT_OBJECT). When the assembler sees that such symbols are defined in SHF_TLS sections or referenced by TLS specific relocations, STT_OBJECT will be upgraded to STT_TLS.

GNU as supports an obscure directive .tls_common which defines STT_TLS SHN_COMMON symbols. LLVM integrated assembler does not support .tls_common.

Linker behavior

The linker combines .tdata input sections into a .tdata output section. .tbss input sections are combined into a .tbss output section. The two SHF_TLS output sections are placed into a PT_TLS program header. Its p_memsz field holds the size of thread-local storage. The PT_TLS program header is contained in a PT_LOAD program header. Conceptually PT_TLS and STT_TLS symbols are like in a separate address space.

In executable and shared object files, st_value normally holds a virtual address. For a STT_TLS symbol, st_value holds an offset relative to the virtual address of the PT_TLS program header.

GNU ld treats STT_TLS SHN_COMMON symbols as defined in .tcommon sections. The internal linker script places such sections into the output section .tdata. LLD does not support STT_TLS SHN_COMMON symbols.

Dynamic loader behavior

The dynamic loader collects PT_TLS program headers from the main executable and immediately loaded shared objects (via transitive DT_NEEDED), and allocates static TLS blocks, one block for each PT_TLS.

Its p_vaddr field holds the virtual address of the initial TLS. The first byte is reserved for the TLS symbol with st_value==0 in that module. The dynamic loader copies the initial content [p_vaddr,p_vaddr+p_filesz) to the corresponding static TLS block.

Models

Local exec TLS model (executable & non-preemptible)

This is the most efficient TLS model. It applies when the TLS symbol is defined in the executable.

The compiler picks this model in -fno-pic/-fpie modes if the variable is

a definition
or a declaration with a non-default visibility.

The first condition is obvious. The second condition is becuase a non-default visibility means the variable must be defined by another translation unit in the executable.

1
2
3

_Thread_local int def;
__attribute__((visibility("hidden"))) extern thread_local int ref;
int foo() { return def + ref; }

1 2	# x86-64 movl %fs:def@TPOFF, %eax

The offset from the thread pointer to the start of a static block is fixed at program start. The offset from the thread pointer to a TLS symbol is usually referred to as the TP offset. You can usually find "TPREL" or "TPOFF" in the relocation type name.

In addition, for the static TLS block of the main executable, the TP offset of a TLS symbol is a link-time constant. The linker and the dynamic loader share the same formula. Here is a list of common relocation types:

arm: R_ARM_TLS_LE32
aarch64: R_AARCH64_TLSLE_ADD_TPREL_HI12, R_AARCH64_TLSLE_ADD_TPREL_LO12_NC
i386: R_386_TLS_LE
x86-64: R_X86_64_TPOFF32
mips: R_MIPS_TPREL_HI16, R_MIPS_TPREL_LO16
ppc32: R_PPC_TPREL_HA, R_PPC_TPREL_LO
ppc64: R_PPC64_TPREL_HA, R_PPC64_TPREL_LO
riscv: R_RISCV_TPREL_HI20, R_RISCV_TPREL_LO12_I, R_RISCV_TPREL_LO12_S

For RISC architectures, because an instruction typically has 4 bytes and cannot encode a 32-bit offset, it usually takes two instructions to materialize the TP offset.

Initial exec TLS model (executable & preemptible)

This model is less efficient than local exec. It applies when the TLS symbol is defined in the executable or a shared object available at program start. The shared object can be due to DT_NEEDED or LD_PRELOAD.

The compiler picks this model in -fno-pic/-fpie modes if the variable is a declaration with default visibility. The idea is that a symbol referenced by the executable must be defined by an immediately loaded shared object, instead of a dlopen loaded shared object. The linker enforces this as well by defaulting to -z defs for a -no-pie/-pie link.

1 2	extern thread_local int ref; int foo() { return ref; }

1
2
3

# x86-64
movq ref@GOTTPOFF(%rip), %rax
movl %fs:(%rax), %eax

Because the offset from the thread pointer to the start of a static block is fixed at program start, such an offset can be encoded by a GOT relocation. Such relocation types typically have GOT and TPREL/TPOFF in their names. Here is a list of common relocation types:

arm: R_ARM_TLS_IE32
aarch64: R_AARCH64_TLSIE_ADR_GOTTPREL_PAGE21, R_AARCH64_TLSIE_LD64_GOTTPREL_LO12_NC
i386: R_386_TLS_IE
x86-64: R_X86_64_GOTTPOFF
ppc32: R_PPC_GOT_TPREL16
ppc64: R_PPC64_GOT_TPREL16_HA, R_PPC64_GOT_TPREL16_LO_DS
riscv: R_RISCV_TLS_GOT_HI20, R_RISCV_PCREL_LO12_I

If the TLS symbol does not satisfy initial-exec to local-exec relaxation requirements, the linker will allocate a GOT entry and emit a dynamic relocation. Here is a list of dynamic relocation types:

arm: R_ARM_TLS_TPOFF32
aarch64: R_AARCH64_TLS_TPREL64
mips32: R_MIPS_TLS_TPREL32
mips64: R_MIPS_TLS_TPREL64
i386: R_386_TPOFF
x86-64: R_X86_64_TPOFF64
ppc32: R_PPC_TPREL32
ppc64: R_PPC64_TPREL64
riscv: R_RISCV_TLS_TPREL64

While they have TPREL or TPOFF in their names, these dynamic relocations have the same bitwidth as the word size. This is a good way to distinguish them from the local-exec relocation types used in object files.

If you add the __attribute((tls_model("initial-exec"))) attribute, a thread-local variable can use this model in -fpic mode. If the object file is linked into an executable, everything is fine. If the object file is linked into a shared object, the shared object generally needs to be an immediately loaded shared object. The linker sets the DF_STATIC_TLS flag to annotate a shared object with initial-exec TLS relocations.

glibc ld.so reserves some space in static TLS blocks and allows dlopen on such a shared object if its TLS size is small. There could be an obscure reason for using such an attribute: general dynamic and local dynamic TLS models are not async-signal-safe in glibc. However, other libc implementations may not reserve additional TLS space for dlopen'ed initial-exec shared objects, e.g. musl will error.

General dynamic and local dynamic TLS models (DSO)

The two modes are used when the TLS symbol may be defined by a shared object. They do not assume the TLS symbol is backed by a static TLS block. Instead, they assume that the thread-local storage of the module may be dynamically allocated, making the models suitable for dlopen usage. The dynamically allocated TLS storage is usually referred to as dynamic TLS.

Each TLS symbol is assigned a pair of (module ID, offset from dtv[m] to the symbol), which is usually referred to as a tls_index object. The module ID m is assigned by the dynamic loader when the module (the executable or a shared object) is loaded, so it is unknown at link time. dtv means the dynamic thread vector. Each thread has its own dynamic thread vector, which is a mapping from module ID to thread-local storage. dtv[m] points to the storage allocated for the module with the ID m.

In the simplest form, once we have a pointer to the (module ID, offset from dtv[m] to the symbol) pair, we can get the address of the symbol with the following C program:


void *__tls_get_addr(size_t *v) {
  pthread_t self = __pthread_self();
  return (void *)(self->dtv[v[0]] + v[1]);
}

General dynamic TLS model (DSO & non-preemptible)

The general dynamic TLS model is the most flexible model. It assumes neither the module ID nor the offset from dtv[m] to the symbol is known at link time. The model is used in -fpic mode when the local dynamic TLS model does not apply. The compiler emits code to set up a pointer to the TLSGD entry of the symbol, then arranges for a call to __tls_get_addr. The return value will contain the runtime address of the TLS symbol in the current thread. On x86-64, you will notice that the leaq instruction has a data16 prefix and the call instruction has two data16 (0x66) prefixes and one rex64 prefix. This is a deliberate choice to make the total size of leaq+call to be 16, suitable for link-time relaxation.

data16 leaq def@tlsgd(%rip), %rdi
.value 0x6666
rex64 call __tls_get_addr@PLT
movl (%rax), %eax

At the linker stage, if the TLS symbol does not satisfy relaxation requirements, the linker will allocate two consecutive words in the .got section for the TLSGD relocation, relocated by two dynamic relocations. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word. The relocation types are:

arm: R_ARM_TLS_DTPMOD32 and R_ARM_TLS_DTPOFF32
aarch64: R_AARCH64_TLS_DTPMOD and R_AARCH64_TLS_DTPREL (rarely used because TLS descriptors are the default)
i386: R_386_TLS_DTPMOD32 and R_386_TLS_DTPOFF32
x86-64: R_X86_64_DTPMOD64 and R_X86_64_DTPOFF64
mips32: R_MIPS_TLS_DTPMOD32 and R_MIPS_TLS_DTPOFF32
mips64: R_MIPS_TLS_DTPMOD64 and R_MIPS_TLS_DTPOFF64
ppc32: R_PPC_DTPMOD32 and R_X86_64_DTPREL32
ppc64: R_PPC64_DTPMOD64 and R_X86_64_DTPREL64
riscv32: R_RISCV_TLS_DTPMOD32 and R_X86_64_TLS_DTPREL32
riscv64: R_RISCV_TLS_DTPMOD64 and R_X86_64_TLS_DTPREL64

Local dynamic TLS model (DSO & preemptible)

The local dynamic TLS model assumes that the offset from dtv[m] to the symbol is a link-time constant. This case happens when the TLS symbol is non-preemptible. The compiler emits code to set up a pointer to the TLSLD entry of the module, next arranges for a call to __tls_get_addr, then adds a link-time constant to the return value to get the address.

1
2
3

leaq def@tlsld(%rip), %rdi
call __tls_get_addr@PLT
movl def@dtpoff(%rax), %edx

I say "the TLSLD entry of the module" because while (on x86-64) def@tlsld looks like the TLSLD entry of the non-preemptible TLS symbol, it can really be shared by other non-preemptible TLS symbols. So one module needs just one such entry. Technically we can just use general dynamic relocation types to represent the local dynamic TLS model. For example, GCC riscv does this:

la.tls.gd a0, .LANCHOR0
call __tls_get_addr@@plt

.section .tbss,"awT",@nobits
.align 2
.set .LANCHOR0, .+0
.type a, @object
.size a, 4
a:
  .zero 4

This is clever. However, I would prefer dedicated local-dynamic relocation types. If we perform a relocatable link merging this object file with another (with its own local symbol .LANCHOR0), the local symbols .LANCHOR0 are separate and their GOT entries cannot be shared. Architectures have dedicated local-dynamic relocation types can share the GOT entries.

At the linker stage, if the TLS symbol does not satisfy local-dynamic to local-exec relaxation requirements, the linker will allocate two consecutive words in the .got section for the TLSLD relocation. The dynamic loader will write the module ID to the first word and the offset from dtv[m] to the symbol to the second word.

If the architecture does not define TLS relaxations, the linker can still made an optimization: in -no-pie/-pie modes, set the first word to 1 (main executable) and omit the dynamic relocation for the module ID.

TLS descriptors

Some architectures have define TLS descriptors as more efficient alternatives to the traditional general dynamic and local dynamic TLS models. Such ABIs repurpose the first word of the (module ID, offset from dtv[m] to the symbol) pair to represent a function pointer. The function pointer points to a very simple function in the static TLS case and a function similar to __tls_get_addr in the dynamic TLS case. The caller does an indirection function call instead of calling __tls_get_addr. There are two main points:

The function call to __tls_get_addr uses the regular calling convention: the compiler has to make the pessimistic assumption that all volatile registers may be clobbered by __tls_get_addr.
In glibc (which does lazy TLS allocation), __tls_get_addr is very complex. If the TLS of the module is backed by a static TLS block, the dynamic loader can simply place the TP offset into the second word and let the function pointer point to a function which simply returns the second word.

The first point is the prominent reason that TLS descriptors are generally more efficient. Arguably traditional general dynamic and local dynamic TLS models could have a mechanism to use custom calling convention for __tls_get_addr as well.

In musl, in the static TLS case, the two words will be set to ((size_t)__tlsdesc_static, tpoff) where __tlsdesc_static is a function which returns the second word. glibc's static TLS case is similar.

.globl __tlsdesc_static
.hidden __tlsdesc_static
__tlsdesc_static:
  # The second word stores the TP offset of the TLS symbol.
  movq 8(%rax), %rax
  ret

The scheme optimizes for static TLS but penalizes the case that requires dynamic TLS. Remember that we have just two words in the GOT and by changing the first word to a function pointer, we have lost information about the module ID. To retain the information, the dynamic loader has to set the second word to a pointer to a (module ID, offset) pair allocated by malloc.

aarch64 defaults to TLS descriptors. On arm, i386 and x86-64, you can select TLS descriptors via GCC -mtls-dialect=gnu2.

(I implemented TLS descriptors and relaxations in LLD'x x86-64 port.)

TLS relaxations

The code sequences have fixed forms and are annotated with appropriate relocations. So the linker understands the compiler's intention and can perform 4 kinds of code sequence modification as optimizations. They are:

general-dynamic/TLSDESC to local-exec relaxation: -no-pie/-pie && non-preemptible
general-dynamic/TLSDESC to initial-exec relaxation: -no-pie/-pie && preemptible
local-dynamic to local-exec relaxation: -no-pie/-pie (the symbol must be non-preemptible, otherwise it is an error to use local-dynamic)
initial-exec to local-exec relaxation: -no-pie/-pie && non-preemptible

Async-signal-safe TLS

TLS can be accessed by signal handlers (think of CPU and memory profilers), hence they need to be async-signal safe. Local-exec and initial-exec TLS models trivially satisfy the requirement since static TLS blocks are fixed at program start.

For a dlopen'ed shared object which does not use initial-exec TLS model (I've said above the dynamic loader has limited support for initial-exec), there are two cases.

The dynamic loader allocates sufficient storage for all currently running threads at dlopen time. This is musl's choice. At dlopen time, the dynamic loader needs to block signal deliveray, take a thread list lock and install a new dynamic thread vector for each thread.
Lazy TLS allocation. TLS allocation is done at the first time __tls_get_addr is called. This is glibc and many other libs implementation's choice. The allocation is typically done by malloc, which is not async-signal-safe.

Lazy TLS allocation has the nice property that it does not penalizes the threads which do not need access TLS of the new shared object. However, it is difficult to make __tls_get_addr async-signal-safe.

Google has attempts to upstream patches to make dynamic TLS async-signal-safe. It has been concluded that making signal-safe allocators is a dead end, see TLS, 2015 edition. If a dlopen implementing eager TLS allocation is developed, conceivably it may need a new symbol version because there can be programs desiring lazy TLS allocation.