Stack unwinding

Stack unwinding主要有以下作用：

获取stack trace，用于debugger、crash reporter、profiler等
加上personality routine和language specific data area后实现C++ exceptions(Itanium C++ ABI)

Stack unwinding可以分成两类：

synchronous: 程序自身触发的，C++ throw、获取自身stack trace等。这类stack unwinding只在function body，不会出现在prologue/epilogue
asynchronous: 由signal或外部程序触发，这类stack unwinding可以发生在prologue/epilogue

Frame pointer

最经典、最简单的stack unwinding基于frame pointer：固定一个寄存器为frame pointer(在x86-64上为RBP)，函数prologue处把frame pointer放入栈帧，并更新frame pointer为保存的frame pointer的地址。
frame pointer值形成了一个单链表。获取初始frame pointer值(__builtin_frame_address)后，不停解引用frame pointer即可得到所有栈帧的frame pointer值。
这种方法不适用于prologue/epilogue的部分指令。

pushq %rbp
movq %rsp, %rbp
...
popq %rbp
retq

下面是个简单的stack unwinding例子：

#include <stdio.h>
[[gnu::noinline]] void qux() {
  void **fp = __builtin_frame_address(0);
  for (;;) {
    printf("%p\n", fp);
    void **next_fp = *fp;
    if (next_fp <= fp) break;
    fp = next_fp;
  }
}
[[gnu::noinline]] void bar() { qux(); }
[[gnu::noinline]] void foo() { bar(); }
int main() { foo(); }

基于frame pointer的方法简单，但是有若干缺陷。

上面的代码用-O1或以上编译时foo和bar会tail call，程序输出不会包含foo bar的栈帧(-fomit-leaf-frame-pointer并不阻碍tail call)。

实践中，有时候不能保证所有库都包含frame pointer。unwind一个线程时，为了增强健壮性需要检测一个next_fp是否像栈地址。检测的一种方法是解析/proc/*/maps判断地址是否可读(慢)，另一种是


int fd = open("/dev/random", O_WRONLY);
if (write(fd, address, 1) < 0)

另外，预留一个寄存器用于frame pointer会有性能开销(prologue、epilogue额外的指令开销和少一个寄存器带来的寄存器压力)，在寄存器贫乏的x86-32可能相当显著，在寄存器较为充足的x86-64可能也有1%以上的性能损失。

-O0: 预设-fno-omit-frame-pointer，所有函数都有frame pointer
-O1或以上: 预设-fomit-frame-pointer，只有必要情况才设置frame pointer。指定-fno-omit-leaf-frame-pointer则可得到类似-O0效果。可以额外指定-momti-leaf-frame-pointer去除leaf functions的frame pointer

libunwind

C++ exception、profiler/crash reporter的stack unwinding通常用libunwind API和DWARF Call Frame Information。Hewlett-Packard定义了一套libunwind API，分为两类：

_Unwind_*: 这部分标准化为Itanium C++ ABI: Exception Handling的Level 1: Base ABI。Level 2 C++ ABI调用这些_Unwind_* API。其中的_Unwind_Resume是唯一可能被C++编译后的代码直接调用的API
unw_*: 入口是unw_init_local(local unwinding，当前进程)和unw_init_remote(remote unwinding，其他进程)

Hewlett-Packard开源了https://www.nongnu.org/libunwind/。这套API在Linux上的常见实现是：

libgcc/unwind-* (libgcc_s.so.1或libgcc_eh.a): 实现了_Unwind_*并引入了一些扩展：_Unwind_Resume_or_Rethrow, _Unwind_FindEnclosingFunction, __register_frame等
llvm-project/libunwind (libunwind.so或libunwind.a)是HP的一个简化实现，还提供了部分unw_*，但没有实现unw_init_remote。部分代码取自ld64。使用Clang的话可以用--rtlib=compiler-rt --unwindlib=libunwind选择
glibc的_Unwind_Find_FDE内部实现，通常不导出，和__register_frame_info有关

DWARF Call Frame Information

程序不同区域需要的unwind指令由DWARF Call Frame Information (CFI)描述，在ELF平台上由.eh_frame存储。Compiler/assembler/linker/libunwind提供相应支持。

.eh_frame由Common Information Entry (CIE)和Frame Description Entry (FDE)组成。CIE提供了一些公共的字段：

augmentation_string: P字符表示personality routine指针；L字符表示FDE的augmentation data存储了language-specific data area (LSDA)
address_size: 一般为4或8
segment_selector_size
code_alignment_factor: 假设指令长度都是2或4的倍数(用于RISC)，可以减小DW_CFA_advance_loc的参数
data_alignment_factor
return_address_register
augmentation_data_length
augmentation_data: personality
initial_instructions
padding

FDE有这些字段：

length: FDE自身长度。若为0xffffffff，接下来8字节(extended_length)记录实际长度。除非特别构造，extended_length是用不到的
CIE_pointer: 从当前位置减去CIE_pointer得到相关的CIE
initial_location: 该FDE描述的第一个位置的地址。在.o中此处有一个引用section symbol的relocation
address_range: initial_location和address_range描述了一个地址区间
instructions: unwind时的指令
augmentation_data_length
augmentation_data: 如果关联的CIE augmentation包含L字符，这里会记录language-specific data area
padding

personality和lsda用于Itanium C++ ABI的Level 2: C++ ABI。

.eh_frame基于DWARF v2引入的.debug_frame。它们有一些区别：

.eh_frame带有SHF_ALLOC flag(标志一个section是否应为内存中镜像的一部分)而.debug_frame没有，因此后者的使用场景非常少。
debug_frame支持DWARF64格式(支持64-bit offsets但体积会稍大)而.eh_frame不支持(其实可以拓展，但是缺乏需求)
.debug_frame的CIE中没有augmentation_data_length和augmentation_data
FDE中CIE_pointer的含义不同。.debug_frame中表示一个section offset(absolute)而.eh_frame中表示一个relative offset。.eh_frame作出的这一改变很好。如果.eh_frame长度超过32-bit，.debug_frame得转换成DWARF64才能表示CIE_pointer，而relative offset则无需担心这一问题(如果FDE到CIE的距离超过32-bit了，追加一个CIE即可)

对于如下的函数：

1
2
3

void f() {
  __builtin_unwind_init();
}

编译器用.cfi_*(CFI directive)标注汇编，.cfi_startproc和.cfi_endproc标识FDE区域，其他CFI directives描述CFI instructions。
一个call frame用栈上的一个地址表示。这个地址叫做Canonical Frame Address (CFA)，通常是call site的stack pointer值。下面用一个例子描述CFI instructions的作用：

f:
# At the function entry, CFA = rsp+8
	.cfi_startproc
# %bb.0:
	pushq	%rbp
# Redefine CFA = rsp+16
	.cfi_def_cfa_offset 16
# rbp is saved at the address CFA-16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
# CFA = rbp+16. CFA does not needed to be redefined when rsp changes
	.cfi_def_cfa_register %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
# rbx is saved at the address CFA-56
	.cfi_offset %rbx, -56
	.cfi_offset %r12, -48
	.cfi_offset %r13, -40
	.cfi_offset %r14, -32
	.cfi_offset %r15, -24
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
# CFA = rsp+8
	.cfi_def_cfa %rsp, 8
	retq
.Lfunc_end0:
	.size	f, .Lfunc_end0-f
	.cfi_endproc

汇编器根据CFI directives生成.eh_frame(这套机制由Alan Modra在2003年引入)。Linker收集.o中的.eh_frame input sections生成output .eh_frame。

定位一个pc所在的FDE需要从头扫描.eh_frame，找到合适的FDE(pc是否落在initial_location和address_range表示的区间)，所花时间和扫描的CIE和FDE记录数相关。
https://sourceware.org/pipermail/binutils/2001-December/015674.html引入了.eh_frame_hdr，包含binary search index table描述(initial_location, FDE address) pairs。

与很多其他sections不同，linker需要“认识”.eh_frame，解析并重写.eh_frame、生成.eh_frame_hdr、创建program header PT_EH_FRAME指向.eh_frame_hdr。

CFI instructions适合编译器生成代码，而手写汇编要准确标准每一条指令是繁琐的，也很容易出错。
2015年Alex Dowad也musl libc贡献了awk脚本，解析assembly并自动标注CFI directives。
其实对于编译器生成的代码也不容易，对于一个不用frame pointer的函数，调整SP就得同时输出一条CFI directive重定义CFA。GCC是不解析inline assembly的，因此inline assembly里调整SP往往会造成不准确的CFI。
而LLVM里的CFIInstrInserter可以插入.cfi_def_cfa_* .cfi_offset .cfi_restore调整CFA和callee-saved寄存器。

void foo() {
  asm("subq $128, %rsp\n"
  
      "nop\n"
      "addq $128, %rsp\n");
}

int main() {
  foo();
}

libunwind例子

#include <libunwind.h>
#include <stdio.h>

void backtrace() {
  unw_context_t context;
  unw_cursor_t cursor;
  
  unw_getcontext(&context);
  
  unw_init_local(&cursor, &context);
  size_t rip, rsp;
  do {
    unw_get_reg(&cursor, UNW_X86_64_RIP, &rip);
    unw_get_reg(&cursor, UNW_X86_64_RSP, &rsp);
    printf("rip: %zx rsp: %zx\n", rip, rsp);
  } while (unw_step(&cursor) > 0);
}

void bar() {backtrace();}
void foo() {bar();}
int main() {foo();}

unw_getcontext: 获取寄存器值(包含PC)
unw_init_local
- 使用dl_iterate_phdr遍历可执行文件和shared objects，找到包含PC的PT_LOAD program header
- 找到所在module的PT_EH_FRAME(.eh_frame_hdr)，存入cursor
unw_step
- 二分搜索PC对应的.eh_frame_hdr项，记录找到的FDE和其指向的CIE
- 执行CIE中的initial_instructions
- 执行FDE中的instructions。维护一个location、CFA，初始指向FDE的initial_location，指令中DW_CFA_advance_loc增加location；DW_CFA_def_cfa_*更新CFA；DW_CFA_offset表示一个寄存器的值保存在CFA+offset处
- location大于等于PC时停止。也就是说，执行的指令是FDE instructions的一个前缀

Unwinder根据program counter找到适用的FDE，执行所有在program counter之前的CFI instructions。

有几种重要的

DW_CFA_def_cfa_*
DW_CFA_offset
DW_CFA_advance_loc

一个-DCMAKE_BUILD_TYPE=Release -DLLVM_TARGETS_TO_BUILD=X86的clang，.text 51.7MiB、.eh_frame 4.2MiB、.eh_frame_hdr 646、2个CIE、82745个FDE。

CFI directives适合编译器

Compact unwind descriptors

在macOS上，Apple设计了compact unwind descriptors机制加速unwinding，理论上这种技术可以用于节省一些__eh_frame空间，但并没有实现。
主要思想是：

大多数函数的FDE都有固定的模式(prologue处指定CFA、存储callee-saved registers)，可以把FDE instructions压缩为32-bit。
CIE/FDE augmentation data描述的personality/lsda很常见，可以提取出来成为固定字段。

下面只讨论64-bit。一个descriptor占32字节

.quad _foo
.set L1, Lfoo_end-_foo
.long L1
.long compact_unwind_description
.quad personality
.quad lsda_address

Compact unwind descriptor编码为：

1
2
3

uint32_t : 24; 
uint32_t mode : 4;
uint32_t flags : 4;

定义了5种mode：

0: reserved
1: FP-based frame: RBP为frame pointer，frame size可变
2: SP-based frame: 不用frame pointer，frame size编译期固定
3: large SP-based frame: 不用frame pointer，frame size编译期固定但数值较大，无法用mode 2表示
4: DWARF CFI escape

FP-based frame (`UNWIND_MODE_BP_FRAME`)

Compact unwind descriptor编码为：

uint32_t regs : 15;
uint32_t : 1; 
uint32_t stack_adjust : 8;
uint32_t mode : 4;
uint32_t flags : 4;

x86-64上callee-saved寄存器有：RBX,R12,R13,R14,R15,RBP。3 bits可以编码一个寄存器，15 bits足够表示除RBP外的5个寄存器(是否保存及保存在哪里)。
stack_adjust记录保存寄存器外的额外栈空间。

SP-based frame (`UNWIND_MODE_STACK_IMMD`)

Compact unwind descriptor编码为：

uint32_t reg_permutation : 10;
uint32_t cnt : 3;
uint32_t : 3;
uint32_t size : 8;
uint32_t mode : 4;
uint32_t flags : 4;

cnt表示保存的寄存器数(最大6)。
reg_permutation表示保存的寄存器的排列的序号。
size*8表示栈帧大小。

Large SP-based frame (`UNWIND_MODE_STACK_IND`)

Compact unwind descriptor编码为：

uint32_t reg_permutation : 10;
uint32_t cnt : 3;
uint32_t adj : 3;
uint32_t size_offset : 8;
uint32_t mode : 4;
uint32_t flags : 4;

和SP-based frame类似。特别的是：栈帧大小是从text section读取的。RSP调整量通常由subq imm, %rsp表示，用size_offset表示该指令到函数开头的距离。
实际表示的stack size还要算上adj*8。

DWARF CFI escape

如果因为各种原因，compact unwind descriptor无法表示，就要回退到DWARF CFI。

LLVM实现里，每一个函数只用一个compact unwind descriptor表示。如果asynchronous stack unwinding发生在epilogue，已有实现无法把它和发生在function body的stack unwinding区分开来。
Canonical Frame Address会计算错误，caller-saved寄存器也会错误地读取。
如果发生在prologue，且prologue在push寄存器和subq imm, $rsp外有其他指令，也会出错。
另外如果一个函数启用了shrink wrapping，prologue可能不在函数开头处。开头到prologue间的asynchronous stack unwinding也会出错。
这个问题似乎多数人都不关心，可能是因为profiler丢失几个百分点的profile大家不在乎吧。

其实如果用多个descriptors描述一个函数的各个区域，还是可以准确unwind的。
OpenVMS 2018年提出了[RFC] Improving compact x86-64 compact unwind descriptors，可惜没有相关实现。

ARM exception handling

待补充

Frame pointer

libunwind

DWARF Call Frame Information

libunwind例子

Compact unwind descriptors

FP-based frame (UNWIND_MODE_BP_FRAME)

SP-based frame (UNWIND_MODE_STACK_IMMD)

Large SP-based frame (UNWIND_MODE_STACK_IND)

DWARF CFI escape

ARM exception handling

FP-based frame (`UNWIND_MODE_BP_FRAME`)

SP-based frame (`UNWIND_MODE_STACK_IMMD`)

Large SP-based frame (`UNWIND_MODE_STACK_IND`)