Last year we published our patch gap analysis of ESXi’s TCP/IP stack, which is forked from FreeBSD 8.2. While our focus was mainly on missing FreeBSD patches in ESXi, we also came across a type confusion bug in code introduced by VMware. This blog post details a vulnerability I discovered in ESXi’s implementation of the setsockopt
system call that could lead to a sandbox escape. The vulnerability was assigned CVE-2022-31696 and disclosed as part of the advisory VMSA-2022-003. Additionally, I also explore ESXi’s kernel heap allocator and weaknesses in existing kernel mitigations.
For information regarding the initial analysis of the TCP/IP kernel module, VMkernel debug symbols, and porting type information from FreeBSD to ESXi, it is recommend to read our earlier analysis.
Comparing setsockopt in FreeBSD vs ESXi
First, let’s take a look at how ESXi 6.7 build 19195723’s setsockopt implementation differs from that of FreeBSD. Of particular note are differences in the handling of the SO_KEEPALIVE
socket option. This option enables keep-alive messages on connection-oriented sockets.
In BSD systems, the TCP timer functions are registered and executed through the callout facility. ESXi added code here to check if there is an active callout for the keep-alive, by calling tcp_timer_active. If so, it resets the TCP keepidle
to a newer value using tcp_timer_activate. The keepidle
value determines how long TCP should wait before sending out the first keep-alive probe.
Type confusion vulnerability in SO_KEEPALIVE handling
What’s the issue with this newly added code? To understand this better, let’s take another look at the decompiled code with type information added.
The Internet PCB structure inpcb
has a pointer inp_ppcb
that can point to either a TCP PCB (tcpcb
) or a UDP PCB (udpcb
) structure depending on the protocol. The vulnerable code shown here always type casts the pointer to tcpcb
irrespective of the socket type. If the SO_KEEPALIVE
option is set for a UDP socket, inp_ppcb
is a pointer to a udpcb
structure, but here it is casted to tcpcb
structure due to the lack of validation. When the code further accesses the tcp_timer
structure variable t_timers
at offset 0x20, the access is out of bounds because the udpcb
structure is only 0x10 bytes in size.
Triggering the Vulnerable Code and PSOD
In order to trigger the vulnerable code path, we need to create a UDP socket and then manipulate the socket using the setsockopt system call. Specifically, it is necessary to set the SO_KEEPALIVE
option. Since ESXi does not package any build tools, we must compile the PoC statically in a Linux machine and then transfer the binary to ESXi for execution. Running the PoC will immediately trigger the Purple Screen of Death (PSOD). To trigger the bug from a sandboxed process, an attacker must be able to invoke the setsockopt
system call on an existing UDP socket descriptor or create a new one for that purpose. Below is the PoC to trigger the bug:
Kernel Debug Setup for ESXi
While ESXi supports a local VMkernel debugger, VMKDBG
, which can be used to inspect the PSOD, it is not as flexible as GDB. The GDB setup detailed in Attacking VMware NSX (Slides 34 – 37 in the PDF) is an excellent reference for getting started with ESXi kernel debugging. In summary, we used the GDB stubs feature provided by VMware to debug ESXi running as a guest VM on Fusion. We also disabled kASLR for ease of debugging. Since the ESXi kernel modules have symbols, it is possible to use GDB’s add-symbol-file
command to load symbol information given an executable file and its base address in memory. The module base address and the path information required for add-symbol-file
can be fetched using the esxcfg-info
command as seen below:
While the file path to the tcpip
module can be seen in the output, there is no file path entry for the VMkernel module. The VMkernel module with symbol information is found as a gzip-compressed file k.b00
within the bootbank
directory of ESXi. Alternatively, to obtain the VMkernel executable with not only symbols but also type information, one can download it from the VMware WorkBench. However, in this case, the VMware WorkBench does not have debug information for the version of ESXi currently under analysis.
Once the kernel modules are available and their base addresses are known, connect the debugger and run the PoC to trigger the crash. The exception triggered may not be caught by GDB. In that case, ESXi will continue running, executing the handler for Interrupt 13 - General Protection Fault (GP), which is responsible for collecting fault information and core dumps. Should this occur, wait for the PSOD and then hit “Control + C” (SIGINT) to break into GDB. In the debug session shown below, you can see the symbolized stack trace obtained using GDB’s add-symbol-file
command. tcp_timer_active
was the last function to be executed before calling the interrupt handler. Therefore, choose the relevant frame (12 in this case) and inspect the program state. The register RAX was found to be loaded with some garbage value, leading to an invalid memory access during the execution of the mov eax,DWORD PTR [rax+0x38]
instruction.
Analyzing the Exploitability of the Type Confusion Bug
Since the debug setup with symbols is now ready, let’s take another look at the crash by setting breakpoints and stepping through the code. The tcpcb
structure can be inspected during the call to tcp_timer_active function, which takes it as the first argument. However, the type information is still missing within GDB. As a workaround, it is possible to use the type information from the FreeBSD kernel for debugging ESXi’s tcpip
kernel module. Though some of the structure definitions vary somewhat between the FreeBSD and ESXi TCP/IP stacks, they have substantial similarities. Once again, GDB’s add-symbol-file
command comes in handy. To import all structure definitions, use the add-symbol-file
command but with address set to 0. Similarly, type information for VMkernel can be imported from an older version of ESXi vmkernel-visor (6.7-14320388) available through the VMware WorkBench.
Unlike the previous debug session, where the crash happened when accessing a garbage pointer, this time the t_timers
variable is pointing to NULL and will result in a NULL pointer dereference. To better understand this behavior, it is necessary to examine the heap allocator used by ESXi. After some analysis on the vmkernel-visor executable, it was noticed that ESXi’s kernel heap allocator is based on Doug Lea's Malloc:
In dlmalloc
, the malloc chunk headers are 32 bytes in size. The structure definition is as follows:
The prev_foot
field holds the size of previous chunk if free, whereas the head
field holds the size of the current chunk. In addition to the size, the head
field also holds two flag bits: PINUSE_BIT
and CINUSE_BIT
. The PINUSE_BIT
(lowest order bit) marks if the previous chunk is in use. The CINUSE_BIT
(second lowest bit) marks if the current chunk is in use. The forward fd
and backward bk
pointer fields are used only when the chunk is free. Otherwise the chunk data starts immediately after the head
field. Now, looking back at the memory pointed to by RDI, it can be inferred that it is the data region of a dlmalloc chunk of size 32 bytes, which can hold 16 bytes of data (the udpcb
structure).
As explained above, when fetching the t_timers
pointer from offset +0x20, it accesses data from the adjacent chunk. This is because the allocated udpcb
structure is smaller than the offset of t_timers
in the tcpcb
structure. Since the adjacent chunk may hold unrelated data, its contents are unpredictable (unless greater care is taken to first groom the heap). That is why the PoC crash will sometimes manifest as a NULL pointer deference and sometimes as a different kind of invalid access. Here is what the access of t_timers
looks like:
Assuming control of the t_timers
pointer, it is possible to corrupt arbitrary memory during the write operations within the callout_stop or callout_reset functions. Alternatively, if there is control over the memory pointed to by the t_timers
pointer, it is possible to control the subsequent access of the tcp_timer
structure. Specifically, tcp_timer
contains a callout
substructure scheduled for execution by tcp_timer_activate
. By targeting the c_func
function pointer we can gain control of the instruction pointer. Since ESXi does not support Supervisor Mode Access Prevention (SMAP), t_timers
could in fact point to user space memory instead of controlled memory in kernel space.
Note that structures such as tcpcb
, tcp_timer
and callout
in ESXi are slightly different from the corresponding structures in FreeBSD. By comparing the decompiled ESXi code against FreeBSD 8.2, I identified new structure elements and adjusted the offsets of existing fields. For example, some global variables in FreeBSD such as tcp_keepidle
, tcp_keepintvl
and tcp_keepcnt
were turned into fields of the tcp_timer
structure in ESXi. This can be recognized by analyzing the tcp_timer_keep callout function.
In addition to lack of support for SMAP, the kASLR of kernel modules was also found to be weak. While the text base address showed significant randomization, the data segment base address did not, with as little as 1 bit of entropy in some cases. Here are the load addresses of the tcpip
kernel module across multiple reboots:
Patch Analysis
To understand the fix for the type confusion bug, a patch diff was performed against ESXi 6.7 Build 20497097 (now at end-of-life). Instead of setting up the newer version of ESXi, you can just download the relevant VIB (vSphere Installation Bundle) from the ESXi Patch Tracker. In the case of tcpip
, the kernel module is found within the ESXi base system esx-base
VIB. This information can be queried using the esxcli
command:
The diff between tcpip
kernel modules from build 19195723 and 20497097 revealed an additional check added to sosetopt
function. The code now checks whether the socket protocol is IPPROTO_TCP
before proceeding with TCP timers. There is no explicit check to prevent a raw socket from entering the code path, but inp_ppcb
is initialized only for socket types SOCK_STREAM
and SOCK_DGRAM
but not for type SOCK_RAW
. Therefore, the timer code is reachable only when the socket type is SOCK_STREAM
and the protocol is IPPROTO_TCP
.
Conclusion
Historically, kernel privilege escalation vulnerabilities in ESXi have not been frequently seen. ESXi has no login shell for low-privileged users, so that entry point is eliminated. On the other hand, user-mode daemons such as SLPD run with the highest privileges (i.e., superDom), so in the case of compromise of a daemon, there is no need for further escalation. For these reasons, ESXi kernel bugs have not been a popular topic of discussion, at least not publicly. However, the situation is changing. SLP is no longer enabled by default, and ESXi is now sandboxing more and more user-mode processes. This makes us believe ESXi kernel bugs will become important in the coming years. For anyone interested, I hope this blog post will give some ideas to get started on the topic, and I’ll continue blogging about any significant findings in the future. Until then, you can follow me @renorobertr and follow the team on Twitter, Mastodon, LinkedIn, or Instagram for the latest in exploit techniques and security patches.