二进制漏洞分析-3.​揭示华为安全管理程序(下)
2023-11-17 07:8:4 Author: 安全狗的自我修养(查看原文) 阅读量:24 收藏

如果你是想谈业务合作,直接翻到底联系作者。

  1. 不闲聊,直接说重点,我能不能做,你有没有预算,相关先介绍(我介绍技术、或者方案产品,你介绍需求带预算)就完事。

  2. 云桌面开发相关: 虚拟化(usb、usb透传、显示器、网卡、磁盘、声卡、摄像头)模块。

  3. 安全产品(dlp/edr/沙箱)开发相关:文件单/双缓冲透明加解密、网络防火防墙、所有通用外设管控(用户层版/驱动层版)模块、其它管理(恶意进程、模块等)。

  4. 通用开发相关:注入、hook、产品方案编写与设计、非黑灰产逆向、具体单独小功能编写。

如果你想系统学习二进制漏洞技能,往最后翻,或者直接翻到底联系作者。

目录

  • 介绍

  • 概述

    • 检索虚拟机监控程序二进制文件

    • 分析虚拟机管理程序二进制文件

    • ARM 虚拟化扩展

    • 华为Hypervisor执行环境

  • 初始化

    • ARM 虚拟内存管理进修课程

    • 在虚拟机管理程序中映射内存

    • 在第二阶段映射内存

    • 虚拟机监控程序的虚拟地址空间

    • 初始化第二转换阶段

    • 打开主内核的电源

    • 全局和本地状态

    • 主核心引导

    • 辅助内核启动

    • 虚拟内存管理

  • 异常处理

    • 虚拟机监控程序调用

    • 安全监控呼叫

    • 处理内核页表

    • 内核页表写入

    • 第 1 阶段页面表遍历期间的第 2 阶段故障

    • 指令中止处理

    • 数据中止处理

    • CRn 1:SCTLR_EL1

    • CRn 2:TTBR0_EL1、TTBR1_EL1和TCR_EL1

    • CRn 5:AFSR0_EL1、AFSR1_EL1和ESR_EL1

    • CRn 6:FAR_EL1

    • CRn 10:MAIR_EL1和AMAIR_EL1

    • CRn 13:CONTEXTIDR_EL1

    • 陷阱指令处理

    • 指令和数据中止处理

    • 内核页表管理

    • 虚拟机监控程序和安全监控器调用

  • 结论

  • 引用

继续上篇

CRn 2:TTBR0_EL1、TTBR1_EL1和TCR_EL1

hhee_sysregs_crn2验证对 、 和 所做的修改。TTBR0_EL1TTBR1_EL1TCR_EL1

  • 对于 ,在虚拟机监控程序处理完内核页表之前,不允许对 ASID 进行修改,但除了此和一些健全性检查之外,允许所有更改。TTBR0_EL1

  • 对于 ,虚拟机管理程序执行健全性检查,并且一旦首次设置了页表地址,内核将不允许再次更改它,除非它要从身份映射的 PGD 切换到交换器的 PGD。当虚拟机管理程序允许修改 时,将使用 process_ttbr1_el1 处理内核页表。TTBR1_EL1idmap_pg_dirswapper_pg_dirTTBR1_EL1

  • 对于 ,验证主要是健全性检查,一旦内核页表被处理完毕,内核就无法更改其值。TCR_EL1

uint64_t hhee_sysregs_crn2(uint64_t rt_val, uint32_t crm, uint32_t op2) {
if (crm) {
return 0xfffffffe;
}

sys_regs = current_cpu->sys_regs;

// TTBR0_EL1.
if (op2 == 0) {
// The Common not Private bit is ignored.
//
// CnP, bit [0]: Common not Private.
new_ttbr = rt_val & 0xfffffffffffffffe;
cnp = rt_val & 1;
asid = rt_val & 0xffff000000000000;

sys_regs = current_cpu->sys_regs;
if (atomic_get(&sys_regs->regs_inited)) {
// Checks if the ASID is determined by TTBR1_EL1 and if the ASID the kernel wants to set in TTBR0_EL1 is not null.
// Verifications are made using the saved value of TCR_EL1, if it's available.
//
// A1, bit [22]: Selects whether TTBR0_EL1 or TTBR1_EL1 defines the ASID.
if (asid && sys_regs->tcr_el1 & (1 << 22)) {
new_ttbr &= 0xffffffffffff;
}
}
// Checks if the ASID is determined by TTBR1_EL1 and if the ASID the kernel wants to set in TTBR0_EL1 is not null.
// Verifications are made using the current value of TCR_EL1, because the saved value has not been set yet.
else if (asid && get_tcr_el1() & (1 << 22)) {
return 0xfffffffe;
}
// If the kernel wants to enable Common not Private translations but it is not supported by the device, it returns
// an error.
else if (cnp && get_id_aa64mmfr2_el1() & 0xf) {
return 0xfffffffe;
}

// Updates TTBR0_EL1 with the new value.
set_ttbr0_el1(new_ttbr);
return 0;
}

// TTBR1_EL1.
if (op2 == 1) {
if (atomic_get(&sys_regs->regs_inited)) {
curr_ttbr = sys_regs->ttbr1_el1;
new_ttbr = rt_val & 0xffffffffffff;
// If the ASID is determined by TTBR1_EL1.
if (sys_regs->tcr_el1 & (1 << 22)) {
// Switches from the identity mapping PGD to the swapper PGD.
if (new_ttbr - curr_ttbr == 0x2000) {
asid = rt_val & 0xffff000000000000;
goto PROCESS_KERNEL_PT;
}
}
// If we are not switching from IDMAP to SWAPPER, the page table address can't be changed once it has been set.
if (new_ttbr != curr_ttbr) {
debug_printf(0x101, "Disallowed change of privileged page table to 0x%016lx", new_ttbr);
return 0xfffffff8;
}
asid = rt_val & 0xffff000000000000;
} else {
// If TTBR0_EL1 defines the ASID, the kernel is not allowed to change it in TTBR1_EL1.
asid = rt_val & 0xffff000000000000;
if (asid && !(get_tcr_el1() & (1 << 22))) {
return 0xfffffffe;
}

// If the kernel wants to enable Common not Private translations but it is not supported by the device, it returns
// an error.
cnp = rt_val & 1;
if (cnp && get_id_aa64mmfr2_el1() & 0xf) {
return 0xfffffffe;
}
}

PROCESS_KERNEL_PT:
// If TTBR1_EL1's ASID is non-null, kernel page tables have not been processes by the hypervisor and the MMU is
// enabled, then we process the translation table referenced by TTBR1_EL1.
if (asid && !current_cpu->kernel_pt_processed && get_sctlr_el1() & 1) {
ret = process_ttbr1_el1();
if (ret) {
return ret;
}
// If no error occured, marks the kernel page tables as processed.
current_cpu->kernel_pt_processed = 1;
}
// Updates TTBR1_EL1 with the new value.
set_ttbr1_el1(rt_val);
return 0;
}

// TCR_EL1.
if (op2 == 2) {
if (atomic_get(&sys_regs->regs_inited)) {
// If the saved values of the system registers have already been initialized in the hypervisor, changing TCR_EL1
// returns an error.
if (rt_val != sys_regs->tcr_el1) {
return 0xfffffff8;
}
// Otherwise we just rewrite the value in the register.
set_tcr_el1(rt_val);
return 0;
}

// For translations using TTBR0_EL1 and TTBR1_EL1, makes sure bits 59, 60, 61 and 62 of each stage 1 translation
// table Block or Page entry can be used by hardware for an IMPLEMENTATION DEFINED purpose.
//
// - HWU059-HWU062, bits [46:43]: Indicates IMPLEMENTATION DEFINED hardware use of bits 59, 60, 61 and 62 of the
// stage 1 translation table Block or Page entry for translations using TTBR0_EL1.
// - HWU159-HWU162, bits [49:46]: Indicates IMPLEMENTATION DEFINED hardware use of bits 59, 60, 61 and 62 of the
// stage 1 translation table Block or Page entry for translations using TTBR1_EL1.
if (rt_val & 0xfffff80800000040) {
return 0xfffffffe;
}

// If 16-bit ASID are not supported by the device, returns an error if the kernel tries to configure an ASID of size
// 16 bits.
//
// AS, bit [36]: ASID Size.
if (rt_val & (1 << 36) && get_id_aa64mmfr0_el1() >> 4) {
return 0xfffffffe;
}

// If hardware updates of access flag and dirty states are not supported by the device, returns an error if the
// kernel tries to enable them.
//
// - HA, bit [39]: Hardware Access flag update in stage 1 translations.
// - HD, bit [40]: Hardware management of dirty state in stage 1 translations.
if ((rt_val & (1 << 39) || rt_val & (1 << 40)) && !(get_id_aa64mmfr1_el1() & 0xf)) {
return 0xfffffffe;
}

// If hierachical permission disables are not supported by the device, returns an error if the kernel tries to
// enable one of them.
//
// - HPD0, bit [41]: Hierarchical Permission Disables in TTBR0_EL1.
// - HPD1, bit [42]: Hierarchical Permission Disables in TTBR1_EL1.
if ((rt_val & (1 << 41) || rt_val & (1 << 42)) && !(get_id_aa64mmfr1_el1() >> 4 & 0xf)) {
return 0xfffffffe;
}

// Updates TCR_EL1 with the new value if no error was encountered.
set_tcr_el1(rt_val);
return 0;
}
return 0xfffffffe;
}

CRn 5:AFSR0_EL1、AFSR1_EL1和ESR_EL1

我们将要查看的下一个陷阱寄存器是值为 5 的寄存器,由 hhee_sysregs_crn5 处理。CRn

  • 更改将被忽略,但不会返回错误。AFSR0_EL1

  • 更改将被忽略,并返回错误。AFSR1_EL1

  • 无需检查即可接受更改。ESR_EL1

uint64_t hhee_sysregs_crn5(uint64_t rt_val, uint32_t crm, uint32_t op2) {
if (crm == 1 && op2 == 0) {
// AFSR0_EL1 - Nothing happens, but no error is returned.
return 0;
}

if (crm == 1 && op2 != 0) {
// AFSR1_EL1 - An error is returned.
return 0xffffffff;
}

if (crm == 2 && op2 == 0) {
// Updates ESR_EL1 without further checks.
set_esr_el1(rt_val);
return 0;
}

// The rest of the registers should be unreachable with the current configuration.
return 0xffffffff;
}

CRn 6:FAR_EL1

然后我们有 ,由 hhee_sysregs_crn6 处理,可以按照内核的意愿进行更改。FAR_EL1

uint64_t hhee_sysregs_crn6(uint64_t rt_val, uint32_t crm, uint32_t op2) {
if (crm != 0 || op2 != 0) {
return 0xfffffffe;
}
set_far_el1(rt_val);
return 0;
}

CRn 10:MAIR_EL1和AMAIR_EL1

我们列表中下一个被困的寄存器是 和 ,它们由 hhee_sysregs_crn10 处理。MAIR_EL1AMAIR_EL1

  • MAIR_EL1仅当存储内核系统寄存器值的虚拟机管理程序结构尚未初始化时,才能更改。否则,如果我们尝试更改其值,并且它与虚拟机管理程序中存储的值不同,则该函数将返回错误。

  • 更改将被忽略,但不会返回任何错误。AMAIR_EL1

uint64_t hhee_sysregs_crn10(uint64_t rt_val, uint32_t crm, uint32_t op2) {
if (op2 != 0) {
return 0xfffffffe;
}

// MAIR_EL1 - Changes are only allowed if sys_regs->mair_el1 has not been set yet. If it has been, returns an error if
// it differs from the value the kernel wants to set.
if (crm == 2) {
sys_regs = current_cpu->sys_regs;
regs_inited = atomic_get(&sys_regs->regs_inited);
if (regs_inited && rt_val != sys_regs->mair_el1) {
return 0xfffffff8;
} else {
set_mair_el1(rt_val);
return 0;
}
}

// AMAIR_EL1 - Changes ignored, but doesn't return an error.
if (crm == 3) {
return 0;
}

return 0xfffffffe;
}

CRn 13:CONTEXTIDR_EL1

最后,由 hhee_sysregs_crn13 处理,可以更改为任何值。CONTEXTIDR_EL1

uint64_t hhee_sysregs_crn13(uint64_t rt_val, uint32_t crm, uint32_t op2) {
if (crm != 0) {
return 0xfffffffe;
}

// Updates CONTEXTIDR_EL1 without further checks.
if (op2 == 1) {
set_contextidr_el1(rt_val);
return 0;
}

return 0xffffffff;
}

指令和数据中止处理

正如我们之前所解释的,虚拟机管理程序还能够处理用户空间或内核尝试读取、写入或执行受保护的内存时发生的指令和数据中止。特别是,虚拟机管理程序使用它来检查内核写入其页表的内容,我们将在下一节中看到。

与捕获的指令处理程序类似,中止处理程序检查系统寄存器中存在的特定于故障的信息。ESR_EL2

指令中止处理

当 EL0 或 EL1 尝试执行无效/受保护的内存导致指令中止时,它由 SynchronousExceptionA64 或 SynchronousExceptionA32 处理,两者最终都会调用 handle_inst_abort_from_lower_el。

handle_inst_abort_from_lower_el只需调用 trigger_instr_data_abort_handling_in_el1 让内核处理中止。此外,还可以检查在第 1 阶段页表遍历期间发生第 2 阶段故障的特殊情况,这会导致调用 handle_s2_fault_during_s1ptw 函数,但我们将在稍后详细介绍此函数。

void handle_inst_abort_from_lower_el(saved_regs_t* regs, uint64_t esr_el2) {
// S1PTW, bit [7] = 1: Stage 2 fault on an access made for a stage 1 translation table walk.
if (((esr_el2 >> 7) & 1) == 1) {
handle_s2_fault_during_s1ptw(0);
} else {
// Let the kernel handle the abort.
trigger_instr_data_abort_handling_in_el1();
}
}

trigger_instr_data_abort_handling_in_el1(也由指令中止处理程序以外的其他处理程序调用)检索指令特定综合征 (ISS) 以确定发生了哪个故障。它还读取寄存器的 (EL0/EL1) 和 (AArch32/AArch64) 位,以确定要返回到内核的哪个异常向量。实际返回是通过写入寄存器,然后在 SynchronousExceptionA64 或 SynchronousExceptionA32 中执行 ERET 指令来完成的。ESR_EL2M[0]M[4]SPSR_EL2ELR_EL2

void trigger_instr_data_abort_handling_in_el1() {
// ...
esr_el2 = get_esr_el2();
ec = (esr_el2 >> 26) & 0b111111;
switch (ec) {
case 0b001000: /* Trapped VMRS access, from ID group trap */
// IL, bit [25] = 1: 32-bit instruction trapped.
esr_el1 = 1 << 25;
break;

case 0b010010: /* HVC instruction execution in AArch32 state */
case 0b010011: /* SMC instruction execution in AArch32 state */
case 0b010110: /* HVC instruction execution in AArch64 state */
case 0b010111: /* SMC instruction execution in AArch64 state */
// EC, bits [31:26] = 0b001110: Illegal Execution state.
esr_el1 = (0b001110 << 26) | (1 << 25);
break;

case 0b100000: /* Instruction Abort from a lower Exception level */
case 0b100100: /* Data Abort from a lower Exception level */
// - IFSC, bits [5:0]: Instruction Fault Status Code.
// - DFSC, bits [5:0]: Data Fault Status Code.
esr_el1 = esr_el2 & ~0b111111;

el = (get_spsr_el2() >> 2) & 0b1111;
if (el != 0b0000 /* EL0t */) {
// EC, bits [32:26]: Exception Class.
//
// - 0b100000 -> 0b100001: Instruction Abort taken without a change in Exception level.
// - 0b100100 -> 0b100101: Data Abort without a change in Exception level.
esr_el1 |= (1 << 26);
}

// S1PTW, bit [7] = 1: Fault on the stage 2 translation of an access for a stage 1 translation table walk.
if (((esr_el2 >> 7) & 1) != 0) {
// - IFSC, bits [5:0]: Instruction Fault Status Code.
// - DFSC, bits [5:0]: Data Fault Status Code.
//
// If the S1PTW bit is set, then the level refers the level of the stage 2 translation that is translating a
// stage 1 translation walk.
esr_el1 |= 0b101000 | (esr_el2 & 3);
} else {
esr_el1 |= 0b100000;
}
break;

default:
esr_el1 = esr_el2;
}

far_el1 = get_far_el2();
elr_el1 = get_elr_el2();
spsr_el1 = get_spsr_el2();

el = (spsr_el1 >> 2) & 0b1111;
if (el != 0b0000 /* EL0t */) {
if ((spsr_el1 & 1) == 0) {
offset = 0x000 /* Synchronous, Current EL with SP0 */;
} else {
offset = 0x200 /* Synchronous, Current EL with SPx */;
}
} else {
// M[4], bit [4]: Execution state.
//
// - 0b0: AArch64 execution state.
// - 0b1: AArch32 execution state.
if (((spsr_el1 >> 4) & 1) == 0) {
offset = 0x400 /* Synchronous, Lower EL using AArch64 */;
} else {
offset = 0x600 /* Synchronous, Lower EL using AArch32 */;
}
}

set_elr_el2(get_vbar_el1() + offset);
// - M[3:0], bits [3:0] = 0b0101: EL1h.
// - F, bit [6] = 1: FIQ interrupt mask.
// - I, bit [7] = 1: IRQ interrupt mask.
// - A, bit [8] = 1: SError interrupt mask.
// - D, bit [9] = 1: Debug exception mask.
set_spsr_el2((0b0101 | (1 << 6) | (1 << 7) | (1 << 8) | (1 << 9)));
set_elr_el1(elr_el1);
set_spsr_el1(spsr_el1);
set_esr_el1(esr_el1);
set_far_el1(far_el1);
}

数据中止处理

当 EL0 或 EL1 尝试读取或写入无效/受保护的内存导致数据中止时,它由 SynchronousExceptionA64 或 SynchronousExceptionA32 处理,两者最终都会调用 handle_data_abort_from_lower_el。

handle_data_abort_from_lower_el还会检查在第 1 阶段页面表遍历期间发生的第 2 阶段故障的特殊情况,我们将在后面详细介绍。否则,在一般情况下,它会检查系统寄存器的 Instruction Specific Syndrome 字段 (ISS)。如果它包含有效的指令综合症,它会从 ISS 中提取访问信息,然后调用handle_data_abort_on_write(如果异常是由写入引起的)或handle_data_abort_on_read(如果是读取)。ESR_EL2

如果 ISS 未知,并且在 AArch32 中引发了异常,则虚拟机监控程序会通过调用 trigger_instr_data_abort_handling_in_el1 将其转发到内核。如果我们在 AArch64 中,则调用 handle_data_abort_on_unknown_iss

void handle_data_abort_from_lower_el(saved_regs_t* regs, uint64_t esr_el2) {
// ...
far_el2 = get_far_el2();
// Special case if the abort occured during a stage 1 PTW.
//
// S1PTW, bit [7] = 1: Stage 2 fault on an access made for a stage 1 translation table walk.
if (((esr_el2 >> 7) & 1) == 1) {
// WnR, bit [6].
//
// - 0: Abort caused by an instruction reading from memory.
// - 1: Abort caused by an instruction writing to memory.
handle_s2_fault_during_s1ptw(((esr_el2 >> 6) & 1) == 1);
}

// Common case: there is a valid instruction syndrome that contains all the information about the access that caused
// the abort.
//
// ISV, bit [24] = 1: ISS hold a valid instruction syndrome.
else if (((esr_el2 >> 24) & 1) == 1) {
// SRT, bits [20:16]: Syndrome Register Transfer, this field holds register specifier, Xt.
srt = (esr_el2 >> 16) & 0b11111;
reg_val = *(uint64_t*)(&regs->x0 + srt);
// SAS, bits [23:22]: Syndrome Access Size, indicates the size of the access attempted by the faulting operation.
sas = (esr_el2 >> 22) & 0b11;
// SSE, bit [21]: Syndrome Sign Extend, indicates whether the data item must be sign extended.
sse = (esr_el2 >> 21) & 1;
// SF, bit [15]: Indicates if the width of the register accessed by the instruction is 64 bits.
sf = (esr_el2 >> 15) & 1;
// AR, bit [14]: Acquire/Release, did the instruction have acquire/release semantics?
af = (esr_el2 >> 14) & 1;
wnr = (esr_el2 >> 6) & 1;
// Call a different handler depending on the access type (read or write).
if (wnr == 1) {
ret = handle_data_abort_on_write(reg_val, far_el2, sas, ar);
} else {
ret = handle_data_abort_on_read(reg_val, far_el2, sas, sse, sf, ar);
}
// Move the instruction pointer past the faulting instruction.
//
// IL, bit [25]: Instruction Length for synchronous exceptions.
if (ret == 0) {
set_elr_el2(get_elr_el2() + 2 * (il + 1));
}
}

// Uncommon case: there is no valid instruction syndrome and the data abort came from an AArch32 process, let the
// kernel handle the fault.
//
// M[4], bit [4] = 1: AArch32 execution state.
else if (((get_spsr_el2() >> 4) & 1) == 1) {
trigger_instr_data_abort_handling_in_el1();
}

// Uncommon case: there is no valid instruction syndrome, the hypervisor will try to figure out what instruction
// caused the abort by decoding it.
else {
handle_data_abort_on_unknown_iss(regs);
}
}

写入时数据中止

对于由写入导致的数据中止,handle_data_abort_on_write调用handle_data_abort_on_write_aligned与访问大小对齐的地址:

  • 如果地址已在访问大小上对齐,则直接调用 handle_data_abort_on_write_aligned;

  • 否则,它会将访问权限拆分为多个较小大小的子访问,这些子访问最终会对齐。

int32_t handle_data_abort_on_write(uint64_t reg_val,
uint64_t fault_va,
int32_t access_size_log2,
bool acquire_release) {
// Simple case: if the fault VA is aligned on the access size.
access_size = 1 << access_size_log2;
if ((fault_va & (access_size - 1)) == 0) {
fault_ipa = virt_to_phys_el1(fault_va);
if (fault_ipa == -1) {
trigger_instr_data_abort_handling_in_el1();
return -1;
}
if (handle_data_abort_on_write_aligned(reg_val, fault_ipa, access_size_log2) != 0) {
return -1;
}
if (acquire_release) {
dmb_ish();
}
return 0;
}

// Complicated case: if the fault VA is not aligned on the access size then iterate on each aligned chunk that are
// contained in the access.
chunk_access_size_log2 = __clz(__rbit(fault_va));
chunk_access_size = 1 << chunk_access_size_log2;
chunk_fault_va = fault_va;
while (access_size != 0) {
chunk_fault_ipa = virt_to_phys_el1(chunk_fault_va);
if (chunk_fault_ipa == -1) {
trigger_instr_data_abort_handling_in_el1();
return -1;
}
if (handle_data_abort_on_write_aligned(reg_val, chunk_fault_ipa, chunk_access_size_log2) != 0) {
return -1;
}
reg_val >>= 8 * chunk_access_size;
chunk_fault_va += chunk_access_size;
access_size -= chunk_access_size;
}

// Emulate A/R semantics.
if (acquire_release) {
dmb_ish();
}
return 0;
}

handle_data_abort_on_write_aligned检查写入是否引发了异常,但仍然应该被允许。它通过调用 check_kernel_page_table_write 来实现此目的,该调用返回一个判定值:

  • 如果为零,则允许写入,并调用 perform_faulting_write 代表内核执行写入;

  • 如果它不为零并且错误指令在内核中,则在下一条指令时恢复执行;

  • 如果它为非零,并且错误指令位于用户空间进程中,则允许内核处理错误。

int32_t handle_data_abort_on_write_aligned(uint64_t reg_val, uint64_t fault_ipa, int32_t access_size_log2) {
// ...
hvc_lock_acquire();
// Check if the write, most likely into a kernel page tables, is allowed.
verdict = check_kernel_page_table_write(&reg_val, &fault_ipa, &access_size_log2, &fault_pa);
hvc_lock_set();

if (verdict != 0) {
// If the write is not allowed, let the kernel handle the fault, whether it came from user-land or kernel-land.
//
// - M[3:0], bits [3:0]: AArch64 Exception level and selected Stack Pointer (...) M[3:2] is set to the value of
// PSTATE.EL.
el = (get_spsr_el2() & 0b1111) >> 2;
if (el != 0 /* EL0 */) {
hvc_lock_dec();
return 0;
} else {
trigger_instr_data_abort_handling_in_el1();
hvc_lock_dec();
return verdict;
}
} else {
// If the write is allowed, performs it on behalf of the kernel.
perform_faulting_write(fault_pa, access_size_log2, reg_val);
hvc_lock_dec();
return 0;
}
}

如上所述,perform_faulting_write只需将目标物理地址映射到虚拟机管理程序中,代表内核执行写入操作,然后取消映射即可。

void perform_faulting_write(uint64_t fault_pa, int32_t access_size_log2, uint64_t reg_val) {
// ...
fault_hva = map_pa_into_hypervisor(fault_pa);
switch (access_size_log2) {
case 0:
*(uint8_t*)fault_hva = reg_val;
break;
case 1:
*(uint16_t*)fault_hva = reg_val;
break;
case 2:
*(uint32_t*)fault_hva = reg_val;
break;
case 3:
*(uint64_t*)fault_hva = reg_val;
break;
}
unmap_hva_from_hypervisor(fault_hva);
}

读取时数据中止

对于由读取引起的数据中止,handle_data_abort_on_read调用handle_data_abort_on_read_aligned与访问大小对齐的地址:

  • 如果地址已在访问大小上对齐,则直接调用handle_data_abort_on_read_aligned;

  • 否则,它会将访问拆分为目标地址周围 8 个字节中每个字节的一个或两个访问(例如,在地址 0x123 处访问四字会导致在 0x128 处进行 8 字节访问,在 0x120 处进行 8 字节访问)。

如果访问大小小于 8 个字节、指令符号扩展了读取的值,或者指令具有获取-释放语义,则需要进行一些额外的处理。

int32_t handle_data_abort_on_read(uint64_t* reg_val_ptr,
uint64_t fault_va,
int32_t access_size_log2,
bool sign_extend,
bool sixty_four,
bool acquire_release) {
// Simple case: if the fault VA is aligned on the access size.
access_size = 1 << access_size_log2;
if ((fault_va & (access_size - 1)) == 0) {
if (handle_data_abort_on_read_aligned(reg_val, fault_va, access_size_log2) != 0) {
return -1;
}
}

// Complicated case: if the fault VA is not aligned on the access size then split it into one or two aligned accesses
// of 8 bytes each.
else {
// Handle the first 8-byte access.
fault_va_align = fault_va & ~7;
if (handle_data_abort_on_read_aligned(&g_reg_val, fault_va_align, 3) != 0) {
return -1;
}
// Handle the second 8-byte access if necessary.
if (fault_va + access_size > fault_va_align + 8 &&
handle_data_abort_on_read_aligned(&g_reg_val + 8, fault_va_align + 8, 3) != 0) {
return -1;
}
// Copy the resulting value at the right place.
if (memcpy_s(reg_val_ptr, 8, &g_reg_val + (fault_va & 7), 8) != 0) {
return -1;
}
}

// If the access size is smaller than 8 bytes, extra work might be needed.
if (access_size < 8) {
// Sign extend the value if needed.
if (!sign_extend) {
*reg_val_ptr &= (1 << (8 * access_size)) - 1;
}
// Downcast the value if needed.
if (!sixty_four) {
*reg_val_ptr = *(uint32_t*)reg_val_ptr;
}
}

// Emulate A/R semantics if needed.
if (acquire_release) {
dmb_ishld();
}
return 0;
}

handle_data_abort_on_read_aligned将检查写入是否引发了异常,但仍然应该被允许。它通过调用 check_kernel_page_table_read 来实现此目的,该调用返回一个判定值:

  • 如果为零,则允许读取,并调用 perform_faulting_read 代表错误进程执行写入;

  • 如果它不为零并且错误指令在内核中,则在下一条指令时恢复执行;

  • 如果它为非零,并且错误指令位于用户空间进程中,则允许内核处理错误。

int32_t handle_data_abort_on_read_aligned(uint64_t* reg_val_ptr, uint64_t fault_va, int32_t access_size_log2) {
// ...
fault_ipa = virt_to_phys_el1(fault_va);
if (fault_ipa == -1) {
trigger_instr_data_abort_handling_in_el1();
return -1;
}

hvc_lock_inc();
// Check if the read, most likely into a kernel page tables, is allowed.
verdict = check_kernel_page_table_read(&fault_ipa);
if (verdict != 0) {
// If the read is not allowed, let the kernel handle the fault, whether it came from userland or kerneland.
//
// - M[3:0], bits [3:0]: AArch64 Exception level and selected Stack Pointer (...) M[3:2] is set to the value of
// PSTATE.EL.
el = (get_spsr_el2() & 0b1111) >> 2;
if (el != 0 /* EL0 */) {
hvc_lock_dec();
return 0;
} else {
trigger_instr_data_abort_handling_in_el1();
hvc_lock_dec();
return verdict;
}
} else {
// If the read is allowed, performs it on behalf of the kernel.
*(uint64_t*)reg_val_ptr = perform_faulting_read(fault_ipa, access_size_log2);
hvc_lock_dec();
return 0;
}
}

check_kernel_page_table_read检索错误 IPA 的第 2 阶段描述符。如果地址未映射或标记为仅执行,则访问将被拒绝。否则,它将返回 report_error_on_invalid_access 的判决。

int32_t check_kernel_page_table_read(uint64_t* fault_ipa_ptr) {
// ...
// Check if the fault IPA is mapped in the stage 2.
desc_s2 = get_stage2_page_table_descriptor_for_ipa(*fault_ipa_ptr, &desc_s2_va_range);
if ((desc_s2 & 1) == 0) {
return -1;
}

// Decide what to do depending on the software attributes.
verdict = report_error_on_invalid_access(desc_s2);
sw_attrs = get_software_attrs(desc_s2);
// Change the return code for userland execute-only memory.
if (verdict != 0 && sw_attrs == SOP_XO) {
return -2;
}
return verdict;
}

report_error_on_invalid_access只允许访问未标记的地址,并记录所有其他访问。

int32_t report_error_on_invalid_access(uint64_t desc_s2) {
// ...
phys_addr = desc_s2 & 0xfffffffff000;
sw_attrs = get_software_attrs(desc_s2);
switch (sw_attrs) {
case 0:
return 0 /* allowed */;
case 1:
return -1 /* disallowed */;
case OS_RO:
case OS_RO_MOD:
// If execution at EL1 was not permitted.
//
// XN, bits [54:53]: Execute-Never.
if (((desc_s2 >> 53) & 0b11) == 0b01 || ((desc_s2 >> 53) & 0b11) == 0b10) {
code = 0x202;
segment = "data";
} else {
code = 0x201;
segment = "code";
}
debug_printf(code, "Invalid write to %s read-only %s location 0x%lx", "operating system", segment, phys_addr);
return -1;
case HYP_RO:
case HYP_RO_MOD:
debug_printf(0x203, "Invalid write to %s read-only %s location 0x%lx", "hypervisor-mediated", "data", phys_addr);
return -1;
case SOP_XO:
debug_printf(0x204, "Invalid read or write to %s execution-only %s location 0x%lx", "sop-protected", "code",
phys_addr);
return -1;
}
}

perform_faulting_read将目标物理地址映射到虚拟机管理程序中,代表进程执行读取,然后取消映射。

uint64_t perform_faulting_read(uint64_t fault_pa, int32_t access_size_log2) {
// ...
fault_hva = map_pa_into_hypervisor(fault_pa);
switch (access_size_log2) {
case 0:
reg_val = *(uint8_t*)fault_hva;
break;
case 1:
reg_val = *(uint16_t*)fault_hva;
break;
case 2:
reg_val = *(uint32_t*)fault_hva;
break;
case 3:
reg_val = *(uint64_t*)fault_hva;
break;
}
unmap_hva_from_hypervisor(fault_hva);
return reg_val;
}

未知国际空间站上的数据中止

当寄存器中不存在有效的 ISS 时,将调用 handle_data_abort_on_unknown_iss 函数。在这种情况下,虚拟机监控程序将尝试找出导致中止的指令并决定操作过程。ESR_EL2

处理程序首先通过读取寄存器来检索故障 PC,如果它未在 4 个字节上对齐,则提前返回。然后,它读取错误指令的前两个字节,并根据指令的字段调用不同的处理程序。如果处理程序返回零值,则将执行设置为在下一条指令时恢复。ELR_EL2Op0

int32_t handle_data_abort_on_unknown_iss(saved_regs_t* regs) {
// ...
// Get the IPA of the faulting instruction.
//
// ELR_EL2, Exception Link Register (EL2).
return_address_va = get_elr_el2();
return_address_ipa = virt_to_phys_el1(return_address_va);
if (return_address_ipa == -1) {
return 0;
}

// Unaligned instruction.
if ((return_address_ipa & 3) != 0) {
return -1;
}

// Read the first two bytes of the faulting instruction.
if (read_physical_memory(return_address_ipa, &fault_instr, 2) == 0) {
return -1;
}

// Extract the Op0 field of the instruction and call the appropriate handler.
op0 = (fault_instr >> 25) & 0b1111;
if (instr_handlers_by_op0_field[op0](regs, fault_instr) != 0) {
return -1;
}

// Move the instruction pointer to the next instruction.
set_elr_el2(return_address_va + 4);
return 0;
}

操作0处理器
x1x0handle_loads_and_stores_instructions
101xhandle_branches_exception_generating_and_system_instructions

该函数handle_loads_and_stores_instructions还根据指令的指令特定字段调用不同的处理程序(特定于加载和存储指令的编码,不要与常规编码的字段混淆)。Op0Op0

uint64_t handle_loads_and_stores_instructions(saved_regs_t* regs, uint64_t fault_instr) {
op0 = (fault_instr >> 28) & 0b11;
return by_ldr_str_instr_field_op0[op0](regs);
}
操作0处理器
0bxx00handle_load_store_exclusive
0bxx10handle_load_store_register_pair
0bxx11handle_load_store_register_immediate

我们不会详细说明特定指令处理程序的代码,但会提供它们所执行操作的摘要。

  • handle_load_store_exclusive处理加载/存储独占寄存器指令,例如 和 。它基本上是处理同步问题的handle_data_abort_on_write的包装器。STXRLDXR

  • handle_load_store_register_pair处理加载/存储寄存器对指令,例如 和 。它基本上是对handle_data_abort_on_writehandle_data_abort_on_read的两个调用的包装器,具体取决于指令是加载还是存储寄存器,两个寄存器中的每一个对应一个寄存器。STPLDP

  • handle_load_store_register_immediate处理加载/存储寄存器即时指令,例如 和 。它基本上是一个围绕handle_data_abort_on_writehandle_data_abort_on_read的包装器,可以解码即时价值。STR (immediate)LDR (immediate)

  • handle_branches_exception_generating_and_system_instructions主要处理和缓存维护指令。对于 ,它通过在目标虚拟地址上执行来模拟它,对于 ,它在每个目标数据缓存块上调用 handle_data_abort_on_writeDC IVACDC ZVADC IVACDC CIVACDC ZVA

内核页表管理

本节介绍虚拟机管理程序如何填充和管理第二阶段页表。除其他事项外,它必须强制要求在 EL1 上只能执行内核代码,并且不能在用户空间中访问内核内存。这要求虚拟机监控程序了解内核内存布局。可以通过让内核通过 HVC 指定其布局或让虚拟机管理程序解析内核页表来获取此信息。华为选择实现后一种更通用的方法,不需要内核的配合。

但是,第二阶段无法区分 EL0 和 EL1 访问。对于给定的物理内存页,两个 EL 的数据访问权限始终相同,最初指令执行的访问权限也是如此。如果虚拟机管理程序在第二阶段默认将所有页面设置为不可执行,这将使它们在 EL0 中无法用于可执行代码。同样,如果它在第二阶段使内核内存无法访问,以确保 EL0 无法访问它,内核本身也将无法访问它。因此,虚拟机监控程序被迫使用第 1 阶段权限来实现此效果,并且需要完全控制内核页表。

注意:通过引入 ARMv8.2-TTS2UXN 的 FEAT_XNX,解决了第二阶段指令执行访问权限的缺点(“在第 2 阶段通过异常级别执行-从不控制区分”)。当支持此功能时,硬件不再忽略该位,并且页面只能在 EL0 或 EL1 上不可执行。XN[0]

当涉及到内核页表时,虚拟机管理程序会在两种情况下进行干预:

  • 当它们首次初始化时,通过捕获对 和 的访问SCTLR_EL1TTBR1_EL1;

  • 以及当它们被修改或发生翻译错误时。

现在,我们将详细介绍虚拟机管理程序如何处理内核页表,以及允许进行哪些更改。

处理内核页表

当内核启动时,它会初始化其虚拟地址空间并设置其页表。如前所述,一旦在 中设置了转换表的基址,虚拟机监控程序就会调用 process_kernel_page_tables。此函数首先调用 get_kernel_pt_info 来检索有关内核页表的信息,例如 PGD 的 IPA、每个表的条目数等。然后使用 map_page_table_hyp_and_stage2 将 PGD 映射到第二阶段,并使用 process_kernel_page_table 处理其每个条目,以遍历相应的页表并确保存储的描述符具有合理的值。TTBR1_EL1

uint64_t process_kernel_page_tables() {
// ...

ret = 0;
was_not_marked = 0;
pt_size_log2 = 0;
pt_level = 0;
nb_pt_entries = 0;
kernel_page_tables_processed = atomic_get(&g_kernel_page_tables_processed);

hvc_lock_acquire();

// Returns if kernel page tables have already been processed.
if (!kernel_page_tables_processed) {
goto EXIT;
}

// Retrieves page table information using system register values configured by the kernel.
ret = get_kernel_pt_info(1, &pgd, &pt_size_log2, &pt_level, &nb_pt_entries);
if (ret) {
ret = 0xffffffff;
goto EXIT;
}

// Maps the kernel PGD in the hypervisor's address space and in the second translation stage.
pgd_hva = map_page_table_hyp_and_stage2(pgd, pt_size_log2, pt_level, nb_pt_entries, &was_not_marked);
if (!pgd_hva) {
unmap_hva_from_hypervisor(pgd_hva);
ret = 0xffffffff;
goto EXIT;
}

// Checks that the PGD has the expected size.
if (pt_size_log2 != 0xc) {
unmap_hva_from_hypervisor(pgd_hva);
ret = 0xffffffff;
goto EXIT;
}

if (nb_pt_entries) {
ret = 0;
pt_entries_count = 0;
for (;;) {
pt_ret = process_kernel_page_table(pgd_desc_ptr, pt_size_log2, pt_level, 0);
ret += pt_ret;
if (pt_ret < 0) {
break;
}
if (nb_pt_entries <= ++pt_entries_count) {
goto SUCCESS;
}
}
} else {
SUCCESS:
success = 1;
unmap_hva_from_hypervisor(pgd_hva);
}

// EL0/EL1 execute control distinction at Stage 2 is supported.
if (get_id_aa64mmfr1_el1() >> 0x1c == 1) {
// Disables execution at EL1 for writable pages in the second stage.
process_stage2_page_tables((void (*)(void))set_non_executable_at_el1);
// ...
}
// ...
if (success) {
atomic_set(&g_kernel_page_tables_processed);
ret = 0;
}

EXIT:
hvc_lock_release();
return ret;
}

在下面的代码片段中,您可以看到函数 get_kernel_pt_info 的代码,该函数通过解析 和 寄存器的内容来检索有关内核页表的信息。虚拟机监控程序从不使用使用的代码路径。TCR_EL1TTBR1_EL1TTBR0_EL1

uint64_t get_kernel_pt_info(int32_t is_kernel,
uint64_t* kernel_pgd_p,
int32_t* page_size_log2_p,
uint32_t* pt_level_p,
uint64_t* nb_pt_entries_p) {
// ...
tcr_el1 = get_tcr_el1();
if (is_kernel) {
// Granule size for the TTBR1_EL1.
tg1 = tcr_el1 >> 30;
// The size offset of the memory region addressed by TTBR1_EL1.
ttbr_txsz = (tcr_el1 >> 16) & 0x3f;
// TTBR1_EL1, Translation Table Base Register 1.
ttbrx_el1 = get_ttbr1_el1();
page_size_log2 = (tg1 >= 1 && tg1 <= 3) ? 0 : g_tg_to_page_size_log2[tg1];
txsz = (page_size_log2 <= 0xf) ? 0x10 : 0xc;
// ...
} else {
// Granule size for the TTBR0_EL1.
tg0 = tcr_el1 >> 14;
// The size offset of the memory region addressed by TTBR0_EL1.
ttbr_txsz = tcr_el1 & 0x3f;
// TTBR0_EL1, Translation Table Base Register 0.
ttbrx_el1 = get_ttbr0_el1();
page_size_log2 = (tg0 >= 1 && tg0 <= 3) ? 0 : g_tg_to_page_size_log2[tg1];
txsz = (page_size_log2 <= 0xf) ? 0x10 : 0xc;
// ...
}
// ...
if (txsz < ttbr_txsz) {
txsz = ttbr_txsz;
}
ttbr_addr = ttbrx_el1 & 0xfffffffffffe;
// The page table entries log2.
nb_pt_entries_log2 = page_size_log2 - 3;
// The page size log2.
*page_size_log2_p = page_size_log2;
// Computes the number of page table levels based on the address space size.
*pt_level_p = 4 - (0x3c - txsz) / nb_pt_entries_log2;
// The number of entries in a page table.
*nb_pt_entries_p = 1 << nb_pt_entries_log2;
// Checks if the level 0 page table address is page aligned.
if (ttbr_addr & ((1 << page_size_log2) - 1)) {
*kernel_pgd_p = ttbr_addr;
return 0;
}
return 0xffffffff;
}

一旦我们从内核获得信息,我们就会调用 map_page_table_hyp_and_stage2。此函数获取第 1 阶段页表的第 2 阶段描述符,在本例中为内核的 PGD。在检查描述符的软件属性之前,页表将映射到虚拟机管理程序中:

  • 如果已经标记了它,我们确保它是 ,其中 n 是页表级别,我们返回时注意到我们没有更改第二阶段权限;PTABLE_Ln

  • 否则,我们在阶段 2 中将页表映射为只读,并循环访问其条目以:

    • 防止重复使用已标记的现有页表;

    • 验证连续描述符是否共享相同的属性并指向连续内存。

uint64_t map_page_table_hyp_and_stage2(uint64_t ptable,
uint32_t page_size_log2,
uint32_t pt_level,
uint64_t nb_pt_entries,
uint8_t* was_not_marked) {
// ...
// Retrieves the descriptor of the kernel's page table in the second translation stage.
ptable_desc_s2 = get_stage2_page_table_descriptor_for_ipa(ptable, 0);
if ((ptable_desc_s2 & 1) == 0) {
return 0;
}
page_size = 1 << page_size_log2;
ptable_pa = ptable_desc_s2 & 0xfffffffff000;
ptable_s2_perms = READ | NORMAL_MEMORY;
offset_in_ptable = 0;
// ...
// If the software attributes for the page table have already been set, we just map the address in the hypervisor and
// return the corresponding virtual address.
ptable_software_attrs = get_software_attrs(ptable_desc_s2);
software_attrs = pt_level + PTABLE_L0;
if (software_attrs == ptable_software_attrs) {
*was_not_marked = 0;
return map_pa_into_hypervisor(offset_in_ptable + ptable_pa);
}
// Sets was_not_marked to signal that the stage 2 mapping is new and other initialization procedures outside of this
// function must be applied to this mapping.
*was_not_marked = 1;
// Tries to map the page table in the stage 2 and returns if an error occurs.
if (ptable_software_attrs || map_stage2_memory(ptable, ptable_pa, page_size, ptable_s2_perms, software_attrs)) {
return 0;
}
// Maps the page table's physical address in the hypervisor's address space.
ptable_hva = map_pa_into_hypervisor(offset_in_ptable + ptable_pa);
// Gets a function pointer to a function that checks a page table descriptor based on its level.
pt_check_ops = g_pt_check_ops_by_level[pt_level];
// ...
num_contiguous_entries = 0x10;
// ...
// Computes the position in the virtual address of the index into the current page table level (e.g. 12 at level 3, 21
// at level 2, etc.).
nb_pt_entries_log2 = page_size_log2 - 3;
pt_level_idx_pos = (4 - pt_level) * nb_pt_entries_log2 + 3;
if (!nb_pt_entries) {
return ptable_hva;
}

desc_va_range = 1 << pt_level_idx_pos;
for (desc_idx = 0; desc_idx < nb_pt_entries; desc_idx += num_contiguous_entries) {
contiguous_count = 0;
desc_addr = ptable_hva + 8 * desc_idx;
// Counts the number of contiguous entries in the range being mapped.
for (addr_idx = 0; addr_idx != num_contiguous_entries; addr_idx++) {
contig_desc = desc_addr[addr_idx++];
if (contig_desc & 1) {
is_contiguous = pt_check_ops(contig_desc, page_size_log2);
contiguous_count += is_contiguous;
if (is_contiguous < 0) {
// If one of the descriptors is deemed invalid in the page table we mapped, the stage 2 entry of the page
// table gets reset.
goto RESET_PTABLE_FROM_S2;
}
}
}
// If there are enough contiguous entries, check that they are all mapped with the same attributes.
if (contiguous_count) {
if (num_contiguous_entries <= contiguous_count && num_contiguous_entries != 1) {
desc = *(uint64_t*)(ptable_hva + 8 * desc_idx);
desc_attrs = desc & 0x7f000000000b7f;
desc_oa = desc & 0xfffffffff000;
remaining_entries = num_contiguous_entries;
curr_oa = desc_oa & -(desc_va_range * num_contiguous_entries);
// ...
for (;;) {
iter_desc = *desc_addr++;
iter_desc_attrs = iter_desc & 0x7f000000000b7f;
iter_desc_oa = iter_desc & 0xfffffffff000;
// Checks if attributes are the same for the whole range and makes sure we haven't reached the end of the
// contiguous region.
if (desc_attrs != iter_desc_attrs || iter_desc_oa != curr_oa) {
break;
}
curr_oa += desc_va_range;
if (!--remaining_entries) {
continue;
}
}
}
}
RESET_PTABLE_FROM_S2:
map_stage2_memory(ptable, ptable_pa, page_size, EXEC_EL0 | WRITE | READ | NORMAL_MEMORY, 0);
// ...
unmap_hva_from_hypervisor(ptable_hva);
return 0;
}
}

然后,我们进入主内核页表处理函数 process_kernel_page_table。它以递归方式遍历页表,每次调用都会返回一个值,在下面的代码片段中称为 verdict,指示:

  • 如果为负数,则表示在处理页表期间发生了错误;

  • 如果为零,则表映射的物理内存中没有区域可在 EL1 处执行或可从 EL0 访问;

  • 如果为正数,则表映射的物理内存中至少有一个区域可在 EL1 处执行或可在 EL0 处访问。

在递归期间,根据页表级别和描述符类型,process_kernel_page_table执行下面列出的操作。

  • 表描述符

    • 如果当前和以前的级别阻止物理内存在 EL1 上可执行或可从 EL0 访问,则它将返回判定 0,而不处理当前和下一个级别的表。

    • 否则,当前表将映射为只读,并标记为在第二阶段,然后对其每个描述符调用 process_kernel_page_table。如果表映射的区域在 EL1 上是可执行的,或者无法从 EL0 访问,则它只需设置当前描述符的 和 位,并在第二阶段取消标记/取消保护表。但是,如果至少有一个区域,则必须使表保持只读。PTABLE_LnAPTable[0]PXNTable

  • 页面和块描述符

    • 如果前面的级别阻止物理内存在 EL1 上执行或可从 EL0 访问,则返回判定值 0。

    • 如果当前或以前的级别阻止物理内存在 EL1 处可执行,则它将返回一个判定,该判定取决于该区域是否可在 EL0 处访问。

    • 如果前面的级别使该区域为只读,则在第二阶段的 EL1 处将其设置为只读且可执行,在当前描述符中的 EL0 处无法访问,并返回判定值 1。如果物理内存是只读的,并且忽略其脏状态,也会发生同样的事情。

    • 否则,将设置当前描述符的位,并且返回的判定取决于该区域是否可在 EL0 处访问。PXN

int32_t process_kernel_page_table(uint64_t desc_p, uint32_t table_size_log2, uint32_t pt_level, uint64_t prev_attrs) {
// ...

was_not_marked = 0;
verdict = 0;
desc = *desc_p;
table_size = 1 << table_size_log2;
nb_pt_entries_log2 = table_size_log2 - 3;
nb_pt_entries = 1 << nb_pt_entries_log2;

// Computes the position in the virtual address of the index into the current page table level (e.g. 12 at level 3, 21
// at level 2, etc.).
pt_level_idx_pos = (4 - pt_level) * nb_pt_entries_log2 + 3;

// Checks if the descriptor is valid.
if (!(desc & 1)) {
return 0;
}

// Level 0, 1 or 2 tables.
if (pt_level <= 2 && (desc & 2)) {
ptable_ipa = desc & 0xfffffffff000;
// ...
// Retrieves the stage 1 upper attributes of the table descriptor.
desc_ua = desc >> 59;
prev_ua = prev_attrs >> 59;
// Page tables that don't allow access from EL0 and execution at EL1, e.g. kernel data, are ignored. We don't need
// to set the upper attributes in the next levels page tables.
//
// - APTable, bits [62:61]: APTable[0] == 1 -> No access at EL0.
// - PXNTable, bit [59]: PXNTable == 1 -> No exec at EL1.
if ((desc_ua & 5 | prev_ua & 5) == 5) {
return 0;
}
// Gets the upper attributes for the current level.
current_level_attrs = desc & 0x7800000000000000 | prev_attrs;
next_pt_level += 1;
// Maps the current page table level.
ptable_hva =
map_page_table_hyp_and_stage2(ptable_ipa, table_size_log2, pt_level + 1, nb_pt_entries, &was_not_marked);
// ...
for (int32_t i = 0; i < 0x1000; i += 8) {
// Processes all the page table entries found in the current level we've just mapped.
ret = process_kernel_page_table(ptable_hva + i, table_size_log2, next_pt_level, current_level_attrs);
if (ret < 0) {
unmap_hva_from_hypervisor(ptable_hva);
if (was_not_marked) {
goto UNMARK_PAGE_TABLE;
}
ret;
}
verdict += ret;
}
unmap_hva_from_hypervisor(ptable);
// If the page table was made read-only by the call to map_page_table_hyp_and_stage2 and maps no region that is
// executable at EL1 or accessible from EL0, then it doesn't need to be protected as long as the previous levels are
// protected.
if (was_not_marked && !verdict) {
// Sets APTable[0] = 1 and PXNTable = 1 on the descriptor.
*desc_p = desc | (1 << 61) | (1 << 59);
UNMARK_PAGE_TABLE:
// Unmark the page table in the second translation stage.
ptable_s2_desc = get_stage2_page_table_descriptor_for_ipa(ptable_ipa, 0);
ptable_pa = ptable_s2_desc & 0xfffffffff000;
map_stage2_memory(ptable_ipa, ptable_pa, table_size, EXEC_EL0 | WRITE | READ | NORMAL_MEMORY, 0);
// ...
return verdict;
}
}
// Level 3 page or block.
else if {
// Table attributes - Bit 59, PXNTable == 1 -> No exec at EL1.
if (prev_attrs >> 59 & 1) {
// Table attributes - Bit 61, APTable[0] == 1 -> No access at EL0.
if (prev_attrs >> 61 & 1) {
return 0;
}
non_executable = 1;
// Page attributes - Bit 6, AP[1]: EL0 Access.
verdict = (desc >> 6) & 1;
}
// Table attributes - Bit 59, PXNTable == 0 -> Exec at EL1.
else {
// Page descriptor - Bit 53, PXN.
non_executable = (desc >> 53) & 1;
verdict = 0;
// Table attributes - Bit 61, APTable[0] == 0 -> Access at EL0.
if (!(prev_attrs >> 61 & 1)) {
// Page attributes - Bit 6, AP[1]: EL0 Access.
verdict = (desc >> 6) & 1;
}
}
// If the region should be executable, the page will be set as RX in the second stage if the dirty state is not
// tracked and the page is not writable.
if (!non_executable) {
// Table attributes - Bit 62, APTable[1] == 1 -> Read-only.
if (prev_attrs >> 62 & 1) {
goto SET_RX;
}
// Page attributes - Bit 52, Contiguous.
if (desc >> 52 & 1) {
num_contiguous_entries = 0x10;
// Gets a pointer to the descriptor of the start of the contiguous block.
desc_contiguous_start_p = desc & 0xffffffffffffff80;
// ...
for (int32_t idx = 0; i < num_contiguous_entries; i++) {
// Retrieves the next descriptor in the contiguous range.
next_desc = *(uint64_t*)(desc_contiguous_start_p + 8 * idx);
// If one of the descriptor from the contiguous range has its dirty state tracked or is writable, the PXN bit
// is forced and we return.
//
// - DBM, bit [51]: Dirty Bit Modifier.
// - AP[2], bit [7]: read / write access.
if (!(next_desc >> 7 & 1) || (next_desc >> 51 & 1)) {
// Page descriptor - Bit 53, PXN.
*desc_p = desc | (1 << 53);
return verdict;
}
}
// Otherwise we set it as RX in the second stage.
goto SET_RX;
}
// Otherwise, if the dirty state is not tracked and the page is read-only, it is set as RX in the second stage.
//
// - DBM, bit [51]: Dirty Bit Modifier.
// - AP[2], bit [7]: read / write access.
else if ((desc >> 7 & 1) && !(desc >> 51 & 1)) {
SET_RX:
ipa = desc & 0xfffffffff000;
va_range_size = 1 << pt_level_idx_pos;
// ...
// Sets the current va range as read-only / executable.
ret = change_stage2_software_attrs_per_ipa_range(ipa, ipa_range_size, EXEC_EL1 | EXEC_EL0 | READ, OS_RO,
1 /* sets the software attributes in the descriptor */);
if (!ret) {
if (pt_level == 3 && !g_has_set_l3_ptable_ro) {
g_has_set_l3_ptable_ro = 1;
}
// Disables EL0 access.
*desc_p = desc & 0xffffffffffffffbf;
return 1;
}
return -1;
}
// Page descriptor - Bit 53, PXN.
*desc_p = desc | (1 << 53);
}
}
return verdict;
}

内核页表写入

正如我们刚才所看到的,不包含可在 EL1 上执行或可从 EL0 访问的区域的内核页表在最初在 map_page_table_hyp_and_stage2 中处理时,在第二阶段被设置为只读。当内核写入这些表时,会引发异常,允许虚拟机监控程序验证更改。现在是时候详细说明虚拟机监控程序调用的check_kernel_page_table_write函数,以便在发生数据中止时获得判定。

check_kernel_page_table_write确保我们处于上述情况,即异常是由内核触发的,并且写入的页面被标记为 .如果内核写入另一个 PGD,而不是当前正在使用的 PGD,则立即允许它,因为一旦内核将其设置为新的转换表,就会执行验证。此外,如果写入小于 8 个字节,则会将其转换为跨越整个描述符的更大访问。PTABLE_Ln

如果写入的描述符当前无效,则允许访问,但会更改权限以强制执行 EL1 的不可执行性(通过设置 或 )和 EL0 的不可访问性(通过取消设置或设置 )。否则,对于块和页面描述符,判定留给is_kernel_write_to_block_page_allowed,对于表描述符,则留给is_kernel_write_to_table_allowed。如果判定是否定或为零,则仅返回该判定值。但是,如果它为正数,则当前和下一级描述符将由 reset_executable_memory 函数处理,并更改权限以强制执行 EL1 的不可执行性和 EL0 的不可访问性。PXNPXNTableAP[1]APTable[0]

int32_t check_kernel_page_table_write(uint64_t* reg_val_ptr,
uint64_t* fault_ipa_ptr,
int32_t* access_size_log2_ptr,
uint64_t* fault_pa_ptr) {
// ...
// Userspace is not allowed to write kernel page tables.
//
// - M[3:0], bits [3:0]: AArch64 Exception level and selected Stack Pointer (...) M[3:2] is set to the value of
// PSTATE.EL.
el = (get_spsr_el2() & 0b1111) >> 2;
if (el != 0 /* EL0 */) {
debug_print(0x201, "Illegal operation from user space");
return -1;
}

// Check if the fault IPA is mapped in the stage 2.
fault_ipa = *fault_ipa_ptr;
desc_s2 = get_stage2_page_table_descriptor_for_ipa(fault_ipa, &desc_s2_va_range);
if ((desc_s2 & 1) == 0) {
return -1;
}

// Translate the fault IPA into a PA for the caller.
*fault_pa_ptr = (desc_s2 & 0xfffffffff000) | (fault_ipa & (desc_s2_va_range - 1));

// Check if the fault IPA belongs to a stage 1 PT of any level.
sw_attrs = get_software_attrs(desc_s2);
if (!is_software_attrs_ptable(sw_attrs)) {
// Decide what to do for the other software attributes.
return report_error_on_invalid_access(desc_s2);
}

// Get the information about the kernel page tables that was gathered when the kernel page tables where processed.
if (get_kernel_pt_info(s2_desc & 1, &kernel_pgd, &kernel_granule_size_log2, &kernel_pt_level,
&kernel_nb_pt_entries) != 0) {
return -1;
}

// Allow writes to other PGDs than the one currently in use by the kernel.
//
// Page tables will be checked once the kernel switches to this new PGD.
fault_pt_level = sw_attrs & 3;
if (fault_pt_level == kernel_pt_level && fault_ipa - kernel_pgd >= 8 * kernel_nb_pt_entries) {
return 0;
}

// If the access is smaller than 8 bytes, the fault IPA and written value will be adjusted to fake a qword access that
// is easier to work with.
if (*access_size_log2_ptr != 3) {
// Compute a mask of the bits updated by the write and the bits values updated by the write.
offset_in_bits = 8 * (*fault_pa_ptr & 7);
access_size_in_bits = 8 << *access_size_log2_ptr;
update_mask = (0xffffffffffffffff >> (64 - access_size_in_bits)) << offset_in_bits;
new_val = *reg_val_ptr << offset_in_bits;
// Align the fault PA on 8 bytes.
*fault_pa_ptr = *fault_pa_ptr & 0xfffffffffffffff8;
// Get the old value from memory.
old_val_hva = map_pa_into_hypervisor(*fault_pa_ptr);
old_val = *fault_val_hva;
unmap_hva_from_hypervisor(fault_va_hva);
// Update the written value and access size.
*reg_val_ptr = (old_val & ~update_mask) | (new_val & update_mask);
*access_size_log2_ptr = 3;
}

// Get the current value of the descriptor in the stage 1 page tables.
desc_s1_hva = map_pa_into_hypervisor(*fault_pa_ptr);
cur_desc_s1 = *desc_s1_hva;
// Pointer to the value the kernel wants to write at address fault_ipa.
new_desc_s1_ptr = reg_val_ptr;

// If the descriptor is invalid, the write is allowed but the PXN bit will be set and access from EL0 disallowed.
if ((cur_desc_s1 & 1) == 0) {
goto CHANGE_PXN_AND_AP_BITS;
}

// Check if the write is allowed by calling a different function depending on the descriptor type: block/page or
// table.
if (fault_pt_level == 3 || (cur_desc_s1 & 0b10) == 0b00) {
verdict =
is_kernel_write_to_block_page_allowed(desc_s1_hva, *new_desc_s1_ptr, kernel_granule_size_log2, fault_pt_level);
} else {
verdict = is_kernel_write_to_table_allowed(desc_s1_hva, *new_desc_s1_ptr, kernel_granule_size_log2, fault_pt_level);
}

// If verdict is:
//
// - < 0 -> the write is not allowed.
// - == 0 -> the write is allowed.
// - > 0 -> the write is allowed but the PXN bit will be set and access from EL0 disallowed.
if (verdict <= 0) {
unmap_hva_from_hypervisor(desc_s1_hva);
return verdict;
}

CHANGE_PXN_AND_AP_BITS:
// Recursively walk the page tables starting from the stage 1 descriptor and for each executable memory region found,
// remove it from the stage 1 page tables, set the memory as read-write non-exec in the stage 2 and also resets the
// software attributes, effectively unmarking the region.
reset_executable_memory(desc_s1_hva, fault_pt_level);
unmap_hva_from_hypervisor(desc_s1_hva);

// If the kernel page tables have not been processed yet, then the access is allowed.
if (!atomic_get(&g_kernel_page_tables_processed)) {
return 0;
}

// If the descriptor is set to invalid, the access is also allowed.
if (*new_desc_s1_ptr == 0) {
return 0;
}

// If the descriptor is a reserved descriptor (a would-be level 3 block descriptor), change it into an invalid
// descriptor.
if (fault_pt_level == 3 && (*new_desc_s1_ptr & 0b10) == 0b00) {
*new_desc_s1_ptr = 0;
}

// If the descriptor is a table descriptor, set the memory as not executable at EL1 and not accessible from EL0.
//
// - PXNTable, bit [59] = 1: the PXN bit is treated as 1 in all subsequent levels of lookup, regardless of the
// actual value of the bit.
// - APTable, bits [62:61] = 0bx1: Access at EL0 not permitted, regardless of permissions in subsequent levels of
// lookup.
else if (fault_pt_level != 3 && (*new_desc_s1_ptr & 0b10) != 0b00) {
*new_desc_s1_ptr = *new_desc_s1_ptr | (1 << 59) | (1 << 61);
}

// If the descriptor is a page descriptor, set the memory as not executable at EL1 and not accessible from EL0.
//
// - AP, bits [7:6] = 0bx0: Access at EL0 not permitted.
// - PXN, bits [53] = 1: Execution at EL1 not permitted.
else {
*new_desc_s1_ptr = (*new_desc_s1_ptr & ~(1 << 6)) | (1 << 53);
}

return 0;
}

is_kernel_write_to_block_page_allowed的作用是检查对输入块或页面描述符所做的更改,并确定是否允许这些更改。它通过计算旧值和新值之间的不同位以及不同的掩码来允许对描述符进行以下修改:

  • 切换位;AF

  • 设置 和 位;PXNUXN

  • 清除钻头;AP[1]

  • 如果页面是读写的或跟踪脏状态,则切换 AND 位。AP[2]DBM

如果未发现不允许的更改,则该函数会提前返回。否则,它将确保描述符所覆盖的任何物理内存区域都不受虚拟机监控程序的保护。

int32_t is_kernel_write_to_block_page_allowed(uint64_t* desc_s1_hva,
uint64_t reg_val,
uint32_t kernel_granule_size_log2,
uint8_t fault_pt_level) {
// ...
// If the new descriptor is also a block descriptor.
if ((reg_val & 1) == 1) {
// Allow setting or unsetting the AF bit and the reserved bits.
//
// AF, bit [10]: Access Flag.
changed_bits = (*desc_s1_hva ^ reg_val) & ~((1 << 10) | (0b111111111 << 55));

// Allow setting the PXN and UXN bits.
//
// - PXN, bit [53]: Privileged eXecute-Never.
// - UXN, bit [54]: Unprivileged eXecute-Never.
ignored_bits_set = ~(reg_val & ((1 << 53) | (1 << 54)));
}

// Otherwise it is an invalid descriptor.
else {
changed_bits = *desc_s1_hva;
ignored_bits_set = 0xffffffffffffffff;
}

// Allow unsetting the AP[1] bit, disallowing access from EL0.
//
// AP, bits [7:6]: AP[1] selects between EL0 or EL1 control.
ignored_bits_unset = ~(*desc_s1_hva & (1 << 6));

// Compute the changed bits to check if the access is allowed.
changed_bits = changed_bits & ignored_bits_set & ignored_bits_unset;

// If access from EL1 is read-write or the dirty state is tracked, ignore setting or unsetting the AP[2] and DBM bits.
//
// - AP, bits [7:6]: AP[2] selects between read-only and read/write access.
// - DBM, bit [51]: Dirty Bit Modifier.
if (((*desc_s1_hva >> 7) & 1) == 0 || ((*desc_s1_hva >> 51) & 1) == 1) {
changed_bits &= ~((1 << 7) | (1 << 51));
}

// No disallowed changes were made.
if (changed_bits == 0) {
return 0 /* allowed */;
}

// Get the output memory range (IPAs) from the descriptor.
region_ipa = *desc_s1_hva & 0xfffffffff000;
if (kernel_granule_size_log2 > 16) {
region_ipa |= (*desc_s1_hva & 0xf000) << 36;
}
region_size = 1 << ((4 - fault_pt_level) * (kernel_granule_size_log2 - 3) + 3);
// ...

// Check if the current output memory range contains protected regions.
for (offset = 0; offset < region_size; offset += desc_s2_va_range) {
desc_s2 = get_stage2_page_table_descriptor_for_ipa(region_ipa + offset, &desc_s2_va_range);
if (!desc_s2_va_range || (desc_s2 & 1) == 0) {
break;
}

sw_attrs = get_software_attrs(desc_s2);
switch (sw_attrs) {
case OS_RO:
case HYP_RO:
case HYP_RO_MOD:
return -1 /* disallowed */;
case OS_RO_MOD:
// If execution at EL1 was not permitted.
//
// XN[1:0], bits [54:53]: Execute-Never.
if (((desc_s2 >> 53) & 0b11) == 0b01 || ((desc_s2 >> 53) & 0b11) == 0b10) {
return -1 /* disallowed */;
}
}
}
return 1 /* allowed, but PXN will be set and access from EL0 disallowed */;
}

is_kernel_write_to_table_allowed的作用是检查对输入表描述符所做的更改,并确定是否允许这些更改。它以与 is_kernel_write_to_block_page_allowed 类似的方式执行此操作,并允许设置 、 和 位。PXNTableUXNTableAPTable

如果未发现不允许的更改,则该函数会提前返回。否则,它将调用 is_table_hyp_mediated,这将递归检查描述符指向的表是否映射受保护的内存。根据此函数的返回值,不允许写入,或者允许写入,但更改了描述符,使内存在 EL1 处不可执行,并且无法从 EL0 访问。

int32_t is_kernel_write_to_table_allowed(uint64_t* desc_s1_hva,
uint64_t reg_val,
uint32_t kernel_granule_size_log2,
uint8_t fault_pt_level) {
// If the new descriptor is also a table descriptor.
if ((reg_val & 0b11) == 0b11) {
// Allow setting or unsetting the reserved bits.
changed_bits = reg_val & ~((0b1111111111 << 2) | (0b11111111 << 51));
// Allow setting the PXNTable, UXNTable and APTable bits, which effectively allows disabling execution and access.
//
// - PXNTable, bit [59]: the PXN bit is treated as 1 in all subsequent levels of lookup.
// - UXNTable, bit [60]: the UXN bit is treated as 1 in all subsequent levels of lookup.
// - APTable, bits [62:61]: access permissions limit subsequent levels of lookup.
ignored_bits_set = ~(reg_val & ((1 << 59) | (1 << 60) | (0b11 << 61)));
}

// Otherwise it is an invalid descriptor.
else {
changed_bits = *desc_s1_hva ^ reg_val;
ignored_bits_set = 0xffffffffffffffff;
}

// No disallowed changes were made.
if ((changed_bits & ignored_bits_set) == 0) {
return 0 /* allowed */;
}

// Check if the table, or any of the subsequent level tables, contains memory that is protected by the hypervisor.
if (is_table_hyp_mediated(desc_s1_hva, kernel_granule_size_log2, fault_pt_level + 1) < 0) {
return -1 /* disallowed */;
} else {
return 1 /* allowed, but PXN will be set and access from EL0 disallowed */;
}
}

is_table_hyp_mediated首先从输入描述符中检索页表 IPA,确保将其标记为处于第二阶段,并将其映射到虚拟机管理程序的地址空间。然后,它循环访问其每个描述符,并用于:PTABLE_Ln

  • 表描述符,它会递归调用is_table_hyp_mediated下一级表;

  • 页面或块描述符,它确保描述符覆盖的任何物理内存区域都不受虚拟机管理程序的保护。

如果允许写入,它将设置第 1 阶段描述符的位,并从第 2 阶段取消标记表的物理内存页。PXNTable

int32_t is_table_hyp_mediated(uint64_t* desc_s1_hva, uint32_t kernel_granule_size_log2, uint8_t ptable_level) {
// ...
// Get the page table IPA from the descriptor.
ptable_ipa = *desc_s1_hva & 0xfffffffff000;
if (kernel_granule_size_log2 > 16) {
ptable_ipa |= (*desc_s1_hva & 0xf000) << 36;
}

// Check if the page is mapped in the stage 2 and belongs to any stage 1 PT.
desc_s2 = get_stage2_page_table_descriptor_for_ipa(ptable_ipa, 0);
sw_attrs = get_software_attrs(desc_s2);
if ((desc_s2 & 1) == 0 || !is_software_attrs_ptable(sw_attrs)) {
return 0;
}

// Map the page table into the hypervisor.
ptable = map_pa_into_hypervisor(ptable_ipa);
ptable_size = 1 << kernel_granule_size_log2;

// And iterate over each of its descriptors.
for (desc_index = 0; desc_index < ptable_size / 8; ++desc_index) {
ptable_desc_ptr = (uint64_t*)(ptable + desc_index * 8);

// If it is an invalid descriptor, do nothing.
if ((*ptable_desc_ptr & 1) == 0) {
continue;
}

// If it is a table descriptor, call is_table_hyp_mediated to check it.
if (ptable_level < 3 && (*ptable_desc_ptr & 0b10) == 0b10) {
disallowed = is_table_hyp_mediated(ptable_desc_ptr, kernel_granule_size_log2, ptable_level + 1);
if (disallowed < 0) {
return disallowed;
}
}

// If it is a block or page descriptor.
else {
// Get the output memory range (IPAs) from the descriptor.
output_addr = *ptable_desc_ptr & 0xfffffffff000;
// ...
output_size = 1 << ((4 - fault_pt_level) * (kernel_granule_size_log2 - 3) + 3);

// Check if the output memory range contains protected regions.
for (offset = 0; offset < output_size; offset += desc_s2_va_range) {
out_desc_s2 = get_stage2_page_table_descriptor_for_ipa(output_addr + offset, &desc_s2_va_range);
if (!desc_s2_va_range || (out_desc_s2 & 1) == 0) {
break;
}

sw_attrs = get_software_attrs(out_desc_s2);
switch (sw_attrs) {
case OS_RO:
case HYP_RO:
case HYP_RO_MOD:
return -1 /* disallowed */;
case OS_RO_MOD:
// If execution at EL1 was not permitted.
//
// XN, bits [54:53]: Execute-Never.
if (((out_desc_s2 >> 53) & 0b11) == 0b01 || ((out_desc_s2 >> 53) & 0b11) == 0b10) {
return -1 /* disallowed */;
}
}
}
}
}

// Set the PXNTable bit of the descriptor.
//
// PXNTable, bit [59]: the PXN bit is treated as 1 in all subsequent levels of lookup.
*desc_s1_hva |= 1 << 59;
dsb_ishst();

// Resets the stage 2 attributes and permissions of the current page table before it gets changed to the new one.
map_stage2_memory(ptable_ipa, desc_s2 & 0xfffffffff000, 1 << kernel_granule_size_log2,
NORMAL_MEMORY | READ | WRITE | EXEC_EL0, 0);
// ...
tlbi_vmalle1is();
return 1 /* allowed, but PXN will be set and access from EL0 disallowed */;
}

正如我们之前所看到的,当判决为正值时,check_kernel_page_table_write调用reset_executable_memory。如果当前描述符为:

  • 一个表描述符,内存区域在 EL1 处可执行,它对表的每个条目调用 reset_executable_memory;

  • 块或页面描述符,并且内存区域在 EL1 处可执行,它将描述符设置为无效,并从第二阶段取消映射描述符覆盖的物理内存区域。

void reset_executable_memory(uint64_t* desc_s1_hva, uint8_t fault_pt_level) {
// ...
// If the descriptor is invalid, nothing to do.
if ((*desc_s1_hva & 1) == 0) {
return;
}

// If it is a table descriptor.
if (fault_pt_level < 3 && (*desc_s1_hva & 0b10) == 0b10) {
// And the memory region is executable at EL1.
//
// PXNTable, bit [59].
//
// - 0b1: the PXN bit is treated as 1 in all subsequent levels of lookup.
// - 0b0: has no effect.
if (((*desc_s1_hva >> 59) & 1) == 0) {
// Get the page table IPA from the descriptor.
ptable_ipa = *desc_s1_hva & 0xfffffffff000;

// Get the page table PA in the stage 2.
ptable_desc_s2 = get_stage2_page_table_descriptor_for_ipa(ptable_ipa, &desc_s2_va_range);
ptable_pa = (ptable_desc_s2 & 0xfffffffff000) | (ptable_ipa & (desc_s2_va_range - 1));

// Map the page table into the hypervisor.
ptable = map_pa_into_hypervisor(ptable_pa);
ptable_size = 0x1000;

// And iterate over each of its descriptors and call reset_executable_memory recursively.
for (desc_index = 0; desc_index < ptable_size / 8; ++desc_index) {
ptable_desc_ptr = (uint64_t*)(ptable + desc_index * 8);
reset_executable_memory(ptable_desc_ptr, fault_pt_level + 1);
}

// Unmap the page table from the hypervisor.
unmap_hva_from_hypervisor(ptable);
}
}

// If it is a block or a page descriptor.
else {
// And the memory is executable at EL1.
//
// PXN, bit [53]: Privileged eXecute-Never.
if (((*desc_s1_hva >> 53) & 1) == 0) {
// And the kernel page tables have been processed.
if (atomic_get(&g_kernel_page_tables_processed)) {
// Set the descriptor as invalid.
*desc_s1_hva &= ~1;
dsb_ish();
tlbi_vmalle1is();
dsb_ish();
isb();
ic_ialluis();
// Set the PXN bit.
*desc_s1_hva |= 1 << 53;
dsb_ish();
isb();
// Remap the memory region pointed by the descriptor and resets permissions in the second stage.
region_ipa = *desc_s1_hva & 0xfffffffff000;
region_desc_s2 = get_stage2_page_table_descriptor_for_ipa(region_ipa, 0);
region_size = 1 << (39 - 9 * fault_pt_level);
region_pa = region_desc_s2 & 0xfffffffff000;
map_stage2_memory(region_ipa, region_pa, region_size, NORMAL_MEMORY | READ | WRITE | EXEC_EL0, 0);
// ...
}
}
}
}

第 1 阶段页面表遍历期间的第 2 阶段故障

虽然与安全保证无关,但我们仍然需要讨论另一个与内核页表相关的虚拟机管理程序功能:模拟在 handle_s2_fault_during_s1ptw 中实现的硬件管理位。

如果启用了脏状态和访问标志的硬件更新,则在访问或修改内存区域时,CPU 会自动更改相应的位。但是,在第二阶段,我们已经看到大多数内核页表都映射为只读。因此,当硬件尝试更新描述符的位时,将引发异常。虚拟机监控程序需要通过处理此异常来手动更新位,这就是 handle_s2_fault_during_s1ptw 函数的用途。

实际上,handle_s2_fault_during_s1ptw 是 update_hw_bits_on_s2_fault_during_s1ptw 的包装器,如果调用返回错误,它会将异常传播到 EL1。

void handle_s2_fault_during_s1ptw(bool is_write) {
// Updates the hardware-managed bits and lets the kernel handle the fault.
//
// - HPFAR_EL2: Hypervisor IPA Fault Address Register, holds the IPA of the fault that occured during the stage 1
// translation table walk. This is NOT the IPA of the VA under translation.
// - FAR_EL2: Fault Address Register (EL2), holds the VA that was being translated in the stage 1.
uint64_t fault_ipa = get_hpfar_el2() << 8;
if (update_hw_bits_on_s2_fault_during_s1ptw(fault_ipa, get_far_el2(), is_write)) {
trigger_instr_data_abort_handling_in_el1();
}
}

update_hw_bits_on_s2_fault_during_s1ptw在错误阶段 1 描述符有效并像第二阶段一样映射的情况下对硬件管理的位进行实际更新。PTABLE_Ln

int32_t update_hw_bits_on_s2_fault_during_s1ptw(uint64_t fault_ipa, uint64_t translated_va, bool is_write) {
hvc_lock_inc();

// Check if the fault IPA is mapped in the stage 2.
desc_s2 = get_stage2_page_table_descriptor_for_ipa(fault_ipa, 0);
if ((desc_s2 & 1) == 0) {
hvc_lock_dec();
return -1;
}

// Check if the fault IPA belongs to a stage 1 PT of any level.
sw_attrs = get_software_attrs(desc_s2);
if (!is_software_attrs_ptable(sw_attrs)) {
hvc_lock_dec();
// Decide what to do for the other software attributes.
return report_error_on_invalid_access(desc_s2);
}

// Map the stage 1 page table descriptor that caused the fault on access.
fault_pt_level = sw_attrs & 3;
desc_index = get_pt_desc_index(translated_va, fault_pt_level);
fault_page_pa = desc_s2 & 0xfffffffff000;
desc_s1_hva = map_pa_into_hypervisor((8 * desc_index) | fault_page_pa);
desc_s1 = *desc_s1_hva;

// Check if the stage 1 page table descriptor was valid.
if ((desc_s1 & 1) == 0) {
unmap_hva_from_hypervisor(desc_s1_hva);
hvc_lock_dec();
return 0;
}

// Check if the hypervisor needs to update the access flag.
//
// - HA, bit [39] = 1: Stage 1 Hardware Access flag update enabled.
// - AF, bit [10] = 0: Access Flag.
if (((get_tcr_el1() >> 39) & 1) == 1 && ((desc_s1 >> 10) & 1) == 0) {
// Change the AF to 1.
if (exclusive_load(desc_s1_hva) == desc_s1) {
exclusive_store(desc_s1 | (1 << 10), desc_s1_hva);
}
}

// Check if the hypervisor needs to update the dirty state.
//
// - HD, bit [40] = 1: Stage 1 hardware management of dirty state enabled.
// - AP, bits [7:6] = 0b1x: Read-only access from EL1.
// - DBM, bit [51] = 1: Dirty Bit Modifier.
else if (is_write && ((get_tcr_el1() >> 40) & 1) == 1 && ((desc_s1 >> 7) & 1) == 1 && ((desc_s1 >> 51) & 1) == 1) {
// Change the AP bits to read/write from EL1.
if (exclusive_load(desc_s1_hva) == desc_s1) {
exclusive_store(desc_s1 & ~(1 << 7), desc_s1_hva);
}
tlbi_vaale1is(desc_s1);
}

unmap_hva_from_hypervisor(desc_s1_hva);
hvc_lock_dec();
return 0;
}

在本节中,我们了解了虚拟机管理程序如何对内核页表强制实施以下限制。

  • 一旦虚拟机管理程序处理了其页表,内核就无法更改其 PGD。

  • 在处理页表时,虚拟机监控程序会查找需要在 EL1 上可执行或在 EL0 上可访问的所有物理内存区域。在第二阶段,它们的页表是只读的。对于其他区域,除了页面/块描述符的 和 位外,还使用 和 位作为表描述符,在 EL0 处强制执行不可执行性和不可访问性,并且页表保持可写状态。APTable[0]PXNTableAPPXN

  • 当内核修改其页表时,虚拟机监控程序会确保它不会在 EL1 上创建可执行区域,也不会在 EL0 上创建可访问的区域,并且不会删除受保护的区域。

虚拟机监控程序和安全监控器调用

为了与更高的 EL 通信,内核可以使用 HVC 和 SMC 指令发送请求。虽然 SMC 不适用于虚拟机管理程序,但我们在异常处理中看到,HHEE 的配置使其可以拦截 SMC 以过滤内核中的参数、获得电源管理事件的通知等。

虚拟机监控程序调用

异常处理部分介绍了 hhee_handle_hvc_instruction,这是来自内核的 HVC 调用的处理程序。HVC 根据其 ID 分为四组,每组由其自己的功能管理:

  • handle_hvc_c0xxxxxx:与 Arm 架构服务 SMC 相关;

  • handle_hvc_c4xxxxxx:与 Arm 电源状态协调接口 SMC 相关;

  • handle_hvc_c6xxxxxx:与HHEE安全功能有关;

  • handle_hvc_c9xxxxxx:与 HHEE 日志记录功能相关。

由于 SMC 处理程序是在安全监视器中实现的,并且由于日志记录系统不是特别相关,因此本节的其余部分将重点介绍为内核和用户空间实现内存保护功能的范围内的 HVC 处理程序。0xC6001000-0xC60010FF

void hhee_handle_hvc_instruction(uint64_t x0, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
if ((x0 & 0x80000000) != 0) {
switch (x0 & 0x3f000000) {
case 0:
handle_hvc_c0xxxxxx(x0, x1, x2, x3, regs);
return;
case 0x4000000:
handle_hvc_c4xxxxxx(x0, x1, x2, x3, regs);
return;
case 0x6000000:
handle_hvc_c6xxxxxx(x0, x1, x2, x3, regs);
return;
case 0x9000000:
handle_hvc_c9xxxxxx(x0, x1, x2, x3, regs);
return;
}
}
// ...
}

当执行达到 handle_hvc_c6xxxxxx 时,虚拟机监控程序会将调用分派给阵列中的相应 HVC 处理程序。hvc_handlers

void handle_hvc_c6xxxxxx(uint64_t x0, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
if ((x0 & 0xff00) == 0x1000) {
// Calls the corresponding handler with the arguments stored in the general purpose registers X0, X1, X2 and X3 set
// by the kernel before making the HVC.
hvc_handlers[x0 & 0xff](x0, x1, x2, x3, regs);
} else {
/* ... */
}
}

The table below lists the implemented HVC handlers and gives a brief description of their functionality.

HVC 编号名字描述
0xC6001030HKIP_HVC_RO_REGISTER将内存区域设置为只读并将其标记为 。OS_RO
0xC6001031HKIP_HVC_RO_UNREGISTER不执行任何操作(在内核中未使用)。
0xC6001032HKIP_HVC_RO_MOD_REGISTER将内存区域设置为只读并将其标记为 。OS_RO_MOD
0xC6001033HKIP_HVC_RO_MOD_UNREGISTER重置内存区域并取消标记。OS_RO_MOD
0xC6001040HKIP_HVC_ROWM_REGISTER将内存区域设置为只读并将其标记为 。HYP_RO
0xC6001041HKIP_HVC_ROWM_UNREGISTER不执行任何操作(在内核中未使用)。
0xC6001042HKIP_HVC_ROWM_MOD_REGISTER将内存区域设置为只读并将其标记为 。HYP_RO_MOD
0xC6001043HKIP_HVC_ROWM_MOD_UNREGISTER重置内存区域并取消标记。HYP_RO_MOD
0xC6001050HKIP_HVC_ROWM_SET_BIT在 或 内存区域中设置页面的位。HYP_ROHYP_RO_MOD
0xC6001051HKIP_HVC_ROWM_WRITE将缓冲区复制到 或 内存区域。HYP_ROHYP_RO_MOD
0xC6001054-0xC6001057HKIP_HVC_ROWM_WRITE_{8,16,32,64}在 或 内存区域中写入值。HYP_ROHYP_RO_MOD
0xC6001058HKIP_HVC_ROWM_SET将 或 内存区域设置为一个值。HYP_ROHYP_RO_MOD
0xC6001059HKIP_HVC_ROWM_CLEAR将 或 内存区域设置为零。HYP_ROHYP_RO_MOD
0xC600105A-0xC600105DHKIP_HVC_ROWM_CLEAR_{8,16,32,64}将 或 内存区域中的值归零。HYP_ROHYP_RO_MOD
0xC6001060HKIP_HVC_XO_REGISTER在 EL0 处将内存区域设置为仅执行,并将其标记为 。SOP_XO
0xC6001082HHEE_LKM_UPDATE在 EL1 处将内存区域设置为可执行文件,并将其标记为 。OS_RO_MOD
0xC6001089HHEE_HVC_TOKEN返回一个随机值,称为 所需的澄清标记。HHEE_LKM_UPDATE
0xC600108AHHEE_HVC_ENABLE_TVM在虚拟机监控程序中设置一个全局变量,该变量在 CPU 挂起时重新启用。HCR_EL2.TVM
0xC6001090HHEE_HVC_ROX_TEXT_REGISTER仅在第二阶段将内存区域设置为 EL1 的可执行文件,并将其标记为可执行文件(在内核中未使用)。OS_RO
0xC6001091HHEE_HVC_VMMU_ENABLE启用地址转换的第二阶段(在内核中未使用)
0xC60010C8-0xC60010CBHKIP_HVC_ROWM_XCHG_{8,16,32,64}交换 或 内存区域中的值。HYP_ROHYP_RO_MOD
0xC60010CC-0xC60010CFHKIP_HVC_ROWM_CMPXCHG_{8,16,32,64}比较和交换 或 内存区域中的值。HYP_ROHYP_RO_MOD
0xC60010D0-0xC60010D3HKIP_HVC_ROWM_ADD_{8,16,32,64}添加到 或 内存区域中的值。HYP_ROHYP_RO_MOD
0xC60010D4-0xC60010D7HKIP_HVC_ROWM_OR_{8,16,32,64}ORs 或内存区域中的值。HYP_ROHYP_RO_MOD
0xC60010D8-0xC60010DBHKIP_HVC_ROWM_AND_{8,16,32,64}AND 或内存区域中的值。HYP_ROHYP_RO_MOD
0xC60010DC-0xC60010DFHKIP_HVC_ROWM_XOR_{8,16,32,64}XOR 或内存区域中的值。HYP_ROHYP_RO_MOD

虚拟机管理程序端

让我们首先看一下虚拟机管理程序调用处理程序。这将使我们很好地了解内核可以使用虚拟机管理程序应用于内存区域的不同保护。

内核只读内存保护

有两组 HVC 处理程序可用于管理 EL1 上的只读内存区域。第一个由 hkip_hvc_ro_register 和 hkip_hvc_ro_unregister 组成,用于内核数据。

hkip_hvc_ro_register将内存范围设置为只读,并将其标记为处于第二阶段。OS_RO

void hkip_hvc_ro_register(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t x3, saved_regs_t* regs) {
hvc_lock_acquire();
change_stage2_software_attrs_per_va_range(vaddr, size, EXEC_EL0 | READ, OS_RO, 1);
hvc_lock_release();
tlbi_vmalle1is();
// ...
}

hkip_hvc_ro_unregister 总是返回错误。无法取消标记由hkip_hvc_ro_register标记的内存。

void hkip_hvc_ro_unregister(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t x3, saved_regs_t* regs) {
regs->x0 = -2;
// ...
}

第二组 HVC 处理程序由 hkip_hvc_ro_mod_register 和 hkip_hvc_ro_mod_unregister 组成,用于内核初始化后可以加载和卸载的内核模块。

hkip_hvc_ro_mod_register将内存范围设置为只读,并将其标记为第二阶段。OS_RO_MOD

void hkip_hvc_ro_mod_register(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t x3, saved_regs_t* regs) {
hvc_lock_acquire();
change_stage2_software_attrs_per_va_range(vaddr, size, EXEC_EL0 | READ, OS_RO_MOD, 1);
hvc_lock_release();
tlbi_vmalle1is();
// ...
}

hkip_hvc_ro_mod_unregister取消标记标记为第二阶段的内存范围,从而有效地使其再次可写。OS_RO_MOD

void hkip_hvc_ro_mod_unregister(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t x3, saved_regs_t* regs) {
hvc_lock_acquire();
change_stage2_software_attrs_per_va_range(vaddr, size, EXEC_EL0 | READ, OS_RO_MOD, 0);
hvc_lock_release();
tlbi_vmalle1is();
// ...
}
内核可执行内存保护

与内核只读内存保护类似,内存区域也可以在 EL1 处可执行。对于内核代码,当虚拟机管理程序处理页表时,会自动执行此操作。对于内核模块,这是使用 hhee_lkm_update HVC 完成的。仅当调用方知道启动时生成的随机值时,才能调用此 HVC,该值可以使用hhee_hvc_token HVC 获得。

hhee_hvc_token返回一个称为 clarify 令牌的 64 位随机值。如果尚未初始化,则随机生成,存储在全局变量中,然后返回给调用方。一旦处理了内核页表,就无法获取澄清令牌,这意味着它只能在内核引导过程的早期检索到。

void hhee_hvc_token(uint64_t hvc_id, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
// The token cannot be obtained after the kernel page tables have been processed.
if (current_cpu->kernel_pt_processed || atomic_get(&current_cpu->sys_regs->regs_inited)) {
regs->x0 = 0xfffffff8;
return;
}

spin_lock(&g_clarify_token_lock);
if (!g_clarify_token_set) {
// Checks if a clarify token was set earlier in the bootchain, if not derive one from the Counter-timer Physical
// Count register.
g_clarify_token = g_boot_clarify_token;
if (!g_clarify_token) {
g_clarify_token = (0x5deece66d * get_cntpct_el0() + 0xd) & 0xffffffffffff;
}
g_clarify_token_set = 1;
}
spin_unlock(&g_clarify_token_lock);
regs->x0 = 0;
regs->x1 = g_clarify_token;
// ...
}

hhee_lkm_update首先验证调用方给出的澄清令牌是否与全局变量中的令牌匹配。如果是这样,则使目标内存范围在 EL1 处可执行,并在第二阶段对其进行标记。它最终调用 ,这将取消设置阶段 1 页表中的位,从而有效地使内存在 EL1 处可执行。OS_RO_MODunset_pxn_stage1PXN

uint64_t hhee_lkm_update(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t clarify_token, saved_regs_t* regs) {
// ...
spin_lock(&g_clarify_token_lock);
if (!g_clarify_token_set || clarify_token != g_clarify_token) {
spin_unlock(&g_clarify_token_lock);
return 0xfffffff8;
}
spin_unlock(&g_clarify_token_lock);

hvc_lock_acquire();

ret = 0xfffffffd;
if (get_kernel_pt_info(1, &pgd, &pt_size_log2, &pt_level, &nb_pt_entries)) {
goto EXIT;
}

ret = 0xffffffff;
if (pt_size_log2 != 0xc) {
goto EXIT;
}

ret = change_stage2_software_attrs_per_va_range(vaddr, size, EXEC_EL1 | EXEC_EL0 | READ, OS_RO_MOD, 1);
if (ret) {
goto EXIT;
}

// On our device, nb_pt_entries = 0x200, pt_level = 0, thus the computation below subtracts 0xffff000000000000 to
// vaddr. It effectively converts a virtual address into an offset into the upper VA range.
offset = vaddr + nb_pt_entries << (39 - 9 * pt_level);

// Walk the stage 1 page tables and remove the PXN bit of descriptor of each of the pages contained in the virtual
// memory region.
pgd_desc = pgd | 3;
while (size > 0) {
ret = unset_pxn_stage1_by_ipa(&pdg_desc, offset, pt_level, nb_pt_entries);
if (ret) {
goto EXIT;
}
offset += 0x1000;
size -= 0x1000;
}

EXIT:
hvc_lock_release();
return ret;
}

unset_pxn_stage1_by_ipa将输入表映射到虚拟机管理程序中。如果设置了输入描述符的位,则它会循环访问所有下一级描述符,并在取消设置输入描述符的位之前清除其 or 位,具体取决于其类型。PXNTablePXNTablePXNPXNTable

uint64_t unset_pxn_stage1_by_ipa(uint64_t* pt_desc_ptr, uint64_t offset, uint32_t pt_level, uint64_t nb_pt_entries) {
// ...
pt_desc = *pt_desc_ptr;
pt_ipa = pt_desc & 0xfffffffff000;
pt_s2_desc = get_stage2_page_table_descriptor_for_ipa(pt_ipa, NULL);

// Returns an error if the stage 2 descriptor for the page table is invalid.
if ((pt_s2_desc & 1) == 0) {
return 0xfffffff7;
}

// Otherwise, map the table into the hypervisor to process it.
pt_pa = pt_s2_desc & 0xfffffffff000;
pt_hva = map_pa_into_hypervisor(pt_pa);

// PXNTable, bit [59]: PXN limit for subsequent levels of lookup.
if (((pt_desc >> 59) & 1) != 0) {
for (uint64_t i = 0; i < nb_pt_entries; i++) {
sub_desc = pt_hva[i];
// If the current subsequent descriptor is valid...
if ((sub_desc & 1) != 0) {
// If the current descriptor is a table descriptor, sets the PXNTable bit.
if (pt_level < 3 && (sub_desc & 0b10) != 0) {
desc = sub_desc | (1 << 59);
}
// Otherwise, sets the PXN bit.
else {
desc = sub_desc | (1 << 53);
}
pt_hva[i] = desc;
}
}
// Unsets PXNTable in the input descriptor.
*pt_desc_ptr = pt_desc & ~(1 << 59);
}

pt_level_idx_pos = 39 - 9 * pt_level;
pt_level_next_mask = (1 << pt_level_idx_pos) - 1;

sub_desc_idx = offset >> pt_level_idx_pos;
sub_desc = pt_hva[sub_desc_idx];

// Returns an error if the subsequent descriptor is invalid.
if ((sub_desc & 1) == 0) {
ret = 0xfffffff7;
}

else if (pt_level <= 2) {
// If the subsequent descriptor is a table, call unset_pxn_stage1_by_ipa recursively.
if ((sub_desc & 0b10) != 0) {
ret = unset_pxn_stage1_by_ipa(&pt_hva[sub_desc_idx], offset & pt_level_next_mask, pt_level + 1, 0x200);
}
// Otherwise, returns an error if it's a block descriptor.
else {
ret = 0xfffffff8;
}
}

// Returns an error if the subsequent descriptor is reserved (i.e. level 3 block descriptor).
else if ((sub_desc & 0b10) == 0) {
ret = 0xfffffff7;
}

// Returns an error if the page is writable.
//
// AP, bits [7:6]: AP[2] selects between read-only and read/write access.
else if (((sub_desc >> 7) & 1) == 0) {
ret = 0xfffffff8;
}

// Otherwise, unsets the PXN bit and flushes the page tables.
//
// PXN, bit [53]: The Privileged execute-never field.
else {
pt_hva[sub_desc_idx] = sub_desc & ~(1 << 53);
dsb_ish();
tlbi_vaale1is();
dsb_ish();
isb();
ret = 0;
}

unmap_hva_from_hypervisor(pt_hva);
return ret;
}

内核写入稀有内存保护

与内核只读内存保护类似,有两组 HVC 处理程序可用于管理 EL1 上的写入稀有内存区域。

  • hkip_hvc_rowm_register将内核内存标记为只读 () 和虚拟机监控程序介导的 (READHYP_RO);

  • hkip_hvc_rowm_unregister不允许取消标记标记hkip_hvc_rowm_register;

  • hkip_hvc_rowm_mod_register将内核模块内存标记为只读 () 和虚拟机监控程序介导的 (READHYP_RO_MOD);

  • hkip_hvc_rowm_mod_unregister取消标记 的内存。hkip_hvc_rowm_mod_register

此外,还有一些 HVC 允许修改虚拟机管理程序介导的内存区域(即,在“虚拟机管理程序调用”部分的表中以开头的所有其他 HVC ID)。它们实现了等效于 、、、、算术和逻辑运算符的功能。下表给出了虚拟机管理程序中的 HVC 及其相应的处理程序。HKIP_HVC_ROWMmemcpymemsetbzero

让我们看一下其中一个函数,hkip_hvc_rowm_set_bit

void hkip_hvc_rowm_set_bit(uint64_t hvc_id, uint64_t bits, uint64_t pos, uint64_t value, saved_regs_t* regs) {
// ...
target_ipa = virt_to_phys_el1(bits + pos / 8);
// ...
hvc_lock_inc();
desc_s2 = get_stage2_page_table_descriptor_for_ipa(target_ipa, 0);
sw_attrs = get_software_attrs(desc_s2);
// ...
if (sw_attrs == 0 || sw_attrs == HYP_RO || sw_attrs == HYP_RO_MOD) {
target_hva = map_pa_into_hypervisor((desc_s2 & 0xfffffffff000) | (target_ipa & 0xfff));
mask = 1 << (pos & 7);
if (value != 0) {
do {
cur_value = exclusive_load(target_hva);
} while (exclusive_store(cur_value | mask, target_hva));
} else {
do {
cur_value = exclusive_load(target_hva);
} while (exclusive_store(cur_value & ~mask, target_hva));
}
unmap_hva_from_hypervisor(target_hva);
}
hvc_lock_dec();
// ...
}

在检索要写入的字节的阶段 2 描述符并在虚拟机管理程序的地址空间中映射相应的页面后,hkip_hvc_rowm_set_bit设置字节的一个位。与写入稀有内存区域相关的所有其他函数都遵循类似的实现,本文不会详细介绍。

Userland 仅执行内存保护

最后,虚拟机管理程序还支持另一种内存保护:创建无法从内核访问的用户空间仅执行内存。可以使用 hkip_hvc_xo_register HVC 处理程序将此保护应用于内存区域,该处理程序在第二阶段将其标记为 EL0 () 和共享对象保护 ()。EXEC_EL0SOP_XO

hkip_hvc_xo_register

void hkip_hvc_xo_register(uint64_t hvc_id, uint64_t vaddr, uint64_t size, uint64_t x3, saved_regs_t* regs) {
hvc_lock_acquire();
change_stage2_software_attrs_per_va_range(vaddr, size, EXEC_EL0, SOP_XO, 1);
hvc_lock_release();
tlbi_vmalle1is();
// ...
}
其他 HVC 处理程序

此外,还有三个杂项 HVC 处理程序。

第一个 似乎用于在 CPU 挂起时设置系统寄存器的位。它由守护程序写入 触发。但是,在其配置文件中,似乎已禁用。我们尚不清楚该 HVC 的真正用途是什么,因为该钻头已设置在hyp_set_el2_and_enable_stage_2_per_cpu中。hhee_hvc_enable_tvmTVMHCR_EL2/vendor/bin/hisecd/proc/kernel_stp/vendor/etc/hisecd_scd.confHheeTvmPolicyTVM

另外两个 和 未被内核使用,被怀疑仅用于调试。在 EL0/EL1 () 处将内存区域设置为可读和可执行,并在第二阶段将其标记为内核只读区域 ()。只需调用enable_stage2_addr_translation即可。hhee_hvc_rox_text_registerhhee_hvc_vmmu_enablehhee_hvc_rox_text_registerREAD | EXEC_EL0 | EXEC_EL1OS_ROhhee_hvc_vmmu_enable

内核端

现在,我们可以转到内核方面,并解释如何利用虚拟机管理程序中实现的功能来保护 EL1 中可访问的敏感信息。

内核只读数据

虚拟机管理程序提供的保护的最简单用途是对内核数据强制执行只读内存权限。正如我们在处理内核页表部分所看到的,当虚拟机管理程序处理内核页表时,内核代码在第二阶段被设置为只读。

内核数据通过两步设置为只读。这两个步骤都利用了该函数,该函数仅执行 HVC。hkip_register_roHKIP_HVC_RO_REGISTER

▸ include/linux/hisi/hhee_prmem.h
static inline int hkip_register_ro(const void *base, size_t size)
{
return hkip_reg_unreg(HKIP_HVC_RO_REGISTER, base, size);
}

函数(特别是 )将不同的内核部分设置为只读,特别是 和异常表。mark_constdata_ro.rodata.notes

▸ arch/arm64/mm/mmu.c
void mark_constdata_ro(void)
{
unsigned long start = (unsigned long)__start_rodata;
unsigned long end = (unsigned long)__init_begin;
unsigned long section_size = (unsigned long)end - (unsigned long)start;

/*
* mark .rodata as read only. Use __init_begin rather than __end_rodata
* to cover NOTES and EXCEPTION_TABLE.
*/
update_mapping_prot(__pa_symbol(__start_rodata), start,
section_size, PAGE_KERNEL_RO);
hkip_register_ro((void *)start, ALIGN(section_size, PAGE_SIZE));
// ...
}

此函数是在特定于体系结构的初始化之后从 调用的。start_kernel

▸ init/main.c
asmlinkage __visible void __init start_kernel(void)
{
// ...
/* setup_arch is the last function to alter the constdata content */
mark_constdata_ro();
// ...
}

最后,内核部分由函数设置为只读。.ro_after_init_datamark_rodata_ro

▸ arch/arm64/mm/mmu.c
void mark_rodata_ro(void)
{
// ...
section_size = (unsigned long)__end_data_ro_after_init -
(unsigned long)__start_data_ro_after_init;
update_mapping_prot(__pa_symbol(__start_data_ro_after_init), (unsigned long)__start_data_ro_after_init,
section_size, PAGE_KERNEL_RO);
hkip_register_ro((void *)__start_data_ro_after_init,
ALIGN(section_size, PAGE_SIZE));
// ...
}

此函数是从 调用的。mark_readonly

▸ init/main.c
static void mark_readonly(void)
{
if (rodata_enabled) {
/*
* load_module() results in W+X mappings, which are cleaned up
* with call_rcu_sched(). Let's make sure that queued work is
* flushed so that we don't hit false positives looking for
* insecure pages which are W+X.
*/
rcu_barrier_sched();
mark_rodata_ro();
rodata_test();
} else {
pr_info("Kernel memory protection disabled.\n");
}
}

不出所料,它是在生成初始进程之前调用的。kernel_init

▸ init/main.c
static int __ref kernel_init(void *unused)
{
// ...
mark_readonly();
// ...
}
模块只读代码

虚拟机管理程序强制实施的另一项保护措施是使模块代码既是只读的,又是可执行的。模块数据似乎没有在虚拟机监控程序级别强制执行任何权限。内核模块的内存布局如下:

保护模块代码的代码路径比内核数据要简单一些。第一步是获取使用保护功能所需的 clarify token

澄清令牌由函数在初始化时检索,该函数将其存储到全局变量中。module_token_initclarify_token

▸ kernel/module.c
static unsigned long clarify_token;

static int __init module_token_init(void)
{
struct arm_smccc_res res;

if (hhee_check_enable() == HHEE_ENABLE) {
arm_smccc_hvc(HHEE_HVC_TOKEN, 0, 0,
0, 0, 0, 0, 0, &res);
clarify_token = res.a1;
}
return 0;
}
module_init(module_token_init);

然后,hhee_lkm_update 函数可以使用该令牌,这将使其文本部分在第二阶段成为只读和可执行的。

▸ kernel/module.c
static inline void hhee_lkm_update(const struct module_layout *layout)
{
struct arm_smccc_res res;

if (hhee_check_enable() != HHEE_ENABLE)
return;
arm_smccc_hvc(HHEE_LKM_UPDATE, (unsigned long)layout->base,
layout->text_size, clarify_token, 0, 0, 0, 0, &res);

if (res.a0)
pr_err("service from hhee failed test.\n");
}

对于模块的两个布局,都从 调用此函数。module_enable_ro

▸ kernel/module.c
void module_enable_ro(const struct module *mod, bool after_init)
{
// ...

/*
* Note: make sure this is the last time
* u change the page table to x or RO.
*/
hhee_lkm_update(&mod->init_layout);
hhee_lkm_update(&mod->core_layout);
}

module_enable_ro从以下位置调用:

  • complete_formation在初始化期间;

  • do_init_module初始化后。

▸ kernel/module.c
static int complete_formation(struct module *mod, struct load_info *info)
{
// ...
module_enable_ro(mod, false);
// ...
}
▸ kernel/module.c
static noinline int do_init_module(struct module *mod)
{
// ...
module_enable_ro(mod, true);
// ...
}

这两个函数都是从 调用的,在加载模块时调用。load_module

▸ kernel/module.c
static int load_module(struct load_info *info, const char __user *uargs,
int flags)
{
// ...
/* Finally it's fully formed, ready to start executing. */
err = complete_formation(mod, info);
// ...
err = do_init_module(mod);
// ...
}
受保护的内存

内核使用虚拟机管理程序提供的功能来创建受保护的内存部分:

  • prmem_wr:只读内存,可以使用调用虚拟机管理程序的专用函数写入;

  • prmem_wr_after_init:与 类似,但权限仅在内核初始化后应用;prmem_wr

  • prmem_rw:用于开发目的的读写存储器;在生产中应将其替换为 和。prmem_wrprmem_wr_after_init

该部分在函数中注册为只读写入介导。prmem_wrmark_wr_data_wr

▸ arch/arm64/mm/mmu.c
void mark_wr_data_wr(void)
{
unsigned long section_size;

if (prmem_bypass())
return;

section_size = (unsigned long)(uintptr_t)__end_data_wr -
(unsigned long)(uintptr_t)__start_data_wr;
update_mapping_prot(__pa_symbol(__start_data_wr),
(unsigned long)(uintptr_t)__start_data_wr,
section_size, PAGE_KERNEL_RO);
hkip_register_rowm((void *)__start_data_wr,
ALIGN(section_size, PAGE_SIZE));
}

mark_wr_data_wr在受保护内存子系统的初始化期间调用。

▸ drivers/hisi/hhee/mm/prmem.c
void __init prmem_init(void)
{
// ...
mark_wr_data_wr();
// ...
}

该部分的初始化方式与该部分类似,由函数初始化。它在内核初始化后调用的先前看到的中被调用。prmem_wr_after_initprmem_wrmark_wr_after_init_data_wrmark_rodata_ro

▸ arch/arm64/mm/mmu.c
void mark_rodata_ro(void)
{
// ...
mark_wr_after_init_data_wr();
}

在编译时,可以分别使用 和 宏将安全关键型全局变量移动到 和 memory 部分。此外,本节还包含受保护内存池的控制结构。这些池(使用宏定义)可以具有以下类型:prmem_wrprmem_wr_after_init__wr__wr_after_initprmem_wrPRMEM_POOL

  • ro_no_recl:只读,不可回收;

  • wr_no_recl:写稀有的不可回收;

  • start_wr_no_recl:预保护写入稀有不可回收;

  • start_wr_recl:预保护写入稀有可回收;

  • wr_recl:写稀有可回收;

  • ro_recl:只读可回收;

  • rw_recl:读写可回收。

受保护的内存池与类似于 vmalloc 的运行时分配器组合在一起。该函数返回最初由内核可写的内存,除非它是预先保护的。然后,必须使用 or 函数保护此内存,并且应用的权限取决于池的类型。最后,如果它是可回收的,则可以使用该函数释放内存。pmallocprmem_protect_addrprmem_protect_poolpfree

如果需要更高的性能,受保护的内存功能可提供类似于 SLUB 分配器的对象缓存。可以使用宏定义特定对象类型的缓存。然后,可以使用该函数分配对象,并使用该函数释放对象。PRMEM_CACHEprmem_cache_allocprmem_cache_free

从任一分配器返回并受保护的对象不再是内核可写的。要修改其字段,内核必须使用专用函数,例如 和 ,这些函数最终会调用虚拟机管理程序。wr_memcpywr_assign

凭据保护

任务的凭据通过从预先受保护的写入稀有不可回收内存的缓存中分配来保护。由于此结构的某些成员经常被写入,因此它们被移动到一个新的专用 中,从读写可回收内存的缓存中分配。struct credstruct cred_rw

▸ kernel/cred.c
PRMEM_POOL(cred_pool, start_wr_no_recl, CRED_ALIGNMENT, kB(8), CRED_POOL_SIZE);
PRMEM_CACHE(cred_cache, &cred_pool, sizeof(struct cred), CRED_ALIGNMENT);

PRMEM_POOL(cred_rw_pool, rw_recl, CRED_ALIGNMENT, kB(8), CRED_RW_POOL_SIZE);
PRMEM_CACHE(cred_rw_cache, &cred_rw_pool, sizeof(struct cred_rw),
CRED_ALIGNMENT);

经常编写的成员 将更改为指向 的相应成员的指针,并添加指向任务的指针以进行一致性检查。struct credstruct cred_rw

▸ include/linux/cred.h
struct cred {
atomic_t *usage;
// ...
struct rcu_head *rcu; /* RCU deletion hook */
struct task_struct *task;
// ...
};

除了经常写入的成员之外,还包含指向只读 .struct cred_rwstruct cred

▸ include/linux/cred.h
struct cred_rw {
atomic_t usage;
struct rcu_head rcu;
struct cred *cred_wr;
};

该函数确保循环的一致性。它从许多地方调用,包括宏和 and 函数。validate_task_credstask->creds->taskcurrent_credprepare_credscommit_creds

▸ include/linux/cred.h
void validate_task_creds(const struct task_struct *task)
{
struct cred *cred;
struct cred *real_cred;

WARN_ON(!task);
cred = (struct cred *)rcu_dereference_protected(task->cred, 1);
WARN_ON(!cred);
real_cred = (struct cred *)
rcu_dereference_protected(task->real_cred, 1);
WARN_ON(!real_cred);
if (likely(task != &init_task)) {
BUG_ON(!is_wr(real_cred, sizeof(struct cred)));
if (cred != real_cred)
BUG_ON(!is_wr(cred, sizeof(struct cred)));
} else {
BUG_ON(real_cred != &init_cred);
BUG_ON(cred != &init_cred);
}
BUG_ON(real_cred->task != task);
if (cred != real_cred)
BUG_ON(cred->task != task);
}

同样,该函数确保了循环的一致性。它也从许多地方调用,包括 和 函数。validate_cred_rwcred_rw->cred->cred_rw.rcuprepare_credscommit_creds

▸ include/linux/cred.h
void validate_cred_rw(struct cred_rw *cred_rw)
{
BUG_ON(!cred_rw->cred_wr);
if (likely(cred_rw->cred_wr != &init_cred))
BUG_ON(!is_rw(cred_rw, sizeof(struct cred_rw)));
BUG_ON(cred_rw->cred_wr->rcu != &cred_rw->rcu);
}
SELinux 保护

全局变量在初始化后设置为只读,从而受到保护。selinux_enabled

▸ security/selinux/hooks.c
#define __selinux_enabled_prot  __ro_after_init
int selinux_enabled __selinux_enabled_prot = 1;

全局变量在编译时被常量值替换。selinux_enforcing

▸ security/selinux/include/avc.h
#define selinux_enforcing 1

通过将全局变量移动到该部分来保护全局变量。ss_initializedprmem_wr

▸ security/selinux/ss/services.c
int ss_initialized __wr;

加载安全策略时,SELinux 数据结构是从只读可回收内存的特殊池中分配的。

▸ security/selinux/ss/policydb_hkip.c
#define selinux_ro_recl  ro_recl
PRMEM_POOL(selinux_pool, selinux_ro_recl, SELINUX_POOL_ALIGNMENT,
PAGE_SIZE, SELINUX_POOL_CAP);

此内存池在安全策略加载完成后进行写保护。

▸ security/selinux/ss/services.c
int security_load_policy(void *data, size_t len)
{
// ...
prmem_protect_pool(&selinux_pool);
return rc;
}

应该注意的是,这并不能阻止覆盖 AVC 缓存以绕过否定决策。但是,无法重新加载 SELinux 策略,因为额外的内核强化会阻止不等于 1 的任务写入 。TGID/sys/fs/selinux/load

断电命令保护

以前在 Samsung RKP 旁路中使用的 poweroff 命令通过将其移动到该部分来保护该命令。prmem_wr

▸ kernel/reboot.c
char poweroff_cmd[POWEROFF_CMD_PATH_LEN] __wr = "/sbin/poweroff";
BPF保护

BPF 程序也受到虚拟机管理程序的保护。并从只读可回收池中分配。struct bpf_progstruct bpf_binary_header

▸ kernel/bpf/core.c
#define bpf_cap round_up((CONFIG_HKIP_PROTECT_BPF_CAP * SZ_1M), PAGE_SIZE)
PRMEM_POOL(bpf_pool, ro_recl, sizeof(void *), PAGE_SIZE, bpf_cap);

内存权限应用于 和 函数中。bpf_prog_lock_robpf_jit_binary_lock_ro

▸ include/linux/filter.h
static inline void bpf_prog_lock_ro(struct bpf_prog *fp)
{
fp->locked = 1;
prmem_protect_addr(fp);
}
▸ include/linux/filter.h
static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr)
{
prmem_protect_addr(hdr);
}

应该注意的是,这种机制似乎并不能阻止调用___bpf_prog_run来执行任意字节码。但是,根据我们的理解,内核 CFI 实现会阻止此类调用。

CFI 强化

内核 CFI 实现是通过将 CFI_Check 函数指针转换为单独的、动态分配的 .此结构还包含指向所有者的指针。struct modulestruct safe_cfi_areastruct module

▸ include/linux/module.h
struct module {
// ...
struct safe_cfi_area {
cfi_check_fn cfi_check;
struct module *owner;
} *safe_cfi_area;
// ...
};

使用预先保护的写稀有不可回收内存池从专用缓存中分配。struct safe_cfi_area

▸ drivers/hisi/hhee/cfi_harden/cfi_harden.c
#define CFI_CHECK_POOL_SIZE  \
(CONFIG_CFI_CHECK_CACHE_NUM * sizeof(struct safe_cfi_area))
#define CFI_GUARD_SIZE (2 * PAGE_SIZE)
#define CFI_CHECK_MAX_SIZE (CFI_CHECK_POOL_SIZE + CFI_GUARD_SIZE)

PRMEM_POOL(cfi_check_pool, start_wr_no_recl, sizeof(void *),
CFI_CHECK_POOL_SIZE, CFI_CHECK_MAX_SIZE);
PRMEM_CACHE(cfi_check_cache, &cfi_check_pool,
sizeof(struct safe_cfi_area), sizeof(void *));

不是简单地取消引用 ,而是用于检索函数指针。mod->cfi_checkfetch_cfi_check_fncfi_check

▸ drivers/hisi/hhee/cfi_harden/cfi_harden.c
cfi_check_fn fetch_cfi_check_fn(struct module *mod)
{
/*
* In order to prevent forging the entire malicious module,
* the verification of the module is also necessary in the future.
*/
if (WARN(!is_cfi_valid(mod->safe_cfi_area, sizeof(struct safe_cfi_area)) ||
(mod != mod->safe_cfi_area->owner),
"Attempt to alter cfi_check!"))
return 0;
return mod->safe_cfi_area->cfi_check;
}

此外,CFI 影子页面全局变量将移到该部分中。prmem_wr

▸ kernel/cfi.c
static struct cfi_shadow __rcu *cfi_shadow __wr;

影子页面本身是从写入稀有的可回收内存池中分配的。

▸ drivers/hisi/hhee/cfi_harden/cfi_harden.c
#define CFI_SHADOW_POOL_SIZE (SHADOW_PAGES * PAGE_SIZE)

PRMEM_POOL(cfi_shadow_pool, wr_recl, sizeof(void *),
CFI_SHADOW_POOL_SIZE, PRMEM_NO_CAP);

影子页面受该函数保护。cfi_protect_shadow_pages

▸ drivers/hisi/hhee/cfi_harden/cfi_harden.c
void cfi_protect_shadow_pages(void *addr)
{
prmem_protect_addr(addr);
}
权限提升检测

虚拟机监控程序还支持检测权限提升方法。

它通过为每个任务保留一个布尔值来保护任务,指示它是 还是 。这些布尔值包含在数组 中。addr_limitUSER_DSKERNEL_FShkip_addr_limit_bits

▸ drivers/hisi/hhee/hkip/critdata.c
#define DEFINE_HKIP_BITS(name, count) \
u8 hkip_##name[ALIGN(DIV_ROUND_UP(count, 8), PAGE_SIZE)] \
__aligned(PAGE_SIZE)
#define DEFINE_HKIP_TASK_BITS(name) DEFINE_HKIP_BITS(name, PID_MAX_DEFAULT)

DEFINE_HKIP_TASK_BITS(addr_limit_bits);

此数组在初始化时由函数保护,该函数调用 HVC 周围的包装器。hkip_critdata_inithkip_register_bitsHKIP_HVC_ROWM_REGISTER

▸ drivers/hisi/hhee/hkip/critdata.c
static int __init hkip_critdata_init(void)
{
hkip_register_bits(hkip_addr_limit_bits, sizeof (hkip_addr_limit_bits));
// ...
}

用于设置字段的函数调用 .set_fsaddr_limithkip_set_fs

▸ arch/arm64/include/asm/uaccess.h
static inline void set_fs(mm_segment_t fs)
{
current_thread_info()->addr_limit = fs;
hkip_set_fs(fs);
// ...
}

该宏更新数组的相应位。hkip_set_fshkip_addr_limit_bits

▸ include/linux/hisi/hkip.h
#define hkip_set_fs(fs) \
hkip_set_current_bit(hkip_addr_limit_bits, (fs) == KERNEL_DS)

同样,用于检索字段的函数直接调用。get_fsaddr_limithkip_get_fs

▸ arch/arm64/include/asm/uaccess.h
#define get_fs()    (mm_segment_t)hkip_get_fs()

该宏检查数组中包含的当前线程的布尔值。hkip_get_fshkip_addr_limit_bits

▸ include/linux/hisi/hkip.h
#define hkip_is_kernel_fs() \
((current_thread_info()->addr_limit == KERNEL_DS) \
&& hkip_get_current_bit(hkip_addr_limit_bits, true))
#define hkip_get_fs() \
(hkip_is_kernel_fs() ? KERNEL_DS : USER_DS)

此外,在内核中添加了一个新字段,用于存储线程的 PID 和相应的布尔值。这些值保存到 结构中,并在 assembly 函数中恢复。struct pt_regskernel_entrykernel_exit

▸ arch/arm64/include/asm/ptrace.h
struct pt_regs {
// ...
u32 orig_addr_limit_hkip[2];
// ...
};

The hypervisor also protects the UID and GID of tasks using two arrays of booleans: which denotes if a task has the root UID, and if it has the root GID.hkip_uid_root_bitshkip_gid_root_bits

▸ drivers/hisi/hhee/hkip/critdata.c
static DEFINE_HKIP_TASK_BITS(uid_root_bits);
static DEFINE_HKIP_TASK_BITS(gid_root_bits);
▸ drivers/hisi/hhee/hkip/critdata.c
static int __init hkip_critdata_init(void)
{
// ...
hkip_register_bits(hkip_uid_root_bits, sizeof (hkip_uid_root_bits));
hkip_register_bits(hkip_gid_root_bits, sizeof (hkip_gid_root_bits));
return 0;
}

The function uses the to check if a task has escalated to the root UID, that is it did not have the root UID in the and has now the root UID or all capabilities.hkip_check_uid_roothkip_uid_root_bitshkip_uid_root_bits

▸ drivers/hisi/hhee/hkip/critdata.c
int hkip_check_uid_root(void)
{
const struct cred *creds = NULL;

if (hkip_get_current_bit(hkip_uid_root_bits, true)) {
return 0;
}

/*
* Note: In principles, FSUID cannot be zero if EGID is non-zero.
* But we check it separately anyway, in case of memory corruption.
*/
creds = (struct cred *)current_cred();/*lint !e666*/
if (unlikely(hkip_compute_uid_root(creds) ||
uid_eq(creds->fsuid, GLOBAL_ROOT_UID))) {
pr_alert("UID root escalation!\n");
force_sig(SIGKILL, current);
return -EPERM;
}

return 0;
}

The check on the current credentials is performed by the function.hkip_compute_uid_root

▸ drivers/hisi/hhee/hkip/critdata.c
static bool hkip_compute_uid_root(const struct cred *creds)
{
return uid_eq(creds->uid, GLOBAL_ROOT_UID) ||
uid_eq(creds->euid, GLOBAL_ROOT_UID) ||
uid_eq(creds->suid, GLOBAL_ROOT_UID) ||
/*
* Note: FSUID can only change when EUID is zero. So a change of FSUID
* will not affect the overall root status bit: it will remain true.
*/
!cap_isclear(creds->cap_inheritable) ||
!cap_isclear(creds->cap_permitted);
}

存在一对类似的函数,用于检查任务是否已升级到根 GID。此外,该函数还对 UID 和 GID 执行检查。hkip_check_gid_roothkip_compute_gid_roothkip_check_xid_root

这些检查在以下函数中执行:

  • 在文件访问时调用的 中,如果文件属于根 UID,则检查 UID,如果文件属于根 GID,则检查 GID;acl_permission_check

  • 在 中,UID 和 GID 都被选中;prepare_creds

  • in 中调用自 ,同时检查 UID 和 GID;copy_processfork

  • 在 中,在系统调用时调用,检查 UID。__cap_capable

仅执行用户空间

虚拟机管理程序启用的最后一个功能是创建仅执行的用户空间内存,该内存无法从内核访问。此内存用于存储加密存储在文件系统上的共享库的代码。解密密钥存储在 trustlet 中。HW_KEYMASTER

系统服务负责在启动时将加密的共享库从磁盘加载到用户内存中。加载流程如下:/vendor/bin/hw/[email protected]

  • 该服务打开内核 XOdata 子系统公开的第一个设备;/dev/sop_privileged

  • 它创建一个 IOCTL,在内核端创建一个控制结构,与作为参数给出的标签相关联;SOPLOCK_SET_TAG

  • 它使用文件描述符调用以分配指定大小的用户内存并将其映射到其地址空间;mmap

  • 它解密 TrustZone 中的共享库,并将明文代码存储在 mmap 的内存中;

  • 它创建一个 IOCTL,该 IOCTL 仅在 EL0 调用内存可执行文件,在 EL1 时无法访问。SOPLOCK_SET_XOMhkip_register_xo

当进程中需要明文共享库时,将按如下方式加载:

  • 该进程打开内核 XOdata 子系统公开的第二个设备;/dev/sop_unprivileged

  • 它创建一个 IOCTL,传递与参数相同的标记,以获取共享库的大小;SOPLOCK_SET_TAG

  • 它创建一个 IOCTL,向其传递将重新映射明文代码的内存区域的地址和大小。SOPLOCK_MMAP_KERNEL

安全监控呼叫

根据设计,SMC 由安全监视器处理。但是,如果配置允许,虚拟机监控程序可以拦截这些调用以过滤它们或在调用相应的 SMC 之前执行其他操作。由于位被设置在 中,SMC 被 HHEE 捕获。TSCHCR_EL2

在 EL1 上使用 SMC 指令时,执行将重定向到虚拟机管理程序,hhee_handle_smc_instruction在虚拟机管理程序中处理监视器调用。在华为基于ARM可信固件的安全监控实现中,SMC处理程序被分组到运行时服务中,这些服务拥有实体编号(OEN)标识。OEN 在 SMC 的第一个参数中发送的命令 ID 的高位中指定。然后,hhee_handle_smc_instruction中的虚拟机监控程序使用它来调用截获的 SMC 的相应处理程序。

void hhee_handle_smc_instruction(uint64_t x0, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
// Call a handler specific to the Owning Entity Number (OEN).
smc_handlers[(x0 >> 24) & 0x3f](x0, x1, x2, x3, regs);
}

根据用户提供的 OEN,虚拟机管理程序会执行不同的操作,下表对此进行了总结。

OEN的行动
0x00对与 arch 变通方法相关的 SMC 进行一些处理
0x01允许在不进一步处理的情况下调用 SMC
0x02不允许 SMCARM_SIP_SVC_EXE_STATE_SWITCH
0x03允许在不进一步处理的情况下调用 SMC
0x04通话handle_smc_oen_std
0x05-0x0A允许在不进一步处理的情况下调用 SMC
0x0B-0x2F向内核返回错误
0x30-0x3F允许在不进一步处理的情况下调用 SMC

让我们看一下 handle_smc_oen_std,它调用 handle_smc_psci 对 OEN 为 4 且函数 ID 低于 0x20 的命令执行额外处理。

void handle_smc_oen_std(uint64_t x0, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
// ...
// A dedicated function is called to handle the function IDs of the standard service calls that belong to the PSCI
// implementation.
if ((x0 & 0xffff) < 0x20) {
regs->x0 = handle_smc_psci(x0, x1, x2, x3);
// ...
}
// ...
}

如果 SMC ID 满足 handle_smc_psci 检查的条件,则从中调用与输入函数 ID 对应的处理程序。smc_handlers_psci

uint64_t handle_smc_psci(uint64_t x0, uint64_t x1, uint64_t x2, uint64_t x3, saved_regs_t* regs) {
// SMC must be of type FAST.
if (((x0 >> 31) & 1) == 0) {
return -1;
}
// SMC must be 32-bit or in the white-list.
if (((x0 >> 30) & 1) != 0 && ((0x88f45 >> x0) & 1) != 0) {
return -1;
}
return smc_handlers_psci[x0 & 0xffff](x0, x1, x2, x3, regs);
}

下面列出了handle_smc_psci调用的可能处理程序及其执行的操作。

OEN的行动
0x00允许在不进一步处理的情况下调用 SMC
0x01调用handle_smc_psci_cpu_suspend
0x02允许在不进一步处理的情况下调用 SMC
0x03通话handle_smc_psci_cpu_on
0x04-0x09允许在不进一步处理的情况下调用 SMC
0x0A允许或拒绝调用 SMC,而无需进一步处理
0x0B允许在不进一步处理的情况下调用 SMC
0x0C调用handle_smc_psci_cpu_default_suspend
0x0D允许在不进一步处理的情况下调用 SMC
0x0E调用handle_smc_psci_system_suspend
0x0F-0x10允许或拒绝调用 SMC,而无需进一步处理
0x11允许在不进一步处理的情况下调用 SMC
0x12允许或拒绝调用 SMC,而无需进一步处理
0x13-0x14允许在不进一步处理的情况下调用 SMC
0x15-0x1F向内核返回错误

handle_smc_psci_cpu_on处理程序在 CPU 通电并完成初始化后调用,它保存作为参数给出的入口点并调用 boot_hyp_cpu

uint64_t handle_smc_psci_cpu_on(uint64_t smc_id,
uint64_t target_cpu,
uint64_t entrypoint,
uint64_t context_id,
saved_regs_t* regs) {
// Saves the entrypoint in the sys_regs structure, but it doesn't seem to be used anywhere (?).
if (!set_sys_reg_value(0x20, entrypoint)) {
boot_hyp_cpu(target_cpu, entrypoint, context_id);
}
// ...
}

当 CPU 想要暂停其执行时调用的 handle_smc_psci_cpu_suspend 处理程序将在调用原始 SMC 以实际暂停 CPU 之前保存入口点和所有寄存器。恢复执行后,将还原入口点值。

uint64_t handle_smc_psci_cpu_suspend(uint64_t smc_id,
uint64_t power_state,
uint64_t entrypoint,
uint64_t context_id,
saved_regs_t* regs) {
// ...
set_tvm();
// ...
// If the power state is valid and the requested state is POWERDOWN.
if ((power_state & 0xfcfe0000) == 0 && (power_state & 0x10000) != 0) {
// Set the saved ELR_EL2 to the entrypoint.
set_sys_reg_value(0x28, entrypoint);
// ...
// Save all EL2 registers and call the original SMC to suspend the CPU.
if (save_el2_registers_and_cpu_suspend(/* ... */) == 0x80000000) {
// If we correctly returned from the hypervisor PSCI entrypoint restore the entrypoint value that came from the
// kernel.
return hyp_set_elr_el2_spsr_el2_sctlr_el1(entrypoint, context_id);
}
// ...
}
// ...
}

save_el2_registers_and_cpu_suspend只需调用 save_el2_registers 并将执行初始 SMC 的 do_smc_pcsi_cpu_suspend 函数作为参数传递。

uint64_t save_el2_registers_and_cpu_suspend(/* ... */) {
return save_el2_registers(/* ..., */ do_smc_pcsi_cpu_suspend);
}

save_el2_registers在调用作为参数给出的函数之前,将通用寄存器和系统寄存器保存在 EL2 处。

uint64_t save_el2_registers(/* ..., */ cb_func_t* func) {
// Saves TPIDR_EL2 (containing the current CPU informations).
//
// Saves general registers: x18 to x30.
//
// Saves system registers: CNTHCTL_EL2, CNTVOFF_EL2, CPTR_EL2, HCR_EL2 HSTR_EL2, VTCR_EL2, VTTBR_EL2, VMPIDR_EL2,
// VPIDR_EL2.
//
// Calls the function pointer given as argument.
return func(/* ... */);
}

如前所述,do_smc_pcsi_cpu_suspend执行原始 SMC。

uint64_t do_smc_pcsi_cpu_suspend(uint64_t smc_id, uint64_t power_state /* ,... */) {
make_smc(0xc4000001, power_state, psci_entrypoint, get_stack_pointer());
}

最后,当恢复执行时,将执行作为 SMC 参数给出的psci_entrypoint。它设置堆栈指针,使用 hyp_config_per_cpu 将某些系统寄存器设置为其全局值,并在 EL2 处恢复系统寄存器的常规和其余部分。

uint64_t psci_entrypoint(uint64_t context_id) {
set_stack_pointer(context_id);
hyp_config_per_cpu();
// Restores TPIDR_EL2 (containing the current CPU info).
tpidr_el2 = *(uint64_t*)context_id;
set_tpidr_el2(tpidr_el2);
set_vbar_el2(SynchronousExceptionSP0);
// Restores system registers CNTHCTL_EL2, CNTVOFF_EL2, CPTR_EL2, HCR_EL2, HSTR_EL2, VTCR_EL2, VTTBR_EL2, VMPIDR_EL2,
// and VPIDR_EL2.
//
// Restores general registers from x18 to x30.
return 0x80000000;
}

处理程序 和 类似于 handle_smc_psci_cpu_suspend,但最终会生成相应的 SMC。handle_smc_psci_cpu_default_suspendhandle_smc_psci_system_suspend

通过对SMC的解释,我们已经涵盖了华为实现的整个异常处理系统。我们现在已经完成了对 HHEE 内部结构的解释,终于可以结束这篇文章了。

在这篇文章中,我们详细介绍了华为安全管理程序 HHEE 的内部结构。我们已经看到实现的机制如何帮助防止内核利用。作为一种纵深防御措施,它使具有内核读/写的攻击者更难做任何有用的事情。

回顾一下虚拟机管理程序提供的各种保证:

  • 内核只读和可执行内存受以下保护:

    • 使用地址转换的第二阶段;

    • 监视内核页表,以强制执行 EL0 和 EL1 的权限;

  • 使用专用的分配器和缓存系统,将各种内核结构移动到受保护的内存中:

    • 任务的凭据,从而防止 UID、GID、功能和安全上下文更改;

    • SELinux 全局变量和结构,以避免修改策略;

    • power-off 命令,防止滥用它轻易获得代码执行;

    • BPF 程序,以防止任何可能导致代码执行的更改;

    • CFI 检查功能和影子页面,以强化 CFI 实现;

  • 如果以某种方式实现权限提升,则可通过以下方式进行检测:

    • 两个只读数组,用于跟踪任务的根 ID,并在文件访问和系统调用时进行检查;

    • 一个数组,用于跟踪使用时检查的任务字段。addr_limit

我们发现 HHEE 具有经过深思熟虑的架构和健壮的实现,从而为内核提供了有效的安全保证。在评估过程中,我们发现了一个漏洞,该漏洞可能会破坏安全虚拟机管理程序。但是,内核缓解措施使得利用它变得困难,如果不是不可能的话。有关此漏洞的更多详细信息,请参阅相应的公告。

我们知道这篇博文可能有很多需要消化的地方,但希望将来可以将其用作指导对华为虚拟机管理程序甚至其他 OEM 实现的进一步研究的参考。如果您发现任何错误,请随时与我们联系;我们很乐意更新这篇博文。

  • 如何驯服你的独角兽 - Daniel Komaromy 和 Lorant Szabo,美国黑帽,2021 年

  • Checkmate Mate30 - 拖鞋 & Guanxing Wen, MOSEC, 2021

  • 以零特权击败三星 KNOX - Di Shen,美国黑帽,2017 年

  • 三星 RKP 纲要 - Alexandre Adamski,Impalabs 博客,2021 年

  • 攻击三星 RKP - Alexandre Adamski,Impalabs 博客,2021 年

  • S8 的 PoC 2019-2215 漏洞利用 - Valentina Palmiotti,GitHub 存储库,2020 年

  • 新的可靠 Android 内核根开发技术 - Dong-Hoon You,POC,2016 

  • KNOX 内核缓解绕过 - Dong-Hoon You,POC,2019 年

二进制漏洞(更新中)

其它课程

windows网络安全防火墙与虚拟网卡(更新完成)

windows文件过滤(更新完成)

USB过滤(更新完成)

游戏安全(更新中)

ios逆向

windbg

恶意软件开发(更新中)

还有很多免费教程(限学员)

更多详细内容添加作者微信


文章来源: http://mp.weixin.qq.com/s?__biz=MzkwOTE5MDY5NA==&mid=2247489858&idx=1&sn=3f03ea0bcfe3737073a83001236a2365&chksm=c13f2a0bf648a31dad29b733f23796c14d0bb4c1ed889f513e719900754a8d5815618f077eba&scene=0&xtrack=1#rd
如有侵权请联系:admin#unsafe.sh