• D
    bpf, x86: Improve PROBE_MEM runtime load check · 90156f4b
    Dave Marchevsky 提交于
    This patch rewrites the runtime PROBE_MEM check insns emitted by the BPF
    JIT in order to ensure load safety. The changes in the patch fix two
    issues with the previous logic and more generally improve size of
    emitted code. Paragraphs between this one and "FIX 1" below explain the
    purpose of the runtime check and examine the current implementation.
    
    When a load is marked PROBE_MEM - e.g. due to PTR_UNTRUSTED access - the
    address being loaded from is not necessarily valid. The BPF jit sets up
    exception handlers for each such load which catch page faults and 0 out
    the destination register.
    
    Arbitrary register-relative loads can escape this exception handling
    mechanism. Specifically, a load like dst_reg = *(src_reg + off) will not
    trigger BPF exception handling if (src_reg + off) is outside of kernel
    address space, resulting in an uncaught page fault. A concrete example
    of such behavior is a program like:
    
      struct result {
        char space[40];
        long a;
      };
    
      /* if err, returns ERR_PTR(-EINVAL) */
      struct result *ptr = get_ptr_maybe_err();
      long x = ptr->a;
    
    If get_ptr_maybe_err returns ERR_PTR(-EINVAL) and the result isn't
    checked for err, 'result' will be (u64)-EINVAL, a number close to
    U64_MAX. The ptr->a load will be > U64_MAX and will wrap over to a small
    positive u64, which will be in userspace and thus not covered by BPF
    exception handling mechanism.
    
    In order to prevent such loads from occurring, the BPF jit emits some
    instructions which do runtime checking of (src_reg + off) and skip the
    actual load if it's out of range. As an example, here are instructions
    emitted for a %rdi = *(%rdi + 0x10) PROBE_MEM load:
    
      72:   movabs $0x800000000010,%r11 --|
      7c:   cmp    %r11,%rdi              |- 72 - 7f: Check 1
      7f:    jb    0x000000000000008d   --|
      81:   mov    %rdi,%r11             -----|
      84:   add    $0x0000000000000010,%r11   |- 81-8b: Check 2
      8b:   jnc    0x0000000000000091    -----|
      8d:   xor    %edi,%edi             ---- 0 out dest
      8f:   jmp    0x0000000000000095
      91:   mov    0x10(%rdi),%rdi       ---- Actual load
      95:
    
    The JIT considers kernel address space to start at MAX_TASK_SIZE +
    PAGE_SIZE. Determining whether a load will be outside of kernel address
    space should be a simple check:
    
      (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE
    
    But because there is only one spare register when the checking logic is
    emitted, this logic is split into two checks:
    
      Check 1: src_reg >= (MAX_TASK_SIZE + PAGE_SIZE - off)
      Check 2: src_reg + off doesn't wrap over U64_MAX and result in small pos u64
    
    Emitted insns implementing Checks 1 and 2 are annotated in the above
    example. Check 1 can be done with a single spare register since the
    source reg by definition is the left-hand-side of the inequality.
    Since adding 'off' to both sides of Check 1's inequality results in the
    original inequality we want, it's equivalent to testing that inequality.
    Except in the case where src_reg + off wraps past U64_MAX, which is why
    Check 2 needs to actually add src_reg + off if Check 1 passes - again
    using the single spare reg.
    
    FIX 1: The Check 1 inequality listed above is not what current code is
    doing. Current code is a bit more pessimistic, instead checking:
    
      src_reg >= (MAX_TASK_SIZE + PAGE_SIZE + abs(off))
    
    The 0x800000000010 in above example is from this current check. If Check
    1 was corrected to use the correct right-hand-side, the value would be
    0x7ffffffffff0. This patch changes the checking logic more broadly (FIX
    2 below will elaborate), fixing this issue as a side-effect of the
    rewrite. Regardless, it's important to understand why Check 1 should've
    been doing MAX_TASK_SIZE + PAGE_SIZE - off before proceeding.
    
    FIX 2: Current code relies on a 'jnc' to determine whether src_reg + off
    addition wrapped over. For negative offsets this logic is incorrect.
    Consider Check 2 insns emitted when off = -0x10:
    
      81:   mov    %rdi,%r11
      84:   add    0xfffffffffffffff0,%r11
      8b:   jnc    0x0000000000000091
    
    2's complement representation of -0x10 is a large positive u64. Any
    value of src_reg that passes Check 1 will result in carry flag being set
    after (src_reg + off) addition. So a load with any negative offset will
    always fail Check 2 at runtime and never do the actual load. This patch
    fixes the negative offset issue by rewriting both checks in order to not
    rely on carry flag.
    
    The rewrite takes advantage of the fact that, while we only have one
    scratch reg to hold arbitrary values, we know the offset at JIT time.
    This we can use src_reg as a temporary scratch reg to hold src_reg +
    offset since we can return it to its original value by later subtracting
    offset. As a result we can directly check the original inequality we
    care about:
    
      (src_reg + off) >= MAX_TASK_SIZE + PAGE_SIZE
    
    For a load like %rdi = *(%rsi + -0x10), this results in emitted code:
    
      43:   movabs $0x800000000000,%r11
      4d:   add    $0xfffffffffffffff0,%rsi --- src_reg += off
      54:   cmp    %r11,%rsi                --- Check original inequality
      57:   jae    0x000000000000005d
      59:   xor    %edi,%edi
      5b:   jmp    0x0000000000000061
      5d:   mov    0x0(%rdi),%rsi           --- Actual Load
      61:   sub    $0xfffffffffffffff0,%rsi --- src_reg -= off
    
    Note that the actual load is always done with offset 0, since previous
    insns have already done src_reg += off. Regardless of whether the new
    check succeeds or fails, insn 61 is always executed, returning src_reg
    to its original value.
    
    Because the goal of these checks is to ensure that loaded-from address
    will be protected by BPF exception handler, the new check can safely
    ignore any wrapover from insn 4d. If such wrapped-over address passes
    insn 54 + 57's cmp-and-jmp it will have such protection so the load can
    proceed.
    
    IMPROVEMENTS: The above improved logic is 8 insns vs original logic's 9,
    and has 1 fewer jmp. The number of checking insns can be further
    improved in common scenarios:
    
    If src_reg == dst_reg, the actual load insn will clobber src_reg, so
    there's no original src_reg state for the sub insn immediately following
    the load to restore, so it can be omitted. In fact, it must be omitted
    since it would incorrectly subtract from the result of the load if it
    wasn't. So for src_reg == dst_reg, JIT emits these insns:
    
      3c:   movabs $0x800000000000,%r11
      46:   add    $0xfffffffffffffff0,%rdi
      4d:   cmp    %r11,%rdi
      50:   jae    0x0000000000000056
      52:   xor    %edi,%edi
      54:   jmp    0x000000000000005a
      56:   mov    0x0(%rdi),%rdi
      5a:
    
    The only difference from larger example being the omitted sub, which
    would've been insn 5a in this example.
    
    If offset == 0, we can similarly omit the sub as in previous case, since
    there's nothing added to subtract. For the same reason we can omit the
    addition as well, resulting in JIT emitting these insns:
    
      46:   movabs $0x800000000000,%r11
      4d:   cmp    %r11,%rdi
      50:   jae    0x0000000000000056
      52:   xor    %edi,%edi
      54:   jmp    0x000000000000005a
      56:   mov    0x0(%rdi),%rdi
      5a:
    
    Although the above example also has src_reg == dst_reg, the same
    offset == 0 optimization is valid to apply if src_reg != dst_reg.
    
    To summarize the improvements in emitted insn count for the
    check-and-load:
    
    BEFORE:                8 check insns, 3 jmps
    AFTER (general case):  7 check insns, 2 jmps (12.5% fewer insn, 33% jmp)
    AFTER (src == dst):    6 check insns, 2 jmps (25% fewer insn)
    AFTER (offset == 0):   5 check insns, 2 jmps (37.5% fewer insn)
    
    (Above counts don't include the 1 load insn, just checking around it)
    
    Based on BPF bytecode + JITted x86 insn I saw while experimenting with
    these improvements, I expect the src_reg == dst_reg case to occur most
    often, followed by offset == 0, then the general case.
    Signed-off-by: NDave Marchevsky <davemarchevsky@fb.com>
    Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
    Acked-by: NYonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20221216214319.3408356-1-davemarchevsky@fb.com
    90156f4b
bpf_jit_comp.c 70.7 KB