1. 25 5月, 2019 2 次提交
    • J
      bpf: verifier: insert zero extension according to analysis result · a4b1d3c1
      Jiong Wang 提交于
      After previous patches, verifier will mark a insn if it really needs zero
      extension on dst_reg.
      
      It is then for back-ends to decide how to use such information to eliminate
      unnecessary zero extension code-gen during JIT compilation.
      
      One approach is verifier insert explicit zero extension for those insns
      that need zero extension in a generic way, JIT back-ends then do not
      generate zero extension for sub-register write at default.
      
      However, only those back-ends which do not have hardware zero extension
      want this optimization. Back-ends like x86_64 and AArch64 have hardware
      zero extension support that the insertion should be disabled.
      
      This patch introduces new target hook "bpf_jit_needs_zext" which returns
      false at default, meaning verifier zero extension insertion is disabled at
      default. A back-end could override this hook to return true if it doesn't
      have hardware support and want verifier insert zero extension explicitly.
      
      Offload targets do not use this native target hook, instead, they could
      get the optimization results using bpf_prog_offload_ops.finalize.
      
      NOTE: arches could have diversified features, it is possible for one arch
      to have hardware zero extension support for some sub-register write insns
      but not for all. For example, PowerPC, SPARC have zero extended loads, but
      not for alu32. So when verifier zero extension insertion enabled, these JIT
      back-ends need to peephole insns to remove those zero extension inserted
      for insn that actually has hardware zero extension support. The peephole
      could be as simple as looking the next insn, if it is a special zero
      extension insn then it is safe to eliminate it if the current insn has
      hardware zero extension support.
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a4b1d3c1
    • J
      bpf: introduce new mov32 variant for doing explicit zero extension · 7d134041
      Jiong Wang 提交于
      The encoding for this new variant is based on BPF_X format. "imm" field was
      0 only, now it could be 1 which means doing zero extension unconditionally
      
        .code = BPF_ALU | BPF_MOV | BPF_X
        .dst_reg = DST
        .src_reg = SRC
        .imm  = 1
      
      We use this new form for doing zero extension for which verifier will
      guarantee SRC == DST.
      
      Implications on JIT back-ends when doing code-gen for
      BPF_ALU | BPF_MOV | BPF_X:
        1. No change if hardware already does zero extension unconditionally for
           sub-register write.
        2. Otherwise, when seeing imm == 1, just generate insns to clear high
           32-bit. No need to generate insns for the move because when imm == 1,
           dst_reg is the same as src_reg at the moment.
      
      Interpreter doesn't need change as well. It is doing unconditionally zero
      extension for mov32 already.
      
      One helper macro BPF_ZEXT_REG is added to help creating zero extension
      insn using this new mov32 variant.
      
      One helper function insn_is_zext is added for checking one insn is an
      zero extension on dst. This will be widely used by a few JIT back-ends in
      later patches in this set.
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7d134041
  2. 30 4月, 2019 2 次提交
    • R
      bpf: Use vmalloc special flag · d53d2f78
      Rick Edgecombe 提交于
      Use new flag VM_FLUSH_RESET_PERMS for handling freeing of special
      permissioned memory in vmalloc and remove places where memory was set RW
      before freeing which is no longer needed. Don't track if the memory is RO
      anymore because it is now tracked in vmalloc.
      Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <akpm@linux-foundation.org>
      Cc: <ard.biesheuvel@linaro.org>
      Cc: <deneen.t.dock@intel.com>
      Cc: <kernel-hardening@lists.openwall.com>
      Cc: <kristen@linux.intel.com>
      Cc: <linux_dti@icloud.com>
      Cc: <will.deacon@arm.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190426001143.4983-19-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d53d2f78
    • N
      x86/modules: Avoid breaking W^X while loading modules · f2c65fb3
      Nadav Amit 提交于
      When modules and BPF filters are loaded, there is a time window in
      which some memory is both writable and executable. An attacker that has
      already found another vulnerability (e.g., a dangling pointer) might be
      able to exploit this behavior to overwrite kernel code. Prevent having
      writable executable PTEs in this stage.
      
      In addition, avoiding having W+X mappings can also slightly simplify the
      patching of modules code on initialization (e.g., by alternatives and
      static-key), as would be done in the next patch. This was actually the
      main motivation for this patch.
      
      To avoid having W+X mappings, set them initially as RW (NX) and after
      they are set as RO set them as X as well. Setting them as executable is
      done as a separate step to avoid one core in which the old PTE is cached
      (hence writable), and another which sees the updated PTE (executable),
      which would break the W^X protection.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <akpm@linux-foundation.org>
      Cc: <ard.biesheuvel@linaro.org>
      Cc: <deneen.t.dock@intel.com>
      Cc: <kernel-hardening@lists.openwall.com>
      Cc: <kristen@linux.intel.com>
      Cc: <linux_dti@icloud.com>
      Cc: <will.deacon@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: https://lkml.kernel.org/r/20190426001143.4983-12-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f2c65fb3
  3. 13 4月, 2019 4 次提交
    • A
      bpf: Add file_pos field to bpf_sysctl ctx · e1550bfe
      Andrey Ignatov 提交于
      Add file_pos field to bpf_sysctl context to read and write sysctl file
      position at which sysctl is being accessed (read or written).
      
      The field can be used to e.g. override whole sysctl value on write to
      sysctl even when sys_write is called by user space with file_pos > 0. Or
      BPF program may reject such accesses.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e1550bfe
    • A
      bpf: Introduce bpf_sysctl_{get,set}_new_value helpers · 4e63acdf
      Andrey Ignatov 提交于
      Add helpers to work with new value being written to sysctl by user
      space.
      
      bpf_sysctl_get_new_value() copies value being written to sysctl into
      provided buffer.
      
      bpf_sysctl_set_new_value() overrides new value being written by user
      space with a one from provided buffer. Buffer should contain string
      representation of the value, similar to what can be seen in /proc/sys/.
      
      Both helpers can be used only on sysctl write.
      
      File position matters and can be managed by an interface that will be
      introduced separately. E.g. if user space calls sys_write to a file in
      /proc/sys/ at file position = X, where X > 0, then the value set by
      bpf_sysctl_set_new_value() will be written starting from X. If program
      wants to override whole value with specified buffer, file position has
      to be set to zero.
      
      Documentation for the new helpers is provided in bpf.h UAPI.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4e63acdf
    • A
      bpf: Introduce bpf_sysctl_get_current_value helper · 1d11b301
      Andrey Ignatov 提交于
      Add bpf_sysctl_get_current_value() helper to copy current sysctl value
      into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.
      
      It provides same string as user space can see by reading corresponding
      file in /proc/sys/, including new line, etc.
      
      Documentation for the new helper is provided in bpf.h UAPI.
      
      Since current value is kept in ctl_table->data in a parsed form,
      ctl_table->proc_handler() with write=0 is called to read that data and
      convert it to a string. Such a string can later be parsed by a program
      using helpers that will be introduced separately.
      
      Unfortunately it's not trivial to provide API to access parsed data due to
      variety of data representations (string, intvec, uintvec, ulongvec,
      custom structures, even NULL, etc). Instead it's assumed that user know
      how to handle specific sysctl they're interested in and appropriate
      helpers can be used.
      
      Since ctl_table->proc_handler() expects __user buffer, conversion to
      __user happens for kernel allocated one where the value is stored.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d11b301
    • A
      bpf: Sysctl hook · 7b146ceb
      Andrey Ignatov 提交于
      Containerized applications may run as root and it may create problems
      for whole host. Specifically such applications may change a sysctl and
      affect applications in other containers.
      
      Furthermore in existing infrastructure it may not be possible to just
      completely disable writing to sysctl, instead such a process should be
      gradual with ability to log what sysctl are being changed by a
      container, investigate, limit the set of writable sysctl to currently
      used ones (so that new ones can not be changed) and eventually reduce
      this set to zero.
      
      The patch introduces new program type BPF_PROG_TYPE_CGROUP_SYSCTL and
      attach type BPF_CGROUP_SYSCTL to solve these problems on cgroup basis.
      
      New program type has access to following minimal context:
      	struct bpf_sysctl {
      		__u32	write;
      	};
      
      Where @write indicates whether sysctl is being read (= 0) or written (=
      1).
      
      Helpers to access sysctl name and value will be introduced separately.
      
      BPF_CGROUP_SYSCTL attach point is added to sysctl code right before
      passing control to ctl_table->proc_handler so that BPF program can
      either allow or deny access to sysctl.
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7b146ceb
  4. 28 2月, 2019 1 次提交
    • A
      bpf: enable program stats · 492ecee8
      Alexei Starovoitov 提交于
      JITed BPF programs are indistinguishable from kernel functions, but unlike
      kernel code BPF code can be changed often.
      Typical approach of "perf record" + "perf report" profiling and tuning of
      kernel code works just as well for BPF programs, but kernel code doesn't
      need to be monitored whereas BPF programs do.
      Users load and run large amount of BPF programs.
      These BPF stats allow tools monitor the usage of BPF on the server.
      The monitoring tools will turn sysctl kernel.bpf_stats_enabled
      on and off for few seconds to sample average cost of the programs.
      Aggregated data over hours and days will provide an insight into cost of BPF
      and alarms can trigger in case given program suddenly gets more expensive.
      
      The cost of two sched_clock() per program invocation adds ~20 nsec.
      Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down
      from ~10 nsec to ~30 nsec.
      static_key minimizes the cost of the stats collection.
      There is no measurable difference before/after this patch
      with kernel.bpf_stats_enabled=0
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      492ecee8
  5. 20 2月, 2019 1 次提交
  6. 01 2月, 2019 1 次提交
  7. 31 1月, 2019 1 次提交
    • V
      bpf: fix missing prototype warnings · 116bfa96
      Valdis Kletnieks 提交于
      Compiling with W=1 generates warnings:
      
        CC      kernel/bpf/core.o
      kernel/bpf/core.c:721:12: warning: no previous prototype for ?bpf_jit_alloc_exec_limit? [-Wmissing-prototypes]
        721 | u64 __weak bpf_jit_alloc_exec_limit(void)
            |            ^~~~~~~~~~~~~~~~~~~~~~~~
      kernel/bpf/core.c:757:14: warning: no previous prototype for ?bpf_jit_alloc_exec? [-Wmissing-prototypes]
        757 | void *__weak bpf_jit_alloc_exec(unsigned long size)
            |              ^~~~~~~~~~~~~~~~~~
      kernel/bpf/core.c:762:13: warning: no previous prototype for ?bpf_jit_free_exec? [-Wmissing-prototypes]
        762 | void __weak bpf_jit_free_exec(void *addr)
            |             ^~~~~~~~~~~~~~~~~
      
      All three are weak functions that archs can override, provide
      proper prototypes for when a new arch provides their own.
      Signed-off-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      116bfa96
  8. 27 1月, 2019 1 次提交
  9. 24 1月, 2019 1 次提交
  10. 22 1月, 2019 1 次提交
  11. 03 1月, 2019 2 次提交
  12. 12 12月, 2018 1 次提交
    • D
      bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K · fdadd049
      Daniel Borkmann 提交于
      Michael and Sandipan report:
      
        Commit ede95a63 introduced a bpf_jit_limit tuneable to limit BPF
        JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
        and is adjusted again at init time if MODULES_VADDR is defined.
      
        For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
        the compile-time default at boot-time, which is 0x9c400000 when
        using 64K page size. This overflows the signed 32-bit bpf_jit_limit
        value:
      
        root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
        -1673527296
      
        and can cause various unexpected failures throughout the network
        stack. In one case `strace dhclient eth0` reported:
      
        setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
                   16) = -1 ENOTSUPP (Unknown error 524)
      
        and similar failures can be seen with tools like tcpdump. This doesn't
        always reproduce however, and I'm not sure why. The more consistent
        failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
        host would time out on systemd/netplan configuring a virtio-net NIC
        with no noticeable errors in the logs.
      
      Given this and also given that in near future some architectures like
      arm64 will have a custom area for BPF JIT image allocations we should
      get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
      4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
      so therefore add another overridable bpf_jit_alloc_exec_limit() helper
      function which returns the possible size of the memory area for deriving
      the default heuristic in bpf_jit_charge_init().
      
      Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
      bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
      JIT memory provider, and therefore in case archs implement their custom
      module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
      vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.
      
      Additionally, for archs supporting large page sizes, we should change
      the sysctl to be handled as long to not run into sysctl restrictions
      in future.
      
      Fixes: ede95a63 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
      Reported-by: NSandipan Das <sandipan@linux.ibm.com>
      Reported-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fdadd049
  13. 10 12月, 2018 1 次提交
    • M
      bpf: Add bpf_line_info support · c454a46b
      Martin KaFai Lau 提交于
      This patch adds bpf_line_info support.
      
      It accepts an array of bpf_line_info objects during BPF_PROG_LOAD.
      The "line_info", "line_info_cnt" and "line_info_rec_size" are added
      to the "union bpf_attr".  The "line_info_rec_size" makes
      bpf_line_info extensible in the future.
      
      The new "check_btf_line()" ensures the userspace line_info is valid
      for the kernel to use.
      
      When the verifier is translating/patching the bpf_prog (through
      "bpf_patch_insn_single()"), the line_infos' insn_off is also
      adjusted by the newly added "bpf_adj_linfo()".
      
      If the bpf_prog is jited, this patch also provides the jited addrs (in
      aux->jited_linfo) for the corresponding line_info.insn_off.
      "bpf_prog_fill_jited_linfo()" is added to fill the aux->jited_linfo.
      It is currently called by the x86 jit.  Other jits can also use
      "bpf_prog_fill_jited_linfo()" and it will be done in the followup patches.
      In the future, if it deemed necessary, a particular jit could also provide
      its own "bpf_prog_fill_jited_linfo()" implementation.
      
      A few "*line_info*" fields are added to the bpf_prog_info such
      that the user can get the xlated line_info back (i.e. the line_info
      with its insn_off reflecting the translated prog).  The jited_line_info
      is available if the prog is jited.  It is an array of __u64.
      If the prog is not jited, jited_line_info_cnt is 0.
      
      The verifier's verbose log with line_info will be done in
      a follow up patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c454a46b
  14. 01 12月, 2018 1 次提交
  15. 27 11月, 2018 1 次提交
  16. 11 11月, 2018 1 次提交
    • A
      bpf: Allow narrow loads with offset > 0 · 46f53a65
      Andrey Ignatov 提交于
      Currently BPF verifier allows narrow loads for a context field only with
      offset zero. E.g. if there is a __u32 field then only the following
      loads are permitted:
        * off=0, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=0, size=4 (full).
      
      On the other hand LLVM can generate a load with offset different than
      zero that make sense from program logic point of view, but verifier
      doesn't accept it.
      
      E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code:
      
        #define DST_IP4			0xC0A801FEU /* 192.168.1.254 */
        ...
        	if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) &&
      
      where ctx is struct bpf_sock_addr.
      
      Some versions of LLVM can produce the following byte code for it:
      
             8:       71 12 07 00 00 00 00 00         r2 = *(u8 *)(r1 + 7)
             9:       67 02 00 00 18 00 00 00         r2 <<= 24
            10:       18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00         r3 = 4261412864 ll
            12:       5d 32 07 00 00 00 00 00         if r2 != r3 goto +7 <LBB0_6>
      
      where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1
      and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently
      rejected by verifier.
      
      Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok()
      what means any is_valid_access implementation, that uses the function,
      works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or
      sock_addr_is_valid_access() for bpf_sock_addr.
      
      The patch makes such loads supported. Offset can be in [0; size_default)
      but has to be multiple of load size. E.g. for __u32 field the following
      loads are supported now:
        * off=0, size=1 (narrow);
        * off=1, size=1 (narrow);
        * off=2, size=1 (narrow);
        * off=3, size=1 (narrow);
        * off=0, size=2 (narrow);
        * off=2, size=2 (narrow);
        * off=0, size=4 (full).
      Reported-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      46f53a65
  17. 26 10月, 2018 1 次提交
    • D
      bpf: add bpf_jit_limit knob to restrict unpriv allocations · ede95a63
      Daniel Borkmann 提交于
      Rick reported that the BPF JIT could potentially fill the entire module
      space with BPF programs from unprivileged users which would prevent later
      attempts to load normal kernel modules or privileged BPF programs, for
      example. If JIT was enabled but unsuccessful to generate the image, then
      before commit 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      we would always fall back to the BPF interpreter. Nowadays in the case
      where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
      with a failure since the BPF interpreter was compiled out.
      
      Add a global limit and enforce it for unprivileged users such that in case
      of BPF interpreter compiled out we fail once the limit has been reached
      or we fall back to BPF interpreter earlier w/o using module mem if latter
      was compiled in. In a next step, fair share among unprivileged users can
      be resolved in particular for the case where we would fail hard once limit
      is reached.
      
      Fixes: 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
      Co-Developed-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: LKML <linux-kernel@vger.kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ede95a63
  18. 20 10月, 2018 1 次提交
  19. 16 10月, 2018 1 次提交
    • D
      bpf, sockmap: convert to generic sk_msg interface · 604326b4
      Daniel Borkmann 提交于
      Add a generic sk_msg layer, and convert current sockmap and later
      kTLS over to make use of it. While sk_buff handles network packet
      representation from netdevice up to socket, sk_msg handles data
      representation from application to socket layer.
      
      This means that sk_msg framework spans across ULP users in the
      kernel, and enables features such as introspection or filtering
      of data with the help of BPF programs that operate on this data
      structure.
      
      Latter becomes in particular useful for kTLS where data encryption
      is deferred into the kernel, and as such enabling the kernel to
      perform L7 introspection and policy based on BPF for TLS connections
      where the record is being encrypted after BPF has run and came to
      a verdict. In order to get there, first step is to transform open
      coding of scatter-gather list handling into a common core framework
      that subsystems can use.
      
      The code itself has been split and refactored into three bigger
      pieces: i) the generic sk_msg API which deals with managing the
      scatter gather ring, providing helpers for walking and mangling,
      transferring application data from user space into it, and preparing
      it for BPF pre/post-processing, ii) the plain sock map itself
      where sockets can be attached to or detached from; these bits
      are independent of i) which can now be used also without sock
      map, and iii) the integration with plain TCP as one protocol
      to be used for processing L7 application data (later this could
      e.g. also be extended to other protocols like UDP). The semantics
      are the same with the old sock map code and therefore no change
      of user facing behavior or APIs. While pursuing this work it
      also helped finding a number of bugs in the old sockmap code
      that we've fixed already in earlier commits. The test_sockmap
      kselftest suite passes through fine as well.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      604326b4
  20. 18 8月, 2018 1 次提交
    • D
      bpf: fix redirect to map under tail calls · f6069b9a
      Daniel Borkmann 提交于
      Commits 109980b8 ("bpf: don't select potentially stale ri->map
      from buggy xdp progs") and 7c300131 ("bpf: fix ri->map_owner
      pointer on bpf_prog_realloc") tried to mitigate that buggy programs
      using bpf_redirect_map() helper call do not leave stale maps behind.
      Idea was to add a map_owner cookie into the per CPU struct redirect_info
      which was set to prog->aux by the prog making the helper call as a
      proof that the map is not stale since the prog is implicitly holding
      a reference to it. This owner cookie could later on get compared with
      the program calling into BPF whether they match and therefore the
      redirect could proceed with processing the map safely.
      
      In (obvious) hindsight, this approach breaks down when tail calls are
      involved since the original caller's prog->aux pointer does not have
      to match the one from one of the progs out of the tail call chain,
      and therefore the xdp buffer will be dropped instead of redirected.
      A way around that would be to fix the issue differently (which also
      allows to remove related work in fast path at the same time): once
      the life-time of a redirect map has come to its end we use it's map
      free callback where we need to wait on synchronize_rcu() for current
      outstanding xdp buffers and remove such a map pointer from the
      redirect info if found to be present. At that time no program is
      using this map anymore so we simply invalidate the map pointers to
      NULL iff they previously pointed to that instance while making sure
      that the redirect path only reads out the map once.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Reported-by: NSebastiano Miano <sebastiano.miano@polito.it>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6069b9a
  21. 11 8月, 2018 2 次提交
    • M
      bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection · 8217ca65
      Martin KaFai Lau 提交于
      This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
      SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
      the earlier patch.  "bpf_run_sk_reuseport()" will return -ECONNREFUSED
      when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
      The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
      handle this case and return NULL immediately instead of continuing the
      sk search from its hashtable.
      
      It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
      BPF_PROG_TYPE_SK_REUSEPORT.  The "sk_reuseport_attach_bpf()" will check
      if the attaching bpf prog is in the new SK_REUSEPORT or the existing
      SOCKET_FILTER type and then check different things accordingly.
      
      One level of "__reuseport_attach_prog()" call is removed.  The
      "sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
      back to "reuseport_attach_prog()" in sock_reuseport.c.  sock_reuseport.c
      seems to have more knowledge on those test requirements than filter.c.
      In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
      the old_prog (if any) is also directly freed instead of returning the
      old_prog to the caller and asking the caller to free.
      
      The sysctl_optmem_max check is moved back to the
      "sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
      As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
      bounded by the usual "bpf_prog_charge_memlock()" during load time
      instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8217ca65
    • M
      bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT · 2dbb9b9e
      Martin KaFai Lau 提交于
      This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
      a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
      non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.
      
      BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
      to store the bpf context instead of using the skb->cb[48].
      
      At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
      from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
      point,  it is not always clear where the bpf context can be appended
      in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
      aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
      clear if the lower layer is only ipv4 and ipv6 in the future and
      will it not touch the cb[] again before transiting to the upper
      layer.
      
      For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
      instead of IP[6]CB and it may still modify the cb[] after calling
      the udp[46]_lib_lookup_skb().  Because of the above reason, if
      sk->cb is used for the bpf ctx, saving-and-restoring is needed
      and likely the whole 48 bytes cb[] has to be saved and restored.
      
      Instead of saving, setting and restoring the cb[], this patch opts
      to create a new "struct sk_reuseport_kern" and setting the needed
      values in there.
      
      The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
      will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
      specific usage at this point and it is also inline with the current
      sock_reuseport.c implementation (i.e. no protocol specific requirement).
      
      In "struct sk_reuseport_md", this patch exposes data/data_end/len
      with semantic similar to other existing usages.  Together
      with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
      the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
      the bpf prog that the reuseport group is bind-ed to a local
      INANY address which cannot be learned from skb.
      
      The new "bind_inany" is added to "struct sock_reuseport" which will be
      used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
      to avoid repeating the "bind INANY" test on
      "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
      only be properly initialized when a "sk->sk_reuseport" enabled sk is
      adding to a hashtable (i.e. during "reuseport_alloc()" and
      "reuseport_add_sock()").
      
      The new "sk_select_reuseport()" is the main helper that the
      bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
      that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
      the earlier patch, the validity of a selected sk is checked in
      run time in "sk_select_reuseport()".  Doing the check in
      verification time is difficult and inflexible (consider the map-in-map
      use case).  The runtime check is to compare the selected sk's reuseport_id
      with the reuseport_id that we want.  This helper will return -EXXX if the
      selected sk cannot serve the incoming request (e.g. reuseport_id
      not match).  The bpf prog can decide if it wants to do SK_DROP as its
      discretion.
      
      When the bpf prog returns SK_PASS, the kernel will check if a
      valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
      If it does , it will use the selected sk.  If not, the kernel
      will select one from "reuse->socks[]" (as before this patch).
      
      The SK_DROP and SK_PASS handling logic will be in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2dbb9b9e
  22. 10 8月, 2018 2 次提交
    • T
      xdp: Helpers for disabling napi_direct of xdp_return_frame · 2539650f
      Toshiaki Makita 提交于
      We need some mechanism to disable napi_direct on calling
      xdp_return_frame_rx_napi() from some context.
      When veth gets support of XDP_REDIRECT, it will redirects packets which
      are redirected from other devices. On redirection veth will reuse
      xdp_mem_info of the redirection source device to make return_frame work.
      But in this case .ndo_xdp_xmit() called from veth redirection uses
      xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
      is not called directly from the rxq which owns the xdp_mem_info.
      
      This approach introduces a flag in bpf_redirect_info to indicate that
      napi_direct should be disabled even when _rx_napi variant is used as
      well as helper functions to use it.
      
      A NAPI handler who wants to use this flag needs to call
      xdp_set_return_frame_no_direct() before processing packets, and call
      xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
      exiting NAPI.
      
      v4:
      - Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
        avoid per-frame copy cost.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2539650f
    • T
      bpf: Make redirect_info accessible from modules · 0b19cc0a
      Toshiaki Makita 提交于
      We are going to add kern_flags field in redirect_info for kernel
      internal use.
      In order to avoid function call to access the flags, make redirect_info
      accessible from modules. Also as it is now non-static, add prefix bpf_
      to redirect_info.
      
      v6:
      - Fix sparse warning around EXPORT_SYMBOL.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      0b19cc0a
  23. 08 7月, 2018 1 次提交
  24. 30 6月, 2018 1 次提交
    • D
      bpf: undo prog rejection on read-only lock failure · 85782e03
      Daniel Borkmann 提交于
      Partially undo commit 9facc336 ("bpf: reject any prog that failed
      read-only lock") since it caused a regression, that is, syzkaller was
      able to manage to cause a panic via fault injection deep in set_memory_ro()
      path by letting an allocation fail: In x86's __change_page_attr_set_clr()
      it was able to change the attributes of the primary mapping but not in
      the alias mapping via cpa_process_alias(), so the second, inner call
      to the __change_page_attr() via __change_page_attr_set_clr() had to split
      a larger page and failed in the alloc_pages() with the artifically triggered
      allocation error which is then propagated down to the call site.
      
      Thus, for set_memory_ro() this means that it returned with an error, but
      from debugging a probe_kernel_write() revealed EFAULT on that memory since
      the primary mapping succeeded to get changed. Therefore the subsequent
      hdr->locked = 0 reset triggered the panic as it was performed on read-only
      memory, so call-site assumptions were infact wrong to assume that it would
      either succeed /or/ not succeed at all since there's no such rollback in
      set_memory_*() calls from partial change of mappings, in other words, we're
      left in a state that is "half done". A later undo via set_memory_rw() is
      succeeding though due to matching permissions on that part (aka due to the
      try_preserve_large_page() succeeding). While reproducing locally with
      explicitly triggering this error, the initial splitting only happens on
      rare occasions and in real world it would additionally need oom conditions,
      but that said, it could partially fail. Therefore, it is definitely wrong
      to bail out on set_memory_ro() error and reject the program with the
      set_memory_*() semantics we have today. Shouldn't have gone the extra mile
      since no other user in tree today infact checks for any set_memory_*()
      errors, e.g. neither module_enable_ro() / module_disable_ro() for module
      RO/NX handling which is mostly default these days nor kprobes core with
      alloc_insn_page() / free_insn_page() as examples that could be invoked long
      after bootup and original 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks") did neither when it got first introduced to BPF
      so "improving" with bailing out was clearly not right when set_memory_*()
      cannot handle it today.
      
      Kees suggested that if set_memory_*() can fail, we should annotate it with
      __must_check, and all callers need to deal with it gracefully given those
      set_memory_*() markings aren't "advisory", but they're expected to actually
      do what they say. This might be an option worth to move forward in future
      but would at the same time require that set_memory_*() calls from supporting
      archs are guaranteed to be "atomic" in that they provide rollback if part
      of the range fails, once that happened, the transition from RW -> RO could
      be made more robust that way, while subsequent RO -> RW transition /must/
      continue guaranteeing to always succeed the undo part.
      
      Reported-by: syzbot+a4eb8c7766952a1ca872@syzkaller.appspotmail.com
      Reported-by: syzbot+d866d1925855328eac3b@syzkaller.appspotmail.com
      Fixes: 9facc336 ("bpf: reject any prog that failed read-only lock")
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      85782e03
  25. 21 6月, 2018 1 次提交
  26. 16 6月, 2018 3 次提交
    • T
      xdp: Fix handling of devmap in generic XDP · 6d5fc195
      Toshiaki Makita 提交于
      Commit 67f29e07 ("bpf: devmap introduce dev_map_enqueue") changed
      the return value type of __devmap_lookup_elem() from struct net_device *
      to struct bpf_dtab_netdev * but forgot to modify generic XDP code
      accordingly.
      
      Thus generic XDP incorrectly used struct bpf_dtab_netdev where struct
      net_device is expected, then skb->dev was set to invalid value.
      
      v2:
      - Fix compiler warning without CONFIG_BPF_SYSCALL.
      
      Fixes: 67f29e07 ("bpf: devmap introduce dev_map_enqueue")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6d5fc195
    • D
      bpf: reject any prog that failed read-only lock · 9facc336
      Daniel Borkmann 提交于
      We currently lock any JITed image as read-only via bpf_jit_binary_lock_ro()
      as well as the BPF image as read-only through bpf_prog_lock_ro(). In
      the case any of these would fail we throw a WARN_ON_ONCE() in order to
      yell loudly to the log. Perhaps, to some extend, this may be comparable
      to an allocation where __GFP_NOWARN is explicitly not set.
      
      Added via 65869a47 ("bpf: improve read-only handling"), this behavior
      is slightly different compared to any of the other in-kernel set_memory_ro()
      users who do not check the return code of set_memory_ro() and friends /at
      all/ (e.g. in the case of module_enable_ro() / module_disable_ro()). Given
      in BPF this is mandatory hardening step, we want to know whether there
      are any issues that would leave both BPF data writable. So it happens
      that syzkaller enabled fault injection and it triggered memory allocation
      failure deep inside x86's change_page_attr_set_clr() which was triggered
      from set_memory_ro().
      
      Now, there are two options: i) leaving everything as is, and ii) reworking
      the image locking code in order to have a final checkpoint out of the
      central bpf_prog_select_runtime() which probes whether any of the calls
      during prog setup weren't successful, and then bailing out with an error.
      Option ii) is a better approach since this additional paranoia avoids
      altogether leaving any potential W+X pages from BPF side in the system.
      Therefore, lets be strict about it, and reject programs in such unlikely
      occasion. While testing I noticed also that one bpf_prog_lock_ro()
      call was missing on the outer dummy prog in case of calls, e.g. in the
      destructor we call bpf_prog_free_deferred() on the main prog where we
      try to bpf_prog_unlock_free() the program, and since we go via
      bpf_prog_select_runtime() do that as well.
      
      Reported-by: syzbot+3b889862e65a98317058@syzkaller.appspotmail.com
      Reported-by: syzbot+9e762b52dd17e616a7a5@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9facc336
    • D
      bpf: fix panic in prog load calls cleanup · 7d1982b4
      Daniel Borkmann 提交于
      While testing I found that when hitting error path in bpf_prog_load()
      where we jump to free_used_maps and prog contained BPF to BPF calls
      that were JITed earlier, then we never clean up the bpf_prog_kallsyms_add()
      done under jit_subprogs(). Add proper API to make BPF kallsyms deletion
      more clear and fix that.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7d1982b4
  27. 03 6月, 2018 3 次提交
    • D
      bpf: fix context access in tracing progs on 32 bit archs · bc23105c
      Daniel Borkmann 提交于
      Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT
      program type in test_verifier report the following errors on x86_32:
      
        172/p unpriv: spill/fill of different pointers ldx FAIL
        Unexpected error message!
        0: (bf) r6 = r10
        1: (07) r6 += -8
        2: (15) if r1 == 0x0 goto pc+3
        R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
        3: (bf) r2 = r10
        4: (07) r2 += -76
        5: (7b) *(u64 *)(r6 +0) = r2
        6: (55) if r1 != 0x0 goto pc+1
        R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp
        7: (7b) *(u64 *)(r6 +0) = r1
        8: (79) r1 = *(u64 *)(r6 +0)
        9: (79) r1 = *(u64 *)(r1 +68)
        invalid bpf_context access off=68 size=8
      
        378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (71) r0 = *(u8 *)(r1 +68)
        invalid bpf_context access off=68 size=1
      
        379/p check bpf_perf_event_data->sample_period half load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (69) r0 = *(u16 *)(r1 +68)
        invalid bpf_context access off=68 size=2
      
        380/p check bpf_perf_event_data->sample_period word load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (61) r0 = *(u32 *)(r1 +68)
        invalid bpf_context access off=68 size=4
      
        381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
        Failed to load prog 'Permission denied'!
        0: (b7) r0 = 0
        1: (79) r0 = *(u64 *)(r1 +68)
        invalid bpf_context access off=68 size=8
      
      Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte
      boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok()
      will then bail out saying that off & (size_default - 1) which is 68 & 7
      doesn't cleanly align in the case of sample_period access from struct
      bpf_perf_event_data, hence verifier wrongly thinks we might be doing an
      unaligned access here though underlying arch can handle it just fine.
      Therefore adjust this down to machine size and check and rewrite the
      offset for narrow access on that basis. We also need to fix corresponding
      pe_prog_is_valid_access(), since we hit the check for off % size != 0
      (e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs
      for tracing work on x86_32.
      Reported-by: NWang YanQing <udknight@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Tested-by: NWang YanQing <udknight@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bc23105c
    • D
      bpf: avoid retpoline for lookup/update/delete calls on maps · 09772d92
      Daniel Borkmann 提交于
      While some of the BPF map lookup helpers provide a ->map_gen_lookup()
      callback for inlining the map lookup altogether it is not available
      for every map, so the remaining ones have to call bpf_map_lookup_elem()
      helper which does a dispatch to map->ops->map_lookup_elem(). In
      times of retpolines, this will control and trap speculative execution
      rather than letting it do its work for the indirect call and will
      therefore cause a slowdown. Likewise, bpf_map_update_elem() and
      bpf_map_delete_elem() do not have an inlined version and need to call
      into their map->ops->map_update_elem() resp. map->ops->map_delete_elem()
      handlers.
      
      Before:
      
        # bpftool prog dump xlated id 1
          0: (bf) r2 = r10
          1: (07) r2 += -8
          2: (7a) *(u64 *)(r2 +0) = 0
          3: (18) r1 = map[id:1]
          5: (85) call __htab_map_lookup_elem#232656
          6: (15) if r0 == 0x0 goto pc+4
          7: (71) r1 = *(u8 *)(r0 +35)
          8: (55) if r1 != 0x0 goto pc+1
          9: (72) *(u8 *)(r0 +35) = 1
         10: (07) r0 += 56
         11: (15) if r0 == 0x0 goto pc+4
         12: (bf) r2 = r0
         13: (18) r1 = map[id:1]
         15: (85) call bpf_map_delete_elem#215008  <-- indirect call via
         16: (95) exit                                 helper
      
      After:
      
        # bpftool prog dump xlated id 1
          0: (bf) r2 = r10
          1: (07) r2 += -8
          2: (7a) *(u64 *)(r2 +0) = 0
          3: (18) r1 = map[id:1]
          5: (85) call __htab_map_lookup_elem#233328
          6: (15) if r0 == 0x0 goto pc+4
          7: (71) r1 = *(u8 *)(r0 +35)
          8: (55) if r1 != 0x0 goto pc+1
          9: (72) *(u8 *)(r0 +35) = 1
         10: (07) r0 += 56
         11: (15) if r0 == 0x0 goto pc+4
         12: (bf) r2 = r0
         13: (18) r1 = map[id:1]
         15: (85) call htab_lru_map_delete_elem#238240  <-- direct call
         16: (95) exit
      
      In all three lookup/update/delete cases however we can use the actual
      address of the map callback directly if we find that there's only a
      single path with a map pointer leading to the helper call, meaning
      when the map pointer has not been poisoned from verifier side.
      Example code can be seen above for the delete case.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      09772d92
    • D
      bpf: test case for map pointer poison with calls/branches · 06be0864
      Daniel Borkmann 提交于
      Add several test cases where the same or different map pointers
      originate from different paths in the program and execute a map
      lookup or tail call at a common location.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      06be0864
  28. 28 5月, 2018 1 次提交
    • A
      bpf: Hooks for sys_sendmsg · 1cedee13
      Andrey Ignatov 提交于
      In addition to already existing BPF hooks for sys_bind and sys_connect,
      the patch provides new hooks for sys_sendmsg.
      
      It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
      that provides access to socket itlself (properties like family, type,
      protocol) and user-passed `struct sockaddr *` so that BPF program can
      override destination IP and port for system calls such as sendto(2) or
      sendmsg(2) and/or assign source IP to the socket.
      
      The hooks are implemented as two new attach types:
      `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
      UDPv6 correspondingly.
      
      UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
      sys_connect hooks, i.e. to prevent reading from / writing to e.g.
      user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
      
      The difference with already existing hooks is sys_sendmsg are
      implemented only for unconnected UDP.
      
      For TCP it doesn't make sense to change user-provided `struct sockaddr *`
      at sendto(2)/sendmsg(2) time since socket either was already connected
      and has source/destination set or wasn't connected and call to
      sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
      
      Connected UDP is already handled by sys_connect hooks that can override
      source/destination at connect time and use fast-path later, i.e. these
      hooks don't affect UDP fast-path.
      
      Rewriting source IP is implemented differently than that in sys_connect
      hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
      just bind socket to desired local IP address since source IP can be set
      on per-packet basis by using ancillary data (cmsg(3)). So no matter if
      socket is bound or not, source IP has to be rewritten on every call to
      sys_sendmsg.
      
      To do so two new fields are added to UAPI `struct bpf_sock_addr`;
      * `msg_src_ip4` to set source IPv4 for UDPv4;
      * `msg_src_ip6` to set source IPv6 for UDPv6.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1cedee13