1. 21 11月, 2019 2 次提交
    • D
      bpf: Switch bpf_map_{area_alloc,area_mmapable_alloc}() to u64 size · 196e8ca7
      Daniel Borkmann 提交于
      Given we recently extended the original bpf_map_area_alloc() helper in
      commit fc970227 ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY"),
      we need to apply the same logic as in ff1c08e1 ("bpf: Change size
      to u64 for bpf_map_{area_alloc, charge_init}()"). To avoid conflicts,
      extend it for bpf-next.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      196e8ca7
    • D
      bpf: Emit audit messages upon successful prog load and unload · 91e6015b
      Daniel Borkmann 提交于
      Allow for audit messages to be emitted upon BPF program load and
      unload for having a timeline of events. The load itself is in
      syscall context, so additional info about the process initiating
      the BPF prog creation can be logged and later directly correlated
      to the unload event.
      
      The only info really needed from BPF side is the globally unique
      prog ID where then audit user space tooling can query / dump all
      info needed about the specific BPF program right upon load event
      and enrich the record, thus these changes needed here can be kept
      small and non-intrusive to the core.
      
      Raw example output:
      
        # auditctl -D
        # auditctl -a always,exit -F arch=x86_64 -S bpf
        # ausearch --start recent -m 1334
        [...]
        ----
        time->Wed Nov 20 12:45:51 2019
        type=PROCTITLE msg=audit(1574271951.590:8974): proctitle="./test_verifier"
        type=SYSCALL msg=audit(1574271951.590:8974): arch=c000003e syscall=321 success=yes exit=14 a0=5 a1=7ffe2d923e80 a2=78 a3=0 items=0 ppid=742 pid=949 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
        type=UNKNOWN[1334] msg=audit(1574271951.590:8974): auid=0 uid=0 gid=0 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=949 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" prog-id=3260 event=LOAD
        ----
        time->Wed Nov 20 12:45:51 2019
      type=UNKNOWN[1334] msg=audit(1574271951.590:8975): prog-id=3260 event=UNLOAD
        ----
        [...]
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191120213816.8186-1-jolsa@kernel.org
      91e6015b
  2. 18 11月, 2019 3 次提交
    • A
      bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY · fc970227
      Andrii Nakryiko 提交于
      Add ability to memory-map contents of BPF array map. This is extremely useful
      for working with BPF global data from userspace programs. It allows to avoid
      typical bpf_map_{lookup,update}_elem operations, improving both performance
      and usability.
      
      There had to be special considerations for map freezing, to avoid having
      writable memory view into a frozen map. To solve this issue, map freezing and
      mmap-ing is happening under mutex now:
        - if map is already frozen, no writable mapping is allowed;
        - if map has writable memory mappings active (accounted in map->writecnt),
          map freezing will keep failing with -EBUSY;
        - once number of writable memory mappings drops to zero, map freezing can be
          performed again.
      
      Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
      can't be memory mapped either.
      
      For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
      to be mmap()'able. We also need to make sure that array data memory is
      page-sized and page-aligned, so we over-allocate memory in such a way that
      struct bpf_array is at the end of a single page of memory with array->value
      being aligned with the start of the second page. On deallocation we need to
      accomodate this memory arrangement to free vmalloc()'ed memory correctly.
      
      One important consideration regarding how memory-mapping subsystem functions.
      Memory-mapping subsystem provides few optional callbacks, among them open()
      and close().  close() is called for each memory region that is unmapped, so
      that users can decrease their reference counters and free up resources, if
      necessary. open() is *almost* symmetrical: it's called for each memory region
      that is being mapped, **except** the very first one. So bpf_map_mmap does
      initial refcnt bump, while open() will do any extra ones after that. Thus
      number of close() calls is equal to number of open() calls plus one more.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
      fc970227
    • A
      bpf: Convert bpf_prog refcnt to atomic64_t · 85192dbf
      Andrii Nakryiko 提交于
      Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64
      and remove artificial 32k limit. This allows to make bpf_prog's refcounting
      non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc.
      
      Validated compilation by running allyesconfig kernel build.
      Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com
      85192dbf
    • A
      bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails · 1e0bd5a0
      Andrii Nakryiko 提交于
      92117d84 ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
      potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
      (32k). Due to using 32-bit counter, it's possible in practice to overflow
      refcounter and make it wrap around to 0, causing erroneous map free, while
      there are still references to it, causing use-after-free problems.
      
      But having a failing refcounting operations are problematic in some cases. One
      example is mmap() interface. After establishing initial memory-mapping, user
      is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
      splitting it into multiple non-contiguous regions. All this happening without
      any control from the users of mmap subsystem. Rather mmap subsystem sends
      notifications to original creator of memory mapping through open/close
      callbacks, which are optionally specified during initial memory mapping
      creation. These callbacks are used to maintain accurate refcount for bpf_map
      (see next patch in this series). The problem is that open() callback is not
      supposed to fail, because memory-mapped resource is set up and properly
      referenced. This is posing a problem for using memory-mapping with BPF maps.
      
      One solution to this is to maintain separate refcount for just memory-mappings
      and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
      There are similar use cases in current work on tcp-bpf, necessitating extra
      counter as well. This seems like a rather unfortunate and ugly solution that
      doesn't scale well to various new use cases.
      
      Another approach to solve this is to use non-failing refcount_t type, which
      uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
      stays there. This utlimately causes memory leak, but prevents use after free.
      
      But given refcounting is not the most performance-critical operation with BPF
      maps (it's not used from running BPF program code), we can also just switch to
      64-bit counter that can't overflow in practice, potentially disadvantaging
      32-bit platforms a tiny bit. This simplifies semantics and allows above
      described scenarios to not worry about failing refcount increment operation.
      
      In terms of struct bpf_map size, we are still good and use the same amount of
      space:
      
      BEFORE (3 cache lines, 8 bytes of padding at the end):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
      	atomic_t                   usercnt;              /*   132     4 */
      	struct work_struct work;                         /*   136    32 */
      	char                       name[16];             /*   168    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 146, holes: 1, sum holes: 38 */
      	/* padding: 8 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      AFTER (same 3 cache lines, no extra padding now):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
      	atomic64_t                 usercnt;              /*   136     8 */
      	struct work_struct work;                         /*   144    32 */
      	char                       name[16];             /*   176    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 154, holes: 1, sum holes: 38 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      This patch, while modifying all users of bpf_map_inc, also cleans up its
      interface to match bpf_map_put with separate operations for bpf_map_inc and
      bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
      respectively). Also, given there are no users of bpf_map_inc_not_zero
      specifying uref=true, remove uref flag and default to uref=false internally.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
      1e0bd5a0
  3. 16 11月, 2019 7 次提交
    • A
      bpf: Support attaching tracing BPF program to other BPF programs · 5b92a28a
      Alexei Starovoitov 提交于
      Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type
      including their subprograms. This feature allows snooping on input and output
      packets in XDP, TC programs including their return values. In order to do that
      the verifier needs to track types not only of vmlinux, but types of other BPF
      programs as well. The verifier also needs to translate uapi/linux/bpf.h types
      used by networking programs into kernel internal BTF types used by FENTRY/FEXIT
      BPF programs. In some cases LLVM optimizations can remove arguments from BPF
      subprograms without adjusting BTF info that LLVM backend knows. When BTF info
      disagrees with actual types that the verifiers sees the BPF trampoline has to
      fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT
      program can still attach to such subprograms, but it won't be able to recognize
      pointer types like 'struct sk_buff *' and it won't be able to pass them to
      bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program
      would need to use bpf_probe_read_kernel() instead.
      
      The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set
      to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd
      points to previously loaded BPF program the attach_btf_id is BTF type id of
      main function or one of its subprograms.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org
      5b92a28a
    • A
      bpf: Compare BTF types of functions arguments with actual types · 8c1b6e69
      Alexei Starovoitov 提交于
      Make the verifier check that BTF types of function arguments match actual types
      passed into top-level BPF program and into BPF-to-BPF calls. If types match
      such BPF programs and sub-programs will have full support of BPF trampoline. If
      types mismatch the trampoline has to be conservative. It has to save/restore
      five program arguments and assume 64-bit scalars.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-17-ast@kernel.org
      8c1b6e69
    • A
      bpf: Annotate context types · 91cc1a99
      Alexei Starovoitov 提交于
      Annotate BPF program context types with program-side type and kernel-side type.
      This type information is used by the verifier. btf_get_prog_ctx_type() is
      used in the later patches to verify that BTF type of ctx in BPF program matches to
      kernel expected ctx type. For example, the XDP program type is:
      BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp, struct xdp_md, struct xdp_buff)
      That means that XDP program should be written as:
      int xdp_prog(struct xdp_md *ctx) { ... }
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-16-ast@kernel.org
      91cc1a99
    • A
      bpf: Fix race in btf_resolve_helper_id() · 9cc31b3a
      Alexei Starovoitov 提交于
      btf_resolve_helper_id() caching logic is a bit racy, since under root the
      verifier can verify several programs in parallel. Fix it with READ/WRITE_ONCE.
      Fix the type as well, since error is also recorded.
      
      Fixes: a7658e1a ("bpf: Check types of arguments passed into helpers")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-15-ast@kernel.org
      9cc31b3a
    • A
      bpf: Introduce BPF trampoline · fec56f58
      Alexei Starovoitov 提交于
      Introduce BPF trampoline concept to allow kernel code to call into BPF programs
      with practically zero overhead.  The trampoline generation logic is
      architecture dependent.  It's converting native calling convention into BPF
      calling convention.  BPF ISA is 64-bit (even on 32-bit architectures). The
      registers R1 to R5 are used to pass arguments into BPF functions. The main BPF
      program accepts only single argument "ctx" in R1. Whereas CPU native calling
      convention is different. x86-64 is passing first 6 arguments in registers
      and the rest on the stack. x86-32 is passing first 3 arguments in registers.
      sparc64 is passing first 6 in registers. And so on.
      
      The trampolines between BPF and kernel already exist.  BPF_CALL_x macros in
      include/linux/filter.h statically compile trampolines from BPF into kernel
      helpers. They convert up to five u64 arguments into kernel C pointers and
      integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On
      32-bit architecture they're meaningful.
      
      The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and
      __bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert
      kernel function arguments into array of u64s that BPF program consumes via
      R1=ctx pointer.
      
      This patch set is doing the same job as __bpf_trace_##call() static
      trampolines, but dynamically for any kernel function. There are ~22k global
      kernel functions that are attachable via nop at function entry. The function
      arguments and types are described in BTF.  The job of btf_distill_func_proto()
      function is to extract useful information from BTF into "function model" that
      architecture dependent trampoline generators will use to generate assembly code
      to cast kernel function arguments into array of u64s.  For example the kernel
      function eth_type_trans has two pointers. They will be casted to u64 and stored
      into stack of generated trampoline. The pointer to that stack space will be
      passed into BPF program in R1. On x86-64 such generated trampoline will consume
      16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will
      make sure that only two u64 are accessed read-only by BPF program. The verifier
      will also recognize the precise type of the pointers being accessed and will
      not allow typecasting of the pointer to a different type within BPF program.
      
      The tracing use case in the datacenter demonstrated that certain key kernel
      functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always
      active.  Other functions have both kprobe and kretprobe.  So it is essential to
      keep both kernel code and BPF programs executing at maximum speed. Hence
      generated BPF trampoline is re-generated every time new program is attached or
      detached to maintain maximum performance.
      
      To avoid the high cost of retpoline the attached BPF programs are called
      directly. __bpf_prog_enter/exit() are used to support per-program execution
      stats.  In the future this logic will be optimized further by adding support
      for bpf_stats_enabled_key inside generated assembly code. Introduction of
      preemptible and sleepable BPF programs will completely remove the need to call
      to __bpf_prog_enter/exit().
      
      Detach of a BPF program from the trampoline should not fail. To avoid memory
      allocation in detach path the half of the page is used as a reserve and flipped
      after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly
      which is enough for BPF tracing use cases. This limit can be increased in the
      future.
      
      BPF_TRACE_FENTRY programs have access to raw kernel function arguments while
      BPF_TRACE_FEXIT programs have access to kernel return value as well. Often
      kprobe BPF program remembers function arguments in a map while kretprobe
      fetches arguments from a map and analyzes them together with return value.
      BPF_TRACE_FEXIT accelerates this typical use case.
      
      Recursion prevention for kprobe BPF programs is done via per-cpu
      bpf_prog_active counter. In practice that turned out to be a mistake. It
      caused programs to randomly skip execution. The tracing tools missed results
      they were looking for. Hence BPF trampoline doesn't provide builtin recursion
      prevention. It's a job of BPF program itself and will be addressed in the
      follow up patches.
      
      BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases
      in the future. For example to remove retpoline cost from XDP programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org
      fec56f58
    • A
      bpf: Add bpf_arch_text_poke() helper · 5964b200
      Alexei Starovoitov 提交于
      Add bpf_arch_text_poke() helper that is used by BPF trampoline logic to patch
      nops/calls in kernel text into calls into BPF trampoline and to patch
      calls/nops inside BPF programs too.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-4-ast@kernel.org
      5964b200
    • I
      bpf: Support doubleword alignment in bpf_jit_binary_alloc · b7b3fc8d
      Ilya Leoshkevich 提交于
      Currently passing alignment greater than 4 to bpf_jit_binary_alloc does
      not work: in such cases it silently aligns only to 4 bytes.
      
      On s390, in order to load a constant from memory in a large (>512k) BPF
      program, one must use lgrl instruction, whose memory operand must be
      aligned on an 8-byte boundary.
      
      This patch makes it possible to request 8-byte alignment from
      bpf_jit_binary_alloc, and also makes it issue a warning when an
      unsupported alignment is requested.
      Signed-off-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191115123722.58462-1-iii@linux.ibm.com
      b7b3fc8d
  4. 03 11月, 2019 3 次提交
    • D
      bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers · 6ae08ae3
      Daniel Borkmann 提交于
      The current bpf_probe_read() and bpf_probe_read_str() helpers are broken
      in that they assume they can be used for probing memory access for kernel
      space addresses /as well as/ user space addresses.
      
      However, plain use of probe_kernel_read() for both cases will attempt to
      always access kernel space address space given access is performed under
      KERNEL_DS and some archs in-fact have overlapping address spaces where a
      kernel pointer and user pointer would have the /same/ address value and
      therefore accessing application memory via bpf_probe_read{,_str}() would
      read garbage values.
      
      Lets fix BPF side by making use of recently added 3d708182 ("uaccess:
      Add non-pagefault user-space read functions"). Unfortunately, the only way
      to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}()
      and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}()
      helpers are kept as-is to retain their current behavior.
      
      The two *_user() variants attempt the access always under USER_DS set, the
      two *_kernel() variants will -EFAULT when accessing user memory if the
      underlying architecture has non-overlapping address ranges, also avoiding
      throwing the kernel warning via 00c42373 ("x86-64: add warning for
      non-canonical user access address dereferences").
      
      Fixes: a5e8c070 ("bpf: add bpf_probe_read_str helper")
      Fixes: 2541517c ("tracing, perf: Implement BPF programs attached to kprobes")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649915.git.daniel@iogearbox.net
      6ae08ae3
    • D
      uaccess: Add strict non-pagefault kernel-space read function · 75a1a607
      Daniel Borkmann 提交于
      Add two new probe_kernel_read_strict() and strncpy_from_unsafe_strict()
      helpers which by default alias to the __probe_kernel_read() and the
      __strncpy_from_unsafe(), respectively, but can be overridden by archs
      which have non-overlapping address ranges for kernel space and user
      space in order to bail out with -EFAULT when attempting to probe user
      memory including non-canonical user access addresses [0]:
      
        4-level page tables:
          user-space mem: 0x0000000000000000 - 0x00007fffffffffff
          non-canonical:  0x0000800000000000 - 0xffff7fffffffffff
      
        5-level page tables:
          user-space mem: 0x0000000000000000 - 0x00ffffffffffffff
          non-canonical:  0x0100000000000000 - 0xfeffffffffffffff
      
      The idea is that these helpers are complementary to the probe_user_read()
      and strncpy_from_unsafe_user() which probe user-only memory. Both added
      helpers here do the same, but for kernel-only addresses.
      
      Both set of helpers are going to be used for BPF tracing. They also
      explicitly avoid throwing the splat for non-canonical user addresses from
      00c42373 ("x86-64: add warning for non-canonical user access address
      dereferences").
      
      For compat, the current probe_kernel_read() and strncpy_from_unsafe() are
      left as-is.
      
        [0] Documentation/x86/x86_64/mm.txt
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: x86@kernel.org
      Link: https://lore.kernel.org/bpf/eefeefd769aa5a013531f491a71f0936779e916b.1572649915.git.daniel@iogearbox.net
      75a1a607
    • D
      uaccess: Add non-pagefault user-space write function · 1d1585ca
      Daniel Borkmann 提交于
      Commit 3d708182 ("uaccess: Add non-pagefault user-space read functions")
      missed to add probe write function, therefore factor out a probe_write_common()
      helper with most logic of probe_kernel_write() except setting KERNEL_DS, and
      add a new probe_user_write() helper so it can be used from BPF side.
      
      Again, on some archs, the user address space and kernel address space can
      co-exist and be overlapping, so in such case, setting KERNEL_DS would mean
      that the given address is treated as being in kernel address space.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Link: https://lore.kernel.org/bpf/9df2542e68141bfa3addde631441ee45503856a8.1572649915.git.daniel@iogearbox.net
      1d1585ca
  5. 02 11月, 2019 2 次提交
  6. 01 11月, 2019 8 次提交
  7. 31 10月, 2019 15 次提交
    • A
      bpf: Replace prog_raw_tp+btf_id with prog_tracing · f1b9509c
      Alexei Starovoitov 提交于
      The bpf program type raw_tp together with 'expected_attach_type'
      was the most appropriate api to indicate BTF-enabled raw_tp programs.
      But during development it became apparent that 'expected_attach_type'
      cannot be used and new 'attach_btf_id' field had to be introduced.
      Which means that the information is duplicated in two fields where
      one of them is ignored.
      Clean it up by introducing new program type where both
      'expected_attach_type' and 'attach_btf_id' fields have
      specific meaning.
      In the future 'expected_attach_type' will be extended
      with other attach points that have similar semantics to raw_tp.
      This patch is replacing BTF-enabled BPF_PROG_TYPE_RAW_TRACEPOINT with
      prog_type = BPF_RPOG_TYPE_TRACING
      expected_attach_type = BPF_TRACE_RAW_TP
      attach_btf_id = btf_id of raw tracepoint inside the kernel
      Future patches will add
      expected_attach_type = BPF_TRACE_FENTRY or BPF_TRACE_FEXIT
      where programs have the same input context and the same helpers,
      but different attach points.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191030223212.953010-2-ast@kernel.org
      f1b9509c
    • J
      efi/efi_test: Lock down /dev/efi_test and require CAP_SYS_ADMIN · 359efcc2
      Javier Martinez Canillas 提交于
      The driver exposes EFI runtime services to user-space through an IOCTL
      interface, calling the EFI services function pointers directly without
      using the efivar API.
      
      Disallow access to the /dev/efi_test character device when the kernel is
      locked down to prevent arbitrary user-space to call EFI runtime services.
      
      Also require CAP_SYS_ADMIN to open the chardev to prevent unprivileged
      users to call the EFI runtime services, instead of just relying on the
      chardev file mode bits for this.
      
      The main user of this driver is the fwts [0] tool that already checks if
      the effective user ID is 0 and fails otherwise. So this change shouldn't
      cause any regression to this tool.
      
      [0]: https://wiki.ubuntu.com/FirmwareTestSuite/Reference/uefivarinfoSigned-off-by: NJavier Martinez Canillas <javierm@redhat.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NLaszlo Ersek <lersek@redhat.com>
      Acked-by: NMatthew Garrett <mjg59@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: https://lkml.kernel.org/r/20191029173755.27149-7-ardb@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      359efcc2
    • K
      x86, efi: Never relocate kernel below lowest acceptable address · 220dd769
      Kairui Song 提交于
      Currently, kernel fails to boot on some HyperV VMs when using EFI.
      And it's a potential issue on all x86 platforms.
      
      It's caused by broken kernel relocation on EFI systems, when below three
      conditions are met:
      
      1. Kernel image is not loaded to the default address (LOAD_PHYSICAL_ADDR)
         by the loader.
      2. There isn't enough room to contain the kernel, starting from the
         default load address (eg. something else occupied part the region).
      3. In the memmap provided by EFI firmware, there is a memory region
         starts below LOAD_PHYSICAL_ADDR, and suitable for containing the
         kernel.
      
      EFI stub will perform a kernel relocation when condition 1 is met. But
      due to condition 2, EFI stub can't relocate kernel to the preferred
      address, so it fallback to ask EFI firmware to alloc lowest usable memory
      region, got the low region mentioned in condition 3, and relocated
      kernel there.
      
      It's incorrect to relocate the kernel below LOAD_PHYSICAL_ADDR. This
      is the lowest acceptable kernel relocation address.
      
      The first thing goes wrong is in arch/x86/boot/compressed/head_64.S.
      Kernel decompression will force use LOAD_PHYSICAL_ADDR as the output
      address if kernel is located below it. Then the relocation before
      decompression, which move kernel to the end of the decompression buffer,
      will overwrite other memory region, as there is no enough memory there.
      
      To fix it, just don't let EFI stub relocate the kernel to any address
      lower than lowest acceptable address.
      
      [ ardb: introduce efi_low_alloc_above() to reduce the scope of the change ]
      Signed-off-by: NKairui Song <kasong@redhat.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: https://lkml.kernel.org/r/20191029173755.27149-6-ardb@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      220dd769
    • V
      net: sched: update action implementations to support flags · e3822678
      Vlad Buslov 提交于
      Extend struct tc_action with new "tcfa_flags" field. Set the field in
      tcf_idr_create() function and provide new helper
      tcf_idr_create_from_flags() that derives 'cpustats' boolean from flags
      value. Update individual hardware-offloaded actions init() to pass their
      "flags" argument to new helper in order to skip percpu stats allocation
      when user requested it through flags.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3822678
    • V
      net: sched: extend TCA_ACT space with TCA_ACT_FLAGS · abbb0d33
      Vlad Buslov 提交于
      Extend TCA_ACT space with nla_bitfield32 flags. Add
      TCA_ACT_FLAGS_NO_PERCPU_STATS as the only allowed flag. Parse the flags in
      tcf_action_init_1() and pass resulting value as additional argument to
      a_o->init().
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abbb0d33
    • V
      net: sched: modify stats helper functions to support regular stats · 5e174d5e
      Vlad Buslov 提交于
      Modify stats update helper functions introduced in previous patches in this
      series to fallback to regular tc_action->tcfa_{b|q}stats if cpu stats are
      not allocated for the action argument. If regular non-percpu allocated
      counters are in use, then obtain action tcfa_lock while modifying them.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e174d5e
    • V
      net: sched: don't expose action qstats to skb_tc_reinsert() · ef816f3c
      Vlad Buslov 提交于
      Previous commit introduced helper function for updating qstats and
      refactored set of actions to use the helpers, instead of modifying qstats
      directly. However, one of the affected action exposes its qstats to
      skb_tc_reinsert(), which then modifies it.
      
      Refactor skb_tc_reinsert() to return integer error code and don't increment
      overlimit qstats in case of error, and use the returned error code in
      tcf_mirred_act() to manually increment the overlimit counter with new
      helper function.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef816f3c
    • V
      net: sched: extract qstats update code into functions · 26b537a8
      Vlad Buslov 提交于
      Extract common code that increments cpu_qstats counters into standalone act
      API functions. Change hardware offloaded actions that use percpu counter
      allocation to use the new functions instead of accessing cpu_qstats
      directly.
      
      This commit doesn't change functionality.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26b537a8
    • V
      net: sched: extract bstats update code into function · 5e1ad95b
      Vlad Buslov 提交于
      Extract common code that increments cpu_bstats counter into standalone act
      API function. Change hardware offloaded actions that use percpu counter
      allocation to use the new function instead of incrementing cpu_bstats
      directly.
      
      This commit doesn't change functionality.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e1ad95b
    • V
      net: sched: extract common action counters update code into function · c8ecebd0
      Vlad Buslov 提交于
      Currently, all implementations of tc_action_ops->stats_update() callback
      have almost exactly the same implementation of counters update
      code (besides gact which also updates drop counter). In order to simplify
      support for using both percpu-allocated and regular action counters
      depending on run-time flag in following patches, extract action counters
      update code into standalone function in act API.
      
      This commit doesn't change functionality.
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8ecebd0
    • E
      net: annotate lockless accesses to sk->sk_napi_id · ee8d153d
      Eric Dumazet 提交于
      We already annotated most accesses to sk->sk_napi_id
      
      We missed sk_mark_napi_id() and sk_mark_napi_id_once()
      which might be called without socket lock held in UDP stack.
      
      KCSAN reported :
      BUG: KCSAN: data-race in udpv6_queue_rcv_one_skb / udpv6_queue_rcv_one_skb
      
      write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 0:
       sk_mark_napi_id include/net/busy_poll.h:125 [inline]
       __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
       udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
       udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
       udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
       __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
       udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
       ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
       ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
       dst_input include/net/dst.h:442 [inline]
       ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
      
      write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 1:
       sk_mark_napi_id include/net/busy_poll.h:125 [inline]
       __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
       udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
       udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
       udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
       __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
       udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
       ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
       ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
       dst_input include/net/dst.h:442 [inline]
       ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 10890 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: e68b6e50 ("udp: enable busy polling for all sockets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee8d153d
    • M
      flow_dissector: extract more ICMP information · 5dec597e
      Matteo Croce 提交于
      The ICMP flow dissector currently parses only the Type and Code fields.
      Some ICMP packets (echo, timestamp) have a 16 bit Identifier field which
      is used to correlate packets.
      Add such field in flow_dissector_key_icmp and replace skb_flow_get_be16()
      with a more complex function which populate this field.
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5dec597e
    • M
      flow_dissector: add meaningful comments · 98298e6c
      Matteo Croce 提交于
      Documents two piece of code which can't be understood at a glance.
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98298e6c
    • E
      net: annotate accesses to sk->sk_incoming_cpu · 7170a977
      Eric Dumazet 提交于
      This socket field can be read and written by concurrent cpus.
      
      Use READ_ONCE() and WRITE_ONCE() annotations to document this,
      and avoid some compiler 'optimizations'.
      
      KCSAN reported :
      
      BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv
      
      write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
       sk_incoming_cpu_update include/net/sock.h:953 [inline]
       tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
      
      read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
       sk_incoming_cpu_update include/net/sock.h:952 [inline]
       tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7170a977
    • J
      tipc: add smart nagle feature · c0bceb97
      Jon Maloy 提交于
      We introduce a feature that works like a combination of TCP_NAGLE and
      TCP_CORK, but without some of the weaknesses of those. In particular,
      we will not observe long delivery delays because of delayed acks, since
      the algorithm itself decides if and when acks are to be sent from the
      receiving peer.
      
      - The nagle property as such is determined by manipulating a new
        'maxnagle' field in struct tipc_sock. If certain conditions are met,
        'maxnagle' will define max size of the messages which can be bundled.
        If it is set to zero no messages are ever bundled, implying that the
        nagle property is disabled.
      - A socket with the nagle property enabled enters nagle mode when more
        than 4 messages have been sent out without receiving any data message
        from the peer.
      - A socket leaves nagle mode whenever it receives a data message from
        the peer.
      
      In nagle mode, messages smaller than 'maxnagle' are accumulated in the
      socket write queue. The last buffer in the queue is marked with a new
      'ack_required' bit, which forces the receiving peer to send a CONN_ACK
      message back to the sender upon reception.
      
      The accumulated contents of the write queue is transmitted when one of
      the following events or conditions occur.
      
      - A CONN_ACK message is received from the peer.
      - A data message is received from the peer.
      - A SOCK_WAKEUP pseudo message is received from the link level.
      - The write queue contains more than 64 1k blocks of data.
      - The connection is being shut down.
      - There is no CONN_ACK message to expect. I.e., there is currently
        no outstanding message where the 'ack_required' bit was set. As a
        consequence, the first message added after we enter nagle mode
        is always sent directly with this bit set.
      
      This new feature gives a 50-100% improvement of throughput for small
      (i.e., less than MTU size) messages, while it might add up to one RTT
      to latency time when the socket is in nagle mode.
      Acked-by: NYing Xue <ying.xue@windreiver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bceb97