1. 07 11月, 2020 1 次提交
  2. 29 10月, 2020 1 次提交
    • Y
      bpf: Permit cond_resched for some iterators · cf83b2d2
      Yonghong Song 提交于
      Commit e679654a ("bpf: Fix a rcu_sched stall issue with
      bpf task/task_file iterator") tries to fix rcu stalls warning
      which is caused by bpf task_file iterator when running
      "bpftool prog".
      
            rcu: INFO: rcu_sched self-detected stall on CPU
            rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913
            \x09(t=21031 jiffies g=2534773 q=179750)
            NMI backtrace for cpu 7
            CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G        W         5.8.0-00004-g68bfc7f8c1b4 #6
            Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
            Call Trace:
            <IRQ>
            dump_stack+0x57/0x70
            nmi_cpu_backtrace.cold+0x14/0x53
            ? lapic_can_unplug_cpu.cold+0x39/0x39
            nmi_trigger_cpumask_backtrace+0xb7/0xc7
            rcu_dump_cpu_stacks+0xa2/0xd0
            rcu_sched_clock_irq.cold+0x1ff/0x3d9
            ? tick_nohz_handler+0x100/0x100
            update_process_times+0x5b/0x90
            tick_sched_timer+0x5e/0xf0
            __hrtimer_run_queues+0x12a/0x2a0
            hrtimer_interrupt+0x10e/0x280
            __sysvec_apic_timer_interrupt+0x51/0xe0
            asm_call_on_stack+0xf/0x20
            </IRQ>
            sysvec_apic_timer_interrupt+0x6f/0x80
            ...
            task_file_seq_next+0x52/0xa0
            bpf_seq_read+0xb9/0x320
            vfs_read+0x9d/0x180
            ksys_read+0x5f/0xe0
            do_syscall_64+0x38/0x60
            entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The fix is to limit the number of bpf program runs to be
      one million. This fixed the program in most cases. But
      we also found under heavy load, which can increase the wallclock
      time for bpf_seq_read(), the warning may still be possible.
      
      For example, calling bpf_delay() in the "while" loop of
      bpf_seq_read(), which will introduce artificial delay,
      the warning will show up in my qemu run.
      
        static unsigned q;
        volatile unsigned *p = &q;
        volatile unsigned long long ll;
        static void bpf_delay(void)
        {
               int i, j;
      
               for (i = 0; i < 10000; i++)
                       for (j = 0; j < 10000; j++)
                               ll += *p;
        }
      
      There are two ways to fix this issue. One is to reduce the above
      one million threshold to say 100,000 and hopefully rcu warning will
      not show up any more. Another is to introduce a target feature
      which enables bpf_seq_read() calling cond_resched().
      
      This patch took second approach as the first approach may cause
      more -EAGAIN failures for read() syscalls. Note that not all bpf_iter
      targets can permit cond_resched() in bpf_seq_read() as some, e.g.,
      netlink seq iterator, rcu read lock critical section spans through
      seq_ops->next() -> seq_ops->show() -> seq_ops->next().
      
      For the kernel code with the above hack, "bpftool p" roughly takes
      38 seconds to finish on my VM with 184 bpf program runs.
      Using the following command, I am able to collect the number of
      context switches:
         perf stat -e context-switches -- ./bpftool p >& log
      Without this patch,
         69      context-switches
      With this patch,
         75      context-switches
      This patch added additional 6 context switches, roughly every 6 seconds
      to reschedule, to avoid lengthy no-rescheduling which may cause the
      above RCU warnings.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201028061054.1411116-1-yhs@fb.com
      cf83b2d2
  3. 12 10月, 2020 1 次提交
    • D
      bpf: Allow for map-in-map with dynamic inner array map entries · 4a8f87e6
      Daniel Borkmann 提交于
      Recent work in f4d05259 ("bpf: Add map_meta_equal map ops") and 134fede4
      ("bpf: Relax max_entries check for most of the inner map types") added support
      for dynamic inner max elements for most map-in-map types. Exceptions were maps
      like array or prog array where the map_gen_lookup() callback uses the maps'
      max_entries field as a constant when emitting instructions.
      
      We recently implemented Maglev consistent hashing into Cilium's load balancer
      which uses map-in-map with an outer map being hash and inner being array holding
      the Maglev backend table for each service. This has been designed this way in
      order to reduce overall memory consumption given the outer hash map allows to
      avoid preallocating a large, flat memory area for all services. Also, the
      number of service mappings is not always known a-priori.
      
      The use case for dynamic inner array map entries is to further reduce memory
      overhead, for example, some services might just have a small number of back
      ends while others could have a large number. Right now the Maglev backend table
      for small and large number of backends would need to have the same inner array
      map entries which adds a lot of unneeded overhead.
      
      Dynamic inner array map entries can be realized by avoiding the inlined code
      generation for their lookup. The lookup will still be efficient since it will
      be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
      The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
      inline code generation and relaxes array_map_meta_equal() check to ignore both
      maps' max_entries. This also still allows to have faster lookups for map-in-map
      when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.
      
      Example code generation where inner map is dynamic sized array:
      
        # bpftool p d x i 125
        int handle__sys_enter(void * ctx):
        ; int handle__sys_enter(void *ctx)
           0: (b4) w1 = 0
        ; int key = 0;
           1: (63) *(u32 *)(r10 -4) = r1
           2: (bf) r2 = r10
        ;
           3: (07) r2 += -4
        ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
           4: (18) r1 = map[id:468]
           6: (07) r1 += 272
           7: (61) r0 = *(u32 *)(r2 +0)
           8: (35) if r0 >= 0x3 goto pc+5
           9: (67) r0 <<= 3
          10: (0f) r0 += r1
          11: (79) r0 = *(u64 *)(r0 +0)
          12: (15) if r0 == 0x0 goto pc+1
          13: (05) goto pc+1
          14: (b7) r0 = 0
          15: (b4) w6 = -1
        ; if (!inner_map)
          16: (15) if r0 == 0x0 goto pc+6
          17: (bf) r2 = r10
        ;
          18: (07) r2 += -4
        ; val = bpf_map_lookup_elem(inner_map, &key);
          19: (bf) r1 = r0                               | No inlining but instead
          20: (85) call array_map_lookup_elem#149280     | call to array_map_lookup_elem()
        ; return val ? *val : -1;                        | for inner array lookup.
          21: (15) if r0 == 0x0 goto pc+1
        ; return val ? *val : -1;
          22: (61) r6 = *(u32 *)(r0 +0)
        ; }
          23: (bc) w0 = w6
          24: (95) exit
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net
      4a8f87e6
  4. 03 10月, 2020 2 次提交
  5. 30 9月, 2020 2 次提交
  6. 29 9月, 2020 5 次提交
  7. 26 9月, 2020 2 次提交
    • J
      bpf: Add comment to document BTF type PTR_TO_BTF_ID_OR_NULL · ba5f4cfe
      John Fastabend 提交于
      The meaning of PTR_TO_BTF_ID_OR_NULL differs slightly from other types
      denoted with the *_OR_NULL type. For example the types PTR_TO_SOCKET
      and PTR_TO_SOCKET_OR_NULL can be used for branch analysis because the
      type PTR_TO_SOCKET is guaranteed to _not_ have a null value.
      
      In contrast PTR_TO_BTF_ID and BTF_TO_BTF_ID_OR_NULL have slightly
      different meanings. A PTR_TO_BTF_TO_ID may be a pointer to NULL value,
      but it is safe to read this pointer in the program context because
      the program context will handle any faults. The fallout is for
      PTR_TO_BTF_ID the verifier can assume reads are safe, but can not
      use the type in branch analysis. Additionally, authors need to be
      extra careful when passing PTR_TO_BTF_ID into helpers. In general
      helpers consuming type PTR_TO_BTF_ID will need to assume it may
      be null.
      
      Seeing the above is not obvious to readers without the back knowledge
      lets add a comment in the type definition.
      
      Editorial comment, as networking and tracing programs get closer
      and more tightly merged we may need to consider a new type that we
      can ensure is non-null for branch analysis and also passing into
      helpers.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NLorenz Bauer <lmb@cloudflare.com>
      ba5f4cfe
    • M
      bpf: Enable bpf_skc_to_* sock casting helper to networking prog type · 1df8f55a
      Martin KaFai Lau 提交于
      There is a constant need to add more fields into the bpf_tcp_sock
      for the bpf programs running at tc, sock_ops...etc.
      
      A current workaround could be to use bpf_probe_read_kernel().  However,
      other than making another helper call for reading each field and missing
      CO-RE, it is also not as intuitive to use as directly reading
      "tp->lsndtime" for example.  While already having perfmon cap to do
      bpf_probe_read_kernel(), it will be much easier if the bpf prog can
      directly read from the tcp_sock.
      
      This patch tries to do that by using the existing casting-helpers
      bpf_skc_to_*() whose func_proto returns a btf_id.  For example, the
      func_proto of bpf_skc_to_tcp_sock returns the btf_id of the
      kernel "struct tcp_sock".
      
      These helpers are also added to is_ptr_cast_function().
      It ensures the returning reg (BPF_REF_0) will also carries the ref_obj_id.
      That will keep the ref-tracking works properly.
      
      The bpf_skc_to_* helpers are made available to most of the bpf prog
      types in filter.c. The bpf_skc_to_* helpers will be limited by
      perfmon cap.
      
      This patch adds a ARG_PTR_TO_BTF_ID_SOCK_COMMON.  The helper accepting
      this arg can accept a btf-id-ptr (PTR_TO_BTF_ID + &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON])
      or a legacy-ctx-convert-skc-ptr (PTR_TO_SOCK_COMMON).  The bpf_skc_to_*()
      helpers are changed to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that
      they will accept pointer obtained from skb->sk.
      
      Instead of specifying both arg_type and arg_btf_id in the same func_proto
      which is how the current ARG_PTR_TO_BTF_ID does, the arg_btf_id of
      the new ARG_PTR_TO_BTF_ID_SOCK_COMMON is specified in the
      compatible_reg_types[] in verifier.c.  The reason is the arg_btf_id is
      always the same.  Discussion in this thread:
      https://lore.kernel.org/bpf/20200922070422.1917351-1-kafai@fb.com/
      
      The ARG_PTR_TO_BTF_ID_ part gives a clear expectation that the helper is
      expecting a PTR_TO_BTF_ID which could be NULL.  This is the same
      behavior as the existing helper taking ARG_PTR_TO_BTF_ID.
      
      The _SOCK_COMMON part means the helper is also expecting the legacy
      SOCK_COMMON pointer.
      
      By excluding the _OR_NULL part, the bpf prog cannot call helper
      with a literal NULL which doesn't make sense in most cases.
      e.g. bpf_skc_to_tcp_sock(NULL) will be rejected.  All PTR_TO_*_OR_NULL
      reg has to do a NULL check first before passing into the helper or else
      the bpf prog will be rejected.  This behavior is nothing new and
      consistent with the current expectation during bpf-prog-load.
      
      [ ARG_PTR_TO_BTF_ID_SOCK_COMMON will be used to replace
        ARG_PTR_TO_SOCK* of other existing helpers later such that
        those existing helpers can take the PTR_TO_BTF_ID returned by
        the bpf_skc_to_*() helpers.
      
        The only special case is bpf_sk_lookup_assign() which can accept a
        literal NULL ptr.  It has to be handled specially in another follow
        up patch if there is a need (e.g. by renaming ARG_PTR_TO_SOCKET_OR_NULL
        to ARG_PTR_TO_BTF_ID_SOCK_COMMON_OR_NULL). ]
      
      [ When converting the older helpers that take ARG_PTR_TO_SOCK* in
        the later patch, if the kernel does not support BTF,
        ARG_PTR_TO_BTF_ID_SOCK_COMMON will behave like ARG_PTR_TO_SOCK_COMMON
        because no reg->type could have PTR_TO_BTF_ID in this case.
      
        It is not a concern for the newer-btf-only helper like the bpf_skc_to_*()
        here though because these helpers must require BTF vmlinux to begin
        with. ]
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200925000350.3855720-1-kafai@fb.com
      1df8f55a
  8. 22 9月, 2020 3 次提交
  9. 18 9月, 2020 3 次提交
    • M
      bpf, x64: rework pro/epilogue and tailcall handling in JIT · ebf7d1f5
      Maciej Fijalkowski 提交于
      This commit serves two things:
      1) it optimizes BPF prologue/epilogue generation
      2) it makes possible to have tailcalls within BPF subprogram
      
      Both points are related to each other since without 1), 2) could not be
      achieved.
      
      In [1], Alexei says:
      "The prologue will look like:
      nop5
      xor eax,eax  // two new bytes if bpf_tail_call() is used in this
                   // function
      push rbp
      mov rbp, rsp
      sub rsp, rounded_stack_depth
      push rax // zero init tail_call counter
      variable number of push rbx,r13,r14,r15
      
      Then bpf_tail_call will pop variable number rbx,..
      and final 'pop rax'
      Then 'add rsp, size_of_current_stack_frame'
      jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
      rbp, rsp'
      
      This way new function will set its own stack size and will init tail
      call
      counter with whatever value the parent had.
      
      If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
      Instead it would need to have 'nop2' in there."
      
      Implement that suggestion.
      
      Since the layout of stack is changed, tail call counter handling can not
      rely anymore on popping it to rbx just like it have been handled for
      constant prologue case and later overwrite of rbx with actual value of
      rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
      is considered to be volatile/caller-saved and pop the value of tail call
      counter in there in the epilogue.
      
      Drop the BUILD_BUG_ON in emit_prologue and in
      emit_bpf_tail_call_indirect where instruction layout is not constant
      anymore.
      
      Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
      dedicated for skipping the register pops and stack unwind that are
      generated right before the actual jump to target program.
      For case when the target program is not present, BPF program will skip
      the pop instructions and nop5 dedicated for jmpq $target. An example of
      such state when only R6 of callee saved registers is used by program:
      
      ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
      ffffffffc0513aa6:       5b                      pop    %rbx
      ffffffffc0513aa7:       58                      pop    %rax
      ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
      
      When target program is inserted, the jump that was there to skip
      pops/nop5 will become the nop5, so CPU will go over pops and do the
      actual tailcall.
      
      One might ask why there simply can not be pushes after the nop5?
      In the following example snippet:
      
      ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
      (...)
      ffffffffc0370332:       5b                      pop    %rbx
      ffffffffc0370333:       58                      pop    %rax
      ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
      ffffffffc0370347:       50                      push   %rax
      ffffffffc0370348:       53                      push   %rbx
      ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
      ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
      
      There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
      and jump target is not present. ctx is in %rbx register and BPF
      subprogram that we will call into on ffffffffc037034c is relying on it,
      e.g. it will pick ctx from there. Such code layout is therefore broken
      as we would overwrite the content of %rbx with the value that was pushed
      on the prologue. That is the reason for the 'bypass' approach.
      
      Special care needs to be taken during the install/update/remove of
      tailcall target. In case when target program is not present, the CPU
      must not execute the pop instructions that precede the tailcall.
      
      To address that, the following states can be defined:
      A nop, unwind, nop
      B nop, unwind, tail
      C skip, unwind, nop
      D skip, unwind, tail
      
      A is forbidden (lead to incorrectness). The state transitions between
      tailcall install/update/remove will work as follows:
      
      First install tail call f: C->D->B(f)
       * poke the tailcall, after that get rid of the skip
      Update tail call f to f': B(f)->B(f')
       * poke the tailcall (poke->tailcall_target) and do NOT touch the
         poke->tailcall_bypass
      Remove tail call: B(f')->C(f')
       * poke->tailcall_bypass is poked back to jump, then we wait the RCU
         grace period so that other programs will finish its execution and
         after that we are safe to remove the poke->tailcall_target
      Install new tail call (f''): C(f')->D(f'')->B(f'').
       * same as first step
      
      This way CPU can never be exposed to "unwind, tail" state.
      
      Last but not least, when tailcalls get mixed with bpf2bpf calls, it
      would be possible to encounter the endless loop due to clearing the
      tailcall counter if for example we would use the tailcall3-like from BPF
      selftests program that would be subprogram-based, meaning the tailcall
      would be present within the BPF subprogram.
      
      This test, broken down to particular steps, would do:
      entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
      func0 -> call subprog_tail
      (we are NOT skipping the first 11 bytes of prologue and this subprogram
      has a tailcall, therefore we clear the counter...)
      subprog -> do the same thing as entry
      
      and then loop forever.
      
      To address this, the idea is to go through the call chain of bpf2bpf progs
      and look for a tailcall presence throughout whole chain. If we saw a single
      tail call then each node in this call chain needs to be marked as a subprog
      that can reach the tailcall. We would later feed the JIT with this info
      and:
      - set eax to 0 only when tailcall is reachable and this is the entry prog
      - if tailcall is reachable but there's no tailcall in insns of currently
        JITed prog then push rax anyway, so that it will be possible to
        propagate further down the call chain
      - finally if tailcall is reachable, then we need to precede the 'call'
        insn with mov rax, [rbp - (stack_depth + 8)]
      
      Tail call related cases from test_verifier kselftest are also working
      fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
      work properly as well.
      
      [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ebf7d1f5
    • M
      bpf: rename poke descriptor's 'ip' member to 'tailcall_target' · cf71b174
      Maciej Fijalkowski 提交于
      Reflect the actual purpose of poke->ip and rename it to
      poke->tailcall_target so that it will not the be confused with another
      poke target that will be introduced in next commit.
      
      While at it, do the same thing with poke->ip_stable - rename it to
      poke->tailcall_target_stable.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      cf71b174
    • M
      bpf: propagate poke descriptors to subprograms · a748c697
      Maciej Fijalkowski 提交于
      Previously, there was no need for poke descriptors being present in
      subprogram's bpf_prog_aux struct since tailcalls were simply not allowed
      in them. Each subprog is JITed independently so in order to enable
      JITing subprograms that use tailcalls, do the following:
      
      - in fixup_bpf_calls() store the index of tailcall insn onto the generated
        poke descriptor,
      - in case when insn patching occurs, adjust the tailcall insn idx from
        bpf_patch_insn_data,
      - then in jit_subprogs() check whether the given poke descriptor belongs
        to the current subprog by checking if that previously stored absolute
        index of tail call insn is in the scope of the insns of given subprog,
      - update the insn->imm with new poke descriptor slot so that while JITing
        the proper poke descriptor will be grabbed
      
      This way each of the main program's poke descriptors are distributed
      across the subprograms poke descriptor array, so main program's
      descriptors can be untracked out of the prog array map.
      
      Add also subprog's aux struct to the BPF map poke_progs list by calling
      on it map_poke_track().
      
      In case of any error, call the map_poke_untrack() on subprog's aux
      structs that have already been registered to prog array map.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a748c697
  10. 16 9月, 2020 1 次提交
  11. 29 8月, 2020 2 次提交
    • A
      bpf: Add bpf_copy_from_user() helper. · 07be4c4a
      Alexei Starovoitov 提交于
      Sleepable BPF programs can now use copy_from_user() to access user memory.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NKP Singh <kpsingh@google.com>
      Link: https://lore.kernel.org/bpf/20200827220114.69225-4-alexei.starovoitov@gmail.com
      07be4c4a
    • A
      bpf: Introduce sleepable BPF programs · 1e6c62a8
      Alexei Starovoitov 提交于
      Introduce sleepable BPF programs that can request such property for themselves
      via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
      to use helpers like bpf_copy_from_user() that might sleep. At present only
      fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
      when they are attached to kernel functions that are known to allow sleeping.
      
      The non-sleepable programs are relying on implicit rcu_read_lock() and
      migrate_disable() to protect life time of programs, maps that they use and
      per-cpu kernel structures used to pass info between bpf programs and the
      kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
      migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
      should not be enclosed in migrate_disable() as well. Therefore
      rcu_read_lock_trace is used to protect the life time of sleepable progs.
      
      There are many networking and tracing program types. In many cases the
      'struct bpf_prog *' pointer itself is rcu protected within some other kernel
      data structure and the kernel code is using rcu_dereference() to load that
      program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
      Instead sleepable bpf programs are allowed with bpf trampoline only. The
      program pointers are hard-coded into generated assembly of bpf trampoline and
      synchronize_rcu_tasks_trace() is used to protect the life time of the program.
      The same trampoline can hold both sleepable and non-sleepable progs.
      
      When rcu_read_lock_trace is held it means that some sleepable bpf program is
      running from bpf trampoline. Those programs can use bpf arrays and preallocated
      hash/lru maps. These map types are waiting on programs to complete via
      synchronize_rcu_tasks_trace();
      
      Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
      synchronize_rcu_tasks() to wait for sleepable progs to finish and for
      trampoline assembly to finish.
      
      This is the first step of introducing sleepable progs. Eventually dynamically
      allocated hash maps can be allowed and networking program types can become
      sleepable too.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NKP Singh <kpsingh@google.com>
      Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
      1e6c62a8
  12. 28 8月, 2020 1 次提交
    • M
      bpf: Add map_meta_equal map ops · f4d05259
      Martin KaFai Lau 提交于
      Some properties of the inner map is used in the verification time.
      When an inner map is inserted to an outer map at runtime,
      bpf_map_meta_equal() is currently used to ensure those properties
      of the inserting inner map stays the same as the verification
      time.
      
      In particular, the current bpf_map_meta_equal() checks max_entries which
      turns out to be too restrictive for most of the maps which do not use
      max_entries during the verification time.  It limits the use case that
      wants to replace a smaller inner map with a larger inner map.  There are
      some maps do use max_entries during verification though.  For example,
      the map_gen_lookup in array_map_ops uses the max_entries to generate
      the inline lookup code.
      
      To accommodate differences between maps, the map_meta_equal is added
      to bpf_map_ops.  Each map-type can decide what to check when its
      map is used as an inner map during runtime.
      
      Also, some map types cannot be used as an inner map and they are
      currently black listed in bpf_map_meta_alloc() in map_in_map.c.
      It is not unusual that the new map types may not aware that such
      blacklist exists.  This patch enforces an explicit opt-in
      and only allows a map to be used as an inner map if it has
      implemented the map_meta_equal ops.  It is based on the
      discussion in [1].
      
      All maps that support inner map has its map_meta_equal points
      to bpf_map_meta_equal in this patch.  A later patch will
      relax the max_entries check for most maps.  bpf_types.h
      counts 28 map types.  This patch adds 23 ".map_meta_equal"
      by using coccinelle.  -5 for
      	BPF_MAP_TYPE_PROG_ARRAY
      	BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
      	BPF_MAP_TYPE_STRUCT_OPS
      	BPF_MAP_TYPE_ARRAY_OF_MAPS
      	BPF_MAP_TYPE_HASH_OF_MAPS
      
      The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
      is moved such that the same error is returned.
      
      [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
      f4d05259
  13. 26 8月, 2020 3 次提交
  14. 22 8月, 2020 3 次提交
  15. 20 8月, 2020 1 次提交
  16. 07 8月, 2020 1 次提交
    • Y
      bpf: Change uapi for bpf iterator map elements · 5e7b3020
      Yonghong Song 提交于
      Commit a5cbe05a ("bpf: Implement bpf iterator for
      map elements") added bpf iterator support for
      map elements. The map element bpf iterator requires
      info to identify a particular map. In the above
      commit, the attr->link_create.target_fd is used
      to carry map_fd and an enum bpf_iter_link_info
      is added to uapi to specify the target_fd actually
      representing a map_fd:
          enum bpf_iter_link_info {
      	BPF_ITER_LINK_UNSPEC = 0,
      	BPF_ITER_LINK_MAP_FD = 1,
      
      	MAX_BPF_ITER_LINK_INFO,
          };
      
      This is an extensible approach as we can grow
      enumerator for pid, cgroup_id, etc. and we can
      unionize target_fd for pid, cgroup_id, etc.
      But in the future, there are chances that
      more complex customization may happen, e.g.,
      for tasks, it could be filtered based on
      both cgroup_id and user_id.
      
      This patch changed the uapi to have fields
      	__aligned_u64	iter_info;
      	__u32		iter_info_len;
      for additional iter_info for link_create.
      The iter_info is defined as
      	union bpf_iter_link_info {
      		struct {
      			__u32   map_fd;
      		} map;
      	};
      
      So future extension for additional customization
      will be easier. The bpf_iter_link_info will be
      passed to target callback to validate and generic
      bpf_iter framework does not need to deal it any
      more.
      
      Note that map_fd = 0 will be considered invalid
      and -EBADF will be returned to user space.
      
      Fixes: a5cbe05a ("bpf: Implement bpf iterator for map elements")
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200805055056.1457463-1-yhs@fb.com
      5e7b3020
  17. 02 8月, 2020 1 次提交
  18. 26 7月, 2020 7 次提交