1. 17 9月, 2022 2 次提交
  2. 15 9月, 2022 1 次提交
    • D
      bpf: Add verifier check for BPF_PTR_POISON retval and arg · 47e34cb7
      Dave Marchevsky 提交于
      BPF_PTR_POISON was added in commit c0a5a21c ("bpf: Allow storing
      referenced kptr in map") to denote a bpf_func_proto btf_id which the
      verifier will replace with a dynamically-determined btf_id at verification
      time.
      
      This patch adds verifier 'poison' functionality to BPF_PTR_POISON in
      order to prepare for expanded use of the value to poison ret- and
      arg-btf_id in ongoing work, namely rbtree and linked list patchsets
      [0, 1]. Specifically, when the verifier checks helper calls, it assumes
      that BPF_PTR_POISON'ed ret type will be replaced with a valid type before
      - or in lieu of - the default ret_btf_id logic. Similarly for arg btf_id.
      
      If poisoned btf_id reaches default handling block for either, consider
      this a verifier internal error and fail verification. Otherwise a helper
      w/ poisoned btf_id but no verifier logic replacing the type will cause a
      crash as the invalid pointer is dereferenced.
      
      Also move BPF_PTR_POISON to existing include/linux/posion.h header and
      remove unnecessary shift.
      
        [0]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
        [1]: lore.kernel.org/bpf/20220904204145.3089-1-memxor@gmail.com
      Signed-off-by: NDave Marchevsky <davemarchevsky@fb.com>
      Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220912154544.1398199-1-davemarchevsky@fb.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      47e34cb7
  3. 11 9月, 2022 2 次提交
  4. 08 9月, 2022 5 次提交
  5. 07 9月, 2022 1 次提交
  6. 05 9月, 2022 5 次提交
    • A
      bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. · 9f2c6e96
      Alexei Starovoitov 提交于
      User space might be creating and destroying a lot of hash maps. Synchronous
      rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets
      and other map memory and may cause artificial OOM situation under stress.
      Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc:
      - remove rcu_barrier from hash map, since htab doesn't use call_rcu
        directly and there are no callback to wait for.
      - bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks.
        Use it to avoid barriers in fast path.
      - When barriers are needed copy bpf_mem_alloc into temp structure
        and wait for rcu barrier-s in the worker to let the rest of
        hash map freeing to proceed.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220902211058.60789-17-alexei.starovoitov@gmail.com
      9f2c6e96
    • A
      bpf: Add percpu allocation support to bpf_mem_alloc. · 4ab67149
      Alexei Starovoitov 提交于
      Extend bpf_mem_alloc to cache free list of fixed size per-cpu allocations.
      Once such cache is created bpf_mem_cache_alloc() will return per-cpu objects.
      bpf_mem_cache_free() will free them back into global per-cpu pool after
      observing RCU grace period.
      per-cpu flavor of bpf_mem_alloc is going to be used by per-cpu hash maps.
      
      The free list cache consists of tuples { llist_node, per-cpu pointer }
      Unlike alloc_percpu() that returns per-cpu pointer
      the bpf_mem_cache_alloc() returns a pointer to per-cpu pointer and
      bpf_mem_cache_free() expects to receive it back.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220902211058.60789-11-alexei.starovoitov@gmail.com
      4ab67149
    • A
      bpf: Introduce any context BPF specific memory allocator. · 7c8199e2
      Alexei Starovoitov 提交于
      Tracing BPF programs can attach to kprobe and fentry. Hence they
      run in unknown context where calling plain kmalloc() might not be safe.
      
      Front-end kmalloc() with minimal per-cpu cache of free elements.
      Refill this cache asynchronously from irq_work.
      
      BPF programs always run with migration disabled.
      It's safe to allocate from cache of the current cpu with irqs disabled.
      Free-ing is always done into bucket of the current cpu as well.
      irq_work trims extra free elements from buckets with kfree
      and refills them with kmalloc, so global kmalloc logic takes care
      of freeing objects allocated by one cpu and freed on another.
      
      struct bpf_mem_alloc supports two modes:
      - When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
        This is typical bpf hash map use case when all elements have equal size.
      - When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
        kmalloc/kfree. Max allocation size is 4096 in this case.
        This is bpf_dynptr and bpf_kptr use case.
      
      bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree.
      bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free.
      
      The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@gmail.com
      7c8199e2
    • S
      net: phy: Add 1000BASE-KX interface mode · 05ad5d45
      Sean Anderson 提交于
      Add 1000BASE-KX interface mode. This 1G backplane ethernet as described in
      clause 70. Clause 73 autonegotiation is mandatory, and only full duplex
      operation is supported.
      
      Although at the PMA level this interface mode is identical to
      1000BASE-X, it uses a different form of in-band autonegation. This
      justifies a separate interface mode, since the interface mode (along
      with the MLO_AN_* autonegotiation mode) sets the type of autonegotiation
      which will be used on a link. This results in more than just electrical
      differences between the link modes.
      
      With regard to 1000BASE-X, 1000BASE-KX holds a similar position to
      SGMII: same signaling, but different autonegotiation. PCS drivers
      (which typically handle in-band autonegotiation) may only support
      1000BASE-X, and not 1000BASE-KX. Similarly, the phy mode is used to
      configure serdes phys with phy_set_mode_ext. Due to the different
      electrical standards (SFI or XFI vs Clause 70), they will likely want to
      use different configuration. Adding a phy interface mode for
      1000BASE-KX helps simplify configuration in these areas.
      Signed-off-by: NSean Anderson <sean.anderson@seco.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05ad5d45
    • M
      net: pcs: add new PCS driver for altera TSE PCS · 4a502cf4
      Maxime Chevallier 提交于
      The Altera Triple Speed Ethernet has a SGMII/1000BaseC PCS that can be
      integrated in several ways. It can either be part of the TSE MAC's
      address space, accessed through 32 bits accesses on the mapped mdio
      device 0, or through a dedicated 16 bits register set.
      
      This driver allows using the TSE PCS outside of altera TSE's driver,
      since it can be used standalone by other MACs.
      Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a502cf4
  7. 03 9月, 2022 3 次提交
  8. 02 9月, 2022 3 次提交
  9. 01 9月, 2022 1 次提交
    • J
      mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse · 2555283e
      Jann Horn 提交于
      anon_vma->degree tracks the combined number of child anon_vmas and VMAs
      that use the anon_vma as their ->anon_vma.
      
      anon_vma_clone() then assumes that for any anon_vma attached to
      src->anon_vma_chain other than src->anon_vma, it is impossible for it to
      be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
      elevated by 1 because of a child anon_vma, meaning that if ->degree
      equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.
      
      This assumption is wrong because the ->degree optimization leads to leaf
      nodes being abandoned on anon_vma_clone() - an existing anon_vma is
      reused and no new parent-child relationship is created.  So it is
      possible to reuse an anon_vma for one VMA while it is still tied to
      another VMA.
      
      This is an issue because is_mergeable_anon_vma() and its callers assume
      that if two VMAs have the same ->anon_vma, the list of anon_vmas
      attached to the VMAs is guaranteed to be the same.  When this assumption
      is violated, vma_merge() can merge pages into a VMA that is not attached
      to the corresponding anon_vma, leading to dangling page->mapping
      pointers that will be dereferenced during rmap walks.
      
      Fix it by separately tracking the number of child anon_vmas and the
      number of VMAs using the anon_vma as their ->anon_vma.
      
      Fixes: 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy")
      Cc: stable@kernel.org
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2555283e
  10. 31 8月, 2022 2 次提交
  11. 30 8月, 2022 3 次提交
  12. 29 8月, 2022 4 次提交
  13. 27 8月, 2022 1 次提交
  14. 26 8月, 2022 3 次提交
    • L
      lsm,io_uring: add LSM hooks for the new uring_cmd file op · 2a584012
      Luis Chamberlain 提交于
      io-uring cmd support was added through ee692a21 ("fs,io_uring:
      add infrastructure for uring-cmd"), this extended the struct
      file_operations to allow a new command which each subsystem can use
      to enable command passthrough. Add an LSM specific for the command
      passthrough which enables LSMs to inspect the command details.
      
      This was discussed long ago without no clear pointer for something
      conclusive, so this enables LSMs to at least reject this new file
      operation.
      
      [0] https://lkml.kernel.org/r/8adf55db-7bab-f59d-d612-ed906b948d19@schaufler-ca.com
      
      Cc: stable@vger.kernel.org
      Fixes: ee692a21 ("fs,io_uring: add infrastructure for uring-cmd")
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      2a584012
    • U
      netdev: Use try_cmpxchg in napi_if_scheduled_mark_missed · b9030780
      Uros Bizjak 提交于
      Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
      napi_if_scheduled_mark_missed. x86 CMPXCHG instruction returns
      success in ZF flag, so this change saves a compare after cmpxchg
      (and related move instruction in front of cmpxchg).
      
      Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
      fails, enabling further code simplifications.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
      Link: https://lore.kernel.org/r/20220822143243.2798-1-ubizjak@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b9030780
    • H
      bpf: Introduce cgroup iter · d4ccaf58
      Hao Luo 提交于
      Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
      
       - walking a cgroup's descendants in pre-order.
       - walking a cgroup's descendants in post-order.
       - walking a cgroup's ancestors.
       - process only the given cgroup.
      
      When attaching cgroup_iter, one can set a cgroup to the iter_link
      created from attaching. This cgroup is passed as a file descriptor
      or cgroup id and serves as the starting point of the walk. If no
      cgroup is specified, the starting point will be the root cgroup v2.
      
      For walking descendants, one can specify the order: either pre-order or
      post-order. For walking ancestors, the walk starts at the specified
      cgroup and ends at the root.
      
      One can also terminate the walk early by returning 1 from the iter
      program.
      
      Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
      program is called with cgroup_mutex held.
      
      Currently only one session is supported, which means, depending on the
      volume of data bpf program intends to send to user space, the number
      of cgroups that can be walked is limited. For example, given the current
      buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
      cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can
      be walked is 512. This is a limitation of cgroup_iter. If the output
      data is larger than the kernel buffer size, after all data in the
      kernel buffer is consumed by user space, the subsequent read() syscall
      will signal EOPNOTSUPP. In order to work around, the user may have to
      update their program to reduce the volume of data sent to output. For
      example, skip some uninteresting cgroups. In future, we may extend
      bpf_iter flags to allow customizing buffer size.
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20220824233117.1312810-2-haoluo@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      d4ccaf58
  15. 25 8月, 2022 2 次提交
    • J
      wifi: cfg80211/mac80211: check EHT capability size correctly · ea5cba26
      Johannes Berg 提交于
      For AP/non-AP the EHT MCS/NSS subfield size differs, the
      4-octet subfield is only used for 20 MHz-only non-AP STA.
      Pass an argument around everywhere to be able to parse it
      properly.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      ea5cba26
    • K
      bpf: Fix reference state management for synchronous callbacks · 9d9d00ac
      Kumar Kartikeya Dwivedi 提交于
      Currently, verifier verifies callback functions (sync and async) as if
      they will be executed once, (i.e. it explores execution state as if the
      function was being called once). The next insn to explore is set to
      start of subprog and the exit from nested frame is handled using
      curframe > 0 and prepare_func_exit. In case of async callback it uses a
      customized variant of push_stack simulating a kind of branch to set up
      custom state and execution context for the async callback.
      
      While this approach is simple and works when callback really will be
      executed only once, it is unsafe for all of our current helpers which
      are for_each style, i.e. they execute the callback multiple times.
      
      A callback releasing acquired references of the caller may do so
      multiple times, but currently verifier sees it as one call inside the
      frame, which then returns to caller. Hence, it thinks it released some
      reference that the cb e.g. got access through callback_ctx (register
      filled inside cb from spilled typed register on stack).
      
      Similarly, it may see that an acquire call is unpaired inside the
      callback, so the caller will copy the reference state of callback and
      then will have to release the register with new ref_obj_ids. But again,
      the callback may execute multiple times, but the verifier will only
      account for acquired references for a single symbolic execution of the
      callback, which will cause leaks.
      
      Note that for async callback case, things are different. While currently
      we have bpf_timer_set_callback which only executes it once, even for
      multiple executions it would be safe, as reference state is NULL and
      check_reference_leak would force program to release state before
      BPF_EXIT. The state is also unaffected by analysis for the caller frame.
      Hence async callback is safe.
      
      Since we want the reference state to be accessible, e.g. for pointers
      loaded from stack through callback_ctx's PTR_TO_STACK, we still have to
      copy caller's reference_state to callback's bpf_func_state, but we
      enforce that whatever references it adds to that reference_state has
      been released before it hits BPF_EXIT. This requires introducing a new
      callback_ref member in the reference state to distinguish between caller
      vs callee references. Hence, check_reference_leak now errors out if it
      sees we are in callback_fn and we have not released callback_ref refs.
      Since there can be multiple nested callbacks, like frame 0 -> cb1 -> cb2
      etc. we need to also distinguish between whether this particular ref
      belongs to this callback frame or parent, and only error for our own, so
      we store state->frameno (which is always non-zero for callbacks).
      
      In short, callbacks can read parent reference_state, but cannot mutate
      it, to be able to use pointers acquired by the caller. They must only
      undo their changes (by releasing their own acquired_refs before
      BPF_EXIT) on top of caller reference_state before returning (at which
      point the caller and callback state will match anyway, so no need to
      copy it back to caller).
      
      Fixes: 69c087ba ("bpf: Add bpf_for_each_map_elem() helper")
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220823013125.24938-1-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      9d9d00ac
  16. 24 8月, 2022 2 次提交