1. 08 9月, 2022 4 次提交
  2. 07 9月, 2022 1 次提交
  3. 05 9月, 2022 5 次提交
    • A
      bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. · 9f2c6e96
      Alexei Starovoitov 提交于
      User space might be creating and destroying a lot of hash maps. Synchronous
      rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets
      and other map memory and may cause artificial OOM situation under stress.
      Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc:
      - remove rcu_barrier from hash map, since htab doesn't use call_rcu
        directly and there are no callback to wait for.
      - bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks.
        Use it to avoid barriers in fast path.
      - When barriers are needed copy bpf_mem_alloc into temp structure
        and wait for rcu barrier-s in the worker to let the rest of
        hash map freeing to proceed.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220902211058.60789-17-alexei.starovoitov@gmail.com
      9f2c6e96
    • A
      bpf: Add percpu allocation support to bpf_mem_alloc. · 4ab67149
      Alexei Starovoitov 提交于
      Extend bpf_mem_alloc to cache free list of fixed size per-cpu allocations.
      Once such cache is created bpf_mem_cache_alloc() will return per-cpu objects.
      bpf_mem_cache_free() will free them back into global per-cpu pool after
      observing RCU grace period.
      per-cpu flavor of bpf_mem_alloc is going to be used by per-cpu hash maps.
      
      The free list cache consists of tuples { llist_node, per-cpu pointer }
      Unlike alloc_percpu() that returns per-cpu pointer
      the bpf_mem_cache_alloc() returns a pointer to per-cpu pointer and
      bpf_mem_cache_free() expects to receive it back.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220902211058.60789-11-alexei.starovoitov@gmail.com
      4ab67149
    • A
      bpf: Introduce any context BPF specific memory allocator. · 7c8199e2
      Alexei Starovoitov 提交于
      Tracing BPF programs can attach to kprobe and fentry. Hence they
      run in unknown context where calling plain kmalloc() might not be safe.
      
      Front-end kmalloc() with minimal per-cpu cache of free elements.
      Refill this cache asynchronously from irq_work.
      
      BPF programs always run with migration disabled.
      It's safe to allocate from cache of the current cpu with irqs disabled.
      Free-ing is always done into bucket of the current cpu as well.
      irq_work trims extra free elements from buckets with kfree
      and refills them with kmalloc, so global kmalloc logic takes care
      of freeing objects allocated by one cpu and freed on another.
      
      struct bpf_mem_alloc supports two modes:
      - When size != 0 create kmem_cache and bpf_mem_cache for each cpu.
        This is typical bpf hash map use case when all elements have equal size.
      - When size == 0 allocate 11 bpf_mem_cache-s for each cpu, then rely on
        kmalloc/kfree. Max allocation size is 4096 in this case.
        This is bpf_dynptr and bpf_kptr use case.
      
      bpf_mem_alloc/bpf_mem_free are bpf specific 'wrappers' of kmalloc/kfree.
      bpf_mem_cache_alloc/bpf_mem_cache_free are 'wrappers' of kmem_cache_alloc/kmem_cache_free.
      
      The allocators are NMI-safe from bpf programs only. They are not NMI-safe in general.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220902211058.60789-2-alexei.starovoitov@gmail.com
      7c8199e2
    • S
      net: phy: Add 1000BASE-KX interface mode · 05ad5d45
      Sean Anderson 提交于
      Add 1000BASE-KX interface mode. This 1G backplane ethernet as described in
      clause 70. Clause 73 autonegotiation is mandatory, and only full duplex
      operation is supported.
      
      Although at the PMA level this interface mode is identical to
      1000BASE-X, it uses a different form of in-band autonegation. This
      justifies a separate interface mode, since the interface mode (along
      with the MLO_AN_* autonegotiation mode) sets the type of autonegotiation
      which will be used on a link. This results in more than just electrical
      differences between the link modes.
      
      With regard to 1000BASE-X, 1000BASE-KX holds a similar position to
      SGMII: same signaling, but different autonegotiation. PCS drivers
      (which typically handle in-band autonegotiation) may only support
      1000BASE-X, and not 1000BASE-KX. Similarly, the phy mode is used to
      configure serdes phys with phy_set_mode_ext. Due to the different
      electrical standards (SFI or XFI vs Clause 70), they will likely want to
      use different configuration. Adding a phy interface mode for
      1000BASE-KX helps simplify configuration in these areas.
      Signed-off-by: NSean Anderson <sean.anderson@seco.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05ad5d45
    • M
      net: pcs: add new PCS driver for altera TSE PCS · 4a502cf4
      Maxime Chevallier 提交于
      The Altera Triple Speed Ethernet has a SGMII/1000BaseC PCS that can be
      integrated in several ways. It can either be part of the TSE MAC's
      address space, accessed through 32 bits accesses on the mapped mdio
      device 0, or through a dedicated 16 bits register set.
      
      This driver allows using the TSE PCS outside of altera TSE's driver,
      since it can be used standalone by other MACs.
      Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a502cf4
  4. 03 9月, 2022 3 次提交
  5. 02 9月, 2022 3 次提交
  6. 01 9月, 2022 1 次提交
    • J
      mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse · 2555283e
      Jann Horn 提交于
      anon_vma->degree tracks the combined number of child anon_vmas and VMAs
      that use the anon_vma as their ->anon_vma.
      
      anon_vma_clone() then assumes that for any anon_vma attached to
      src->anon_vma_chain other than src->anon_vma, it is impossible for it to
      be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
      elevated by 1 because of a child anon_vma, meaning that if ->degree
      equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.
      
      This assumption is wrong because the ->degree optimization leads to leaf
      nodes being abandoned on anon_vma_clone() - an existing anon_vma is
      reused and no new parent-child relationship is created.  So it is
      possible to reuse an anon_vma for one VMA while it is still tied to
      another VMA.
      
      This is an issue because is_mergeable_anon_vma() and its callers assume
      that if two VMAs have the same ->anon_vma, the list of anon_vmas
      attached to the VMAs is guaranteed to be the same.  When this assumption
      is violated, vma_merge() can merge pages into a VMA that is not attached
      to the corresponding anon_vma, leading to dangling page->mapping
      pointers that will be dereferenced during rmap walks.
      
      Fix it by separately tracking the number of child anon_vmas and the
      number of VMAs using the anon_vma as their ->anon_vma.
      
      Fixes: 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy")
      Cc: stable@kernel.org
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2555283e
  7. 31 8月, 2022 2 次提交
  8. 30 8月, 2022 3 次提交
  9. 29 8月, 2022 4 次提交
  10. 27 8月, 2022 1 次提交
  11. 26 8月, 2022 3 次提交
    • L
      lsm,io_uring: add LSM hooks for the new uring_cmd file op · 2a584012
      Luis Chamberlain 提交于
      io-uring cmd support was added through ee692a21 ("fs,io_uring:
      add infrastructure for uring-cmd"), this extended the struct
      file_operations to allow a new command which each subsystem can use
      to enable command passthrough. Add an LSM specific for the command
      passthrough which enables LSMs to inspect the command details.
      
      This was discussed long ago without no clear pointer for something
      conclusive, so this enables LSMs to at least reject this new file
      operation.
      
      [0] https://lkml.kernel.org/r/8adf55db-7bab-f59d-d612-ed906b948d19@schaufler-ca.com
      
      Cc: stable@vger.kernel.org
      Fixes: ee692a21 ("fs,io_uring: add infrastructure for uring-cmd")
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      2a584012
    • U
      netdev: Use try_cmpxchg in napi_if_scheduled_mark_missed · b9030780
      Uros Bizjak 提交于
      Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
      napi_if_scheduled_mark_missed. x86 CMPXCHG instruction returns
      success in ZF flag, so this change saves a compare after cmpxchg
      (and related move instruction in front of cmpxchg).
      
      Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
      fails, enabling further code simplifications.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
      Link: https://lore.kernel.org/r/20220822143243.2798-1-ubizjak@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b9030780
    • H
      bpf: Introduce cgroup iter · d4ccaf58
      Hao Luo 提交于
      Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
      
       - walking a cgroup's descendants in pre-order.
       - walking a cgroup's descendants in post-order.
       - walking a cgroup's ancestors.
       - process only the given cgroup.
      
      When attaching cgroup_iter, one can set a cgroup to the iter_link
      created from attaching. This cgroup is passed as a file descriptor
      or cgroup id and serves as the starting point of the walk. If no
      cgroup is specified, the starting point will be the root cgroup v2.
      
      For walking descendants, one can specify the order: either pre-order or
      post-order. For walking ancestors, the walk starts at the specified
      cgroup and ends at the root.
      
      One can also terminate the walk early by returning 1 from the iter
      program.
      
      Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
      program is called with cgroup_mutex held.
      
      Currently only one session is supported, which means, depending on the
      volume of data bpf program intends to send to user space, the number
      of cgroups that can be walked is limited. For example, given the current
      buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
      cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can
      be walked is 512. This is a limitation of cgroup_iter. If the output
      data is larger than the kernel buffer size, after all data in the
      kernel buffer is consumed by user space, the subsequent read() syscall
      will signal EOPNOTSUPP. In order to work around, the user may have to
      update their program to reduce the volume of data sent to output. For
      example, skip some uninteresting cgroups. In future, we may extend
      bpf_iter flags to allow customizing buffer size.
      Acked-by: NYonghong Song <yhs@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20220824233117.1312810-2-haoluo@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      d4ccaf58
  12. 25 8月, 2022 2 次提交
    • J
      wifi: cfg80211/mac80211: check EHT capability size correctly · ea5cba26
      Johannes Berg 提交于
      For AP/non-AP the EHT MCS/NSS subfield size differs, the
      4-octet subfield is only used for 20 MHz-only non-AP STA.
      Pass an argument around everywhere to be able to parse it
      properly.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      ea5cba26
    • K
      bpf: Fix reference state management for synchronous callbacks · 9d9d00ac
      Kumar Kartikeya Dwivedi 提交于
      Currently, verifier verifies callback functions (sync and async) as if
      they will be executed once, (i.e. it explores execution state as if the
      function was being called once). The next insn to explore is set to
      start of subprog and the exit from nested frame is handled using
      curframe > 0 and prepare_func_exit. In case of async callback it uses a
      customized variant of push_stack simulating a kind of branch to set up
      custom state and execution context for the async callback.
      
      While this approach is simple and works when callback really will be
      executed only once, it is unsafe for all of our current helpers which
      are for_each style, i.e. they execute the callback multiple times.
      
      A callback releasing acquired references of the caller may do so
      multiple times, but currently verifier sees it as one call inside the
      frame, which then returns to caller. Hence, it thinks it released some
      reference that the cb e.g. got access through callback_ctx (register
      filled inside cb from spilled typed register on stack).
      
      Similarly, it may see that an acquire call is unpaired inside the
      callback, so the caller will copy the reference state of callback and
      then will have to release the register with new ref_obj_ids. But again,
      the callback may execute multiple times, but the verifier will only
      account for acquired references for a single symbolic execution of the
      callback, which will cause leaks.
      
      Note that for async callback case, things are different. While currently
      we have bpf_timer_set_callback which only executes it once, even for
      multiple executions it would be safe, as reference state is NULL and
      check_reference_leak would force program to release state before
      BPF_EXIT. The state is also unaffected by analysis for the caller frame.
      Hence async callback is safe.
      
      Since we want the reference state to be accessible, e.g. for pointers
      loaded from stack through callback_ctx's PTR_TO_STACK, we still have to
      copy caller's reference_state to callback's bpf_func_state, but we
      enforce that whatever references it adds to that reference_state has
      been released before it hits BPF_EXIT. This requires introducing a new
      callback_ref member in the reference state to distinguish between caller
      vs callee references. Hence, check_reference_leak now errors out if it
      sees we are in callback_fn and we have not released callback_ref refs.
      Since there can be multiple nested callbacks, like frame 0 -> cb1 -> cb2
      etc. we need to also distinguish between whether this particular ref
      belongs to this callback frame or parent, and only error for our own, so
      we store state->frameno (which is always non-zero for callbacks).
      
      In short, callbacks can read parent reference_state, but cannot mutate
      it, to be able to use pointers acquired by the caller. They must only
      undo their changes (by releasing their own acquired_refs before
      BPF_EXIT) on top of caller reference_state before returning (at which
      point the caller and callback state will match anyway, so no need to
      copy it back to caller).
      
      Fixes: 69c087ba ("bpf: Add bpf_for_each_map_elem() helper")
      Signed-off-by: NKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220823013125.24938-1-memxor@gmail.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      9d9d00ac
  13. 24 8月, 2022 7 次提交
  14. 23 8月, 2022 1 次提交
    • M
      net/mlx5: Avoid false positive lockdep warning by adding lock_class_key · d59b73a6
      Moshe Shemesh 提交于
      Add a lock_class_key per mlx5 device to avoid a false positive
      "possible circular locking dependency" warning by lockdep, on flows
      which lock more than one mlx5 device, such as adding SF.
      
      kernel log:
       ======================================================
       WARNING: possible circular locking dependency detected
       5.19.0-rc8+ #2 Not tainted
       ------------------------------------------------------
       kworker/u20:0/8 is trying to acquire lock:
       ffff88812dfe0d98 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_init_one+0x2e/0x490 [mlx5_core]
      
       but task is already holding lock:
       ffff888101aa7898 (&(&notifier->n_head)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x5a/0x130
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (&(&notifier->n_head)->rwsem){++++}-{3:3}:
              down_write+0x90/0x150
              blocking_notifier_chain_register+0x53/0xa0
              mlx5_sf_table_init+0x369/0x4a0 [mlx5_core]
              mlx5_init_one+0x261/0x490 [mlx5_core]
              probe_one+0x430/0x680 [mlx5_core]
              local_pci_probe+0xd6/0x170
              work_for_cpu_fn+0x4e/0xa0
              process_one_work+0x7c2/0x1340
              worker_thread+0x6f6/0xec0
              kthread+0x28f/0x330
              ret_from_fork+0x1f/0x30
      
       -> #0 (&dev->intf_state_mutex){+.+.}-{3:3}:
              __lock_acquire+0x2fc7/0x6720
              lock_acquire+0x1c1/0x550
              __mutex_lock+0x12c/0x14b0
              mlx5_init_one+0x2e/0x490 [mlx5_core]
              mlx5_sf_dev_probe+0x29c/0x370 [mlx5_core]
              auxiliary_bus_probe+0x9d/0xe0
              really_probe+0x1e0/0xaa0
              __driver_probe_device+0x219/0x480
              driver_probe_device+0x49/0x130
              __device_attach_driver+0x1b8/0x280
              bus_for_each_drv+0x123/0x1a0
              __device_attach+0x1a3/0x460
              bus_probe_device+0x1a2/0x260
              device_add+0x9b1/0x1b40
              __auxiliary_device_add+0x88/0xc0
              mlx5_sf_dev_state_change_handler+0x67e/0x9d0 [mlx5_core]
              blocking_notifier_call_chain+0xd5/0x130
              mlx5_vhca_state_work_handler+0x2b0/0x3f0 [mlx5_core]
              process_one_work+0x7c2/0x1340
              worker_thread+0x59d/0xec0
              kthread+0x28f/0x330
              ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&(&notifier->n_head)->rwsem);
                                      lock(&dev->intf_state_mutex);
                                      lock(&(&notifier->n_head)->rwsem);
         lock(&dev->intf_state_mutex);
      
        *** DEADLOCK ***
      
       4 locks held by kworker/u20:0/8:
        #0: ffff888150612938 ((wq_completion)mlx5_events){+.+.}-{0:0}, at: process_one_work+0x6e2/0x1340
        #1: ffff888100cafdb8 ((work_completion)(&work->work)#3){+.+.}-{0:0}, at: process_one_work+0x70f/0x1340
        #2: ffff888101aa7898 (&(&notifier->n_head)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x5a/0x130
        #3: ffff88813682d0e8 (&dev->mutex){....}-{3:3}, at:__device_attach+0x76/0x460
      
       stack backtrace:
       CPU: 6 PID: 8 Comm: kworker/u20:0 Not tainted 5.19.0-rc8+
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Workqueue: mlx5_events mlx5_vhca_state_work_handler [mlx5_core]
       Call Trace:
        <TASK>
        dump_stack_lvl+0x57/0x7d
        check_noncircular+0x278/0x300
        ? print_circular_bug+0x460/0x460
        ? lock_chain_count+0x20/0x20
        ? register_lock_class+0x1880/0x1880
        __lock_acquire+0x2fc7/0x6720
        ? register_lock_class+0x1880/0x1880
        ? register_lock_class+0x1880/0x1880
        lock_acquire+0x1c1/0x550
        ? mlx5_init_one+0x2e/0x490 [mlx5_core]
        ? lockdep_hardirqs_on_prepare+0x400/0x400
        __mutex_lock+0x12c/0x14b0
        ? mlx5_init_one+0x2e/0x490 [mlx5_core]
        ? mlx5_init_one+0x2e/0x490 [mlx5_core]
        ? _raw_read_unlock+0x1f/0x30
        ? mutex_lock_io_nested+0x1320/0x1320
        ? __ioremap_caller.constprop.0+0x306/0x490
        ? mlx5_sf_dev_probe+0x269/0x370 [mlx5_core]
        ? iounmap+0x160/0x160
        mlx5_init_one+0x2e/0x490 [mlx5_core]
        mlx5_sf_dev_probe+0x29c/0x370 [mlx5_core]
        ? mlx5_sf_dev_remove+0x130/0x130 [mlx5_core]
        auxiliary_bus_probe+0x9d/0xe0
        really_probe+0x1e0/0xaa0
        __driver_probe_device+0x219/0x480
        ? auxiliary_match_id+0xe9/0x140
        driver_probe_device+0x49/0x130
        __device_attach_driver+0x1b8/0x280
        ? driver_allows_async_probing+0x140/0x140
        bus_for_each_drv+0x123/0x1a0
        ? bus_for_each_dev+0x1a0/0x1a0
        ? lockdep_hardirqs_on_prepare+0x286/0x400
        ? trace_hardirqs_on+0x2d/0x100
        __device_attach+0x1a3/0x460
        ? device_driver_attach+0x1e0/0x1e0
        ? kobject_uevent_env+0x22d/0xf10
        bus_probe_device+0x1a2/0x260
        device_add+0x9b1/0x1b40
        ? dev_set_name+0xab/0xe0
        ? __fw_devlink_link_to_suppliers+0x260/0x260
        ? memset+0x20/0x40
        ? lockdep_init_map_type+0x21a/0x7d0
        __auxiliary_device_add+0x88/0xc0
        ? auxiliary_device_init+0x86/0xa0
        mlx5_sf_dev_state_change_handler+0x67e/0x9d0 [mlx5_core]
        blocking_notifier_call_chain+0xd5/0x130
        mlx5_vhca_state_work_handler+0x2b0/0x3f0 [mlx5_core]
        ? mlx5_vhca_event_arm+0x100/0x100 [mlx5_core]
        ? lock_downgrade+0x6e0/0x6e0
        ? lockdep_hardirqs_on_prepare+0x286/0x400
        process_one_work+0x7c2/0x1340
        ? lockdep_hardirqs_on_prepare+0x400/0x400
        ? pwq_dec_nr_in_flight+0x230/0x230
        ? rwlock_bug.part.0+0x90/0x90
        worker_thread+0x59d/0xec0
        ? process_one_work+0x1340/0x1340
        kthread+0x28f/0x330
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x1f/0x30
        </TASK>
      
      Fixes: 6a327321 ("net/mlx5: SF, Port function state change support")
      Signed-off-by: NMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: NShay Drory <shayd@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      d59b73a6