1. 27 3月, 2021 5 次提交
    • M
      bpf: selftests: Add kfunc_call test · 7bd1590d
      Martin KaFai Lau 提交于
      This patch adds a few kernel function bpf_kfunc_call_test*() for the
      selftest's test_run purpose.  They will be allowed for tc_cls prog.
      
      The selftest calling the kernel function bpf_kfunc_call_test*()
      is also added in this patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210325015252.1551395-1-kafai@fb.com
      7bd1590d
    • M
      bpf: Support bpf program calling kernel function · e6ac2450
      Martin KaFai Lau 提交于
      This patch adds support to BPF verifier to allow bpf program calling
      kernel function directly.
      
      The use case included in this set is to allow bpf-tcp-cc to directly
      call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
      functions have already been used by some kernel tcp-cc implementations.
      
      This set will also allow the bpf-tcp-cc program to directly call the
      kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
      implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
      from the kernel tcp_dctcp.c instead of reimplementing (or
      copy-and-pasting) them.
      
      The tcp-cc kernel functions mentioned above will be white listed
      for the struct_ops bpf-tcp-cc programs to use in a later patch.
      The white listed functions are not bounded to a fixed ABI contract.
      Those functions have already been used by the existing kernel tcp-cc.
      If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
      implementations have to be changed.  The same goes for the struct_ops
      bpf-tcp-cc programs which have to be adjusted accordingly.
      
      This patch is to make the required changes in the bpf verifier.
      
      First change is in btf.c, it adds a case in "btf_check_func_arg_match()".
      When the passed in "btf->kernel_btf == true", it means matching the
      verifier regs' states with a kernel function.  This will handle the
      PTR_TO_BTF_ID reg.  It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET,
      and PTR_TO_TCP_SOCK to its kernel's btf_id.
      
      In the later libbpf patch, the insn calling a kernel function will
      look like:
      
      insn->code == (BPF_JMP | BPF_CALL)
      insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */
      insn->imm == func_btf_id /* btf_id of the running kernel */
      
      [ For the future calling function-in-kernel-module support, an array
        of module btf_fds can be passed at the load time and insn->off
        can be used to index into this array. ]
      
      At the early stage of verifier, the verifier will collect all kernel
      function calls into "struct bpf_kfunc_desc".  Those
      descriptors are stored in "prog->aux->kfunc_tab" and will
      be available to the JIT.  Since this "add" operation is similar
      to the current "add_subprog()" and looking for the same insn->code,
      they are done together in the new "add_subprog_and_kfunc()".
      
      In the "do_check()" stage, the new "check_kfunc_call()" is added
      to verify the kernel function call instruction:
      1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE.
         A new bpf_verifier_ops "check_kfunc_call" is added to do that.
         The bpf-tcp-cc struct_ops program will implement this function in
         a later patch.
      2. Call "btf_check_kfunc_args_match()" to ensure the regs can be
         used as the args of a kernel function.
      3. Mark the regs' type, subreg_def, and zext_dst.
      
      At the later do_misc_fixups() stage, the new fixup_kfunc_call()
      will replace the insn->imm with the function address (relative
      to __bpf_call_base).  If needed, the jit can find the btf_func_model
      by calling the new bpf_jit_find_kfunc_model(prog, insn).
      With the imm set to the function address, "bpftool prog dump xlated"
      will be able to display the kernel function calls the same way as
      it displays other bpf helper calls.
      
      gpl_compatible program is required to call kernel function.
      
      This feature currently requires JIT.
      
      The verifier selftests are adjusted because of the changes in
      the verbose log in add_subprog_and_kfunc().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210325015142.1544736-1-kafai@fb.com
      e6ac2450
    • M
      bpf: Refactor btf_check_func_arg_match · 34747c41
      Martin KaFai Lau 提交于
      This patch moved the subprog specific logic from
      btf_check_func_arg_match() to the new btf_check_subprog_arg_match().
      The core logic is left in btf_check_func_arg_match() which
      will be reused later to check the kernel function call.
      
      The "if (!btf_type_is_ptr(t))" is checked first to improve the
      indentation which will be useful for a later patch.
      
      Some of the "btf_kind_str[]" usages is replaced with the shortcut
      "btf_type_str(t)".
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210325015136.1544504-1-kafai@fb.com
      34747c41
    • M
      bpf: Simplify freeing logic in linfo and jited_linfo · e16301fb
      Martin KaFai Lau 提交于
      This patch simplifies the linfo freeing logic by combining
      "bpf_prog_free_jited_linfo()" and "bpf_prog_free_unused_jited_linfo()"
      into the new "bpf_prog_jit_attempt_done()".
      It is a prep work for the kernel function call support.  In a later
      patch, freeing the kernel function call descriptors will also
      be done in the "bpf_prog_jit_attempt_done()".
      
      "bpf_prog_free_linfo()" is removed since it is only called by
      "__bpf_prog_put_noref()".  The kvfree() are directly called
      instead.
      
      It also takes this chance to s/kcalloc/kvcalloc/ for the jited_linfo
      allocation.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210325015130.1544323-1-kafai@fb.com
      e16301fb
    • W
      bpf: struct sock is declared twice in bpf_sk_storage header · fcb8d0d7
      Wan Jiabing 提交于
      struct sock has been declared twice, therefore remove the duplicate.
      Signed-off-by: NWan Jiabing <wanjiabing@vivo.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210325070602.858024-1-wanjiabing@vivo.com
      fcb8d0d7
  2. 26 3月, 2021 6 次提交
    • Y
      bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper · b910eaaa
      Yonghong Song 提交于
      Jiri Olsa reported a bug ([1]) in kernel where cgroup local
      storage pointer may be NULL in bpf_get_local_storage() helper.
      There are two issues uncovered by this bug:
        (1). kprobe or tracepoint prog incorrectly sets cgroup local storage
             before prog run,
        (2). due to change from preempt_disable to migrate_disable,
             preemption is possible and percpu storage might be overwritten
             by other tasks.
      
      This issue (1) is fixed in [2]. This patch tried to address issue (2).
      The following shows how things can go wrong:
        task 1:   bpf_cgroup_storage_set() for percpu local storage
               preemption happens
        task 2:   bpf_cgroup_storage_set() for percpu local storage
               preemption happens
        task 1:   run bpf program
      
      task 1 will effectively use the percpu local storage setting by task 2
      which will be either NULL or incorrect ones.
      
      Instead of just one common local storage per cpu, this patch fixed
      the issue by permitting 8 local storages per cpu and each local
      storage is identified by a task_struct pointer. This way, we
      allow at most 8 nested preemption between bpf_cgroup_storage_set()
      and bpf_cgroup_storage_unset(). The percpu local storage slot
      is released (calling bpf_cgroup_storage_unset()) by the same task
      after bpf program finished running.
      bpf_test_run() is also fixed to use the new bpf_cgroup_storage_set()
      interface.
      
      The patch is tested on top of [2] with reproducer in [1].
      Without this patch, kernel will emit error in 2-3 minutes.
      With this patch, after one hour, still no error.
      
       [1] https://lore.kernel.org/bpf/CAKH8qBuXCfUz=w8L+Fj74OaUpbosO29niYwTki7e3Ag044_aww@mail.gmail.com/T
       [2] https://lore.kernel.org/bpf/20210309185028.3763817-1-yhs@fb.comSigned-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/bpf/20210323055146.3334476-1-yhs@fb.com
      b910eaaa
    • R
    • M
      mm: memblock: fix section mismatch warning again · a024b7c2
      Mike Rapoport 提交于
      Commit 34dc2efb ("memblock: fix section mismatch warning") marked
      memblock_bottom_up() and memblock_set_bottom_up() as __init, but they
      could be referenced from non-init functions like
      memblock_find_in_range_node() on architectures that enable
      CONFIG_ARCH_KEEP_MEMBLOCK.
      
      For such builds kernel test robot reports:
      
         WARNING: modpost: vmlinux.o(.text+0x74fea4): Section mismatch in reference from the function memblock_find_in_range_node() to the function .init.text:memblock_bottom_up()
         The function memblock_find_in_range_node() references the function __init memblock_bottom_up().
         This is often because memblock_find_in_range_node lacks a __init  annotation or the annotation of memblock_bottom_up is wrong.
      
      Replace __init annotations with __init_memblock annotations so that the
      appropriate section will be selected depending on
      CONFIG_ARCH_KEEP_MEMBLOCK.
      
      Link: https://lore.kernel.org/lkml/202103160133.UzhgY0wt-lkp@intel.com
      Link: https://lkml.kernel.org/r/20210316171347.14084-1-rppt@kernel.org
      Fixes: 34dc2efb ("memblock: fix section mismatch warning")
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a024b7c2
    • S
      mm/mmu_notifiers: ensure range_end() is paired with range_start() · c2655835
      Sean Christopherson 提交于
      If one or more notifiers fails .invalidate_range_start(), invoke
      .invalidate_range_end() for "all" notifiers.  If there are multiple
      notifiers, those that did not fail are expecting _start() and _end() to
      be paired, e.g.  KVM's mmu_notifier_count would become imbalanced.
      Disallow notifiers that can fail _start() from implementing _end() so
      that it's unnecessary to either track which notifiers rejected _start(),
      or had already succeeded prior to a failed _start().
      
      Note, the existing behavior of calling _start() on all notifiers even
      after a previous notifier failed _start() was an unintented "feature".
      Make it canon now that the behavior is depended on for correctness.
      
      As of today, the bug is likely benign:
      
        1. The only caller of the non-blocking notifier is OOM kill.
        2. The only notifiers that can fail _start() are the i915 and Nouveau
           drivers.
        3. The only notifiers that utilize _end() are the SGI UV GRU driver
           and KVM.
        4. The GRU driver will never coincide with the i195/Nouveau drivers.
        5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the
           _guest_, and the guest is already doomed due to being an OOM victim.
      
      Fix the bug now to play nice with future usage, e.g.  KVM has a
      potential use case for blocking memslot updates in KVM while an
      invalidation is in-progress, and failure to unblock would result in said
      updates being blocked indefinitely and hanging.
      
      Found by inspection.  Verified by adding a second notifier in KVM that
      periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
      and observing that KVM exits with an elevated notifier count.
      
      Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com
      Fixes: 93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Suggested-by: NJason Gunthorpe <jgg@ziepe.ca>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2655835
    • A
      kasan: fix per-page tags for non-page_alloc pages · cf10bd4c
      Andrey Konovalov 提交于
      To allow performing tag checks on page_alloc addresses obtained via
      page_address(), tag-based KASAN modes store tags for page_alloc
      allocations in page->flags.
      
      Currently, the default tag value stored in page->flags is 0x00.
      Therefore, page_address() returns a 0x00ffff...  address for pages that
      were not allocated via page_alloc.
      
      This might cause problems.  A particular case we encountered is a
      conflict with KFENCE.  If a KFENCE-allocated slab object is being freed
      via kfree(page_address(page) + offset), the address passed to kfree()
      will get tagged with 0x00 (as slab pages keep the default per-page
      tags).  This leads to is_kfence_address() check failing, and a KFENCE
      object ending up in normal slab freelist, which causes memory
      corruptions.
      
      This patch changes the way KASAN stores tag in page-flags: they are now
      stored xor'ed with 0xff.  This way, KASAN doesn't need to initialize
      per-page flags for every created page, which might be slow.
      
      With this change, page_address() returns natively-tagged (with 0xff)
      pointers for pages that didn't have tags set explicitly.
      
      This patch fixes the encountered conflict with KFENCE and prevents more
      similar issues that can occur in the future.
      
      Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com
      Fixes: 2813b9c0 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NMarco Elver <elver@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf10bd4c
    • M
      hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings · d85aecf2
      Miaohe Lin 提交于
      The current implementation of hugetlb_cgroup for shared mappings could
      have different behavior.  Consider the following two scenarios:
      
       1.Assume initial css reference count of hugetlb_cgroup is 1:
        1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
            count is 2 associated with 1 file_region.
        1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
            count is 3 associated with 2 file_region.
        1.3 coalesce_file_region will coalesce these two file_regions into
            one. So css reference count is 3 associated with 1 file_region
            now.
      
       2.Assume initial css reference count of hugetlb_cgroup is 1 again:
        2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
            count is 2 associated with 1 file_region.
      
      Therefore, we might have one file_region while holding one or more css
      reference counts. This inconsistency could lead to imbalanced css_get()
      and css_put() pair. If we do css_put one by one (i.g. hole punch case),
      scenario 2 would put one more css reference. If we do css_put all
      together (i.g. truncate case), scenario 1 will leak one css reference.
      
      The imbalanced css_get() and css_put() pair would result in a non-zero
      reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup
      directory is removed __but__ associated resource is not freed. This
      might result in OOM or can not create a new hugetlb cgroup in a busy
      workload ultimately.
      
      In order to fix this, we have to make sure that one file_region must
      hold exactly one css reference. So in coalesce_file_region case, we
      should release one css reference before coalescence. Also only put css
      reference when the entire file_region is removed.
      
      The last thing to note is that the caller of region_add() will only hold
      one reference to h_cg->css for the whole contiguous reservation region.
      But this area might be scattered when there are already some
      file_regions reside in it. As a result, many file_regions may share only
      one h_cg->css reference. In order to ensure that one file_region must
      hold exactly one css reference, we should do css_get() for each
      file_region and release the reference held by caller when they are done.
      
      [linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings]
        Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com
      Fixes: 075a61d0 ("hugetlb_cgroup: add accounting for shared mappings")
      Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR)
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d85aecf2
  3. 25 3月, 2021 19 次提交
  4. 24 3月, 2021 10 次提交
    • D
      net: make unregister netdev warning timeout configurable · 5aa3afe1
      Dmitry Vyukov 提交于
      netdev_wait_allrefs() issues a warning if refcount does not drop to 0
      after 10 seconds. While 10 second wait generally should not happen
      under normal workload in normal environment, it seems to fire falsely
      very often during fuzzing and/or in qemu emulation (~10x slower).
      At least it's not possible to understand if it's really a false
      positive or not. Automated testing generally bumps all timeouts
      to very high values to avoid flake failures.
      Add net.core.netdev_unregister_timeout_secs sysctl to make
      the timeout configurable for automated testing systems.
      Lowering the timeout may also be useful for e.g. manual bisection.
      The default value matches the current behavior.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5aa3afe1
    • V
      net: ocelot: replay switchdev events when joining bridge · e4bd44e8
      Vladimir Oltean 提交于
      The premise of this change is that the switchdev port attributes and
      objects offloaded by ocelot might have been missed when we are joining
      an already existing bridge port, such as a bonding interface.
      
      The patch pulls these switchdev attributes and objects from the bridge,
      on behalf of the 'bridge port' net device which might be either the
      ocelot switch interface, or the bonding upper interface.
      
      The ocelot_net.c belongs strictly to the switchdev ocelot driver, while
      ocelot.c is part of a library shared with the DSA felix driver.
      The ocelot_port_bridge_leave function (part of the common library) used
      to call ocelot_port_vlan_filtering(false), something which is not
      necessary for DSA, since the framework deals with that already there.
      So we move this function to ocelot_switchdev_unsync, which is specific
      to the switchdev driver.
      
      The code movement described above makes ocelot_port_bridge_leave no
      longer return an error code, so we change its type from int to void.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4bd44e8
    • V
      net: bridge: add helper to replay VLANs installed on port · 22f67cdf
      Vladimir Oltean 提交于
      Currently this simple setup with DSA:
      
      ip link add br0 type bridge vlan_filtering 1
      ip link add bond0 type bond
      ip link set bond0 master br0
      ip link set swp0 master bond0
      
      will not work because the bridge has created the PVID in br_add_if ->
      nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1,
      but that was too early, since swp0 was not yet a lower of bond0, so it
      had no reason to act upon that notification.
      
      We need a helper in the bridge to replay the switchdev VLAN objects that
      were notified since the bridge port creation, because some of them may
      have been missed.
      
      As opposed to the br_mdb_replay function, the vg->vlan_list write side
      protection is offered by the rtnl_mutex which is sleepable, so we don't
      need to queue up the objects in atomic context, we can replay them right
      away.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22f67cdf
    • V
      net: bridge: add helper to replay port and local fdb entries · 04846f90
      Vladimir Oltean 提交于
      When a switchdev port starts offloading a LAG that is already in a
      bridge and has an FDB entry pointing to it:
      
      ip link set bond0 master br0
      bridge fdb add dev bond0 00:01:02:03:04:05 master static
      ip link set swp0 master bond0
      
      the switchdev driver will have no idea that this FDB entry is there,
      because it missed the switchdev event emitted at its creation.
      
      Ido Schimmel pointed this out during a discussion about challenges with
      switchdev offloading of stacked interfaces between the physical port and
      the bridge, and recommended to just catch that condition and deny the
      CHANGEUPPER event:
      https://lore.kernel.org/netdev/20210210105949.GB287766@shredder.lan/
      
      But in fact, we might need to deal with the hard thing anyway, which is
      to replay all FDB addresses relevant to this port, because it isn't just
      static FDB entries, but also local addresses (ones that are not
      forwarded but terminated by the bridge). There, we can't just say 'oh
      yeah, there was an upper already so I'm not joining that'.
      
      So, similar to the logic for replaying MDB entries, add a function that
      must be called by individual switchdev drivers and replays local FDB
      entries as well as ones pointing towards a bridge port. This time, we
      use the atomic switchdev notifier block, since that's what FDB entries
      expect for some reason.
      Reported-by: NIdo Schimmel <idosch@idosch.org>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04846f90
    • V
      net: bridge: add helper to replay port and host-joined mdb entries · 4f2673b3
      Vladimir Oltean 提交于
      I have a system with DSA ports, and udhcpcd is configured to bring
      interfaces up as soon as they are created.
      
      I create a bridge as follows:
      
      ip link add br0 type bridge
      
      As soon as I create the bridge and udhcpcd brings it up, I also have
      avahi which automatically starts sending IPv6 packets to advertise some
      local services, and because of that, the br0 bridge joins the following
      IPv6 groups due to the code path detailed below:
      
      33:33:ff:6d:c1:9c vid 0
      33:33:00:00:00:6a vid 0
      33:33:00:00:00:fb vid 0
      
      br_dev_xmit
      -> br_multicast_rcv
         -> br_ip6_multicast_add_group
            -> __br_multicast_add_group
               -> br_multicast_host_join
                  -> br_mdb_notify
      
      This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host
      hooked up, and switchdev will attempt to offload the host joined groups
      to an empty list of ports. Of course nobody offloads them.
      
      Then when we add a port to br0:
      
      ip link set swp0 master br0
      
      the bridge doesn't replay the host-joined MDB entries from br_add_if,
      and eventually the host joined addresses expire, and a switchdev
      notification for deleting it is emitted, but surprise, the original
      addition was already completely missed.
      
      The strategy to address this problem is to replay the MDB entries (both
      the port ones and the host joined ones) when the new port joins the
      bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can
      be populated and only then attached to a bridge that you offload).
      However there are 2 possibilities: the addresses can be 'pushed' by the
      bridge into the port, or the port can 'pull' them from the bridge.
      
      Considering that in the general case, the new port can be really late to
      the party, and there may have been many other switchdev ports that
      already received the initial notification, we would like to avoid
      delivering duplicate events to them, since they might misbehave. And
      currently, the bridge calls the entire switchdev notifier chain, whereas
      for replaying it should just call the notifier block of the new guy.
      But the bridge doesn't know what is the new guy's notifier block, it
      just knows where the switchdev notifier chain is. So for simplification,
      we make this a driver-initiated pull for now, and the notifier block is
      passed as an argument.
      
      To emulate the calling context for mdb objects (deferred and put on the
      blocking notifier chain), we must iterate under RCU protection through
      the bridge's mdb entries, queue them, and only call them once we're out
      of the RCU read-side critical section.
      
      There was some opportunity for reuse between br_mdb_switchdev_host_port,
      br_mdb_notify and the newly added br_mdb_queue_one in how the switchdev
      mdb object is created, so a helper was created.
      Suggested-by: NIdo Schimmel <idosch@idosch.org>
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f2673b3
    • V
      net: bridge: add helper to retrieve the current ageing time · f1d42ea1
      Vladimir Oltean 提交于
      The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from:
      
      sysfs/ioctl/netlink
      -> br_set_ageing_time
         -> __set_ageing_time
      
      therefore not at bridge port creation time, so:
      (a) switchdev drivers have to hardcode the initial value for the address
          ageing time, because they didn't get any notification
      (b) that hardcoded value can be out of sync, if the user changes the
          ageing time before enslaving the port to the bridge
      
      We need a helper in the bridge, such that switchdev drivers can query
      the current value of the bridge ageing time when they start offloading
      it.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NTobias Waldekranz <tobias@waldekranz.com>
      Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1d42ea1
    • V
      net: bridge: add helper for retrieving the current bridge port STP state · c0e715bb
      Vladimir Oltean 提交于
      It may happen that we have the following topology with DSA or any other
      switchdev driver with LAG offload:
      
      ip link add br0 type bridge stp_state 1
      ip link add bond0 type bond
      ip link set bond0 master br0
      ip link set swp0 master bond0
      ip link set swp1 master bond0
      
      STP decides that it should put bond0 into the BLOCKING state, and
      that's that. The ports that are actively listening for the switchdev
      port attributes emitted for the bond0 bridge port (because they are
      offloading it) and have the honor of seeing that switchdev port
      attribute can react to it, so we can program swp0 and swp1 into the
      BLOCKING state.
      
      But if then we do:
      
      ip link set swp2 master bond0
      
      then as far as the bridge is concerned, nothing has changed: it still
      has one bridge port. But this new bridge port will not see any STP state
      change notification and will remain FORWARDING, which is how the
      standalone code leaves it in.
      
      We need a function in the bridge driver which retrieves the current STP
      state, such that drivers can synchronize to it when they may have missed
      switchdev events.
      Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NTobias Waldekranz <tobias@waldekranz.com>
      Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0e715bb
    • X
      net: lapb: Make "lapb_t1timer_running" able to detect an already running timer · 65d2dbb3
      Xie He 提交于
      Problem:
      
      The "lapb_t1timer_running" function in "lapb_timer.c" is used in only
      one place: in the "lapb_kick" function in "lapb_out.c". "lapb_kick" calls
      "lapb_t1timer_running" to check if the timer is already pending, and if
      it is not, schedule it to run.
      
      However, if the timer has already fired and is running, and is waiting to
      get the "lapb->lock" lock, "lapb_t1timer_running" will not detect this,
      and "lapb_kick" will then schedule a new timer. The old timer will then
      abort when it sees a new timer pending.
      
      I think this is not right. The purpose of "lapb_kick" should be ensuring
      that the actual work of the timer function is scheduled to be done.
      If the timer function is already running but waiting for the lock,
      "lapb_kick" should not abort and reschedule it.
      
      Changes made:
      
      I added a new field "t1timer_running" in "struct lapb_cb" for
      "lapb_t1timer_running" to use. "t1timer_running" will accurately reflect
      whether the actual work of the timer is pending. If the timer has fired
      but is still waiting for the lock, "t1timer_running" will still correctly
      reflect whether the actual work is waiting to be done.
      
      The old "t1timer_stop" field, whose only responsibility is to ask a timer
      (that is already running but waiting for the lock) to abort, is no longer
      needed, because the new "t1timer_running" field can fully take over its
      responsibility. Therefore "t1timer_stop" is deleted.
      
      "t1timer_running" is not simply a negation of the old "t1timer_stop".
      At the end of the timer function, if it does not reschedule itself,
      "t1timer_running" is set to false to indicate that the timer is stopped.
      
      For consistency of the code, I also added "t2timer_running" and deleted
      "t2timer_stop".
      Signed-off-by: NXie He <xie.he.0141@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65d2dbb3
    • M
      mm/writeback: Add wait_on_page_writeback_killable · e5dbd332
      Matthew Wilcox (Oracle) 提交于
      This is the killable version of wait_on_page_writeback.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: kafs-testing@auristor.com
      cc: linux-afs@lists.infradead.org
      cc: linux-mm@kvack.org
      Link: https://lore.kernel.org/r/20210320054104.1300774-3-willy@infradead.org
      e5dbd332
    • M
      fs/cachefiles: Remove wait_bit_key layout dependency · 39f985c8
      Matthew Wilcox (Oracle) 提交于
      Cachefiles was relying on wait_page_key and wait_bit_key being the
      same layout, which is fragile.  Now that wait_page_key is exposed in
      the pagemap.h header, we can remove that fragility
      
      A comment on the need to maintain structure layout equivalence was added by
      Linus[1] and that is no longer applicable.
      
      Fixes: 62906027 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: kafs-testing@auristor.com
      cc: linux-cachefs@redhat.com
      cc: linux-mm@kvack.org
      Link: https://lore.kernel.org/r/20210320054104.1300774-2-willy@infradead.org/
      Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3510ca20ece0150af6b10c77a74ff1b5c198e3e2 [1]
      39f985c8