- 27 3月, 2021 4 次提交
-
-
由 Martin KaFai Lau 提交于
This patch adds a few kernel function bpf_kfunc_call_test*() for the selftest's test_run purpose. They will be allowed for tc_cls prog. The selftest calling the kernel function bpf_kfunc_call_test*() is also added in this patch. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210325015252.1551395-1-kafai@fb.com
-
由 Martin KaFai Lau 提交于
This patch adds support to BPF verifier to allow bpf program calling kernel function directly. The use case included in this set is to allow bpf-tcp-cc to directly call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()"). Those functions have already been used by some kernel tcp-cc implementations. This set will also allow the bpf-tcp-cc program to directly call the kernel tcp-cc implementation, For example, a bpf_dctcp may only want to implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly from the kernel tcp_dctcp.c instead of reimplementing (or copy-and-pasting) them. The tcp-cc kernel functions mentioned above will be white listed for the struct_ops bpf-tcp-cc programs to use in a later patch. The white listed functions are not bounded to a fixed ABI contract. Those functions have already been used by the existing kernel tcp-cc. If any of them has changed, both in-tree and out-of-tree kernel tcp-cc implementations have to be changed. The same goes for the struct_ops bpf-tcp-cc programs which have to be adjusted accordingly. This patch is to make the required changes in the bpf verifier. First change is in btf.c, it adds a case in "btf_check_func_arg_match()". When the passed in "btf->kernel_btf == true", it means matching the verifier regs' states with a kernel function. This will handle the PTR_TO_BTF_ID reg. It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET, and PTR_TO_TCP_SOCK to its kernel's btf_id. In the later libbpf patch, the insn calling a kernel function will look like: insn->code == (BPF_JMP | BPF_CALL) insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */ insn->imm == func_btf_id /* btf_id of the running kernel */ [ For the future calling function-in-kernel-module support, an array of module btf_fds can be passed at the load time and insn->off can be used to index into this array. ] At the early stage of verifier, the verifier will collect all kernel function calls into "struct bpf_kfunc_desc". Those descriptors are stored in "prog->aux->kfunc_tab" and will be available to the JIT. Since this "add" operation is similar to the current "add_subprog()" and looking for the same insn->code, they are done together in the new "add_subprog_and_kfunc()". In the "do_check()" stage, the new "check_kfunc_call()" is added to verify the kernel function call instruction: 1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE. A new bpf_verifier_ops "check_kfunc_call" is added to do that. The bpf-tcp-cc struct_ops program will implement this function in a later patch. 2. Call "btf_check_kfunc_args_match()" to ensure the regs can be used as the args of a kernel function. 3. Mark the regs' type, subreg_def, and zext_dst. At the later do_misc_fixups() stage, the new fixup_kfunc_call() will replace the insn->imm with the function address (relative to __bpf_call_base). If needed, the jit can find the btf_func_model by calling the new bpf_jit_find_kfunc_model(prog, insn). With the imm set to the function address, "bpftool prog dump xlated" will be able to display the kernel function calls the same way as it displays other bpf helper calls. gpl_compatible program is required to call kernel function. This feature currently requires JIT. The verifier selftests are adjusted because of the changes in the verbose log in add_subprog_and_kfunc(). Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210325015142.1544736-1-kafai@fb.com
-
由 Martin KaFai Lau 提交于
This patch moved the subprog specific logic from btf_check_func_arg_match() to the new btf_check_subprog_arg_match(). The core logic is left in btf_check_func_arg_match() which will be reused later to check the kernel function call. The "if (!btf_type_is_ptr(t))" is checked first to improve the indentation which will be useful for a later patch. Some of the "btf_kind_str[]" usages is replaced with the shortcut "btf_type_str(t)". Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210325015136.1544504-1-kafai@fb.com
-
由 Martin KaFai Lau 提交于
This patch simplifies the linfo freeing logic by combining "bpf_prog_free_jited_linfo()" and "bpf_prog_free_unused_jited_linfo()" into the new "bpf_prog_jit_attempt_done()". It is a prep work for the kernel function call support. In a later patch, freeing the kernel function call descriptors will also be done in the "bpf_prog_jit_attempt_done()". "bpf_prog_free_linfo()" is removed since it is only called by "__bpf_prog_put_noref()". The kvfree() are directly called instead. It also takes this chance to s/kcalloc/kvcalloc/ for the jited_linfo allocation. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210325015130.1544323-1-kafai@fb.com
-
- 26 3月, 2021 6 次提交
-
-
由 Yonghong Song 提交于
Jiri Olsa reported a bug ([1]) in kernel where cgroup local storage pointer may be NULL in bpf_get_local_storage() helper. There are two issues uncovered by this bug: (1). kprobe or tracepoint prog incorrectly sets cgroup local storage before prog run, (2). due to change from preempt_disable to migrate_disable, preemption is possible and percpu storage might be overwritten by other tasks. This issue (1) is fixed in [2]. This patch tried to address issue (2). The following shows how things can go wrong: task 1: bpf_cgroup_storage_set() for percpu local storage preemption happens task 2: bpf_cgroup_storage_set() for percpu local storage preemption happens task 1: run bpf program task 1 will effectively use the percpu local storage setting by task 2 which will be either NULL or incorrect ones. Instead of just one common local storage per cpu, this patch fixed the issue by permitting 8 local storages per cpu and each local storage is identified by a task_struct pointer. This way, we allow at most 8 nested preemption between bpf_cgroup_storage_set() and bpf_cgroup_storage_unset(). The percpu local storage slot is released (calling bpf_cgroup_storage_unset()) by the same task after bpf program finished running. bpf_test_run() is also fixed to use the new bpf_cgroup_storage_set() interface. The patch is tested on top of [2] with reproducer in [1]. Without this patch, kernel will emit error in 2-3 minutes. With this patch, after one hour, still no error. [1] https://lore.kernel.org/bpf/CAKH8qBuXCfUz=w8L+Fj74OaUpbosO29niYwTki7e3Ag044_aww@mail.gmail.com/T [2] https://lore.kernel.org/bpf/20210309185028.3763817-1-yhs@fb.comSigned-off-by: NYonghong Song <yhs@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NRoman Gushchin <guro@fb.com> Link: https://lore.kernel.org/bpf/20210323055146.3334476-1-yhs@fb.com
-
由 Ricardo Ribalda 提交于
Trivial fix. Signed-off-by: NRicardo Ribalda <ribalda@chromium.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210318202223.164873-8-ribalda@chromium.org
-
由 Mike Rapoport 提交于
Commit 34dc2efb ("memblock: fix section mismatch warning") marked memblock_bottom_up() and memblock_set_bottom_up() as __init, but they could be referenced from non-init functions like memblock_find_in_range_node() on architectures that enable CONFIG_ARCH_KEEP_MEMBLOCK. For such builds kernel test robot reports: WARNING: modpost: vmlinux.o(.text+0x74fea4): Section mismatch in reference from the function memblock_find_in_range_node() to the function .init.text:memblock_bottom_up() The function memblock_find_in_range_node() references the function __init memblock_bottom_up(). This is often because memblock_find_in_range_node lacks a __init annotation or the annotation of memblock_bottom_up is wrong. Replace __init annotations with __init_memblock annotations so that the appropriate section will be selected depending on CONFIG_ARCH_KEEP_MEMBLOCK. Link: https://lore.kernel.org/lkml/202103160133.UzhgY0wt-lkp@intel.com Link: https://lkml.kernel.org/r/20210316171347.14084-1-rppt@kernel.org Fixes: 34dc2efb ("memblock: fix section mismatch warning") Signed-off-by: NMike Rapoport <rppt@linux.ibm.com> Reviewed-by: NArnd Bergmann <arnd@arndb.de> Reported-by: Nkernel test robot <lkp@intel.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NNick Desaulniers <ndesaulniers@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sean Christopherson 提交于
If one or more notifiers fails .invalidate_range_start(), invoke .invalidate_range_end() for "all" notifiers. If there are multiple notifiers, those that did not fail are expecting _start() and _end() to be paired, e.g. KVM's mmu_notifier_count would become imbalanced. Disallow notifiers that can fail _start() from implementing _end() so that it's unnecessary to either track which notifiers rejected _start(), or had already succeeded prior to a failed _start(). Note, the existing behavior of calling _start() on all notifiers even after a previous notifier failed _start() was an unintented "feature". Make it canon now that the behavior is depended on for correctness. As of today, the bug is likely benign: 1. The only caller of the non-blocking notifier is OOM kill. 2. The only notifiers that can fail _start() are the i915 and Nouveau drivers. 3. The only notifiers that utilize _end() are the SGI UV GRU driver and KVM. 4. The GRU driver will never coincide with the i195/Nouveau drivers. 5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the _guest_, and the guest is already doomed due to being an OOM victim. Fix the bug now to play nice with future usage, e.g. KVM has a potential use case for blocking memslot updates in KVM while an invalidation is in-progress, and failure to unblock would result in said updates being blocked indefinitely and hanging. Found by inspection. Verified by adding a second notifier in KVM that periodically returns -EAGAIN on non-blockable ranges, triggering OOM, and observing that KVM exits with an elevated notifier count. Link: https://lkml.kernel.org/r/20210311180057.1582638-1-seanjc@google.com Fixes: 93065ac7 ("mm, oom: distinguish blockable mode for mmu notifiers") Signed-off-by: NSean Christopherson <seanjc@google.com> Suggested-by: NJason Gunthorpe <jgg@ziepe.ca> Reviewed-by: NJason Gunthorpe <jgg@nvidia.com> Cc: David Rientjes <rientjes@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Konovalov 提交于
To allow performing tag checks on page_alloc addresses obtained via page_address(), tag-based KASAN modes store tags for page_alloc allocations in page->flags. Currently, the default tag value stored in page->flags is 0x00. Therefore, page_address() returns a 0x00ffff... address for pages that were not allocated via page_alloc. This might cause problems. A particular case we encountered is a conflict with KFENCE. If a KFENCE-allocated slab object is being freed via kfree(page_address(page) + offset), the address passed to kfree() will get tagged with 0x00 (as slab pages keep the default per-page tags). This leads to is_kfence_address() check failing, and a KFENCE object ending up in normal slab freelist, which causes memory corruptions. This patch changes the way KASAN stores tag in page-flags: they are now stored xor'ed with 0xff. This way, KASAN doesn't need to initialize per-page flags for every created page, which might be slow. With this change, page_address() returns natively-tagged (with 0xff) pointers for pages that didn't have tags set explicitly. This patch fixes the encountered conflict with KFENCE and prevents more similar issues that can occur in the future. Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com Fixes: 2813b9c0 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc") Signed-off-by: NAndrey Konovalov <andreyknvl@google.com> Reviewed-by: NMarco Elver <elver@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miaohe Lin 提交于
The current implementation of hugetlb_cgroup for shared mappings could have different behavior. Consider the following two scenarios: 1.Assume initial css reference count of hugetlb_cgroup is 1: 1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference count is 2 associated with 1 file_region. 1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference count is 3 associated with 2 file_region. 1.3 coalesce_file_region will coalesce these two file_regions into one. So css reference count is 3 associated with 1 file_region now. 2.Assume initial css reference count of hugetlb_cgroup is 1 again: 2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference count is 2 associated with 1 file_region. Therefore, we might have one file_region while holding one or more css reference counts. This inconsistency could lead to imbalanced css_get() and css_put() pair. If we do css_put one by one (i.g. hole punch case), scenario 2 would put one more css reference. If we do css_put all together (i.g. truncate case), scenario 1 will leak one css reference. The imbalanced css_get() and css_put() pair would result in a non-zero reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup directory is removed __but__ associated resource is not freed. This might result in OOM or can not create a new hugetlb cgroup in a busy workload ultimately. In order to fix this, we have to make sure that one file_region must hold exactly one css reference. So in coalesce_file_region case, we should release one css reference before coalescence. Also only put css reference when the entire file_region is removed. The last thing to note is that the caller of region_add() will only hold one reference to h_cg->css for the whole contiguous reservation region. But this area might be scattered when there are already some file_regions reside in it. As a result, many file_regions may share only one h_cg->css reference. In order to ensure that one file_region must hold exactly one css reference, we should do css_get() for each file_region and release the reference held by caller when they are done. [linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings] Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com Fixes: 075a61d0 ("hugetlb_cgroup: add accounting for shared mappings") Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR) Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mina Almasry <almasrymina@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 3月, 2021 10 次提交
-
-
由 Ong Boon Leong 提交于
In order to discover whether remote station supports frame preemption, local station sends verify mPacket and expects response mPacket in return from the remote station. So, we add the functions to send and handle event when verify mPacket and response mPacket are exchanged between the networked stations. The mechanism to handle different FPE states between local and remote station (link partner) is implemented using workqueue which starts a task each time there is some sign of verify & response mPacket exchange as check in FPE IRQ event. The task retries couple of times to try to spot the states that both stations are ready to enter FPE ON. This allows different end points to enable FPE at different time and verify-response mPacket can happen asynchronously. Ultimately, the task will only turn FPE ON when local station have both exchange response in both directions. Thanks to Voon Weifeng for implementing the core functions for detecting FPE events and send mPacket and phylink related change. Signed-off-by: NOng Boon Leong <boon.leong.ong@intel.com> Co-developed-by: NVoon Weifeng <weifeng.voon@intel.com> Signed-off-by: NVoon Weifeng <weifeng.voon@intel.com> Co-developed-by: NTan Tee Min <tee.min.tan@intel.com> Signed-off-by: NTan Tee Min <tee.min.tan@intel.com> Co-developed-by: NMohammad Athari Bin Ismail <mohammad.athari.ismail@intel.com> Signed-off-by: NMohammad Athari Bin Ismail <mohammad.athari.ismail@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Wong Vee Khee 提交于
Add generic code to enable C45 PHY loopback into the common phy-c45.c file. This will allow C45 PHY drivers aceess this by setting .set_loopback. Suggested-by: NHeiner Kallweit <hkallweit1@gmail.com> Signed-off-by: NWong Vee Khee <vee.khee.wong@linux.intel.com> Reviewed-by: NHeiner Kallweit <hkallweit1@gmail.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Tan Tee Min 提交于
Cross timestamping is supported on Integrated Ethernet Controller in Intel SoC such as EHL and TGL with Always Running Timer. The hardware cross-timestamp result is made available to applications through the PTP_SYS_OFFSET_PRECISE ioctl which calls stmmac_getcrosststamp(). Device time is stored in the MAC Auxiliary register. The 64-bit System time (ART timestamp) is stored in registers that are only addressable by using MDIO space. Signed-off-by: NTan Tee Min <tee.min.tan@intel.com> Co-developed-by: NWong Vee Khee <vee.khee.wong@linux.intel.com> Signed-off-by: NWong Vee Khee <vee.khee.wong@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Felix Fietkau 提交于
The switch might have already added the VLAN tag through PVID hardware offload. Keep this extra VLAN in the flowtable but skip it on egress. Signed-off-by: NFelix Fietkau <nbd@nbd.name> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Felix Fietkau 提交于
Add .ndo_fill_forward_path for dsa slave port devices Signed-off-by: NFelix Fietkau <nbd@nbd.name> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Felix Fietkau 提交于
Pass on the PPPoE session ID, destination hardware address and the real device. Signed-off-by: NFelix Fietkau <nbd@nbd.name> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Felix Fietkau 提交于
Depending on the VLAN settings of the bridge and the port, the bridge can either add or remove a tag. When vlan filtering is enabled, the fdb lookup also needs to know the VLAN tag/proto for the destination address To provide this, keep track of the stack of VLAN tags for the path in the lookup context Signed-off-by: NFelix Fietkau <nbd@nbd.name> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pablo Neira Ayuso 提交于
Add .ndo_fill_forward_path for bridge devices. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pablo Neira Ayuso 提交于
Add .ndo_fill_forward_path for vlan devices. For instance, assuming the following topology: IP forwarding / \ eth0.100 eth0 | eth0 . . . ethX ab:cd:ef:ab:cd:ef For packets going through IP forwarding to eth0.100 whose destination MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the following path: eth0.100 -> eth0 Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pablo Neira Ayuso 提交于
This patch adds dev_fill_forward_path() which resolves the path to reach the real netdevice from the IP forwarding side. This function takes as input the netdevice and the destination hardware address and it walks down the devices calling .ndo_fill_forward_path() for each device until the real device is found. For instance, assuming the following topology: IP forwarding / \ br0 eth0 / \ eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity. ethX is the interface in another box which is connected to the eth1 bridge port. For packets going through IP forwarding to br0 whose destination MAC address is ab:cd:ef:ab:cd:ef, dev_fill_forward_path() provides the following path: br0 -> eth1 .ndo_fill_forward_path for br0 looks up at the FDB for the bridge port from the destination MAC address to get the bridge port eth1. This information allows to create a fast path that bypasses the classic bridge and IP forwarding paths, so packets go directly from the bridge port eth1 to eth0 (wan interface) and vice versa. fast path .------------------------. / \ | IP forwarding | | / \ \/ | br0 eth0 . / \ -> eth1 eth2 . . . ethX ab:cd:ef:ab:cd:ef Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 3月, 2021 8 次提交
-
-
由 Dmitry Vyukov 提交于
netdev_wait_allrefs() issues a warning if refcount does not drop to 0 after 10 seconds. While 10 second wait generally should not happen under normal workload in normal environment, it seems to fire falsely very often during fuzzing and/or in qemu emulation (~10x slower). At least it's not possible to understand if it's really a false positive or not. Automated testing generally bumps all timeouts to very high values to avoid flake failures. Add net.core.netdev_unregister_timeout_secs sysctl to make the timeout configurable for automated testing systems. Lowering the timeout may also be useful for e.g. manual bisection. The default value matches the current behavior. Signed-off-by: NDmitry Vyukov <dvyukov@google.com> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877 Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vladimir Oltean 提交于
Currently this simple setup with DSA: ip link add br0 type bridge vlan_filtering 1 ip link add bond0 type bond ip link set bond0 master br0 ip link set swp0 master bond0 will not work because the bridge has created the PVID in br_add_if -> nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1, but that was too early, since swp0 was not yet a lower of bond0, so it had no reason to act upon that notification. We need a helper in the bridge to replay the switchdev VLAN objects that were notified since the bridge port creation, because some of them may have been missed. As opposed to the br_mdb_replay function, the vg->vlan_list write side protection is offered by the rtnl_mutex which is sleepable, so we don't need to queue up the objects in atomic context, we can replay them right away. Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com> Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vladimir Oltean 提交于
When a switchdev port starts offloading a LAG that is already in a bridge and has an FDB entry pointing to it: ip link set bond0 master br0 bridge fdb add dev bond0 00:01:02:03:04:05 master static ip link set swp0 master bond0 the switchdev driver will have no idea that this FDB entry is there, because it missed the switchdev event emitted at its creation. Ido Schimmel pointed this out during a discussion about challenges with switchdev offloading of stacked interfaces between the physical port and the bridge, and recommended to just catch that condition and deny the CHANGEUPPER event: https://lore.kernel.org/netdev/20210210105949.GB287766@shredder.lan/ But in fact, we might need to deal with the hard thing anyway, which is to replay all FDB addresses relevant to this port, because it isn't just static FDB entries, but also local addresses (ones that are not forwarded but terminated by the bridge). There, we can't just say 'oh yeah, there was an upper already so I'm not joining that'. So, similar to the logic for replaying MDB entries, add a function that must be called by individual switchdev drivers and replays local FDB entries as well as ones pointing towards a bridge port. This time, we use the atomic switchdev notifier block, since that's what FDB entries expect for some reason. Reported-by: NIdo Schimmel <idosch@idosch.org> Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com> Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vladimir Oltean 提交于
I have a system with DSA ports, and udhcpcd is configured to bring interfaces up as soon as they are created. I create a bridge as follows: ip link add br0 type bridge As soon as I create the bridge and udhcpcd brings it up, I also have avahi which automatically starts sending IPv6 packets to advertise some local services, and because of that, the br0 bridge joins the following IPv6 groups due to the code path detailed below: 33:33:ff:6d:c1:9c vid 0 33:33:00:00:00:6a vid 0 33:33:00:00:00:fb vid 0 br_dev_xmit -> br_multicast_rcv -> br_ip6_multicast_add_group -> __br_multicast_add_group -> br_multicast_host_join -> br_mdb_notify This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host hooked up, and switchdev will attempt to offload the host joined groups to an empty list of ports. Of course nobody offloads them. Then when we add a port to br0: ip link set swp0 master br0 the bridge doesn't replay the host-joined MDB entries from br_add_if, and eventually the host joined addresses expire, and a switchdev notification for deleting it is emitted, but surprise, the original addition was already completely missed. The strategy to address this problem is to replay the MDB entries (both the port ones and the host joined ones) when the new port joins the bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can be populated and only then attached to a bridge that you offload). However there are 2 possibilities: the addresses can be 'pushed' by the bridge into the port, or the port can 'pull' them from the bridge. Considering that in the general case, the new port can be really late to the party, and there may have been many other switchdev ports that already received the initial notification, we would like to avoid delivering duplicate events to them, since they might misbehave. And currently, the bridge calls the entire switchdev notifier chain, whereas for replaying it should just call the notifier block of the new guy. But the bridge doesn't know what is the new guy's notifier block, it just knows where the switchdev notifier chain is. So for simplification, we make this a driver-initiated pull for now, and the notifier block is passed as an argument. To emulate the calling context for mdb objects (deferred and put on the blocking notifier chain), we must iterate under RCU protection through the bridge's mdb entries, queue them, and only call them once we're out of the RCU read-side critical section. There was some opportunity for reuse between br_mdb_switchdev_host_port, br_mdb_notify and the newly added br_mdb_queue_one in how the switchdev mdb object is created, so a helper was created. Suggested-by: NIdo Schimmel <idosch@idosch.org> Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com> Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vladimir Oltean 提交于
The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from: sysfs/ioctl/netlink -> br_set_ageing_time -> __set_ageing_time therefore not at bridge port creation time, so: (a) switchdev drivers have to hardcode the initial value for the address ageing time, because they didn't get any notification (b) that hardcoded value can be out of sync, if the user changes the ageing time before enslaving the port to the bridge We need a helper in the bridge, such that switchdev drivers can query the current value of the bridge ageing time when they start offloading it. Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Reviewed-by: NTobias Waldekranz <tobias@waldekranz.com> Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vladimir Oltean 提交于
It may happen that we have the following topology with DSA or any other switchdev driver with LAG offload: ip link add br0 type bridge stp_state 1 ip link add bond0 type bond ip link set bond0 master br0 ip link set swp0 master bond0 ip link set swp1 master bond0 STP decides that it should put bond0 into the BLOCKING state, and that's that. The ports that are actively listening for the switchdev port attributes emitted for the bond0 bridge port (because they are offloading it) and have the honor of seeing that switchdev port attribute can react to it, so we can program swp0 and swp1 into the BLOCKING state. But if then we do: ip link set swp2 master bond0 then as far as the bridge is concerned, nothing has changed: it still has one bridge port. But this new bridge port will not see any STP state change notification and will remain FORWARDING, which is how the standalone code leaves it in. We need a function in the bridge driver which retrieves the current STP state, such that drivers can synchronize to it when they may have missed switchdev events. Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Reviewed-by: NTobias Waldekranz <tobias@waldekranz.com> Acked-by: NNikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Matthew Wilcox (Oracle) 提交于
This is the killable version of wait_on_page_writeback. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid Howells <dhowells@redhat.com> Tested-by: kafs-testing@auristor.com cc: linux-afs@lists.infradead.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20210320054104.1300774-3-willy@infradead.org
-
由 Matthew Wilcox (Oracle) 提交于
Cachefiles was relying on wait_page_key and wait_bit_key being the same layout, which is fragile. Now that wait_page_key is exposed in the pagemap.h header, we can remove that fragility A comment on the need to maintain structure layout equivalence was added by Linus[1] and that is no longer applicable. Fixes: 62906027 ("mm: add PageWaiters indicating tasks are waiting for a page bit") Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid Howells <dhowells@redhat.com> Tested-by: kafs-testing@auristor.com cc: linux-cachefs@redhat.com cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20210320054104.1300774-2-willy@infradead.org/ Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3510ca20ece0150af6b10c77a74ff1b5c198e3e2 [1]
-
- 23 3月, 2021 6 次提交
-
-
由 Kurt Kanzenbach 提交于
Report the driver name, ASIC ID and the switch name via devlink. This is a useful information for user space tooling. Signed-off-by: NKurt Kanzenbach <kurt@kmk-computers.de> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Reviewed-by: NVladimir Oltean <olteanv@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the initial net device refcount was 0. When CONFIG_PCPU_DEV_REFCNT is not set, this means the first dev_hold() triggers an illegal refcount operation (addition on 0) refcount_t: addition on 0; use-after-free. WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4 Fix is to change initial (and final) refcount to be 1. Also add a missing kerneldoc piece, as reported by Stephen Rothwell. Fixes: 919067cc ("net: add CONFIG_PCPU_DEV_REFCNT") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NGuenter Roeck <groeck@google.com> Tested-by: NGuenter Roeck <groeck@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vladimir Oltean 提交于
ptype_all and ptype_base are declared in net/core/dev.c as non-static, because they are used by net-procfs.c too. However, a "make W=1" build complains that there was no previous declaration of ptype_all and ptype_base in a header file, so this way of declaring things constitutes a violation of coding style. Let's move the extern declarations of ptype_all and ptype_base to the linux/netdevice.h file, which is included by net-procfs.c too. Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Bhaskar Chowdhury 提交于
s/unrequired/"not required"/ s/consme/consume/ .....two different places s/accros/across/ Signed-off-by: NBhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: NIgor Russkikh <irusskikh@marvell.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vincent Mailhol 提交于
Add a function to set the dynamic queue limit minimum value. Some specific drivers might have legitimate reasons to configure dql.min_limit to a given value. Typically, this is the case when the PDU of the protocol is smaller than the packet size to used to carry those frames to the device. Concrete example: a CAN (Control Area Network) device with an USB 2.0 interface. The PDU of classical CAN protocol are roughly 16 bytes but the USB packet size (which is used to carry the CAN frames to the device) might be up to 512 bytes. Wen small traffic burst occurs, BQL algorithm is not able to immediately adjust and this would result in having to send many small USB packets (i.e packet of 16 bytes for each CAN frame). Filling up the USB packet with CAN frames is relatively fast (small latency issue) but the gain of not having to send several small USB packets is huge (big throughput increase). In this case, forcing dql.min_limit to a given value that would allow to stuff the USB packet is always a win. This function is to be used by network drivers which are able to prove through a rationale and through empirical tests on several environment (with other applications, heavy context switching, virtualization...), that they constantly reach better performances with a specific predefined dql.min_limit value with no noticeable latency impact. Signed-off-by: NVincent Mailhol <mailhol.vincent@wanadoo.fr> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Qi Zhang 提交于
The virtual channel is going to be extended to support FDIR and RSS configure from AVF. New data structures and OP codes will be added, the patch enable the FDIR part. To support above advanced AVF feature, we need to figure out what kind of data structure should be passed from VF to PF to describe an FDIR rule or RSS config rule. The common part of the requirement is we need a data structure to represent the input set selection of a rule's hash key. An input set selection is a group of fields be selected from one or more network protocol layers that could be identified as a specific flow. For example, select dst IP address from an IPv4 header combined with dst port from the TCP header as the input set for an IPv4/TCP flow. The patch adds a new data structure virtchnl_proto_hdrs to abstract a network protocol headers group which is composed of layers of network protocol header(virtchnl_proto_hdr). A protocol header contains a 32 bits mask (field_selector) to describe which fields are selected as input sets, as well as a header type (enum virtchnl_proto_hdr_type). Each bit is mapped to a field in enum virtchnl_proto_hdr_field guided by its header type. +------------+-----------+------------------------------+ | | Proto Hdr | Header Type A | | | +------------------------------+ | | | BIT 31 | ... | BIT 1 | BIT 0 | | |-----------+------------------------------+ |Proto Hdrs | Proto Hdr | Header Type B | | | +------------------------------+ | | | BIT 31 | ... | BIT 1 | BIT 0 | | |-----------+------------------------------+ | | Proto Hdr | Header Type C | | | +------------------------------+ | | | BIT 31 | ... | BIT 1 | BIT 0 | | |-----------+------------------------------+ | | .... | +-------------------------------------------------------+ All fields in enum virtchnl_proto_hdr_fields are grouped with header type and the value of the first field of a header type is always 32 aligned. enum proto_hdr_type { header_type_A = 0; header_type_B = 1; .... } enum proto_hdr_field { /* header type A */ header_A_field_0 = 0, header_A_field_1 = 1, header_A_field_2 = 2, header_A_field_3 = 3, /* header type B */ header_B_field_0 = 32, // = header_type_B << 5 header_B_field_0 = 33, header_B_field_0 = 34 header_B_field_0 = 35, .... }; So we have: proto_hdr_type = proto_hdr_field / 32 bit offset = proto_hdr_field % 32 To simply the protocol header's operations, couple help macros are added. For example, to select src IP and dst port as input set for an IPv4/UDP flow. we have: struct virtchnl_proto_hdr hdr[2]; VIRTCHNL_SET_PROTO_HDR_TYPE(&hdr[0], IPV4) VIRTCHNL_ADD_PROTO_HDR_FIELD(&hdr[0], IPV4, SRC) VIRTCHNL_SET_PROTO_HDR_TYPE(&hdr[1], UDP) VIRTCHNL_ADD_PROTO_HDR_FIELD(&hdr[1], UDP, DST) The byte array is used to store the protocol header of a training package. The byte array must be network order. The patch added virtual channel support for iAVF FDIR add/validate/delete filter. iAVF FDIR is Flow Director for Intel Adaptive Virtual Function which can direct Ethernet packets to the queues of the Network Interface Card. Add/delete command is adding or deleting one rule for each virtual channel message, while validate command is just verifying if this rule is valid without any other operations. To add or delete one rule, driver needs to config TCAM and Profile, build training packets which contains the input set value, and send the training packets through FDIR Tx queue. In addition, driver needs to manage the software context to avoid adding duplicated rules, deleting non-existent rule, input set conflicts and other invalid cases. NOTE: Supported pattern/actions and their parse functions are not be included in this patch, they will be added in a separate one. Signed-off-by: NJeff Guo <jia.guo@intel.com> Signed-off-by: NYahui Cao <yahui.cao@intel.com> Signed-off-by: NSimei Su <simei.su@intel.com> Signed-off-by: NBeilei Xing <beilei.xing@intel.com> Signed-off-by: NQi Zhang <qi.z.zhang@intel.com> Tested-by: NChen Bo <BoX.C.Chen@intel.com> Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
-
- 20 3月, 2021 2 次提交
-
-
由 Zqiang 提交于
The syzbot reported a memleak as follows: BUG: memory leak unreferenced object 0xffff888101b41d00 (size 120): comm "kworker/u4:0", pid 8, jiffies 4294944270 (age 12.780s) backtrace: [<ffffffff8125dc56>] alloc_pid+0x66/0x560 [<ffffffff81226405>] copy_process+0x1465/0x25e0 [<ffffffff81227943>] kernel_clone+0xf3/0x670 [<ffffffff812281a1>] kernel_thread+0x61/0x80 [<ffffffff81253464>] call_usermodehelper_exec_work [<ffffffff81253464>] call_usermodehelper_exec_work+0xc4/0x120 [<ffffffff812591c9>] process_one_work+0x2c9/0x600 [<ffffffff81259ab9>] worker_thread+0x59/0x5d0 [<ffffffff812611c8>] kthread+0x178/0x1b0 [<ffffffff8100227f>] ret_from_fork+0x1f/0x30 unreferenced object 0xffff888110ef5c00 (size 232): comm "kworker/u4:0", pid 8414, jiffies 4294944270 (age 12.780s) backtrace: [<ffffffff8154a0cf>] kmem_cache_zalloc [<ffffffff8154a0cf>] __alloc_file+0x1f/0xf0 [<ffffffff8154a809>] alloc_empty_file+0x69/0x120 [<ffffffff8154a8f3>] alloc_file+0x33/0x1b0 [<ffffffff8154ab22>] alloc_file_pseudo+0xb2/0x140 [<ffffffff81559218>] create_pipe_files+0x138/0x2e0 [<ffffffff8126c793>] umd_setup+0x33/0x220 [<ffffffff81253574>] call_usermodehelper_exec_async+0xb4/0x1b0 [<ffffffff8100227f>] ret_from_fork+0x1f/0x30 After the UMD process exits, the pipe_to_umh/pipe_from_umh and tgid need to be released. Fixes: d71fa5c9 ("bpf: Add kernel module with user mode driver that populates bpffs.") Reported-by: syzbot+44908bb56d2bfe56b28e@syzkaller.appspotmail.com Signed-off-by: NZqiang <qiang.zhang@windriver.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210317030915.2865-1-qiang.zhang@windriver.com
-
由 Eric Dumazet 提交于
I was working on a syzbot issue, claiming one device could not be dismantled because its refcount was -1 unregister_netdevice: waiting for sit0 to become free. Usage count = -1 It would be nice if syzbot could trigger a warning at the time this reference count became negative. This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults to per cpu variables (as before this patch) on SMP builds. v2: free_dev label in alloc_netdev_mqs() is moved to avoid a compiler warning (-Wunused-label), as reported by kernel test robot <lkp@intel.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 3月, 2021 4 次提交
-
-
由 Ard Biesheuvel 提交于
Commit 494c704f ("efi: Use 32-bit alignment for efi_guid_t") updated the type definition of efi_guid_t to ensure that it always appears sufficiently aligned (the UEFI spec is ambiguous about this, but given the fact that its EFI_GUID type is defined in terms of a struct carrying a uint32_t, the natural alignment is definitely >= 32 bits). However, we missed the EFI_GUID() macro which is used to instantiate efi_guid_t literals: that macro is still based on the guid_t type, which does not have a minimum alignment at all. This results in warnings such as In file included from drivers/firmware/efi/mokvar-table.c:35: include/linux/efi.h:1093:34: warning: passing 1-byte aligned argument to 4-byte aligned parameter 2 of 'get_var' may result in an unaligned pointer access [-Walign-mismatch] status = get_var(L"SecureBoot", &EFI_GLOBAL_VARIABLE_GUID, NULL, &size, ^ include/linux/efi.h:1101:24: warning: passing 1-byte aligned argument to 4-byte aligned parameter 2 of 'get_var' may result in an unaligned pointer access [-Walign-mismatch] get_var(L"SetupMode", &EFI_GLOBAL_VARIABLE_GUID, NULL, &size, &setupmode); The distinction only matters on CPUs that do not support misaligned loads fully, but 32-bit ARM's load-multiple instructions fall into that category, and these are likely to be emitted by the compiler that built the firmware for loading word-aligned 128-bit GUIDs from memory So re-implement the initializer in terms of our own efi_guid_t type, so that the alignment becomes a property of the literal's type. Fixes: 494c704f ("efi: Use 32-bit alignment for efi_guid_t") Reported-by: NNathan Chancellor <nathan@kernel.org> Reviewed-by: NNick Desaulniers <ndesaulniers@google.com> Reviewed-by: NNathan Chancellor <nathan@kernel.org> Tested-by: NNathan Chancellor <nathan@kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1327Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
-
由 Wong, Vee Khee 提交于
Intel mGbE variant implemented in EHL and TGL can be set to select different clock frequency based on GPO bits in MAC_GPIO_STATUS register. We introduce a new "void (*ptp_clk_freq_config)(void *priv)" in platform data so that if a platform is required to configure the frequency of clock source, in this case Intel mGBE does, the platform-specific configuration of the PTP clock setting is done when stmmac_ptp_register() is called. Signed-off-by: NWong, Vee Khee <vee.khee.wong@intel.com> Signed-off-by: NVoon Weifeng <weifeng.voon@intel.com> Co-developed-by: NOng Boon Leong <boon.leong.ong@intel.com> Signed-off-by: NOng Boon Leong <boon.leong.ong@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Antoine Tenart 提交于
Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in net_device. That will simplify a lot the code removing the need for lots of if/else conditionals as the correct map will be available using its offset in the array. This should not modify the xps maps behaviour in any way. Suggested-by: NAlexander Duyck <alexander.duyck@gmail.com> Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Antoine Tenart 提交于
Embed nr_ids (the number of cpu for the xps cpus map, and the number of rxqs for the xps cpus map) in dev_maps. That will help not accessing out of bound memory if those values change after dev_maps was allocated. Suggested-by: NAlexander Duyck <alexander.duyck@gmail.com> Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-