- 05 7月, 2019 26 次提交
-
-
由 Alex Williamson 提交于
commit ddefc033eecf23f1e8b81d0663c5db965adf5516 upstream The commit referenced below introduced device locking around save and restore of state for each device during a PCI bus "try" reset, making it decidely non-"try" and prone to deadlock in the event that a device is already locked. Restore __pci_reset_bus() and __pci_reset_slot() to their advertised locking semantics by pushing the save and restore functions into the branch where the entire tree is already locked. Extend the helper function names with "_locked" and update the comment to reflect this calling requirement. Fixes: b014e96d ("PCI: Protect pci_error_handlers->reset_notify() usage with device_lock()") Signed-off-by: NAlex Williamson <alex.williamson@redhat.com> Signed-off-by: NZhiyuan Hou <zhiyuan2048@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 kbuild test robot 提交于
lkp-build bot reported the following link error with ipv6 disabled: ld: net/hookers/hookers.o:(.data+0x40): undefined reference to `ipv6_specific' ld: net/hookers/hookers.o:(.data+0x78): undefined reference to `ipv6_mapped' ld: net/hookers/hookers.o:(.data+0xe8): undefined reference to `inet6_stream_ops' Fixed this issue by adding IS_ENABLED(CONFIG_IPV6) check. Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 kbuild test robot 提交于
Fixes: 60448d43 ("writeback: add memcg_blkcg_link tree") Signed-off-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Caspar Zhang 提交于
read/write_cr0() are used in net/hookers.c, but they are only available on x86 platform. Adding a depend-on fields in Kconfig to disable this feature in other platforms. Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Joseph Qi 提交于
Wrap cgroup writeback v1 logic to prevent build errors without CONFIG_CGROUPS or CONFIG_CGROUP_WRITEBACK. Reported-by: Nkbuild test robot <lkp@intel.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jiufei Xue 提交于
So far writeback control is supported for cgroup v1 interface. However it also has some restrictions, so introduce a new kernel boot parameter to control the behavior which is disabled by default. Users can enable the writeback control for cgroup v1 with the command line "cgwb_v1". Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 luanshi 提交于
There might have tons of files queued in the writeback, awaiting for writing back. Unfortunately, the writeback's cgroup has been dead. In this case, we reassociate the inode with another writeback, but we possibly can't because the writeback associated with the dead cgroup is the only valid one. In this case, the new writeback is allocated, initialized and associated with the inode in the non-stopping fashion until all data resident in the inode's page cache are flushed to disk. It causes unnecessary high system load. This fixes the issue by enforce moving the inode to root cgroup when the previous binding cgroup becomes dead. With it, no more unnecessary writebacks are created, populated and the system load decreased by about 6x in the test case we carried out: Without the patch: 30% system load With the patch: 5% system load Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
We have gotten a WARNNING when releasing blkcg_css: [332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0 [332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200) ...... [332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G [332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A 10/11/2016 [332489.685061] Workqueue: cgroup_destroy css_release_work_fn [332489.685654] ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78 0000000000000000 [332489.686269] ffffc9001d92bd68 ffffffff81088f8b 0000003800000000 ffff883e6b94d4a0 [332489.686867] ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400 ffff883e6b94d4a0 [332489.687479] Call Trace: [332489.688078] [<ffffffff81380042>] dump_stack+0x63/0x81 [332489.688681] [<ffffffff81088f8b>] __warn+0xcb/0xf0 [332489.689276] [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80 [332489.689877] [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0 [332489.690481] [<ffffffff81125552>] css_release_work_fn+0x42/0x140 [332489.691090] [<ffffffff810a2db9>] process_one_work+0x189/0x420 [332489.691693] [<ffffffff810a309e>] worker_thread+0x4e/0x4b0 [332489.692293] [<ffffffff810a3050>] ? process_one_work+0x420/0x420 [332489.692905] [<ffffffff810a9616>] kthread+0xe6/0x100 [332489.693504] [<ffffffff810a9530>] ? kthread_park+0x60/0x60 [332489.694099] [<ffffffff817184e1>] ret_from_fork+0x41/0x50 [332489.694722] ---[ end trace 0cf869c4a5cfba87 ]--- ...... This is caused by calling css_get after the css is killed by another thread described below: Thread 1 Thread 2 cgroup_rmdir -> kill_css -> percpu_ref_kill_and_confirm -> css_killed_ref_fn css_killed_work_fn -> css_put -> css_release wb_get_create -> find_blkcg_css -> css_get -> css_put -> css_release (double free) -> css_release_workfn -> css_free_work_fn -> blkcg_css_free When doublefree happened, it may free the memory still used by other threads and cause a kernel panic. Fix this by using css_tryget_online in find_blkcg_css while will return false if the css is killed. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Here we add a global radix tree to link memcg and blkcg that the user attach the tasks to when using cgroup v1, which is used for writeback cgroup. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 George Zhang 提交于
LVS fullnat will replace network traffic's source ip with its local ip, and thus the backend servers cannot obtain the real client ip. To solve this, LVS has introduced the tcp option address (TOA) to store the essential ip address information in the last tcp ack packet of the 3-way handshake, and the backend servers need to retrieve it from the packet header. In this patch, we have introduced the sk_toa_data member in the sock structure to hold the TOA information. There used to be an in-tree module for TOA managing, whereas it has now been maintained as an standalone module. In this case, the toa module should register its hook function(s) using the provided interfaces in the hookers module. TOA in sock structure: __be32 sk_toa_data[16]; The hookers module only provides the sk_toa_data placeholder, and the toa module can use this variable through the layout it needs. Hook interfaces: The hookers module replaces the kernel's syn_recv_sock and getname handler with a stub that chains the toa module's hook function(s) to the original handling function. The hookers module allows hook functions to be installed and uninstalled in any order. toa module: The external toa module will be provided in separate RPM package. [xuyu@linux.alibaba.com: amend commit log] Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Changpeng Liu 提交于
commit 1f23816b8eb8fdc39990abe166c10a18c16f6b21 upstream. In commit 88c85538, "virtio-blk: add discard and write zeroes features to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio block specification has been extended to add VIRTIO_BLK_T_DISCARD and VIRTIO_BLK_T_WRITE_ZEROES commands. This patch enables support for discard and write zeroes in the virtio-blk driver when the device advertises the corresponding features, VIRTIO_BLK_F_DISCARD and VIRTIO_BLK_F_WRITE_ZEROES. Signed-off-by: NChangpeng Liu <changpeng.liu@intel.com> Signed-off-by: NDaniel Verkamp <dverkamp@chromium.org> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
Unstable tsc will trigger clocksource watchdog and disable itself, as a result other clocksource will be elected as the current clocksource which will result in performace issue on our servers. RHEL7 also disabled this feature for some issues, see changelog: [x86] disable clocksource watchdog (Prarit Bhargava) [914709] Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
This reverts commit 76d3b851. The returned value for check_tsc_warp() is useless now, remove it. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
This reverts commit cc4db268. When we do hot-add and enable vCPU, the time inside the VM jumps and then VM stucks. The dmesg shows like this: [ 48.402948] CPU2 has been hot-added [ 48.413774] smpboot: Booting Node 0 Processor 2 APIC 0x2 [ 48.415155] kvm-clock: cpu 2, msr 6b615081, secondary cpu clock [ 48.453690] TSC ADJUST compensate: CPU2 observed 139318776350 warp. Adjust: 139318776350 [ 102.060874] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 102.060874] clocksource: 'kvm-clock' wd_now: 1cb1cfc4bf8 wd_last: 1be9588f1fe mask: ffffffffffffffff [ 102.060874] clocksource: 'tsc' cs_now: 207d794f7e cs_last: 205a32697a mask: ffffffffffffffff [ 102.060874] tsc: Marking TSC unstable due to clocksource watchdog [ 102.070188] KVM setup async PF for cpu 2 [ 102.071461] kvm-stealtime: cpu 2, msr 13ba95000 [ 102.074530] Will online and init hotplugged CPU: 2 This is because the TSC for the newly added VCPU is initialized to 0 while others are ahead. Guest will do the TSC ADJUST compensate and cause the time jumps. Commit bd8fab39("KVM: x86: fix maintaining of kvm_clock stability on guest CPU hotplug") can fix this problem. However, the host kernel version may be older, so do not ajust TSC if sync test fails, just mark it unstable. Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Joseph Qi 提交于
ECI may have an use case that configuring each device mapper disk throttling policy just under root blkio cgroup, but actually using them in different containers. Since hierarchical throttling is now only supported on cgroup v2 and ECI uses cgroup v1, so we have to enable hierarchical throttling on cgroup v1. This is ported from redhat 7u, and a year ago Jiufei already ported it to alikernel 4.9 as well. So I think this change should be acceptable. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eryu Guan 提交于
Prior to xdragon platform 20181230 release (e.g. 0930 release), vring_use_dma_api() is required to return 'true' unconditionally. Introduce a new kernel boot parameter called "vring_force_dma_api" to control the behavior, boot xdragon host with "vring_force_dma_api" command line to make ENI hotplug work, so that normal ECS hosts keep the original behavior. Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
-
由 Arjan van de Ven 提交于
Cherry-pick from clear-linux patches: https://github.com/clearlinux-pkgs/linux-kvm/0104-give-rdrand-some-credit.patch try to credit rdrand/rdseed with some entropy In VMs but even modern hardware, we're super starved for entropy, and while we can and do wear a tin foil hat, it's very hard to argue that rdrand and rdtsc add zero entropy. Signed-off-by: NArjan van de Ven <arjan@linux.intel.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Julio Montes 提交于
Cherry-pick from kata-container patches: https://github.com/kata-containers/packaging/tree/master/kernel/patches/0001-NO-UPSTREAM-9P-always-use-cached-inode-to-fill-in-v9.patch So that if in cache=none mode, we don't have to lookup server that might not support open-unlink-fstat operation. fixes https://github.com/01org/cc-oci-runtime/issues/47 fixes https://github.com/01org/cc-oci-runtime/issues/1062Signed-off-by: NJulio Montes <julio.montes@intel.com> Signed-off-by: NPeng Tao <bergwolf@gmail.com> Signed-off-by: NEryu Guan <eguan@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Arjan van de Ven 提交于
Cherry-pick from kata-container patches: https://github.com/kata-containers/packaging/tree/master/kernel/patches/0002-Compile-in-evged-always.patch We need evged for NEMU (and in general for hw reduced) The config option cannot be set normally since it breaks all regular systems, and hardware reduced is really a runtime choice. Signed-off-by: NArjan van de Ven <arjan@linux.intel.com> Signed-off-by: NEryu Guan <eguan@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream. Add new code to count canceled pending cluster reservations on bigalloc file systems and to reduce the cluster reservation count on all file systems using delayed allocation. This replaces old code in ext4_da_page_release_reservations that was incorrect. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream. Modify ext4_ext_remove_space() and the code it calls to correct the reserved cluster count for pending reservations (delayed allocated clusters shared with allocated blocks) when a block range is removed from the extent tree. Pending reservations may be found for the clusters at the ends of written or unwritten extents when a block range is removed. If a physical cluster at the end of an extent is freed, it's necessary to increment the reserved cluster count to maintain correct accounting if the corresponding logical cluster is shared with at least one delayed and unwritten extent as found in the extents status tree. Add a new function, ext4_rereserve_cluster(), to reapply a reservation on a delayed allocated cluster sharing blocks with a freed allocated cluster. To avoid ENOSPC on reservation, a flag is applied to ext4_free_blocks() to briefly defer updating the freeclusters counter when an allocated cluster is freed. This prevents another thread from allocating the freed block before the reservation can be reapplied. Redefine the partial cluster object as a struct to carry more state information and to clarify the code using it. Adjust the conditional code structure in ext4_ext_remove_space to reduce the indentation level in the main body of the code to improve readability. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream. Ext4 does not always reduce the reserved cluster count by the number of clusters allocated when mapping a delayed extent. It sometimes adds back one or more clusters after allocation if delalloc blocks adjacent to the range allocated by ext4_ext_map_blocks() share the clusters newly allocated for that range. However, this overcounts the number of clusters needed to satisfy future mapping requests (holding one or more reservations for clusters that have already been allocated) and premature ENOSPC and quota failures, etc., result. Ext4 also does not reduce the reserved cluster count when allocating clusters for non-delayed allocated writes that have previously been reserved for delayed writes. This also results in overcounts. To make it possible to handle reserved cluster accounting for fallocated regions in the same manner as used for other non-delayed writes, do the reserved cluster accounting for them at the time of allocation. In the current code, this is only done later when a delayed extent sharing the fallocated region is finally mapped. Address comment correcting handling of unsigned long long constant from Jan Kara's review of RFC version of this patch. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream. The code in ext4_da_map_blocks sometimes reserves space for more delayed allocated clusters than it should, resulting in premature ENOSPC, exceeded quota, and inaccurate free space reporting. Fix this by checking for written and unwritten blocks shared in the same cluster with the newly delayed allocated block. A cluster reservation should not be made for a cluster for which physical space has already been allocated. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 1dc0aa46e74a3366e12f426b7caaca477853e9c3 upstream. Add new pending reservation mechanism to help manage reserved cluster accounting. Its primary function is to avoid the need to read extents from the disk when invalidating pages as a result of a truncate, punch hole, or collapse range operation. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit ad431025aecda85d3ebef5e4a3aca5c1c681d0c7 upstream. Ext4 contains a few functions that are used to search for delayed extents or blocks in the extents status tree. Rather than duplicate code to add new functions to search for extents with different status values, such as written or a combination of delayed and unwritten, generalize the existing code to search for caller-specified extents status values. Also, move this code into extents_status.c where it is better associated with the data structures it operates upon, and where it can be more readily used to implement new extents status tree functions that might want a broader scope for i_es_lock. Three missing static specifiers in RFC version of patch reported and fixed by Fengguang Wu <fengguang.wu@intel.com>. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
- 03 7月, 2019 14 次提交
-
-
由 Greg Kroah-Hartman 提交于
-
由 Jean-Philippe Brucker 提交于
commit c5e2edeb01ae9ffbdde95bdcdb6d3614ba1eb195 upstream. GCC 8.1.0 reports that the ldadd instruction encoding, recently added to insn.c, doesn't match the mask and couldn't possibly be identified: linux/arch/arm64/include/asm/insn.h: In function 'aarch64_insn_is_ldadd': linux/arch/arm64/include/asm/insn.h:280:257: warning: bitwise comparison always evaluates to false [-Wtautological-compare] Bits [31:30] normally encode the size of the instruction (1 to 8 bytes) and the current instruction value only encodes the 4- and 8-byte variants. At the moment only the BPF JIT needs this instruction, and doesn't require the 1- and 2-byte variants, but to be consistent with our other ldr and str instruction encodings, clear the size field in the insn value. Fixes: 34b8ab091f9ef57a ("bpf, arm64: use more scalable stadd over ldxr / stxr loop in xadd") Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Reported-by: NKuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: NYoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Thinh Nguyen 提交于
commit c7152763f02e05567da27462b2277a554e507c89 upstream. Currently req->num_trbs is not reset after the TRBs are skipped and processed from the cancelled list. The gadget driver may reuse the request with an invalid req->num_trbs, and DWC3 will incorrectly skip trbs. To fix this, simply reset req->num_trbs to 0 after skipping through all of them. Fixes: c3acd5901414 ("usb: dwc3: gadget: use num_trbs when skipping TRBs on ->dequeue()") Signed-off-by: NThinh Nguyen <thinhn@synopsys.com> Signed-off-by: NFelipe Balbi <felipe.balbi@linux.intel.com> Cc: Sasha Levin <sashal@kernel.org> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Xin Long 提交于
commit c3bcde026684c62d7a2b6f626dc7cf763833875c upstream. udp_tunnel(6)_xmit_skb() called by tipc_udp_xmit() expects a tunnel device to count packets on dev->tstats, a perpcu variable. However, TIPC is using udp tunnel with no tunnel device, and pass the lower dev, like veth device that only initializes dev->lstats(a perpcu variable) when creating it. Later iptunnel_xmit_stats() called by ip(6)tunnel_xmit() thinks the dev as a tunnel device, and uses dev->tstats instead of dev->lstats. tstats' each pointer points to a bigger struct than lstats, so when tstats->tx_bytes is increased, other percpu variable's members could be overwritten. syzbot has reported quite a few crashes due to fib_nh_common percpu member 'nhc_pcpu_rth_output' overwritten, call traces are like: BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 __mkroute_output net/ipv4/route.c:2332 [inline] ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564 ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393 __ip_route_output_key include/net/route.h:125 [inline] ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651 ip_route_output_key include/net/route.h:135 [inline] ... or: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168 <IRQ> rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline] free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 ... The issue exists since tunnel stats update is moved to iptunnel_xmit by Commit 039f5062 ("ip_tunnel: Move stats update to iptunnel_xmit()"), and here to fix it by passing a NULL tunnel dev to udp_tunnel(6)_xmit_skb so that the packets counting won't happen on dev->tstats. Reported-by: syzbot+9d4c12bfd45a58738d0a@syzkaller.appspotmail.com Reported-by: syzbot+a9e23ea2aa21044c2798@syzkaller.appspotmail.com Reported-by: syzbot+c4c4b2bb358bb936ad7e@syzkaller.appspotmail.com Reported-by: syzbot+0290d2290a607e035ba1@syzkaller.appspotmail.com Reported-by: syzbot+a43d8d4e7e8a7a9e149e@syzkaller.appspotmail.com Reported-by: syzbot+a47c5f4c6c00fc1ed16e@syzkaller.appspotmail.com Fixes: 039f5062 ("ip_tunnel: Move stats update to iptunnel_xmit()") Signed-off-by: NXin Long <lucien.xin@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jason Gunthorpe 提交于
commit 641114d2af312d39ca9bbc2369d18a5823da51c6 upstream. gcc 9 now does allocation size tracking and thinks that passing the member of a union and then accessing beyond that member's bounds is an overflow. Instead of using the union member, use the entire union with a cast to get to the sockaddr. gcc will now know that the memory extends the full size of the union. Signed-off-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Will Deacon 提交于
commit 427503519739e779c0db8afe876c1b33f3ac60ae upstream. The architecture implementations of 'arch_futex_atomic_op_inuser()' and 'futex_atomic_cmpxchg_inatomic()' are permitted to return only -EFAULT, -EAGAIN or -ENOSYS in the case of failure. Update the comments in the asm-generic/ implementation and also a stray reference in the robust futex documentation. Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Daniel Borkmann 提交于
commit 34b8ab091f9ef57a2bb3c8c8359a0a03a8abf2f9 upstream. Since ARMv8.1 supplement introduced LSE atomic instructions back in 2016, lets add support for STADD and use that in favor of LDXR / STXR loop for the XADD mapping if available. STADD is encoded as an alias for LDADD with XZR as the destination register, therefore add LDADD to the instruction encoder along with STADD as special case and use it in the JIT for CPUs that advertise LSE atomics in CPUID register. If immediate offset in the BPF XADD insn is 0, then use dst register directly instead of temporary one. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com> Acked-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Will Deacon 提交于
commit 8e4e0ac02b449297b86498ac24db5786ddd9f647 upstream. Returning an error code from futex_atomic_cmpxchg_inatomic() indicates that the caller should not make any use of *uval, and should instead act upon on the value of the error code. Although this is implemented correctly in our futex code, we needlessly copy uninitialised stack to *uval in the error case, which can easily be avoided. Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Martin KaFai Lau 提交于
commit 4ac30c4b3659efac031818c418beb51e630d512d upstream. __udp6_lib_err() may be called when handling icmpv6 message. For example, the icmpv6 toobig(type=2). __udp6_lib_lookup() is then called which may call reuseport_select_sock(). reuseport_select_sock() will call into a bpf_prog (if there is one). reuseport_select_sock() is expecting the skb->data pointing to the transport header (udphdr in this case). For example, run_bpf_filter() is pulling the transport header. However, in the __udp6_lib_err() path, the skb->data is pointing to the ipv6hdr instead of the udphdr. One option is to pull and push the ipv6hdr in __udp6_lib_err(). Instead of doing this, this patch follows how the original commit 538950a1 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") was done in IPv4, which has passed a NULL skb pointer to reuseport_select_sock(). Fixes: 538950a1 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Cc: Craig Gallek <kraig@google.com> Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NSong Liu <songliubraving@fb.com> Acked-by: NCraig Gallek <kraig@google.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Martin KaFai Lau 提交于
commit 257a525fe2e49584842c504a92c27097407f778f upstream. When the commit a6024562 ("udp: Add GRO functions to UDP socket") added udp[46]_lib_lookup_skb to the udp_gro code path, it broke the reuseport_select_sock() assumption that skb->data is pointing to the transport header. This patch follows an earlier __udp6_lib_err() fix by passing a NULL skb to avoid calling the reuseport's bpf_prog. Fixes: a6024562 ("udp: Add GRO functions to UDP socket") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Daniel Borkmann 提交于
commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream. Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in 7828f20e ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAndrey Ignatov <rdna@fb.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NMartynas Pumputis <m@lambda.lt> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Matt Mullins 提交于
commit 9594dc3c7e71b9f52bee1d7852eb3d4e3aea9e99 upstream. BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as they do not increment bpf_prog_active while executing. This enables three levels of nesting, to support - a kprobe or raw tp or perf event, - another one of the above that irq context happens to call, and - another one in nmi context (at most one of which may be a kprobe or perf event). Fixes: 20b9d7ac ("bpf: avoid excessive stack usage for perf_sample_data") Signed-off-by: NMatt Mullins <mmullins@fb.com> Acked-by: NAndrii Nakryiko <andriin@fb.com> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jonathan Lemon 提交于
commit da2577fdd0932ea4eefe73903f1130ee366767d2 upstream. If the leftmost parent node of the tree has does not have a child on the left side, then trie_get_next_key (and bpftool map dump) will not look at the child on the right. This leads to the traversal missing elements. Lookup is not affected. Update selftest to handle this case. Reproducer: bpftool map create /sys/fs/bpf/lpm type lpm_trie key 6 \ value 1 entries 256 name test_lpm flags 1 bpftool map update pinned /sys/fs/bpf/lpm key 8 0 0 0 0 0 value 1 bpftool map update pinned /sys/fs/bpf/lpm key 16 0 0 0 0 128 value 2 bpftool map dump pinned /sys/fs/bpf/lpm Returns only 1 element. (2 expected) Fixes: b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE") Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Martynas Pumputis 提交于
commit b1d6c15b9d824a58c5415673f374fac19e8eccdf upstream. Previously, the BPF_FIB_LOOKUP_{DIRECT,OUTPUT} flags in the BPF UAPI were defined with the help of BIT macro. This had the following issues: - In order to use any of the flags, a user was required to depend on <linux/bits.h>. - No other flag in bpf.h uses the macro, so it seems that an unwritten convention is to use (1 << (nr)) to define BPF-related flags. Fixes: 87f5fc7e ("bpf: Provide helper to do forwarding lookups in kernel FIB table") Signed-off-by: NMartynas Pumputis <m@lambda.lt> Acked-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-