- 10 5月, 2023 13 次提交
-
-
由 Wei Chen 提交于
mainline inclusion from mainline-v6.3-rc4 commit 92fbb6d1 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6XHPL CVE: CVE-2023-2194 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=92fbb6d1296f81f41f65effd7f5f8c0f74943d15 -------------------------------- The data->block[0] variable comes from user and is a number between 0-255. Without proper check, the variable may be very large to cause an out-of-bounds when performing memcpy in slimpro_i2c_blkwr. Fix this bug by checking the value of writelen. Fixes: f6505fba ("i2c: add SLIMpro I2C device driver on APM X-Gene platform") Signed-off-by: NWei Chen <harperchen1110@gmail.com> Cc: stable@vger.kernel.org Reviewed-by: NAndi Shyti <andi.shyti@kernel.org> Signed-off-by: NWolfram Sang <wsa@kernel.org> Signed-off-by: NYang Jihong <yangjihong1@huawei.com> Reviewed-by: NZheng Yejian <zhengyejian1@huawei.com> Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Baokun Li 提交于
maillist inclusion category: bugfix bugzilla: 188499, https://gitee.com/openeuler/kernel/issues/I6TNVT CVE: NA Reference: https://patchwork.ozlabs.org/project/linux-ext4/patch/20230412124126.2286716-2-libaokun1@huawei.com/ ---------------------------------------- In our fault injection test, we create an ext4 file, migrate it to non-extent based file, then punch a hole and finally trigger a WARN_ON in the ext4_da_update_reserve_space(): EXT4-fs warning (device sda): ext4_da_update_reserve_space:369: ino 14, used 11 with only 10 reserved data blocks When writing back a non-extent based file, if we enable delalloc, the number of reserved blocks will be subtracted from the number of blocks mapped by ext4_ind_map_blocks(), and the extent status tree will be updated. We update the extent status tree by first removing the old extent_status and then inserting the new extent_status. If the block range we remove happens to be in an extent, then we need to allocate another extent_status with ext4_es_alloc_extent(). use old to remove to add new |----------|------------|------------| old extent_status The problem is that the allocation of a new extent_status failed due to a fault injection, and __es_shrink() did not get free memory, resulting in a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM, we map to the same extent again, and the number of reserved blocks is again subtracted from the number of blocks in that extent. Since the blocks in the same extent are subtracted twice, we end up triggering WARN_ON at ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks. For non-extent based file, we update the number of reserved blocks after ext4_ind_map_blocks() is executed, which causes a problem that when we call ext4_ind_map_blocks() to create a block, it doesn't always create a block, but we always reduce the number of reserved blocks. So we move the logic for updating reserved blocks to ext4_ind_map_blocks() to ensure that the number of reserved blocks is updated only after we do succeed in allocating some new blocks. Fixes: 5f634d06 ("ext4: Fix quota accounting error with fallocate") Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NBaokun Li <libaokun1@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Oliver Hartkopp 提交于
stable inclusion from stable-v5.10.159 commit c42221efb1159d6a3c89e96685ee38acdce86b6f category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6WUDS CVE: CVE-2023-2166 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c42221efb1159d6a3c89e96685ee38acdce86b6f -------------------------------- commit 0acc4423 upstream. Analogue to commit 8aa59e35 ("can: af_can: fix NULL pointer dereference in can_rx_register()") we need to check for a missing initialization of ml_priv in the receive path of CAN frames. Since commit 4e096a18 ("net: introduce CAN specific pointer in the struct net_device") the check for dev->type to be ARPHRD_CAN is not sufficient anymore since bonding or tun netdevices claim to be CAN devices but do not initialize ml_priv accordingly. Fixes: 4e096a18 ("net: introduce CAN specific pointer in the struct net_device") Reported-by: syzbot+2d7f58292cb5b29eb5ad@syzkaller.appspotmail.com Reported-by: NWei Chen <harperchen1110@gmail.com> Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20221206201259.3028-1-socketcan@hartkopp.net Cc: stable@vger.kernel.org Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Patrisious Haddad 提交于
mainline inclusion from mainline-v6.3-rc1 commit 8d037973 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6X49E CVE: CVE-2023-2176 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8d037973d48c026224ab285e6a06985ccac6f7bf --------------------------- Refactor rdma_bind_addr function so that it doesn't require that the cma destination address be changed before calling it. So now it will update the destination address internally only when it is really needed and after passing all the required checks. Which in turn results in a cleaner and more sensible call and error handling flows for the functions that call it directly or indirectly. Signed-off-by: NPatrisious Haddad <phaddad@nvidia.com> Reported-by: NWei Chen <harperchen1110@gmail.com> Reviewed-by: NMark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/3d0e9a2fd62bc10ba02fed1c7c48a48638952320.1672819273.git.leonro@nvidia.comSigned-off-by: NLeon Romanovsky <leon@kernel.org> (cherry picked from commit 8d037973) Signed-off-by: NLiu Jian <liujian56@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Jason Gunthorpe 提交于
mainline inclusion from mainline-v5.15-rc4 commit 305d568b category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6X49E CVE: CVE-2023-2176 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=305d568b72f17f674155a2a8275f865f207b3808 --------------------------- The FSM can run in a circle allowing rdma_resolve_ip() to be called twice on the same id_priv. While this cannot happen without going through the work, it violates the invariant that the same address resolution background request cannot be active twice. CPU 1 CPU 2 rdma_resolve_addr(): RDMA_CM_IDLE -> RDMA_CM_ADDR_QUERY rdma_resolve_ip(addr_handler) #1 process_one_req(): for #1 addr_handler(): RDMA_CM_ADDR_QUERY -> RDMA_CM_ADDR_BOUND mutex_unlock(&id_priv->handler_mutex); [.. handler still running ..] rdma_resolve_addr(): RDMA_CM_ADDR_BOUND -> RDMA_CM_ADDR_QUERY rdma_resolve_ip(addr_handler) !! two requests are now on the req_list rdma_destroy_id(): destroy_id_handler_unlock(): _destroy_id(): cma_cancel_operation(): rdma_addr_cancel() // process_one_req() self removes it spin_lock_bh(&lock); cancel_delayed_work(&req->work); if (!list_empty(&req->list)) == true ! rdma_addr_cancel() returns after process_on_req #1 is done kfree(id_priv) process_one_req(): for #2 addr_handler(): mutex_lock(&id_priv->handler_mutex); !! Use after free on id_priv rdma_addr_cancel() expects there to be one req on the list and only cancels the first one. The self-removal behavior of the work only happens after the handler has returned. This yields a situations where the req_list can have two reqs for the same "handle" but rdma_addr_cancel() only cancels the first one. The second req remains active beyond rdma_destroy_id() and will use-after-free id_priv once it inevitably triggers. Fix this by remembering if the id_priv has called rdma_resolve_ip() and always cancel before calling it again. This ensures the req_list never gets more than one item in it and doesn't cost anything in the normal flow that never uses this strange error path. Link: https://lore.kernel.org/r/0-v1-3bc675b8006d+22-syz_cancel_uaf_jgg@nvidia.com Cc: stable@vger.kernel.org Fixes: e51060f0 ("IB: IP address based RDMA connection manager") Reported-by: syzbot+dc3dfba010d7671e05f5@syzkaller.appspotmail.com Signed-off-by: NJason Gunthorpe <jgg@nvidia.com> (cherry picked from commit 305d568b) Signed-off-by: NLiu Jian <liujian56@huawei.com> Conflicts: drivers/infiniband/core/cma_priv.h Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Arnd Bergmann 提交于
mainline inclusion from mainline-v6.0-rc1~14 commit b04e75a4 category: bugfix bugzilla: 188707, https://gitee.com/src-openeuler/kernel/issues/I6VK2F CVE: CVE-2023-2007 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b04e75a4a8a81887386a0d2dbf605a48e779d2a0 ---------------------------------------- The dpt_i2o driver was fixed to stop using virt_to_bus() in 2008, but it still has a stale reference in an error handling code path that could never work. I submitted a patch to fix this reference earlier, but Hannes Reinecke suggested that removing the driver may be just as good here. The i2o driver layer was removed in 2015 with commit 4a72a7af ("staging: remove i2o subsystem"), but the even older dpt_i2o scsi driver stayed around. The last non-cleanup patches I could find were from Miquel van Smoorenburg and Mark Salyzyn back in 2008, they might know if there is any chance of the hardware still being used anywhere. Link: https://lore.kernel.org/linux-scsi/CAK8P3a1XfwkTOV7qOs1fTxf4vthNBRXKNu8A5V7TWnHT081NGA@mail.gmail.com/T/ Link: https://lore.kernel.org/r/20220624155226.2889613-3-arnd@kernel.org Cc: Miquel van Smoorenburg <mikevs@xs4all.net> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NZhong Jinghua <zhongjinghua@huawei.com> Reviewed-by: NYu Kuai <yukuai3@huawei.com> Reviewed-by: NHou Tao <houtao1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Baokun Li 提交于
mainline inclusion from mainline-v6.3-rc8 commit 1ba1199e category: bugfix bugzilla: 188601, https://gitee.com/openeuler/kernel/issues/I6TNTC CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1ba1199ec5747f475538c0d25a32804e5ba1dfde -------------------------------- KASAN report null-ptr-deref: ================================================================== BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0 Write of size 8 at addr 0000000000000000 by task sync/943 CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461 Call Trace: <TASK> dump_stack_lvl+0x7f/0xc0 print_report+0x2ba/0x340 kasan_report+0xc4/0x120 kasan_check_range+0x1b7/0x2e0 __kasan_check_write+0x24/0x40 bdi_split_work_to_wbs+0x5c5/0x7b0 sync_inodes_sb+0x195/0x630 sync_inodes_one_sb+0x3a/0x50 iterate_supers+0x106/0x1b0 ksys_sync+0x98/0x160 [...] ================================================================== The race that causes the above issue is as follows: cpu1 cpu2 -------------------------|------------------------- inode_switch_wbs INIT_WORK(&isw->work, inode_switch_wbs_work_fn) queue_rcu_work(isw_wq, &isw->work) // queue_work async inode_switch_wbs_work_fn wb_put_many(old_wb, nr_switched) percpu_ref_put_many ref->data->release(ref) cgwb_release queue_work(cgwb_release_wq, &wb->release_work) // queue_work async &wb->release_work cgwb_release_workfn ksys_sync iterate_supers sync_inodes_one_sb sync_inodes_sb bdi_split_work_to_wbs kmalloc(sizeof(*work), GFP_ATOMIC) // alloc memory failed percpu_ref_exit ref->data = NULL kfree(data) wb_get(wb) percpu_ref_get(&wb->refcnt) percpu_ref_get_many(ref, 1) atomic_long_add(nr, &ref->data->count) atomic64_add(i, v) // trigger null-ptr-deref bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs. If the allocation of new work fails, the on-stack fallback will be used and the reference count of the current wb is increased afterwards. If cgroup writeback membership switches occur before getting the reference count and the current wb is released as old_wd, then calling wb_get() or wb_put() will trigger the null pointer dereference above. This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb() and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger this issue. For scenarios called via sync_inodes_sb(), originally commit 7fc5854f ("writeback: synchronize sync(2) against cgroup writeback membership switches") reduced the possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes easily reproducible again. To solve this problem, percpu_ref_exit() is called under RCU protection to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), and skip the current wb if wb_tryget() fails because the wb has already been shutdown. Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com Fixes: b817525a ("writeback: bdi_writeback iteration must not skip dying ones") Signed-off-by: NBaokun Li <libaokun1@huawei.com> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NTejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Brauner <brauner@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hou Tao <houtao1@huawei.com> Cc: yangerkun <yangerkun@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: mm/backing-dev.c Signed-off-by: NBaokun Li <libaokun1@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Liu Jian 提交于
mainline inclusion from mainline-v6.3-rc2 commit d900f3d2 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d900f3d20cc3169ce42ec72acc850e662a4d4db2 --------------------------- When the buffer length of the recvmsg system call is 0, we got the flollowing soft lockup problem: watchdog: BUG: soft lockup - CPU#3 stuck for 27s! [a.out:6149] CPU: 3 PID: 6149 Comm: a.out Kdump: loaded Not tainted 6.2.0+ #30 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:remove_wait_queue+0xb/0xc0 Code: 5e 41 5f c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 57 <41> 56 41 55 41 54 55 48 89 fd 53 48 89 f3 4c 8d 6b 18 4c 8d 73 20 RSP: 0018:ffff88811b5978b8 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff88811a7d3780 RCX: ffffffffb7a4d768 RDX: dffffc0000000000 RSI: ffff88811b597908 RDI: ffff888115408040 RBP: 1ffff110236b2f1b R08: 0000000000000000 R09: ffff88811a7d37e7 R10: ffffed10234fa6fc R11: 0000000000000001 R12: ffff88811179b800 R13: 0000000000000001 R14: ffff88811a7d38a8 R15: ffff88811a7d37e0 FS: 00007f6fb5398740(0000) GS:ffff888237180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000000 CR3: 000000010b6ba002 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tcp_msg_wait_data+0x279/0x2f0 tcp_bpf_recvmsg_parser+0x3c6/0x490 inet_recvmsg+0x280/0x290 sock_recvmsg+0xfc/0x120 ____sys_recvmsg+0x160/0x3d0 ___sys_recvmsg+0xf0/0x180 __sys_recvmsg+0xea/0x1a0 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc The logic in tcp_bpf_recvmsg_parser is as follows: msg_bytes_ready: copied = sk_msg_recvmsg(sk, psock, msg, len, flags); if (!copied) { wait data; goto msg_bytes_ready; } In this case, "copied" always is 0, the infinite loop occurs. According to the Linux system call man page, 0 should be returned in this case. Therefore, in tcp_bpf_recvmsg_parser(), if the length is 0, directly return. Also modify several other functions with the same problem. Fixes: 1f5be6b3 ("udp: Implement udp_bpf_recvmsg() for sockmap") Fixes: 9825d866 ("af_unix: Implement unix_dgram_bpf_recvmsg()") Fixes: c5d2177a ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self") Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: NLiu Jian <liujian56@huawei.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230303080946.1146638-1-liujian56@huawei.comSigned-off-by: NLiu Jian <liujian56@huawei.com> Conflicts: net/ipv4/udp_bpf.c net/unix/unix_bpf.c Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 John Fastabend 提交于
mainline inclusion from mainline-v5.17-rc1 commit 218d747a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=218d747a4142f281a256687bb513a135c905867b --------------------------- sock_map_link() is called to update a sockmap entry with a sk. But, if the sock_map_init_proto() call fails then we return an error to the map_update op against the sockmap. In the error path though we need to cleanup psock and dec the refcnt on any programs associated with the map, because we refcnt them early in the update process to ensure they are pinned for the psock. (This avoids a race where user deletes programs while also updating the map with new socks.) In current code we do the prog refcnt dec explicitely by calling bpf_prog_put() when the program was found in the map. But, after commit '38207a5e' in this error path we've already done the prog to psock assignment so the programs have a reference from the psock as well. This then causes the psock tear down logic, invoked by sk_psock_put() in the error path, to similarly call bpf_prog_put on the programs there. To be explicit this logic does the prog->psock assignment: if (msg_*) psock_set_prog(...) Then the error path under the out_progs label does a similar check and dec with: if (msg_*) bpf_prog_put(...) And the teardown logic sk_psock_put() does ... psock_set_prog(msg_*, NULL) ... triggering another bpf_prog_put(...). Then KASAN gives us this splat, found by syzbot because we've created an inbalance between bpf_prog_inc and bpf_prog_put calling put twice on the program. BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline] BUG: KASAN: vmalloc-out-of-bounds in __bpf_prog_put kernel/bpf/syscall.c:1812 [inline] kernel/bpf/syscall.c:1829 BUG: KASAN: vmalloc-out-of-bounds in bpf_prog_put+0x8c/0x4f0 kernel/bpf/syscall.c:1829 kernel/bpf/syscall.c:1829 Read of size 8 at addr ffffc90000e76038 by task syz-executor020/3641 To fix clean up error path so it doesn't try to do the bpf_prog_put in the error path once progs are assigned then it relies on the normal psock tear down logic to do complete cleanup. For completness we also cover the case whereh sk_psock_init_strp() fails, but this is not expected because it indicates an incorrect socket type and should be caught earlier. Fixes: 38207a5e ("bpf, sockmap: Attach map progs to psock early for feature probes") Reported-by: syzbot+bb73e71cf4b8fd376a4f@syzkaller.appspotmail.com Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220104214645.290900-1-john.fastabend@gmail.com (cherry picked from commit 218d747a) Signed-off-by: NLiu Jian <liujian56@huawei.com> Conflicts: net/core/sock_map.c Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 John Fastabend 提交于
mainline inclusion from mainline-v5.16-rc5 commit c0d95d33 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c0d95d3380ee099d735e08618c0d599e72f6c8b0 --------------------------- When a sock is added to a sock map we evaluate what proto op hooks need to be used. However, when the program is removed from the sock map we have not been evaluating if that changes the required program layout. Before the patch listed in the 'fixes' tag this was not causing failures because the base program set handles all cases. Specifically, the case with a stream parser and the case with out a stream parser are both handled. With the fix below we identified a race when running with a proto op that attempts to read skbs off both the stream parser and the skb->receive_queue. Namely, that a race existed where when the stream parser is empty checking the skb->receive_queue from recvmsg at the precies moment when the parser is paused and the receive_queue is not empty could result in skipping the stream parser. This may break a RX policy depending on the parser to run. The fix tag then loads a specific proto ops that resolved this race. But, we missed removing that proto ops recv hook when the sock is removed from the sockmap. The result is the stream parser is stopped so no more skbs will be aggregated there, but the hook and BPF program continues to be attached on the psock. User space will then get an EBUSY when trying to read the socket because the recvmsg() handler is now waiting on a stopped stream parser. To fix we rerun the proto ops init() function which will look at the new set of progs attached to the psock and rest the proto ops hook to the correct handlers. And in the above case where we remove the sock from the sock map the RX prog will no longer be listed so the proto ops is removed. Fixes: c5d2177a ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self") Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211119181418.353932-3-john.fastabend@gmail.com (cherry picked from commit c0d95d33) Signed-off-by: NLiu Jian <liujian56@huawei.com> Conflicts: net/core/skmsg.c Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 John Fastabend 提交于
mainline inclusion from mainline-v5.16-rc5 commit 38207a5e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=38207a5e81230d6ffbdd51e5fa5681be5116dcae --------------------------- When a TCP socket is added to a sock map we look at the programs attached to the map to determine what proto op hooks need to be changed. Before the patch in the 'fixes' tag there were only two categories -- the empty set of programs or a TX policy. In any case the base set handled the receive case. After the fix we have an optimized program for receive that closes a small, but possible, race on receive. This program is loaded only when the map the psock is being added to includes a RX policy. Otherwise, the race is not possible so we don't need to handle the race condition. In order for the call to sk_psock_init() to correctly evaluate the above conditions all progs need to be set in the psock before the call. However, in the current code this is not the case. We end up evaluating the requirements on the old prog state. If your psock is attached to multiple maps -- for example a tx map and rx map -- then the second update would pull in the correct maps. But, the other pattern with a single rx enabled map the correct receive hooks are not used. The result is the race fixed by the patch in the fixes tag below may still be seen in this case. To fix we simply set all psock->progs before doing the call into sock_map_init(). With this the init() call gets the full list of programs and chooses the correct proto ops on the first iteration instead of requiring the second update to pull them in. This fixes the race case when only a single map is used. Fixes: c5d2177a ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self") Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211119181418.353932-2-john.fastabend@gmail.com (cherry picked from commit 38207a5e) Signed-off-by: NLiu Jian <liujian56@huawei.com> Conflicts: net/core/sock_map.c Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 John Fastabend 提交于
mainline inclusion from mainline-v5.17-rc1 commit 5b2c5540 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5b2c5540b8110eea0d67a78fb0ddb9654c58daeb --------------------------- Applications can be confused slightly because we do not always return the same error code as expected, e.g. what the TCP stack normally returns. For example on a sock err sk->sk_err instead of returning the sock_error we return EAGAIN. This usually means the application will 'try again' instead of aborting immediately. Another example, when a shutdown event is received we should immediately abort instead of waiting for data when the user provides a timeout. These tend to not be fatal, applications usually recover, but introduces bogus errors to the user or introduces unexpected latency. Before 'c5d2177a' we fell back to the TCP stack when no data was available so we managed to catch many of the cases here, although with the extra latency cost of calling tcp_msg_wait_data() first. To fix lets duplicate the error handling in TCP stack into tcp_bpf so that we get the same error codes. These were found in our CI tests that run applications against sockmap and do longer lived testing, at least compared to test_sockmap that does short-lived ping/pong tests, and in some of our test clusters we deploy. Its non-trivial to do these in a shorter form CI tests that would be appropriate for BPF selftests, but we are looking into it so we can ensure this keeps working going forward. As a preview one idea is to pull in the packetdrill testing which catches some of this. Fixes: c5d2177a ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self") Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220104205918.286416-1-john.fastabend@gmail.com (cherry picked from commit 5b2c5540) Signed-off-by: NLiu Jian <liujian56@huawei.com> Conflicts: net/ipv4/tcp_bpf.c Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 John Fastabend 提交于
mainline inclusion from mainline-v5.16-rc1 commit c5d2177a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I65HYE Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c5d2177a72a1659554922728fc407f59950aa929 --------------------------- A socket in a sockmap may have different combinations of programs attached depending on configuration. There can be no programs in which case the socket acts as a sink only. There can be a TX program in this case a BPF program is attached to sending side, but no RX program is attached. There can be an RX program only where sends have no BPF program attached, but receives are hooked with BPF. And finally, both TX and RX programs may be attached. Giving us the permutations: None, Tx, Rx, and TxRx To date most of our use cases have been TX case being used as a fast datapath to directly copy between local application and a userspace proxy. Or Rx cases and TxRX applications that are operating an in kernel based proxy. The traffic in the first case where we hook applications into a userspace application looks like this: AppA redirect AppB Tx <-----------> Rx | | + + TCP <--> lo <--> TCP In this case all traffic from AppA (after 3whs) is copied into the AppB ingress queue and no traffic is ever on the TCP recieive_queue. In the second case the application never receives, except in some rare error cases, traffic on the actual user space socket. Instead the send happens in the kernel. AppProxy socket pool sk0 ------------->{sk1,sk2, skn} ^ | | | | v ingress lb egress TCP TCP Here because traffic is never read off the socket with userspace recv() APIs there is only ever one reader on the sk receive_queue. Namely the BPF programs. However, we've started to introduce a third configuration where the BPF program on receive should process the data, but then the normal case is to push the data into the receive queue of AppB. AppB recv() (userspace) ----------------------- tcp_bpf_recvmsg() (kernel) | | | | | | ingress_msgQ | | | RX_BPF | | | v v sk->receive_queue This is different from the App{A,B} redirect because traffic is first received on the sk->receive_queue. Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg queue for any data handled by the BPF rx program and returned with PASS code so that it was enqueued on the ingress msg queue. Then if no data exists on that queue it checks the socket receive queue. Unfortunately, this is the same receive_queue the BPF program is reading data off of. So we get a race. Its possible for the recvmsg() hook to pull data off the receive_queue before the BPF hook has a chance to read it. It typically happens when an application is banging on recv() and getting EAGAINs. Until they manage to race with the RX BPF program. To fix this we note that before this patch at attach time when the socket is loaded into the map we check if it needs a TX program or just the base set of proto bpf hooks. Then it uses the above general RX hook regardless of if we have a BPF program attached at rx or not. This patch now extends this check to handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above race when an RX program is attached we use a new hook that is nearly identical to the old one except now we do not let the recv() call skip the RX BPF program. Now only the BPF program pulls data from sk->receive_queue and recv() only pulls data from the ingress msgQ post BPF program handling. With this resolved our AppB from above has been up and running for many hours without detecting any errors. We do this by correlating counters in RX BPF events and the AppB to ensure data is never skipping the BPF program. Selftests, was not able to detect this because we only run them for a short period of time on well ordered send/recvs so we don't get any of the noise we see in real application environments. Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Tested-by: NJussi Maki <joamaki@gmail.com> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com (cherry picked from commit c5d2177a) Signed-off-by: NLiu Jian <liujian56@huawei.com> Conflicts: net/ipv4/tcp_bpf.c Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
- 26 4月, 2023 21 次提交
-
-
由 openeuler-ci-bot 提交于
Merge Pull Request from: @zhangjialin11 Pull new CVEs: CVE-2023-1855 CVE-2023-2006 CVE-2023-30772 CVE-2023-1872 net bugfixes from Ziyang Xuan timer bugfix from Yu Liao xfs bugfixes from Guo Xuenan Link:https://gitee.com/openeuler/kernel/pulls/632 Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
-
由 Ido Schimmel 提交于
mainline inclusion from mainline-v6.3 commit c484fcc0 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6WNGK CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c484fcc058bada604d7e4e5228d4affb646ddbc2 --------------------------- When a net device is put administratively up, its 'IFF_UP' flag is set (if not set already) and a 'NETDEV_UP' notification is emitted, which causes the 8021q driver to add VLAN ID 0 on the device. The reverse happens when a net device is put administratively down. When changing the type of a bond to Ethernet, its 'IFF_UP' flag is incorrectly cleared, resulting in the kernel skipping the above process and VLAN ID 0 being leaked [1]. Fix by restoring the flag when changing the type to Ethernet, in a similar fashion to the restoration of the 'IFF_SLAVE' flag. The issue can be reproduced using the script in [2], with example out before and after the fix in [3]. [1] unreferenced object 0xffff888103479900 (size 256): comm "ip", pid 329, jiffies 4294775225 (age 28.561s) hex dump (first 32 bytes): 00 a0 0c 15 81 88 ff ff 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81a6051a>] kmalloc_trace+0x2a/0xe0 [<ffffffff8406426c>] vlan_vid_add+0x30c/0x790 [<ffffffff84068e21>] vlan_device_event+0x1491/0x21a0 [<ffffffff81440c8e>] notifier_call_chain+0xbe/0x1f0 [<ffffffff8372383a>] call_netdevice_notifiers_info+0xba/0x150 [<ffffffff837590f2>] __dev_notify_flags+0x132/0x2e0 [<ffffffff8375ad9f>] dev_change_flags+0x11f/0x180 [<ffffffff8379af36>] do_setlink+0xb96/0x4060 [<ffffffff837adf6a>] __rtnl_newlink+0xc0a/0x18a0 [<ffffffff837aec6c>] rtnl_newlink+0x6c/0xa0 [<ffffffff837ac64e>] rtnetlink_rcv_msg+0x43e/0xe00 [<ffffffff839a99e0>] netlink_rcv_skb+0x170/0x440 [<ffffffff839a738f>] netlink_unicast+0x53f/0x810 [<ffffffff839a7fcb>] netlink_sendmsg+0x96b/0xe90 [<ffffffff8369d12f>] ____sys_sendmsg+0x30f/0xa70 [<ffffffff836a6d7a>] ___sys_sendmsg+0x13a/0x1e0 unreferenced object 0xffff88810f6a83e0 (size 32): comm "ip", pid 329, jiffies 4294775225 (age 28.561s) hex dump (first 32 bytes): a0 99 47 03 81 88 ff ff a0 99 47 03 81 88 ff ff ..G.......G..... 81 00 00 00 01 00 00 00 cc cc cc cc cc cc cc cc ................ backtrace: [<ffffffff81a6051a>] kmalloc_trace+0x2a/0xe0 [<ffffffff84064369>] vlan_vid_add+0x409/0x790 [<ffffffff84068e21>] vlan_device_event+0x1491/0x21a0 [<ffffffff81440c8e>] notifier_call_chain+0xbe/0x1f0 [<ffffffff8372383a>] call_netdevice_notifiers_info+0xba/0x150 [<ffffffff837590f2>] __dev_notify_flags+0x132/0x2e0 [<ffffffff8375ad9f>] dev_change_flags+0x11f/0x180 [<ffffffff8379af36>] do_setlink+0xb96/0x4060 [<ffffffff837adf6a>] __rtnl_newlink+0xc0a/0x18a0 [<ffffffff837aec6c>] rtnl_newlink+0x6c/0xa0 [<ffffffff837ac64e>] rtnetlink_rcv_msg+0x43e/0xe00 [<ffffffff839a99e0>] netlink_rcv_skb+0x170/0x440 [<ffffffff839a738f>] netlink_unicast+0x53f/0x810 [<ffffffff839a7fcb>] netlink_sendmsg+0x96b/0xe90 [<ffffffff8369d12f>] ____sys_sendmsg+0x30f/0xa70 [<ffffffff836a6d7a>] ___sys_sendmsg+0x13a/0x1e0 [2] ip link add name t-nlmon type nlmon ip link add name t-dummy type dummy ip link add name t-bond type bond mode active-backup ip link set dev t-bond up ip link set dev t-nlmon master t-bond ip link set dev t-nlmon nomaster ip link show dev t-bond ip link set dev t-dummy master t-bond ip link show dev t-bond ip link del dev t-bond ip link del dev t-dummy ip link del dev t-nlmon [3] Before: 12: t-bond: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/netlink 12: t-bond: <BROADCAST,MULTICAST,MASTER,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 46:57:39:a4:46:a2 brd ff:ff:ff:ff:ff:ff After: 12: t-bond: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/netlink 12: t-bond: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 66:48:7b:74:b6:8a brd ff:ff:ff:ff:ff:ff Fixes: e36b9d16 ("bonding: clean muticast addresses when device changes type") Fixes: 75c78500 ("bonding: remap muticast addresses without using dev_close() and dev_open()") Fixes: 9ec7eb60 ("bonding: restore IFF_MASTER/SLAVE flags on bond enslave ether type change") Reported-by: NMirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr> Link: https://lore.kernel.org/netdev/78a8a03b-6070-3e6b-5042-f848dab16fb8@alu.unizg.hr/Tested-by: NMirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr> Signed-off-by: NIdo Schimmel <idosch@nvidia.com> Acked-by: NJay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Nikolay Aleksandrov 提交于
mainline inclusion from mainline-v6.3-rc3 commit e667d469 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6WNGK CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e667d469098671261d558be0cd93dca4d285ce1e --------------------------- syzbot reported a warning[1] where the bond device itself is a slave and we try to enslave a non-ethernet device as the first slave which fails but then in the error path when ether_setup() restores the bond device it also clears all flags. In my previous fix[2] I restored the IFF_MASTER flag, but I didn't consider the case that the bond device itself might also be a slave with IFF_SLAVE set, so we need to restore that flag as well. Use the bond_ether_setup helper which does the right thing and restores the bond's flags properly. Steps to reproduce using a nlmon dev: $ ip l add nlmon0 type nlmon $ ip l add bond1 type bond $ ip l add bond2 type bond $ ip l set bond1 master bond2 $ ip l set dev nlmon0 master bond1 $ ip -d l sh dev bond1 22: bond1: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noqueue master bond2 state DOWN mode DEFAULT group default qlen 1000 (now bond1's IFF_SLAVE flag is gone and we'll hit a warning[3] if we try to delete it) [1] https://syzkaller.appspot.com/bug?id=391c7b1f6522182899efba27d891f1743e8eb3ef [2] commit 7d5cd2ce ("bonding: correctly handle bonding type change on enslave failure") [3] example warning: [ 27.008664] bond1: (slave nlmon0): The slave device specified does not support setting the MAC address [ 27.008692] bond1: (slave nlmon0): Error -95 calling set_mac_address [ 32.464639] bond1 (unregistering): Released all slaves [ 32.464685] ------------[ cut here ]------------ [ 32.464686] WARNING: CPU: 1 PID: 2004 at net/core/dev.c:10829 unregister_netdevice_many+0x72a/0x780 [ 32.464694] Modules linked in: br_netfilter bridge bonding virtio_net [ 32.464699] CPU: 1 PID: 2004 Comm: ip Kdump: loaded Not tainted 5.18.0-rc3+ #47 [ 32.464703] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc37 04/01/2014 [ 32.464704] RIP: 0010:unregister_netdevice_many+0x72a/0x780 [ 32.464707] Code: 99 fd ff ff ba 90 1a 00 00 48 c7 c6 f4 02 66 96 48 c7 c7 20 4d 35 96 c6 05 fa c7 2b 02 01 e8 be 6f 4a 00 0f 0b e9 73 fd ff ff <0f> 0b e9 5f fd ff ff 80 3d e3 c7 2b 02 00 0f 85 3b fd ff ff ba 59 [ 32.464710] RSP: 0018:ffffa006422d7820 EFLAGS: 00010206 [ 32.464712] RAX: ffff8f6e077140a0 RBX: ffffa006422d7888 RCX: 0000000000000000 [ 32.464714] RDX: ffff8f6e12edbe58 RSI: 0000000000000296 RDI: ffffffff96d4a520 [ 32.464716] RBP: ffff8f6e07714000 R08: ffffffff96d63600 R09: ffffa006422d7728 [ 32.464717] R10: 0000000000000ec0 R11: ffffffff9698c988 R12: ffff8f6e12edb140 [ 32.464719] R13: dead000000000122 R14: dead000000000100 R15: ffff8f6e12edb140 [ 32.464723] FS: 00007f297c2f1740(0000) GS:ffff8f6e5d900000(0000) knlGS:0000000000000000 [ 32.464725] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 32.464726] CR2: 00007f297bf1c800 CR3: 00000000115e8000 CR4: 0000000000350ee0 [ 32.464730] Call Trace: [ 32.464763] <TASK> [ 32.464767] rtnl_dellink+0x13e/0x380 [ 32.464776] ? cred_has_capability.isra.0+0x68/0x100 [ 32.464780] ? __rtnl_unlock+0x33/0x60 [ 32.464783] ? bpf_lsm_capset+0x10/0x10 [ 32.464786] ? security_capable+0x36/0x50 [ 32.464790] rtnetlink_rcv_msg+0x14e/0x3b0 [ 32.464792] ? _copy_to_iter+0xb1/0x790 [ 32.464796] ? post_alloc_hook+0xa0/0x160 [ 32.464799] ? rtnl_calcit.isra.0+0x110/0x110 [ 32.464802] netlink_rcv_skb+0x50/0xf0 [ 32.464806] netlink_unicast+0x216/0x340 [ 32.464809] netlink_sendmsg+0x23f/0x480 [ 32.464812] sock_sendmsg+0x5e/0x60 [ 32.464815] ____sys_sendmsg+0x22c/0x270 [ 32.464818] ? import_iovec+0x17/0x20 [ 32.464821] ? sendmsg_copy_msghdr+0x59/0x90 [ 32.464823] ? do_set_pte+0xa0/0xe0 [ 32.464828] ___sys_sendmsg+0x81/0xc0 [ 32.464832] ? mod_objcg_state+0xc6/0x300 [ 32.464835] ? refill_obj_stock+0xa9/0x160 [ 32.464838] ? memcg_slab_free_hook+0x1a5/0x1f0 [ 32.464842] __sys_sendmsg+0x49/0x80 [ 32.464847] do_syscall_64+0x3b/0x90 [ 32.464851] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 32.464865] RIP: 0033:0x7f297bf2e5e7 [ 32.464868] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 [ 32.464869] RSP: 002b:00007ffd96c824c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 32.464872] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f297bf2e5e7 [ 32.464874] RDX: 0000000000000000 RSI: 00007ffd96c82540 RDI: 0000000000000003 [ 32.464875] RBP: 00000000640f19de R08: 0000000000000001 R09: 000000000000007c [ 32.464876] R10: 00007f297bffabe0 R11: 0000000000000246 R12: 0000000000000001 [ 32.464877] R13: 00007ffd96c82d20 R14: 00007ffd96c82610 R15: 000055bfe38a7020 [ 32.464881] </TASK> [ 32.464882] ---[ end trace 0000000000000000 ]--- Fixes: 7d5cd2ce ("bonding: correctly handle bonding type change on enslave failure") Reported-by: syzbot+9dfc3f3348729cc82277@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=391c7b1f6522182899efba27d891f1743e8eb3efSigned-off-by: NNikolay Aleksandrov <razor@blackwall.org> Reviewed-by: NMichal Kubiak <michal.kubiak@intel.com> Acked-by: NJonathan Toppins <jtoppins@redhat.com> Acked-by: NJay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Nikolay Aleksandrov 提交于
mainline inclusion from mainline-v6.3-rc3 commit 9ec7eb60 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6WNGK CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ec7eb60dcbcb6c41076defbc5df7bbd95ceaba5 --------------------------- Add bond_ether_setup helper which is used to fix ether_setup() calls in the bonding driver. It takes care of both IFF_MASTER and IFF_SLAVE flags, the former is always restored and the latter only if it was set. If the bond enslaves non-ARPHRD_ETHER device (changes its type), then releases it and enslaves ARPHRD_ETHER device (changes back) then we use ether_setup() to restore the bond device type but it also resets its flags and removes IFF_MASTER and IFF_SLAVE[1]. Use the bond_ether_setup helper to restore both after such transition. [1] reproduce (nlmon is non-ARPHRD_ETHER): $ ip l add nlmon0 type nlmon $ ip l add bond2 type bond mode active-backup $ ip l set nlmon0 master bond2 $ ip l set nlmon0 nomaster $ ip l add bond1 type bond (we use bond1 as ARPHRD_ETHER device to restore bond2's mode) $ ip l set bond1 master bond2 $ ip l sh dev bond2 37: bond2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether be:d7:c5:40:5b:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500 (notice bond2's IFF_MASTER is missing) Fixes: e36b9d16 ("bonding: clean muticast addresses when device changes type") Signed-off-by: NNikolay Aleksandrov <razor@blackwall.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Conflicts: drivers/net/bonding/bond_main.c Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Zheng Wang 提交于
mainline inclusion from mainline-v6.3-rc3 commit cb090e64 category: bugfix bugzilla: 188657, https://gitee.com/src-openeuler/kernel/issues/I6T36A CVE: CVE-2023-1855 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cb090e64cf25602b9adaf32d5dfc9c8bec493cd1 -------------------------------- In xgene_hwmon_probe, &ctx->workq is bound with xgene_hwmon_evt_work. Then it will be started. If we remove the driver which will call xgene_hwmon_remove to clean up, there may be unfinished work. The possible sequence is as follows: Fix it by finishing the work before cleanup in xgene_hwmon_remove. CPU0 CPU1 |xgene_hwmon_evt_work xgene_hwmon_remove | kfifo_free(&ctx->async_msg_fifo);| | |kfifo_out_spinlocked |//use &ctx->async_msg_fifo Fixes: 2ca492e2 ("hwmon: (xgene) Fix crash when alarm occurs before driver probe") Signed-off-by: NZheng Wang <zyytlz.wz@163.com> Link: https://lore.kernel.org/r/20230310084007.1403388-1-zyytlz.wz@163.comSigned-off-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NZhao Wenhui <zhaowenhui8@huawei.com> Reviewed-by: Nsongping yu <yusongping@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 David Howells 提交于
stable inclusion from stable-v5.10.157 commit 3535c632e6d16c98f76e615da8dc0cb2750c66cc category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6VK2H CVE: CVE-2023-2006 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3535c632e6d16c98f76e615da8dc0cb2750c66cc -------------------------------- [ Upstream commit 3bcd6c7e ] After rxrpc_unbundle_conn() has removed a connection from a bundle, it checks to see if there are any conns with available channels and, if not, removes and attempts to destroy the bundle. Whilst it does check after grabbing client_bundles_lock that there are no connections attached, this races with rxrpc_look_up_bundle() retrieving the bundle, but not attaching a connection for the connection to be attached later. There is therefore a window in which the bundle can get destroyed before we manage to attach a new connection to it. Fix this by adding an "active" counter to struct rxrpc_bundle: (1) rxrpc_connect_call() obtains an active count by prepping/looking up a bundle and ditches it before returning. (2) If, during rxrpc_connect_call(), a connection is added to the bundle, this obtains an active count, which is held until the connection is discarded. (3) rxrpc_deactivate_bundle() is created to drop an active count on a bundle and destroy it when the active count reaches 0. The active count is checked inside client_bundles_lock() to prevent a race with rxrpc_look_up_bundle(). (4) rxrpc_unbundle_conn() then calls rxrpc_deactivate_bundle(). Fixes: 245500d8 ("rxrpc: Rewrite the client connection manager") Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-15975 Signed-off-by: NDavid Howells <dhowells@redhat.com> Tested-by: zdi-disclosures@trendmicro.com cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: NDavid S. Miller <davem@davemloft.net> Conflicts: net/rxrpc/ar-internal.h net/rxrpc/conn_client.c Signed-off-by: NWang Yufen <wangyufen@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Zheng Wang 提交于
stable inclusion from stable-v5.10.177 commit 75e2144291e847009fbc0350e10ec588ff96e05a category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6W80A CVE: CVE-2023-30772 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=75e2144291e847009fbc0350e10ec588ff96e05a -------------------------------- [ Upstream commit 06615d11 ] In da9150_charger_probe, &charger->otg_work is bound with da9150_charger_otg_work. da9150_charger_otg_ncb may be called to start the work. If we remove the module which will call da9150_charger_remove to make cleanup, there may be a unfinished work. The possible sequence is as follows: Fix it by canceling the work before cleanup in the da9150_charger_remove CPU0 CPUc1 |da9150_charger_otg_work da9150_charger_remove | power_supply_unregister | device_unregister | power_supply_dev_release| kfree(psy) | | | power_supply_changed(charger->usb); | //use Fixes: c1a281e3 ("power: Add support for DA9150 Charger") Signed-off-by: NZheng Wang <zyytlz.wz@163.com> Signed-off-by: NSebastian Reichel <sebastian.reichel@collabora.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com> Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Frederic Weisbecker 提交于
mainline inclusion from mainline-v5.16-rc4 commit 53e87e3c category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I6WCC1 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=53e87e3cdc155f20c3417b689df8d2ac88d79576 -------------------------------- When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU is guaranteed to stay online and to never stop its tick. Meanwhile on some rare case, the dedicated timekeeper may be running with interrupts disabled for a while, such as in stop_machine. If jiffies stop being updated, a nohz_full CPU may end up endlessly programming the next tick in the past, taking the last jiffies update monotonic timestamp as a stale base, resulting in an tick storm. Here is a scenario where it matters: 0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU. 1) A stop machine callback is queued to execute somewhere. 2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1 can still take IRQs. 3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward. 4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking last_jiffies_update as a base. But last_jiffies_update hasn't been updated for 2 jiffies since the timekeeper has interrupts disabled. 5) clockevents_program_event(), which relies on ktime_get(), observes that the expiration is in the past and therefore programs the min delta event on the clock. 6) The tick fires immediately, goto 3) 7) Tick storm, the nohz_full CPU is drown and takes ages to reach MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation. Solve this with unconditionally updating jiffies if the value is stale on nohz_full IRQ entry. IRQs and other disturbances are expected to be rare enough on nohz_full for the unconditional call to ktime_get() to actually matter. Reported-by: NPaul E. McKenney <paulmck@kernel.org> Signed-off-by: NFrederic Weisbecker <frederic@kernel.org> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NPaul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org Conflicts: kernel/softirq.c Signed-off-by: NYu Liao <liaoyu15@huawei.com> Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.18-rc2 commit a54f78de category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a54f78def73d847cb060b18c4e4a3d1d26c9ca6d -------------------------------- The recent patch to improve btree cycle checking caused a regression when I rebased the in-memory btree branch atop the 5.19 for-next branch, because in-memory short-pointer btrees do not have AG numbers. This produced the following complaint from kmemleak: unreferenced object 0xffff88803d47dde8 (size 264): comm "xfs_io", pid 4889, jiffies 4294906764 (age 24.072s) hex dump (first 32 bytes): 90 4d 0b 0f 80 88 ff ff 00 a0 bd 05 80 88 ff ff .M.............. e0 44 3a a0 ff ff ff ff 00 df 08 06 80 88 ff ff .D:............. backtrace: [<ffffffffa0388059>] xfbtree_dup_cursor+0x49/0xc0 [xfs] [<ffffffffa029887b>] xfs_btree_dup_cursor+0x3b/0x200 [xfs] [<ffffffffa029af5d>] __xfs_btree_split+0x6ad/0x820 [xfs] [<ffffffffa029b130>] xfs_btree_split+0x60/0x110 [xfs] [<ffffffffa029f6da>] xfs_btree_make_block_unfull+0x19a/0x1f0 [xfs] [<ffffffffa029fada>] xfs_btree_insrec+0x3aa/0x810 [xfs] [<ffffffffa029fff3>] xfs_btree_insert+0xb3/0x240 [xfs] [<ffffffffa02cb729>] xfs_rmap_insert+0x99/0x200 [xfs] [<ffffffffa02cf142>] xfs_rmap_map_shared+0x192/0x5f0 [xfs] [<ffffffffa02cf60b>] xfs_rmap_map_raw+0x6b/0x90 [xfs] [<ffffffffa0384a85>] xrep_rmap_stash+0xd5/0x1d0 [xfs] [<ffffffffa0384dc0>] xrep_rmap_visit_bmbt+0xa0/0xf0 [xfs] [<ffffffffa0384fb6>] xrep_rmap_scan_iext+0x56/0xa0 [xfs] [<ffffffffa03850d8>] xrep_rmap_scan_ifork+0xd8/0x160 [xfs] [<ffffffffa0385195>] xrep_rmap_scan_inode+0x35/0x80 [xfs] [<ffffffffa03852ee>] xrep_rmap_find_rmaps+0x10e/0x270 [xfs] I noticed that xfs_btree_insrec has a bunch of debug code that return out of the function immediately, without freeing the "new" btree cursor that can be returned when _make_block_unfull calls xfs_btree_split. Fix the error return in this function to free the btree cursor. Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NDave Chinner <david@fromorbit.com> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.18-rc2 commit 5672225e category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5672225e8f2a872a22b0cecedba7a6644af1fb84 -------------------------------- Commit dc04db2a has caused a small aim7 regression, showing a small increase in CPU usage in __xfs_btree_check_sblock() as a result of the extra checking. This is likely due to the endian conversion of the sibling poitners being unconditional instead of relying on the compiler to endian convert the NULL pointer at compile time and avoiding the runtime conversion for this common case. Rework the checks so that endian conversion of the sibling pointers is only done if they are not null as the original code did. .... and these need to be "inline" because the compiler completely fails to inline them automatically like it should be doing. $ size fs/xfs/libxfs/xfs_btree.o* text data bss dec hex filename 51874 240 0 52114 cb92 fs/xfs/libxfs/xfs_btree.o.orig 51562 240 0 51802 ca5a fs/xfs/libxfs/xfs_btree.o.inline Just when you think the tools have advanced sufficiently we don't have to care about stuff like this anymore, along comes a reminder that *our tools still suck*. Fixes: dc04db2a ("xfs: detect self referencing btree sibling pointers") Reported-by: Nkernel test robot <oliver.sang@intel.com> Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDave Chinner <david@fromorbit.com> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.18-rc2 commit dc04db2a category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=dc04db2aa7c9307e740d6d0e173085301c173b1a -------------------------------- To catch the obvious graph cycle problem and hence potential endless looping. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDave Chinner <david@fromorbit.com> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.14-rc4 commit 04fcad80 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04fcad80cd068731a779fb442f78234732683755 -------------------------------- Introduce a helper function xfs_buf_daddr() to extract the disk address of the buffer from the struct xfs_buf. This will replace direct accesses to bp->b_bn and bp->b_maps[0].bm_bn, as well as the XFS_BUF_ADDR() macro. This patch introduces the helper function and replaces all uses of XFS_BUF_ADDR() as this is just a simple sed replacement. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Darrick J. Wong 提交于
mainline inclusion from mainline-v5.10-rc5 commit 3945ae03 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3945ae03d822aa47584dd502ac024ae1e1eb9e2d -------------------------------- A couple of the superblock validation checks apply only to the kernel, so move them to xfs_fc_fill_super before we add the needsrepair "feature", which will prevent the kernel (but not xfsprogs) from mounting the filesystem. This also reduces the diff between kernel and userspace libxfs. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.19-rc2 commit 7cf2b0f9 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7cf2b0f9611b9971d663e1fc3206eeda3b902922 -------------------------------- Currently inodegc work can sit queued on the per-cpu queue until the workqueue is either flushed of the queue reaches a depth that triggers work queuing (and later throttling). This means that we could queue work that waits for a long time for some other event to trigger flushing. Hence instead of just queueing work at a specific depth, use a delayed work that queues the work at a bound time. We can still schedule the work immediately at a given depth, but we no long need to worry about leaving a number of items on the list that won't get processed until external events prevail. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.19-rc2 commit 5e672cd6 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5e672cd69f0a534a445df4372141fd0d1d00901d -------------------------------- The current blocking mechanism for pushing the inodegc queue out to disk can result in systems becoming unusable when there is a long running inodegc operation. This is because the statfs() implementation currently issues a blocking flush of the inodegc queue and a significant number of common system utilities will call statfs() to discover something about the underlying filesystem. This can result in userspace operations getting stuck on inodegc progress, and when trying to remove a heavily reflinked file on slow storage with a full journal, this can result in delays measuring in hours. Avoid this problem by adding "push" function that expedites the flushing of the inodegc queue, but doesn't wait for it to complete. Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this mechanism so they don't block but still ensure that queued operations are expedited. Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues") Reported-by: NChris Dunlop <chris@onthe.net.au> Signed-off-by: NDave Chinner <dchinner@redhat.com> [djwong: fix _getquota_next to use _inodegc_push too] Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Brian Foster 提交于
mainline inclusion from mainline-v5.16-rc5 commit 6191cf3a category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6191cf3ad59fda5901160633fef8e41b064a5246 -------------------------------- The xfs_inodegc_stop() helper performs a high level flush of pending work on the percpu queues and then runs a cancel_work_sync() on each of the percpu work tasks to ensure all work has completed before returning. While cancel_work_sync() waits for wq tasks to complete, it does not guarantee work tasks have started. This means that the _stop() helper can queue and instantly cancel a wq task without having completed the associated work. This can be observed by tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>" test: xfs_destroy_inode: ... ino 0x83 ... xfs_inode_set_need_inactive: ... ino 0x83 ... xfs_inodegc_stop: ... ... xfs_inodegc_start: ... xfs_inodegc_worker: ... xfs_inode_inactivating: ... ino 0x83 ... The first few lines show that the inode is removed and need inactive state set, but the inactivation work has not completed before the inodegc mechanism stops. The inactivation doesn't actually occur until the fs is unfrozen and the gc mechanism starts back up. Note that this test requires fsfreeze to reproduce because xfs_freeze indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush(). When this occurs, the workqueue try_to_grab_pending() logic first tries to steal the pending bit, which does not succeed because the bit has been set by queue_work_on(). Subsequently, it checks for association of a pool workqueue from the work item under the pool lock. This association is set at the point a work item is queued and cleared when dequeued for processing. If the association exists, the work item is removed from the queue and cancel_work_sync() returns true. If the pwq association is cleared, the remove attempt assumes the task is busy and retries (eventually returning false to the caller after waiting for the work task to complete). To avoid this race, we can flush each work item explicitly before cancel. However, since the _queue_all() already schedules each underlying work item, the workqueue level helpers are sufficient to achieve the same ordering effect. E.g., the inodegc enabled flag prevents scheduling any further work in the _stop() case. Use the drain_workqueue() helper in this particular case to make the intent a bit more self explanatory. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.17-rc6 commit 919edbad category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=919edbadebe17a67193533f531c2920c03e40fa4 -------------------------------- Jan Kara reported a performance regression in dbench that he bisected down to commit bad77c37 ("xfs: CIL checkpoint flushes caches unconditionally"). Whilst developing the journal flush/fua optimisations this cache was part of, it appeared to made a significant difference to performance. However, now that this patchset has settled and all the correctness issues fixed, there does not appear to be any significant performance benefit to asynchronous cache flushes. In fact, the opposite is true on some storage types and workloads, where additional cache flushes that can occur from fsync heavy workloads have measurable and significant impact on overall throughput. Local dbench testing shows little difference on dbench runs with sync vs async cache flushes on either fast or slow SSD storage, and no difference in streaming concurrent async transaction workloads like fs-mark. Fast NVME storage. From `dbench -t 30`, CIL scale: clients async sync BW Latency BW Latency 1 935.18 0.855 915.64 0.903 8 2404.51 6.873 2341.77 6.511 16 3003.42 6.460 2931.57 6.529 32 3697.23 7.939 3596.28 7.894 128 7237.43 15.495 7217.74 11.588 512 5079.24 90.587 5167.08 95.822 fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize create chown unlink async 1m41s 1m16s 2m03s sync 1m40s 1m19s 1m54s Slower SATA SSD storage: From `dbench -t 30`, CIL scale: clients async sync BW Latency BW Latency 1 78.59 15.792 83.78 10.729 8 367.88 92.067 404.63 59.943 16 564.51 72.524 602.71 76.089 32 831.66 105.984 870.26 110.482 128 1659.76 102.969 1624.73 91.356 512 2135.91 223.054 2603.07 161.160 fsmark, 16 threads, create w/32k logbsize create unlink async 5m06s 4m15s sync 5m00s 4m22s And on Jan's test machine: 5.18-rc8-vanilla 5.18-rc8-patched Amean 1 71.22 ( 0.00%) 64.94 * 8.81%* Amean 2 93.03 ( 0.00%) 84.80 * 8.85%* Amean 4 150.54 ( 0.00%) 137.51 * 8.66%* Amean 8 252.53 ( 0.00%) 242.24 * 4.08%* Amean 16 454.13 ( 0.00%) 439.08 * 3.31%* Amean 32 835.24 ( 0.00%) 829.74 * 0.66%* Amean 64 1740.59 ( 0.00%) 1686.73 * 3.09%* Performance and cache flush behaviour is restored to pre-regression levels. As such, we can now consider the async cache flush mechanism an unnecessary exercise in premature optimisation and hence we can now remove it and the infrastructure it requires completely. Fixes: bad77c37 ("xfs: CIL checkpoint flushes caches unconditionally") Reported-and-tested-by: NJan Kara <jack@suse.cz> Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.14-rc1 commit 9d110014 category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9d110014205cb1129fa570d8de83d486fa199354 -------------------------------- From the department of "generic/482 keeps on giving", we bring you another tail update race condition: iclog: S1 C1 +-----------------------+-----------------------+ S2 EOIC Two checkpoints in a single iclog. One is complete, the other just contains the start record and overruns into a new iclog. Timeline: Before S1: Cache flush, log tail = X At S1: Metadata stable, write start record and checkpoint At C1: Write commit record, set NEED_FUA Single iclog checkpoint, so no need for NEED_FLUSH Log tail still = X, so no need for NEED_FLUSH After C1, Before S2: Cache flush, log tail = X At S2: Metadata stable, write start record and checkpoint After S2: Log tail moves to X+1 At EOIC: End of iclog, more journal data to write Releases iclog Not a commit iclog, so no need for NEED_FLUSH Writes log tail X+1 into iclog. At this point, the iclog has tail X+1 and NEED_FUA set. There has been no cache flush for the metadata between X and X+1, and the iclog writes the new tail permanently to the log. THis is sufficient to violate on disk metadata/journal ordering. We have two options here. The first is to detect this case in some manner and ensure that the partial checkpoint write sets NEED_FLUSH when the iclog is already marked NEED_FUA and the log tail changes. This seems somewhat fragile and quite complex to get right, and it doesn't actually make it obvious what underlying problem it is actually addressing from reading the code. The second option seems much cleaner to me, because it is derived directly from the requirements of the C1 commit record in the iclog. That is, when we write this commit record to the iclog, we've guaranteed that the metadata/data ordering is correct for tail update purposes. Hence if we only write the log tail into the iclog for the *first* commit record rather than the log tail at the last release, we guarantee that the log tail does not move past where the the first commit record in the log expects it to be. IOWs, taking the first option means that replay of C1 becomes dependent on future operations doing the right thing, not just the C1 checkpoint itself doing the right thing. This makes log recovery almost impossible to reason about because now we have to take into account what might or might not have happened in the future when looking at checkpoints in the log rather than just having to reconstruct the past... Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Dave Chinner 提交于
mainline inclusion from mainline-v5.14-rc1 commit b2ae3a9e category: bugfix bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b2ae3a9ef91152931b99620c431cf3805daa1429 -------------------------------- Because I cannot tell if the NEED_FLUSH flag is being set correctly by the log force and CIL push machinery without it. Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Jens Axboe 提交于
stable inclusion from stable-v5.10.172 commit da24142b1ef9fd5d36b76e36bab328a5b27523e8 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6V7V1 CVE: CVE-2023-1872 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=da24142b1ef9fd5d36b76e36bab328a5b27523e8 -------------------------------- We can't use 0 here, as io_init_req() is always invoked with the ctx uring_lock held. Newer kernels have IO_URING_F_UNLOCKED for this, but previously we used IO_URING_F_NONBLOCK to indicate this as well. Fixes: 08681391b84d ("io_uring: add missing lock in io_get_file_fixed") Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NZhaoLong Wang <wangzhaolong1@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Bing-Jhong Billy Jheng 提交于
stable inclusion from stable-v5.10.171 commit 08681391b84da27133deefaaddefd0acfa90c2be category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6V7V1 CVE: CVE-2023-1872 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08681391b84da27133deefaaddefd0acfa90c2be -------------------------------- io_get_file_fixed will access io_uring's context. Lock it if it is invoked unlocked (eg via io-wq) to avoid a race condition with fixed files getting unregistered. No single upstream patch exists for this issue, it was fixed as part of the file assignment changes that went into the 5.18 cycle. Signed-off-by: NJheng, Bing-Jhong Billy <billy@starlabs.sg> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NZhaoLong Wang <wangzhaolong1@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
- 19 4月, 2023 6 次提交
-
-
由 openeuler-ci-bot 提交于
Merge Pull Request from: @openeuler-sync-bot Origin pull request: https://gitee.com/openeuler/kernel/pulls/585 PR sync from: Wei Li <liwei391@huawei.com> https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/thread/UNWNKYXBCMVT2QIU4QPROMO5UMGXKQS7/ Link:https://gitee.com/openeuler/kernel/pulls/588 Reviewed-by: Jialin Zhang <zhangjialin11@huawei.com> Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com>
-
由 openeuler-ci-bot 提交于
Merge Pull Request from: @zhangjialin11 Pull new CVEs: CVE-2023-1829 CVE-2022-36280 CVE-2022-1015 CVE-2023-1989 CVE-2023-30456 CVE-2023-1990 xfs bugfixes from Long Li and yangerkun Link:https://gitee.com/openeuler/kernel/pulls/591 Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
-
由 Jamal Hadi Salim 提交于
stable inclusion from stable-v5.10.173 commit 18c3fa7a7fdbb4d21dafc8a7710ae2c1680930f6 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6UYBU CVE: CVE-2023-1829 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=18c3fa7a7fdbb4d21dafc8a7710ae2c1680930f6 -------------------------------- commit 8c710f75 upstream. The tcindex classifier has served us well for about a quarter of a century but has not been getting much TLC due to lack of known users. Most recently it has become easy prey to syzkaller. For this reason, we are retiring it. Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com> Acked-by: NJiri Pirko <jiri@nvidia.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Conflicts: net/sched/cls_tcindex.c Signed-off-by: NZhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Zack Rusin 提交于
stable inclusion from stable-v5.10.163 commit 439cbbc1519547f9a7b483f0de33b556ebfec901 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5QLC4 CVE:CVE-2022-36280 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=439cbbc1519547f9a7b483f0de33b556ebfec901 -------------------------------- commit 4cf949c7 upstream. Invalid userspace dma surface copies could potentially overflow the memcpy from the surface to the snooped image leading to crashes. To fix it the dimensions of the copybox have to be validated against the expected size of the snooped cursor. Signed-off-by: NZack Rusin <zackr@vmware.com> Fixes: 2ac86371 ("vmwgfx: Snoop DMA transfers with non-covering sizes") Cc: <stable@vger.kernel.org> # v3.2+ Reviewed-by: NMichael Banack <banackm@vmware.com> Reviewed-by: NMartin Krastev <krastevm@vmware.com> Link: https://patchwork.freedesktop.org/patch/msgid/20221026031936.1004280-1-zack@kde.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYuyao Lin <linyuyao1@huawei.com> Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Antoine Tenart 提交于
mainline inclusion from mainline-v5.18-rc3 commit 6c6f9f31 category: bugfix bugzilla: 186582, https://gitee.com/src-openeuler/kernel/issues/I50WB5 CVE: CVE-2022-1015 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6c6f9f31ecd47dce1d0dafca4bec8805f9bc97cd -------------------------------- Since commit 6e1acfa3 ("netfilter: nf_tables: validate registers coming from userspace.") nft_parse_register can return a negative value, but the function prototype is still returning an unsigned int. Fixes: 6e1acfa3 ("netfilter: nf_tables: validate registers coming from userspace.") Signed-off-by: NAntoine Tenart <atenart@kernel.org> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> conflict: net/netfilter/nf_tables_api.c Signed-off-by: NLu Wei <luwei32@huawei.com> Reviewed-by: NLiu Jian <liujian56@huawei.com> Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-
由 Pablo Neira Ayuso 提交于
mainline inclusion from mainline-v5.18-rc1 commit 6e1acfa3 category: bugfix bugzilla: 186582, https://gitee.com/src-openeuler/kernel/issues/I50WB5 CVE: CVE-2022-1015 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6e1acfa387b9ff82cfc7db8cc3b6959221a95851 -------------------------------- Bail out in case userspace uses unsupported registers. Fixes: 49499c3e ("netfilter: nf_tables: switch registers to 32 bit addressing") Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> conflict: net/netfilter/nf_tables_api.c Signed-off-by: NLu Wei <luwei32@huawei.com> Reviewed-by: NLiu Jian <liujian56@huawei.com> Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com> Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
-