- 14 1月, 2022 1 次提交
-
-
由 Willem de Bruijn 提交于
stable inclusion from stable-v5.10.88 commit 7da349f07e457cad135df0920a3f670e423fb5e9 bugzilla: 186058 https://gitee.com/openeuler/kernel/issues/I4QW6A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7da349f07e457cad135df0920a3f670e423fb5e9 -------------------------------- [ Upstream commit ec6af094 ] Packet sockets may switch ring versions. Avoid misinterpreting state between versions, whose fields share a union. rx_owner_map is only allocated with a packet ring (pg_vec) and both are swapped together. If pg_vec is NULL, meaning no packet ring was allocated, then neither was rx_owner_map. And the field may be old state from a tpacket_v3. Fixes: 61fad681 ("net/packet: tpacket_rcv: avoid a producer race condition") Reported-by: NSyzbot <syzbot+1ac0994a0a0c55151121@syzkaller.appspotmail.com> Signed-off-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20211215143937.106178-1-willemdebruijn.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 12 10月, 2021 3 次提交
-
-
由 Eric Dumazet 提交于
stable inclusion from stable-5.10.47 commit 196b22ef6cd1cdd87a85811d27ec8936fd71f346 bugzilla: 172973 https://gitee.com/openeuler/kernel/issues/I4DAKB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=196b22ef6cd1cdd87a85811d27ec8936fd71f346 -------------------------------- [ Upstream commit e032f7c9 ] Like prior patch, we need to annotate lockless accesses to po->ifindex For instance, packet_getname() is reading po->ifindex (twice) while another thread is able to change po->ifindex. KCSAN reported: BUG: KCSAN: data-race in packet_do_bind / packet_getname write to 0xffff888143ce3cbc of 4 bytes by task 25573 on cpu 1: packet_do_bind+0x420/0x7e0 net/packet/af_packet.c:3191 packet_bind+0xc3/0xd0 net/packet/af_packet.c:3255 __sys_bind+0x200/0x290 net/socket.c:1637 __do_sys_bind net/socket.c:1648 [inline] __se_sys_bind net/socket.c:1646 [inline] __x64_sys_bind+0x3d/0x50 net/socket.c:1646 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888143ce3cbc of 4 bytes by task 25578 on cpu 0: packet_getname+0x5b/0x1a0 net/packet/af_packet.c:3525 __sys_getsockname+0x10e/0x1a0 net/socket.c:1887 __do_sys_getsockname net/socket.c:1902 [inline] __se_sys_getsockname net/socket.c:1899 [inline] __x64_sys_getsockname+0x3e/0x50 net/socket.c:1899 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0x00000001 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 25578 Comm: syz-executor.5 Not tainted 5.13.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Eric Dumazet 提交于
stable inclusion from stable-5.10.47 commit da8b3aeff4ad7f8f777dd4b4d1929150849c9f46 bugzilla: 172973 https://gitee.com/openeuler/kernel/issues/I4DAKB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=da8b3aeff4ad7f8f777dd4b4d1929150849c9f46 -------------------------------- [ Upstream commit c7d2ef5d ] tpacket_snd(), packet_snd(), packet_getname() and packet_seq_show() can read po->num without holding a lock. This means other threads can change po->num at the same time. KCSAN complained about this known fact [1] Add READ_ONCE()/WRITE_ONCE() to address the issue. [1] BUG: KCSAN: data-race in packet_do_bind / packet_sendmsg write to 0xffff888131a0dcc0 of 2 bytes by task 24714 on cpu 0: packet_do_bind+0x3ab/0x7e0 net/packet/af_packet.c:3181 packet_bind+0xc3/0xd0 net/packet/af_packet.c:3255 __sys_bind+0x200/0x290 net/socket.c:1637 __do_sys_bind net/socket.c:1648 [inline] __se_sys_bind net/socket.c:1646 [inline] __x64_sys_bind+0x3d/0x50 net/socket.c:1646 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888131a0dcc0 of 2 bytes by task 24719 on cpu 1: packet_snd net/packet/af_packet.c:2899 [inline] packet_sendmsg+0x317/0x3570 net/packet/af_packet.c:3040 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmsg+0x1ed/0x270 net/socket.c:2433 __do_sys_sendmsg net/socket.c:2442 [inline] __se_sys_sendmsg net/socket.c:2440 [inline] __x64_sys_sendmsg+0x42/0x50 net/socket.c:2440 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000 -> 0x1200 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 24719 Comm: syz-executor.5 Not tainted 5.13.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Eric Dumazet 提交于
stable inclusion from stable-5.10.47 commit f57132a887ea4be3d7d813b5003d56716ba33448 bugzilla: 172973 https://gitee.com/openeuler/kernel/issues/I4DAKB Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f57132a887ea4be3d7d813b5003d56716ba33448 -------------------------------- [ Upstream commit d1b5bee4 ] There is a known race in packet_sendmsg(), addressed in commit 32d3182c ("net/packet: fix race in tpacket_snd()") Now we have data_race(), we can use it to avoid a future KCSAN warning, as syzbot loves stressing af_packet sockets :) Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 15 6月, 2021 1 次提交
-
-
由 Richard Sanger 提交于
stable inclusion from stable-5.10.42 commit 9c386011fa61ade4b20de22eb4f60b2b893ab0d4 bugzilla: 55093 CVE: NA -------------------------------- [ Upstream commit 171c3b15 ] The packetmmap tx ring should only return timestamps if requested via setsockopt PACKET_TIMESTAMP, as documented. This allows compatibility with non-timestamp aware user-space code which checks tp_status == TP_STATUS_AVAILABLE; not expecting additional timestamp flags to be set in tp_status. Fixes: b9c32fb2 ("packet: if hw/sw ts enabled in rx/tx ring, report which ts we got") Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: NRichard Sanger <rsanger@wand.net.nz> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 03 6月, 2021 2 次提交
-
-
由 Eric Dumazet 提交于
stable inclusion from stable-5.10.37 commit 2b3ae007c6394446562f9ba2e5043fb209ab3fb0 bugzilla: 51868 CVE: NA -------------------------------- [ Upstream commit 94f633ea ] af_packet fanout uses RCU rules to ensure f->arr elements are not dismantled before RCU grace period. However, it lacks rcu accessors to make sure KCSAN and other tools wont detect data races. Stupid compilers could also play games. Fixes: dc99f600 ("packet: Add fanout support.") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: N"Gong, Sishuai" <sishuai@purdue.edu> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Tanner Love 提交于
stable inclusion from stable-5.10.37 commit 3a1c395703bef879cc13b4b02530db72d2c3aeb6 bugzilla: 51868 CVE: NA -------------------------------- [ Upstream commit 9c661b0b ] One use case of PACKET_FANOUT is lockless reception with one socket per CPU. 256 is a practical limit on increasingly many machines. Increase PACKET_FANOUT_MAX to 64K. Expand setsockopt PACKET_FANOUT to take an extra argument max_num_members. Also explicitly define a fanout_args struct, instead of implicitly casting to an integer. This documents the API and simplifies the control flow. If max_num_members is not specified or is set to 0, then 256 is used, same as before. Signed-off-by: NTanner Love <tannerlove@google.com> Signed-off-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 09 3月, 2021 1 次提交
-
-
由 Yonatan Linik 提交于
stable inclusion from stable-5.10.18 commit 06ab1e63ec5c72255fae316c59ea9c0475eadac8 bugzilla: 50148 -------------------------------- [ Upstream commit a268e0f2 ] proc_fs was used, in af_packet, without a surrounding #ifdef, although there is no hard dependency on proc_fs. That caused the initialization of the af_packet module to fail when CONFIG_PROC_FS=n. Specifically, proc_create_net() was used in af_packet.c, and when it fails, packet_net_init() returns -ENOMEM. It will always fail when the kernel is compiled without proc_fs, because, proc_create_net() for example always returns NULL. The calling order that starts in af_packet.c is as follows: packet_init() register_pernet_subsys() register_pernet_operations() __register_pernet_operations() ops_init() ops->init() (packet_net_ops.init=packet_net_init()) proc_create_net() It worked in the past because register_pernet_subsys()'s return value wasn't checked before this Commit 36096f2f ("packet: Fix error path in packet_init."). It always returned an error, but was not checked before, so everything was working even when CONFIG_PROC_FS=n. The fix here is simply to add the necessary #ifdef. This also fixes a similar error in tls_proc.c, that was found by Jakub Kicinski. Fixes: d26b698d ("net/tls: add skeleton of MIB statistics") Fixes: 36096f2f ("packet: Fix error path in packet_init") Signed-off-by: NYonatan Linik <yonatanlinik@gmail.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
-
- 24 11月, 2020 1 次提交
-
-
由 Eyal Birger 提交于
In the patchset merged by commit b9fcf0a0 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which did not have header_ops were given one for the purpose of protocol parsing on af_packet transmit path. That change made af_packet receive path regard these devices as having a visible L3 header and therefore aligned incoming skb->data to point to the skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not reset their mac_header prior to ingress and therefore their incoming packets became malformed. Ideally these devices would reset their mac headers, or af_packet would be able to rely on dev->hard_header_len being 0 for such cases, but it seems this is not the case. Fix by changing af_packet RX ll visibility criteria to include the existence of a '.create()' header operation, which is used when creating a device hard header - via dev_hard_header() - by upper layers, and does not exist in these L3 devices. As this predicate may be useful in other situations, add it as a common dev_has_header() helper in netdevice.h. Fixes: b9fcf0a0 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") Signed-off-by: NEyal Birger <eyal.birger@gmail.com> Acked-by: NJason A. Donenfeld <Jason@zx2c4.com> Acked-by: NWillem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 20 9月, 2020 1 次提交
-
-
由 Xie He 提交于
skb->nh.raw has been renamed as skb->network_header in 2007, in commit b0e380b1 ("[SK_BUFF]: unions of just one member don't get anything done, kill them") So here we change it to the new name. Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: NXie He <xie.he.0141@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 9月, 2020 1 次提交
-
-
由 Xie He 提交于
1. Change all "dev->hard_header" to "dev->header_ops" 2. On receiving incoming frames when header_ops == NULL: The comment only says what is wrong, but doesn't say what is right. This patch changes the comment to make it clear what is right. 3. On transmitting and receiving outgoing frames when header_ops == NULL: The comment explains that the LL header will be later added by the driver. However, I think it's better to simply say that the LL header is invisible to us. This phrasing is better from a software engineering perspective, because this makes it clear that what happens in the driver should be hidden from us and we should not care about what happens internally in the driver. 4. On resuming the LL header (for RAW frames) when header_ops == NULL: The comment says we are "unlikely" to restore the LL header. However, we should say that we are "unable" to restore it. It's not possible (rather than not likely) to restore it, because: 1) There is no way for us to restore because the LL header internally processed by the driver should be invisible to us. 2) In function packet_rcv and tpacket_rcv, the code only tries to restore the LL header when header_ops != NULL. Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: NXie He <xie.he.0141@gmail.com> Acked-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 9月, 2020 1 次提交
-
-
由 Xie He 提交于
This comment is outdated and no longer reflects the actual implementation of af_packet.c. Reasons for the new comment: 1. In af_packet.c, the function packet_snd first reserves a headroom of length (dev->hard_header_len + dev->needed_headroom). Then if the socket is a SOCK_DGRAM socket, it calls dev_hard_header, which calls dev->header_ops->create, to create the link layer header. If the socket is a SOCK_RAW socket, it "un-reserves" a headroom of length (dev->hard_header_len), and checks if the user has provided a header sized between (dev->min_header_len) and (dev->hard_header_len) (in dev_validate_header). This shows the developers of af_packet.c expect hard_header_len to be consistent with header_ops. 2. In af_packet.c, the function packet_sendmsg_spkt has a FIXME comment. That comment states that prepending an LL header internally in a driver is considered a bug. I believe this bug can be fixed by setting hard_header_len to 0, making the internal header completely invisible to af_packet.c (and requesting the headroom in needed_headroom instead). 3. There is a commit for a WiFi driver: commit 9454f7a8 ("mwifiex: set needed_headroom, not hard_header_len") According to the discussion about it at: https://patchwork.kernel.org/patch/11407493/ The author tried to set the WiFi driver's hard_header_len to the Ethernet header length, and request additional header space internally needed by setting needed_headroom. This means this usage is already adopted by driver developers. Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Brian Norris <briannorris@chromium.org> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NXie He <xie.he.0141@gmail.com> Acked-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 9月, 2020 1 次提交
-
-
由 Wang Hai 提交于
BLOCK_PRIV is never used after it was introduced. So better to remove it. Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NWang Hai <wanghai38@huawei.com> Acked-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 05 9月, 2020 1 次提交
-
-
由 Or Cohen 提交于
Using tp_reserve to calculate netoff can overflow as tp_reserve is unsigned int and netoff is unsigned short. This may lead to macoff receving a smaller value then sizeof(struct virtio_net_hdr), and if po->has_vnet_hdr is set, an out-of-bounds write will occur when calling virtio_net_hdr_from_skb. The bug is fixed by converting netoff to unsigned int and checking if it exceeds USHRT_MAX. This addresses CVE-2020-14386 Fixes: 8913336a ("packet: add PACKET_RESERVE sockopt") Signed-off-by: NOr Cohen <orcohen@paloaltonetworks.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 8月, 2020 1 次提交
-
-
由 Gustavo A. R. Silva 提交于
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
-
- 14 8月, 2020 1 次提交
-
-
由 John Ogness 提交于
After @blk_fill_in_prog_lock is acquired there is an early out vnet situation that can occur. In that case, the rwlock needs to be released. Also, since @blk_fill_in_prog_lock is only acquired when @tp_version is exactly TPACKET_V3, only release it on that exact condition as well. And finally, add sparse annotation so that it is clearer that prb_fill_curr_block() and prb_clear_blk_fill_status() are acquiring and releasing @blk_fill_in_prog_lock, respectively. sparse is still unable to understand the balance, but the warnings are now on a higher level that make more sense. Fixes: 632ca50f ("af_packet: TPACKET_V3: replace busy-wait loop") Signed-off-by: NJohn Ogness <john.ogness@linutronix.de> Reported-by: Nkernel test robot <lkp@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 7月, 2020 2 次提交
-
-
由 Christoph Hellwig 提交于
Rework the remaining setsockopt code to pass a sockptr_t instead of a plain user pointer. This removes the last remaining set_fs(KERNEL_DS) outside of architecture specific code. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154] Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Pass a sockptr_t to prepare for set_fs-less handling of the kernel pointer from bpf-cgroup. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 7月, 2020 2 次提交
-
-
由 Christoph Hellwig 提交于
Just check for a NULL method instead of wiring up sock_no_{get,set}sockopt. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NMarc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Add a helper that copies either a native or compat bpf_fprog from userspace after verifying the length, and remove the compat setsockopt handlers that now aren't required. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 7月, 2020 1 次提交
-
-
由 John Ogness 提交于
A busy-wait loop is used to implement waiting for bits to be copied from the skb to the kernel buffer before retiring a block. This is a problem on PREEMPT_RT because the copying task could be preempted by the busy-waiting task and thus live lock in the busy-wait loop. Replace the busy-wait logic with an rwlock_t. This provides lockdep coverage and makes the code RT ready. Signed-off-by: NJohn Ogness <john.ogness@linutronix.de> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 02 7月, 2020 1 次提交
-
-
由 Colin Ian King 提交于
The variable err is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 3月, 2020 1 次提交
-
-
由 Willem de Bruijn 提交于
PACKET_RX_RING can cause multiple writers to access the same slot if a fast writer wraps the ring while a slow writer is still copying. This is particularly likely with few, large, slots (e.g., GSO packets). Synchronize kernel thread ownership of rx ring slots with a bitmap. Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL while holding the sk receive queue lock. They release this lock before copying and set tp_status to TP_STATUS_USER to release to userspace when done. During copying, another writer may take the lock, also see TP_STATUS_KERNEL, and start writing to the same slot. Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a slot, test and set with the lock held. To release race-free, update tp_status and owner bit as a transaction, so take the lock again. This is the one of a variety of discussed options (see Link below): * instead of a shadow ring, embed the data in the slot itself, such as in tp_padding. But any test for this field may match a value left by userspace, causing deadlock. * avoid the lock on release. This leaves a small race if releasing the shadow slot before setting TP_STATUS_USER. The below reproducer showed that this race is not academic. If releasing the slot after tp_status, the race is more subtle. See the first link for details. * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store of two fields. But, legacy applications may interpret all non-zero tp_status as owned by the user. As libpcap does. So this is possible only opt-in by newer processes. It can be added as an optional mode. * embed the struct at the tail of pg_vec to avoid extra allocation. The implementation proved no less complex than a separate field. The additional locking cost on release adds contention, no different than scaling on multicore or multiqueue h/w. In practice, below reproducer nor small packet tcpdump showed a noticeable change in perf report in cycles spent in spinlock. Where contention is problematic, packet sockets support mitigation through PACKET_FANOUT. And we can consider adding opt-in state TP_KERNEL_OWNED. Easy to reproduce by running multiple netperf or similar TCP_STREAM flows concurrently with `tcpdump -B 129 -n greater 60000`. Based on an earlier patchset by Jon Rosen. See links below. I believe this issue goes back to the introduction of tpacket_rcv, which predates git history. Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.htmlSuggested-by: NJon Rosen <jrosen@cisco.com> Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NJon Rosen <jrosen@cisco.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 3月, 2020 1 次提交
-
-
由 Willem de Bruijn 提交于
In one error case, tpacket_rcv drops packets after incrementing the ring producer index. If this happens, it does not update tp_status to TP_STATUS_USER and thus the reader is stalled for an iteration of the ring, causing out of order arrival. The only such error path is when virtio_net_hdr_from_skb fails due to encountering an unknown GSO type. Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 12月, 2019 1 次提交
-
-
由 Mao Wenan 提交于
If __ethtool_get_link_ksettings() is failed and with non-zero value, prb_calc_retire_blk_tmo() should return DEFAULT_PRB_RETIRE_TOV firstly. This patch is to refactory code and make it more readable. Signed-off-by: NMao Wenan <maowenan@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 12月, 2019 1 次提交
-
-
由 Arnd Bergmann 提交于
The memory mapped packet socket data structure in version 1 through 3 all contain 32-bit second values for the packet time stamps, which makes them suffer from the overflow of time_t in y2038 or y2106 (depending on whether user space interprets the value as signed or unsigned). The implementation uses the deprecated getnstimeofday() function. In order to get rid of that, this changes the code to use ktime_get_real_ts64() as a replacement, documenting the nature of the overflow. As long as the user applications treat the timestamps as unsigned, or only use the difference between timestamps, they are fine, and changing the timestamps to 64-bit wouldn't require a more invasive user space API change. Note: a lot of other APIs suffer from incompatible structures when time_t gets redefined to 64-bit in 32-bit user space, but this one does not. Acked-by: NWillem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/lkml/CAF=yD-Jomr-gWSR-EBNKnSpFL46UeG564FLfqTCMNEm-prEaXA@mail.gmail.com/T/#uSigned-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 10 12月, 2019 1 次提交
-
-
由 Mao Wenan 提交于
There is softlockup when using TPACKET_V3: ... NMI watchdog: BUG: soft lockup - CPU#2 stuck for 60010ms! (__irq_svc) from [<c0558a0c>] (_raw_spin_unlock_irqrestore+0x44/0x54) (_raw_spin_unlock_irqrestore) from [<c027b7e8>] (mod_timer+0x210/0x25c) (mod_timer) from [<c0549c30>] (prb_retire_rx_blk_timer_expired+0x68/0x11c) (prb_retire_rx_blk_timer_expired) from [<c027a7ac>] (call_timer_fn+0x90/0x17c) (call_timer_fn) from [<c027ab6c>] (run_timer_softirq+0x2d4/0x2fc) (run_timer_softirq) from [<c021eaf4>] (__do_softirq+0x218/0x318) (__do_softirq) from [<c021eea0>] (irq_exit+0x88/0xac) (irq_exit) from [<c0240130>] (msa_irq_exit+0x11c/0x1d4) (msa_irq_exit) from [<c0209cf0>] (handle_IPI+0x650/0x7f4) (handle_IPI) from [<c02015bc>] (gic_handle_irq+0x108/0x118) (gic_handle_irq) from [<c0558ee4>] (__irq_usr+0x44/0x5c) ... If __ethtool_get_link_ksettings() is failed in prb_calc_retire_blk_tmo(), msec and tmo will be zero, so tov_in_jiffies is zero and the timer expire for retire_blk_timer is turn to mod_timer(&pkc->retire_blk_timer, jiffies + 0), which will trigger cpu usage of softirq is 100%. Fixes: f6fb8f10 ("af-packet: TPACKET_V3 flexible buffer implementation.") Tested-by: NXiao Jiangfeng <xiaojiangfeng@huawei.com> Signed-off-by: NMao Wenan <maowenan@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 11月, 2019 1 次提交
-
-
由 Eric Dumazet 提交于
KCSAN reported the following data-race [1] Adding a couple of READ_ONCE()/WRITE_ONCE() should silence it. Since the report hinted about multiple cpus using the history concurrently, I added a test avoiding writing on it if the victim slot already contains the desired value. [1] BUG: KCSAN: data-race in fanout_demux_rollover / fanout_demux_rollover read to 0xffff8880b01786cc of 4 bytes by task 18921 on cpu 1: fanout_flow_is_huge net/packet/af_packet.c:1303 [inline] fanout_demux_rollover+0x33e/0x3f0 net/packet/af_packet.c:1353 packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453 deliver_skb net/core/dev.c:1888 [inline] dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958 xmit_one net/core/dev.c:3195 [inline] dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215 __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0x9f/0xc0 net/socket.c:657 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311 __sys_sendmmsg+0x123/0x350 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg net/socket.c:2439 [inline] __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 write to 0xffff8880b01786cc of 4 bytes by task 18922 on cpu 0: fanout_flow_is_huge net/packet/af_packet.c:1306 [inline] fanout_demux_rollover+0x3a4/0x3f0 net/packet/af_packet.c:1353 packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453 deliver_skb net/core/dev.c:1888 [inline] dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958 xmit_one net/core/dev.c:3195 [inline] dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215 __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0x9f/0xc0 net/socket.c:657 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311 __sys_sendmmsg+0x123/0x350 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg net/socket.c:2439 [inline] __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 18922 Comm: syz-executor.3 Not tainted 5.4.0-rc6+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 3b3a5b0a ("packet: rollover huge flows before small flows") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 10月, 2019 1 次提交
-
-
由 Florian Westphal 提交于
commit 174e2381 ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi recycle always drop skb extensions. The additional skb_ext_del() that is performed via nf_reset on napi skb recycle is not needed anymore. Most nf_reset() calls in the stack are there so queued skb won't block 'rmmod nf_conntrack' indefinitely. This removes the skb_ext_del from nf_reset, and renames it to a more fitting nf_reset_ct(). In a few selected places, add a call to skb_ext_reset to make sure that no active extensions remain. I am submitting this for "net", because we're still early in the release cycle. The patch applies to net-next too, but I think the rename causes needless divergence between those trees. Suggested-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 16 8月, 2019 1 次提交
-
-
由 Eric Dumazet 提交于
packet_sendmsg() checks tx_ring.pg_vec to decide if it must call tpacket_snd(). Problem is that the check is lockless, meaning another thread can issue a concurrent setsockopt(PACKET_TX_RING ) to flip tx_ring.pg_vec back to NULL. Given that tpacket_snd() grabs pg_vec_lock mutex, we can perform the check again to solve the race. syzbot reported : kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 11429 Comm: syz-executor394 Not tainted 5.3.0-rc4+ #101 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:packet_lookup_frame+0x8d/0x270 net/packet/af_packet.c:474 Code: c1 ee 03 f7 73 0c 80 3c 0e 00 0f 85 cb 01 00 00 48 8b 0b 89 c0 4c 8d 24 c1 48 b8 00 00 00 00 00 fc ff df 4c 89 e1 48 c1 e9 03 <80> 3c 01 00 0f 85 94 01 00 00 48 8d 7b 10 4d 8b 3c 24 48 b8 00 00 RSP: 0018:ffff88809f82f7b8 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: ffff8880a45c7030 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 1ffff110148b8e06 RDI: ffff8880a45c703c RBP: ffff88809f82f7e8 R08: ffff888087aea200 R09: fffffbfff134ae50 R10: fffffbfff134ae4f R11: ffffffff89a5727f R12: 0000000000000000 R13: 0000000000000001 R14: ffff8880a45c6ac0 R15: 0000000000000000 FS: 00007fa04716f700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa04716edb8 CR3: 0000000091eb4000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: packet_current_frame net/packet/af_packet.c:487 [inline] tpacket_snd net/packet/af_packet.c:2667 [inline] packet_sendmsg+0x590/0x6250 net/packet/af_packet.c:2975 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:657 ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311 __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg net/socket.c:2439 [inline] __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439 do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 69e3c75f ("net: TX_RING and packet mmap") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 6月, 2019 1 次提交
-
-
由 Neil Horman 提交于
When an application is run that: a) Sets its scheduler to be SCHED_FIFO and b) Opens a memory mapped AF_PACKET socket, and sends frames with the MSG_DONTWAIT flag cleared, its possible for the application to hang forever in the kernel. This occurs because when waiting, the code in tpacket_snd calls schedule, which under normal circumstances allows other tasks to run, including ksoftirqd, which in some cases is responsible for freeing the transmitted skb (which in AF_PACKET calls a destructor that flips the status bit of the transmitted frame back to available, allowing the transmitting task to complete). However, when the calling application is SCHED_FIFO, its priority is such that the schedule call immediately places the task back on the cpu, preventing ksoftirqd from freeing the skb, which in turn prevents the transmitting task from detecting that the transmission is complete. We can fix this by converting the schedule call to a completion mechanism. By using a completion queue, we force the calling task, when it detects there are no more frames to send, to schedule itself off the cpu until such time as the last transmitted skb is freed, allowing forward progress to be made. Tested by myself and the reporter, with good results Change Notes: V1->V2: Enhance the sleep logic to support being interruptible and allowing for honoring to SK_SNDTIMEO (Willem de Bruijn) V2->V3: Rearrage the point at which we wait for the completion queue, to avoid needing to check for ph/skb being null at the end of the loop. Also move the complete call to the skb destructor to avoid needing to modify __packet_set_status. Also gate calling complete on packet_read_pending returning zero to avoid multiple calls to complete. (Willem de Bruijn) Move timeo computation within loop, to re-fetch the socket timeout since we also use the timeo variable to record the return code from the wait_for_complete call (Neil Horman) V3->V4: Willem has requested that the control flow be restored to the previous state. Doing so lets us eliminate the need for the po->wait_on_complete flag variable, and lets us get rid of the packet_next_frame function, but introduces another complexity. Specifically, but using the packet pending count, we can, if an applications calls sendmsg multiple times with MSG_DONTWAIT set, each set of transmitted frames, when complete, will cause tpacket_destruct_skb to issue a complete call, for which there will never be a wait_on_completion call. This imbalance will lead to any future call to wait_for_completion here to return early, when the frames they sent may not have completed. To correct this, we need to re-init the completion queue on every call to tpacket_snd before we enter the loop so as to ensure we wait properly for the frames we send in this iteration. Change the timeout and interrupted gotos to out_put rather than out_status so that we don't try to free a non-existant skb Clean up some extra newlines (Willem de Bruijn) Reviewed-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NNeil Horman <nhorman@tuxdriver.com> Reported-by: NMatteo Croce <mcroce@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 6月, 2019 1 次提交
-
-
由 Eric Dumazet 提交于
syzbot found we can leak memory in packet_set_ring(), if user application provides buggy parameters. Fixes: 7f953ab2 ("af_packet: TX_RING support for TPACKET_V3") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 6月, 2019 8 次提交
-
-
由 Eric Dumazet 提交于
There are two places where we want to clear the pressure if possible, add a helper to make it more obvious. Signed-off-by: NEric Dumazet <edumazet@google.com> Suggested-by: NWillem de Bruijn <willemb@google.com> Acked-by: NVinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
__packet_rcv_has_room() can now be run without lock being held. po->pressure is only a non persistent hint, we can mark all read/write accesses with READ_ONCE()/WRITE_ONCE() to document the fact that the field could be written without any synchronization. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
tpacket_rcv() can be hit under DDOS quite hard, since it will always grab a socket spinlock, to eventually find there is no room for an additional packet. Using tcpdump [1] on a busy host can lead to catastrophic consequences, because of all cpus spinning on a contended spinlock. This replicates a similar strategy used in packet_rcv() [1] Also some applications mistakenly use af_packet socket bound to ETH_P_ALL only to send packets. Receive queue is never drained and immediately full. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Under DDOS, we want to be able to increment tp_drops without touching the spinlock. This will help readers to drain the receive queue slightly faster :/ Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Goal is use the helper without lock being held. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Goal is to be able to use __tpacket_v3_has_room() without holding a lock. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Goal is to be able to use __tpacket_has_room() without holding a lock. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
struct packet_sock is only read. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-