- 03 12月, 2020 1 次提交
-
-
由 Trond Myklebust 提交于
Currently, we wake up the tasks by priority queue ordering, which means that we ignore the batching that is supposed to help with QoS issues. Fixes: c049f8ea ("SUNRPC: Remove the bh-safe lock requirement on the rpc_wait_queue->lock") Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
-
- 02 12月, 2020 1 次提交
-
-
由 Chuck Lever 提交于
There's no need to defer allocation of pages for the receive buffer. - This upcall is quite infrequent - gssp_alloc_receive_pages() can allocate the pages with GFP_KERNEL, unlike the transport - gssp_alloc_receive_pages() knows exactly how many pages are needed Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Reviewed-by: NOlga Kornievskaia <aglo@umich.edu> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
- 28 11月, 2020 3 次提交
-
-
由 Willem de Bruijn 提交于
When setting sk_err, set it to ee_errno, not ee_origin. Commit f5f99309 ("sock: do not set sk_err in sock_dequeue_err_skb") disabled updating sk_err on errq dequeue, which is correct for most error types (origins): - sk->sk_err = err; Commit 38b25793 ("sock: reset sk_err when the error queue is empty") reenabled the behavior for IMCP origins, which do require it: + if (icmp_next) + sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_origin; But read from ee_errno. Fixes: 38b25793 ("sock: reset sk_err when the error queue is empty") Reported-by: NAyush Ranjan <ayushranjan@google.com> Signed-off-by: NWillem de Bruijn <willemb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20201126151220.2819322-1-willemdebruijn.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Paolo Abeni 提交于
If an msk listener receives an MPJ carrying an invalid token, it will zero the request socket msk entry. That should later cause fallback and subflow reset - as per RFC - at subflow_syn_recv_sock() time due to failing hmac validation. Since commit 4cf8b7e4 ("subflow: introduce and use mptcp_can_accept_new_subflow()"), we unconditionally dereference - in mptcp_can_accept_new_subflow - the subflow request msk before performing hmac validation. In the above scenario we hit a NULL ptr dereference. Address the issue doing the hmac validation earlier. Fixes: 4cf8b7e4 ("subflow: introduce and use mptcp_can_accept_new_subflow()") Tested-by: NDavide Caratti <dcaratti@redhat.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/03b2cfa3ac80d8fc18272edc6442a9ddf0b1e34e.1606400227.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eelco Chaudron 提交于
Currently, the openvswitch module is not accepting the correctly formated netlink message for the TTL decrement action. For both setting and getting the dec_ttl action, the actions should be nested in the OVS_DEC_TTL_ATTR_ACTION attribute as mentioned in the openvswitch.h uapi. When the original patch was sent, it was tested with a private OVS userspace implementation. This implementation was unfortunately not upstreamed and reviewed, hence an erroneous version of this patch was sent out. Leaving the patch as-is would cause problems as the kernel module could interpret additional attributes as actions and vice-versa, due to the actions not being encapsulated/nested within the actual attribute, but being concatinated after it. Fixes: 744676e7 ("openvswitch: add TTL decrement action") Signed-off-by: NEelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/160622121495.27296.888010441924340582.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 27 11月, 2020 1 次提交
-
-
由 Oliver Hartkopp 提交于
To detect potential bugs in CAN protocol implementations (double removal of receiver entries) a WARN() statement has been used if no matching list item was found for removal. The fault injection issued by syzkaller was able to create a situation where the closing of a socket runs simultaneously to the notifier call chain for removing the CAN network device in use. This case is very unlikely in real life but it doesn't break anything. Therefore we just replace the WARN() statement with pr_warn() to preserve the notification for the CAN protocol development. Reported-by: syzbot+381d06e0c8eaacb8706f@syzkaller.appspotmail.com Reported-by: syzbot+d0ddd88c9a7432f041e6@syzkaller.appspotmail.com Reported-by: syzbot+76d62d3b8162883c7d11@syzkaller.appspotmail.com Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20201126192140.14350-1-socketcan@hartkopp.netSigned-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
-
- 26 11月, 2020 5 次提交
-
-
由 Maxim Mikityanskiy 提交于
tls_device_offload_cleanup_rx doesn't clear tls_ctx->netdev after calling tls_dev_del if TLX TX offload is also enabled. Clearing tls_ctx->netdev gets postponed until tls_device_gc_task. It leaves a time frame when tls_device_down may get called and call tls_dev_del for RX one extra time, confusing the driver, which may lead to a crash. This patch corrects this racy behavior by adding a flag to prevent tls_device_down from calling tls_dev_del the second time. Fixes: e8f69799 ("net/tls: Add generic NIC offload infrastructure") Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20201125221810.69870-1-saeedm@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Parav Pandit 提交于
When devlink reload operation is not used, netdev of an Ethernet port may be present in different net namespace than the net namespace of the devlink instance. Ensure that both the devlink instance and devlink port netdev are located in same net namespace. Fixes: 070c63f2 ("net: devlink: allow to change namespaces during reload") Signed-off-by: NParav Pandit <parav@nvidia.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Parav Pandit 提交于
A netdevice of a devlink port can be moved to different net namespace than its parent devlink instance. This scenario occurs when devlink reload is not used. When netdevice is undergoing migration to net namespace, its ifindex and name may change. In such use case, devlink port query may read stale netdev attributes. Fix it by reading them under rtnl lock. Fixes: bfcd3a46 ("Introduce devlink infrastructure") Signed-off-by: NParav Pandit <parav@nvidia.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
After cited commit, gro_cells_destroy() became damn slow on hosts with a lot of cores. This is because we have one additional synchronize_net() per cpu as stated in the changelog. gro_cells_init() is setting NAPI_STATE_NO_BUSY_POLL, and this was enough to not have one synchronize_net() call per netif_napi_del() We can factorize all the synchronize_net() to a single one, right before freeing per-cpu memory. Fixes: 5198d545 ("net: remove napi_hash_del() from driver-facing API") Signed-off-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201124203822.1360107-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Wang Hai 提交于
kmemleak report a memory leak as follows: unreferenced object 0xffff8880059c6a00 (size 64): comm "ip", pid 23696, jiffies 4296590183 (age 1755.384s) hex dump (first 32 bytes): 20 01 00 10 00 00 00 00 00 00 00 00 00 00 00 00 ............... 1c 00 00 00 00 00 00 00 00 00 00 00 07 00 00 00 ................ backtrace: [<00000000aa4e7a87>] ip6addrlbl_add+0x90/0xbb0 [<0000000070b8d7f1>] ip6addrlbl_net_init+0x109/0x170 [<000000006a9ca9d4>] ops_init+0xa8/0x3c0 [<000000002da57bf2>] setup_net+0x2de/0x7e0 [<000000004e52d573>] copy_net_ns+0x27d/0x530 [<00000000b07ae2b4>] create_new_namespaces+0x382/0xa30 [<000000003b76d36f>] unshare_nsproxy_namespaces+0xa1/0x1d0 [<0000000030653721>] ksys_unshare+0x3a4/0x780 [<0000000007e82e40>] __x64_sys_unshare+0x2d/0x40 [<0000000031a10c08>] do_syscall_64+0x33/0x40 [<0000000099df30e7>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 We should free all rules when we catch an error in ip6addrlbl_net_init(). otherwise a memory leak will occur. Fixes: 2a8cc6c8 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.") Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NWang Hai <wanghai38@huawei.com> Link: https://lore.kernel.org/r/20201124071728.8385-1-wanghai38@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 25 11月, 2020 2 次提交
-
-
由 Alexander Duyck 提交于
When a BPF program is used to select between a type of TCP congestion control algorithm that uses either ECN or not there is a case where the synack for the frame was coming up without the ECT0 bit set. A bit of research found that this was due to the final socket being configured to dctcp while the listener socket was staying in cubic. To reproduce it all that is needed is to monitor TCP traffic while running the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed, assuming tcp_dctcp module is loaded or compiled in and the traffic matches the rules in the sample file, is that for all frames with the exception of the synack the ECT0 bit is set. To address that it is necessary to make one additional call to tcp_bpf_ca_needs_ecn using the request socket and then use the output of that to set the ECT0 bit for the tos/tclass of the packet. Fixes: 91b5b21c ("bpf: Add support for changing congestion control") Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Moshe Shemesh 提交于
Fix reload stats structure exposed to the user. Change stats structure hierarchy to have the reload action as a parent of the stat entry and then stat entry includes value per limit. This will also help to avoid string concatenation on iproute2 output. Reload stats structure before this fix: "stats": { "reload": { "driver_reinit": 2, "fw_activate": 1, "fw_activate_no_reset": 0 } } After this fix: "stats": { "reload": { "driver_reinit": { "unspecified": 2 }, "fw_activate": { "unspecified": 1, "no_reset": 0 } } Fixes: a254c264 ("devlink: Add reload stats") Signed-off-by: NMoshe Shemesh <moshe@mellanox.com> Reviewed-by: NJiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/1606109785-25197-1-git-send-email-moshe@mellanox.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 24 11月, 2020 3 次提交
-
-
由 Eyal Birger 提交于
In the patchset merged by commit b9fcf0a0 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which did not have header_ops were given one for the purpose of protocol parsing on af_packet transmit path. That change made af_packet receive path regard these devices as having a visible L3 header and therefore aligned incoming skb->data to point to the skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not reset their mac_header prior to ingress and therefore their incoming packets became malformed. Ideally these devices would reset their mac headers, or af_packet would be able to rely on dev->hard_header_len being 0 for such cases, but it seems this is not the case. Fix by changing af_packet RX ll visibility criteria to include the existence of a '.create()' header operation, which is used when creating a device hard header - via dev_hard_header() - by upper layers, and does not exist in these L3 devices. As this predicate may be useful in other situations, add it as a common dev_has_header() helper in netdevice.h. Fixes: b9fcf0a0 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'") Signed-off-by: NEyal Birger <eyal.birger@gmail.com> Acked-by: NJason A. Donenfeld <Jason@zx2c4.com> Acked-by: NWillem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Stefano Garzarella 提交于
Starting from commit 8692cefc ("virtio_vsock: Fix race condition in virtio_transport_recv_pkt"), we discard packets in virtio_transport_recv_pkt() if the socket has been released. When the socket is connected, we schedule a delayed work to wait the RST packet from the other peer, also if SHUTDOWN_MASK is set in sk->sk_shutdown. This is done to complete the virtio-vsock shutdown algorithm, releasing the port assigned to the socket definitively only when the other peer has consumed all the packets. If we discard the RST packet received, the socket will be closed only when the VSOCK_CLOSE_TIMEOUT is reached. Sergio discovered the issue while running ab(1) HTTP benchmark using libkrun [1] and observing a latency increase with that commit. To avoid this issue, we discard packet only if the socket is really closed (SOCK_DONE flag is set). We also set SOCK_DONE in virtio_transport_release() when we don't need to wait any packets from the other peer (we didn't schedule the delayed work). In this case we remove the socket from the vsock lists, releasing the port assigned. [1] https://github.com/containers/libkrun Fixes: 8692cefc ("virtio_vsock: Fix race condition in virtio_transport_recv_pkt") Cc: justin.he@arm.com Reported-by: NSergio Lopez <slp@redhat.com> Tested-by: NSergio Lopez <slp@redhat.com> Signed-off-by: NStefano Garzarella <sgarzare@redhat.com> Acked-by: NJia He <justin.he@arm.com> Link: https://lore.kernel.org/r/20201120104736.73749-1-sgarzare@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Ricardo Dias 提交于
When the TCP stack is in SYN flood mode, the server child socket is created from the SYN cookie received in a TCP packet with the ACK flag set. The child socket is created when the server receives the first TCP packet with a valid SYN cookie from the client. Usually, this packet corresponds to the final step of the TCP 3-way handshake, the ACK packet. But is also possible to receive a valid SYN cookie from the first TCP data packet sent by the client, and thus create a child socket from that SYN cookie. Since a client socket is ready to send data as soon as it receives the SYN+ACK packet from the server, the client can send the ACK packet (sent by the TCP stack code), and the first data packet (sent by the userspace program) almost at the same time, and thus the server will equally receive the two TCP packets with valid SYN cookies almost at the same instant. When such event happens, the TCP stack code has a race condition that occurs between the momement a lookup is done to the established connections hashtable to check for the existence of a connection for the same client, and the moment that the child socket is added to the established connections hashtable. As a consequence, this race condition can lead to a situation where we add two child sockets to the established connections hashtable and deliver two sockets to the userspace program to the same client. This patch fixes the race condition by checking if an existing child socket exists for the same client when we are adding the second child socket to the established connections socket. If an existing child socket exists, we drop the packet and discard the second child socket to the same client. Signed-off-by: NRicardo Dias <rdias@singlestore.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lanSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 22 11月, 2020 1 次提交
-
-
由 Julian Wiedmann 提交于
Child sockets erroneously inherit their parent's sk_type (ie. SOCK_*), instead of the PF_IUCV protocol that the parent was created with in iucv_sock_create(). We're currently not using sk->sk_protocol ourselves, so this shouldn't have much impact (except eg. getting the output in skb_dump() right). Fixes: eac3731b ("[S390]: Add AF_IUCV socket support") Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com> Link: https://lore.kernel.org/r/20201120100657.34407-1-jwi@linux.ibm.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 21 11月, 2020 4 次提交
-
-
由 Alexander Duyck 提交于
When setting congestion control via a BPF program it is seen that the SYN/ACK for packets within a given flow will not include the ECT0 flag. A bit of simple printk debugging shows that when this is configured without BPF we will see the value INET_ECN_xmit value initialized in tcp_assign_congestion_control however when we configure this via BPF the socket is in the closed state and as such it isn't configured, and I do not see it being initialized when we transition the socket into the listen state. The result of this is that the ECT0 bit is configured based on whatever the default state is for the socket. Any easy way to reproduce this is to monitor the following with tcpdump: tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca Without this patch the SYN/ACK will follow whatever the default is. If dctcp all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0 will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK bit matches the value seen on the other packets in the given stream. Fixes: 91b5b21c ("bpf: Add support for changing congestion control") Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Alexander Duyck 提交于
An issue was recently found where DCTCP SYN/ACK packets did not have the ECT bit set in the L3 header. A bit of code review found that the recent change referenced below had gone though and added a mask that prevented the ECN bits from being populated in the L3 header. This patch addresses that by rolling back the mask so that it is only applied to the flags coming from the incoming TCP request instead of applying it to the socket tos/tclass field. Doing this the ECT bits were restored in the SYN/ACK packets in my testing. One thing that is not addressed by this patch set is the fact that tcp_reflect_tos appears to be incompatible with ECN based congestion avoidance algorithms. At a minimum the feature should likely be documented which it currently isn't. Fixes: ac8f1710 ("tcp: reflect tos value received in SYN to the socket") Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com> Acked-by: NWei Wang <weiwan@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Vadim Fedorenko 提交于
In case when tcp socket received FIN after some data and the parser haven't started before reading data caller will receive an empty buffer. This behavior differs from plain TCP socket and leads to special treating in user-space. The flow that triggers the race is simple. Server sends small amount of data right after the connection is configured to use TLS and closes the connection. In this case receiver sees TLS Handshake data, configures TLS socket right after Change Cipher Spec record. While the configuration is in process, TCP socket receives small Application Data record, Encrypted Alert record and FIN packet. So the TCP socket changes sk_shutdown to RCV_SHUTDOWN and sk_flag with SK_DONE bit set. The received data is not parsed upon arrival and is never sent to user-space. Patch unpauses parser directly if we have unparsed data in tcp receive queue. Fixes: fcf4793e ("tls: check RCV_SHUTDOWN in tls_wait_data") Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/1605801588-12236-1-git-send-email-vfedorenko@novek.ruSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Anmol Karn 提交于
rose_send_frame() dereferences `neigh->dev` when called from rose_transmit_clear_request(), and the first occurrence of the `neigh` is in rose_loopback_timer() as `rose_loopback_neigh`, and it is initialized in rose_add_loopback_neigh() as NULL. i.e when `rose_loopback_neigh` used in rose_loopback_timer() its `->dev` was still NULL and rose_loopback_timer() was calling rose_rx_call_request() without checking for NULL. - net/rose/rose_link.c This bug seems to get triggered in this line: rose_call = (ax25_address *)neigh->dev->dev_addr; Fix it by adding NULL checking for `rose_loopback_neigh->dev` in rose_loopback_timer(). Fixes: 1da177e4 ("Linux-2.6.12-rc2") Suggested-by: NJakub Kicinski <kuba@kernel.org> Reported-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com Tested-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=9d2a7ca8c7f2e4b682c97578dfa3f236258300b3Signed-off-by: NAnmol Karn <anmol.karan123@gmail.com> Link: https://lore.kernel.org/r/20201119191043.28813-1-anmol.karan123@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 20 11月, 2020 3 次提交
-
-
由 Karsten Graul 提交于
Sparse complaints 3 times about: net/smc/smc_ib.c:203:52: warning: incorrect type in argument 1 (different address spaces) net/smc/smc_ib.c:203:52: expected struct net_device const *dev net/smc/smc_ib.c:203:52: got struct net_device [noderef] __rcu *const ndev Fix that by using the existing and validated ndev variable instead of accessing attr->ndev directly. Fixes: 5102eca9 ("net/smc: Use rdma_read_gid_l2_fields to L2 fields") Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Karsten Graul 提交于
With the multi-subnet support of SMC-Dv2 the match for existing link groups should not include the vlanid of the network device. Set ini->smcd_version accordingly before the call to smc_conn_create() and use this value in smc_conn_create() to skip the vlanid check. Fixes: 5c21c4cc ("net/smc: determine accepted ISM devices") Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Georg Kohmann 提交于
IPV6=m NF_DEFRAG_IPV6=y ld: net/ipv6/netfilter/nf_conntrack_reasm.o: in function `nf_ct_frag6_gather': net/ipv6/netfilter/nf_conntrack_reasm.c:462: undefined reference to `ipv6_frag_thdr_truncated' Netfilter is depending on ipv6 symbol ipv6_frag_thdr_truncated. This dependency is forcing IPV6=y. Remove this dependency by moving ipv6_frag_thdr_truncated out of ipv6. This is the same solution as used with a similar issues: Referring to commit 70b095c8 ("ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module") Fixes: 9d9e937b ("ipv6/netfilter: Discard first fragment not including all headers") Reported-by: NRandy Dunlap <rdunlap@infradead.org> Reported-by: Nkernel test robot <lkp@intel.com> Signed-off-by: NGeorg Kohmann <geokohma@cisco.com> Acked-by: NPablo Neira Ayuso <pablo@netfilter.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Link: https://lore.kernel.org/r/20201119095833.8409-1-geokohma@cisco.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 19 11月, 2020 2 次提交
-
-
由 Florian Fainelli 提交于
DSA network devices rely on having their DSA management interface up and running otherwise their ndo_open() will return -ENETDOWN. Without doing this it would not be possible to use DSA devices as netconsole when configured on the command line. These devices also do not utilize the upper/lower linking so the check about the netpoll device having upper is not going to be a problem. The solution adopted here is identical to the one done for net/ipv4/ipconfig.c with 728c0208 ("net: ipv4: handle DSA enabled master network devices"), with the network namespace scope being restricted to that of the process configuring netpoll. Fixes: 04ff53f9 ("net: dsa: Add netconsole support") Tested-by: NVladimir Oltean <olteanv@gmail.com> Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20201117035236.22658-1-f.fainelli@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Zhang Changzhong 提交于
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: 1da177e4 ("Linux-2.6.12-rc2") Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com> Link: https://lore.kernel.org/r/1605581105-35295-1-git-send-email-zhangchangzhong@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 18 11月, 2020 10 次提交
-
-
由 Florian Klink 提交于
Checking for ifdef CONFIG_x fails if CONFIG_x=m. Use IS_ENABLED instead, which is true for both built-ins and modules. Otherwise, a > ip -4 route add 1.2.3.4/32 via inet6 fe80::2 dev eth1 fails with the message "Error: IPv6 support not enabled in kernel." if CONFIG_IPV6 is `m`. In the spirit of b8127113. Fixes: d1566268 ("ipv4: Allow ipv6 gateway with ipv4 routes") Cc: Kim Phillips <kim.phillips@arm.com> Signed-off-by: NFlorian Klink <flokli@flokli.de> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20201115224509.2020651-1-flokli@flokli.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Wang Hai 提交于
nlmsg_cancel() needs to be called in the error path of inet_req_diag_fill to cancel the message. Fixes: d545caca ("net: inet: diag: expose the socket mark to privileged processes.") Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NWang Hai <wanghai38@huawei.com> Link: https://lore.kernel.org/r/20201116082018.16496-1-wanghai38@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 John Fastabend 提交于
When skb has a frag_list its possible for skb_to_sgvec() to fail. This happens when the scatterlist has fewer elements to store pages than would be needed for the initial skb plus any of its frags. This case appears rare, but is possible when running an RX parser/verdict programs exposed to the internet. Currently, when this happens we throw an error, break the pipe, and kfree the msg. This effectively breaks the application or forces it to do a retry. Lets catch this case and handle it by doing an skb_linearize() on any skb we receive with frags. At this point skb_to_sgvec should not fail because the failing conditions would require frags to be in place. Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556576837.73229.14800682790808797635.stgit@john-XPS-13-9370
-
由 John Fastabend 提交于
If the skb_verdict_prog redirects an skb knowingly to itself, fix your BPF program this is not optimal and an abuse of the API please use SK_PASS. That said there may be cases, such as socket load balancing, where picking the socket is hashed based or otherwise picks the same socket it was received on in some rare cases. If this happens we don't want to confuse userspace giving them an EAGAIN error if we can avoid it. To avoid double accounting in these cases. At the moment even if the skb has already been charged against the sockets rcvbuf and forward alloc we check it again and do set_owner_r() causing it to be orphaned and recharged. For one this is useless work, but more importantly we can have a case where the skb could be put on the ingress queue, but because we are under memory pressure we return EAGAIN. The trouble here is the skb has already been accounted for so any rcvbuf checks include the memory associated with the packet already. This rolls up and can result in unnecessary EAGAIN errors in userspace read() calls. Fix by doing an unlikely check and skipping checks if skb->sk == sk. Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556574804.73229.11328201020039674147.stgit@john-XPS-13-9370
-
由 John Fastabend 提交于
If a socket redirects to itself and it is under memory pressure it is possible to get a socket stuck so that recv() returns EAGAIN and the socket can not advance for some time. This happens because when redirecting a skb to the same socket we received the skb on we first check if it is OK to enqueue the skb on the receiving socket by checking memory limits. But, if the skb is itself the object holding the memory needed to enqueue the skb we will keep retrying from kernel side and always fail with EAGAIN. Then userspace will get a recv() EAGAIN error if there are no skbs in the psock ingress queue. This will continue until either some skbs get kfree'd causing the memory pressure to reduce far enough that we can enqueue the pending packet or the socket is destroyed. In some cases its possible to get a socket stuck for a noticeable amount of time if the socket is only receiving skbs from sk_skb verdict programs. To reproduce I make the socket memory limits ridiculously low so sockets are always under memory pressure. More often though if under memory pressure it looks like a spurious EAGAIN error on user space side causing userspace to retry and typically enough has moved on the memory side that it works. To fix skip memory checks and skb_orphan if receiving on the same sock as already assigned. For SK_PASS cases this is easy, its always the same socket so we can just omit the orphan/set_owner pair. For backlog cases we need to check skb->sk and decide if the orphan and set_owner pair are needed. Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556572660.73229.12566203819812939627.stgit@john-XPS-13-9370
-
由 John Fastabend 提交于
We use skb->size with sk_rmem_scheduled() which is not correct. Instead use truesize to align with socket and tcp stack usage of sk_rmem_schedule. Suggested-by: NDaniel Borkman <daniel@iogearbox.net> Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556570616.73229.17003722112077507863.stgit@john-XPS-13-9370
-
由 John Fastabend 提交于
Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This allows users to tune SO_RCVBUF and sockmap will honor them. We can refactor the if(charge) case out in later patches. But, keep this fix to the point. Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Suggested-by: NJakub Sitnicki <jakub@cloudflare.com> Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370
-
由 John Fastabend 提交于
If copy_page_to_iter() fails or even partially completes, but with fewer bytes copied than expected we currently reset sg.start and return EFAULT. This proves problematic if we already copied data into the user buffer before we return an error. Because we leave the copied data in the user buffer and fail to unwind the scatterlist so kernel side believes data has been copied and user side believes data has _not_ been received. Expected behavior should be to return number of bytes copied and then on the next read we need to return the error assuming its still there. This can happen if we have a copy length spanning multiple scatterlist elements and one or more complete before the error is hit. The error is rare enough though that my normal testing with server side programs, such as nginx, httpd, envoy, etc., I have never seen this. The only reliable way to reproduce that I've found is to stream movies over my browser for a day or so and wait for it to hang. Not very scientific, but with a few extra WARN_ON()s in the code the bug was obvious. When we review the errors from copy_page_to_iter() it seems we are hitting a page fault from copy_page_to_iter_iovec() where the code checks fault_in_pages_writeable(buf, copy) where buf is the user buffer. It also seems typical server applications don't hit this case. The other way to try and reproduce this is run the sockmap selftest tool test_sockmap with data verification enabled, but it doesn't reproduce the fault. Perhaps we can trigger this case artificially somehow from the test tools. I haven't sorted out a way to do that yet though. Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370
-
由 Tariq Toukan 提交于
In async_resync mode, we log the TCP seq of records until the async request is completed. Later, in case one of the logged seqs matches the resync request, we return it, together with its record serial number. Before this fix, we mistakenly returned the serial number of the current record instead. Fixes: ed9b7646 ("net/tls: Add asynchronous resync") Signed-off-by: NTariq Toukan <tariqt@nvidia.com> Reviewed-by: NBoris Pismenny <borisp@nvidia.com> Link: https://lore.kernel.org/r/20201115131448.2702-1-tariqt@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Ryan Sharpelletti 提交于
During loss recovery, retransmitted packets are forced to use TCP timestamps to calculate the RTT samples, which have a millisecond granularity. BBR is designed using a microsecond granularity. As a result, multiple RTT samples could be truncated to the same RTT value during loss recovery. This is problematic, as BBR will not enter PROBE_RTT if the RTT sample is <= the current min_rtt sample, meaning that if there are persistent losses, PROBE_RTT will constantly be pushed off and potentially never re-entered. This patch makes sure that BBR enters PROBE_RTT by checking if RTT sample is < the current min_rtt sample, rather than <=. The Netflix transport/TCP team discovered this bug in the Linux TCP BBR code during lab tests. Fixes: 0f8782ea ("tcp_bbr: add BBR congestion control") Signed-off-by: NRyan Sharpelletti <sharpelletti@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/20201116174412.1433277-1-sharpelletti.kdev@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 17 11月, 2020 3 次提交
-
-
由 Vadim Fedorenko 提交于
If tcp socket has more data than Encrypted Handshake Message then tls_sw_recvmsg will try to decrypt next record instead of returning full control message to userspace as mentioned in comment. The next message - usually Application Data - gets corrupted because it uses zero copy for decryption that's why the data is not stored in skb for next iteration. Revert check to not decrypt next record if current is not Application Data. Fixes: 692d7b5d ("tls: Fix recvmsg() to be able to peek across multiple records") Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/1605413760-21153-1-git-send-email-vfedorenko@novek.ruSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Heiner Kallweit 提交于
In br_forward.c and br_input.c fields dev->stats.tx_dropped and dev->stats.multicast are populated, but they are ignored in ndo_get_stats64. Fixes: 28172739 ("net: fix 64 bit counters on 32 bit arches") Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/58ea9963-77ad-a7cf-8dfd-fc95ab95f606@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Georg Kohmann 提交于
Packets are processed even though the first fragment don't include all headers through the upper layer header. This breaks TAHI IPv6 Core Conformance Test v6LC.1.3.6. Referring to RFC8200 SECTION 4.5: "If the first fragment does not include all headers through an Upper-Layer header, then that fragment should be discarded and an ICMP Parameter Problem, Code 3, message should be sent to the source of the fragment, with the Pointer field set to zero." The fragment needs to be validated the same way it is done in commit 2efdaaaf ("IPv6: reply ICMP error if the first fragment don't include all headers") for ipv6. Wrap the validation into a common function, ipv6_frag_thdr_truncated() to check for truncation in the upper layer header. This validation does not fullfill all aspects of RFC 8200, section 4.5, but is at the moment sufficient to pass mentioned TAHI test. In netfilter, utilize the fragment offset returned by find_prev_fhdr() to let ipv6_frag_thdr_truncated() start it's traverse from the fragment header. Return 0 to drop the fragment in the netfilter. This is the same behaviour as used on other protocol errors in this function, e.g. when nf_ct_frag6_queue() returns -EPROTO. The Fragment will later be picked up by ipv6_frag_rcv() in reassembly.c. ipv6_frag_rcv() will then send an appropriate ICMP Parameter Problem message back to the source. References commit 2efdaaaf ("IPv6: reply ICMP error if the first fragment don't include all headers") Signed-off-by: NGeorg Kohmann <geokohma@cisco.com> Acked-by: NPablo Neira Ayuso <pablo@netfilter.org> Link: https://lore.kernel.org/r/20201111115025.28879-1-geokohma@cisco.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 16 11月, 2020 1 次提交
-
-
由 Anant Thazhemadam 提交于
In canfd_rcv(), cfd->len is uninitialized when skb->len = 0, and this uninitialized cfd->len is accessed nonetheless by pr_warn_once(). Fix this uninitialized variable access by checking cfd->len's validity condition (cfd->len > CANFD_MAX_DLEN) separately after the skb->len's condition is checked, and appropriately modify the log messages that are generated as well. In case either of the required conditions fail, the skb is freed and NET_RX_DROP is returned, same as before. Fixes: d4689846 ("can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once") Reported-by: syzbot+9bcb0c9409066696d3aa@syzkaller.appspotmail.com Tested-by: NAnant Thazhemadam <anant.thazhemadam@gmail.com> Signed-off-by: NAnant Thazhemadam <anant.thazhemadam@gmail.com> Link: https://lore.kernel.org/r/20201103213906.24219-3-anant.thazhemadam@gmail.comSigned-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
-