- 12 2月, 2021 1 次提交
-
-
由 Tariq Toukan 提交于
Use a new config SOCK_RX_QUEUE_MAPPING to compile-in the socket RX queue field and logic, instead of the XPS config. This breaks dependency in XPS, and allows selecting it from non-XPS use cases, as we do in the next patch. In addition, use the new flag to wrap the logic in sk_rx_queue_get() and protect access to the sk_rx_queue_mapping field, while keeping the function exposed unconditionally, just like sk_rx_queue_set() and sk_rx_queue_clear(). Signed-off-by: NTariq Toukan <tariqt@nvidia.com> Reviewed-by: NMaxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 2月, 2021 1 次提交
-
-
由 Wei Wang 提交于
Currently, a percpu_counter with the default batch size (2*nr_cpus) is used to record the total # of active sockets per protocol. This means sk_sockets_allocated_read_positive() could be off by +/-2*(nr_cpus^2). This under/over-estimation could lead to wrong memory suppression conditions in __sk_raise_mem_allocated(). Fix this by using a more reasonable fixed batch size of 16. See related commit cf86a086 ("net/dst: use a smaller percpu_counter batch for dst entries accounting") that addresses a similar issue. Signed-off-by: NWei Wang <weiwan@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20210202193408.1171634-1-weiwan@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 20 1月, 2021 1 次提交
-
-
由 Yuchung Cheng 提交于
The previous commit 32efcc06 ("tcp: export count for rehash attempts") would mis-account rehashing SNMP and socket stats: a. During handshake of an active open, only counts the first SYN timeout b. After handshake of passive and active open, stop updating after (roughly) TCP_RETRIES1 recurring RTOs c. After the socket aborts, over count timeout_rehash by 1 This patch fixes this by checking the rehash result from sk_rethink_txhash. Fixes: 32efcc06 ("tcp: export count for rehash attempts") Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20210119192619.1848270-1-ycheng@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 01 12月, 2020 3 次提交
-
-
由 Paolo Abeni 提交于
This allows invoking an additional callback under the socket spin lock. Will be used by the next patches to avoid additional spin lock contention. Acked-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Björn Töpel 提交于
This option lets a user set a per socket NAPI budget for busy-polling. If the options is not set, it will use the default of 8. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
-
由 Björn Töpel 提交于
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket option or system-wide using the /proc/sys/net/core/busy_read knob, is an opportunistic. That means that if the NAPI context is not scheduled, it will poll it. If, after busy-polling, the budget is exceeded the busy-polling logic will schedule the NAPI onto the regular softirq handling. One implication of the behavior above is that a busy/heavy loaded NAPI context will never enter/allow for busy-polling. Some applications prefer that most NAPI processing would be done by busy-polling. This series adds a new socket option, SO_PREFER_BUSY_POLL, that works in concert with the napi_defer_hard_irqs and gro_flush_timeout knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were introduced in commit 6f8b12d6 ("net: napi: add hard irqs deferral feature"), and allows for a user to defer interrupts to be enabled and instead schedule the NAPI context from a watchdog timer. When a user enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled, and the NAPI context is being processed by a softirq, the softirq NAPI processing will exit early to allow the busy-polling to be performed. If the application stops performing busy-polling via a system call, the watchdog timer defined by gro_flush_timeout will timeout, and regular softirq handling will resume. In summary; Heavy traffic applications that prefer busy-polling over softirq processing should use this option. Example usage: $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout Note that the timeout should be larger than the userspace processing window, otherwise the watchdog will timeout and fall back to regular softirq processing. Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
-
- 21 11月, 2020 2 次提交
-
-
由 Randy Dunlap 提交于
Fix build of net/core/stream.o when CONFIG_INET is not enabled. Fixes these build errors (sample): ld: net/core/stream.o: in function `sk_stream_write_space': (.text+0x27e): undefined reference to `tcp_stream_memory_free' ld: (.text+0x29c): undefined reference to `tcp_stream_memory_free' ld: (.text+0x2ab): undefined reference to `tcp_stream_memory_free' ld: net/core/stream.o: in function `sk_stream_wait_memory': (.text+0x5a1): undefined reference to `tcp_stream_memory_free' ld: (.text+0x5bf): undefined reference to `tcp_stream_memory_free' Fixes: 1c5f2ced ("tcp: avoid indirect call to tcp_stream_memory_free()") Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Reported-by: NRandy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20201118194438.674-1-rdunlap@infradead.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Paolo Abeni 提交于
The static checker is fooled by the non-static locking scheme implemented by the mentioned helpers. Let's make its life easier adding some unconditional annotation so that the helpers are now interpreted as a plain spinlock from sparse. v1 -> v2: - add __releases() annotation to unlock_sock_fast() Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/6ed7ae627d8271fb7f20e0a9c6750fbba1ac2635.1605634911.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 15 11月, 2020 1 次提交
-
-
由 Eric Dumazet 提交于
Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 03 11月, 2020 1 次提交
-
-
由 Jakub Kicinski 提交于
Currently, variables used only within lockdep expressions are flagged as unused, requiring that these variables' declarations be decorated with either #ifdef or __maybe_unused. This results in ugly code. This commit therefore causes the lockdep_sock_is_held() function to be visible even when lockdep is not enabled, thus removing the need for these decorations. This approach further relies on dead-code elimination to remove any references to functions or variables that are not available in non-lockdep kernels. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
-
- 25 9月, 2020 1 次提交
-
-
由 Geliang Tang 提交于
This patch added a new helper sk_stop_timer_sync, it deactivates a timer like sk_stop_timer, but waits for the handler to finish. Acked-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NGeliang Tang <geliangtang@gmail.com> Reviewed-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 9月, 2020 1 次提交
-
-
由 Eric Dumazet 提交于
SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state that remembers if some room has been made in the rtx queue by an incoming ACK packet. This is later used from tcp_check_space() before considering to send EPOLLOUT. Problem is: If we receive SACK packets, and no packet is removed from RTX queue, we can send fresh packets, thus moving them from write queue to rtx queue and eventually empty the write queue. This stall can happen if TCP_NOTSENT_LOWAT is used. With this fix, we no longer risk stalling sends while holes are repaired, and we can fully use socket sndbuf. This also removes a cache line dirtying for typical RPC workloads. Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 9月, 2020 1 次提交
-
-
由 Miaohe Lin 提交于
This is a pure codestyle cleanup patch. No functional change intended. Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 8月, 2020 1 次提交
-
-
由 KP Singh 提交于
A purely mechanical change to split the renaming from the actual generalization. Flags/consts: SK_STORAGE_CREATE_FLAG_MASK BPF_LOCAL_STORAGE_CREATE_FLAG_MASK BPF_SK_STORAGE_CACHE_SIZE BPF_LOCAL_STORAGE_CACHE_SIZE MAX_VALUE_SIZE BPF_LOCAL_STORAGE_MAX_VALUE_SIZE Structs: bucket bpf_local_storage_map_bucket bpf_sk_storage_map bpf_local_storage_map bpf_sk_storage_data bpf_local_storage_data bpf_sk_storage_elem bpf_local_storage_elem bpf_sk_storage bpf_local_storage The "sk" member in bpf_local_storage is also updated to "owner" in preparation for changing the type to void * in a subsequent patch. Functions: selem_linked_to_sk selem_linked_to_storage selem_alloc bpf_selem_alloc __selem_unlink_sk bpf_selem_unlink_storage_nolock __selem_link_sk bpf_selem_link_storage_nolock selem_unlink_sk __bpf_selem_unlink_storage sk_storage_update bpf_local_storage_update __sk_storage_lookup bpf_local_storage_lookup bpf_sk_storage_map_free bpf_local_storage_map_free bpf_sk_storage_map_alloc bpf_local_storage_map_alloc bpf_sk_storage_map_alloc_check bpf_local_storage_map_alloc_check bpf_sk_storage_map_check_btf bpf_local_storage_map_check_btf Signed-off-by: NKP Singh <kpsingh@google.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NMartin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200825182919.1118197-2-kpsingh@chromium.org
-
- 06 8月, 2020 1 次提交
-
-
由 Alexander Aring 提交于
This patch adds a new socket helper function to set the mark value for a kernel socket. Signed-off-by: NAlexander Aring <aahringo@redhat.com> Signed-off-by: NDavid Teigland <teigland@redhat.com>
-
- 25 7月, 2020 2 次提交
-
-
由 Christoph Hellwig 提交于
Rework the remaining setsockopt code to pass a sockptr_t instead of a plain user pointer. This removes the last remaining set_fs(KERNEL_DS) outside of architecture specific code. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154] Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Pass a sockptr_t to prepare for set_fs-less handling of the kernel pointer from bpf-cgroup. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 7月, 2020 3 次提交
-
-
由 Christoph Hellwig 提交于
Just check for a NULL method instead of wiring up sock_no_{get,set}sockopt. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NMarc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Handle the few cases that need special treatment in-line using in_compat_syscall(). This also removes all the now unused compat_{get,set}sockopt methods. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Add the compat handling to sock_common_{get,set}sockopt instead, keyed of in_compat_syscall(). This allow to remove the now unused ->compat_{get,set}sockopt methods from struct proto_ops. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Acked-by: NStefan Schmidt <stefan@datenfreihafen.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 7月, 2020 1 次提交
-
-
由 Kees Cook 提交于
Add missed sock updates to compat path via a new helper, which will be used more in coming patches. (The net/core/scm.c code is left as-is here to assist with -stable backports for the compat path.) Cc: Christoph Hellwig <hch@lst.de> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jakub Kicinski <kuba@kernel.org> Cc: stable@vger.kernel.org Fixes: 48a87cc2 ("net: netprio: fd passed in SCM_RIGHTS datagram not set correctly") Fixes: d8429506 ("net: net_cls: fd passed in SCM_RIGHTS datagram not set correctly") Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NKees Cook <keescook@chromium.org>
-
- 10 7月, 2020 1 次提交
-
-
由 Martin KaFai Lau 提交于
bpf_sk_reuseport_detach is currently called when sk->sk_user_data is not NULL. It is incorrect because sk->sk_user_data may not be managed by the bpf's reuseport_array. It has been reported in [1] that, the bpf_sk_reuseport_detach() which is called from udp_lib_unhash() has corrupted the sk_user_data managed by l2tp. This patch solves it by using another bit (defined as SK_USER_DATA_BPF) of the sk_user_data pointer value. It marks that a sk_user_data is managed/owned by BPF. The patch depends on a PTRMASK introduced in commit f1ff5ce2 ("net, sk_msg: Clear sk_user_data pointer on clone if tagged"). [ Note: sk->sk_user_data is used by bpf's reuseport_array only when a sk is added to the bpf's reuseport_array. i.e. doing setsockopt(SO_REUSEPORT) and having "sk->sk_reuseport == 1" alone will not stop sk->sk_user_data being used by other means. ] [1]: https://lore.kernel.org/netdev/20200706121259.GA20199@katalix.com/ Fixes: 5dc4c4b7 ("bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY") Reported-by: NJames Chapman <jchapman@katalix.com> Reported-by: syzbot+9f092552ba9a5efca5df@syzkaller.appspotmail.com Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Tested-by: NJames Chapman <jchapman@katalix.com> Acked-by: NJames Chapman <jchapman@katalix.com> Link: https://lore.kernel.org/bpf/20200709061110.4019316-1-kafai@fb.com
-
- 25 6月, 2020 1 次提交
-
-
由 Dmitry Yakunin 提交于
This is preparation for usage in bpf_setsockopt. Signed-off-by: NDmitry Yakunin <zeil@yandex-team.ru> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NMartin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200620153052.9439-1-zeil@yandex-team.ru
-
- 24 6月, 2020 1 次提交
-
-
由 Tariq Toukan 提交于
Clearing the sock TX queue in sk_set_socket() might cause unexpected out-of-order transmit when called from sock_orphan(), as outstanding packets can pick a different TX queue and bypass the ones already queued. This is undesired in general. More specifically, it breaks the in-order scheduling property guarantee for device-offloaded TLS sockets. Remove the call to sk_tx_queue_clear() in sk_set_socket(), and add it explicitly only where needed. Fixes: e022f0b4 ("net: Introduce sk_tx_queue_mapping") Signed-off-by: NTariq Toukan <tariqt@mellanox.com> Reviewed-by: NBoris Pismenny <borisp@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 6月, 2020 1 次提交
-
-
由 Ferenc Fejes 提交于
The sock_bindtoindex intended for kernel wide usage however it will lock the socket regardless of the context. This modification relax this behavior optionally: locking the socket will be optional by calling the sock_bindtoindex with lock_sk = true. The modification applied to all users of the sock_bindtoindex. Signed-off-by: NFerenc Fejes <fejes@inf.elte.hu> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/bee6355da40d9e991b2f2d12b67d55ebb5f5b207.1590871065.git.fejes@inf.elte.hu
-
- 30 5月, 2020 1 次提交
-
-
由 Christoph Hellwig 提交于
The SCTP protocol allows to bind multiple address to a socket. That feature is currently only exposed as a socket option. Add a bind_add method struct proto that allows to bind additional addresses, and switch the dlm code to use the method instead of going through the socket option from kernel space. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 5月, 2020 9 次提交
-
-
由 Christoph Hellwig 提交于
Add a helper to directly set the SO_REUSEPORT sockopt from kernel space without going through a fake uaccess. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Add a helper to directly set the SO_RCVBUFFORCE sockopt from kernel space without going through a fake uaccess. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Add a helper to directly set the SO_KEEPALIVE sockopt from kernel space without going through a fake uaccess. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Add a helper to directly enable timestamps instead of setting the SO_TIMESTAMP* sockopts from kernel space and going through a fake uaccess. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Add a helper to directly set the SO_BINDTOIFINDEX sockopt from kernel space without going through a fake uaccess. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Add a helper to directly set the SO_SNDTIMEO_NEW sockopt from kernel space without going through a fake uaccess. The interface is simplified to only pass the seconds value, as that is the only thing needed at the moment. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Add a helper to directly set the SO_PRIORITY sockopt from kernel space without going through a fake uaccess. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Add a helper to directly set the SO_LINGER sockopt from kernel space with onoff set to true and a linger time of 0 without going through a fake uaccess. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christoph Hellwig 提交于
Add a helper to directly set the SO_REUSEADDR sockopt from kernel space without going through a fake uaccess. For this the iscsi target now has to formally depend on inet to avoid a mostly theoretical compile failure. For actual operation it already did depend on having ipv4 or ipv6 support. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 4月, 2020 1 次提交
-
-
由 Lothar Rubusch 提交于
Fix warnings related to kernel-doc notation, and wording in function description. Signed-off-by: NLothar Rubusch <l.rubusch@gmail.com> Acked-by: NRandy Dunlap <rdunlap@infradead.org> Tested-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 3月, 2020 3 次提交
-
-
由 Joe Stringer 提交于
Avoid taking a reference on listen sockets by checking the socket type in the sk_assign and in the corresponding skb_steal_sock() code in the the transport layer, and by ensuring that the prefetch free (sock_pfree) function uses the same logic to check whether the socket is refcounted. Suggested-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NJoe Stringer <joe@wand.net.nz> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NMartin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz
-
由 Joe Stringer 提交于
Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make the determination of whether the socket is reference counted in the case where it is prefetched by earlier logic such as early_demux. Signed-off-by: NJoe Stringer <joe@wand.net.nz> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NMartin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-3-joe@wand.net.nz
-
由 Joe Stringer 提交于
Add support for TPROXY via a new bpf helper, bpf_sk_assign(). This helper requires the BPF program to discover the socket via a call to bpf_sk*_lookup_*(), then pass this socket to the new helper. The helper takes its own reference to the socket in addition to any existing reference that may or may not currently be obtained for the duration of BPF processing. For the destination socket to receive the traffic, the traffic must be routed towards that socket via local route. The simplest example route is below, but in practice you may want to route traffic more narrowly (eg by CIDR): $ ip route add local default dev lo This patch avoids trying to introduce an extra bit into the skb->sk, as that would require more invasive changes to all code interacting with the socket to ensure that the bit is handled correctly, such as all error-handling cases along the path from the helper in BPF through to the orphan path in the input. Instead, we opt to use the destructor variable to switch on the prefetch of the socket. Signed-off-by: NJoe Stringer <joe@wand.net.nz> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NMartin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
-
- 22 2月, 2020 1 次提交
-
-
由 Jakub Sitnicki 提交于
sk_user_data can hold a pointer to an object that is not intended to be shared between the parent socket and the child that gets a pointer copy on clone. This is the case when sk_user_data points at reference-counted object, like struct sk_psock. One way to resolve it is to tag the pointer with a no-copy flag by repurposing its lowest bit. Based on the bit-flag value we clear the child sk_user_data pointer after cloning the parent socket. The no-copy flag is stored in the pointer itself as opposed to externally, say in socket flags, to guarantee that the pointer and the flag are copied from parent to child socket in an atomic fashion. Parent socket state is subject to change while copying, we don't hold any locks at that time. This approach relies on an assumption that sk_user_data holds a pointer to an object aligned at least 2 bytes. A manual audit of existing users of rcu_dereference_sk_user_data helper confirms our assumption. Also, an RCU-protected sk_user_data is not likely to hold a pointer to a char value or a pathological case of "struct { char c; }". To be safe, warn when the flag-bit is set when setting sk_user_data to catch any future misuses. It is worth considering why clearing sk_user_data unconditionally is not an option. There exist users, DRBD, NVMe, and Xen drivers being among them, that rely on the pointer being copied when cloning the listening socket. Potentially we could distinguish these users by checking if the listening socket has been created in kernel-space via sock_create_kern, and hence has sk_kern_sock flag set. However, this is not the case for NVMe and Xen drivers, which create sockets without marking them as belonging to the kernel. Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
-