- 24 8月, 2022 2 次提交
-
-
由 Kuniyuki Iwashima 提交于
While reading sysctl_optmem_max, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kuniyuki Iwashima 提交于
While reading sysctl_[rw]mem_(max|default), they can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 7月, 2022 1 次提交
-
-
由 Jakub Kicinski 提交于
We continuously hold the socket lock during large reads and writes. This may inflate RTT and negatively impact TCP performance. Flush the backlog periodically. I tried to pick a flush period (128kB) which gives significant benefit but the max Bps rate is not yet visibly impacted. Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 6月, 2022 1 次提交
-
-
由 Eric Dumazet 提交于
sk_memory_allocated_add() has three callers, and returns to them @memory_allocated. sk_forced_mem_schedule() is one of them, and ignores the returned value. Change sk_memory_allocated_add() to return void. Change sock_reserve_memory() and __sk_mem_raise_allocated() to call sk_memory_allocated(). This removes one cache line miss [1] for RPC workloads, as first skbs in TCP write queue and receive queue go through sk_forced_mem_schedule(). [1] Cache line holding tcp_memory_allocated. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 6月, 2022 3 次提交
-
-
由 Eric Dumazet 提交于
These two helpers are only used from core networking. Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
Each protocol having a ->memory_allocated pointer gets a corresponding per-cpu reserve, that following patches will use. Instead of having reserved bytes per socket, we want to have per-cpu reserves. Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE. This might change in the future, but it seems better to avoid the confusion. Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 10 6月, 2022 1 次提交
-
-
由 Eric Dumazet 提交于
Check against skb dst in socket backlog has never triggered in past years. Keep the check omly for CONFIG_DEBUG_NET=y builds. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 16 5月, 2022 2 次提交
-
-
由 Eric Dumazet 提交于
sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val), because other cpus/threads might locklessly read this field. sock_getbindtodevice(), sock_getsockopt() need READ_ONCE() because they run without socket lock held. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
The code for gso_max_size was added originally to allow for debugging and workaround of buggy devices that couldn't support TSO with blocks 64K in size. The original reason for limiting it to 64K was because that was the existing limits of IPv4 and non-jumbogram IPv6 length fields. With the addition of Big TCP we can remove this limit and allow the value to potentially go up to UINT_MAX and instead be limited by the tso_max_size value. So in order to support this we need to go through and clean up the remaining users of the gso_max_size value so that the values will cap at 64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper limit for GSO_MAX_SIZE. v6: (edumazet) fixed a compile error if CONFIG_IPV6=n, in a new sk_trim_gso_size() helper. netif_set_tso_max_size() caps the requested TSO size with GSO_MAX_SIZE. Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 5月, 2022 1 次提交
-
-
由 Eyal Birger 提交于
The commit referenced in the "Fixes" tag added the SO_RCVMARK socket option for receiving the skb mark in the ancillary data. Since this is a new capability, and exposes admin configured details regarding the underlying network setup to sockets, let's align the needed capabilities with those of SO_MARK. Fixes: 6fd1d51c ("net: SO_RCVMARK socket option for SO_MARK with recvmsg()") Signed-off-by: NEyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/20220504095459.2663513-1-eyal.birger@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 01 5月, 2022 3 次提交
-
-
由 Pavel Begunkov 提交于
Now we have a separate path for sock_def_write_space() and can go one step further. When it's called from sock_wfree() we know that there is a preceding atomic for putting down ->sk_wmem_alloc. We can use it to replace to replace smb_mb() with a less expensive smp_mb__after_atomic(). It also removes an extra RCU read lock/unlock as a small bonus. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pavel Begunkov 提交于
For non SOCK_USE_WRITE_QUEUE sockets, sock_wfree() (atomically) puts ->sk_wmem_alloc twice. It's needed to keep the socket alive while calling ->sk_write_space() after the first put. However, some sockets, such as UDP, are freed by RCU (i.e. SOCK_RCU_FREE) and use already RCU-safe sock_def_write_space(). Carve a fast path for such sockets, put down all refs in one go before calling sock_def_write_space() but guard the socket from being freed by an RCU read section. note: because TCP sockets are marked with SOCK_USE_WRITE_QUEUE it doesn't add extra checks in its path. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pavel Begunkov 提交于
Except for minor rounding differences the first ->sk_wmem_alloc test in sock_def_write_space() is a hand coded version of sock_writeable(). Replace it with the helper, and also kill the following if duplicating the check. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 4月, 2022 1 次提交
-
-
由 Pavel Begunkov 提交于
sock_alloc_send_skb() is simple and just proxying to another function, so we can inline it and cut associated overhead. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 4月, 2022 1 次提交
-
-
由 Erin MacNeil 提交于
Adding a new socket option, SO_RCVMARK, to indicate that SO_MARK should be included in the ancillary data returned by recvmsg(). Renamed the sock_recv_ts_and_drops() function to sock_recv_cmsgs(). Signed-off-by: NErin MacNeil <lnx.erin@gmail.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Acked-by: NMarc Kleine-Budde <mkl@pengutronix.de> Link: https://lore.kernel.org/r/20220427200259.2564-1-lnx.erin@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 27 4月, 2022 1 次提交
-
-
由 Eric Dumazet 提交于
Logic added in commit f35f8219 ("tcp: defer skb freeing after socket lock is released") helped bulk TCP flows to move the cost of skbs frees outside of critical section where socket lock was held. But for RPC traffic, or hosts with RFS enabled, the solution is far from being ideal. For RPC traffic, recvmsg() has to return to user space right after skb payload has been consumed, meaning that BH handler has no chance to pick the skb before recvmsg() thread. This issue is more visible with BIG TCP, as more RPC fit one skb. For RFS, even if BH handler picks the skbs, they are still picked from the cpu on which user thread is running. Ideally, it is better to free the skbs (and associated page frags) on the cpu that originally allocated them. This patch removes the per socket anchor (sk->defer_list) and instead uses a per-cpu list, which will hold more skbs per round. This new per-cpu list is drained at the end of net_action_rx(), after incoming packets have been processed, to lower latencies. In normal conditions, skbs are added to the per-cpu list with no further action. In the (unlikely) cases where the cpu does not run net_action_rx() handler fast enough, we use an IPI to raise NET_RX_SOFTIRQ on the remote cpu. Also, we do not bother draining the per-cpu list from dev_cpu_dead() This is because skbs in this list have no requirement on how fast they should be freed. Note that we can add in the future a small per-cpu cache if we see any contention on sd->defer_lock. Tested on a pair of hosts with 100Gbit NIC, RFS enabled, and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around page recycling strategy used by NIC driver (its page pool capacity being too small compared to number of skbs/pages held in sockets receive queues) Note that this tuning was only done to demonstrate worse conditions for skb freeing for this particular test. These conditions can happen in more general production workload. 10 runs of one TCP_STREAM flow Before: Average throughput: 49685 Mbit. Kernel profiles on cpu running user thread recvmsg() show high cost for skb freeing related functions (*) 57.81% [kernel] [k] copy_user_enhanced_fast_string (*) 12.87% [kernel] [k] skb_release_data (*) 4.25% [kernel] [k] __free_one_page (*) 3.57% [kernel] [k] __list_del_entry_valid 1.85% [kernel] [k] __netif_receive_skb_core 1.60% [kernel] [k] __skb_datagram_iter (*) 1.59% [kernel] [k] free_unref_page_commit (*) 1.16% [kernel] [k] __slab_free 1.16% [kernel] [k] _copy_to_iter (*) 1.01% [kernel] [k] kfree (*) 0.88% [kernel] [k] free_unref_page 0.57% [kernel] [k] ip6_rcv_core 0.55% [kernel] [k] ip6t_do_table 0.54% [kernel] [k] flush_smp_call_function_queue (*) 0.54% [kernel] [k] free_pcppages_bulk 0.51% [kernel] [k] llist_reverse_order 0.38% [kernel] [k] process_backlog (*) 0.38% [kernel] [k] free_pcp_prepare 0.37% [kernel] [k] tcp_recvmsg_locked (*) 0.37% [kernel] [k] __list_add_valid 0.34% [kernel] [k] sock_rfree 0.34% [kernel] [k] _raw_spin_lock_irq (*) 0.33% [kernel] [k] __page_cache_release 0.33% [kernel] [k] tcp_v6_rcv (*) 0.33% [kernel] [k] __put_page (*) 0.29% [kernel] [k] __mod_zone_page_state 0.27% [kernel] [k] _raw_spin_lock After patch: Average throughput: 73076 Mbit. Kernel profiles on cpu running user thread recvmsg() looks better: 81.35% [kernel] [k] copy_user_enhanced_fast_string 1.95% [kernel] [k] _copy_to_iter 1.95% [kernel] [k] __skb_datagram_iter 1.27% [kernel] [k] __netif_receive_skb_core 1.03% [kernel] [k] ip6t_do_table 0.60% [kernel] [k] sock_rfree 0.50% [kernel] [k] tcp_v6_rcv 0.47% [kernel] [k] ip6_rcv_core 0.45% [kernel] [k] read_tsc 0.44% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] _raw_spin_lock 0.37% [kernel] [k] native_irq_return_iret 0.33% [kernel] [k] __inet6_lookup_established 0.31% [kernel] [k] ip6_protocol_deliver_rcu 0.29% [kernel] [k] tcp_rcv_established 0.29% [kernel] [k] llist_reverse_order v2: kdoc issue (kernel bots) do not defer if (alloc_cpu == smp_processor_id()) (Paolo) replace the sk_buff_head with a single-linked list (Jakub) add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NPaolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 12 4月, 2022 1 次提交
-
-
由 Oliver Hartkopp 提交于
The internal recvmsg() functions have two parameters 'flags' and 'noblock' that were merged inside skb_recv_datagram(). As a follow up patch to commit f4b41f06 ("net: remove noblock parameter from skb_recv_datagram()") this patch removes the separate 'noblock' parameter for recvmsg(). Analogue to the referenced patch for skb_recv_datagram() the 'flags' and 'noblock' parameters are unnecessarily split up with e.g. err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); or in err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); instead of simply using only flags all the time and check for MSG_DONTWAIT where needed (to preserve for the formerly separated no(n)block condition). Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.netSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
-
- 11 4月, 2022 1 次提交
-
-
由 Menglong Dong 提交于
In order to report the reasons of skb drops in 'sock_queue_rcv_skb()', introduce the function 'sock_queue_rcv_skb_reason()'. As the return value of 'sock_queue_rcv_skb()' is used as the error code, we can't make it as drop reason and have to pass extra output argument. 'sock_queue_rcv_skb()' is used in many places, so we can't change it directly. Introduce the new function 'sock_queue_rcv_skb_reason()' and make 'sock_queue_rcv_skb()' an inline call to it. Reviewed-by: NHao Peng <flyingpeng@tencent.com> Reviewed-by: NJiang Biao <benbjiang@tencent.com> Signed-off-by: NMenglong Dong <imagedong@tencent.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 4月, 2022 1 次提交
-
-
由 Jakub Kicinski 提交于
There's a number of functions and static variables used under net/core/ but not from the outside. We currently dump most of them into netdevice.h. That bad for many reasons: - netdevice.h is very cluttered, hard to figure out what the APIs are; - netdevice.h is very long; - we have to touch netdevice.h more which causes expensive incremental builds. Create a header under net/core/ and move some declarations. The new header is also a bit of a catch-all but that's fine, if we create more specific headers people will likely over-think where their declaration fit best. And end up putting them in netdevice.h, again. More work should be done on splitting netdevice.h into more targeted headers, but that'd be more time consuming so small steps. Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 09 3月, 2022 1 次提交
-
-
由 Samuel Thibault 提交于
ENOTSUPP is documented as "should never be seen by user programs", and thus not exposed in <errno.h>, and thus applications cannot safely check against it (they get "Unknown error 524" as strerror). We should rather return the well-known -EOPNOTSUPP. This is similar to 2230a7ef ("drop_monitor: Use correct error code") and 4a5cdc60 ("net/tls: Fix return values to avoid ENOTSUPP"), which did not seem to cause problems. Signed-off-by: NSamuel Thibault <samuel.thibault@labri.fr> Acked-by: NWillem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20220307223126.djzvg44v2o2jkjsx@beginSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 18 2月, 2022 2 次提交
-
-
由 Eric Dumazet 提交于
UDP sendmsg() can be lockless, this is causing all kinds of data races. This patch converts sk->sk_tskey to remove one of these races. BUG: KCSAN: data-race in __ip_append_data / __ip_append_data read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1: __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0: __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994 ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636 udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249 inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000054d -> 0x0000054e Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 09c2d251 ("net-timestamp: add key to disambiguate concurrent datagrams") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
prot->memory_allocated should only be set if prot->sysctl_mem is also set. This is a followup of commit 25206111 ("crypto: af_alg - get rid of alg_memory_allocated"). Signed-off-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220216171801.3604366-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 02 2月, 2022 1 次提交
-
-
由 Jakub Kicinski 提交于
There's not reason SO_MARK would be allowed via setsockopt() and not via cmsg, let's keep the two consistent. See commit 079925cc ("net: allow SO_MARK with CAP_NET_RAW") for justification why NET_RAW -> SO_MARK is safe. Reviewed-by: NMaciej Żenczykowski <maze@google.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220131233357.52964-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 31 1月, 2022 2 次提交
-
-
由 Akhmat Karakotov 提交于
Disabling rehash behavior did not affect SYN ACK retransmits because hash was forcefully changed bypassing the sk_rethink_hash function. This patch adds a condition which checks for rehash mode before resetting hash. Signed-off-by: NAkhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Akhmat Karakotov 提交于
Add the SO_TXREHASH socket option to control hash rethink behavior per socket. When default mode is set, sockets disable rehash at initialization and use sysctl option when entering listen state. setsockopt() overrides default behavior. Signed-off-by: NAkhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 1月, 2022 1 次提交
-
-
由 David Ahern 提交于
sk_gso_max_size is set based on the dst dev. Both users of it adjust the value by the same offset - (MAX_TCP_HEADER + 1). Rather than compute the same adjusted value on each call do the adjustment once when set. Signed-off-by: NDavid Ahern <dsahern@kernel.org> Reviewed-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220125024511.27480-1-dsahern@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 17 1月, 2022 1 次提交
-
-
由 Gal Pressman 提交于
The cited Fixes patch moved to a deferred skb approach where the skbs are not freed immediately under the socket lock. Add a WARN_ON_ONCE() to verify the deferred list is empty on socket destroy, and empty it to prevent potential memory leaks. Fixes: f35f8219 ("tcp: defer skb freeing after socket lock is released") Signed-off-by: NGal Pressman <gal@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 1月, 2022 1 次提交
-
-
由 Miroslav Lichvar 提交于
Don't forget to release the device in sock_timestamping_bind_phc() after it was used to get the vclock indices. Fixes: d463126e ("net: sock: extend SO_TIMESTAMPING for PHC binding") Signed-off-by: NMiroslav Lichvar <mlichvar@redhat.com> Cc: Yangbo Lu <yangbo.lu@nxp.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 12月, 2021 1 次提交
-
-
由 Kuniyuki Iwashima 提交于
This patch moves sock_release_ownership() down in include/net/sock.h and replaces some sk_lock.owned tests with sock_owned_by_user_nocheck(). Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp> Link: https://lore.kernel.org/r/20211208062158.54132-1-kuniyu@amazon.co.jpSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 10 12月, 2021 1 次提交
-
-
由 Eric Dumazet 提交于
Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 25 11月, 2021 2 次提交
-
-
由 Maciej Żenczykowski 提交于
A CAP_NET_RAW capable process can already spoof (on transmit) anything it desires via raw packet sockets... There is no good reason to not allow it to also be able to play routing tricks on packets from its own normal sockets. There is a desire to be able to use SO_MARK for routing table selection (via ip rule fwmark) from within a user process without having to run it as root. Granting it CAP_NET_RAW is much less dangerous than CAP_NET_ADMIN (CAP_NET_RAW doesn't permit persistent state change, while CAP_NET_ADMIN does - by for example allowing the reconfiguration of the routing tables and/or bringing up/down devices). Let's keep CAP_NET_ADMIN for persistent state changes, while using CAP_NET_RAW for non-configuration related stuff. Signed-off-by: NMaciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20211123203715.193413-1-zenczykowski@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Maciej Żenczykowski 提交于
CAP_NET_ADMIN is and should continue to be about configuring the system as a whole, not about configuring per-socket or per-packet parameters. Sending and receiving raw packets is what CAP_NET_RAW is all about. It can already send packets with any VLAN tag, and any IPv4 TOS mark, and any IPv6 TCLASS mark, simply by virtue of building such a raw packet. Not to mention using any protocol and source/ /destination ip address/port tuple. These are the fields that networking gear uses to prioritize packets. Hence, a CAP_NET_RAW process is already capable of affecting traffic prioritization after it hits the wire. This change makes it capable of affecting traffic prioritization even in the host at the nic and before that in the queueing disciplines (provided skb->priority is actually being used for prioritization, and not the TOS/TCLASS field) Hence it makes sense to allow a CAP_NET_RAW process to set the priority of sockets and thus packets it sends. Signed-off-by: NMaciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20211123203702.193221-1-zenczykowski@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 22 11月, 2021 2 次提交
-
-
由 Eric Dumazet 提交于
dev->gso_max_segs is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add netif_set_gso_max_segs() helper. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs() where we can to better document what is going on. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
dev->gso_max_size is written under RTNL protection, or when the device is not yet visible, but is read locklessly. Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_size() where we can to better document what is going on. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 11月, 2021 5 次提交
-
-
由 Eric Dumazet 提交于
net->core.sock_inuse is a per cpu variable (int), while net->core.prot_inuse is another per cpu variable of 64 integers. per cpu allocator tend to place them in very different places. Grouping them together makes sense, since it makes updates potentially faster, if hitting the same cache line. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
MPTCP hard codes it, let us instead provide this helper. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
sock_prot_inuse_add() is very small, we can inline it. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Use INDIRECT_CALL_INET() to avoid an indirect call when/if CONFIG_RETPOLINE=y Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Instead of using a full netdev_features_t, we can use a single bit, as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from sk->sk_route_cap. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-