- 19 6月, 2019 1 次提交
-
-
由 JingYi Hou 提交于
In sock_getsockopt(), 'optlen' is fetched the first time from userspace. 'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is fetched the second time from userspace. If change it between two fetches may cause security problems or unexpected behaivor, and there is no reason to fetch it a second time. To fix this, we need to remove the second fetch. Signed-off-by: NJingYi Hou <houjingyi647@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 6月, 2019 1 次提交
-
-
由 Eric Dumazet 提交于
>From linux-3.7, (commit 5640f768 "net: use a per task frag allocator") TCP sendmsg() has preferred using order-3 allocations. While it gives good results for most cases, we had reports that heavy uses of TCP over loopback were hitting a spinlock contention in page allocations/freeing. This commits adds a sysctl so that admins can opt-in for order-0 allocations. Hopefully mm layer might optimize order-3 allocations in the future since it could give us a nice boost (see 8 lines of following benchmark) The following benchmark shows a win when more than 8 TCP_STREAM threads are running (56 x86 cores server in my tests) for thr in {1..30} do sysctl -wq net.core.high_order_alloc_disable=0 T0=`./super_netperf $thr -H 127.0.0.1 -l 15` sysctl -wq net.core.high_order_alloc_disable=1 T1=`./super_netperf $thr -H 127.0.0.1 -l 15` echo $thr:$T0:$T1 done 1: 49979: 37267 2: 98745: 76286 3: 141088: 110051 4: 177414: 144772 5: 197587: 173563 6: 215377: 208448 7: 241061: 234087 8: 267155: 263373 9: 295069: 297402 10: 312393: 335213 11: 340462: 368778 12: 371366: 403954 13: 412344: 443713 14: 426617: 473580 15: 474418: 507861 16: 503261: 538539 17: 522331: 563096 18: 532409: 567084 19: 550824: 605240 20: 525493: 641988 21: 564574: 665843 22: 567349: 690868 23: 583846: 710917 24: 588715: 736306 25: 603212: 763494 26: 604083: 792654 27: 602241: 796450 28: 604291: 797993 29: 611610: 833249 30: 577356: 841062 Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 6月, 2019 1 次提交
-
-
由 Martin KaFai Lau 提交于
The cloned sk should not carry its parent-listener's sk_bpf_storage. This patch fixes it by setting it back to NULL. Fixes: 6ac99e8f ("bpf: Introduce bpf sk local storage") Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 31 5月, 2019 1 次提交
-
-
由 Thomas Gleixner 提交于
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NAllison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 28 4月, 2019 1 次提交
-
-
由 Martin KaFai Lau 提交于
After allowing a bpf prog to - directly read the skb->sk ptr - get the fullsock bpf_sock by "bpf_sk_fullsock()" - get the bpf_tcp_sock by "bpf_tcp_sock()" - get the listener sock by "bpf_get_listener_sock()" - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock" into different bpf running context. this patch is another effort to make bpf's network programming more intuitive to do (together with memory and performance benefit). When bpf prog needs to store data for a sk, the current practice is to define a map with the usual 4-tuples (src/dst ip/port) as the key. If multiple bpf progs require to store different sk data, multiple maps have to be defined. Hence, wasting memory to store the duplicated keys (i.e. 4 tuples here) in each of the bpf map. [ The smallest key could be the sk pointer itself which requires some enhancement in the verifier and it is a separate topic. ] Also, the bpf prog needs to clean up the elem when sk is freed. Otherwise, the bpf map will become full and un-usable quickly. The sk-free tracking currently could be done during sk state transition (e.g. BPF_SOCK_OPS_STATE_CB). The size of the map needs to be predefined which then usually ended-up with an over-provisioned map in production. Even the map was re-sizable, while the sk naturally come and go away already, this potential re-size operation is arguably redundant if the data can be directly connected to the sk itself instead of proxy-ing through a bpf map. This patch introduces sk->sk_bpf_storage to provide local storage space at sk for bpf prog to use. The space will be allocated when the first bpf prog has created data for this particular sk. The design optimizes the bpf prog's lookup (and then optionally followed by an inline update). bpf_spin_lock should be used if the inline update needs to be protected. BPF_MAP_TYPE_SK_STORAGE: ----------------------- To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can be created to fit different bpf progs' needs. The map enforces BTF to allow printing the sk-local-storage during a system-wise sk dump (e.g. "ss -ta") in the future. The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete a "sk-local-storage" data from a particular sk. Think of the map as a meta-data (or "type") of a "sk-local-storage". This particular "type" of "sk-local-storage" data can then be stored in any sk. The main purposes of this map are mostly: 1. Define the size of a "sk-local-storage" type. 2. Provide a similar syscall userspace API as the map (e.g. lookup/update, map-id, map-btf...etc.) 3. Keep track of all sk's storages of this "type" and clean them up when the map is freed. sk->sk_bpf_storage: ------------------ The main lookup/update/delete is done on sk->sk_bpf_storage (which is a "struct bpf_sk_storage"). When doing a lookup, the "map" pointer is now used as the "key" to search on the sk_storage->list. The "map" pointer is actually serving as the "type" of the "sk-local-storage" that is being requested. To allow very fast lookup, it should be as fast as looking up an array at a stable-offset. At the same time, it is not ideal to set a hard limit on the number of sk-local-storage "type" that the system can have. Hence, this patch takes a cache approach. The last search result from sk_storage->list is cached in sk_storage->cache[] which is a stable sized array. Each "sk-local-storage" type has a stable offset to the cache[] array. In the future, a map's flag could be introduced to do cache opt-out/enforcement if it became necessary. The cache size is 16 (i.e. 16 types of "sk-local-storage"). Programs can share map. On the program side, having a few bpf_progs running in the networking hotpath is already a lot. The bpf_prog should have already consolidated the existing sock-key-ed map usage to minimize the map lookup penalty. 16 has enough runway to grow. All sk-local-storage data will be removed from sk->sk_bpf_storage during sk destruction. bpf_sk_storage_get() and bpf_sk_storage_delete(): ------------------------------------------------ Instead of using bpf_map_(lookup|update|delete)_elem(), the bpf prog needs to use the new helper bpf_sk_storage_get() and bpf_sk_storage_delete(). The verifier can then enforce the ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to "create" new elem if one does not exist in the sk. It is done by the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE. The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together, it has eliminated the potential use cases for an equivalent bpf_map_update_elem() API (for bpf_prog) in this patch. Misc notes: ---------- 1. map_get_next_key is not supported. From the userspace syscall perspective, the map has the socket fd as the key while the map can be shared by pinned-file or map-id. Since btf is enforced, the existing "ss" could be enhanced to pretty print the local-storage. Supporting a kernel defined btf with 4 tuples as the return key could be explored later also. 2. The sk->sk_lock cannot be acquired. Atomic operations is used instead. e.g. cmpxchg is done on the sk->sk_bpf_storage ptr. Please refer to the source code comments for the details in synchronization cases and considerations. 3. The mem is charged to the sk->sk_omem_alloc as the sk filter does. Benchmark: --------- Here is the benchmark data collected by turning on the "kernel.bpf_stats_enabled" sysctl. Two bpf progs are tested: One bpf prog with the usual bpf hashmap (max_entries = 8192) with the sk ptr as the key. (verifier is modified to support sk ptr as the key That should have shortened the key lookup time.) Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE. Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for each egress skb and then bump the cnt. netperf is used to drive data with 4096 connected UDP sockets. BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run) 27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633 loaded_at 2019-04-15T13:46:39-0700 uid 0 xlated 344B jited 258B memlock 4096B map_ids 16 btf_id 5 BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run) 30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739 loaded_at 2019-04-15T13:47:54-0700 uid 0 xlated 168B jited 156B memlock 4096B map_ids 17 btf_id 6 Here is a high-level picture on how are the objects organized: sk ┌──────┐ │ │ │ │ │ │ │*sk_bpf_storage─────
▶ bpf_sk_storage └──────┘ ┌───────┐ ┌───────────┤ list │ │ │ │ │ │ │ │ │ │ │ └───────┘ │ │ elem │ ┌────────┐ ├─▶ │ snode │ │ ├────────┤ │ │ data │ bpf_map │ ├────────┤ ┌─────────┐ │ │map_node│◀ ─┬─────┤ list │ │ └────────┘ │ │ │ │ │ │ │ │ elem │ │ │ │ ┌────────┐ │ └─────────┘ └─▶ │ snode │ │ ├────────┤ │ bpf_map │ data │ │ ┌─────────┐ ├────────┤ │ │ list ├───────▶ │map_node│ │ │ │ └────────┘ │ │ │ │ │ │ elem │ └─────────┘ ┌────────┐ │ ┌─▶ │ snode │ │ │ ├────────┤ │ │ │ data │ │ │ ├────────┤ │ │ │map_node│◀ ─┘ │ └────────┘ │ │ │ ┌───────┐ sk └──────────│ list │ ┌──────┐ │ │ │ │ │ │ │ │ │ │ │ │ └───────┘ │*sk_bpf_storage───────▶ bpf_sk_storage └──────┘ Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 24 4月, 2019 1 次提交
-
-
由 Stephen Rothwell 提交于
net/core/sock.c: In function 'sock_gettstamp': net/core/sock.c:3007:23: error: expected '}' before ';' token .tv_sec = ts.tv_sec; ^ net/core/sock.c:3011:4: error: expected ')' before 'return' return -EFAULT; ^~~~~~ net/core/sock.c:3013:2: error: expected expression before '}' token } ^ Fixes: c7cbdbf2 ("net: rework SIOCGSTAMP ioctl handling") Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 4月, 2019 1 次提交
-
-
由 Arnd Bergmann 提交于
The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many socket protocol handlers, and all of those end up calling the same sock_get_timestamp()/sock_get_timestampns() helper functions, which results in a lot of duplicate code. With the introduction of 64-bit time_t on 32-bit architectures, this gets worse, as we then need four different ioctl commands in each socket protocol implementation. To simplify that, let's add a new .gettstamp() operation in struct proto_ops, and move ioctl implementation into the common sock_ioctl()/compat_sock_ioctl_trans() functions that these all go through. We can reuse the sock_get_timestamp() implementation, but generalize it so it can deal with both native and compat mode, as well as timeval and timespec structures. Acked-by: NStefan Schmidt <stefan@datenfreihafen.org> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Acked-by: NMarc Kleine-Budde <mkl@pengutronix.de> Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/Signed-off-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 4月, 2019 1 次提交
-
-
由 Arnd Bergmann 提交于
It looks like the new socket options only work correctly for native execution, but in case of compat mode fall back to the old behavior as we ignore the 'old_timeval' flag. Rework so we treat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW the same way in compat and native 32-bit mode. Cc: Deepa Dinamani <deepa.kernel@gmail.com> Fixes: a9beb86a ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW") Signed-off-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 3月, 2019 2 次提交
-
-
由 Eric Dumazet 提交于
For legacy applications using 32bit variable, SO_MAX_PACING_RATE has to cap the returned value to 0xFFFFFFFF, meaning that rates above 34.35 Gbit are capped. This patch allows applications to read socket pacing rate at full resolution, if they provide a 64bit variable to store it, and the kernel is 64bit. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
64bit kernels now support 64bit pacing rates. This commit changes setsockopt() to accept 64bit values provided by applications. Old applications providing 32bit value are still supported, but limited to the old 34Gbit limitation. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 2月, 2019 1 次提交
-
-
由 Li RongQing 提交于
This pointer is RCU protected, so proper primitives should be used. Signed-off-by: NZhang Yu <zhangyu31@baidu.com> Signed-off-by: NLi RongQing <lirongqing@baidu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 2月, 2019 1 次提交
-
-
由 Guillaume Nault 提交于
SO_SNDBUF and SO_RCVBUF (and their *BUFFORCE version) may overflow or underflow their input value. This patch aims at providing explicit handling of these extreme cases, to get a clear behaviour even with values bigger than INT_MAX / 2 or lower than INT_MIN / 2. For simplicity, only SO_SNDBUF and SO_SNDBUFFORCE are described here, but the same explanation and fix apply to SO_RCVBUF and SO_RCVBUFFORCE (with 'SNDBUF' replaced by 'RCVBUF' and 'wmem_max' by 'rmem_max'). Overflow of positive values =========================== When handling SO_SNDBUF or SO_SNDBUFFORCE, if 'val' exceeds INT_MAX / 2, the buffer size is set to its minimum value because 'val * 2' overflows, and max_t() considers that it's smaller than SOCK_MIN_SNDBUF. For SO_SNDBUF, this can only happen with net.core.wmem_max > INT_MAX / 2. SO_SNDBUF and SO_SNDBUFFORCE are actually designed to let users probe for the maximum buffer size by setting an arbitrary large number that gets capped to the maximum allowed/possible size. Having the upper half of the positive integer space to potentially reduce the buffer size to its minimum value defeats this purpose. This patch caps the base value to INT_MAX / 2, so that bigger values don't overflow and keep setting the buffer size to its maximum. Underflow of negative values ============================ For negative numbers, SO_SNDBUF always considers them bigger than net.core.wmem_max, which is bounded by [SOCK_MIN_SNDBUF, INT_MAX]. Therefore such values are set to net.core.wmem_max and we're back to the behaviour of positive integers described above (return maximum buffer size if wmem_max <= INT_MAX / 2, return SOCK_MIN_SNDBUF otherwise). However, SO_SNDBUFFORCE behaves differently. The user value is directly multiplied by two and compared with SOCK_MIN_SNDBUF. If 'val * 2' doesn't underflow or if it underflows to a value smaller than SOCK_MIN_SNDBUF then buffer size is set to its minimum value. Otherwise the buffer size is set to the underflowed value. This patch treats negative values passed to SO_SNDBUFFORCE as null, to prevent underflows. Therefore negative values now always set the buffer size to its minimum value. Even though SO_SNDBUF behaves inconsistently by setting buffer size to the maximum value when passed a negative number, no attempt is made to modify this behaviour. There may exist some programs that rely on using negative numbers to set the maximum buffer size. Avoiding overflows because of extreme net.core.wmem_max values is the most we can do here. Summary of altered behaviours ============================= val : user-space value passed to setsockopt() val_uf : the underflowed value resulting from doubling val when val < INT_MIN / 2 wmem_max : short for net.core.wmem_max val_cap : min(val, wmem_max) min_len : minimal buffer length (that is, SOCK_MIN_SNDBUF) max_len : maximal possible buffer length, regardless of wmem_max (that is, INT_MAX - 1) ^^^^ : altered behaviour SO_SNDBUF: +-------------------------+-------------+------------+----------------+ | CONDITION | OLD RESULT | NEW RESULT | COMMENT | +-------------------------+-------------+------------+----------------+ | val < 0 && | | | No overflow, | | wmem_max <= INT_MAX/2 | wmem_max*2 | wmem_max*2 | keep original | | | | | behaviour | +-------------------------+-------------+------------+----------------+ | val < 0 && | | | Cap wmem_max | | INT_MAX/2 < wmem_max | min_len | max_len | to prevent | | | | ^^^^^^^ | overflow | +-------------------------+-------------+------------+----------------+ | 0 <= val <= min_len/2 | min_len | min_len | Ordinary case | +-------------------------+-------------+------------+----------------+ | min_len/2 < val && | val_cap*2 | val_cap*2 | Ordinary case | | val_cap <= INT_MAX/2 | | | | +-------------------------+-------------+------------+----------------+ | min_len < val && | | | Cap val_cap | | INT_MAX/2 < val_cap | min_len | max_len | again to | | (implies that | | ^^^^^^^ | prevent | | INT_MAX/2 < wmem_max) | | | overflow | +-------------------------+-------------+------------+----------------+ SO_SNDBUFFORCE: +------------------------------+---------+---------+------------------+ | CONDITION | BEFORE | AFTER | COMMENT | | | PATCH | PATCH | | +------------------------------+---------+---------+------------------+ | val < INT_MIN/2 && | min_len | min_len | Underflow with | | val_uf <= min_len | | | no consequence | +------------------------------+---------+---------+------------------+ | val < INT_MIN/2 && | val_uf | min_len | Set val to 0 to | | val_uf > min_len | | ^^^^^^^ | avoid underflow | +------------------------------+---------+---------+------------------+ | INT_MIN/2 <= val < 0 | min_len | min_len | No underflow | +------------------------------+---------+---------+------------------+ | 0 <= val <= min_len/2 | min_len | min_len | Ordinary case | +------------------------------+---------+---------+------------------+ | min_len/2 < val <= INT_MAX/2 | val*2 | val*2 | Ordinary case | +------------------------------+---------+---------+------------------+ | INT_MAX/2 < val | min_len | max_len | Cap val to | | | | ^^^^^^^ | prevent overflow | +------------------------------+---------+---------+------------------+ Signed-off-by: NGuillaume Nault <gnault@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 2月, 2019 1 次提交
-
-
由 Eric Dumazet 提交于
With many active TCP sockets, fat TCP sockets could fool __sk_mem_raise_allocated() thanks to an overflow. They would increase their share of the memory, instead of decreasing it. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 2月, 2019 7 次提交
-
-
由 David S. Miller 提交于
net/core/sock.c: In function 'sock_setsockopt': net/core/sock.c:914:3: warning: this statement may fall through [-Wimplicit-fallthrough=] sock_set_flag(sk, SOCK_TSTAMP_NEW); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ net/core/sock.c:915:2: note: here case SO_TIMESTAMPING_OLD: ^~~~ Fixes: 9718475e ("socket: Add SO_TIMESTAMPING_NEW") Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Deepa Dinamani 提交于
Add new socket timeout options that are y2038 safe. Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Acked-by: NWillem de Bruijn <willemb@google.com> Cc: ccaulfie@redhat.com Cc: davem@davemloft.net Cc: deller@gmx.de Cc: paulus@samba.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: cluster-devel@redhat.com Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Deepa Dinamani 提交于
SO_RCVTIMEO and SO_SNDTIMEO socket options use struct timeval as the time format. struct timeval is not y2038 safe. The subsequent patches in the series add support for new socket timeout options with _NEW suffix that will use y2038 safe data structures. Although the existing struct timeval layout is sufficiently wide to represent timeouts, because of the way libc will interpret time_t based on user defined flag, these new flags provide a way of having a structure that is the same for all architectures consistently. Rename the existing options with _OLD suffix forms so that the right option is enabled for userspace applications according to the architecture and time_t definition of libc. Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Acked-by: NWillem de Bruijn <willemb@google.com> Cc: ccaulfie@redhat.com Cc: deller@gmx.de Cc: paulus@samba.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: cluster-devel@redhat.com Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Deepa Dinamani 提交于
Add SO_TIMESTAMPING_NEW variant of socket timestamp options. This is the y2038 safe versions of the SO_TIMESTAMPING_OLD for all architectures. Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Acked-by: NWillem de Bruijn <willemb@google.com> Cc: chris@zankel.net Cc: fenghua.yu@intel.com Cc: rth@twiddle.net Cc: tglx@linutronix.de Cc: ubraun@linux.ibm.com Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-s390@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: sparclinux@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Deepa Dinamani 提交于
Add SO_TIMESTAMP_NEW and SO_TIMESTAMPNS_NEW variants of socket timestamp options. These are the y2038 safe versions of the SO_TIMESTAMP_OLD and SO_TIMESTAMPNS_OLD for all architectures. Note that the format of scm_timestamping.ts[0] is not changed in this patch. Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Acked-by: NWillem de Bruijn <willemb@google.com> Cc: jejb@parisc-linux.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: linux-alpha@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-parisc@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Deepa Dinamani 提交于
SO_TIMESTAMP, SO_TIMESTAMPNS and SO_TIMESTAMPING options, the way they are currently defined, are not y2038 safe. Subsequent patches in the series add new y2038 safe versions of these options which provide 64 bit timestamps on all architectures uniformly. Hence, rename existing options with OLD tag suffixes. Also note that kernel will not use the untagged SO_TIMESTAMP* and SCM_TIMESTAMP* options internally anymore. Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Acked-by: NWillem de Bruijn <willemb@google.com> Cc: deller@gmx.de Cc: dhowells@redhat.com Cc: jejb@parisc-linux.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: linux-afs@lists.infradead.org Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-parisc@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Arnd Bergmann 提交于
This is a cleanup to prepare for the addition of 64-bit time_t in O_SNDTIMEO/O_RCVTIMEO. The existing compat handler seems unnecessarily complex and error-prone, moving it all into the main setsockopt()/getsockopt() implementation requires half as much code and is easier to extend. 32-bit user space can now use old_timeval32 on both 32-bit and 64-bit machines, while 64-bit code can use __old_kernel_timeval. Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Acked-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 1月, 2019 1 次提交
-
-
由 Yafang Shao 提交于
The only call site of sk_clone_lock is in inet_csk_clone_lock, and sk_cookie will be set there. So we don't need to set sk_cookie in sk_clone_lock(). Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NYafang Shao <laoar.shao@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 1月, 2019 1 次提交
-
-
由 David Herrmann 提交于
This introduces a new generic SOL_SOCKET-level socket option called SO_BINDTOIFINDEX. It behaves similar to SO_BINDTODEVICE, but takes a network interface index as argument, rather than the network interface name. User-space often refers to network-interfaces via their index, but has to temporarily resolve it to a name for a call into SO_BINDTODEVICE. This might pose problems when the network-device is renamed asynchronously by other parts of the system. When this happens, the SO_BINDTODEVICE might either fail, or worse, it might bind to the wrong device. In most cases user-space only ever operates on devices which they either manage themselves, or otherwise have a guarantee that the device name will not change (e.g., devices that are UP cannot be renamed). However, particularly in libraries this guarantee is non-obvious and it would be nice if that race-condition would simply not exist. It would make it easier for those libraries to operate even in situations where the device-name might change under the hood. A real use-case that we recently hit is trying to start the network stack early in the initrd but make it survive into the real system. Existing distributions rename network-interfaces during the transition from initrd into the real system. This, obviously, cannot affect devices that are up and running (unless you also consider moving them between network-namespaces). However, the network manager now has to make sure its management engine for dormant devices will not run in parallel to these renames. Particularly, when you offload operations like DHCP into separate processes, these might setup their sockets early, and thus have to resolve the device-name possibly running into this race-condition. By avoiding a call to resolve the device-name, we no longer depend on the name and can run network setup of dormant devices in parallel to the transition off the initrd. The SO_BINDTOIFINDEX ioctl plugs this race. Reviewed-by: NTom Gundersen <teg@jklm.no> Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com> Acked-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 1月, 2019 1 次提交
-
-
由 Deepa Dinamani 提交于
Al Viro mentioned (Message-ID <20170626041334.GZ10672@ZenIV.linux.org.uk>) that there is probably a race condition lurking in accesses of sk_stamp on 32-bit machines. sock->sk_stamp is of type ktime_t which is always an s64. On a 32 bit architecture, we might run into situations of unsafe access as the access to the field becomes non atomic. Use seqlocks for synchronization. This allows us to avoid using spinlocks for readers as readers do not need mutual exclusion. Another approach to solve this is to require sk_lock for all modifications of the timestamps. The current approach allows for timestamps to have their own lock: sk_stamp_lock. This allows for the patch to not compete with already existing critical sections, and side effects are limited to the paths in the patch. The addition of the new field maintains the data locality optimizations from commit 9115e8cd ("net: reorganize struct sock for better data locality") Note that all the instances of the sk_stamp accesses are either through the ioctl or the syscall recvmsg. Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 12月, 2018 1 次提交
-
-
由 yupeng 提交于
after set SO_DONTROUTE to 1, the IP layer should not route packets if the dest IP address is not in link scope. But if the socket has cached the dst_entry, such packets would be routed until the sk_dst_cache expires. So we should clean the sk_dst_cache when a user set SO_DONTROUTE option. Below are server/client python scripts which could reprodue this issue: server side code: ========================================================================== import socket import struct import time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind(('0.0.0.0', 9000)) s.listen(1) sock, addr = s.accept() sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1)) while True: sock.send(b'foo') time.sleep(1) ========================================================================== client side code: ========================================================================== import socket import time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect(('server_address', 9000)) while True: data = s.recv(1024) print(data) ========================================================================== Signed-off-by: Nyupeng <yupeng0921@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 12月, 2018 1 次提交
-
-
由 Willem de Bruijn 提交于
Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and interpret flag MSG_ZEROCOPY. This patch was previously part of the zerocopy RFC patchsets. Zerocopy is not effective at small MTU. With segmentation offload building larger datagrams, the benefit of page flipping outweights the cost of generating a completion notification. tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on test patch and making skb_orphan_frags_rx same as skb_orphan_frags: ipv4 udp -t 1 tx=191312 (11938 MB) txc=0 zc=n rx=191312 (11938 MB) ipv4 udp -z -t 1 tx=304507 (19002 MB) txc=304507 zc=y rx=304507 (19002 MB) ok ipv6 udp -t 1 tx=174485 (10888 MB) txc=0 zc=n rx=174485 (10888 MB) ipv6 udp -z -t 1 tx=294801 (18396 MB) txc=294801 zc=y rx=294801 (18396 MB) ok Changes v1 -> v2 - Fixup reverse christmas tree violation v2 -> v3 - Split refcount avoidance optimization into separate patch - Fix refcount leak on error in fragmented case (thanks to Paolo Abeni for pointing this one out!) - Fix refcount inc on zero - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data. This is needed since commit 5cf4a853 ("tcp: really ignore MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp. Signed-off-by: NWillem de Bruijn <willemb@google.com> Acked-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 11月, 2018 1 次提交
-
-
由 David Barmann 提交于
When setting the SO_MARK socket option, if the mark changes, the dst needs to be reset so that a new route lookup is performed. This fixes the case where an application wants to change routing by setting a new sk_mark. If this is done after some packets have already been sent, the dst is cached and has no effect. Signed-off-by: NDavid Barmann <david.barmann@stackpath.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 11月, 2018 1 次提交
-
-
由 Mike Manning 提交于
Ensure an unbound datagram skt is chosen when not in a VRF. The check for a device match in compute_score() for UDP must be performed when there is no device match. For this, a failure is returned when there is no device match. This ensures that bound sockets are never selected, even if there is no unbound socket. Allow IPv6 packets to be sent over a datagram skt bound to a VRF. These packets are currently blocked, as flowi6_oif was set to that of the master vrf device, and the ipi6_ifindex is that of the slave device. Allow these packets to be sent by checking the device with ipi6_ifindex has the same L3 scope as that of the bound device of the skt, which is the master vrf device. Note that this check always succeeds if the skt is unbound. Even though the right datagram skt is now selected by compute_score(), a different skt is being returned that is bound to the wrong vrf. The difference between these and stream sockets is the handling of the skt option for SO_REUSEPORT. While the handling when adding a skt for reuse correctly checks that the bound device of the skt is a match, the skts in the hashslot are already incorrect. So for the same hash, a skt for the wrong vrf may be selected for the required port. The root cause is that the skt is immediately placed into a slot when it is created, but when the skt is then bound using SO_BINDTODEVICE, it remains in the same slot. The solution is to move the skt to the correct slot by forcing a rehash. Signed-off-by: NMike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Tested-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 11月, 2018 1 次提交
-
-
由 Andrei Vagin 提交于
IPPROTO_RAW isn't registred as an inet protocol, so inet_protos[protocol] is always NULL for it. Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Fixes: bf2ae2e4 ("sock_diag: request _diag module only when the family or proto has been registered") Signed-off-by: NAndrei Vagin <avagin@gmail.com> Reviewed-by: NCyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 10月, 2018 2 次提交
-
-
由 Eric Dumazet 提交于
sk_pacing_rate has beed introduced as a u32 field in 2013, effectively limiting per flow pacing to 34Gbit. We believe it is time to allow TCP to pace high speed flows on 64bit hosts, as we now can reach 100Gbit on one TCP flow. This patch adds no cost for 32bit kernels. The tcpi_pacing_rate and tcpi_max_pacing_rate were already exported as 64bit, so iproute2/ss command require no changes. Unfortunately the SO_MAX_PACING_RATE socket option will stay 32bit and we will need to add a new option to let applications control high pacing rates. State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741 timer:(on,003ms,0) ino:91863 sk:2 <-> skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0) ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448 rcvmss:536 advmss:1448 cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177 segs_in:3916318 data_segs_out:177279175 bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2) send 28045.5Mbps lastrcv:73333 pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps busy:73333ms unacked:135 retrans:0/157 rcv_space:14480 notsent:2085120 minrtt:0.013 Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Convert kTLS over to make use of sk_msg interface for plaintext and encrypted scattergather data, so it reuses all the sk_msg helpers and data structure which later on in a second step enables to glue this to BPF. This also allows to remove quite a bit of open coded helpers which are covered by the sk_msg API. Recent changes in kTLs 80ece6a0 ("tls: Remove redundant vars from tls record structure") and 4e6d4720 ("tls: Add support for inplace records encryption") changed the data path handling a bit; while we've kept the latter optimization intact, we had to undo the former change to better fit the sk_msg model, hence the sg_aead_in and sg_aead_out have been brought back and are linked into the sk_msg sgs. Now the kTLS record contains a msg_plaintext and msg_encrypted sk_msg each. In the original code, the zerocopy_from_iter() has been used out of TX but also RX path. For the strparser skb-based RX path, we've left the zerocopy_from_iter() in decrypt_internal() mostly untouched, meaning it has been moved into tls_setup_from_iter() with charging logic removed (as not used from RX). Given RX path is not based on sk_msg objects, we haven't pursued setting up a dummy sk_msg to call into sk_msg_zerocopy_from_iter(), but it could be an option to prusue in a later step. Joint work with John. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 03 10月, 2018 1 次提交
-
-
由 Eric Dumazet 提交于
syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk)); in tcp_close() While a socket is being closed, it is very possible other threads find it in rtnetlink dump. tcp_get_info() will acquire the socket lock for a short amount of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);), enough to trigger the warning. Fixes: 67db3e4b ("tcp: no longer hold ehash lock while calling tcp_get_info()") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 9月, 2018 1 次提交
-
-
由 David S. Miller 提交于
An SKB is not on a list if skb->next is NULL. Codify this convention into a helper function and use it where we are dequeueing an SKB and need to mark it as such. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 8月, 2018 1 次提交
-
-
由 Yafang Shao 提交于
The sock_flag() check is alreay inside sock_enable_timestamp(), so it is unnecessary checking it in the caller. void sock_enable_timestamp(struct sock *sk, int flag) { if (!sock_flag(sk, flag)) { ... } } Signed-off-by: NYafang Shao <laoar.shao@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 8月, 2018 1 次提交
-
-
由 Matthieu Baerts 提交于
This refactoring work has been started by David Howells in cdfbabfb (net: Work around lockdep limitation in sockets that use sockets) but the exact same day in 581319c5 (net/socket: use per af lockdep classes for sk queues), Paolo Abeni added new classes. This reduces the amount of (nearly) duplicated code and eases the addition of new socket types. Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 7月, 2018 1 次提交
-
-
由 Daniel Borkmann 提交于
Current sg coalescing logic in sk_alloc_sg() (latter is used by tls and sockmap) is not quite correct in that we do fetch the previous sg entry, however the subsequent check whether the refilled page frag from the socket is still the same as from the last entry with prior offset and length matching the start of the current buffer is comparing always the first sg list entry instead of the prior one. Fixes: 3c4d7559 ("tls: kernel TLS support") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NDave Watson <davejwatson@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 7月, 2018 2 次提交
-
-
由 Jesus Sanchez-Palencia 提交于
Use the socket error queue for reporting dropped packets if the socket has enabled that feature through the SO_TXTIME API. Packets are dropped either on enqueue() if they aren't accepted by the qdisc or on dequeue() if the system misses their deadline. Those are reported as different errors so applications can react accordingly. Userspace can retrieve the errors through the socket error queue and the corresponding cmsg interfaces. A struct sock_extended_err* is used for returning the error data, and the packet's timestamp can be retrieved by adding both ee_data and ee_info fields as e.g.: ((__u64) serr->ee_data << 32) + serr->ee_info This feature is disabled by default and must be explicitly enabled by applications. Enabling it can bring some overhead for the Tx cycles of the application. Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Richard Cochran 提交于
This patch introduces SO_TXTIME. User space enables this option in order to pass a desired future transmit time in a CMSG when calling sendmsg(2). The argument to this socket option is a 8-bytes long struct provided by the uapi header net_tstamp.h defined as: struct sock_txtime { clockid_t clockid; u32 flags; }; Note that new fields were added to struct sock by filling a 2-bytes hole found in the struct. For that reason, neither the struct size or number of cachelines were altered. Signed-off-by: NRichard Cochran <rcochran@linutronix.de> Signed-off-by: NJesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 7月, 2018 2 次提交
-
-
由 Yafang Shao 提交于
Currently trace_sock_exceed_buf_limit() only show rmem info, but wmem limit may also be hit. So expose wmem info in this tracepoint as well. Regarding memcg, I think it is better to introduce a new tracepoint(if that is needed), i.e. trace_memcg_limit_hit other than show memcg info in trace_sock_exceed_buf_limit. Signed-off-by: NYafang Shao <laoar.shao@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Amritha Nambiar 提交于
This patch adds a new field to sock_common 'skc_rx_queue_mapping' which holds the receive queue number for the connection. The Rx queue is marked in tcp_finish_connect() to allow a client app to do SO_INCOMING_NAPI_ID after a connect() call to get the right queue association for a socket. Rx queue is also marked in tcp_conn_request() to allow syn-ack to go on the right tx-queue associated with the queue on which syn is received. Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 6月, 2018 1 次提交
-
-
由 Shakeel Butt 提交于
Currently the kernel accounts the memory for network traffic through mem_cgroup_[un]charge_skmem() interface. However the memory accounted only includes the truesize of sk_buff which does not include the size of sock objects. In our production environment, with opt-out kmem accounting, the sock kmem caches (TCP[v6], UDP[v6], RAW[v6], UNIX) are among the top most charged kmem caches and consume a significant amount of memory which can not be left as system overhead. So, this patch converts the kmem caches of all sock objects to SLAB_ACCOUNT. Signed-off-by: NShakeel Butt <shakeelb@google.com> Suggested-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-