- 05 4月, 2016 3 次提交
-
-
由 Soheil Hassas Yeganeh 提交于
Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level as a basis to accept timestamping requests per write. This implementation only accepts TX recording flags (i.e., SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE, SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in control messages. Users need to set reporting flags (e.g., SOF_TIMESTAMPING_OPT_ID) per socket via socket options. This commit adds a tsflags field in sockcm_cookie which is set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_* bits in sockcm_cookie.tsflags allowing the control message to override the recording behavior per write, yet maintaining the value of other flags. This patch implements validating the control message and setting tsflags in struct sockcm_cookie. Next commits in this series will actually implement timestamping per write for different protocols. Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Soheil Hassas Yeganeh 提交于
SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs to associate timestamps with send calls. For TCP connections, tp->snd_una is used as the starting point to calculate relative IDs. This socket option will fail if set before the handshake on a passive TCP fast open connection with data in SYN or SYN/ACK, since setsockopt requires the connection to be in the ESTABLISHED state. To address these, instead of limiting the option to the ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as long as the connection is not in LISTEN or CLOSE states. Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NWillem de Bruijn <willemb@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Willem de Bruijn 提交于
To process cmsg's of the SOL_SOCKET level in addition to cmsgs of another level, protocols can call sock_cmsg_send(). This causes a double walk on the cmsghdr list, one for SOL_SOCKET and one for the other level. Extract the inner demultiplex logic from the loop that walks the list, to allow having this called directly from a walker in the protocol specific code. Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 3月, 2016 1 次提交
-
-
由 Joonsoo Kim 提交于
The success of CMA allocation largely depends on the success of migration and key factor of it is page reference count. Until now, page reference is manipulated by direct calling atomic functions so we cannot follow up who and where manipulate it. Then, it is hard to find actual reason of CMA allocation failure. CMA allocation should be guaranteed to succeed so finding offending place is really important. In this patch, call sites where page reference is manipulated are converted to introduced wrapper function. This is preparation step to add tracepoint to each page reference manipulation function. With this facility, we can easily find reason of CMA allocation failure. There is no functional change in this patch. In addition, this patch also converts reference read sites. It will help a second step that renames page._count to something else and prevents later attempt to direct access to it (Suggested by Andrew). Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NMichal Nazarewicz <mina86@mina86.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 2月, 2016 1 次提交
-
-
由 Tom Herbert 提交于
This patch add the SO_CNX_ADVICE socket option (setsockopt only). The purpose is to allow an application to give feedback to the kernel about the quality of the network path for a connected socket. The value argument indicates the type of quality report. For this initial patch the only supported advice is a value of 1 which indicates "bad path, please reroute"-- the action taken by the kernel is to call dst_negative_advice which will attempt to choose a different ECMP route, reset the TX hash for flow label and UDP source port in encapsulation, etc. This facility should be useful for connected UDP sockets where only the application can provide any feedback about path quality. It could also be useful for TCP applications that have additional knowledge about the path outside of the normal TCP control loop. Signed-off-by: NTom Herbert <tom@herbertland.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 2月, 2016 1 次提交
-
-
由 Craig Gallek 提交于
Both of the lines in this patch probably should have been included in the initial implementation of this code for generic socket support, but weren't technically necessary since only UDP sockets were supported. First, the sk_reuseport_cb points to a structure which assumes each socket in the group has this pointer assigned at the same time it's added to the array in the structure. The sk_clone_lock function breaks this assumption. Since a child socket shouldn't implicitly be in a reuseport group, the simple fix is to clear the field in the clone. Second, the SO_ATTACH_REUSEPORT_xBPF socket options require that SO_REUSEPORT also be set first. For UDP sockets, this is easily enforced at bind-time since that process both puts the socket in the appropriate receive hlist and updates the reuseport structures. Since these operations can happen at two different times for TCP sockets (bind and listen) it must be explicitly checked to enforce the use of SO_REUSEPORT with SO_ATTACH_REUSEPORT_xBPF in the setsockopt call. Signed-off-by: NCraig Gallek <kraig@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 1月, 2016 4 次提交
-
-
由 Johannes Weiner 提交于
The unified hierarchy memory controller is going to use this jump label as well to control the networking callbacks. Move it to the memory controller code and give it a more generic name. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
There won't be any separate counters for socket memory consumed by protocols other than TCP in the future. Remove the indirection and link sockets directly to their owning memory cgroup. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
There won't be a tcp control soft limit, so integrating the memcg code into the global skmem limiting scheme complicates things unnecessarily. Replace this with simple and clear charge and uncharge calls--hidden behind a jump label--to account skb memory. Note that this is not purely aesthetic: as a result of shoehorning the per-memcg code into the same memory accounting functions that handle the global level, the old code would compare the per-memcg consumption against the smaller of the per-memcg limit and the global limit. This allowed the total consumption of multiple sockets to exceed the global limit, as long as the individual sockets stayed within bounds. After this change, the code will always compare the per-memcg consumption to the per-memcg limit, and the global consumption to the global limit, and thus close this loophole. Without a soft limit, the per-memcg memory pressure state in sockets is generally questionable. However, we did it until now, so we continue to enter it when the hard limit is hit, and packets are dropped, to let other sockets in the cgroup know that they shouldn't grow their transmit windows, either. However, keep it simple in the new callback model and leave memory pressure lazily when the next packet is accepted (as opposed to doing it synchroneously when packets are processed). When packets are dropped, network performance will already be in the toilet, so that should be a reasonable trade-off. As described above, consumption is now checked on the per-memcg level and the global level separately. Likewise, memory pressure states are maintained on both the per-memcg level and the global level, and a socket is considered under pressure when either level asserts as much. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Move the jump-label from sock_update_memcg() and sock_release_memcg() to the callsite, and so eliminate those function calls when socket accounting is not enabled. This also eliminates the need for dummy functions because the calls will be optimized away if the Kconfig options are not enabled. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid S. Miller <davem@davemloft.net> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 1月, 2016 1 次提交
-
-
由 Craig Gallek 提交于
Expose socket options for setting a classic or extended BPF program for use when selecting sockets in an SO_REUSEPORT group. These options can be used on the first socket to belong to a group before bind or on any socket in the group after bind. This change includes refactoring of the existing sk_filter code to allow reuse of the existing BPF filter validation checks. Signed-off-by: NCraig Gallek <kraig@google.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 12月, 2015 1 次提交
-
-
由 WANG Cong 提交于
Dmitry reported the following out-of-bound access: Call Trace: [<ffffffff816cec2e>] __asan_report_load4_noabort+0x3e/0x40 mm/kasan/report.c:294 [<ffffffff84affb14>] sock_setsockopt+0x1284/0x13d0 net/core/sock.c:880 [< inline >] SYSC_setsockopt net/socket.c:1746 [<ffffffff84aed7ee>] SyS_setsockopt+0x1fe/0x240 net/socket.c:1729 [<ffffffff85c18c76>] entry_SYSCALL_64_fastpath+0x16/0x7a arch/x86/entry/entry_64.S:185 This is because we mistake a raw socket as a tcp socket. We should check both sk->sk_type and sk->sk_protocol to ensure it is a tcp socket. Willem points out __skb_complete_tx_timestamp() needs to fix as well. Reported-by: NDmitry Vyukov <dvyukov@google.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 12月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
XFRM can deal with SYNACK messages, sent while listener socket is not locked. We add proper rcu protection to __xfrm_sk_clone_policy() and xfrm_sk_policy_lookup() This might serve as the first step to remove xfrm.xfrm_policy_lock use in fast path. Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSteffen Klassert <steffen.klassert@secunet.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 12月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
In cgroup v1, dealing with cgroup membership was difficult because the number of membership associations was unbound. As a result, cgroup v1 grew several controllers whose primary purpose is either tagging membership or pull in configuration knobs from other subsystems so that cgroup membership test can be avoided. net_cls and net_prio controllers are examples of the latter. They allow configuring network-specific attributes from cgroup side so that network subsystem can avoid testing cgroup membership; unfortunately, these are not only cumbersome but also problematic. Both net_cls and net_prio aren't properly hierarchical. Both inherit configuration from the parent on creation but there's no interaction afterwards. An ancestor doesn't restrict the behavior in its subtree in anyway and configuration changes aren't propagated downwards. Especially when combined with cgroup delegation, this is problematic because delegatees can mess up whatever network configuration implemented at the system level. net_prio would allow the delegatees to set whatever priority value regardless of CAP_NET_ADMIN and net_cls the same for classid. While it is possible to solve these issues from controller side by implementing hierarchical allowable ranges in both controllers, it would involve quite a bit of complexity in the controllers and further obfuscate network configuration as it becomes even more difficult to tell what's actually being configured looking from the network side. While not much can be done for v1 at this point, as membership handling is sane on cgroup v2, it'd be better to make cgroup matching behave like other network matches and classifiers than introducing further complications. In preparation, this patch updates sock->sk_cgrp_data handling so that it points to the v2 cgroup that sock was created in until either net_prio or net_cls is used. Once either of the two is used, sock->sk_cgrp_data reverts to its previous role of carrying prioidx and classid. This is to avoid adding yet another cgroup related field to struct sock. As the mode switching can happen at most once per boot, the switching mechanism is aimed at lowering hot path overhead. It may leak a finite, likely small, number of cgroup refs and report spurious prioidx or classid on switching; however, dynamic updates of prioidx and classid have always been racy and lossy - socks between creation and fd installation are never updated, config changes don't update existing sockets at all, and prioidx may index with dead and recycled cgroup IDs. Non-critical inaccuracies from small race windows won't make any noticeable difference. This patch doesn't make use of the pointer yet. The following patch will implement netfilter match for cgroup2 membership. v2: Use sock_cgroup_data to avoid inflating struct sock w/ another cgroup specific field. v3: Add comments explaining why sock_data_prioidx() and sock_data_classid() use different fallback values. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Daniel Wagner <daniel.wagner@bmw-carit.de> CC: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Tejun Heo 提交于
Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data. ->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct and its accessors are defined in cgroup-defs.h. This is to prepare for overloading the fields with a cgroup pointer. This patch mostly performs equivalent conversions but the followings are noteworthy. * Equality test before updating classid is removed from sock_update_classid(). This shouldn't make any noticeable difference and a similar test will be implemented on the helper side later. * sock_update_netprioidx() now takes struct sock_cgroup_data and can be moved to netprio_cgroup.h without causing include dependency loop. Moved. * The dummy version of sock_update_netprioidx() converted to a static inline function while at it. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 12月, 2015 1 次提交
-
-
由 Marcelo Ricardo Leitner 提交于
Dmitry Vyukov reported that SCTP was triggering a WARN on socket destroy related to disabling sock timestamp. When SCTP accepts an association or peel one off, it copies sock flags but forgot to call net_enable_timestamp() if a packet timestamping flag was copied, leading to extra calls to net_disable_timestamp() whenever such clones were closed. The fix is to call net_enable_timestamp() whenever we copy a sock with that flag on, like tcp does. Reported-by: NDmitry Vyukov <dvyukov@google.com> Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: NVlad Yasevich <vyasevich@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 12月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
While testing the np->opt RCU conversion, I found that UDP/IPv6 was using a mixture of xchg() and sk_dst_lock to protect concurrent changes to sk->sk_dst_cache, leading to possible corruptions and crashes. ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest way to fix the mess is to remove sk_dst_lock completely, as we did for IPv4. __ip6_dst_store() and ip6_dst_store() share same implementation. sk_setup_caps() being called with socket lock being held or not, we have to use sk_dst_set() instead of __sk_dst_set() Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);" in ip6_dst_store() before the sk_setup_caps(sk, dst) call. This is because ip6_dst_store() can be called from process context, without any lock held. As soon as the dst is installed in sk->sk_dst_cache, dst can be freed from another cpu doing a concurrent ip6_dst_store() Doing the dst dereference before doing the install is needed to make sure no use after free would trigger. Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 12月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
This patch is a cleanup to make following patch easier to review. Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA from (struct socket)->flags to a (struct socket_wq)->flags to benefit from RCU protection in sock_wake_async() To ease backports, we rename both constants. Two new helpers, sk_set_bit(int nr, struct sock *sk) and sk_clear_bit(int net, struct sock *sk) are added so that following patch can change their implementation. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 12月, 2015 1 次提交
-
-
由 Herbert Xu 提交于
The memory barrier in the helper wq_has_sleeper is needed by just about every user of waitqueue_active. This patch generalises it by making it take a wait_queue_head_t directly. The existing helper is renamed to skwq_has_sleeper. Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 11月, 2015 1 次提交
-
-
由 Mel Gorman 提交于
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 11月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
skb_set_owner_w() is called from various places that assume skb->sk always point to a full blown socket (as it changes sk->sk_wmem_alloc) We'd like to attach skb to request sockets, and in the future to timewait sockets as well. For these kind of pseudo sockets, we need to take a traditional refcount and use sock_edemux() as the destructor. It is now time to un-inline skb_set_owner_w(), being too big. Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: NEric Dumazet <edumazet@google.com> Bisected-by: NHaiyang Zhang <haiyangz@microsoft.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 10月, 2015 1 次提交
-
-
由 Hannes Frederic Sowa 提交于
netstamp_needed is toggled for all socket families if they request timestamping. But some protocols don't need the lower-layer timestamping code at all. This patch starts disabling it for af-unix. E.g. systemd enables timestamping during boot-up on the journald af-unix sockets, thus causing the system to globally enable timestamping in the lower networking stack. Still, it is very probable that timestamping gets activated, by e.g. dhclient or various NTP implementations. Reported-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 10月, 2015 2 次提交
-
-
由 Eric Dumazet 提交于
SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command to fetch incoming cpu handling a particular TCP flow after accept() This commits adds setsockopt() support and extends SO_REUSEPORT selection logic : If a TCP listener or UDP socket has this option set, a packet is delivered to this socket only if CPU handling the packet matches the specified one. This allows to build very efficient TCP servers, using one listener per RX queue, as the associated TCP listener should only accept flows handled in softirq by the same cpu. This provides optimal NUMA behavior and keep cpu caches hot. Note that __inet_lookup_listener() still has to iterate over the list of all listeners. Following patch puts sk_refcnt in a different cache line to let this iteration hit only shared and read mostly cache lines. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Edward Jee 提交于
It's useful to allow users to set fwmark for an individual packet, without changing the socket state. The function this patch adds in sock layer can be used by the protocols that need such a feature. Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 10月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
Before letting request sockets being put in TCP/DCCP regular ehash table, we need to add either : - SLAB_DESTROY_BY_RCU flag to their kmem_cache - add RCU grace period before freeing them. Since we carefully respected the SLAB_DESTROY_BY_RCU protocol like ESTABLISH and TIMEWAIT sockets, use it here. req_prot_init() being only used by TCP and DCCP, I did not add a new slab_flags into their rsk_prot, but reuse prot->slab_flags Since all reqsk_alloc() users are correctly dealing with a failure, add the __GFP_NOWARN flag to avoid traces under pressure. Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 9月, 2015 1 次提交
-
-
由 Julia Lawall 提交于
Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) { \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); x = NULL; -} // </smpl> Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 8月, 2015 1 次提交
-
-
由 Jean Sacren 提交于
The symbol '__sk_reclaim' is not present in the current tree. Apparently '__sk_reclaim' was meant to be '__sk_mem_reclaim', so fix it with the right symbol name for the kernel doc. Signed-off-by: NJean Sacren <sakiwit@gmail.com> Cc: Hideo Aoki <haoki@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 7月, 2015 1 次提交
-
-
由 Sowmini Varadhan 提交于
The newsk returned by sk_clone_lock should hold a get_net() reference if, and only if, the parent is not a kernel socket (making this similar to sk_alloc()). E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock sets up the syn_recv newsk from sk_clone_lock. When the parent (listen) socket is a kernel socket (defined in sk_alloc() as having sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt and should not hold a get_net() reference. Fixes: 26abe143 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Acked-by: NEric Dumazet <edumazet@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 7月, 2015 1 次提交
-
-
由 Sabrina Dubroca 提交于
Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called with flags = MSG_WAITALL | MSG_PEEK. sk_wait_data waits for sk_receive_queue not empty, but in this case, the receive queue is not empty, but does not contain any skb that we can use. Add a "last skb seen on receive queue" argument to sk_wait_data, so that it sleeps until the receive queue has new skbs. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493 Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258Reported-by: NEnrico Scholz <rh-bugzilla@ensc.de> Reported-by: NDan Searle <dan@censornet.com> Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 7月, 2015 1 次提交
-
-
由 Craig Gallek 提交于
Kernel sockets do not hold a reference for the network namespace to which they point. Socket destruction broadcasting relies on the network namespace and will cause the splat below when a kernel socket is destroyed. This fix simply ignores kernel sockets when they are destroyed. Reported as: general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 1 PID: 9130 Comm: kworker/1:1 Not tainted 4.1.0-gelk-debug+ #1 Workqueue: sock_diag_events sock_diag_broadcast_destroy_work Stack: ffff8800b9c586c0 ffff8800b9c586c0 ffff8800ac4692c0 ffff8800936d4a90 ffff8800352efd38 ffffffff8469a93e ffff8800352efd98 ffffffffc09b9b90 ffff8800352efd78 ffff8800ac4692c0 ffff8800b9c586c0 ffff8800831b6ab8 Call Trace: [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10 [<ffffffffc09b9b90>] ? inet_diag_handler_get_info+0x110/0x1fb [inet_diag] [<ffffffff845c868d>] netlink_broadcast+0x1d/0x20 [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10 [<ffffffff845b2bf5>] sock_diag_broadcast_destroy_work+0xd5/0x160 [<ffffffff8408ea97>] process_one_work+0x147/0x420 [<ffffffff8408f0f9>] worker_thread+0x69/0x470 [<ffffffff8409fda3>] ? preempt_count_sub+0xa3/0xf0 [<ffffffff8408f090>] ? rescuer_thread+0x320/0x320 [<ffffffff84093cd7>] kthread+0x107/0x120 [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0 [<ffffffff8469d31f>] ret_from_fork+0x3f/0x70 [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0 Tested: Using a debug kernel while 'ss -E' is running: ip netns add test-ns ip netns delete test-ns Fixes: eb4cb008 sock_diag: define destruction multicast groups Fixes: 26abe143 net: Modify sk_alloc to not reference count the netns of kernel sockets. Reported-by: NDave Jones <davej@codemonkey.org.uk> Suggested-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NCraig Gallek <kraig@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 6月, 2015 1 次提交
-
-
由 David Miller 提交于
No more users, so it can now be removed. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 6月, 2015 1 次提交
-
-
由 Craig Gallek 提交于
These groups will contain socket-destruction events for AF_INET/AF_INET6, IPPROTO_TCP/IPPROTO_UDP. Near the end of socket destruction, a check for listeners is performed. In the presence of a listener, rather than completely cleanup the socket, a unit of work will be added to a private work queue which will first broadcast information about the socket and then finish the cleanup operation. Signed-off-by: NCraig Gallek <kraig@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 6月, 2015 1 次提交
-
-
由 Shaohua Li 提交于
We saw excessive direct memory compaction triggered by skb_page_frag_refill. This causes performance issues and add latency. Commit 5640f768 introduces the order-3 allocation. According to the changelog, the order-3 allocation isn't a must-have but to improve performance. But direct memory compaction has high overhead. The benefit of order-3 allocation can't compensate the overhead of direct memory compaction. This patch makes the order-3 page allocation atomic. If there is no memory pressure and memory isn't fragmented, the alloction will still success, so we don't sacrifice the order-3 benefit here. If the atomic allocation fails, direct memory compaction will not be triggered, skb_page_frag_refill will fallback to order-0 immediately, hence the direct memory compaction overhead is avoided. In the allocation failure case, kswapd is waken up and doing compaction, so chances are allocation could success next time. alloc_skb_with_frags is the same. The mellanox driver does similar thing, if this is accepted, we must fix the driver too. V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric V2: make the changelog clearer Cc: Eric Dumazet <edumazet@google.com> Cc: Chris Mason <clm@fb.com> Cc: Debabrata Banerjee <dbavatar@gmail.com> Signed-off-by: NShaohua Li <shli@fb.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 6月, 2015 1 次提交
-
-
由 Mel Gorman 提交于
Jeff Layton reported the following; [ 74.232485] ------------[ cut here ]------------ [ 74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80() [ 74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio [ 74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5 [ 74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 74.245546] 0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d [ 74.246786] 0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa [ 74.248175] 0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8 [ 74.249483] Call Trace: [ 74.249872] [<ffffffff8179263d>] dump_stack+0x45/0x57 [ 74.250703] [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0 [ 74.251655] [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20 [ 74.252585] [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80 [ 74.253519] [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc] [ 74.254537] [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc] [ 74.255610] [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs] [ 74.256582] [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80 [ 74.257496] [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0 [ 74.258318] [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140 [ 74.259253] [<ffffffff81798dae>] system_call_fastpath+0x12/0x71 [ 74.260158] ---[ end trace 2530722966429f10 ]--- The warning in question was unnecessary but with Jeff's series the rules are also clearer. This patch removes the warning and updates the comment to explain why sk_mem_reclaim() may still be called. [jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon points out that it's not needed.] Cc: Leon Romanovsky <leon@leon.nu> Signed-off-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NJeff Layton <jeff.layton@primarydata.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 5月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
By making sure sk->sk_gso_max_segs minimal value is one, and sysctl_tcp_min_tso_segs minimal value is one as well, tcp_tso_autosize() will return a non zero value. We can then revert 843925f3 ("tcp: Do not apply TSO segment limit to non-TSO packets") and save few cpu cycles in fast path. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 5月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
sk_mem_reclaim_partial() goal is to ensure each socket has one SK_MEM_QUANTUM forward allocation. This is needed both for performance and better handling of memory pressure situations in follow up patches. SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes as some arches have 64KB pages. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 5月, 2015 3 次提交
-
-
由 Eric W. Biederman 提交于
These functions are no longer needed and no longer used kill them. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric W. Biederman 提交于
Now that sk_alloc knows when a kernel socket is being allocated modify it to not reference count the network namespace of kernel sockets. Keep track of if a socket needs reference counting by adding a flag to struct sock called sk_net_refcnt. Update all of the callers of sock_create_kern to stop using sk_change_net and sk_release_kernel as those hacks are no longer needed, to avoid reference counting a kernel socket. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric W. Biederman 提交于
In preparation for changing how struct net is refcounted on kernel sockets pass the knowledge that we are creating a kernel socket from sock_create_kern through to sk_alloc. Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 5月, 2015 1 次提交
-
-
由 Herbert Xu 提交于
This reverts commit c243d7e2. That patch is solving a non-existant problem while creating a real problem. Just because a socket is allocated in the init name space doesn't mean that it gets hashed in the init name space. When we unhash it the name space must be the same as the one we had when we hashed it. So this patch is completely bogus and causes socket leaks. Reported-by: NAndrey Wagin <avagin@gmail.com> Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-