- 01 4月, 2018 1 次提交
-
-
由 Eric Dumazet 提交于
In order to simplify the API, add a pointer to struct inet_frags. This will allow us to make things less complex. These functions no longer have a struct inet_frags parameter : inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */) inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */) ip6_expire_frag_queue(struct net *net, struct frag_queue *fq) Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 3月, 2018 1 次提交
-
-
由 Kirill Tkhai 提交于
inet_evict_bucket() iterates global list, and several tasks may call it in parallel. All of them hash the same fq->list_evictor to different lists, which leads to list corruption. This patch makes fq be hashed to expired list only if this has not been made yet by another task. Since inet_frag_alloc() allocates fq using kmem_cache_zalloc(), we may rely on list_evictor is initially unhashed. The problem seems to exist before async pernet_operations, as there was possible to have exit method to be executed in parallel with inet_frags::frags_work, so I add two Fixes tags. This also may go to stable. Fixes: d1fe1944 "inet: frag: don't re-use chainlist for evictor" Fixes: f84c6821 "net: Convert pernet_subsys, registered from inet_init()" Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 10月, 2017 1 次提交
-
-
由 Mark Rutland 提交于
locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: NMark Rutland <mark.rutland@arm.com> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 10月, 2017 1 次提交
-
-
由 Kees Cook 提交于
In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Cc: Alexander Aring <alex.aring@gmail.com> Cc: Stefan Schmidt <stefan@osg.samsung.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Florian Westphal <fw@strlen.de> Cc: linux-wpan@vger.kernel.org Cc: netdev@vger.kernel.org Cc: netfilter-devel@vger.kernel.org Cc: coreteam@netfilter.org Signed-off-by: NKees Cook <keescook@chromium.org> Acked-by: Stefan Schmidt <stefan@osg.samsung.com> # for ieee802154 Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 9月, 2017 1 次提交
-
-
由 Jesper Dangaard Brouer 提交于
This reverts commit 6d7b857d. There is a bug in fragmentation codes use of the percpu_counter API, that can cause issues on systems with many CPUs. The frag_mem_limit() just reads the global counter (fbc->count), without considering other CPUs can have upto batch size (130K) that haven't been subtracted yet. Due to the 3MBytes lower thresh limit, this become dangerous at >=24 CPUs (3*1024*1024/130000=24). The correct API usage would be to use __percpu_counter_compare() which does the right thing, and takes into account the number of (online) CPUs and batch size, to account for this and call __percpu_counter_sum() when needed. We choose to revert the use of the lib/percpu_counter API for frag memory accounting for several reasons: 1) On systems with CPUs > 24, the heavier fully locked __percpu_counter_sum() is always invoked, which will be more expensive than the atomic_t that is reverted to. Given systems with more than 24 CPUs are becoming common this doesn't seem like a good option. To mitigate this, the batch size could be decreased and thresh be increased. 2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX CPU, before SKBs are pushed into sockets on remote CPUs. Given NICs can only hash on L2 part of the IP-header, the NIC-RXq's will likely be limited. Thus, a fair chance that atomic add+dec happen on the same CPU. Revert note that commit 1d6119ba ("net: fix percpu memory leaks") removed init_frag_mem_limit() and instead use inet_frags_init_net(). After this revert, inet_frags_uninit_net() becomes empty. Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting") Fixes: 1d6119ba ("net: fix percpu memory leaks") Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 7月, 2017 1 次提交
-
-
由 Reshetova, Elena 提交于
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: NElena Reshetova <elena.reshetova@intel.com> Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com> Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NDavid Windsor <dwindsor@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 6月, 2016 1 次提交
-
-
由 Michal Kubeček 提交于
Before commit 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting"), setting the reassembly high threshold to 0 prevented fragment reassembly as first fragment would be always evicted before second could be added to the queue. While inefficient, some users apparently relied on this method. Since the commit mentioned above, a percpu counter is used for reassembly memory accounting and high batch size avoids taking slow path in most common scenarios. As a result, a whole full sized packet can be reassembled without the percpu counter's main counter changing its value so that even with high_thresh set to 0, fragmented packets can be still reassembled and processed. Add explicit check preventing reassembly if high threshold is zero. Signed-off-by: NMichal Kubecek <mkubecek@suse.cz> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 1月, 2016 1 次提交
-
-
由 Florian Westphal 提交于
The only user was removed in commit 029f7f3b ("netfilter: ipv6: nf_defrag: avoid/free clone operations"). Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 11月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
This patch fixes following problems : 1) percpu_counter_init() can return an error, therefore init_frag_mem_limit() must propagate this error so that inet_frags_init_net() can do the same up to its callers. 2) If ip[46]_frags_ns_ctl_register() fail, we must unwind properly and free the percpu_counter. Without this fix, we leave freed object in percpu_counters global list (if CONFIG_HOTPLUG_CPU) leading to crashes. This bug was detected by KASAN and syzkaller tool (http://github.com/google/syzkaller) Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 7月, 2015 4 次提交
-
-
由 Nikolay Aleksandrov 提交于
We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags race conditions with the evictor and use a participation test for the evictor list, when we're at that point (after inet_frag_kill) in the timer there're 2 possible cases: 1. The evictor added the entry to its evictor list while the timer was waiting for the chainlock or 2. The timer unchained the entry and the evictor won't see it In both cases we should be able to see list_evictor correctly due to the sync on the chainlock. Joint work with Florian Westphal. Tested-by: NFrank Schreuder <fschreuder@transip.nl> Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
Frank reports 'NMI watchdog: BUG: soft lockup' errors when load is high. Instead of (potentially) unbounded restarts of the eviction process, just skip to the next entry. One caveat is that, when a netns is exiting, a timer may still be running by the time inet_evict_bucket returns. We use the frag memory accounting to wait for outstanding timers, so that when we free the percpu counter we can be sure no running timer will trip over it. Reported-and-tested-by: NFrank Schreuder <fschreuder@transip.nl> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
Followup patch will call it after inet_frag_queue was freed, so q->net doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would). Tested-by: NFrank Schreuder <fschreuder@transip.nl> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
commit 65ba1f1e ("inet: frags: fix a race between inet_evict_bucket and inet_frag_kill") describes the bug, but the fix doesn't work reliably. Problem is that ->flags member can be set on other cpu without chainlock being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED bit after we put the element on the evictor private list. We can crash when walking the 'private' evictor list since an element can be deleted from list underneath the evictor. Join work with Nikolay Alexandrov. Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue") Reported-by: NJohan Schuijt <johan@transip.nl> Tested-by: NFrank Schreuder <fschreuder@transip.nl> Signed-off-by: NNikolay Alexandrov <nikolay@cumulusnetworks.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 4月, 2015 1 次提交
-
-
由 Ian Morris 提交于
The ipv4 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x == NULL and sometimes as !x. !x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: NIan Morris <ipm@chirality.org.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 11月, 2014 1 次提交
-
-
由 Joe Perches 提交于
Use the more common dynamic_debug capable net_dbg_ratelimited and remove the LIMIT_NETDEBUG macro. All messages are still ratelimited. Some KERN_<LEVEL> uses are changed to KERN_DEBUG. This may have some negative impact on messages that were emitted at KERN_INFO that are not not enabled at all unless DEBUG is defined or dynamic_debug is enabled. Even so, these messages are now _not_ emitted by default. This also eliminates the use of the net_msg_warn sysctl "/proc/sys/net/core/warnings". For backward compatibility, the sysctl is not removed, but it has no function. The extern declaration of net_msg_warn is removed from sock.h and made static in net/core/sysctl_net_core.c Miscellanea: o Update the sysctl documentation o Remove the embedded uses of pr_fmt o Coalesce format fragments o Realign arguments Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 10月, 2014 2 次提交
-
-
由 Nikolay Aleksandrov 提交于
The WARN_ON in inet_evict_bucket can be triggered by a valid case: inet_frag_kill and inet_evict_bucket can be running in parallel on the same queue which means that there has been at least one more ref added by a previous inet_frag_find call, but inet_frag_kill can delete the timer before inet_evict_bucket which will cause the WARN_ON() there to trigger since we'll have refcnt!=1. Now, this case is valid because the queue is being "killed" for some reason (removed from the chain list and its timer deleted) so it will get destroyed in the end by one of the inet_frag_put() calls which reaches 0 i.e. refcnt is still valid. CC: Florian Westphal <fw@strlen.de> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McLean <chutzpah@gentoo.org> Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue") Reported-by: NPatrick McLean <chutzpah@gentoo.org> Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
When the evictor is running it adds some chosen frags to a local list to be evicted once the chain lock has been released but at the same time the *frag_queue can be running for some of the same queues and it may call inet_frag_kill which will wait on the chain lock and will then delete the queue from the wrong list since it was added in the eviction one. The fix is simple - check if the queue has the evict flag set under the chain lock before deleting it, this is safe because the evict flag is set only under that lock and having the flag set also means that the queue has been detached from the chain list, so no need to delete it again. An important note to make is that we're safe w.r.t refcnt because inet_frag_kill and inet_evict_bucket will sync on the del_timer operation where only one of the two can succeed (or if the timer is executing - none of them), the cases are: 1. inet_frag_kill succeeds in del_timer - then the timer ref is removed, but inet_evict_bucket will not add this queue to its expire list but will restart eviction in that chain 2. inet_evict_bucket succeeds in del_timer - then the timer ref is kept until the evictor "expires" the queue, but inet_frag_kill will remove the initial ref and will set INET_FRAG_COMPLETE which will make the frag_expire fn just to remove its ref. In the end all of the queue users will do an inet_frag_put and the one that reaches 0 will free it. The refcount balance should be okay. CC: Florian Westphal <fw@strlen.de> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McLean <chutzpah@gentoo.org> Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue") Suggested-by: NEric Dumazet <eric.dumazet@gmail.com> Reported-by: NPatrick McLean <chutzpah@gentoo.org> Tested-by: NPatrick McLean <chutzpah@gentoo.org> Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com> Reviewed-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 8月, 2014 4 次提交
-
-
由 Nikolay Aleksandrov 提交于
Use kmem_cache to allocate/free inet_frag_queue objects since they're all the same size per inet_frags user and are alloced/freed in high volumes thus making it a perfect case for kmem_cache. Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com> Acked-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
Now that we have INET_FRAG_EVICTED we might as well use it to stop sending icmp messages in the "frag_expire" functions instead of stripping INET_FRAG_FIRST_IN from their flags when evicting. Also fix the comment style in ip6_expire_frag_queue(). Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com> Reviewed-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
Fix a couple of functions' declaration alignments. Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
The last_in field has been used to store various flags different from first/last frag in so give it a more descriptive name: flags. Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 7月, 2014 7 次提交
-
-
由 Florian Westphal 提交于
rehash is rare operation, don't force readers to take the read-side rwlock. Instead, we only have to detect the (rare) case where the secret was altered while we are trying to insert a new inetfrag queue into the table. If it was changed, drop the bucket lock and recompute the hash to get the 'new' chain bucket that we have to insert into. Joint work with Nikolay Aleksandrov. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
merge functionality into the eviction workqueue. Instead of rebuilding every n seconds, take advantage of the upper hash chain length limit. If we hit it, mark table for rebuild and schedule workqueue. To prevent frequent rebuilds when we're completely overloaded, don't rebuild more than once every 5 seconds. ipfrag_secret_interval sysctl is now obsolete and has been marked as deprecated, it still can be changed so scripts won't be broken but it won't have any effect. A comment is left above each unused secret_timer variable to avoid confusion. Joint work with Nikolay Aleksandrov. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
no longer used. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
The 'nqueues' counter is protected by the lru list lock, once thats removed this needs to be converted to atomic counter. Given this isn't used for anything except for reporting it to userspace via /proc, just remove it. We still report the memory currently used by fragment reassembly queues. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
When the high_thresh limit is reached we try to toss the 'oldest' incomplete fragment queues until memory limits are below the low_thresh value. This happens in softirq/packet processing context. This has two drawbacks: 1) processors might evict a queue that was about to be completed by another cpu, because they will compete wrt. resource usage and resource reclaim. 2) LRU list maintenance is expensive. But when constantly overloaded, even the 'least recently used' element is recent, so removing 'lru' queue first is not 'fairer' than removing any other fragment queue. This moves eviction out of the fast path: When the low threshold is reached, a work queue is scheduled which then iterates over the table and removes the queues that exceed the memory limits of the namespace. It sets a new flag called INET_FRAG_EVICTED on the evicted queues so the proper counters will get incremented when the queue is forcefully expired. When the high threshold is reached, no more fragment queues are created until we're below the limit again. The LRU list is now unused and will be removed in a followup patch. Joint work with Nikolay Aleksandrov. Suggested-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
First step to move eviction handling into a work queue. We lose two spots that accounted evicted fragments in MIB counters. Accounting will be restored since the upcoming work-queue evictor invokes the frag queue timer callbacks instead. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
hide actual hash size from individual users: The _find function will now fold the given hash value into the required range. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 3月, 2014 1 次提交
-
-
由 Florian Westphal 提交于
Quoting Alexander Aring: While fragmentation and unloading of 6lowpan module I got this kernel Oops after few seconds: BUG: unable to handle kernel paging request at f88bbc30 [..] Modules linked in: ipv6 [last unloaded: 6lowpan] Call Trace: [<c012af4c>] ? call_timer_fn+0x54/0xb3 [<c012aef8>] ? process_timeout+0xa/0xa [<c012b66b>] run_timer_softirq+0x140/0x15f Problem is that incomplete frags are still around after unload; when their frag expire timer fires, we get crash. When a netns is removed (also done when unloading module), inet_frag calls the evictor with 'force' argument to purge remaining frags. The evictor loop terminates when accounted memory ('work') drops to 0 or the lru-list becomes empty. However, the mem accounting is done via percpu counters and may not be accurate, i.e. loop may terminate prematurely. Alter evictor to only stop once the lru list is empty when force is requested. Reported-by: NPhoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Reported-by: NAlexander Aring <alex.aring@gmail.com> Tested-by: NAlexander Aring <alex.aring@gmail.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 3月, 2014 1 次提交
-
-
由 Nikolay Aleksandrov 提交于
I stumbled upon this very serious bug while hunting for another one, it's a very subtle race condition between inet_frag_evictor, inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically the users of inet_frag_kill/inet_frag_put). What happens is that after a fragment has been added to the hash chain but before it's been added to the lru_list (inet_frag_lru_add) in inet_frag_intern, it may get deleted (either by an expired timer if the system load is high or the timer sufficiently low, or by the fraq_queue function for different reasons) before it's added to the lru_list, then after it gets added it's a matter of time for the evictor to get to a piece of memory which has been freed leading to a number of different bugs depending on what's left there. I've been able to trigger this on both IPv4 and IPv6 (which is normal as the frag code is the same), but it's been much more difficult to trigger on IPv4 due to the protocol differences about how fragments are treated. The setup I used to reproduce this is: 2 machines with 4 x 10G bonded in a RR bond, so the same flow can be seen on multiple cards at the same time. Then I used multiple instances of ping/ping6 to generate fragmented packets and flood the machines with them while running other processes to load the attacked machine. *It is very important to have the _same flow_ coming in on multiple CPUs concurrently. Usually the attacked machine would die in less than 30 minutes, if configured properly to have many evictor calls and timeouts it could happen in 10 minutes or so. An important point to make is that any caller (frag_queue or timer) of inet_frag_kill will remove both the timer refcount and the original/guarding refcount thus removing everything that's keeping the frag from being freed at the next inet_frag_put. All of this could happen before the frag was ever added to the LRU list, then it gets added and the evictor uses a freed fragment. An example for IPv6 would be if a fragment is being added and is at the stage of being inserted in the hash after the hash lock is released, but before inet_frag_lru_add executes (or is able to obtain the lru lock) another overlapping fragment for the same flow arrives at a different CPU which finds it in the hash, but since it's overlapping it drops it invoking inet_frag_kill and thus removing all guarding refcounts, and afterwards freeing it by invoking inet_frag_put which removes the last refcount added previously by inet_frag_find, then inet_frag_lru_add gets executed by inet_frag_intern and we have a freed fragment in the lru_list. The fix is simple, just move the lru_add under the hash chain locked region so when a removing function is called it'll have to wait for the fragment to be added to the lru_list, and then it'll remove it (it works because the hash chain removal is done before the lru_list one and there's no window between the two list adds when the frag can get dropped). With this fix applied I couldn't kill the same machine in 24 hours with the same setup. Fixes: 3ef0eb0d ("net: frag, move LRU list maintenance outside of rwlock") CC: Florian Westphal <fw@strlen.de> CC: Jesper Dangaard Brouer <brouer@redhat.com> CC: David S. Miller <davem@davemloft.net> Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 10月, 2013 1 次提交
-
-
由 Hannes Frederic Sowa 提交于
All fragmentation hash secrets now get initialized by their corresponding hash function with net_get_random_once. Thus we can eliminate the initial seeding. Also provide a comment that hash secret seeding happens at the first call to the corresponding hashing function. Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 7月, 2013 1 次提交
-
-
由 Jiang Liu 提交于
The global variable num_physpages is scheduled to be removed, so use totalram_pages instead of num_physpages at runtime. Signed-off-by: NJiang Liu <jiang.liu@huawei.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 6月, 2013 1 次提交
-
-
由 Rami Rosen 提交于
This patch removes an empty ifdef from inet_frag_intern() in net/ipv4/inet_fragment.c. commit b67bfe0d (hlist: drop the node parameter from iterators) removed hlist from net/ipv4/inet_fragment.c, but did not remove the enclosing ifdef command, which is now empty. Signed-off-by: NRami Rosen <ramirose@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 5月, 2013 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
This patch fixes race between inet_frag_lru_move() and inet_frag_lru_add() which was introduced in commit 3ef0eb0d ("net: frag, move LRU list maintenance outside of rwlock") One cpu already added new fragment queue into hash but not into LRU. Other cpu found it in hash and tries to move it to the end of LRU. This leads to NULL pointer dereference inside of list_move_tail(). Another possible race condition is between inet_frag_lru_move() and inet_frag_lru_del(): move can happens after deletion. This patch initializes LRU list head before adding fragment into hash and inet_frag_lru_move() doesn't touches it if it's empty. I saw this kernel oops two times in a couple of days. [119482.128853] BUG: unable to handle kernel NULL pointer dereference at (null) [119482.132693] IP: [<ffffffff812ede89>] __list_del_entry+0x29/0xd0 [119482.136456] PGD 2148f6067 PUD 215ab9067 PMD 0 [119482.140221] Oops: 0000 [#1] SMP [119482.144008] Modules linked in: vfat msdos fat 8021q fuse nfsd auth_rpcgss nfs_acl nfs lockd sunrpc ppp_async ppp_generic bridge slhc stp llc w83627ehf hwmon_vid snd_hda_codec_hdmi snd_hda_codec_realtek kvm_amd k10temp kvm snd_hda_intel snd_hda_codec edac_core radeon snd_hwdep ath9k snd_pcm ath9k_common snd_page_alloc ath9k_hw snd_timer snd soundcore drm_kms_helper ath ttm r8169 mii [119482.152692] CPU 3 [119482.152721] Pid: 20, comm: ksoftirqd/3 Not tainted 3.9.0-zurg-00001-g9f95269 #132 To Be Filled By O.E.M. To Be Filled By O.E.M./RS880D [119482.161478] RIP: 0010:[<ffffffff812ede89>] [<ffffffff812ede89>] __list_del_entry+0x29/0xd0 [119482.166004] RSP: 0018:ffff880216d5db58 EFLAGS: 00010207 [119482.170568] RAX: 0000000000000000 RBX: ffff88020882b9c0 RCX: dead000000200200 [119482.175189] RDX: 0000000000000000 RSI: 0000000000000880 RDI: ffff88020882ba00 [119482.179860] RBP: ffff880216d5db58 R08: ffffffff8155c7f0 R09: 0000000000000014 [119482.184570] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88020882ba00 [119482.189337] R13: ffffffff81c8d780 R14: ffff880204357f00 R15: 00000000000005a0 [119482.194140] FS: 00007f58124dc700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000 [119482.198928] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [119482.203711] CR2: 0000000000000000 CR3: 00000002155f0000 CR4: 00000000000007e0 [119482.208533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [119482.213371] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [119482.218221] Process ksoftirqd/3 (pid: 20, threadinfo ffff880216d5c000, task ffff880216d3a9a0) [119482.223113] Stack: [119482.228004] ffff880216d5dbd8 ffffffff8155dcda 0000000000000000 ffff000200000001 [119482.233038] ffff8802153c1f00 ffff880000289440 ffff880200000014 ffff88007bc72000 [119482.238083] 00000000000079d5 ffff88007bc72f44 ffffffff00000002 ffff880204357f00 [119482.243090] Call Trace: [119482.248009] [<ffffffff8155dcda>] ip_defrag+0x8fa/0xd10 [119482.252921] [<ffffffff815a8013>] ipv4_conntrack_defrag+0x83/0xe0 [119482.257803] [<ffffffff8154485b>] nf_iterate+0x8b/0xa0 [119482.262658] [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40 [119482.267527] [<ffffffff815448e4>] nf_hook_slow+0x74/0x130 [119482.272412] [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40 [119482.277302] [<ffffffff8155d068>] ip_rcv+0x268/0x320 [119482.282147] [<ffffffff81519992>] __netif_receive_skb_core+0x612/0x7e0 [119482.286998] [<ffffffff81519b78>] __netif_receive_skb+0x18/0x60 [119482.291826] [<ffffffff8151a650>] process_backlog+0xa0/0x160 [119482.296648] [<ffffffff81519f29>] net_rx_action+0x139/0x220 [119482.301403] [<ffffffff81053707>] __do_softirq+0xe7/0x220 [119482.306103] [<ffffffff81053868>] run_ksoftirqd+0x28/0x40 [119482.310809] [<ffffffff81074f5f>] smpboot_thread_fn+0xff/0x1a0 [119482.315515] [<ffffffff81074e60>] ? lg_local_lock_cpu+0x40/0x40 [119482.320219] [<ffffffff8106d870>] kthread+0xc0/0xd0 [119482.324858] [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40 [119482.329460] [<ffffffff816c32dc>] ret_from_fork+0x7c/0xb0 [119482.334057] [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40 [119482.338661] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48 39 c8 74 7a <4c> 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89 42 08 [119482.343787] RIP [<ffffffff812ede89>] __list_del_entry+0x29/0xd0 [119482.348675] RSP <ffff880216d5db58> [119482.353493] CR2: 0000000000000000 Oops happened on this path: ip_defrag() -> ip_frag_queue() -> inet_frag_lru_move() -> list_move_tail() -> __list_del_entry() Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Florian Westphal <fw@strlen.de> Cc: Eric Dumazet <edumazet@google.com> Cc: David S. Miller <davem@davemloft.net> Acked-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 4月, 2013 1 次提交
-
-
由 Jesper Dangaard Brouer 提交于
This patch implements per hash bucket locking for the frag queue hash. This removes two write locks, and the only remaining write lock is for protecting hash rebuild. This essentially reduce the readers-writer lock to a rebuild lock. This patch is part of "net: frag performance followup" http://thread.gmane.org/gmane.linux.network/263644 of which two patches have already been accepted: Same test setup as previous: (http://thread.gmane.org/gmane.linux.network/257155) Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses Ethernet flow-control. A third interface is used for generating the DoS attack (with trafgen). Notice, I have changed the frag DoS generator script to be more efficient/deadly. Before it would only hit one RX queue, now its sending packets causing multi-queue RX, due to "better" RX hashing. Test types summary (netperf UDP_STREAM): Test-20G64K == 2x10G with 65K fragments Test-20G3F == 2x10G with 3x fragments (3*1472 bytes) Test-20G64K+DoS == Same as 20G64K with frag DoS Test-20G3F+DoS == Same as 20G3F with frag DoS Test-20G64K+MQ == Same as 20G64K with Multi-Queue frag DoS Test-20G3F+MQ == Same as 20G3F with Multi-Queue frag DoS When I rebased this-patch(03) (on top of net-next commit a210576c) and removed the _bh spinlock, I saw a performance regression. BUT this was caused by some unrelated change in-between. See tests below. Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de. Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02. Test (C) is what I reported before for this-patch Test (D) is net-next master HEAD (commit a210576c), which reveals some (unknown) performance regression (compared against test (B)). Test (D) function as a new base-test. Performance table summary (in Mbit/s): (#) Test-type: 20G64K 20G3F 20G64K+DoS 20G3F+DoS 20G64K+MQ 20G3F+MQ ---------- ------- ------- ---------- --------- -------- ------- (A) Patch-02 : 18848.7 13230.1 4103.04 5310.36 130.0 440.2 (B) 1b5ab0de : 18841.5 13156.8 4101.08 5314.57 129.0 424.2 (C) Patch-03v1: 18838.0 13490.5 4405.11 6814.72 196.6 461.6 (D) a210576c : 18321.5 11250.4 3635.34 5160.13 119.1 405.2 (E) with _bh : 17247.3 11492.6 3994.74 6405.29 166.7 413.6 (F) without bh: 17471.3 11298.7 3818.05 6102.11 165.7 406.3 Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks. I cannot explain the slow down for 20G64K (but its an artificial "lab-test" so I'm not worried). But the other results does show improvements. And test (E) "with _bh" version is slightly better. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: NEric Dumazet <edumazet@google.com> ---- V2: - By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't need the spinlock _bh versions, as Netfilter currently does a local_bh_disable() before entering inet_fragment. - Fold-in desc from cover-mail V3: - Drop the chain_len counter per hash bucket. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 3月, 2013 2 次提交
-
-
由 Jesper Dangaard Brouer 提交于
Move the protection of netns_frags.nqueues updates under the LRU_lock, instead of the write lock. As they are located on the same cacheline, and this is also needed when transitioning to use per hash bucket locking. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jesper Dangaard Brouer 提交于
The LRU list is protected by its own lock, since commit 3ef0eb0d (net: frag, move LRU list maintenance outside of rwlock), and no-longer by a read_lock. This makes it possible, to remove the inet_frag_queue, which is about to be "evicted", from the LRU list head. This avoids the problem, of several CPUs grabbing the same frag queue. Note, cannot remove the inet_frag_lru_del() call in fq_unlink() called by inet_frag_kill(), because inet_frag_kill() is also used in other situations. Thus, we use list_del_init() to allow this double list_del to work. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 3月, 2013 1 次提交
-
-
由 Hannes Frederic Sowa 提交于
This patch just moves some code arround to make the ip4_frag_ecn_table and IPFRAG_ECN_* constants accessible from the other reassembly engines. I also renamed ip4_frag_ecn_table to ip_frag_ecn_table. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jesper Dangaard Brouer <jbrouer@redhat.com> Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 3月, 2013 1 次提交
-
-
由 Hannes Frederic Sowa 提交于
This patch introduces a constant limit of the fragment queue hash table bucket list lengths. Currently the limit 128 is choosen somewhat arbitrary and just ensures that we can fill up the fragment cache with empty packets up to the default ip_frag_high_thresh limits. It should just protect from list iteration eating considerable amounts of cpu. If we reach the maximum length in one hash bucket a warning is printed. This is implemented on the caller side of inet_frag_find to distinguish between the different users of inet_fragment.c. I dropped the out of memory warning in the ipv4 fragment lookup path, because we already get a warning by the slab allocator. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jesper Dangaard Brouer <jbrouer@redhat.com> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 2月, 2013 1 次提交
-
-
由 Sasha Levin 提交于
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-