1. 04 9月, 2017 1 次提交
    • J
      Revert "net: use lib/percpu_counter API for fragmentation mem accounting" · fb452a1a
      Jesper Dangaard Brouer 提交于
      This reverts commit 6d7b857d.
      
      There is a bug in fragmentation codes use of the percpu_counter API,
      that can cause issues on systems with many CPUs.
      
      The frag_mem_limit() just reads the global counter (fbc->count),
      without considering other CPUs can have upto batch size (130K) that
      haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
      this become dangerous at >=24 CPUs (3*1024*1024/130000=24).
      
      The correct API usage would be to use __percpu_counter_compare() which
      does the right thing, and takes into account the number of (online)
      CPUs and batch size, to account for this and call __percpu_counter_sum()
      when needed.
      
      We choose to revert the use of the lib/percpu_counter API for frag
      memory accounting for several reasons:
      
      1) On systems with CPUs > 24, the heavier fully locked
         __percpu_counter_sum() is always invoked, which will be more
         expensive than the atomic_t that is reverted to.
      
      Given systems with more than 24 CPUs are becoming common this doesn't
      seem like a good option.  To mitigate this, the batch size could be
      decreased and thresh be increased.
      
      2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
         CPU, before SKBs are pushed into sockets on remote CPUs.  Given
         NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
         likely be limited.  Thus, a fair chance that atomic add+dec happen
         on the same CPU.
      
      Revert note that commit 1d6119ba ("net: fix percpu memory leaks")
      removed init_frag_mem_limit() and instead use inet_frags_init_net().
      After this revert, inet_frags_uninit_net() becomes empty.
      
      Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
      Fixes: 1d6119ba ("net: fix percpu memory leaks")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb452a1a
  2. 01 7月, 2017 1 次提交
  3. 21 6月, 2017 1 次提交
    • N
      percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch · 104b4e51
      Nikolay Borisov 提交于
      Currently, percpu_counter_add is a wrapper around __percpu_counter_add
      which is preempt safe due to explicit calls to preempt_disable.  Given
      how __ prefix is used in percpu related interfaces, the naming
      unfortunately creates the false sense that __percpu_counter_add is
      less safe than percpu_counter_add.  In terms of context-safety,
      they're equivalent.  The only difference is that the __ version takes
      a batch parameter.
      
      Make this a bit more explicit by just renaming __percpu_counter_add to
      percpu_counter_add_batch.
      
      This patch doesn't cause any functional changes.
      
      tj: Minor updates to patch description for clarity.  Cosmetic
          indentation updates.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: linux-mm@kvack.org
      Cc: "David S. Miller" <davem@davemloft.net>
      104b4e51
  4. 23 5月, 2017 1 次提交
  5. 21 1月, 2017 1 次提交
  6. 17 2月, 2016 1 次提交
  7. 06 1月, 2016 1 次提交
  8. 03 11月, 2015 1 次提交
  9. 27 7月, 2015 3 次提交
  10. 28 5月, 2015 1 次提交
    • F
      ip_fragment: don't forward defragmented DF packet · d6b915e2
      Florian Westphal 提交于
      We currently always send fragments without DF bit set.
      
      Thus, given following setup:
      
      mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
         A           R1              R2         B
      
      Where R1 and R2 run linux with netfilter defragmentation/conntrack
      enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
      will respond with icmp too big error if one of these fragments exceeded
      1400 bytes.
      
      However, if R1 receives fragment sizes 1200 and 100, it would
      forward the reassembled packet without refragmenting, i.e.
      R2 will send an icmp error in response to a packet that was never sent,
      citing mtu that the original sender never exceeded.
      
      The other minor issue is that a refragmentation on R1 will conceal the
      MTU of R2-B since refragmentation does not set DF bit on the fragments.
      
      This modifies ip_fragment so that we track largest fragment size seen
      both for DF and non-DF packets, and set frag_max_size to the largest
      value.
      
      If the DF fragment size is larger or equal to the non-df one, we will
      consider the packet a path mtu probe:
      We set DF bit on the reassembled skb and also tag it with a new IPCB flag
      to force refragmentation even if skb fits outdev mtu.
      
      We will also set DF bit on each fragment in this case.
      
      Joint work with Hannes Frederic Sowa.
      Reported-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6b915e2
  11. 08 9月, 2014 1 次提交
    • T
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo 提交于
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  12. 03 8月, 2014 3 次提交
  13. 28 7月, 2014 7 次提交
  14. 24 10月, 2013 1 次提交
  15. 06 5月, 2013 1 次提交
    • K
      net: frag, fix race conditions in LRU list maintenance · b56141ab
      Konstantin Khlebnikov 提交于
      This patch fixes race between inet_frag_lru_move() and inet_frag_lru_add()
      which was introduced in commit 3ef0eb0d
      ("net: frag, move LRU list maintenance outside of rwlock")
      
      One cpu already added new fragment queue into hash but not into LRU.
      Other cpu found it in hash and tries to move it to the end of LRU.
      This leads to NULL pointer dereference inside of list_move_tail().
      
      Another possible race condition is between inet_frag_lru_move() and
      inet_frag_lru_del(): move can happens after deletion.
      
      This patch initializes LRU list head before adding fragment into hash and
      inet_frag_lru_move() doesn't touches it if it's empty.
      
      I saw this kernel oops two times in a couple of days.
      
      [119482.128853] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [119482.132693] IP: [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.136456] PGD 2148f6067 PUD 215ab9067 PMD 0
      [119482.140221] Oops: 0000 [#1] SMP
      [119482.144008] Modules linked in: vfat msdos fat 8021q fuse nfsd auth_rpcgss nfs_acl nfs lockd sunrpc ppp_async ppp_generic bridge slhc stp llc w83627ehf hwmon_vid snd_hda_codec_hdmi snd_hda_codec_realtek kvm_amd k10temp kvm snd_hda_intel snd_hda_codec edac_core radeon snd_hwdep ath9k snd_pcm ath9k_common snd_page_alloc ath9k_hw snd_timer snd soundcore drm_kms_helper ath ttm r8169 mii
      [119482.152692] CPU 3
      [119482.152721] Pid: 20, comm: ksoftirqd/3 Not tainted 3.9.0-zurg-00001-g9f95269 #132 To Be Filled By O.E.M. To Be Filled By O.E.M./RS880D
      [119482.161478] RIP: 0010:[<ffffffff812ede89>]  [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.166004] RSP: 0018:ffff880216d5db58  EFLAGS: 00010207
      [119482.170568] RAX: 0000000000000000 RBX: ffff88020882b9c0 RCX: dead000000200200
      [119482.175189] RDX: 0000000000000000 RSI: 0000000000000880 RDI: ffff88020882ba00
      [119482.179860] RBP: ffff880216d5db58 R08: ffffffff8155c7f0 R09: 0000000000000014
      [119482.184570] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88020882ba00
      [119482.189337] R13: ffffffff81c8d780 R14: ffff880204357f00 R15: 00000000000005a0
      [119482.194140] FS:  00007f58124dc700(0000) GS:ffff88021fcc0000(0000) knlGS:0000000000000000
      [119482.198928] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [119482.203711] CR2: 0000000000000000 CR3: 00000002155f0000 CR4: 00000000000007e0
      [119482.208533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [119482.213371] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [119482.218221] Process ksoftirqd/3 (pid: 20, threadinfo ffff880216d5c000, task ffff880216d3a9a0)
      [119482.223113] Stack:
      [119482.228004]  ffff880216d5dbd8 ffffffff8155dcda 0000000000000000 ffff000200000001
      [119482.233038]  ffff8802153c1f00 ffff880000289440 ffff880200000014 ffff88007bc72000
      [119482.238083]  00000000000079d5 ffff88007bc72f44 ffffffff00000002 ffff880204357f00
      [119482.243090] Call Trace:
      [119482.248009]  [<ffffffff8155dcda>] ip_defrag+0x8fa/0xd10
      [119482.252921]  [<ffffffff815a8013>] ipv4_conntrack_defrag+0x83/0xe0
      [119482.257803]  [<ffffffff8154485b>] nf_iterate+0x8b/0xa0
      [119482.262658]  [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40
      [119482.267527]  [<ffffffff815448e4>] nf_hook_slow+0x74/0x130
      [119482.272412]  [<ffffffff8155c7f0>] ? inet_del_offload+0x40/0x40
      [119482.277302]  [<ffffffff8155d068>] ip_rcv+0x268/0x320
      [119482.282147]  [<ffffffff81519992>] __netif_receive_skb_core+0x612/0x7e0
      [119482.286998]  [<ffffffff81519b78>] __netif_receive_skb+0x18/0x60
      [119482.291826]  [<ffffffff8151a650>] process_backlog+0xa0/0x160
      [119482.296648]  [<ffffffff81519f29>] net_rx_action+0x139/0x220
      [119482.301403]  [<ffffffff81053707>] __do_softirq+0xe7/0x220
      [119482.306103]  [<ffffffff81053868>] run_ksoftirqd+0x28/0x40
      [119482.310809]  [<ffffffff81074f5f>] smpboot_thread_fn+0xff/0x1a0
      [119482.315515]  [<ffffffff81074e60>] ? lg_local_lock_cpu+0x40/0x40
      [119482.320219]  [<ffffffff8106d870>] kthread+0xc0/0xd0
      [119482.324858]  [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40
      [119482.329460]  [<ffffffff816c32dc>] ret_from_fork+0x7c/0xb0
      [119482.334057]  [<ffffffff8106d7b0>] ? insert_kthread_work+0x40/0x40
      [119482.338661] Code: 00 00 55 48 8b 17 48 b9 00 01 10 00 00 00 ad de 48 8b 47 08 48 89 e5 48 39 ca 74 29 48 b9 00 02 20 00 00 00 ad de 48 39 c8 74 7a <4c> 8b 00 4c 39 c7 75 53 4c 8b 42 08 4c 39 c7 75 2b 48 89 42 08
      [119482.343787] RIP  [<ffffffff812ede89>] __list_del_entry+0x29/0xd0
      [119482.348675]  RSP <ffff880216d5db58>
      [119482.353493] CR2: 0000000000000000
      
      Oops happened on this path:
      ip_defrag() -> ip_frag_queue() -> inet_frag_lru_move() -> list_move_tail() -> __list_del_entry()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b56141ab
  16. 30 4月, 2013 1 次提交
  17. 05 4月, 2013 1 次提交
    • J
      net: frag queue per hash bucket locking · 19952cc4
      Jesper Dangaard Brouer 提交于
      This patch implements per hash bucket locking for the frag queue
      hash.  This removes two write locks, and the only remaining write
      lock is for protecting hash rebuild.  This essentially reduce the
      readers-writer lock to a rebuild lock.
      
      This patch is part of "net: frag performance followup"
       http://thread.gmane.org/gmane.linux.network/263644
      of which two patches have already been accepted:
      
      Same test setup as previous:
       (http://thread.gmane.org/gmane.linux.network/257155)
       Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses
       Ethernet flow-control.  A third interface is used for generating the
       DoS attack (with trafgen).
      
      Notice, I have changed the frag DoS generator script to be more
      efficient/deadly.  Before it would only hit one RX queue, now its
      sending packets causing multi-queue RX, due to "better" RX hashing.
      
      Test types summary (netperf UDP_STREAM):
       Test-20G64K     == 2x10G with 65K fragments
       Test-20G3F      == 2x10G with 3x fragments (3*1472 bytes)
       Test-20G64K+DoS == Same as 20G64K with frag DoS
       Test-20G3F+DoS  == Same as 20G3F  with frag DoS
       Test-20G64K+MQ  == Same as 20G64K with Multi-Queue frag DoS
       Test-20G3F+MQ   == Same as 20G3F  with Multi-Queue frag DoS
      
      When I rebased this-patch(03) (on top of net-next commit a210576c) and
      removed the _bh spinlock, I saw a performance regression.  BUT this
      was caused by some unrelated change in-between.  See tests below.
      
      Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de.
      Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02.
      Test (C) is what I reported before for this-patch
      
      Test (D) is net-next master HEAD (commit a210576c), which reveals some
      (unknown) performance regression (compared against test (B)).
      Test (D) function as a new base-test.
      
      Performance table summary (in Mbit/s):
      
      (#) Test-type:  20G64K    20G3F    20G64K+DoS  20G3F+DoS  20G64K+MQ 20G3F+MQ
          ----------  -------   -------  ----------  ---------  --------  -------
      (A) Patch-02  : 18848.7   13230.1   4103.04     5310.36     130.0    440.2
      (B) 1b5ab0de  : 18841.5   13156.8   4101.08     5314.57     129.0    424.2
      (C) Patch-03v1: 18838.0   13490.5   4405.11     6814.72     196.6    461.6
      
      (D) a210576c  : 18321.5   11250.4   3635.34     5160.13     119.1    405.2
      (E) with _bh  : 17247.3   11492.6   3994.74     6405.29     166.7    413.6
      (F) without bh: 17471.3   11298.7   3818.05     6102.11     165.7    406.3
      
      Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks.
      
      I cannot explain the slow down for 20G64K (but its an artificial
      "lab-test" so I'm not worried).  But the other results does show
      improvements.  And test (E) "with _bh" version is slightly better.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      
      ----
      V2:
      - By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't
        need the spinlock _bh versions, as Netfilter currently does a
        local_bh_disable() before entering inet_fragment.
      - Fold-in desc from cover-mail
      V3:
      - Drop the chain_len counter per hash bucket.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19952cc4
  18. 28 3月, 2013 1 次提交
  19. 25 3月, 2013 1 次提交
  20. 19 3月, 2013 1 次提交
    • H
      inet: limit length of fragment queue hash table bucket lists · 5a3da1fe
      Hannes Frederic Sowa 提交于
      This patch introduces a constant limit of the fragment queue hash
      table bucket list lengths. Currently the limit 128 is choosen somewhat
      arbitrary and just ensures that we can fill up the fragment cache with
      empty packets up to the default ip_frag_high_thresh limits. It should
      just protect from list iteration eating considerable amounts of cpu.
      
      If we reach the maximum length in one hash bucket a warning is printed.
      This is implemented on the caller side of inet_frag_find to distinguish
      between the different users of inet_fragment.c.
      
      I dropped the out of memory warning in the ipv4 fragment lookup path,
      because we already get a warning by the slab allocator.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a3da1fe
  21. 23 2月, 2013 1 次提交
  22. 30 1月, 2013 6 次提交
  23. 20 9月, 2012 1 次提交
  24. 27 8月, 2012 1 次提交
  25. 18 5月, 2012 1 次提交