1. 09 8月, 2019 1 次提交
    • G
      inet: frags: re-introduce skb coalescing for local delivery · 891584f4
      Guillaume Nault 提交于
      Before commit d4289fcc ("net: IP6 defrag: use rbtrees for IPv6
      defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus
      generating many fragments) and running over an IPsec tunnel, reported
      more than 6Gbps throughput. After that patch, the same test gets only
      9Mbps when receiving on a be2net nic (driver can make a big difference
      here, for example, ixgbe doesn't seem to be affected).
      
      By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing
      (IPv4 fragment coalescing was dropped by commit 14fe22e3 ("Revert
      "ipv4: use skb coalescing in defragmentation"")).
      
      Without fragment coalescing, be2net runs out of Rx ring entries and
      starts to drop frames (ethtool reports rx_drops_no_frags errors). Since
      the netperf traffic is only composed of UDP fragments, any lost packet
      prevents reassembly of the full datagram. Therefore, fragments which
      have no possibility to ever get reassembled pile up in the reassembly
      queue, until the memory accounting exeeds the threshold. At that point
      no fragment is accepted anymore, which effectively discards all
      netperf traffic.
      
      When reassembly timeout expires, some stale fragments are removed from
      the reassembly queue, so a few packets can be received, reassembled
      and delivered to the netperf receiver. But the nic still drops frames
      and soon the reassembly queue gets filled again with stale fragments.
      These long time frames where no datagram can be received explain why
      the performance drop is so significant.
      
      Re-introducing fragment coalescing is enough to get the initial
      performances again (6.6Gbps with be2net): driver doesn't drop frames
      anymore (no more rx_drops_no_frags errors) and the reassembly engine
      works at full speed.
      
      This patch is quite conservative and only coalesces skbs for local
      IPv4 and IPv6 delivery (in order to avoid changing skb geometry when
      forwarding). Coalescing could be extended in the future if need be, as
      more scenarios would probably benefit from it.
      
      [0]: Test configuration
      Sender:
      ip xfrm policy flush
      ip xfrm state flush
      ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
      ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
      ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
      ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
      netserver -D -L fc00:2::1
      
      Receiver:
      ip xfrm policy flush
      ip xfrm state flush
      ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
      ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
      ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
      ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
      netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      891584f4
  2. 19 6月, 2019 1 次提交
    • E
      inet: fix various use-after-free in defrags units · d5dd8879
      Eric Dumazet 提交于
      syzbot reported another issue caused by my recent patches. [1]
      
      The issue here is that fqdir_exit() is initiating a work queue
      and immediately returns. A bit later cleanup_net() was able
      to free the MIB (percpu data) and the whole struct net was freed,
      but we had active frag timers that fired and triggered use-after-free.
      
      We need to make sure that timers can catch fqdir->dead being set,
      to bailout.
      
      Since RCU is used for the reader side, this means
      we want to respect an RCU grace period between these operations :
      
      1) qfdir->dead = 1;
      
      2) netns dismantle (freeing of various data structure)
      
      This patch uses new new (struct pernet_operations)->pre_exit
      infrastructure to ensures a full RCU grace period
      happens between fqdir_pre_exit() and fqdir_exit()
      
      This also means we can use a regular work queue, we no
      longer need rcu_work.
      
      Tested:
      
      $ time for i in {1..1000}; do unshare -n /bin/false;done
      
      real	0m2.585s
      user	0m0.160s
      sys	0m2.214s
      
      [1]
      
      BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
      Read of size 8 at addr ffff88808b9fe330 by task syz-executor.4/11860
      
      CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
       ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
       call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
       expire_timers kernel/time/timer.c:1366 [inline]
       __run_timers kernel/time/timer.c:1685 [inline]
       __run_timers kernel/time/timer.c:1653 [inline]
       run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
       __do_softirq+0x25c/0x94c kernel/softirq.c:293
       invoke_softirq kernel/softirq.c:374 [inline]
       irq_exit+0x180/0x1d0 kernel/softirq.c:414
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
       </IRQ>
      RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035
      Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 <48> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c
      RSP: 0018:ffff88806ae079c0 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff13
      RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffc9000e655000
      RDX: dffffc0000000000 RSI: ffffffff82fd88a7 RDI: ffff888086202398
      RBP: ffff88806ae07a00 R08: ffff88808b6c8700 R09: ffffed100d5c0f4d
      R10: ffffed100d5c0f4c R11: 0000000000000000 R12: ffff888086202380
      R13: 0000000000000030 R14: 00000000000000d3 R15: 0000000000000000
       tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087
       tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline]
       tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734
       tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
       security_file_ioctl+0x77/0xc0 security/security.c:1370
       ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4592c9
      Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f8db5e44c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004592c9
      RDX: 0000000020000080 RSI: 00000000000089f1 RDI: 0000000000000006
      RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8db5e456d4
      R13: 00000000004cc770 R14: 00000000004d5cd8 R15: 00000000ffffffff
      
      Allocated by task 9047:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_kmalloc mm/kasan/common.c:489 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
       kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
       slab_post_alloc_hook mm/slab.h:437 [inline]
       slab_alloc mm/slab.c:3326 [inline]
       kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488
       kmem_cache_zalloc include/linux/slab.h:732 [inline]
       net_alloc net/core/net_namespace.c:386 [inline]
       copy_net_ns+0xed/0x340 net/core/net_namespace.c:426
       create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
       unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
       ksys_unshare+0x440/0x980 kernel/fork.c:2692
       __do_sys_unshare kernel/fork.c:2760 [inline]
       __se_sys_unshare kernel/fork.c:2758 [inline]
       __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 2541:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
       __cache_free mm/slab.c:3432 [inline]
       kmem_cache_free+0x86/0x260 mm/slab.c:3698
       net_free net/core/net_namespace.c:402 [inline]
       net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409
       net_drop_ns net/core/net_namespace.c:408 [inline]
       cleanup_net+0x538/0x960 net/core/net_namespace.c:571
       process_one_work+0x989/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x354/0x420 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      The buggy address belongs to the object at ffff88808b9fe100
       which belongs to the cache net_namespace of size 6784
      The buggy address is located 560 bytes inside of
       6784-byte region [ffff88808b9fe100, ffff88808b9ffb80)
      The buggy address belongs to the page:
      page:ffffea00022e7f80 refcount:1 mapcount:0 mapping:ffff88821b6f60c0 index:0x0 compound_mapcount: 0
      flags: 0x1fffc0000010200(slab|head)
      raw: 01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0
      raw: 0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                           ^
       ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 3c8fc878 ("inet: frags: rework rhashtable dismantle")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5dd8879
  3. 13 6月, 2019 1 次提交
  4. 31 5月, 2019 1 次提交
  5. 29 5月, 2019 1 次提交
  6. 27 5月, 2019 7 次提交
  7. 27 2月, 2019 1 次提交
  8. 26 1月, 2019 1 次提交
  9. 31 12月, 2018 1 次提交
  10. 21 12月, 2018 1 次提交
  11. 06 12月, 2018 1 次提交
    • J
      ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes · ebaf39e6
      Jiri Wiesner 提交于
      The *_frag_reasm() functions are susceptible to miscalculating the byte
      count of packet fragments in case the truesize of a head buffer changes.
      The truesize member may be changed by the call to skb_unclone(), leaving
      the fragment memory limit counter unbalanced even if all fragments are
      processed. This miscalculation goes unnoticed as long as the network
      namespace which holds the counter is not destroyed.
      
      Should an attempt be made to destroy a network namespace that holds an
      unbalanced fragment memory limit counter the cleanup of the namespace
      never finishes. The thread handling the cleanup gets stuck in
      inet_frags_exit_net() waiting for the percpu counter to reach zero. The
      thread is usually in running state with a stacktrace similar to:
      
       PID: 1073   TASK: ffff880626711440  CPU: 1   COMMAND: "kworker/u48:4"
        #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
        #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
        #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
        #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
        #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
       #10 [ffff880621563e38] process_one_work at ffffffff81096f14
      
      It is not possible to create new network namespaces, and processes
      that call unshare() end up being stuck in uninterruptible sleep state
      waiting to acquire the net_mutex.
      
      The bug was observed in the IPv6 netfilter code by Per Sundstrom.
      I thank him for his analysis of the problem. The parts of this patch
      that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
      Signed-off-by: NJiri Wiesner <jwiesner@suse.com>
      Reported-by: NPer Sundstrom <per.sundstrom@redqube.se>
      Acked-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebaf39e6
  12. 22 9月, 2018 2 次提交
    • P
      net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net · 83619623
      Peter Oskolkov 提交于
      Currently, ip[6]frag_high_thresh sysctl values in new namespaces are
      hard-limited to those of the root/init ns.
      
      There are at least two use cases when it would be desirable to
      set the high_thresh values higher in a child namespace vs the global hard
      limit:
      
      - a security/ddos protection policy may lower the thresholds in the
        root/init ns but allow for a special exception in a child namespace
      - testing: a test running in a namespace may want to set these
        thresholds higher in its namespace than what is in the root/init ns
      
      The new behavior:
      
       # ip netns add testns
       # ip netns exec testns bash
      
       # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
       net.ipv4.ipfrag_high_thresh = 9000000
      
       # sysctl net.ipv4.ipfrag_high_thresh
       net.ipv4.ipfrag_high_thresh = 9000000
      
       # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
       net.ipv6.ip6frag_high_thresh = 9000000
      
       # sysctl net.ipv6.ip6frag_high_thresh
       net.ipv6.ip6frag_high_thresh = 9000000
      
      The old behavior:
      
       # ip netns add testns
       # ip netns exec testns bash
      
       # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
       net.ipv4.ipfrag_high_thresh = 9000000
      
       # sysctl net.ipv4.ipfrag_high_thresh
       net.ipv4.ipfrag_high_thresh = 4194304
      
       # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
       net.ipv6.ip6frag_high_thresh = 9000000
      
       # sysctl net.ipv6.ip6frag_high_thresh
       net.ipv6.ip6frag_high_thresh = 4194304
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83619623
    • P
      ipv6: discard IP frag queue on more errors · 2475f59c
      Peter Oskolkov 提交于
      This is similar to how ipv4 now behaves:
      commit 0ff89efb ("ip: fail fast on IP defrag errors").
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2475f59c
  13. 11 9月, 2018 1 次提交
  14. 06 8月, 2018 2 次提交
  15. 18 7月, 2018 1 次提交
    • F
      ipv6: remove dependency of nf_defrag_ipv6 on ipv6 module · 70b095c8
      Florian Westphal 提交于
      IPV6=m
      DEFRAG_IPV6=m
      CONNTRACK=y yields:
      
      net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get':
      net/netfilter/nf_conntrack_proto.c:802: undefined reference to `nf_defrag_ipv6_enable'
      net/netfilter/nf_conntrack_proto.o:(.rodata+0x640): undefined reference to `nf_conntrack_l4proto_icmpv6'
      
      Setting DEFRAG_IPV6=y causes undefined references to ip6_rhash_params
      ip6_frag_init and ip6_expire_frag_queue so it would be needed to force
      IPV6=y too.
      
      This patch gets rid of the 'followup linker error' by removing
      the dependency of ipv6.ko symbols from netfilter ipv6 defrag.
      
      Shared code is placed into a header, then used from both.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      70b095c8
  16. 19 4月, 2018 1 次提交
    • E
      ipv6: frags: fix a lockdep false positive · 415787d7
      Eric Dumazet 提交于
      lockdep does not know that the locks used by IPv4 defrag
      and IPv6 reassembly units are of different classes.
      
      It complains because of following chains :
      
      1) sch_direct_xmit()        (lock txq->_xmit_lock)
          dev_hard_start_xmit()
           xmit_one()
            dev_queue_xmit_nit()
             packet_rcv_fanout()
              ip_check_defrag()
               ip_defrag()
                spin_lock()     (lock frag queue spinlock)
      
      2) ip6_input_finish()
          ipv6_frag_rcv()       (lock frag queue spinlock)
           ip6_frag_queue()
            icmpv6_param_prob() (lock txq->_xmit_lock at some point)
      
      We could add lockdep annotations, but we also can make sure IPv6
      calls icmpv6_param_prob() only after the release of the frag queue spinlock,
      since this naturally makes frag queue spinlock a leaf in lock hierarchy.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      415787d7
  17. 18 4月, 2018 1 次提交
  18. 05 4月, 2018 1 次提交
  19. 02 4月, 2018 1 次提交
  20. 01 4月, 2018 9 次提交
    • E
      ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CB · 219badfa
      Eric Dumazet 提交于
      ip6_frag_queue uses skb->cb[] to store the fragment offset, meaning that
      we could use two cache lines per skb when finding the insertion point,
      if for some reason inet6_skb_parm size is increased in the future.
      
      By using skb->ip_defrag_offset instead of skb->cb[], we pack all
      the fields in a single cache line, matching what we did for IPv4.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      219badfa
    • E
      ipv6: frags: rewrite ip6_expire_frag_queue() · 05c0b86b
      Eric Dumazet 提交于
      Make it similar to IPv4 ip_expire(), and release the lock
      before calling icmp functions.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05c0b86b
    • E
      inet: frags: break the 2GB limit for frags storage · 3e67f106
      Eric Dumazet 提交于
      Some users are willing to provision huge amounts of memory to be able
      to perform reassembly reasonnably well under pressure.
      
      Current memory tracking is using one atomic_t and integers.
      
      Switch to atomic_long_t so that 64bit arches can use more than 2GB,
      without any cost for 32bit arches.
      
      Note that this patch avoids an overflow error, if high_thresh was set
      to ~2GB, since this test in inet_frag_alloc() was never true :
      
      if (... || frag_mem_limit(nf) > nf->high_thresh)
      
      Tested:
      
      $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
      
      <frag DDOS>
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 14705885 memory 16000002880
      
      $ nstat -n ; sleep 1 ; nstat | grep Reas
      IpReasmReqds                    3317150            0.0
      IpReasmFails                    3317112            0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e67f106
    • E
      inet: frags: remove inet_frag_maybe_warn_overflow() · 2d44ed22
      Eric Dumazet 提交于
      This function is obsolete, after rhashtable addition to inet defrag.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d44ed22
    • E
      inet: frags: get rif of inet_frag_evicting() · 399d1404
      Eric Dumazet 提交于
      This refactors ip_expire() since one indentation level is removed.
      
      Note: in the future, we should try hard to avoid the skb_clone()
      since this is a serious performance cost.
      Under DDOS, the ICMP message wont be sent because of rate limits.
      
      Fact that ip6_expire_frag_queue() does not use skb_clone() is
      disturbing too. Presumably IPv6 should have the same
      issue than the one we fixed in commit ec4fbd64
      ("inet: frag: release spinlock before calling icmp_send()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      399d1404
    • E
      inet: frags: use rhashtables for reassembly units · 648700f7
      Eric Dumazet 提交于
      Some applications still rely on IP fragmentation, and to be fair linux
      reassembly unit is not working under any serious load.
      
      It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
      
      A work queue is supposed to garbage collect items when host is under memory
      pressure, and doing a hash rebuild, changing seed used in hash computations.
      
      This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
      occurring every 5 seconds if host is under fire.
      
      Then there is the problem of sharing this hash table for all netns.
      
      It is time to switch to rhashtables, and allocate one of them per netns
      to speedup netns dismantle, since this is a critical metric these days.
      
      Lookup is now using RCU. A followup patch will even remove
      the refcount hold/release left from prior implementation and save
      a couple of atomic operations.
      
      Before this patch, 16 cpus (16 RX queue NIC) could not handle more
      than 1 Mpps frags DDOS.
      
      After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
      of storage for the fragments (exact number depends on frags being evicted
      after timeout)
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 1966916 memory 2140004608
      
      A followup patch will change the limits for 64bit arches.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Alexander Aring <alex.aring@gmail.com>
      Cc: Stefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      648700f7
    • E
      inet: frags: refactor ipv6_frag_init() · 5b975bab
      Eric Dumazet 提交于
      We want to call inet_frags_init() earlier.
      
      This is a prereq to "inet: frags: use rhashtables for reassembly units"
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b975bab
    • E
      inet: frags: add a pointer to struct netns_frags · 093ba729
      Eric Dumazet 提交于
      In order to simplify the API, add a pointer to struct inet_frags.
      This will allow us to make things less complex.
      
      These functions no longer have a struct inet_frags parameter :
      
      inet_frag_destroy(struct inet_frag_queue *q  /*, struct inet_frags *f */)
      inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
      inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
      inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
      ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      093ba729
    • E
      inet: frags: change inet_frags_init_net() return value · 787bea77
      Eric Dumazet 提交于
      We will soon initialize one rhashtable per struct netns_frags
      in inet_frags_init_net().
      
      This patch changes the return value to eventually propagate an
      error.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      787bea77
  21. 30 3月, 2018 1 次提交
  22. 28 3月, 2018 1 次提交
  23. 20 2月, 2018 1 次提交
    • K
      net: Convert ip6_frags_ops · 5fc094f5
      Kirill Tkhai 提交于
      Exit methods calls inet_frags_exit_net() with global ip6_frags
      as argument. So, after we make the pernet_operations async,
      a pair of exit methods may be called to iterate this hash table.
      Since there is inet_frag_worker(), which already may work
      in parallel with inet_frags_exit_net(), and it can make the same
      cleanup, that inet_frags_exit_net() does, it's safe. So we may
      mark these pernet_operations as async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fc094f5
  24. 18 10月, 2017 1 次提交
    • K
      inet: frags: Convert timers to use timer_setup() · 78802011
      Kees Cook 提交于
      In preparation for unconditionally passing the struct timer_list pointer to
      all timer callbacks, switch to using the new timer_setup() and from_timer()
      to pass the timer pointer explicitly.
      
      Cc: Alexander Aring <alex.aring@gmail.com>
      Cc: Stefan Schmidt <stefan@osg.samsung.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: linux-wpan@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Cc: netfilter-devel@vger.kernel.org
      Cc: coreteam@netfilter.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: Stefan Schmidt <stefan@osg.samsung.com> # for ieee802154
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78802011