1. 13 1月, 2017 1 次提交
    • N
      ipmr: improve hash scalability · 8fb472c0
      Nikolay Aleksandrov 提交于
      Recently we started using ipmr with thousands of entries and easily hit
      soft lockups on smaller devices. The reason is that the hash function
      uses the high order bits from the src and dst, but those don't change in
      many common cases, also the hash table  is only 64 elements so with
      thousands it doesn't scale at all.
      This patch migrates the hash table to rhashtable, and in particular the
      rhl interface which allows for duplicate elements to be chained because
      of the MFC_PROXY support (*,G; *,*,oif cases) which allows for multiple
      duplicate entries to be added with different interfaces (IMO wrong, but
      it's been in for a long time).
      
      And here are some results from tests I've run in a VM:
       mr_table size (default, allocated for all namespaces):
        Before                    After
         49304 bytes               2400 bytes
      
       Add 65000 routes (the diff is much larger on smaller devices):
        Before                    After
         1m42s                     58s
      
       Forwarding 256 byte packets with 65000 routes (test done in a VM):
        Before                    After
         3 Mbps / ~1465 pps        122 Mbps / ~59000 pps
      
      As a bonus we no longer see the soft lockups on smaller devices which
      showed up even with 2000 entries before.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fb472c0
  2. 09 1月, 2017 1 次提交
  3. 03 1月, 2017 1 次提交
  4. 25 12月, 2016 1 次提交
  5. 04 11月, 2016 1 次提交
  6. 01 11月, 2016 2 次提交
    • D
      net: Enable support for VRF with ipv4 multicast · e58e4159
      David Ahern 提交于
      Enable support for IPv4 multicast:
      - similar to unicast the flow struct is updated to L3 master device
        if relevant prior to calling fib_rules_lookup. The table id is saved
        to the lookup arg so the rule action for ipmr can return the table
        associated with the device.
      
      - ip_mr_forward needs to check for master device mismatch as well
        since the skb->dev is set to it
      
      - allow multicast address on VRF device for Rx by checking for the
        daddr in the VRF device as well as the original ingress device
      
      - on Tx need to drop to __mkroute_output when FIB lookup fails for
        multicast destination address.
      
      - if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates
        IPMR FIB rules on first device create similar to FIB rules. In
        addition the VRF driver does not divert IPv4 multicast packets:
        it breaks on Tx since the fib lookup fails on the mcast address.
      
      With this patch, ipmr forwarding and local rx/tx work.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e58e4159
    • N
  7. 26 9月, 2016 1 次提交
    • N
      ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route · 2cf75070
      Nikolay Aleksandrov 提交于
      Since the commit below the ipmr/ip6mr rtnl_unicast() code uses the portid
      instead of the previous dst_pid which was copied from in_skb's portid.
      Since the skb is new the portid is 0 at that point so the packets are sent
      to the kernel and we get scheduling while atomic or a deadlock (depending
      on where it happens) by trying to acquire rtnl two times.
      Also since this is RTM_GETROUTE, it can be triggered by a normal user.
      
      Here's the sleeping while atomic trace:
      [ 7858.212557] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
      [ 7858.212748] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0
      [ 7858.212881] 2 locks held by swapper/0/0:
      [ 7858.213013]  #0:  (((&mrt->ipmr_expire_timer))){+.-...}, at: [<ffffffff810fbbf5>] call_timer_fn+0x5/0x350
      [ 7858.213422]  #1:  (mfc_unres_lock){+.....}, at: [<ffffffff8161e005>] ipmr_expire_process+0x25/0x130
      [ 7858.213807] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc7+ #179
      [ 7858.213934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [ 7858.214108]  0000000000000000 ffff88005b403c50 ffffffff813a7804 0000000000000000
      [ 7858.214412]  ffffffff81a1338e ffff88005b403c78 ffffffff810a4a72 ffffffff81a1338e
      [ 7858.214716]  000000000000026c 0000000000000000 ffff88005b403ca8 ffffffff810a4b9f
      [ 7858.215251] Call Trace:
      [ 7858.215412]  <IRQ>  [<ffffffff813a7804>] dump_stack+0x85/0xc1
      [ 7858.215662]  [<ffffffff810a4a72>] ___might_sleep+0x192/0x250
      [ 7858.215868]  [<ffffffff810a4b9f>] __might_sleep+0x6f/0x100
      [ 7858.216072]  [<ffffffff8165bea3>] mutex_lock_nested+0x33/0x4d0
      [ 7858.216279]  [<ffffffff815a7a5f>] ? netlink_lookup+0x25f/0x460
      [ 7858.216487]  [<ffffffff8157474b>] rtnetlink_rcv+0x1b/0x40
      [ 7858.216687]  [<ffffffff815a9a0c>] netlink_unicast+0x19c/0x260
      [ 7858.216900]  [<ffffffff81573c70>] rtnl_unicast+0x20/0x30
      [ 7858.217128]  [<ffffffff8161cd39>] ipmr_destroy_unres+0xa9/0xf0
      [ 7858.217351]  [<ffffffff8161e06f>] ipmr_expire_process+0x8f/0x130
      [ 7858.217581]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
      [ 7858.217785]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
      [ 7858.217990]  [<ffffffff810fbc95>] call_timer_fn+0xa5/0x350
      [ 7858.218192]  [<ffffffff810fbbf5>] ? call_timer_fn+0x5/0x350
      [ 7858.218415]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
      [ 7858.218656]  [<ffffffff810fde10>] run_timer_softirq+0x260/0x640
      [ 7858.218865]  [<ffffffff8166379b>] ? __do_softirq+0xbb/0x54f
      [ 7858.219068]  [<ffffffff816637c8>] __do_softirq+0xe8/0x54f
      [ 7858.219269]  [<ffffffff8107a948>] irq_exit+0xb8/0xc0
      [ 7858.219463]  [<ffffffff81663452>] smp_apic_timer_interrupt+0x42/0x50
      [ 7858.219678]  [<ffffffff816625bc>] apic_timer_interrupt+0x8c/0xa0
      [ 7858.219897]  <EOI>  [<ffffffff81055f16>] ? native_safe_halt+0x6/0x10
      [ 7858.220165]  [<ffffffff810d64dd>] ? trace_hardirqs_on+0xd/0x10
      [ 7858.220373]  [<ffffffff810298e3>] default_idle+0x23/0x190
      [ 7858.220574]  [<ffffffff8102a20f>] arch_cpu_idle+0xf/0x20
      [ 7858.220790]  [<ffffffff810c9f8c>] default_idle_call+0x4c/0x60
      [ 7858.221016]  [<ffffffff810ca33b>] cpu_startup_entry+0x39b/0x4d0
      [ 7858.221257]  [<ffffffff8164f995>] rest_init+0x135/0x140
      [ 7858.221469]  [<ffffffff81f83014>] start_kernel+0x50e/0x51b
      [ 7858.221670]  [<ffffffff81f82120>] ? early_idt_handler_array+0x120/0x120
      [ 7858.221894]  [<ffffffff81f8243f>] x86_64_start_reservations+0x2a/0x2c
      [ 7858.222113]  [<ffffffff81f8257c>] x86_64_start_kernel+0x13b/0x14a
      
      Fixes: 2942e900 ("[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cf75070
  8. 21 9月, 2016 1 次提交
  9. 27 7月, 2016 1 次提交
  10. 20 7月, 2016 1 次提交
  11. 17 7月, 2016 1 次提交
    • N
      net: ipmr/ip6mr: add support for keeping an entry age · 43b9e127
      Nikolay Aleksandrov 提交于
      In preparation for hardware offloading of ipmr/ip6mr we need an
      interface that allows to check (and later update) the age of entries.
      Relying on stats alone can show activity but not actual age of the entry,
      furthermore when there're tens of thousands of entries a lot of the
      hardware implementations only support "hit" bits which are cleared on
      read to denote that the entry was active and shouldn't be aged out,
      these can then be naturally translated into age timestamp and will be
      compatible with the software forwarding age. Using a lastuse entry doesn't
      affect performance because the members in that cache line are written to
      along with the age.
      Since all new users are encouraged to use ipmr via netlink, this is
      exported via the RTA_EXPIRES attribute.
      Also do a minor local variable declaration style adjustment - arrange them
      longest to shortest.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Shrijeet Mukherjee <shm@cumulusnetworks.com>
      CC: Satish Ashok <sashok@cumulusnetworks.com>
      CC: Donald Sharp <sharpd@cumulusnetworks.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43b9e127
  12. 28 6月, 2016 1 次提交
  13. 22 4月, 2016 1 次提交
  14. 01 12月, 2015 5 次提交
  15. 25 11月, 2015 1 次提交
  16. 24 11月, 2015 9 次提交
  17. 23 11月, 2015 1 次提交
    • N
      net: ipmr: fix static mfc/dev leaks on table destruction · 0e615e96
      Nikolay Aleksandrov 提交于
      When destroying an mrt table the static mfc entries and the static
      devices are kept, which leads to devices that can never be destroyed
      (because of refcnt taken) and leaked memory, for example:
      unreferenced object 0xffff880034c144c0 (size 192):
        comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s)
        hex dump (first 32 bytes):
          98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff  .S.4.....S.4....
          ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00  ................
        backtrace:
          [<ffffffff815c1b9e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff811ea6e0>] kmem_cache_alloc+0x190/0x300
          [<ffffffff815931cb>] ip_mroute_setsockopt+0x5cb/0x910
          [<ffffffff8153d575>] do_ip_setsockopt.isra.11+0x105/0xff0
          [<ffffffff8153e490>] ip_setsockopt+0x30/0xa0
          [<ffffffff81564e13>] raw_setsockopt+0x33/0x90
          [<ffffffff814d1e14>] sock_common_setsockopt+0x14/0x20
          [<ffffffff814d0b51>] SyS_setsockopt+0x71/0xc0
          [<ffffffff815cdbf6>] entry_SYSCALL_64_fastpath+0x16/0x7a
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Make sure that everything is cleaned on netns destruction.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e615e96
  18. 03 11月, 2015 1 次提交
    • A
      ipmr: fix possible race resulting from improper usage of IP_INC_STATS_BH() in preemptible context. · 44f49dd8
      Ani Sinha 提交于
      Fixes the following kernel BUG :
      
      BUG: using __this_cpu_add() in preemptible [00000000] code: bash/2758
      caller is __this_cpu_preempt_check+0x13/0x15
      CPU: 0 PID: 2758 Comm: bash Tainted: P           O   3.18.19 #2
       ffffffff8170eaca ffff880110d1b788 ffffffff81482b2a 0000000000000000
       0000000000000000 ffff880110d1b7b8 ffffffff812010ae ffff880007cab800
       ffff88001a060800 ffff88013a899108 ffff880108b84240 ffff880110d1b7c8
      Call Trace:
      [<ffffffff81482b2a>] dump_stack+0x52/0x80
      [<ffffffff812010ae>] check_preemption_disabled+0xce/0xe1
      [<ffffffff812010d4>] __this_cpu_preempt_check+0x13/0x15
      [<ffffffff81419d60>] ipmr_queue_xmit+0x647/0x70c
      [<ffffffff8141a154>] ip_mr_forward+0x32f/0x34e
      [<ffffffff8141af76>] ip_mroute_setsockopt+0xe03/0x108c
      [<ffffffff810553fc>] ? get_parent_ip+0x11/0x42
      [<ffffffff810e6974>] ? pollwake+0x4d/0x51
      [<ffffffff81058ac0>] ? default_wake_function+0x0/0xf
      [<ffffffff810553fc>] ? get_parent_ip+0x11/0x42
      [<ffffffff810613d9>] ? __wake_up_common+0x45/0x77
      [<ffffffff81486ea9>] ? _raw_spin_unlock_irqrestore+0x1d/0x32
      [<ffffffff810618bc>] ? __wake_up_sync_key+0x4a/0x53
      [<ffffffff8139a519>] ? sock_def_readable+0x71/0x75
      [<ffffffff813dd226>] do_ip_setsockopt+0x9d/0xb55
      [<ffffffff81429818>] ? unix_seqpacket_sendmsg+0x3f/0x41
      [<ffffffff813963fe>] ? sock_sendmsg+0x6d/0x86
      [<ffffffff813959d4>] ? sockfd_lookup_light+0x12/0x5d
      [<ffffffff8139650a>] ? SyS_sendto+0xf3/0x11b
      [<ffffffff810d5738>] ? new_sync_read+0x82/0xaa
      [<ffffffff813ddd19>] compat_ip_setsockopt+0x3b/0x99
      [<ffffffff813fb24a>] compat_raw_setsockopt+0x11/0x32
      [<ffffffff81399052>] compat_sock_common_setsockopt+0x18/0x1f
      [<ffffffff813c4d05>] compat_SyS_setsockopt+0x1a9/0x1cf
      [<ffffffff813c4149>] compat_SyS_socketcall+0x180/0x1e3
      [<ffffffff81488ea1>] cstar_dispatch+0x7/0x1e
      Signed-off-by: NAni Sinha <ani@arista.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44f49dd8
  19. 08 10月, 2015 1 次提交
  20. 18 9月, 2015 5 次提交
  21. 10 9月, 2015 1 次提交
    • P
      net: ipv6: use common fib_default_rule_pref · f53de1e9
      Phil Sutter 提交于
      This switches IPv6 policy routing to use the shared
      fib_default_rule_pref() function of IPv4 and DECnet. It is also used in
      multicast routing for IPv4 as well as IPv6.
      
      The motivation for this patch is a complaint about iproute2 behaving
      inconsistent between IPv4 and IPv6 when adding policy rules: Formerly,
      IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the
      assigned priority value was decreased with each rule added.
      
      Since then all users of the default_pref field have been converted to
      assign the generic function fib_default_rule_pref(), fib_nl_newrule()
      may just use it directly instead. Therefore get rid of the function
      pointer altogether and make fib_default_rule_pref() static, as it's not
      used outside fib_rules.c anymore.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f53de1e9
  22. 08 4月, 2015 1 次提交
    • D
      netfilter: Pass socket pointer down through okfn(). · 7026b1dd
      David Miller 提交于
      On the output paths in particular, we have to sometimes deal with two
      socket contexts.  First, and usually skb->sk, is the local socket that
      generated the frame.
      
      And second, is potentially the socket used to control a tunneling
      socket, such as one the encapsulates using UDP.
      
      We do not want to disassociate skb->sk when encapsulating in order
      to fix this, because that would break socket memory accounting.
      
      The most extreme case where this can cause huge problems is an
      AF_PACKET socket transmitting over a vxlan device.  We hit code
      paths doing checks that assume they are dealing with an ipv4
      socket, but are actually operating upon the AF_PACKET one.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7026b1dd
  23. 04 4月, 2015 1 次提交