1. 11 10月, 2018 2 次提交
  2. 09 10月, 2018 6 次提交
  3. 06 10月, 2018 1 次提交
    • M
      rtnetlink: fix rtnl_fdb_dump() for ndmsg header · bd961c9b
      Mauricio Faria de Oliveira 提交于
      Currently, rtnl_fdb_dump() assumes the family header is 'struct ifinfomsg',
      which is not always true -- 'struct ndmsg' is used by iproute2 ('ip neigh').
      
      The problem is, the function bails out early if nlmsg_parse() fails, which
      does occur for iproute2 usage of 'struct ndmsg' because the payload length
      is shorter than the family header alone (as 'struct ifinfomsg' is assumed).
      
      This breaks backward compatibility with userspace -- nothing is sent back.
      
      Some examples with iproute2 and netlink library for go [1]:
      
       1) $ bridge fdb show
          33:33:00:00:00:01 dev ens3 self permanent
          01:00:5e:00:00:01 dev ens3 self permanent
          33:33:ff:15:98:30 dev ens3 self permanent
      
            This one works, as it uses 'struct ifinfomsg'.
      
            fdb_show() @ iproute2/bridge/fdb.c
              """
              .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
              ...
              if (rtnl_dump_request(&rth, RTM_GETNEIGH, [...]
              """
      
       2) $ ip --family bridge neigh
          RTNETLINK answers: Invalid argument
          Dump terminated
      
            This one fails, as it uses 'struct ndmsg'.
      
            do_show_or_flush() @ iproute2/ip/ipneigh.c
              """
              .n.nlmsg_type = RTM_GETNEIGH,
              .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ndmsg)),
              """
      
       3) $ ./neighlist
          < no output >
      
            This one fails, as it uses 'struct ndmsg'-based.
      
            neighList() @ netlink/neigh_linux.go
              """
              req := h.newNetlinkRequest(unix.RTM_GETNEIGH, [...]
              msg := Ndmsg{
              """
      
      The actual breakage was introduced by commit 0ff50e83 ("net: rtnetlink:
      bail out from rtnl_fdb_dump() on parse error"), because nlmsg_parse() fails
      if the payload length (with the _actual_ family header) is less than the
      family header length alone (which is assumed, in parameter 'hdrlen').
      This is true in the examples above with struct ndmsg, with size and payload
      length shorter than struct ifinfomsg.
      
      However, that commit just intends to fix something under the assumption the
      family header is indeed an 'struct ifinfomsg' - by preventing access to the
      payload as such (via 'ifm' pointer) if the payload length is not sufficient
      to actually contain it.
      
      The assumption was introduced by commit 5e6d2435 ("bridge: netlink dump
      interface at par with brctl"), to support iproute2's 'bridge fdb' command
      (not 'ip neigh') which indeed uses 'struct ifinfomsg', thus is not broken.
      
      So, in order to unbreak the 'struct ndmsg' family headers and still allow
      'struct ifinfomsg' to continue to work, check for the known message sizes
      used with 'struct ndmsg' in iproute2 (with zero or one attribute which is
      not used in this function anyway) then do not parse the data as ifinfomsg.
      
      Same examples with this patch applied (or revert/before the original fix):
      
          $ bridge fdb show
          33:33:00:00:00:01 dev ens3 self permanent
          01:00:5e:00:00:01 dev ens3 self permanent
          33:33:ff:15:98:30 dev ens3 self permanent
      
          $ ip --family bridge neigh
          dev ens3 lladdr 33:33:00:00:00:01 PERMANENT
          dev ens3 lladdr 01:00:5e:00:00:01 PERMANENT
          dev ens3 lladdr 33:33:ff:15:98:30 PERMANENT
      
          $ ./neighlist
          netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0x0, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
          netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x1, 0x0, 0x5e, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
          netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0xff, 0x15, 0x98, 0x30}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
      
      Tested on mainline (v4.19-rc6) and net-next (3bd09b05b068).
      
      References:
      
      [1] netlink library for go (test-case)
          https://github.com/vishvananda/netlink
      
          $ cat ~/go/src/neighlist/main.go
          package main
          import ("fmt"; "syscall"; "github.com/vishvananda/netlink")
          func main() {
              neighs, _ := netlink.NeighList(0, syscall.AF_BRIDGE)
              for _, neigh := range neighs { fmt.Printf("%#v\n", neigh) }
          }
      
          $ export GOPATH=~/go
          $ go get github.com/vishvananda/netlink
          $ go build neighlist
          $ ~/go/src/neighlist/neighlist
      
      Thanks to David Ahern for suggestions to improve this patch.
      
      Fixes: 0ff50e83 ("net: rtnetlink: bail out from rtnl_fdb_dump() on parse error")
      Fixes: 5e6d2435 ("bridge: netlink dump interface at par with brctl")
      Reported-by: NAidan Obley <aobley@pivotal.io>
      Signed-off-by: NMauricio Faria de Oliveira <mfo@canonical.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd961c9b
  4. 03 10月, 2018 1 次提交
    • E
      rtnl: limit IFLA_NUM_TX_QUEUES and IFLA_NUM_RX_QUEUES to 4096 · 0e1d6eca
      Eric Dumazet 提交于
      We have an impressive number of syzkaller bugs that are linked
      to the fact that syzbot was able to create a networking device
      with millions of TX (or RX) queues.
      
      Let's limit the number of RX/TX queues to 4096, this really should
      cover all known cases.
      
      A separate patch will add various cond_resched() in the loops
      handling sysfs entries at device creation and dismantle.
      
      Tested:
      
      lpaa6:~# ip link add gre-4097 numtxqueues 4097 numrxqueues 4097 type ip6gretap
      RTNETLINK answers: Invalid argument
      
      lpaa6:~# time ip link add gre-4096 numtxqueues 4096 numrxqueues 4096 type ip6gretap
      
      real	0m0.180s
      user	0m0.000s
      sys	0m0.107s
      
      Fixes: 76ff5cc9 ("rtnl: allow to specify number of rx and tx queues on device creation")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e1d6eca
  5. 02 10月, 2018 1 次提交
  6. 26 9月, 2018 1 次提交
  7. 14 9月, 2018 1 次提交
  8. 06 9月, 2018 3 次提交
  9. 30 8月, 2018 1 次提交
    • S
      net: rtnl: return early from rtnl_unregister_all when protocol isn't registered · f707ef61
      Sabrina Dubroca 提交于
      rtnl_unregister_all(PF_INET6) gets called from inet6_init in cases when
      no handler has been registered for PF_INET6 yet, for example if
      ip6_mr_init() fails. Abort and avoid a NULL pointer deref in that case.
      
      Example of panic (triggered by faking a failure of
       register_pernet_subsys):
      
          general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
          [...]
          RIP: 0010:rtnl_unregister_all+0x17e/0x2a0
          [...]
          Call Trace:
           ? rtnetlink_net_init+0x250/0x250
           ? sock_unregister+0x103/0x160
           ? kernel_getsockopt+0x200/0x200
           inet6_init+0x197/0x20d
      
      Fixes: e2fddf5e ("[IPV6]: Make af_inet6 to check ip6_route_init return value.")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f707ef61
  10. 30 7月, 2018 2 次提交
  11. 23 7月, 2018 1 次提交
    • R
      rtnetlink: add rtnl_link_state check in rtnl_configure_link · 5025f7f7
      Roopa Prabhu 提交于
      rtnl_configure_link sets dev->rtnl_link_state to
      RTNL_LINK_INITIALIZED and unconditionally calls
      __dev_notify_flags to notify user-space of dev flags.
      
      current call sequence for rtnl_configure_link
      rtnetlink_newlink
          rtnl_link_ops->newlink
          rtnl_configure_link (unconditionally notifies userspace of
                               default and new dev flags)
      
      If a newlink handler wants to call rtnl_configure_link
      early, we will end up with duplicate notifications to
      user-space.
      
      This patch fixes rtnl_configure_link to check rtnl_link_state
      and call __dev_notify_flags with gchanges = 0 if already
      RTNL_LINK_INITIALIZED.
      
      Later in the series, this patch will help the following sequence
      where a driver implementing newlink can call rtnl_configure_link
      to initialize the link early.
      
      makes the following call sequence work:
      rtnetlink_newlink
          rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
                                                      link and notifies
                                                      user-space of default
                                                      dev flags)
          rtnl_configure_link (updates dev flags if requested by user ifm
                               and notifies user-space of new dev flags)
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5025f7f7
  12. 19 7月, 2018 1 次提交
  13. 14 7月, 2018 3 次提交
  14. 07 7月, 2018 1 次提交
    • R
      rtnetlink: add rtnl_link_state check in rtnl_configure_link · 8d356b89
      Roopa Prabhu 提交于
      rtnl_configure_link sets dev->rtnl_link_state to
      RTNL_LINK_INITIALIZED and unconditionally calls
      __dev_notify_flags to notify user-space of dev flags.
      
      current call sequence for rtnl_configure_link
      rtnetlink_newlink
          rtnl_link_ops->newlink
          rtnl_configure_link (unconditionally notifies userspace of
                               default and new dev flags)
      
      If a newlink handler wants to call rtnl_configure_link
      early, we will end up with duplicate notifications to
      user-space.
      
      This patch fixes rtnl_configure_link to check rtnl_link_state
      and call __dev_notify_flags with gchanges = 0 if already
      RTNL_LINK_INITIALIZED.
      
      Later in the series, this patch will help the following sequence
      where a driver implementing newlink can call rtnl_configure_link
      to initialize the link early.
      
      makes the following call sequence work:
      rtnetlink_newlink
          rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
                                                      link and notifies
                                                      user-space of default
                                                      dev flags)
          rtnl_configure_link (updates dev flags if requested by user ifm
                               and notifies user-space of new dev flags)
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d356b89
  15. 06 6月, 2018 1 次提交
    • E
      rtnetlink: validate attributes in do_setlink() · 644c7eeb
      Eric Dumazet 提交于
      It seems that rtnl_group_changelink() can call do_setlink
      while a prior call to validate_linkmsg(dev = NULL, ...) could
      not validate IFLA_ADDRESS / IFLA_BROADCAST
      
      Make sure do_setlink() calls validate_linkmsg() instead
      of letting its callers having this responsibility.
      
      With help from Dmitry Vyukov, thanks a lot !
      
      BUG: KMSAN: uninit-value in is_valid_ether_addr include/linux/etherdevice.h:199 [inline]
      BUG: KMSAN: uninit-value in eth_prepare_mac_addr_change net/ethernet/eth.c:275 [inline]
      BUG: KMSAN: uninit-value in eth_mac_addr+0x203/0x2b0 net/ethernet/eth.c:308
      CPU: 1 PID: 8695 Comm: syz-executor3 Not tainted 4.17.0-rc5+ #103
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:113
       kmsan_report+0x149/0x260 mm/kmsan/kmsan.c:1084
       __msan_warning_32+0x6e/0xc0 mm/kmsan/kmsan_instr.c:686
       is_valid_ether_addr include/linux/etherdevice.h:199 [inline]
       eth_prepare_mac_addr_change net/ethernet/eth.c:275 [inline]
       eth_mac_addr+0x203/0x2b0 net/ethernet/eth.c:308
       dev_set_mac_address+0x261/0x530 net/core/dev.c:7157
       do_setlink+0xbc3/0x5fc0 net/core/rtnetlink.c:2317
       rtnl_group_changelink net/core/rtnetlink.c:2824 [inline]
       rtnl_newlink+0x1fe9/0x37a0 net/core/rtnetlink.c:2976
       rtnetlink_rcv_msg+0xa32/0x1560 net/core/rtnetlink.c:4646
       netlink_rcv_skb+0x378/0x600 net/netlink/af_netlink.c:2448
       rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4664
       netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
       netlink_unicast+0x1678/0x1750 net/netlink/af_netlink.c:1336
       netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
       sock_sendmsg_nosec net/socket.c:629 [inline]
       sock_sendmsg net/socket.c:639 [inline]
       ___sys_sendmsg+0xec0/0x1310 net/socket.c:2117
       __sys_sendmsg net/socket.c:2155 [inline]
       __do_sys_sendmsg net/socket.c:2164 [inline]
       __se_sys_sendmsg net/socket.c:2162 [inline]
       __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
       do_syscall_64+0x152/0x230 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x455a09
      RSP: 002b:00007fc07480ec68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00007fc07480f6d4 RCX: 0000000000455a09
      RDX: 0000000000000000 RSI: 00000000200003c0 RDI: 0000000000000014
      RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000000005d0 R14: 00000000006fdc20 R15: 0000000000000000
      
      Uninit was stored to memory at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
       kmsan_save_stack mm/kmsan/kmsan.c:294 [inline]
       kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:685
       kmsan_memcpy_origins+0x11d/0x170 mm/kmsan/kmsan.c:527
       __msan_memcpy+0x109/0x160 mm/kmsan/kmsan_instr.c:478
       do_setlink+0xb84/0x5fc0 net/core/rtnetlink.c:2315
       rtnl_group_changelink net/core/rtnetlink.c:2824 [inline]
       rtnl_newlink+0x1fe9/0x37a0 net/core/rtnetlink.c:2976
       rtnetlink_rcv_msg+0xa32/0x1560 net/core/rtnetlink.c:4646
       netlink_rcv_skb+0x378/0x600 net/netlink/af_netlink.c:2448
       rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:4664
       netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
       netlink_unicast+0x1678/0x1750 net/netlink/af_netlink.c:1336
       netlink_sendmsg+0x104f/0x1350 net/netlink/af_netlink.c:1901
       sock_sendmsg_nosec net/socket.c:629 [inline]
       sock_sendmsg net/socket.c:639 [inline]
       ___sys_sendmsg+0xec0/0x1310 net/socket.c:2117
       __sys_sendmsg net/socket.c:2155 [inline]
       __do_sys_sendmsg net/socket.c:2164 [inline]
       __se_sys_sendmsg net/socket.c:2162 [inline]
       __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
       do_syscall_64+0x152/0x230 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:279 [inline]
       kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:189
       kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:315
       kmsan_slab_alloc+0x10/0x20 mm/kmsan/kmsan.c:322
       slab_post_alloc_hook mm/slab.h:446 [inline]
       slab_alloc_node mm/slub.c:2753 [inline]
       __kmalloc_node_track_caller+0xb32/0x11b0 mm/slub.c:4395
       __kmalloc_reserve net/core/skbuff.c:138 [inline]
       __alloc_skb+0x2cb/0x9e0 net/core/skbuff.c:206
       alloc_skb include/linux/skbuff.h:988 [inline]
       netlink_alloc_large_skb net/netlink/af_netlink.c:1182 [inline]
       netlink_sendmsg+0x76e/0x1350 net/netlink/af_netlink.c:1876
       sock_sendmsg_nosec net/socket.c:629 [inline]
       sock_sendmsg net/socket.c:639 [inline]
       ___sys_sendmsg+0xec0/0x1310 net/socket.c:2117
       __sys_sendmsg net/socket.c:2155 [inline]
       __do_sys_sendmsg net/socket.c:2164 [inline]
       __se_sys_sendmsg net/socket.c:2162 [inline]
       __x64_sys_sendmsg+0x331/0x460 net/socket.c:2162
       do_syscall_64+0x152/0x230 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: e7ed828f ("netlink: support setting devgroup parameters")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      644c7eeb
  16. 01 6月, 2018 2 次提交
    • P
      rtnetlink: Fix null-ptr-deref in rtnl_newlink · af066ed3
      Prashant Bhole 提交于
      In rtnl_newlink(), NULL check is performed on m_ops however member of
      ops is accessed. Fixed by accessing member of m_ops instead of ops.
      
      [  345.432629] BUG: KASAN: null-ptr-deref in rtnl_newlink+0x400/0x1110
      [  345.432629] Read of size 4 at addr 0000000000000088 by task ip/986
      [  345.432629]
      [  345.432629] CPU: 1 PID: 986 Comm: ip Not tainted 4.17.0-rc6+ #9
      [  345.432629] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [  345.432629] Call Trace:
      [  345.432629]  dump_stack+0xc6/0x150
      [  345.432629]  ? dump_stack_print_info.cold.0+0x1b/0x1b
      [  345.432629]  ? kasan_report+0xb4/0x410
      [  345.432629]  kasan_report.cold.4+0x8f/0x91
      [  345.432629]  ? rtnl_newlink+0x400/0x1110
      [  345.432629]  rtnl_newlink+0x400/0x1110
      [...]
      
      Fixes: ccf8dbcd ("rtnetlink: Remove VLA usage")
      Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Tested-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af066ed3
    • K
      rtnetlink: Remove VLA usage · ccf8dbcd
      Kees Cook 提交于
      In the quest to remove all stack VLA usage from the kernel[1], this
      allocates the maximum size expected for all possible types and adds
      sanity-checks at both registration and usage to make sure nothing gets
      out of sync. This matches the proposed VLA solution for nfnetlink[2]. The
      values chosen here were based on finding assignments for .maxtype and
      .slave_maxtype and manually counting the enums:
      
      slave_maxtype (max 33):
      	IFLA_BRPORT_MAX     33
      	IFLA_BOND_SLAVE_MAX  9
      
      maxtype (max 45):
      	IFLA_BOND_MAX       28
      	IFLA_BR_MAX         45
      	__IFLA_CAIF_HSI_MAX  8
      	IFLA_CAIF_MAX        4
      	IFLA_CAN_MAX        16
      	IFLA_GENEVE_MAX     12
      	IFLA_GRE_MAX        25
      	IFLA_GTP_MAX         5
      	IFLA_HSR_MAX         7
      	IFLA_IPOIB_MAX       4
      	IFLA_IPTUN_MAX      21
      	IFLA_IPVLAN_MAX      3
      	IFLA_MACSEC_MAX     15
      	IFLA_MACVLAN_MAX     7
      	IFLA_PPP_MAX         2
      	__IFLA_RMNET_MAX     4
      	IFLA_VLAN_MAX        6
      	IFLA_VRF_MAX         2
      	IFLA_VTI_MAX         7
      	IFLA_VXLAN_MAX      28
      	VETH_INFO_MAX        2
      	VXCAN_INFO_MAX       2
      
      This additionally changes maxtype and slave_maxtype fields to unsigned,
      since they're only ever using positive values.
      
      [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
      [2] https://patchwork.kernel.org/patch/10439647/Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ccf8dbcd
  17. 18 4月, 2018 1 次提交
  18. 01 4月, 2018 1 次提交
    • K
      net: Do not take net_rwsem in __rtnl_link_unregister() · 554873e5
      Kirill Tkhai 提交于
      This function calls call_netdevice_notifier(), which also
      may take net_rwsem. So, we can't use net_rwsem here.
      
      This patch makes callers of this functions take pernet_ops_rwsem,
      like register_netdevice_notifier() does. This will protect
      the modifications of net_namespace_list, and allows notifiers
      to take it (they won't have to care about context).
      
      Since __rtnl_link_unregister() is used on module load
      and unload (which are not frequent operations), this looks
      for me better, than make all call_netdevice_notifier()
      always executing in "protected net_namespace_list" context.
      
      Also, this fixes the problem we had a deal in 328fbe74
      "Close race between {un, }register_netdevice_notifier and ...",
      and guarantees __rtnl_link_unregister() does not skip
      exitting net.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      554873e5
  19. 30 3月, 2018 1 次提交
    • K
      net: Introduce net_rwsem to protect net_namespace_list · f0b07bb1
      Kirill Tkhai 提交于
      rtnl_lock() is used everywhere, and contention is very high.
      When someone wants to iterate over alive net namespaces,
      he/she has no a possibility to do that without exclusive lock.
      But the exclusive rtnl_lock() in such places is overkill,
      and it just increases the contention. Yes, there is already
      for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
      and this can't be sleepable. Also, sometimes it may be need
      really prevent net_namespace_list growth, so for_each_net_rcu()
      is not fit there.
      
      This patch introduces new rw_semaphore, which will be used
      instead of rtnl_mutex to protect net_namespace_list. It is
      sleepable and allows not-exclusive iterations over net
      namespaces list. It allows to stop using rtnl_lock()
      in several places (what is made in next patches) and makes
      less the time, we keep rtnl_mutex. Here we just add new lock,
      while the explanation of we can remove rtnl_lock() there are
      in next patches.
      
      Fine grained locks generally are better, then one big lock,
      so let's do that with net_namespace_list, while the situation
      allows that.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0b07bb1
  20. 28 3月, 2018 3 次提交
  21. 17 3月, 2018 1 次提交
    • K
      net: Add rtnl_lock_killable() · 79ffdfc6
      Kirill Tkhai 提交于
      rtnl_lock() is widely used mutex in kernel. Some of kernel code
      does memory allocations under it. In case of memory deficit this
      may invoke OOM killer, but the problem is a killed task can't
      exit if it's waiting for the mutex. This may be a reason of deadlock
      and panic.
      
      This patch adds a new primitive, which responds on SIGKILL, and
      it allows to use it in the places, where we don't want to sleep
      forever.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79ffdfc6
  22. 13 2月, 2018 2 次提交
    • K
      net: Convert rtnetlink_net_ops · 46456675
      Kirill Tkhai 提交于
      rtnetlink_net_init() and rtnetlink_net_exit()
      create and destroy netlink socket net::rtnl.
      
      The socket is used to send rtnl notification via
      rtnl_net_notifyid(). There is no a problem
      to create and destroy it in parallel with other
      pernet operations, as we link net in setup_net()
      after the socket is created, and destroy
      in cleanup_net() after net is unhashed from all
      the lists and there is no RCU references on it.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46456675
    • K
      net: Introduce net_sem for protection of pernet_list · 1a57feb8
      Kirill Tkhai 提交于
      Currently, the mutex is mostly used to protect pernet operations
      list. It orders setup_net() and cleanup_net() with parallel
      {un,}register_pernet_operations() calls, so ->exit{,batch} methods
      of the same pernet operations are executed for a dying net, as
      were used to call ->init methods, even after the net namespace
      is unlinked from net_namespace_list in cleanup_net().
      
      But there are several problems with scalability. The first one
      is that more than one net can't be created or destroyed
      at the same moment on the node. For big machines with many cpus
      running many containers it's very sensitive.
      
      The second one is that it's need to synchronize_rcu() after net
      is removed from net_namespace_list():
      
      Destroy net_ns:
      cleanup_net()
        mutex_lock(&net_mutex)
        list_del_rcu(&net->list)
        synchronize_rcu()                                  <--- Sleep there for ages
        list_for_each_entry_reverse(ops, &pernet_list, list)
          ops_exit_list(ops, &net_exit_list)
        list_for_each_entry_reverse(ops, &pernet_list, list)
          ops_free_list(ops, &net_exit_list)
        mutex_unlock(&net_mutex)
      
      This primitive is not fast, especially on the systems with many processors
      and/or when preemptible RCU is enabled in config. So, all the time, while
      cleanup_net() is waiting for RCU grace period, creation of new net namespaces
      is not possible, the tasks, who makes it, are sleeping on the same mutex:
      
      Create net_ns:
      copy_net_ns()
        mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
      
      I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
      with preemptible RCU enabled after CRIU tests round is finished.
      
      The solution is to convert net_mutex to the rw_semaphore and add fine grain
      locks to really small number of pernet_operations, what really need them.
      
      Then, pernet_operations::init/::exit methods, modifying the net-related data,
      will require down_read() locking only, while down_write() will be used
      for changing pernet_list (i.e., when modules are being loaded and unloaded).
      
      This gives signify performance increase, after all patch set is applied,
      like you may see here:
      
      %for i in {1..10000}; do unshare -n bash -c exit; done
      
      *before*
      real 1m40,377s
      user 0m9,672s
      sys 0m19,928s
      
      *after*
      real 0m17,007s
      user 0m5,311s
      sys 0m11,779
      
      (5.8 times faster)
      
      This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
      describes the variables it protects, and makes to use, where appropriate.
      net_mutex is still present, and next patches will kick it out step-by-step.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a57feb8
  23. 09 2月, 2018 1 次提交
    • C
      rtnetlink: require unique netns identifier · 4ff66cae
      Christian Brauner 提交于
      Since we've added support for IFLA_IF_NETNSID for RTM_{DEL,GET,SET,NEW}LINK
      it is possible for userspace to send us requests with three different
      properties to identify a target network namespace. This affects at least
      RTM_{NEW,SET}LINK. Each of them could potentially refer to a different
      network namespace which is confusing. For legacy reasons the kernel will
      pick the IFLA_NET_NS_PID property first and then look for the
      IFLA_NET_NS_FD property but there is no reason to extend this type of
      behavior to network namespace ids. The regression potential is quite
      minimal since the rtnetlink requests in question either won't allow
      IFLA_IF_NETNSID requests before 4.16 is out (RTM_{NEW,SET}LINK) or don't
      support IFLA_NET_NS_{PID,FD} (RTM_{DEL,GET}LINK) in the first place.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ff66cae
  24. 01 2月, 2018 1 次提交
  25. 31 1月, 2018 1 次提交
    • C
      rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK · 5bb8ed07
      Christian Brauner 提交于
      - Backwards Compatibility:
        If userspace wants to determine whether RTM_NEWLINK supports the
        IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
        with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
        does not include IFLA_IF_NETNSID userspace should assume that
        IFLA_IF_NETNSID is not supported on this kernel.
        If the reply does contain an IFLA_IF_NETNSID property userspace
        can send an RTM_NEWLINK with a IFLA_IF_NETNSID property. If they receive
        EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
        with RTM_NEWLINK. Userpace should then fallback to other means.
      
      - Security:
        Callers must have CAP_NET_ADMIN in the owning user namespace of the
        target network namespace.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bb8ed07