1. 13 2月, 2018 2 次提交
  2. 26 1月, 2018 1 次提交
    • K
      net: Move net:netns_ids destruction out of rtnl_lock() and document locking scheme · fb07a820
      Kirill Tkhai 提交于
      Currently, we unhash a dying net from netns_ids lists
      under rtnl_lock(). It's a leftover from the time when
      net::netns_ids was introduced. There was no net::nsid_lock,
      and rtnl_lock() was mostly need to order modification
      of alive nets nsid idr, i.e. for:
      	for_each_net(tmp) {
      		...
      		id = __peernet2id(tmp, net);
      		idr_remove(&tmp->netns_ids, id);
      		...
      	}
      
      Since we have net::nsid_lock, the modifications are
      protected by this local lock, and now we may introduce
      better scheme of netns_ids destruction.
      
      Let's look at the functions peernet2id_alloc() and
      get_net_ns_by_id(). Previous commits taught these
      functions to work well with dying net acquired from
      rtnl unlocked lists. And they are the only functions
      which can hash a net to netns_ids or obtain from there.
      And as easy to check, other netns_ids operating functions
      works with id, not with net pointers. So, we do not
      need rtnl_lock to synchronize cleanup_net() with all them.
      
      The another property, which is used in the patch,
      is that net is unhashed from net_namespace_list
      in the only place and by the only process. So,
      we avoid excess rcu_read_lock() or rtnl_lock(),
      when we'are iterating over the list in unhash_nsid().
      
      All the above makes possible to keep rtnl_lock() locked
      only for net->list deletion, and completely avoid it
      for netns_ids unhashing and destruction. As these two
      doings may take long time (e.g., memory allocation
      to send skb), the patch should positively act on
      the scalability and signify decrease the time, which
      rtnl_lock() is held in cleanup_net().
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb07a820
  3. 18 1月, 2018 2 次提交
    • K
      net: Remove spinlock from get_net_ns_by_id() · 42157277
      Kirill Tkhai 提交于
      idr_find() is safe under rcu_read_lock() and
      maybe_get_net() guarantees that net is alive.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42157277
    • K
      net: Fix possible race in peernet2id_alloc() · 0c06bea9
      Kirill Tkhai 提交于
      peernet2id_alloc() is racy without rtnl_lock() as refcount_read(&peer->count)
      under net->nsid_lock does not guarantee, peer is alive:
      
      rcu_read_lock()
      peernet2id_alloc()                            ..
        spin_lock_bh(&net->nsid_lock)               ..
        refcount_read(&peer->count) (!= 0)          ..
        ..                                          put_net()
        ..                                            cleanup_net()
        ..                                              for_each_net(tmp)
        ..                                                spin_lock_bh(&tmp->nsid_lock)
        ..                                                __peernet2id(tmp, net) == -1
        ..                                                    ..
        ..                                                    ..
          __peernet2id_alloc(alloc == true)                   ..
        ..                                                    ..
      rcu_read_unlock()                                       ..
      ..                                                synchronize_rcu()
      ..                                                kmem_cache_free(net)
      
      After the above situation, net::netns_id contains id pointing to freed memory,
      and any other dereferencing by the id will operate with this freed memory.
      
      Currently, peernet2id_alloc() is used under rtnl_lock() everywhere except
      ovs_vport_cmd_fill_info(), and this race can't occur. But peernet2id_alloc()
      is generic interface, and better we fix it before someone really starts
      use it in wrong context.
      
      v2: Don't place refcount_read(&net->count) under net->nsid_lock
          as suggested by Eric W. Biederman <ebiederm@xmission.com>
      v3: Rebase on top of net-next
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c06bea9
  4. 16 1月, 2018 1 次提交
  5. 21 12月, 2017 1 次提交
    • E
      net: Fix double free and memory corruption in get_net_ns_by_id() · 21b59443
      Eric W. Biederman 提交于
      (I can trivially verify that that idr_remove in cleanup_net happens
       after the network namespace count has dropped to zero --EWB)
      
      Function get_net_ns_by_id() does not check for net::count
      after it has found a peer in netns_ids idr.
      
      It may dereference a peer, after its count has already been
      finaly decremented. This leads to double free and memory
      corruption:
      
      put_net(peer)                                   rtnl_lock()
      atomic_dec_and_test(&peer->count) [count=0]     ...
      __put_net(peer)                                 get_net_ns_by_id(net, id)
        spin_lock(&cleanup_list_lock)
        list_add(&net->cleanup_list, &cleanup_list)
        spin_unlock(&cleanup_list_lock)
      queue_work()                                      peer = idr_find(&net->netns_ids, id)
        |                                               get_net(peer) [count=1]
        |                                               ...
        |                                               (use after final put)
        v                                               ...
        cleanup_net()                                   ...
          spin_lock(&cleanup_list_lock)                 ...
          list_replace_init(&cleanup_list, ..)          ...
          spin_unlock(&cleanup_list_lock)               ...
          ...                                           ...
          ...                                           put_net(peer)
          ...                                             atomic_dec_and_test(&peer->count) [count=0]
          ...                                               spin_lock(&cleanup_list_lock)
          ...                                               list_add(&net->cleanup_list, &cleanup_list)
          ...                                               spin_unlock(&cleanup_list_lock)
          ...                                             queue_work()
          ...                                           rtnl_unlock()
          rtnl_lock()                                   ...
          for_each_net(tmp) {                           ...
            id = __peernet2id(tmp, peer)                ...
            spin_lock_irq(&tmp->nsid_lock)              ...
            idr_remove(&tmp->netns_ids, id)             ...
            ...                                         ...
            net_drop_ns()                               ...
      	net_free(peer)                            ...
          }                                             ...
        |
        v
        cleanup_net()
          ...
          (Second free of peer)
      
      Also, put_net() on the right cpu may reorder with left's cpu
      list_replace_init(&cleanup_list, ..), and then cleanup_list
      will be corrupted.
      
      Since cleanup_net() is executed in worker thread, while
      put_net(peer) can happen everywhere, there should be
      enough time for concurrent get_net_ns_by_id() to pick
      the peer up, and the race does not seem to be unlikely.
      The patch fixes the problem in standard way.
      
      (Also, there is possible problem in peernet2id_alloc(), which requires
      check for net::count under nsid_lock and maybe_get_net(peer), but
      in current stable kernel it's used under rtnl_lock() and it has to be
      safe. Openswitch begun to use peernet2id_alloc(), and possibly it should
      be fixed too. While this is not in stable kernel yet, so I'll send
      a separate message to netdev@ later).
      
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Fixes: 0c7aecd4 "netns: add rtnl cmd to add and get peer netns ids"
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21b59443
  6. 05 11月, 2017 1 次提交
  7. 10 8月, 2017 2 次提交
  8. 01 7月, 2017 1 次提交
  9. 20 6月, 2017 1 次提交
    • F
      netns: add and use net_ns_barrier · 7866cc57
      Florian Westphal 提交于
      Quoting Joe Stringer:
        If a user loads nf_conntrack_ftp, sends FTP traffic through a network
        namespace, destroys that namespace then unloads the FTP helper module,
        then the kernel will crash.
      
      Events that lead to the crash:
      1. conntrack is created with ftp helper in netns x
      2. This netns is destroyed
      3. netns destruction is scheduled
      4. netns destruction wq starts, removes netns from global list
      5. ftp helper is unloaded, which resets all helpers of the conntracks
      via for_each_net()
      
      but because netns is already gone from list the for_each_net() loop
      doesn't include it, therefore all of these conntracks are unaffected.
      
      6. helper module unload finishes
      7. netns wq invokes destructor for rmmod'ed helper
      
      CC: "Eric W. Biederman" <ebiederm@xmission.com>
      Reported-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      7866cc57
  10. 11 6月, 2017 2 次提交
  11. 26 5月, 2017 1 次提交
    • R
      net: move somaxconn init from sysctl code · 7c3f1875
      Roman Kapl 提交于
      The default value for somaxconn is set in sysctl_core_net_init(), but this
      function is not called when kernel is configured without CONFIG_SYSCTL.
      
      This results in the kernel not being able to accept TCP connections,
      because the backlog has zero size. Usually, the user ends up with:
      "TCP: request_sock_TCP: Possible SYN flooding on port 7. Dropping request.  Check SNMP counters."
      If SYN cookies are not enabled the connection is rejected.
      
      Before ef547f2a (tcp: remove max_qlen_log), the effects were less
      severe, because the backlog was always at least eight slots long.
      Signed-off-by: NRoman Kapl <roman.kapl@sysgo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c3f1875
  12. 01 5月, 2017 1 次提交
  13. 18 4月, 2017 1 次提交
  14. 14 4月, 2017 1 次提交
  15. 02 3月, 2017 1 次提交
  16. 15 12月, 2016 1 次提交
  17. 04 12月, 2016 3 次提交
    • A
      netns: fix net_generic() "id - 1" bloat · 6af2d5ff
      Alexey Dobriyan 提交于
      net_generic() function is both a) inline and b) used ~600 times.
      
      It has the following code inside
      
      		...
      	ptr = ng->ptr[id - 1];
      		...
      
      "id" is never compile time constant so compiler is forced to subtract 1.
      And those decrements or LEA [r32 - 1] instructions add up.
      
      We also start id'ing from 1 to catch bugs where pernet sybsystem id
      is not initialized and 0. This is quite pointless idea (nothing will
      work or immediate interference with first registered subsystem) in
      general but it hints what needs to be done for code size reduction.
      
      Namely, overlaying allocation of pointer array and fixed part of
      structure in the beginning and using usual base-0 addressing.
      
      Ids are just cookies, their exact values do not matter, so lets start
      with 3 on x86_64.
      
      Code size savings (oh boy): -4.2 KB
      
      As usual, ignore the initial compiler stupidity part of the table.
      
      	add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
      	function                                     old     new   delta
      	tipc_nametbl_insert_publ                    1250    1270     +20
      	nlmclnt_lookup_host                          686     703     +17
      	nfsd4_encode_fattr                          5930    5941     +11
      	nfs_get_client                              1050    1061     +11
      	register_pernet_operations                   333     342      +9
      	tcf_mirred_init                              843     849      +6
      	tcf_bpf_init                                1143    1149      +6
      	gss_setup_upcall                             990     994      +4
      	idmap_name_to_id                             432     434      +2
      	ops_init                                     274     275      +1
      	nfsd_inject_forget_client                    259     260      +1
      	nfs4_alloc_client                            612     613      +1
      	tunnel_key_walker                            164     163      -1
      
      		...
      
      	tipc_bcbase_select_primary                   392     360     -32
      	mac80211_hwsim_new_radio                    2808    2767     -41
      	ipip6_tunnel_ioctl                          2228    2186     -42
      	tipc_bcast_rcv                               715     672     -43
      	tipc_link_build_proto_msg                   1140    1089     -51
      	nfsd4_lock                                  3851    3796     -55
      	tipc_mon_rcv                                1012     956     -56
      	Total: Before=156643951, After=156639743, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6af2d5ff
    • A
      netns: add dummy struct inside "struct net_generic" · 9bfc7b99
      Alexey Dobriyan 提交于
      This is precursor to fixing "[id - 1]" bloat inside net_generic().
      
      Name "s" is chosen to complement name "u" often used for dummy unions.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bfc7b99
    • A
      netns: publish net_generic correctly · 1a9a0592
      Alexey Dobriyan 提交于
      Publishing net_generic pointer is done with silly mistake: new array is
      published BEFORE setting freshly acquired pernet subsystem pointer.
      
      	memcpy
      	rcu_assign_pointer
      	kfree_rcu
      	ng->ptr[id - 1] = data;
      
      This bug was introduced with commit dec827d1
      ("[NETNS]: The generic per-net pointers.") in the glorious days of
      chopping networking stack into containers proper 8.5 years ago (whee...)
      
      How it didn't trigger for so long?
      Well, you need quite specific set of conditions:
      
      *) race window opens once per pernet subsystem addition
         (read: modprobe or boot)
      
      *) not every pernet subsystem is eligible (need ->id and ->size)
      
      *) not every pernet subsystem is vulnerable (need incorrect or absense
         of ordering of register_pernet_sybsys() and actually using net_generic())
      
      *) to hide the bug even more, default is to preallocate 13 pointers which
         is actually quite a lot. You need IPv6, netfilter, bridging etc together
         loaded to trigger reallocation in the first place. Trimmed down
         config are OK.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a9a0592
  18. 18 11月, 2016 2 次提交
    • A
      netns: make struct pernet_operations::id unsigned int · c7d03a00
      Alexey Dobriyan 提交于
      Make struct pernet_operations::id unsigned.
      
      There are 2 reasons to do so:
      
      1)
      This field is really an index into an zero based array and
      thus is unsigned entity. Using negative value is out-of-bound
      access by definition.
      
      2)
      On x86_64 unsigned 32-bit data which are mixed with pointers
      via array indexing or offsets added or subtracted to pointers
      are preffered to signed 32-bit data.
      
      "int" being used as an array index needs to be sign-extended
      to 64-bit before being used.
      
      	void f(long *p, int i)
      	{
      		g(p[i]);
      	}
      
        roughly translates to
      
      	movsx	rsi, esi
      	mov	rdi, [rsi+...]
      	call 	g
      
      MOVSX is 3 byte instruction which isn't necessary if the variable is
      unsigned because x86_64 is zero extending by default.
      
      Now, there is net_generic() function which, you guessed it right, uses
      "int" as an array index:
      
      	static inline void *net_generic(const struct net *net, int id)
      	{
      		...
      		ptr = ng->ptr[id - 1];
      		...
      	}
      
      And this function is used a lot, so those sign extensions add up.
      
      Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
      messing with code generation):
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      
      Unfortunately some functions actually grow bigger.
      This is a semmingly random artefact of code generation with register
      allocator being used differently. gcc decides that some variable
      needs to live in new r8+ registers and every access now requires REX
      prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
      used which is longer than [r8]
      
      However, overall balance is in negative direction:
      
      	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
      	function                                     old     new   delta
      	nfsd4_lock                                  3886    3959     +73
      	tipc_link_build_proto_msg                   1096    1140     +44
      	mac80211_hwsim_new_radio                    2776    2808     +32
      	tipc_mon_rcv                                1032    1058     +26
      	svcauth_gss_legacy_init                     1413    1429     +16
      	tipc_bcbase_select_primary                   379     392     +13
      	nfsd4_exchange_id                           1247    1260     +13
      	nfsd4_setclientid_confirm                    782     793     +11
      		...
      	put_client_renew_locked                      494     480     -14
      	ip_set_sockfn_get                            730     716     -14
      	geneve_sock_add                              829     813     -16
      	nfsd4_sequence_done                          721     703     -18
      	nlmclnt_lookup_host                          708     686     -22
      	nfsd4_lockt                                 1085    1063     -22
      	nfs_get_client                              1077    1050     -27
      	tcf_bpf_init                                1106    1076     -30
      	nfsd4_encode_fattr                          5997    5930     -67
      	Total: Before=154856051, After=154854321, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d03a00
    • W
      net: check dead netns for peernet2id_alloc() · cfc44a4d
      WANG Cong 提交于
      Andrei reports we still allocate netns ID from idr after we destroy
      it in cleanup_net().
      
      cleanup_net():
        ...
        idr_destroy(&net->netns_ids);
        ...
        list_for_each_entry_reverse(ops, &pernet_list, list)
          ops_exit_list(ops, &net_exit_list);
            -> rollback_registered_many()
              -> rtmsg_ifinfo_build_skb()
               -> rtnl_fill_ifinfo()
                 -> peernet2id_alloc()
      
      After that point we should not even access net->netns_ids, we
      should check the death of the current netns as early as we can in
      peernet2id_alloc().
      
      For net-next we can consider to avoid sending rtmsg totally,
      it is a good optimization for netns teardown path.
      
      Fixes: 0c7aecd4 ("netns: add rtnl cmd to add and get peer netns ids")
      Reported-by: NAndrei Vagin <avagin@gmail.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NAndrei Vagin <avagin@openvz.org>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfc44a4d
  19. 24 10月, 2016 1 次提交
  20. 23 10月, 2016 1 次提交
  21. 24 9月, 2016 1 次提交
    • A
      netns: move {inc,dec}_net_namespaces into #ifdef · 2ed6afde
      Arnd Bergmann 提交于
      With the newly enforced limit on the number of namespaces,
      we get a build warning if CONFIG_NETNS is disabled:
      
      net/core/net_namespace.c:273:13: error: 'dec_net_namespaces' defined but not used [-Werror=unused-function]
      net/core/net_namespace.c:268:24: error: 'inc_net_namespaces' defined but not used [-Werror=unused-function]
      
      This moves the two added functions inside the #ifdef that guards
      their callers.
      
      Fixes: 70328660 ("netns: Add a limit on the number of net namespaces")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      2ed6afde
  22. 23 9月, 2016 2 次提交
  23. 05 9月, 2016 2 次提交
  24. 02 9月, 2016 1 次提交
  25. 15 8月, 2016 1 次提交
    • D
      netns: do not call pernet ops for not yet set up init_net namespace · f8c46cb3
      Dmitry Torokhov 提交于
      When CONFIG_NET_NS is disabled, registering pernet operations causes
      init() to be called immediately with init_net as an argument. Unfortunately
      this leads to some pernet ops, such as proc_net_ns_init() to be called too
      early, when init_net namespace has not been fully initialized. This causes
      issues when we want to change pernet ops to use more data from the net
      namespace in question, for example reference user namespace that owns our
      network namespace.
      
      To fix this we could either play game of musical chairs and rearrange init
      order, or we could do the same as when CONFIG_NET_NS is enabled, and
      postpone calling pernet ops->init() until namespace is set up properly.
      
      Note that we can not simply undo commit ed160e83 ("[NET]: Cleanup
      pernet operation without CONFIG_NET_NS") and use the same implementations
      for __register_pernet_operations() and __unregister_pernet_operations(),
      because many pernet ops are marked as __net_initdata and will be discarded,
      which wreaks havoc on our ops lists. Here we rely on the fact that we only
      use lists until init_net is fully initialized, which happens much earlier
      than discarding __net_initdata sections.
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8c46cb3
  26. 09 8月, 2016 1 次提交
  27. 18 5月, 2015 1 次提交
  28. 15 5月, 2015 1 次提交
  29. 13 5月, 2015 1 次提交
  30. 10 5月, 2015 2 次提交
    • N
      netlink: allow to listen "all" netns · 59324cf3
      Nicolas Dichtel 提交于
      More accurately, listen all netns that have a nsid assigned into the netns
      where the netlink socket is opened.
      For this purpose, a netlink socket option is added:
      NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
      socket will receive netlink notifications from all netns that have a nsid
      assigned into the netns where the socket has been opened. The nsid is sent
      to userland via an anscillary data.
      
      With this patch, a daemon needs only one socket to listen many netns. This
      is useful when the number of netns is high.
      
      Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
      the field nsid is valid or not. skb->cb is initialized to 0 on skb
      allocation, thus we are sure that we will never send a nsid 0 by error to
      the userland.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59324cf3
    • N
      netns: use a spin_lock to protect nsid management · 95f38411
      Nicolas Dichtel 提交于
      Before this patch, nsid were protected by the rtnl lock. The goal of this
      patch is to be able to find a nsid without needing to hold the rtnl lock.
      
      The next patch will introduce a netlink socket option to listen to all
      netns that have a nsid assigned into the netns where the socket is opened.
      Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to
      avoid a recursive lock (nsid are notified via rtnl). This was the main
      reason of the previous patch.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95f38411