1. 21 5月, 2019 1 次提交
  2. 28 4月, 2019 1 次提交
    • J
      netlink: make validation more configurable for future strictness · 8cb08174
      Johannes Berg 提交于
      We currently have two levels of strict validation:
      
       1) liberal (default)
           - undefined (type >= max) & NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
           - garbage at end of message accepted
       2) strict (opt-in)
           - NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
      
      Split out parsing strictness into four different options:
       * TRAILING     - check that there's no trailing data after parsing
                        attributes (in message or nested)
       * MAXTYPE      - reject attrs > max known type
       * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
       * STRICT_ATTRS - strictly validate attribute size
      
      The default for future things should be *everything*.
      The current *_strict() is a combination of TRAILING and MAXTYPE,
      and is renamed to _deprecated_strict().
      The current regular parsing has none of this, and is renamed to
      *_parse_deprecated().
      
      Additionally it allows us to selectively set one of the new flags
      even on old policies. Notably, the UNSPEC flag could be useful in
      this case, since it can be arranged (by filling in the policy) to
      not be an incompatible userspace ABI change, but would then going
      forward prevent forgetting attribute entries. Similar can apply
      to the POLICY flag.
      
      We end up with the following renames:
       * nla_parse           -> nla_parse_deprecated
       * nla_parse_strict    -> nla_parse_deprecated_strict
       * nlmsg_parse         -> nlmsg_parse_deprecated
       * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
       * nla_parse_nested    -> nla_parse_nested_deprecated
       * nla_validate_nested -> nla_validate_nested_deprecated
      
      Using spatch, of course:
          @@
          expression TB, MAX, HEAD, LEN, POL, EXT;
          @@
          -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
          +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression TB, MAX, NLA, POL, EXT;
          @@
          -nla_parse_nested(TB, MAX, NLA, POL, EXT)
          +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
      
          @@
          expression START, MAX, POL, EXT;
          @@
          -nla_validate_nested(START, MAX, POL, EXT)
          +nla_validate_nested_deprecated(START, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, MAX, POL, EXT;
          @@
          -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
          +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
      
      For this patch, don't actually add the strict, non-renamed versions
      yet so that it breaks compile if I get it wrong.
      
      Also, while at it, make nla_validate and nla_parse go down to a
      common __nla_validate_parse() function to avoid code duplication.
      
      Ultimately, this allows us to have very strict validation for every
      new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
      next patch, while existing things will continue to work as is.
      
      In effect then, this adds fully strict validation for any new command.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8cb08174
  3. 12 4月, 2019 1 次提交
  4. 29 3月, 2019 1 次提交
  5. 20 1月, 2019 1 次提交
  6. 25 12月, 2018 1 次提交
  7. 28 11月, 2018 5 次提交
  8. 09 10月, 2018 1 次提交
  9. 22 8月, 2018 1 次提交
  10. 21 7月, 2018 1 次提交
  11. 01 4月, 2018 1 次提交
    • K
      net: Do not take net_rwsem in __rtnl_link_unregister() · 554873e5
      Kirill Tkhai 提交于
      This function calls call_netdevice_notifier(), which also
      may take net_rwsem. So, we can't use net_rwsem here.
      
      This patch makes callers of this functions take pernet_ops_rwsem,
      like register_netdevice_notifier() does. This will protect
      the modifications of net_namespace_list, and allows notifiers
      to take it (they won't have to care about context).
      
      Since __rtnl_link_unregister() is used on module load
      and unload (which are not frequent operations), this looks
      for me better, than make all call_netdevice_notifier()
      always executing in "protected net_namespace_list" context.
      
      Also, this fixes the problem we had a deal in 328fbe74
      "Close race between {un, }register_netdevice_notifier and ...",
      and guarantees __rtnl_link_unregister() does not skip
      exitting net.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      554873e5
  12. 30 3月, 2018 1 次提交
    • K
      net: Introduce net_rwsem to protect net_namespace_list · f0b07bb1
      Kirill Tkhai 提交于
      rtnl_lock() is used everywhere, and contention is very high.
      When someone wants to iterate over alive net namespaces,
      he/she has no a possibility to do that without exclusive lock.
      But the exclusive rtnl_lock() in such places is overkill,
      and it just increases the contention. Yes, there is already
      for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
      and this can't be sleepable. Also, sometimes it may be need
      really prevent net_namespace_list growth, so for_each_net_rcu()
      is not fit there.
      
      This patch introduces new rw_semaphore, which will be used
      instead of rtnl_mutex to protect net_namespace_list. It is
      sleepable and allows not-exclusive iterations over net
      namespaces list. It allows to stop using rtnl_lock()
      in several places (what is made in next patches) and makes
      less the time, we keep rtnl_mutex. Here we just add new lock,
      while the explanation of we can remove rtnl_lock() there are
      in next patches.
      
      Fine grained locks generally are better, then one big lock,
      so let's do that with net_namespace_list, while the situation
      allows that.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0b07bb1
  13. 28 3月, 2018 4 次提交
  14. 23 3月, 2018 1 次提交
  15. 07 3月, 2018 1 次提交
    • K
      net: Make account struct net to memcg · 30855ffc
      Kirill Tkhai 提交于
      The patch adds SLAB_ACCOUNT to flags of net_cachep cache,
      which enables accounting of struct net memory to memcg kmem.
      Since number of net_namespaces may be significant, user
      want to know, how much there were consumed, and control.
      
      Note, that we do not account net_generic to the same memcg,
      where net was accounted, moreover, we don't do this at all (*).
      We do not want the situation, when single memcg memory deficit
      prevents us to register new pernet_operations.
      
      (*)Even despite there is !current process accounting already
      available in linux-next. See kmalloc_memcg() there for the details.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30855ffc
  16. 27 2月, 2018 1 次提交
  17. 21 2月, 2018 3 次提交
  18. 13 2月, 2018 7 次提交
    • K
      net: Convert net_defaults_ops · ff291d00
      Kirill Tkhai 提交于
      net_defaults_ops introduce only net_defaults_init_net method,
      and it acts on net::core::sysctl_somaxconn, which
      is not interesting for the rest of pernet_subsys and
      pernet_device lists. Then, make them async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff291d00
    • K
      net: Convert net_ns_ops methods · 3fc3b827
      Kirill Tkhai 提交于
      This patch starts to convert pernet_subsys, registered
      from pure initcalls.
      
      net_ns_ops::net_ns_net_init/net_ns_net_init, methods use only
      ida_simple_* functions, which are not need a synchronization.
      They are synchronized by idr subsystem.
      
      So, net_ns_ops methods are able to be executed
      in parallel with methods of other pernet operations.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fc3b827
    • K
      net: Allow pernet_operations to be executed in parallel · 447cd7a0
      Kirill Tkhai 提交于
      This adds new pernet_operations::async flag to indicate operations,
      which ->init(), ->exit() and ->exit_batch() methods are allowed
      to be executed in parallel with the methods of any other pernet_operations.
      
      When there are only asynchronous pernet_operations in the system,
      net_mutex won't be taken for a net construction and destruction.
      
      Also, remove BUG_ON(mutex_is_locked()) from net_assign_generic()
      without replacing with the equivalent net_sem check, as there is
      one more lockdep assert below.
      
      v3: Add comment near net_mutex.
      Suggested-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      447cd7a0
    • K
      net: Move mutex_unlock() in cleanup_net() up · bcab1ddd
      Kirill Tkhai 提交于
      net_sem protects from pernet_list changing, while
      ops_free_list() makes simple kfree(), and it can't
      race with other pernet_operations callbacks.
      
      So we may release net_mutex earlier then it was.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcab1ddd
    • K
      net: Introduce net_sem for protection of pernet_list · 1a57feb8
      Kirill Tkhai 提交于
      Currently, the mutex is mostly used to protect pernet operations
      list. It orders setup_net() and cleanup_net() with parallel
      {un,}register_pernet_operations() calls, so ->exit{,batch} methods
      of the same pernet operations are executed for a dying net, as
      were used to call ->init methods, even after the net namespace
      is unlinked from net_namespace_list in cleanup_net().
      
      But there are several problems with scalability. The first one
      is that more than one net can't be created or destroyed
      at the same moment on the node. For big machines with many cpus
      running many containers it's very sensitive.
      
      The second one is that it's need to synchronize_rcu() after net
      is removed from net_namespace_list():
      
      Destroy net_ns:
      cleanup_net()
        mutex_lock(&net_mutex)
        list_del_rcu(&net->list)
        synchronize_rcu()                                  <--- Sleep there for ages
        list_for_each_entry_reverse(ops, &pernet_list, list)
          ops_exit_list(ops, &net_exit_list)
        list_for_each_entry_reverse(ops, &pernet_list, list)
          ops_free_list(ops, &net_exit_list)
        mutex_unlock(&net_mutex)
      
      This primitive is not fast, especially on the systems with many processors
      and/or when preemptible RCU is enabled in config. So, all the time, while
      cleanup_net() is waiting for RCU grace period, creation of new net namespaces
      is not possible, the tasks, who makes it, are sleeping on the same mutex:
      
      Create net_ns:
      copy_net_ns()
        mutex_lock_killable(&net_mutex)                    <--- Sleep there for ages
      
      I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
      with preemptible RCU enabled after CRIU tests round is finished.
      
      The solution is to convert net_mutex to the rw_semaphore and add fine grain
      locks to really small number of pernet_operations, what really need them.
      
      Then, pernet_operations::init/::exit methods, modifying the net-related data,
      will require down_read() locking only, while down_write() will be used
      for changing pernet_list (i.e., when modules are being loaded and unloaded).
      
      This gives signify performance increase, after all patch set is applied,
      like you may see here:
      
      %for i in {1..10000}; do unshare -n bash -c exit; done
      
      *before*
      real 1m40,377s
      user 0m9,672s
      sys 0m19,928s
      
      *after*
      real 0m17,007s
      user 0m5,311s
      sys 0m11,779
      
      (5.8 times faster)
      
      This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
      describes the variables it protects, and makes to use, where appropriate.
      net_mutex is still present, and next patches will kick it out step-by-step.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a57feb8
    • K
      net: Cleanup in copy_net_ns() · 5ba049a5
      Kirill Tkhai 提交于
      Line up destructors actions in the revers order
      to constructors. Next patches will add more actions,
      and this will be comfortable, if there is the such
      order.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ba049a5
    • K
      net: Assign net to net_namespace_list in setup_net() · 98f6c533
      Kirill Tkhai 提交于
      This patch merges two repeating pieces of code in one,
      and they will live in setup_net() now.
      
      The only change is that assignment:
      
      	init_net_initialized = true;
      
      becomes reordered with:
      
      	list_add_tail_rcu(&net->list, &net_namespace_list);
      
      The order does not have visible effect, and it is a simple
      cleanup because of:
      
      init_net_initialized is used in !CONFIG_NET_NS case
      to order proc_net_ns_ops registration occuring at boot time:
      
      	start_kernel()->proc_root_init()->proc_net_init(),
      with
      	net_ns_init()->setup_net(&init_net, &init_user_ns)
      
      also occuring in boot time from the same init_task.
      
      When there are no another tasks to race with them,
      for the single task it does not matter, which order
      two sequential independent loads should be made.
      So we make them reordered.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98f6c533
  19. 26 1月, 2018 1 次提交
    • K
      net: Move net:netns_ids destruction out of rtnl_lock() and document locking scheme · fb07a820
      Kirill Tkhai 提交于
      Currently, we unhash a dying net from netns_ids lists
      under rtnl_lock(). It's a leftover from the time when
      net::netns_ids was introduced. There was no net::nsid_lock,
      and rtnl_lock() was mostly need to order modification
      of alive nets nsid idr, i.e. for:
      	for_each_net(tmp) {
      		...
      		id = __peernet2id(tmp, net);
      		idr_remove(&tmp->netns_ids, id);
      		...
      	}
      
      Since we have net::nsid_lock, the modifications are
      protected by this local lock, and now we may introduce
      better scheme of netns_ids destruction.
      
      Let's look at the functions peernet2id_alloc() and
      get_net_ns_by_id(). Previous commits taught these
      functions to work well with dying net acquired from
      rtnl unlocked lists. And they are the only functions
      which can hash a net to netns_ids or obtain from there.
      And as easy to check, other netns_ids operating functions
      works with id, not with net pointers. So, we do not
      need rtnl_lock to synchronize cleanup_net() with all them.
      
      The another property, which is used in the patch,
      is that net is unhashed from net_namespace_list
      in the only place and by the only process. So,
      we avoid excess rcu_read_lock() or rtnl_lock(),
      when we'are iterating over the list in unhash_nsid().
      
      All the above makes possible to keep rtnl_lock() locked
      only for net->list deletion, and completely avoid it
      for netns_ids unhashing and destruction. As these two
      doings may take long time (e.g., memory allocation
      to send skb), the patch should positively act on
      the scalability and signify decrease the time, which
      rtnl_lock() is held in cleanup_net().
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb07a820
  20. 18 1月, 2018 2 次提交
    • K
      net: Remove spinlock from get_net_ns_by_id() · 42157277
      Kirill Tkhai 提交于
      idr_find() is safe under rcu_read_lock() and
      maybe_get_net() guarantees that net is alive.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42157277
    • K
      net: Fix possible race in peernet2id_alloc() · 0c06bea9
      Kirill Tkhai 提交于
      peernet2id_alloc() is racy without rtnl_lock() as refcount_read(&peer->count)
      under net->nsid_lock does not guarantee, peer is alive:
      
      rcu_read_lock()
      peernet2id_alloc()                            ..
        spin_lock_bh(&net->nsid_lock)               ..
        refcount_read(&peer->count) (!= 0)          ..
        ..                                          put_net()
        ..                                            cleanup_net()
        ..                                              for_each_net(tmp)
        ..                                                spin_lock_bh(&tmp->nsid_lock)
        ..                                                __peernet2id(tmp, net) == -1
        ..                                                    ..
        ..                                                    ..
          __peernet2id_alloc(alloc == true)                   ..
        ..                                                    ..
      rcu_read_unlock()                                       ..
      ..                                                synchronize_rcu()
      ..                                                kmem_cache_free(net)
      
      After the above situation, net::netns_id contains id pointing to freed memory,
      and any other dereferencing by the id will operate with this freed memory.
      
      Currently, peernet2id_alloc() is used under rtnl_lock() everywhere except
      ovs_vport_cmd_fill_info(), and this race can't occur. But peernet2id_alloc()
      is generic interface, and better we fix it before someone really starts
      use it in wrong context.
      
      v2: Don't place refcount_read(&net->count) under net->nsid_lock
          as suggested by Eric W. Biederman <ebiederm@xmission.com>
      v3: Rebase on top of net-next
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c06bea9
  21. 16 1月, 2018 1 次提交
  22. 21 12月, 2017 1 次提交
    • E
      net: Fix double free and memory corruption in get_net_ns_by_id() · 21b59443
      Eric W. Biederman 提交于
      (I can trivially verify that that idr_remove in cleanup_net happens
       after the network namespace count has dropped to zero --EWB)
      
      Function get_net_ns_by_id() does not check for net::count
      after it has found a peer in netns_ids idr.
      
      It may dereference a peer, after its count has already been
      finaly decremented. This leads to double free and memory
      corruption:
      
      put_net(peer)                                   rtnl_lock()
      atomic_dec_and_test(&peer->count) [count=0]     ...
      __put_net(peer)                                 get_net_ns_by_id(net, id)
        spin_lock(&cleanup_list_lock)
        list_add(&net->cleanup_list, &cleanup_list)
        spin_unlock(&cleanup_list_lock)
      queue_work()                                      peer = idr_find(&net->netns_ids, id)
        |                                               get_net(peer) [count=1]
        |                                               ...
        |                                               (use after final put)
        v                                               ...
        cleanup_net()                                   ...
          spin_lock(&cleanup_list_lock)                 ...
          list_replace_init(&cleanup_list, ..)          ...
          spin_unlock(&cleanup_list_lock)               ...
          ...                                           ...
          ...                                           put_net(peer)
          ...                                             atomic_dec_and_test(&peer->count) [count=0]
          ...                                               spin_lock(&cleanup_list_lock)
          ...                                               list_add(&net->cleanup_list, &cleanup_list)
          ...                                               spin_unlock(&cleanup_list_lock)
          ...                                             queue_work()
          ...                                           rtnl_unlock()
          rtnl_lock()                                   ...
          for_each_net(tmp) {                           ...
            id = __peernet2id(tmp, peer)                ...
            spin_lock_irq(&tmp->nsid_lock)              ...
            idr_remove(&tmp->netns_ids, id)             ...
            ...                                         ...
            net_drop_ns()                               ...
      	net_free(peer)                            ...
          }                                             ...
        |
        v
        cleanup_net()
          ...
          (Second free of peer)
      
      Also, put_net() on the right cpu may reorder with left's cpu
      list_replace_init(&cleanup_list, ..), and then cleanup_list
      will be corrupted.
      
      Since cleanup_net() is executed in worker thread, while
      put_net(peer) can happen everywhere, there should be
      enough time for concurrent get_net_ns_by_id() to pick
      the peer up, and the race does not seem to be unlikely.
      The patch fixes the problem in standard way.
      
      (Also, there is possible problem in peernet2id_alloc(), which requires
      check for net::count under nsid_lock and maybe_get_net(peer), but
      in current stable kernel it's used under rtnl_lock() and it has to be
      safe. Openswitch begun to use peernet2id_alloc(), and possibly it should
      be fixed too. While this is not in stable kernel yet, so I'll send
      a separate message to netdev@ later).
      
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Fixes: 0c7aecd4 "netns: add rtnl cmd to add and get peer netns ids"
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21b59443
  23. 05 11月, 2017 1 次提交
  24. 10 8月, 2017 1 次提交