1. 28 3月, 2020 1 次提交
    • D
      bpf: Add netns cookie and enable it for bpf cgroup hooks · f318903c
      Daniel Borkmann 提交于
      In Cilium we're mainly using BPF cgroup hooks today in order to implement
      kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
      ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
      between Cilium managed nodes. While this works in its current shape and avoids
      packet-level NAT for inter Cilium managed node traffic, there is one major
      limitation we're facing today, that is, lack of netns awareness.
      
      In Kubernetes, the concept of Pods (which hold one or multiple containers)
      has been built around network namespaces, so while we can use the global scope
      of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
      NodePort ports on loopback addresses), we also have the need to differentiate
      between initial network namespaces and non-initial one. For example, ExternalIP
      services mandate that non-local service IPs are not to be translated from the
      host (initial) network namespace as one example. Right now, we have an ugly
      work-around in place where non-local service IPs for ExternalIP services are
      not xlated from connect() and friends BPF hooks but instead via less efficient
      packet-level NAT on the veth tc ingress hook for Pod traffic.
      
      On top of determining whether we're in initial or non-initial network namespace
      we also have a need for a socket-cookie like mechanism for network namespaces
      scope. Socket cookies have the nice property that they can be combined as part
      of the key structure e.g. for BPF LRU maps without having to worry that the
      cookie could be recycled. We are planning to use this for our sessionAffinity
      implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
      which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
      provide the cookie for the initial network namespace while passing the context
      instead of NULL would provide the cookie from the application's network namespace.
      We're using a hole, so no size increase; the assignment happens only once.
      Therefore this allows for a comparison on initial namespace as well as regular
      cookie usage as we have today with socket cookies. We could later on enable
      this helper for other program types as well as we would see need.
      
        (*) Both externalTrafficPolicy={Local|Cluster} types
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
      f318903c
  2. 17 1月, 2020 1 次提交
  3. 15 1月, 2020 3 次提交
    • G
      netns: don't disable BHs when locking "nsid_lock" · 8d7e5dee
      Guillaume Nault 提交于
      When peernet2id() had to lock "nsid_lock" before iterating through the
      nsid table, we had to disable BHs, because VXLAN can call peernet2id()
      from the xmit path:
        vxlan_xmit() -> vxlan_fdb_miss() -> vxlan_fdb_notify()
          -> __vxlan_fdb_notify() -> vxlan_fdb_info() -> peernet2id().
      
      Now that peernet2id() uses RCU protection, "nsid_lock" isn't used in BH
      context anymore. Therefore, we can safely use plain
      spin_lock()/spin_unlock() and let BHs run when holding "nsid_lock".
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d7e5dee
    • G
      netns: protect netns ID lookups with RCU · 2dce224f
      Guillaume Nault 提交于
      __peernet2id() can be protected by RCU as it only calls idr_for_each(),
      which is RCU-safe, and never modifies the nsid table.
      
      rtnl_net_dumpid() can also do lockless lookups. It does two nested
      idr_for_each() calls on nsid tables (one direct call and one indirect
      call because of rtnl_net_dumpid_one() calling __peernet2id()). The
      netnsid tables are never updated. Therefore it is safe to not take the
      nsid_lock and run within an RCU-critical section instead.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2dce224f
    • G
      netns: Remove __peernet2id_alloc() · 49052941
      Guillaume Nault 提交于
      __peernet2id_alloc() was used for both plain lookups and for netns ID
      allocations (depending the value of '*alloc'). Let's separate lookups
      from allocations instead. That is, integrate the lookup code into
      __peernet2id() and make peernet2id_alloc() responsible for allocating
      new netns IDs when necessary.
      
      This makes it clear that __peernet2id() doesn't modify the idr and
      prepares the code for lockless lookups.
      
      Also, mark the 'net' argument of __peernet2id() as 'const', since we're
      modifying this line.
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49052941
  4. 26 10月, 2019 1 次提交
    • G
      netns: fix GFP flags in rtnl_net_notifyid() · d4e4fdf9
      Guillaume Nault 提交于
      In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
      rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
      but there are a few paths calling rtnl_net_notifyid() from atomic
      context or from RCU critical sections. The later also precludes the use
      of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
      call is wrong too, as it uses GFP_KERNEL unconditionally.
      
      Therefore, we need to pass the GFP flags as parameter and propagate it
      through function calls until the proper flags can be determined.
      
      In most cases, GFP_KERNEL is fine. The exceptions are:
        * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
          indirectly call rtnl_net_notifyid() from RCU critical section,
      
        * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
          parameter.
      
      Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
      by nlmsg_new(). The function is allowed to sleep, so better make the
      flags consistent with the ones used in the following
      ovs_vport_cmd_fill_info() call.
      
      Found by code inspection.
      
      Fixes: 9a963454 ("netns: notify netns id events")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4e4fdf9
  5. 25 10月, 2019 1 次提交
    • T
      keys: Fix memory leak in copy_net_ns · 82ecff65
      Takeshi Misawa 提交于
      If copy_net_ns() failed after net_alloc(), net->key_domain is leaked.
      Fix this, by freeing key_domain in error path.
      
      syzbot report:
      BUG: memory leak
      unreferenced object 0xffff8881175007e0 (size 32):
        comm "syz-executor902", pid 7069, jiffies 4294944350 (age 28.400s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000a83ed741>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
          [<00000000a83ed741>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<00000000a83ed741>] slab_alloc mm/slab.c:3326 [inline]
          [<00000000a83ed741>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<0000000059fc92b9>] kmalloc include/linux/slab.h:547 [inline]
          [<0000000059fc92b9>] kzalloc include/linux/slab.h:742 [inline]
          [<0000000059fc92b9>] net_alloc net/core/net_namespace.c:398 [inline]
          [<0000000059fc92b9>] copy_net_ns+0xb2/0x220 net/core/net_namespace.c:445
          [<00000000a9d74bbc>] create_new_namespaces+0x141/0x2a0 kernel/nsproxy.c:103
          [<000000008047d645>] unshare_nsproxy_namespaces+0x7f/0x100 kernel/nsproxy.c:202
          [<000000005993ea6e>] ksys_unshare+0x236/0x490 kernel/fork.c:2674
          [<0000000019417e75>] __do_sys_unshare kernel/fork.c:2742 [inline]
          [<0000000019417e75>] __se_sys_unshare kernel/fork.c:2740 [inline]
          [<0000000019417e75>] __x64_sys_unshare+0x16/0x20 kernel/fork.c:2740
          [<00000000f4c5f2c8>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:296
          [<0000000038550184>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      syzbot also reported other leak in copy_net_ns -> setup_net.
      This problem is already fixed by cf47a0b8.
      
      Fixes: 9b242610 ("keys: Network namespace domain tag")
      Reported-and-tested-by: syzbot+3b3296d032353c33184b@syzkaller.appspotmail.com
      Signed-off-by: NTakeshi Misawa <jeliantsurux@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82ecff65
  6. 10 10月, 2019 1 次提交
  7. 27 6月, 2019 1 次提交
    • D
      keys: Network namespace domain tag · 9b242610
      David Howells 提交于
      Create key domain tags for network namespaces and make it possible to
      automatically tag keys that are used by networked services (e.g. AF_RXRPC,
      AFS, DNS) with the default network namespace if not set by the caller.
      
      This allows keys with the same description but in different namespaces to
      coexist within a keyring.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: netdev@vger.kernel.org
      cc: linux-nfs@vger.kernel.org
      cc: linux-cifs@vger.kernel.org
      cc: linux-afs@lists.infradead.org
      9b242610
  8. 23 6月, 2019 1 次提交
  9. 19 6月, 2019 1 次提交
    • E
      netns: add pre_exit method to struct pernet_operations · d7d99872
      Eric Dumazet 提交于
      Current struct pernet_operations exit() handlers are highly
      discouraged to call synchronize_rcu().
      
      There are cases where we need them, and exit_batch() does
      not help the common case where a single netns is dismantled.
      
      This patch leverages the existing synchronize_rcu() call
      in cleanup_net()
      
      Calling optional ->pre_exit() method before ->exit() or
      ->exit_batch() allows to benefit from a single synchronize_rcu()
      call.
      
      Note that the synchronize_rcu() calls added in this patch
      are only in error paths or slow paths.
      
      Tested:
      
      $ time for i in {1..1000}; do unshare -n /bin/false;done
      
      real	0m2.612s
      user	0m0.171s
      sys	0m2.216s
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7d99872
  10. 21 5月, 2019 1 次提交
  11. 28 4月, 2019 1 次提交
    • J
      netlink: make validation more configurable for future strictness · 8cb08174
      Johannes Berg 提交于
      We currently have two levels of strict validation:
      
       1) liberal (default)
           - undefined (type >= max) & NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
           - garbage at end of message accepted
       2) strict (opt-in)
           - NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
      
      Split out parsing strictness into four different options:
       * TRAILING     - check that there's no trailing data after parsing
                        attributes (in message or nested)
       * MAXTYPE      - reject attrs > max known type
       * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
       * STRICT_ATTRS - strictly validate attribute size
      
      The default for future things should be *everything*.
      The current *_strict() is a combination of TRAILING and MAXTYPE,
      and is renamed to _deprecated_strict().
      The current regular parsing has none of this, and is renamed to
      *_parse_deprecated().
      
      Additionally it allows us to selectively set one of the new flags
      even on old policies. Notably, the UNSPEC flag could be useful in
      this case, since it can be arranged (by filling in the policy) to
      not be an incompatible userspace ABI change, but would then going
      forward prevent forgetting attribute entries. Similar can apply
      to the POLICY flag.
      
      We end up with the following renames:
       * nla_parse           -> nla_parse_deprecated
       * nla_parse_strict    -> nla_parse_deprecated_strict
       * nlmsg_parse         -> nlmsg_parse_deprecated
       * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
       * nla_parse_nested    -> nla_parse_nested_deprecated
       * nla_validate_nested -> nla_validate_nested_deprecated
      
      Using spatch, of course:
          @@
          expression TB, MAX, HEAD, LEN, POL, EXT;
          @@
          -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
          +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression TB, MAX, NLA, POL, EXT;
          @@
          -nla_parse_nested(TB, MAX, NLA, POL, EXT)
          +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
      
          @@
          expression START, MAX, POL, EXT;
          @@
          -nla_validate_nested(START, MAX, POL, EXT)
          +nla_validate_nested_deprecated(START, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, MAX, POL, EXT;
          @@
          -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
          +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
      
      For this patch, don't actually add the strict, non-renamed versions
      yet so that it breaks compile if I get it wrong.
      
      Also, while at it, make nla_validate and nla_parse go down to a
      common __nla_validate_parse() function to avoid code duplication.
      
      Ultimately, this allows us to have very strict validation for every
      new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
      next patch, while existing things will continue to work as is.
      
      In effect then, this adds fully strict validation for any new command.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8cb08174
  12. 12 4月, 2019 1 次提交
  13. 29 3月, 2019 1 次提交
  14. 20 1月, 2019 1 次提交
  15. 25 12月, 2018 1 次提交
  16. 28 11月, 2018 5 次提交
  17. 09 10月, 2018 1 次提交
  18. 22 8月, 2018 1 次提交
  19. 21 7月, 2018 1 次提交
  20. 01 4月, 2018 1 次提交
    • K
      net: Do not take net_rwsem in __rtnl_link_unregister() · 554873e5
      Kirill Tkhai 提交于
      This function calls call_netdevice_notifier(), which also
      may take net_rwsem. So, we can't use net_rwsem here.
      
      This patch makes callers of this functions take pernet_ops_rwsem,
      like register_netdevice_notifier() does. This will protect
      the modifications of net_namespace_list, and allows notifiers
      to take it (they won't have to care about context).
      
      Since __rtnl_link_unregister() is used on module load
      and unload (which are not frequent operations), this looks
      for me better, than make all call_netdevice_notifier()
      always executing in "protected net_namespace_list" context.
      
      Also, this fixes the problem we had a deal in 328fbe74
      "Close race between {un, }register_netdevice_notifier and ...",
      and guarantees __rtnl_link_unregister() does not skip
      exitting net.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      554873e5
  21. 30 3月, 2018 1 次提交
    • K
      net: Introduce net_rwsem to protect net_namespace_list · f0b07bb1
      Kirill Tkhai 提交于
      rtnl_lock() is used everywhere, and contention is very high.
      When someone wants to iterate over alive net namespaces,
      he/she has no a possibility to do that without exclusive lock.
      But the exclusive rtnl_lock() in such places is overkill,
      and it just increases the contention. Yes, there is already
      for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
      and this can't be sleepable. Also, sometimes it may be need
      really prevent net_namespace_list growth, so for_each_net_rcu()
      is not fit there.
      
      This patch introduces new rw_semaphore, which will be used
      instead of rtnl_mutex to protect net_namespace_list. It is
      sleepable and allows not-exclusive iterations over net
      namespaces list. It allows to stop using rtnl_lock()
      in several places (what is made in next patches) and makes
      less the time, we keep rtnl_mutex. Here we just add new lock,
      while the explanation of we can remove rtnl_lock() there are
      in next patches.
      
      Fine grained locks generally are better, then one big lock,
      so let's do that with net_namespace_list, while the situation
      allows that.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0b07bb1
  22. 28 3月, 2018 4 次提交
  23. 23 3月, 2018 1 次提交
  24. 07 3月, 2018 1 次提交
    • K
      net: Make account struct net to memcg · 30855ffc
      Kirill Tkhai 提交于
      The patch adds SLAB_ACCOUNT to flags of net_cachep cache,
      which enables accounting of struct net memory to memcg kmem.
      Since number of net_namespaces may be significant, user
      want to know, how much there were consumed, and control.
      
      Note, that we do not account net_generic to the same memcg,
      where net was accounted, moreover, we don't do this at all (*).
      We do not want the situation, when single memcg memory deficit
      prevents us to register new pernet_operations.
      
      (*)Even despite there is !current process accounting already
      available in linux-next. See kmalloc_memcg() there for the details.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30855ffc
  25. 27 2月, 2018 1 次提交
  26. 21 2月, 2018 3 次提交
  27. 13 2月, 2018 3 次提交