1. 10 3月, 2018 1 次提交
    • E
      net: do not create fallback tunnels for non-default namespaces · 79134e6c
      Eric Dumazet 提交于
      fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
      ip6tnl0, ip6gre0) are automatically created when the corresponding
      module is loaded.
      
      These tunnels are also automatically created when a new network
      namespace is created, at a great cost.
      
      In many cases, netns are used for isolation purposes, and these
      extra network devices are a waste of resources. We are using
      thousands of netns per host, and hit the netns creation/delete
      bottleneck a lot. (Many thanks to Kirill for recent work on this)
      
      Add a new sysctl so that we can opt-out from this automatic creation.
      
      Note that these tunnels are still created for the initial namespace,
      to be the least intrusive for typical setups.
      
      Tested:
      lpk43:~# cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
      done
      wait
      
      lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m37.521s
      user	0m0.886s
      sys	7m7.084s
      lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m4.761s
      user	0m0.851s
      sys	1m8.343s
      lpk43:~#
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79134e6c
  2. 05 3月, 2018 2 次提交
  3. 28 2月, 2018 2 次提交
  4. 07 2月, 2018 2 次提交
  5. 26 1月, 2018 1 次提交
  6. 19 1月, 2018 1 次提交
    • A
      ip6_gre: init dev->mtu and dev->hard_header_len correctly · 128bb975
      Alexey Kodanev 提交于
      Commit b05229f4 ("gre6: Cleanup GREv6 transmit path,
      call common GRE functions") moved dev->mtu initialization
      from ip6gre_tunnel_setup() to ip6gre_tunnel_init(), as a
      result, the previously set values, before ndo_init(), are
      reset in the following cases:
      
      * rtnl_create_link() can update dev->mtu from IFLA_MTU
        parameter.
      
      * ip6gre_tnl_link_config() is invoked before ndo_init() in
        netlink and ioctl setup, so ndo_init() can reset MTU
        adjustments with the lower device MTU as well, dev->mtu
        and dev->hard_header_len.
      
        Not applicable for ip6gretap because it has one more call
        to ip6gre_tnl_link_config(tunnel, 1) in ip6gre_tap_init().
      
      Fix the first case by updating dev->mtu with 'tb[IFLA_MTU]'
      parameter if a user sets it manually on a device creation,
      and fix the second one by moving ip6gre_tnl_link_config()
      call after register_netdevice().
      
      Fixes: b05229f4 ("gre6: Cleanup GREv6 transmit path, call common GRE functions")
      Fixes: db2ec95d ("ip6_gre: Fix MTU setting")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      128bb975
  7. 27 12月, 2017 2 次提交
    • W
      net: erspan: remove md NULL check · 214bb1c7
      William Tu 提交于
      The 'md' is allocated from 'tun_dst = ip_tun_rx_dst' and
      since we've checked 'tun_dst', 'md' will never be NULL.
      The patch removes it at both ipv4 and ipv6 erspan.
      
      Fixes: afb4c97d ("ip6_gre: fix potential memory leak in ip6erspan_rcv")
      Fixes: 50670b6e ("ip_gre: fix potential memory leak in erspan_rcv")
      Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      214bb1c7
    • A
      ip6_gre: fix device features for ioctl setup · e5a9336a
      Alexey Kodanev 提交于
      When ip6gre is created using ioctl, its features, such as
      scatter-gather, GSO and tx-checksumming will be turned off:
      
        # ip -f inet6 tunnel add gre6 mode ip6gre remote fd00::1
        # ethtool -k gre6 (truncated output)
          tx-checksumming: off
          scatter-gather: off
          tcp-segmentation-offload: off
          generic-segmentation-offload: off [requested on]
      
      But when netlink is used, they will be enabled:
        # ip link add gre6 type ip6gre remote fd00::1
        # ethtool -k gre6 (truncated output)
          tx-checksumming: on
          scatter-gather: on
          tcp-segmentation-offload: on
          generic-segmentation-offload: on
      
      This results in a loss of performance when gre6 is created via ioctl.
      The issue was found with LTP/gre tests.
      
      Fix it by moving the setup of device features to a separate function
      and invoke it with ndo_init callback because both netlink and ioctl
      will eventually call it via register_netdevice():
      
         register_netdevice()
             - ndo_init() callback -> ip6gre_tunnel_init() or ip6gre_tap_init()
                 - ip6gre_tunnel_init_common()
                      - ip6gre_tnl_init_features()
      
      The moved code also contains two minor style fixes:
        * removed needless tab from GRE6_FEATURES on NETIF_F_HIGHDMA line.
        * fixed the issue reported by checkpatch: "Unnecessary parentheses around
          'nt->encap.type == TUNNEL_ENCAP_NONE'"
      
      Fixes: ac4eb009 ("ip6gre: Add support for basic offloads offloads excluding GSO")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5a9336a
  8. 21 12月, 2017 3 次提交
  9. 20 12月, 2017 1 次提交
  10. 19 12月, 2017 2 次提交
  11. 16 12月, 2017 2 次提交
  12. 07 12月, 2017 1 次提交
  13. 05 12月, 2017 1 次提交
  14. 02 12月, 2017 2 次提交
  15. 19 11月, 2017 1 次提交
  16. 13 11月, 2017 2 次提交
    • X
      ip6_gre: process toobig in a better way · fe1a4ca0
      Xin Long 提交于
      Now ip6gre processes toobig icmp packet by setting gre dev's mtu in
      ip6gre_err, which would cause few things not good:
      
        - It couldn't set mtu with dev_set_mtu due to it's not in user context,
          which causes route cache and idev->cnf.mtu6 not to be updated.
      
        - It has to update sk dst pmtu in tx path according to gredev->mtu for
          ip6gre, while it updates pmtu again according to lower dst pmtu in
          ip6_tnl_xmit.
      
        - To change dev->mtu by toobig icmp packet is not a good idea, it should
          only work on pmtu.
      
      This patch is to process toobig by updating the lower dst's pmtu, as later
      sk dst pmtu will be updated in ip6_tnl_xmit, the same way as in ip4gre.
      
      Note that gre dev's mtu will not be updated any more, it doesn't make any
      sense to change dev's mtu after receiving a toobig packet.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe1a4ca0
    • X
      ip6_gre: add the process for redirect in ip6gre_err · 929fc032
      Xin Long 提交于
      This patch is to add redirect icmp packet process for ip6gre by
      calling ip6_redirect() in ip6gre_err(), as in vti6_err.
      
      Prior to this patch, there's even no route cache generated after
      receiving redirect.
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      929fc032
  17. 27 10月, 2017 2 次提交
  18. 01 10月, 2017 1 次提交
  19. 20 9月, 2017 1 次提交
    • E
      ipv6: speedup ipv6 tunnels dismantle · bb401cae
      Eric Dumazet 提交于
      Implement exit_batch() method to dismantle more devices
      per round.
      
      (rtnl_lock() ...
       unregister_netdevice_many() ...
       rtnl_unlock())
      
      Tested:
      $ cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
      done
      wait ; grep net_namespace /proc/slabinfo
      
      Before patch :
      $ time ./add_del_unshare.sh
      net_namespace        110    267   5504    1    2 : tunables    8    4    0 : slabdata    110    267      0
      
      real    3m25.292s
      user    0m0.644s
      sys     0m40.153s
      
      After patch:
      
      $ time ./add_del_unshare.sh
      net_namespace        126    282   5504    1    2 : tunables    8    4    0 : slabdata    126    282      0
      
      real	1m38.965s
      user	0m0.688s
      sys	0m37.017s
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb401cae
  20. 19 9月, 2017 1 次提交
    • X
      ip6_gre: skb_push ipv6hdr before packing the header in ip6gre_header · 76cc0d32
      Xin Long 提交于
      Now in ip6gre_header before packing the ipv6 header, it skb_push t->hlen
      which only includes encap_hlen + tun_hlen. It means greh and inner header
      would be over written by ipv6 stuff and ipv6h might have no chance to set
      up.
      
      Jianlin found this issue when using remote any on ip6_gre, the packets he
      captured on gre dev are truncated:
      
      22:50:26.210866 Out ethertype IPv6 (0x86dd), length 120: truncated-ip6 -\
      8128 bytes missing!(flowlabel 0x92f40, hlim 0, next-header Options (0)  \
      payload length: 8192) ::1:2000:0 > ::1:0:86dd: HBH [trunc] ip-proto-128 \
      8184
      
      It should also skb_push ipv6hdr so that ipv6h points to the right position
      to set ipv6 stuff up.
      
      This patch is to skb_push hlen + sizeof(*ipv6h) and also fix some indents
      in ip6gre_header.
      
      Fixes: c12b395a ("gre: Support GRE over IPv6")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76cc0d32
  21. 08 9月, 2017 1 次提交
    • X
      ip6_gre: update mtu properly in ip6gre_err · 5c25f30c
      Xin Long 提交于
      Now when probessing ICMPV6_PKT_TOOBIG, ip6gre_err only subtracts the
      offset of gre header from mtu info. The expected mtu of gre device
      should also subtract gre header. Otherwise, the next packets still
      can't be sent out.
      
      Jianlin found this issue when using the topo:
        client(ip6gre)<---->(nic1)route(nic2)<----->(ip6gre)server
      
      and reducing nic2's mtu, then both tcp and sctp's performance with
      big size data became 0.
      
      This patch is to fix it by also subtracting grehdr (tun->tun_hlen)
      from mtu info when updating gre device's mtu in ip6gre_err(). It
      also needs to subtract ETH_HLEN if gre dev'type is ARPHRD_ETHER.
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c25f30c
  22. 02 8月, 2017 1 次提交
  23. 29 7月, 2017 1 次提交
  24. 27 6月, 2017 3 次提交
  25. 16 6月, 2017 1 次提交
    • J
      networking: make skb_push & __skb_push return void pointers · d58ff351
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions return void * and remove all the casts across
      the tree, adding a (u8 *) cast only where the unsigned char pointer
      was used directly, all done with the following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      
          @@
          expression SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          @@
          - fn(SKB, LEN)[0]
          + *(u8 *)fn(SKB, LEN)
      
      Note that the last part there converts from push(...)[0] to the
      more idiomatic *(u8 *)push(...).
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d58ff351
  26. 08 6月, 2017 1 次提交
    • D
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller 提交于
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf124db5
  27. 27 5月, 2017 1 次提交
    • P
      ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets · 0e9a7095
      Peter Dawson 提交于
      This fix addresses two problems in the way the DSCP field is formulated
       on the encapsulating header of IPv6 tunnels.
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195661
      
      1) The IPv6 tunneling code was manipulating the DSCP field of the
       encapsulating packet using the 32b flowlabel. Since the flowlabel is
       only the lower 20b it was incorrect to assume that the upper 12b
       containing the DSCP and ECN fields would remain intact when formulating
       the encapsulating header. This fix handles the 'inherit' and
       'fixed-value' DSCP cases explicitly using the extant dsfield u8 variable.
      
      2) The use of INET_ECN_encapsulate(0, dsfield) in ip6_tnl_xmit was
       incorrect and resulted in the DSCP value always being set to 0.
      
      Commit 90427ef5 ("ipv6: fix flow labels when the traffic class
       is non-0") caused the regression by masking out the flowlabel
       which exposed the incorrect handling of the DSCP portion of the
       flowlabel in ip6_tunnel and ip6_gre.
      
      Fixes: 90427ef5 ("ipv6: fix flow labels when the traffic class is non-0")
      Signed-off-by: NPeter Dawson <peter.a.dawson@boeing.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e9a7095