1. 26 3月, 2018 2 次提交
    • K
      net: Drop NETDEV_UNREGISTER_FINAL · 070f2d7e
      Kirill Tkhai 提交于
      Last user is gone after bdf5bd7f "rds: tcp: remove
      register_netdevice_notifier infrastructure.", so we can
      remove this netdevice command. This allows to delete
      rtnl_lock() in netdev_run_todo(), which is hot path for
      net namespace unregistration.
      
      dev_change_net_namespace() and netdev_wait_allrefs()
      have rcu_barrier() before NETDEV_UNREGISTER_FINAL call,
      and the source commits say they were introduced to
      delemit the call with NETDEV_UNREGISTER, but this patch
      leaves them on the places, since they require additional
      analysis, whether we need in them for something else.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      070f2d7e
    • K
      net: Make NETDEV_XXX commands enum { } · ede2762d
      Kirill Tkhai 提交于
      This patch is preparation to drop NETDEV_UNREGISTER_FINAL.
      Since the cmd is used in usnic_ib_netdev_event_to_string()
      to get cmd name, after plain removing NETDEV_UNREGISTER_FINAL
      from everywhere, we'd have holes in event2str[] in this
      function.
      
      Instead of that, let's make NETDEV_XXX commands names
      available for everyone, and to define netdev_cmd_to_name()
      in the way we won't have to shaffle names after their
      numbers are changed.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede2762d
  2. 14 3月, 2018 1 次提交
    • A
      net: fix sysctl_fb_tunnels_only_for_init_net link error · be9fc097
      Arnd Bergmann 提交于
      The new variable is only available when CONFIG_SYSCTL is enabled,
      otherwise we get a link error:
      
      net/ipv4/ip_tunnel.o: In function `ip_tunnel_init_net':
      ip_tunnel.c:(.text+0x278b): undefined reference to `sysctl_fb_tunnels_only_for_init_net'
      net/ipv6/sit.o: In function `sit_init_net':
      sit.c:(.init.text+0x4c): undefined reference to `sysctl_fb_tunnels_only_for_init_net'
      net/ipv6/ip6_tunnel.o: In function `ip6_tnl_init_net':
      ip6_tunnel.c:(.init.text+0x39): undefined reference to `sysctl_fb_tunnels_only_for_init_net'
      
      This adds an extra condition, keeping the traditional behavior when
      CONFIG_SYSCTL is disabled.
      
      Fixes: 79134e6c ("net: do not create fallback tunnels for non-default namespaces")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be9fc097
  3. 10 3月, 2018 2 次提交
    • P
      net: introduce IFF_NO_RX_HANDLER · f5426250
      Paolo Abeni 提交于
      Some network devices - notably ipvlan slave - are not compatible with
      any kind of rx_handler. Currently the hook can be installed but any
      configuration (bridge, bond, macsec, ...) is nonfunctional.
      
      This change allocates a priv_flag bit to mark such devices and explicitly
      forbid installing a rx_handler if such bit is set. The new bit is used
      by ipvlan slave device.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5426250
    • E
      net: do not create fallback tunnels for non-default namespaces · 79134e6c
      Eric Dumazet 提交于
      fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
      ip6tnl0, ip6gre0) are automatically created when the corresponding
      module is loaded.
      
      These tunnels are also automatically created when a new network
      namespace is created, at a great cost.
      
      In many cases, netns are used for isolation purposes, and these
      extra network devices are a waste of resources. We are using
      thousands of netns per host, and hit the netns creation/delete
      bottleneck a lot. (Many thanks to Kirill for recent work on this)
      
      Add a new sysctl so that we can opt-out from this automatic creation.
      
      Note that these tunnels are still created for the initial namespace,
      to be the least intrusive for typical setups.
      
      Tested:
      lpk43:~# cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
      done
      wait
      
      lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m37.521s
      user	0m0.886s
      sys	7m7.084s
      lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m4.761s
      user	0m0.851s
      sys	1m8.343s
      lpk43:~#
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79134e6c
  4. 08 3月, 2018 1 次提交
    • P
      net: unpollute priv_flags space · 1ec54cb4
      Paolo Abeni 提交于
      the ipvlan device driver defines and uses 2 bits inside the priv_flags
      net_device field. Such bits and the related helper are used only
      inside the ipvlan device driver, and the core networking does not
      need to be aware of them.
      
      This change moves netif_is_ipvlan* helper in the ipvlan driver and
      re-implement them looking for ipvlan specific symbols instead of
      using priv_flags.
      
      Overall this frees two bits inside priv_flags - and move the following
      ones to avoid gaps - without any intended functional change.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ec54cb4
  5. 15 2月, 2018 3 次提交
  6. 01 2月, 2018 1 次提交
    • D
      bpf: fix null pointer deref in bpf_prog_test_run_xdp · 65073a67
      Daniel Borkmann 提交于
      syzkaller was able to generate the following XDP program ...
      
        (18) r0 = 0x0
        (61) r5 = *(u32 *)(r1 +12)
        (04) (u32) r0 += (u32) 0
        (95) exit
      
      ... and trigger a NULL pointer dereference in ___bpf_prog_run()
      via bpf_prog_test_run_xdp() where this was attempted to run.
      
      Reason is that recent xdp_rxq_info addition to XDP programs
      updated all drivers, but not bpf_prog_test_run_xdp(), where
      xdp_buff is set up. Thus when context rewriter does the deref
      on the netdev it's NULL at runtime. Fix it by using xdp_rxq
      from loopback dev. __netif_get_rx_queue() helper can also be
      reused in various other locations later on.
      
      Fixes: 02dd3291 ("bpf: finally expose xdp_rxq_info to XDP bpf-programs")
      Reported-by: syzbot+1eb094057b338eb1fc00@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      65073a67
  7. 30 1月, 2018 1 次提交
  8. 25 1月, 2018 2 次提交
  9. 24 1月, 2018 1 次提交
  10. 23 1月, 2018 1 次提交
  11. 18 1月, 2018 1 次提交
  12. 15 1月, 2018 2 次提交
    • J
      bpf: offload: add map offload infrastructure · a3884572
      Jakub Kicinski 提交于
      BPF map offload follow similar path to program offload.  At creation
      time users may specify ifindex of the device on which they want to
      create the map.  Map will be validated by the kernel's
      .map_alloc_check callback and device driver will be called for the
      actual allocation.  Map will have an empty set of operations
      associated with it (save for alloc and free callbacks).  The real
      device callbacks are kept in map->offload->dev_ops because they
      have slightly different signatures.  Map operations are called in
      process context so the driver may communicate with HW freely,
      msleep(), wait() etc.
      
      Map alloc and free callbacks are muxed via existing .ndo_bpf, and
      are always called with rtnl lock held.  Maps and programs are
      guaranteed to be destroyed before .ndo_uninit (i.e. before
      unregister_netdev() returns).  Map callbacks are invoked with
      bpf_devs_lock *read* locked, drivers must take care of exclusive
      locking if necessary.
      
      All offload-specific branches are marked with unlikely() (through
      bpf_map_is_dev_bound()), given that branch penalty will be
      negligible compared to IO anyway, and we don't want to penalize
      SW path unnecessarily.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a3884572
    • N
      net: sch: prio: Add offload ability to PRIO qdisc · 7fdb61b4
      Nogah Frankel 提交于
      Add the ability to offload PRIO qdisc by using ndo_setup_tc.
      There are three commands for PRIO offloading:
      * TC_PRIO_REPLACE: handles set and tune
      * TC_PRIO_DESTROY: handles qdisc destroy
      * TC_PRIO_STATS: updates the qdiscs counters (given as reference)
      
      Like RED qdisc, the indication of whether PRIO is being offloaded is being
      set and updated as part of the dump function. It is so because the driver
      could decide to offload or not based on the qdisc parent, which could
      change without notifying the qdisc.
      Signed-off-by: NNogah Frankel <nogahf@mellanox.com>
      Reviewed-by: NYuval Mintz <yuvalm@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fdb61b4
  13. 11 1月, 2018 1 次提交
  14. 09 1月, 2018 2 次提交
  15. 06 1月, 2018 1 次提交
  16. 31 12月, 2017 1 次提交
  17. 21 12月, 2017 1 次提交
  18. 20 12月, 2017 1 次提交
  19. 03 12月, 2017 2 次提交
  20. 24 11月, 2017 1 次提交
    • W
      net: accept UFO datagrams from tuntap and packet · 0c19f846
      Willem de Bruijn 提交于
      Tuntap and similar devices can inject GSO packets. Accept type
      VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.
      
      Processes are expected to use feature negotiation such as TUNSETOFFLOAD
      to detect supported offload types and refrain from injecting other
      packets. This process breaks down with live migration: guest kernels
      do not renegotiate flags, so destination hosts need to expose all
      features that the source host does.
      
      Partially revert the UFO removal from 182e0b6b~1..d9d30adf.
      This patch introduces nearly(*) no new code to simplify verification.
      It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
      insertion and software UFO segmentation.
      
      It does not reinstate protocol stack support, hardware offload
      (NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
      of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.
      
      To support SKB_GSO_UDP reappearing in the stack, also reinstate
      logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
      by squashing in commit 93991221 ("net: skb_needs_check() removes
      CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee6
      ("net: avoid skb_warn_bad_offload false positives on UFO").
      
      (*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
      ipv6_proxy_select_ident is changed to return a __be32 and this is
      assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
      at the end of the enum to minimize code churn.
      
      Tested
        Booted a v4.13 guest kernel with QEMU. On a host kernel before this
        patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
        enabled, same as on a v4.13 host kernel.
      
        A UFO packet sent from the guest appears on the tap device:
          host:
            nc -l -p -u 8000 &
            tcpdump -n -i tap0
      
          guest:
            dd if=/dev/zero of=payload.txt bs=1 count=2000
            nc -u 192.16.1.1 8000 < payload.txt
      
        Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
        packets arriving fragmented:
      
          ./with_tap_pair.sh ./tap_send_ufo tap0 tap1
          (from https://github.com/wdebruij/kerneltools/tree/master/tests)
      
      Changes
        v1 -> v2
          - simplified set_offload change (review comment)
          - documented test procedure
      
      Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
      Fixes: fb652fdf ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c19f846
  21. 10 11月, 2017 1 次提交
  22. 09 11月, 2017 1 次提交
  23. 08 11月, 2017 3 次提交
  24. 05 11月, 2017 2 次提交
  25. 03 11月, 2017 1 次提交
  26. 28 10月, 2017 1 次提交
  27. 21 10月, 2017 1 次提交
  28. 18 10月, 2017 1 次提交
    • J
      bpf: cpumap xdp_buff to skb conversion and allocation · 1c601d82
      Jesper Dangaard Brouer 提交于
      This patch makes cpumap functional, by adding SKB allocation and
      invoking the network stack on the dequeuing CPU.
      
      For constructing the SKB on the remote CPU, the xdp_buff in converted
      into a struct xdp_pkt, and it mapped into the top headroom of the
      packet, to avoid allocating separate mem.  For now, struct xdp_pkt is
      just a cpumap internal data structure, with info carried between
      enqueue to dequeue.
      
      If a driver doesn't have enough headroom it is simply dropped, with
      return code -EOVERFLOW.  This will be picked up the xdp tracepoint
      infrastructure, to allow users to catch this.
      
      V2: take into account xdp->data_meta
      
      V4:
       - Drop busypoll tricks, keeping it more simple.
       - Skip RPS and Generic-XDP-recursive-reinjection, suggested by Alexei
      
      V5: correct RCU read protection around __netif_receive_skb_core.
      
      V6: Setting TASK_RUNNING vs TASK_INTERRUPTIBLE based on talk with Rik van Riel
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c601d82
  29. 17 10月, 2017 1 次提交
    • C
      tun: call dev_get_valid_name() before register_netdevice() · 0ad646c8
      Cong Wang 提交于
      register_netdevice() could fail early when we have an invalid
      dev name, in which case ->ndo_uninit() is not called. For tun
      device, this is a problem because a timer etc. are already
      initialized and it expects ->ndo_uninit() to clean them up.
      
      We could move these initializations into a ->ndo_init() so
      that register_netdevice() knows better, however this is still
      complicated due to the logic in tun_detach().
      
      Therefore, I choose to just call dev_get_valid_name() before
      register_netdevice(), which is quicker and much easier to audit.
      And for this specific case, it is already enough.
      
      Fixes: 96442e42 ("tuntap: choose the txq based on rxq")
      Reported-by: NDmitry Alexeev <avekceeb@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad646c8