1. 06 12月, 2017 1 次提交
  2. 24 11月, 2017 1 次提交
    • W
      net: accept UFO datagrams from tuntap and packet · 0c19f846
      Willem de Bruijn 提交于
      Tuntap and similar devices can inject GSO packets. Accept type
      VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.
      
      Processes are expected to use feature negotiation such as TUNSETOFFLOAD
      to detect supported offload types and refrain from injecting other
      packets. This process breaks down with live migration: guest kernels
      do not renegotiate flags, so destination hosts need to expose all
      features that the source host does.
      
      Partially revert the UFO removal from 182e0b6b~1..d9d30adf.
      This patch introduces nearly(*) no new code to simplify verification.
      It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
      insertion and software UFO segmentation.
      
      It does not reinstate protocol stack support, hardware offload
      (NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
      of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.
      
      To support SKB_GSO_UDP reappearing in the stack, also reinstate
      logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
      by squashing in commit 93991221 ("net: skb_needs_check() removes
      CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee6
      ("net: avoid skb_warn_bad_offload false positives on UFO").
      
      (*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
      ipv6_proxy_select_ident is changed to return a __be32 and this is
      assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
      at the end of the enum to minimize code churn.
      
      Tested
        Booted a v4.13 guest kernel with QEMU. On a host kernel before this
        patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
        enabled, same as on a v4.13 host kernel.
      
        A UFO packet sent from the guest appears on the tap device:
          host:
            nc -l -p -u 8000 &
            tcpdump -n -i tap0
      
          guest:
            dd if=/dev/zero of=payload.txt bs=1 count=2000
            nc -u 192.16.1.1 8000 < payload.txt
      
        Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
        packets arriving fragmented:
      
          ./with_tap_pair.sh ./tap_send_ufo tap0 tap1
          (from https://github.com/wdebruij/kerneltools/tree/master/tests)
      
      Changes
        v1 -> v2
          - simplified set_offload change (review comment)
          - documented test procedure
      
      Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
      Fixes: fb652fdf ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c19f846
  3. 21 11月, 2017 2 次提交
  4. 14 11月, 2017 8 次提交
  5. 05 11月, 2017 2 次提交
  6. 03 11月, 2017 1 次提交
  7. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  8. 18 10月, 2017 2 次提交
    • J
      bpf: cpumap xdp_buff to skb conversion and allocation · 1c601d82
      Jesper Dangaard Brouer 提交于
      This patch makes cpumap functional, by adding SKB allocation and
      invoking the network stack on the dequeuing CPU.
      
      For constructing the SKB on the remote CPU, the xdp_buff in converted
      into a struct xdp_pkt, and it mapped into the top headroom of the
      packet, to avoid allocating separate mem.  For now, struct xdp_pkt is
      just a cpumap internal data structure, with info carried between
      enqueue to dequeue.
      
      If a driver doesn't have enough headroom it is simply dropped, with
      return code -EOVERFLOW.  This will be picked up the xdp tracepoint
      infrastructure, to allow users to catch this.
      
      V2: take into account xdp->data_meta
      
      V4:
       - Drop busypoll tricks, keeping it more simple.
       - Skip RPS and Generic-XDP-recursive-reinjection, suggested by Alexei
      
      V5: correct RCU read protection around __netif_receive_skb_core.
      
      V6: Setting TASK_RUNNING vs TASK_INTERRUPTIBLE based on talk with Rik van Riel
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c601d82
    • H
      net: export netdev_txq_to_tc to allow sch_mqprio to compile as module · 8a5f2166
      Henrik Austad 提交于
      In commit 32302902 ("mqprio: Reserve last 32 classid values for HW
      traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc
      to find the correct tc instead of dev->tc_to_txq[]
      
      However, when mqprio is compiled as a module, it cannot resolve the
      symbol, leading to this error:
      
           ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!
      
      This adds an EXPORT_SYMBOL() since the other user in the kernel
      (netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
      sysfs-callback.
      
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NHenrik Austad <haustad@cisco.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a5f2166
  9. 17 10月, 2017 1 次提交
    • C
      tun: call dev_get_valid_name() before register_netdevice() · 0ad646c8
      Cong Wang 提交于
      register_netdevice() could fail early when we have an invalid
      dev name, in which case ->ndo_uninit() is not called. For tun
      device, this is a problem because a timer etc. are already
      initialized and it expects ->ndo_uninit() to clean them up.
      
      We could move these initializations into a ->ndo_init() so
      that register_netdevice() knows better, however this is still
      complicated due to the logic in tun_detach().
      
      Therefore, I choose to just call dev_get_valid_name() before
      register_netdevice(), which is quicker and much easier to audit.
      And for this specific case, it is already enough.
      
      Fixes: 96442e42 ("tuntap: choose the txq based on rxq")
      Reported-by: NDmitry Alexeev <avekceeb@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad646c8
  10. 05 10月, 2017 4 次提交
  11. 04 10月, 2017 1 次提交
  12. 27 9月, 2017 1 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
  13. 23 9月, 2017 2 次提交
  14. 21 9月, 2017 1 次提交
  15. 09 9月, 2017 1 次提交
    • J
      net: rcu lock and preempt disable missing around generic xdp · bbbe211c
      John Fastabend 提交于
      do_xdp_generic must be called inside rcu critical section with preempt
      disabled to ensure BPF programs are valid and per-cpu variables used
      for redirect operations are consistent. This patch ensures this is true
      and fixes the splat below.
      
      The netif_receive_skb_internal() code path is now broken into two rcu
      critical sections. I decided it was better to limit the preempt_enable/disable
      block to just the xdp static key portion and the fallout is more
      rcu_read_lock/unlock calls. Seems like the best option to me.
      
      [  607.596901] =============================
      [  607.596906] WARNING: suspicious RCU usage
      [  607.596912] 4.13.0-rc4+ #570 Not tainted
      [  607.596917] -----------------------------
      [  607.596923] net/core/dev.c:3948 suspicious rcu_dereference_check() usage!
      [  607.596927]
      [  607.596927] other info that might help us debug this:
      [  607.596927]
      [  607.596933]
      [  607.596933] rcu_scheduler_active = 2, debug_locks = 1
      [  607.596938] 2 locks held by pool/14624:
      [  607.596943]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff95445ffd>] ip_finish_output2+0x14d/0x890
      [  607.596973]  #1:  (rcu_read_lock_bh){......}, at: [<ffffffff953c8e3a>] __dev_queue_xmit+0x14a/0xfd0
      [  607.597000]
      [  607.597000] stack backtrace:
      [  607.597006] CPU: 5 PID: 14624 Comm: pool Not tainted 4.13.0-rc4+ #570
      [  607.597011] Hardware name: Dell Inc. Precision Tower 5810/0HHV7N, BIOS A17 03/01/2017
      [  607.597016] Call Trace:
      [  607.597027]  dump_stack+0x67/0x92
      [  607.597040]  lockdep_rcu_suspicious+0xdd/0x110
      [  607.597054]  do_xdp_generic+0x313/0xa50
      [  607.597068]  ? time_hardirqs_on+0x5b/0x150
      [  607.597076]  ? mark_held_locks+0x6b/0xc0
      [  607.597088]  ? netdev_pick_tx+0x150/0x150
      [  607.597117]  netif_rx_internal+0x205/0x3f0
      [  607.597127]  ? do_xdp_generic+0xa50/0xa50
      [  607.597144]  ? lock_downgrade+0x2b0/0x2b0
      [  607.597158]  ? __lock_is_held+0x93/0x100
      [  607.597187]  netif_rx+0x119/0x190
      [  607.597202]  loopback_xmit+0xfd/0x1b0
      [  607.597214]  dev_hard_start_xmit+0x127/0x4e0
      
      Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
      Fixes: b5cdae32 ("net: Generic XDP")
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbbe211c
  16. 02 9月, 2017 1 次提交
  17. 29 8月, 2017 1 次提交
  18. 25 8月, 2017 1 次提交
  19. 14 8月, 2017 1 次提交
    • J
      net: export some generic xdp helpers · 7c497478
      Jason Wang 提交于
      This patch tries to export some generic xdp helpers to drivers. This
      can let driver to do XDP for a specific skb. This is useful for the
      case when the packet is hard to be processed at page level directly
      (e.g jumbo/GSO frame).
      
      With this patch, there's no need for driver to forbid the XDP set when
      configuration is not suitable. Instead, it can defer the XDP for
      packets that is hard to be processed directly after skb is created.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c497478
  20. 12 8月, 2017 1 次提交
  21. 09 8月, 2017 1 次提交
    • W
      net: avoid skb_warn_bad_offload false positives on UFO · 8d63bee6
      Willem de Bruijn 提交于
      skb_warn_bad_offload triggers a warning when an skb enters the GSO
      stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL
      checksum offload set.
      
      Commit b2504a5d ("net: reduce skb_warn_bad_offload() noise")
      observed that SKB_GSO_DODGY producers can trigger the check and
      that passing those packets through the GSO handlers will fix it
      up. But, the software UFO handler will set ip_summed to
      CHECKSUM_NONE.
      
      When __skb_gso_segment is called from the receive path, this
      triggers the warning again.
      
      Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On
      Tx these two are equivalent. On Rx, this better matches the
      skb state (checksum computed), as CHECKSUM_NONE here means no
      checksum computed.
      
      See also this thread for context:
      http://patchwork.ozlabs.org/patch/799015/
      
      Fixes: b2504a5d ("net: reduce skb_warn_bad_offload() noise")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d63bee6
  22. 04 8月, 2017 1 次提交
    • W
      sock: enable MSG_ZEROCOPY · 1f8b977a
      Willem de Bruijn 提交于
      Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
      skb_zerocopy_clone() wherever needed due to skb split, merge, resize
      or clone.
      
      Split skb_orphan_frags into two variants. The split, merge, .. paths
      support reference counted zerocopy buffers, so do not do a deep copy.
      Add skb_orphan_frags_rx for paths that may loop packets to receive
      sockets. That is not allowed, as it may cause unbounded latency.
      Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
      
      The exact locations to modify were chosen by exhaustively searching
      through all code that might modify skb_frag references and/or the
      the SKBTX_DEV_ZEROCOPY tx_flags bit.
      
      The changes err on the safe side, in two ways.
      
      (1) legacy ubuf_info paths virtio and tap are not modified. They keep
          a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
          still call skb_copy_ubufs and thus copy frags in this case.
      
      (2) not all copies deep in the stack are addressed yet. skb_shift,
          skb_split and skb_try_coalesce can be refined to avoid copying.
          These are not in the hot path and this patch is hairy enough as
          is, so that is left for future refinement.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f8b977a
  23. 25 7月, 2017 2 次提交
  24. 20 7月, 2017 1 次提交
  25. 18 7月, 2017 1 次提交