1. 04 7月, 2018 26 次提交
    • T
      vhost_net: Avoid tx vring kicks during busyloop · 027b1760
      Toshiaki Makita 提交于
      Under heavy load vhost busypoll may run without suppressing
      notification. For example tx zerocopy callback can push tx work while
      handle_tx() is running, then busyloop exits due to vhost_has_work()
      condition and enables notification but immediately reenters handle_tx()
      because the pushed work was tx. In this case handle_tx() tries to
      disable notification again, but when using event_idx it by design
      cannot. Then busyloop will run without suppressing notification.
      Another example is the case where handle_tx() tries to enable
      notification but avail idx is advanced so disables it again. This case
      also leads to the same situation with event_idx.
      
      The problem is that once we enter this situation busyloop does not work
      under heavy load for considerable amount of time, because notification
      is likely to happen during busyloop and handle_tx() immediately enables
      notification after notification happens. Specifically busyloop detects
      notification by vhost_has_work() and then handle_tx() calls
      vhost_enable_notify(). Because the detected work was the tx work, it
      enters handle_tx(), and enters busyloop without suppression again.
      This is likely to be repeated, so with event_idx we are almost not able
      to suppress notification in this case.
      
      To fix this, poll the work instead of enabling notification when
      busypoll is interrupted by something. IMHO vhost_has_work() is kind of
      interruption rather than a signal to completely cancel the busypoll, so
      let's run busypoll after the necessary work is done.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      027b1760
    • T
      vhost_net: Rename local variables in vhost_net_rx_peek_head_len · 28b9b33b
      Toshiaki Makita 提交于
      So we can easily see which variable is for which, tx or rx.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28b9b33b
    • Q
      net:sched: add action inheritdsfield to skbedit · e7e3728b
      Qiaobin Fu 提交于
      The new action inheritdsfield copies the field DS of
      IPv4 and IPv6 packets into skb->priority. This enables
      later classification of packets based on the DS field.
      
      v5:
      *Update the drop counter for TC_ACT_SHOT
      
      v4:
      *Not allow setting flags other than the expected ones.
      
      *Allow dumping the pure flags.
      
      v3:
      *Use optional flags, so that it won't break old versions of tc.
      
      *Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags.
      
      v2:
      *Fix the style issue
      
      *Move the code from skbmod to skbedit
      
      Original idea by Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NQiaobin Fu <qiaobinf@bu.edu>
      Reviewed-by: NMichel Machado <michel@digirati.com.br>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7e3728b
    • D
      Merge branch 'More-mirror-to-gretap-tests-with-bridge-in-UL' · f145b0a7
      David S. Miller 提交于
      Petr Machata says:
      
      ====================
      More mirror-to-gretap tests with bridge in UL
      
      This patchset adds two more tests where the mirror-to-gretap has a
      bridge in underlay packet path, without a VLAN above or below that
      bridge.
      
      In patch #1, a non-VLAN-filtering bridge is tested.
      
      In patch #2, a VLAN-filtering bridge is tested.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f145b0a7
    • P
      selftests: forwarding: Test mirror-to-gretap w/ UL 802.1q · 239e754a
      Petr Machata 提交于
      Test for "tc action mirred egress mirror" that mirrors to gretap when
      the underlay route points at a VLAN-aware bridge (802.1q).
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      239e754a
    • P
      selftests: forwarding: Test mirror-to-gretap w/ UL 802.1d · 35c31d5c
      Petr Machata 提交于
      Test for "tc action mirred egress mirror" that mirrors to gretap when
      the underlay route points at a VLAN-unaware bridge (802.1d).
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35c31d5c
    • D
      Merge branch 'Handle-multiple-received-packets-at-each-stage' · 2d1b1385
      David S. Miller 提交于
      Edward Cree says:
      
      ====================
      Handle multiple received packets at each stage
      
      This patch series adds the capability for the network stack to receive a
       list of packets and process them as a unit, rather than handling each
       packet singly in sequence.  This is done by factoring out the existing
       datapath code at each layer and wrapping it in list handling code.
      
      The motivation for this change is twofold:
      * Instruction cache locality.  Currently, running the entire network
        stack receive path on a packet involves more code than will fit in the
        lowest-level icache, meaning that when the next packet is handled, the
        code has to be reloaded from more distant caches.  By handling packets
        in "row-major order", we ensure that the code at each layer is hot for
        most of the list.  (There is a corresponding downside in _data_ cache
        locality, since we are now touching every packet at every layer, but in
        practice there is easily enough room in dcache to hold one cacheline of
        each of the 64 packets in a NAPI poll.)
      * Reduction of indirect calls.  Owing to Spectre mitigations, indirect
        function calls are now more expensive than ever; they are also heavily
        used in the network stack's architecture (see [1]).  By replacing 64
        indirect calls to the next-layer per-packet function with a single
        indirect call to the next-layer list function, we can save CPU cycles.
      
      Drivers pass an SKB list to the stack at the end of the NAPI poll; this
       gives a natural batch size (the NAPI poll weight) and avoids waiting at
       the software level for further packets to make a larger batch (which
       would add latency).  It also means that the batch size is automatically
       tuned by the existing interrupt moderation mechanism.
      The stack then runs each layer of processing over all the packets in the
       list before proceeding to the next layer.  Where the 'next layer' (or
       the context in which it must run) differs among the packets, the stack
       splits the list; this 'late demux' means that packets which differ only
       in later headers (e.g. same L2/L3 but different L4) can traverse the
       early part of the stack together.
      Also, where the next layer is not (yet) list-aware, the stack can revert
       to calling the rest of the stack in a loop; this allows gradual/creeping
       listification, with no 'flag day' patch needed to listify everything.
      
      Patches 1-2 simply place received packets on a list during the event
       processing loop on the sfc EF10 architecture, then call the normal stack
       for each packet singly at the end of the NAPI poll.  (Analogues of patch
       #2 for other NIC drivers should be fairly straightforward.)
      Patches 3-9 extend the list processing as far as the IP receive handler.
      
      Patches 1-2 alone give about a 10% improvement in packet rate in the
       baseline test; adding patches 3-9 raises this to around 25%.
      
      Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
       packets and a single core to handle interrupts on the RX side; this was
       in order to measure as simply as possible the packet rate handled by a
       single core.  Figures are in Mbit/s; divide by 8 to obtain Mpps.  The
       setup was tuned for maximum reproducibility, rather than raw performance.
       Full details and more results (both with and without retpolines) from a
       previous version of the patch series are presented in [2].
      
      The baseline test uses four streams, and multiple RXQs all bound to a
       single CPU (the netperf binary is bound to a neighbouring CPU).  These
       tests were run with retpolines.
      net-next: 6.91 Mb/s (datum)
       after 9: 8.46 Mb/s (+22.5%)
      Note however that these results are not robust; changes in the parameters
       of the test sometimes shrink the gain to single-digit percentages.  For
       instance, when using only a single RXQ, only a 4% gain was seen.
      
      One test variation was the use of software filtering/firewall rules.
       Adding a single iptables rule (UDP port drop on a port range not matching
       the test traffic), thus making the netfilter hook have work to do,
       reduced baseline performance but showed a similar gain from the patches:
      net-next: 5.02 Mb/s (datum)
       after 9: 6.78 Mb/s (+35.1%)
      
      Similarly, testing with a set of TC flower filters (kindly supplied by
       Cong Wang) gave the following:
      net-next: 6.83 Mb/s (datum)
       after 9: 8.86 Mb/s (+29.7%)
      
      These data suggest that the batching approach remains effective in the
       presence of software switching rules, and perhaps even improves the
       performance of those rules by allowing them and their codepaths to stay
       in cache between packets.
      
      Changes from v3:
      * Fixed build error when CONFIG_NETFILTER=n (thanks kbuild).
      
      Changes from v2:
      * Used standard list handling (and skb->list) instead of the skb-queue
        functions (that use skb->next, skb->prev).
        - As part of this, changed from a "dequeue, process, enqueue" model to
          using list_for_each_safe, list_del, and (new) list_cut_before.
      * Altered __netif_receive_skb_core() changes in patch 6 as per Willem de
        Bruijn's suggestions (separate **ppt_prev from *pt_prev; renaming).
      * Removed patches to Generic XDP, since they were producing no benefit.
        I may revisit them later.
      * Removed RFC tags.
      
      Changes from v1:
      * Rebased across 2 years' net-next movement (surprisingly straightforward).
        - Added Generic XDP handling to netif_receive_skb_list_internal()
        - Dealt with changes to PFMEMALLOC setting APIs
      * General cleanup of code and comments.
      * Skipped function calls for empty lists at various points in the stack
        (patch #9).
      * Added listified Generic XDP handling (patches 10-12), though it doesn't
        seem to help (see above).
      * Extended testing to cover software firewalls / netfilter etc.
      
      [1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
      [2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d1b1385
    • E
      net: don't bother calling list RX functions on empty lists · b9f463d6
      Edward Cree 提交于
      Generally the check should be very cheap, as the sk_buff_head is in cache.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9f463d6
    • E
      net: ipv4: listify ip_rcv_finish · 5fa12739
      Edward Cree 提交于
      ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early
       demux or route lookup.  The last step, calling dst_input(skb), is left to
       the caller; in the listified case, we split to form sublists with a common
       dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
      The next step in listification would thus be to add a list_input() method
       to struct dst_entry.
      
      Early demux is an indirect call based on iph->protocol; this is another
       opportunity for listification which is not taken here (it would require
       slicing up ip_rcv_finish_core() to allow splitting on protocol changes).
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fa12739
    • E
      net: ipv4: listified version of ip_rcv · 17266ee9
      Edward Cree 提交于
      Also involved adding a way to run a netfilter hook over a list of packets.
       Rather than attempting to make netfilter know about lists (which would be
       a major project in itself) we just let it call the regular okfn (in this
       case ip_rcv_finish()) for any packets it steals, and have it give us back
       a list of packets it's synchronously accepted (which normally NF_HOOK
       would automatically call okfn() on, but we want to be able to potentially
       pass the list to a listified version of okfn().)
      The netfilter hooks themselves are indirect calls that still happen per-
       packet (see nf_hook_entry_hookfn()), but again, changing that can be left
       for future work.
      
      There is potential for out-of-order receives if the netfilter hook ends up
       synchronously stealing packets, as they will be processed before any
       accepts earlier in the list.  However, it was already possible for an
       asynchronous accept to cause out-of-order receives, so presumably this is
       considered OK.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17266ee9
    • E
      net: core: propagate SKB lists through packet_type lookup · 88eb1944
      Edward Cree 提交于
      __netif_receive_skb_core() does a depressingly large amount of per-packet
       work that can't easily be listified, because the another_round looping
       makes it nontrivial to slice up into smaller functions.
      Fortunately, most of that work disappears in the fast path:
       * Hardware devices generally don't have an rx_handler
       * Unless you're tcpdumping or something, there is usually only one ptype
       * VLAN processing comes before the protocol ptype lookup, so doesn't force
         a pt_prev deliver
       so normally, __netif_receive_skb_core() will run straight through and pass
       back the one ptype found in ptype_base[hash of skb->protocol].
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88eb1944
    • E
      net: core: another layer of lists, around PF_MEMALLOC skb handling · 4ce0017a
      Edward Cree 提交于
      First example of a layer splitting the list (rather than merely taking
       individual packets off it).
      Involves new list.h function, list_cut_before(), like list_cut_position()
       but cuts on the other side of the given entry.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ce0017a
    • E
      net: core: Another step of skb receive list processing · 7da517a3
      Edward Cree 提交于
      netif_receive_skb_list_internal() now processes a list and hands it
       on to the next function.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7da517a3
    • E
      920572b7
    • E
      sfc: batch up RX delivery · e090bfb9
      Edward Cree 提交于
      Improves packet rate of 1-byte UDP receives by up to 10%.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e090bfb9
    • E
      net: core: trivial netif_receive_skb_list() entry point · f6ad8c1b
      Edward Cree 提交于
      Just calls netif_receive_skb() in a loop.
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6ad8c1b
    • D
      Merge branch 'sctp-fully-support-for-dscp-and-flowlabel-per-transport' · 2bdea157
      David S. Miller 提交于
      Xin Long says:
      
      ====================
      sctp: fully support for dscp and flowlabel per transport
      
      Now dscp and flowlabel are set from sock when sending the packets,
      but being multi-homing, sctp also supports for dscp and flowlabel
      per transport, which is described in section 8.1.12 in RFC6458.
      
      v1->v2:
        - define ip_queue_xmit as inline in net/ip.h, instead of exporting
          it in Patch 1/5 according to David's suggestion.
        - fix the param len check in sctp_s/getsockopt_peer_addr_params()
          in Patch 3/5 to guarantee that an old app built with old kernel
          headers could work on the newer kernel per Marcelo's point.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bdea157
    • X
      sctp: check for ipv6_pinfo legal sndflow with flowlabel in sctp_v6_get_dst · 0999f021
      Xin Long 提交于
      The transport with illegal flowlabel should not be allowed to send
      packets. Other transport protocols already denies this.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0999f021
    • X
      sctp: add support for setting flowlabel when adding a transport · 4be4139f
      Xin Long 提交于
      Struct sockaddr_in6 has the member sin6_flowinfo that includes the
      ipv6 flowlabel, it should also support for setting flowlabel when
      adding a transport whose ipaddr is from userspace.
      
      Note that addrinfo in sctp_sendmsg is using struct in6_addr for
      the secondary addrs, which doesn't contain sin6_flowinfo, and
      it needs to copy sin6_flowinfo from the primary addr.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4be4139f
    • X
      sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams · 0b0dce7a
      Xin Long 提交于
      spp_ipv6_flowlabel and spp_dscp are added in sctp_paddrparams in
      this patch so that users could set sctp_sock/asoc/transport dscp
      and flowlabel with spp_flags SPP_IPV6_FLOWLABEL or SPP_DSCP by
      SCTP_PEER_ADDR_PARAMS , as described section 8.1.12 in RFC6458.
      
      As said in last patch, it uses '| 0x100000' or '|0x1' to mark
      flowlabel or dscp is set,  so that their values could be set
      to 0.
      
      Note that to guarantee that an old app built with old kernel
      headers could work on the newer kernel, the param's check in
      sctp_g/setsockopt_peer_addr_params() is also improved, which
      follows the way that sctp_g/setsockopt_delayed_ack() or some
      other sockopts' process that accept two types of params does.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b0dce7a
    • X
      sctp: add support for dscp and flowlabel per transport · 8a9c58d2
      Xin Long 提交于
      Like some other per transport params, flowlabel and dscp are added
      in transport, asoc and sctp_sock. By default, transport sets its
      value from asoc's, and asoc does it from sctp_sock. flowlabel
      only works for ipv6 transport.
      
      Other than that they need to be passed down in sctp_xmit, flow4/6
      also needs to set them before looking up route in get_dst.
      
      Note that it uses '& 0x100000' to check if flowlabel is set and
      '& 0x1' (tos 1st bit is unused) to check if dscp is set by users,
      so that they could be set to 0 by sockopt in next patch.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a9c58d2
    • X
      ipv4: add __ip_queue_xmit() that supports tos param · 69b9e1e0
      Xin Long 提交于
      This patch introduces __ip_queue_xmit(), through which the callers
      can pass tos param into it without having to set inet->tos. For
      ipv6, ip6_xmit() already allows passing tclass parameter.
      
      It's needed when some transport protocol doesn't use inet->tos,
      like sctp's per transport dscp, which will be added in next patch.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69b9e1e0
    • L
      net: dsa: Add Vitesse VSC73xx DSA router driver · 05bd97fc
      Linus Walleij 提交于
      This adds a DSA driver for:
      
      Vitesse VSC7385 SparX-G5 5-port Integrated Gigabit Ethernet Switch
      Vitesse VSC7388 SparX-G8 8-port Integrated Gigabit Ethernet Switch
      Vitesse VSC7395 SparX-G5e 5+1-port Integrated Gigabit Ethernet Switch
      Vitesse VSC7398 SparX-G8e 8-port Integrated Gigabit Ethernet Switch
      
      These switches have a built-in 8051 CPU and can download and execute
      firmware in this CPU. They can also be configured to use an external
      CPU handling the switch in a memory-mapped manner by connecting to
      that external CPU's memory bus.
      
      This driver (currently) only takes control of the switch chip over
      SPI and configures it to route packages around when connected to a
      CPU port. The chip has embedded PHYs and VLAN support so we model it
      using DSA as a best fit so we can easily add VLAN support and maybe
      later also exploit the internal frame header to get more direct
      control over the switch.
      
      The four built-in GPIO lines are exposed using a standard GPIO chip.
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05bd97fc
    • L
      net: phy: vitesse: Add support for VSC73xx · 975ae7c6
      Linus Walleij 提交于
      The VSC7385, VSC7388, VSC7395 and VSC7398 are integrated
      switch/router chips for 5+1 or 8-port switches/routers. When
      managed directly by Linux using DSA we need to do a special
      set-up "dance" on the PHY. Unfortunately these sequences
      switches the PHY to undocumented pages named 2a30 and 52b6
      and does undocumented things. It is described by these opaque
      sequences also in the reference manual. This is a best
      effort to integrate it anyways.
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      975ae7c6
    • L
      net: dsa: Add DT bindings for Vitesse VSC73xx switches · 1decd2ec
      Linus Walleij 提交于
      This adds the device tree bindings for the Vitesse VSC73xx
      switches. We also add the vendor name for Vitesse.
      
      Cc: devicetree@vger.kernel.org
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1decd2ec
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · b6803408
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2018-07-03
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Various improvements to bpftool and libbpf, that is, bpftool build
         speed improvements, missing BPF program types added for detection
         by section name, ability to load programs from '.text' section is
         made to work again, and better bash completion handling, from Jakub.
      
      2) Improvements to nfp JIT's map read handling which allows for optimizing
         memcpy from map to packet, from Jiong.
      
      3) New BPF sample is added which demonstrates XDP in combination with
         bpf_perf_event_output() helper to sample packets on all CPUs, from Toke.
      
      4) Add a new BPF kselftest case for tracking connect(2) BPF hooks
         infrastructure in combination with TFO, from Andrey.
      
      5) Extend the XDP/BPF xdp_rxq_info sample code with a cmdline option to
         read payload from packet data in order to use it for benchmarking.
         Also for '--action XDP_TX' option implement swapping of MAC addresses
         to avoid drops on some hardware seen during testing, from Jesper.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6803408
  2. 03 7月, 2018 13 次提交
  3. 02 7月, 2018 1 次提交
    • D
      Merge branch 'hns3-a-few-code-improvements' · f6779e4e
      David S. Miller 提交于
      Peng Li says:
      
      ====================
      net: hns3: a few code improvements
      
      This patchset removes some redundant code and fixes a few code
      stylistic issues from internal concentrated review,
      no functional changes introduced.
      
      ---
      Change log:
      V1 -> V2:
      1, remove a patch according to the comment reported by David Miller.
      ---
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6779e4e