1. 03 10月, 2016 2 次提交
    • A
      net: rtnl: avoid uninitialized data in IFLA_VF_VLAN_LIST handling · fa34cd94
      Arnd Bergmann 提交于
      With the newly added support for IFLA_VF_VLAN_LIST netlink messages,
      we get a warning about potential uninitialized variable use in
      the parsing of the user input when enabling the -Wmaybe-uninitialized
      warning:
      
      net/core/rtnetlink.c: In function 'do_setvfinfo':
      net/core/rtnetlink.c:1756:9: error: 'ivvl$' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      I have not been able to prove whether it is possible to arrive in
      this code with an empty IFLA_VF_VLAN_LIST block, but if we do,
      then ndo_set_vf_vlan gets called with uninitialized arguments.
      
      This adds an explicit check for an empty list, making it obvious
      to the reader and the compiler that this cannot happen.
      
      Fixes: 79aab093 ("net: Update API for VF vlan protocol 802.1ad support")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa34cd94
    • P
      net: pktgen: fix pkt_size · 63d75463
      Paolo Abeni 提交于
      The commit 879c7220 ("net: pktgen: Observe needed_headroom
      of the device") increased the 'pkt_overhead' field value by
      LL_RESERVED_SPACE.
      As a side effect the generated packet size, computed as:
      
      	/* Eth + IPh + UDPh + mpls */
      	datalen = pkt_dev->cur_pkt_size - 14 - 20 - 8 -
      		  pkt_dev->pkt_overhead;
      
      is decreased by the same value.
      The above changed slightly the behavior of existing pktgen users,
      and made the procfs interface somewhat inconsistent.
      Fix it by restoring the previous pkt_overhead value and using
      LL_RESERVED_SPACE as extralen in skb allocation.
      Also, change pktgen_alloc_skb() to only partially reserve
      the headroom to allow the caller to prefetch from ll header
      start.
      
      v1 -> v2:
       - fixed some typos in the comments
      
      Fixes: 879c7220 ("net: pktgen: Observe needed_headroom of the device")
      Suggested-by: NBen Greear <greearb@candelatech.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63d75463
  2. 29 9月, 2016 1 次提交
  3. 25 9月, 2016 1 次提交
  4. 24 9月, 2016 1 次提交
    • M
      net: Update API for VF vlan protocol 802.1ad support · 79aab093
      Moshe Shemesh 提交于
      Introduce new rtnl UAPI that exposes a list of vlans per VF, giving
      the ability for user-space application to specify it for the VF, as an
      option to support 802.1ad.
      We adjusted IP Link tool to support this option.
      
      For future use cases, the new UAPI supports multiple vlans. For now we
      limit the list size to a single vlan in kernel.
      Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward
      compatibility with older versions of IP Link tool.
      
      Add a vlan protocol parameter to the ndo_set_vf_vlan callback.
      We kept 802.1Q as the drivers' default vlan protocol.
      Suitable ip link tool command examples:
        Set vf vlan protocol 802.1ad:
          ip link set eth0 vf 1 vlan 100 proto 802.1ad
        Set vf to VST (802.1Q) mode:
          ip link set eth0 vf 1 vlan 100 proto 802.1Q
        Or by omitting the new parameter
          ip link set eth0 vf 1 vlan 100
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79aab093
  5. 23 9月, 2016 3 次提交
  6. 22 9月, 2016 3 次提交
  7. 21 9月, 2016 1 次提交
    • D
      bpf: direct packet write and access for helpers for clsact progs · 36bbef52
      Daniel Borkmann 提交于
      This work implements direct packet access for helpers and direct packet
      write in a similar fashion as already available for XDP types via commits
      4acf6c0b ("bpf: enable direct packet data write for xdp progs") and
      6841de8b ("bpf: allow helpers access the packet directly"), and as a
      complementary feature to the already available direct packet read for tc
      (cls/act) programs.
      
      For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
      and bpf_csum_update(). The first is generally needed for both, read and
      write, because they would otherwise only be limited to the current linear
      skb head. Usually, when the data_end test fails, programs just bail out,
      or, in the direct read case, use bpf_skb_load_bytes() as an alternative
      to overcome this limitation. If such data sits in non-linear parts, we
      can just pull them in once with the new helper, retest and eventually
      access them.
      
      At the same time, this also makes sure the skb is uncloned, which is, of
      course, a necessary condition for direct write. As this needs to be an
      invariant for the write part only, the verifier detects writes and adds
      a prologue that is calling bpf_skb_pull_data() to effectively unclone the
      skb from the very beginning in case it is indeed cloned. The heuristic
      makes use of a similar trick that was done in 233577a2 ("net: filter:
      constify detection of pkt_type_offset"). This comes at zero cost for other
      programs that do not use the direct write feature. Should a program use
      this feature only sparsely and has read access for the most parts with,
      for example, drop return codes, then such write action can be delegated
      to a tail called program for mitigating this cost of potential uncloning
      to a late point in time where it would have been paid similarly with the
      bpf_skb_store_bytes() as well. Advantage of direct write is that the
      writes are inlined whereas the helper cannot make any length assumptions
      and thus needs to generate a call to memcpy() also for small sizes, as well
      as cost of helper call itself with sanity checks are avoided. Plus, when
      direct read is already used, we don't need to cache or perform rechecks
      on the data boundaries (due to verifier invalidating previous checks for
      helpers that change skb->data), so more complex programs using rewrites
      can benefit from switching to direct read plus write.
      
      For direct packet access to helpers, we save the otherwise needed copy into
      a temp struct sitting on stack memory when use-case allows. Both facilities
      are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
      this to map helpers and csum_diff, and can successively enable other helpers
      where we find it makes sense. Helpers that definitely cannot be allowed for
      this are those part of bpf_helper_changes_skb_data() since they can change
      underlying data, and those that write into memory as this could happen for
      packet typed args when still cloned. bpf_csum_update() helper accommodates
      for the fact that we need to fixup checksum_complete when using direct write
      instead of bpf_skb_store_bytes(), meaning the programs can use available
      helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
      csum_block_add(), csum_block_sub() equivalents in eBPF together with the
      new helper. A usage example will be provided for iproute2's examples/bpf/
      directory.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36bbef52
  8. 20 9月, 2016 2 次提交
  9. 19 9月, 2016 1 次提交
  10. 11 9月, 2016 1 次提交
  11. 10 9月, 2016 5 次提交
  12. 09 9月, 2016 1 次提交
    • Y
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang 提交于
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: NYaogong Wang <wygivan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f5afeae
  13. 05 9月, 2016 3 次提交
  14. 02 9月, 2016 4 次提交
  15. 31 8月, 2016 2 次提交
    • R
      net: lwtunnel: Handle fragmentation · 14972cbd
      Roopa Prabhu 提交于
      Today mpls iptunnel lwtunnel_output redirect expects the tunnel
      output function to handle fragmentation. This is ok but can be
      avoided if we did not do the mpls output redirect too early.
      ie we could wait until ip fragmentation is done and then call
      mpls output for each ip fragment.
      
      To make this work we will need,
      1) the lwtunnel state to carry encap headroom
      2) and do the redirect to the encap output handler on the ip fragment
      (essentially do the output redirect after fragmentation)
      
      This patch adds tunnel headroom in lwtstate to make sure we
      account for tunnel data in mtu calculations during fragmentation
      and adds new xmit redirect handler to redirect to lwtunnel xmit func
      after ip fragmentation.
      
      This includes IPV6 and some mtu fixes and testing from David Ahern.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14972cbd
    • E
      net: batch calls to flush_all_backlogs() · 41852497
      Eric Dumazet 提交于
      After commit 145dd5f9 ("net: flush the softnet backlog in process
      context"), we can easily batch calls to flush_all_backlogs() for all
      devices processed in rollback_registered_many()
      
      Tested:
      
      Before patch, on an idle host.
      
      modprobe dummy numdummies=10000
      perf stat -e context-switches -a rmmod dummy
      
       Performance counter stats for 'system wide':
      
               1,211,798      context-switches
      
             1.302137465 seconds time elapsed
      
      After patch:
      
      perf stat -e context-switches -a rmmod dummy
      
       Performance counter stats for 'system wide':
      
                 225,523      context-switches
      
             0.721623566 seconds time elapsed
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41852497
  16. 27 8月, 2016 2 次提交
    • I
      bridge: switchdev: Add forward mark support for stacked devices · 6bc506b4
      Ido Schimmel 提交于
      switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
      port netdevs so that packets being flooded by the device won't be
      flooded twice.
      
      It works by assigning a unique identifier (the ifindex of the first
      bridge port) to bridge ports sharing the same parent ID. This prevents
      packets from being flooded twice by the same switch, but will flood
      packets through bridge ports belonging to a different switch.
      
      This method is problematic when stacked devices are taken into account,
      such as VLANs. In such cases, a physical port netdev can have upper
      devices being members in two different bridges, thus requiring two
      different 'offload_fwd_mark's to be configured on the port netdev, which
      is impossible.
      
      The main problem is that packet and netdev marking is performed at the
      physical netdev level, whereas flooding occurs between bridge ports,
      which are not necessarily port netdevs.
      
      Instead, packet and netdev marking should really be done in the bridge
      driver with the switch driver only telling it which packets it already
      forwarded. The bridge driver will mark such packets using the mark
      assigned to the ingress bridge port and will prevent the packet from
      being forwarded through any bridge port sharing the same mark (i.e.
      having the same parent ID).
      
      Remove the current switchdev 'offload_fwd_mark' implementation and
      instead implement the proposed method. In addition, make rocker - the
      sole user of the mark - use the proposed method.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bc506b4
    • P
      net: flush the softnet backlog in process context · 145dd5f9
      Paolo Abeni 提交于
      Currently in process_backlog(), the process_queue dequeuing is
      performed with local IRQ disabled, to protect against
      flush_backlog(), which runs in hard IRQ context.
      
      This patch moves the flush operation to a work queue and runs the
      callback with bottom half disabled to protect the process_queue
      against dequeuing.
      Since process_queue is now always manipulated in bottom half context,
      the irq disable/enable pair around the dequeue operation are removed.
      
      To keep the flush time as low as possible, the flush
      works are scheduled on all online cpu simultaneously, using the
      high priority work-queue and statically allocated, per cpu,
      work structs.
      
      Overall this change increases the time required to destroy a device
      to improve slightly the packets reinjection performances.
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      145dd5f9
  17. 24 8月, 2016 3 次提交
  18. 19 8月, 2016 4 次提交