1. 18 10月, 2018 1 次提交
    • S
      Merge branch 'mlx5-next' of... · 186daf0c
      Saeed Mahameed 提交于
      Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into net-next
      
      mlx5 updates for both net-next and rdma-next
      
      * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: (21 commits)
        net/mlx5: Expose DC scatter to CQE capability bit
        net/mlx5: Update mlx5_ifc with DEVX UID bits
        net/mlx5: Set uid as part of DCT commands
        net/mlx5: Set uid as part of SRQ commands
        net/mlx5: Set uid as part of SQ commands
        net/mlx5: Set uid as part of RQ commands
        net/mlx5: Set uid as part of QP commands
        net/mlx5: Set uid as part of CQ commands
        net/mlx5: Rename incorrect naming in IFC file
        net/mlx5: Export packet reformat alloc/dealloc functions
        net/mlx5: Pass a namespace for packet reformat ID allocation
        net/mlx5: Expose new packet reformat capabilities
        {net, RDMA}/mlx5: Rename encap to reformat packet
        net/mlx5: Move header encap type to IFC header file
        net/mlx5: Break encap/decap into two separated flow table creation flags
        net/mlx5: Add support for more namespaces when allocating modify header
        net/mlx5: Export modify header alloc/dealloc functions
        net/mlx5: Add proper NIC TX steering flow tables support
        net/mlx5: Cleanup flow namespace getter switch logic
        net/mlx5: Add memic command opcode to command checker
        ...
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      186daf0c
  2. 17 10月, 2018 15 次提交
  3. 16 10月, 2018 24 次提交
    • D
      Merge branch 'net-Kernel-side-filtering-for-route-dumps' · 2c59f06c
      David S. Miller 提交于
      David Ahern says:
      
      ====================
      net: Kernel side filtering for route dumps
      
      Implement kernel side filtering of route dumps by protocol (e.g., which
      routing daemon installed the route), route type (e.g., unicast), table
      id and nexthop device.
      
      iproute2 has been doing this filtering in userspace for years; pushing
      the filters to the kernel side reduces the amount of data the kernel
      sends and reduces wasted cycles on both sides processing unwanted data.
      These initial options provide a huge improvement for efficiently
      examining routes on large scale systems.
      
      v2
      - better handling of requests for a specific table. Rather than walking
        the hash of all tables, lookup the specific table and dump it
      - refactor mr_rtm_dumproute moving the loop over the table into a
        helper that can be invoked directly
      - add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure
        it is returned even when the dump returns nothing
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c59f06c
    • D
      net/ipv4: Bail early if user only wants prefix entries · e4e92fb1
      David Ahern 提交于
      Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the
      flag is set in the dump request, just return.
      
      In the process of this change, move the CLONE check to use the new
      filter flags.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4e92fb1
    • D
      net/ipv6: Bail early if user only wants cloned entries · 08e814c9
      David Ahern 提交于
      Similar to IPv4, IPv6 fib no longer contains cloned routes. If a user
      requests a route dump for only cloned entries, no sense walking the FIB
      and returning everything.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08e814c9
    • D
      net/mpls: Handle kernel side filtering of route dumps · 196cfebf
      David Ahern 提交于
      Update the dump request parsing in MPLS for the non-INET case to
      enable kernel side filtering. If INET is disabled the only filters
      that make sense for MPLS are protocol and nexthop device.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      196cfebf
    • D
      net: Enable kernel side filtering of route dumps · effe6792
      David Ahern 提交于
      Update parsing of route dump request to enable kernel side filtering.
      Allow filtering results by protocol (e.g., which routing daemon installed
      the route), route type (e.g., unicast), table id and nexthop device. These
      amount to the low hanging fruit, yet a huge improvement, for dumping
      routes.
      
      ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can
      be used to look up the device index without taking a reference. From
      there filter->dev is only used during dump loops with the lock still held.
      
      Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results
      have been filtered should no entries be returned.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      effe6792
    • D
      net: Plumb support for filtering ipv4 and ipv6 multicast route dumps · cb167893
      David Ahern 提交于
      Implement kernel side filtering of routes by egress device index and
      table id. If the table id is given in the filter, lookup table and
      call mr_table_dump directly for it.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb167893
    • D
      ipmr: Refactor mr_rtm_dumproute · e1cedae1
      David Ahern 提交于
      Move per-table loops from mr_rtm_dumproute to mr_table_dump and export
      mr_table_dump for dumps by specific table id.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1cedae1
    • D
      net/mpls: Plumb support for filtering route dumps · bae9a78b
      David Ahern 提交于
      Implement kernel side filtering of routes by egress device index and
      protocol. MPLS uses only a single table and route type.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bae9a78b
    • D
      net/ipv6: Plumb support for filtering route dumps · 13e38901
      David Ahern 提交于
      Implement kernel side filtering of routes by table id, egress device
      index, protocol, and route type. If the table id is given in the filter,
      lookup the table and call fib6_dump_table directly for it.
      
      Move the existing route flags check for prefix only routes to the new
      filter.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13e38901
    • D
      net/ipv4: Plumb support for filtering route dumps · 18a8021a
      David Ahern 提交于
      Implement kernel side filtering of routes by table id, egress device index,
      protocol and route type. If the table id is given in the filter, lookup the
      table and call fib_table_dump directly for it.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18a8021a
    • D
      net: Add struct for fib dump filter · 4724676d
      David Ahern 提交于
      Add struct fib_dump_filter for options on limiting which routes are
      returned in a dump request. The current list is table id, protocol,
      route type, rtm_flags and nexthop device index. struct net is needed
      to lookup the net_device from the index.
      
      Declare the filter for each route dump handler and plumb the new
      arguments from dump handlers to ip_valid_fib_dump_req.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4724676d
    • D
      netlink: Add answer_flags to netlink_callback · 22e6c58b
      David Ahern 提交于
      With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED
      flag is set on a message back to the user if the data returned is
      influenced by some input attributes. Normally this can be done as
      messages are added to the skb, but if the filter results in no data
      being returned, the user could be confused as to why.
      
      This patch adds answer_flags to the netlink_callback allowing dump
      handlers to set the NLM_F_DUMP_FILTERED at a minimum in the
      NLMSG_DONE message ensuring the flag gets back to the user.
      
      The netlink_callback space is initialized to 0 via a memset in
      __netlink_dump_start, so init of the new answer_flags is covered.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22e6c58b
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · e8567951
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf-next 2018-10-16
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable
         sk_msg BPF integration for the latter, from Daniel and John.
      
      2) Enable BPF syscall side to indicate for maps that they do not support
         a map lookup operation as opposed to just missing key, from Prashant.
      
      3) Add bpftool map create command which after map creation pins the
         map into bpf fs for further processing, from Jakub.
      
      4) Add bpftool support for attaching programs to maps allowing sock_map
         and sock_hash to be used from bpftool, from John.
      
      5) Improve syscall BPF map update/delete path for map-in-map types to
         wait a RCU grace period for pending references to complete, from Daniel.
      
      6) Couple of follow-up fixes for the BPF socket lookup to get it
         enabled also when IPv6 is compiled as a module, from Joe.
      
      7) Fix a generic-XDP bug to handle the case when the Ethernet header
         was mangled and thus update skb's protocol and data, from Jesper.
      
      8) Add a missing BTF header length check between header copies from
         user space, from Wenwen.
      
      9) Minor fixups in libbpf to use __u32 instead u32 types and include
         proper perf_event.h uapi header instead of perf internal one, from Yonghong.
      
      10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS
          to bpftool's build, from Jiri.
      
      11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install
          with_addr.sh script from flow dissector selftest, from Anders.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8567951
    • H
      net: phy: merge phy_start_aneg and phy_start_aneg_priv · c45d7150
      Heiner Kallweit 提交于
      After commit 9f2959b6 ("net: phy: improve handling delayed work")
      the sync parameter isn't needed any longer in phy_start_aneg_priv().
      This allows to merge phy_start_aneg() and phy_start_aneg_priv().
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c45d7150
    • H
      hv_netvsc: fix vf serial matching with pci slot info · 00547955
      Haiyang Zhang 提交于
      The VF device's serial number is saved as a string in PCI slot's
      kobj name, not the slot->number. This patch corrects the netvsc
      driver, so the VF device can be successfully paired with synthetic
      NIC.
      
      Fixes: 00d7ddba ("hv_netvsc: pair VF based on serial number")
      Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Reviewed-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00547955
    • D
      Merge branch 'tcp-second-round-for-EDT-conversion' · b1394967
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      tcp: second round for EDT conversion
      
      First round of EDT patches left TCP stack in a non optimal state.
      
      - High speed flows suffered from loss of performance, addressed
        by the first patch of this series.
      
      - Second patch brings pacing to the current state of networking,
        since we now reach ~100 Gbit on a single TCP flow.
      
      - Third patch implements a mitigation for scheduling delays,
        like the one we did in sch_fq in the past.
      
      - Fourth patch removes one special case in sch_fq for ACK packets.
      
      - Fifth patch removes a serious perfomance cost for TCP internal
        pacing. We should setup the high resolution timer only if
        really needed.
      
      - Sixth patch fixes a typo in BBR.
      
      - Last patch is one minor change in cdg congestion control.
      
      Neal Cardwell also has a patch series fixing BBR after
      EDT adoption.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1394967
    • E
      tcp: cdg: use tcp high resolution clock cache · 825e1c52
      Eric Dumazet 提交于
      We store in tcp socket a cache of most recent high resolution
      clock, there is no need to call local_clock() again, since
      this cache is good enough.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      825e1c52
    • N
      tcp_bbr: fix typo in bbr_pacing_margin_percent · 97ec3eb3
      Neal Cardwell 提交于
      There was a typo in this parameter name.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97ec3eb3
    • E
      tcp: optimize tcp internal pacing · 864e5c09
      Eric Dumazet 提交于
      When TCP implements its own pacing (when no fq packet scheduler is used),
      it is arming high resolution timer after a packet is sent.
      
      But in many cases (like TCP_RR kind of workloads), this high resolution
      timer expires before the application attempts to write the following
      packet. This overhead also happens when the flow is ACK clocked and
      cwnd limited instead of being limited by the pacing rate.
      
      This leads to extra overhead (high number of IRQ)
      
      Now tcp_wstamp_ns is reserved for the pacing timer only
      (after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"),
      we can setup the timer only when a packet is about to be sent,
      and if tcp_wstamp_ns is in the future.
      
      This leads to a ~10% performance increase in TCP_RR workloads.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      864e5c09
    • E
      net_sched: sch_fq: no longer use skb_is_tcp_pure_ack() · 7baf33bd
      Eric Dumazet 提交于
      With the new EDT model, sch_fq no longer has to special
      case TCP pure acks, since their skb->tstamp will allow them
      being sent without pacing delay.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7baf33bd
    • E
      tcp: mitigate scheduling jitter in EDT pacing model · a7a25630
      Eric Dumazet 提交于
      In commit fefa569a ("net_sched: sch_fq: account for schedule/timers
      drifts") we added a mitigation for scheduling jitter in fq packet scheduler.
      
      This patch does the same in TCP stack, now it is using EDT model.
      
      Note that this mitigation is valid for both external (fq packet scheduler)
      or internal TCP pacing.
      
      This uses the same strategy than the above commit, allowing
      a time credit of half the packet currently sent.
      
      Consider following case :
      
      An skb is sent, after an idle period of 300 usec.
      The air-time (skb->len/pacing_rate) is 500 usec
      Instead of setting the pacing timer to now+500 usec,
      it will use now+min(500/2, 300) -> now+250usec
      
      This is like having a token bucket with a depth of half
      an skb.
      
      Tested:
      
      tc qdisc replace dev eth0 root pfifo_fast
      
      Before
      netperf -P0 -H remote -- -q 1000000000 # 8000Mbit
      540000 262144 262144    10.00    7710.43
      
      After :
      netperf -P0 -H remote -- -q 1000000000 # 8000 Mbit
      540000 262144 262144    10.00    7999.75   # Much closer to 8000Mbit target
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7a25630
    • E
      net: extend sk_pacing_rate to unsigned long · 76a9ebe8
      Eric Dumazet 提交于
      sk_pacing_rate has beed introduced as a u32 field in 2013,
      effectively limiting per flow pacing to 34Gbit.
      
      We believe it is time to allow TCP to pace high speed flows
      on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
      
      This patch adds no cost for 32bit kernels.
      
      The tcpi_pacing_rate and tcpi_max_pacing_rate were already
      exported as 64bit, so iproute2/ss command require no changes.
      
      Unfortunately the SO_MAX_PACING_RATE socket option will stay
      32bit and we will need to add a new option to let applications
      control high pacing rates.
      
      State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
      ESTAB      0      1787144  10.246.9.76:49992             10.246.9.77:36741
                       timer:(on,003ms,0) ino:91863 sk:2 <->
       skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
       ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
       rcvmss:536 advmss:1448
       cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
       segs_in:3916318 data_segs_out:177279175
       bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
       send 28045.5Mbps lastrcv:73333
       pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
       busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
       notsent:2085120 minrtt:0.013
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76a9ebe8
    • E
      tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh · 5f6188a8
      Eric Dumazet 提交于
      In EDT design, I made the mistake of using tcp_wstamp_ns
      to store the last tcp_clock_ns() sample and to store the
      pacing virtual timer.
      
      This causes major regressions at high speed flows.
      
      Introduce tcp_clock_cache to store last tcp_clock_ns().
      This is needed because some arches have slow high-resolution
      kernel time service.
      
      tcp_wstamp_ns is only updated when a packet is sent.
      
      Note that we can remove tcp_mstamp in the future since
      tcp_mstamp is essentially tcp_clock_cache/1000, so the
      apparent socket size increase is temporary.
      
      Fixes: 9799ccb0 ("tcp: add tcp_wstamp_ns socket field")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f6188a8
    • L
      net: bridge: fix a possible memory leak in __vlan_add · 1a3aea25
      Li RongQing 提交于
      After per-port vlan stats, vlan stats should be released
      when fail to add vlan
      
      Fixes: 9163a0fc ("net: bridge: add support for per-port vlan stats")
      CC: bridge@lists.linux-foundation.org
      cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NZhang Yu <zhangyu31@baidu.com>
      Signed-off-by: NLi RongQing <lirongqing@baidu.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a3aea25