1. 16 10月, 2018 14 次提交
    • D
      net: Enable kernel side filtering of route dumps · effe6792
      David Ahern 提交于
      Update parsing of route dump request to enable kernel side filtering.
      Allow filtering results by protocol (e.g., which routing daemon installed
      the route), route type (e.g., unicast), table id and nexthop device. These
      amount to the low hanging fruit, yet a huge improvement, for dumping
      routes.
      
      ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can
      be used to look up the device index without taking a reference. From
      there filter->dev is only used during dump loops with the lock still held.
      
      Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results
      have been filtered should no entries be returned.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      effe6792
    • D
      net: Plumb support for filtering ipv4 and ipv6 multicast route dumps · cb167893
      David Ahern 提交于
      Implement kernel side filtering of routes by egress device index and
      table id. If the table id is given in the filter, lookup table and
      call mr_table_dump directly for it.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb167893
    • D
      ipmr: Refactor mr_rtm_dumproute · e1cedae1
      David Ahern 提交于
      Move per-table loops from mr_rtm_dumproute to mr_table_dump and export
      mr_table_dump for dumps by specific table id.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1cedae1
    • D
      net/ipv4: Plumb support for filtering route dumps · 18a8021a
      David Ahern 提交于
      Implement kernel side filtering of routes by table id, egress device index,
      protocol and route type. If the table id is given in the filter, lookup the
      table and call fib_table_dump directly for it.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18a8021a
    • D
      net: Add struct for fib dump filter · 4724676d
      David Ahern 提交于
      Add struct fib_dump_filter for options on limiting which routes are
      returned in a dump request. The current list is table id, protocol,
      route type, rtm_flags and nexthop device index. struct net is needed
      to lookup the net_device from the index.
      
      Declare the filter for each route dump handler and plumb the new
      arguments from dump handlers to ip_valid_fib_dump_req.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4724676d
    • E
      tcp: cdg: use tcp high resolution clock cache · 825e1c52
      Eric Dumazet 提交于
      We store in tcp socket a cache of most recent high resolution
      clock, there is no need to call local_clock() again, since
      this cache is good enough.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      825e1c52
    • N
      tcp_bbr: fix typo in bbr_pacing_margin_percent · 97ec3eb3
      Neal Cardwell 提交于
      There was a typo in this parameter name.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97ec3eb3
    • E
      tcp: optimize tcp internal pacing · 864e5c09
      Eric Dumazet 提交于
      When TCP implements its own pacing (when no fq packet scheduler is used),
      it is arming high resolution timer after a packet is sent.
      
      But in many cases (like TCP_RR kind of workloads), this high resolution
      timer expires before the application attempts to write the following
      packet. This overhead also happens when the flow is ACK clocked and
      cwnd limited instead of being limited by the pacing rate.
      
      This leads to extra overhead (high number of IRQ)
      
      Now tcp_wstamp_ns is reserved for the pacing timer only
      (after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"),
      we can setup the timer only when a packet is about to be sent,
      and if tcp_wstamp_ns is in the future.
      
      This leads to a ~10% performance increase in TCP_RR workloads.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      864e5c09
    • E
      tcp: mitigate scheduling jitter in EDT pacing model · a7a25630
      Eric Dumazet 提交于
      In commit fefa569a ("net_sched: sch_fq: account for schedule/timers
      drifts") we added a mitigation for scheduling jitter in fq packet scheduler.
      
      This patch does the same in TCP stack, now it is using EDT model.
      
      Note that this mitigation is valid for both external (fq packet scheduler)
      or internal TCP pacing.
      
      This uses the same strategy than the above commit, allowing
      a time credit of half the packet currently sent.
      
      Consider following case :
      
      An skb is sent, after an idle period of 300 usec.
      The air-time (skb->len/pacing_rate) is 500 usec
      Instead of setting the pacing timer to now+500 usec,
      it will use now+min(500/2, 300) -> now+250usec
      
      This is like having a token bucket with a depth of half
      an skb.
      
      Tested:
      
      tc qdisc replace dev eth0 root pfifo_fast
      
      Before
      netperf -P0 -H remote -- -q 1000000000 # 8000Mbit
      540000 262144 262144    10.00    7710.43
      
      After :
      netperf -P0 -H remote -- -q 1000000000 # 8000 Mbit
      540000 262144 262144    10.00    7999.75   # Much closer to 8000Mbit target
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7a25630
    • E
      net: extend sk_pacing_rate to unsigned long · 76a9ebe8
      Eric Dumazet 提交于
      sk_pacing_rate has beed introduced as a u32 field in 2013,
      effectively limiting per flow pacing to 34Gbit.
      
      We believe it is time to allow TCP to pace high speed flows
      on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
      
      This patch adds no cost for 32bit kernels.
      
      The tcpi_pacing_rate and tcpi_max_pacing_rate were already
      exported as 64bit, so iproute2/ss command require no changes.
      
      Unfortunately the SO_MAX_PACING_RATE socket option will stay
      32bit and we will need to add a new option to let applications
      control high pacing rates.
      
      State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
      ESTAB      0      1787144  10.246.9.76:49992             10.246.9.77:36741
                       timer:(on,003ms,0) ino:91863 sk:2 <->
       skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
       ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
       rcvmss:536 advmss:1448
       cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
       segs_in:3916318 data_segs_out:177279175
       bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
       send 28045.5Mbps lastrcv:73333
       pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
       busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
       notsent:2085120 minrtt:0.013
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76a9ebe8
    • E
      tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh · 5f6188a8
      Eric Dumazet 提交于
      In EDT design, I made the mistake of using tcp_wstamp_ns
      to store the last tcp_clock_ns() sample and to store the
      pacing virtual timer.
      
      This causes major regressions at high speed flows.
      
      Introduce tcp_clock_cache to store last tcp_clock_ns().
      This is needed because some arches have slow high-resolution
      kernel time service.
      
      tcp_wstamp_ns is only updated when a packet is sent.
      
      Note that we can remove tcp_mstamp in the future since
      tcp_mstamp is essentially tcp_clock_cache/1000, so the
      apparent socket size increase is temporary.
      
      Fixes: 9799ccb0 ("tcp: add tcp_wstamp_ns socket field")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f6188a8
    • D
      bpf, sockmap: convert to generic sk_msg interface · 604326b4
      Daniel Borkmann 提交于
      Add a generic sk_msg layer, and convert current sockmap and later
      kTLS over to make use of it. While sk_buff handles network packet
      representation from netdevice up to socket, sk_msg handles data
      representation from application to socket layer.
      
      This means that sk_msg framework spans across ULP users in the
      kernel, and enables features such as introspection or filtering
      of data with the help of BPF programs that operate on this data
      structure.
      
      Latter becomes in particular useful for kTLS where data encryption
      is deferred into the kernel, and as such enabling the kernel to
      perform L7 introspection and policy based on BPF for TLS connections
      where the record is being encrypted after BPF has run and came to
      a verdict. In order to get there, first step is to transform open
      coding of scatter-gather list handling into a common core framework
      that subsystems can use.
      
      The code itself has been split and refactored into three bigger
      pieces: i) the generic sk_msg API which deals with managing the
      scatter gather ring, providing helpers for walking and mangling,
      transferring application data from user space into it, and preparing
      it for BPF pre/post-processing, ii) the plain sock map itself
      where sockets can be attached to or detached from; these bits
      are independent of i) which can now be used also without sock
      map, and iii) the integration with plain TCP as one protocol
      to be used for processing L7 application data (later this could
      e.g. also be extended to other protocols like UDP). The semantics
      are the same with the old sock map code and therefore no change
      of user facing behavior or APIs. While pursuing this work it
      also helped finding a number of bugs in the old sockmap code
      that we've fixed already in earlier commits. The test_sockmap
      kselftest suite passes through fine as well.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      604326b4
    • D
      tcp, ulp: remove ulp bits from sockmap · 1243a51f
      Daniel Borkmann 提交于
      In order to prepare sockmap logic to be used in combination with kTLS
      we need to detangle it from ULP, and further split it in later commits
      into a generic API.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1243a51f
    • D
      tcp, ulp: enforce sock_owned_by_me upon ulp init and cleanup · 8b9088f8
      Daniel Borkmann 提交于
      Whenever the ULP data on the socket is mangled, enforce that the
      caller has the socket lock held as otherwise things may race with
      initialization and cleanup callbacks from ulp ops as both would
      mangle internal socket state.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      8b9088f8
  2. 13 10月, 2018 1 次提交
    • D
      net: Evict neighbor entries on carrier down · 859bd2ef
      David Ahern 提交于
      When a link's carrier goes down it could be a sign of the port changing
      networks. If the new network has overlapping addresses with the old one,
      then the kernel will continue trying to use neighbor entries established
      based on the old network until the entries finally age out - meaning a
      potentially long delay with communications not working.
      
      This patch evicts neighbor entries on carrier down with the exception of
      those marked permanent. Permanent entries are managed by userspace (either
      an admin or a routing daemon such as FRR).
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      859bd2ef
  3. 11 10月, 2018 3 次提交
  4. 09 10月, 2018 5 次提交
  5. 08 10月, 2018 1 次提交
    • J
      udp: Unbreak modules that rely on external __skb_recv_udp() availability · 7e823644
      Jiri Kosina 提交于
      Commit 2276f58a ("udp: use a separate rx queue for packet reception")
      turned static inline __skb_recv_udp() from being a trivial helper around
      __skb_recv_datagram() into a UDP specific implementaion, making it
      EXPORT_SYMBOL_GPL() at the same time.
      
      There are external modules that got broken by __skb_recv_udp() not being
      visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().
      
      Rationale (one of those) why this is actually "technically correct" thing
      to do: __skb_recv_udp() used to be an inline wrapper around
      __skb_recv_datagram(), which itself (still, and correctly so, I believe)
      is EXPORT_SYMBOL().
      
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: 2276f58a ("udp: use a separate rx queue for packet reception")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e823644
  6. 06 10月, 2018 1 次提交
  7. 05 10月, 2018 4 次提交
  8. 03 10月, 2018 6 次提交
  9. 02 10月, 2018 5 次提交