1. 05 3月, 2015 4 次提交
  2. 04 3月, 2015 1 次提交
    • E
      neigh: Factor out ___neigh_lookup_noref · 60395a20
      Eric W. Biederman 提交于
      While looking at the mpls code I found myself writing yet another
      version of neigh_lookup_noref.  We currently have __ipv4_lookup_noref
      and __ipv6_lookup_noref.
      
      So to make my work a little easier and to make it a smidge easier to
      verify/maintain the mpls code in the future I stopped and wrote
      ___neigh_lookup_noref.  Then I rewote __ipv4_lookup_noref and
      __ipv6_lookup_noref in terms of this new function.  I tested my new
      version by verifying that the same code is generated in
      ip_finish_output2 and ip6_finish_output2 where these functions are
      inlined.
      
      To get to ___neigh_lookup_noref I added a new neighbour cache table
      function key_eq.  So that the static size of the key would be
      available.
      
      I also added __neigh_lookup_noref for people who want to to lookup
      a neighbour table entry quickly but don't know which neibhgour table
      they are going to look up.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60395a20
  3. 03 3月, 2015 5 次提交
  4. 02 3月, 2015 1 次提交
  5. 01 3月, 2015 4 次提交
    • E
      tcp: cleanup static functions · 74abc20c
      Eric Dumazet 提交于
      tcp_fastopen_create_child() is static and should not be exported.
      
      tcp4_gso_segment() and tcp6_gso_segment() should be static.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74abc20c
    • E
      tcp: tso: allow CA_CWR state in tcp_tso_should_defer() · a0ea700e
      Eric Dumazet 提交于
      Another TCP issue is triggered by ECN.
      
      Under pressure, receiver gets ECN marks, and send back ACK packets
      with ECE TCP flag. Senders enter CA_CWR state.
      
      In this state, tcp_tso_should_defer() is short cut :
      
      if (icsk->icsk_ca_state != TCP_CA_Open)
          goto send_now;
      
      This means that about all ACK packets we receive are triggering
      a partial send, and because cwnd is kept small, we can only send
      a small amount of data for each incoming ACK,
      which in return generate more ACK packets.
      
      Allowing CA_Open and CA_CWR states to enable TSO defer in
      tcp_tso_should_defer() brings performance back :
      TSO autodefer has more chance to defer under pressure.
      
      This patch increases TSO and LRO/GRO efficiency back to normal levels,
      and does not impact overall ECN behavior.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0ea700e
    • E
      tcp: tso: restore IW10 after TSO autosizing · 50c8339e
      Eric Dumazet 提交于
      With sysctl_tcp_min_tso_segs being 4, it is very possible
      that tcp_tso_should_defer() decides not sending last 2 MSS
      of initial window of 10 packets. This also applies if
      autosizing decides to send X MSS per GSO packet, and cwnd
      is not a multiple of X.
      
      This patch implements an heuristic based on age of first
      skb in write queue : If it was sent very recently (less than half srtt),
      we can predict that no ACK packet will come in less than half rtt,
      so deferring might cause an under utilization of our window.
      
      This is visible on initial send (IW10) on web servers,
      but more generally on some RPC, as the last part of the message
      might need an extra RTT to get delivered.
      
      Tested:
      
      Ran following packetdrill test
      // A simple server-side test that sends exactly an initial window (IW10)
      // worth of packets.
      
      `sysctl -e -q net.ipv4.tcp_min_tso_segs=4`
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +.1   < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.1   < . 1:1(0) ack 1 win 257
      +0    accept(3, ..., ...) = 4
      
      +0    write(4, ..., 14600) = 14600
      +0    > . 1:5841(5840) ack 1 win 457
      +0    > . 5841:11681(5840) ack 1 win 457
      // Following packet should be sent right now.
      +0    > P. 11681:14601(2920) ack 1 win 457
      
      +.1   < . 1:1(0) ack 14601 win 257
      
      +0    close(4) = 0
      +0    > F. 14601:14601(0) ack 1
      +.1   < F. 1:1(0) ack 14602 win 257
      +0    > . 14602:14602(0) ack 2
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      50c8339e
    • E
      tcp: tso: remove tp->tso_deferred · 5f852eb5
      Eric Dumazet 提交于
      TSO relies on ability to defer sending a small amount of packets.
      Heuristic is to wait for future ACKS in hope to send more packets at once.
      Current algorithm uses a per socket tso_deferred field as a pseudo timer.
      
      This pseudo timer relies on future ACK, but there is no guarantee
      we receive them in time.
      
      Fix would be to use a real timer, but cost of such timer is probably too
      expensive for typical cases.
      
      This patch changes the logic to test the time of last transmit,
      because we should not add bursts of more than 1ms for any given flow.
      
      We've used this patch for about two years at Google, before FQ/pacing
      as it would reduce a fair amount of bursts.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f852eb5
  6. 28 2月, 2015 6 次提交
  7. 23 2月, 2015 1 次提交
  8. 21 2月, 2015 3 次提交
  9. 15 2月, 2015 1 次提交
  10. 13 2月, 2015 2 次提交
  11. 12 2月, 2015 6 次提交
    • J
      mm: page_counter: pull "-1" handling out of page_counter_memparse() · 650c5e56
      Johannes Weiner 提交于
      The unified hierarchy interface for memory cgroups will no longer use "-1"
      to mean maximum possible resource value.  In preparation for this, make
      the string an argument and let the caller supply it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      650c5e56
    • T
      gue: Use checksum partial with remote checksum offload · fe881ef1
      Tom Herbert 提交于
      Change remote checksum handling to set checksum partial as default
      behavior. Added an iflink parameter to configure not using
      checksum partial (calling csum_partial to update checksum).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fe881ef1
    • T
      net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offload · 15e2396d
      Tom Herbert 提交于
      This patch adds infrastructure so that remote checksum offload can
      set CHECKSUM_PARTIAL instead of calling csum_partial and writing
      the modfied checksum field.
      
      Add skb_remcsum_adjust_partial function to set an skb for using
      CHECKSUM_PARTIAL with remote checksum offload.  Changed
      skb_remcsum_process and skb_gro_remcsum_process to take a boolean
      argument to indicate if checksum partial can be set or the
      checksum needs to be modified using the normal algorithm.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15e2396d
    • T
      udp: Set SKB_GSO_UDP_TUNNEL* in UDP GRO path · 6db93ea1
      Tom Herbert 提交于
      Properly set GSO types and skb->encapsulation in the UDP tunnel GRO
      complete so that packets are properly represented for GSO. This sets
      SKB_GSO_UDP_TUNNEL or SKB_GSO_UDP_TUNNEL_CSUM depending on whether
      non-zero checksums were received, and sets SKB_GSO_TUNNEL_REMCSUM if
      the remote checksum option was processed.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6db93ea1
    • T
      net: Fix remcsum in GRO path to not change packet · 26c4f7da
      Tom Herbert 提交于
      Remote checksum offload processing is currently the same for both
      the GRO and non-GRO path. When the remote checksum offload option
      is encountered, the checksum field referred to is modified in
      the packet. So in the GRO case, the packet is modified in the
      GRO path and then the operation is skipped when the packet goes
      through the normal path based on skb->remcsum_offload. There is
      a problem in that the packet may be modified in the GRO path, but
      then forwarded off host still containing the remote checksum option.
      A remote host will again perform RCO but now the checksum verification
      will fail since GRO RCO already modified the checksum.
      
      To fix this, we ensure that GRO restores a packet to it's original
      state before returning. In this model, when GRO processes a remote
      checksum option it still changes the checksum per the algorithm
      but on return from lower layer processing the checksum is restored
      to its original value.
      
      In this patch we add define gro_remcsum structure which is passed
      to skb_gro_remcsum_process to save offset and delta for the checksum
      being changed. After lower layer processing, skb_gro_remcsum_cleanup
      is called to restore the checksum before returning from GRO.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26c4f7da
    • P
      cipso: don't use IPCB() to locate the CIPSO IP option · 04f81f01
      Paul Moore 提交于
      Using the IPCB() macro to get the IPv4 options is convenient, but
      unfortunately NetLabel often needs to examine the CIPSO option outside
      of the scope of the IP layer in the stack.  While historically IPCB()
      worked above the IP layer, due to the inclusion of the inet_skb_param
      struct at the head of the {tcp,udp}_skb_cb structs, recent commit
      971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
      reordered the tcp_skb_cb struct and invalidated this IPCB() trick.
      
      This patch fixes the problem by creating a new function,
      cipso_v4_optptr(), which locates the CIPSO option inside the IP header
      without calling IPCB().  Unfortunately, this isn't as fast as a simple
      lookup so some additional tweaks were made to limit the use of this
      new function.
      
      Cc: <stable@vger.kernel.org> # 3.18
      Reported-by: NCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Tested-by: NCasey Schaufler <casey@schaufler-ca.com>
      04f81f01
  12. 10 2月, 2015 2 次提交
  13. 09 2月, 2015 2 次提交
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
    • S
      gre/ipip: use be16 variants of netlink functions · 3e97fa70
      Sabrina Dubroca 提交于
      encap.sport and encap.dport are __be16, use nla_{get,put}_be16 instead
      of nla_{get,put}_u16.
      
      Fixes the sparse warnings:
      
      warning: incorrect type in assignment (different base types)
         expected restricted __be32 [addressable] [usertype] o_key
         got restricted __be16 [addressable] [usertype] i_flags
      warning: incorrect type in assignment (different base types)
         expected restricted __be16 [usertype] sport
         got unsigned short
      warning: incorrect type in assignment (different base types)
         expected restricted __be16 [usertype] dport
         got unsigned short
      warning: incorrect type in argument 3 (different base types)
         expected unsigned short [unsigned] [usertype] value
         got restricted __be16 [usertype] sport
      warning: incorrect type in argument 3 (different base types)
         expected unsigned short [unsigned] [usertype] value
         got restricted __be16 [usertype] dport
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e97fa70
  14. 08 2月, 2015 2 次提交