1. 03 3月, 2015 3 次提交
  2. 01 3月, 2015 1 次提交
  3. 13 2月, 2015 1 次提交
  4. 12 2月, 2015 3 次提交
    • T
      vxlan: Use checksum partial with remote checksum offload · 0ace2ca8
      Tom Herbert 提交于
      Change remote checksum handling to set checksum partial as default
      behavior. Added an iflink parameter to configure not using
      checksum partial (calling csum_partial to update checksum).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ace2ca8
    • T
      net: Fix remcsum in GRO path to not change packet · 26c4f7da
      Tom Herbert 提交于
      Remote checksum offload processing is currently the same for both
      the GRO and non-GRO path. When the remote checksum offload option
      is encountered, the checksum field referred to is modified in
      the packet. So in the GRO case, the packet is modified in the
      GRO path and then the operation is skipped when the packet goes
      through the normal path based on skb->remcsum_offload. There is
      a problem in that the packet may be modified in the GRO path, but
      then forwarded off host still containing the remote checksum option.
      A remote host will again perform RCO but now the checksum verification
      will fail since GRO RCO already modified the checksum.
      
      To fix this, we ensure that GRO restores a packet to it's original
      state before returning. In this model, when GRO processes a remote
      checksum option it still changes the checksum per the algorithm
      but on return from lower layer processing the checksum is restored
      to its original value.
      
      In this patch we add define gro_remcsum structure which is passed
      to skb_gro_remcsum_process to save offset and delta for the checksum
      being changed. After lower layer processing, skb_gro_remcsum_cleanup
      is called to restore the checksum before returning from GRO.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26c4f7da
    • P
      cipso: don't use IPCB() to locate the CIPSO IP option · 04f81f01
      Paul Moore 提交于
      Using the IPCB() macro to get the IPv4 options is convenient, but
      unfortunately NetLabel often needs to examine the CIPSO option outside
      of the scope of the IP layer in the stack.  While historically IPCB()
      worked above the IP layer, due to the inclusion of the inet_skb_param
      struct at the head of the {tcp,udp}_skb_cb structs, recent commit
      971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
      reordered the tcp_skb_cb struct and invalidated this IPCB() trick.
      
      This patch fixes the problem by creating a new function,
      cipso_v4_optptr(), which locates the CIPSO option inside the IP header
      without calling IPCB().  Unfortunately, this isn't as fast as a simple
      lookup so some additional tweaks were made to limit the use of this
      new function.
      
      Cc: <stable@vger.kernel.org> # 3.18
      Reported-by: NCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Tested-by: NCasey Schaufler <casey@schaufler-ca.com>
      04f81f01
  5. 10 2月, 2015 3 次提交
  6. 09 2月, 2015 1 次提交
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
  7. 08 2月, 2015 2 次提交
    • N
      tcp: mitigate ACK loops for connections as tcp_request_sock · a9b2c06d
      Neal Cardwell 提交于
      In the SYN_RECV state, where the TCP connection is represented by
      tcp_request_sock, we now rate-limit SYNACKs in response to a client's
      retransmitted SYNs: we do not send a SYNACK in response to client SYN
      if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
      since we last sent a SYNACK in response to a client's retransmitted
      SYN.
      
      This allows the vast majority of legitimate client connections to
      proceed unimpeded, even for the most aggressive platforms, iOS and
      MacOS, which actually retransmit SYNs 1-second intervals for several
      times in a row. They use SYN RTO timeouts following the progression:
      1,1,1,1,1,2,4,8,16,32.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9b2c06d
    • N
      tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423
      Neal Cardwell 提交于
      Helpers for mitigating ACK loops by rate-limiting dupacks sent in
      response to incoming out-of-window packets.
      
      This patch includes:
      
      - rate-limiting logic
      - sysctl to control how often we allow dupacks to out-of-window packets
      - SNMP counter for cases where we rate-limited our dupack sending
      
      The rate-limiting logic in this patch decides to not send dupacks in
      response to out-of-window segments if (a) they are SYNs or pure ACKs
      and (b) the remote endpoint is sending them faster than the configured
      rate limit.
      
      We rate-limit our responses rather than blocking them entirely or
      resetting the connection, because legitimate connections can rely on
      dupacks in response to some out-of-window segments. For example, zero
      window probes are typically sent with a sequence number that is below
      the current window, and ZWPs thus expect to thus elicit a dupack in
      response.
      
      We allow dupacks in response to TCP segments with data, because these
      may be spurious retransmissions for which the remote endpoint wants to
      receive DSACKs. This is safe because segments with data can't
      realistically be part of ACK loops, which by their nature consist of
      each side sending pure/data-less ACKs to each other.
      
      The dupack interval is controlled by a new sysctl knob,
      tcp_invalid_ratelimit, given in milliseconds, in case an administrator
      needs to dial this upward in the face of a high-rate DoS attack. The
      name and units are chosen to be analogous to the existing analogous
      knob for ICMP, icmp_ratelimit.
      
      The default value for tcp_invalid_ratelimit is 500ms, which allows at
      most one such dupack per 500ms. This is chosen to be 2x faster than
      the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
      2.4). We allow the extra 2x factor because network delay variations
      can cause packets sent at 1 second intervals to be compressed and
      arrive much closer.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      032ee423
  8. 06 2月, 2015 1 次提交
    • E
      net: ipv6: allow explicitly choosing optimistic addresses · c58da4c6
      Erik Kline 提交于
      RFC 4429 ("Optimistic DAD") states that optimistic addresses
      should be treated as deprecated addresses.  From section 2.1:
      
         Unless noted otherwise, components of the IPv6 protocol stack
         should treat addresses in the Optimistic state equivalently to
         those in the Deprecated state, indicating that the address is
         available for use but should not be used if another suitable
         address is available.
      
      Optimistic addresses are indeed avoided when other addresses are
      available (i.e. at source address selection time), but they have
      not heretofore been available for things like explicit bind() and
      sendmsg() with struct in6_pktinfo, etc.
      
      This change makes optimistic addresses treated more like
      deprecated addresses than tentative ones.
      Signed-off-by: NErik Kline <ek@google.com>
      Acked-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c58da4c6
  9. 05 2月, 2015 5 次提交
    • E
      ipv6: fix sparse errors in ip6_make_flowlabel() · 67765146
      Eric Dumazet 提交于
      include/net/ipv6.h:713:22: warning: incorrect type in assignment (different base types)
      include/net/ipv6.h:713:22:    expected restricted __be32 [usertype] hash
      include/net/ipv6.h:713:22:    got unsigned int
      include/net/ipv6.h:719:25: warning: restricted __be32 degrades to integer
      include/net/ipv6.h:719:22: warning: invalid assignment: ^=
      include/net/ipv6.h:719:22:    left side has type restricted __be32
      include/net/ipv6.h:719:22:    right side has type unsigned int
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67765146
    • E
      flow_keys: n_proto type should be __be16 · f4575d35
      Eric Dumazet 提交于
      (struct flow_keys)->n_proto is in network order, use
      proper type for this.
      
      Fixes following sparse errors :
      
      net/core/flow_dissector.c:139:39: warning: incorrect type in assignment (different base types)
      net/core/flow_dissector.c:139:39:    expected unsigned short [unsigned] [usertype] n_proto
      net/core/flow_dissector.c:139:39:    got restricted __be16 [assigned] [usertype] proto
      net/core/flow_dissector.c:237:23: warning: incorrect type in assignment (different base types)
      net/core/flow_dissector.c:237:23:    expected unsigned short [unsigned] [usertype] n_proto
      net/core/flow_dissector.c:237:23:    got restricted __be16 [assigned] [usertype] proto
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: e0f31d84 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4575d35
    • E
      tcp: do not pace pure ack packets · 98781965
      Eric Dumazet 提交于
      When we added pacing to TCP, we decided to let sch_fq take care
      of actual pacing.
      
      All TCP had to do was to compute sk->pacing_rate using simple formula:
      
      sk->pacing_rate = 2 * cwnd * mss / rtt
      
      It works well for senders (bulk flows), but not very well for receivers
      or even RPC :
      
      cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
      can end up pacing ACK packets, slowing down the sender.
      
      Really, only the sender should pace, according to its own logic.
      
      Instead of adding a new bit in skb, or call yet another flow
      dissection, we tweak skb->truesize to a small value (2), and
      we instruct sch_fq to use new helper and not pace pure ack.
      
      Note this also helps TCP small queue, as ack packets present
      in qdisc/NIC do not prevent sending a data packet (RPC workload)
      
      This helps to reduce tx completion overhead, ack packets can use regular
      sock_wfree() instead of tcp_wfree() which is a bit more expensive.
      
      This has no impact in the case packets are sent to loopback interface,
      as we do not coalesce ack packets (were we would detect skb->truesize
      lie)
      
      In case netem (with a delay) is used, skb_orphan_partial() also sets
      skb->truesize to 1.
      
      This patch is a combination of two patches we used for about one year at
      Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98781965
    • M
      net/bonding: Notify state change on slaves · 69e61133
      Moni Shoua 提交于
      Use notifier chain to dispatch an event upon a change in slave state.
      Event is dispatched with slave specific info.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69e61133
    • M
      net/bonding: Move slave state changes to a helper function · 69a2338e
      Moni Shoua 提交于
      Move slave state changes to a helper function, this is a pre-step for adding
      functionality of dispatching an event when this helper is called.
      
      This commit doesn't add new functionality.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69a2338e
  10. 04 2月, 2015 8 次提交
  11. 03 2月, 2015 11 次提交
  12. 02 2月, 2015 1 次提交