1. 09 2月, 2015 1 次提交
  2. 08 2月, 2015 5 次提交
    • N
      tcp: mitigate ACK loops for connections as tcp_timewait_sock · 4fb17a60
      Neal Cardwell 提交于
      Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is
      represented by a tcp_timewait_sock, we rate limit dupacks in response
      to incoming packets (a) with TCP timestamps that fail PAWS checks, or
      (b) with sequence numbers that are out of the acceptable window.
      
      We do not send a dupack in response to out-of-window packets if it has
      been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
      last sent a dupack in response to an out-of-window packet.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4fb17a60
    • N
      tcp: mitigate ACK loops for connections as tcp_sock · f2b2c582
      Neal Cardwell 提交于
      Ensure that in state ESTABLISHED, where the connection is represented
      by a tcp_sock, we rate limit dupacks in response to incoming packets
      (a) with TCP timestamps that fail PAWS checks, or (b) with sequence
      numbers or ACK numbers that are out of the acceptable window.
      
      We do not send a dupack in response to out-of-window packets if it has
      been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
      last sent a dupack in response to an out-of-window packet.
      
      There is already a similar (although global) rate-limiting mechanism
      for "challenge ACKs". When deciding whether to send a challence ACK,
      we first consult the new per-connection rate limit, and then the
      global rate limit.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2b2c582
    • N
      tcp: mitigate ACK loops for connections as tcp_request_sock · a9b2c06d
      Neal Cardwell 提交于
      In the SYN_RECV state, where the TCP connection is represented by
      tcp_request_sock, we now rate-limit SYNACKs in response to a client's
      retransmitted SYNs: we do not send a SYNACK in response to client SYN
      if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
      since we last sent a SYNACK in response to a client's retransmitted
      SYN.
      
      This allows the vast majority of legitimate client connections to
      proceed unimpeded, even for the most aggressive platforms, iOS and
      MacOS, which actually retransmit SYNs 1-second intervals for several
      times in a row. They use SYN RTO timeouts following the progression:
      1,1,1,1,1,2,4,8,16,32.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9b2c06d
    • N
      tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423
      Neal Cardwell 提交于
      Helpers for mitigating ACK loops by rate-limiting dupacks sent in
      response to incoming out-of-window packets.
      
      This patch includes:
      
      - rate-limiting logic
      - sysctl to control how often we allow dupacks to out-of-window packets
      - SNMP counter for cases where we rate-limited our dupack sending
      
      The rate-limiting logic in this patch decides to not send dupacks in
      response to out-of-window segments if (a) they are SYNs or pure ACKs
      and (b) the remote endpoint is sending them faster than the configured
      rate limit.
      
      We rate-limit our responses rather than blocking them entirely or
      resetting the connection, because legitimate connections can rely on
      dupacks in response to some out-of-window segments. For example, zero
      window probes are typically sent with a sequence number that is below
      the current window, and ZWPs thus expect to thus elicit a dupack in
      response.
      
      We allow dupacks in response to TCP segments with data, because these
      may be spurious retransmissions for which the remote endpoint wants to
      receive DSACKs. This is safe because segments with data can't
      realistically be part of ACK loops, which by their nature consist of
      each side sending pure/data-less ACKs to each other.
      
      The dupack interval is controlled by a new sysctl knob,
      tcp_invalid_ratelimit, given in milliseconds, in case an administrator
      needs to dial this upward in the face of a high-rate DoS attack. The
      name and units are chosen to be analogous to the existing analogous
      knob for ICMP, icmp_ratelimit.
      
      The default value for tcp_invalid_ratelimit is 500ms, which allows at
      most one such dupack per 500ms. This is chosen to be 2x faster than
      the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
      2.4). We allow the extra 2x factor because network delay variations
      can cause packets sent at 1 second intervals to be compressed and
      arrive much closer.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      032ee423
    • J
      net: openvswitch: Support masked set actions. · 83d2b9ba
      Jarno Rajahalme 提交于
      OVS userspace already probes the openvswitch kernel module for
      OVS_ACTION_ATTR_SET_MASKED support.  This patch adds the kernel module
      implementation of masked set actions.
      
      The existing set action sets many fields at once.  When only a subset
      of the IP header fields, for example, should be modified, all the IP
      fields need to be exact matched so that the other field values can be
      copied to the set action.  A masked set action allows modification of
      an arbitrary subset of the supported header bits without requiring the
      rest to be matched.
      
      Masked set action is now supported for all writeable key types, except
      for the tunnel key.  The set tunnel action is an exception as any
      input tunnel info is cleared before action processing starts, so there
      is no tunnel info to mask.
      
      The kernel module converts all (non-tunnel) set actions to masked set
      actions.  This makes action processing more uniform, and results in
      less branching and duplicating the action processing code.  When
      returning actions to userspace, the fully masked set actions are
      converted back to normal set actions.  We use a kernel internal action
      code to be able to tell the userspace provided and converted masked
      set actions apart.
      Signed-off-by: NJarno Rajahalme <jrajahalme@nicira.com>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83d2b9ba
  3. 05 2月, 2015 12 次提交
    • E
      ipv6: fix sparse errors in ip6_make_flowlabel() · 67765146
      Eric Dumazet 提交于
      include/net/ipv6.h:713:22: warning: incorrect type in assignment (different base types)
      include/net/ipv6.h:713:22:    expected restricted __be32 [usertype] hash
      include/net/ipv6.h:713:22:    got unsigned int
      include/net/ipv6.h:719:25: warning: restricted __be32 degrades to integer
      include/net/ipv6.h:719:22: warning: invalid assignment: ^=
      include/net/ipv6.h:719:22:    left side has type restricted __be32
      include/net/ipv6.h:719:22:    right side has type unsigned int
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67765146
    • E
      flow_keys: n_proto type should be __be16 · f4575d35
      Eric Dumazet 提交于
      (struct flow_keys)->n_proto is in network order, use
      proper type for this.
      
      Fixes following sparse errors :
      
      net/core/flow_dissector.c:139:39: warning: incorrect type in assignment (different base types)
      net/core/flow_dissector.c:139:39:    expected unsigned short [unsigned] [usertype] n_proto
      net/core/flow_dissector.c:139:39:    got restricted __be16 [assigned] [usertype] proto
      net/core/flow_dissector.c:237:23: warning: incorrect type in assignment (different base types)
      net/core/flow_dissector.c:237:23:    expected unsigned short [unsigned] [usertype] n_proto
      net/core/flow_dissector.c:237:23:    got restricted __be16 [assigned] [usertype] proto
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: e0f31d84 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4575d35
    • E
      pkt_sched: fq: better control of DDOS traffic · 06eb395f
      Eric Dumazet 提交于
      FQ has a fast path for skb attached to a socket, as it does not
      have to compute a flow hash. But for other packets, FQ being non
      stochastic means that hosts exposed to random Internet traffic
      can allocate million of flows structure (104 bytes each) pretty
      easily. Not only host can OOM, but lookup in RB trees can take
      too much cpu and memory resources.
      
      This patch adds a new attribute, orphan_mask, that is adding
      possibility of having a stochastic hash for orphaned skb.
      
      Its default value is 1024 slots, to mimic SFQ behavior.
      
      Note: This does not apply to locally generated TCP traffic,
      and no locally generated traffic will share a flow structure
      with another perfect or stochastic flow.
      
      This patch also handles the specific case of SYNACK messages:
      
      They are attached to the listener socket, and therefore all map
      to a single hash bucket. If listener have set SO_MAX_PACING_RATE,
      hoping to have new accepted socket inherit this rate, SYNACK
      might be paced and even dropped.
      
      This is very similar to an internal patch Google have used more
      than one year.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06eb395f
    • E
      tcp: do not pace pure ack packets · 98781965
      Eric Dumazet 提交于
      When we added pacing to TCP, we decided to let sch_fq take care
      of actual pacing.
      
      All TCP had to do was to compute sk->pacing_rate using simple formula:
      
      sk->pacing_rate = 2 * cwnd * mss / rtt
      
      It works well for senders (bulk flows), but not very well for receivers
      or even RPC :
      
      cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
      can end up pacing ACK packets, slowing down the sender.
      
      Really, only the sender should pace, according to its own logic.
      
      Instead of adding a new bit in skb, or call yet another flow
      dissection, we tweak skb->truesize to a small value (2), and
      we instruct sch_fq to use new helper and not pace pure ack.
      
      Note this also helps TCP small queue, as ack packets present
      in qdisc/NIC do not prevent sending a data packet (RPC workload)
      
      This helps to reduce tx completion overhead, ack packets can use regular
      sock_wfree() instead of tcp_wfree() which is a bit more expensive.
      
      This has no impact in the case packets are sent to loopback interface,
      as we do not coalesce ack packets (were we would detect skb->truesize
      lie)
      
      In case netem (with a delay) is used, skb_orphan_partial() also sets
      skb->truesize to 1.
      
      This patch is a combination of two patches we used for about one year at
      Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98781965
    • H
      rhashtable: Introduce rhashtable_walk_* · f2dba9c6
      Herbert Xu 提交于
      Some existing rhashtable users get too intimate with it by walking
      the buckets directly.  This prevents us from easily changing the
      internals of rhashtable.
      
      This patch adds the helpers rhashtable_walk_init/exit/start/next/stop
      which will replace these custom walkers.
      
      They are meant to be usable for both procfs seq_file walks as well
      as walking by a netlink dump.  The iterator structure should fit
      inside a netlink dump cb structure, with at least one element to
      spare.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2dba9c6
    • M
      net/mlx4_core: Port aggregation upper layer interface · 53f33ae2
      Moni Shoua 提交于
      Supply interface functions to bond and unbond ports of a mlx4 internal
      interfaces. Example for such an interface is the one registered by the
      mlx4 IB driver under RoCE.
      
      There are
      
      1. Functions to go in/out to/from bonded mode
      2. Function to remap virtual ports to physical ports
      
      The bond_mutex prevents simultaneous access to data that keep status of
      the device in bonded mode.
      
      The upper mlx4 interface marks to the mlx4 core module that they
      want to be subject for such bonding by setting the MLX4_INTFF_BONDING
      flag. Interface which goes to/from bonded mode is re-created.
      
      The mlx4 Ethernet driver does not set this flag when registering the
      interface, the IB driver does.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53f33ae2
    • M
      net/mlx4_core: Port aggregation low level interface · 59e14e32
      Moni Shoua 提交于
      Implement the hardware interface required for port aggregation.
      
      1. Disable RX port check on receive - don't perform a validity check
      that matches to QP's port and the port where the packet is received.
      
      2. Virtual to physical port remap - configure virtual to physical port
      mapping. Port remap capability for virtual functions.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59e14e32
    • M
      net/bonding: Notify state change on slaves · 69e61133
      Moni Shoua 提交于
      Use notifier chain to dispatch an event upon a change in slave state.
      Event is dispatched with slave specific info.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69e61133
    • M
      net/bonding: Move slave state changes to a helper function · 69a2338e
      Moni Shoua 提交于
      Move slave state changes to a helper function, this is a pre-step for adding
      functionality of dispatching an event when this helper is called.
      
      This commit doesn't add new functionality.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69a2338e
    • M
      net/core: Add event for a change in slave state · 61bd3857
      Moni Shoua 提交于
      Add event which provides an indication on a change in the state
      of a bonding slave. The event handler should cast the pointer to the
      appropriate type (struct netdev_bonding_info) in order to get the
      full info about the slave.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61bd3857
    • T
      net: add skb functions to process remote checksum offload · dcdc8994
      Tom Herbert 提交于
      This patch adds skb_remcsum_process and skb_gro_remcsum_process to
      perform the appropriate adjustments to the skb when receiving
      remote checksum offload.
      
      Updated vxlan and gue to use these functions.
      
      Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did
      not see any change in performance.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcdc8994
    • E
      xps: fix xps for stacked devices · 2bd82484
      Eric Dumazet 提交于
      A typical qdisc setup is the following :
      
      bond0 : bonding device, using HTB hierarchy
      eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc
      
      XPS allows to spread packets on specific tx queues, based on the cpu
      doing the send.
      
      Problem is that dequeues from bond0 qdisc can happen on random cpus,
      due to the fact that qdisc_run() can dequeue a batch of packets.
      
      CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1
      CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0
      
      CPUB -> dequeue packet P1 from bond0
              enqueue packet on eth1/eth2
      CPUC -> dequeue packet P2 from bond0
              enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0)
      
      get_xps_queue() then might select wrong queue for P1, since current cpu
      might be different than CPUA.
      
      P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping),
      if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock)
      
      Effect of this bug is TCP reorders, and more generally not optimal
      TX queue placement. (A victim bulk flow can be migrated to the wrong TX
      queue for a while)
      
      To fix this, we have to record sender cpu number the first time
      dev_queue_xmit() is called for one tx skb.
      
      We can union napi_id (used on receive path) and sender_cpu,
      granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for
      this union idea)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bd82484
  4. 04 2月, 2015 15 次提交
  5. 03 2月, 2015 7 次提交