1. 08 2月, 2015 2 次提交
    • N
      tcp: mitigate ACK loops for connections as tcp_request_sock · a9b2c06d
      Neal Cardwell 提交于
      In the SYN_RECV state, where the TCP connection is represented by
      tcp_request_sock, we now rate-limit SYNACKs in response to a client's
      retransmitted SYNs: we do not send a SYNACK in response to client SYN
      if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
      since we last sent a SYNACK in response to a client's retransmitted
      SYN.
      
      This allows the vast majority of legitimate client connections to
      proceed unimpeded, even for the most aggressive platforms, iOS and
      MacOS, which actually retransmit SYNs 1-second intervals for several
      times in a row. They use SYN RTO timeouts following the progression:
      1,1,1,1,1,2,4,8,16,32.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9b2c06d
    • N
      tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423
      Neal Cardwell 提交于
      Helpers for mitigating ACK loops by rate-limiting dupacks sent in
      response to incoming out-of-window packets.
      
      This patch includes:
      
      - rate-limiting logic
      - sysctl to control how often we allow dupacks to out-of-window packets
      - SNMP counter for cases where we rate-limited our dupack sending
      
      The rate-limiting logic in this patch decides to not send dupacks in
      response to out-of-window segments if (a) they are SYNs or pure ACKs
      and (b) the remote endpoint is sending them faster than the configured
      rate limit.
      
      We rate-limit our responses rather than blocking them entirely or
      resetting the connection, because legitimate connections can rely on
      dupacks in response to some out-of-window segments. For example, zero
      window probes are typically sent with a sequence number that is below
      the current window, and ZWPs thus expect to thus elicit a dupack in
      response.
      
      We allow dupacks in response to TCP segments with data, because these
      may be spurious retransmissions for which the remote endpoint wants to
      receive DSACKs. This is safe because segments with data can't
      realistically be part of ACK loops, which by their nature consist of
      each side sending pure/data-less ACKs to each other.
      
      The dupack interval is controlled by a new sysctl knob,
      tcp_invalid_ratelimit, given in milliseconds, in case an administrator
      needs to dial this upward in the face of a high-rate DoS attack. The
      name and units are chosen to be analogous to the existing analogous
      knob for ICMP, icmp_ratelimit.
      
      The default value for tcp_invalid_ratelimit is 500ms, which allows at
      most one such dupack per 500ms. This is chosen to be 2x faster than
      the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
      2.4). We allow the extra 2x factor because network delay variations
      can cause packets sent at 1 second intervals to be compressed and
      arrive much closer.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      032ee423
  2. 05 2月, 2015 5 次提交
    • E
      ipv6: fix sparse errors in ip6_make_flowlabel() · 67765146
      Eric Dumazet 提交于
      include/net/ipv6.h:713:22: warning: incorrect type in assignment (different base types)
      include/net/ipv6.h:713:22:    expected restricted __be32 [usertype] hash
      include/net/ipv6.h:713:22:    got unsigned int
      include/net/ipv6.h:719:25: warning: restricted __be32 degrades to integer
      include/net/ipv6.h:719:22: warning: invalid assignment: ^=
      include/net/ipv6.h:719:22:    left side has type restricted __be32
      include/net/ipv6.h:719:22:    right side has type unsigned int
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67765146
    • E
      flow_keys: n_proto type should be __be16 · f4575d35
      Eric Dumazet 提交于
      (struct flow_keys)->n_proto is in network order, use
      proper type for this.
      
      Fixes following sparse errors :
      
      net/core/flow_dissector.c:139:39: warning: incorrect type in assignment (different base types)
      net/core/flow_dissector.c:139:39:    expected unsigned short [unsigned] [usertype] n_proto
      net/core/flow_dissector.c:139:39:    got restricted __be16 [assigned] [usertype] proto
      net/core/flow_dissector.c:237:23: warning: incorrect type in assignment (different base types)
      net/core/flow_dissector.c:237:23:    expected unsigned short [unsigned] [usertype] n_proto
      net/core/flow_dissector.c:237:23:    got restricted __be16 [assigned] [usertype] proto
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: e0f31d84 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4575d35
    • E
      tcp: do not pace pure ack packets · 98781965
      Eric Dumazet 提交于
      When we added pacing to TCP, we decided to let sch_fq take care
      of actual pacing.
      
      All TCP had to do was to compute sk->pacing_rate using simple formula:
      
      sk->pacing_rate = 2 * cwnd * mss / rtt
      
      It works well for senders (bulk flows), but not very well for receivers
      or even RPC :
      
      cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
      can end up pacing ACK packets, slowing down the sender.
      
      Really, only the sender should pace, according to its own logic.
      
      Instead of adding a new bit in skb, or call yet another flow
      dissection, we tweak skb->truesize to a small value (2), and
      we instruct sch_fq to use new helper and not pace pure ack.
      
      Note this also helps TCP small queue, as ack packets present
      in qdisc/NIC do not prevent sending a data packet (RPC workload)
      
      This helps to reduce tx completion overhead, ack packets can use regular
      sock_wfree() instead of tcp_wfree() which is a bit more expensive.
      
      This has no impact in the case packets are sent to loopback interface,
      as we do not coalesce ack packets (were we would detect skb->truesize
      lie)
      
      In case netem (with a delay) is used, skb_orphan_partial() also sets
      skb->truesize to 1.
      
      This patch is a combination of two patches we used for about one year at
      Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98781965
    • M
      net/bonding: Notify state change on slaves · 69e61133
      Moni Shoua 提交于
      Use notifier chain to dispatch an event upon a change in slave state.
      Event is dispatched with slave specific info.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69e61133
    • M
      net/bonding: Move slave state changes to a helper function · 69a2338e
      Moni Shoua 提交于
      Move slave state changes to a helper function, this is a pre-step for adding
      functionality of dispatching an event when this helper is called.
      
      This commit doesn't add new functionality.
      Signed-off-by: NMoni Shoua <monis@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69a2338e
  3. 04 2月, 2015 8 次提交
  4. 03 2月, 2015 11 次提交
  5. 02 2月, 2015 4 次提交
  6. 01 2月, 2015 2 次提交
    • E
      net: sched: fix panic in rate estimators · 0d32ef8c
      Eric Dumazet 提交于
      Doing the following commands on a non idle network device
      panics the box instantly, because cpu_bstats gets overwritten
      by stats.
      
      tc qdisc add dev eth0 root <your_favorite_qdisc>
      ... some traffic (one packet is enough) ...
      tc qdisc replace dev eth0 root est 1sec 4sec <your_favorite_qdisc>
      
      [  325.355596] BUG: unable to handle kernel paging request at ffff8841dc5a074c
      [  325.362609] IP: [<ffffffff81541c9e>] __gnet_stats_copy_basic+0x3e/0x90
      [  325.369158] PGD 1fa7067 PUD 0
      [  325.372254] Oops: 0000 [#1] SMP
      [  325.375514] Modules linked in: ...
      [  325.398346] CPU: 13 PID: 14313 Comm: tc Not tainted 3.19.0-smp-DEV #1163
      [  325.412042] task: ffff8800793ab5d0 ti: ffff881ff2fa4000 task.ti: ffff881ff2fa4000
      [  325.419518] RIP: 0010:[<ffffffff81541c9e>]  [<ffffffff81541c9e>] __gnet_stats_copy_basic+0x3e/0x90
      [  325.428506] RSP: 0018:ffff881ff2fa7928  EFLAGS: 00010286
      [  325.433824] RAX: 000000000000000c RBX: ffff881ff2fa796c RCX: 000000000000000c
      [  325.440988] RDX: ffff8841dc5a0744 RSI: 0000000000000060 RDI: 0000000000000060
      [  325.448120] RBP: ffff881ff2fa7948 R08: ffffffff81cd4f80 R09: 0000000000000000
      [  325.455268] R10: ffff883ff223e400 R11: 0000000000000000 R12: 000000015cba0744
      [  325.462405] R13: ffffffff81cd4f80 R14: ffff883ff223e460 R15: ffff883feea0722c
      [  325.469536] FS:  00007f2ee30fa700(0000) GS:ffff88407fa20000(0000) knlGS:0000000000000000
      [  325.477630] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  325.483380] CR2: ffff8841dc5a074c CR3: 0000003feeae9000 CR4: 00000000001407e0
      [  325.490510] Stack:
      [  325.492524]  ffff883feea0722c ffff883fef719dc0 ffff883feea0722c ffff883ff223e4a0
      [  325.499990]  ffff881ff2fa79a8 ffffffff815424ee ffff883ff223e49c 000000015cba0744
      [  325.507460]  00000000f2fa7978 0000000000000000 ffff881ff2fa79a8 ffff883ff223e4a0
      [  325.514956] Call Trace:
      [  325.517412]  [<ffffffff815424ee>] gen_new_estimator+0x8e/0x230
      [  325.523250]  [<ffffffff815427aa>] gen_replace_estimator+0x4a/0x60
      [  325.529349]  [<ffffffff815718ab>] tc_modify_qdisc+0x52b/0x590
      [  325.535117]  [<ffffffff8155edd0>] rtnetlink_rcv_msg+0xa0/0x240
      [  325.540963]  [<ffffffff8155ed30>] ? __rtnl_unlock+0x20/0x20
      [  325.546532]  [<ffffffff8157f811>] netlink_rcv_skb+0xb1/0xc0
      [  325.552145]  [<ffffffff8155b355>] rtnetlink_rcv+0x25/0x40
      [  325.557558]  [<ffffffff8157f0d8>] netlink_unicast+0x168/0x220
      [  325.563317]  [<ffffffff8157f47c>] netlink_sendmsg+0x2ec/0x3e0
      
      Lets play safe and not use an union : percpu 'pointers' are mostly read
      anyway, and we have typically few qdiscs per host.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Fixes: 22e0f8b9 ("net: sched: make bstats per cpu and estimator RCU safe")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d32ef8c
    • E
      ipv4: icmp: use percpu allocation · 349c9e3c
      Eric Dumazet 提交于
      Get rid of nr_cpu_ids and use modern percpu allocation.
      
      Note that the sockets themselves are not yet allocated
      using NUMA affinity.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      349c9e3c
  7. 31 1月, 2015 1 次提交
  8. 29 1月, 2015 4 次提交
    • C
      net: remove sock_iocb · 7cc05662
      Christoph Hellwig 提交于
      The sock_iocb structure is allocate on stack for each read/write-like
      operation on sockets, and contains various fields of which only the
      embedded msghdr and sometimes a pointer to the scm_cookie is ever used.
      Get rid of the sock_iocb and put a msghdr directly on the stack and pass
      the scm_cookie explicitly to netlink_mmap_sendmsg.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7cc05662
    • J
      openvswitch: Add support for checksums on UDP tunnels. · b8693877
      Jesse Gross 提交于
      Currently, it isn't possible to request checksums on the outer UDP
      header of tunnels - the TUNNEL_CSUM flag is ignored. This adds
      support for requesting that UDP checksums be computed on transmit
      and properly reported if they are present on receive.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8693877
    • N
      tcp: stretch ACK fixes prep · e73ebb08
      Neal Cardwell 提交于
      LRO, GRO, delayed ACKs, and middleboxes can cause "stretch ACKs" that
      cover more than the RFC-specified maximum of 2 packets. These stretch
      ACKs can cause serious performance shortfalls in common congestion
      control algorithms that were designed and tuned years ago with
      receiver hosts that were not using LRO or GRO, and were instead
      politely ACKing every other packet.
      
      This patch series fixes Reno and CUBIC to handle stretch ACKs.
      
      This patch prepares for the upcoming stretch ACK bug fix patches. It
      adds an "acked" parameter to tcp_cong_avoid_ai() to allow for future
      fixes to tcp_cong_avoid_ai() to correctly handle stretch ACKs, and
      changes all congestion control algorithms to pass in 1 for the ACKed
      count. It also changes tcp_slow_start() to return the number of packet
      ACK "credits" that were not processed in slow start mode, and can be
      processed by the congestion control module in additive increase mode.
      
      In future patches we will fix tcp_cong_avoid_ai() to handle stretch
      ACKs, and fix Reno and CUBIC handling of stretch ACKs in slow start
      and additive increase mode.
      Reported-by: NEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e73ebb08
    • M
      Bluetooth: Perform a power cycle when receiving hardware error event · c7741d16
      Marcel Holtmann 提交于
      When receiving a HCI Hardware Error event, the controller should be
      assumed to be non-functional until issuing a HCI Reset command.
      
      The Bluetooth hardware errors are vendor specific and so add a
      new hdev->hw_error callback that drivers can provide to run extra
      code to handle the hardware error.
      
      After completing the vendor specific error handling perform a full
      reset of the Bluetooth stack by closing and re-opening the transport.
      Based-on-patch-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      c7741d16
  9. 28 1月, 2015 3 次提交