1. 19 1月, 2017 3 次提交
  2. 18 1月, 2017 5 次提交
    • J
      tcp: accept RST for rcv_nxt - 1 after receiving a FIN · 0e40f4c9
      Jason Baron 提交于
      Using a Mac OSX box as a client connecting to a Linux server, we have found
      that when certain applications (such as 'ab'), are abruptly terminated
      (via ^C), a FIN is sent followed by a RST packet on tcp connections. The
      FIN is accepted by the Linux stack but the RST is sent with the same
      sequence number as the FIN, and Linux responds with a challenge ACK per
      RFC 5961. The OSX client then sometimes (they are rate-limited) does not
      reply with any RST as would be expected on a closed socket.
      
      This results in sockets accumulating on the Linux server left mostly in
      the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
      This sequence of events can tie up a lot of resources on the Linux server
      since there may be a lot of data in write buffers at the time of the RST.
      Accepting a RST equal to rcv_nxt - 1, after we have already successfully
      processed a FIN, has made a significant difference for us in practice, by
      freeing up unneeded resources in a more expedient fashion.
      
      A packetdrill test demonstrating the behavior:
      
      // testing mac osx rst behavior
      
      // Establish a connection
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
      0.200 < . 1:1(0) ack 1 win 32768
      0.200 accept(3, ..., ...) = 4
      
      // Client closes the connection
      0.300 < F. 1:1(0) ack 1 win 32768
      
      // now send rst with same sequence
      0.300 < R. 1:1(0) ack 1 win 32768
      
      // make sure we are in TCP_CLOSE
      0.400 %{
      assert tcpi_state == 7
      }%
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e40f4c9
    • G
      net: ping: Use right format specifier to avoid type casting · a7ef6715
      Gao Feng 提交于
      The inet_num is u16, so use %hu instead of casting it to int. And
      the sk_bound_dev_if is int actually, so it needn't cast to int.
      Signed-off-by: NGao Feng <fgao@ikuai8.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7ef6715
    • L
      bridge: sparse fixes in br_ip6_multicast_alloc_query() · 53631a5f
      Lance Richardson 提交于
      Changed type of csum field in struct igmpv3_query from __be16 to
      __sum16 to eliminate type warning, made same change in struct
      igmpv3_report for consistency.
      
      Fixed up an ntohs() where htons() should have been used instead.
      Signed-off-by: NLance Richardson <lrichard@redhat.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53631a5f
    • R
      mpls: Packet stats · 27d69105
      Robert Shearman 提交于
      Having MPLS packet stats is useful for observing network operation and
      for diagnosing network problems. In the absence of anything better,
      RFC2863 and RFC3813 are used for guidance for which stats to expose
      and the semantics of them. In particular rx_noroutes maps to in
      unknown protos in RFC2863. The stats are exposed to userspace via
      AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of
      RTM_GETSTATS messages.
      
      All the introduced fields are 64-bit, even error ones, to ensure no
      overflow with long uptimes. Per-CPU counters are used to avoid
      cache-line contention on the commonly used fields. The other fields
      have also been made per-CPU for code to avoid performance problems in
      error conditions on the assumption that on some platforms the cost of
      atomic operations could be more expensive than sending the packet
      (which is what would be done in the success case). If that's not the
      case, we could instead not use per-CPU counters for these fields.
      
      Only unicast and non-fragment are exposed at the moment, but other
      counters can be exposed in the future either by adding to the end of
      struct mpls_link_stats or by additional netlink attributes in the
      AF_MPLS IFLA_STATS_AF_SPEC nested attribute.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27d69105
    • R
      net: AF-specific RTM_GETSTATS attributes · aefb4d4a
      Robert Shearman 提交于
      Add the functionality for including address-family-specific per-link
      stats in RTM_GETSTATS messages. This is done through adding a new
      IFLA_STATS_AF_SPEC attribute under which address family attributes are
      nested and then the AF-specific attributes can be further nested. This
      follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the
      advantage of presenting an easily extended hierarchy. The rtnl_af_ops
      structure is extended to provide AFs with the opportunity to fill and
      provide the size of their stats attributes.
      
      One alternative would have been to provide AFs with the ability to add
      attributes directly into the RTM_GETSTATS message without a nested
      hierarchy. I discounted this approach as it increases the rate at
      which the 32 attribute number space is used up and it makes
      implementation a little more tricky for stats dump resuming (at the
      moment the order in which attributes are added to the message has to
      match the numeric order of the attributes).
      
      Another alternative would have been to register per-AF RTM_GETSTATS
      handlers. I discounted this approach as I perceived a common use-case
      to be getting all the stats for an interface and this approach would
      necessitate multiple requests/dumps to retrieve them all.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aefb4d4a
  3. 17 1月, 2017 11 次提交
    • J
      net sched actions: fix refcnt when GETing of action after bind · 0faa9cb5
      Jamal Hadi Salim 提交于
      Demonstrating the issue:
      
      .. add a drop action
      $sudo $TC actions add action drop index 10
      
      .. retrieve it
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 2 bind 0 installed 29 sec used 29 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      ... bug 1 above: reference is two.
          Reference is actually 1 but we forget to subtract 1.
      
      ... do a GET again and we see the same issue
          try a few times and nothing changes
      ~$ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 2 bind 0 installed 31 sec used 31 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      ... lets try to bind the action to a filter..
      $ sudo $TC qdisc add dev lo ingress
      $ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
        u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10
      
      ... and now a few GETs:
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 3 bind 1 installed 204 sec used 204 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 4 bind 1 installed 206 sec used 206 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 5 bind 1 installed 235 sec used 235 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      .... as can be observed the reference count keeps going up.
      
      After the fix
      
      $ sudo $TC actions add action drop index 10
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 1 bind 0 installed 4 sec used 4 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 1 bind 0 installed 6 sec used 6 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      $ sudo $TC qdisc add dev lo ingress
      $ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
        u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10
      
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 2 bind 1 installed 32 sec used 32 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      $ sudo $TC -s actions get action gact index 10
      
      	action order 1: gact action drop
      	 random type none pass val 0
      	 index 10 ref 2 bind 1 installed 33 sec used 33 sec
       	Action statistics:
      	Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
      	backlog 0b 0p requeues 0
      
      Fixes: aecc5cef ("net sched actions: fix GETing actions")
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0faa9cb5
    • P
      net/sched: cls_flower: Disallow duplicate internal elements · a3308d8f
      Paul Blakey 提交于
      Flower currently allows having the same filter twice with the same
      priority. Actions (and statistics update) will always execute on the
      first inserted rule leaving the second rule unused.
      This patch disallows that.
      Signed-off-by: NPaul Blakey <paulb@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3308d8f
    • B
      ax25: Fix segfault after sock connection timeout · 8a367e74
      Basil Gunn 提交于
      The ax.25 socket connection timed out & the sock struct has been
      previously taken down ie. sock struct is now a NULL pointer. Checking
      the sock_flag causes the segfault.  Check if the socket struct pointer
      is NULL before checking sock_flag. This segfault is seen in
      timed out netrom connections.
      
      Please submit to -stable.
      Signed-off-by: NBasil Gunn <basil@pacabunga.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a367e74
    • D
      bpf: rework prog_digest into prog_tag · f1f7714e
      Daniel Borkmann 提交于
      Commit 7bd509e3 ("bpf: add prog_digest and expose it via
      fdinfo/netlink") was recently discussed, partially due to
      admittedly suboptimal name of "prog_digest" in combination
      with sha1 hash usage, thus inevitably and rightfully concerns
      about its security in terms of collision resistance were
      raised with regards to use-cases.
      
      The intended use cases are for debugging resp. introspection
      only for providing a stable "tag" over the instruction sequence
      that both kernel and user space can calculate independently.
      It's not usable at all for making a security relevant decision.
      So collisions where two different instruction sequences generate
      the same tag can happen, but ideally at a rather low rate. The
      "tag" will be dumped in hex and is short enough to introspect
      in tracepoints or kallsyms output along with other data such
      as stack trace, etc. Thus, this patch performs a rename into
      prog_tag and truncates the tag to a short output (64 bits) to
      make it obvious it's not collision-free.
      
      Should in future a hash or facility be needed with a security
      relevant focus, then we can think about requirements, constraints,
      etc that would fit to that situation. For now, rework the exposed
      parts for the current use cases as long as nothing has been
      released yet. Tested on x86_64 and s390x.
      
      Fixes: 7bd509e3 ("bpf: add prog_digest and expose it via fdinfo/netlink")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1f7714e
    • M
      sctp: remove useless code from sctp_apply_peer_addr_params · cdfb1a9f
      Marcelo Ricardo Leitner 提交于
      sctp_frag_point() doesn't store anything, and thus just calling it
      cannot do anything useful.
      
      sctp_apply_peer_addr_params is only called by
      sctp_setsockopt_peer_addr_params. When operating on an asoc,
      sctp_setsockopt_peer_addr_params will call sctp_apply_peer_addr_params
      once for the asoc, and then once for each transport this asoc has,
      meaning that the frag_point will be recomputed when updating the
      transports and calling it when updating the asoc is not necessary.
      IOW, no action is needed here and we can remove this call.
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdfb1a9f
    • M
    • C
      flow dissector: check if arp_eth is null rather than arp · 57b68ec2
      Colin Ian King 提交于
      arp is being checked instead of arp_eth to see if the call to
      __skb_header_pointer failed. Fix this by checking arp_eth is
      null instead of arp.   Also fix to use length hlen rather than
      hlen - sizeof(_arp); thanks to Eric Dumazet for spotting
      this latter issue.
      
      CoverityScan CID#1396428 ("Logically dead code") on 2nd
      arp comparison (which should be arp_eth instead).
      
      Fixes: commit 55733350 ("flow disector: ARP support")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57b68ec2
    • E
      netlink: do not enter direct reclaim from netlink_trim() · e89df813
      Eric Dumazet 提交于
      In commit d35c99ff ("netlink: do not enter direct reclaim from
      netlink_dump()") we made sure to not trigger expensive memory reclaim.
      
      Problem is that a bit later, netlink_trim() might be called and
      trigger memory reclaim.
      
      netlink_trim() should be best effort, and really as fast as possible.
      Under memory pressure, it is fine to not trim this skb.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e89df813
    • P
      tipc: allocate user memory with GFP_KERNEL flag · 57d5f64d
      Parthasarathy Bhuvaragan 提交于
      Until now, we allocate memory always with GFP_ATOMIC flag.
      When the system is under memory pressure and a user tries to send,
      the send fails due to low memory. However, the user application
      can wait for free memory if we allocate it using GFP_KERNEL flag.
      
      In this commit, we use allocate memory with GFP_KERNEL for all user
      allocation.
      Reported-by: NRune Torgersen <runet@innovsys.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57d5f64d
    • J
      ip6_tunnel: Account for tunnel header in tunnel MTU · 02ca0423
      Jakub Sitnicki 提交于
      With ip6gre we have a tunnel header which also makes the tunnel MTU
      smaller. We need to reserve room for it. Previously we were using up
      space reserved for the Tunnel Encapsulation Limit option
      header (RFC 2473).
      
      Also, after commit b05229f4 ("gre6: Cleanup GREv6 transmit path,
      call common GRE functions") our contract with the caller has
      changed. Now we check if the packet length exceeds the tunnel MTU after
      the tunnel header has been pushed, unlike before.
      
      This is reflected in the check where we look at the packet length minus
      the size of the tunnel header, which is already accounted for in tunnel
      MTU.
      
      Fixes: b05229f4 ("gre6: Cleanup GREv6 transmit path, call common GRE functions")
      Signed-off-by: NJakub Sitnicki <jkbs@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02ca0423
    • H
      mld: do not remove mld souce list info when set link down · 1666d49e
      Hangbin Liu 提交于
      This is an IPv6 version of commit 24803f38 ("igmp: do not remove igmp
      souce list..."). In mld_del_delrec(), we will restore back all source filter
      info instead of flush them.
      
      Move mld_clear_delrec() from ipv6_mc_down() to ipv6_mc_destroy_dev() since
      we should not remove source list info when set link down. Remove
      igmp6_group_dropped() in ipv6_mc_destroy_dev() since we have called it in
      ipv6_mc_down().
      
      Also clear all source info after igmp6_group_dropped() instead of in it
      because ipv6_mc_down() will call igmp6_group_dropped().
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1666d49e
  4. 16 1月, 2017 1 次提交
    • L
      openvswitch: maintain correct checksum state in conntrack actions · 75f01a4c
      Lance Richardson 提交于
      When executing conntrack actions on skbuffs with checksum mode
      CHECKSUM_COMPLETE, the checksum must be updated to account for
      header pushes and pulls. Otherwise we get "hw csum failure"
      logs similar to this (ICMP packet received on geneve tunnel
      via ixgbe NIC):
      
      [  405.740065] genev_sys_6081: hw csum failure
      [  405.740106] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G          I     4.10.0-rc3+ #1
      [  405.740108] Call Trace:
      [  405.740110]  <IRQ>
      [  405.740113]  dump_stack+0x63/0x87
      [  405.740116]  netdev_rx_csum_fault+0x3a/0x40
      [  405.740118]  __skb_checksum_complete+0xcf/0xe0
      [  405.740120]  nf_ip_checksum+0xc8/0xf0
      [  405.740124]  icmp_error+0x1de/0x351 [nf_conntrack_ipv4]
      [  405.740132]  nf_conntrack_in+0xe1/0x550 [nf_conntrack]
      [  405.740137]  ? find_bucket.isra.2+0x62/0x70 [openvswitch]
      [  405.740143]  __ovs_ct_lookup+0x95/0x980 [openvswitch]
      [  405.740145]  ? netif_rx_internal+0x44/0x110
      [  405.740149]  ovs_ct_execute+0x147/0x4b0 [openvswitch]
      [  405.740153]  do_execute_actions+0x22e/0xa70 [openvswitch]
      [  405.740157]  ovs_execute_actions+0x40/0x120 [openvswitch]
      [  405.740161]  ovs_dp_process_packet+0x84/0x120 [openvswitch]
      [  405.740166]  ovs_vport_receive+0x73/0xd0 [openvswitch]
      [  405.740168]  ? udp_rcv+0x1a/0x20
      [  405.740170]  ? ip_local_deliver_finish+0x93/0x1e0
      [  405.740172]  ? ip_local_deliver+0x6f/0xe0
      [  405.740174]  ? ip_rcv_finish+0x3a0/0x3a0
      [  405.740176]  ? ip_rcv_finish+0xdb/0x3a0
      [  405.740177]  ? ip_rcv+0x2a7/0x400
      [  405.740180]  ? __netif_receive_skb_core+0x970/0xa00
      [  405.740185]  netdev_frame_hook+0xd3/0x160 [openvswitch]
      [  405.740187]  __netif_receive_skb_core+0x1dc/0xa00
      [  405.740194]  ? ixgbe_clean_rx_irq+0x46d/0xa20 [ixgbe]
      [  405.740197]  __netif_receive_skb+0x18/0x60
      [  405.740199]  netif_receive_skb_internal+0x40/0xb0
      [  405.740201]  napi_gro_receive+0xcd/0x120
      [  405.740204]  gro_cell_poll+0x57/0x80 [geneve]
      [  405.740206]  net_rx_action+0x260/0x3c0
      [  405.740209]  __do_softirq+0xc9/0x28c
      [  405.740211]  irq_exit+0xd9/0xf0
      [  405.740213]  do_IRQ+0x51/0xd0
      [  405.740215]  common_interrupt+0x93/0x93
      
      Fixes: 7f8a436e ("openvswitch: Add conntrack action")
      Signed-off-by: NLance Richardson <lrichard@redhat.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75f01a4c
  5. 14 1月, 2017 15 次提交
    • Y
      tcp: disable fack by default · 94bdc978
      Yuchung Cheng 提交于
      This patch disables FACK by default as RACK is the successor of FACK
      (inspired by the insights behind FACK).
      
      FACK[1] in Linux works as follows: a packet P is deemed lost,
      if packet Q of higher sequence is s/acked and P and Q are distant
      by at least dupthresh number of packets in sequence space.
      
      FACK is more aggressive than the IETF recommened recovery for SACK
      (RFC3517 A Conservative Selective Acknowledgment (SACK)-based Loss
       Recovery Algorithm for TCP), because a single SACK may trigger
      fast recovery. This obviously won't work well with reordering so
      FACK is dynamically disabled upon detecting reordering.
      
      RACK supersedes FACK by using time distance instead of sequence
      distance. On reordering, RACK waits for a quarter of RTT receiving
      a single SACK before starting recovery. (the timer can be made more
      adaptive in the future by measuring reordering distance in time,
      but currently RTT/4 seem to work well.) Once the recovery starts,
      RACK behaves almost like FACK because it reduces the reodering
      window to 1ms, so it fast retransmits quickly. In addition RACK
      can detect loss retransmission as it does not care about the packet
      sequences (being repeated or not), which is extremely useful when
      the connection is going through a traffic policer.
      
      Google server experiments indicate that disabling FACK after enabling
      RACK has negligible impact on the overall loss recovery performance
      with more reordering events detected.  But we still keep the FACK
      implementation for backup if RACK has bugs that needs to be disabled.
      
      [1] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining
      TCP Congestion Control," In Proceedings of SIGCOMM '96, August 1996.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94bdc978
    • Y
      tcp: remove thin_dupack feature · 4a7f6009
      Yuchung Cheng 提交于
      Thin stream DUPACK is to start fast recovery on only one DUPACK
      provided the connection is a thin stream (i.e., low inflight).  But
      this older feature is now subsumed with RACK. If a connection
      receives only a single DUPACK, RACK would arm a reordering timer
      and soon starts fast recovery instead of timeout if no further
      ACKs are received.
      
      The socket option (THIN_DUPACK) is kept as a nop for compatibility.
      Note that this patch does not change another thin-stream feature
      which enables linear RTO. Although it might be good to generalize
      that in the future (i.e., linear RTO for the first say 3 retries).
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a7f6009
    • Y
      tcp: remove RFC4653 NCR · ac229dca
      Yuchung Cheng 提交于
      This patch removes the (partial) implementation of the aggressive
      limited transmit in RFC4653 TCP Non-Congestion Robustness (NCR).
      
      NCR is a mitigation to the problem created by the dynamic
      DUPACK threshold.  With the current adaptive DUPACK threshold
      (tp->reordering) could cause timeouts by preventing fast recovery.
      For example, if the last packet of a cwnd burst was reordered, the
      threshold will be set to the size of cwnd. But if next application
      burst is smaller than threshold and has drops instead of reorderings,
      the sender would not trigger fast recovery but instead resorts to a
      timeout recovery.
      
      NCR mitigates this issue by checking the number of DUPACKs against
      the current flight size additionally. The techniqueue is similar to
      the early retransmit RFC.
      
      With RACK loss detection, this mitigation is not needed, because RACK
      does not use DUPACK threshold to detect losses. RACK arms a reordering
      timer to fire at most a quarter RTT later to start fast recovery.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac229dca
    • Y
      tcp: remove early retransmit · bec41a11
      Yuchung Cheng 提交于
      This patch removes the support of RFC5827 early retransmit (i.e.,
      fast recovery on small inflight with <3 dupacks) because it is
      subsumed by the new RACK loss detection. More specifically when
      RACK receives DUPACKs, it'll arm a reordering timer to start fast
      recovery after a quarter of (min)RTT, hence it covers the early
      retransmit except RACK does not limit itself to specific inflight
      or dupack numbers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec41a11
    • Y
      tcp: remove forward retransmit feature · 840a3cbe
      Yuchung Cheng 提交于
      Forward retransmit is an esoteric feature in RFC3517 (condition(3)
      in the NextSeg()). Basically if a packet is not considered lost by
      the current criteria (# of dupacks etc), but the congestion window
      has room for more packets, then retransmit this packet.
      
      However it actually conflicts with the rest of recovery design. For
      example, when reordering is detected we want to be conservative
      in retransmitting packets but forward-retransmit feature would
      break that to force more retransmission. Also the implementation is
      fairly complicated inside the retransmission logic inducing extra
      iterations in the write queue. With RACK losses are being detected
      timely and this heuristic is no longer necessary. There this patch
      removes the feature.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      840a3cbe
    • Y
      tcp: extend F-RTO to catch more spurious timeouts · 89fe18e4
      Yuchung Cheng 提交于
      Current F-RTO reverts cwnd reset whenever a never-retransmitted
      packet was (s)acked. The timeout can be declared spurious because
      the packets acknoledged with this ACK was transmitted before the
      timeout, so clearly not all the packets are lost to reset the cwnd.
      
      This nice detection does not really depend F-RTO internals. This
      patch applies the detection universally. On Google servers this
      change detected 20% more spurious timeouts.
      Suggested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89fe18e4
    • Y
      tcp: enable RACK loss detection to trigger recovery · a0370b3f
      Yuchung Cheng 提交于
      This patch changes two things:
      
      1. Start fast recovery with RACK in addition to other heuristics
         (e.g., DUPACK threshold, FACK). Prior to this change RACK
         is enabled to detect losses only after the recovery has
         started by other algorithms.
      
      2. Disable TCP early retransmit. RACK subsumes the early retransmit
         with the new reordering timer feature. A latter patch in this
         series removes the early retransmit code.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0370b3f
    • Y
      tcp: check undo conditions before detecting losses · 98e36d44
      Yuchung Cheng 提交于
      Currently RACK would mark loss before the undo operations in TCP
      loss recovery. This could incorrectly identify real losses as
      spurious. For example a sender first experiences a delay spike and
      then eventually some packets were lost due to buffer overrun.
      In this case, the sender should perform fast recovery b/c not all
      the packets were lost.
      
      But the sender may first trigger a (spurious) RTO and reset
      cwnd to 1. The following ACKs may used to mark real losses by
      tcp_rack_mark_lost. Then in tcp_process_loss this ACK could trigger
      F-RTO undo condition and unmark real losses and revert the cwnd
      reduction. If there are no more ACKs coming back, eventually the
      sender would timeout again instead of performing fast recovery.
      
      The patch fixes this incorrect process by always performing
      the undo checks before detecting losses.
      
      Fixes: 4f41b1c5 ("tcp: use RACK to detect losses")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98e36d44
    • Y
      tcp: use sequence to break TS ties for RACK loss detection · 1d0833df
      Yuchung Cheng 提交于
      The packets inside a jumbo skb (e.g., TSO) share the same skb
      timestamp, even though they are sent sequentially on the wire. Since
      RACK is based on time, it can not detect some packets inside the
      same skb are lost.  However, we can leverage the packet sequence
      numbers as extended timestamps to detect losses. Therefore, when
      RACK timestamp is identical to skb's timestamp (i.e., one of the
      packets of the skb is acked or sacked), we use the sequence numbers
      of the acked and unacked packets to break ties.
      
      We can use the same sequence logic to advance RACK xmit time as
      well to detect more losses and avoid timeout.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d0833df
    • Y
      tcp: add reordering timer in RACK loss detection · 57dde7f7
      Yuchung Cheng 提交于
      This patch makes RACK install a reordering timer when it suspects
      some packets might be lost, but wants to delay the decision
      a little bit to accomodate reordering.
      
      It does not create a new timer but instead repurposes the existing
      RTO timer, because both are meant to retransmit packets.
      Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
      the RACK timing check fails. The wait time is set to
      
        RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
      
      This translates to expecting a packet (Packet) should take
      (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
      
      When there are multiple packets that need a timer, we use one timer
      with the maximum timeout. Therefore the timer conservatively uses
      the maximum window to expire N packets by one timeout, instead of
      N timeouts to expire N packets sent at different times.
      
      The fudge factor is 2 jiffies to ensure when the timer fires, all
      the suspected packets would exceed the deadline and be marked lost
      by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
      clock may tick between calling icsk_reset_xmit_timer(timeout) and
      actually hang the timer. The next jiffy is to lower-bound the timeout
      to 2 jiffies when reo_wnd is < 1ms.
      
      When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
      in Recovery we'll enter fast recovery and force fast retransmit.
      This is very similar to the early retransmit (RFC5827) except RACK
      is not constrained to only enter recovery for small outstanding
      flights.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57dde7f7
    • Y
      tcp: record most recent RTT in RACK loss detection · deed7be7
      Yuchung Cheng 提交于
      Record the most recent RTT in RACK. It is often identical to the
      "ca_rtt_us" values in tcp_clean_rtx_queue. But when the packet has
      been retransmitted, RACK choses to believe the ACK is for the
      (latest) retransmitted packet if the RTT is over minimum RTT.
      
      This requires passing the arrival time of the most recent ACK to
      RACK routines. The timestamp is now recorded in the "ack_time"
      in tcp_sacktag_state during the ACK processing.
      
      This patch does not change the RACK algorithm itself. It only adds
      the RTT variable to prepare the next main patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      deed7be7
    • Y
      tcp: new helper for RACK to detect loss · e636f8b0
      Yuchung Cheng 提交于
      Create a new helper tcp_rack_detect_loss to prepare the upcoming
      RACK reordering timer patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e636f8b0
    • Y
      tcp: new helper function for RACK loss detection · db8da6bb
      Yuchung Cheng 提交于
      Create a new helper tcp_rack_mark_skb_lost to prepare the
      upcoming RACK reordering timer support.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db8da6bb
    • S
      tcp: fix tcp_fastopen unaligned access complaints on sparc · 003c9410
      Shannon Nelson 提交于
      Fix up a data alignment issue on sparc by swapping the order
      of the cookie byte array field with the length field in
      struct tcp_fastopen_cookie, and making it a proper union
      to clean up the typecasting.
      
      This addresses log complaints like these:
          log_unaligned: 113 callbacks suppressed
          Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
          Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
          Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
          Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
          Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NShannon Nelson <shannon.nelson@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      003c9410
    • D
      ipv6: sr: fix several BUGs when preemption is enabled · fa79581e
      David Lebrun 提交于
      When CONFIG_PREEMPT=y, CONFIG_IPV6=m and CONFIG_SEG6_HMAC=y,
      seg6_hmac_init() is called during the initialization of the ipv6 module.
      This causes a subsequent call to smp_processor_id() with preemption
      enabled, resulting in the following trace.
      
      [   20.451460] BUG: using smp_processor_id() in preemptible [00000000] code: systemd/1
      [   20.452556] caller is debug_smp_processor_id+0x17/0x19
      [   20.453304] CPU: 0 PID: 1 Comm: systemd Not tainted 4.9.0-rc5-00973-g46738b13 #1
      [   20.454406]  ffffc9000062fc18 ffffffff813607b2 0000000000000000 ffffffff81a7f782
      [   20.455528]  ffffc9000062fc48 ffffffff813778dc 0000000000000000 00000000001dcf98
      [   20.456539]  ffffffffa003bd08 ffffffff81af93e0 ffffc9000062fc58 ffffffff81377905
      [   20.456539] Call Trace:
      [   20.456539]  [<ffffffff813607b2>] dump_stack+0x63/0x7f
      [   20.456539]  [<ffffffff813778dc>] check_preemption_disabled+0xd1/0xe3
      [   20.456539]  [<ffffffff81377905>] debug_smp_processor_id+0x17/0x19
      [   20.460260]  [<ffffffffa0061f3b>] seg6_hmac_init+0xfa/0x192 [ipv6]
      [   20.460260]  [<ffffffffa0061ccc>] seg6_init+0x39/0x6f [ipv6]
      [   20.460260]  [<ffffffffa006121a>] inet6_init+0x21a/0x321 [ipv6]
      [   20.460260]  [<ffffffffa0061000>] ? 0xffffffffa0061000
      [   20.460260]  [<ffffffff81000457>] do_one_initcall+0x8b/0x115
      [   20.460260]  [<ffffffff811328a3>] do_init_module+0x53/0x1c4
      [   20.460260]  [<ffffffff8110650a>] load_module+0x1153/0x14ec
      [   20.460260]  [<ffffffff81106a7b>] SYSC_finit_module+0x8c/0xb9
      [   20.460260]  [<ffffffff81106a7b>] ? SYSC_finit_module+0x8c/0xb9
      [   20.460260]  [<ffffffff81106abc>] SyS_finit_module+0x9/0xb
      [   20.460260]  [<ffffffff810014d1>] do_syscall_64+0x62/0x75
      [   20.460260]  [<ffffffff816834f0>] entry_SYSCALL64_slow_path+0x25/0x25
      
      Moreover, dst_cache_* functions also call smp_processor_id(), generating
      a similar trace.
      
      This patch uses raw_cpu_ptr() in seg6_hmac_init() rather than this_cpu_ptr()
      and disable preemption when using dst_cache_* functions.
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa79581e
  6. 13 1月, 2017 5 次提交
    • M
      mac80211: prevent skb/txq mismatch · dbef5362
      Michal Kazior 提交于
      Station structure is considered as not uploaded
      (to driver) until drv_sta_state() finishes. This
      call is however done after the structure is
      attached to mac80211 internal lists and hashes.
      This means mac80211 can lookup (and use) station
      structure before it is uploaded to a driver.
      
      If this happens (structure exists, but
      sta->uploaded is false) fast_tx path can still be
      taken. Deep in the fastpath call the sta->uploaded
      is checked against to derive "pubsta" argument for
      ieee80211_get_txq(). If sta->uploaded is false
      (and sta is actually non-NULL) ieee80211_get_txq()
      effectively downgraded to vif->txq.
      
      At first glance this may look innocent but coerces
      mac80211 into a state that is almost guaranteed
      (codel may drop offending skb) to crash because a
      station-oriented skb gets queued up on
      vif-oriented txq. The ieee80211_tx_dequeue() ends
      up looking at info->control.flags and tries to use
      txq->sta which in the fail case is NULL.
      
      It's probably pointless to pretend one can
      downgrade skb from sta-txq to vif-txq.
      
      Since downgrading unicast traffic to vif->txq must
      not be done there's no txq to put a frame on if
      sta->uploaded is false. Therefore the code is made
      to fall back to regular tx() op path if the
      described condition is hit.
      
      Only drivers using wake_tx_queue were affected.
      
      Example crash dump before fix:
      
       Unable to handle kernel paging request at virtual address ffffe26c
       PC is at ieee80211_tx_dequeue+0x204/0x690 [mac80211]
       [<bf4252a4>] (ieee80211_tx_dequeue [mac80211]) from
       [<bf4b1388>] (ath10k_mac_tx_push_txq+0x54/0x1c0 [ath10k_core])
       [<bf4b1388>] (ath10k_mac_tx_push_txq [ath10k_core]) from
       [<bf4bdfbc>] (ath10k_htt_txrx_compl_task+0xd78/0x11d0 [ath10k_core])
       [<bf4bdfbc>] (ath10k_htt_txrx_compl_task [ath10k_core])
       [<bf51c5a4>] (ath10k_pci_napi_poll+0x54/0xe8 [ath10k_pci])
       [<bf51c5a4>] (ath10k_pci_napi_poll [ath10k_pci]) from
       [<c0572e90>] (net_rx_action+0xac/0x160)
      Reported-by: NMohammed Shafi Shajakhan <mohammed@qti.qualcomm.com>
      Signed-off-by: NMichal Kazior <michal.kazior@tieto.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      dbef5362
    • F
      mac80211: initialize SMPS field in HT capabilities · 43071d8f
      Felix Fietkau 提交于
      ibss and mesh modes copy the ht capabilites from the band without
      overriding the SMPS state. Unfortunately the default value 0 for the
      SMPS field means static SMPS instead of disabled.
      
      This results in HT ibss and mesh setups using only single-stream rates,
      even though SMPS is not supposed to be active.
      
      Initialize SMPS to disabled for all bands on ieee80211_hw_register to
      ensure that the value is sane where it is not overriden with the real
      SMPS state.
      Reported-by: NElektra Wagenrad <onelektra@gmx.net>
      Signed-off-by: NFelix Fietkau <nbd@nbd.name>
      [move VHT TODO comment to a better place]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      43071d8f
    • P
      cfg80211: Specify the reason for connect timeout · 3093ebbe
      Purushottam Kushwaha 提交于
      This enhances the connect timeout API to also carry the reason for the
      timeout. These reason codes for the connect time out are represented by
      enum nl80211_timeout_reason and are passed to user space through a new
      attribute NL80211_ATTR_TIMEOUT_REASON (u32).
      Signed-off-by: NPurushottam Kushwaha <pkushwah@qti.qualcomm.com>
      Signed-off-by: NJouni Malinen <jouni@qca.qualcomm.com>
      [keep gfp_t argument last]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      3093ebbe
    • V
      cfg80211: Add support to sched scan to report better BSSs · bf95ecdb
      vamsi krishna 提交于
      Enhance sched scan to support option of finding a better BSS while in
      connected state. Firmware scans the medium and reports when it finds a
      known BSS which has better RSSI than the current connected BSS. New
      attributes to specify the relative RSSI (compared to the current BSS)
      are added to the sched scan to implement this.
      Signed-off-by: Nvamsi krishna <vamsin@qti.qualcomm.com>
      Signed-off-by: NJouni Malinen <jouni@qca.qualcomm.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      bf95ecdb
    • V
      cfg80211: Add support for randomizing TA of Public Action frames · ab5bb2d5
      vamsi krishna 提交于
      Add support to use a random local address (Address 2 = TA in transmit
      and the same address in receive functionality) for Public Action frames
      in order to improve privacy of WLAN clients. Applications fill the
      random transmit address in the frame buffer in the NL80211_CMD_FRAME
      command. This can be used only with the drivers that indicate support
      for random local address by setting the new
      NL80211_EXT_FEATURE_MGMT_TX_RANDOM_TA and/or
      NL80211_EXT_FEATURE_MGMT_TX_RANDOM_TA_CONNECTED in ext_features.
      
      The driver needs to configure receive behavior to accept frames to the
      specified random address during the time the frame exchange is pending
      and such frames need to be acknowledged similarly to frames sent to the
      local permanent address when this random address functionality is not
      used.
      Signed-off-by: Nvamsi krishna <vamsin@qti.qualcomm.com>
      Signed-off-by: NJouni Malinen <jouni@qca.qualcomm.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      ab5bb2d5