1. 03 11月, 2015 1 次提交
  2. 22 10月, 2015 1 次提交
    • A
      netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0
      Arad, Ronen 提交于
      if_nlmsg_size() overestimates the minimum allocation size of netlink
      dump request (when called from rtnl_calcit()) or the size of the
      message (when called from rtnl_getlink()). This is because
      ext_filter_mask is not supported by rtnl_link_get_af_size() and
      rtnl_link_get_size().
      
      The over-estimation is significant when at least one netdev has many
      VLANs configured (8 bytes for each configured VLAN).
      
      This patch-set "rightsizes" the protocol specific attribute size
      calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
      and adding this a argument to get_link_af_size op in rtnl_af_ops.
      
      Bridge module already used filtering aware sizing for notifications.
      br_get_link_af_size_filtered() is consistent with the modified
      get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
      br_get_link_af_size() becomes unused and thus removed.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1974ed0
  3. 21 10月, 2015 3 次提交
    • Y
      tcp: use RACK to detect losses · 4f41b1c5
      Yuchung Cheng 提交于
      This patch implements the second half of RACK that uses the the most
      recent transmit time among all delivered packets to detect losses.
      
      tcp_rack_mark_lost() is called upon receiving a dubious ACK.
      It then checks if an not-yet-sacked packet was sent at least
      "reo_wnd" prior to the sent time of the most recently delivered.
      If so the packet is deemed lost.
      
      The "reo_wnd" reordering window starts with 1msec for fast loss
      detection and changes to min-RTT/4 when reordering is observed.
      We found 1msec accommodates well on tiny degree of reordering
      (<3 pkts) on faster links. We use min-RTT instead of SRTT because
      reordering is more of a path property but SRTT can be inflated by
      self-inflicated congestion. The factor of 4 is borrowed from the
      delayed early retransmit and seems to work reasonably well.
      
      Since RACK is still experimental, it is now used as a supplemental
      loss detection on top of existing algorithms. It is only effective
      after the fast recovery starts or after the timeout occurs. The
      fast recovery is still triggered by FACK and/or dupack threshold
      instead of RACK.
      
      We introduce a new sysctl net.ipv4.tcp_recovery for future
      experiments of loss recoveries. For now RACK can be disabled by
      setting it to 0.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f41b1c5
    • Y
      tcp: track the packet timings in RACK · 659a8ad5
      Yuchung Cheng 提交于
      This patch is the first half of the RACK loss recovery.
      
      RACK loss recovery uses the notion of time instead
      of packet sequence (FACK) or counts (dupthresh). It's inspired by the
      previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
      transmit (new data packet) is sacked, then current retransmitted
      sequence below the newly sacked sequence must been lost,
      since at least one round trip time has elapsed.
      
      But it has several limitations:
      1) can't detect tail drops since it depends on limited transmit
      2) is disabled upon reordering (assumes no reordering)
      3) only enabled in fast recovery ut not timeout recovery
      
      RACK (Recently ACK) addresses these limitations with the notion
      of time instead: a packet P1 is lost if a later packet P2 is s/acked,
      as at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when later retransmission is
      s/acked while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering.
      
      This patch implements tcp_advanced_rack() which tracks the
      most recent transmission time among the packets that have been
      delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
      is the key to determine which packet has been lost.
      
      Consider an example that the sender sends six packets:
      T1: P1 (lost)
      T2: P2
      T3: P3
      T4: P4
      T100: sack of P2. rack.mstamp = T2
      T101: retransmit P1
      T102: sack of P2,P3,P4. rack.mstamp = T4
      T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
      
      We need to be careful about spurious retransmission because it may
      falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
      to falsely mark all packets lost, just like a spurious timeout.
      
      We identify spurious retransmission by the ACK's TS echo value.
      If TS option is not applicable but the retransmission is acknowledged
      less than min-RTT ago, it is likely to be spurious. We refrain from
      using the transmission time of these spurious retransmissions.
      
      The second half is implemented in the next patch that marks packet
      lost using RACK timestamp.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      659a8ad5
    • Y
      tcp: track min RTT using windowed min-filter · f6722583
      Yuchung Cheng 提交于
      Kathleen Nichols' algorithm for tracking the minimum RTT of a
      data stream over some measurement window. It uses constant space
      and constant time per update. Yet it almost always delivers
      the same minimum as an implementation that has to keep all
      the data in the window. The measurement window is tunable via
      sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
      
      The algorithm keeps track of the best, 2nd best & 3rd best min
      values, maintaining an invariant that the measurement time of
      the n'th best >= n-1'th best. It also makes sure that the three
      values are widely separated in the time window since that bounds
      the worse case error when that data is monotonically increasing
      over the window.
      
      Upon getting a new min, we can forget everything earlier because
      it has no value - the new min is less than everything else in the
      window by definition and it's the most recent. So we restart fresh
      on every new min and overwrites the 2nd & 3rd choices. The same
      property holds for the 2nd & 3rd best.
      
      Therefore we have to maintain two invariants to maximize the
      information in the samples, one on values (1st.v <= 2nd.v <=
      3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
      now). These invariants determine the structure of the code
      
      The RTT input to the windowed filter is the minimum RTT measured
      from ACK or SACK, or as the last resort from TCP timestamps.
      
      The accessor tcp_min_rtt() returns the minimum RTT seen in the
      window. ~0U indicates it is not available. The minimum is 1usec
      even if the true RTT is below that.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6722583
  4. 19 10月, 2015 1 次提交
  5. 17 10月, 2015 2 次提交
  6. 16 10月, 2015 2 次提交
  7. 15 10月, 2015 7 次提交
  8. 14 10月, 2015 1 次提交
  9. 13 10月, 2015 12 次提交
    • D
      net: Add IPv6 support to l3mdev · ccf3c8c3
      David Ahern 提交于
      Add operations to retrieve cached IPv6 dst entry from l3mdev device
      and lookup IPv6 source address.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ccf3c8c3
    • A
      cfg80211: Add multiple scan plans for scheduled scan · 3b06d277
      Avraham Stern 提交于
      Add the option to configure multiple 'scan plans' for scheduled scan.
      Each 'scan plan' defines the number of scan cycles and the interval
      between scans. The scan plans are executed in the order they were
      configured. The last scan plan will always run infinitely and thus
      defines only the interval between scans.
      The maximum number of scan plans supported by the device and the
      maximum number of iterations in a single scan plan are advertised
      to userspace so it can configure the scan plans appropriately.
      
      When scheduled scan results are received there is no way to know which
      scan plan is being currently executed, so there is no way to know when
      the next scan iteration will start. This is not a problem, however.
      The scan start timestamp is only used for flushing old scan results,
      and there is no difference between flushing all results received until
      the end of the previous iteration or the start of the current one,
      since no results will be received in between.
      Signed-off-by: NAvraham Stern <avraham.stern@intel.com>
      Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      3b06d277
    • D
      nl80211: allow BSS data to include CLOCK_BOOTTIME timestamp · 6e19bc4b
      Dmitry Shmidt 提交于
      For location and connectivity services, userspace would often like
      to know the time when the BSS was last seen. The current "last seen"
      value is calculated in a way that makes it less useful, especially
      if the system suspended in the meantime.
      
      Add the ability for the driver to report a real CLOCK_BOOTTIME stamp
      that can then be reported to userspace (if present).
      
      Drivers wishing to use this must be converted to the new API to call
      cfg80211_inform_bss_data() or cfg80211_inform_bss_frame_data(). They
      need to ensure the reported value is accurate enough even when the
      frame might have been buffered in the device (e.g. firmware.)
      Signed-off-by: NDmitry Shmidt <dimitrysh@google.com>
      [modified to use struct, inlines]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      6e19bc4b
    • T
      Revert "mac80211: remove exposing 'mfp' to drivers" · 93f0490e
      Tamizh chelvam 提交于
      This reverts commit 5c48f120.
      
      Some device drivers (ath10k) offload part of aggregation including AddBA/DelBA
      negotiations to firmware. In such scenario, the PMF configuration of
      the station needs to be provided to driver to enable encryption of
      AddBA/DelBA action frames.
      Signed-off-by: NTamizh chelvam <c_traja@qti.qualcomm.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      93f0490e
    • E
      ipv6: Pass struct net into nf_ct_frag6_gather · b7277597
      Eric W. Biederman 提交于
      The function nf_ct_frag6_gather is called on both the input and the
      output paths of the networking stack.  In particular ipv6_defrag which
      calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain
      on input and the LOCAL_OUT chain on output.
      
      The addition of a net parameter makes it explicit which network
      namespace the packets are being reassembled in, and removes the need
      for nf_ct_frag6_gather to guess.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7277597
    • E
      ipv4: Pass struct net into ip_defrag and ip_check_defrag · 19bcf9f2
      Eric W. Biederman 提交于
      The function ip_defrag is called on both the input and the output
      paths of the networking stack.  In particular conntrack when it is
      tracking outbound packets from the local machine calls ip_defrag.
      
      So add a struct net parameter and stop making ip_defrag guess which
      network namespace it needs to defragment packets in.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19bcf9f2
    • P
      ipv4/icmp: redirect messages can use the ingress daddr as source · e2ca690b
      Paolo Abeni 提交于
      This patch allows configuring how the source address of ICMP
      redirect messages is selected; by default the old behaviour is
      retained, while setting icmp_redirects_use_orig_daddr force the
      usage of the destination address of the packet that caused the
      redirect.
      
      The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
      following scenario:
      
      Two machines are set up with VRRP to act as routers out of a subnet,
      they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
      x.x.x.254/24.
      
      If a host in said subnet needs to get an ICMP redirect from the VRRP
      router, i.e. to reach a destination behind a different gateway, the
      source IP in the ICMP redirect is chosen as the primary IP on the
      interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
      
      The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
      and will continue to use the wrong next-op.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2ca690b
    • E
      tcp: shrink tcp_timewait_sock by 8 bytes · d475f090
      Eric Dumazet 提交于
      Reducing tcp_timewait_sock from 280 bytes to 272 bytes
      allows SLAB to pack 15 objects per page instead of 14 (on x86)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d475f090
    • E
      net: shrink struct sock and request_sock by 8 bytes · ed53d0ab
      Eric Dumazet 提交于
      One 32bit hole is following skc_refcnt, use it.
      skc_incoming_cpu can also be an union for request_sock rcv_wnd.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed53d0ab
    • E
      net: align sk_refcnt on 128 bytes boundary · 8e5eb54d
      Eric Dumazet 提交于
      sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
      This is a performance issue if multiple cpus hit a common socket,
      or multiple sockets are chained due to SO_REUSEPORT.
      
      By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
      are mostly read. As they contain the lookup keys, this has
      a considerable performance impact, as cpus can cache them.
      
      These 8 bytes are not wasted, we use them as a place holder
      for various fields, depending on the socket type.
      
      Tested:
       SYN flood hitting a 16 RX queues NIC.
       TCP listener using 16 sockets and SO_REUSEPORT
       and SO_INCOMING_CPU for proper siloing.
      
       Could process 6.0 Mpps SYN instead of 4.2 Mpps
      
       Kernel profile looked like :
          11.68%  [kernel]  [k] sha_transform
           6.51%  [kernel]  [k] __inet_lookup_listener
           5.07%  [kernel]  [k] __inet_lookup_established
           4.15%  [kernel]  [k] memcpy_erms
           3.46%  [kernel]  [k] ipt_do_table
           2.74%  [kernel]  [k] fib_table_lookup
           2.54%  [kernel]  [k] tcp_make_synack
           2.34%  [kernel]  [k] tcp_conn_request
           2.05%  [kernel]  [k] __netif_receive_skb_core
           2.03%  [kernel]  [k] kmem_cache_alloc
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e5eb54d
    • E
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet 提交于
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da268b
    • E
      sock: support per-packet fwmark · f28ea365
      Edward Jee 提交于
      It's useful to allow users to set fwmark for an individual packet,
      without changing the socket state. The function this patch adds in
      sock layer can be used by the protocols that need such a feature.
      Signed-off-by: NEdward Hyunkoo Jee <edjee@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f28ea365
  10. 12 10月, 2015 3 次提交
  11. 11 10月, 2015 4 次提交
  12. 08 10月, 2015 3 次提交