1. 23 10月, 2015 5 次提交
    • R
      mpls: flow-based multipath selection · 1c78efa8
      Robert Shearman 提交于
      Change the selection of a multipath route to use a flow-based
      hash. This more suitable for traffic sensitive to reordering within a
      flow (e.g. TCP, L2VPN) and whilst still allowing a good distribution
      of traffic given enough flows.
      
      Selection of the path for a multipath route is done using a hash of:
      1. Label stack up to MAX_MP_SELECT_LABELS labels or up to and
         including entropy label, whichever is first.
      2. 3-tuple of (L3 src, L3 dst, proto) from IPv4/IPv6 header in MPLS
         payload, if present.
      
      Naturally, a 5-tuple hash using L4 information in addition would be
      possible and be better in some scenarios, but there is a tradeoff
      between looking deeper into the packet to achieve good distribution,
      and packet forwarding performance, and I have erred on the side of the
      latter as the default.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c78efa8
    • R
      mpls: multipath route support · f8efb73c
      Roopa Prabhu 提交于
      This patch adds support for MPLS multipath routes.
      
      Includes following changes to support multipath:
      - splits struct mpls_route into 'struct mpls_route + struct mpls_nh'
      
      - 'struct mpls_nh' represents a mpls nexthop label forwarding entry
      
      - moves mpls route and nexthop structures into internal.h
      
      - A mpls_route can point to multiple mpls_nh structs
      
      - the nexthops are maintained as a array (similar to ipv4 fib)
      
      - In the process of restructuring, this patch also consistently changes
        all labels to u8
      
      - Adds support to parse/fill RTA_MULTIPATH netlink attribute for
      multipath routes similar to ipv4/v6 fib
      
      - In this patch, the multipath route nexthop selection algorithm
      simply returns the first nexthop. It is replaced by a
      hash based algorithm from Robert Shearman in the next patch
      
      - mpls_route_update cleanup: remove 'dev' handling in mpls_route_update.
      mpls_route_update though implemented to update based on dev, it was
      never used that way. And the dev handling gets tricky with multiple
      nexthops. Cannot match against any single nexthops dev. So, this patch
      removes the unused 'dev' handling in mpls_route_update.
      
      - dead route/path handling will be implemented in a subsequent patch
      
      Example:
      
      $ip -f mpls route add 100 nexthop as 200 via inet 10.1.1.2 dev swp1 \
                      nexthop as 700 via inet 10.1.1.6 dev swp2 \
                      nexthop as 800 via inet 40.1.1.2 dev swp3
      
      $ip  -f mpls route show
      100
              nexthop as to 200 via inet 10.1.1.2  dev swp1
              nexthop as to 700 via inet 10.1.1.6  dev swp2
              nexthop as to 800 via inet 40.1.1.2  dev swp3
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Acked-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8efb73c
    • H
      if_link: Add control trust VF · dd461d6a
      Hiroshi Shimamoto 提交于
      Add netlink directives and ndo entry to trust VF user.
      
      This controls the special permission of VF user.
      The administrator will dedicatedly trust VF user to use some features
      which impacts security and/or performance.
      
      The administrator never turn it on unless VF user is fully trusted.
      
      CC: Sy Jong Choi <sy.jong.choi@intel.com>
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Acked-by: NGreg Rose <gregory.v.rose@intel.com>
      Tested-by: NKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      dd461d6a
    • E
      tcp/dccp: fix hashdance race for passive sessions · 5e0724d0
      Eric Dumazet 提交于
      Multiple cpus can process duplicates of incoming ACK messages
      matching a SYN_RECV request socket. This is a rare event under
      normal operations, but definitely can happen.
      
      Only one must win the race, otherwise corruption would occur.
      
      To fix this without adding new atomic ops, we use logic in
      inet_ehash_nolisten() to detect the request was present in the same
      ehash bucket where we try to insert the new child.
      
      If request socket was not found, we have to undo the child creation.
      
      This actually removes a spin_lock()/spin_unlock() pair in
      reqsk_queue_unlink() for the fast path.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e0724d0
    • P
      ipv4: implement support for NOPREFIXROUTE ifa flag for ipv4 address · 7b131180
      Paolo Abeni 提交于
      Currently adding a new ipv4 address always cause the creation of the
      related network route, with default metric. When a host has multiple
      interfaces on the same network, multiple routes with the same metric
      are created.
      
      If the userspace wants to set specific metric on each routes, i.e.
      giving better metric to ethernet links in respect to Wi-Fi ones,
      the network routes must be deleted and recreated, which is error-prone.
      
      This patch implements the support for IFA_F_NOPREFIXROUTE for ipv4
      address. When an address is added with such flag set, no associated
      network route is created, no network route is deleted when
      said IP is gone and it's up to the user space manage such route.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b131180
  2. 22 10月, 2015 20 次提交
  3. 21 10月, 2015 15 次提交
    • E
      Adding switchdev ageing notification on port bridged · 6ac311ae
      Elad Raz 提交于
      Configure ageing time to the HW for newly bridged device
      
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NElad Raz <eladr@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ac311ae
    • Y
      tcp: use RACK to detect losses · 4f41b1c5
      Yuchung Cheng 提交于
      This patch implements the second half of RACK that uses the the most
      recent transmit time among all delivered packets to detect losses.
      
      tcp_rack_mark_lost() is called upon receiving a dubious ACK.
      It then checks if an not-yet-sacked packet was sent at least
      "reo_wnd" prior to the sent time of the most recently delivered.
      If so the packet is deemed lost.
      
      The "reo_wnd" reordering window starts with 1msec for fast loss
      detection and changes to min-RTT/4 when reordering is observed.
      We found 1msec accommodates well on tiny degree of reordering
      (<3 pkts) on faster links. We use min-RTT instead of SRTT because
      reordering is more of a path property but SRTT can be inflated by
      self-inflicated congestion. The factor of 4 is borrowed from the
      delayed early retransmit and seems to work reasonably well.
      
      Since RACK is still experimental, it is now used as a supplemental
      loss detection on top of existing algorithms. It is only effective
      after the fast recovery starts or after the timeout occurs. The
      fast recovery is still triggered by FACK and/or dupack threshold
      instead of RACK.
      
      We introduce a new sysctl net.ipv4.tcp_recovery for future
      experiments of loss recoveries. For now RACK can be disabled by
      setting it to 0.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f41b1c5
    • Y
      tcp: track the packet timings in RACK · 659a8ad5
      Yuchung Cheng 提交于
      This patch is the first half of the RACK loss recovery.
      
      RACK loss recovery uses the notion of time instead
      of packet sequence (FACK) or counts (dupthresh). It's inspired by the
      previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
      transmit (new data packet) is sacked, then current retransmitted
      sequence below the newly sacked sequence must been lost,
      since at least one round trip time has elapsed.
      
      But it has several limitations:
      1) can't detect tail drops since it depends on limited transmit
      2) is disabled upon reordering (assumes no reordering)
      3) only enabled in fast recovery ut not timeout recovery
      
      RACK (Recently ACK) addresses these limitations with the notion
      of time instead: a packet P1 is lost if a later packet P2 is s/acked,
      as at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when later retransmission is
      s/acked while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering.
      
      This patch implements tcp_advanced_rack() which tracks the
      most recent transmission time among the packets that have been
      delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
      is the key to determine which packet has been lost.
      
      Consider an example that the sender sends six packets:
      T1: P1 (lost)
      T2: P2
      T3: P3
      T4: P4
      T100: sack of P2. rack.mstamp = T2
      T101: retransmit P1
      T102: sack of P2,P3,P4. rack.mstamp = T4
      T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
      
      We need to be careful about spurious retransmission because it may
      falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
      to falsely mark all packets lost, just like a spurious timeout.
      
      We identify spurious retransmission by the ACK's TS echo value.
      If TS option is not applicable but the retransmission is acknowledged
      less than min-RTT ago, it is likely to be spurious. We refrain from
      using the transmission time of these spurious retransmissions.
      
      The second half is implemented in the next patch that marks packet
      lost using RACK timestamp.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      659a8ad5
    • Y
      tcp: add tcp_tsopt_ecr_before helper · 77c63127
      Yuchung Cheng 提交于
      a helper to prepare the main RACK patch
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77c63127
    • Y
      tcp: remove tcp_mark_lost_retrans() · af82f4e8
      Yuchung Cheng 提交于
      Remove the existing lost retransmit detection because RACK subsumes
      it completely. This also stops the overloading the ack_seq field of
      the skb control block.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af82f4e8
    • Y
      tcp: track min RTT using windowed min-filter · f6722583
      Yuchung Cheng 提交于
      Kathleen Nichols' algorithm for tracking the minimum RTT of a
      data stream over some measurement window. It uses constant space
      and constant time per update. Yet it almost always delivers
      the same minimum as an implementation that has to keep all
      the data in the window. The measurement window is tunable via
      sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
      
      The algorithm keeps track of the best, 2nd best & 3rd best min
      values, maintaining an invariant that the measurement time of
      the n'th best >= n-1'th best. It also makes sure that the three
      values are widely separated in the time window since that bounds
      the worse case error when that data is monotonically increasing
      over the window.
      
      Upon getting a new min, we can forget everything earlier because
      it has no value - the new min is less than everything else in the
      window by definition and it's the most recent. So we restart fresh
      on every new min and overwrites the 2nd & 3rd choices. The same
      property holds for the 2nd & 3rd best.
      
      Therefore we have to maintain two invariants to maximize the
      information in the samples, one on values (1st.v <= 2nd.v <=
      3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
      now). These invariants determine the structure of the code
      
      The RTT input to the windowed filter is the minimum RTT measured
      from ACK or SACK, or as the last resort from TCP timestamps.
      
      The accessor tcp_min_rtt() returns the minimum RTT seen in the
      window. ~0U indicates it is not available. The minimum is 1usec
      even if the true RTT is below that.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6722583
    • Y
      tcp: apply Kern's check on RTTs used for congestion control · 9e45a3e3
      Yuchung Cheng 提交于
      Currently ca_seq_rtt_us does not use Kern's check. Fix that by
      checking if any packet acked is a retransmit, for both RTT used
      for RTT estimation and congestion control.
      
      Fixes: 5b08e47c ("tcp: prefer packet timing to TS-ECR for RTT")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e45a3e3
    • J
      Bluetooth: Fix missing hdev locking for LE scan cleanup · 8ce783dc
      Johan Hedberg 提交于
      The hci_conn objects don't have a dedicated lock themselves but rely
      on the caller to hold the hci_dev lock for most types of access. The
      hci_conn_timeout() function has so far sent certain HCI commands based
      on the hci_conn state which has been possible without holding the
      hci_dev lock.
      
      The recent changes to do LE scanning before connect attempts added
      even more operations to hci_conn and hci_dev from hci_conn_timeout,
      thereby exposing potential race conditions with the hci_dev and
      hci_conn states.
      
      As an example of such a race, here there's a timeout but an
      l2cap_sock_connect() call manages to race with the cleanup routine:
      
      [Oct21 08:14] l2cap_chan_timeout: chan ee4b12c0 state BT_CONNECT
      [  +0.000004] l2cap_chan_close: chan ee4b12c0 state BT_CONNECT
      [  +0.000002] l2cap_chan_del: chan ee4b12c0, conn f3141580, err 111, state BT_CONNECT
      [  +0.000002] l2cap_sock_teardown_cb: chan ee4b12c0 state BT_CONNECT
      [  +0.000005] l2cap_chan_put: chan ee4b12c0 orig refcnt 4
      [  +0.000010] hci_conn_drop: hcon f53d56e0 orig refcnt 1
      [  +0.000013] l2cap_chan_put: chan ee4b12c0 orig refcnt 3
      [  +0.000063] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT
      [  +0.000049] hci_conn_params_del: addr ee:0d:30:09:53:1f (type 1)
      [  +0.000002] hci_chan_list_flush: hcon f53d56e0
      [  +0.000001] hci_chan_del: hci0 hcon f53d56e0 chan f4e7ccc0
      [  +0.004528] l2cap_sock_create: sock e708fc00
      [  +0.000023] l2cap_chan_create: chan ee4b1770
      [  +0.000001] l2cap_chan_hold: chan ee4b1770 orig refcnt 1
      [  +0.000002] l2cap_sock_init: sk ee4b3390
      [  +0.000029] l2cap_sock_bind: sk ee4b3390
      [  +0.000010] l2cap_sock_setsockopt: sk ee4b3390
      [  +0.000037] l2cap_sock_connect: sk ee4b3390
      [  +0.000002] l2cap_chan_connect: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f (type 2) psm 0x00
      [  +0.000002] hci_get_route: 00:02:72:d9:e5:8b -> ee:0d:30:09:53:1f
      [  +0.000001] hci_dev_hold: hci0 orig refcnt 8
      [  +0.000003] hci_conn_hold: hcon f53d56e0 orig refcnt 0
      
      Above the l2cap_chan_connect() shouldn't have been able to reach the
      hci_conn f53d56e0 anymore but since hci_conn_timeout didn't do proper
      locking that's not the case. The end result is a reference to hci_conn
      that's not in the conn_hash list, resulting in list corruption when
      trying to remove it later:
      
      [Oct21 08:15] l2cap_chan_timeout: chan ee4b1770 state BT_CONNECT
      [  +0.000004] l2cap_chan_close: chan ee4b1770 state BT_CONNECT
      [  +0.000003] l2cap_chan_del: chan ee4b1770, conn f3141580, err 111, state BT_CONNECT
      [  +0.000001] l2cap_sock_teardown_cb: chan ee4b1770 state BT_CONNECT
      [  +0.000005] l2cap_chan_put: chan ee4b1770 orig refcnt 4
      [  +0.000002] hci_conn_drop: hcon f53d56e0 orig refcnt 1
      [  +0.000015] l2cap_chan_put: chan ee4b1770 orig refcnt 3
      [  +0.000038] hci_conn_timeout: hcon f53d56e0 state BT_CONNECT
      [  +0.000003] hci_chan_list_flush: hcon f53d56e0
      [  +0.000002] hci_conn_hash_del: hci0 hcon f53d56e0
      [  +0.000001] ------------[ cut here ]------------
      [  +0.000461] WARNING: CPU: 0 PID: 1782 at lib/list_debug.c:56 __list_del_entry+0x3f/0x71()
      [  +0.000839] list_del corruption, f53d56e0->prev is LIST_POISON2 (00000200)
      
      The necessary fix is unfortunately more complicated than just adding
      hci_dev_lock/unlock calls to the hci_conn_timeout() call path.
      Particularly, the hci_conn_del() API, which expects the hci_dev lock to
      be held, performs a cancel_delayed_work_sync(&hcon->disc_work) which
      would lead to a deadlock if the hci_conn_timeout() call path tries to
      acquire the same lock.
      
      This patch solves the problem by deferring the cleanup work to a
      separate work callback. To protect against the hci_dev or hci_conn
      going away meanwhile temporary references are taken with the help of
      hci_dev_hold() and hci_conn_get().
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Cc: stable@vger.kernel.org # 4.3
      8ce783dc
    • J
      mac80211: move station statistics into sub-structs · e5a9f8d0
      Johannes Berg 提交于
      Group station statistics by where they're (mostly) updated
      (TX, RX and TX-status) and group them into sub-structs of
      the struct sta_info.
      
      Also rename the variables since the grouping now makes it
      obvious where they belong.
      
      This makes it easier to identify where the statistics are
      updated in the code, and thus easier to think about them.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      e5a9f8d0
    • J
      mac80211: move beacon_loss_count into ifmgd · 976bd9ef
      Johannes Berg 提交于
      There's little point in keeping (and even sending to userspace)
      the beacon_loss_count value per station, since it can only apply
      to the AP on a managed-mode connection. Move the value to ifmgd,
      advertise it only in managed mode, and remove it from ethtool as
      it's available through better interfaces.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      976bd9ef
    • J
      mac80211: remove sta->last_ack_signal · 763aa27a
      Johannes Berg 提交于
      This file only feeds a debugfs file that isn't very useful, so remove
      it. If necessary, we can add other ways to get this information, for
      example in the NL80211_CMD_PROBE_CLIENT response.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      763aa27a
    • M
      Bluetooth: Introduce driver specific post init callback · 98a63aaf
      Marcel Holtmann 提交于
      Some drivers might have to restore certain settings after the init
      procedure has been completed. This driver callback allows them to hook
      into that stage. This callback is run just before the controller is
      declared as powered up.
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: NJohan Hedberg <johan.hedberg@intel.com>
      98a63aaf
    • D
      Bluetooth: l2cap_disconnection_req priority over shutdown · 9f7378a9
      Dean Jenkins 提交于
      There is a L2CAP protocol race between the local peer and
      the remote peer demanding disconnection of the L2CAP link.
      
      When L2CAP ERTM is used, l2cap_sock_shutdown() can be called
      from userland to disconnect L2CAP. However, there can be a
      delay introduced by waiting for ACKs. During this waiting
      period, the remote peer may have sent a Disconnection Request.
      Therefore, recheck the shutdown status of the socket
      after waiting for ACKs because there is no need to do
      further processing if the connection has gone.
      Signed-off-by: NDean Jenkins <Dean_Jenkins@mentor.com>
      Signed-off-by: NHarish Jenny K N <harish_kandiga@mentor.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      9f7378a9
    • D
      Bluetooth: Reorganize mutex lock in l2cap_sock_shutdown() · 04ba72e6
      Dean Jenkins 提交于
      This commit reorganizes the mutex lock and is now
      only protecting l2cap_chan_close(). This is now consistent
      with other places where l2cap_chan_close() is called.
      
      If a conn connection exists, call
      mutex_lock(&conn->chan_lock) before calling l2cap_chan_close()
      to ensure other L2CAP protocol operations do not interfere.
      
      Note that the conn structure has to be protected from being
      freed as it is possible for the connection to be disconnected
      whilst the locks are not held. This solution allows the mutex
      lock to be used even when the connection has just been
      disconnected.
      
      This commit also reduces the scope of chan locking.
      
      The only place where chan locking is needed is the call to
      l2cap_chan_close(chan, 0) which if necessary closes the channel.
      Therefore, move the l2cap_chan_lock(chan) and
      l2cap_chan_lock(chan) locking calls to around
      l2cap_chan_close(chan, 0).
      
      This allows __l2cap_wait_ack(sk, chan) to be called with no
      chan locks being held so L2CAP messaging over the ACL link
      can be done unimpaired.
      Signed-off-by: NDean Jenkins <Dean_Jenkins@mentor.com>
      Signed-off-by: NHarish Jenny K N <harish_kandiga@mentor.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      04ba72e6
    • D
      Bluetooth: Unwind l2cap_sock_shutdown() · e7456437
      Dean Jenkins 提交于
      l2cap_sock_shutdown() is designed to only action shutdown
      of the channel when shutdown is not already in progress.
      Therefore, reorganise the code flow by adding a goto
      to jump to the end of function handling when shutdown is
      already being actioned. This removes one level of code
      indentation and make the code more readable.
      Signed-off-by: NDean Jenkins <Dean_Jenkins@mentor.com>
      Signed-off-by: NHarish Jenny K N <harish_kandiga@mentor.com>
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      e7456437