1. 11 7月, 2018 5 次提交
    • T
      sch_cake: Add NAT awareness to packet classifier · ea825115
      Toke Høiland-Jørgensen 提交于
      When CAKE is deployed on a gateway that also performs NAT (which is a
      common deployment mode), the host fairness mechanism cannot distinguish
      internal hosts from each other, and so fails to work correctly.
      
      To fix this, we add an optional NAT awareness mode, which will query the
      kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
      and use that in the flow and host hashing.
      
      When the shaper is enabled and the host is already performing NAT, the cost
      of this lookup is negligible. However, in unlimited mode with no NAT being
      performed, there is a significant CPU cost at higher bandwidths. For this
      reason, the feature is turned off by default.
      
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: NToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea825115
    • T
      netfilter: Add nf_ct_get_tuple_skb global lookup function · b60a6040
      Toke Høiland-Jørgensen 提交于
      This adds a global netfilter function to extract a conntrack tuple from an
      skb. The function uses a new function added to nf_ct_hook, which will try
      to get the tuple from skb->_nfct, and do a full lookup if that fails. This
      makes it possible to use the lookup function before the skb has passed
      through the conntrack init hooks (e.g., in an ingress qdisc). The tuple is
      copied to the caller to avoid issues with reference counting.
      
      The function returns false if conntrack is not loaded, allowing it to be
      used without incurring a module dependency on conntrack. This is used by
      the NAT mode in sch_cake.
      
      Cc: netfilter-devel@vger.kernel.org
      Signed-off-by: NToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b60a6040
    • T
      sch_cake: Add optional ACK filter · 8b713881
      Toke Høiland-Jørgensen 提交于
      The ACK filter is an optional feature of CAKE which is designed to improve
      performance on links with very asymmetrical rate limits. On such links
      (which are unfortunately quite prevalent, especially for DSL and cable
      subscribers), the downstream throughput can be limited by the number of
      ACKs capable of being transmitted in the *upstream* direction.
      
      Filtering ACKs can, in general, have adverse effects on TCP performance
      because it interferes with ACK clocking (especially in slow start), and it
      reduces the flow's resiliency to ACKs being dropped further along the path.
      To alleviate these drawbacks, the ACK filter in CAKE tries its best to
      always keep enough ACKs queued to ensure forward progress in the TCP flow
      being filtered. It does this by only filtering redundant ACKs. In its
      default 'conservative' mode, the filter will always keep at least two
      redundant ACKs in the queue, while in 'aggressive' mode, it will filter
      down to a single ACK.
      
      The ACK filter works by inspecting the per-flow queue on every packet
      enqueue. Starting at the head of the queue, the filter looks for another
      eligible packet to drop (so the ACK being dropped is always closer to the
      head of the queue than the packet being enqueued). An ACK is eligible only
      if it ACKs *fewer* bytes than the new packet being enqueued, including any
      SACK options. This prevents duplicate ACKs from being filtered, to avoid
      interfering with retransmission logic. In addition, we check TCP header
      options and only drop those that are known to not interfere with sender
      state. In particular, packets with unknown option codes are never dropped.
      
      In aggressive mode, an eligible packet is always dropped, while in
      conservative mode, at least two ACKs are kept in the queue. Only pure ACKs
      (with no data segments) are considered eligible for dropping, but when an
      ACK with data segments is enqueued, this can cause another pure ACK to
      become eligible for dropping.
      
      The approach described above ensures that this ACK filter avoids most of
      the drawbacks of a naive filtering mechanism that only keeps flow state but
      does not inspect the queue. This is the rationale for including the ACK
      filter in CAKE itself rather than as separate module (as the TC filter, for
      instance).
      
      Our performance evaluation has shown that on a 30/1 Mbps link with a
      bidirectional traffic test (RRUL), turning on the ACK filter on the
      upstream link improves downstream throughput by ~20% (both modes) and
      upstream throughput by ~12% in conservative mode and ~40% in aggressive
      mode, at the cost of ~5ms of inter-flow latency due to the increased
      congestion.
      
      In *really* pathological cases, the effect can be a lot more; for instance,
      the ACK filter increases the achievable downstream throughput on a link
      with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5
      Mbps to ~25 Mbps).
      
      Finally, even though we consider the ACK filter to be safer than most, we
      do not recommend turning it on everywhere: on more symmetrical link
      bandwidths the effect is negligible at best.
      
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b713881
    • T
      sch_cake: Add ingress mode · 7298de9c
      Toke Høiland-Jørgensen 提交于
      The ingress mode is meant to be enabled when CAKE runs downlink of the
      actual bottleneck (such as on an IFB device). The mode changes the shaper
      to also account dropped packets to the shaped rate, as these have already
      traversed the bottleneck.
      
      Enabling ingress mode will also tune the AQM to always keep at least two
      packets queued *for each flow*. This is done by scaling the minimum queue
      occupancy level that will disable the AQM by the number of active bulk
      flows. The rationale for this is that retransmits are more expensive in
      ingress mode, since dropped packets have to traverse the bottleneck again
      when they are retransmitted; thus, being more lenient and keeping a minimum
      number of packets queued will improve throughput in cases where the number
      of active flows are so large that they saturate the bottleneck even at
      their minimum window size.
      
      This commit also adds a separate switch to enable ingress mode rate
      autoscaling. If enabled, the autoscaling code will observe the actual
      traffic rate and adjust the shaper rate to match it. This can help avoid
      latency increases in the case where the actual bottleneck rate decreases
      below the shaped rate. The scaling filters out spikes by an EWMA filter.
      Signed-off-by: NToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7298de9c
    • T
      sched: Add Common Applications Kept Enhanced (cake) qdisc · 046f6fd5
      Toke Høiland-Jørgensen 提交于
      sch_cake targets the home router use case and is intended to squeeze the
      most bandwidth and latency out of even the slowest ISP links and routers,
      while presenting an API simple enough that even an ISP can configure it.
      
      Example of use on a cable ISP uplink:
      
      tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
      
      To shape a cable download link (ifb and tc-mirred setup elided)
      
      tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
      
      CAKE is filled with:
      
      * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
        derived Flow Queuing system, which autoconfigures based on the bandwidth.
      * A novel "triple-isolate" mode (the default) which balances per-host
        and per-flow FQ even through NAT.
      * An deficit based shaper, that can also be used in an unlimited mode.
      * 8 way set associative hashing to reduce flow collisions to a minimum.
      * A reasonable interpretation of various diffserv latency/loss tradeoffs.
      * Support for zeroing diffserv markings for entering and exiting traffic.
      * Support for interacting well with Docsis 3.0 shaper framing.
      * Extensive support for DSL framing types.
      * Support for ack filtering.
      * Extensive statistics for measuring, loss, ecn markings, latency
        variation.
      
      A paper describing the design of CAKE is available at
      https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
      International Symposium on Local and Metropolitan Area Networks (LANMAN).
      
      This patch adds the base shaper and packet scheduler, while subsequent
      commits add the optional (configurable) features. The full userspace API
      and most data structures are included in this commit, but options not
      understood in the base version will be ignored.
      
      Various versions baking have been available as an out of tree build for
      kernel versions going back to 3.10, as the embedded router world has been
      running a few years behind mainline Linux. A stable version has been
      generally available on lede-17.01 and later.
      
      sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
      in the sqm-scripts, with sane defaults and vastly simpler configuration.
      
      CAKE's principal author is Jonathan Morton, with contributions from
      Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
      Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht,
      and Loganaden Velvindron.
      
      Testing from Pete Heist, Georgios Amanakis, and the many other members of
      the cake@lists.bufferbloat.net mailing list.
      
      tc -s qdisc show dev eth2
       qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84
       Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12)
        backlog 0b 0p requeues 12
        memory used: 1053008b of 15140Kb
        capacity estimate: 970Mbit
        min/max network layer size:           28 /    1500
        min/max overhead-adjusted size:       84 /    1538
        average network hdr offset:           14
                          Bulk  Best Effort        Voice
         thresh      62500Kbit        1Gbit      250Mbit
         target          5.0ms        5.0ms        5.0ms
         interval      100.0ms      100.0ms      100.0ms
         pk_delay          5us          5us          6us
         av_delay          3us          2us          2us
         sp_delay          2us          1us          1us
         backlog            0b           0b           0b
         pkts          3164050     25030267      9530280
         bytes      3227519915  35396974782  12879808898
         way_inds            0            8            0
         way_miss           21          366           25
         way_cols            0            0            0
         drops               5            0            1
         marks               0            0            0
         ack_drop            0            0            0
         sp_flows            1            3            0
         bk_flows            0            1            1
         un_flows            0            0            0
         max_len         68130        68130        68130
      Tested-by: NPete Heist <peteheist@gmail.com>
      Tested-by: NGeorgios Amanakis <gamanakis@gmail.com>
      Signed-off-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NToke Høiland-Jørgensen <toke@toke.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      046f6fd5
  2. 10 7月, 2018 20 次提交
  3. 08 7月, 2018 15 次提交
    • E
      tcp: remove redundant SOCK_DONE checks · c47078d6
      Eric Dumazet 提交于
      In both tcp_splice_read() and tcp_recvmsg(), we already test
      sock_flag(sk, SOCK_DONE) right before evaluating sk->sk_state,
      so "!sock_flag(sk, SOCK_DONE)" is always true.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c47078d6
    • D
      Merge branch 'mlxsw-Spectrum2-acl-prep' · 3d907eaf
      David S. Miller 提交于
      Ido Schimmel says:
      
      ====================
      mlxsw: Spectrum-2 small ACL preparations
      
      This is the first set of changes towards Spectrum-2 support in the mlxsw
      driver. It contains small changes that prepare the code for the later
      introduction of Spectrum-2 support.
      
      The Spectrum-2 ASIC uses an algorithmic TCAM (A-TCAM) instead of a
      circuit TCAM (C-TCAM) as Spectrum, and thus most of the changes are
      around the ACL code.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d907eaf
    • J
      mlxsw: core_acl_flex_actions: Fix helper to get the first KVD linear index · 0317a6f4
      Jiri Pirko 提交于
      The helper should return always KVD linear index of the second set.
      It is unused now, but going to be used soon.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0317a6f4
    • J
      mlxsw: core_acl_flex_actions: Allow the first set to be dummy · 5b9488fd
      Jiri Pirko 提交于
      In Spectrum-2, the real action sets are always in KVD linear. The first
      set is always empty and contains only pointer to the first real set in
      KVD linear. So provide possibility to specify the first set is the dummy
      one.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b9488fd
    • J
      mlxsw: spectrum: Put pointer to flex action ops to mlxsw_sp · 9dbab6f5
      Jiri Pirko 提交于
      Spectrum-2 need a slightly different handling of flexible actions. So
      put an ops pointer in mlxsw_sp struct and rename it.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9dbab6f5
    • J
      mlxsw: core_acl_flex_keys: Change SRC_SYS_PORT flex key element size · 82b63bcf
      Jiri Pirko 提交于
      The SRC_SYS_PORT is passed as 8 bit value down to hw anyway, so cap it
      in the driver as well. Also, in Spectrum-2 the FW iface for SRC_SYS_PORT
      is only 8 bits, so prepare for it.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82b63bcf
    • J
      mlxsw: core_acl_flex_keys: Split MAC and IP address flex key elements · c43ea06d
      Jiri Pirko 提交于
      Since in Spectrum-2, MACs are split and IP addresses are split as well,
      in order to use the same elements for Spectrum and Spectrum-2 split them
      now.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c43ea06d
    • J
      mlxsw: spectrum_acl: Ignore always-zeroed bits in tp->prio · 2139469b
      Jiri Pirko 提交于
      The lowest 16 bits of tp->prio are always zero, so ignore them with a
      shift.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2139469b
    • J
      mlxsw: reg: Introduce Flex2 key type for PTAR register · 45e0620d
      Jiri Pirko 提交于
      Introduce Flex2 key type for PTAR register which is used in Spectrum-2.
      Also, extend mlxsw_reg_ptar_pack() to set the value according to the
      caller.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45e0620d
    • J
      mlxsw: spectrum: Change name of mlxsw_sp_afk_blocks to mlxsw_sp1_afk_blocks · d4b0d20f
      Jiri Pirko 提交于
      This is specific for Spectrum as Spectrum-2 has completely different key
      blocks.
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4b0d20f
    • D
      net: sched: Fix warnings from xchg() on RCU'd cookie pointer. · 0dbc81ea
      David S. Miller 提交于
      The kbuild test robot reports:
      
      >> net/sched/act_api.c:71:15: sparse: incorrect type in initializer (different address spaces) @@    expected struct tc_cookie [noderef] <asn:4>*__ret @@    got [noderef] <asn:4>*__ret @@
         net/sched/act_api.c:71:15:    expected struct tc_cookie [noderef] <asn:4>*__ret
         net/sched/act_api.c:71:15:    got struct tc_cookie *new_cookie
      >> net/sched/act_api.c:71:13: sparse: incorrect type in assignment (different address spaces) @@    expected struct tc_cookie *old @@    got struct tc_cookie [noderef] <struct tc_cookie *old @@
         net/sched/act_api.c:71:13:    expected struct tc_cookie *old
         net/sched/act_api.c:71:13:    got struct tc_cookie [noderef] <asn:4>*[assigned] __ret
      >> net/sched/act_api.c:132:48: sparse: dereference of noderef expression
      
      Handle this in the usual way by force casting away the __rcu annotation
      when we are using xchg() on it.
      
      Fixes: eec94fdb ("net: sched: use rcu for action cookie update")
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0dbc81ea
    • D
      Merge branch 'Modify-action-API-for-implementing-lockless-actions' · e9ec8045
      David S. Miller 提交于
      Vlad Buslov says:
      
      ====================
      Modify action API for implementing lockless actions
      
      Currently, all netlink protocol handlers for updating rules, actions and
      qdiscs are protected with single global rtnl lock which removes any
      possibility for parallelism. This patch set is a first step to remove
      rtnl lock dependency from TC rules update path.
      
      Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
      Handlers registered with this flag are called without RTNL taken. End
      goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
      etc.) to be registered with UNLOCKED flag to allow parallel execution.
      However, there is no intention to completely remove or split rtnl lock
      itself. This patch set addresses specific problems in action API that
      prevents it from being executed concurrently. This patch set does not
      completely unlock rules or actions update path. Additional patch sets
      are required to refactor individual actions and filters update for
      parallel execution.
      
      As a preparation for executing TC rules update handlers without rtnl
      lock, action API code was audited to determine areas that assume
      external synchronization with rtnl lock and must be changed to allow
      safe concurrent access with following results:
      
      1. Action idr is already protected with spinlock. However, some code
         paths assume that idr state is not changes between several
         consecutive tcf_idr_* function calls.
      2. tc_action reference and bind counters are implemented as plain
         integers. They purpose was to allow single actions to be shared
         between multiple filters, not to provide means for concurrent
         modification.
      3. tc_action 'cookie' pointer field is not protected against
         modification.
      4. Action API functions, that work with set of actions, use intrusive
         linked list, which cannot be used concurrently without additional
         synchronization.
      5. Action API functions don't take reference to actions while using
         them, assuming external synchronization with rtnl lock.
      
      Following solutions to these problems are implemented:
      
      1. To remove assumption that idr state doesn't change between tcf_idr_*
         calls, implement new functions that atomically perform several
         operations on idr without releasing idr spinlock. (function to
         atomically lookup and delete action by index, function to atomically
         check if action exists and allocate new one if necessary, etc.)
      2. Use atomic operations on counters to make them suitable for
         concurrent get/put operations.
      3. Data that 'cookie' points to is never modified, so it enough to
         refactor it to rcu pointer to prevent concurrent de-allocation.
      4. Action API doesn't actually use any linked list specific operations
         on actions intrusive linked list, so it can be refactored to array in
         straightforward manner.
      5. Always take reference to action while accessing it in action API.
         tcf_idr_search function modified to take reference to action before
         returning it, so there is no way to lookup an action without
         incrementing its reference counter. All users of this function are
         modified to release the reference, after they done using action. With
         all users using reference counting, it is now safe to concurrently
         delete actions.
      
      Additionally, actions init function signature was expanded with
      'rtnl_held' argument, that allows actions that have internal dependency
      on rtnl lock to take/release it when necessary.
      
      Since only shared state in action API module are actions themselves and
      action idr, these changes are sufficient to not to rely on global rtnl
      lock for protection of internal action API data structures.
      
      Changes from V5 to V6:
      - Rebase on current net-next
      - When action is deleted, set pointer in actions array to NULL to
        prevent double freeing.
      
      Changes from V4 to V5:
      - Change action delete API to track actions that were deleted, to
        prevent releasing them on error.
      
      Changes from V3 to V4:
      - Expand cover letter.
      - Reduce actions array size in tcf_action_init_1.
      - Rebase on latest net-next.
      
      Changes from V2 to V3:
      - Re-send with changelog copied to individual patches.
      
      Changes from V1 to V2:
      - Removed redundant actions ops lookup during delete.
      - Merge action ops delete definition and implementation.
      - Assume all actions have delete implemented and don't check for it
        explicitly.
      - Resplit action lookup/release code to prevent memory leaks in
        individual patches.
      - Make __tcf_idr_check function static
      - Remove unique idr insertion function. Change original idr insert to do
        the same thing.
      - Merge changes that take reference to action when performing lookup and
        changes that account for this additional reference when dumping action
        to user space into single patch.
      - Change convoluted commit message.
      - Rename "unlocked" to "rtnl_held" for clarity.
      - Remove estimator lock add patch.
      - Refactor action check-alloc code into standalone function.
      - Rename tcf_idr_find_delete to tcf_idr_delete_index.
      - Rearrange variable definitions in tc_action_delete.
      - Add patch that refactors action API code to use array of pointers to
        actions instead of intrusive linked list.
      - Expand cover letter.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9ec8045
    • V
      net: sched: change action API to use array of pointers to actions · 90b73b77
      Vlad Buslov 提交于
      Act API used linked list to pass set of actions to functions. It is
      intrusive data structure that stores list nodes inside action structure
      itself, which means it is not safe to modify such list concurrently.
      However, action API doesn't use any linked list specific operations on this
      set of actions, so it can be safely refactored into plain pointer array.
      
      Refactor action API to use array of pointers to tc_actions instead of
      linked list. Change argument 'actions' type of exported action init,
      destroy and dump functions.
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90b73b77
    • V
      net: sched: atomically check-allocate action · 0190c1d4
      Vlad Buslov 提交于
      Implement function that atomically checks if action exists and either takes
      reference to it, or allocates idr slot for action index to prevent
      concurrent allocations of actions with same index. Use EBUSY error pointer
      to indicate that idr slot is reserved.
      
      Implement cleanup helper function that removes temporary error pointer from
      idr. (in case of error between idr allocation and insertion of newly
      created action to specified index)
      
      Refactor all action init functions to insert new action to idr using this
      API.
      Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0190c1d4
    • V
      net: sched: use reference counting action init · cae422f3
      Vlad Buslov 提交于
      Change action API to assume that action init function always takes
      reference to action, even when overwriting existing action. This is
      necessary because action API continues to use action pointer after init
      function is done. At this point action becomes accessible for concurrent
      modifications, so user must always hold reference to it.
      
      Implement helper put list function to atomically release list of actions
      after action API init code is done using them.
      Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cae422f3