1. 08 10月, 2017 12 次提交
    • W
      ipv6: update fn_sernum after route is inserted to tree · bbd63f06
      Wei Wang 提交于
      fib6_add() logic currently calls fib6_add_1() to figure out what node
      should be used for the newly added route and then call
      fib6_add_rt2node() to insert the route to the node.
      And during the call of fib6_add_1(), fn_sernum is updated for all nodes
      that share the same prefix as the new route.
      This does not have issue in the current code because reader thread will
      not be able to access the tree while writer thread is inserting new
      route to it. However, it is not the case once we transition to use RCU.
      Reader thread could potentially see the new fn_sernum before the new
      route is inserted. As a result, reader thread's route lookup will return
      a stale route with the new fn_sernum.
      
      In order to solve this issue, we remove all the update of fn_sernum in
      fib6_add_1(), and instead, introduce a new function that updates fn_sernum
      for all related nodes and call this functions once the route is
      successfully inserted to the tree.
      Also, smp_wmb() is used after a route is successfully inserted into the
      fib tree and right before the updated of fn->sernum. And smp_rmb() is
      used right after fn->sernum is accessed in rt6_get_cookie_safe(). This
      is to guarantee that when the reader thread sees the new fn->sernum, the
      new route is already inserted in the tree in memory.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbd63f06
    • W
      ipv6: replace dst_hold() with dst_hold_safe() in routing code · d3843fe5
      Wei Wang 提交于
      With rwlock, it is safe to call dst_hold() in the read thread because
      read thread is guaranteed to be separated from write thread.
      However, after we replace rwlock with rcu, it is no longer safe to use
      dst_hold(). A dst might already have been deleted but is waiting for the
      rcu grace period to pass before freeing the memory when a read thread is
      trying to do dst_hold(). This could potentially cause double free issue.
      
      So this commit replaces all dst_hold() with dst_hold_safe() in all read
      thread to avoid this double free issue.
      And in order to make the code more compact, a new function ip6_hold_safe()
      is introduced. It calls dst_hold_safe() first, and if that fails, it will
      either fall back to hold and return net->ipv6.ip6_null_entry or set rt to
      NULL according to the caller's need.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3843fe5
    • W
      ipv6: don't release rt->rt6i_pcpu memory during rt6_release() · 51e398e8
      Wei Wang 提交于
      After rwlock is replaced with rcu and spinlock, route lookup can happen
      simultanously with route deletion.
      This patch removes the call to free_percpu(rt->rt6i_pcpu) from
      rt6_release() to avoid the race condition between rt6_release() and
      rt6_get_pcpu_route(). And as free_percpu(rt->rt6i_pcpu) is already
      called in ip6_dst_destroy() after the rcu grace period, it is safe to do
      this change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51e398e8
    • W
      ipv6: grab rt->rt6i_ref before allocating pcpu rt · a94b9367
      Wei Wang 提交于
      After rwlock is replaced with rcu and spinlock, ip6_pol_route() will be
      called with only rcu held. That means rt6 route deletion could happen
      simultaneously with rt6_make_pcpu_rt(). This could potentially cause
      memory leak if rt6_release() is called right before rt6_make_pcpu_rt()
      on the same route.
      
      This patch grabs rt->rt6i_ref safely before calling rt6_make_pcpu_rt()
      to make sure rt6_release() will not get triggered while
      rt6_make_pcpu_rt() is in progress. And rt6_release() is called after
      rt6_make_pcpu_rt() is finished.
      
      Note: As we are incrementing rt->rt6i_ref in ip6_pol_route(), there is a
      very slim chance that fib6_purge_rt() will be triggered unnecessarily
      when deleting a route if ip6_pol_route() running on another thread picks
      this route as well and tries to make pcpu cache for it.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a94b9367
    • W
      ipv6: hook up exception table to store dst cache · 2b760fcf
      Wei Wang 提交于
      This commit makes use of the exception hash table implementation to
      store dst caches created by pmtu discovery and ip redirect into the hash
      table under the rt_info and no longer inserts these routes into fib6
      tree.
      This makes the fib6 tree only contain static configured routes and could
      now be protected by rcu instead of a rw lock.
      With this change, in the route lookup related functions, after finding
      the rt6_info with the longest prefix, we also need to search for the
      exception table before doing backtracking.
      In the route delete function, if the route being deleted is not a dst
      cache, deletion of this route also need to flush the whole hash table
      under it. If it is a dst cache, then only delete the cached dst in the
      hash table.
      
      Note: for fib6_walk_continue() function, w->root now is always pointing
      to a root node considering that fib6_prune_clones() is removed from the
      code. So we add a WARN_ON() msg to make sure w->root always points to a
      root node and also removed the update of w->root in fib6_repair_tree().
      This is a prerequisite for later patch because we don't need to make
      w->root as rcu protected when replacing rwlock with RCU.
      Also, we remove all prune related variables as it is no longer used.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b760fcf
    • W
      ipv6: prepare fib6_locate() for exception table · 38fbeeee
      Wei Wang 提交于
      fib6_locate() is used to find the fib6_node according to the passed in
      prefix address key. It currently tries to find the fib6_node with the
      exact match of the passed in key. However, when we move cached routes
      into the exception table, fib6_locate() will fail to find the fib6_node
      for it as the cached routes will be stored in the exception table under
      the fib6_node with the longest prefix match of the cache's dst addr key.
      This commit adds a new parameter to let the caller specify if it needs
      exact match or longest prefix match.
      Right now, all callers still does exact match when calling
      fib6_locate(). It will be changed in later commit where exception table
      is hooked up to store cached routes.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38fbeeee
    • W
      ipv6: prepare fib6_age() for exception table · c757faa8
      Wei Wang 提交于
      If all dst cache entries are stored in the exception table under the
      main route, we have to go through them during fib6_age() when doing
      garbage collecting.
      Introduce a new function rt6_age_exception() which goes through all dst
      entries in the exception table and remove those entries that are expired.
      This function is called in fib6_age() so that all dst caches are also
      garbage collected.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c757faa8
    • W
      ipv6: prepare rt6_clean_tohost() for exception table · b16cb459
      Wei Wang 提交于
      If we move all cached dst into the exception table under the main route,
      current rt6_clean_tohost() will no longer be able to access them.
      This commit makes fib6_clean_tohost() to also go through all cached
      routes in exception table and removes cached gateway routes to the
      passed in gateway.
      This is a preparation in order to move all cached routes into the
      exception table.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b16cb459
    • W
      ipv6: prepare rt6_mtu_change() for exception table · f5bbe7ee
      Wei Wang 提交于
      If we move all cached dst into the exception table under the main route,
      current rt6_mtu_change() will no longer be able to access them.
      This commit makes rt6_mtu_change_route() function to also go through all
      cached routes in the exception table under the main route and do proper
      updates on the mtu.
      This is a preparation in order to move all cached routes into the
      exception table.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f5bbe7ee
    • W
      ipv6: prepare fib6_remove_prefsrc() for exception table · 60006a48
      Wei Wang 提交于
      After we move cached dst entries into the exception table under its
      parent route, current fib6_remove_prefsrc() no longer can access them.
      This commit makes fib6_remove_prefsrc() also go through all routes
      in the exception table to remove the pref src.
      This is a preparation patch in order to move all cached dst into the
      exception table.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60006a48
    • W
      ipv6: introduce a hash table to store dst cache · 35732d01
      Wei Wang 提交于
      Add a hash table into struct rt6_info in order to store dst caches
      created by pmtu discovery and ip redirect in ipv6 routing code.
      APIs to add dst cache, delete dst cache, find dst cache and update
      dst cache in the hash table are implemented and will be used in later
      commits.
      This is a preparation work to move all cache routes into the exception
      table instead of getting inserted into the fib6 tree.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35732d01
    • W
      ipv6: introduce a new function fib6_update_sernum() · 180ca444
      Wei Wang 提交于
      This function takes a route as input and tries to update the sernum in
      the fib6_node this route is associated with. It will be used in later
      commit when adding a cached route into the exception table under that
      route.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      180ca444
  2. 07 10月, 2017 25 次提交
  3. 06 10月, 2017 3 次提交
    • J
      i40e/i40evf: organize and re-number feature flags · b74f571f
      Jacob Keller 提交于
      Now that we've reduced the number of flags, organize similar flags
      together and re-number them accordingly.
      
      Since we don't yet have more than 32 flags, we'll use a u32 for both the
      hw_features and flag field. Should we gain more flags in the future, we
      may need to convert to a u64 or separate flags out into two fields.
      
      One alternative approach considered, but not implemented here, was to
      use an enumeration for the flag variables, and create a macro
      I40E_FLAG() which used string concatenation to generate BIT_ULL values.
      This has the advantage of making the actual bit values compile-time
      dynamic so that we do not need to worry about matching the order to the
      bit value. However, this does produce a high level of code churn, and
      makes it more difficult to read a dumped flags value when debugging.
      
      Change-ID: I8653fff69453cd547d6fe98d29dfa9d8710387d1
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: NMitch Williams <mitch.a.williams@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b74f571f
    • J
      i40e: ignore skb->xmit_more when deciding to set RS bit · a5340d93
      Jacob Keller 提交于
      Since commit 6a7fded7 ("i40e: Fix RS bit update in Tx path and
      disable force WB workaround") we've tried to "optimize" setting the
      RS bit based around skb->xmit_more. This same logic was refactored
      in commit 1dc8b538 ("i40e: Reorder logic for coalescing RS bits"),
      but ultimately was not functionally changed.
      
      Using skb->xmit_more in this way is incorrect, because in certain
      circumstances we may see a large number of skbs in sequence with
      xmit_more set. This leads to a performance loss as the hardware does not
      writeback anything for those packets, which delays the time it takes for
      us to respond to the stack transmit requests. This significantly impacts
      UDP performance, especially when layered with multiple devices, such as
      bonding, VLANs, and vnet setups.
      
      This was not noticed until now because it is difficult to create a setup
      which reproduces the issue. It was discovered in a UDP_STREAM test in
      a VM, connected using a vnet device to a bridge, which is connected to
      a bonded pair of X710 ports in active-backup mode with a VLAN. These
      layered devices seem to compound the number of skbs transmitted at once
      by the qdisc. Additionally, the problem can be masked by reducing the
      ITR value.
      
      Since the original commit does not provide strong justification for this
      RS bit "optimization", revert to the previous behavior of setting the RS
      bit every 4th packet.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a5340d93
    • J
      i40evf: enable support for VF VLAN tag stripping control · 0a3b4f70
      Jacob Keller 提交于
      A recent commit 809481484e5d ("i40e/i40evf: support for VF VLAN tag
      stripping control") added support for VFs to negotiate the control of
      VLAN tag stripping. This should have allowed VFs to disable the feature.
      Unfortunately, the flag was set only in netdev->feature flags and not in
      netdev->hw_features.
      
      This ultimately causes the stack to assume that it cannot change the
      flag, so it was unchangeable and marked as [fixed] in the ethtool -k
      output.
      
      Fix this by setting the feature in hw_features first, just as we do for
      the PF code. This enables ethtool -K to disable the feature correctly,
      and fully enables user control of the VLAN tag stripping feature.
      Signed-off-by: NJacob Keller <jacob.e.keller@intel.com>
      Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      0a3b4f70