1. 23 10月, 2017 7 次提交
    • I
      mlxsw: spectrum: Increase number of linear entries · f11fbaf8
      Ido Schimmel 提交于
      The memory region where adjacency entries (nexthops) are stored is
      called the KVD linear and is configured during initialization with a
      size of 64K.
      
      Extend this area with 32K more entries, that will be partitioned into 64
      groups of 0.5K entries, thereby allowing us to support weighted nexthops
      with high accuracy.
      
      Change the ratio between both types of hash entries, so as to prevent
      reduction in the number of double hash entries, which are used for IPv6
      neighbours and routes with a prefix length greater than 64.
      
      Note that the user will be able to control all these sizes once the
      devlink resource manager is introduced.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f11fbaf8
    • I
      mlxsw: spectrum_router: Populate adjacency entries according to weights · eb789980
      Ido Schimmel 提交于
      Up until now the driver assumed all the nexthops have an equal weight
      and wrote each to a single adjacency entry.
      
      This patch takes the `weight` parameter into account and populates the
      adjacency group according to the relative weight of each nexthop.
      
      Specifically, the weights of all the nexthops that should be offloaded
      are first normalized and then used to calculate the upper adjacency
      index of each nexthop. This is done according to the hash-threshold
      algorithm used by the kernel for IPv4 multi-path routing.
      
      Adjacency groups are currently limited to 32 entries which limits the
      weights that can be used, but follow-up patches will introduce groups of
      512 entries.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb789980
    • I
      mlxsw: spectrum_router: Prepare for large adjacency groups · 425a08c6
      Ido Schimmel 提交于
      The device has certain restrictions regarding the size of an adjacency
      group.
      
      Have the router determine the size of the adjacency group according to
      available KVDL allocation sizes and these restrictions.
      
      This was not needed until now since only allocations of up 32 entries
      were supported and these are all valid sizes for an adjacency group.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      425a08c6
    • I
      mlxsw: spectrum_router: Store weight in nexthop struct · 408bd946
      Ido Schimmel 提交于
      As the first step towards non-equal-cost multi-path support, store each
      nexthop's weight.
      
      For IPv6 nexthops always set the weight to 1, as it only supports ECMP.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      408bd946
    • I
      mlxsw: spectrum: Add ability to query KVDL allocation size · d672aec4
      Ido Schimmel 提交于
      The current KVDL allocation API allows the user to specify the requested
      number of entries, but the user has no way of knowing how many entries
      were actually allocated.
      
      This works because existing users (e.g., router) request the exact
      number they end up using. With the introduction of large adjacency
      groups, this will change, as the router will have the ability to choose
      from several allocation sizes, where larger allocations provide higher
      accuracy with respect to requested weights and better resilience against
      nexthop failures.
      
      One option is to have the router try several allocations of descending
      size until one succeeds, but a better way is to simply allow it to query
      the actual allocation size and then size its request accordingly.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d672aec4
    • I
      mlxsw: spectrum: Better represent KVDL partitions · a875a2ee
      Ido Schimmel 提交于
      The KVD linear (KVDL) allocator currently consists of a very large
      bitmap that reflects the KVDL's usage. The boundaries of each partition
      as well as their allocation size are represented using defines.
      
      This representation requires us to patch all the functions that act on a
      partition whenever the partitioning scheme is changed. In addition, it
      does not enable the dynamic configuration of the KVDL using the
      up-coming resource manager.
      
      Add objects to represent these partitions as well as the accompanying
      code that acts on them to perform allocations and de-allocations.
      
      In the following patches, this will allow us to easily add another
      partition as well as new operations to act on these partitions.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a875a2ee
    • I
      mlxsw: spectrum_dpipe: Add adjacency group size · e69cd9d7
      Ido Schimmel 提交于
      The adjacency group size is part of the match on the adjacency group and
      should therefore be exposed using dpipe.
      
      When non-equal-cost multi-path support will be introduced, the group's
      size will help users understand the exact number of adjacency entries
      each nexthop occupies, as a nexthop will no longer correspond to a
      single entry.
      
      The output for a multi-path route with two nexthops, one with weight 255
      and the second 1 will be:
      
      Example:
      
      $ devlink dpipe table dump pci/0000:01:00.0 name mlxsw_adj
      pci/0000:01:00.0:
        index 0
        match_value:
          type field_exact header mlxsw_meta field adj_index value 65536
          type field_exact header mlxsw_meta field adj_size value 512
          type field_exact header mlxsw_meta field adj_hash_index value 0
        action_value:
          type field_modify header ethernet field destination mac value e4:1d:2d:a5:f3:64
          type field_modify header mlxsw_meta field erif_port mapping ifindex mapping_value 3 value 1
      
        index 1
        match_value:
          type field_exact header mlxsw_meta field adj_index value 65536
          type field_exact header mlxsw_meta field adj_size value 512
          type field_exact header mlxsw_meta field adj_hash_index value 510
        action_value:
          type field_modify header ethernet field destination mac value e4:1d:2d:a5:f3:65
          type field_modify header mlxsw_meta field erif_port mapping ifindex mapping_value 4 value 2
      
      Thus, the first nexthop occupies 510 adjacency entries and the second 2,
      which leads to a ratio of 255 to 1.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e69cd9d7
  2. 22 10月, 2017 7 次提交
  3. 21 10月, 2017 5 次提交
  4. 20 10月, 2017 2 次提交
  5. 18 10月, 2017 1 次提交
    • I
      mlxsw: core: Fix possible deadlock · d965465b
      Ido Schimmel 提交于
      When an EMAD is transmitted, a timeout work item is scheduled with a
      delay of 200ms, so that another EMAD will be retried until a maximum of
      five retries.
      
      In certain situations, it's possible for the function waiting on the
      EMAD to be associated with a work item that is queued on the same
      workqueue (`mlxsw_core`) as the timeout work item. This results in
      flushing a work item on the same workqueue.
      
      According to commit e159489b ("workqueue: relax lockdep annotation
      on flush_work()") the above may lead to a deadlock in case the workqueue
      has only one worker active or if the system in under memory pressure and
      the rescue worker is in use. The latter explains the very rare and
      random nature of the lockdep splats we have been seeing:
      
      [   52.730240] ============================================
      [   52.736179] WARNING: possible recursive locking detected
      [   52.742119] 4.14.0-rc3jiri+ #4 Not tainted
      [   52.746697] --------------------------------------------
      [   52.752635] kworker/1:3/599 is trying to acquire lock:
      [   52.758378]  (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c4fa4>] flush_work+0x3a4/0x5e0
      [   52.767837]
                     but task is already holding lock:
      [   52.774360]  (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0
      [   52.784495]
                     other info that might help us debug this:
      [   52.791794]  Possible unsafe locking scenario:
      [   52.798413]        CPU0
      [   52.801144]        ----
      [   52.803875]   lock(mlxsw_core_driver_name);
      [   52.808556]   lock(mlxsw_core_driver_name);
      [   52.813236]
                      *** DEADLOCK ***
      [   52.819857]  May be due to missing lock nesting notation
      [   52.827450] 3 locks held by kworker/1:3/599:
      [   52.832221]  #0:  (mlxsw_core_driver_name){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0
      [   52.842846]  #1:  ((&(&bridge->fdb_notify.dw)->work)){+.+.}, at: [<ffffffff811c65c4>] process_one_work+0x7d4/0x12f0
      [   52.854537]  #2:  (rtnl_mutex){+.+.}, at: [<ffffffff822ad8e7>] rtnl_lock+0x17/0x20
      [   52.863021]
                     stack backtrace:
      [   52.867890] CPU: 1 PID: 599 Comm: kworker/1:3 Not tainted 4.14.0-rc3jiri+ #4
      [   52.875773] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
      [   52.886267] Workqueue: mlxsw_core mlxsw_sp_fdb_notify_work [mlxsw_spectrum]
      [   52.894060] Call Trace:
      [   52.909122]  __lock_acquire+0xf6f/0x2a10
      [   53.025412]  lock_acquire+0x158/0x440
      [   53.047557]  flush_work+0x3c4/0x5e0
      [   53.087571]  __cancel_work_timer+0x3ca/0x5e0
      [   53.177051]  cancel_delayed_work_sync+0x13/0x20
      [   53.182142]  mlxsw_reg_trans_bulk_wait+0x12d/0x7a0 [mlxsw_core]
      [   53.194571]  mlxsw_core_reg_access+0x586/0x990 [mlxsw_core]
      [   53.225365]  mlxsw_reg_query+0x10/0x20 [mlxsw_core]
      [   53.230882]  mlxsw_sp_fdb_notify_work+0x2a3/0x9d0 [mlxsw_spectrum]
      [   53.237801]  process_one_work+0x8f1/0x12f0
      [   53.321804]  worker_thread+0x1fd/0x10c0
      [   53.435158]  kthread+0x28e/0x370
      [   53.448703]  ret_from_fork+0x2a/0x40
      [   53.453017] mlxsw_spectrum 0000:01:00.0: EMAD retries (2/5) (tid=bf4549b100000774)
      [   53.453119] mlxsw_spectrum 0000:01:00.0: EMAD retries (5/5) (tid=bf4549b100000770)
      [   53.453132] mlxsw_spectrum 0000:01:00.0: EMAD reg access failed (tid=bf4549b100000770,reg_id=200b(sfn),type=query,status=0(operation performed))
      [   53.453143] mlxsw_spectrum 0000:01:00.0: Failed to get FDB notifications
      
      Fix this by creating another workqueue for EMAD timeouts, thereby
      preventing the situation of a work item trying to flush a work item
      queued on the same workqueue.
      
      Fixes: caf7297e ("mlxsw: core: Introduce support for asynchronous EMAD register access")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d965465b
  6. 17 10月, 2017 5 次提交
  7. 15 10月, 2017 10 次提交
  8. 12 10月, 2017 3 次提交