1. 18 5月, 2017 8 次提交
  2. 16 5月, 2017 1 次提交
  3. 14 5月, 2017 5 次提交
  4. 09 5月, 2017 4 次提交
    • J
      net/mlx4_core: Reduce harmless SRIOV error message to debug level · 83bd5118
      Jack Morgenstein 提交于
      Under SRIOV resource management, extra counters are allocated to VFs
      from a free pool. If that pool is empty, the ALLOC_RES command for
      a counter resource fails -- and this generates a misleading error
      message in the message log.
      
      Under SRIOV, each VF is allocated (i.e., guaranteed) 2 counters --
      one counter per port. For ETH ports, the RoCE driver requests an
      additional counter (above the guaranteed counters). If that request
      fails, the VF RoCE driver simply uses the default (i.e., guaranteed)
      counter for that port.
      
      Thus, failing to allocate an additional counter does not constitute
      a  problem, and the error message on the PF when this occurs should
      be reduced to debug level.
      
      Finally, to identify the situation that the reason for the failure is
      that no resources are available to grant to the VF, we modified the
      error returned by mlx4_grant_resource to -EDQUOT (Quota exceeded),
      which more accurately describes the error.
      
      Fixes: c3abb51b ("IB/mlx4: Add RoCE/IB dedicated counters")
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83bd5118
    • T
      net/mlx4_en: Avoid adding steering rules with invalid ring · 89c55768
      Talat Batheesh 提交于
      Inserting steering rules with illegal ring is an invalid operation,
      block it.
      
      Fixes: 82067281 ('net/mlx4_en: Manage flow steering rules with ethtool')
      Signed-off-by: NTalat Batheesh <talatb@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89c55768
    • K
      net/mlx4_en: Change the error print to debug print · 505a9249
      Kamal Heib 提交于
      The error print within mlx4_en_calc_rx_buf() should be a debug print.
      
      Fixes: 51151a16 ('mlx4: allow order-0 memory allocations in RX path')
      Signed-off-by: NKamal Heib <kamalh@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      505a9249
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
  5. 05 5月, 2017 1 次提交
  6. 01 5月, 2017 1 次提交
    • I
      mlxsw: spectrum_router: Simplify VRF enslavement · b1e45526
      Ido Schimmel 提交于
      When a netdev is enslaved to a VRF master, its router interface (RIF)
      needs to be destroyed (if exists) and a new one created using the
      corresponding virtual router (VR).
      
      >From the driver's perspective, the above is equivalent to an inetaddr
      event sent for this netdev. Therefore, when a port netdev (or its
      uppers) are enslaved to a VRF master, call the same function that
      would've been called had a NETDEV_UP was sent for this netdev in the
      inetaddr notification chain.
      
      This patch also fixes a bug when a LAG netdev with an existing RIF is
      enslaved to a VRF. Before this patch, each LAG port would drop the
      reference on the RIF, but would re-join the same one (in the wrong VR)
      soon after. With this patch, the corresponding RIF is first destroyed
      and a new one is created using the correct VR.
      
      Fixes: 7179eb5a ("mlxsw: spectrum_router: Add support for VRFs")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1e45526
  7. 30 4月, 2017 15 次提交
    • E
      net/mlx5: E-Switch, Avoid redundant memory allocation · 0a0ab1d2
      Eli Cohen 提交于
      struct esw_mc_addr is a small struct that can be part of struct
      mlx5_eswitch. Define it as a field and not as a pointer and save the
      kzalloc call and then error flow handling.
      Signed-off-by: NEli Cohen <eli@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      0a0ab1d2
    • E
      net/mlx5e: Disable HW LRO when PCI is slower than link on striding RQ · 0f6e4cf6
      Eran Ben Elisha 提交于
      We will activate the HW LRO only on servers with PCI BW > MAX LINK BW,
      or when PCI BW > 16Gbps. On other cases we do not want LRO by default as
      LRO sessions might get timeout and add redundant software overhead.
      
      Tested:
      	ethtool -k <ifs-name> | grep large-receive-offload
      	On systems with and without the limitations.
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Cc: kernel-team@fb.com
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      0f6e4cf6
    • T
      net/mlx5e: Use u8 as ownership type in mlx5e_get_cqe() · b1b03bde
      Tariq Toukan 提交于
      CQE ownership indication is as small as a single bit.
      Use u8 to speedup the comparison.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Cc: kernel-team@fb.com
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      b1b03bde
    • T
      net/mlx5e: Use prefetchw when a write is to follow · ad78af9b
      Tariq Toukan 提交于
      "prefetchw()" prefetches the cacheline for write. Use it for
      skb->data, as soon we'll be copying the packet header there.
      
      Performance:
      Single-stream packet-rate tested with pktgen.
      Packets are dropped in tc level to zoom into driver data-path.
      Larger gain is expected for smaller packets, as less time
      is spent on handling SKB fragments, making the path shorter
      and the improvement more significant.
      
      ---------------------------------------------
      packet size | before    | after     | gain  |
      64B         | 4,113,306 | 4,778,720 |  16%  |
      1024B       | 3,633,819 | 3,950,593 | 8.7%  |
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Cc: kernel-team@fb.com
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      ad78af9b
    • T
      net/mlx5e: Optimize poll ICOSQ completion queue · 1f5b1e47
      Tariq Toukan 提交于
      UMR operations are more frequent and important.
      Check them first, and add a compiler branch predictor hint.
      
      According to current design, ICOSQ CQ can contain at most one
      pending CQE per napi. Poll function is optimized accordingly.
      
      Performance:
      Single-stream packet-rate tested with pktgen.
      Packets are dropped in tc level to zoom into driver data-path.
      Larger gain is expected for larger packet sizes, as BW is higher
      and UMR posts are more frequent.
      
      ---------------------------------------------
      packet size | before    | after     | gain  |
      64B         | 4,092,370 | 4,113,306 |  0.5% |
      1024B       | 3,421,435 | 3,633,819 |  6.2% |
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Cc: kernel-team@fb.com
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      1f5b1e47
    • H
      net/mlx5e: Act on delay probe time updates · a2fa1fe5
      Hadar Hen Zion 提交于
      The user can change delay_first_probe_time parameter through sysctl.
      Listen to NETEVENT_DELAY_PROBE_TIME_UPDATE notifications and update the
      intervals for updating the neighbours 'used' value periodic task and
      for flow HW counters query periodic task.
      Both of the intervals will be update only in case the new delay prob
      time value is lower the current interval.
      
      Since the driver saves only one min interval value and not per device,
      the users will be able to set lower interval value for updating
      neighbour 'used' value periodic task but they won't be able to schedule
      a higher interval for this periodic task.
      The used interval for scheduling neighbour 'used' value periodic task is
      the minimal delay prob time parameter ever seen by the driver.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      a2fa1fe5
    • H
      net/mlx5e: Update neighbour 'used' state using HW flow rules counters · f6dfb4c3
      Hadar Hen Zion 提交于
      When IP tunnel encapsulation rules are offloaded, the kernel can't see
      the traffic of the offloaded flow. The neighbour for the IP tunnel
      destination of the offloaded flow can mistakenly become STALE and
      deleted by the kernel since its 'used' value wasn't changed.
      
      To make sure that a neighbour which is used by the HW won't become
      STALE, we proactively update the neighbour 'used' value every
      DELAY_PROBE_TIME period, when packets were matched and counted by the HW
      for one of the tunnel encap flows related to this neighbour.
      
      The periodic task that updates the used neighbours is scheduled when a
      tunnel encap rule is successfully offloaded into HW and keeps re-scheduling
      itself as long as the representor's neighbours list isn't empty.
      
      Add, remove, lookup and status change operations done over the
      representor's neighbours list or the neighbour hash entry encaps list
      are all serialized by RTNL lock.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      f6dfb4c3
    • H
      net/mlx5e: Add support to neighbour update flow · 232c0013
      Hadar Hen Zion 提交于
      In order to offload TC encap rules, the driver does a lookup for the IP
      tunnel neighbour according to the output device and the destination IP
      given by the user.
      
      To keep tracking after the validity state of such neighbours, we keep
      the neighbours information (pair of device pointer and destination IP)
      in a hash table maintained at the relevant egress representor and
      register to get NETEVENT_NEIGH_UPDATE events. When getting neighbour update
      netevent, we search for a match among the cached neighbours entries used for
      encapsulation.
      
      In case the neighbour isn't valid, we can't offload the flow into the
      HW. We cache the flow (requested matching and actions) in the driver and
      offload the rule later, when the neighbour is resolved and becomes
      valid.
      
      When a flow is only cached in the driver and not offloaded into HW
      yet, we use EAGAIN return value to mark it internally, the TC ndo still
      returns success.
      
      Listen to kernel neighbour update netevents to trace relevant neighbours
      validity state:
      
      1. If a neighbour becomes valid, offload the related rules to HW.
      
      2. If the neighbour becomes invalid, remove the related rules from HW.
      
      3. If the neighbour mac address was changed, update the encap header.
         Remove all the offloaded rules using the old encap header from the HW
         and insert new rules to HW with updated encap header.
      
      Access to the neighbors hash table is protected by RTNL lock of its
      caller or by the table's spinlock.
      
      Details of the locking/synchronization among the different actions
      applied on the neighbour table:
      
      Add/remove operations - protected by RTNL lock of its caller (all TC
      commands are protected by RTNL lock). Add and remove operations are
      initiated only when the user inserts/removes a TC rule into/from the driver.
      
      Lookup/remove operations - since the lookup operation is done from
      netevent notifier block, RTNL lock can't be used (atomic context).
      Use the table's spin lock to protect lookups from TC user removal operation.
      bh is used since netevent can be called from a softirq context.
      
      Lookup/add operations - The hash table access functions are taking
      care of the protection between lookup and add operations.
      
      When adding/removing encap headers and rules to/from the HW, RTNL lock
      is used. It can happen when:
      
      1. The user inserts/removes a TC rule into/from the driver (TC commands
      are protected by RTNL lock of it's caller).
      
      2. The driver gets neighbour notification event, which reports about
      neighbour validity status change. Before adding/removing encap headers
      and rules to/from the HW, RTNL lock is taken.
      
      A neighbour hash table entry should be freed when its encap list is empty.
      Since The neighbour update netevent notification schedules a neighbour
      update work that uses the neighbour hash entry, it can't be freed
      unconditionally when the encap list becomes empty during TC delete rule flow.
      Use reference count to protect from freeing neighbour hash table entry
      while it's still in use.
      
      When the user asks to unregister a netdvice used by one of the neigbours,
      neighbour removal notification is received. Then we take a reference on the
      neighbour and don't free it until the relevant encap entries (and flows) are
      marked as invalid (not offloaded) and removed from HW.
      As long as the encap entry is still valid (checked under RTNL lock) we
      can safely access the neighbour device saved on mlx5e_neigh struct.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      232c0013
    • H
      net/mlx5e: Add neighbour hash table to the representors · 37b498ff
      Hadar Hen Zion 提交于
      Add hash table to the representors which is to be used by the next patch
      to save neighbours information in the driver.
      
      In order to offload IP tunnel encapsulation rules, the driver must find
      the tunnel dst neighbour according to the output device and the
      destination address given by the user. The next patch will cache the
      neighbors information in the driver to allow support in neigh update
      flow for tunnel encap rules.
      
      The neighbour entries are also saved in a list so we easily iterate over
      them when querying statistics in order to provide 'used' feedback to the
      kernel neighbour NUD core.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      37b498ff
    • H
      net/mlx5e: Read neigh parameters with proper locking · 033354d5
      Hadar Hen Zion 提交于
      The nud_state and hardware address fields are protected by the neighbour
      lock, we should acquire it before accessing those parameters.
      
      Use this lock to avoid inconsistency between the neighbour validity state
      and it's hardware address.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      033354d5
    • H
      net/mlx5e: Use flag to properly monitor a flow rule offloading state · 0b67a38f
      Hadar Hen Zion 提交于
      Instead of relaying on the 'flow->rule' pointer value which can be
      valid or invalid (in case the FW returns an error while trying to offload
      the rule), monitor the rule state using a flag.
      
      In downstream patch which adds support to IP tunneling neigh update
      flow, a TC rule could be cached in the driver and not offloaded into the
      HW. In this case, the flow handle pointer stays NULL.
      
      Check the offloaded flag to properly deal with rules which are currently
      not offloaded when querying rule statistics.
      
      This patch doesn't add any new functionality.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      0b67a38f
    • H
      net/mlx5e: Remove output device parameter from create encap header helpers definition · 1a8552bd
      Hadar Hen Zion 提交于
      Passing output device parameter to the helper functions that deal with
      creation of encapsulation headers is redundant. Output device parameter
      can be defined inside those helpers, no need to pass it. Refactor the code by
      removing the parameter from the function signature.
      
      This patch doesn't change any functionality.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      1a8552bd
    • O
      net/mlx5e: Move the encap entry structure from the eswitch header · c1ae1152
      Or Gerlitz 提交于
      The encap entry structure isn't manipulated by the eswitch code,
      hence it can/needs to be removed from the eswitch header.
      
      Do that, and change it to have mlx5e_ prefix.
      
      This patch doesn't change any functionality.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      c1ae1152
    • O
      net/mlx5: Remove encap entry pointer from the eswitch flow attributes · 45247bf2
      Or Gerlitz 提交于
      Encap wise, the tc eswitch flow attribute struct needs to have
      only the encap ID which is programmed later to the HW and none
      of the higher level encap params, fix that.
      
      This patch doesn't change any functionality.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      45247bf2
    • S
      net/mlx5e: Extendable vport representor netdev private data · 1d447a39
      Saeed Mahameed 提交于
      Make representor netdev private data extendable by adding new struct
      "mlx5e_rep_priv" and use it as the rep netdev private data struct
      instead of directly pointing to mlx5_eswitch_rep.
      
      Added new en_rep.h header file to contain all representor related
      definitions and prototypes, and moved all representor specific logic
      into en_rep.c.
      
      Needed for downstream patches to extend representor functionality to
      support neighbour update.
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      1d447a39
  8. 25 4月, 2017 1 次提交
    • M
      net/mlx5e: Fix race in mlx5e_sw_stats and mlx5e_vport_stats · 1510d728
      Martin KaFai Lau 提交于
      We have observed a sudden spike in rx/tx_packets and rx/tx_bytes
      reported under /proc/net/dev.  There is a race in mlx5e_update_stats()
      and some of the get-stats functions (the one that we hit is the
      mlx5e_get_stats() which is called by ndo_get_stats64()).
      
      In particular, the very first thing mlx5e_update_sw_counters()
      does is 'memset(s, 0, sizeof(*s))'.  For example, if mlx5e_get_stats()
      is unlucky at one point, rx_bytes and rx_packets could be 0.  One second
      later, a normal (and much bigger than 0) value will be reported.
      
      This patch is to use a 'struct mlx5e_sw_stats temp' to avoid
      a direct memset zero on priv->stats.sw.
      
      mlx5e_update_vport_counters() has a similar race.  Hence, addressed
      together.  However, memset zero is removed instead because
      it is not needed.
      
      I am lucky enough to catch this 0-reset in rx multicast:
      eth0: 41457665   76804   70    0    0    70          0     47085 15586634   87502    3    0    0     0       3          0
      eth0: 41459860   76815   70    0    0    70          0     47094 15588376   87516    3    0    0     0       3          0
      eth0: 41460577   76822   70    0    0    70          0         0 15589083   87521    3    0    0     0       3          0
      eth0: 41463293   76838   70    0    0    70          0     47108 15595872   87538    3    0    0     0       3          0
      eth0: 41463379   76839   70    0    0    70          0     47116 15596138   87539    3    0    0     0       3          0
      
      v2: Remove memset zero from mlx5e_update_vport_counters()
      v1: Use temp and memcpy
      
      Fixes: 9218b44d ("net/mlx5e: Statistics handling refactoring")
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Suggested-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1510d728
  9. 23 4月, 2017 4 次提交