1. 01 3月, 2016 1 次提交
  2. 23 12月, 2015 3 次提交
  3. 12 12月, 2015 1 次提交
    • C
      IB: add a proper completion queue abstraction · 14d3a3b2
      Christoph Hellwig 提交于
      This adds an abstraction that allows ULPs to simply pass a completion
      object and completion callback with each submitted WR and let the RDMA
      core handle the nitty gritty details of how to handle completion
      interrupts and poll the CQ.
      
      In detail there is a new ib_cqe structure which just contains the
      completion callback, and which can be used to get at the containing
      object using container_of.  It is pointed to by the WR and WC as an
      alternative to the wr_id field, similar to how many ULPs already use
      the field to store a pointer using casts.
      
      A driver using the new completion callbacks allocates it's CQs using
      the new ib_create_cq API, which in addition to the number of CQEs and
      the completion vectors also takes a mode on how we poll for CQEs.
      Three modes are available: direct for drivers that never take CQ
      interrupts and just poll for them, softirq to poll from softirq context
      using the to be renamed blk-iopoll infrastructure which takes care of
      rearming and budgeting, or a workqueue for consumer who want to be
      called from user context.
      
      Thanks a lot to Sagi Grimberg who helped reviewing the API, wrote
      the current version of the workqueue code because my two previous
      attempts sucked too much and converted the iSER initiator to the new
      API.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      14d3a3b2
  4. 22 10月, 2015 2 次提交
  5. 31 8月, 2015 7 次提交
    • D
      IB/core: Remove needless bracketization · b8071ad8
      Doug Ledford 提交于
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      b8071ad8
    • D
      IB/core: missing curly braces in ib_find_gid() · 98d25afa
      Dan Carpenter 提交于
      Smatch says that, based on the indenting, we should probably add curly
      braces here.
      
      Fixes: 03db3a2d ('IB/core: Add RoCE GID table management')
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      98d25afa
    • M
      IB/core: Add RoCE GID table management · 03db3a2d
      Matan Barak 提交于
      RoCE GIDs are based on IP addresses configured on Ethernet net-devices
      which relate to the RDMA (RoCE) device port.
      
      Currently, each of the low-level drivers that support RoCE (ocrdma,
      mlx4) manages its own RoCE port GID table. As there's nothing which is
      essentially vendor specific, we generalize that, and enhance the RDMA
      core GID cache to do this job.
      
      In order to populate the GID table, we listen for events:
      
      (a) netdev up/down/change_addr events - if a netdev is built onto
          our RoCE device, we need to add/delete its IPs. This involves
          adding all GIDs related to this ndev, add default GIDs, etc.
      
      (b) inet events - add new GIDs (according to the IP addresses)
          to the table.
      
      For programming the port RoCE GID table, providers must implement
      the add_gid and del_gid callbacks.
      
      RoCE GID management requires us to state the associated net_device
      alongside the GID. This information is necessary in order to manage
      the GID table. For example, when a net_device is removed, its
      associated GIDs need to be removed as well.
      
      RoCE mandates generating a default GID for each port, based on the
      related net-device's IPv6 link local. In contrast to the GID based on
      the regular IPv6 link-local (as we generate GID per IP address),
      the default GID is also available when the net device is down (in
      order to support loopback).
      
      Locking is done as follows:
      The patch modify the GID table code both for new RoCE drivers
      implementing the add_gid/del_gid callbacks and for current RoCE and
      IB drivers that do not. The flows for updating the table are
      different, so the locking requirements are too.
      
      While updating RoCE GID table, protection against multiple writers is
      achieved via mutex_lock(&table->lock). Since writing to a table
      requires us to find an entry (possible a free entry) in the table and
      then modify it, this mutex protects both the find_gid and write_gid
      ensuring the atomicity of the action.
      Each entry in the GID cache is protected by rwlock. In RoCE, writing
      (usually results from netdev notifier) involves invoking the vendor's
      add_gid and del_gid callbacks, which could sleep.
      Therefore, an invalid flag is added for each entry. Updates for RoCE are
      done via a workqueue, thus sleeping is permitted.
      
      In IB, updates are done in write_lock_irq(&device->cache.lock), thus
      write_gid isn't allowed to sleep and add_gid/del_gid are not called.
      
      When passing net-device into/out-of the GID cache, the device
      is always passed held (dev_hold).
      
      The code uses a single work item for updating all RDMA devices,
      following a netdev or inet notifier.
      
      The patch moves the cache from being a client (which was incorrect,
      as the cache is part of the IB infrastructure) to being explicitly
      initialized/freed when a device is registered/removed.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      03db3a2d
    • J
      IB/core: Make ib_alloc_device init the kobject · 55aeed06
      Jason Gunthorpe 提交于
      This gets rid of the weird in-between state where struct ib_device
      was allocated but the kobject didn't work.
      
      Consequently ib_device_release is now guaranteed to be called in
      all situations and we needn't duplicate its kfrees on error paths.
      Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      55aeed06
    • Y
      IB/core: Find the network device matching connection parameters · 9268f72d
      Yotam Kenneth 提交于
      In the case of IPoIB, and maybe in other cases, the network device is
      managed by an upper-layer protocol (ULP). In order to expose this
      network device to other users of the IB device, let ULPs implement
      a callback that returns network device according to connection parameters.
      
      The IB device and port, together with the P_Key and the GID should
      be enough to uniquely identify the ULP net device. However, in current
      kernels there can be multiple IPoIB interfaces created with the same GID.
      Furthermore, such configuration may be desireable to support ipvlan-like
      configurations for RDMA CM with IPoIB.  To resolve the device in these
      cases the code will also take the IP address as an additional input.
      Reviewed-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Signed-off-by: NYotam Kenneth <yotamke@mellanox.com>
      Signed-off-by: NShachar Raindel <raindel@mellanox.com>
      Signed-off-by: NGuy Shapiro <guysh@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      9268f72d
    • H
      IB/core: lock client data with lists_rwsem · 7c1eb45a
      Haggai Eran 提交于
      An ib_client callback that is called with the lists_rwsem locked only for
      read is protected from changes to the IB client lists, but not from
      ib_unregister_device() freeing its client data. This is because
      ib_unregister_device() will remove the device from the device list with
      lists_rwsem locked for write, but perform the rest of the cleanup,
      including the call to remove() without that lock.
      
      Mark client data that is undergoing de-registration with a new going_down
      flag in the client data context. Lock the client data list with lists_rwsem
      for write in addition to using the spinlock, so that functions calling the
      callback would be able to lock only lists_rwsem for read and let callbacks
      sleep.
      
      Since ib_unregister_client() now marks the client data context, no need for
      remove() to search the context again, so pass the client data directly to
      remove() callbacks.
      Reviewed-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      7c1eb45a
    • H
      IB/core: Add rwsem to allow reading device list or client list · 5aa44bb9
      Haggai Eran 提交于
      Currently the RDMA subsystem's device list and client list are protected by
      a single mutex. This prevents adding user-facing APIs that iterate these
      lists, since using them may cause a deadlock. The patch attempts to solve
      this problem by adding a read-write semaphore to protect the lists. Readers
      now don't need the mutex, and are safe just by read-locking the semaphore.
      
      The ib_register_device, ib_register_client, ib_unregister_device, and
      ib_unregister_client functions are modified to lock the semaphore for write
      during their respective list modification. Also, in order to make sure
      client callbacks are called only between add() and remove() calls, the code
      is changed to only add items to the lists after the add() calls and remove
      from the lists before the remove() calls.
      
      This patch attempts to solve a similar need [1] that was seen in the RoCE
      v2 patch series.
      
      [1] http://www.spinics.net/lists/linux-rdma/msg24733.htmlReviewed-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Matan Barak <matanb@mellanox.com>
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      5aa44bb9
  6. 13 6月, 2015 3 次提交
  7. 21 5月, 2015 2 次提交
  8. 19 5月, 2015 2 次提交
  9. 01 10月, 2012 1 次提交
    • J
      IB/core: Handle table with full and partial membership for the same P_Key · ff7166c4
      Jack Morgenstein 提交于
      Extend the cached and non-cached P_Key table lookups to handle limited
      and full membership of the same P_Key to co-exist in the P_Key table.
      
      This is necessary for SR-IOV, to allow for some guests would to have
      the full membership P_Key in their virtual P_Key table, while other
      guests on the same physical HCA would have the limited one.
      To support this, we need both the limited and full membership P_Keys
      to be present in the master's (hypervisor physical port) P_Key table.
      
      The algorithm for handling P_Key tables which contain both the limited
      and the full membership versions of the same P_Key works as follows:
      
      When scanning the P_Key table for a 15-bit P_Key:
      
      A. If there is a full member version of that P_Key anywhere in the
          table, return its index (even if a limited-member version of the
          P_Key exists earlier in the table).
      
      B. If the full member version is not in the table, but the
         limited-member version is in the table, return the index of the
         limited P_Key.
      Signed-off-by: NLiran Liss <liranl@mellanox.com>
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      ff7166c4
  10. 19 7月, 2011 1 次提交
  11. 21 5月, 2011 2 次提交
  12. 17 1月, 2011 1 次提交
    • T
      RDMA: Update workqueue usage · f0626710
      Tejun Heo 提交于
      * ib_wq is added, which is used as the common workqueue for infiniband
        instead of the system workqueue.  All system workqueue usages
        including flush_scheduled_work() callers are converted to use and
        flush ib_wq.
      
      * cancel_delayed_work() + flush_scheduled_work() converted to
        cancel_delayed_work_sync().
      
      * qib_wq is removed and ib_wq is used instead.
      
      This is to prepare for deprecation of flush_scheduled_work().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      f0626710
  13. 22 5月, 2010 1 次提交
    • R
      IB/core: Allow device-specific per-port sysfs files · 9a6edb60
      Ralph Campbell 提交于
      Add a new parameter to ib_register_device() so that low-level device
      drivers can pass in a pointer to a callback function that will be
      called for each port that is registered in sysfs.  This allows
      low-level device drivers to create files in
      
          /sys/class/infiniband/<hca>/ports/<N>/
      
      without having to poke through the internals of the RDMA sysfs handling.
      
      There is no need for an unregister function since the kobject
      reference will go to zero when ib_unregister_device() is called.
      Signed-off-by: NRalph Campbell <ralph.campbell@qlogic.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      9a6edb60
  14. 26 2月, 2009 1 次提交
    • R
      IB: Remove sysfs files before unregistering device · 9206dff1
      Roland Dreier 提交于
      Move the ib_device_unregister_sysfs() call from ib_dealloc_device() to
      ib_unregister_device().  The old code allows device unregister to
      proceed even if some sysfs files are open, which leaves a window where
      userspace can open a file before a device is removed but then end up
      reading the file after the device is removed, which leads to various
      kernel crashes either because the device data structure is freed or
      because the low-level driver code is gone after module removal.
      
      By not returning from ib_unregister_device() until after all sysfs
      entries are removed, we make sure that data structures and/or module
      code is not freed until after all sysfs access is done.
      Reported-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      9206dff1
  15. 15 7月, 2008 1 次提交
  16. 10 10月, 2007 1 次提交
    • R
      IB: find_first_zero_bit() takes unsigned pointer · 65d470b3
      Roland Dreier 提交于
      Fix sparse warning
      
          drivers/infiniband/core/device.c:142:6: warning: incorrect type in argument 1 (different signedness)
          drivers/infiniband/core/device.c:142:6:    expected unsigned long const *addr
          drivers/infiniband/core/device.c:142:6:    got long *[assigned] inuse
      
      by making the local variable inuse unsigned.  Does not affect generated
      code at all.
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      65d470b3
  17. 04 8月, 2007 1 次提交
    • M
      IB/core: Ignore membership bit in ib_find_pkey() · 36026ecc
      Moni Shoua 提交于
      ib_find_pkey() is used as a replacement for ib_find_cached_pkey(), and
      the original function ignored the membership bit when searching for a
      P_Key, so ib_find_pkey() should ignore the bit too.
      
      In particular, IPoIB turns on the P_Key membership bit of limited
      membership P_Keys when creating a child interface and looks for the
      full membership P_key.  This broke if a port was a partial member of a
      partition when IPoIB switched from ib_find_cached_pkey() to
      ib_find_pkey(), and this change fixes things again.
      Signed-off-by: NMoni Shoua <monis@voltaire.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      36026ecc
  18. 22 5月, 2007 1 次提交
    • A
      Detach sched.h from mm.h · e8edc6e0
      Alexey Dobriyan 提交于
      First thing mm.h does is including sched.h solely for can_do_mlock() inline
      function which has "current" dereference inside. By dealing with can_do_mlock()
      mm.h can be detached from sched.h which is good. See below, why.
      
      This patch
      a) removes unconditional inclusion of sched.h from mm.h
      b) makes can_do_mlock() normal function in mm/mlock.c
      c) exports can_do_mlock() to not break compilation
      d) adds sched.h inclusions back to files that were getting it indirectly.
      e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
         getting them indirectly
      
      Net result is:
      a) mm.h users would get less code to open, read, preprocess, parse, ... if
         they don't need sched.h
      b) sched.h stops being dependency for significant number of files:
         on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
         after patch it's only 3744 (-8.3%).
      
      Cross-compile tested on
      
      	all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
      	alpha alpha-up
      	arm
      	i386 i386-up i386-defconfig i386-allnoconfig
      	ia64 ia64-up
      	m68k
      	mips
      	parisc parisc-up
      	powerpc powerpc-up
      	s390 s390-up
      	sparc sparc-up
      	sparc64 sparc64-up
      	um-x86_64
      	x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
      
      as well as my two usual configs.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8edc6e0
  19. 19 5月, 2007 2 次提交
  20. 09 5月, 2007 1 次提交
    • R
      IB/uverbs: Export ib_umem_get()/ib_umem_release() to modules · f7c6a7b5
      Roland Dreier 提交于
      Export ib_umem_get()/ib_umem_release() and put low-level drivers in
      control of when to call ib_umem_get() to pin and DMA map userspace,
      rather than always calling it in ib_uverbs_reg_mr() before calling the
      low-level driver's reg_user_mr method.
      
      Also move these functions to be in the ib_core module instead of
      ib_uverbs, so that driver modules using them do not depend on
      ib_uverbs.
      
      This has a number of advantages:
       - It is better design from the standpoint of making generic code a
         library that can be used or overridden by device-specific code as
         the details of specific devices dictate.
       - Drivers that do not need to pin userspace memory regions do not
         need to take the performance hit of calling ib_mem_get().  For
         example, although I have not tried to implement it in this patch,
         the ipath driver should be able to avoid pinning memory and just
         use copy_{to,from}_user() to access userspace memory regions.
       - Buffers that need special mapping treatment can be identified by
         the low-level driver.  For example, it may be possible to solve
         some Altix-specific memory ordering issues with mthca CQs in
         userspace by mapping CQ buffers with extra flags.
       - Drivers that need to pin and DMA map userspace memory for things
         other than memory regions can use ib_umem_get() directly, instead
         of hacks using extra parameters to their reg_phys_mr method.  For
         example, the mlx4 driver that is pending being merged needs to pin
         and DMA map QP and CQ buffers, but it does not need to create a
         memory key for these buffers.  So the cleanest solution is for mlx4
         to call ib_umem_get() in the create_qp and create_cq methods.
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      f7c6a7b5
  21. 11 2月, 2007 1 次提交
  22. 23 9月, 2006 2 次提交
  23. 14 1月, 2006 1 次提交
  24. 02 11月, 2005 1 次提交