提交 · 4789fcdd14095281952b5dd19a928d5d8b279e90 · gsplhtlxg / clone-Linux

01 2月, 2020 1 次提交

IB/umem: use get_user_pages_fast() to pin DMA pages · 4789fcdd

由 John Hubbard 提交于 1月 30, 2020

And get rid of the mmap_sem calls, as part of that.  Note that
get_user_pages_fast() will, if necessary, fall back to
__gup_longterm_unlocked(), which takes the mmap_sem as needed.

Link: http://lkml.kernel.org/r/20200107224558.2362728-10-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4789fcdd

16 1月, 2020 1 次提交

IB: Allow calls to ib_umem_get from kernel ULPs · c320e527

由 Moni Shoua 提交于 1月 15, 2020

So far the assumption was that ib_umem_get() and ib_umem_odp_get()
are called from flows that start in UVERBS and therefore has a user
context. This assumption restricts flows that are initiated by ULPs
and need the service that ib_umem_get() provides.

This patch changes ib_umem_get() and ib_umem_odp_get() to get IB device
directly by relying on the fact that both UVERBS and ULPs sets that
field correctly.
Reviewed-by: NGuy Levi <guyle@mellanox.com>
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>

c320e527

17 11月, 2019 1 次提交

IB/umem: remove the dmasync argument to ib_umem_get · 72b894b0

由 Christoph Hellwig 提交于 11月 13, 2019

The argument is always ignored, so remove it.

Link: https://lore.kernel.org/r/20191113073214.9514-3-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Acked-by: NMichal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

72b894b0

15 11月, 2019 1 次提交

dma-mapping: remove the DMA_ATTR_WRITE_BARRIER flag · 7283fff8

由 Christoph Hellwig 提交于 11月 13, 2019

This flag is not implemented by any backend and only set by the ib_umem
module in a single instance.

Link: https://lore.kernel.org/r/20191113073214.9514-2-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

7283fff8

25 9月, 2019 1 次提交

mm/gup: add make_dirty arg to put_user_pages_dirty_lock() · 2d15eb31

由 akpm@linux-foundation.org 提交于 9月 23, 2019

[11~From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: add make_dirty arg to put_user_pages_dirty_lock()

Patch series "mm/gup: add make_dirty arg to put_user_pages_dirty_lock()",
v3.

There are about 50+ patches in my tree [2], and I'll be sending out the
remaining ones in a few more groups:

* The block/bio related changes (Jerome mostly wrote those, but I've had
  to move stuff around extensively, and add a little code)

* mm/ changes

* other subsystem patches

* an RFC that shows the current state of the tracking patch set.  That
  can only be applied after all call sites are converted, but it's good to
  get an early look at it.

This is part a tree-wide conversion, as described in fc1d8e7c ("mm:
introduce put_user_page*(), placeholder versions").

This patch (of 3):

Provide more capable variation of put_user_pages_dirty_lock(), and delete
put_user_pages_dirty().  This is based on the following:

1.  Lots of call sites become simpler if a bool is passed into
   put_user_page*(), instead of making the call site choose which
   put_user_page*() variant to call.

2.  Christoph Hellwig's observation that set_page_dirty_lock() is
   usually correct, and set_page_dirty() is usually a bug, or at least
   questionable, within a put_user_page*() calling chain.

This leads to the following API choices:

    * put_user_pages_dirty_lock(page, npages, make_dirty)

    * There is no put_user_pages_dirty(). You have to
      hand code that, in the rare case that it's
      required.

[jhubbard@nvidia.com: remove unused variable in siw_free_plist()]
  Link: http://lkml.kernel.org/r/20190729074306.10368-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20190724044537.10458-2-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2d15eb31

22 8月, 2019 3 次提交

RDMA/odp: remove ib_ucontext from ib_umem · 47f725ee

由 Jason Gunthorpe 提交于 8月 06, 2019

At this point the ucontext is only being stored to access the ib_device,
so just store the ib_device directly instead. This is more natural and
logical as the umem has nothing to do with the ucontext.

Link: https://lore.kernel.org/r/20190806231548.25242-8-jgg@ziepe.caSigned-off-by: NJason Gunthorpe <jgg@mellanox.com>

47f725ee

RDMA/odp: Provide ib_umem_odp_release() to undo the allocs · 0446cad9

由 Jason Gunthorpe 提交于 8月 19, 2019

Now that there are allocator APIs that return the ib_umem_odp directly
it should be freed through a umem_odp free'er as well.

Link: https://lore.kernel.org/r/20190819111710.18440-8-leon@kernel.orgSigned-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

0446cad9

RDMA/odp: Split creating a umem_odp from ib_umem_get · 261dc53f

由 Jason Gunthorpe 提交于 8月 19, 2019

This is the last creation API that is overloaded for both, there is very
little code sharing and a driver has to be specifically ready for a
umem_odp to be created to use the odp version.

Link: https://lore.kernel.org/r/20190819111710.18440-7-leon@kernel.orgSigned-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

261dc53f

21 8月, 2019 1 次提交

RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB · 27b7fb1a

由 Jason Gunthorpe 提交于 8月 15, 2019

When ODP is enabled with IB_ACCESS_HUGETLB then the required pages
should be calculated based on the extent of the MR, which is rounded
to the nearest huge page alignment.

Fixes: d2183c6f ("RDMA/umem: Move page_shift from ib_umem to ib_odp_umem")
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190815083834.9245-5-leon@kernel.orgSigned-off-by: NDoug Ledford <dledford@redhat.com>

27b7fb1a

21 6月, 2019 1 次提交

RDMA: Check umem pointer validity prior to release · 836a0fbb

由 Leon Romanovsky 提交于 6月 16, 2019

Update ib_umem_release() to behave similarly to kfree() and allow
submitting NULL pointer as safe input to this function.

Fixes: a52c8e24 ("RDMA: Clean destroy CQ in drivers do not return errors")
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

836a0fbb

28 5月, 2019 1 次提交

RDMA: Convert put_page() to put_user_page*() · ea996974

由 John Hubbard 提交于 5月 24, 2019

For infiniband code that retains pages via get_user_pages*(), release
those pages via the new put_user_page(), or put_user_pages*(), instead of
put_page()

This is a tiny part of the second step of fixing the problem described in
[1]. The steps are:

1) Provide put_user_page*() routines, intended to be used for releasing
   pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to invoke
   put_user_page*(), instead of put_page(). This involves dozens of call
   sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
Acked-by: NJason Gunthorpe <jgg@mellanox.com>
Tested-by: NIra Weiny <ira.weiny@intel.com>
Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

ea996974

22 5月, 2019 1 次提交

RDMA/umem: Move page_shift from ib_umem to ib_odp_umem · d2183c6f

由 Jason Gunthorpe 提交于 5月 20, 2019

This value has always been set to PAGE_SHIFT in the core code, the only
thing that does differently was the ODP path. Move the value into the ODP
struct and still use it for ODP, but change all the non-ODP things to just
use PAGE_SHIFT/PAGE_SIZE/PAGE_MASK directly.
Reviewed-by: NShiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>

d2183c6f

15 5月, 2019 1 次提交

mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63

由 Ira Weiny 提交于 5月 13, 2019

Pach series "Add FOLL_LONGTERM to GUP fast and use it".

HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages.  These pages can be held for a significant time.  But
get_user_pages_fast() does not protect against mapping FS DAX pages.

Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks.  XDP has also
shown interest in using this functionality.[1]

In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.

[1] https://lkml.org/lkml/2019/3/19/939

"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move.  I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem.  Then I think we can change the flag to a
better name.

Secondly, it depends on how often you are registering memory.  I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance.  I don't have the numbers as the
tests for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

This patch (of 7):

This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast().  Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.

Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.

This patch does not change any functionality.  In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked.  However, callers of get_user_pages_fast()
were not "protected".

FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.

NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages.  This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.

As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.

In review[1] it was asked:

<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?

A couple of points.

First "longterm" is a relative thing and at this point is probably a
misnomer.  This is really flagging a pin which is going to be given to
hardware and can't move.  I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem.  Then I think we can
change the flag to a better name.

Second, It depends on how often you are registering memory.  I have spoken
with some RDMA users who consider MR in the performance path...  For the
overall application performance.  I don't have the numbers as the tests
for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

</quote>

[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

[ira.weiny@intel.com: v3]
  Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

932f4a63

07 5月, 2019 2 次提交

RDMA/umem: Remove hugetlb flag · db6c6774

由 Shiraz Saleem 提交于 5月 06, 2019

The drivers i40iw and bnxt_re no longer dependent on the hugetlb flag. So
remove this flag from ib_umem structure.
Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NShiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

db6c6774

RDMA/umem: Add API to find best driver supported page size in an MR · 4a353399

由 Shiraz Saleem 提交于 5月 06, 2019

This helper iterates through the SG list to find the best page size to use
from a bitmap of HW supported page sizes. Drivers that support multiple
page sizes, but not mixed sizes in an MR can use this API.
Suggested-by: NJason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: NShiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

4a353399

03 5月, 2019 1 次提交

RDMA/umem: Handle page combining avoidance correctly in ib_umem_add_sg_table() · 7872168a

由 Shiraz Saleem 提交于 4月 29, 2019

The flag update_cur_sg tracks whether contiguous pages from a new set of
page_list pages can be merged into the SGE passed into
ib_umem_add_sg_table(). If this flag is true, but the total segment length
exceeds the max_seg_size supported by HW, we avoid combining to this SGE
and move to a new SGE (x) and merge 'len' pages to it. However, if i <
npages, the next iteration can incorrectly merge 'len' contiguous pages
into x instead of into a new SGE since update_cur_sg is still true.

Reset update_cur_sg to false always after the check to merge pages into
the first SGE passed in to ib_umem_add_sg_table().  Also, prevent a new
SGE's segment length from ever exceeding HW max_seg_sz.

There is a crash on hfi1 as result of this where-in max_seg_sz is
defaulting to 64K. Due to above bug, unfolding SGE's in __ib_umem_release
points to a bad page ptr.

 TEST comp-wfr.perfnative.STL-22166-WDT _ perftest native 2-Write_4097QP_4MB STARTING at 1555387093
 BUG: Bad page state in process ib_write_bw  pfn:7ebca0
 page:ffffcd675faf2800 count:0 mapcount:1 mapping:0000000000000000 index:0x1
 flags: 0x17ffffc0000000()
 raw: 0017ffffc0000000 dead000000000100 dead000000000200 0000000000000000
 raw: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
 page dumped because: nonzero mapcount
 CPU: 18 PID: 15853 Comm: ib_write_bw Tainted: G    B             5.1.0-rc4 #1
 Hardware name: Intel Corporation S2600CWR/S2600CW, BIOS SE5C610.86B.01.01.0014.121820151719 12/18/2015
 Call Trace:
  dump_stack+0x5a/0x73
  bad_page+0xf5/0x10f
  free_pcppages_bulk+0x62c/0x680
  free_unref_page+0x54/0x70
  __ib_umem_release+0x148/0x1a0 [ib_uverbs]
  ib_umem_release+0x22/0x80 [ib_uverbs]
  rvt_dereg_mr+0x67/0xb0 [rdmavt]
  ib_dereg_mr_user+0x37/0x60 [ib_core]
  destroy_hw_idr_uobject+0x1c/0x50 [ib_uverbs]
  uverbs_destroy_uobject+0x2e/0x180 [ib_uverbs]
  uobj_destroy+0x4d/0x60 [ib_uverbs]
  __uobj_get_destroy+0x33/0x50 [ib_uverbs]
  __uobj_perform_destroy+0xa/0x30 [ib_uverbs]
  ib_uverbs_dereg_mr+0x66/0x90 [ib_uverbs]
  ib_uverbs_write+0x3e1/0x500 [ib_uverbs]
  vfs_write+0xad/0x1b0
  ksys_write+0x5a/0xd0
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: d10bcf94 ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs")
Tested-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NShiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

7872168a

09 4月, 2019 2 次提交

RDMA/umem: Use correct value for SG entries in sg_copy_to_buffer() · d0b5c01b

由 Shiraz Saleem 提交于 4月 04, 2019

With page combining, the assumption that number of SG entries in umem SGL
equal to number of system pages in umem no longer holds.

umem->sg_nents tracks the SG entries in umem SGL. Use it in
sg_pcopy_to_buffer() as opposed to ib_umem_num_pages(umem).

Fixes: d10bcf94 ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs")
Reported-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NShiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

d0b5c01b

RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs · d10bcf94

由 Shiraz Saleem 提交于 4月 02, 2019

Combine contiguous regions of PAGE_SIZE pages into single scatter list
entry while building the scatter table for a umem. This minimizes the
number of the entries in the scatter list and reduces the DMA mapping
overhead, particularly with the IOMMU.

Set default max_seg_size in core for IB devices to 2G and do not combine
if we exceed this limit.

Also, purge npages in struct ib_umem as we now DMA map the umem SGL with
sg_nents and npage computation is not needed. Drivers should now be using
ib_umem_num_pages(), so fix the last stragglers.

Move npages tracking to ib_umem_odp as ODP drivers still need it.
Suggested-by: NJason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Acked-by: NAdit Ranadive <aditr@vmware.com>
Signed-off-by: NShiraz Saleem <shiraz.saleem@intel.com>
Tested-by: NGal Pressman <galpress@amazon.com>
Tested-by: NSelvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

d10bcf94

27 3月, 2019 1 次提交

IB/core: Ensure an invalidate_range callback on ODP MR · 4ae27444

由 Ira Weiny 提交于 3月 13, 2019

No device supports ODP MR without an invalidate_range callback.

Warn on any any device which attempts to support ODP without supplying
this callback.

Then we can remove the checks for the callback within the code.

This stems from the discussion

https://www.spinics.net/lists/linux-rdma/msg76460.html

...which concluded this code was no longer necessary.
Acked-by: NJohn Hubbard <jhubbard@nvidia.com>
Reviewed-by: NHaggai Eran <haggaie@mellanox.com>
Signed-off-by: NIra Weiny <ira.weiny@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

4ae27444

16 2月, 2019 1 次提交

IB/uverbs: Add ib_ucontext to uverbs_attr_bundle sent from ioctl and cmd flows · 3d9dfd06

由 Shamir Rabinovitch 提交于 2月 07, 2019

Add ib_ucontext to the uverbs_attr_bundle sent down the iocl and cmd flows
as soon as the flow has ib_uobject.

In addition, remove rdma_get_ucontext helper function that is only used by
ib_umem_get.
Signed-off-by: NShamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

3d9dfd06

08 2月, 2019 2 次提交

drivers/IB,core: reduce scope of mmap_sem · b95df5e3

由 Davidlohr Bueso 提交于 2月 06, 2019

ib_umem_get() uses gup_longterm() and relies on the lock to stabilze the
vma_list, so we cannot really get rid of mmap_sem altogether, but now that
the counter is atomic, we can get of some complexity that mmap_sem brings
with only pinned_vm.
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

b95df5e3

mm: make mm->pinned_vm an atomic64 counter · 70f8a3ca

由 Davidlohr Bueso 提交于 2月 06, 2019

Taking a sleeping lock to _only_ increment a variable is quite the
overkill, and pretty much all users do this. Furthermore, some drivers
(ie: infiniband and scif) that need pinned semantics can go to quite
some trouble to actually delay via workqueue (un)accounting for pinned
pages when not possible to acquire it.

By making the counter atomic we no longer need to hold the mmap_sem and
can simply some code around it for pinned_vm users. The counter is 64-bit
such that we need not worry about overflows such as rdma user input
controlled from userspace.
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Reviewed-by: NChristoph Lameter <cl@linux.com>
Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

70f8a3ca

11 1月, 2019 1 次提交

IB/{core,hw}: Have ib_umem_get extract the ib_ucontext from ib_udata · b0ea0fa5

由 Jason Gunthorpe 提交于 1月 09, 2019

ib_umem_get() can only be called in a method callback, which always has a
udata parameter. This allows ib_umem_get() to derive the ucontext pointer
directly from the udata without requiring the drivers to find it in some
way or another.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NShamir Rabinovitch <shamir.rabinovitch@oracle.com>

b0ea0fa5

28 9月, 2018 1 次提交

RDMA/core: Acquire and release mmap_sem on page range · 3994586f

由 Parav Pandit 提交于 9月 25, 2018

Currently mmap_sem is read locked while pinning the memory.  In a
multi-threaded application of a process, holding mmap_sem lock creates
contention with other threads who might be either registering memory,
creating QPs or simply doing mmap() as such operations also require to
hold the mmap_sem write lock.

All such operation cannot make forward progress until one memory pin
operation is completed.  It becomes more worse if the memory is unpinned
and/or memory registration is large (in GB range).

Therefore, instead of holding mmap_sem for too long (for whole region
pinning), acquire and release the lock for every few pages.  For example
on x86 with 4K page size, acquire and release mmap_sem for every 2Mbytes
memory chunk.

This allows other competing threads to make progress who might wish to
hold mmap_sem for shorter duration.

When memory registration latency is measured using [1] for memory sizes
ranging from 4K to 48GB, <= 1% or 0.5% degradation is noticed. In many
runs no difference is seen other than run-to-run variance.

In other targeted tests of users with large memory, desired improvements
are seen due to reduced contention of mmap_sem.

[1] https://github.com/paravmellanox/rtool

$ rdma_resource_lat -c 1 -s 48G -a -u L -i 500 -A

It registers pinned memory from 4K to 48GB size with 500 iterations for
each memory size.

$ rdma_resource_lat -c 1 -s 12G -a -u L -i 500 -t 4

4 competing threads pin memory, each of 12GB size with 500 iterations.
Signed-off-by: NParav Pandit <parav@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

3994586f

26 9月, 2018 2 次提交

RDMA/umem: Fix potential addition overflow · c6ce5807

由 Doug Ledford 提交于 9月 21, 2018

Given a large enough memory allocation, it is possible to wrap the
pinned_vm counter.  Check for addition overflow to prevent such
eventualities.

Fixes: 40ddacf2 ("RDMA/umem: Don't hold mmap_sem for too long")
Reported-by: NJason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: NDoug Ledford <dledford@redhat.com>
Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

c6ce5807

RDMA/umem: Minor optimizations · 3312d1c6

由 Doug Ledford 提交于 9月 21, 2018

Noticed while reviewing commit d4b4dd1b ("RDMA/umem: Do not use
current->tgid to track the mm_struct") patch. Why would we take a lock,
adjust a protected variable, drop the lock, and *then* check the input
into our protected variable adjustment? Then we have to take the lock
again on our error unwind. Let's just check the input early and skip
taking the locks needlessly if the input isn't valid.

It was also noticed that we set mm = current->mm, we then never modify
mm, but we still go back and reference current->mm a number of times
needlessly. Be consistent in using the stored reference in mm.
Signed-off-by: NDoug Ledford <dledford@redhat.com>
Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

3312d1c6

21 9月, 2018 4 次提交

RDMA/umem: Get rid of struct ib_umem.odp_data · 597ecc5a

由 Jason Gunthorpe 提交于 9月 16, 2018

This no longer has any use, we can use container_of to get to the
umem_odp, and a simple flag to indicate if this is an odp MR. Remove the
few remaining references to it.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

597ecc5a

RDMA/umem: Make ib_umem_odp into a sub structure of ib_umem · 41b4deea

由 Jason Gunthorpe 提交于 9月 16, 2018

These two structures are linked together, use the container_of pattern
instead of a double allocation to make the code simpler and easier to
follow.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

41b4deea

RDMA/umem: Use ib_umem_odp in all function signatures connected to ODP · b5231b01

由 Jason Gunthorpe 提交于 9月 16, 2018

All of these functions already require the ODP version of the umem struct,
make this very clear by having the signature require it. This paves the
way to using the container_of() pattern to link umem_odp and umem
together.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

b5231b01

RDMA/umem: Do not use current->tgid to track the mm_struct · d4b4dd1b

由 Jason Gunthorpe 提交于 9月 16, 2018

This is just wrong, the process that calls into the reg_mr is the process
associated with the umem, and that does not have to be the same process
that created the context.

When this code was first written mmgrab() didn't exist, however these days
we can just directly hold the mm_struct pointer in the umem and have no
ambiguity when it comes to releasing the umem as to which mm it was
associated with.
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

d4b4dd1b

14 7月, 2018 2 次提交

RDMA/umem: Refactor exit paths in ib_umem_get · 1215cb7c

由 Leon Romanovsky 提交于 7月 10, 2018

Simplify exit paths in ib_umem_get to use the standard goto unwind
pattern.
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

1215cb7c

RDMA/umem: Don't hold mmap_sem for too long · 40ddacf2

由 Leon Romanovsky 提交于 7月 10, 2018

DMA mapping is time consuming operation and doesn't need to be performed
with mmap_sem semaphore is held.

The semaphore only needs to be held for accounting and get_user_pages
related activities.
Signed-off-by: NHuy Nguyen <huyn@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

40ddacf2

27 6月, 2018 1 次提交

RDMA/umem: Don't check for a negative return value of dma_map_sg_attrs() · 3a2e791c

由 Leon Romanovsky 提交于 6月 24, 2018

dma_map_sg_attrs() returns 0 on error and can't return a negative number
(ensured by BUG_ON), so don't check.
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

3a2e791c

29 5月, 2018 1 次提交

IB/core: Make testing MR flags for writability a static inline function · 08bb558a

由 Jack Morgenstein 提交于 5月 23, 2018

Make the MR writability flags check, which is performed in umem.c,
a static inline function in file ib_verbs.h

This allows the function to be used by low-level infiniband drivers.

Cc: <stable@vger.kernel.org>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>

08bb558a

16 5月, 2018 2 次提交

IB/umem: Use the correct mm during ib_umem_release · 8e907ed4

由 Lidong Chen 提交于 5月 08, 2018

User-space may invoke ibv_reg_mr and ibv_dereg_mr in different threads.

If ibv_dereg_mr is called after the thread which invoked ibv_reg_mr has
exited, get_pid_task will return NULL and ib_umem_release will not
decrease mm->pinned_vm.

Instead of using threads to locate the mm, use the overall tgid from the
ib_ucontext struct instead. This matches the behavior of ODP and
disassociate in handling the mm of the process that called ibv_reg_mr.

Cc: <stable@vger.kernel.org>
Fixes: 87773dd5 ("IB: ib_umem_release() should decrement mm->pinned_vm from ib_umem_get")
Signed-off-by: NLidong Chen <lidongchen@tencent.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

8e907ed4

IB/core: Remove redundant return · aec05afe

由 Yuval Shaia 提交于 5月 10, 2018

"return" statement at the end of void function is redundant, removing
it.
Signed-off-by: NYuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: NZhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: NQing Huang <qing.huang@oracle.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

aec05afe

19 12月, 2017 1 次提交

IB/umem: Fix use of npages/nmap fields · edf1a84f

由 Artemy Kovalyov 提交于 11月 14, 2017

In ib_umem structure npages holds original number of sg entries, while
nmap is number of DMA blocks returned by dma_map_sg.

Fixes: c5d76f13 ('IB/core: Add umem function to read data from user-space')
Signed-off-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

edf1a84f

30 11月, 2017 1 次提交

IB/core: disable memory registration of filesystem-dax vmas · 5f1d43de

由 Dan Williams 提交于 11月 29, 2017

Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow RDMA to create long standing memory registrations
against filesytem-dax vmas.

Link: http://lkml.kernel.org/r/151068941011.7446.7766030590347262502.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Reported-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NJason Gunthorpe <jgg@mellanox.com>
Acked-by: NDoug Ledford <dledford@redhat.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Inki Dae <inki.dae@samsung.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joonyoung Shim <jy0922.shim@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5f1d43de

02 6月, 2017 1 次提交

RDMA/core: not to set page dirty bit if it's already set. · 53376fed

由 Qing Huang 提交于 5月 18, 2017

This change will optimize kernel memory deregistration operations.
__ib_umem_release() used to call set_page_dirty_lock() against every
writable page in its memory region. Its purpose is to keep data
synced between CPU and DMA device when swapping happens after mem
deregistration ops. Now we choose not to set page dirty bit if it's
already set by kernel prior to calling __ib_umem_release(). This
reduces memory deregistration time by half or even more when we ran
application simulation test program.
Signed-off-by: NQing Huang <qing.huang@oracle.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

53376fed

26 4月, 2017 1 次提交

IB/umem: Add support to huge ODP · 0008b84e

由 Artemy Kovalyov 提交于 4月 05, 2017

Add IB_ACCESS_HUGETLB ib_reg_mr flag.
Hugetlb region registered with this flag
will use single translation entry per huge page.
Signed-off-by: NArtemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: NLeon Romanovsky <leon@kernel.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

0008b84e