1. 01 2月, 2020 1 次提交
  2. 16 1月, 2020 1 次提交
  3. 17 11月, 2019 1 次提交
  4. 15 11月, 2019 1 次提交
  5. 25 9月, 2019 1 次提交
    • A
      mm/gup: add make_dirty arg to put_user_pages_dirty_lock() · 2d15eb31
      akpm@linux-foundation.org 提交于
      [11~From: John Hubbard <jhubbard@nvidia.com>
      Subject: mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
      
      Patch series "mm/gup: add make_dirty arg to put_user_pages_dirty_lock()",
      v3.
      
      There are about 50+ patches in my tree [2], and I'll be sending out the
      remaining ones in a few more groups:
      
      * The block/bio related changes (Jerome mostly wrote those, but I've had
        to move stuff around extensively, and add a little code)
      
      * mm/ changes
      
      * other subsystem patches
      
      * an RFC that shows the current state of the tracking patch set.  That
        can only be applied after all call sites are converted, but it's good to
        get an early look at it.
      
      This is part a tree-wide conversion, as described in fc1d8e7c ("mm:
      introduce put_user_page*(), placeholder versions").
      
      This patch (of 3):
      
      Provide more capable variation of put_user_pages_dirty_lock(), and delete
      put_user_pages_dirty().  This is based on the following:
      
      1.  Lots of call sites become simpler if a bool is passed into
         put_user_page*(), instead of making the call site choose which
         put_user_page*() variant to call.
      
      2.  Christoph Hellwig's observation that set_page_dirty_lock() is
         usually correct, and set_page_dirty() is usually a bug, or at least
         questionable, within a put_user_page*() calling chain.
      
      This leads to the following API choices:
      
          * put_user_pages_dirty_lock(page, npages, make_dirty)
      
          * There is no put_user_pages_dirty(). You have to
            hand code that, in the rare case that it's
            required.
      
      [jhubbard@nvidia.com: remove unused variable in siw_free_plist()]
        Link: http://lkml.kernel.org/r/20190729074306.10368-1-jhubbard@nvidia.com
      Link: http://lkml.kernel.org/r/20190724044537.10458-2-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d15eb31
  6. 22 8月, 2019 3 次提交
  7. 21 8月, 2019 1 次提交
  8. 21 6月, 2019 1 次提交
  9. 28 5月, 2019 1 次提交
  10. 22 5月, 2019 1 次提交
  11. 15 5月, 2019 1 次提交
    • I
      mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63
      Ira Weiny 提交于
      Pach series "Add FOLL_LONGTERM to GUP fast and use it".
      
      HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
      advantages.  These pages can be held for a significant time.  But
      get_user_pages_fast() does not protect against mapping FS DAX pages.
      
      Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
      retains the performance while also adding the FS DAX checks.  XDP has also
      shown interest in using this functionality.[1]
      
      In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
      and remove the specialized get_user_pages_longterm call.
      
      [1] https://lkml.org/lkml/2019/3/19/939
      
      "longterm" is a relative thing and at this point is probably a misnomer.
      This is really flagging a pin which is going to be given to hardware and
      can't move.  I've thought of a couple of alternative names but I think we
      have to settle on if we are going to use FL_LAYOUT or something else to
      solve the "longterm" problem.  Then I think we can change the flag to a
      better name.
      
      Secondly, it depends on how often you are registering memory.  I have
      spoken with some RDMA users who consider MR in the performance path...
      For the overall application performance.  I don't have the numbers as the
      tests for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an aside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      This patch (of 7):
      
      This patch starts a series which aims to support FOLL_LONGTERM in
      get_user_pages_fast().  Some callers who would like to do a longterm (user
      controlled pin) of pages with the fast variant of GUP for performance
      purposes.
      
      Rather than have a separate get_user_pages_longterm() call, introduce
      FOLL_LONGTERM and change the longterm callers to use it.
      
      This patch does not change any functionality.  In the short term
      "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
      in particular has been blocked.  However, callers of get_user_pages_fast()
      were not "protected".
      
      FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
      requires vmas to determine if DAX is in use.
      
      NOTE: In merging with the CMA changes we opt to change the
      get_user_pages() call in check_and_migrate_cma_pages() to a call of
      __get_user_pages_locked() on the newly migrated pages.  This makes the
      code read better in that we are calling __get_user_pages_locked() on the
      pages before and after a potential migration.
      
      As a side affect some of the interfaces are cleaned up but this is not the
      primary purpose of the series.
      
      In review[1] it was asked:
      
      <quote>
      > This I don't get - if you do lock down long term mappings performance
      > of the actual get_user_pages call shouldn't matter to start with.
      >
      > What do I miss?
      
      A couple of points.
      
      First "longterm" is a relative thing and at this point is probably a
      misnomer.  This is really flagging a pin which is going to be given to
      hardware and can't move.  I've thought of a couple of alternative names
      but I think we have to settle on if we are going to use FL_LAYOUT or
      something else to solve the "longterm" problem.  Then I think we can
      change the flag to a better name.
      
      Second, It depends on how often you are registering memory.  I have spoken
      with some RDMA users who consider MR in the performance path...  For the
      overall application performance.  I don't have the numbers as the tests
      for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an asside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      </quote>
      
      [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      932f4a63
  12. 07 5月, 2019 2 次提交
  13. 03 5月, 2019 1 次提交
    • S
      RDMA/umem: Handle page combining avoidance correctly in ib_umem_add_sg_table() · 7872168a
      Shiraz Saleem 提交于
      The flag update_cur_sg tracks whether contiguous pages from a new set of
      page_list pages can be merged into the SGE passed into
      ib_umem_add_sg_table(). If this flag is true, but the total segment length
      exceeds the max_seg_size supported by HW, we avoid combining to this SGE
      and move to a new SGE (x) and merge 'len' pages to it. However, if i <
      npages, the next iteration can incorrectly merge 'len' contiguous pages
      into x instead of into a new SGE since update_cur_sg is still true.
      
      Reset update_cur_sg to false always after the check to merge pages into
      the first SGE passed in to ib_umem_add_sg_table().  Also, prevent a new
      SGE's segment length from ever exceeding HW max_seg_sz.
      
      There is a crash on hfi1 as result of this where-in max_seg_sz is
      defaulting to 64K. Due to above bug, unfolding SGE's in __ib_umem_release
      points to a bad page ptr.
      
       TEST comp-wfr.perfnative.STL-22166-WDT _ perftest native 2-Write_4097QP_4MB STARTING at 1555387093
       BUG: Bad page state in process ib_write_bw  pfn:7ebca0
       page:ffffcd675faf2800 count:0 mapcount:1 mapping:0000000000000000 index:0x1
       flags: 0x17ffffc0000000()
       raw: 0017ffffc0000000 dead000000000100 dead000000000200 0000000000000000
       raw: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
       page dumped because: nonzero mapcount
       CPU: 18 PID: 15853 Comm: ib_write_bw Tainted: G    B             5.1.0-rc4 #1
       Hardware name: Intel Corporation S2600CWR/S2600CW, BIOS SE5C610.86B.01.01.0014.121820151719 12/18/2015
       Call Trace:
        dump_stack+0x5a/0x73
        bad_page+0xf5/0x10f
        free_pcppages_bulk+0x62c/0x680
        free_unref_page+0x54/0x70
        __ib_umem_release+0x148/0x1a0 [ib_uverbs]
        ib_umem_release+0x22/0x80 [ib_uverbs]
        rvt_dereg_mr+0x67/0xb0 [rdmavt]
        ib_dereg_mr_user+0x37/0x60 [ib_core]
        destroy_hw_idr_uobject+0x1c/0x50 [ib_uverbs]
        uverbs_destroy_uobject+0x2e/0x180 [ib_uverbs]
        uobj_destroy+0x4d/0x60 [ib_uverbs]
        __uobj_get_destroy+0x33/0x50 [ib_uverbs]
        __uobj_perform_destroy+0xa/0x30 [ib_uverbs]
        ib_uverbs_dereg_mr+0x66/0x90 [ib_uverbs]
        ib_uverbs_write+0x3e1/0x500 [ib_uverbs]
        vfs_write+0xad/0x1b0
        ksys_write+0x5a/0xd0
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: d10bcf94 ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs")
      Tested-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Reviewed-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
      Signed-off-by: NShiraz Saleem <shiraz.saleem@intel.com>
      Reviewed-by: NDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      7872168a
  14. 09 4月, 2019 2 次提交
  15. 27 3月, 2019 1 次提交
  16. 16 2月, 2019 1 次提交
  17. 08 2月, 2019 2 次提交
  18. 11 1月, 2019 1 次提交
  19. 28 9月, 2018 1 次提交
    • P
      RDMA/core: Acquire and release mmap_sem on page range · 3994586f
      Parav Pandit 提交于
      Currently mmap_sem is read locked while pinning the memory.  In a
      multi-threaded application of a process, holding mmap_sem lock creates
      contention with other threads who might be either registering memory,
      creating QPs or simply doing mmap() as such operations also require to
      hold the mmap_sem write lock.
      
      All such operation cannot make forward progress until one memory pin
      operation is completed.  It becomes more worse if the memory is unpinned
      and/or memory registration is large (in GB range).
      
      Therefore, instead of holding mmap_sem for too long (for whole region
      pinning), acquire and release the lock for every few pages.  For example
      on x86 with 4K page size, acquire and release mmap_sem for every 2Mbytes
      memory chunk.
      
      This allows other competing threads to make progress who might wish to
      hold mmap_sem for shorter duration.
      
      When memory registration latency is measured using [1] for memory sizes
      ranging from 4K to 48GB, <= 1% or 0.5% degradation is noticed. In many
      runs no difference is seen other than run-to-run variance.
      
      In other targeted tests of users with large memory, desired improvements
      are seen due to reduced contention of mmap_sem.
      
      [1] https://github.com/paravmellanox/rtool
      
      $ rdma_resource_lat -c 1 -s 48G -a -u L -i 500 -A
      
      It registers pinned memory from 4K to 48GB size with 500 iterations for
      each memory size.
      
      $ rdma_resource_lat -c 1 -s 12G -a -u L -i 500 -t 4
      
      4 competing threads pin memory, each of 12GB size with 500 iterations.
      Signed-off-by: NParav Pandit <parav@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      3994586f
  20. 26 9月, 2018 2 次提交
  21. 21 9月, 2018 4 次提交
  22. 14 7月, 2018 2 次提交
  23. 27 6月, 2018 1 次提交
  24. 29 5月, 2018 1 次提交
  25. 16 5月, 2018 2 次提交
  26. 19 12月, 2017 1 次提交
  27. 30 11月, 2017 1 次提交
  28. 02 6月, 2017 1 次提交
    • Q
      RDMA/core: not to set page dirty bit if it's already set. · 53376fed
      Qing Huang 提交于
      This change will optimize kernel memory deregistration operations.
      __ib_umem_release() used to call set_page_dirty_lock() against every
      writable page in its memory region. Its purpose is to keep data
      synced between CPU and DMA device when swapping happens after mem
      deregistration ops. Now we choose not to set page dirty bit if it's
      already set by kernel prior to calling __ib_umem_release(). This
      reduces memory deregistration time by half or even more when we ran
      application simulation test program.
      Signed-off-by: NQing Huang <qing.huang@oracle.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      53376fed
  29. 26 4月, 2017 1 次提交