1. 03 4月, 2020 12 次提交
    • P
      mm/gup: fix __get_user_pages() on fault retry of hugetlb · ad415db8
      Peter Xu 提交于
      When follow_hugetlb_page() returns with *locked==0, it means we've got a
      VM_FAULT_RETRY within the fauling process and we've released the mmap_sem.
      When that happens, we should stop and bail out.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-3-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad415db8
    • P
      mm/gup: rename "nonblocking" to "locked" where proper · 4f6da934
      Peter Xu 提交于
      Patch series "mm: Page fault enhancements", v6.
      
      This series contains cleanups and enhancements to current page fault
      logic.  The whole idea comes from the discussion between Andrea and Linus
      on the bug reported by syzbot here:
      
        https://lkml.org/lkml/2017/11/2/833
      
      Basically it does two things:
      
        (a) Allows the page fault logic to be more interactive on not only
            SIGKILL, but also the rest of userspace signals, and,
      
        (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
            than once.
      
      For (a): with the changes we should be able to react faster when page
      faults are working in parallel with userspace signals like SIGSTOP and
      SIGCONT (and more), and with that we can remove the buggy part in
      userfaultfd and benefit the whole page fault mechanism on faster signal
      processing to reach the userspace.
      
      For (b), we should be able to allow the page fault handler to loop for
      even more than twice.  Some context: for now since we have
      FAULT_FLAG_ALLOW_RETRY we can allow to retry the page fault once with the
      same interrupt context, however never more than twice.  This can be not
      only a potential cleanup to remove this assumption since AFAIU the code
      itself doesn't really have this twice-only limitation (though that should
      be a protective approach in the past), at the same time it'll greatly
      simplify future works like userfaultfd write-protect where it's possible
      to retry for more than twice (please have a look at [1] below for a
      possible user that might require the page fault to be handled for a third
      time; if we can remove the retry limitation we can simply drop that patch
      and those complexity).
      
      This patch (of 16):
      
      There's plenty of places around __get_user_pages() that has a parameter
      "nonblocking" which does not really mean that "it won't block" (because it
      can really block) but instead it shows whether the mmap_sem is released by
      up_read() during the page fault handling mostly when VM_FAULT_RETRY is
      returned.
      
      We have the correct naming in e.g.  get_user_pages_locked() or
      get_user_pages_remote() as "locked", however there're still many places
      that are using the "nonblocking" as name.
      
      Renaming the places to "locked" where proper to better suite the
      functionality of the variable.  While at it, fixing up some of the
      comments accordingly.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f6da934
    • P
      mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path · df3a0a21
      Pingfan Liu 提交于
      FOLL_LONGTERM is a special case of FOLL_PIN.  It suggests a pin which is
      going to be given to hardware and can't move.  It would truncate CMA
      permanently and should be excluded.
      
      In gup slow path, where
      __gup_longterm_locked->check_and_migrate_cma_pages() handles
      FOLL_LONGTERM, but in fast path, there lacks such a check, which means a
      possible leak of CMA page to longterm pinned.
      
      Place a check in try_grab_compound_head() in the fast path to fix the
      leak, and if FOLL_LONGTERM happens on CMA, it will fall back to slow path
      to migrate the page.
      
      Some note about the check: Huge page's subpages have the same migrate type
      due to either allocation from a free_list[] or alloc_contig_range() with
      param MIGRATE_MOVABLE.  So it is enough to check on a single subpage by
      is_migrate_cma_page(subpage)
      Signed-off-by: NPingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Link: http://lkml.kernel.org/r/1584876733-17405-3-git-send-email-kernelfans@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df3a0a21
    • P
      mm/gup: rename nr as nr_pinned in get_user_pages_fast() · 4628b063
      Pingfan Liu 提交于
      To better reflect the held state of pages and make code self-explaining,
      rename nr as nr_pinned.
      Signed-off-by: NPingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Link: http://lkml.kernel.org/r/1584876733-17405-2-git-send-email-kernelfans@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4628b063
    • C
      mm/gup/writeback: add callbacks for inaccessible pages · f28d4363
      Claudio Imbrenda 提交于
      With the introduction of protected KVM guests on s390 there is now a
      concept of inaccessible pages.  These pages need to be made accessible
      before the host can access them.
      
      While cpu accesses will trigger a fault that can be resolved, I/O accesses
      will just fail.  We need to add a callback into architecture code for
      places that will do I/O, namely when writeback is started or when a page
      reference is taken.
      
      This is not only to enable paging, file backing etc, it is also necessary
      to protect the host against a malicious user space.  For example a bad
      QEMU could simply start direct I/O on such protected memory.  We do not
      want userspace to be able to trigger I/O errors and thus the logic is
      "whenever somebody accesses that page (gup) or does I/O, make sure that
      this page can be accessed".  When the guest tries to access that page we
      will wait in the page fault handler for writeback to have finished and for
      the page_ref to be the expected value.
      
      On s390x the function is not supposed to fail, so it is ok to use a
      WARN_ON on failure.  If we ever need some more finegrained handling we can
      tackle this when we know the details.
      Signed-off-by: NClaudio Imbrenda <imbrenda@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f28d4363
    • J
      mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting · 1970dc6f
      John Hubbard 提交于
      Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
      unpin_user_pages*(), we need some visibility into whether all of this is
      working correctly.
      
      Add two new fields to /proc/vmstat:
      
          nr_foll_pin_acquired
          nr_foll_pin_released
      
      These are documented in Documentation/core-api/pin_user_pages.rst.  They
      represent the number of pages (since boot time) that have been pinned
      ("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
      pin_user_pages*() and unpin_user_pages*().
      
      In the absence of long-running DMA or RDMA operations that hold pages
      pinned, the above two fields will normally be equal to each other.
      
      Also: update Documentation/core-api/pin_user_pages.rst, to remove an
      earlier (now confirmed untrue) claim about a performance problem with
      /proc/vmstat.
      
      Also: update Documentation/core-api/pin_user_pages.rst to rename the new
      /proc/vmstat entries, to the names listed here.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1970dc6f
    • J
      mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages · 47e29d32
      John Hubbard 提交于
      For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
      scheme tends to overflow too easily, each tail page increments the head
      page->_refcount by GUP_PIN_COUNTING_BIAS (1024).  That limits the number
      of huge pages that can be pinned.
      
      This patch removes that limitation, by using an exact form of pin counting
      for compound pages of order > 1.  The "order > 1" is required because this
      approach uses the 3rd struct page in the compound page, and order 1
      compound pages only have two pages, so that won't work there.
      
      A new struct page field, hpage_pinned_refcount, has been added, replacing
      a padding field in the union (so no new space is used).
      
      This enhancement also has a useful side effect: huge pages and compound
      pages (of order > 1) do not suffer from the "potential false positives"
      problem that is discussed in the page_dma_pinned() comment block.  That is
      because these compound pages have extra space for tracking things, so they
      get exact pin counts instead of overloading page->_refcount.
      
      Documentation/core-api/pin_user_pages.rst is updated accordingly.
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47e29d32
    • J
      mm/gup: track FOLL_PIN pages · 3faa52c0
      John Hubbard 提交于
      Add tracking of pages that were pinned via FOLL_PIN.  This tracking is
      implemented via overloading of page->_refcount: pins are added by adding
      GUP_PIN_COUNTING_BIAS (1024) to the refcount.  This provides a fuzzy
      indication of pinning, and it can have false positives (and that's OK).
      Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
      details.
      
      As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
      (typically via pin_user_pages*()) are required to ultimately free such
      pages via unpin_user_page().
      
      Please also note the limitation, discussed in pin_user_pages.rst under the
      "TODO: for 1GB and larger huge pages" section.  (That limitation will be
      removed in a following patch.)
      
      The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
      thought of as "FOLL_GET for DIO and/or RDMA use".
      
      Pages that have been pinned via FOLL_PIN are identifiable via a new
      function call:
      
         bool page_maybe_dma_pinned(struct page *page);
      
      What to do in response to encountering such a page, is left to later
      patchsets. There is discussion about this in [1], [2], [3], and [4].
      
      This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019):
          https://lwn.net/Articles/784574/
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
          https://lwn.net/Articles/774411/
      [3] The trouble with get_user_pages() (Apr 30, 2018):
          https://lwn.net/Articles/753027/
      [4] LWN kernel index: get_user_pages():
          https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
      
      [jhubbard@nvidia.com: add kerneldoc]
        Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
      [imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
        Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
      [akpm@linux-foundation.org: fix put_compound_head defined but not used]
      Suggested-by: NJan Kara <jack@suse.cz>
      Suggested-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NClaudio Imbrenda <imbrenda@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3faa52c0
    • J
      mm/gup: require FOLL_GET for get_user_pages_fast() · 94202f12
      John Hubbard 提交于
      Internal to mm/gup.c, require that get_user_pages_fast() and
      __get_user_pages_fast() identify themselves, by setting FOLL_GET.  This is
      required in order to be able to make decisions based on "FOLL_PIN, or
      FOLL_GET, or both or neither are set", in upcoming patches.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-6-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94202f12
    • J
      mm/gup: pass gup flags to two more routines · 3b78d834
      John Hubbard 提交于
      In preparation for an upcoming patch, send gup flags args to two more
      routines: put_compound_head(), and undo_dev_pagemap().
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-5-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b78d834
    • J
      mm/gup: pass a flags arg to __gup_device_* functions · 86dfbed4
      John Hubbard 提交于
      A subsequent patch requires access to gup flags, so pass the flags
      argument through to the __gup_device_* functions.
      
      Also placate checkpatch.pl by shortening a nearby line.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-3-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86dfbed4
    • J
      mm/gup: split get_user_pages_remote() into two routines · 22bf29b6
      John Hubbard 提交于
      Patch series "mm/gup: track FOLL_PIN pages", v6.
      
      This activates tracking of FOLL_PIN pages.  This is in support of fixing
      the get_user_pages()+DMA problem described in [1]-[4].
      
      FOLL_PIN support is now in the main linux tree.  However, the patch to use
      FOLL_PIN to track pages was *not* submitted, because Leon saw an RDMA test
      suite failure that involved (I think) page refcount overflows when huge
      pages were used.
      
      This patch definitively solves that kind of overflow problem, by adding an
      exact pincount, for compound pages (of order > 1), in the 3rd struct page
      of a compound page.  If available, that form of pincounting is used,
      instead of the GUP_PIN_COUNTING_BIAS approach.  Thanks again to Jan Kara
      for that idea.
      
      Other interesting changes:
      
      * dump_page(): added one, or two new things to report for compound
        pages: head refcount (for all compound pages), and map_pincount (for
        compound pages of order > 1).
      
      * Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
        huge page refcount upper limit problems, and added notes about how it
        works now.  Also added a note about the dump_page() enhancements.
      
      * Added some comments in gup.c and mm.h, to explain that there are two
        ways to count pinned pages: exact (for compound pages of order > 1) and
        fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).
      
      ============================================================
      General notes about the tracking patch:
      
      This is a prerequisite to solving the problem of proper interactions
      between file-backed pages, and [R]DMA activities, as discussed in [1],
      [2], [3], [4] and in a remarkable number of email threads since about
      2017.  :)
      
      In contrast to earlier approaches, the page tracking can be incrementally
      applied to the kernel call sites that, until now, have been simply calling
      get_user_pages() ("gup").  In other words, opt-in by changing from this:
      
          get_user_pages() (sets FOLL_GET)
          put_page()
      
      to this:
          pin_user_pages() (sets FOLL_PIN)
          unpin_user_page()
      
      ============================================================
      Future steps:
      
      * Convert more subsystems from get_user_pages() to pin_user_pages().
        The first probably needs to be bio/biovecs, because any filesystem
        testing is too difficult without those in place.
      
      * Change VFS and filesystems to respond appropriately when encountering
        dma-pinned pages.
      
      * Work with Ira and others to connect this all up with file system
        leases.
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019):
          https://lwn.net/Articles/784574/
      
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
          https://lwn.net/Articles/774411/
      
      [3] The trouble with get_user_pages() (Apr 30, 2018):
          https://lwn.net/Articles/753027/
      
      [4] LWN kernel index: get_user_pages()
          https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
      
      This patch (of 12):
      
      An upcoming patch requires reusing the implementation of
      get_user_pages_remote().  Split up get_user_pages_remote() into an outer
      routine that checks flags, and an implementation routine that will be
      reused.  This makes subsequent changes much easier to understand.
      
      There should be no change in behavior due to this patch.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-2-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22bf29b6
  2. 04 2月, 2020 1 次提交
  3. 01 2月, 2020 8 次提交
    • J
      mm, tree-wide: rename put_user_page*() to unpin_user_page*() · f1f6a7dd
      John Hubbard 提交于
      In order to provide a clearer, more symmetric API for pinning and
      unpinning DMA pages.  This way, pin_user_pages*() calls match up with
      unpin_user_pages*() calls, and the API is a lot closer to being
      self-explanatory.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1f6a7dd
    • J
      mm/gup: introduce pin_user_pages*() and FOLL_PIN · eddb1c22
      John Hubbard 提交于
      Introduce pin_user_pages*() variations of get_user_pages*() calls, and
      also pin_longterm_pages*() variations.
      
      For now, these are placeholder calls, until the various call sites are
      converted to use the correct get_user_pages*() or pin_user_pages*() API.
      
      These variants will eventually all set FOLL_PIN, which is also
      introduced, and thoroughly documented.
      
          pin_user_pages()
          pin_user_pages_remote()
          pin_user_pages_fast()
      
      All pages that are pinned via the above calls, must be unpinned via
      put_user_page().
      
      The underlying rules are:
      
      * FOLL_PIN is a gup-internal flag, so the call sites should not directly
        set it.  That behavior is enforced with assertions.
      
      * Call sites that want to indicate that they are going to do DirectIO
        ("DIO") or something with similar characteristics, should call a
        get_user_pages()-like wrapper call that sets FOLL_PIN.  These wrappers
        will:
      
          * Start with "pin_user_pages" instead of "get_user_pages".  That
            makes it easy to find and audit the call sites.
      
          * Set FOLL_PIN
      
      * For pages that are received via FOLL_PIN, those pages must be returned
        via put_user_page().
      
      Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in
      this documentation.  (I've reworded it and expanded upon it.)
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-12-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>		[Documentation]
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eddb1c22
    • J
      mm/gup: allow FOLL_FORCE for get_user_pages_fast() · f4000fdf
      John Hubbard 提交于
      Commit 817be129 ("mm: validate get_user_pages_fast flags") allowed
      only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
      This, combined with the fact that get_user_pages_fast() falls back to
      "slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
      if you need FOLL_FORCE, you cannot call get_user_pages_fast().
      
      There does not appear to be any reason for filtering out FOLL_FORCE.
      There is nothing in the _fast() implementation that requires that we
      avoid writing to the pages.  So it appears to have been an oversight.
      
      Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-9-jhubbard@nvidia.com
      Fixes: 817be129 ("mm: validate get_user_pages_fast flags")
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4000fdf
    • J
      mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM · c4237f8b
      John Hubbard 提交于
      As it says in the updated comment in gup.c: current FOLL_LONGTERM
      behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the FS
      DAX check requirement on vmas.
      
      However, the corresponding restriction in get_user_pages_remote() was
      slightly stricter than is actually required: it forbade all
      FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
      that do not set the "locked" arg.
      
      Update the code and comments to loosen the restriction, allowing
      FOLL_LONGTERM in some cases.
      
      Also, copy the DAX check ("if a VMA is DAX, don't allow long term
      pinning") from the VFIO call site, all the way into the internals of
      get_user_pages_remote() and __gup_longterm_locked().  That is:
      get_user_pages_remote() calls __gup_longterm_locked(), which in turn
      calls check_dax_vmas().  This check will then be removed from the VFIO
      call site in a subsequent patch.
      
      Thanks to Jason Gunthorpe for pointing out a clean way to fix this, and
      to Dan Williams for helping clarify the DAX refactoring.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-7-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Tested-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Suggested-by: NJason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4237f8b
    • J
      mm/gup: move try_get_compound_head() to top, fix minor issues · a707cdd5
      John Hubbard 提交于
      An upcoming patch uses try_get_compound_head() more widely, so move it to
      the top of gup.c.
      
      Also fix a tiny spelling error and a checkpatch.pl warning.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-3-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a707cdd5
    • J
      mm/gup: factor out duplicate code from four routines · a43e9820
      John Hubbard 提交于
      Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12.
      
      Overview:
      
      This is a prerequisite to solving the problem of proper interactions
      between file-backed pages, and [R]DMA activities, as discussed in [1],
      [2], [3], and in a remarkable number of email threads since about
      2017.  :)
      
      A new internal gup flag, FOLL_PIN is introduced, and thoroughly
      documented in the last patch's Documentation/vm/pin_user_pages.rst.
      
      I believe that this will provide a good starting point for doing the
      layout lease work that Ira Weiny has been working on.  That's because
      these new wrapper functions provide a clean, constrained, systematically
      named set of functionality that, again, is required in order to even
      know if a page is "dma-pinned".
      
      In contrast to earlier approaches, the page tracking can be
      incrementally applied to the kernel call sites that, until now, have
      been simply calling get_user_pages() ("gup").  In other words, opt-in by
      changing from this:
      
          get_user_pages() (sets FOLL_GET)
          put_page()
      
      to this:
          pin_user_pages() (sets FOLL_PIN)
          unpin_user_page()
      
      Testing:
      
      * I've done some overall kernel testing (LTP, and a few other goodies),
        and some directed testing to exercise some of the changes. And as you
        can see, gup_benchmark is enhanced to exercise this. Basically, I've
        been able to runtime test the core get_user_pages() and
        pin_user_pages() and related routines, but not so much on several of
        the call sites--but those are generally just a couple of lines
        changed, each.
      
        Not much of the kernel is actually using this, which on one hand
        reduces risk quite a lot. But on the other hand, testing coverage
        is low. So I'd love it if, in particular, the Infiniband and PowerPC
        folks could do a smoke test of this series for me.
      
        Runtime testing for the call sites so far is pretty light:
      
          * io_uring: Some directed tests from liburing exercise this, and
                      they pass.
          * process_vm_access.c: A small directed test passes.
          * gup_benchmark: the enhanced version hits the new gup.c code, and
                           passes.
          * infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py
          * VFIO: compiles (I'm vowing to set up a run time test soon, but it's
                            not ready just yet)
          * powerpc: it compiles...
          * drm/via: compiles...
          * goldfish: compiles...
          * net/xdp: compiles...
          * media/v4l2: compiles...
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
      [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
      
      This patch (of 22):
      
      There are four locations in gup.c that have a fair amount of code
      duplication.  This means that changing one requires making the same
      changes in four places, not to mention reading the same code four times,
      and wondering if there are subtle differences.
      
      Factor out the common code into static functions, thus reducing the
      overall line count and the code's complexity.
      
      Also, take the opportunity to slightly improve the efficiency of the
      error cases, by doing a mass subtraction of the refcount, surrounded by
      get_page()/put_page().
      
      Also, further simplify (slightly), by waiting until the the successful
      end of each routine, to increment *nr.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-2-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a43e9820
    • W
      be9d3045
    • Q
      mm: fix gup_pud_range · 15494520
      Qiujun Huang 提交于
      sorry for not processing for a long time.  I met it again.
      
      patch v1   https://lkml.org/lkml/2019/9/20/656
      
      do_machine_check()
        do_memory_failure()
          memory_failure()
            hw_poison_user_mappings()
              try_to_unmap()
                pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
      
      ...and now we have a swap entry that indicates that the page entry
      refers to a bad (and poisoned) page of memory, but gup_fast() at this
      level of the page table was ignoring swap entries, and incorrectly
      assuming that "!pxd_none() == valid and present".
      
      And this was not just a poisoned page problem, but a generaly swap entry
      problem.  So, any swap entry type (device memory migration, numa
      migration, or just regular swapping) could lead to the same problem.
      
      Fix this by checking for pxd_present(), instead of pxd_none().
      
      Link: http://lkml.kernel.org/r/1578479084-15508-1-git-send-email-hqjagain@gmail.comSigned-off-by: NQiujun Huang <hqjagain@gmail.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15494520
  4. 01 12月, 2019 2 次提交
  5. 19 10月, 2019 1 次提交
  6. 26 9月, 2019 1 次提交
  7. 25 9月, 2019 3 次提交
  8. 17 7月, 2019 1 次提交
  9. 13 7月, 2019 11 次提交