1. 10 6月, 2020 5 次提交
  2. 09 6月, 2020 3 次提交
  3. 04 6月, 2020 5 次提交
  4. 03 6月, 2020 4 次提交
    • M
      mm/gup.c: further document vma_permits_fault() · 548b6a1e
      Miles Chen 提交于
      Describe the caller's responsibilities when passing
      FAULT_FLAG_ALLOW_RETRY.
      
      Link: http://lkml.kernel.org/r/1586915606.5647.5.camel@mtkswgap22Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      548b6a1e
    • J
      mm/gup: introduce pin_user_pages_unlocked · 91429023
      John Hubbard 提交于
      Introduce pin_user_pages_unlocked(), which is nearly identical to the
      get_user_pages_unlocked() that it wraps, except that it sets FOLL_PIN
      and rejects FOLL_GET.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Walls <awalls@md.metrocast.net>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Link: http://lkml.kernel.org/r/20200518012157.1178336-2-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91429023
    • S
      mm/gup.c: update the documentation · adc8cb40
      Souptick Joarder 提交于
      This patch is an attempt to update the documentation.
      
       - Add/ remove extra * based on type of function static/global.
      
       - Add description for functions and their input arguments.
      
      [akpm@linux-foundation.org: s@/*@/**@]
      Signed-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1588013630-4497-1-git-send-email-jrdr.linux@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      adc8cb40
    • L
      gup: document and work around "COW can break either way" issue · 17839856
      Linus Torvalds 提交于
      Doing a "get_user_pages()" on a copy-on-write page for reading can be
      ambiguous: the page can be COW'ed at any time afterwards, and the
      direction of a COW event isn't defined.
      
      Yes, whoever writes to it will generally do the COW, but if the thread
      that did the get_user_pages() unmapped the page before the write (and
      that could happen due to memory pressure in addition to any outright
      action), the writer could also just take over the old page instead.
      
      End result: the get_user_pages() call might result in a page pointer
      that is no longer associated with the original VM, and is associated
      with - and controlled by - another VM having taken it over instead.
      
      So when doing a get_user_pages() on a COW mapping, the only really safe
      thing to do would be to break the COW when getting the page, even when
      only getting it for reading.
      
      At the same time, some users simply don't even care.
      
      For example, the perf code wants to look up the page not because it
      cares about the page, but because the code simply wants to look up the
      physical address of the access for informational purposes, and doesn't
      really care about races when a page might be unmapped and remapped
      elsewhere.
      
      This adds logic to force a COW event by setting FOLL_WRITE on any
      copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
      pointer as a result.
      
      The current semantics end up being:
      
       - __get_user_pages_fast(): no change. If you don't ask for a write,
         you won't break COW. You'd better know what you're doing.
      
       - get_user_pages_fast(): the fast-case "look it up in the page tables
         without anything getting mmap_sem" now refuses to follow a read-only
         page, since it might need COW breaking.  Which happens in the slow
         path - the fast path doesn't know if the memory might be COW or not.
      
       - get_user_pages() (including the slow-path fallback for gup_fast()):
         for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
         very similar semantics to FOLL_FORCE.
      
      If it turns out that we want finer granularity (ie "only break COW when
      it might actually matter" - things like the zero page are special and
      don't need to be broken) we might need to push these semantics deeper
      into the lookup fault path.  So if people care enough, it's possible
      that we might end up adding a new internal FOLL_BREAK_COW flag to go
      with the internal FOLL_COW flag we already have for tracking "I had a
      COW".
      
      Alternatively, if it turns out that different callers might want to
      explicitly control the forced COW break behavior, we might even want to
      make such a flag visible to the users of get_user_pages() instead of
      using the above default semantics.
      
      But for now, this is mostly commentary on the issue (this commit message
      being a lot bigger than the patch, and that patch in turn is almost all
      comments), with that minimal "enable COW breaking early" logic using the
      existing FOLL_WRITE behavior.
      
      [ It might be worth noting that we've always had this ambiguity, and it
        could arguably be seen as a user-space issue.
      
        You only get private COW mappings that could break either way in
        situations where user space is doing cooperative things (ie fork()
        before an execve() etc), but it _is_ surprising and very subtle, and
        fork() is supposed to give you independent address spaces.
      
        So let's treat this as a kernel issue and make the semantics of
        get_user_pages() easier to understand. Note that obviously a true
        shared mapping will still get a page that can change under us, so this
        does _not_ mean that get_user_pages() somehow returns any "stable"
        page ]
      Reported-by: NJann Horn <jannh@google.com>
      Tested-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill Shutemov <kirill@shutemov.name>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17839856
  5. 15 5月, 2020 1 次提交
  6. 22 4月, 2020 1 次提交
  7. 21 4月, 2020 1 次提交
  8. 09 4月, 2020 1 次提交
  9. 08 4月, 2020 5 次提交
  10. 03 4月, 2020 14 次提交
    • P
      mm/gup: allow to react to fatal signals · 71335f37
      Peter Xu 提交于
      The existing gup code does not react to the fatal signals in many code
      paths.  For example, in one retry path of gup we're still using
      down_read() rather than down_read_killable().  Also, when doing page
      faults we don't pass in FAULT_FLAG_KILLABLE as well, which means that
      within the faulting process we'll wait in non-killable way as well.  These
      were spotted by Linus during the code review of some other patches.
      
      Let's allow the gup code to react to fatal signals to improve the
      responsiveness of threads when during gup and being killed.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160256.9887-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71335f37
    • P
      mm/gup: allow VM_FAULT_RETRY for multiple times · 4426e945
      Peter Xu 提交于
      This is the gup counterpart of the change that allows the VM_FAULT_RETRY
      to happen for more than once.  One thing to mention is that we must check
      the fatal signal here before retry because the GUP can be interrupted by
      that, otherwise we can loop forever.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220195357.16371-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4426e945
    • P
      mm/gup: fix __get_user_pages() on fault retry of hugetlb · ad415db8
      Peter Xu 提交于
      When follow_hugetlb_page() returns with *locked==0, it means we've got a
      VM_FAULT_RETRY within the fauling process and we've released the mmap_sem.
      When that happens, we should stop and bail out.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-3-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad415db8
    • P
      mm/gup: rename "nonblocking" to "locked" where proper · 4f6da934
      Peter Xu 提交于
      Patch series "mm: Page fault enhancements", v6.
      
      This series contains cleanups and enhancements to current page fault
      logic.  The whole idea comes from the discussion between Andrea and Linus
      on the bug reported by syzbot here:
      
        https://lkml.org/lkml/2017/11/2/833
      
      Basically it does two things:
      
        (a) Allows the page fault logic to be more interactive on not only
            SIGKILL, but also the rest of userspace signals, and,
      
        (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
            than once.
      
      For (a): with the changes we should be able to react faster when page
      faults are working in parallel with userspace signals like SIGSTOP and
      SIGCONT (and more), and with that we can remove the buggy part in
      userfaultfd and benefit the whole page fault mechanism on faster signal
      processing to reach the userspace.
      
      For (b), we should be able to allow the page fault handler to loop for
      even more than twice.  Some context: for now since we have
      FAULT_FLAG_ALLOW_RETRY we can allow to retry the page fault once with the
      same interrupt context, however never more than twice.  This can be not
      only a potential cleanup to remove this assumption since AFAIU the code
      itself doesn't really have this twice-only limitation (though that should
      be a protective approach in the past), at the same time it'll greatly
      simplify future works like userfaultfd write-protect where it's possible
      to retry for more than twice (please have a look at [1] below for a
      possible user that might require the page fault to be handled for a third
      time; if we can remove the retry limitation we can simply drop that patch
      and those complexity).
      
      This patch (of 16):
      
      There's plenty of places around __get_user_pages() that has a parameter
      "nonblocking" which does not really mean that "it won't block" (because it
      can really block) but instead it shows whether the mmap_sem is released by
      up_read() during the page fault handling mostly when VM_FAULT_RETRY is
      returned.
      
      We have the correct naming in e.g.  get_user_pages_locked() or
      get_user_pages_remote() as "locked", however there're still many places
      that are using the "nonblocking" as name.
      
      Renaming the places to "locked" where proper to better suite the
      functionality of the variable.  While at it, fixing up some of the
      comments accordingly.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f6da934
    • P
      mm/gup: fix omission of check on FOLL_LONGTERM in gup fast path · df3a0a21
      Pingfan Liu 提交于
      FOLL_LONGTERM is a special case of FOLL_PIN.  It suggests a pin which is
      going to be given to hardware and can't move.  It would truncate CMA
      permanently and should be excluded.
      
      In gup slow path, where
      __gup_longterm_locked->check_and_migrate_cma_pages() handles
      FOLL_LONGTERM, but in fast path, there lacks such a check, which means a
      possible leak of CMA page to longterm pinned.
      
      Place a check in try_grab_compound_head() in the fast path to fix the
      leak, and if FOLL_LONGTERM happens on CMA, it will fall back to slow path
      to migrate the page.
      
      Some note about the check: Huge page's subpages have the same migrate type
      due to either allocation from a free_list[] or alloc_contig_range() with
      param MIGRATE_MOVABLE.  So it is enough to check on a single subpage by
      is_migrate_cma_page(subpage)
      Signed-off-by: NPingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Link: http://lkml.kernel.org/r/1584876733-17405-3-git-send-email-kernelfans@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df3a0a21
    • P
      mm/gup: rename nr as nr_pinned in get_user_pages_fast() · 4628b063
      Pingfan Liu 提交于
      To better reflect the held state of pages and make code self-explaining,
      rename nr as nr_pinned.
      Signed-off-by: NPingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Link: http://lkml.kernel.org/r/1584876733-17405-2-git-send-email-kernelfans@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4628b063
    • C
      mm/gup/writeback: add callbacks for inaccessible pages · f28d4363
      Claudio Imbrenda 提交于
      With the introduction of protected KVM guests on s390 there is now a
      concept of inaccessible pages.  These pages need to be made accessible
      before the host can access them.
      
      While cpu accesses will trigger a fault that can be resolved, I/O accesses
      will just fail.  We need to add a callback into architecture code for
      places that will do I/O, namely when writeback is started or when a page
      reference is taken.
      
      This is not only to enable paging, file backing etc, it is also necessary
      to protect the host against a malicious user space.  For example a bad
      QEMU could simply start direct I/O on such protected memory.  We do not
      want userspace to be able to trigger I/O errors and thus the logic is
      "whenever somebody accesses that page (gup) or does I/O, make sure that
      this page can be accessed".  When the guest tries to access that page we
      will wait in the page fault handler for writeback to have finished and for
      the page_ref to be the expected value.
      
      On s390x the function is not supposed to fail, so it is ok to use a
      WARN_ON on failure.  If we ever need some more finegrained handling we can
      tackle this when we know the details.
      Signed-off-by: NClaudio Imbrenda <imbrenda@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f28d4363
    • J
      mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting · 1970dc6f
      John Hubbard 提交于
      Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
      unpin_user_pages*(), we need some visibility into whether all of this is
      working correctly.
      
      Add two new fields to /proc/vmstat:
      
          nr_foll_pin_acquired
          nr_foll_pin_released
      
      These are documented in Documentation/core-api/pin_user_pages.rst.  They
      represent the number of pages (since boot time) that have been pinned
      ("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
      pin_user_pages*() and unpin_user_pages*().
      
      In the absence of long-running DMA or RDMA operations that hold pages
      pinned, the above two fields will normally be equal to each other.
      
      Also: update Documentation/core-api/pin_user_pages.rst, to remove an
      earlier (now confirmed untrue) claim about a performance problem with
      /proc/vmstat.
      
      Also: update Documentation/core-api/pin_user_pages.rst to rename the new
      /proc/vmstat entries, to the names listed here.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1970dc6f
    • J
      mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages · 47e29d32
      John Hubbard 提交于
      For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
      scheme tends to overflow too easily, each tail page increments the head
      page->_refcount by GUP_PIN_COUNTING_BIAS (1024).  That limits the number
      of huge pages that can be pinned.
      
      This patch removes that limitation, by using an exact form of pin counting
      for compound pages of order > 1.  The "order > 1" is required because this
      approach uses the 3rd struct page in the compound page, and order 1
      compound pages only have two pages, so that won't work there.
      
      A new struct page field, hpage_pinned_refcount, has been added, replacing
      a padding field in the union (so no new space is used).
      
      This enhancement also has a useful side effect: huge pages and compound
      pages (of order > 1) do not suffer from the "potential false positives"
      problem that is discussed in the page_dma_pinned() comment block.  That is
      because these compound pages have extra space for tracking things, so they
      get exact pin counts instead of overloading page->_refcount.
      
      Documentation/core-api/pin_user_pages.rst is updated accordingly.
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47e29d32
    • J
      mm/gup: track FOLL_PIN pages · 3faa52c0
      John Hubbard 提交于
      Add tracking of pages that were pinned via FOLL_PIN.  This tracking is
      implemented via overloading of page->_refcount: pins are added by adding
      GUP_PIN_COUNTING_BIAS (1024) to the refcount.  This provides a fuzzy
      indication of pinning, and it can have false positives (and that's OK).
      Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
      details.
      
      As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
      (typically via pin_user_pages*()) are required to ultimately free such
      pages via unpin_user_page().
      
      Please also note the limitation, discussed in pin_user_pages.rst under the
      "TODO: for 1GB and larger huge pages" section.  (That limitation will be
      removed in a following patch.)
      
      The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
      thought of as "FOLL_GET for DIO and/or RDMA use".
      
      Pages that have been pinned via FOLL_PIN are identifiable via a new
      function call:
      
         bool page_maybe_dma_pinned(struct page *page);
      
      What to do in response to encountering such a page, is left to later
      patchsets. There is discussion about this in [1], [2], [3], and [4].
      
      This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019):
          https://lwn.net/Articles/784574/
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
          https://lwn.net/Articles/774411/
      [3] The trouble with get_user_pages() (Apr 30, 2018):
          https://lwn.net/Articles/753027/
      [4] LWN kernel index: get_user_pages():
          https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
      
      [jhubbard@nvidia.com: add kerneldoc]
        Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
      [imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
        Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
      [akpm@linux-foundation.org: fix put_compound_head defined but not used]
      Suggested-by: NJan Kara <jack@suse.cz>
      Suggested-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NClaudio Imbrenda <imbrenda@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3faa52c0
    • J
      mm/gup: require FOLL_GET for get_user_pages_fast() · 94202f12
      John Hubbard 提交于
      Internal to mm/gup.c, require that get_user_pages_fast() and
      __get_user_pages_fast() identify themselves, by setting FOLL_GET.  This is
      required in order to be able to make decisions based on "FOLL_PIN, or
      FOLL_GET, or both or neither are set", in upcoming patches.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-6-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94202f12
    • J
      mm/gup: pass gup flags to two more routines · 3b78d834
      John Hubbard 提交于
      In preparation for an upcoming patch, send gup flags args to two more
      routines: put_compound_head(), and undo_dev_pagemap().
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-5-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b78d834
    • J
      mm/gup: pass a flags arg to __gup_device_* functions · 86dfbed4
      John Hubbard 提交于
      A subsequent patch requires access to gup flags, so pass the flags
      argument through to the __gup_device_* functions.
      
      Also placate checkpatch.pl by shortening a nearby line.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-3-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86dfbed4
    • J
      mm/gup: split get_user_pages_remote() into two routines · 22bf29b6
      John Hubbard 提交于
      Patch series "mm/gup: track FOLL_PIN pages", v6.
      
      This activates tracking of FOLL_PIN pages.  This is in support of fixing
      the get_user_pages()+DMA problem described in [1]-[4].
      
      FOLL_PIN support is now in the main linux tree.  However, the patch to use
      FOLL_PIN to track pages was *not* submitted, because Leon saw an RDMA test
      suite failure that involved (I think) page refcount overflows when huge
      pages were used.
      
      This patch definitively solves that kind of overflow problem, by adding an
      exact pincount, for compound pages (of order > 1), in the 3rd struct page
      of a compound page.  If available, that form of pincounting is used,
      instead of the GUP_PIN_COUNTING_BIAS approach.  Thanks again to Jan Kara
      for that idea.
      
      Other interesting changes:
      
      * dump_page(): added one, or two new things to report for compound
        pages: head refcount (for all compound pages), and map_pincount (for
        compound pages of order > 1).
      
      * Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
        huge page refcount upper limit problems, and added notes about how it
        works now.  Also added a note about the dump_page() enhancements.
      
      * Added some comments in gup.c and mm.h, to explain that there are two
        ways to count pinned pages: exact (for compound pages of order > 1) and
        fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).
      
      ============================================================
      General notes about the tracking patch:
      
      This is a prerequisite to solving the problem of proper interactions
      between file-backed pages, and [R]DMA activities, as discussed in [1],
      [2], [3], [4] and in a remarkable number of email threads since about
      2017.  :)
      
      In contrast to earlier approaches, the page tracking can be incrementally
      applied to the kernel call sites that, until now, have been simply calling
      get_user_pages() ("gup").  In other words, opt-in by changing from this:
      
          get_user_pages() (sets FOLL_GET)
          put_page()
      
      to this:
          pin_user_pages() (sets FOLL_PIN)
          unpin_user_page()
      
      ============================================================
      Future steps:
      
      * Convert more subsystems from get_user_pages() to pin_user_pages().
        The first probably needs to be bio/biovecs, because any filesystem
        testing is too difficult without those in place.
      
      * Change VFS and filesystems to respond appropriately when encountering
        dma-pinned pages.
      
      * Work with Ira and others to connect this all up with file system
        leases.
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019):
          https://lwn.net/Articles/784574/
      
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
          https://lwn.net/Articles/774411/
      
      [3] The trouble with get_user_pages() (Apr 30, 2018):
          https://lwn.net/Articles/753027/
      
      [4] LWN kernel index: get_user_pages()
          https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
      
      This patch (of 12):
      
      An upcoming patch requires reusing the implementation of
      get_user_pages_remote().  Split up get_user_pages_remote() into an outer
      routine that checks flags, and an implementation routine that will be
      reused.  This makes subsequent changes much easier to understand.
      
      There should be no change in behavior due to this patch.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-2-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22bf29b6