1. 03 4月, 2020 4 次提交
    • P
      mm/gup: allow VM_FAULT_RETRY for multiple times · 4426e945
      Peter Xu 提交于
      This is the gup counterpart of the change that allows the VM_FAULT_RETRY
      to happen for more than once.  One thing to mention is that we must check
      the fatal signal here before retry because the GUP can be interrupted by
      that, otherwise we can loop forever.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220195357.16371-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4426e945
    • P
      mm/gup: rename "nonblocking" to "locked" where proper · 4f6da934
      Peter Xu 提交于
      Patch series "mm: Page fault enhancements", v6.
      
      This series contains cleanups and enhancements to current page fault
      logic.  The whole idea comes from the discussion between Andrea and Linus
      on the bug reported by syzbot here:
      
        https://lkml.org/lkml/2017/11/2/833
      
      Basically it does two things:
      
        (a) Allows the page fault logic to be more interactive on not only
            SIGKILL, but also the rest of userspace signals, and,
      
        (b) Allows the page fault retry (VM_FAULT_RETRY) to happen for more
            than once.
      
      For (a): with the changes we should be able to react faster when page
      faults are working in parallel with userspace signals like SIGSTOP and
      SIGCONT (and more), and with that we can remove the buggy part in
      userfaultfd and benefit the whole page fault mechanism on faster signal
      processing to reach the userspace.
      
      For (b), we should be able to allow the page fault handler to loop for
      even more than twice.  Some context: for now since we have
      FAULT_FLAG_ALLOW_RETRY we can allow to retry the page fault once with the
      same interrupt context, however never more than twice.  This can be not
      only a potential cleanup to remove this assumption since AFAIU the code
      itself doesn't really have this twice-only limitation (though that should
      be a protective approach in the past), at the same time it'll greatly
      simplify future works like userfaultfd write-protect where it's possible
      to retry for more than twice (please have a look at [1] below for a
      possible user that might require the page fault to be handled for a third
      time; if we can remove the retry limitation we can simply drop that patch
      and those complexity).
      
      This patch (of 16):
      
      There's plenty of places around __get_user_pages() that has a parameter
      "nonblocking" which does not really mean that "it won't block" (because it
      can really block) but instead it shows whether the mmap_sem is released by
      up_read() during the page fault handling mostly when VM_FAULT_RETRY is
      returned.
      
      We have the correct naming in e.g.  get_user_pages_locked() or
      get_user_pages_remote() as "locked", however there're still many places
      that are using the "nonblocking" as name.
      
      Renaming the places to "locked" where proper to better suite the
      functionality of the variable.  While at it, fixing up some of the
      comments accordingly.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220155353.8676-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f6da934
    • J
      mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages · 47e29d32
      John Hubbard 提交于
      For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
      scheme tends to overflow too easily, each tail page increments the head
      page->_refcount by GUP_PIN_COUNTING_BIAS (1024).  That limits the number
      of huge pages that can be pinned.
      
      This patch removes that limitation, by using an exact form of pin counting
      for compound pages of order > 1.  The "order > 1" is required because this
      approach uses the 3rd struct page in the compound page, and order 1
      compound pages only have two pages, so that won't work there.
      
      A new struct page field, hpage_pinned_refcount, has been added, replacing
      a padding field in the union (so no new space is used).
      
      This enhancement also has a useful side effect: huge pages and compound
      pages (of order > 1) do not suffer from the "potential false positives"
      problem that is discussed in the page_dma_pinned() comment block.  That is
      because these compound pages have extra space for tracking things, so they
      get exact pin counts instead of overloading page->_refcount.
      
      Documentation/core-api/pin_user_pages.rst is updated accordingly.
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47e29d32
    • J
      mm/gup: track FOLL_PIN pages · 3faa52c0
      John Hubbard 提交于
      Add tracking of pages that were pinned via FOLL_PIN.  This tracking is
      implemented via overloading of page->_refcount: pins are added by adding
      GUP_PIN_COUNTING_BIAS (1024) to the refcount.  This provides a fuzzy
      indication of pinning, and it can have false positives (and that's OK).
      Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
      details.
      
      As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
      (typically via pin_user_pages*()) are required to ultimately free such
      pages via unpin_user_page().
      
      Please also note the limitation, discussed in pin_user_pages.rst under the
      "TODO: for 1GB and larger huge pages" section.  (That limitation will be
      removed in a following patch.)
      
      The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
      thought of as "FOLL_GET for DIO and/or RDMA use".
      
      Pages that have been pinned via FOLL_PIN are identifiable via a new
      function call:
      
         bool page_maybe_dma_pinned(struct page *page);
      
      What to do in response to encountering such a page, is left to later
      patchsets. There is discussion about this in [1], [2], [3], and [4].
      
      This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019):
          https://lwn.net/Articles/784574/
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
          https://lwn.net/Articles/774411/
      [3] The trouble with get_user_pages() (Apr 30, 2018):
          https://lwn.net/Articles/753027/
      [4] LWN kernel index: get_user_pages():
          https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
      
      [jhubbard@nvidia.com: add kerneldoc]
        Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
      [imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
        Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
      [akpm@linux-foundation.org: fix put_compound_head defined but not used]
      Suggested-by: NJan Kara <jack@suse.cz>
      Suggested-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NClaudio Imbrenda <imbrenda@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3faa52c0
  2. 05 1月, 2020 1 次提交
    • W
      mm/hugetlb: defer freeing of huge pages if in non-task context · c77c0a8a
      Waiman Long 提交于
      The following lockdep splat was observed when a certain hugetlbfs test
      was run:
      
        ================================
        WARNING: inconsistent lock state
        4.18.0-159.el8.x86_64+debug #1 Tainted: G        W --------- -  -
        --------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
        ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
        {SOFTIRQ-ON-W} state was registered at:
          lock_acquire+0x14f/0x3b0
          _raw_spin_lock+0x30/0x70
          __nr_hugepages_store_common+0x11b/0xb30
          hugetlb_sysctl_handler_common+0x209/0x2d0
          proc_sys_call_handler+0x37f/0x450
          vfs_write+0x157/0x460
          ksys_write+0xb8/0x170
          do_syscall_64+0xa5/0x4d0
          entry_SYSCALL_64_after_hwframe+0x6a/0xdf
        irq event stamp: 691296
        hardirqs last  enabled at (691296): [<ffffffff99bb034b>] _raw_spin_unlock_irqrestore+0x4b/0x60
        hardirqs last disabled at (691295): [<ffffffff99bb0ad2>] _raw_spin_lock_irqsave+0x22/0x81
        softirqs last  enabled at (691284): [<ffffffff97ff0c63>] irq_enter+0xc3/0xe0
        softirqs last disabled at (691285): [<ffffffff97ff0ebe>] irq_exit+0x23e/0x2b0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(hugetlb_lock);
          <Interrupt>
            lock(hugetlb_lock);
      
         *** DEADLOCK ***
            :
        Call Trace:
         <IRQ>
         __lock_acquire+0x146b/0x48c0
         lock_acquire+0x14f/0x3b0
         _raw_spin_lock+0x30/0x70
         free_huge_page+0x36f/0xaa0
         bio_check_pages_dirty+0x2fc/0x5c0
         clone_endio+0x17f/0x670 [dm_mod]
         blk_update_request+0x276/0xe50
         scsi_end_request+0x7b/0x6a0
         scsi_io_completion+0x1c6/0x1570
         blk_done_softirq+0x22e/0x350
         __do_softirq+0x23d/0xad8
         irq_exit+0x23e/0x2b0
         do_IRQ+0x11a/0x200
         common_interrupt+0xf/0xf
         </IRQ>
      
      Both the hugetbl_lock and the subpool lock can be acquired in
      free_huge_page().  One way to solve the problem is to make both locks
      irq-safe.  However, Mike Kravetz had learned that the hugetlb_lock is
      held for a linear scan of ALL hugetlb pages during a cgroup reparentling
      operation.  So it is just too long to have irq disabled unless we can
      break hugetbl_lock down into finer-grained locks with shorter lock hold
      times.
      
      Another alternative is to defer the freeing to a workqueue job.  This
      patch implements the deferred freeing by adding a free_hpage_workfn()
      work function to do the actual freeing.  The free_huge_page() call in a
      non-task context saves the page to be freed in the hpage_freelist linked
      list in a lockless manner using the llist APIs.
      
      The generic workqueue is used to process the work, but a dedicated
      workqueue can be used instead if it is desirable to have the huge page
      freed ASAP.
      
      Thanks to Kirill Tkhai <ktkhai@virtuozzo.com> for suggesting the use of
      llist APIs which simplfy the code.
      
      Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c77c0a8a
  3. 02 12月, 2019 7 次提交
  4. 19 10月, 2019 1 次提交
  5. 25 9月, 2019 1 次提交
  6. 14 8月, 2019 1 次提交
    • M
      hugetlbfs: fix hugetlb page migration/fault race causing SIGBUS · 4643d67e
      Mike Kravetz 提交于
      Li Wang discovered that LTP/move_page12 V2 sometimes triggers SIGBUS in
      the kernel-v5.2.3 testing.  This is caused by a race between hugetlb
      page migration and page fault.
      
      If a hugetlb page can not be allocated to satisfy a page fault, the task
      is sent SIGBUS.  This is normal hugetlbfs behavior.  A hugetlb fault
      mutex exists to prevent two tasks from trying to instantiate the same
      page.  This protects against the situation where there is only one
      hugetlb page, and both tasks would try to allocate.  Without the mutex,
      one would fail and SIGBUS even though the other fault would be
      successful.
      
      There is a similar race between hugetlb page migration and fault.
      Migration code will allocate a page for the target of the migration.  It
      will then unmap the original page from all page tables.  It does this
      unmap by first clearing the pte and then writing a migration entry.  The
      page table lock is held for the duration of this clear and write
      operation.  However, the beginnings of the hugetlb page fault code
      optimistically checks the pte without taking the page table lock.  If
      clear (as it can be during the migration unmap operation), a hugetlb
      page allocation is attempted to satisfy the fault.  Note that the page
      which will eventually satisfy this fault was already allocated by the
      migration code.  However, the allocation within the fault path could
      fail which would result in the task incorrectly being sent SIGBUS.
      
      Ideally, we could take the hugetlb fault mutex in the migration code
      when modifying the page tables.  However, locks must be taken in the
      order of hugetlb fault mutex, page lock, page table lock.  This would
      require significant rework of the migration code.  Instead, the issue is
      addressed in the hugetlb fault code.  After failing to allocate a huge
      page, take the page table lock and check for huge_pte_none before
      returning an error.  This is the same check that must be made further in
      the code even if page allocation is successful.
      
      Link: http://lkml.kernel.org/r/20190808000533.7701-1-mike.kravetz@oracle.com
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NLi Wang <liwang@redhat.com>
      Tested-by: NLi Wang <liwang@redhat.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Cyril Hrubis <chrubis@suse.cz>
      Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4643d67e
  7. 29 6月, 2019 1 次提交
  8. 21 5月, 2019 1 次提交
  9. 15 5月, 2019 9 次提交
    • M
      hugetlbfs: always use address space in inode for resv_map pointer · f27a5136
      Mike Kravetz 提交于
      Continuing discussion about 58b6e5e8 ("hugetlbfs: fix memory leak for
      resv_map") brought up the issue that inode->i_mapping may not point to the
      address space embedded within the inode at inode eviction time.  The
      hugetlbfs truncate routine handles this by explicitly using inode->i_data.
      However, code cleaning up the resv_map will still use the address space
      pointed to by inode->i_mapping.  Luckily, private_data is NULL for address
      spaces in all such cases today but, there is no guarantee this will
      continue.
      
      Change all hugetlbfs code getting a resv_map pointer to explicitly get it
      from the address space embedded within the inode.  In addition, add more
      comments in the code to indicate why this is being done.
      
      Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NYufen Yu <yuyufen@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f27a5136
    • J
      mm/mmu_notifier: use correct mmu_notifier events for each invalidation · 7269f999
      Jérôme Glisse 提交于
      This updates each existing invalidation to use the correct mmu notifier
      event that represent what is happening to the CPU page table.  See the
      patch which introduced the events to see the rational behind this.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7269f999
    • J
      mm/mmu_notifier: contextual information for event triggering invalidation · 6f4f13e8
      Jérôme Glisse 提交于
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      Users of mmu notifier API track changes to the CPU page table and take
      specific action for them.  While current API only provide range of virtual
      address affected by the change, not why the changes is happening.
      
      This patchset do the initial mechanical convertion of all the places that
      calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
      event as well as the vma if it is know (most invalidation happens against
      a given vma).  Passing down the vma allows the users of mmu notifier to
      inspect the new vma page protection.
      
      The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
      should assume that every for the range is going away when that event
      happens.  A latter patch do convert mm call path to use a more appropriate
      events for each call.
      
      This is done as 2 patches so that no call site is forgotten especialy
      as it uses this following coccinelle patch:
      
      %<----------------------------------------------------------------------
      @@
      identifier I1, I2, I3, I4;
      @@
      static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
      +enum mmu_notifier_event event,
      +unsigned flags,
      +struct vm_area_struct *vma,
      struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
      
      @@
      @@
      -#define mmu_notifier_range_init(range, mm, start, end)
      +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
      
      @@
      expression E1, E3, E4;
      identifier I1;
      @@
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, I1,
      I1->vm_mm, E3, E4)
      ...>
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(..., struct vm_area_struct *VMA, ...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(...) {
      struct vm_area_struct *VMA;
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN;
      @@
      FN(...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, NULL,
      E2, E3, E4)
      ...> }
      ---------------------------------------------------------------------->%
      
      Applied with:
      spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
      spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
      spatch --sp-file mmu-notifier.spatch --dir mm --in-place
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4f13e8
    • M
      hugetlb: use same fault hash key for shared and private mappings · 1b426bac
      Mike Kravetz 提交于
      hugetlb uses a fault mutex hash table to prevent page faults of the
      same pages concurrently.  The key for shared and private mappings is
      different.  Shared keys off address_space and file index.  Private keys
      off mm and virtual address.  Consider a private mappings of a populated
      hugetlbfs file.  A fault will map the page from the file and if needed
      do a COW to map a writable page.
      
      Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
      pages.  It uses the address_space file index key.  However, private
      mappings will use a different key and could race with this code to map
      the file page.  This causes problems (BUG) for the page cache remove
      code as it expects the page to be unmapped.  A sample stack is:
      
      page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
      kernel BUG at mm/filemap.c:169!
      ...
      RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
      ...
      Call Trace:
      __delete_from_page_cache+0x39/0x220
      delete_from_page_cache+0x45/0x70
      remove_inode_hugepages+0x13c/0x380
      ? __add_to_page_cache_locked+0x162/0x380
      hugetlbfs_fallocate+0x403/0x540
      ? _cond_resched+0x15/0x30
      ? __inode_security_revalidate+0x5d/0x70
      ? selinux_file_permission+0x100/0x130
      vfs_fallocate+0x13f/0x270
      ksys_fallocate+0x3c/0x80
      __x64_sys_fallocate+0x1a/0x20
      do_syscall_64+0x5b/0x180
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      There seems to be another potential COW issue/race with this approach
      of different private and shared keys as noted in commit 8382d914
      ("mm, hugetlb: improve page-fault scalability").
      
      Since every hugetlb mapping (even anon and private) is actually a file
      mapping, just use the address_space index key for all mappings.  This
      results in potentially more hash collisions.  However, this should not
      be the common case.
      
      Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5
      Fixes: b5cec28d ("hugetlbfs: truncate_hugepages() takes a range of pages")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b426bac
    • M
      hugetlbfs: on restore reserve error path retain subpool reservation · 0919e1b6
      Mike Kravetz 提交于
      When a huge page is allocated, PagePrivate() is set if the allocation
      consumed a reservation.  When freeing a huge page, PagePrivate is checked.
      If set, it indicates the reservation should be restored.  PagePrivate
      being set at free huge page time mostly happens on error paths.
      
      When huge page reservations are created, a check is made to determine if
      the mapping is associated with an explicitly mounted filesystem.  If so,
      pages are also reserved within the filesystem.  The default action when
      freeing a huge page is to decrement the usage count in any associated
      explicitly mounted filesystem.  However, if the reservation is to be
      restored the reservation/use count within the filesystem should not be
      decrementd.  Otherwise, a subsequent page allocation and free for the same
      mapping location will cause the file filesystem usage to go 'negative'.
      
      Filesystem                         Size  Used Avail Use% Mounted on
      nodev                              4.0G -4.0M  4.1G    - /opt/hugepool
      
      To fix, when freeing a huge page do not adjust filesystem usage if
      PagePrivate() is set to indicate the reservation should be restored.
      
      I did not cc stable as the problem has been around since reserves were
      added to hugetlbfs and nobody has noticed.
      
      Link: http://lkml.kernel.org/r/20190328234704.27083-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0919e1b6
    • O
      mm/hugetlb: get rid of NODEMASK_ALLOC · 2d0adf7e
      Oscar Salvador 提交于
      NODEMASK_ALLOC is used to allocate a nodemask bitmap, and it does it by
      first determining whether it should be allocated on the stack or
      dynamically, depending on NODES_SHIFT.  Right now, it goes the dynamic
      path whenever the nodemask_t is above 32 bytes.
      
      Although we could bump it to a reasonable value, the largest a nodemask_t
      can get is 128 bytes, so since __nr_hugepages_store_common is called from
      a rather short stack we can just get rid of the NODEMASK_ALLOC call here.
      
      This reduces some code churn and complexity.
      
      Link: http://lkml.kernel.org/r/20190402133415.21983-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Alex Ghiti <alex@ghiti.fr>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d0adf7e
    • M
      hugetlbfs: fix potential over/underflow setting node specific nr_hugepages · fd875dca
      Mike Kravetz 提交于
      The number of node specific huge pages can be set via a file such as:
      /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
      When a node specific value is specified, the global number of huge pages
      must also be adjusted.  This adjustment is calculated as the specified
      node specific value + (global value - current node value).  If the node
      specific value provided by the user is large enough, this calculation
      could overflow an unsigned long leading to a smaller than expected number
      of huge pages.
      
      To fix, check the calculation for overflow.  If overflow is detected, use
      ULONG_MAX as the requested value.  This is inline with the user request to
      allocate as many huge pages as possible.
      
      It was also noticed that the above calculation was done outside the
      hugetlb_lock.  Therefore, the values could be inconsistent and result in
      underflow.  To fix, the calculation is moved within the routine
      set_max_huge_pages() where the lock is held.
      
      In addition, the code in __nr_hugepages_store_common() which tries to
      handle the case of not being able to allocate a node mask would likely
      result in incorrect behavior.  Luckily, it is very unlikely we will ever
      take this path.  If we do, simply return ENOMEM.
      
      Link: http://lkml.kernel.org/r/20190328220533.19884-1-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Alex Ghiti <alex@ghiti.fr>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd875dca
    • A
      hugetlb: allow to free gigantic pages regardless of the configuration · 4eb0716e
      Alexandre Ghiti 提交于
      On systems without CONTIG_ALLOC activated but that support gigantic pages,
      boottime reserved gigantic pages can not be freed at all.  This patch
      simply enables the possibility to hand back those pages to memory
      allocator.
      
      Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.frSigned-off-by: NAlexandre Ghiti <alex@ghiti.fr>
      Acked-by: David S. Miller <davem@davemloft.net> [sparc]
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eb0716e
    • K
      mm/hugetlb.c: don't put_page in lock of hugetlb_lock · 2bf753e6
      Kai Shen 提交于
      spinlock recursion happened when do LTP test:
      #!/bin/bash
      ./runltp -p -f hugetlb &
      ./runltp -p -f hugetlb &
      ./runltp -p -f hugetlb &
      ./runltp -p -f hugetlb &
      ./runltp -p -f hugetlb &
      
      The dtor returned by get_compound_page_dtor in __put_compound_page may be
      the function of free_huge_page which will lock the hugetlb_lock, so don't
      put_page in lock of hugetlb_lock.
      
       BUG: spinlock recursion on CPU#0, hugemmap05/1079
        lock: hugetlb_lock+0x0/0x18, .magic: dead4ead, .owner: hugemmap05/1079, .owner_cpu: 0
       Call trace:
        dump_backtrace+0x0/0x198
        show_stack+0x24/0x30
        dump_stack+0xa4/0xcc
        spin_dump+0x84/0xa8
        do_raw_spin_lock+0xd0/0x108
        _raw_spin_lock+0x20/0x30
        free_huge_page+0x9c/0x260
        __put_compound_page+0x44/0x50
        __put_page+0x2c/0x60
        alloc_surplus_huge_page.constprop.19+0xf0/0x140
        hugetlb_acct_memory+0x104/0x378
        hugetlb_reserve_pages+0xe0/0x250
        hugetlbfs_file_mmap+0xc0/0x140
        mmap_region+0x3e8/0x5b0
        do_mmap+0x280/0x460
        vm_mmap_pgoff+0xf4/0x128
        ksys_mmap_pgoff+0xb4/0x258
        __arm64_sys_mmap+0x34/0x48
        el0_svc_common+0x78/0x130
        el0_svc_handler+0x38/0x78
        el0_svc+0x8/0xc
      
      Link: http://lkml.kernel.org/r/b8ade452-2d6b-0372-32c2-703644032b47@huawei.com
      Fixes: 9980d744 ("mm, hugetlb: get rid of surplus page accounting tricks")
      Signed-off-by: NKai Shen <shenkai8@huawei.com>
      Signed-off-by: NFeilong Lin <linfeilong@huawei.com>
      Reported-by: NWang Wang <wangwang2@huawei.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bf753e6
  10. 15 4月, 2019 1 次提交
  11. 03 4月, 2019 1 次提交
  12. 06 3月, 2019 4 次提交
  13. 02 3月, 2019 1 次提交
    • M
      hugetlbfs: fix races and page leaks during migration · cb6acd01
      Mike Kravetz 提交于
      hugetlb pages should only be migrated if they are 'active'.  The
      routines set/clear_page_huge_active() modify the active state of hugetlb
      pages.
      
      When a new hugetlb page is allocated at fault time, set_page_huge_active
      is called before the page is locked.  Therefore, another thread could
      race and migrate the page while it is being added to page table by the
      fault code.  This race is somewhat hard to trigger, but can be seen by
      strategically adding udelay to simulate worst case scheduling behavior.
      Depending on 'how' the code races, various BUG()s could be triggered.
      
      To address this issue, simply delay the set_page_huge_active call until
      after the page is successfully added to the page table.
      
      Hugetlb pages can also be leaked at migration time if the pages are
      associated with a file in an explicitly mounted hugetlbfs filesystem.
      For example, consider a two node system with 4GB worth of huge pages
      available.  A program mmaps a 2G file in a hugetlbfs filesystem.  It
      then migrates the pages associated with the file from one node to
      another.  When the program exits, huge page counts are as follows:
      
        node0
        1024    free_hugepages
        1024    nr_hugepages
      
        node1
        0       free_hugepages
        1024    nr_hugepages
      
        Filesystem                         Size  Used Avail Use% Mounted on
        nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool
      
      That is as expected.  2G of huge pages are taken from the free_hugepages
      counts, and 2G is the size of the file in the explicitly mounted
      filesystem.  If the file is then removed, the counts become:
      
        node0
        1024    free_hugepages
        1024    nr_hugepages
      
        node1
        1024    free_hugepages
        1024    nr_hugepages
      
        Filesystem                         Size  Used Avail Use% Mounted on
        nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool
      
      Note that the filesystem still shows 2G of pages used, while there
      actually are no huge pages in use.  The only way to 'fix' the filesystem
      accounting is to unmount the filesystem
      
      If a hugetlb page is associated with an explicitly mounted filesystem,
      this information in contained in the page_private field.  At migration
      time, this information is not preserved.  To fix, simply transfer
      page_private from old to new page at migration time if necessary.
      
      There is a related race with removing a huge page from a file and
      migration.  When a huge page is removed from the pagecache, the
      page_mapping() field is cleared, yet page_private remains set until the
      page is actually freed by free_huge_page().  A page could be migrated
      while in this state.  However, since page_mapping() is not set the
      hugetlbfs specific routine to transfer page_private is not called and we
      leak the page count in the filesystem.
      
      To fix that, check for this condition before migrating a huge page.  If
      the condition is detected, return EBUSY for the page.
      
      Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
      Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
      Fixes: bcc54222 ("mm: hugetlb: introduce page_huge_active")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>
      [mike.kravetz@oracle.com: v2]
        Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
      [mike.kravetz@oracle.com: update comment and changelog]
        Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.comSigned-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb6acd01
  14. 02 2月, 2019 1 次提交
  15. 09 1月, 2019 2 次提交
    • M
      hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization" · ddeaab32
      Mike Kravetz 提交于
      This reverts b43a9990
      
      The reverted commit caused issues with migration and poisoning of anon
      huge pages.  The LTP move_pages12 test will cause an "unable to handle
      kernel NULL pointer" BUG would occur with stack similar to:
      
        RIP: 0010:down_write+0x1b/0x40
        Call Trace:
          migrate_pages+0x81f/0xb90
          __ia32_compat_sys_migrate_pages+0x190/0x190
          do_move_pages_to_node.isra.53.part.54+0x2a/0x50
          kernel_move_pages+0x566/0x7b0
          __x64_sys_move_pages+0x24/0x30
          do_syscall_64+0x5b/0x180
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The purpose of the reverted patch was to fix some long existing races
      with huge pmd sharing.  It used i_mmap_rwsem for this purpose with the
      idea that this could also be used to address truncate/page fault races
      with another patch.  Further analysis has determined that i_mmap_rwsem
      can not be used to address all these hugetlbfs synchronization issues.
      Therefore, revert this patch while working an another approach to the
      underlying issues.
      
      Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ddeaab32
    • M
      hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race" · e7c58097
      Mike Kravetz 提交于
      This reverts c86aa7bb
      
      The reverted commit caused ABBA deadlocks when file migration raced with
      file eviction for specific hugetlbfs files.  This was discovered with a
      modified version of the LTP move_pages12 test.
      
      The purpose of the reverted patch was to close a long existing race
      between hugetlbfs file truncation and page faults.  After more analysis
      of the patch and impacted code, it was determined that i_mmap_rwsem can
      not be used for all required synchronization.  Therefore, revert this
      patch while working an another approach to the underlying issue.
      
      Link: http://lkml.kernel.org/r/20190103235452.29335-1-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7c58097
  16. 05 1月, 2019 1 次提交
  17. 29 12月, 2018 3 次提交
    • M
      hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race · c86aa7bb
      Mike Kravetz 提交于
      hugetlbfs page faults can race with truncate and hole punch operations.
      Current code in the page fault path attempts to handle this by 'backing
      out' operations if we encounter the race.  One obvious omission in the
      current code is removing a page newly added to the page cache.  This is
      pretty straight forward to address, but there is a more subtle and
      difficult issue of backing out hugetlb reservations.  To handle this
      correctly, the 'reservation state' before page allocation needs to be
      noted so that it can be properly backed out.  There are four distinct
      possibilities for reservation state: shared/reserved, shared/no-resv,
      private/reserved and private/no-resv.  Backing out a reservation may
      require memory allocation which could fail so that needs to be taken into
      account as well.
      
      Instead of writing the required complicated code for this rare occurrence,
      just eliminate the race.  i_mmap_rwsem is now held in read mode for the
      duration of page fault processing.  Hold i_mmap_rwsem longer in truncation
      and hold punch code to cover the call to remove_inode_hugepages.
      
      With this modification, code in remove_inode_hugepages checking for races
      becomes 'dead' as it can not longer happen.  Remove the dead code and
      expand comments to explain reasoning.  Similarly, checks for races with
      truncation in the page fault path can be simplified and removed.
      
      [mike.kravetz@oracle.com: incorporat suggestions from Kirill]
        Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com
      Fixes: ebed4bfc ("hugetlb: fix absurd HugePages_Rsvd")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c86aa7bb
    • M
      hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization · b43a9990
      Mike Kravetz 提交于
      While looking at BUGs associated with invalid huge page map counts, it was
      discovered and observed that a huge pte pointer could become 'invalid' and
      point to another task's page table.  Consider the following:
      
      A task takes a page fault on a shared hugetlbfs file and calls
      huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
      shared pmd.
      
      Now, another task truncates the hugetlbfs file.  As part of truncation, it
      unmaps everyone who has the file mapped.  If the range being truncated is
      covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
      last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
      to the pmd.  If the task in the middle of the page fault is not the last
      user, the ptep returned by huge_pte_alloc now points to another task's
      page table or worse.  This leads to bad things such as incorrect page
      map/reference counts or invalid memory references.
      
      To fix, expand the use of i_mmap_rwsem as follows:
      
      - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
        huge_pmd_share is only called via huge_pte_alloc, so callers of
        huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
        of huge_pte_alloc continue to hold the semaphore until finished with the
        ptep.
      
      - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
        called.
      
      [mike.kravetz@oracle.com: add explicit check for mapping != null]
      Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
      Fixes: 39dde65c ("shared page table for hugetlb page")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b43a9990
    • J
      mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 · ac46d4f3
      Jérôme Glisse 提交于
      To avoid having to change many call sites everytime we want to add a
      parameter use a structure to group all parameters for the mmu_notifier
      invalidate_range_start/end cakks.  No functional changes with this patch.
      
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      From: Jérôme Glisse <jglisse@redhat.com>
      Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3
      
      fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n
      
      Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac46d4f3