1. 01 12月, 2022 1 次提交
  2. 09 11月, 2022 4 次提交
  3. 04 10月, 2022 7 次提交
    • M
      hugetlb: clean up code checking for fault/truncation races · fa27759a
      Mike Kravetz 提交于
      With the new hugetlb vma lock in place, it can also be used to handle page
      fault races with file truncation.  The lock is taken at the beginning of
      the code fault path in read mode.  During truncation, it is taken in write
      mode for each vma which has the file mapped.  The file's size (i_size) is
      modified before taking the vma lock to unmap.
      
      How are races handled?
      
      The page fault code checks i_size early in processing after taking the vma
      lock.  If the fault is beyond i_size, the fault is aborted.  If the fault
      is not beyond i_size the fault will continue and a new page will be added
      to the file.  It could be that truncation code modifies i_size after the
      check in fault code.  That is OK, as truncation code will soon remove the
      page.  The truncation code will wait until the fault is finished, as it
      must obtain the vma lock in write mode.
      
      This patch cleans up/removes late checks in the fault paths that try to
      back out pages racing with truncation.  As noted above, we just let the
      truncation code remove the pages.
      
      [mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning]
        Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey
      Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fa27759a
    • M
      hugetlb: use new vma_lock for pmd sharing synchronization · 40549ba8
      Mike Kravetz 提交于
      The new hugetlb vma lock is used to address this race:
      
      Faulting thread                                 Unsharing thread
      ...                                                  ...
      ptep = huge_pte_offset()
            or
      ptep = huge_pte_alloc()
      ...
                                                      i_mmap_lock_write
                                                      lock page table
      ptep invalid   <------------------------        huge_pmd_unshare()
      Could be in a previously                        unlock_page_table
      sharing process or worse                        i_mmap_unlock_write
      ...
      
      The vma_lock is used as follows:
      - During fault processing. The lock is acquired in read mode before
        doing a page table lock and allocation (huge_pte_alloc).  The lock is
        held until code is finished with the page table entry (ptep).
      - The lock must be held in write mode whenever huge_pmd_unshare is
        called.
      
      Lock ordering issues come into play when unmapping a page from all
      vmas mapping the page.  The i_mmap_rwsem must be held to search for the
      vmas, and the vma lock must be held before calling unmap which will
      call huge_pmd_unshare.  This is done today in:
      - try_to_migrate_one and try_to_unmap_ for page migration and memory
        error handling.  In these routines we 'try' to obtain the vma lock and
        fail to unmap if unsuccessful.  Calling routines already deal with the
        failure of unmapping.
      - hugetlb_vmdelete_list for truncation and hole punch.  This routine
        also tries to acquire the vma lock.  If it fails, it skips the
        unmapping.  However, we can not have file truncation or hole punch
        fail because of contention.  After hugetlb_vmdelete_list, truncation
        and hole punch call remove_inode_hugepages.  remove_inode_hugepages
        checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
        hugetlb_unmap_file_page is designed to drop locks and reacquire in the
        correct order to guarantee unmap success.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      40549ba8
    • M
      hugetlb: create hugetlb_unmap_file_folio to unmap single file folio · 378397cc
      Mike Kravetz 提交于
      Create the new routine hugetlb_unmap_file_folio that will unmap a single
      file folio.  This is refactored code from hugetlb_vmdelete_list.  It is
      modified to do locking within the routine itself and check whether the
      page is mapped within a specific vma before unmapping.
      
      This refactoring will be put to use and expanded upon in a subsequent
      patch adding vma specific locking.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-8-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      378397cc
    • M
      hugetlb: create remove_inode_single_folio to remove single file folio · c8627228
      Mike Kravetz 提交于
      Create the new routine remove_inode_single_folio that will remove a single
      folio from a file.  This is refactored code from remove_inode_hugepages. 
      It checks for the uncommon case in which the folio is still mapped and
      unmaps.
      
      No functional change.  This refactoring will be put to use and expanded
      upon in a subsequent patches.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-5-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c8627228
    • M
      hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache · 7e1813d4
      Mike Kravetz 提交于
      remove_huge_page removes a hugetlb page from the page cache.  Change to
      hugetlb_delete_from_page_cache as it is a more descriptive name. 
      huge_add_to_page_cache is global in scope, but only deals with hugetlb
      pages.  For consistency and clarity, rename to hugetlb_add_to_page_cache.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7e1813d4
    • M
      hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization · 3a47c54f
      Mike Kravetz 提交于
      Commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
      synchronization") added code to take i_mmap_rwsem in read mode for the
      duration of fault processing.  However, this has been shown to cause
      performance/scaling issues.  Revert the code and go back to only taking
      the semaphore in huge_pmd_share during the fault path.
      
      Keep the code that takes i_mmap_rwsem in write mode before calling
      try_to_unmap as this is required if huge_pmd_unshare is called.
      
      NOTE: Reverting this code does expose the following race condition.
      
      Faulting thread                                 Unsharing thread
      ...                                                  ...
      ptep = huge_pte_offset()
            or
      ptep = huge_pte_alloc()
      ...
                                                      i_mmap_lock_write
                                                      lock page table
      ptep invalid   <------------------------        huge_pmd_unshare()
      Could be in a previously                        unlock_page_table
      sharing process or worse                        i_mmap_unlock_write
      ...
      ptl = huge_pte_lock(ptep)
      get/update pte
      set_pte_at(pte, ptep)
      
      It is unknown if the above race was ever experienced by a user.  It was
      discovered via code inspection when initially addressed.
      
      In subsequent patches, a new synchronization mechanism will be added to
      coordinate pmd sharing and eliminate this race.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      3a47c54f
    • M
      hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race · 188a3972
      Mike Kravetz 提交于
      Patch series "hugetlb: Use new vma lock for huge pmd sharing
      synchronization", v2.
      
      hugetlb fault scalability regressions have recently been reported [1]. 
      This is not the first such report, as regressions were also noted when
      commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
      synchronization") was added [2] in v5.7.  At that time, a proposal to
      address the regression was suggested [3] but went nowhere.
      
      The regression and benefit of this patch series is not evident when
      using the vm_scalability benchmark reported in [2] on a recent kernel.
      Results from running,
      "./usemem -n 48 --prealloc --prefault -O -U 3448054972"
      
      			48 sample Avg
      next-20220913		next-20220913			next-20220913
      unmodified	revert i_mmap_sema locking	vma sema locking, this series
      -----------------------------------------------------------------------------
      498150 KB/s		501934 KB/s			504793 KB/s
      
      The recent regression report [1] notes page fault and fork latency of
      shared hugetlb mappings.  To measure this, I created two simple programs:
      1) map a shared hugetlb area, write fault all pages, unmap area
         Do this in a continuous loop to measure faults per second
      2) map a shared hugetlb area, write fault a few pages, fork and exit
         Do this in a continuous loop to measure forks per second
      These programs were run on a 48 CPU VM with 320GB memory.  The shared
      mapping size was 250GB.  For comparison, a single instance of the program
      was run.  Then, multiple instances were run in parallel to introduce
      lock contention.  Changing the locking scheme results in a significant
      performance benefit.
      
      test		instances	unmodified	revert		vma
      --------------------------------------------------------------------------
      faults per sec	1		393043		395680		389932
      faults per sec  24		 71405		 81191		 79048
      forks per sec   1		  2802		  2747		  2725
      forks per sec   24		   439		   536		   500
      Combined faults 24		  1621		 68070		 53662
      Combined forks  24		   358		    67		   142
      
      Combined test is when running both faulting program and forking program
      simultaneously.
      
      Patches 1 and 2 of this series revert c0d0381a and 87bf91d3 which
      depends on c0d0381a.  Acquisition of i_mmap_rwsem is still required in
      the fault path to establish pmd sharing, so this is moved back to
      huge_pmd_share.  With c0d0381a reverted, this race is exposed:
      
      Faulting thread                                 Unsharing thread
      ...                                                  ...
      ptep = huge_pte_offset()
            or
      ptep = huge_pte_alloc()
      ...
                                                      i_mmap_lock_write
                                                      lock page table
      ptep invalid   <------------------------        huge_pmd_unshare()
      Could be in a previously                        unlock_page_table
      sharing process or worse                        i_mmap_unlock_write
      ...
      ptl = huge_pte_lock(ptep)
      get/update pte
      set_pte_at(pte, ptep)
      
      Reverting 87bf91d3 exposes races in page fault/file truncation.  When
      the new vma lock is put to use in patch 8, this will handle the fault/file
      truncation races.  This is explained in patch 9 where code associated with
      these races is cleaned up.
      
      Patches 3 - 5 restructure existing code in preparation for using the new
      vma lock (rw semaphore) for pmd sharing synchronization.  The idea is that
      this semaphore will be held in read mode for the duration of fault
      processing, and held in write mode for unmap operations which may call
      huge_pmd_unshare.  Acquiring i_mmap_rwsem is also still required to
      synchronize huge pmd sharing.  However it is only required in the fault
      path when setting up sharing, and will be acquired in huge_pmd_share().
      
      Patch 6 adds the new vma lock and all supporting routines, but does not
      actually change code to use the new lock.
      
      Patch 7 refactors code in preparation for using the new lock.  And, patch
      8 finally adds code to make use of this new vma lock.  Unfortunately, the
      fault code and truncate/hole punch code would naturally take locks in the
      opposite order which could lead to deadlock.  Since the performance of
      page faults is more important, the truncation/hole punch code is modified
      to back out and take locks in the correct order if necessary.
      
      [1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
      [2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
      [3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/
      
      
      This patch (of 9):
      
      Commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
      synchronization") added code to take i_mmap_rwsem in read mode for the
      duration of fault processing.  The use of i_mmap_rwsem to prevent
      fault/truncate races depends on this.  However, this has been shown to
      cause performance/scaling issues.  As a result, that code will be
      reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
      depends on this, it must also be reverted.
      
      In a subsequent patch, code will be added to detect the fault/truncate
      race and back out operations as required.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20220914221810.95771-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      188a3972
  4. 24 9月, 2022 2 次提交
  5. 09 8月, 2022 1 次提交
  6. 03 8月, 2022 1 次提交
  7. 30 7月, 2022 5 次提交
  8. 29 6月, 2022 2 次提交
  9. 17 6月, 2022 1 次提交
    • M
      hugetlbfs: zero partial pages during fallocate hole punch · 68d32527
      Mike Kravetz 提交于
      hugetlbfs fallocate support was originally added with commit 70c3547e
      ("hugetlbfs: add hugetlbfs_fallocate()").  Initial support only operated
      on whole hugetlb pages.  This makes sense for populating files as other
      interfaces such as mmap and truncate require hugetlb page size alignment. 
      Only operating on whole hugetlb pages for the hole punch case was a
      simplification and there was no compelling use case to zero partial pages.
      
      In a recent discussion[1] it was assumed that hugetlbfs hole punch would
      zero partial hugetlb pages as that is in line with the man page
      description saying 'partial filesystem blocks are zeroed'.  However, the
      hugetlbfs hole punch code actually does this:
      
              hole_start = round_up(offset, hpage_size);
              hole_end = round_down(offset + len, hpage_size);
      
      Modify code to zero partial hugetlb pages in hole punch range.  It is
      possible that application code could note a change in behavior.  However,
      that would imply the code is passing in an unaligned range and expecting
      only whole pages be removed.  This is unlikely as the fallocate
      documentation states the opposite.
      
      The current hugetlbfs fallocate hole punch behavior is tested with the
      libhugetlbfs test fallocate_align[2].  This test will be updated to
      validate partial page zeroing.
      
      [1] https://lore.kernel.org/linux-mm/20571829-9d3d-0b48-817c-b6b15565f651@redhat.com/
      [2] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/fallocate_align.c
      
      Link: https://lkml.kernel.org/r/YqeiMlZDKI1Kabfe@monkeySigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      68d32527
  10. 13 5月, 2022 2 次提交
    • P
      mm/hugetlb: only drop uffd-wp special pte if required · 05e90bd0
      Peter Xu 提交于
      As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte
      if unmapping an entire vma or synchronized such that faults can not race
      with the unmap operation.  This requires passing zap_flags all the way to
      the lowest level hugetlb unmap routine: __unmap_hugepage_range.
      
      In general, unmap calls originated in hugetlbfs code will pass the
      ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent
      faults.  The exception is hole punch which will first unmap without any
      synchronization.  Later when hole punch actually removes the page from the
      file, it will check to see if there was a subsequent fault and if so take
      the hugetlb fault mutex while unmapping again.  This second unmap will
      pass in ZAP_FLAG_DROP_MARKER.
      
      The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when
      unmap a hugetlb range" is (IMHO): we should never reach a state when a
      page fault could errornously fault in a page-cache page that was
      wr-protected to be writable, even in an extremely short period.  That
      could happen if e.g.  we pass ZAP_FLAG_DROP_MARKER when
      hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page
      faults after that call and before remove_inode_hugepages() is executed,
      the page cache can be mapped writable again in the small racy window, that
      can cause unexpected data overwritten.
      
      [peterx@redhat.com: fix sparse warning]
        Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local
      [akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues]
      Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      05e90bd0
    • M
      hugetlbfs: fix hugetlbfs_statfs() locking · 4b25f030
      Mina Almasry 提交于
      After commit db71ef79 ("hugetlb: make free_huge_page irq safe"), the
      subpool lock should be locked with spin_lock_irq() and all call sites was
      modified as such, except for the ones in hugetlbfs_statfs().
      
      Link: https://lkml.kernel.org/r/20220429202207.3045-1-almasrymina@google.com
      Fixes: db71ef79 ("hugetlb: make free_huge_page irq safe")
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4b25f030
  11. 09 5月, 2022 1 次提交
  12. 05 5月, 2022 2 次提交
  13. 22 4月, 2022 1 次提交
    • C
      mm, hugetlb: allow for "high" userspace addresses · 5f24d5a5
      Christophe Leroy 提交于
      This is a fix for commit f6795053 ("mm: mmap: Allow for "high"
      userspace addresses") for hugetlb.
      
      This patch adds support for "high" userspace addresses that are
      optionally supported on the system and have to be requested via a hint
      mechanism ("high" addr parameter to mmap).
      
      Architectures such as powerpc and x86 achieve this by making changes to
      their architectural versions of hugetlb_get_unmapped_area() function.
      However, arm64 uses the generic version of that function.
      
      So take into account arch_get_mmap_base() and arch_get_mmap_end() in
      hugetlb_get_unmapped_area().  To allow that, move those two macros out
      of mm/mmap.c into include/linux/sched/mm.h
      
      If these macros are not defined in architectural code then they default
      to (TASK_SIZE) and (base) so should not introduce any behavioural
      changes to architectures that do not define them.
      
      For the time being, only ARM64 is affected by this change.
      
      Catalin (ARM64) said
       "We should have fixed hugetlb_get_unmapped_area() as well when we added
        support for 52-bit VA. The reason for commit f6795053 was to
        prevent normal mmap() from returning addresses above 48-bit by default
        as some user-space had hard assumptions about this.
      
        It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
        but I doubt anyone would notice. It's more likely that the current
        behaviour would cause issues, so I'd rather have them consistent.
      
        Basically when arm64 gained support for 52-bit addresses we did not
        want user-space calling mmap() to suddenly get such high addresses,
        otherwise we could have inadvertently broken some programs (similar
        behaviour to x86 here). Hence we added commit f6795053. But we
        missed hugetlbfs which could still get such high mmap() addresses. So
        in theory that's a potential regression that should have bee addressed
        at the same time as commit f6795053 (and before arm64 enabled
        52-bit addresses)"
      
      Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu
      Fixes: f6795053 ("mm: mmap: Allow for "high" userspace addresses")
      Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>	[5.0.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f24d5a5
  14. 23 3月, 2022 1 次提交
  15. 17 3月, 2022 1 次提交
  16. 15 1月, 2022 1 次提交
    • S
      hugetlbfs: fix off-by-one error in hugetlb_vmdelete_list() · d6aba4c8
      Sean Christopherson 提交于
      Pass "end - 1" instead of "end" when walking the interval tree in
      hugetlb_vmdelete_list() to fix an inclusive vs.  exclusive bug.  The two
      callers that pass a non-zero "end" treat it as exclusive, whereas the
      interval tree iterator expects an inclusive "last".  E.g.  punching a
      hole in a file that precisely matches the size of a single hugepage,
      with a vma starting right on the boundary, will result in
      unmap_hugepage_range() being called twice, with the second call having
      start==end.
      
      The off-by-one error doesn't cause functional problems as
      __unmap_hugepage_range() turns into a massive nop due to
      short-circuiting its for-loop on "address < end".  But, the mmu_notifier
      invocations to invalid_range_{start,end}() are passed a bogus zero-sized
      range, which may be unexpected behavior for secondary MMUs.
      
      The bug was exposed by commit ed922739 ("KVM: Use interval tree to
      do fast hva lookup in memslots"), currently queued in the KVM tree for
      5.17, which added a WARN to detect ranges with start==end.
      
      Link: https://lkml.kernel.org/r/20211228234257.1926057-1-seanjc@google.com
      Fixes: 1bfad99a ("hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reported-by: syzbot+4e697fe80a31aa7efe21@syzkaller.appspotmail.com
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6aba4c8
  17. 10 11月, 2021 1 次提交
  18. 24 7月, 2021 1 次提交
  19. 17 6月, 2021 1 次提交
    • M
      mm/hugetlb: expand restore_reserve_on_error functionality · 846be085
      Mike Kravetz 提交于
      The routine restore_reserve_on_error is called to restore reservation
      information when an error occurs after page allocation.  The routine
      alloc_huge_page modifies the mapping reserve map and potentially the
      reserve count during allocation.  If code calling alloc_huge_page
      encounters an error after allocation and needs to free the page, the
      reservation information needs to be adjusted.
      
      Currently, restore_reserve_on_error only takes action on pages for which
      the reserve count was adjusted(HPageRestoreReserve flag).  There is
      nothing wrong with these adjustments.  However, alloc_huge_page ALWAYS
      modifies the reserve map during allocation even if the reserve count is
      not adjusted.  This can cause issues as observed during development of
      this patch [1].
      
      One specific series of operations causing an issue is:
      
       - Create a shared hugetlb mapping
         Reservations for all pages created by default
      
       - Fault in a page in the mapping
         Reservation exists so reservation count is decremented
      
       - Punch a hole in the file/mapping at index previously faulted
         Reservation and any associated pages will be removed
      
       - Allocate a page to fill the hole
         No reservation entry, so reserve count unmodified
         Reservation entry added to map by alloc_huge_page
      
       - Error after allocation and before instantiating the page
         Reservation entry remains in map
      
       - Allocate a page to fill the hole
         Reservation entry exists, so decrement reservation count
      
      This will cause a reservation count underflow as the reservation count
      was decremented twice for the same index.
      
      A user would observe a very large number for HugePages_Rsvd in
      /proc/meminfo.  This would also likely cause subsequent allocations of
      hugetlb pages to fail as it would 'appear' that all pages are reserved.
      
      This sequence of operations is unlikely to happen, however they were
      easily reproduced and observed using hacked up code as described in [1].
      
      Address the issue by having the routine restore_reserve_on_error take
      action on pages where HPageRestoreReserve is not set.  In this case, we
      need to remove any reserve map entry created by alloc_huge_page.  A new
      helper routine vma_del_reservation assists with this operation.
      
      There are three callers of alloc_huge_page which do not currently call
      restore_reserve_on error before freeing a page on error paths.  Add
      those missing calls.
      
      [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/
      
      Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
      Fixes: 96b96a96 ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMina Almasry <almasrymina@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      846be085
  20. 23 5月, 2021 1 次提交
  21. 15 5月, 2021 1 次提交
  22. 06 5月, 2021 2 次提交