提交 · dbaf7dc97ab8d526a20d3477419bc14b4890a82c · openeuler / Kernel

01 12月, 2022 1 次提交

hugetlbfs: inode: remove unnecessary (void*) conversions · dbaf7dc9

由 Li zeming 提交于 11月 07, 2022

The ei pointer does not need to cast the type.

Link: https://lkml.kernel.org/r/20221107015659.3221-1-zeming@nfschina.comSigned-off-by: NLi zeming <zeming@nfschina.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

dbaf7dc9

09 11月, 2022 4 次提交

hugetlbfs: fix null-ptr-deref in hugetlbfs_parse_param() · 26215b7e

由 Hawkins Jiawei 提交于 10月 21, 2022

Syzkaller reports a null-ptr-deref bug as follows:
======================================================
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:hugetlbfs_parse_param+0x1dd/0x8e0 fs/hugetlbfs/inode.c:1380
[...]
Call Trace:
 <TASK>
 vfs_parse_fs_param fs/fs_context.c:148 [inline]
 vfs_parse_fs_param+0x1f9/0x3c0 fs/fs_context.c:129
 vfs_parse_fs_string+0xdb/0x170 fs/fs_context.c:191
 generic_parse_monolithic+0x16f/0x1f0 fs/fs_context.c:231
 do_new_mount fs/namespace.c:3036 [inline]
 path_mount+0x12de/0x1e20 fs/namespace.c:3370
 do_mount fs/namespace.c:3383 [inline]
 __do_sys_mount fs/namespace.c:3591 [inline]
 __se_sys_mount fs/namespace.c:3568 [inline]
 __x64_sys_mount+0x27f/0x300 fs/namespace.c:3568
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
 [...]
 </TASK>
======================================================

According to commit "vfs: parse: deal with zero length string value",
kernel will set the param->string to null pointer in vfs_parse_fs_string()
if fs string has zero length.

Yet the problem is that, hugetlbfs_parse_param() will dereference the
param->string, without checking whether it is a null pointer.  To be more
specific, if hugetlbfs_parse_param() parses an illegal mount parameter,
such as "size=,", kernel will constructs struct fs_parameter with null
pointer in vfs_parse_fs_string(), then passes this struct fs_parameter to
hugetlbfs_parse_param(), which triggers the above null-ptr-deref bug.

This patch solves it by adding sanity check on param->string
in hugetlbfs_parse_param().

Link: https://lkml.kernel.org/r/20221020231609.4810-1-yin31149@gmail.com
Reported-by: syzbot+a3e6acd85ded5c16a709@syzkaller.appspotmail.com
Tested-by: syzbot+a3e6acd85ded5c16a709@syzkaller.appspotmail.com
  Link: https://lore.kernel.org/all/0000000000005ad00405eb7148c6@google.com/Signed-off-by: NHawkins Jiawei <yin31149@gmail.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Hawkins Jiawei <yin31149@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

26215b7e

hugetlbfs: convert hugetlb_delete_from_page_cache() to use folios · ece62684

由 Sidhartha Kumar 提交于 9月 22, 2022

Remove the last caller of delete_from_page_cache() by converting the code
to its folio equivalent.

Link: https://lkml.kernel.org/r/20220922154207.1575343-5-sidhartha.kumar@oracle.comSigned-off-by: NSidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Colin Cross <ccross@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

ece62684

mm/hugetlb: add hugetlb_folio_subpool() helpers · 149562f7

由 Sidhartha Kumar 提交于 9月 22, 2022

Allow hugetlbfs_migrate_folio to check and read subpool information by
passing in a folio.

Link: https://lkml.kernel.org/r/20220922154207.1575343-4-sidhartha.kumar@oracle.comSigned-off-by: NSidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Colin Cross <ccross@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

149562f7

hugetlbfs: don't delete error page from pagecache · 8625147c

由 James Houghton 提交于 10月 18, 2022

This change is very similar to the change that was made for shmem [1], and
it solves the same problem but for HugeTLBFS instead.

Currently, when poison is found in a HugeTLB page, the page is removed
from the page cache.  That means that attempting to map or read that
hugepage in the future will result in a new hugepage being allocated
instead of notifying the user that the page was poisoned.  As [1] states,
this is effectively memory corruption.

The fix is to leave the page in the page cache.  If the user attempts to
use a poisoned HugeTLB page with a syscall, the syscall will fail with
EIO, the same error code that shmem uses.  For attempts to map the page,
the thread will get a BUS_MCEERR_AR SIGBUS.

[1]: commit a7605426 ("mm: shmem: don't truncate page if memory failure happens")

Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.comSigned-off-by: NJames Houghton <jthoughton@google.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: NYang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

8625147c

04 10月, 2022 7 次提交

hugetlb: clean up code checking for fault/truncation races · fa27759a

由 Mike Kravetz 提交于 9月 14, 2022

With the new hugetlb vma lock in place, it can also be used to handle page
fault races with file truncation.  The lock is taken at the beginning of
the code fault path in read mode.  During truncation, it is taken in write
mode for each vma which has the file mapped.  The file's size (i_size) is
modified before taking the vma lock to unmap.

How are races handled?

The page fault code checks i_size early in processing after taking the vma
lock.  If the fault is beyond i_size, the fault is aborted.  If the fault
is not beyond i_size the fault will continue and a new page will be added
to the file.  It could be that truncation code modifies i_size after the
check in fault code.  That is OK, as truncation code will soon remove the
page.  The truncation code will wait until the fault is finished, as it
must obtain the vma lock in write mode.

This patch cleans up/removes late checks in the fault paths that try to
back out pages racing with truncation.  As noted above, we just let the
truncation code remove the pages.

[mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning]
  Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey
Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

fa27759a

hugetlb: use new vma_lock for pmd sharing synchronization · 40549ba8

由 Mike Kravetz 提交于 9月 14, 2022

The new hugetlb vma lock is used to address this race:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...

The vma_lock is used as follows:
- During fault processing. The lock is acquired in read mode before
  doing a page table lock and allocation (huge_pte_alloc).  The lock is
  held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
  called.

Lock ordering issues come into play when unmapping a page from all
vmas mapping the page.  The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare.  This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
  error handling.  In these routines we 'try' to obtain the vma lock and
  fail to unmap if unsuccessful.  Calling routines already deal with the
  failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch.  This routine
  also tries to acquire the vma lock.  If it fails, it skips the
  unmapping.  However, we can not have file truncation or hole punch
  fail because of contention.  After hugetlb_vmdelete_list, truncation
  and hole punch call remove_inode_hugepages.  remove_inode_hugepages
  checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
  hugetlb_unmap_file_page is designed to drop locks and reacquire in the
  correct order to guarantee unmap success.

Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

40549ba8

hugetlb: create hugetlb_unmap_file_folio to unmap single file folio · 378397cc

由 Mike Kravetz 提交于 9月 14, 2022

Create the new routine hugetlb_unmap_file_folio that will unmap a single
file folio.  This is refactored code from hugetlb_vmdelete_list.  It is
modified to do locking within the routine itself and check whether the
page is mapped within a specific vma before unmapping.

This refactoring will be put to use and expanded upon in a subsequent
patch adding vma specific locking.

Link: https://lkml.kernel.org/r/20220914221810.95771-8-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

378397cc

hugetlb: create remove_inode_single_folio to remove single file folio · c8627228

由 Mike Kravetz 提交于 9月 14, 2022

Create the new routine remove_inode_single_folio that will remove a single
folio from a file.  This is refactored code from remove_inode_hugepages. 
It checks for the uncommon case in which the folio is still mapped and
unmaps.

No functional change.  This refactoring will be put to use and expanded
upon in a subsequent patches.

Link: https://lkml.kernel.org/r/20220914221810.95771-5-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

c8627228

hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache · 7e1813d4

由 Mike Kravetz 提交于 9月 14, 2022

remove_huge_page removes a hugetlb page from the page cache.  Change to
hugetlb_delete_from_page_cache as it is a more descriptive name. 
huge_add_to_page_cache is global in scope, but only deals with hugetlb
pages.  For consistency and clarity, rename to hugetlb_add_to_page_cache.

Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

7e1813d4

hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization · 3a47c54f

由 Mike Kravetz 提交于 9月 14, 2022

Commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  However, this has been shown to cause
performance/scaling issues.  Revert the code and go back to only taking
the semaphore in huge_pmd_share during the fault path.

Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.

NOTE: Reverting this code does expose the following race condition.

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

It is unknown if the above race was ever experienced by a user.  It was
discovered via code inspection when initially addressed.

In subsequent patches, a new synchronization mechanism will be added to
coordinate pmd sharing and eliminate this race.

Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

3a47c54f

hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race · 188a3972

由 Mike Kravetz 提交于 9月 14, 2022

Patch series "hugetlb: Use new vma lock for huge pmd sharing
synchronization", v2.

hugetlb fault scalability regressions have recently been reported [1]. 
This is not the first such report, as regressions were also noted when
commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"

			48 sample Avg
next-20220913		next-20220913			next-20220913
unmodified	revert i_mmap_sema locking	vma sema locking, this series
-----------------------------------------------------------------------------
498150 KB/s		501934 KB/s			504793 KB/s

The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings.  To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
   Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
   Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory.  The shared
mapping size was 250GB.  For comparison, a single instance of the program
was run.  Then, multiple instances were run in parallel to introduce
lock contention.  Changing the locking scheme results in a significant
performance benefit.

test		instances	unmodified	revert		vma
--------------------------------------------------------------------------
faults per sec	1		393043		395680		389932
faults per sec  24		 71405		 81191		 79048
forks per sec   1		  2802		  2747		  2725
forks per sec   24		   439		   536		   500
Combined faults 24		  1621		 68070		 53662
Combined forks  24		   358		    67		   142

Combined test is when running both faulting program and forking program
simultaneously.

Patches 1 and 2 of this series revert c0d0381a and 87bf91d3 which
depends on c0d0381a.  Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share.  With c0d0381a reverted, this race is exposed:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

Reverting 87bf91d3 exposes races in page fault/file truncation.  When
the new vma lock is put to use in patch 8, this will handle the fault/file
truncation races.  This is explained in patch 9 where code associated with
these races is cleaned up.

Patches 3 - 5 restructure existing code in preparation for using the new
vma lock (rw semaphore) for pmd sharing synchronization.  The idea is that
this semaphore will be held in read mode for the duration of fault
processing, and held in write mode for unmap operations which may call
huge_pmd_unshare.  Acquiring i_mmap_rwsem is also still required to
synchronize huge pmd sharing.  However it is only required in the fault
path when setting up sharing, and will be acquired in huge_pmd_share().

Patch 6 adds the new vma lock and all supporting routines, but does not
actually change code to use the new lock.

Patch 7 refactors code in preparation for using the new lock.  And, patch
8 finally adds code to make use of this new vma lock.  Unfortunately, the
fault code and truncate/hole punch code would naturally take locks in the
opposite order which could lead to deadlock.  Since the performance of
page faults is more important, the truncation/hole punch code is modified
to back out and take locks in the correct order if necessary.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/


This patch (of 9):

Commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  The use of i_mmap_rwsem to prevent
fault/truncate races depends on this.  However, this has been shown to
cause performance/scaling issues.  As a result, that code will be
reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.

In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.

Link: https://lkml.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20220914221810.95771-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

188a3972

24 9月, 2022 2 次提交

vfs: open inside ->tmpfile() · 863f144f

由 Miklos Szeredi 提交于 9月 24, 2022

This is in preparation for adding tmpfile support to fuse, which requires
that the tmpfile creation and opening are done as a single operation.

Replace the 'struct dentry *' argument of i_op->tmpfile with
'struct file *'.

Call finish_open_simple() as the last thing in ->tmpfile() instances (may
be omitted in the error case).

Change d_tmpfile() argument to 'struct file *' as well to make callers more
readable.
Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

863f144f

hugetlbfs: cleanup mknod and tmpfile · 19ee5345

由 Al Viro 提交于 9月 24, 2022

Duplicate the few lines that are shared between hugetlbfs_mknod() and
hugetlbfs_tmpfile().

This is a prerequisite for sanely changing the signature of ->tmpfile().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

19ee5345

09 8月, 2022 1 次提交
- A
  hugetlbfs: copy_page_to_iter() can deal with compound pages · c7d57ab1
  由 Al Viro 提交于 6月 23, 2022
```
... since April 2021
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  c7d57ab1
03 8月, 2022 1 次提交

hugetlb: Convert to migrate_folio · b890ec2a

由 Matthew Wilcox (Oracle) 提交于 6月 06, 2022

This involves converting migrate_huge_page_move_mapping(). We also need a
folio variant of hugetlb_set_page_subpool(), but that's for a later patch.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: NMuchun Song <songmuchun@bytedance.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>

b890ec2a

30 7月, 2022 5 次提交

hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() · 11680763

由 Miaohe Lin 提交于 7月 26, 2022

In some cases, e.g. when size option is not specified, f_blocks, f_bavail
and f_bfree will be set to -1 instead of 0. Likewise, when nr_inodes
isn't specified, f_files and f_ffree will be set to -1 too. Update the
comment to make this clear.

Link: https://lkml.kernel.org/r/20220726142918.51693-6-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

11680763

hugetlbfs: cleanup some comments in inode.c · 445c8098

由 Miaohe Lin 提交于 7月 26, 2022

The function generic_file_buffered_read has been renamed to filemap_read
since commit 87fa0f3e ("mm/filemap: rename generic_file_buffered_read
to filemap_read"). Update the corresponding comment. And duplicated
taken in hugetlbfs_fill_super is removed.

Link: https://lkml.kernel.org/r/20220726142918.51693-5-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

445c8098

hugetlbfs: remove unneeded header file · 990e52b1

由 Miaohe Lin 提交于 7月 26, 2022

The header file signal.h is unneeded now. Remove it.

Link: https://lkml.kernel.org/r/20220726142918.51693-4-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

990e52b1

hugetlbfs: remove unneeded hugetlbfs_ops forward declaration · 7ec3c362

由 Miaohe Lin 提交于 7月 26, 2022

The forward declaration for hugetlbfs_ops is unnecessary. Remove it.

Link: https://lkml.kernel.org/r/20220726142918.51693-3-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

7ec3c362

hugetlbfs: use helper macro SZ_1{K,M} · d0036517

由 Miaohe Lin 提交于 7月 26, 2022

Patch series "A few cleanup and fixup patches for hugetlbfs", v2.

This series contains a few cleaup patches to remove unneeded forward
declaration, use helper macro and so on.  More details can be found in the
respective changelogs.


This patch (of 5):

Use helper macro SZ_1K and SZ_1M to do the size conversion.  Minor
readability improvement.

Link: https://lkml.kernel.org/r/20220726142918.51693-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220726142918.51693-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

d0036517

29 6月, 2022 2 次提交

hugetlbfs: Convert remove_inode_hugepages() to use filemap_get_folios() · 1508062e

由 Matthew Wilcox (Oracle) 提交于 6月 04, 2022

Use folios throughout this function.  That removes the last caller of
huge_pagevec_release(), so delete that too.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

1508062e

hugetlb: Convert huge_add_to_page_cache() to use a folio · d9ef44de

由 Matthew Wilcox (Oracle) 提交于 6月 01, 2022

Remove the last caller of add_to_page_cache()
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>

d9ef44de

17 6月, 2022 1 次提交

hugetlbfs: zero partial pages during fallocate hole punch · 68d32527

由 Mike Kravetz 提交于 6月 13, 2022

hugetlbfs fallocate support was originally added with commit 70c3547e
("hugetlbfs: add hugetlbfs_fallocate()").  Initial support only operated
on whole hugetlb pages.  This makes sense for populating files as other
interfaces such as mmap and truncate require hugetlb page size alignment. 
Only operating on whole hugetlb pages for the hole punch case was a
simplification and there was no compelling use case to zero partial pages.

In a recent discussion[1] it was assumed that hugetlbfs hole punch would
zero partial hugetlb pages as that is in line with the man page
description saying 'partial filesystem blocks are zeroed'.  However, the
hugetlbfs hole punch code actually does this:

        hole_start = round_up(offset, hpage_size);
        hole_end = round_down(offset + len, hpage_size);

Modify code to zero partial hugetlb pages in hole punch range.  It is
possible that application code could note a change in behavior.  However,
that would imply the code is passing in an unaligned range and expecting
only whole pages be removed.  This is unlikely as the fallocate
documentation states the opposite.

The current hugetlbfs fallocate hole punch behavior is tested with the
libhugetlbfs test fallocate_align[2].  This test will be updated to
validate partial page zeroing.

[1] https://lore.kernel.org/linux-mm/20571829-9d3d-0b48-817c-b6b15565f651@redhat.com/
[2] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/fallocate_align.c

Link: https://lkml.kernel.org/r/YqeiMlZDKI1Kabfe@monkeySigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

68d32527

13 5月, 2022 2 次提交

mm/hugetlb: only drop uffd-wp special pte if required · 05e90bd0

由 Peter Xu 提交于 5月 12, 2022

As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte
if unmapping an entire vma or synchronized such that faults can not race
with the unmap operation.  This requires passing zap_flags all the way to
the lowest level hugetlb unmap routine: __unmap_hugepage_range.

In general, unmap calls originated in hugetlbfs code will pass the
ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent
faults.  The exception is hole punch which will first unmap without any
synchronization.  Later when hole punch actually removes the page from the
file, it will check to see if there was a subsequent fault and if so take
the hugetlb fault mutex while unmapping again.  This second unmap will
pass in ZAP_FLAG_DROP_MARKER.

The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when
unmap a hugetlb range" is (IMHO): we should never reach a state when a
page fault could errornously fault in a page-cache page that was
wr-protected to be writable, even in an extremely short period.  That
could happen if e.g.  we pass ZAP_FLAG_DROP_MARKER when
hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page
faults after that call and before remove_inode_hugepages() is executed,
the page cache can be mapped writable again in the small racy window, that
can cause unexpected data overwritten.

[peterx@redhat.com: fix sparse warning]
  Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local
[akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues]
Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

05e90bd0

hugetlbfs: fix hugetlbfs_statfs() locking · 4b25f030

由 Mina Almasry 提交于 5月 09, 2022

After commit db71ef79 ("hugetlb: make free_huge_page irq safe"), the
subpool lock should be locked with spin_lock_irq() and all call sites was
modified as such, except for the ones in hugetlbfs_statfs().

Link: https://lkml.kernel.org/r/20220429202207.3045-1-almasrymina@google.com
Fixes: db71ef79 ("hugetlb: make free_huge_page irq safe")
Signed-off-by: NMina Almasry <almasrymina@google.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

4b25f030

09 5月, 2022 1 次提交

fs: Remove flags parameter from aops->write_begin · 9d6b0cd7

由 Matthew Wilcox (Oracle) 提交于 2月 22, 2022

There are no more aop flags left, so remove the parameter.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

9d6b0cd7

05 5月, 2022 2 次提交

mm: Add len and flags parameters to arch_get_mmap_end() · 2cb4de08

由 Christophe Leroy 提交于 4月 09, 2022

Powerpc needs flags and len to make decision on arch_get_mmap_end().

So add them as parameters to arch_get_mmap_end().
Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b556daabe7d2bdb2361c4d6130280da7c1ba2c14.1649523076.git.christophe.leroy@csgroup.eu

2cb4de08

mm, hugetlbfs: Allow an arch to always use generic versions of get_unmapped_area functions · 4b439e25

由 Christophe Leroy 提交于 4月 09, 2022

Unlike most architectures, powerpc can only define at runtime
if it is going to use the generic arch_get_unmapped_area() or not.

Today, powerpc has a copy of the generic arch_get_unmapped_area()
because when selection HAVE_ARCH_UNMAPPED_AREA the generic
arch_get_unmapped_area() is not available.

Rename it generic_get_unmapped_area() and make it independent of
HAVE_ARCH_UNMAPPED_AREA.

Do the same for arch_get_unmapped_area_topdown() versus
HAVE_ARCH_UNMAPPED_AREA_TOPDOWN.

Do the same for hugetlb_get_unmapped_area() versus
HAVE_ARCH_HUGETLB_UNMAPPED_AREA.
Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/77f9d3e592f1c8511df9381aa1c4e754651da4d1.1649523076.git.christophe.leroy@csgroup.eu

4b439e25

22 4月, 2022 1 次提交

mm, hugetlb: allow for "high" userspace addresses · 5f24d5a5

由 Christophe Leroy 提交于 4月 21, 2022

This is a fix for commit f6795053 ("mm: mmap: Allow for "high"
userspace addresses") for hugetlb.

This patch adds support for "high" userspace addresses that are
optionally supported on the system and have to be requested via a hint
mechanism ("high" addr parameter to mmap).

Architectures such as powerpc and x86 achieve this by making changes to
their architectural versions of hugetlb_get_unmapped_area() function.
However, arm64 uses the generic version of that function.

So take into account arch_get_mmap_base() and arch_get_mmap_end() in
hugetlb_get_unmapped_area().  To allow that, move those two macros out
of mm/mmap.c into include/linux/sched/mm.h

If these macros are not defined in architectural code then they default
to (TASK_SIZE) and (base) so should not introduce any behavioural
changes to architectures that do not define them.

For the time being, only ARM64 is affected by this change.

Catalin (ARM64) said
 "We should have fixed hugetlb_get_unmapped_area() as well when we added
  support for 52-bit VA. The reason for commit f6795053 was to
  prevent normal mmap() from returning addresses above 48-bit by default
  as some user-space had hard assumptions about this.

  It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
  but I doubt anyone would notice. It's more likely that the current
  behaviour would cause issues, so I'd rather have them consistent.

  Basically when arm64 gained support for 52-bit addresses we did not
  want user-space calling mmap() to suddenly get such high addresses,
  otherwise we could have inadvertently broken some programs (similar
  behaviour to x86 here). Hence we added commit f6795053. But we
  missed hugetlbfs which could still get such high mmap() addresses. So
  in theory that's a potential regression that should have bee addressed
  at the same time as commit f6795053 (and before arm64 enabled
  52-bit addresses)"

Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu
Fixes: f6795053 ("mm: mmap: Allow for "high" userspace addresses")
Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>	[5.0.x]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5f24d5a5

23 3月, 2022 1 次提交

fs: allocate inode by using alloc_inode_sb() · fd60b288

由 Muchun Song 提交于 3月 22, 2022

The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().

Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>		[ext4]
Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fd60b288

17 3月, 2022 1 次提交

fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio · 46de8b97

由 Matthew Wilcox (Oracle) 提交于 2月 09, 2022

This is a mechanical change.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs

46de8b97

15 1月, 2022 1 次提交

hugetlbfs: fix off-by-one error in hugetlb_vmdelete_list() · d6aba4c8

由 Sean Christopherson 提交于 1月 14, 2022

Pass "end - 1" instead of "end" when walking the interval tree in
hugetlb_vmdelete_list() to fix an inclusive vs. exclusive bug. The two
callers that pass a non-zero "end" treat it as exclusive, whereas the
interval tree iterator expects an inclusive "last". E.g. punching a
hole in a file that precisely matches the size of a single hugepage,
with a vma starting right on the boundary, will result in
unmap_hugepage_range() being called twice, with the second call having
start==end.

The off-by-one error doesn't cause functional problems as
__unmap_hugepage_range() turns into a massive nop due to
short-circuiting its for-loop on "address < end". But, the mmu_notifier
invocations to invalid_range_{start,end}() are passed a bogus zero-sized
range, which may be unexpected behavior for secondary MMUs.

The bug was exposed by commit ed922739 ("KVM: Use interval tree to
do fast hva lookup in memslots"), currently queued in the KVM tree for
5.17, which added a WARN to detect ranges with start==end.

Link: https://lkml.kernel.org/r/20211228234257.1926057-1-seanjc@google.com
Fixes: 1bfad99a ("hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reported-by: syzbot+4e697fe80a31aa7efe21@syzkaller.appspotmail.com
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d6aba4c8

10 11月, 2021 1 次提交

mm,hugetlb: remove mlock ulimit for SHM_HUGETLB · 83c1fd76

由 zhangyiru 提交于 11月 08, 2021

Commit 21a3c273 ("mm, hugetlb: add thread name and pid to
SHM_HUGETLB mlock rlimit warning") marked this as deprecated in 2012,
but it is not deleted yet.

Mike says he still sees that message in log files on occasion, so maybe we
should preserve this warning.

Also remove hugetlbfs related user_shm_unlock in ipc/shm.c and remove the
user_shm_unlock after out.

Link: https://lkml.kernel.org/r/20211103105857.25041-1-zhangyiru3@huawei.comSigned-off-by: Nzhangyiru <zhangyiru3@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liu Zixian <liuzixian4@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: wuxu.wu <wuxu.wu@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

83c1fd76

24 7月, 2021 1 次提交

hugetlbfs: fix mount mode command line processing · e0f7e2b2

由 Mike Kravetz 提交于 7月 23, 2021

In commit 32021982 ("hugetlbfs: Convert to fs_context") processing
of the mount mode string was changed from match_octal() to fsparam_u32.

This changed existing behavior as match_octal does not require octal
values to have a '0' prefix, but fsparam_u32 does.

Use fsparam_u32oct which provides the same behavior as match_octal.

Link: https://lkml.kernel.org/r/20210721183326.102716-1-mike.kravetz@oracle.com
Fixes: 32021982 ("hugetlbfs: Convert to fs_context")
Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Reported-by: NDennis Camera <bugs+kernel.org@dtnr.ch>
Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e0f7e2b2

17 6月, 2021 1 次提交

mm/hugetlb: expand restore_reserve_on_error functionality · 846be085

由 Mike Kravetz 提交于 6月 15, 2021

The routine restore_reserve_on_error is called to restore reservation
information when an error occurs after page allocation.  The routine
alloc_huge_page modifies the mapping reserve map and potentially the
reserve count during allocation.  If code calling alloc_huge_page
encounters an error after allocation and needs to free the page, the
reservation information needs to be adjusted.

Currently, restore_reserve_on_error only takes action on pages for which
the reserve count was adjusted(HPageRestoreReserve flag).  There is
nothing wrong with these adjustments.  However, alloc_huge_page ALWAYS
modifies the reserve map during allocation even if the reserve count is
not adjusted.  This can cause issues as observed during development of
this patch [1].

One specific series of operations causing an issue is:

 - Create a shared hugetlb mapping
   Reservations for all pages created by default

 - Fault in a page in the mapping
   Reservation exists so reservation count is decremented

 - Punch a hole in the file/mapping at index previously faulted
   Reservation and any associated pages will be removed

 - Allocate a page to fill the hole
   No reservation entry, so reserve count unmodified
   Reservation entry added to map by alloc_huge_page

 - Error after allocation and before instantiating the page
   Reservation entry remains in map

 - Allocate a page to fill the hole
   Reservation entry exists, so decrement reservation count

This will cause a reservation count underflow as the reservation count
was decremented twice for the same index.

A user would observe a very large number for HugePages_Rsvd in
/proc/meminfo.  This would also likely cause subsequent allocations of
hugetlb pages to fail as it would 'appear' that all pages are reserved.

This sequence of operations is unlikely to happen, however they were
easily reproduced and observed using hacked up code as described in [1].

Address the issue by having the routine restore_reserve_on_error take
action on pages where HPageRestoreReserve is not set.  In this case, we
need to remove any reserve map entry created by alloc_huge_page.  A new
helper routine vma_del_reservation assists with this operation.

There are three callers of alloc_huge_page which do not currently call
restore_reserve_on error before freeing a page on error paths.  Add
those missing calls.

[1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/

Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
Fixes: 96b96a96 ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

846be085

23 5月, 2021 1 次提交

userfaultfd: hugetlbfs: fix new flag usage in error path · e32905e5

由 Mike Kravetz 提交于 5月 22, 2021

In commit d6995da3 ("hugetlb: use page.private for hugetlb specific
page flags") the use of PagePrivate to indicate a reservation count
should be restored at free time was changed to the hugetlb specific flag
HPageRestoreReserve.  Changes to a userfaultfd error path as well as a
VM_BUG_ON() in remove_inode_hugepages() were overlooked.

Users could see incorrect hugetlb reserve counts if they experience an
error with a UFFDIO_COPY operation.  Specifically, this would be the
result of an unlikely copy_huge_page_from_user error.  There is not an
increased chance of hitting the VM_BUG_ON.

Link: https://lkml.kernel.org/r/20210521233952.236434-1-mike.kravetz@oracle.com
Fixes: d6995da3 ("hugetlb: use page.private for hugetlb specific page flags")
Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMina Almasry <almasry.mina@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e32905e5

15 5月, 2021 1 次提交

mm/hugetlb: fix F_SEAL_FUTURE_WRITE · 22247efd

由 Peter Xu 提交于 5月 14, 2021

Patch series "mm/hugetlb: Fix issues on file sealing and fork", v2.

Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to
hugetlbfs, which I can easily verify using the memfd_test program, which
seems that the program is hardly run with hugetlbfs pages (as by default
shmem).

Meanwhile I found another probably even more severe issue on that hugetlb
fork won't wr-protect child cow pages, so child can potentially write to
parent private pages.  Patch 2 addresses that.

After this series applied, "memfd_test hugetlbfs" should start to pass.

This patch (of 2):

F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
There is a test program for that and it fails constantly.

$ ./memfd_test hugetlbfs
memfd-hugetlb: CREATE
memfd-hugetlb: BASIC
memfd-hugetlb: SEAL-WRITE
memfd-hugetlb: SEAL-FUTURE-WRITE
mmap() didn't fail as expected
Aborted (core dumped)

I think it's probably because no one is really running the hugetlbfs test.

Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we
do in shmem_mmap().  Generalize a helper for that.

Link: https://lkml.kernel.org/r/20210503234356.9097-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210503234356.9097-2-peterx@redhat.com
Fixes: ab3948f5 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
Signed-off-by: NPeter Xu <peterx@redhat.com>
Reported-by: NHugh Dickins <hughd@google.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

22247efd

06 5月, 2021 2 次提交

mm/hugetlb: remove unused variable pseudo_vma in remove_inode_hugepages() · 15b83653

由 Miaohe Lin 提交于 5月 04, 2021

The local variable pseudo_vma is not used anymore.

Link: https://lkml.kernel.org/r/20210410072348.20437-6-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Cc: Feilong Lin <linfeilong@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

15b83653

mm/hugetlb: avoid calculating fault_mutex_hash in truncate_op case · d4241a04

由 Miaohe Lin 提交于 5月 04, 2021

The fault_mutex hashing overhead can be avoided in truncate_op case
because page faults can not race with truncation in this routine.

So calculate hash for fault_mutex only in !truncate_op case to save some
cpu cycles.

Link: https://lkml.kernel.org/r/20210308112809.26107-6-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d4241a04

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功