1. 15 10月, 2020 1 次提交
  2. 22 9月, 2020 4 次提交
    • M
      mm/hugetlb: fix a race between hugetlb sysctl handlers · 403e3ba3
      Muchun Song 提交于
      mainline inclusion
      from mainline-v5.9-rc4
      commit 17743798
      category: bugfix
      bugzilla: NA
      CVE: CVE-2020-25285
      
      --------------------------------
      
      There is a race between the assignment of `table->data` and write value
      to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on
      the other thread.
      
        CPU0:                                 CPU1:
                                              proc_sys_write
        hugetlb_sysctl_handler                  proc_sys_call_handler
        hugetlb_sysctl_handler_common             hugetlb_sysctl_handler
          table->data = &tmp;                       hugetlb_sysctl_handler_common
                                                      table->data = &tmp;
            proc_doulongvec_minmax
              do_proc_doulongvec_minmax           sysctl_head_finish
                __do_proc_doulongvec_minmax         unuse_table
                  i = table->data;
                  *i = val;  // corrupt CPU1's stack
      
      Fix this by duplicating the `table`, and only update the duplicate of
      it.  And introduce a helper of proc_hugetlb_doulongvec_minmax() to
      simplify the code.
      
      The following oops was seen:
      
          BUG: kernel NULL pointer dereference, address: 0000000000000000
          #PF: supervisor instruction fetch in kernel mode
          #PF: error_code(0x0010) - not-present page
          Code: Bad RIP value.
          ...
          Call Trace:
           ? set_max_huge_pages+0x3da/0x4f0
           ? alloc_pool_huge_page+0x150/0x150
           ? proc_doulongvec_minmax+0x46/0x60
           ? hugetlb_sysctl_handler_common+0x1c7/0x200
           ? nr_hugepages_store+0x20/0x20
           ? copy_fd_bitmaps+0x170/0x170
           ? hugetlb_sysctl_handler+0x1e/0x20
           ? proc_sys_call_handler+0x2f1/0x300
           ? unregister_sysctl_table+0xb0/0xb0
           ? __fd_install+0x78/0x100
           ? proc_sys_write+0x14/0x20
           ? __vfs_write+0x4d/0x90
           ? vfs_write+0xef/0x240
           ? ksys_write+0xc0/0x160
           ? __ia32_sys_read+0x50/0x50
           ? __close_fd+0x129/0x150
           ? __x64_sys_write+0x43/0x50
           ? do_syscall_64+0x6c/0x200
           ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: e5ff2159 ("hugetlb: multiple hstates for multiple page sizes")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200828031146.43035-1-songmuchun@bytedance.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      403e3ba3
    • W
      arm64/ascend: Add hugepage flags change interface · db1d159b
      Weilong Chen 提交于
      ascend inclusion
      category: feature
      bugzilla: NA
      CVE: NA
      
      -------------------------------------------------
      
      Add a variable to change alloc hugepage gfp flags, and export it out.
      Hugepage need to be accounted by cgroup. We use this to set ACCOUNT
      flag to memory subsystem.
      Signed-off-by: NWeilong Chen <chenweilong@huawei.com>
      Reviewed-by: NDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      db1d159b
    • W
      arm64/ascend: Add set hugepage number helper function · b6bcd500
      Weilong Chen 提交于
      ascend inclusion
      category: feature
      bugzilla: NA
      CVE: NA
      
      -------------------------------------------------
      
      Add helper function for change system hugepage nr, and export it out.
      Signed-off-by: NWeilong Chen <chenweilong@huawei.com>
      Reviewed-by: NDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b6bcd500
    • L
      mm/hugetlb: fix a addressing exception caused by huge_pte_offset · f053d5db
      Longpeng 提交于
      stable inclusion
      from linux-4.19.119
      commit dcca7d2f751014d8caa3711e93b2119305e837df
      
      --------------------------------
      
      commit 3c1d7e6c upstream.
      
      Our machine encountered a panic(addressing exception) after run for a
      long time and the calltrace is:
      
          RIP: hugetlb_fault+0x307/0xbe0
          RSP: 0018:ffff9567fc27f808  EFLAGS: 00010286
          RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
          RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
          RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
          R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
          R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
          FS:  00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
            follow_hugetlb_page+0x175/0x540
            __get_user_pages+0x2a0/0x7e0
            __get_user_pages_unlocked+0x15d/0x210
            __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
            try_async_pf+0x6e/0x2a0 [kvm]
            tdp_page_fault+0x151/0x2d0 [kvm]
           ...
            kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
            kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
            do_vfs_ioctl+0x3f0/0x540
            SyS_ioctl+0xa1/0xc0
            system_call_fastpath+0x22/0x27
      
      For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
      may return a wrong 'pmdp' if there is a race.  Please look at the
      following code snippet:
      
          ...
          pud = pud_offset(p4d, addr);
          if (sz != PUD_SIZE && pud_none(*pud))
              return NULL;
          /* hugepage or swap? */
          if (pud_huge(*pud) || !pud_present(*pud))
              return (pte_t *)pud;
      
          pmd = pmd_offset(pud, addr);
          if (sz != PMD_SIZE && pmd_none(*pmd))
              return NULL;
          /* hugepage or swap? */
          if (pmd_huge(*pmd) || !pmd_present(*pmd))
              return (pte_t *)pmd;
          ...
      
      The following sequence would trigger this bug:
      
       - CPU0: sz = PUD_SIZE and *pud = 0 , continue
       - CPU0: "pud_huge(*pud)" is false
       - CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
       - CPU0: "!pud_present(*pud)" is false, continue
       - CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp
      
      However, we want CPU0 to return NULL or pudp in this case.
      
      We must make sure there is exactly one dereference of pud and pmd.
      Signed-off-by: NLongpeng <longpeng2@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200413010342.771-1-longpeng2@huawei.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f053d5db
  3. 31 8月, 2020 3 次提交
  4. 27 12月, 2019 20 次提交
  5. 08 12月, 2018 1 次提交
    • A
      userfaultfd: use ENOENT instead of EFAULT if the atomic copy user fails · 10f98c13
      Andrea Arcangeli 提交于
      commit 9e368259ad988356c4c95150fafd1a06af095d98 upstream.
      
      Patch series "userfaultfd shmem updates".
      
      Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
      lack of the VM_MAYWRITE check and the lack of i_size checks.
      
      Then looking into the above we also fixed the MAP_PRIVATE case.
      
      Hugh by source review also found a data loss source if UFFDIO_COPY is
      used on shmem MAP_SHARED PROT_READ mappings (the production usages
      incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
      happen in those production usages like with QEMU).
      
      The whole patchset is marked for stable.
      
      We verified QEMU postcopy live migration with guest running on shmem
      MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
      Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
      unconditionally invokes a punch hole if the guest mapping is filebacked
      and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
      for the anon backend).
      
      This patch (of 5):
      
      We internally used EFAULT to communicate with the caller, switch to
      ENOENT, so EFAULT can be used as a non internal retval.
      
      Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
      Fixes: 4c27fe4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NHugh Dickins <hughd@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      10f98c13
  6. 21 11月, 2018 1 次提交
    • M
      hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444! · 7b46e532
      Mike Kravetz 提交于
      commit 5e41540c8a0f0e98c337dda8b391e5dda0cde7cf upstream.
      
      This bug has been experienced several times by the Oracle DB team.  The
      BUG is in remove_inode_hugepages() as follows:
      
      	/*
      	 * If page is mapped, it was faulted in after being
      	 * unmapped in caller.  Unmap (again) now after taking
      	 * the fault mutex.  The mutex will prevent faults
      	 * until we finish removing the page.
      	 *
      	 * This race can only happen in the hole punch case.
      	 * Getting here in a truncate operation is a bug.
      	 */
      	if (unlikely(page_mapped(page))) {
      		BUG_ON(truncate_op);
      
      In this case, the elevated map count is not the result of a race.
      Rather it was incorrectly incremented as the result of a bug in the huge
      pmd sharing code.  Consider the following:
      
       - Process A maps a hugetlbfs file of sufficient size and alignment
         (PUD_SIZE) that a pmd page could be shared.
      
       - Process B maps the same hugetlbfs file with the same size and
         alignment such that a pmd page is shared.
      
       - Process B then calls mprotect() to change protections for the mapping
         with the shared pmd. As a result, the pmd is 'unshared'.
      
       - Process B then calls mprotect() again to chage protections for the
         mapping back to their original value. pmd remains unshared.
      
       - Process B then forks and process C is created. During the fork
         process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
         tables. Copying page tables for hugetlb mappings is done in the
         routine copy_hugetlb_page_range.
      
      In copy_hugetlb_page_range(), the destination pte is obtained by:
      
      	dst_pte = huge_pte_alloc(dst, addr, sz);
      
      If pmd sharing is possible, the returned pointer will be to a pte in an
      existing page table.  In the situation above, process C could share with
      either process A or process B.  Since process A is first in the list,
      the returned pte is a pointer to a pte in process A's page table.
      
      However, the check for pmd sharing in copy_hugetlb_page_range is:
      
      	/* If the pagetables are shared don't copy or take references */
      	if (dst_pte == src_pte)
      		continue;
      
      Since process C is sharing with process A instead of process B, the
      above test fails.  The code in copy_hugetlb_page_range which follows
      assumes dst_pte points to a huge_pte_none pte.  It copies the pte entry
      from src_pte to dst_pte and increments this map count of the associated
      page.  This is how we end up with an elevated map count.
      
      To solve, check the dst_pte entry for huge_pte_none.  If !none, this
      implies PMD sharing so do not copy.
      
      Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com
      Fixes: c5c99429 ("fix hugepages leak due to pagetable page sharing")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b46e532
  7. 14 11月, 2018 1 次提交
    • M
      hugetlbfs: dirty pages as they are added to pagecache · fa5466d7
      Mike Kravetz 提交于
      commit 22146c3ce98962436e401f7b7016a6f664c9ffb5 upstream.
      
      Some test systems were experiencing negative huge page reserve counts and
      incorrect file block counts.  This was traced to /proc/sys/vm/drop_caches
      removing clean pages from hugetlbfs file pagecaches.  When non-hugetlbfs
      explicit code removes the pages, the appropriate accounting is not
      performed.
      
      This can be recreated as follows:
       fallocate -l 2M /dev/hugepages/foo
       echo 1 > /proc/sys/vm/drop_caches
       fallocate -l 2M /dev/hugepages/foo
       grep -i huge /proc/meminfo
         AnonHugePages:         0 kB
         ShmemHugePages:        0 kB
         HugePages_Total:    2048
         HugePages_Free:     2047
         HugePages_Rsvd:    18446744073709551615
         HugePages_Surp:        0
         Hugepagesize:       2048 kB
         Hugetlb:         4194304 kB
       ls -lsh /dev/hugepages/foo
         4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
      
      To address this issue, dirty pages as they are added to pagecache.  This
      can easily be reproduced with fallocate as shown above.  Read faulted
      pages will eventually end up being marked dirty.  But there is a window
      where they are clean and could be impacted by code such as drop_caches.
      So, just dirty them all as they are added to the pagecache.
      
      Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
      Fixes: 6bda666a ("hugepages: fold find_or_alloc_pages into huge_no_page()")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMihcla Hocko <mhocko@suse.com>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa5466d7
  8. 06 10月, 2018 2 次提交
  9. 24 8月, 2018 2 次提交
    • S
      mm: Change return type int to vm_fault_t for fault handlers · 2b740303
      Souptick Joarder 提交于
      Use new return type vm_fault_t for fault handler.  For now, this is just
      documenting that the function returns a VM_FAULT value rather than an
      errno.  Once all instances are converted, vm_fault_t will become a
      distinct type.
      
      Ref-> commit 1c8f4220 ("mm: change return type to vm_fault_t")
      
      The aim is to change the return type of finish_fault() and
      handle_mm_fault() to vm_fault_t type.  As part of that clean up return
      type of all other recursively called functions have been changed to
      vm_fault_t type.
      
      The places from where handle_mm_fault() is getting invoked will be
      change to vm_fault_t type but in a separate patch.
      
      vmf_error() is the newly introduce inline function in 4.17-rc6.
      
      [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
      Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b740303
    • N
      mm: fix race on soft-offlining free huge pages · 6bc9b564
      Naoya Horiguchi 提交于
      Patch series "mm: soft-offline: fix race against page allocation".
      
      Xishi recently reported the issue about race on reusing the target pages
      of soft offlining.  Discussion and analysis showed that we need make
      sure that setting PG_hwpoison should be done in the right place under
      zone->lock for soft offline.  1/2 handles free hugepage's case, and 2/2
      hanldes free buddy page's case.
      
      This patch (of 2):
      
      There's a race condition between soft offline and hugetlb_fault which
      causes unexpected process killing and/or hugetlb allocation failure.
      
      The process killing is caused by the following flow:
      
        CPU 0               CPU 1              CPU 2
      
        soft offline
          get_any_page
          // find the hugetlb is free
                            mmap a hugetlb file
                            page fault
                              ...
                                hugetlb_fault
                                  hugetlb_no_page
                                    alloc_huge_page
                                    // succeed
            soft_offline_free_page
            // set hwpoison flag
                                               mmap the hugetlb file
                                               page fault
                                                 ...
                                                   hugetlb_fault
                                                     hugetlb_no_page
                                                       find_lock_page
                                                         return VM_FAULT_HWPOISON
                                                 mm_fault_error
                                                   do_sigbus
                                                   // kill the process
      
      The hugetlb allocation failure comes from the following flow:
      
        CPU 0                          CPU 1
      
                                       mmap a hugetlb file
                                       // reserve all free page but don't fault-in
        soft offline
          get_any_page
          // find the hugetlb is free
            soft_offline_free_page
            // set hwpoison flag
              dissolve_free_huge_page
              // fail because all free hugepages are reserved
                                       page fault
                                         ...
                                           hugetlb_fault
                                             hugetlb_no_page
                                               alloc_huge_page
                                                 ...
                                                   dequeue_huge_page_node_exact
                                                   // ignore hwpoisoned hugepage
                                                   // and finally fail due to no-mem
      
      The root cause of this is that current soft-offline code is written based
      on an assumption that PageHWPoison flag should be set at first to avoid
      accessing the corrupted data.  This makes sense for memory_failure() or
      hard offline, but does not for soft offline because soft offline is about
      corrected (not uncorrected) error and is safe from data lost.  This patch
      changes soft offline semantics where it sets PageHWPoison flag only after
      containment of the error page completes successfully.
      
      Link: http://lkml.kernel.org/r/1531452366-11661-2-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Suggested-by: NXishi Qiu <xishi.qiuxishi@alibaba-inc.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <zy.zhengyi@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bc9b564
  10. 18 8月, 2018 4 次提交
  11. 03 8月, 2018 1 次提交