1. 03 6月, 2021 1 次提交
  2. 19 4月, 2021 1 次提交
    • M
      hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings · c329e216
      Miaohe Lin 提交于
      stable inclusion
      from stable-5.10.27
      commit fe03ccc3ce906a31005637263fb82dd84d5d1dac
      bugzilla: 51493
      
      --------------------------------
      
      commit d85aecf2 upstream.
      
      The current implementation of hugetlb_cgroup for shared mappings could
      have different behavior.  Consider the following two scenarios:
      
       1.Assume initial css reference count of hugetlb_cgroup is 1:
        1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
            count is 2 associated with 1 file_region.
        1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
            count is 3 associated with 2 file_region.
        1.3 coalesce_file_region will coalesce these two file_regions into
            one. So css reference count is 3 associated with 1 file_region
            now.
      
       2.Assume initial css reference count of hugetlb_cgroup is 1 again:
        2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
            count is 2 associated with 1 file_region.
      
      Therefore, we might have one file_region while holding one or more css
      reference counts. This inconsistency could lead to imbalanced css_get()
      and css_put() pair. If we do css_put one by one (i.g. hole punch case),
      scenario 2 would put one more css reference. If we do css_put all
      together (i.g. truncate case), scenario 1 will leak one css reference.
      
      The imbalanced css_get() and css_put() pair would result in a non-zero
      reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup
      directory is removed __but__ associated resource is not freed. This
      might result in OOM or can not create a new hugetlb cgroup in a busy
      workload ultimately.
      
      In order to fix this, we have to make sure that one file_region must
      hold exactly one css reference. So in coalesce_file_region case, we
      should release one css reference before coalescence. Also only put css
      reference when the entire file_region is removed.
      
      The last thing to note is that the caller of region_add() will only hold
      one reference to h_cg->css for the whole contiguous reservation region.
      But this area might be scattered when there are already some
      file_regions reside in it. As a result, many file_regions may share only
      one h_cg->css reference. In order to ensure that one file_region must
      hold exactly one css reference, we should do css_get() for each
      file_region and release the reference held by caller when they are done.
      
      [linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings]
        Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com
      
      Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com
      Fixes: 075a61d0 ("hugetlb_cgroup: add accounting for shared mappings")
      Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR)
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      c329e216
  3. 09 4月, 2021 4 次提交
  4. 09 3月, 2021 4 次提交
  5. 28 1月, 2021 1 次提交
  6. 18 1月, 2021 1 次提交
    • M
      mm/hugetlb: fix deadlock in hugetlb_cow error path · 69d4df11
      Mike Kravetz 提交于
      stable inclusion
      from stable-5.10.5
      commit df73c80338ef397d5fb2fe2631d24e2256bed9bd
      bugzilla: 46931
      
      --------------------------------
      
      commit e7dd91c4 upstream.
      
      syzbot reported the deadlock here [1].  The issue is in hugetlb cow
      error handling when there are not enough huge pages for the faulting
      task which took the original reservation.  It is possible that other
      (child) tasks could have consumed pages associated with the reservation.
      In this case, we want the task which took the original reservation to
      succeed.  So, we unmap any associated pages in children so that they can
      be used by the faulting task that owns the reservation.
      
      The unmapping code needs to hold i_mmap_rwsem in write mode.  However,
      due to commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd
      sharing synchronization") we are already holding i_mmap_rwsem in read
      mode when hugetlb_cow is called.
      
      Technically, i_mmap_rwsem does not need to be held in read mode for COW
      mappings as they can not share pmd's.  Modifying the fault code to not
      take i_mmap_rwsem in read mode for COW (and other non-sharable) mappings
      is too involved for a stable fix.
      
      Instead, we simply drop the hugetlb_fault_mutex and i_mmap_rwsem before
      unmapping.  This is OK as it is technically not needed.  They are
      reacquired after unmapping as expected by calling code.  Since this is
      done in an uncommon error path, the overhead of dropping and reacquiring
      mutexes is acceptable.
      
      While making changes, remove redundant BUG_ON after unmap_ref_private.
      
      [1] https://lkml.kernel.org/r/000000000000b73ccc05b5cf8558@google.com
      
      Link: https://lkml.kernel.org/r/4c5781b8-3b00-761e-c0c7-c5edebb6ec1a@oracle.com
      Fixes: c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: syzbot+5eee4145df3c15e96625@syzkaller.appspotmail.com
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      69d4df11
  7. 12 1月, 2021 1 次提交
  8. 12 12月, 2020 1 次提交
    • G
      mm/hugetlb: clear compound_nr before freeing gigantic pages · ba9c1201
      Gerald Schaefer 提交于
      Commit 1378a5ee ("mm: store compound_nr as well as compound_order")
      added compound_nr counter to first tail struct page, overlaying with
      page->mapping.  The overlay itself is fine, but while freeing gigantic
      hugepages via free_contig_range(), a "bad page" check will trigger for
      non-NULL page->mapping on the first tail page:
      
        BUG: Bad page state in process bash  pfn:380001
        page:00000000c35f0856 refcount:0 mapcount:0 mapping:00000000126b68aa index:0x0 pfn:0x380001
        aops:0x0
        flags: 0x3ffff00000000000()
        raw: 3ffff00000000000 0000000000000100 0000000000000122 0000000100000000
        raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
        page dumped because: non-NULL mapping
        Modules linked in:
        CPU: 6 PID: 616 Comm: bash Not tainted 5.10.0-rc7-next-20201208 #1
        Hardware name: IBM 3906 M03 703 (LPAR)
        Call Trace:
          show_stack+0x6e/0xe8
          dump_stack+0x90/0xc8
          bad_page+0xd6/0x130
          free_pcppages_bulk+0x26a/0x800
          free_unref_page+0x6e/0x90
          free_contig_range+0x94/0xe8
          update_and_free_page+0x1c4/0x2c8
          free_pool_huge_page+0x11e/0x138
          set_max_huge_pages+0x228/0x300
          nr_hugepages_store_common+0xb8/0x130
          kernfs_fop_write+0xd2/0x218
          vfs_write+0xb0/0x2b8
          ksys_write+0xac/0xe0
          system_call+0xe6/0x288
        Disabling lock debugging due to kernel taint
      
      This is because only the compound_order is cleared in
      destroy_compound_gigantic_page(), and compound_nr is set to
      1U << order == 1 for order 0 in set_compound_order(page, 0).
      
      Fix this by explicitly clearing compound_nr for first tail page after
      calling set_compound_order(page, 0).
      
      Link: https://lkml.kernel.org/r/20201208182813.66391-2-gerald.schaefer@linux.ibm.com
      Fixes: 1378a5ee ("mm: store compound_nr as well as compound_order")
      Signed-off-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: <stable@vger.kernel.org>	[5.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba9c1201
  9. 15 11月, 2020 1 次提交
    • M
      hugetlbfs: fix anon huge page migration race · 336bf30e
      Mike Kravetz 提交于
      Qian Cai reported the following BUG in [1]
      
        LTP: starting move_pages12
        BUG: unable to handle page fault for address: ffffffffffffffe0
        ...
        RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
        Call Trace:
          rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
          try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
          migrate_pages+0x1005/0x1fb0
          move_pages_and_store_status.isra.47+0xd7/0x1a0
          __x64_sys_move_pages+0xa5c/0x1100
          do_syscall_64+0x5f/0x310
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Hugh Dickins diagnosed this as a migration bug caused by code introduced
      to use i_mmap_rwsem for pmd sharing synchronization.  Specifically, the
      routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
      flag to try_to_unmap() while holding i_mmap_rwsem.  This is wrong for
      anon pages as the anon_vma_lock should be held in this case.  Further
      analysis suggested that i_mmap_rwsem was not required to he held at all
      when calling try_to_unmap for anon pages as an anon page could never be
      part of a shared pmd mapping.
      
      Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
      to drop page lock and acquire i_mmap_rwsem is wrong.  There is no way to
      keep mapping valid while dropping page lock.
      
      This patch does the following:
      
       - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
         calling try_to_unmap.
      
       - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
         will now simply do a 'trylock' while still holding the page lock. If
         the trylock fails, it will return NULL. This could impact the
         callers:
      
          - migration calling code will receive -EAGAIN and retry up to the
            hard coded limit (10).
      
          - memory error code will treat the page as BUSY. This will force
            killing (SIGKILL) instead of SIGBUS any mapping tasks.
      
         Do note that this change in behavior only happens when there is a
         race. None of the standard kernel testing suites actually hit this
         race, but it is possible.
      
      [1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
      [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/
      
      Fixes: c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
      Reported-by: NQian Cai <cai@lca.pw>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      336bf30e
  10. 03 11月, 2020 1 次提交
  11. 14 10月, 2020 10 次提交
  12. 02 10月, 2020 1 次提交
  13. 06 9月, 2020 2 次提交
    • M
      mm/hugetlb: fix a race between hugetlb sysctl handlers · 17743798
      Muchun Song 提交于
      There is a race between the assignment of `table->data` and write value
      to the pointer of `table->data` in the __do_proc_doulongvec_minmax() on
      the other thread.
      
        CPU0:                                 CPU1:
                                              proc_sys_write
        hugetlb_sysctl_handler                  proc_sys_call_handler
        hugetlb_sysctl_handler_common             hugetlb_sysctl_handler
          table->data = &tmp;                       hugetlb_sysctl_handler_common
                                                      table->data = &tmp;
            proc_doulongvec_minmax
              do_proc_doulongvec_minmax           sysctl_head_finish
                __do_proc_doulongvec_minmax         unuse_table
                  i = table->data;
                  *i = val;  // corrupt CPU1's stack
      
      Fix this by duplicating the `table`, and only update the duplicate of
      it.  And introduce a helper of proc_hugetlb_doulongvec_minmax() to
      simplify the code.
      
      The following oops was seen:
      
          BUG: kernel NULL pointer dereference, address: 0000000000000000
          #PF: supervisor instruction fetch in kernel mode
          #PF: error_code(0x0010) - not-present page
          Code: Bad RIP value.
          ...
          Call Trace:
           ? set_max_huge_pages+0x3da/0x4f0
           ? alloc_pool_huge_page+0x150/0x150
           ? proc_doulongvec_minmax+0x46/0x60
           ? hugetlb_sysctl_handler_common+0x1c7/0x200
           ? nr_hugepages_store+0x20/0x20
           ? copy_fd_bitmaps+0x170/0x170
           ? hugetlb_sysctl_handler+0x1e/0x20
           ? proc_sys_call_handler+0x2f1/0x300
           ? unregister_sysctl_table+0xb0/0xb0
           ? __fd_install+0x78/0x100
           ? proc_sys_write+0x14/0x20
           ? __vfs_write+0x4d/0x90
           ? vfs_write+0xef/0x240
           ? ksys_write+0xc0/0x160
           ? __ia32_sys_read+0x50/0x50
           ? __close_fd+0x129/0x150
           ? __x64_sys_write+0x43/0x50
           ? do_syscall_64+0x6c/0x200
           ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: e5ff2159 ("hugetlb: multiple hstates for multiple page sizes")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: http://lkml.kernel.org/r/20200828031146.43035-1-songmuchun@bytedance.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17743798
    • L
      mm/hugetlb: try preferred node first when alloc gigantic page from cma · 953f064a
      Li Xinhai 提交于
      Since commit cf11e85f ("mm: hugetlb: optionally allocate gigantic
      hugepages using cma"), the gigantic page would be allocated from node
      which is not the preferred node, although there are pages available from
      that node.  The reason is that the nid parameter has been ignored in
      alloc_gigantic_page().
      
      Besides, the __GFP_THISNODE also need be checked if user required to
      alloc only from the preferred node.
      
      After this patch, the preferred node is tried first before other allowed
      nodes, and don't try to allocate from other nodes if __GFP_THISNODE is
      specified.  If user don't specify the preferred node, the current node
      will be used as preferred node, which makes sure consistent behavior of
      allocating gigantic and non-gigantic hugetlb page.
      
      Fixes: cf11e85f ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
      Signed-off-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Link: https://lkml.kernel.org/r/20200902025016.697260-1-lixinhai.lxh@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      953f064a
  14. 01 9月, 2020 1 次提交
  15. 13 8月, 2020 6 次提交
  16. 08 8月, 2020 2 次提交
    • P
      mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible · 75802ca6
      Peter Xu 提交于
      This is found by code observation only.
      
      Firstly, the worst case scenario should assume the whole range was covered
      by pmd sharing.  The old algorithm might not work as expected for ranges
      like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
      expected range should be (0, 2g).
      
      Since at it, remove the loop since it should not be required.  With that,
      the new code should be faster too when the invalidating range is huge.
      
      Mike said:
      
      : With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only
      : adjust to (0, 1g+2m) which is incorrect.
      :
      : We should cc stable.  The original reason for adjusting the range was to
      : prevent data corruption (getting wrong page).  Since the range is not
      : always adjusted correctly, the potential for corruption still exists.
      :
      : However, I am fairly confident that adjust_range_if_pmd_sharing_possible
      : is only gong to be called in two cases:
      :
      : 1) for a single page
      : 2) for range == entire vma
      :
      : In those cases, the current code should produce the correct results.
      :
      : To be safe, let's just cc stable.
      
      Fixes: 017b1660 ("mm: migration: fix migration of huge PMD shared pages")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200730201636.74778-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75802ca6
    • M
      mm: remove unneeded includes of <asm/pgalloc.h> · ca15ca40
      Mike Rapoport 提交于
      Patch series "mm: cleanup usage of <asm/pgalloc.h>"
      
      Most architectures have very similar versions of pXd_alloc_one() and
      pXd_free_one() for intermediate levels of page table.  These patches add
      generic versions of these functions in <asm-generic/pgalloc.h> and enable
      use of the generic functions where appropriate.
      
      In addition, functions declared and defined in <asm/pgalloc.h> headers are
      used mostly by core mm and early mm initialization in arch and there is no
      actual reason to have the <asm/pgalloc.h> included all over the place.
      The first patch in this series removes unneeded includes of
      <asm/pgalloc.h>
      
      In the end it didn't work out as neatly as I hoped and moving
      pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
      unnecessary changes to arches that have custom page table allocations, so
      I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
      to mm/.
      
      This patch (of 8):
      
      In most cases <asm/pgalloc.h> header is required only for allocations of
      page table memory.  Most of the .c files that include that header do not
      use symbols declared in <asm/pgalloc.h> and do not require that header.
      
      As for the other header files that used to include <asm/pgalloc.h>, it is
      possible to move that include into the .c file that actually uses symbols
      from <asm/pgalloc.h> and drop the include from the header file.
      
      The process was somewhat automated using
      
      	sed -i -E '/[<"]asm\/pgalloc\.h/d' \
                      $(grep -L -w -f /tmp/xx \
                              $(git grep -E -l '[<"]asm/pgalloc\.h'))
      
      where /tmp/xx contains all the symbols defined in
      arch/*/include/asm/pgalloc.h.
      
      [rppt@linux.ibm.com: fix powerpc warning]
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca15ca40
  17. 25 7月, 2020 1 次提交
  18. 04 7月, 2020 1 次提交