1. 01 12月, 2022 3 次提交
    • M
      hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing · 04ada095
      Mike Kravetz 提交于
      madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
      tables associated with the address range.  For hugetlb vmas,
      zap_page_range will call __unmap_hugepage_range_final.  However,
      __unmap_hugepage_range_final assumes the passed vma is about to be removed
      and deletes the vma_lock to prevent pmd sharing as the vma is on the way
      out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
      missing vma_lock prevents pmd sharing and could potentially lead to issues
      with truncation/fault races.
      
      This issue was originally reported here [1] as a BUG triggered in
      page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
      vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
      prevent pmd sharing.  Subsequent faults on this vma were confused as
      VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
      not set in new pages added to the page table.  This resulted in pages that
      appeared anonymous in a VM_SHARED vma and triggered the BUG.
      
      Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
      call from unmap_vmas().  This is used to indicate the 'final' unmapping of
      a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and
      the vm_lock is not deleted.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      04ada095
    • M
      madvise: use zap_page_range_single for madvise dontneed · 21b85b09
      Mike Kravetz 提交于
      This series addresses the issue first reported in [1], and fully described
      in patch 2.  Patches 1 and 2 address the user visible issue and are tagged
      for stable backports.
      
      While exploring solutions to this issue, related problems with mmu
      notification calls were discovered.  This is addressed in the patch
      "hugetlb: remove duplicate mmu notifications:".  Since there are no user
      visible effects, this third is not tagged for stable backports.
      
      Previous discussions suggested further cleanup by removing the
      routine zap_page_range.  This is possible because zap_page_range_single
      is now exported, and all callers of zap_page_range pass ranges entirely
      within a single vma.  This work will be done in a later patch so as not
      to distract from this bug fix.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      
      This patch (of 2):
      
      Expose the routine zap_page_range_single to zap a range within a single
      vma.  The madvise routine madvise_dontneed_single_vma can use this routine
      as it explicitly operates on a single vma.  Also, update the mmu
      notification range in zap_page_range_single to take hugetlb pmd sharing
      into account.  This is required as MADV_DONTNEED supports hugetlb vmas.
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      21b85b09
    • Y
      mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE · dec1d352
      Yang Shi 提交于
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node
      include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
      hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221
      alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted
      6.1.0-rc1-syzkaller-00454-ga7038524 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc
      ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9
      96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89
      f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
      f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      It is because khugepaged allocates pages with __GFP_THISNODE, but the
      preferred node is bogus.  The previous patch fixed the khugepaged code to
      avoid allocating page from non-existing node.  But it is still racy
      against memory hotremove.  There is no synchronization with the memory
      hotplug so it is possible that memory gets offline during a longer taking
      scanning.
      
      So this warning still seems not quite helpful because:
        * There is no guarantee the node is online for __GFP_THISNODE context
          for all the callsites.
        * Kernel just fails the allocation regardless the warning, and it looks
          all callsites handle the allocation failure gracefully.
      
      Although while the warning has helped to identify a buggy code, it is not
      safe in general and this warning could panic the system with panic-on-warn
      configuration which tends to be used surprisingly often.  So replace
      VM_WARN_ON to pr_warn().  And the warning will be triggered if
      __GFP_NOWARN is set since the allocator would print out warning for such
      case if __GFP_NOWARN is not set.
      
      [shy828301@gmail.com: rename nid to this_node and gfp to warn_gfp]
        Link: https://lkml.kernel.org/r/20221123193014.153983-1-shy828301@gmail.com
      [akpm@linux-foundation.org: fix whitespace]
      [akpm@linux-foundation.org: print gfp_mask instead of warn_gfp, per Michel]
      Link: https://lkml.kernel.org/r/20221108184357.55614-3-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      dec1d352
  2. 23 11月, 2022 24 次提交
    • L
      test_kprobes: fix implicit declaration error of test_kprobes · de3db3f8
      Li Hua 提交于
      If KPROBES_SANITY_TEST and ARCH_CORRECT_STACKTRACE_ON_KRETPROBE is enabled, but
      STACKTRACE is not set. Build failed as below:
      
      lib/test_kprobes.c: In function `stacktrace_return_handler':
      lib/test_kprobes.c:228:8: error: implicit declaration of function `stack_trace_save'; did you mean `stacktrace_driver'? [-Werror=implicit-function-declaration]
        ret = stack_trace_save(stack_buf, STACK_BUF_SIZE, 0);
              ^~~~~~~~~~~~~~~~
              stacktrace_driver
      cc1: all warnings being treated as errors
      scripts/Makefile.build:250: recipe for target 'lib/test_kprobes.o' failed
      make[2]: *** [lib/test_kprobes.o] Error 1
      
      To fix this error, Select STACKTRACE if ARCH_CORRECT_STACKTRACE_ON_KRETPROBE is enabled.
      
      Link: https://lkml.kernel.org/r/20221121030620.63181-1-hucool.lihua@huawei.com
      Fixes: 1f6d3a8f ("kprobes: Add a test case for stacktrace from kretprobe handler")
      Signed-off-by: NLi Hua <hucool.lihua@huawei.com>
      Acked-by: NMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      de3db3f8
    • C
      nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty · 512c5ca0
      Chen Zhongjin 提交于
      When extending segments, nilfs_sufile_alloc() is called to get an
      unassigned segment, then mark it as dirty to avoid accidentally allocating
      the same segment in the future.
      
      But for some special cases such as a corrupted image it can be unreliable.
      If such corruption of the dirty state of the segment occurs, nilfs2 may
      reallocate a segment that is in use and pick the same segment for writing
      twice at the same time.
      
      This will cause the problem reported by syzkaller:
      https://syzkaller.appspot.com/bug?id=c7c4748e11ffcc367cef04f76e02e931833cbd24
      
      This case started with segbuf1.segnum = 3, nextnum = 4 when constructed. 
      It supposed segment 4 has already been allocated and marked as dirty.
      
      However the dirty state was corrupted and segment 4 usage was not dirty. 
      For the first time nilfs_segctor_extend_segments() segment 4 was allocated
      again, which made segbuf2 and next segbuf3 had same segment 4.
      
      sb_getblk() will get same bh for segbuf2 and segbuf3, and this bh is added
      to both buffer lists of two segbuf.  It makes the lists broken which
      causes NULL pointer dereference.
      
      Fix the problem by setting usage as dirty every time in
      nilfs_sufile_mark_dirty(), which is called during constructing current
      segment to be written out and before allocating next segment.
      
      [chenzhongjin@huawei.com: add lock protection per Ryusuke]
        Link: https://lkml.kernel.org/r/20221121091141.214703-1-chenzhongjin@huawei.com
      Link: https://lkml.kernel.org/r/20221118063304.140187-1-chenzhongjin@huawei.com
      Fixes: 9ff05123 ("nilfs2: segment constructor")
      Signed-off-by: NChen Zhongjin <chenzhongjin@huawei.com>
      Reported-by: <syzbot+77e4f0...@syzkaller.appspotmail.com>
      Reported-by: NLiu Shixin <liushixin2@huawei.com>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: NRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      512c5ca0
    • A
      mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1 · 81a70c21
      Aneesh Kumar K.V 提交于
      balance_dirty_pages doesn't do the required dirty throttling on cgroupv1. 
      See commit 9badce00 ("cgroup, writeback: don't enable cgroup writeback
      on traditional hierarchies").  Instead, the kernel depends on writeback
      throttling in shrink_folio_list to achieve the same goal.  With large
      memory systems, the flusher may not be able to writeback quickly enough
      such that we will start finding pages in the shrink_folio_list already in
      writeback.  Hence for cgroupv1 let's do a reclaim throttle after waking up
      the flusher.
      
      The below test which used to fail on a 256GB system completes till the the
      file system is full with this change.
      
      root@lp2:/sys/fs/cgroup/memory# mkdir test
      root@lp2:/sys/fs/cgroup/memory# cd test/
      root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes
      root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks
      root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M
      Killed
      
      Link: https://lkml.kernel.org/r/20221118070603.84081-1-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: zefan li <lizefan.x@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      81a70c21
    • Q
      mm: fix unexpected changes to {failslab|fail_page_alloc}.attr · ea4452de
      Qi Zheng 提交于
      When we specify __GFP_NOWARN, we only expect that no warnings will be
      issued for current caller.  But in the __should_failslab() and
      __should_fail_alloc_page(), the local GFP flags alter the global
      {failslab|fail_page_alloc}.attr, which is persistent and shared by all
      tasks.  This is not what we expected, let's fix it.
      
      [akpm@linux-foundation.org: unexport should_fail_ex()]
      Link: https://lkml.kernel.org/r/20221118100011.2634-1-zhengqi.arch@bytedance.com
      Fixes: 3f913fc5 ("mm: fix missing handler for __GFP_NOWARN")
      Signed-off-by: NQi Zheng <zhengqi.arch@bytedance.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ea4452de
    • C
      swapfile: fix soft lockup in scan_swap_map_slots · de1ccfb6
      Chen Wandun 提交于
      A softlockup occurs in scan free swap slot under huge memory pressure. 
      The test scenario is: 64 CPU cores, 64GB memory, and 28 zram devices, the
      disksize of each zram device is 50MB.
      
      LATENCY_LIMIT is used to prevent softlockups in scan_swap_map_slots(), but
      the real loop number would more than LATENCY_LIMIT because of "goto checks
      and goto scan" repeatly without decreasing latency limit.
      
      In order to fix it, decrease latency_ration in advance.
      
      There is also a suspicious place that will cause softlockups in
      get_swap_pages().  In this function, the "goto start_over" may result in
      continuous scanning of the swap partition.  If there is no cond_sched in
      scan_swap_map_slots(), it would cause a softlockup (I am not sure about
      this).
      
      WARN: soft lockup - CPU#11 stuck for 11s! [kswapd0:466]
      CPU: 11 PID: 466 Comm: kswapd@ Kdump: loaded Tainted: G
      dump backtrace+0x0/0x1le4
      show stack+0x20/@x2c
      dump_stack+0xd8/0x140
      watchdog print_info+0x48/0x54
      watchdog_process_before_softlockup+0x98/0xa0
      watchdog_timer_fn+0xlac/0x2d0
      hrtimer_rum_queues+0xb0/0x130
      hrtimer_interrupt+0x13c/0x3c0
      arch_timer_handler_virt+0x3c/0x50
      handLe_percpu_devid_irq+0x90/0x1f4
      handle domain irq+0x84/0x100
      gic_handle_irq+0x88/0x2b0
      e11 ira+0xhB/Bx140
      scan_swap_map_slots+0x678/0x890
      get_swap_pages+0x29c/0x440
      get_swap_page+0x120/0x2e0
      add_to_swap+UX2U/0XyC
      shrink_page_list+0x5d0/0x152c
      shrink_inactive_list+0xl6c/Bx500
      shrink_lruvec+0x270/0x304
      
      WARN: soft lockup - CPU#32 stuck for 11s! [stress-ng:309915]
      watchdog_timer_fn+0x1ac/0x2d0
      __run_hrtimer+0x98/0x2a0
      __hrtimer_run_queues+0xb0/0x130
      hrtimer_interrupt+0x13c/0x3c0
      arch_timer_handler_virt+0x3c/0x50
      handle_percpu_devid_irq+0x90/0x1f4
      __handle_domain_irq+0x84/0x100
      gic_handle_irq+0x88/0x2b0
      el1_irq+0xb8/0x140
      get_swap_pages+0x1e8/0x440
      get_swap_page+0x1c8/0x2e0
      add_to_swap+0x20/0x9c
      shrink_page_list+0x5d0/0x152c
      reclaim_pages+0x160/0x310
      madvise_cold_or_pageout_pte_range+0x7bc/0xe3c
      walk_pmd_range.isra.0+0xac/0x22c
      walk_pud_range+0xfc/0x1c0
      walk_pgd_range+0x158/0x1b0
      __walk_page_range+0x64/0x100
      walk_page_range+0x104/0x150
      
      Link: https://lkml.kernel.org/r/20221118133850.3360369-1-chenwandun@huawei.com
      Fixes: 048c27fd ("[PATCH] swap: scan_swap_map latency breaks")
      Signed-off-by: NChen Wandun <chenwandun@huawei.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: <xialonglong1@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      de1ccfb6
    • M
      hugetlb: fix __prep_compound_gigantic_page page flag setting · 7fb0728a
      Mike Kravetz 提交于
      Commit 2b21624f ("hugetlb: freeze allocated pages before creating
      hugetlb pages") changed the order page flags were cleared and set in the
      head page.  It moved the __ClearPageReserved after __SetPageHead. 
      However, there is a check to make sure __ClearPageReserved is never done
      on a head page.  If CONFIG_DEBUG_VM_PGFLAGS is enabled, the following BUG
      will be hit when creating a hugetlb gigantic page:
      
          page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
          ------------[ cut here ]------------
          kernel BUG at include/linux/page-flags.h:500!
          Call Trace will differ depending on whether hugetlb page is created
          at boot time or run time.
      
      Make sure to __ClearPageReserved BEFORE __SetPageHead.
      
      Link: https://lkml.kernel.org/r/20221118195249.178319-1-mike.kravetz@oracle.com
      Fixes: 2b21624f ("hugetlb: freeze allocated pages before creating hugetlb pages")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Acked-by: NMuchun Song <songmuchun@bytedance.com>
      Tested-by: NTarun Sahu <tsahu@linux.ibm.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7fb0728a
    • M
      kfence: fix stack trace pruning · 747c0f35
      Marco Elver 提交于
      Commit b1405135 ("mm/sl[au]b: generalize kmalloc subsystem")
      refactored large parts of the kmalloc subsystem, resulting in the stack
      trace pruning logic done by KFENCE to no longer work.
      
      While b1405135 attempted to fix the situation by including
      '__kmem_cache_free' in the list of functions KFENCE should skip through,
      this only works when the compiler actually optimized the tail call from
      kfree() to __kmem_cache_free() into a jump (and thus kfree() _not_
      appearing in the full stack trace to begin with).
      
      In some configurations, the compiler no longer optimizes the tail call
      into a jump, and __kmem_cache_free() appears in the stack trace.  This
      means that the pruned stack trace shown by KFENCE would include kfree()
      which is not intended - for example:
      
       | BUG: KFENCE: invalid free in kfree+0x7c/0x120
       |
       | Invalid free of 0xffff8883ed8fefe0 (in kfence-#126):
       |  kfree+0x7c/0x120
       |  test_double_free+0x116/0x1a9
       |  kunit_try_run_case+0x90/0xd0
       | [...]
      
      Fix it by moving __kmem_cache_free() to the list of functions that may be
      tail called by an allocator entry function, making the pruning logic work
      in both the optimized and unoptimized tail call cases.
      
      Link: https://lkml.kernel.org/r/20221118152216.3914899-1-elver@google.com
      Fixes: b1405135 ("mm/sl[au]b: generalize kmalloc subsystem")
      Signed-off-by: NMarco Elver <elver@google.com>
      Reviewed-by: NAlexander Potapenko <glider@google.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      747c0f35
    • Y
      proc/meminfo: fix spacing in SecPageTables · f850c849
      Yosry Ahmed 提交于
      SecPageTables has a tab after it instead of a space, this can break
      fragile parsers that depend on spaces after the stat names.
      
      Link: https://lkml.kernel.org/r/20221117043247.133294-1-yosryahmed@google.com
      Fixes: ebc97a52 ("mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.")
      Signed-off-by: NYosry Ahmed <yosryahmed@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NShakeel Butt <shakeelb@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f850c849
    • Y
      mm: multi-gen LRU: retry folios written back while isolated · 359a5e14
      Yu Zhao 提交于
      The page reclaim isolates a batch of folios from the tail of one of the
      LRU lists and works on those folios one by one.  For a suitable
      swap-backed folio, if the swap device is async, it queues that folio for
      writeback.  After the page reclaim finishes an entire batch, it puts back
      the folios it queued for writeback to the head of the original LRU list.
      
      In the meantime, the page writeback flushes the queued folios also by
      batches.  Its batching logic is independent from that of the page reclaim.
      For each of the folios it writes back, the page writeback calls
      folio_rotate_reclaimable() which tries to rotate a folio to the tail.
      
      folio_rotate_reclaimable() only works for a folio after the page reclaim
      has put it back.  If an async swap device is fast enough, the page
      writeback can finish with that folio while the page reclaim is still
      working on the rest of the batch containing it.  In this case, that folio
      will remain at the head and the page reclaim will not retry it before
      reaching there.
      
      This patch adds a retry to evict_folios().  After evict_folios() has
      finished an entire batch and before it puts back folios it cannot free
      immediately, it retries those that may have missed the rotation.
      
      Before this patch, ~60% of folios swapped to an Intel Optane missed
      folio_rotate_reclaimable().  After this patch, ~99% of missed folios were
      reclaimed upon retry.
      
      This problem affects relatively slow async swap devices like Samsung 980
      Pro much less and does not affect sync swap devices like zram or zswap at
      all.
      
      Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com
      Fixes: ac35a490 ("mm: multi-gen LRU: minimal implementation")
      Signed-off-by: NYu Zhao <yuzhao@google.com>
      Cc: "Yin, Fengwei" <fengwei.yin@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      359a5e14
    • S
      mailmap: update email address for Satya Priya · 47123d7f
      Satya Priya 提交于
      Add and also update email address, skakit@codeaurora.org is no longer
      active.
      
      Link: https://lkml.kernel.org/r/20221116105017.3018971-1-quic_c_skakit@quicinc.comSigned-off-by: NSatya Priya <quic_c_skakit@quicinc.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      47123d7f
    • A
      mm/migrate_device: return number of migrating pages in args->cpages · 44af0b45
      Alistair Popple 提交于
      migrate_vma->cpages originally contained a count of the number of pages
      migrating including non-present pages which can be populated directly on
      the target.
      
      Commit 241f6885 ("mm/migrate_device.c: refactor migrate_vma and
      migrate_device_coherent_page()") inadvertantly changed this to contain
      just the number of pages that were unmapped.  Usage of migrate_vma->cpages
      isn't documented, but most drivers use it to see if all the requested
      addresses can be migrated so restore the original behaviour.
      
      Link: https://lkml.kernel.org/r/20221111005135.1344004-1-apopple@nvidia.com
      Fixes: 241f6885 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()")
      Signed-off-by: NAlistair Popple <apopple@nvidia.com>
      Reported-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      44af0b45
    • S
      kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible · 50c69721
      Sam James 提交于
      Add missing <linux/string.h> include for strcmp.
      
      Clang 16 makes -Wimplicit-function-declaration an error by default. 
      Unfortunately, out of tree modules may use this in configure scripts,
      which means failure might cause silent miscompilation or misconfiguration.
      
      For more information, see LWN.net [0] or LLVM's Discourse [1], gentoo-dev@ [2],
      or the (new) c-std-porting mailing list [3].
      
      [0] https://lwn.net/Articles/913505/
      [1] https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213
      [2] https://archives.gentoo.org/gentoo-dev/message/dd9f2d3082b8b6f8dfbccb0639e6e240
      [3] hosted at lists.linux.dev.
      
      [akpm@linux-foundation.org: remember "linux/"]
      Link: https://lkml.kernel.org/r/20221116182634.2823136-1-sam@gentoo.orgSigned-off-by: NSam James <sam@gentoo.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      50c69721
    • A
      db8e0998
    • A
      mailmap: update Alex Hung's email address · d39e2ad6
      Alex Hung 提交于
      I am no longer at Canonical and add entry of my personal email address.
      
      Link: https://lkml.kernel.org/r/20221114001302.671897-1-alex.hung@amd.comSigned-off-by: NAlex Hung <alexhung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d39e2ad6
    • I
      mm: mmap: fix documentation for vma_mas_szero · 4a423440
      Ian Cowan 提交于
      When the struct_mm input, mm, was changed to a struct ma_state, mas, the
      documentation for the function was never updated.  This updates that
      documentation reference.
      
      Link: https://lkml.kernel.org/r/20221114003349.41235-1-ian@linux.cowan.aeroSigned-off-by: NIan Cowan <ian@linux.cowan.aero>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4a423440
    • S
      mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed · 8468b486
      SeongJae Park 提交于
      A DAMON sysfs interface user can start DAMON with a scheme, remove the
      sysfs directory for the scheme, and then ask update of the scheme's stats.
      Because the schemes stats update logic isn't aware of the situation, it
      results in an invalid memory access.  Fix the bug by checking if the
      scheme sysfs directory exists.
      
      Link: https://lkml.kernel.org/r/20221114175552.1951-1-sj@kernel.org
      Fixes: 0ac32b8a ("mm/damon/sysfs: support DAMOS stats")
      Signed-off-by: NSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[v5.18]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      8468b486
    • A
      mm/memory: return vm_fault_t result from migrate_to_ram() callback · 4a955bed
      Alistair Popple 提交于
      The migrate_to_ram() callback should always succeed, but in rare cases can
      fail usually returning VM_FAULT_SIGBUS.  Commit 16ce101d
      ("mm/memory.c: fix race when faulting a device private page") incorrectly
      stopped passing the return code up the stack.  Fix this by setting the ret
      variable, restoring the previous behaviour on migrate_to_ram() failure.
      
      Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com
      Fixes: 16ce101d ("mm/memory.c: fix race when faulting a device private page")
      Signed-off-by: NAlistair Popple <apopple@nvidia.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4a955bed
    • L
      mm: correctly charge compressed memory to its memcg · cd08d80e
      Li Liguang 提交于
      Kswapd will reclaim memory when memory pressure is high, the annonymous
      memory will be compressed and stored in the zpool if zswap is enabled. 
      The memcg_kmem_bypass() in get_obj_cgroup_from_page() will bypass the
      kernel thread and cause the compressed memory not be charged to its memory
      cgroup.
      
      Remove the memcg_kmem_bypass() call and properly charge compressed memory
      to its corresponding memory cgroup.
      
      Link: https://lore.kernel.org/linux-mm/CALvZod4nnn8BHYqAM4xtcR0Ddo2-Wr8uKm9h_CHWUaXw7g_DCg@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20221114194828.100822-1-hannes@cmpxchg.org
      Fixes: f4840ccf ("zswap: memcg accounting")
      Signed-off-by: NLi Liguang <liliguang@baidu.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: <stable@vger.kernel.org>	[5.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      cd08d80e
    • M
      ipc/shm: call underlying open/close vm_ops · b6305049
      Mike Kravetz 提交于
      Shared memory segments can be created that are backed by hugetlb pages. 
      When this happens, the vmas associated with any mappings (shmat) are
      marked VM_HUGETLB, yet the vm_ops for such mappings are provided by
      ipc/shm (shm_vm_ops).  There is a mechanism to call the underlying hugetlb
      vm_ops, and this is done for most operations.  However, it is not done for
      open and close.
      
      This was not an issue until the introduction of the hugetlb vma_lock. 
      This lock structure is pointed to by vm_private_data and the open/close
      vm_ops help maintain this structure.  The special hugetlb routine called
      at fork took care of structure updates at fork time.  However,
      vma_splitting is not properly handled for ipc shared memory mappings
      backed by hugetlb pages.  This can result in a "kernel NULL pointer
      dereference" BUG or use after free as two vmas point to the same lock
      structure.
      
      Update the shm open and close routines to always call the underlying open
      and close routines.
      
      Link: https://lkml.kernel.org/r/20221114210018.49346-1-mike.kravetz@oracle.com
      Fixes: 8d9bfb26 ("hugetlb: add vma based lock for pmd sharing")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NDoug Nelson <doug.nelson@intel.com>
      Reported-by: <syzbot+83b4134621b7c326d950@syzkaller.appspotmail.com>
      Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b6305049
    • M
      gcov: clang: fix the buffer overflow issue · a6f810ef
      Mukesh Ojha 提交于
      Currently, in clang version of gcov code when module is getting removed
      gcov_info_add() incorrectly adds the sfn_ptr->counter to all the
      dst->functions and it result in the kernel panic in below crash report. 
      Fix this by properly handling it.
      
      [    8.899094][  T599] Unable to handle kernel write to read-only memory at virtual address ffffff80461cc000
      [    8.899100][  T599] Mem abort info:
      [    8.899102][  T599]   ESR = 0x9600004f
      [    8.899103][  T599]   EC = 0x25: DABT (current EL), IL = 32 bits
      [    8.899105][  T599]   SET = 0, FnV = 0
      [    8.899107][  T599]   EA = 0, S1PTW = 0
      [    8.899108][  T599]   FSC = 0x0f: level 3 permission fault
      [    8.899110][  T599] Data abort info:
      [    8.899111][  T599]   ISV = 0, ISS = 0x0000004f
      [    8.899113][  T599]   CM = 0, WnR = 1
      [    8.899114][  T599] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000ab8de000
      [    8.899116][  T599] [ffffff80461cc000] pgd=18000009ffcde003, p4d=18000009ffcde003, pud=18000009ffcde003, pmd=18000009ffcad003, pte=00600000c61cc787
      [    8.899124][  T599] Internal error: Oops: 9600004f [#1] PREEMPT SMP
      [    8.899265][  T599] Skip md ftrace buffer dump for: 0x1609e0
      ....
      ..,
      [    8.899544][  T599] CPU: 7 PID: 599 Comm: modprobe Tainted: G S         OE     5.15.41-android13-8-g38e9b1af6bce #1
      [    8.899547][  T599] Hardware name: XXX (DT)
      [    8.899549][  T599] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
      [    8.899551][  T599] pc : gcov_info_add+0x9c/0xb8
      [    8.899557][  T599] lr : gcov_event+0x28c/0x6b8
      [    8.899559][  T599] sp : ffffffc00e733b00
      [    8.899560][  T599] x29: ffffffc00e733b00 x28: ffffffc00e733d30 x27: ffffffe8dc297470
      [    8.899563][  T599] x26: ffffffe8dc297000 x25: ffffffe8dc297000 x24: ffffffe8dc297000
      [    8.899566][  T599] x23: ffffffe8dc0a6200 x22: ffffff880f68bf20 x21: 0000000000000000
      [    8.899569][  T599] x20: ffffff880f68bf00 x19: ffffff8801babc00 x18: ffffffc00d7f9058
      [    8.899572][  T599] x17: 0000000000088793 x16: ffffff80461cbe00 x15: 9100052952800785
      [    8.899575][  T599] x14: 0000000000000200 x13: 0000000000000041 x12: 9100052952800785
      [    8.899577][  T599] x11: ffffffe8dc297000 x10: ffffffe8dc297000 x9 : ffffff80461cbc80
      [    8.899580][  T599] x8 : ffffff8801babe80 x7 : ffffffe8dc2ec000 x6 : ffffffe8dc2ed000
      [    8.899583][  T599] x5 : 000000008020001f x4 : fffffffe2006eae0 x3 : 000000008020001f
      [    8.899586][  T599] x2 : ffffff8027c49200 x1 : ffffff8801babc20 x0 : ffffff80461cb3a0
      [    8.899589][  T599] Call trace:
      [    8.899590][  T599]  gcov_info_add+0x9c/0xb8
      [    8.899592][  T599]  gcov_module_notifier+0xbc/0x120
      [    8.899595][  T599]  blocking_notifier_call_chain+0xa0/0x11c
      [    8.899598][  T599]  do_init_module+0x2a8/0x33c
      [    8.899600][  T599]  load_module+0x23cc/0x261c
      [    8.899602][  T599]  __arm64_sys_finit_module+0x158/0x194
      [    8.899604][  T599]  invoke_syscall+0x94/0x2bc
      [    8.899607][  T599]  el0_svc_common+0x1d8/0x34c
      [    8.899609][  T599]  do_el0_svc+0x40/0x54
      [    8.899611][  T599]  el0_svc+0x94/0x2f0
      [    8.899613][  T599]  el0t_64_sync_handler+0x88/0xec
      [    8.899615][  T599]  el0t_64_sync+0x1b4/0x1b8
      [    8.899618][  T599] Code: f905f56c f86e69ec f86e6a0f 8b0c01ec (f82e6a0c)
      [    8.899620][  T599] ---[ end trace ed5218e9e5b6e2e6 ]---
      
      Link: https://lkml.kernel.org/r/1668020497-13142-1-git-send-email-quic_mojha@quicinc.com
      Fixes: e178a5be ("gcov: clang support")
      Signed-off-by: NMukesh Ojha <quic_mojha@quicinc.com>
      Reviewed-by: NPeter Oberparleiter <oberpar@linux.ibm.com>
      Tested-by: NPeter Oberparleiter <oberpar@linux.ibm.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a6f810ef
    • G
      mm/khugepaged: refactor mm_khugepaged_scan_file tracepoint to remove filename from function call · 045634ff
      Gautam Menghani 提交于
      Refactor the mm_khugepaged_scan_file tracepoint to move filename
      dereference to the tracepoint definition, to maintain consistency with
      other tracepoints[1].
      
      [1]:lore.kernel.org/lkml/20221024111621.3ba17e2c@gandalf.local.home/
      
      Link: https://lkml.kernel.org/r/20221026044524.54793-1-gautammenghani201@gmail.com
      Fixes: d41fd201 ("mm/khugepaged: add tracepoint to hpage_collapse_scan_file()")
      Signed-off-by: NGautam Menghani <gautammenghani201@gmail.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZach O'Keefe <zokeefe@google.com>
      Reviewed-by: NSteven Rostedt (Google) <rostedt@goodmis.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      045634ff
    • C
      mm/page_exit: fix kernel doc warning in page_ext_put() · ed86b748
      Charan Teja Kalla 提交于
      Fix the below compiler warnings reported with 'make W=1 mm/'. 
      mm/page_ext.c:178: warning: Function parameter or member 'page_ext' not
      described in 'page_ext_put'.
      
      [quic_pkondeti@quicinc.com: better patch title]
      Link: https://lkml.kernel.org/r/1667884582-2465-1-git-send-email-quic_charante@quicinc.com
      Fixes: b1d5488a ("mm: fix use-after free of page_ext after race with memory-offline")
      Signed-off-by: NCharan Teja Kalla <quic_charante@quicinc.com>
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ed86b748
    • Y
      mm: khugepaged: allow page allocation fallback to eligible nodes · e031ff96
      Yang Shi 提交于
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga7038524 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      The khugepaged code would pick up the node with the most hit as the preferred
      node, and also tries to do some balance if several nodes have the same
      hit record.  Basically it does conceptually:
          * If the target_node <= last_target_node, then iterate from
      last_target_node + 1 to MAX_NUMNODES (1024 on default config)
          * If the max_value == node_load[nid], then target_node = nid
      
      But there is a corner case, paritucularly for MADV_COLLAPSE, that the
      non-existing node may be returned as preferred node.
      
      Assuming the system has 2 nodes, the target_node is 0 and the
      last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
      be 0, then it may return 2 for target_node, but it is actually not
      existing (offline), so the warn is triggered.
      
      The node balance was introduced by commit 9f1b868a ("mm: thp:
      khugepaged: add policy for finding target node") to satisfy
      "numactl --interleave=all".  But interleaving is a mere hint rather than
      something that has hard requirements.
      
      So use nodemask to record the nodes which have the same hit record, the
      hugepage allocation could fallback to those nodes.  And remove
      __GFP_THISNODE since it does disallow fallback.  And if the nodemask
      just has one node set, it means there is one single node has the most
      hit record, the nodemask approach actually behaves like __GFP_THISNODE.
      
      Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Suggested-by: NZach O'Keefe <zokeefe@google.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NZach O'Keefe <zokeefe@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e031ff96
    • J
      mm: vmscan: fix extreme overreclaim and swap floods · f53af428
      Johannes Weiner 提交于
      During proactive reclaim, we sometimes observe severe overreclaim, with
      several thousand times more pages reclaimed than requested.
      
      This trace was obtained from shrink_lruvec() during such an instance:
      
          prio:0 anon_cost:1141521 file_cost:7767
          nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
          nr=[7161123 345 578 1111]
      
      While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
      by swapping.  These requests take over a minute, during which the write()
      to memory.reclaim is unkillably stuck inside the kernel.
      
      Digging into the source, this is caused by the proportional reclaim
      bailout logic.  This code tries to resolve a fundamental conflict: to
      reclaim roughly what was requested, while also aging all LRUs fairly and
      in accordance to their size, swappiness, refault rates etc.  The way it
      attempts fairness is that once the reclaim goal has been reached, it stops
      scanning the LRUs with the smaller remaining scan targets, and adjusts the
      remainder of the bigger LRUs according to how much of the smaller LRUs was
      scanned.  It then finishes scanning that remainder regardless of the
      reclaim goal.
      
      This works fine if priority levels are low and the LRU lists are
      comparable in size.  However, in this instance, the cgroup that is
      targeted by proactive reclaim has almost no files left - they've already
      been squeezed out by proactive reclaim earlier - and the remaining anon
      pages are hot.  Anon rotations cause the priority level to drop to 0,
      which results in reclaim targeting all of anon (a lot) and all of file
      (almost nothing).  By the time reclaim decides to bail, it has scanned
      most or all of the file target, and therefor must also scan most or all of
      the enormous anon target.  This target is thousands of times larger than
      the reclaim goal, thus causing the overreclaim.
      
      The bailout code hasn't changed in years, why is this failing now?  The
      most likely explanations are two other recent changes in anon reclaim:
      
      1. Before the series starting with commit 5df74196 ("mm: fix LRU
         balancing effect of new transparent huge pages"), the VM was
         overall relatively reluctant to swap at all, even if swap was
         configured. This means the LRU balancing code didn't come into play
         as often as it does now, and mostly in high pressure situations
         where pronounced swap activity wouldn't be as surprising.
      
      2. For historic reasons, shrink_lruvec() loops on the scan targets of
         all LRU lists except the active anon one, meaning it would bail if
         the only remaining pages to scan were active anon - even if there
         were a lot of them.
      
         Before the series starting with commit ccc5dc67 ("mm/vmscan:
         make active/inactive ratio as 1:1 for anon lru"), most anon pages
         would live on the active LRU; the inactive one would contain only a
         handful of preselected reclaim candidates. After the series, anon
         gets aged similarly to file, and the inactive list is the default
         for new anon pages as well, making it often the much bigger list.
      
         As a result, the VM is now more likely to actually finish large
         anon targets than before.
      
      Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
      larger LRU lists is made before bailing out on a met reclaim goal.
      
      This fixes the extreme overreclaim problem.
      
      Fairness is more subtle and harder to evaluate.  No obvious misbehavior
      was observed on the test workload, in any case.  Conceptually, fairness
      should primarily be a cumulative effect from regular, lower priority
      scans.  Once the VM is in trouble and needs to escalate scan targets to
      make forward progress, fairness needs to take a backseat.  This is also
      acknowledged by the myriad exceptions in get_scan_count().  This patch
      makes fairness decrease gradually, as it keeps fairness work static over
      increasing priority levels with growing scan targets.  This should make
      more sense - although we may have to re-visit the exact values.
      
      Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f53af428
  3. 09 11月, 2022 13 次提交