1. 23 11月, 2022 15 次提交
    • S
      mailmap: update email address for Satya Priya · 47123d7f
      Satya Priya 提交于
      Add and also update email address, skakit@codeaurora.org is no longer
      active.
      
      Link: https://lkml.kernel.org/r/20221116105017.3018971-1-quic_c_skakit@quicinc.comSigned-off-by: NSatya Priya <quic_c_skakit@quicinc.com>
      Cc: Konrad Dybcio <konrad.dybcio@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      47123d7f
    • A
      mm/migrate_device: return number of migrating pages in args->cpages · 44af0b45
      Alistair Popple 提交于
      migrate_vma->cpages originally contained a count of the number of pages
      migrating including non-present pages which can be populated directly on
      the target.
      
      Commit 241f6885 ("mm/migrate_device.c: refactor migrate_vma and
      migrate_device_coherent_page()") inadvertantly changed this to contain
      just the number of pages that were unmapped.  Usage of migrate_vma->cpages
      isn't documented, but most drivers use it to see if all the requested
      addresses can be migrated so restore the original behaviour.
      
      Link: https://lkml.kernel.org/r/20221111005135.1344004-1-apopple@nvidia.com
      Fixes: 241f6885 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()")
      Signed-off-by: NAlistair Popple <apopple@nvidia.com>
      Reported-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      44af0b45
    • S
      kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible · 50c69721
      Sam James 提交于
      Add missing <linux/string.h> include for strcmp.
      
      Clang 16 makes -Wimplicit-function-declaration an error by default. 
      Unfortunately, out of tree modules may use this in configure scripts,
      which means failure might cause silent miscompilation or misconfiguration.
      
      For more information, see LWN.net [0] or LLVM's Discourse [1], gentoo-dev@ [2],
      or the (new) c-std-porting mailing list [3].
      
      [0] https://lwn.net/Articles/913505/
      [1] https://discourse.llvm.org/t/configure-script-breakage-with-the-new-werror-implicit-function-declaration/65213
      [2] https://archives.gentoo.org/gentoo-dev/message/dd9f2d3082b8b6f8dfbccb0639e6e240
      [3] hosted at lists.linux.dev.
      
      [akpm@linux-foundation.org: remember "linux/"]
      Link: https://lkml.kernel.org/r/20221116182634.2823136-1-sam@gentoo.orgSigned-off-by: NSam James <sam@gentoo.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      50c69721
    • A
      db8e0998
    • A
      mailmap: update Alex Hung's email address · d39e2ad6
      Alex Hung 提交于
      I am no longer at Canonical and add entry of my personal email address.
      
      Link: https://lkml.kernel.org/r/20221114001302.671897-1-alex.hung@amd.comSigned-off-by: NAlex Hung <alexhung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d39e2ad6
    • I
      mm: mmap: fix documentation for vma_mas_szero · 4a423440
      Ian Cowan 提交于
      When the struct_mm input, mm, was changed to a struct ma_state, mas, the
      documentation for the function was never updated.  This updates that
      documentation reference.
      
      Link: https://lkml.kernel.org/r/20221114003349.41235-1-ian@linux.cowan.aeroSigned-off-by: NIan Cowan <ian@linux.cowan.aero>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4a423440
    • S
      mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed · 8468b486
      SeongJae Park 提交于
      A DAMON sysfs interface user can start DAMON with a scheme, remove the
      sysfs directory for the scheme, and then ask update of the scheme's stats.
      Because the schemes stats update logic isn't aware of the situation, it
      results in an invalid memory access.  Fix the bug by checking if the
      scheme sysfs directory exists.
      
      Link: https://lkml.kernel.org/r/20221114175552.1951-1-sj@kernel.org
      Fixes: 0ac32b8a ("mm/damon/sysfs: support DAMOS stats")
      Signed-off-by: NSeongJae Park <sj@kernel.org>
      Cc: <stable@vger.kernel.org>	[v5.18]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      8468b486
    • A
      mm/memory: return vm_fault_t result from migrate_to_ram() callback · 4a955bed
      Alistair Popple 提交于
      The migrate_to_ram() callback should always succeed, but in rare cases can
      fail usually returning VM_FAULT_SIGBUS.  Commit 16ce101d
      ("mm/memory.c: fix race when faulting a device private page") incorrectly
      stopped passing the return code up the stack.  Fix this by setting the ret
      variable, restoring the previous behaviour on migrate_to_ram() failure.
      
      Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com
      Fixes: 16ce101d ("mm/memory.c: fix race when faulting a device private page")
      Signed-off-by: NAlistair Popple <apopple@nvidia.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4a955bed
    • L
      mm: correctly charge compressed memory to its memcg · cd08d80e
      Li Liguang 提交于
      Kswapd will reclaim memory when memory pressure is high, the annonymous
      memory will be compressed and stored in the zpool if zswap is enabled. 
      The memcg_kmem_bypass() in get_obj_cgroup_from_page() will bypass the
      kernel thread and cause the compressed memory not be charged to its memory
      cgroup.
      
      Remove the memcg_kmem_bypass() call and properly charge compressed memory
      to its corresponding memory cgroup.
      
      Link: https://lore.kernel.org/linux-mm/CALvZod4nnn8BHYqAM4xtcR0Ddo2-Wr8uKm9h_CHWUaXw7g_DCg@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20221114194828.100822-1-hannes@cmpxchg.org
      Fixes: f4840ccf ("zswap: memcg accounting")
      Signed-off-by: NLi Liguang <liliguang@baidu.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: <stable@vger.kernel.org>	[5.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      cd08d80e
    • M
      ipc/shm: call underlying open/close vm_ops · b6305049
      Mike Kravetz 提交于
      Shared memory segments can be created that are backed by hugetlb pages. 
      When this happens, the vmas associated with any mappings (shmat) are
      marked VM_HUGETLB, yet the vm_ops for such mappings are provided by
      ipc/shm (shm_vm_ops).  There is a mechanism to call the underlying hugetlb
      vm_ops, and this is done for most operations.  However, it is not done for
      open and close.
      
      This was not an issue until the introduction of the hugetlb vma_lock. 
      This lock structure is pointed to by vm_private_data and the open/close
      vm_ops help maintain this structure.  The special hugetlb routine called
      at fork took care of structure updates at fork time.  However,
      vma_splitting is not properly handled for ipc shared memory mappings
      backed by hugetlb pages.  This can result in a "kernel NULL pointer
      dereference" BUG or use after free as two vmas point to the same lock
      structure.
      
      Update the shm open and close routines to always call the underlying open
      and close routines.
      
      Link: https://lkml.kernel.org/r/20221114210018.49346-1-mike.kravetz@oracle.com
      Fixes: 8d9bfb26 ("hugetlb: add vma based lock for pmd sharing")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NDoug Nelson <doug.nelson@intel.com>
      Reported-by: <syzbot+83b4134621b7c326d950@syzkaller.appspotmail.com>
      Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b6305049
    • M
      gcov: clang: fix the buffer overflow issue · a6f810ef
      Mukesh Ojha 提交于
      Currently, in clang version of gcov code when module is getting removed
      gcov_info_add() incorrectly adds the sfn_ptr->counter to all the
      dst->functions and it result in the kernel panic in below crash report. 
      Fix this by properly handling it.
      
      [    8.899094][  T599] Unable to handle kernel write to read-only memory at virtual address ffffff80461cc000
      [    8.899100][  T599] Mem abort info:
      [    8.899102][  T599]   ESR = 0x9600004f
      [    8.899103][  T599]   EC = 0x25: DABT (current EL), IL = 32 bits
      [    8.899105][  T599]   SET = 0, FnV = 0
      [    8.899107][  T599]   EA = 0, S1PTW = 0
      [    8.899108][  T599]   FSC = 0x0f: level 3 permission fault
      [    8.899110][  T599] Data abort info:
      [    8.899111][  T599]   ISV = 0, ISS = 0x0000004f
      [    8.899113][  T599]   CM = 0, WnR = 1
      [    8.899114][  T599] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000ab8de000
      [    8.899116][  T599] [ffffff80461cc000] pgd=18000009ffcde003, p4d=18000009ffcde003, pud=18000009ffcde003, pmd=18000009ffcad003, pte=00600000c61cc787
      [    8.899124][  T599] Internal error: Oops: 9600004f [#1] PREEMPT SMP
      [    8.899265][  T599] Skip md ftrace buffer dump for: 0x1609e0
      ....
      ..,
      [    8.899544][  T599] CPU: 7 PID: 599 Comm: modprobe Tainted: G S         OE     5.15.41-android13-8-g38e9b1af6bce #1
      [    8.899547][  T599] Hardware name: XXX (DT)
      [    8.899549][  T599] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
      [    8.899551][  T599] pc : gcov_info_add+0x9c/0xb8
      [    8.899557][  T599] lr : gcov_event+0x28c/0x6b8
      [    8.899559][  T599] sp : ffffffc00e733b00
      [    8.899560][  T599] x29: ffffffc00e733b00 x28: ffffffc00e733d30 x27: ffffffe8dc297470
      [    8.899563][  T599] x26: ffffffe8dc297000 x25: ffffffe8dc297000 x24: ffffffe8dc297000
      [    8.899566][  T599] x23: ffffffe8dc0a6200 x22: ffffff880f68bf20 x21: 0000000000000000
      [    8.899569][  T599] x20: ffffff880f68bf00 x19: ffffff8801babc00 x18: ffffffc00d7f9058
      [    8.899572][  T599] x17: 0000000000088793 x16: ffffff80461cbe00 x15: 9100052952800785
      [    8.899575][  T599] x14: 0000000000000200 x13: 0000000000000041 x12: 9100052952800785
      [    8.899577][  T599] x11: ffffffe8dc297000 x10: ffffffe8dc297000 x9 : ffffff80461cbc80
      [    8.899580][  T599] x8 : ffffff8801babe80 x7 : ffffffe8dc2ec000 x6 : ffffffe8dc2ed000
      [    8.899583][  T599] x5 : 000000008020001f x4 : fffffffe2006eae0 x3 : 000000008020001f
      [    8.899586][  T599] x2 : ffffff8027c49200 x1 : ffffff8801babc20 x0 : ffffff80461cb3a0
      [    8.899589][  T599] Call trace:
      [    8.899590][  T599]  gcov_info_add+0x9c/0xb8
      [    8.899592][  T599]  gcov_module_notifier+0xbc/0x120
      [    8.899595][  T599]  blocking_notifier_call_chain+0xa0/0x11c
      [    8.899598][  T599]  do_init_module+0x2a8/0x33c
      [    8.899600][  T599]  load_module+0x23cc/0x261c
      [    8.899602][  T599]  __arm64_sys_finit_module+0x158/0x194
      [    8.899604][  T599]  invoke_syscall+0x94/0x2bc
      [    8.899607][  T599]  el0_svc_common+0x1d8/0x34c
      [    8.899609][  T599]  do_el0_svc+0x40/0x54
      [    8.899611][  T599]  el0_svc+0x94/0x2f0
      [    8.899613][  T599]  el0t_64_sync_handler+0x88/0xec
      [    8.899615][  T599]  el0t_64_sync+0x1b4/0x1b8
      [    8.899618][  T599] Code: f905f56c f86e69ec f86e6a0f 8b0c01ec (f82e6a0c)
      [    8.899620][  T599] ---[ end trace ed5218e9e5b6e2e6 ]---
      
      Link: https://lkml.kernel.org/r/1668020497-13142-1-git-send-email-quic_mojha@quicinc.com
      Fixes: e178a5be ("gcov: clang support")
      Signed-off-by: NMukesh Ojha <quic_mojha@quicinc.com>
      Reviewed-by: NPeter Oberparleiter <oberpar@linux.ibm.com>
      Tested-by: NPeter Oberparleiter <oberpar@linux.ibm.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Tom Rix <trix@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a6f810ef
    • G
      mm/khugepaged: refactor mm_khugepaged_scan_file tracepoint to remove filename from function call · 045634ff
      Gautam Menghani 提交于
      Refactor the mm_khugepaged_scan_file tracepoint to move filename
      dereference to the tracepoint definition, to maintain consistency with
      other tracepoints[1].
      
      [1]:lore.kernel.org/lkml/20221024111621.3ba17e2c@gandalf.local.home/
      
      Link: https://lkml.kernel.org/r/20221026044524.54793-1-gautammenghani201@gmail.com
      Fixes: d41fd201 ("mm/khugepaged: add tracepoint to hpage_collapse_scan_file()")
      Signed-off-by: NGautam Menghani <gautammenghani201@gmail.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZach O'Keefe <zokeefe@google.com>
      Reviewed-by: NSteven Rostedt (Google) <rostedt@goodmis.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      045634ff
    • C
      mm/page_exit: fix kernel doc warning in page_ext_put() · ed86b748
      Charan Teja Kalla 提交于
      Fix the below compiler warnings reported with 'make W=1 mm/'. 
      mm/page_ext.c:178: warning: Function parameter or member 'page_ext' not
      described in 'page_ext_put'.
      
      [quic_pkondeti@quicinc.com: better patch title]
      Link: https://lkml.kernel.org/r/1667884582-2465-1-git-send-email-quic_charante@quicinc.com
      Fixes: b1d5488a ("mm: fix use-after free of page_ext after race with memory-offline")
      Signed-off-by: NCharan Teja Kalla <quic_charante@quicinc.com>
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ed86b748
    • Y
      mm: khugepaged: allow page allocation fallback to eligible nodes · e031ff96
      Yang Shi 提交于
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga7038524 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      The khugepaged code would pick up the node with the most hit as the preferred
      node, and also tries to do some balance if several nodes have the same
      hit record.  Basically it does conceptually:
          * If the target_node <= last_target_node, then iterate from
      last_target_node + 1 to MAX_NUMNODES (1024 on default config)
          * If the max_value == node_load[nid], then target_node = nid
      
      But there is a corner case, paritucularly for MADV_COLLAPSE, that the
      non-existing node may be returned as preferred node.
      
      Assuming the system has 2 nodes, the target_node is 0 and the
      last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
      be 0, then it may return 2 for target_node, but it is actually not
      existing (offline), so the warn is triggered.
      
      The node balance was introduced by commit 9f1b868a ("mm: thp:
      khugepaged: add policy for finding target node") to satisfy
      "numactl --interleave=all".  But interleaving is a mere hint rather than
      something that has hard requirements.
      
      So use nodemask to record the nodes which have the same hit record, the
      hugepage allocation could fallback to those nodes.  And remove
      __GFP_THISNODE since it does disallow fallback.  And if the nodemask
      just has one node set, it means there is one single node has the most
      hit record, the nodemask approach actually behaves like __GFP_THISNODE.
      
      Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Suggested-by: NZach O'Keefe <zokeefe@google.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NZach O'Keefe <zokeefe@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e031ff96
    • J
      mm: vmscan: fix extreme overreclaim and swap floods · f53af428
      Johannes Weiner 提交于
      During proactive reclaim, we sometimes observe severe overreclaim, with
      several thousand times more pages reclaimed than requested.
      
      This trace was obtained from shrink_lruvec() during such an instance:
      
          prio:0 anon_cost:1141521 file_cost:7767
          nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
          nr=[7161123 345 578 1111]
      
      While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
      by swapping.  These requests take over a minute, during which the write()
      to memory.reclaim is unkillably stuck inside the kernel.
      
      Digging into the source, this is caused by the proportional reclaim
      bailout logic.  This code tries to resolve a fundamental conflict: to
      reclaim roughly what was requested, while also aging all LRUs fairly and
      in accordance to their size, swappiness, refault rates etc.  The way it
      attempts fairness is that once the reclaim goal has been reached, it stops
      scanning the LRUs with the smaller remaining scan targets, and adjusts the
      remainder of the bigger LRUs according to how much of the smaller LRUs was
      scanned.  It then finishes scanning that remainder regardless of the
      reclaim goal.
      
      This works fine if priority levels are low and the LRU lists are
      comparable in size.  However, in this instance, the cgroup that is
      targeted by proactive reclaim has almost no files left - they've already
      been squeezed out by proactive reclaim earlier - and the remaining anon
      pages are hot.  Anon rotations cause the priority level to drop to 0,
      which results in reclaim targeting all of anon (a lot) and all of file
      (almost nothing).  By the time reclaim decides to bail, it has scanned
      most or all of the file target, and therefor must also scan most or all of
      the enormous anon target.  This target is thousands of times larger than
      the reclaim goal, thus causing the overreclaim.
      
      The bailout code hasn't changed in years, why is this failing now?  The
      most likely explanations are two other recent changes in anon reclaim:
      
      1. Before the series starting with commit 5df74196 ("mm: fix LRU
         balancing effect of new transparent huge pages"), the VM was
         overall relatively reluctant to swap at all, even if swap was
         configured. This means the LRU balancing code didn't come into play
         as often as it does now, and mostly in high pressure situations
         where pronounced swap activity wouldn't be as surprising.
      
      2. For historic reasons, shrink_lruvec() loops on the scan targets of
         all LRU lists except the active anon one, meaning it would bail if
         the only remaining pages to scan were active anon - even if there
         were a lot of them.
      
         Before the series starting with commit ccc5dc67 ("mm/vmscan:
         make active/inactive ratio as 1:1 for anon lru"), most anon pages
         would live on the active LRU; the inactive one would contain only a
         handful of preselected reclaim candidates. After the series, anon
         gets aged similarly to file, and the inactive list is the default
         for new anon pages as well, making it often the much bigger list.
      
         As a result, the VM is now more likely to actually finish large
         anon targets than before.
      
      Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
      larger LRU lists is made before bailing out on a met reclaim goal.
      
      This fixes the extreme overreclaim problem.
      
      Fairness is more subtle and harder to evaluate.  No obvious misbehavior
      was observed on the test workload, in any case.  Conceptually, fairness
      should primarily be a cumulative effect from regular, lower priority
      scans.  Once the VM is in trouble and needs to escalate scan targets to
      make forward progress, fairness needs to take a backseat.  This is also
      acknowledged by the myriad exceptions in get_scan_count().  This patch
      makes fairness decrease gradually, as it keeps fairness work static over
      increasing priority levels with growing scan targets.  This should make
      more sense - although we may have to re-visit the exact values.
      
      Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f53af428
  2. 09 11月, 2022 22 次提交
  3. 07 11月, 2022 3 次提交
    • L
      Linux 6.1-rc4 · f0c4d9fc
      Linus Torvalds 提交于
      f0c4d9fc
    • L
      Merge tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 16c7a368
      Linus Torvalds 提交于
      Pull cxl fixes from Dan Williams:
       "Several fixes for CXL region creation crashes, leaks and failures.
      
        This is mainly fallout from the original implementation of dynamic CXL
        region creation (instantiate new physical memory pools) that arrived
        in v6.0-rc1.
      
        Given the theme of "failures in the presence of pass-through decoders"
        this also includes new regression test infrastructure for that case.
      
        Summary:
      
         - Fix region creation crash with pass-through decoders
      
         - Fix region creation crash when no decoder allocation fails
      
         - Fix region creation crash when scanning regions to enforce the
           increasing physical address order constraint that CXL mandates
      
         - Fix a memory leak for cxl_pmem_region objects, track 1:N instead of
           1:1 memory-device-to-region associations.
      
         - Fix a memory leak for cxl_region objects when regions with active
           targets are deleted
      
         - Fix assignment of NUMA nodes to CXL regions by CFMWS (CXL Window)
           emulated proximity domains.
      
         - Fix region creation failure for switch attached devices downstream
           of a single-port host-bridge
      
         - Fix false positive memory leak of cxl_region objects by recycling
           recently used region ids rather than freeing them
      
         - Add regression test infrastructure for a pass-through decoder
           configuration
      
         - Fix some mailbox payload handling corner cases"
      
      * tag 'cxl-fixes-for-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
        cxl/region: Recycle region ids
        cxl/region: Fix 'distance' calculation with passthrough ports
        tools/testing/cxl: Add a single-port host-bridge regression config
        tools/testing/cxl: Fix some error exits
        cxl/pmem: Fix cxl_pmem_region and cxl_memdev leak
        cxl/region: Fix cxl_region leak, cleanup targets at region delete
        cxl/region: Fix region HPA ordering validation
        cxl/pmem: Use size_add() against integer overflow
        cxl/region: Fix decoder allocation crash
        ACPI: NUMA: Add CXL CFMWS 'nodes' to the possible nodes set
        cxl/pmem: Fix failure to account for 8 byte header for writes to the device LSA.
        cxl/region: Fix null pointer dereference due to pass through decoder commit
        cxl/mbox: Add a check on input payload size
      16c7a368
    • L
      Merge tag 'hwmon-for-v6.1-rc4' of... · aa529949
      Linus Torvalds 提交于
      Merge tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
      
      Pull hwmon fixes from Guenter Roeck:
       "Fix two regressions:
      
         - Commit 54cc3dbf ("hwmon: (pmbus) Add regulator supply into
           macro") resulted in regulator undercount when disabling regulators.
           Revert it.
      
         - The thermal subsystem rework caused the scmi driver to no longer
           register with the thermal subsystem because index values no longer
           match. To fix the problem, the scmi driver now directly registers
           with the thermal subsystem, no longer through the hwmon core"
      
      * tag 'hwmon-for-v6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        Revert "hwmon: (pmbus) Add regulator supply into macro"
        hwmon: (scmi) Register explicitly with Thermal Framework
      aa529949