1. 01 12月, 2022 2 次提交
  2. 23 11月, 2022 2 次提交
    • G
      mm/khugepaged: refactor mm_khugepaged_scan_file tracepoint to remove filename from function call · 045634ff
      Gautam Menghani 提交于
      Refactor the mm_khugepaged_scan_file tracepoint to move filename
      dereference to the tracepoint definition, to maintain consistency with
      other tracepoints[1].
      
      [1]:lore.kernel.org/lkml/20221024111621.3ba17e2c@gandalf.local.home/
      
      Link: https://lkml.kernel.org/r/20221026044524.54793-1-gautammenghani201@gmail.com
      Fixes: d41fd201 ("mm/khugepaged: add tracepoint to hpage_collapse_scan_file()")
      Signed-off-by: NGautam Menghani <gautammenghani201@gmail.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZach O'Keefe <zokeefe@google.com>
      Reviewed-by: NSteven Rostedt (Google) <rostedt@goodmis.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      045634ff
    • Y
      mm: khugepaged: allow page allocation fallback to eligible nodes · e031ff96
      Yang Shi 提交于
      Syzbot reported the below splat:
      
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Modules linked in:
      CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga7038524 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline]
      RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline]
      RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963
      Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae
      RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001
      RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715
       hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156
       madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419
       do_madvise mm/madvise.c:1432 [inline]
       __do_sys_madvise mm/madvise.c:1432 [inline]
       __se_sys_madvise mm/madvise.c:1430 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f6b48a4eef9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4
      R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000
       </TASK>
      
      The khugepaged code would pick up the node with the most hit as the preferred
      node, and also tries to do some balance if several nodes have the same
      hit record.  Basically it does conceptually:
          * If the target_node <= last_target_node, then iterate from
      last_target_node + 1 to MAX_NUMNODES (1024 on default config)
          * If the max_value == node_load[nid], then target_node = nid
      
      But there is a corner case, paritucularly for MADV_COLLAPSE, that the
      non-existing node may be returned as preferred node.
      
      Assuming the system has 2 nodes, the target_node is 0 and the
      last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may
      be 0, then it may return 2 for target_node, but it is actually not
      existing (offline), so the warn is triggered.
      
      The node balance was introduced by commit 9f1b868a ("mm: thp:
      khugepaged: add policy for finding target node") to satisfy
      "numactl --interleave=all".  But interleaving is a mere hint rather than
      something that has hard requirements.
      
      So use nodemask to record the nodes which have the same hit record, the
      hugepage allocation could fallback to those nodes.  And remove
      __GFP_THISNODE since it does disallow fallback.  And if the nodemask
      just has one node set, it means there is one single node has the most
      hit record, the nodemask approach actually behaves like __GFP_THISNODE.
      
      Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Suggested-by: NZach O'Keefe <zokeefe@google.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NZach O'Keefe <zokeefe@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e031ff96
  3. 04 10月, 2022 7 次提交
    • Z
      mm/khugepaged: add tracepoint to hpage_collapse_scan_file() · d41fd201
      Zach O'Keefe 提交于
      Add huge_memory:trace_mm_khugepaged_scan_file tracepoint to
      hpage_collapse_scan_file() analogously to hpage_collapse_scan_pmd().
      
      While this change is targeted at debugging MADV_COLLAPSE pathway, the
      "mm_khugepaged" prefix is retained for symmetry with
      huge_memory:trace_mm_khugepaged_scan_pmd, which retains it's legacy name
      to prevent changing kernel ABI as much as possible.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-5-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-5-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d41fd201
    • Z
      mm/madvise: add file and shmem support to MADV_COLLAPSE · 34488399
      Zach O'Keefe 提交于
      Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
      memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
      
      On success, the backing memory will be a hugepage.  For the memory range
      and process provided, the page tables will synchronously have a huge pmd
      installed, mapping the THP.  Other mappings of the file extent mapped by
      the memory range may be added to a set of entries that khugepaged will
      later process and attempt update their page tables to map the THP by a
      pmd.
      
      This functionality unlocks two important uses:
      
      (1)	Immediately back executable text by THPs.  Current support provided
      	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
      	system which might impair services from serving at their full rated
      	load after (re)starting.  Tricks like mremap(2)'ing text onto
      	anonymous memory to immediately realize iTLB performance prevents
      	page sharing and demand paging, both of which increase steady state
      	memory footprint.  Now, we can have the best of both worlds: Peak
      	upfront performance and lower RAM footprints.
      
      (2)	userfaultfd-based live migration of virtual machines satisfy UFFD
      	faults by fetching native-sized pages over the network (to avoid
      	latency of transferring an entire hugepage).  However, after guest
      	memory has been fully copied to the new host, MADV_COLLAPSE can
      	be used to immediately increase guest performance.
      
      Since khugepaged is single threaded, this change now introduces
      possibility of collapse contexts racing in file collapse path.  There a
      important few places to consider:
      
      (1)	hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
      	We could have the memory collapsed out from under us, but
      	the next xas_for_each() iteration will correctly pick up the
      	hugepage.  The hugepage might not be up to date (insofar as
      	copying of small page contents might not have completed - the
      	page still may be locked), but regardless what small page index
      	we were iterating over, we'll find the hugepage and identify it
      	as a suitably aligned compound page of order HPAGE_PMD_ORDER.
      
      	In khugepaged path, we locklessly check the value of the pmd,
      	and only add it to deferred collapse array if we find pmd
      	mapping pte table. This is fine, since other values that could
      	have raced in right afterwards denote failure, or that the
      	memory was successfully collapsed, so we don't need further
      	processing.
      
      	In madvise path, we'll take mmap_lock() in write to serialize
      	against page table updates and will know what to do based on the
      	true value of the pmd: recheck all ptes if we point to a pte table,
      	directly install the pmd, if the pmd has been cleared, but
      	memory not yet faulted, or nothing at all if we find a huge pmd.
      
      	It's worth putting emphasis here on how we treat the none pmd
      	here.  If khugepaged has processed this mm's page tables
      	already, it will have left the pmd cleared (ready for refault by
      	the process).  Depending on the VMA flags and sysfs settings,
      	amount of RAM on the machine, and the current load, could be a
      	relatively common occurrence - and as such is one we'd like to
      	handle successfully in MADV_COLLAPSE.  When we see the none pmd
      	in collapse_pte_mapped_thp(), we've locked mmap_lock in write
      	and checked (a) huepaged_vma_check() to see if the backing
      	memory is appropriate still, along with VMA sizing and
      	appropriate hugepage alignment within the file, and (b) we've
      	found a hugepage head of order HPAGE_PMD_ORDER at the offset
      	in the file mapped by our hugepage-aligned virtual address.
      	Even though the common-case is likely race with khugepaged,
      	given these checks (regardless how we got here - we could be
      	operating on a completely different file than originally checked
      	in hpage_collapse_scan_file() for all we know) it should be safe
      	to directly make the pmd a huge pmd pointing to this hugepage.
      
      (2)	collapse_file() is mostly serialized on the same file extent by
      	lock sequence:
      
      		|	lock hupepage
      		|		lock mapping->i_pages
      		|			lock 1st page
      		|		unlock mapping->i_pages
      		|				<page checks>
      		|		lock mapping->i_pages
      		|				page_ref_freeze(3)
      		|				xas_store(hugepage)
      		|		unlock mapping->i_pages
      		|				page_ref_unfreeze(1)
      		|			unlock 1st page
      		V	unlock hugepage
      
      	Once a context (who already has their fresh hugepage locked)
      	locks mapping->i_pages exclusively, it will hold said lock
      	until it locks the first page, and it will hold that lock until
      	the after the hugepage has been added to the page cache (and
      	will unlock the hugepage after page table update, though that
      	isn't important here).
      
      	A racing context that loses the race for mapping->i_pages will
      	then lose the race to locking the first page.  Here - depending
      	on how far the other racing context has gotten - we might find
      	the new hugepage (in which case we'll exit cleanly when we
      	check PageTransCompound()), or we'll find the "old" 1st small
      	page (in which we'll exit cleanly when we discover unexpected
      	refcount of 2 after isolate_lru_page()).  This is assuming we
      	are able to successfully lock the page we find - in shmem path,
      	we could just fail the trylock and exit cleanly anyways.
      
      	Failure path in collapse_file() is similar: once we hold lock
      	on 1st small page, we are serialized against other collapse
      	contexts.  Before the 1st small page is unlocked, we add it
      	back to the pagecache and unfreeze the refcount appropriately.
      	Contexts who lost the race to the 1st small page will then find
      	the same 1st small page with the correct refcount and will be
      	able to proceed.
      
      [zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()]
        Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com
      [shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove
      	check for multi-add in khugepaged_add_pte_mapped_thp()]
        Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      34488399
    • Z
      mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds · 58ac9a89
      Zach O'Keefe 提交于
      The main benefit of THPs are that they can be mapped at the pmd level,
      increasing the likelihood of TLB hit and spending less cycles in page
      table walks.  pte-mapped hugepages - that is - hugepage-aligned compound
      pages of order HPAGE_PMD_ORDER mapped by ptes - although being contiguous
      in physical memory, don't have this advantage.  In fact, one could argue
      they are detrimental to system performance overall since they occupy a
      precious hugepage-aligned/sized region of physical memory that could
      otherwise be used more effectively.  Additionally, pte-mapped hugepages
      can be the cheapest memory to collapse for khugepaged since no new
      hugepage allocation or copying of memory contents is necessary - we only
      need to update the mapping page tables.
      
      In the anonymous collapse path, we are able to collapse pte-mapped
      hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
      effort when compound pages (of any order) are encountered.
      
      Identify pte-mapped hugepages in the file/shmem collapse path.  The
      final step of which makes a racy check of the value of the pmd to
      ensure it maps a pte table.  This should be fine, since races that
      result in false-positive (i.e.  attempt collapse even though we
      shouldn't) will fail later in collapse_pte_mapped_thp() once we
      actually lock mmap_lock and reinspect the pmd value.  Races that result
      in false-negatives (i.e.  where we decide to not attempt collapse, but
      should have) shouldn't be an issue, since in the worst case, we do
      nothing - which is what we've done up to this point.  We make a similar
      check in retract_page_tables().  If we do think we've found a
      pte-mapped hugepgae in khugepaged context, attempt to update page
      tables mapping this hugepage.
      
      Note that these collapses still count towards the
      /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter,
      and if the pte-mapped hugepage was also mapped into multiple process'
      address spaces, could be incremented for each page table update.  Since we
      increment the counter when a pte-mapped hugepage is successfully added to
      the list of to-collapse pte-mapped THPs, it's possible that we never
      actually update the page table either.  This is different from how
      file/shmem pages_collapsed accounting works today where only a successful
      page cache update is counted (it's also possible here that no page tables
      are actually changed).  Though it incurs some slop, this is preferred to
      either not accounting for the event at all, or plumbing through data in
      struct mm_slot on whether to account for the collapse or not.
      
      Also note that work still needs to be done to support arbitrary compound
      pages, and that this should all be converted to using folios.
      
      [shy828301@gmail.com: Spelling mistake, update comment, and add Documentation]
        Link: https://lore.kernel.org/linux-mm/CAHbLzkpHwZxFzjfX9nxVoRhzup8WMjMfyL6Xiq8mZ9M-N3ombw@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220907144521.3115321-3-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-3-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      58ac9a89
    • Z
      mm/madvise: MADV_COLLAPSE return EAGAIN when page cannot be isolated · 0f3e2a2c
      Zach O'Keefe 提交于
      MADV_COLLAPSE is a best-effort request that attempts to set an actionable
      errno value if the request cannot be fulfilled at the time.  EAGAIN should
      be used to communicate that a resource was temporarily unavailable, but
      that the user may try again immediately.
      
      SCAN_DEL_PAGE_LRU is an internal result code used when a page cannot be
      isolated from it's LRU list.  Since this, like SCAN_PAGE_LRU, is likely a
      transitory state, make MADV_COLLAPSE return EAGAIN so that users know they
      may reattempt the operation.
      
      Another important scenario to consider is race with khugepaged. 
      khugepaged might isolate a page while MADV_COLLAPSE is interested in it. 
      Even though racing with khugepaged might mean that the memory has already
      been collapsed, signalling an errno that is non-intrinsic to that memory
      or arguments provided to madvise(2) lets the user know that future
      attempts might (and in this case likely would) succeed, and avoids
      false-negative assumptions by the user.
      
      Link: https://lkml.kernel.org/r/20220922184651.1016461-1-zokeefe@google.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Signed-off-by: NZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      0f3e2a2c
    • Z
      mm/khugepaged: check compound_order() in collapse_pte_mapped_thp() · 780a4b6f
      Zach O'Keefe 提交于
      By the time we lock a page in collapse_pte_mapped_thp(), the page mapped
      by the address pushed onto the slot's .pte_mapped_thp[] array might have
      changed arbitrarily since we last looked at it.  We revalidate that the
      page is still the head of a compound page, but we don't revalidate if the
      compound page is of order HPAGE_PMD_ORDER before applying rmap and page
      table updates.
      
      Since the kernel now supports large folios of arbitrary order, and since
      replacing page's pte mappings by a pmd mapping only makes sense for
      compound pages of order HPAGE_PMD_ORDER, revalidate that the compound
      order is indeed of order HPAGE_PMD_ORDER before proceeding.
      
      Link: https://lore.kernel.org/linux-mm/CAHbLzkon+2ky8v9ywGcsTUgXM_B35jt5NThYqQKXW2YV_GUacw@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220922222731.1124481-1-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Suggested-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      780a4b6f
    • M
      khugepaged: call shmem_get_folio() · 7459c149
      Matthew Wilcox (Oracle) 提交于
      shmem_getpage() is being removed, so call its replacement and find the
      precise page ourselves.
      
      Link: https://lkml.kernel.org/r/20220902194653.1739778-32-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7459c149
    • Q
      mm: thp: convert to use common struct mm_slot · b26e2701
      Qi Zheng 提交于
      Rename private struct mm_slot to struct khugepaged_mm_slot and convert to
      use common struct mm_slot with no functional change.
      
      [zhengqi.arch@bytedance.com: fix build error with CONFIG_SHMEM disabled]
        Link: https://lkml.kernel.org/r/639fa8d5-8e5b-2333-69dc-40ed46219364@bytedance.com
      Link: https://lkml.kernel.org/r/20220831031951.43152-3-zhengqi.arch@bytedance.comSigned-off-by: NQi Zheng <zhengqi.arch@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b26e2701
  4. 27 9月, 2022 4 次提交
    • M
      mm/khugepaged: stop using vma linked list · 68540502
      Matthew Wilcox (Oracle) 提交于
      Use vma iterator & find_vma() instead of vma linked list.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-53-Liam.Howlett@oracle.comSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NLiam R. Howlett <Liam.Howlett@oracle.com>
      Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net>
      Tested-by: NYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      68540502
    • L
      mm/khugepaged: optimize collapse_pte_mapped_thp() by using vma_lookup() · 94d815b2
      Liam R. Howlett 提交于
      vma_lookup() will walk the vma tree once and not continue to look for the
      next vma.  Since the exact vma is checked below, this is a more optimal
      way of searching.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-22-Liam.Howlett@oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      94d815b2
    • Y
      mm: MADV_COLLAPSE: refetch vm_end after reacquiring mmap_lock · 4d24de94
      Yang Shi 提交于
      The syzbot reported the below problem:
      
      BUG: Bad page map in process syz-executor198  pte:8000000071c00227 pmd:74b30067
      addr:0000000020563000 vm_flags:08100077 anon_vma:ffff8880547d2200 mapping:0000000000000000 index:20563
      file:(null) fault:0x0 mmap:0x0 read_folio:0x0
      CPU: 1 PID: 3614 Comm: syz-executor198 Not tainted 6.0.0-rc3-next-20220901-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_bad_pte.cold+0x2a7/0x2d0 mm/memory.c:565
       vm_normal_page+0x10c/0x2a0 mm/memory.c:636
       hpage_collapse_scan_pmd+0x729/0x1da0 mm/khugepaged.c:1199
       madvise_collapse+0x481/0x910 mm/khugepaged.c:2433
       madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1062
       madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1236
       do_madvise.part.0+0x24a/0x340 mm/madvise.c:1415
       do_madvise mm/madvise.c:1428 [inline]
       __do_sys_madvise mm/madvise.c:1428 [inline]
       __se_sys_madvise mm/madvise.c:1426 [inline]
       __x64_sys_madvise+0x113/0x150 mm/madvise.c:1426
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f770ba87929
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f770ba18308 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
      RAX: ffffffffffffffda RBX: 00007f770bb0f3f8 RCX: 00007f770ba87929
      RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000
      RBP: 00007f770bb0f3f0 R08: 00007f770ba18700 R09: 0000000000000000
      R10: 00007f770ba18700 R11: 0000000000000246 R12: 00007f770bb0f3fc
      R13: 00007ffc2d8b62ef R14: 00007f770ba18400 R15: 0000000000022000
      
      Basically the test program does the below conceptually:
      1. mmap 0x2000000 - 0x21000000 as anonymous region
      2. mmap io_uring SQ stuff at 0x20563000 with MAP_FIXED, io_uring_mmap()
         actually remaps the pages with special PTEs
      3. call MADV_COLLAPSE for 0x20000000 - 0x21000000
      
      It actually triggered the below race:
      
                   CPU A                                          CPU B
      mmap 0x20000000 - 0x21000000 as anon
                                                 madvise_collapse is called on this area
                                                   Retrieve start and end address from the vma (NEVER updated later!)
                                                   Collapsed the first 2M area and dropped mmap_lock
      Acquire mmap_lock
      mmap io_uring file at 0x20563000
      Release mmap_lock
                                                   Reacquire mmap_lock
                                                   revalidate vma pass since 0x20200000 + 0x200000 > 0x20563000
                                                   scan the next 2M (0x20200000 - 0x20400000), but due to whatever reason it didn't release mmap_lock
                                                   scan the 3rd 2M area (start from 0x20400000)
                                                     get into the vma created by io_uring
      
      The hend should be updated after MADV_COLLAPSE reacquire mmap_lock since
      the vma may be shrunk.  We don't have to worry about shink from the other
      direction since it could be caught by hugepage_vma_revalidate().  Either
      no valid vma is found or the vma doesn't fit anymore.
      
      Link: https://lkml.kernel.org/r/20220914162220.787703-1-shy828301@gmail.com
      Fixes: 7d8faaf1 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse")
      Reported-by: syzbot+915f3e317adb0e85835f@syzkaller.appspotmail.com
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4d24de94
    • Y
      mm: gup: fix the fast GUP race against THP collapse · 70cbc3cc
      Yang Shi 提交于
      Since general RCU GUP fast was introduced in commit 2667f50e ("mm:
      introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
      sufficient to handle concurrent GUP-fast in all cases, it only handles
      traditional IPI-based GUP-fast correctly.  On architectures that send an
      IPI broadcast on TLB flush, it works as expected.  But on the
      architectures that do not use IPI to broadcast TLB flush, it may have the
      below race:
      
         CPU A                                          CPU B
      THP collapse                                     fast GUP
                                                    gup_pmd_range() <-- see valid pmd
                                                        gup_pte_range() <-- work on pte
      pmdp_collapse_flush() <-- clear pmd and flush
      __collapse_huge_page_isolate()
          check page pinned <-- before GUP bump refcount
                                                            pin the page
                                                            check PTE <-- no change
      __collapse_huge_page_copy()
          copy data to huge page
          ptep_clear()
      install huge pmd for the huge page
                                                            return the stale page
      discard the stale page
      
      The race can be fixed by checking whether PMD is changed or not after
      taking the page pin in fast GUP, just like what it does for PTE.  If the
      PMD is changed it means there may be parallel THP collapse, so GUP should
      back off.
      
      Also update the stale comment about serializing against fast GUP in
      khugepaged.
      
      Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com
      Fixes: 2667f50e ("mm: introduce a general RCU get_user_pages_fast()")
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      70cbc3cc
  5. 12 9月, 2022 9 次提交
    • Z
      mm/khugepaged: rename prefix of shared collapse functions · 7d2c4385
      Zach O'Keefe 提交于
      The following functions are shared between khugepaged and madvise collapse
      contexts.  Replace the "khugepaged_" prefix with generic "hpage_collapse_"
      prefix in such cases:
      
      khugepaged_test_exit() -> hpage_collapse_test_exit()
      khugepaged_scan_abort() -> hpage_collapse_scan_abort()
      khugepaged_scan_pmd() -> hpage_collapse_scan_pmd()
      khugepaged_find_target_node() -> hpage_collapse_find_target_node()
      khugepaged_alloc_page() -> hpage_collapse_alloc_page()
      
      The kerenel ABI (e.g.  huge_memory:mm_khugepaged_scan_pmd tracepoint) is
      unaltered.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-11-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7d2c4385
    • Z
      mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse · 7d8faaf1
      Zach O'Keefe 提交于
      This idea was introduced by David Rientjes[1].
      
      Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request
      a synchronous collapse of memory at their own expense.
      
      The benefits of this approach are:
      
      * CPU is charged to the process that wants to spend the cycles for the
        THP
      * Avoid unpredictable timing of khugepaged collapse
      
      Semantics
      
      This call is independent of the system-wide THP sysfs settings, but will
      fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
      multiple VMAs, the semantics of the collapse over each VMA is independent
      from the others.  This implies a hugepage cannot cross a VMA boundary.  If
      collapse of a given hugepage-aligned/sized region fails, the operation may
      continue to attempt collapsing the remainder of memory specified.
      
      The memory ranges provided must be page-aligned, but are not required to
      be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
      start/end of the range will be clamped to the first/last hugepage-aligned
      address covered by said range.  The memory ranges must span at least one
      hugepage-sized region.
      
      All non-resident pages covered by the range will first be
      swapped/faulted-in, before being internally copied onto a freshly
      allocated hugepage.  Unmapped pages will have their data directly
      initialized to 0 in the new hugepage.  However, for every eligible
      hugepage aligned/sized region to-be collapsed, at least one page must
      currently be backed by memory (a PMD covering the address range must
      already exist).
      
      Allocation for the new hugepage may enter direct reclaim and/or
      compaction, regardless of VMA flags.  When the system has multiple NUMA
      nodes, the hugepage will be allocated from the node providing the most
      native pages.  This operation operates on the current state of the
      specified process and makes no persistent changes or guarantees on how
      pages will be mapped, constructed, or faulted in the future
      
      Return Value
      
      If all hugepage-sized/aligned regions covered by the provided range were
      either successfully collapsed, or were already PMD-mapped THPs, this
      operation will be deemed successful.  On success, process_madvise(2)
      returns the number of bytes advised, and madvise(2) returns 0.  Else, -1
      is returned and errno is set to indicate the error for the most-recently
      attempted hugepage collapse.  Note that many failures might have occurred,
      since the operation may continue to collapse in the event a single
      hugepage-sized/aligned region fails.
      
      	ENOMEM	Memory allocation failed or VMA not found
      	EBUSY	Memcg charging failed
      	EAGAIN	Required resource temporarily unavailable.  Try again
      		might succeed.
      	EINVAL	Other error: No PMD found, subpage doesn't have Present
      		bit set, "Special" page no backed by struct page, VMA
      		incorrectly sized, address not page-aligned, ...
      
      Most notable here is ENOMEM and EBUSY (new to madvise) which are intended
      to provide the caller with actionable feedback so they may take an
      appropriate fallback measure.
      
      Use Cases
      
      An immediate user of this new functionality are malloc() implementations
      that manage memory in hugepage-sized chunks, but sometimes subrelease
      memory back to the system in native-sized chunks via MADV_DONTNEED;
      zapping the pmd.  Later, when the memory is hot, the implementation could
      madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage
      coverage and dTLB performance.  TCMalloc is such an implementation that
      could benefit from this[2].
      
      Only privately-mapped anon memory is supported for now, but additional
      support for file, shmem, and HugeTLB high-granularity mappings[2] is
      expected.  File and tmpfs/shmem support would permit:
      
      * Backing executable text by THPs.  Current support provided by
        CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
        might impair services from serving at their full rated load after
        (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
        immediately realize iTLB performance prevents page sharing and demand
        paging, both of which increase steady state memory footprint.  With
        MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
        and lower RAM footprints.
      * Backing guest memory by hugapages after the memory contents have been
        migrated in native-page-sized chunks to a new host, in a
        userfaultfd-based live-migration stack.
      
      [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
      [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
      
      [jrdr.linux@gmail.com: avoid possible memory leak in failure path]
        Link: https://lkml.kernel.org/r/20220713024109.62810-1-jrdr.linux@gmail.com
      [zokeefe@google.com add missing kfree() to madvise_collapse()]
        Link: https://lore.kernel.org/linux-mm/20220713024109.62810-1-jrdr.linux@gmail.com/
        Link: https://lkml.kernel.org/r/20220713161851.1879439-1-zokeefe@google.com
      [zokeefe@google.com: delay computation of hpage boundaries until use]]
        Link: https://lkml.kernel.org/r/20220720140603.1958773-4-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-10-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Signed-off-by: N"Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7d8faaf1
    • Z
      mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds hugepage · 50722804
      Zach O'Keefe 提交于
      When scanning an anon pmd to see if it's eligible for collapse, return
      SCAN_PMD_MAPPED if the pmd already maps a hugepage.  Note that
      SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
      file-collapse path, since the latter might identify pte-mapped compound
      pages.  This is required by MADV_COLLAPSE which necessarily needs to know
      what hugepage-aligned/sized regions are already pmd-mapped.
      
      In order to determine if a pmd already maps a hugepage, refactor
      mm_find_pmd():
      
      Return mm_find_pmd() to it's pre-commit f72e7dcd ("mm: let mm_find_pmd
      fix buggy race with THP fault") behavior.  ksm was the only caller that
      explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
      there (pmd_present() and pmd_trans_huge() checks).
      
      Undo revert change in commit f72e7dcd ("mm: let mm_find_pmd fix buggy
      race with THP fault") that open-coded split_huge_pmd_address() pmd lookup
      and use mm_find_pmd() instead.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-9-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      50722804
    • Z
      mm/thp: add flag to enforce sysfs THP in hugepage_vma_check() · a7f4e6e4
      Zach O'Keefe 提交于
      MADV_COLLAPSE is not coupled to the kernel-oriented sysfs THP settings[1].
      
      hugepage_vma_check() is the authority on determining if a VMA is eligible
      for THP allocation/collapse, and currently enforces the sysfs THP
      settings.  Add a flag to disable these checks.  For now, only apply this
      arg to anon and file, which use /sys/kernel/transparent_hugepage/enabled. 
      We can expand this to shmem, which uses
      /sys/kernel/transparent_hugepage/shmem_enabled, later.
      
      Use this flag in collapse_pte_mapped_thp() where previously the VMA flags
      passed to hugepage_vma_check() were OR'd with VM_HUGEPAGE to elide the
      VM_HUGEPAGE check in "madvise" THP mode.  Prior to "mm: khugepaged: check
      THP flag in hugepage_vma_check()", this check also didn't check "never"
      THP mode.  As such, this restores the previous behavior of
      collapse_pte_mapped_thp() where sysfs THP settings are ignored.  See
      comment in code for justification why this is OK.
      
      [1] https://lore.kernel.org/linux-mm/CAAa6QmQxay1_=Pmt8oCX2-Va18t44FV-Vs-WsQt_6+qBks4nZA@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-8-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a7f4e6e4
    • Z
      mm/khugepaged: add flag to predicate khugepaged-only behavior · d8ea7cc8
      Zach O'Keefe 提交于
      Add .is_khugepaged flag to struct collapse_control so khugepaged-specific
      behavior can be elided by MADV_COLLAPSE context.
      
      Start by protecting khugepaged-specific heuristics by this flag.  In
      MADV_COLLAPSE, the user presumably has reason to believe the collapse will
      be beneficial and khugepaged heuristics shouldn't prevent the user from
      doing so:
      
      1) sysfs-controlled knobs khugepaged_max_ptes_[none|swap|shared]
      
      2) requirement that some pages in region being collapsed be young or
         referenced
      
      [zokeefe@google.com: consistently order cc->is_khugepaged and pte_* checks]
        Link: https://lkml.kernel.org/r/20220720140603.1958773-3-zokeefe@google.com
        Link: https://lore.kernel.org/linux-mm/Ys2qJm6FaOQcxkha@google.com/
      Link: https://lkml.kernel.org/r/20220706235936.2197195-7-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d8ea7cc8
    • Z
      mm/khugepaged: propagate enum scan_result codes back to callers · 50ad2f24
      Zach O'Keefe 提交于
      Propagate enum scan_result codes back through return values of
      functions downstream of khugepaged_scan_file() and
      khugepaged_scan_pmd() to inform callers if the operation was
      successful, and if not, why.
      
      Since khugepaged_scan_pmd()'s return value already has a specific meaning
      (whether mmap_lock was unlocked or not), add a bool* argument to
      khugepaged_scan_pmd() to retrieve this information.
      
      Change khugepaged to take action based on the return values of
      khugepaged_scan_file() and khugepaged_scan_pmd() instead of acting deep
      within the collapsing functions themselves.
      
      hugepage_vma_revalidate() now returns SCAN_SUCCEED on success to be more
      consistent with enum scan_result propagation.
      
      Remove dependency on error pointers to communicate to khugepaged that
      allocation failed and it should sleep; instead just use the result of the
      scan (SCAN_ALLOC_HUGE_PAGE_FAIL if allocation fails).
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-6-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      50ad2f24
    • Z
      mm/khugepaged: dedup and simplify hugepage alloc and charging · 9710a78a
      Zach O'Keefe 提交于
      The following code is duplicated in collapse_huge_page() and
      collapse_file():
      
              gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
      
      	new_page = khugepaged_alloc_page(hpage, gfp, node);
              if (!new_page) {
                      result = SCAN_ALLOC_HUGE_PAGE_FAIL;
                      goto out;
              }
      
              if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) {
                      result = SCAN_CGROUP_CHARGE_FAIL;
                      goto out;
              }
              count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);
      
      Also, "node" is passed as an argument to both collapse_huge_page() and
      collapse_file() and obtained the same way, via
      khugepaged_find_target_node().
      
      Move all this into a new helper, alloc_charge_hpage(), and remove the
      duplicate code from collapse_huge_page() and collapse_file().  Also,
      simplify khugepaged_alloc_page() by returning a bool indicating allocation
      success instead of a copy of the allocated struct page *.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-5-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Suggested-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      9710a78a
    • Z
      mm/khugepaged: add struct collapse_control · 34d6b470
      Zach O'Keefe 提交于
      Modularize hugepage collapse by introducing struct collapse_control.  This
      structure serves to describe the properties of the requested collapse, as
      well as serve as a local scratch pad to use during the collapse itself.
      
      Start by moving global per-node khugepaged statistics into this new
      structure.  Note that this structure is still statically allocated since
      CONFIG_NODES_SHIFT might be arbitrary large, and stack-allocating a
      MAX_NUMNODES-sized array could cause -Wframe-large-than= errors.
      
      [zokeefe@google.com: use minimal bits to store num page < HPAGE_PMD_NR]
        Link: https://lkml.kernel.org/r/20220720140603.1958773-2-zokeefe@google.com
        Link: https://lore.kernel.org/linux-mm/Ys2CeIm%2FQmQwWh9a@google.com/
      [sfr@canb.auug.org.au: fix build]
        Link: https://lkml.kernel.org/r/20220721195508.15f1e07a@canb.auug.org.au
      [zokeefe@google.com: fix struct collapse_control load_node definition]
        Link: https://lore.kernel.org/linux-mm/202209021349.F73i5d6X-lkp@intel.com/
        Link: https://lkml.kernel.org/r/20220903021221.1130021-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-4-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      34d6b470
    • Y
      mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA · c6a7f445
      Yang Shi 提交于
      Patch series "mm: userspace hugepage collapse", v7.
      
      Introduction
      --------------------------------
      
      This series provides a mechanism for userspace to induce a collapse of
      eligible ranges of memory into transparent hugepages in process context,
      thus permitting users to more tightly control their own hugepage
      utilization policy at their own expense.
      
      This idea was introduced by David Rientjes[5].
      
      Interface
      --------------------------------
      
      The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
      leverages the new process_madvise(2) call.
      
      process_madvise(2)
      
      	Performs a synchronous collapse of the native pages
      	mapped by the list of iovecs into transparent hugepages.
      
      	This operation is independent of the system THP sysfs settings,
      	but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail.
      
      	THP allocation may enter direct reclaim and/or compaction.
      
      	When a range spans multiple VMAs, the semantics of the collapse
      	over of each VMA is independent from the others.
      
      	Caller must have CAP_SYS_ADMIN if not acting on self.
      
      	Return value follows existing process_madvise(2) conventions.  A
      	“success” indicates that all hugepage-sized/aligned regions
      	covered by the provided range were either successfully
      	collapsed, or were already pmd-mapped THPs.
      
      madvise(2)
      
      	Equivalent to process_madvise(2) on self, with 0 returned on
      	“success”.
      
      Current Use-Cases
      --------------------------------
      
      (1)	Immediately back executable text by THPs.  Current support provided
      	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
      	system which might impair services from serving at their full rated
      	load after (re)starting.  Tricks like mremap(2)'ing text onto
      	anonymous memory to immediately realize iTLB performance prevents
      	page sharing and demand paging, both of which increase steady state
      	memory footprint.  With MADV_COLLAPSE, we get the best of both
      	worlds: Peak upfront performance and lower RAM footprints.  Note
      	that subsequent support for file-backed memory is required here.
      
      (2)	malloc() implementations that manage memory in hugepage-sized
      	chunks, but sometimes subrelease memory back to the system in
      	native-sized chunks via MADV_DONTNEED; zapping the pmd.  Later,
      	when the memory is hot, the implementation could
      	madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
      	hugepage coverage and dTLB performance.  TCMalloc is such an
      	implementation that could benefit from this[6].  A prior study of
      	Google internal workloads during evaluation of Temeraire, a
      	hugepage-aware enhancement to TCMalloc, showed that nearly 20% of
      	all cpu cycles were spent in dTLB stalls, and that increasing
      	hugepage coverage by even small amount can help with that[7].
      
      (3)	userfaultfd-based live migration of virtual machines satisfy UFFD
      	faults by fetching native-sized pages over the network (to avoid
      	latency of transferring an entire hugepage).  However, after guest
      	memory has been fully copied to the new host, MADV_COLLAPSE can
      	be used to immediately increase guest performance.  Note that
      	subsequent support for file/shmem-backed memory is required here.
      
      (4)	HugeTLB high-granularity mapping allows HugeTLB a HugeTLB page to
      	be mapped at different levels in the page tables[8].  As it's not
      	"transparent" like THP, HugeTLB high-granularity mappings require
      	an explicit user API. It is intended that MADV_COLLAPSE be co-opted
      	for this use case[9].  Note that subsequent support for HugeTLB
      	memory is required here.
      
      Future work
      --------------------------------
      
      Only private anonymous memory is supported by this series. File and
      shmem memory support will be added later.
      
      One possible user of this functionality is a userspace agent that
      attempts to optimize THP utilization system-wide by allocating THPs
      based on, for example, task priority, task performance requirements, or
      heatmaps.  For the latter, one idea that has already surfaced is using
      DAMON to identify hot regions, and driving THP collapse through a new
      DAMOS_COLLAPSE scheme[10].
      
      
      This patch (of 17):
      
      The khugepaged has optimization to reduce huge page allocation calls for
      !CONFIG_NUMA by carrying the allocated but failed to collapse huge page to
      the next loop.  CONFIG_NUMA doesn't do so since the next loop may try to
      collapse huge page from a different node, so it doesn't make too much
      sense to carry it.
      
      But when NUMA=n, the huge page is allocated by khugepaged_prealloc_page()
      before scanning the address space, so it means huge page may be allocated
      even though there is no suitable range for collapsing.  Then the page
      would be just freed if khugepaged already made enough progress.  This
      could make NUMA=n run have 5 times as much thp_collapse_alloc as NUMA=y
      run.  This problem actually makes things worse due to the way more
      pointless THP allocations and makes the optimization pointless.
      
      This could be fixed by carrying the huge page across scans, but it will
      complicate the code further and the huge page may be carried indefinitely.
      But if we take one step back, the optimization itself seems not worth
      keeping nowadays since:
      
        * Not too many users build NUMA=n kernel nowadays even though the kernel is
          actually running on a non-NUMA machine. Some small devices may run NUMA=n
          kernel, but I don't think they actually use THP.
        * Since commit 44042b44 ("mm/page_alloc: allow high-order pages to be
          stored on the per-cpu lists"), THP could be cached by pcp.  This actually
          somehow does the job done by the optimization.
      
      Link: https://lkml.kernel.org/r/20220706235936.2197195-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-3-zokeefe@google.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Signed-off-by: NZach O'Keefe <zokeefe@google.com>
      Co-developed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c6a7f445
  6. 08 9月, 2022 1 次提交
    • P
      freezer,sched: Rewrite core freezer logic · f5d39b02
      Peter Zijlstra 提交于
      Rewrite the core freezer to behave better wrt thawing and be simpler
      in general.
      
      By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
      ensured frozen tasks stay frozen until thawed and don't randomly wake
      up early, as is currently possible.
      
      As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
      two PF_flags (yay!).
      
      Specifically; the current scheme works a little like:
      
      	freezer_do_not_count();
      	schedule();
      	freezer_count();
      
      And either the task is blocked, or it lands in try_to_freezer()
      through freezer_count(). Now, when it is blocked, the freezer
      considers it frozen and continues.
      
      However, on thawing, once pm_freezing is cleared, freezer_count()
      stops working, and any random/spurious wakeup will let a task run
      before its time.
      
      That is, thawing tries to thaw things in explicit order; kernel
      threads and workqueues before doing bringing SMP back before userspace
      etc.. However due to the above mentioned races it is entirely possible
      for userspace tasks to thaw (by accident) before SMP is back.
      
      This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
      where the userspace task requires a special CPU to run.
      
      As said; replace this with a special task state TASK_FROZEN and add
      the following state transitions:
      
      	TASK_FREEZABLE	-> TASK_FROZEN
      	__TASK_STOPPED	-> TASK_FROZEN
      	__TASK_TRACED	-> TASK_FROZEN
      
      The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
      (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
      is already required to deal with spurious wakeups and the freezer
      causes one such when thawing the task (since the original state is
      lost).
      
      The special __TASK_{STOPPED,TRACED} states *can* be restored since
      their canonical state is in ->jobctl.
      
      With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
      free of undue (early / spurious) wakeups.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
      f5d39b02
  7. 18 7月, 2022 7 次提交
  8. 04 7月, 2022 7 次提交
  9. 20 5月, 2022 1 次提交