提交 · da63dc84befaa9e6079a0bc363ff0eaa975f9073 · openeuler / Kernel

29 4月, 2022 40 次提交

drivers/base/node.c: fix compaction sysfs file leak · da63dc84

由 Miaohe Lin 提交于 4月 28, 2022

Compaction sysfs file is created via compaction_register_node in
register_node.  But we forgot to remove it in unregister_node.  Thus
compaction sysfs file is leaked.  Using compaction_unregister_node to fix
this issue.

Link: https://lkml.kernel.org/r/20220401070905.43679-1-linmiaohe@huawei.com
Fixes: ed4a6d7f ("mm: compaction: add /sys trigger for per-node memory compaction")
Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

da63dc84

mm: compaction: use helper isolation_suitable() · 4af12d04

由 Miaohe Lin 提交于 4月 28, 2022

Use helper isolation_suitable() to check whether page is suitable to
isolate to simplify the code. Minor readability improvement.

Link: https://lkml.kernel.org/r/20220322110750.60311-1-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

4af12d04

mm/z3fold: remove unneeded PAGE_HEADLESS check in free_handle() · daf79bd8

由 Miaohe Lin 提交于 4月 28, 2022

The only caller z3fold_free() never calls free_handle() in PAGE_HEADLESS
case. Remove this unneeded check.

Link: https://lkml.kernel.org/r/20220308134311.59086-9-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

daf79bd8

mm/z3fold: remove redundant list_del_init of zhdr->buddy in z3fold_free · 52fb90cc

由 Miaohe Lin 提交于 4月 28, 2022

do_compact_page() will do list_del_init(&zhdr->buddy) for us. Remove this
extra one to save some possible cpu cycles.

Link: https://lkml.kernel.org/r/20220308134311.59086-8-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

52fb90cc

mm/z3fold: move decrement of pool->pages_nr into __release_z3fold_page() · 5e36c25b

由 Miaohe Lin 提交于 4月 28, 2022

The z3fold will always do atomic64_dec(&pool->pages_nr) when the
__release_z3fold_page() is called.  Thus we can move decrement of
pool->pages_nr into __release_z3fold_page() to simplify the code.

Also we can reduce the size of z3fold.o ~1k.

Without this patch:
   text	   data	    bss	    dec	    hex	filename
  15444	   1376	      8	  16828	   41bc	mm/z3fold.o
With this patch:
   text	   data	    bss	    dec	    hex	filename
  15044	   1248	      8	  16300	   3fac	mm/z3fold.o

Link: https://lkml.kernel.org/r/20220308134311.59086-7-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

5e36c25b

mm/z3fold: remove confusing local variable l reassignment · a3148b5f

由 Miaohe Lin 提交于 4月 28, 2022

The local variable l holds the address of unbuddied[i] which won't change
after we take the pool lock. Remove it to avoid confusion.

Link: https://lkml.kernel.org/r/20220308134311.59086-6-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

a3148b5f

mm/z3fold: remove unneeded page_mapcount_reset and ClearPagePrivate · 8ea2f86c

由 Miaohe Lin 提交于 4月 28, 2022

Page->page_type and PagePrivate are not used in z3fold. We should remove
these confusing unneeded operations. The z3fold do these here is due to
referring to zsmalloc's migration code which does need these operations.

Link: https://lkml.kernel.org/r/20220308134311.59086-5-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

8ea2f86c

mm/z3fold: minor clean up for z3fold_free · ed0e5dca

由 Miaohe Lin 提交于 4月 28, 2022

Use put_z3fold_header() to pair with get_z3fold_header. Also fix the
wrong comments. Minor readability improvement.

Link: https://lkml.kernel.org/r/20220308134311.59086-4-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

ed0e5dca

mm/z3fold: remove obsolete comment in z3fold_alloc · 78da57d4

由 Miaohe Lin 提交于 4月 28, 2022

The highmem pages are supported since commit f1549cb5 ("mm/z3fold.c:
allow __GFP_HIGHMEM in z3fold_alloc"). Remove the residual comment.

Link: https://lkml.kernel.org/r/20220308134311.59086-3-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

78da57d4

mm/z3fold: declare z3fold_mount with __init · dc3a1f30

由 Miaohe Lin 提交于 4月 28, 2022

Patch series "A few cleanup patches for z3fold", v2.

This series contains a few patches to simplify the code, remove unneeded
code, fix obsolete comment and so on.  More details can be found in the
respective changelogs.


This patch (of 8):

z3fold_mount is only called during init.  So we should declare it with
__init.

Link: https://lkml.kernel.org/r/20220308134311.59086-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220308134311.59086-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

dc3a1f30

fs/proc/task_mmu.c: remove redundant page validation of pte_page · c310e06c

由 Xianting Tian 提交于 4月 28, 2022

pte_page() always returns a valid page, so remove the redundant page
validation, as we did in many other places.

Link: https://lkml.kernel.org/r/20220316025947.328276-1-xianting.tian@linux.alibaba.comSigned-off-by: NXianting Tian <xianting.tian@linux.alibaba.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

c310e06c

mm/vmscan: fix comment for isolate_lru_pages · b2cb6826

由 Miaohe Lin 提交于 4月 28, 2022

Since commit 791b48b6 ("mm: vmscan: scan until it finds eligible
pages"), splicing any skipped pages to the tail of the LRU list won't put
the system at risk of premature OOM but will waste lots of cpu cycles.
Correct the comment accordingly.

Link: https://lkml.kernel.org/r/20220416025231.8082-1-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

b2cb6826

mm/vmscan: fix comment for current_may_throttle · 5829f7db

由 Miaohe Lin 提交于 4月 28, 2022

Since commit 6d6435811c19 ("remove bdi_congested() and wb_congested() and
related functions"), there is no congested backing device check anymore. 
Correct the comment accordingly.

[akpm@linux-foundation.org: tweak grammar]
Link: https://lkml.kernel.org/r/20220414120202.30082-1-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

5829f7db

mm/vmscan: remove obsolete comment in get_scan_count · 02e458d8

由 Miaohe Lin 提交于 4月 28, 2022

Since commit 1431d4d1 ("mm: base LRU balancing on an explicit cost
model"), the relative value of each set of LRU lists is based on cost
model instead of rotated/scanned ratio. Cleanup the relevant comment.

Link: https://lkml.kernel.org/r/20220409030245.61211-1-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

02e458d8

mm/vmscan: sc->reclaim_idx must be a valid zone index · 8b3a899a

由 Wei Yang 提交于 4月 28, 2022

lruvec_lru_size() is only used in get_scan_count(), so the only possible
zone_idx is sc->reclaim_idx. Since sc->reclaim_idx is ensured to be a
valid zone idex, we can remove the extra check for zone iteration.

Link: https://lkml.kernel.org/r/20220317234624.23358-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

8b3a899a

mm/vmscan: make sure wakeup_kswapd with managed zone · bc53008e

由 Wei Yang 提交于 4月 28, 2022

wakeup_kswapd() only wake up kswapd when the zone is managed.

For two callers of wakeup_kswapd(), they are node perspective.

  * wake_all_kswapds
  * numamigrate_isolate_page

If we picked up a !managed zone, this is not we expected.

This patch makes sure we pick up a managed zone for wakeup_kswapd().  And
it also use managed_zone in migrate_balanced_pgdat() to get the proper
zone.

[richard.weiyang@gmail.com: adjust the usage in migrate_balanced_pgdat()]
  Link: https://lkml.kernel.org/r/20220329010901.1654-2-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20220327024101.10378-2-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

bc53008e

mm/vmscan: reclaim only affects managed_zones · 36c26128

由 Wei Yang 提交于 4月 28, 2022

As mentioned in commit 6aa303de ("mm, vmscan: only allocate and
reclaim from zones with pages managed by the buddy allocator") , reclaim
only affects managed_zones.

Let's adjust the code and comment accordingly.

Link: https://lkml.kernel.org/r/20220327024101.10378-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

36c26128

arm64: mm: hugetlb: enable HUGETLB_PAGE_FREE_VMEMMAP for arm64 · 1e63ac08

由 Muchun Song 提交于 4月 28, 2022

The feature of minimizing overhead of struct page associated with each
HugeTLB page aims to free its vmemmap pages (used as struct page) to save
memory, where is ~14GB/16GB per 1TB HugeTLB pages (2MB/1GB type).  In
short, when a HugeTLB page is allocated or freed, the vmemmap array
representing the range associated with the page will need to be remapped. 
When a page is allocated, vmemmap pages are freed after remapping.  When a
page is freed, previously discarded vmemmap pages must be allocated before
remapping.  More implementations and details can be found here [1].

The infrastructure of freeing vmemmap pages associated with each HugeTLB
page is already there, we can easily enable HUGETLB_PAGE_FREE_VMEMMAP for
arm64, the only thing to be fixed is flush_dcache_page() .

flush_dcache_page() need to be adapted to operate on the head page's flags
since the tail vmemmap pages are mapped with read-only after the feature
is enabled (clear operation is not permitted).

There was some discussions about this in the thread [2], but there was no
conclusion in the end.  And I copied the concern proposed by Anshuman to
here and explain why those concern is superfluous.  It is safe to enable
it for x86_64 as well as arm64.

1st concern:
'''
But what happens when a hot remove section's vmemmap area (which is
being teared down) is nearby another vmemmap area which is either created
or being destroyed for HugeTLB alloc/free purpose. As you mentioned
HugeTLB pages inside the hot remove section might be safe. But what about
other HugeTLB areas whose vmemmap area shares page table entries with
vmemmap entries for a section being hot removed ? Massive HugeTLB alloc
/use/free test cycle using memory just adjacent to a memory hotplug area,
which is always added and removed periodically, should be able to expose
this problem.
'''

Answer: At the time memory is removed, all HugeTLB pages either have been
migrated away or dissolved.  So there is no race between memory hot remove
and free_huge_page_vmemmap().  Therefore, HugeTLB pages inside the hot
remove section is safe.  Let's talk your question "what about other
HugeTLB areas whose vmemmap area shares page table entries with vmemmap
entries for a section being hot removed ?", the question is not
established.  The minimal granularity size of hotplug memory 128MB (on
arm64, 4k base page), any HugeTLB smaller than 128MB is within a section,
then, there is no share PTE page tables between HugeTLB in this section
and ones in other sections and a HugeTLB page could not cross two
sections.  In this case, the section cannot be freed.  Any HugeTLB bigger
than 128MB (section size) whose vmemmap pages is an integer multiple of
2MB (PMD-mapped).  As long as:

  1) HugeTLBs are naturally aligned, power-of-two sizes
  2) The HugeTLB size >= the section size
  3) The HugeTLB size >= the vmemmap leaf mapping size

Then a HugeTLB will not share any leaf page table entries with *anything
else*, but will share intermediate entries.  In this case, at the time
memory is removed, all HugeTLB pages either have been migrated away or
dissolved.  So there is also no race between memory hot remove and
free_huge_page_vmemmap().

2nd concern:
'''
differently, not sure if ptdump would require any synchronization.

Dumping an wrong value is probably okay but crashing because a page table
entry is being freed after ptdump acquired the pointer is bad. On arm64,
ptdump() is protected against hotremove via [get|put]_online_mems().
'''

Answer: The ptdump should be fine since vmemmap_remap_free() only
exchanges PTEs or splits the PMD entry (which means allocating a PTE page
table).  Both operations do not free any page tables (PTE), so ptdump
cannot run into a UAF on any page tables.  The worst case is just dumping
an wrong value.

[1] https://lore.kernel.org/all/20210510030027.56044-1-songmuchun@bytedance.com/
[2] https://lore.kernel.org/all/20210518091826.36937-1-songmuchun@bytedance.com/

[songmuchun@bytedance.com: restructure the code comment inside flush_dcache_page()]
  Link: https://lkml.kernel.org/r/20220414072646.21910-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20220331065640.5777-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Reviewed-by: NBarry Song <baohua@kernel.org>
Tested-by: NBarry Song <baohua@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

1e63ac08

mm: hugetlb_vmemmap: introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP · 2e4ec02b

由 Muchun Song 提交于 4月 28, 2022

The feature of minimizing overhead of struct page associated with each
HugeTLB page is implemented on x86_64, however, the infrastructure of this
feature is already there, we could easily enable it for other
architectures.  Introduce ARCH_WANT_HUGETLB_PAGE_FREE_VMEMMAP for other
architectures to be easily enabled.  Just select this config if they want
to enable this feature.

Link: https://lkml.kernel.org/r/20220331065640.5777-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NBarry Song <baohua@kernel.org>
Tested-by: NBarry Song <baohua@kernel.org>
Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

2e4ec02b

hugetlb: remove use of list iterator variable after loop · 84448c8e

由 Jakob Koschel 提交于 4月 28, 2022

In preparation to limit the scope of the list iterator to the list
traversal loop, use a dedicated pointer to iterate through the list [1].

Before hugetlb_resv_map_add() was expecting a file_region struct, but in
case the list iterator in add_reservation_in_range() did not exit early,
the variable passed in, is not actually a valid structure.

In such a case 'rg' is computed on the head element of the list and
represents an out-of-bounds pointer. This still remains safe *iff* you
only use the link member (as it is done in hugetlb_resv_map_add()).

To avoid the type-confusion altogether and limit the list iterator to the
loop, only a list_head pointer is kept to pass to hugetlb_resv_map_add().

Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1]
Link: https://lkml.kernel.org/r/20220331224323.903842-1-jakobkoschel@gmail.comSigned-off-by: NJakob Koschel <jakobkoschel@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Brian Johannesmeyer" <bjohannesmeyer@gmail.com>
Cc: Cristiano Giuffrida <c.giuffrida@vu.nl>
Cc: "Bos, H.J." <h.j.bos@vu.nl>
Cc: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

84448c8e

mm, hugetlb, hwpoison: separate branch for free and in-use hugepage · b283d983

由 Naoya Horiguchi 提交于 4月 28, 2022

We know that HPageFreed pages should have page refcount 0, so
get_page_unless_zero() always fails and returns 0. So explicitly separate
the branch based on page state for minor optimization and better
readability.

Link: https://lkml.kernel.org/r/20220415041848.GA3034499@ik1-406-35019.vs.sakura.ne.jpSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: NMike Kravetz <mike.kravetz@oracle.com>
Suggested-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

b283d983

mm/memory-failure.c: dissolve truncated hugetlb page · ef526b17

由 Miaohe Lin 提交于 4月 28, 2022

If me_huge_page meets a truncated but not yet freed hugepage, it won't be
dissolved even if we hold the last refcnt. It's because the hugepage has
NULL page_mapping while it's not anonymous hugepage too. Thus we lose the
last chance to dissolve it into buddy to save healthy subpages. Remove
PageAnon check to handle these hugepages too.

Link: https://lkml.kernel.org/r/20220414114941.11223-3-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

ef526b17

mm/memory-failure.c: minor cleanup for HWPoisonHandlable · 3f871370

由 Miaohe Lin 提交于 4月 28, 2022

Patch series "A few fixup and cleanup patches for memory failure", v2.

This series contains a patch to clean up the HWPoisonHandlable and another
one to dissolve truncated hugetlb page.  More details can be found in the
respective changelogs.


This patch (of 2):

The local variable movable can be removed by returning true directly. Also
fix typo 'mirgate'. No functional change intended.

Link: https://lkml.kernel.org/r/20220414114941.11223-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220414114941.11223-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NYang Shi <shy828301@gmail.com>
Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

3f871370

Revert "mm/memory-failure.c: fix race with changing page compound again" · 2ba2b008

由 Naoya Horiguchi 提交于 4月 28, 2022

Reverts commit 888af270 ("mm/memory-failure.c: fix race with changing
page compound again") because now we fetch the page refcount under
hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no
longer necessary.

Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.devSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

2ba2b008

mm/hwpoison: put page in already hwpoisoned case with MF_COUNT_INCREASED · f361e246

由 Naoya Horiguchi 提交于 4月 28, 2022

In already hwpoisoned case, memory_failure() is supposed to return with
releasing the page refcount taken for error handling. But currently the
refcount is not released when called with MF_COUNT_INCREASED, which makes
page refcount inconsistent. This should be rare and non-critical, but it
might be inconvenient in testing (unpoison doesn't work).

Link: https://lkml.kernel.org/r/20220408135323.1559401-3-naoya.horiguchi@linux.devSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

f361e246

mm/memory-failure.c: remove unnecessary (void*) conversions · f142e707

由 liqiong 提交于 4月 28, 2022

No need cast (void*) to (struct hwp_walk*).

Link: https://lkml.kernel.org/r/20220322142826.25939-1-liqiong@nfschina.comSigned-off-by: Nliqiong <liqiong@nfschina.com>
Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

f142e707

mm: wrap __find_buddy_pfn() with a necessary buddy page validation · 8170ac47

由 Zi Yan 提交于 4月 28, 2022

Whenever the buddy of a page is found from __find_buddy_pfn(),
page_is_buddy() should be used to check its validity.  Add a helper
function find_buddy_page_pfn() to find the buddy page and do the check
together.

[ziy@nvidia.com: updates per David]
Link: https://lkml.kernel.org/r/20220401230804.1658207-2-zi.yan@sent.com
Link: https://lore.kernel.org/linux-mm/CAHk-=wji_AmYygZMTsPMdJ7XksMt7kOur8oDfDdniBRMjm4VkQ@mail.gmail.com/
Link: https://lkml.kernel.org/r/7236E7CA-B5F1-4C04-AB85-E86FA3E9A54B@nvidia.comSuggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NZi Yan <ziy@nvidia.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

8170ac47

mm: page_alloc: simplify pageblock migratetype check in __free_one_page() · bb0e28eb

由 Zi Yan 提交于 4月 28, 2022

Move pageblock migratetype check code in the while loop to simplify the
logic. It also saves redundant buddy page checking code.

Link: https://lkml.kernel.org/r/20220401230804.1658207-1-zi.yan@sent.com
Link: https://lore.kernel.org/linux-mm/27ff69f9-60c5-9e59-feb2-295250077551@suse.cz/Signed-off-by: NZi Yan <ziy@nvidia.com>
Suggested-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

bb0e28eb

mm/page_alloc: adding same penalty is enough to get round-robin order · 37931324

由 Wei Yang 提交于 4月 28, 2022

To make node order in round-robin in the same distance group, we add a
penalty to the first node we got in each round.

To get a round-robin order in the same distance group, we don't need to
decrease the penalty since:

  * find_next_best_node() always iterates node in the same order
  * distance matters more then penalty in find_next_best_node()
  * in nodes with the same distance, the first one would be picked up

So it is fine to increase same penalty when we get the first node in the
same distance group.  Since we just increase a constance of 1 to node
penalty, it is not necessary to multiply MAX_NODE_LOAD for preference.

[richard.weiyang@gmail.com: remove remove MAX_NODE_LOAD, per Vlastimil]
  Link: https://lkml.kernel.org/r/20220412001319.7462-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20220123013537.20491-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NOscar Salvador <osalvador@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

37931324

Documentation/sysctl: document page_lock_unfairness · 8d98e42f

由 Joel Savitz 提交于 4月 28, 2022

commit 5ef64cc8 ("mm: allow a controlled amount of unfairness in the
page lock") introduced a new systctl but no accompanying documentation.

Add a simple entry to the documentation.

Link: https://lkml.kernel.org/r/20220325164437.120246-1-jsavitz@redhat.comSigned-off-by: NJoel Savitz <jsavitz@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "zhangyi (F)" <yi.zhang@huawei.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

8d98e42f

vmap(): don't allow invalid pages · 4fcdcc12

由 Yury Norov 提交于 4月 28, 2022

vmap() takes struct page *pages as one of arguments, and user may provide
an invalid pointer which may lead to corrupted translation table.

An example of such behaviour is erroneous usage of virt_to_page():

	vaddr1 = dma_alloc_coherent()
	page = virt_to_page()	// Wrong here
	...
	vaddr2 = vmap(page)
	memset(vaddr2)		// Faulting here

virt_to_page() returns a wrong pointer if vaddr1 is not a linear kernel
address.  The problem is that vmap() populates pte with bad pfn
successfully, and it's much harder to debug at memory access time.  This
case should be caught by DEBUG_VIRTUAL being that enabled, but it's not
enabled in popular distros.

Kernel already checks the pages against NULL.  In the case mentioned
above, however, the address is not NULL, and it's big enough so that the
hardware generated Address Size Abort on arm64:

	[  665.484101] Unhandled fault at 0xffff8000252cd000
	[  665.488807] Mem abort info:
	[  665.491617]   ESR = 0x96000043
	[  665.494675]   EC = 0x25: DABT (current EL), IL = 32 bits
	[  665.499985]   SET = 0, FnV = 0
	[  665.503039]   EA = 0, S1PTW = 0
	[  665.506167] Data abort info:
	[  665.509047]   ISV = 0, ISS = 0x00000043
	[  665.512882]   CM = 0, WnR = 1
	[  665.515851] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000818cb000
	[  665.522550] [ffff8000252cd000] pgd=000000affcfff003, pud=000000affcffe003, pmd=0000008fad8c3003, pte=00688000a5217713
	[  665.533160] Internal error: level 3 address size fault: 96000043 [#1] SMP
	[  665.539936] Modules linked in: [...]
	[  665.616212] CPU: 178 PID: 13199 Comm: test Tainted: P           OE 5.4.0-84-generic #94~18.04.1-Ubuntu
	[  665.626806] Hardware name: HPE Apollo 70             /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018
	[  665.636618] pstate: 80400009 (Nzcv daif +PAN -UAO)
	[  665.641407] pc : __memset+0x38/0x188
	[  665.645146] lr : test+0xcc/0x3f8
	[  665.650184] sp : ffff8000359bb840
	[  665.653486] x29: ffff8000359bb840 x28: 0000000000000000
	[  665.658785] x27: 0000000000000000 x26: 0000000000231000
	[  665.664083] x25: ffff00ae660f6110 x24: ffff00ae668cb800
	[  665.669382] x23: 0000000000000001 x22: ffff00af533e5000
	[  665.674680] x21: 0000000000001000 x20: 0000000000000000
	[  665.679978] x19: ffff00ae66950000 x18: ffffffffffffffff
	[  665.685276] x17: 00000000588636a5 x16: 0000000000000013
	[  665.690574] x15: ffffffffffffffff x14: 000000000007ffff
	[  665.695872] x13: 0000000080000000 x12: 0140000000000000
	[  665.701170] x11: 0000000000000041 x10: ffff8000652cd000
	[  665.706468] x9 : ffff8000252cf000 x8 : ffff8000252cd000
	[  665.711767] x7 : 0303030303030303 x6 : 0000000000001000
	[  665.717065] x5 : ffff8000252cd000 x4 : 0000000000000000
	[  665.722363] x3 : ffff8000252cdfff x2 : 0000000000000001
	[  665.727661] x1 : 0000000000000003 x0 : ffff8000252cd000
	[  665.732960] Call trace:
	[  665.735395]  __memset+0x38/0x188
	[...]

Interestingly, this abort happens even if copy_from_kernel_nofault() is
used, which is quite inconvenient for debugging purposes.

This patch adds a pfn_valid() check into vmap() path, so that invalid
mapping will not be created; WARN_ON() is used to let client code know
that something goes wrong, and it's not a regular EINVAL situation.

Link: https://lkml.kernel.org/r/20220422220410.1308706-1-yury.norov@gmail.comSigned-off-by: NYury Norov (NVIDIA) <yury.norov@gmail.com>
Suggested-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

4fcdcc12

mm/vmalloc: fix a comment · 98af39d5

由 Yixuan Cao 提交于 4月 28, 2022

The sentence
"but the mempolcy want to alloc memory by interleaving"
should be rephrased with
"but the mempolicy wants to alloc memory by interleaving"
where "mempolicy" is a struct name.

This work is coauthored by
Yinan Zhang
Jiajian Ye
Shenghong Han
Chongxi Zhao
Yuhong Feng
Yongqiang Liu

Link: https://lkml.kernel.org/r/20220401064543.4447-1-caoyixuan2019@email.szu.edu.cnSigned-off-by: NYixuan Cao <caoyixuan2019@email.szu.edu.cn>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

98af39d5

mm/memcontrol.c: remove unused private flag of memory.oom_control · 9707aff7

由 Lu Jialin 提交于 4月 28, 2022

There is no use for the private value, __OOM_TYPE and OOM notifier
OOM_CONTROL.  Therefore remove them to make the code clean.

Link: https://lkml.kernel.org/r/20220421122755.40899-1-lujialin4@huawei.comSigned-off-by: NLu Jialin <lujialin4@huawei.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

9707aff7

mm/memcontrol.c: make cgroup_memory_noswap static · ef7a4ffc

由 Lu Jialin 提交于 4月 28, 2022

cgroup_memory_noswap is only used in mm/memcontrol.c, therefore just make
it static, and remove export in include/linux/memcontrol.h

Link: https://lkml.kernel.org/r/20220421124736.62180-1-lujialin4@huawei.comSigned-off-by: NLu Jialin <lujialin4@huawei.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

ef7a4ffc

MAINTAINERS: add corresponding kselftests to memcg entry · 9c946e3e

由 Roman Gushchin 提交于 4月 28, 2022

List memory control and kernel memory control kselftests in the memory
resource controller entry.

Link: https://lkml.kernel.org/r/20220415000133.3955987-5-roman.gushchin@linux.devSigned-off-by: NRoman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: David Vernet <void@manifault.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

9c946e3e

MAINTAINERS: add corresponding kselftests to cgroup entry · 1bd1a4dd

由 Roman Gushchin 提交于 4月 28, 2022

List cgroup kselftests in the cgroup MAINTAINERS entry.  These are tests
covering core, freezer and cgroup.kill functionality.

Link: https://lkml.kernel.org/r/20220415000133.3955987-4-roman.gushchin@linux.devSigned-off-by: NRoman Gushchin <roman.gushchin@linux.dev>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Down <chris@chrisdown.name>
Cc: David Vernet <void@manifault.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

1bd1a4dd

kselftests: memcg: speed up the memory.high test · be74553f

由 Roman Gushchin 提交于 4月 28, 2022

After commit 0e4b01df ("mm, memcg: throttle allocators when failing
reclaim over memory.high") allocating memory over memory.high became very
time consuming.  But it's exactly what the memory.high test from cgroup
kselftests is doing: it tries to allocate 100M with 30M memory.high value.
It takes forever to complete.

In order to keep it passing (or failing) in a reasonable amount of time
let's try to allocate only a little over 30M: 31M to be precise.

With this change test_memcontrol finishes in a reasonable amount of
time:
  $ time ./test_memcontrol
  ok 1 test_memcg_subtree_control
  ok 2 test_memcg_current
  ok 3 test_memcg_min
  ok 4 test_memcg_low
  ok 5 test_memcg_high
  ok 6 test_memcg_max
  ok 7 test_memcg_oom_events
  ok 8 test_memcg_swap_max
  ok 9 test_memcg_sock
  ok 10 test_memcg_oom_group_leaf_events
  ok 11 test_memcg_oom_group_parent_events
  ok 12 test_memcg_oom_group_score_events

  real	0m2.273s
  user	0m0.064s
  sys	0m0.739s

Link: https://lkml.kernel.org/r/20220415000133.3955987-3-roman.gushchin@linux.devSigned-off-by: NRoman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: NDavid Vernet <void@manifault.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

be74553f

kselftests: memcg: update the oom group leaf events test · c85bcc91

由 Roman Gushchin 提交于 4月 28, 2022

Patch series "mm: memcg kselftests fixes".


This patch (of 4):

Commit 9852ae3f ("mm, memcg: consider subtrees in memory.events") made
memory.events recursive: all events are propagated upwards by the tree. 
It was a change in semantics.

It broke the oom group leaf events test: it assumes that after an OOM the
oom_kill counter is zero on parent's level.

Let's adjust the test: it should have similar expectations for the child
and parent levels.

The test passes after this fix.

Link: https://lkml.kernel.org/r/20220415000133.3955987-2-roman.gushchin@linux.dev
Link: https://lkml.kernel.org/r/20220415000133.3955987-1-roman.gushchin@linux.devSigned-off-by: NRoman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: NDavid Vernet <void@manifault.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

c85bcc91

mm/memcg: non-hierarchical mode is deprecated · c449d559

由 Wei Yang 提交于 4月 28, 2022

After commit bef8620c ("mm: memcg: deprecate the non-hierarchical
mode"), we won't have a NULL parent except root_mem_cgroup.  And this case
is handled when (memcg == root).

Link: https://lkml.kernel.org/r/20220403020833.26164-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NRoman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

c449d559

mm/memcg: move generation assignment and comparison together · a9320aae

由 Wei Yang 提交于 4月 28, 2022

For each round-trip, we assign generation on first invocation and compare
it on subsequent invocations.

Let's move them together to make it more self-explaining. Also this
reduce a check on prev.

[hannes@cmpxchg.org: better comment to explain reclaim model]
Link: https://lkml.kernel.org/r/20220330234719.18340-4-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NRoman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

a9320aae

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功