提交 · 66c1e596bc52d460dc704197f7eb3247b45cb4b2 · openeuler / Kernel

03 11月, 2022 1 次提交

mm/memory.c: fix race when faulting a device private page · 66c1e596

由 Alistair Popple 提交于 11月 03, 2022

mainline inclusion
from mainline-v6.1-rc1
commit 16ce101d
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5VZ0L
CVE: CVE-2022-3523

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16ce101db85db694a91380aa4c89b25530871d33

--------------------------------

Patch series "Fix several device private page reference counting issues",
v2

This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages.  These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.

During normal usage it is unlikely these will cause any problems.  However
without these fixes it is possible to crash the kernel from userspace.
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting.
In modules such as Nouveau it is also possible to trigger some of these
issues by explicitly closing the device file-descriptor prior to the task
exiting and then accessing device private memory.

This involves some minor changes to both PowerPC and AMD GPU code.
Unfortunately I lack hardware to test either of those so any help there
would be appreciated.  The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.

This patch (of 8):

When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called.  However no
reference is taken on the faulting page.  Therefore a concurrent migration
of the device private page can free the page and possibly the underlying
pgmap.  This results in a race which can crash the kernel due to the
migrate_to_ram() function pointer becoming invalid.  It also means drivers
can't reliably read the zone_device_data field because the page may have
been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl to
ensure it has not been freed.  Unfortunately the elevated reference count
will cause the migration required to handle the fault to fail.  To avoid
this failure pass the faulting page into the migrate_vma functions so that
if an elevated reference count is found it can be checked to see if it's
expected or not.

[mpe@ellerman.id.au: fix build]
  Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	arch/powerpc/kvm/book3s_hv_uvmem.c
	include/linux/migrate.h
	lib/test_hmm.c
	mm/migrate.c
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

66c1e596

02 11月, 2022 1 次提交

mm: split huge PUD on wp_huge_pud fallback · 7782054e

由 Gowans, James 提交于 11月 02, 2022

stable inclusion
from stable-v5.10.132
commit 931dbcc2e02f0409c095b11e35490cade9ac14af
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5YS3T

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=931dbcc2e02f0409c095b11e35490cade9ac14af

--------------------------------

commit 14c99d65 upstream.

Currently the implementation will split the PUD when a fallback is taken
inside the create_huge_pud function.  This isn't where it should be done:
the splitting should be done in wp_huge_pud, just like it's done for PMDs.
Reason being that if a callback is taken during create, there is no PUD
yet so nothing to split, whereas if a fallback is taken when encountering
a write protection fault there is something to split.

It looks like this was the original intention with the commit where the
splitting was introduced, but somehow it got moved to the wrong place
between v1 and v2 of the patch series.  Rebase mistake perhaps.

Link: https://lkml.kernel.org/r/6f48d622eb8bce1ae5dd75327b0b73894a2ec407.camel@amazon.com
Fixes: 327e9fd4 ("mm: Split huge pages on write-notify or COW")
Signed-off-by: NJames Gowans <jgowans@amazon.com>
Reviewed-by: NThomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

7782054e

19 10月, 2022 6 次提交

mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse · 25702ca1

由 Jann Horn 提交于 10月 19, 2022

stable inclusion
from stable-v5.10.141
commit 98f401d36396134c0c86e9e3bd00b6b6b028b521
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5USOP
CVE: CVE-2022-42703

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=98f401d36396134c0c86e9e3bd00b6b6b028b521

--------------------------------

commit 2555283e upstream.

anon_vma->degree tracks the combined number of child anon_vmas and VMAs
that use the anon_vma as their ->anon_vma.

anon_vma_clone() then assumes that for any anon_vma attached to
src->anon_vma_chain other than src->anon_vma, it is impossible for it to
be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
elevated by 1 because of a child anon_vma, meaning that if ->degree
equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.

This assumption is wrong because the ->degree optimization leads to leaf
nodes being abandoned on anon_vma_clone() - an existing anon_vma is
reused and no new parent-child relationship is created.  So it is
possible to reuse an anon_vma for one VMA while it is still tied to
another VMA.

This is an issue because is_mergeable_anon_vma() and its callers assume
that if two VMAs have the same ->anon_vma, the list of anon_vmas
attached to the VMAs is guaranteed to be the same.  When this assumption
is violated, vma_merge() can merge pages into a VMA that is not attached
to the corresponding anon_vma, leading to dangling page->mapping
pointers that will be dereferenced during rmap walks.

Fix it by separately tracking the number of child anon_vmas and the
number of VMAs using the anon_vma as their ->anon_vma.

Fixes: 7a3ef208 ("mm: prevent endless growth of anon_vma hierarchy")
Cc: stable@kernel.org
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

25702ca1

kasan: check KASAN_NO_FREE_META in __kasan_metadata_size · d5c5dc51

由 Andrey Konovalov 提交于 10月 19, 2022

mainline inclusion
from mainline-v6.1-rc1
commit ca77f290
category: bugfix
bugzilla: 187796, https://gitee.com/openeuler/kernel/issues/I5W6YV
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca77f290cff1dfa095d71ae16cc7cda8ee6df495

--------------------------------

Patch series "kasan: switch tag-based modes to stack ring from per-object
metadata", v3.

This series makes the tag-based KASAN modes use a ring buffer for storing
stack depot handles for alloc/free stack traces for slab objects instead
of per-object metadata.  This ring buffer is referred to as the stack
ring.

On each alloc/free of a slab object, the tagged address of the object and
the current stack trace are recorded in the stack ring.

On each bug report, if the accessed address belongs to a slab object, the
stack ring is scanned for matching entries.  The newest entries are used
to print the alloc/free stack traces in the report: one entry for alloc
and one for free.

The advantages of this approach over storing stack trace handles in
per-object metadata with the tag-based KASAN modes:

- Allows to find relevant stack traces for use-after-free bugs without
  using quarantine for freed memory. (Currently, if the object was
  reallocated multiple times, the report contains the latest alloc/free
  stack traces, not necessarily the ones relevant to the buggy allocation.)
- Allows to better identify and mark use-after-free bugs, effectively
  making the CONFIG_KASAN_TAGS_IDENTIFY functionality always-on.
- Has fixed memory overhead.

The disadvantage:

- If the affected object was allocated/freed long before the bug happened
  and the stack trace events were purged from the stack ring, the report
  will have no stack traces.

Discussion

==========

The proposed implementation of the stack ring uses a single ring buffer
for the whole kernel.  This might lead to contention due to atomic
accesses to the ring buffer index on multicore systems.

At this point, it is unknown whether the performance impact from this
contention would be significant compared to the slowdown introduced by
collecting stack traces due to the planned changes to the latter part, see
the section below.

For now, the proposed implementation is deemed to be good enough, but this
might need to be revisited once the stack collection becomes faster.

A considered alternative is to keep a separate ring buffer for each CPU
and then iterate over all of them when printing a bug report.  This
approach requires somehow figuring out which of the stack rings has the
freshest stack traces for an object if multiple stack rings have them.

Further plans
=============

This series is a part of an effort to make KASAN stack trace collection
suitable for production.  This requires stack trace collection to be fast
and memory-bounded.

The planned steps are:

1. Speed up stack trace collection (potentially, by using SCS;
   patches on-hold until steps #2 and #3 are completed).
2. Keep stack trace handles in the stack ring (this series).
3. Add a memory-bounded mode to stack depot or provide an alternative
   memory-bounded stack storage.
4. Potentially, implement stack trace collection sampling to minimize
   the performance impact.

This patch (of 34):

__kasan_metadata_size() calculates the size of the redzone for objects in
a slab cache.

When accounting for presence of kasan_free_meta in the redzone, this
function only compares free_meta_offset with 0.  But free_meta_offset
could also be equal to KASAN_NO_FREE_META, which indicates that
kasan_free_meta is not present at all.

Add a comparison with KASAN_NO_FREE_META into __kasan_metadata_size().

Link: https://lkml.kernel.org/r/cover.1662411799.git.andreyknvl@google.com
Link: https://lkml.kernel.org/r/c7b316d30d90e5947eb8280f4dc78856a49298cf.1662411799.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
Reviewed-by: NMarco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d5c5dc51

kasan: sanitize objects when metadata doesn't fit · d5ddd708

由 Andrey Konovalov 提交于 10月 19, 2022

mainline inclusion
from mainline-v5.11-rc1
commit 97593cad
category: bugfix
bugzilla: 187796, https://gitee.com/openeuler/kernel/issues/I5W6YV
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=97593cad003c668e2532cb2939a24a031f8de52d

--------------------------------

KASAN marks caches that are sanitized with the SLAB_KASAN cache flag.
Currently if the metadata that is appended after the object (stores e.g.
stack trace ids) doesn't fit into KMALLOC_MAX_SIZE (can only happen with
SLAB, see the comment in the patch), KASAN turns off sanitization
completely.

With this change sanitization of the object data is always enabled.
However the metadata is only stored when it fits.  Instead of checking for
SLAB_KASAN flag accross the code to find out whether the metadata is
there, use cache->kasan_info.alloc/free_meta_offset.  As 0 can be a valid
value for free_meta_offset, introduce KASAN_NO_FREE_META as an indicator
that the free metadata is missing.

Without this change all sanitized KASAN objects would be put into
quarantine with generic KASAN.  With this change, only the objects that
have metadata (i.e.  when it fits) are put into quarantine, the rest is
freed right away.

Along the way rework __kasan_cache_create() and add claryfying comments.

Link: https://lkml.kernel.org/r/aee34b87a5e4afe586c2ac6a0b32db8dc4dcc2dc.1606162397.git.andreyknvl@google.com
Link: https://linux-review.googlesource.com/id/Icd947e2bea054cb5cfbdc6cf6652227d97032dcbCo-developed-by: NVincenzo Frascino <Vincenzo.Frascino@arm.com>
Signed-off-by: NVincenzo Frascino <Vincenzo.Frascino@arm.com>
Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
Reviewed-by: NMarco Elver <elver@google.com>
Tested-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	mm/kasan/common.c
	mm/kasan/hw_tags.c
	mm/kasan/kasan.h
	mm/kasan/quarantine.c
	mm/kasan/report.c
	mm/kasan/report_sw_tags.c
	mm/kasan/sw_tags.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d5ddd708

kasan: introduce set_alloc_info · be897263

由 Andrey Konovalov 提交于 10月 19, 2022

mainline inclusion
from mainline-v5.11-rc1
commit 8bb0009b
category: bugfix
bugzilla: 187796, https://gitee.com/openeuler/kernel/issues/I5W6YV
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8bb0009b19465da5a0cd394b5a6ccc2eaf418f23

--------------------------------

Add set_alloc_info() helper and move kasan_set_track() into it. This will
simplify the code for one of the upcoming changes.

No functional changes.

Link: https://lkml.kernel.org/r/b2393e8f1e311a70fc3aaa2196461b6acdee7d21.1606162397.git.andreyknvl@google.com
Link: https://linux-review.googlesource.com/id/I0316193cbb4ecc9b87b7c2eee0dd79f8ec908c1aSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
Reviewed-by: NMarco Elver <elver@google.com>
Tested-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

be897263

kasan: rename get_alloc/free_info · 7ca33333

由 Andrey Konovalov 提交于 10月 19, 2022

mainline inclusion
from mainline-v5.11-rc1
commit 6476792f
category: bugfix
bugzilla: 187796, https://gitee.com/openeuler/kernel/issues/I5W6YV
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6476792f1015a356e6864076c210b328b64d08cc

--------------------------------

Rename get_alloc_info() and get_free_info() to kasan_get_alloc_meta() and
kasan_get_free_meta() to better reflect what those do and avoid confusion
with kasan_set_free_info().

No functional changes.

Link: https://lkml.kernel.org/r/27b7c036b754af15a2839e945f6d8bfce32b4c2f.1606162397.git.andreyknvl@google.com
Link: https://linux-review.googlesource.com/id/Ib6e4ba61c8b12112b403d3479a9799ac8fff8de1Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
Reviewed-by: NMarco Elver <elver@google.com>
Tested-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	mm/kasan/generic.c
	mm/kasan/quarantine.c
	mm/kasan/report.c
	mm/kasan/report_sw_tags.c
	mm/kasan/sw_tags.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7ca33333

kasan: simplify quarantine_put call site · f43965ff

由 Andrey Konovalov 提交于 10月 19, 2022

mainline inclusion
from mainline-v5.11-rc1
commit c696de9f
category: bugfix
bugzilla: 187796, https://gitee.com/openeuler/kernel/issues/I5W6YV
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c696de9f12b7ddeddc05d378fc4dc0f66e9a8c95

--------------------------------

Patch series "kasan: boot parameters for hardware tag-based mode", v4.

=== Overview

Hardware tag-based KASAN mode [1] is intended to eventually be used in
production as a security mitigation. Therefore there's a need for finer
control over KASAN features and for an existence of a kill switch.

This patchset adds a few boot parameters for hardware tag-based KASAN that
allow to disable or otherwise control particular KASAN features, as well
as provides some initial optimizations for running KASAN in production.

There's another planned patchset what will further optimize hardware
tag-based KASAN, provide proper benchmarking and tests, and will fully
enable tag-based KASAN for production use.

Hardware tag-based KASAN relies on arm64 Memory Tagging Extension (MTE)
[2] to perform memory and pointer tagging. Please see [3] and [4] for
detailed analysis of how MTE helps to fight memory safety problems.

The features that can be controlled are:

1. Whether KASAN is enabled at all.
2. Whether KASAN collects and saves alloc/free stacks.
3. Whether KASAN panics on a detected bug or not.

The patch titled "kasan: add and integrate kasan boot parameters" of this
series adds a few new boot parameters.

kasan.mode allows to choose one of three main modes:

- kasan.mode=off - KASAN is disabled, no tag checks are performed
- kasan.mode=prod - only essential production features are enabled
- kasan.mode=full - all KASAN features are enabled

The chosen mode provides default control values for the features mentioned
above. However it's also possible to override the default values by
providing:

- kasan.stacktrace=off/on - enable stacks collection
                            (default: on for mode=full, otherwise off)
- kasan.fault=report/panic - only report tag fault or also panic
                             (default: report)

If kasan.mode parameter is not provided, it defaults to full when
CONFIG_DEBUG_KERNEL is enabled, and to prod otherwise.

It is essential that switching between these modes doesn't require
rebuilding the kernel with different configs, as this is required by
the Android GKI (Generic Kernel Image) initiative.

=== Benchmarks

For now I've only performed a few simple benchmarks such as measuring
kernel boot time and slab memory usage after boot. There's an upcoming
patchset which will optimize KASAN further and include more detailed
benchmarking results.

The benchmarks were performed in QEMU and the results below exclude the
slowdown caused by QEMU memory tagging emulation (as it's different from
the slowdown that will be introduced by hardware and is therefore
irrelevant).

KASAN_HW_TAGS=y + kasan.mode=off introduces no performance or memory
impact compared to KASAN_HW_TAGS=n.

kasan.mode=prod (manually excluding tagging) introduces 3% of performance
and no memory impact (except memory used by hardware to store tags)
compared to kasan.mode=off.

kasan.mode=full has about 40% performance and 30% memory impact over
kasan.mode=prod. Both come from alloc/free stack collection.

=== Notes

This patchset is available here:

https://github.com/xairy/linux/tree/up-boot-mte-v4

This patchset is based on v11 of "kasan: add hardware tag-based mode for
arm64" patchset [1].

For testing in QEMU hardware tag-based KASAN requires:

1. QEMU built from master [6] (use "-machine virt,mte=on -cpu max" arguments
   to run).
2. GCC version 10.

[1] https://lore.kernel.org/linux-arm-kernel/cover.1606161801.git.andreyknvl@google.com/T/#t
[2] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety
[3] https://arxiv.org/pdf/1802.09517.pdf
[4] https://github.com/microsoft/MSRC-Security-Research/blob/master/papers/2020/Security%20analysis%20of%20memory%20tagging.pdf
[5] https://source.android.com/devices/architecture/kernel/generic-kernel-image
[6] https://github.com/qemu/qemu

=== Tags
Tested-by: NVincenzo Frascino <vincenzo.frascino@arm.com>

This patch (of 19):

Move get_free_info() call into quarantine_put() to simplify the call site.

No functional changes.

Link: https://lkml.kernel.org/r/cover.1606162397.git.andreyknvl@google.com
Link: https://lkml.kernel.org/r/312d0a3ef92cc6dc4fa5452cbc1714f9393ca239.1606162397.git.andreyknvl@google.com
Link: https://linux-review.googlesource.com/id/Iab0f04e7ebf8d83247024b7190c67c3c34c7940fSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
Reviewed-by: NMarco Elver <elver@google.com>
Tested-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f43965ff

29 9月, 2022 3 次提交

mm: Force TLB flush for PFNMAP mappings before unlink_file_vma() · 1916192d

由 Jann Horn 提交于 9月 29, 2022

stable inclusion
from stable-v5.10.142
commit 895428ee124ad70b9763259308354877b725c31d
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5PE9S
CVE: CVE-2022-39188

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=895428ee124ad70b9763259308354877b725c31d

--------------------------------

commit b67fbebd upstream.

Some drivers rely on having all VMAs through which a PFN might be
accessible listed in the rmap for correctness.
However, on X86, it was possible for a VMA with stale TLB entries
to not be listed in the rmap.

This was fixed in mainline with
commit b67fbebd ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"),
but that commit relies on preceding refactoring in
commit 18ba064e ("mmu_gather: Let there be one tlb_{start,end}_vma()
implementation") and commit 1e9fdf21 ("mmu_gather: Remove per arch
tlb_{start,end}_vma()").

This patch provides equivalent protection without needing that
refactoring, by forcing a TLB flush between removing PTEs in
unmap_vmas() and the call to unlink_file_vma() in free_pgtables().

[This is a stable-specific rewrite of the upstream commit!]
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Nze zuo <zuoze1@huawei.com>
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1916192d

hugetlb: fix huge_pmd_unshare address update · fd210efb

由 Mike Kravetz 提交于 9月 29, 2022

stable inclusion
from stable-v5.10.121
commit 63758dd9595f87c7e7b5f826fd2dcf53d6aff0cf
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5L6CQ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=63758dd9595f87c7e7b5f826fd2dcf53d6aff0cf

--------------------------------

commit 48381273 upstream.

The routine huge_pmd_unshare() is passed a pointer to an address
associated with an area which may be unshared.  If unshare is successful
this address is updated to 'optimize' callers iterating over huge page
addresses.  For the optimization to work correctly, address should be
updated to the last huge page in the unmapped/unshared area.  However, in
the common case where the passed address is PUD_SIZE aligned, the address
is incorrectly updated to the address of the preceding huge page.  That
wastes CPU cycles as the unmapped/unshared range is scanned twice.

Link: https://lkml.kernel.org/r/20220524205003.126184-1-mike.kravetz@oracle.com
Fixes: 39dde65c ("shared page table for hugetlb page")
Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Acked-by: NMuchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

fd210efb

mm, compaction: fast_find_migrateblock() should return pfn in the target zone · 63146d88

由 Rei Yamamoto 提交于 9月 29, 2022

stable inclusion
from stable-v5.10.121
commit 7994d890123a6cad033f2842ff0177a9bda1cb23
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5L6CQ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7994d890123a6cad033f2842ff0177a9bda1cb23

--------------------------------

commit bbe832b9 upstream.

At present, pages not in the target zone are added to cc->migratepages
list in isolate_migratepages_block().  As a result, pages may migrate
between nodes unintentionally.

This would be a serious problem for older kernels without commit
a984226f ("mm: memcontrol: remove the pgdata parameter of
mem_cgroup_page_lruvec"), because it can corrupt the lru list by
handling pages in list without holding proper lru_lock.

Avoid returning a pfn outside the target zone in the case that it is
not aligned with a pageblock boundary.  Otherwise
isolate_migratepages_block() will handle pages not in the target zone.

Link: https://lkml.kernel.org/r/20220511044300.4069-1-yamamoto.rei@jp.fujitsu.com
Fixes: 70b44595 ("mm, compaction: use free lists to quickly locate a migration source")
Signed-off-by: NRei Yamamoto <yamamoto.rei@jp.fujitsu.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Acked-by: NMel Gorman <mgorman@techsingularity.net>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Wonhyuk Yang <vvghjk1234@gmail.com>
Cc: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

63146d88

20 9月, 2022 5 次提交

etmem: Add a scan flag to support specified page swap-out · 4f0eedb8

由 liubo 提交于 9月 20, 2022

euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A
CVE: NA

-------------------------------------------------

etmem, the memory vertical expansion technology,

The existing memory expansion tool etmem swaps out all pages that can be
swapped out for the process by default, unless the page is marked with
lock flag.

The function of swapping out specified pages is added. The process adds
VM_SWAPFLAG flags for pages to be swapped out. The etmem adds filters to
the scanning module and swaps out only these pages.
Signed-off-by: Nliubo <liubo254@huawei.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4f0eedb8

etmem: add swapcache reclaim to etmem · abfd8691

由 liubo 提交于 9月 20, 2022

euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A
CVE: NA

-------------------------------------------------

etmem, the memory vertical expansion technology,

In the current etmem process, memory page swapping is implemented by
invoking shrink_page_list. When this interface is invoked for the first
time, pages are added to the swap cache and written to disks.The swap
cache page is reclaimed only when this interface is invoked for the
second time and no process accesses the page.However, in the etmem
process, the user mode scans pages that have been accessed, and the
migration is not delivered to pages that are not accessed by processes.
Therefore, the swap cache may always be occupied.
To solve the preceding problem, add the logic for actively reclaiming
the swap cache.When the swap cache occupies a large amount of memory,
the system proactively scans the LRU linked list and reclaims the
swap cache to save memory within the specified range.
Signed-off-by: Nliubo <liubo254@huawei.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

abfd8691

etmem: add original kernel swap enabled options · 79c68ab3

由 liubo 提交于 9月 20, 2022

euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A
CVE: NA

-------------------------------------------------

etmem, the memory vertical expansion technology,
uses DRAM and high-performance storage new media to form multi-level
memory storage.
By grading the stored data, etmem migrates the classified cold
storage data from the storage medium to the high-performance
storage medium,
so as to achieve the purpose of memory capacity expansion and
memory cost reduction.

When the memory expansion function etmem is running, the native
swap function of the kernel needs to be disabled in certain
scenarios to avoid the impact of kernel swap.

This feature provides the preceding functions.

The /sys/kernel/mm/swap/ directory provides the kernel_swap_enable
sys interface to enable or disable the native swap function
of the kernel.

The default value of /sys/kernel/mm/swap/kernel_swap_enable is true,
that is, kernel swap is enabled by default.

Turn on kernel swap:
	echo true > /sys/kernel/mm/swap/kernel_swap_enable

Turn off kernel swap:
	echo false > /sys/kernel/mm/swap/kernel_swap_enable
Signed-off-by: Nliubo <liubo254@huawei.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

79c68ab3

etmem: add CONFIG_ETMEM macro definition for etmem feature · 4232d900

由 liubo 提交于 9月 20, 2022

euleros inclusion
category: feature
eature: etmem
bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A

-------------------------------------------------

add CONFIG_ETMEM macro definition for etmem feature.
Signed-off-by: Nliubo <liubo254@huawei.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4232d900

add ETMEM feature CONFIG to mm/Kconfig · 59abec29

由 liubo 提交于 9月 20, 2022

euleros inclusion
category: feature
feature: etmem
bugzilla: https://gitee.com/openeuler/kernel/issues/I5DC4A

-------------------------------------------------

etmem, the memory vertical expansion technology,
uses DRAM and high-performance storage new media to form multi-level
memory storage.

The etmem feature was introduced in the previous commit
(aa7f1d22),but only the config
options for the etmem_swap and etmem_scan modules were added,
and the config options for the etmem feature were not added,
so in this commit, the CONFIG_ETMEM option for
the etmem feature was added
Signed-off-by: Nliubo <liubo254@huawei.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

59abec29

07 9月, 2022 3 次提交

memcg: Fix the problem of cat memory.high_async_ratio · e946b0e0

由 Lu Jialin 提交于 9月 06, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

--------

The patch fixes the problem of cat memory.high_async_ratio.
After this patch, when user cat memory.high_async_ratio, the correct
memory.high_async_ratio will be shown.

Show case:
/sys/fs/cgroup/test # cat memory.high_async_ratio
0
/sys/fs/cgroup/test # echo 90 > memory.high_async_ratio
/sys/fs/cgroup/test # cat memory.high_async_ratio
90
/sys/fs/cgroup/test # echo 85 > memory.high_async_ratio
/sys/fs/cgroup/test # cat memory.high_async_ratio
85
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e946b0e0

memcg: Modify memory.high_async_ratio changing scope · d014172f

由 Lu Jialin 提交于 9月 06, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

--------

This patch changes HIGH_ASYNC_RATIO_BASE from 10 to 100 and changes
HIGH_ASYNC_RATIO_GAP from 1 to 10.
After this patch, user can set high_async_ratio from 0 to 99, which will
make memcg async reclaim more delicacy management.
If high_async_ratio is smaller than 10, try to reclaim all the pages of the
memcg.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d014172f

zsmalloc: fix races between asynchronous zspage free and page migration · 0117071f

由 Sultan Alsawaf 提交于 9月 06, 2022

stable inclusion
from stable-v5.10.120
commit fae05b2314b147a78fbed1dc4c645d9a66313758
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5L6BR

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=fae05b2314b147a78fbed1dc4c645d9a66313758

--------------------------------

commit 2505a981 upstream.

The asynchronous zspage free worker tries to lock a zspage's entire page
list without defending against page migration.  Since pages which haven't
yet been locked can concurrently migrate off the zspage page list while
lock_zspage() churns away, lock_zspage() can suffer from a few different
lethal races.

It can lock a page which no longer belongs to the zspage and unsafely
dereference page_private(), it can unsafely dereference a torn pointer to
the next page (since there's a data race), and it can observe a spurious
NULL pointer to the next page and thus not lock all of the zspage's pages
(since a single page migration will reconstruct the entire page list, and
create_page_chain() unconditionally zeroes out each list pointer in the
process).

Fix the races by using migrate_read_lock() in lock_zspage() to synchronize
with page migration.

Link: https://lkml.kernel.org/r/20220509024703.243847-1-sultan@kerneltoast.com
Fixes: 77ff4657 ("zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse")
Signed-off-by: NSultan Alsawaf <sultan@kerneltoast.com>
Acked-by: NMinchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

0117071f

01 9月, 2022 2 次提交

mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page() · 585d4dec

由 David Hildenbrand 提交于 9月 01, 2022

mainline inclusion
from mainline-v5.19-rc1
commit 50053941
category: bugfix
bugzilla: 187533, https://gitee.com/openeuler/kernel/issues/I5OITX
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=500539419fae0aeb27189b2d62a238a056ca6742

--------------------------------

We can already theoretically fail to unmap (still having page_mapped()) in
case arch_unmap_one() fails, which can happen on sparc.  Failures to unmap
are handled gracefully, just as if there are other references on the
target page: freezing the refcount in split_huge_page_to_list() will fail
if still mapped and we'll simply remap.

In commit 504e070d ("mm: thp: replace DEBUG_VM BUG with VM_WARN when
unmap fails for split") we already converted to VM_WARN_ON_ONCE_PAGE,
let's get rid of it completely now.

This is a preparation for making try_to_migrate() fail on anonymous pages
with GUP pins, which will make this VM_WARN_ON_ONCE_PAGE trigger more
frequently.

Link: https://lkml.kernel.org/r/20220428083441.37290-11-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Reported-by: NYang Shi <shy828301@gmail.com>
Reviewed-by: NYang Shi <shy828301@gmail.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

585d4dec

random: move randomize_page() into mm where it belongs · 33db6caf

由 Jason A. Donenfeld 提交于 8月 30, 2022

stable inclusion
from stable-v5.10.119
commit 552ae8e4841ba0550952063922d3b38bcd84ec6e
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5L6BB

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=552ae8e4841ba0550952063922d3b38bcd84ec6e

--------------------------------

commit 5ad7dd88 upstream.

randomize_page is an mm function. It is documented like one. It contains
the history of one. It has the naming convention of one. It looks
just like another very similar function in mm, randomize_stack_top().
And it has always been maintained and updated by mm people. There is no
need for it to be in random.c. In the "which shape does not look like
the other ones" test, pointing to randomize_page() is correct.

So move randomize_page() into mm/util.c, right next to the similar
randomize_stack_top() function.

This commit contains no actual code changes.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

33db6caf

17 8月, 2022 17 次提交

mm: support pagecache limit · ca3b6b0e

由 Chen Wandun 提交于 8月 17, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4HOXK
CVE: NA

--------------------------------

Add /pros/sys/vm/cache_limit_mbytes to set page cache limit. This
interface set the upper limit of page cache, if usage of page cache
is over cache_limit_mbytes, it will trigger memory reclaim, the
reclaim size and reclaim interval are decided by interfaces
/proc/sys/vm/cache_reclaim_s and /proc/sys/vm/cache_reclaim_weight,
these two intefaces are introduced in previous patch.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ca3b6b0e

mm: support periodical memory reclaim · 581a69b8

由 Chen Wandun 提交于 8月 17, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4HOXK
CVE: NA

--------------------------------

Adding periodical memory reclaim support, there are three new
interfaces:

1) /proc/sys/vm/cache_reclaim_s --- used to set reclaim interval
2) /proc/sys/vm/cache_reclaim_weight --- used to calculate reclaim amount
3) /proc/sys/vm/cache_reclaim_enable --- used to switch on/off this feature
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

581a69b8

Revert "mm/page_cache_limit: add pagecache limit proc interface" · 5a43b3a7

由 Chen Wandun 提交于 8月 17, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4HOXK
CVE: NA

--------------------------------

This reverts commit 933db18a.
This feature will be reimplement.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5a43b3a7

Revert "mm/page_cache_limit: create kernel thread for page cache limit" · c6618a5a

由 Chen Wandun 提交于 8月 17, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4HOXK
CVE: NA

--------------------------------

This reverts commit b072a9d4.
This feature will be reimplement.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c6618a5a

Revert "mm/page_cache_limit: calculate reclaim pages for each node" · f2515d31

由 Chen Wandun 提交于 8月 17, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4HOXK
CVE: NA

--------------------------------

This reverts commit 425bce98.
This feature will be reimplement.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f2515d31

Revert "mm/page_cache_limit: shrink page cache" · 09d07df0

由 Chen Wandun 提交于 8月 17, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4HOXK
CVE: NA

--------------------------------

This reverts commit 7be2f4c4.
This feature will be reimplement.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

09d07df0

Revert "mm/page_cache_limit: reconfiguration about page cache limit when memory plug/unplug" · 2d26ca72

由 Chen Wandun 提交于 8月 17, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4HOXK
CVE: NA

--------------------------------

This reverts commit 955d63ae.
This feature will be reimplement.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

2d26ca72

Revert "mm/page_cache_limit: do shrink_page_cache when adding page to page cache" · a8220e72

由 Chen Wandun 提交于 8月 17, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4HOXK
CVE: NA

--------------------------------

This reverts commit 9fea105d.
This feature will be reimplement.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a8220e72

Revert "mm/page_cache_limit: add support for droping caches for target node" · f7356019

由 Chen Wandun 提交于 8月 17, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4HOXK
CVE: NA

--------------------------------

This reverts commit e56e8310.
This feature will be reimplement.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f7356019

memblock: stop poisoning raw allocations · 15ad3074

由 Mike Rapoport 提交于 8月 17, 2022

mainline inclusion
from mainline-v5.15-rc1
commit 08678804
category: bugfix
bugzilla: 186414, https://gitee.com/openeuler/kernel/issues/I5K7IY
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=08678804e0b305bbbf5b756ad365373e5fe885a2

--------------------------------

Functions memblock_alloc_exact_nid_raw() and memblock_alloc_try_nid_raw()
are intended for early memory allocation without overhead of zeroing the
allocated memory.  Since these functions were used to allocate the memory
map, they have ended up with addition of a call to page_init_poison() that
poisoned the allocated memory when CONFIG_PAGE_POISON was set.

Since the memory map is allocated using a dedicated memmep_alloc()
function that takes care of the poisoning, remove page poisoning from the
memblock_alloc_*_raw() functions.

Link: https://lkml.kernel.org/r/20210714123739.16493-5-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
Cc: Michal Simek <monstr@monstr.eu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

15ad3074

mm: introduce memmap_alloc() to unify memory map allocation · 25837227

由 Mike Rapoport 提交于 8月 17, 2022

mainline inclusion
from mainline-v5.15-rc1
commit c803b3c8
category: bugfix
bugzilla: 186414, https://gitee.com/openeuler/kernel/issues/I5K7IY
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c803b3c8b3b70f306ee6300bf8acdd70ffd1441a

--------------------------------

There are several places that allocate memory for the memory map:
alloc_node_mem_map() for FLATMEM, sparse_buffer_init() and
__populate_section_memmap() for SPARSEMEM.

The memory allocated in the FLATMEM case is zeroed and it is never
poisoned, regardless of CONFIG_PAGE_POISON setting.

The memory allocated in the SPARSEMEM cases is not zeroed and it is
implicitly poisoned inside memblock if CONFIG_PAGE_POISON is set.

Introduce memmap_alloc() wrapper for memblock allocators that will be used
for both FLATMEM and SPARSEMEM cases and will makei memory map zeroing and
poisoning consistent for different memory models.

Link: https://lkml.kernel.org/r/20210714123739.16493-4-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
Cc: Michal Simek <monstr@monstr.eu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	mm/page_alloc.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

25837227

mm/page_alloc: always initialize memory map for the holes · 750b955b

由 Mike Rapoport 提交于 8月 17, 2022

mainline inclusion
from mainline-v5.15-rc1
commit c3ab6baf
category: bugfix
bugzilla: 186414, https://gitee.com/openeuler/kernel/issues/I5K7IY
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3ab6baf6a004eab7344a1d8880a971f2414e1b6

--------------------------------

Patch series "mm: ensure consistency of memory map poisoning".

Currently memory map allocation for FLATMEM case does not poison the
struct pages regardless of CONFIG_PAGE_POISON setting.

This happens because allocation of the memory map for FLATMEM and SPARSMEM
use different memblock functions and those that are used for SPARSMEM case
(namely memblock_alloc_try_nid_raw() and memblock_alloc_exact_nid_raw())
implicitly poison the allocated memory.

Another side effect of this implicit poisoning is that early setup code
that uses the same functions to allocate memory burns cycles for the
memory poisoning even if it was not intended.

These patches introduce memmap_alloc() wrapper that ensure that the memory
map allocation is consistent for different memory models.

This patch (of 4):

Currently memory map for the holes is initialized only when SPARSEMEM
memory model is used.  Yet, even with FLATMEM there could be holes in the
physical memory layout that have memory map entries.

For instance, the memory reserved using e820 API on i386 or
"reserved-memory" nodes in device tree would not appear in memblock.memory
and hence the struct pages for such holes will be skipped during memory
map initialization.

These struct pages will be zeroed because the memory map for FLATMEM
systems is allocated with memblock_alloc_node() that clears the allocated
memory.  While zeroed struct pages do not cause immediate problems, the
correct behaviour is to initialize every page using __init_single_page().
Besides, enabling page poison for FLATMEM case will trigger
PF_POISONED_CHECK() unless the memory map is properly initialized.

Make sure init_unavailable_range() is called for both SPARSEMEM and
FLATMEM so that struct pages representing memory holes would appear as
PG_Reserved with any memory layout.

[rppt@kernel.org: fix microblaze]
  Link: https://lkml.kernel.org/r/YQWW3RCE4eWBuMu/@kernel.org

Link: https://lkml.kernel.org/r/20210714123739.16493-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210714123739.16493-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Cc: Michal Simek <monstr@monstr.eu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	arch/microblaze/include/asm/page.h
	mm/page_alloc.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

750b955b

mm: sharepool: sp_alloc_mmap_populate bugfix · fa8596f2

由 Guo Mengqi 提交于 8月 17, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5J0Z9
CVE: NA

--------------------------------

when there is only one mm in a group allocating memory,
if process is killed, the error path in sp_alloc_mmap_populate tries to
access the next spg_node->master->mm in group's proc list. However, in
this case the next spg_node in proc list is head and spg_node->master
would be NULL, which leads to log below:

[file:test_sp_alloc.c, func:alloc_large_repeat, line:437] start to alloc...
[  264.699086][ T1772] share pool: gonna sp_alloc_unmap...
[  264.699939][ T1772] share pool: list_next_entry(spg_node, proc_node) is ffff0004c4907028
[  264.700380][ T1772] share pool: master is 0
[  264.701240][ T1772] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
...
[  264.704764][ T1772] Internal error: Oops: 96000006 [#1] SMP
[  264.705166][ T1772] Modules linked in: sharepool_dev(OE)
[  264.705823][ T1772] CPU: 3 PID: 1772 Comm: test_sp_alloc Tainted: G           OE     5.10.0+ #23
...
[  264.712513][ T1772] Call trace:
[  264.713057][ T1772]  sp_alloc+0x528/0xa88
[  264.713740][ T1772]  dev_ioctl+0x6ec/0x1d00 [sharepool_dev]
[  264.714035][ T1772]  __arm64_sys_ioctl+0xb0/0xe8
...
[  264.716891][ T1772] ---[ end trace 1587677032f666c6 ]---
[  264.717457][ T1772] Kernel panic - not syncing: Oops: Fatal exception
[  264.717961][ T1772] SMP: stopping secondary CPUs
[  264.718787][ T1772] Kernel Offset: disabled
[  264.719718][ T1772] Memory Limit: none
[  264.720333][ T1772] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

Add a list_is_last check to avoid this null pointer access.
Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

fa8596f2

mm: sharepool: use built-in-statistics · f30182b3

由 Guo Mengqi 提交于 8月 17, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5J0YW
CVE: NA
--------------------------------

Sharepool owns an statistics system which allow user to check the memory
use easily. The statistics codes are quite independent from the major
functions. However, the realization is very similar with the major
functions, which doubles the lock use and cause nesting problems.

Thus we remove the statistics system, and put all the statistics into raw
data structures as built-in statistics. The user api did not change.
This can greatly reduce the complexity of locks, as well as remove hundred
lines of redundant codes.
Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f30182b3

mm,hwpoison: send SIGBUS to PF_MCE_EARLY processes on action required events · 6cfda1d2

由 Aili Yao 提交于 8月 17, 2022

mainline inclusion
from mainline-v5.12-rc1
commit 30c9cf49
category: feature
bugzilla: 187341, https://gitee.com/openeuler/kernel/issues/I5JGLK
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=30c9cf49270423f8cb0d2c152486e248f375cccb

--------------------------------

When a memory uncorrected error is triggered by process who accessed the
address with error, It's Action Required Case for only current process
which triggered this; This Action Required case means Action optional to
other process who share the same page.  Usually killing current process
will be sufficient, other processes sharing the same page will get be
signaled when they really touch the poisoned page.

But there is another scenario that other processes sharing the same page
want to be signaled early with PF_MCE_EARLY set.  In this case, we should
get them into kill list and signal BUS_MCEERR_AO to them.

So in this patch, task_early_kill will check current process if
force_early is set, and if not current,the code will fallback to
find_early_kill_thread() to check if there is PF_MCE_EARLY process who
cares the error.

In kill_proc(), BUS_MCEERR_AR is only send to current, other processes in
kill list will be signaled with BUS_MCEERR_AO.

Link: https://lkml.kernel.org/r/20210122132424.313c8f5f.yaoaili@kingsoft.comSigned-off-by: NAili Yao <yaoaili@kingsoft.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6cfda1d2

mm: sparsemem: fix missing higher order allocation splitting · a9deda10

由 Muchun Song 提交于 8月 17, 2022

mainline inclusion
from mainline-v5.19-rc7
commit 39d35ede
category: feature
bugzilla: 187198, https://gitee.com/openeuler/kernel/issues/I5GVFO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=39d35edee4537487e5178f258e23518272a66413

--------------------------------

Higher order allocations for vmemmap pages from buddy allocator must be
able to be treated as indepdenent small pages as they can be freed
individually by the caller.  There is no problem for higher order vmemmap
pages allocated at boot time since each individual small page will be
initialized at boot time.  However, it will be an issue for memory hotplug
case since those higher order vmemmap pages are allocated from buddy
allocator without initializing each individual small page's refcount.  The
system will panic in put_page_testzero() when CONFIG_DEBUG_VM is enabled
if the vmemmap page is freed.

Link: https://lkml.kernel.org/r/20220620023019.94257-1-songmuchun@bytedance.com
Fixes: d8d55f56 ("mm: sparsemem: use page table lock to protect kernel pmd operations")
Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	mm/sparse-vmemmap.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a9deda10

mm/slub: add missing TID updates on slab deactivation · 5bb4773d

由 Jann Horn 提交于 8月 17, 2022

stable inclusion
from stable-v5.10.130
commit 6c32496964da0dc230cea763a0e934b2e02dabd5
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5LJFH
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6c32496964da0dc230cea763a0e934b2e02dabd5

--------------------------------

commit eeaa345e upstream.

The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
the TID stays the same. However, two places in __slab_alloc() currently
don't update the TID when deactivating the CPU slab.

If multiple operations race the right way, this could lead to an object
getting lost; or, in an even more unlikely situation, it could even lead to
an object being freed onto the wrong slab's freelist, messing up the
`inuse` counter and eventually causing a page to be freed to the page
allocator while it still contains slab objects.

(I haven't actually tested these cases though, this is just based on
looking at the code. Writing testcases for this stuff seems like it'd be
a pain...)

The race leading to state inconsistency is (all operations on the same CPU
and kmem_cache):

 - task A: begin do_slab_free():
    - read TID
    - read pcpu freelist (==NULL)
    - check `slab == c->slab` (true)
 - [PREEMPT A->B]
 - task B: begin slab_alloc_node():
    - fastpath fails (`c->freelist` is NULL)
    - enter __slab_alloc()
    - slub_get_cpu_ptr() (disables preemption)
    - enter ___slab_alloc()
    - take local_lock_irqsave()
    - read c->freelist as NULL
    - get_freelist() returns NULL
    - write `c->slab = NULL`
    - drop local_unlock_irqrestore()
    - goto new_slab
    - slub_percpu_partial() is NULL
    - get_partial() returns NULL
    - slub_put_cpu_ptr() (enables preemption)
 - [PREEMPT B->A]
 - task A: finish do_slab_free():
    - this_cpu_cmpxchg_double() succeeds()
    - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]

From there, the object on c->freelist will get lost if task B is allowed to
continue from here: It will proceed to the retry_load_slab label,
set c->slab, then jump to load_freelist, which clobbers c->freelist.

But if we instead continue as follows, we get worse corruption:

 - task A: run __slab_free() on object from other struct slab:
    - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
 - task A: run slab_alloc_node() with NUMA node constraint:
    - fastpath fails (c->slab is NULL)
    - call __slab_alloc()
    - slub_get_cpu_ptr() (disables preemption)
    - enter ___slab_alloc()
    - c->slab is NULL: goto new_slab
    - slub_percpu_partial() is non-NULL
    - set c->slab to slub_percpu_partial(c)
    - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
      from slab-2]
    - goto redo
    - node_match() fails
    - goto deactivate_slab
    - existing c->freelist is passed into deactivate_slab()
    - inuse count of slab-1 is decremented to account for object from
      slab-2

At this point, the inuse count of slab-1 is 1 lower than it should be.
This means that if we free all allocated objects in slab-1 except for one,
SLUB will think that slab-1 is completely unused, and may free its page,
leading to use-after-free.

Fixes: c17dda40 ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
Fixes: 03e404af ("slub: fast release on full slab")
Cc: stable@vger.kernel.org
Signed-off-by: NJann Horn <jannh@google.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Tested-by: NHyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220608182205.2945720-1-jannh@google.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5bb4773d

09 8月, 2022 2 次提交

memcg: export high_async_ratio to userland · 4eb5e50e

由 Lu Jialin 提交于 8月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

-------------------------------

User can set high_async_ratio from 0 to 9; start memcg high async when
memcg_usage is larger than memory.high * high_async_ratio / 10;
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4eb5e50e

memcg: enable memcg async reclaim · 078057a1

由 Lu Jialin 提交于 8月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK
CVE: NA

-------------------------------

Enable memcg async reclaim when memcg usage is larger than memory_high *
memcg->high_async_ratio / 10; if memcg usage is larger than memory_high *
(memcg->high_async_ratio - 1)/ 10, the reclaim pages is the diff of
memcg usage and memory_high * (memcg->high_async_ratio - 1)/ 10;
else reclaim pages is MEMCG_CHARGE_BATCH;
The default memcg->high_async_ratio is 0; when memcg->high_async_ratio
is 0, memcg async reclaim is disabled;

The situation when enable memcg async reclaim is
1) try_charge;
2) reset memory_high
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

078057a1

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功