提交 · 69dc8010b8fca475170650a4ebbc0074541df859 · openeuler / Kernel

21 8月, 2021 1 次提交

dm ima: update dm documentation for ima measurement support · 17bfa968

由 Tushar Sugandhi 提交于 8月 13, 2021

The ima documentation for measuring DM targets (dm-ima.rst) is
missing the attribute information for the targets - 'cache', 'integrity',
'multipath', and 'snapshot'.  It is also missing the grammar for
various DM events and targets, which can help the attestation servers
to determine what data to expect for a given DM device.  Further,
the documentation needs to be updated to incorporate code changes
made to DM ima events and targets as part of this patch series.  For
instance, prefixing the event names with "dm_", adding the DM version to
events, prefixing the table hashes in the ima log with the
hash algorithm etc.  There are warnings reported by 'make htmldocs' on
dm-ima.rst, which need to be fixed.  And lastly, the expected behavior
needs to be documented when the configuration CONFIG_IMA_DISABLE_HTABLE
is disabled.

Update the documentation to add examples for 'cache', 'integrity',
'multipath', and 'snapshot' targets.  Add the grammar for
various DM events and targets in Backus Naur form,
so that the attestation servers can interpret and act on the ima
measurements for DM target.  Fix htmldocs warnings in dm-ima.rst.  Update
the documentation to be consistent with the code changes that are part of
this patch series.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NTushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

17bfa968

13 8月, 2021 1 次提交

cgroup/cpuset: Enable memory migration for cpuset v2 · ee9707e8

由 Waiman Long 提交于 8月 11, 2021

When a user changes cpuset.cpus, each task in a v2 cpuset will be moved
to one of the new cpus if it is not there already. For memory, however,
they won't be migrated to the new nodes when cpuset.mems changes. This is
an inconsistency in behavior.

In cpuset v1, there is a memory_migrate control file to enable such
behavior by setting the CS_MEMORY_MIGRATE flag. Make it the default
for cpuset v2 so that we have a consistent set of behavior for both
cpus and memory.

There is certainly a cost to make memory migration the default, but it
is a one time cost that shouldn't really matter as long as cpuset.mems
isn't changed frequenty.  Update the cgroup-v2.rst file to document the
new behavior and recommend against changing cpuset.mems frequently.

Since there won't be any concurrent access to the newly allocated cpuset
structure in cpuset_css_alloc(), we can use the cheaper non-atomic
__set_bit() instead of the more expensive atomic set_bit().
Signed-off-by: NWaiman Long <longman@redhat.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

ee9707e8

12 8月, 2021 1 次提交

x86/reboot: Document how to override DMI platform quirks · 12febc18

由 Paul Gortmaker 提交于 5月 30, 2021

commit 5955633e ("x86/reboot: Skip DMI checks if reboot set by user")
made it so that it's not required to recompile the kernel in order to
bypass broken reboot quirks compiled into an image:

 * This variable is used privately to keep track of whether or not
 * reboot_type is still set to its default value (i.e., reboot= hasn't
 * been set on the command line).  This is needed so that we can
 * suppress DMI scanning for reboot quirks.  Without it, it's
 * impossible to override a faulty reboot quirk without recompiling.

However, at the time it was not eally documented outside the source code,
and so this information isn't really available to the average user out
there.

The change is a little white lie and invented "reboot=default" since it is
easy to remember, and documents well.  The truth is that any random string
that is *not* a currently accepted string will work.

Since that doesn't document well for non-coders, and since it's unknown
what the future additions might be, lay claim on "default" since that is
exactly what it achieves.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210530162447.996461-3-paul.gortmaker@windriver.com

12febc18

11 8月, 2021 2 次提交

dm: add documentation for IMA measurement support · 00d43995

由 Tushar Sugandhi 提交于 7月 12, 2021

To interpret various DM target measurement data in IMA logs,
a separate documentation page is needed under
Documentation/admin-guide/device-mapper.

Add documentation to help system administrators and attestation
client/server component owners to interpret the measurement
data generated by various DM targets, on various device/table state
changes.
Signed-off-by: NTushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

00d43995

dm writecache: add event counters · e3a35d03

由 Mikulas Patocka 提交于 7月 27, 2021

Add 10 counters for various events (hit, miss, etc) and export them in
the status line (accessed from userspace with "dmsetup status"). Also
add a message "clear_stats" that resets these counters.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

e3a35d03

28 7月, 2021 1 次提交

Documentation: Add L1D flushing Documentation · b7fe54f6

由 Balbir Singh 提交于 1月 08, 2021

Add documentation of l1d flushing, explain the need for the
feature and how it can be used.
Signed-off-by: NBalbir Singh <sblbir@amazon.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-6-sblbir@amazon.com

b7fe54f6

01 7月, 2021 7 次提交

userfaultfd/shmem: advertise shmem minor fault support · 964ab004

由 Axel Rasmussen 提交于 6月 30, 2021

Now that the feature is fully implemented (the faulting path hooks exist
so userspace is notified, and the ioctl to resolve such faults is
available), advertise this as a supported feature.

Link: https://lkml.kernel.org/r/20210503180737.2487560-6-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
Acked-by: NHugh Dickins <hughd@google.com>
Acked-by: NPeter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

964ab004

mm/pagemap: export uffd-wp protection information · fb8e37f3

由 Peter Xu 提交于 6月 30, 2021

Export the PTE/PMD status of uffd-wp to pagemap too.

Link: https://lkml.kernel.org/r/20210428225030.9708-6-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wang Qing <wangqing@vivo.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fb8e37f3

mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON · e6d41f12

由 Muchun Song 提交于 6月 30, 2021

When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap pages
associated with each HugeTLB page is default off.  Now the vmemmap is PMD
mapped.  So there is no side effect when this feature is enabled with no
HugeTLB pages in the system.  Someone may want to enable this feature in
the compiler time instead of using boot command line.  So add a config to
make it default on when someone do not want to enable it via command line.

Link: https://lkml.kernel.org/r/20210616094915.34432-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e6d41f12

mm: sparsemem: use huge PMD mapping for vmemmap pages · 2d7a2171

由 Muchun Song 提交于 6月 30, 2021

The preparation of splitting huge PMD mapping of vmemmap pages is ready,
so switch the mapping from PTE to PMD.

Link: https://lkml.kernel.org/r/20210616094915.34432-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2d7a2171

mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled · 4bab4964

由 Muchun Song 提交于 6月 30, 2021

The parameter of memory_hotplug.memmap_on_memory is not compatible with
hugetlb_free_vmemmap.  So disable it when hugetlb_free_vmemmap is enabled.

[akpm@linux-foundation.org: remove unneeded include, per Oscar]

Link: https://lkml.kernel.org/r/20210510030027.56044-9-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4bab4964

mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap · e9fdff87

由 Muchun Song 提交于 6月 30, 2021

Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
freeing unused vmemmap pages associated with each hugetlb page on boot.

We disable PMD mapping of vmemmap pages for x86-64 arch when this feature
is enabled.  Because vmemmap_remap_free() depends on vmemmap being base
page mapped.

Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NBarry Song <song.bao.hua@hisilicon.com>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Tested-by: NChen Huang <chenhuang5@huawei.com>
Tested-by: NBodeddula Balasubramaniam <bodeddub@amazon.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e9fdff87

mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page · ad2fa371

由 Muchun Song 提交于 6月 30, 2021

When we free a HugeTLB page to the buddy allocator, we need to allocate
the vmemmap pages associated with it.  However, we may not be able to
allocate the vmemmap pages when the system is under memory pressure.  In
this case, we just refuse to free the HugeTLB page.  This changes behavior
in some corner cases as listed below:

 1) Failing to free a huge page triggered by the user (decrease nr_pages).

    User needs to try again later.

 2) Failing to free a surplus huge page when freed by the application.

    Try again later when freeing a huge page next time.

 3) Failing to dissolve a free huge page on ZONE_MOVABLE via
    offline_pages().

    This can happen when we have plenty of ZONE_MOVABLE memory, but
    not enough kernel memory to allocate vmemmmap pages.  We may even
    be able to migrate huge page contents, but will not be able to
    dissolve the source huge page.  This will prevent an offline
    operation and is unfortunate as memory offlining is expected to
    succeed on movable zones.  Users that depend on memory hotplug
    to succeed for movable zones should carefully consider whether the
    memory savings gained from this feature are worth the risk of
    possibly not being able to offline memory in certain situations.

 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
    alloc_contig_range() - once we have that handling in place. Mainly
    affects CMA and virtio-mem.

    Similar to 3). virito-mem will handle migration errors gracefully.
    CMA might be able to fallback on other free areas within the CMA
    region.

Vmemmap pages are allocated from the page freeing context.  In order for
those allocations to be not disruptive (e.g.  trigger oom killer)
__GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
a non sleeping allocation would be too fragile and it could fail too
easily under memory pressure.  GFP_ATOMIC or other modes to access memory
reserves is not used because we want to prevent consuming reserves under
heavy hugetlb freeing.

[mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
  Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
[willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org

Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ad2fa371

30 6月, 2021 6 次提交

docs: remove description of DISCONTIGMEM · 48d9f335

由 Mike Rapoport 提交于 6月 28, 2021

Remove description of DISCONTIGMEM from the "Memory Models" document and
update VM sysctl description so that it won't mention DISCONIGMEM.

Link: https://lkml.kernel.org/r/20210608091316.3622-8-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

48d9f335

mm/page_alloc: introduce vm.percpu_pagelist_high_fraction · 74f44822

由 Mel Gorman 提交于 6月 28, 2021

This introduces a new sysctl vm.percpu_pagelist_high_fraction.  It is
similar to the old vm.percpu_pagelist_fraction.  The old sysctl increased
both pcp->batch and pcp->high with the higher pcp->high potentially
reducing zone->lock contention.  However, the higher pcp->batch value also
potentially increased allocation latency while the PCP was refilled.  This
sysctl only adjusts pcp->high so that zone->lock contention is potentially
reduced but allocation latency during a PCP refill remains the same.

  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  649
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=8
  # grep -E "high:|batch" /proc/zoneinfo | tail -2
              high:  35071
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=64
              high:  4383
              batch: 63

  # sysctl vm.percpu_pagelist_high_fraction=0
              high:  649
              batch: 63

[mgorman@techsingularity.net: fix documentation]
  Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net

Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

74f44822

mm/page_alloc: delete vm.percpu_pagelist_fraction · bbbecb35

由 Mel Gorman 提交于 6月 28, 2021

Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.

The per-cpu page allocator (PCP) is meant to reduce contention on the zone
lock but the sizing of batch and high is archaic and neither takes the
zone size into account or the number of CPUs local to a zone. With larger
zones and more CPUs per node, the contention is getting worse.
Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
and high values means that the sysctl can reduce zone lock contention but
also increase allocation latencies.

This series disassociates pcp->high from pcp->batch and then scales
pcp->high based on the size of the local zone with limited impact to
reclaim and accounting for active CPUs but leaves pcp->batch static. It
also adapts the number of pages that can be on the pcp list based on
recent freeing patterns.

The motivation is partially to adjust to larger memory sizes but is also
driven by the fact that large batches of page freeing via release_pages()
often shows zone contention as a major part of the problem. Another is a
bug report based on an older kernel where a multi-terabyte process can
takes several minutes to exit. A workaround was to use
vm.percpu_pagelist_fraction to increase the pcp->high value but testing
indicated that a production workload could not use the same values because
of an increase in allocation latencies. Unfortunately, I cannot reproduce
this test case myself as the multi-terabyte machines are in active use but
it should alleviate the problem.

The series aims to address both and partially acts as a pre-requisite.
pcp only works with order-0 which is useless for SLUB (when using high
orders) and THP (unconditionally). To store high-order pages on PCP, the
pcp->high values need to be increased first.

This patch (of 6):

The vm.percpu_pagelist_fraction is used to increase the batch and high
limits for the per-cpu page allocator (PCP). The intent behind the sysctl
is to reduce zone lock acquisition when allocating/freeing pages but it
has a problem. While it can decrease contention, it can also increase
latency on the allocation side due to unreasonably large batch sizes.
This leads to games where an administrator adjusts
percpu_pagelist_fraction on the fly to work around contention and
allocation latency problems.

This series aims to alleviate the problems with zone lock contention while
avoiding the allocation-side latency problems. For the purposes of
review, it's easier to remove this sysctl now and reintroduce a similar
sysctl later in the series that deals only with pcp->high.

Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bbbecb35

mm/page_reporting: export reporting order as module parameter · f58780a8

由 Gavin Shan 提交于 6月 28, 2021

The macro PAGE_REPORTING_MIN_ORDER is defined as the page reporting
threshold.  It can't be adjusted at runtime.

This introduces a variable (@page_reporting_order) to replace the marcro
(PAGE_REPORTING_MIN_ORDER).  MAX_ORDER is assigned to it initially,
meaning the page reporting is disabled.  It will be specified by driver if
valid one is provided.  Otherwise, it will fall back to @pageblock_order.
It's also exported so that the page reporting order can be adjusted at
runtime.

Link: https://lkml.kernel.org/r/20210625014710.42954-3-gshan@redhat.comSigned-off-by: NGavin Shan <gshan@redhat.com>
Suggested-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f58780a8

doc: watchdog: modify the doc related to "watchdog/%u" · 256f7a67

由 Wang Qing 提交于 6月 28, 2021

"watchdog/%u" threads has be replaced by cpu_stop_work.  The current
description is extremely misleading.

Link: https://lkml.kernel.org/r/1619687073-24686-5-git-send-email-wangqing@vivo.comSigned-off-by: NWang Qing <wangqing@vivo.com>
Reviewed-by: NPetr Mladek <pmladek@suse.com>
Cc: "Guilherme G. Piccoli" <gpiccoli@canonical.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Qais Yousef <qais.yousef@arm.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Santosh Sivaraj <santosh@fossix.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

256f7a67

doc: watchdog: modify the explanation related to watchdog thread · e55fda8c

由 Wang Qing 提交于 6月 28, 2021

"watchdog/%u" threads has be replaced by cpu_stop_work.  The current
description is extremely misleading.

Link: https://lkml.kernel.org/r/1619687073-24686-4-git-send-email-wangqing@vivo.comSigned-off-by: NWang Qing <wangqing@vivo.com>
Reviewed-by: NPetr Mladek <pmladek@suse.com>
Cc: "Guilherme G. Piccoli" <gpiccoli@canonical.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Qais Yousef <qais.yousef@arm.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Santosh Sivaraj <santosh@fossix.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e55fda8c

29 6月, 2021 1 次提交

dm writecache: make writeback pause configurable · 5c0de3d7

由 Mikulas Patocka 提交于 6月 28, 2021

Commit 95b88f4d ("dm writecache: pause
writeback if cache full and origin being written directly") introduced a
code that pauses cache flushing if we are issuing writes directly to the
origin.

Improve that initial commit by making the timeout code configurable
(via the option "pause_writeback"). Also change the default from 1s to
3s because it performed better.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

5c0de3d7

26 6月, 2021 2 次提交

dm writecache: add optional "metadata_only" parameter · 611c3e16

由 Mikulas Patocka 提交于 6月 21, 2021

Add a "metadata_only" parameter that when present: only metadata is
promoted to the cache. This option improves performance for heavier
REQ_META workloads (e.g. device-mapper-test-suite's "git clone and
checkout" benchmark improves from 341s to 312s).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

611c3e16

dm writecache: add "cleaner" and "max_age" to Documentation · cd039afa

由 Mike Snitzer 提交于 6月 25, 2021

Backfill missing Documentation.

Fixes: 93de44eb ("dm writecache: implement the "cleaner" policy")
Fixes: 3923d485 ("dm writecache: implement gradual cleanup")
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

cd039afa

22 6月, 2021 4 次提交

clocksource: Provide kernel module to test clocksource watchdog · 1253b9b8

由 Paul E. McKenney 提交于 5月 27, 2021

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks. It would be good
to have a way of testing the clocksource watchdog's ability to
distinguish between these two causes of clock skew and instability.

Therefore, provide a new clocksource-wdtest module selected by a new
TEST_CLOCKSOURCE_WATCHDOG Kconfig option. This module has a single module
parameter named "holdoff" that provides the number of seconds of delay
before testing should start, which defaults to zero when built as a module
and to 10 seconds when built directly into the kernel. Very large systems
that boot slowly may need to increase the value of this module parameter.

This module uses hand-crafted clocksource structures to do its testing,
thus avoiding messing up timing for the rest of the kernel and for user
applications. This module first verifies that the ->uncertainty_margin
field of the clocksource structures are set sanely. It then tests the
delay-detection capability of the clocksource watchdog, increasing the
number of consecutive delays injected, first provoking console messages
complaining about the delays and finally forcing a clock-skew event.
Unexpected test results cause at least one WARN_ON_ONCE() console splat.
If there are no splats, the test has passed. Finally, it fuzzes the
value returned from a clocksource to test the clocksource watchdog's
ability to detect time skew.

This module checks the state of its clocksource after each test, and
uses WARN_ON_ONCE() to emit a console splat if there are any failures.
This should enable all types of test frameworks to detect any such
failures.

This facility is intended for diagnostic use only, and should be avoided
on production systems.
Reported-by: NChris Mason <clm@fb.com>
Suggested-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NFeng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-5-paulmck@kernel.org

1253b9b8

clocksource: Limit number of CPUs checked for clock synchronization · fa218f1c

由 Paul E. McKenney 提交于 5月 27, 2021

Currently, if skew is detected on a clock marked CLOCK_SOURCE_VERIFY_PERCPU,
that clock is checked on all CPUs.  This is thorough, but might not be
what you want on a system with a few tens of CPUs, let alone a few hundred
of them.

Therefore, by default check only up to eight randomly chosen CPUs.  Also
provide a new clocksource.verify_n_cpus kernel boot parameter.  A value of
-1 says to check all of the CPUs, and a non-negative value says to randomly
select that number of CPUs, without concern about selecting the same CPU
multiple times.  However, make use of a cpumask so that a given CPU will be
checked at most once.

Suggested-by: Thomas Gleixner <tglx@linutronix.de> # For verify_n_cpus=1.
Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NFeng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-3-paulmck@kernel.org

fa218f1c

clocksource: Retry clock read if long delays detected · db3a34e1

由 Paul E. McKenney 提交于 5月 27, 2021

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks. Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can delay
interrupts-disabled regions of code ranging from SMI handlers to vCPU
preemption. It would be good to have some indication as to why the clock
was marked unstable.

Therefore, re-read the watchdog clock on either side of the read from the
clock under test. If the watchdog clock shows an excessive time delta
between its pair of reads, the reads are retried.

The maximum number of retries is specified by a new kernel boot parameter
clocksource.max_cswd_read_retries, which defaults to three, that is, up to
four reads, one initial and up to three retries. If more than one retry
was required, a message is printed on the console (the occasional single
retry is expected behavior, especially in guest OSes). If the maximum
number of retries is exceeded, the clock under test will be marked
unstable. However, the probability of this happening due to various sorts
of delays is quite small. In addition, the reason (clock-read delays) for
the unstable marking will be apparent.
Reported-by: NChris Mason <clm@fb.com>
Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NFeng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org

db3a34e1

block: Introduce the ioprio rq-qos policy · 556910e3

由 Bart Van Assche 提交于 6月 17, 2021

Introduce an rq-qos policy that assigns an I/O priority to requests based
on blk-cgroup configuration settings. This policy has the following
advantages over the ioprio_set() system call:
- This policy is cgroup based so it has all the advantages of cgroups.
- While ioprio_set() does not affect page cache writeback I/O, this rq-qos
  controller affects page cache writeback I/O for filesystems that support
  assiociating a cgroup with writeback I/O. See also
  Documentation/admin-guide/cgroup-v2.rst.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210618004456.7280-5-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

556910e3

18 6月, 2021 4 次提交

docs: admin-guide: sysctl: avoid using ReST :doc:`foo` markup · 2793e19d

由 Mauro Carvalho Chehab 提交于 6月 16, 2021

The :doc:`foo` tag is auto-generated via automarkup.py.
So, use the filename at the sources, instead of :doc:`foo`.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/12abd2290c7ebc05c89178d2556bea740bd70fac.1623824363.git.mchehab+huawei@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

2793e19d

docs: admin-guide: hw-vuln: avoid using ReST :doc:`foo` markup · e499f4c2

由 Mauro Carvalho Chehab 提交于 6月 16, 2021

The :doc:`foo` tag is auto-generated via automarkup.py.
So, use the filename at the sources, instead of :doc:`foo`.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/4e378517761f3df07165d5ecdac5a0a81577e68f.1623824363.git.mchehab+huawei@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

e499f4c2

docs: admin-guide: pm: avoid using ReST :doc:`foo` markup · 17420f31

由 Mauro Carvalho Chehab 提交于 6月 16, 2021

The :doc:`foo` tag is auto-generated via automarkup.py.
So, use the filename at the sources, instead of :doc:`foo`.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/04616d9fc0b4a0d33486fa0018631a2db2eba860.1623824363.git.mchehab+huawei@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

17420f31

docs: admin-guide: reporting-issues.rst: replace some characters · 349660e9

由 Mauro Carvalho Chehab 提交于 6月 16, 2021

The conversion tools used during DocBook/LaTeX/html/Markdown->ReST
conversion and some cut-and-pasted text contain some characters that
aren't easily reachable on standard keyboards and/or could cause
troubles when parsed by the documentation build system.

Replace the occurences of the following characters:

	- U+00a0 (' '): NO-BREAK SPACE
	  as it can cause lines being truncated on PDF output
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/551a2af0e654226067e5c376d3e2d959cc738f39.1623826294.git.mchehab+huawei@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

349660e9

17 6月, 2021 7 次提交

ACPI: sysfs: Allow bitmap list to be supplied to acpi_mask_gpe · d3121e64

由 Andy Shevchenko 提交于 6月 16, 2021

Currently we need to use as many acpi_mask_gpe options as we want to have
GPEs to be masked. Even with two it already becomes inconveniently large
the kernel command line.

Instead, allow acpi_mask_gpe to represent bitmap list.
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

d3121e64

iommu: Update "iommu.strict" documentation · 531353e6

由 Robin Murphy 提交于 6月 14, 2021

Consolidating the flush queue logic also meant that the "iommu.strict"
option started taking effect on x86 as well. Make sure we document that.

Fixes: a250c23f ("iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE")
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
Reviewed-by: NLu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: NJohn Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/2c8c06e1b449d6b060c5bf9ad3b403cd142f405d.1623682646.git.robin.murphy@arm.comSigned-off-by: NJoerg Roedel <jroedel@suse.de>

531353e6

tracing: Add tp_printk_stop_on_boot option · f3860136

由 Steven Rostedt (VMware) 提交于 6月 17, 2021

Add a kernel command line option that disables printing of events to
console at late_initcall_sync(). This is useful when needing to see
specific events written to console on boot up, but not wanting it when
user space starts, as user space may make the console so noisy that the
system becomes inoperable.
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>

f3860136

docs/cgroup-v1/blkio: update for 5.x kernels · 82861595

由 Kir Kolyshkin 提交于 6月 10, 2021

Commit bf382fb0bcef4 ("block: remove legacy IO schedulers", Oct 12 2018)
removes the CFQ scheduler, together with blkio.weight and
blkio.weight_device described in cgroup v1 documentation. Users are
supposed to use the BFQ scheduler, which cgroup file for setting weight
is blkio.bfq.weight, but there is no way to set per-device weight.

Later, commit 795fe54c per-device weights for BFQ, meaning that
blkio.bfq.weight and blkio.bfq.weight_device can be used in a way
similar to the old CFQ cgroup interface.

Yet, the cgroup v1 docs were never updated. Fix this:
 - use the new file names;
 - fix the range for weight (used to be 10..1000, now 1..1000);
 - link to BFQ scheduler docs.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NKir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

82861595

docs/cgroup-v1/blkio: stop abusing itemized list · 37fe4038

由 Kir Kolyshkin 提交于 6月 10, 2021

Fix many formatting issues by stop (ab)using itemized lists for
everything (mostly replaced by definition lists).
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NKir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

37fe4038

pstore/blk: Fix kerndoc and redundancy on blkdev param · c811659b

由 Kees Cook 提交于 6月 15, 2021

Remove redundant details of blkdev and fix up resulting kerndoc.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKees Cook <keescook@chromium.org>

c811659b

pstore/blk: Use the normal block device I/O path · 7bb9557b

由 Kees Cook 提交于 6月 14, 2021

Stop poking into block layer internals and just open the block device
file an use kernel_read and kernel_write on it. Note that this means
the transformation from name_to_dev_t can't be used anymore when
pstore_blk is loaded as a module: a full filesystem device path name
must be used instead. Additionally removes ":internal:" kerndoc link,
since no such documentation remains.
Co-developed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKees Cook <keescook@chromium.org>

7bb9557b

16 6月, 2021 2 次提交

media: admin-guide: avoid using ReST :doc:`foo` markup · 6ef43d27

由 Mauro Carvalho Chehab 提交于 6月 05, 2021

The :doc:`foo` tag is auto-generated via automarkup.py.
So, use the filename at the sources, instead of :doc:`foo`.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>

6ef43d27

media: docs: */media/index.rst: don't use ReST doc:`foo` · 703ac06a

由 Mauro Carvalho Chehab 提交于 6月 05, 2021

The :doc:`foo` tag is auto-generated via automarkup.py.
So, use the filename at the sources, instead of :doc:`foo`.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>

703ac06a

15 6月, 2021 1 次提交

docs: cputopology: move the sysfs ABI description to right place · 05a463ec

由 Tian Tao 提交于 6月 11, 2021

Documentation/admin-guide/cputopology.rst is the wrong place to describe
sysfs ABI. So move the cputopology ABI things to
Documentation/ABI/stable/sysfs-devices-system-cpu and add a reference to
ABI doc in Documentation/admin-guide/cputopology.rst.

Link: https://lkml.kernel.org/r/20210319041618.14316-1-song.bao.hua@hisilicon.com
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NTian Tao <tiantao6@hisilicon.com>
Signed-off-by: NBarry Song <song.bao.hua@hisilicon.com>
Link: https://lore.kernel.org/r/20210611052249.25776-1-song.bao.hua@hisilicon.comSigned-off-by: NJonathan Corbet <corbet@lwn.net>

05a463ec

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功