提交 · db33ec371be8e45956e8cebb5b0fe641f008430b · openeuler / Kernel

09 6月, 2020 8 次提交

doc: cgroup: update note about conditions when oom killer is invoked · db33ec37

由 Konstantin Khlebnikov 提交于 6月 07, 2020

Starting from v4.19 commit 29ef680a ("memcg, oom: move out_of_memory
back to the charge path") cgroup oom killer is no longer invoked only
from page faults.  Now it implements the same semantics as global OOM
killer: allocation context invokes OOM killer and keeps retrying until
success.

[akpm@linux-foundation.org: fixes per Randy]
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: http://lkml.kernel.org/r/158894738928.208854.5244393925922074518.stgit@buzzSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db33ec37

panic: add sysctl to dump all CPUs backtraces on oops event · 60c958d8

由 Guilherme G. Piccoli 提交于 6月 07, 2020

Usually when the kernel reaches an oops condition, it's a point of no
return; in case not enough debug information is available in the kernel
splat, one of the last resorts would be to collect a kernel crash dump
and analyze it. The problem with this approach is that in order to
collect the dump, a panic is required (to kexec-load the crash kernel).
When in an environment of multiple virtual machines, users may prefer to
try living with the oops, at least until being able to properly shutdown
their VMs / finish their important tasks.

This patch implements a way to collect a bit more debug details when an
oops event is reached, by printing all the CPUs backtraces through the
usage of NMIs (on architectures that support that). The sysctl added
(and documented) here was called "oops_all_cpu_backtrace", and when set
will (as the name suggests) dump all CPUs backtraces.

Far from ideal, this may be the last option though for users that for
some reason cannot panic on oops. Most of times oopses are clear enough
to indicate the kernel portion that must be investigated, but in virtual
environments it's possible to observe hypervisor/KVM issues that could
lead to oopses shown in other guests CPUs (like virtual APIC crashes).
This patch hence aims to help debug such complex issues without
resorting to kdump.
Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NKees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200327224116.21030-1-gpiccoli@canonical.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

60c958d8

kernel/hung_task.c: introduce sysctl to print all traces when a hung task is detected · 0ec9dc9b

由 Guilherme G. Piccoli 提交于 6月 07, 2020

Commit 401c636a ("kernel/hung_task.c: show all hung tasks before
panic") introduced a change in that we started to show all CPUs
backtraces when a hung task is detected _and_ the sysctl/kernel
parameter "hung_task_panic" is set. The idea is good, because usually
when observing deadlocks (that may lead to hung tasks), the culprit is
another task holding a lock and not necessarily the task detected as
hung.

The problem with this approach is that dumping backtraces is a slightly
expensive task, specially printing that on console (and specially in
many CPU machines, as servers commonly found nowadays). So, users that
plan to collect a kdump to investigate the hung tasks and narrow down
the deadlock definitely don't need the CPUs backtrace on dmesg/console,
which will delay the panic and pollute the log (crash tool would easily
grab all CPUs traces with 'bt -a' command).

Also, there's the reciprocal scenario: some users may be interested in
seeing the CPUs backtraces but not have the system panic when a hung
task is detected. The current approach hence is almost as embedding a
policy in the kernel, by forcing the CPUs backtraces' dump (only) on
hung_task_panic.

This patch decouples the panic event on hung task from the CPUs
backtraces dump, by creating (and documenting) a new sysctl called
"hung_task_all_cpu_backtrace", analog to the approach taken on soft/hard
lockups, that have both a panic and an "all_cpu_backtrace" sysctl to
allow individual control. The new mechanism for dumping the CPUs
backtraces on hung task detection respects "hung_task_warnings" by not
dumping the traces in case there's no warnings left.
Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NKees Cook <keescook@chromium.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200327223646.20779-1-gpiccoli@canonical.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0ec9dc9b

kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases · f117955a

由 Guilherme G. Piccoli 提交于 6月 07, 2020

After a recent change introduced by Vlastimil's series [0], kernel is
able now to handle sysctl parameters on kernel command line; also, the
series introduced a simple infrastructure to convert legacy boot
parameters (that duplicate sysctls) into sysctl aliases.

This patch converts the watchdog parameters softlockup_panic and
{hard,soft}lockup_all_cpu_backtrace to use the new alias infrastructure.
It fixes the documentation too, since the alias only accepts values 0 or
1, not the full range of integers.

We also took the opportunity here to improve the documentation of the
previously converted hung_task_panic (see the patch series [0]) and put
the alias table in alphabetical order.

[0] http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.czSigned-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Link: http://lkml.kernel.org/r/20200507214624.21911-1-gpiccoli@canonical.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f117955a

kernel/hung_task convert hung_task_panic boot parameter to sysctl · b467f3ef

由 Vlastimil Babka 提交于 6月 07, 2020

We can now handle sysctl parameters on kernel command line and have
infrastructure to convert legacy command line options that duplicate
sysctl to become a sysctl alias.

This patch converts the hung_task_panic parameter.  Note that the sysctl
handler is more strict and allows only 0 and 1, while the legacy
parameter allowed any non-zero value.  But there is little reason anyone
would not be using 1.
Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NKees Cook <keescook@chromium.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200427180433.7029-4-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b467f3ef

kernel/sysctl: support setting sysctl parameters from kernel command line · 3db978d4

由 Vlastimil Babka 提交于 6月 07, 2020

Patch series "support setting sysctl parameters from kernel command line", v3.

This series adds support for something that seems like many people
always wanted but nobody added it yet, so here's the ability to set
sysctl parameters via kernel command line options in the form of
sysctl.vm.something=1

The important part is Patch 1.  The second, not so important part is an
attempt to clean up legacy one-off parameters that do the same thing as
a sysctl.  I don't want to remove them completely for compatibility
reasons, but with generic sysctl support the idea is to remove the
one-off param handlers and treat the parameters as aliases for the
sysctl variants.

I have identified several parameters that mention sysctl counterparts in
Documentation/admin-guide/kernel-parameters.txt but there might be more.
The conversion also has varying level of success:

 - numa_zonelist_order is converted in Patch 2 together with adding the
   necessary infrastructure. It's easy as it doesn't really do anything
   but warn on deprecated value these days.

 - hung_task_panic is converted in Patch 3, but there's a downside that
   now it only accepts 0 and 1, while previously it was any integer
   value

 - nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic,
   so there's no straighforward conversion possible

 - traceoff_on_warning is a flag without value and it would be required
   to handle that somehow in the conversion infractructure, which seems
   pointless for a single flag

This patch (of 5):

A recently proposed patch to add vm_swappiness command line parameter in
addition to existing sysctl [1] made me wonder why we don't have a
general support for passing sysctl parameters via command line.

Googling found only somebody else wondering the same [2], but I haven't
found any prior discussion with reasons why not to do this.

Settings the vm_swappiness issue aside (the underlying issue might be
solved in a different way), quick search of kernel-parameters.txt shows
there are already some that exist as both sysctl and kernel parameter -
hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning.

A general mechanism would remove the need to add more of those one-offs
and might be handy in situations where configuration by e.g.
/etc/sysctl.d/ is impractical.

Hence, this patch adds a new parse_args() pass that looks for parameters
prefixed by 'sysctl.' and tries to interpret them as writes to the
corresponding sys/ files using an temporary in-kernel procfs mount.
This mechanism was suggested by Eric W.  Biederman [3], as it handles
all dynamically registered sysctl tables, even though we don't handle
modular sysctls.  Errors due to e.g.  invalid parameter name or value
are reported in the kernel log.

The processing is hooked right before the init process is loaded, as
some handlers might be more complicated than simple setters and might
need some subsystems to be initialized.  At the moment the init process
can be started and eventually execute a process writing to /proc/sys/
then it should be also fine to do that from the kernel.

Sysctls registered later on module load time are not set by this
mechanism - it's expected that in such scenarios, setting sysctl values
from userspace is practical enough.

[1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/
[2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter
[3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
Acked-by: NKees Cook <keescook@chromium.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz
Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3db978d4

kernel: add panic_on_taint · db38d5c1

由 Rafael Aquini 提交于 6月 07, 2020

Analogously to the introduction of panic_on_warn, this patch introduces
a kernel option named panic_on_taint in order to provide a simple and
generic way to stop execution and catch a coredump when the kernel gets
tainted by any given flag.

This is useful for debugging sessions as it avoids having to rebuild the
kernel to explicitly add calls to panic() into the code sites that
introduce the taint flags of interest.

For instance, if one is interested in proceeding with a post-mortem
analysis at the point a given code path is hitting a bad page (i.e.
unaccount_page_cache_page(), or slab_bug()), a coredump can be collected
by rebooting the kernel with 'panic_on_taint=0x20' amended to the
command line.

Another, perhaps less frequent, use for this option would be as a means
for assuring a security policy case where only a subset of taints, or no
single taint (in paranoid mode), is allowed for the running system.  The
optional switch 'nousertaint' is handy in this particular scenario, as
it will avoid userspace induced crashes by writes to sysctl interface
/proc/sys/kernel/tainted causing false positive hits for such policies.

[akpm@linux-foundation.org: tweak kernel-parameters.txt wording]
Suggested-by: NQian Cai <cai@lca.pw>
Signed-off-by: NRafael Aquini <aquini@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Bunk <bunk@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Link: http://lkml.kernel.org/r/20200515175502.146720-1-aquini@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db38d5c1

dynamic_debug: add an option to enable dynamic debug for modules only · ceabef7d

由 Orson Zhai 提交于 6月 07, 2020

Instead of enabling dynamic debug globally with CONFIG_DYNAMIC_DEBUG,
CONFIG_DYNAMIC_DEBUG_CORE will only enable core function of dynamic
debug.  With the DYNAMIC_DEBUG_MODULE defined for any modules, dynamic
debug will be tied to them.

This is useful for people who only want to enable dynamic debug for
kernel modules without worrying about kernel image size and memory
consumption is increasing too much.

[orson.zhai@unisoc.com: v2]
  Link: http://lkml.kernel.org/r/1587408228-10861-1-git-send-email-orson.unisoc@gmail.comSigned-off-by: NOrson Zhai <orson.zhai@unisoc.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: NPetr Mladek <pmladek@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Link: http://lkml.kernel.org/r/1586521984-5890-1-git-send-email-orson.unisoc@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ceabef7d

06 6月, 2020 1 次提交

dm integrity: add status line documentation · 40e9c5ac

由 Mikulas Patocka 提交于 6月 02, 2020

Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

40e9c5ac

04 6月, 2020 4 次提交

mm: allow swappiness that prefers reclaiming anon over the file workingset · c843966c

由 Johannes Weiner 提交于 6月 03, 2020

With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap
devices such as zswap, it's possible for swap to be much faster than
filesystems, and for swapping to be preferable over thrashing filesystem
caches.

Allow setting swappiness - which defines the rough relative IO cost of
cache misses between page cache and swap-backed pages - to reflect such
situations by making the swap-preferred range configurable.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-4-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c843966c

mm: memcontrol: document the new swap control behavior · 0a27cae1

由 Alex Shi 提交于 6月 03, 2020

Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-18-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0a27cae1

hugetlbfs: clean up command line processing · 282f4214

由 Mike Kravetz 提交于 6月 03, 2020

With all hugetlb page processing done in a single file clean up code.

- Make code match desired semantics
  - Update documentation with semantics
- Make all warnings and errors messages start with 'HugeTLB:'.
- Consistently name command line parsing routines.
- Warn if !hugepages_supported() and command line parameters have
  been specified.
- Add comments to code
  - Describe some of the subtle interactions
  - Describe semantics of command line arguments

This patch also fixes issues with implicitly setting the number of
gigantic huge pages to preallocate.  Previously on X86 command line,

        hugepages=2 default_hugepagesz=1G

would result in zero 1G pages being preallocated and,

        # grep HugePages_Total /proc/meminfo
        HugePages_Total:       0
        # sysctl -a | grep nr_hugepages
        vm.nr_hugepages = 2
        vm.nr_hugepages_mempolicy = 2
        # cat /proc/sys/vm/nr_hugepages
        2

After this patch 2 gigantic pages will be preallocated and all the proc,
sysfs, sysctl and meminfo files will accurately reflect this.

To address the issue with gigantic pages, a small change in behavior was
made to command line processing.  Previously the command line,

        hugepages=128 default_hugepagesz=2M hugepagesz=2M hugepages=256

would result in the allocation of 256 2M huge pages.  The value 128 would
be ignored without any warning.  After this patch, 128 2M pages will be
allocated and a warning message will be displayed indicating the value of
256 is ignored.  This change in behavior is required because allocation of
implicitly specified gigantic pages must be done when the
default_hugepagesz= is encountered for gigantic pages.  Previously the
code waited until later in the boot process (hugetlb_init), to allocate
pages of default size.  However the bootmem allocator required for
gigantic allocations is not available at this time.
Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NSandipan Das <sandipan@linux.ibm.com>
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>	[s390]
Acked-by: NWill Deacon <will@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Longpeng <longpeng2@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200417185049.275845-5-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

282f4214

khugepaged: introduce 'max_ptes_shared' tunable · 71a2c112

由 Kirill A. Shutemov 提交于 6月 03, 2020

'max_ptes_shared' specifies how many pages can be shared across multiple
processes.  Exceeding the number would block the collapse::

	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared

A higher value may increase memory footprint for some workloads.

By default, at least half of pages has to be not shared.

[colin.king@canonical.com: fix several spelling mistakes]
  Link: http://lkml.kernel.org/r/20200420084241.65433-1-colin.king@canonical.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Reviewed-by: NZi Yan <ziy@nvidia.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: http://lkml.kernel.org/r/20200416160026.16538-9-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

71a2c112

03 6月, 2020 2 次提交

mm/memcg: automatically penalize tasks with high swap use · 4b82ab4f

由 Jakub Kicinski 提交于 6月 01, 2020

Add a memory.swap.high knob, which can be used to protect the system
from SWAP exhaustion. The mechanism used for penalizing is similar to
memory.high penalty (sleep on return to user space).

That is not to say that the knob itself is equivalent to memory.high.
The objective is more to protect the system from potentially buggy tasks
consuming a lot of swap and impacting other tasks, or even bringing the
whole system to stand still with complete SWAP exhaustion. Hopefully
without the need to find per-task hard limits.

Slowing misbehaving tasks down gradually allows user space oom killers
or other protection mechanisms to react. oomd and earlyoom already do
killing based on swap exhaustion, and memory.swap.high protection will
help implement such userspace oom policies more reliably.

We can use one counter for number of pages allocated under pressure to
save struct task space and avoid two separate hierarchy walks on the hot
path. The exact overage is calculated on return to user space, anyway.

Take the new high limit into account when determining if swap is "full".
Borrowing the explanation from Johannes:

The idea behind "swap full" is that as long as the workload has plenty
of swap space available and it's not changing its memory contents, it
makes sense to generously hold on to copies of data in the swap device,
even after the swapin. A later reclaim cycle can drop the page without
any IO. Trading disk space for IO.

But the only two ways to reclaim a swap slot is when they're faulted
in and the references go away, or by scanning the virtual address space
like swapoff does - which is very expensive (one could argue it's too
expensive even for swapoff, it's often more practical to just reboot).

So at some point in the fill level, we have to start freeing up swap
slots on fault/swapin. Otherwise we could eventually run out of swap
slots while they're filled with copies of data that is also in RAM.

We don't want to OOM a workload because its available swap space is
filled with redundant cache.
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Chris Down <chris@chrisdown.name>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200527195846.102707-5-kuba@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4b82ab4f

mm, memcg: add workingset_restore in memory.stat · a6f5576b

由 Yafang Shao 提交于 6月 01, 2020

There's a new workingset counter introduced in commit 1899ad18 ("mm:
workingset: tell cache transitions from workingset thrashing").  With
the help of this counter we can know the workingset is transitioning or
thrashing.  To leverage the benifit of this counter to memcg, we should
introduce it into memory.stat.  Then we could know the workingset of the
workload inside a memcg better.

Bellow is the verification of this new counter in memory.stat.  Read a
file into the memory and then read it again to make these pages be
active.  The size of this file is 1G.  (memory.max is greater than file
size) The counters in memory.stat will be

	inactive_file 0
	active_file 1073639424

	workingset_refault 0
	workingset_activate 0
	workingset_restore 0
	workingset_nodereclaim 0

Trigger the memcg reclaim by setting a lower value to memory.high, and
then some pages will be demoted into inactive list, and then some pages
in the inactive list will be evicted into the storage.

	inactive_file 498094080
	active_file 310063104

	workingset_refault 0
	workingset_activate 0
	workingset_restore 0
	workingset_nodereclaim 0

Then recover the memory.high and read the file into memory again.  As a
result of it, the transitioning will occur.  Bellow is the result of
this transitioning,

	inactive_file 498094080
	active_file 575397888

	workingset_refault 64746
	workingset_activate 64746
	workingset_restore 64746
	workingset_nodereclaim 0
Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NChris Down <chris@chrisdown.name>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Link: http://lkml.kernel.org/r/20200504153522.11553-1-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a6f5576b

02 6月, 2020 1 次提交

Documentation: kgdboc: Document new kgdboc_earlycon parameter · f71fc3bc

由 Douglas Anderson 提交于 5月 07, 2020

The recent patch ("kgdboc: Add kgdboc_earlycon to support early kgdb
using boot consoles") adds a new kernel command line parameter.
Document it.

Note that the patch adding the feature does some comparing/contrasting
of "kgdboc_earlycon" vs. the existing "ekgdboc". See that patch for
more details, but briefly "ekgdboc" can be used _instead_ of "kgdboc"
and just makes "kgdboc" do its normal initialization early (only works
if your tty driver is already ready). The new "kgdboc_earlycon" works
in combination with "kgdboc" and is backed by boot consoles.
Signed-off-by: NDouglas Anderson <dianders@chromium.org>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NDaniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200507130644.v4.9.I7d5eb42c6180c831d47aef1af44d0b8be3fac559@changeidSigned-off-by: NDaniel Thompson <daniel.thompson@linaro.org>

f71fc3bc

01 6月, 2020 2 次提交

mtd: Support kmsg dumper based on pstore/blk · 78c08247

由 WeiXiong Liao 提交于 3月 25, 2020

This introduces mtdpstore, which is similar to mtdoops but more
powerful. It uses pstore/blk, and aims to store panic and oops logs to
a flash partition, where pstore can later read back and present as files
in the mounted pstore filesystem.

To make mtdpstore work, the "blkdev" of pstore/blk should be set
as MTD device name or MTD device number. For more details, see
Documentation/admin-guide/pstore-blk.rst

This solves a number of issues:
- Work duplication: both of pstore and mtdoops do the same job storing
  panic/oops log. They have very similar logic, registering to kmsg
  dumper and storing logs to several chunks one by one.
- Layer violations: drivers should provides methods instead of polices.
  MTD should provide read/write/erase operations, and allow a higher
  level drivers to provide the chunk management, kmsg dump
  configuration, etc.
- Missing features: pstore provides many additional features, including
  presenting the logs as files, logging dump time and count, and
  supporting other frontends like pmsg, console, etc.
Signed-off-by: NWeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-11-keescook@chromium.org/
Link: https://lore.kernel.org/r/1589266715-4168-1-git-send-email-liaoweixiong@allwinnertech.comSigned-off-by: NKees Cook <keescook@chromium.org>

78c08247

pstore/blk: Support non-block storage devices · 7dcb7848

由 WeiXiong Liao 提交于 3月 25, 2020

Add support for non-block devices (e.g. MTD). A non-block driver calls
pstore_blk_register_device() to register iself.

In addition, pstore/zone is updated to handle non-block devices,
where an erase must be done before a write. Without this, there is no
way to remove records stored to an MTD.
Signed-off-by: NWeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-10-keescook@chromium.org/Co-developed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NKees Cook <keescook@chromium.org>

7dcb7848

31 5月, 2020 2 次提交

Documentation: Add details for pstore/blk · 649304c9

由 WeiXiong Liao 提交于 3月 25, 2020

Add details on using pstore/blk, the new backend of pstore to record
dumps to block devices, in Documentation/admin-guide/pstore-blk.rst
Signed-off-by: NWeiXiong Liao <liaoweixiong@allwinnertech.com>
Link: https://lore.kernel.org/lkml/20200511233229.27745-7-keescook@chromium.org/Signed-off-by: NKees Cook <keescook@chromium.org>

649304c9

pstore/ram: Introduce max_reason and convert dump_oops · 791205e3

由 Kees Cook 提交于 5月 13, 2020

Now that pstore_register() can correctly pass max_reason to the kmesg
dump facility, introduce a new "max_reason" module parameter and
"max-reason" Device Tree field.

The "dump_oops" module parameter and "dump-oops" Device
Tree field are now considered deprecated, but are now automatically
converted to their corresponding max_reason values when present, though
the new max_reason setting has precedence.

For struct ramoops_platform_data, the "dump_oops" member is entirely
replaced by a new "max_reason" member, with the only existing user
updated in place.

Additionally remove the "reason" filter logic from ramoops_pstore_write(),
as that is not specifically needed anymore, though technically
this is a change in behavior for any ramoops users also setting the
printk.always_kmsg_dump boot param, which will cause ramoops to behave as
if max_reason was set to KMSG_DUMP_MAX.
Co-developed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-6-keescook@chromium.org/Signed-off-by: NKees Cook <keescook@chromium.org>

791205e3

28 5月, 2020 1 次提交

cgroup: add cpu.stat file to root cgroup · 936f2a70

由 Boris Burkov 提交于 5月 27, 2020

Currently, the root cgroup does not have a cpu.stat file. Add one which
is consistent with /proc/stat to capture global cpu statistics that
might not fall under cgroup accounting.

We haven't done this in the past because the data are already presented
in /proc/stat and we didn't want to add overhead from collecting root
cgroup stats when cgroups are configured, but no cgroups have been
created.

By keeping the data consistent with /proc/stat, I think we avoid the
first problem, while improving the usability of cgroups stats.
We avoid the second problem by computing the contents of cpu.stat from
existing data collected for /proc/stat anyway.
Signed-off-by: NBoris Burkov <boris@bur.io>
Suggested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

936f2a70

26 5月, 2020 4 次提交

docs: acpi: fix old http link and improve document format · dd9a41bc

由 Flavio Suligoi 提交于 5月 19, 2020

The website:

    http://wiki.minnowboard.org

doesn't exist anymore. The same pages are moved to:

    https://www.elinux.org/Minnowboard

Other improvements concern the introduction of some rst
semantic markup in the document.
Signed-off-by: NFlavio Suligoi <f.suligoi@asem.it>
Link: https://lore.kernel.org/r/20200519084128.12756-2-f.suligoi@asem.itSigned-off-by: NJonathan Corbet <corbet@lwn.net>

dd9a41bc

docs: sysctl/kernel: document unaligned controls · 997c798e

由 Stephen Kitt 提交于 5月 15, 2020

This documents ignore-unaligned-usertrap, unaligned-dump-stack, and
unaligned-trap, based on arch/arc/kernel/unaligned.c,
arch/ia64/kernel/unaligned.c, and arch/parisc/kernel/unaligned.c.

While we're at it, integrate unaligned-memory-access.txt into the docs
tree.
Signed-off-by: NStephen Kitt <steve@sk2.org>
Link: https://lore.kernel.org/r/20200515212443.5012-1-steve@sk2.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

997c798e

Documentation: admin-guide: update bug-hunting.rst · 4eb92411

由 Randy Dunlap 提交于 5月 16, 2020

Update Documentation/admin-guide/bug-hunting.rst:

- add a small section on "Modules linked in" and their possible flags;
- delete all references to ksymoops since it is no longer applicable;
- fix spello, grammar, and punctuation;
- note that get_maintainers.pl only provides recent patchers if it is
  run inside a git tree;
- add mention of scripts/decode_stacktrace.sh;
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: greg@wind.rmcc.com
Link: https://lore.kernel.org/r/c629a9ef-3867-c3d1-f6c9-2c3b0e4ac68a@infradead.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

4eb92411

docs: sysctl/kernel: document ngroups_max · 17444d9b

由 Stephen Kitt 提交于 5月 18, 2020

This is a read-only export of NGROUPS_MAX.
Signed-off-by: NStephen Kitt <steve@sk2.org>
Link: https://lore.kernel.org/r/20200518145836.15816-1-steve@sk2.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

17444d9b

22 5月, 2020 1 次提交

x86/tsc: Add tsc_early_khz command line parameter · bd35c77e

由 Krzysztof Piecuch 提交于 1月 23, 2020

Changing base clock frequency directly impacts TSC Hz but not CPUID.16h
value. An overclocked CPU supporting CPUID.16h and with partial CPUID.15h
support will set TSC KHZ according to "best guess" given by CPUID.16h
relying on tsc_refine_calibration_work to give better numbers later.
tsc_refine_calibration_work will refuse to do its work when the outcome is
off the early TSC KHZ value by more than 1% which is certain to happen on
an overclocked system.

Fix this by adding a tsc_early_khz command line parameter that makes the
kernel skip early TSC calibration and use the given value instead.

This allows the user to provide the expected TSC frequency that is closer
to reality than the one reported by the hardware, enabling
tsc_refine_calibration_work to do meaningful error checking.

[ tglx: Made the variable __initdata as it's only used on init and
        removed the error checking in the argument parser because
	kstrto*() only stores to the variable if the string is valid ]
Signed-off-by: NKrzysztof Piecuch <piecuch@protonmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/O2CpIOrqLZHgNRkfjRpz_LGqnc1ix_seNIiOCvHY4RHoulOVRo6kMXKuLOfBVTi0SMMevg6Go1uZ_cL9fLYtYdTRNH78ChaFaZyG3VAyYz8=@protonmail.com

bd35c77e

21 5月, 2020 1 次提交

dm zoned: metadata version 2 · bd5c4031

由 Hannes Reinecke 提交于 5月 11, 2020

Implement handling for metadata version 2. The new metadata adds a
label and UUID for the device mapper device, and additional UUID for
the underlying block devices.

It also allows for an additional regular drive to be used for
emulating random access zones. The emulated zones will be placed
logically in front of the zones from the zoned block device, causing
the superblocks and metadata to be stored on that device.

The first zone of the original zoned device will be used to hold
another, tertiary copy of the metadata; this copy carries a generation
number of 0 and is never updated; it's just used for identification.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

bd5c4031

20 5月, 2020 2 次提交

powerpc/64s/hash: Add stress_slb kernel boot option to increase SLB faults · 82a1b8ed

由 Nicholas Piggin 提交于 5月 11, 2020

This option increases the number of SLB misses by limiting the number
of kernel SLB entries, and increased flushing of cached lookaside
information. This helps stress test difficult to hit paths in the
kernel.
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Relocate the code into arch/powerpc/mm, s/torture/stress/]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200511125825.3081305-1-mpe@ellerman.id.au

82a1b8ed

Documentation: admin-guide: pm: Document intel-speed-select · 213081da

由 Srinivas Pandruvada 提交于 5月 11, 2020

Added documentation to configure servers to use Intel(R) Speed
Select Technology using intel-speed-select tool.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: NAndriy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

213081da

19 5月, 2020 1 次提交

Documentation: cpuidle: update the document · 7395683a

由 Hanjun Guo 提交于 5月 19, 2020

Update the document after the remove of cpuidle_sysfs_switch.
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Reviewed-by: NDoug Smythies <dsmythies@telus.net>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

7395683a

18 5月, 2020 2 次提交

Revert "docs: sysctl/kernel: document ngroups_max" · fdb1b5e0

由 Jonathan Corbet 提交于 5月 18, 2020

This reverts commit 2f4c3306.

The changes here were fine, but there's a non-documentation change to
sysctl.c that makes messes elsewhere; those changes should have been done
independently.
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

fdb1b5e0

docs: gpio: Add GPIO Aggregator documentation · ce7a2f77

由 Geert Uytterhoeven 提交于 5月 11, 2020

Document the GPIO Aggregator, and the two typical use-cases.
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Tested-by: NEugeniu Rosca <erosca@de.adit-jv.com>
Reviewed-by: NUlrich Hecht <uli+renesas@fpond.eu>
Reviewed-by: NEugeniu Rosca <erosca@de.adit-jv.com>
Link: https://lore.kernel.org/r/20200511145257.22970-6-geert+renesas@glider.beSigned-off-by: NLinus Walleij <linus.walleij@linaro.org>

ce7a2f77

17 5月, 2020 1 次提交

netns: enable to inherit devconf from current netns · 9efd6a3c

由 Nicolas Dichtel 提交于 5月 13, 2020

The goal is to be able to inherit the initial devconf parameters from the
current netns, ie the netns where this new netns has been created.

This is useful in a containers environment where /proc/sys is read only.
For example, if a pod is created with specifics devconf parameters and has
the capability to create netns, the user expects to get the same parameters
than his 'init_net', which is not the real init_net in this case.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9efd6a3c

16 5月, 2020 5 次提交

docs: add IRQ documentation at the core-api book · e00b0ab8

由 Mauro Carvalho Chehab 提交于 5月 01, 2020

There are 4 IRQ documentation files under Documentation/*.txt.

Move them into a new directory (core-api/irq) and add a new
index file for it.

While here, use a title markup for the Debugging section of the
irq-domain.rst file.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/2da7485c3718e1442e6b4c2dd66857b776e8899b.1588345503.git.mchehab+huawei@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

e00b0ab8

docs: debugging-via-ohci1394.txt: add it to the core-api book · a74e2a22

由 Mauro Carvalho Chehab 提交于 5月 01, 2020

There is an special chapter inside the core-api book about
some debug infrastructure like tracepoints and debug objects.

It sounded to me that this is the best place to add a chapter
explaining how to use a FireWire controller to do remote
kernel debugging, as explained on this document.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/9b489d36d08ad89d3ad5aefef1f52a0715b29716.1588345503.git.mchehab+huawei@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

a74e2a22

doc: Fix some errors in ras.rst · b17b24fc

由 Waiman Long 提交于 5月 06, 2020

Make some miscellaneous fixes to the first paragraph of "ECC memory":
 - Change the incorrect "74 bits" to "72 bits".
 - Change "mentioned on" to "mentioned in".
 - Remove the extra "extra".
 - Rephrase some sentences as suggested by Matthew Wilcox.
Signed-off-by: NWaiman Long <longman@redhat.com>
Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20200506162217.16633-1-longman@redhat.comSigned-off-by: NJonathan Corbet <corbet@lwn.net>

b17b24fc

Documentation: update numastat explanation · 77691ee9

由 Vlastimil Babka 提交于 5月 07, 2020

During recent patch discussion [1] it became apparent that the "other_node"
definition in the numastat documentation has always been different from actual
implementation. It was also noted that the stats can be innacurate on systems
with memoryless nodes.

This patch corrects the other_node definition (with minor tweaks to two more
definitions), adds a note about memoryless nodes and also two introductory
paragraphs to the numastat documentation.

[1] https://lore.kernel.org/linux-mm/20200504070304.127361-1-sandipan@linux.ibm.com/T/#uSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NSandipan Das <sandipan@linux.ibm.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Link: https://lore.kernel.org/r/20200507120217.12313-1-vbabka@suse.czSigned-off-by: NJonathan Corbet <corbet@lwn.net>

77691ee9

docs: sysctl/kernel: document ngroups_max · 2f4c3306

由 Stephen Kitt 提交于 5月 15, 2020

This is a read-only export of NGROUPS_MAX, so this patch also changes
the declarations in kernel/sysctl.c to const.
Signed-off-by: NStephen Kitt <steve@sk2.org>
Reviewed-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200515160222.7994-1-steve@sk2.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

2f4c3306

15 5月, 2020 2 次提交

dm zoned: add 'message' callback · 90b39d58

由 Hannes Reinecke 提交于 5月 11, 2020

Add callback for 'dmsetup message' to allow the reclaim process
to be triggered manually.
Eg.

	dmsetup message /dev/dm-X 0 message

will start the reclaim process even if the default threshold
of 50 percent of free random zones is not reached.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

90b39d58

dm zoned: add 'status' callback · bc3d5717

由 Hannes Reinecke 提交于 5月 11, 2020

Add callback to supply information for 'dmsetup status'
and 'dmsetup table'. The output for 'dmsetup status' is

0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential

where <nr_unmap_rnd> is the number of unmapped (ie free) random zones,
<nr_rnd> the total number of random zones, <nr_unmap_seq> the number
of unmapped sequential zones, and <nr_seq> the total number of
sequential zones.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

bc3d5717

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功