提交 · 77a3ed6484959fb2da2a42952981f6c06ed51cb6 · openanolis / cloud-kernel

28 4月, 2020 7 次提交

x86/resctrl: Initialize the vendor-specific resource functions · 77a3ed64

由 Babu Moger 提交于 11月 21, 2018

to #26613714

commit 1ad4fa41d99f606098fd7b90c1f5deb2b3dfae5d upstream.

Initialize the resource functions that are different between the
vendors. Some features are initialized differently between the vendors.
Add _intel suffix to Intel-specific functions.

For example, the MBA feature varies significantly between Intel and AMD.
Separate the initialization of these resource functions. That way we can
easily add AMD's functions later.
Signed-off-by: NBabu Moger <babu.moger@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pu Wen <puwen@hygon.cn>
Cc: <qianyue.zj@alibaba-inc.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Rian Hunter <rian@alum.mit.edu>
Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: <xiaochen.shen@intel.com>
Link: https://lkml.kernel.org/r/20181121202811.4492-6-babu.moger@amd.comSigned-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Tested-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

77a3ed64

x86/resctrl: Move all the macros to resctrl/internal.h · a7b49907

由 Babu Moger 提交于 11月 21, 2018

to #26613714

commit aa50453a448ad645ea05788505680aa403934aa8 upstream.

Move all the macros to resctrl/internal.h and rename the registers with
MSR_ prefix for consistency.

 [bp: align MSR definitions vertically ]
Signed-off-by: NBabu Moger <babu.moger@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pu Wen <puwen@hygon.cn>
Cc: <qianyue.zj@alibaba-inc.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Rian Hunter <rian@alum.mit.edu>
Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: <xiaochen.shen@intel.com>
Link: https://lkml.kernel.org/r/20181121202811.4492-5-babu.moger@amd.comSigned-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Tested-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a7b49907

x86/resctrl: Re-arrange the RDT init code · 0d7d96de

由 Babu Moger 提交于 11月 21, 2018

to #26613714

commit 0f00717ecce4b366b3f530153e982cb2eeadafd1 upstream.

Separate the call sequence for rdt_quirks and MBA feature. This is in
preparation to handle vendor differences in these call sequences. Rename
the functions to make the flow a bit more meaningful.
Signed-off-by: NBabu Moger <babu.moger@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pu Wen <puwen@hygon.cn>
Cc: <qianyue.zj@alibaba-inc.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Rian Hunter <rian@alum.mit.edu>
Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: <xiaochen.shen@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20181121202811.4492-4-babu.moger@amd.comSigned-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Tested-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0d7d96de

x86/resctrl: Rename the RDT functions and definitions · 51947221

由 Babu Moger 提交于 11月 21, 2018

to #26613714

commit 352940ececaca58536a7fc4ff6b41d181156fd65 upstream.

As AMD is starting to support RESCTRL features, rename the RDT functions
and definitions to more generic names.

Replace "intel_rdt" with "resctrl" where applicable.
Signed-off-by: NBabu Moger <babu.moger@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pu Wen <puwen@hygon.cn>
Cc: <qianyue.zj@alibaba-inc.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Rian Hunter <rian@alum.mit.edu>
Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: <xiaochen.shen@intel.com>
Link: https://lkml.kernel.org/r/20181121202811.4492-3-babu.moger@amd.comSigned-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Tested-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

51947221

x86/resctrl: Rename and move rdt files to a separate directory · ad6da85b

由 Babu Moger 提交于 11月 21, 2018

to #26613714

commit fa7d949337ccad32c76740c88e0e0351c349053b upstream.

New generation of AMD processors add support for RDT (or QOS) features.
Together, these features will be called RESCTRL. With more than one
vendors supporting these features, it seems more appropriate to rename
these files.

Create a new directory with the name 'resctrl' and move all the
intel_rdt files to the new directory. This way all the resctrl related
code resides inside one directory.

 [ bp: Add SPDX identifier to the Makefile ]
Suggested-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NBabu Moger <babu.moger@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Pu Wen <puwen@hygon.cn>
Cc: <qianyue.zj@alibaba-inc.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Rian Hunter <rian@alum.mit.edu>
Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: <xiaochen.shen@intel.com>
Link: https://lkml.kernel.org/r/20181121202811.4492-2-babu.moger@amd.com

[ Shile: fixed conflict in arch/x86/kernel/cpu/resctrl/pseudo_lock.c ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Tested-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

ad6da85b

mm: mempolicy: require at least one nodeid for MPOL_PREFERRED · fd0d6625

由 Randy Dunlap 提交于 4月 01, 2020

to #24913189, fixes CVE-2020-11565.

commit aa9f7d5172fac9bf1f09e678c35e287a40a7b7dd upstream.

Using an empty (malformed) nodelist that is not caught during mount option
parsing leads to a stack-out-of-bounds access.

The option string that was used was: "mpol=prefer:,".  However,
MPOL_PREFERRED requires a single node number, which is not being provided
here.

Add a check that 'nodes' is not empty after parsing for MPOL_PREFERRED's
nodeid.

Fixes: 095f1fc4 ("mempolicy: rework shmem mpol parsing and display")
Reported-by: NEntropy Moe <3ntr0py1337@gmail.com>
Reported-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Link: http://lkml.kernel.org/r/89526377-7eb6-b662-e1d8-4430928abde9@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

fd0d6625

configs: update configs to adapt code changes · af776790

由 Shile Zhang 提交于 4月 28, 2020

to #26832542
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

af776790

27 4月, 2020 1 次提交

serial: 8250: Support disabling mdelay-filled probes of 16550A variants · 67cdb419

由 Josh Triplett 提交于 1月 10, 2020

to #26832542

commit dc56ecb81a0aa46a7e127e916df5c8fdb8364f0b upstream

The 8250 driver can probe for many variants of the venerable 16550A
serial port. Some of those probes involve long (20ms) mdelay calls,
which delay system boot. Modern systems and virtual machines don't have
those variants.

Provide a Kconfig option to disable probes for 16550A variants.
Disabling this speeds up the boot of a virtual machine with a serial
console by more than 20ms (a substantial fraction of the ~100ms needed
to boot a carefully configured VM).

Before:
[  +0.021919] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
After:
[  +0.000097] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
Link: https://lore.kernel.org/r/20200111022513.GA166267@localhostSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
Acked-by: Nluanshi <zhangliguang@linux.alibaba.com>

67cdb419

26 4月, 2020 5 次提交

alinux: kernel: Reduce tasklist_lock contention at fork and exit · f14304cd

由 Xunlei Pang 提交于 11月 16, 2018

to #16966300
to #16966377

We observed heavy tasklist_lock contention when offline tasks
start and end, which caused long scheduling latency and long
period with interrupt off since write_lock_irq() is called.

In extreme cases with tons of concurrent fork and exit events,
it can cause the system hung.

This patch changed them to use trylock, then operations within
lock consume little time, so it naturally addresses the issue.

After this patch, when I launched and killed thousands of tasks,
the latency can reduce from tens of milliscends to around 2ms on
my box.

The patch can pass Unixbench tests, no regression introduced.
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Acked-by: NShile Zhang <shile.zhang@linux.alibaba.com>

f14304cd

alinux: oom: add ratelimit printk to prevent softlockup · 41a1a935

由 zhongjiang-ali 提交于 3月 28, 2020

fix #26198457

Due to some pressure case, it is easier to triggers oom, which will
result in more printk message. we limit the dump_header printk by
ratelimit in memory cgroup further, but the single printk should be
limited in that cases, Otherwise, softlockup will triggers.

[xuyu: replace printk_ratelimit with __ratelimit, and fix warning on
quoted string split across lines]
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>

41a1a935

alinux: Limit the print message frequency when memcg oom triggers · a4229072

由 zhongjiang-ali 提交于 3月 18, 2020

fix #25854413

By above message, we see that a lot of memcg oom is triggered by
test cases. And has no eligiable process to be killed in that case.
Hence, It will print too much oom message by dump_header, resulting
in the interrupt is not responded normally.

The patch restrict the print times by ratelimit when memcg oom
triggers. And as for system level oom, we do not limit tht case
because it will panic later.
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>

a4229072

alinux: mm: restrict the print message frequency further when memcg oom triggers · 8962f125

由 zhongjiang-ali 提交于 2月 24, 2020

to #24843736

It is because too much memcg oom printed message will trigger the softlockup.
In general, we use the same ratelimit oom_rc between system and memcg
to limit the print message. But it is more frequent to exceed its limit
of the memcg, thus it would will result in oom easily. And A lot of
printed information will be outputed. It's likely to trigger softlockup.

The patch use different ratelimit to limit the memcg and system oom. And
we test the patch using the default value in the memcg, The issue will
go.

[xuyu: adjust corresponding sysctl indexes]
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>

8962f125

configs: enable null block device for blk-mq tests · f5d6c930

由 Joseph Qi 提交于 4月 24, 2020

to #26648913
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f5d6c930

24 4月, 2020 13 次提交

configs: enable security and sm3/4 configs for trust · 34e80779

由 Tianjia Zhang 提交于 4月 23, 2020

to #26581627

Enable a set of config related to kernel trust, including
vptm, IMA, EVM, smack LSM, and SM3, SM4, in which SM algorithm
is compiled into module.

These configs also support the x64 arm platform.

The following are some performance data. The output of
`systemd-analyze` on the startup time of an ECS (executed
three times):

before:
725ms (kernel) + 471ms (initrd) + 3.375s (userspace) = 4.572s
794ms (kernel) + 426ms (initrd) + 3.281s (userspace) = 4.501s
797ms (kernel) + 464ms (initrd) + 3.275s (userspace) = 4.538s

after:
777ms (kernel) + 439ms (initrd) + 3.456s (userspace) = 4.672s
785ms (kernel) + 450ms (initrd) + 3.313s (userspace) = 4.549s
741ms (kernel) + 481ms (initrd) + 3.274s (userspace) = 4.497s
Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: Jia Zhang <zhang.jia@linux.alibaba.com>
Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba.com>

34e80779

ext4: fix error pointer dereference · f958a9df

由 Jeffle Xu 提交于 4月 22, 2020

fix #27122350

Don't pass error pointers to brelse().

commit 7159a986b420 ("ext4: fix some error pointer dereferences") has fixed
some cases, fix the remaining one case.

Once ext4_xattr_block_find()->ext4_sb_bread() failed, error pointer is
stored in @bs->bh, which will be passed to brelse() in the cleanup
routine of ext4_xattr_set_handle(). This will then cause a NULL panic
crash in __brelse().

BUG: unable to handle kernel NULL pointer dereference at 000000000000005b
RIP: 0010:__brelse+0x1b/0x50
Call Trace:
 ext4_xattr_set_handle+0x163/0x5d0
 ext4_xattr_set+0x95/0x110
 __vfs_setxattr+0x6b/0x80
 __vfs_setxattr_noperm+0x68/0x1b0
 vfs_setxattr+0xa0/0xb0
 setxattr+0x12c/0x1a0
 path_setxattr+0x8d/0xc0
 __x64_sys_setxattr+0x27/0x30
 do_syscall_64+0x60/0x250
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

In this case, @bs->bh stores '-EIO' actually.

Fixes: fb265c9cb49e ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases")
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Cc: stable@kernel.org # 2.6.19
Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/linux-ext4/1587628004-95123-1-git-send-email-jefflexu@linux.alibaba.com/Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f958a9df

alinux: sched: Introduce per-cgroup iowait accounting · 9e7b35d6

由 Yihao Wu 提交于 4月 21, 2020

to #26424323

We account iowait when the cgroup's se is idle, and it has blocked
task on the hierarchy of se->my_q.

To achieve this, we also add cg_nr_running to track the hierarchical
number of blocked tasks. We do it when a blocked task wakes up or
a task is blocked.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

9e7b35d6

alinux: sched: Introduce per-cgroup steal accounting · c7552980

由 Yihao Wu 提交于 3月 10, 2020

to #26424323

From the previous patch. We know there are 4 possible states.
Since steal state's transition is complex. We choose to account
its supplement.

        steal = elapse - idle - sum_exec_raw - ineffective

Where elapse is the time since the cgroup is created. sum_exec_raw is
the running time including IRQ time. ineffective is the total time that
the cpuacct-binded cpuset doesn't allow this cpu for the cgroup.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

c7552980

alinux: sched: Introduce per-cgroup idle accounting · 61e58859

由 Yihao Wu 提交于 3月 10, 2020

to #26424323

Since we concern idle, let's take idle as the center state. And omit
transition between other stats. Below is the state transition graph:

                                sleep->deque
+-----------+ cpumask +-------+ exit->deque +-------+
|ineffective|-------- | idle  | <-----------|running|
+-----------+         +-------+             +-------+
                        ^ |
 unthrtl child -> deque | |
          wake -> deque | |thrtl chlid -> enque
       migrate -> deque | |migrate -> enque
                        | v
                      +-------+
                      | steal |
                      +-------+

We conclude idle state condition as:

!se->on_rq && !my_q->throttled && cpu allowed.

From this graph and condition, we can hook (de|en)queue_task_fair
update_cpumasks_hier, (un|)throttle_cfs_rq to account idle state.

In the hooked functions, we also check the conditions, to avoid
accounting unwanted cpu clocks.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

61e58859

alinux: cpuacct: make cpuacct record nr_migrations · 965d75d3

由 Yihao Wu 提交于 1月 17, 2020

to #26424323

This patch makes cpuacct to be able to monitor the number of
across-cpu-migrations. Output as follows:

   [root@caspar /sys/fs/cgroup/cpuacct]
   # cat cpuacct.proc_stat
   user 7727
   nice 4
   <snip>
   nr_migrations 48432
Signed-off-by: NZhu Yanhai <zhu.yanhai@linux.alibaba.com>
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

965d75d3

alinux: cpuacct: Export nr_running & nr_uninterruptible · 9410d314

由 Yihao Wu 提交于 1月 17, 2020

to #26424323

cpu cgroup's nr_running and nr_uninterruptible are useful
for troubleshooting. Export them in cpuacct.proc_stat.
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

9410d314

alinux: sched/cputime: Fix guest cputime of cpuacct.proc_stat · 004b3ba9

由 Shanpei Chen 提交于 11月 06, 2019

to #26424323

For container only cases, since guest cputime is always 0, we don't
calculate it and return 0 directly before.

Howerver, when running vm inside a cgroup, we expect the cgroup to
maintain guest cputime correctly.
Tested-by: NYihao Wu <wuyihao@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

004b3ba9

alinux: cpuacct/proc_stat: Consider isolcpus · 8ab45c97

由 Xunlei Pang 提交于 11月 06, 2019

to #26424323

When "isolcpus=" is passed, skip all its accountings.
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Tested-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

8ab45c97

alinux: cpuacct: export cpuacct.proc_stat interface · 9be0ac2b

由 Xunlei Pang 提交于 7月 23, 2019

to #26424323

Add the cgroup file "cpuacct.proc_stat", we'll export per-cgroup
cpu usages and some other scheduler statistics in this interface.
Reviewed-by: NMichael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

9be0ac2b

alinux: sched: Maintain "nr_uninterruptible" in runqueue · 36da4fe9

由 Xunlei Pang 提交于 7月 24, 2019

to #26424323

It's relatively easy to maintain nr_uninterruptible in scheduler
compared to doing it in cpuacct, we assume that "cpu,cpuacct" are
bound together, so that it can be used for per-cgroup load.

This will be needed to calculate per-cgroup load average later.
Reviewed-by: NMichael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

36da4fe9

alinux: config: enable CONFIG_SCHED_SLI · 50fb11b1

由 Yihao Wu 提交于 4月 19, 2020

to #26424323

Enable CONFIG_SCHED_SLI by default.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>

50fb11b1

alinux: sched: add kconfig SCHED_SLI · 06d7286d

由 Yihao Wu 提交于 4月 18, 2020

to #26424323

This introduces the new bool kconfig SCHED_SLI, determining whether the
scheduler SLI feature should be built-in or not.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>

06d7286d

23 4月, 2020 14 次提交

mm, compaction: fully assume capture is not NULL in compact_zone_order() · 087f53fc

由 Vlastimil Babka 提交于 4月 01, 2020

to #26255339

commit 6467552ca64c4ddd2b83ed73192107d7145f533b upstream

Dan reports:

The patch 5e1f0f098b46: "mm, compaction: capture a page under direct
compaction" from Mar 5, 2019, leads to the following Smatch complaint:

    mm/compaction.c:2321 compact_zone_order()
     error: we previously assumed 'capture' could be null (see line 2313)

mm/compaction.c
  2288  static enum compact_result compact_zone_order(struct zone *zone, int order,
  2289                  gfp_t gfp_mask, enum compact_priority prio,
  2290                  unsigned int alloc_flags, int classzone_idx,
  2291                  struct page **capture)
                                      ^^^^^^^

  2313		if (capture)
                    ^^^^^^^
Check for NULL

  2314			current->capture_control = &capc;
  2315
  2316		ret = compact_zone(&cc, &capc);
  2317
  2318		VM_BUG_ON(!list_empty(&cc.freepages));
  2319		VM_BUG_ON(!list_empty(&cc.migratepages));
  2320
  2321		*capture = capc.page;
                ^^^^^^^^
Unchecked dereference.

  2322		current->capture_control = NULL;
  2323

In practice this is not an issue, as the only caller path passes non-NULL
capture:

__alloc_pages_direct_compact()
  struct page *page = NULL;
  try_to_compact_pages(capture = &page);
    compact_zone_order(capture = capture);

So let's remove the unnecessary check, which should also make Smatch happy.

Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NMel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/18b0df3c-0589-d96c-23fa-040798fee187@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

087f53fc

mm/compaction: add missing annotation for compact_lock_irqsave · 61d6ab62

由 Jules Irenge 提交于 4月 06, 2020

to #26255339

commit 77337edee7598d82fb5acf66cb91a5b3f0c46add upstream

Sparse reports a warning at compact_lock_irqsave()

warning: context imbalance in compact_lock_irqsave() - wrong count at exit

The root cause is the missing annotation at compact_lock_irqsave()
Add the missing __acquires(lock) annotation.
Signed-off-by: NJules Irenge <jbi.octave@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200214204741.94112-6-jbi.octave@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

61d6ab62

mm, compaction: fix wrong pfn handling in __reset_isolation_pfn() · a5c295c0

由 Vlastimil Babka 提交于 10月 14, 2019

to #26255339

commit a2e9a5afce080226edbf1882d63d99bf32070e9e upstream

Florian and Dave reported [1] a NULL pointer dereference in
__reset_isolation_pfn().  While the exact cause is unclear, staring at
the code revealed two bugs, which might be related.

One bug is that if zone starts in the middle of pageblock, block_page
might correspond to different pfn than block_pfn, and then the
pfn_valid_within() checks will check different pfn's than those accessed
via struct page.  This might result in acessing an unitialized page in
CONFIG_HOLES_IN_ZONE configs.

The other bug is that end_page refers to the first page of next
pageblock and not last page of current pageblock.  The online and valid
check is then wrong and with sections, the while (page < end_page) loop
might wander off actual struct page arrays.

[1] https://lore.kernel.org/linux-xfs/87o8z1fvqu.fsf@mid.deneb.enyo.de/

Link: http://lkml.kernel.org/r/20191008152915.24704-1-vbabka@suse.cz
Fixes: 6b0868c820ff ("mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints")
Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Reported-by: NFlorian Weimer <fw@deneb.enyo.de>
Reported-by: NDave Chinner <david@fromorbit.com>
Acked-by: NMel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

a5c295c0

mm/compaction.c: remove unnecessary zone parameter in isolate_migratepages() · 57405fb9

由 Pengfei Li 提交于 9月 23, 2019

to #26255339

commit 32aaf0553df99cc4314f6e9f43216cd83afc6c20 upstream

Like commit 40cacbcb3240 ("mm, compaction: remove unnecessary zone
parameter in some instances"), remove unnecessary zone parameter.

No functional change.

Link: http://lkml.kernel.org/r/20190806151616.21107-1-lpf.vector@gmail.comSigned-off-by: NPengfei Li <lpf.vector@gmail.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

57405fb9

mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone · 8539ae81

由 Yafang Shao 提交于 9月 23, 2019

to #26255339

commit a94b525241c0fff3598809131d7cfcfe1d572d8c upstream

total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and
COMPACTFREE_SCANNED in compact_zone().  We should clear them before
scanning a new zone.  In the proc triggered compaction, we forgot clearing
them.

[laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()]
  Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com
[akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko]
[vbabka@suse.cz: squash compact_zone() list_head init as well]
  Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz
[akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone]
Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com
Fixes: 7f354a54 ("mm, compaction: add vmstats for kcompactd work")
Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Yafang Shao <shaoyafang@didiglobal.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

8539ae81

mm: compaction: avoid 100% CPU usage during compaction when a task is killed · 377f424a

由 Mel Gorman 提交于 4月 04, 2020

to #26255339

commit 670105a25608affe01cb0ccdc2a1f4bd2327172b upstream

"howaboutsynergy" reported via kernel buzilla number 204165 that
compact_zone_order was consuming 100% CPU during a stress test for
prolonged periods of time. Specifically the following command, which
should exit in 10 seconds, was taking an excessive time to finish while
the CPU was pegged at 100%.

stress -m 220 --vm-bytes 1000000000 --timeout 10

Tracing indicated a pattern as follows

stress-3923 [007] 519.106208: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106212: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106216: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106219: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106223: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106227: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106231: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106235: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106238: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0
stress-3923 [007] 519.106242: mm_compaction_isolate_migratepages: range=(0x70bb80 ~ 0x70bb80) nr_scanned=0 nr_taken=0

Note that compaction is entered in rapid succession while scanning and
isolating nothing. The problem is that when a task that is compacting
receives a fatal signal, it retries indefinitely instead of exiting
while making no progress as a fatal signal is pending.

It's not easy to trigger this condition although enabling zswap helps on
the basis that the timing is altered. A very small window has to be hit
for the problem to occur (signal delivered while compacting and
isolating a PFN for migration that is not aligned to SWAP_CLUSTER_MAX).

This was reproduced locally -- 16G single socket system, 8G swap, 30%
zswap configured, vm-bytes 22000000000 using Colin Kings stress-ng
implementation from github running in a loop until the problem hits).
Tracing recorded the problem occurring almost 200K times in a short
window. With this patch, the problem hit 4 times but the task existed
normally instead of consuming CPU.

This problem has existed for some time but it was made worse by commit
cf66f0700c8f ("mm, compaction: do not consider a need to reschedule as
contention"). Before that commit, if the same condition was hit then
locks would be quickly contended and compaction would exit that way.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204165
Link: http://lkml.kernel.org/r/20190718085708.GE24383@techsingularity.net
Fixes: cf66f0700c8f ("mm, compaction: do not consider a need to reschedule as contention")
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [5.1+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

377f424a

mm, compaction: make sure we isolate a valid PFN · 89c885ab

由 Suzuki K Poulose 提交于 5月 31, 2019

to #26255339

commit e577c8b64d58fe307ea4d5149d31615df2d90861 upstream

When we have holes in a normal memory zone, we could endup having
cached_migrate_pfns which may not necessarily be valid, under heavy memory
pressure with swapping enabled ( via __reset_isolation_suitable(),
triggered by kswapd).

Later if we fail to find a page via fast_isolate_freepages(), we may end
up using the migrate_pfn we started the search with, as valid page.  This
could lead to accessing NULL pointer derefernces like below, due to an
invalid mem_section pointer.

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [47/1825]
 Mem abort info:
   ESR = 0x96000004
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000082f94ae9
 [0000000000000008] pgd=0000000000000000
 Internal error: Oops: 96000004 [#1] SMP
 ...
 CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6
 Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 09/25/2018
 pstate: 60000005 (nZCv daif -PAN -UAO)
 pc : set_pfnblock_flags_mask+0x58/0xe8
 lr : compaction_alloc+0x300/0x950
 [...]
 Process qemu-system-aar (pid: 6080, stack limit = 0x0000000095070da5)
 Call trace:
  set_pfnblock_flags_mask+0x58/0xe8
  compaction_alloc+0x300/0x950
  migrate_pages+0x1a4/0xbb0
  compact_zone+0x750/0xde8
  compact_zone_order+0xd8/0x118
  try_to_compact_pages+0xb4/0x290
  __alloc_pages_direct_compact+0x84/0x1e0
  __alloc_pages_nodemask+0x5e0/0xe18
  alloc_pages_vma+0x1cc/0x210
  do_huge_pmd_anonymous_page+0x108/0x7c8
  __handle_mm_fault+0xdd4/0x1190
  handle_mm_fault+0x114/0x1c0
  __get_user_pages+0x198/0x3c0
  get_user_pages_unlocked+0xb4/0x1d8
  __gfn_to_pfn_memslot+0x12c/0x3b8
  gfn_to_pfn_prot+0x4c/0x60
  kvm_handle_guest_abort+0x4b0/0xcd8
  handle_exit+0x140/0x1b8
  kvm_arch_vcpu_ioctl_run+0x260/0x768
  kvm_vcpu_ioctl+0x490/0x898
  do_vfs_ioctl+0xc4/0x898
  ksys_ioctl+0x8c/0xa0
  __arm64_sys_ioctl+0x28/0x38
  el0_svc_common+0x74/0x118
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc
 Code: f8607840 f100001f 8b011401 9a801020 (f9400400)
 ---[ end trace af6a35219325a9b6 ]---

The issue was reported on an arm64 server with 128GB with holes in the
zone (e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while
running 100 KVM guest instances.

This patch fixes the issue by ensuring that the page belongs to a valid
PFN when we fallback to using the lower limit of the scan range upon
failure in fast_isolate_freepages().

Link: http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com
Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Reported-by: NMarc Zyngier <marc.zyngier@arm.com>
Reviewed-by: NMel Gorman <mgorman@techsingularity.net>
Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

89c885ab

mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock · 39470173

由 Mel Gorman 提交于 5月 17, 2019

to #26255339

commit 60fce36afa9c77c7ccbf980c4f670f3be3651fce upstream

syzbot reported the following error from a tree with a head commit of
baf76f0c58ae ("slip: make slhc_free() silently accept an error pointer")

  BUG: unable to handle kernel paging request at ffffea0003348000
  #PF error: [normal kernel read fault]
  PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP KASAN
  CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
  RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
  RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
  Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
  ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 <4d> 8b 2c 24 31 ff 49
  c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
  RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
  RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
  RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
  RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
  R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
  FS:  00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
  Call Trace:
   fast_isolate_around mm/compaction.c:1243 [inline]
   fast_isolate_freepages mm/compaction.c:1418 [inline]
   isolate_freepages mm/compaction.c:1438 [inline]
   compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550

There is no reproducer and it is difficult to hit -- 1 crash every few
days.  The issue is very similar to the fix in commit 6b0868c820ff
("mm/compaction.c: correct zone boundary handling when resetting pageblock
skip hints").  When isolating free pages around a target pageblock, the
boundary handling is off by one and can stray into the next pageblock.
Triggering the syzbot error requires that the end of pageblock is section
or zone aligned, and that the next section is unpopulated.

A more subtle consequence of the bug is that pageblocks were being
improperly used as migration targets which potentially hurts fragmentation
avoidance in the long-term one page at a time.

A debugging patch revealed that it's definitely possible to stray outside
of a pageblock which is not intended.  While syzbot cannot be used to
verify this patch, it was confirmed that the debugging warning no longer
triggers with this patch applied.  It has also been confirmed that the THP
allocation stress tests are not degraded by this patch.

Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
Fixes: e332f741a8dd ("mm, compaction: be selective about what pageblocks to clear skip hints")
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> # v5.1+
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

39470173

mm/compaction.c: fix an undefined behaviour · 00bb2964

由 Qian Cai 提交于 5月 13, 2019

to #26255339

commit dd7ef7bd14640f11763b54f55131000165f48321 upstream

In a low-memory situation, cc->fast_search_fail can keep increasing as it
is unable to find an available page to isolate in
fast_isolate_freepages().  As the result, it could trigger an error below,
so just compare with the maximum bits can be shifted first.

UBSAN: Undefined behaviour in mm/compaction.c:1160:30
shift exponent 64 is too large for 64-bit type 'unsigned long'
CPU: 131 PID: 1308 Comm: kcompactd1 Kdump: loaded Tainted: G
W    L    5.0.0+ #17
Call trace:
 dump_backtrace+0x0/0x450
 show_stack+0x20/0x2c
 dump_stack+0xc8/0x14c
 __ubsan_handle_shift_out_of_bounds+0x7e8/0x8c4
 compaction_alloc+0x2344/0x2484
 unmap_and_move+0xdc/0x1dbc
 migrate_pages+0x274/0x1310
 compact_zone+0x26ec/0x43bc
 kcompactd+0x15b8/0x1a24
 kthread+0x374/0x390
 ret_from_fork+0x10/0x18

[akpm@linux-foundation.org: code cleanup]
Link: http://lkml.kernel.org/r/20190320203338.53367-1-cai@lca.pw
Fixes: 70b44595eafe ("mm, compaction: use free lists to quickly locate a migration source")
Signed-off-by: NQian Cai <cai@lca.pw>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NMel Gorman <mgorman@techsingularity.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

00bb2964

mm/compaction.c: abort search if isolation fails · 7f458a1c

由 Qian Cai 提交于 4月 04, 2019

to #26255339

commit 5b56d996dd50a9d2ca87c25ebb50c07b255b7e04 upstream

Running LTP oom01 in a tight loop or memory stress testing put the system
in a low-memory situation could triggers random memory corruption like
page flag corruption below due to in fast_isolate_freepages(), if
isolation fails, next_search_order() does not abort the search immediately
could lead to improper accesses.

UBSAN: Undefined behaviour in ./include/linux/mm.h:1195:50
index 7 is out of range for type 'zone [5]'
Call Trace:
 dump_stack+0x62/0x9a
 ubsan_epilogue+0xd/0x7f
 __ubsan_handle_out_of_bounds+0x14d/0x192
 __isolate_free_page+0x52c/0x600
 compaction_alloc+0x886/0x25f0
 unmap_and_move+0x37/0x1e70
 migrate_pages+0x2ca/0xb20
 compact_zone+0x19cb/0x3620
 kcompactd_do_work+0x2df/0x680
 kcompactd+0x1d8/0x6c0
 kthread+0x32c/0x3f0
 ret_from_fork+0x35/0x40
------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:3124!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
RIP: 0010:__isolate_free_page+0x464/0x600
RSP: 0000:ffff888b9e1af848 EFLAGS: 00010007
RAX: 0000000030000000 RBX: ffff888c39fcf0f8 RCX: 0000000000000000
RDX: 1ffff111873f9e25 RSI: 0000000000000004 RDI: ffffed1173c35ef6
RBP: ffff888b9e1af898 R08: fffffbfff4fc2461 R09: fffffbfff4fc2460
R10: fffffbfff4fc2460 R11: ffffffffa7e12303 R12: 0000000000000008
R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000007
FS:  0000000000000000(0000) GS:ffff888ba8e80000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc7abc00000 CR3: 0000000752416004 CR4: 00000000001606a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 compaction_alloc+0x886/0x25f0
 unmap_and_move+0x37/0x1e70
 migrate_pages+0x2ca/0xb20
 compact_zone+0x19cb/0x3620
 kcompactd_do_work+0x2df/0x680
 kcompactd+0x1d8/0x6c0
 kthread+0x32c/0x3f0
 ret_from_fork+0x35/0x40

Link: http://lkml.kernel.org/r/20190320192648.52499-1-cai@lca.pw
Fixes: dbe2d4e4f12e ("mm, compaction: round-robin the order while searching the free lists for a target")
Signed-off-by: NQian Cai <cai@lca.pw>
Acked-by: NMel Gorman <mgorman@techsingularity.net>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

7f458a1c

mm, page_alloc: always use a captured page regardless of compaction result · 19a4cd3c

由 Mel Gorman 提交于 4月 04, 2020

to #26255339

commit ee8ab0eeb49bd3982090c8f14dc9cc65bcd13c5c upstream

During the development of commit 5e1f0f098b46 ("mm, compaction: capture
a page under direct compaction"), a paranoid check was added to ensure
that if a captured page was available after compaction that it was
consistent with the final state of compaction.  The intent was to catch
serious programming bugs such as using a stale page pointer and causing
corruption problems.

However, it is possible to get a captured page even if compaction was
unsuccessful if an interrupt triggered and happened to free pages in
interrupt context that got merged into a suitable high-order page.  It's
highly unlikely but Li Wang did report the following warning on s390
occuring when testing OOM handling.  Note that the warning is slightly
edited for clarity.

  WARNING: CPU: 0 PID: 9783 at mm/page_alloc.c:3777 __alloc_pages_direct_compact+0x182/0x190
  Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs
    lockd grace fscache sunrpc pkey ghash_s390 prng xts aes_s390
    des_s390 des_generic sha512_s390 zcrypt_cex4 zcrypt vmur binfmt_misc
    ip_tables xfs libcrc32c dasd_fba_mod qeth_l2 dasd_eckd_mod dasd_mod
    qeth qdio lcs ctcm ccwgroup fsm dm_mirror dm_region_hash dm_log
    dm_mod
  CPU: 0 PID: 9783 Comm: copy.sh Kdump: loaded Not tainted 5.1.0-rc 5 #1

This patch simply removes the check entirely instead of trying to be
clever about pages freed from interrupt context.  If a serious
programming error was introduced, it is highly likely to be caught by
prep_new_page() instead.

Link: http://lkml.kernel.org/r/20190419085133.GH18914@techsingularity.net
Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Reported-by: NLi Wang <liwang@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

19a4cd3c

mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints · 99d1c410

由 Mel Gorman 提交于 4月 04, 2019

to #26255339

commit 6b0868c820ff7370d15d6ddfe71b1ce6bbe8a25d upstream

Mikhail Gavrilo reported the following bug being triggered in a Fedora
kernel based on 5.1-rc1 but it is relevant to a vanilla kernel.

 kernel: page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
 kernel: ------------[ cut here ]------------
 kernel: kernel BUG at include/linux/mm.h:1021!
 kernel: invalid opcode: 0000 [#1] SMP NOPTI
 kernel: CPU: 6 PID: 116 Comm: kswapd0 Tainted: G         C        5.1.0-0.rc1.git1.3.fc31.x86_64 #1
 kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 1201 12/07/2018
 kernel: RIP: 0010:__reset_isolation_pfn+0x244/0x2b0
 kernel: Code: fe 06 e8 0f 8e fc ff 44 0f b6 4c 24 04 48 85 c0 0f 85 dc fe ff ff e9 68 fe ff ff 48 c7 c6 58 b7 2e 8c 4c 89 ff e8 0c 75 00 00 <0f> 0b 48 c7 c6 58 b7 2e 8c e8 fe 74 00 00 0f 0b 48 89 fa 41 b8 01
 kernel: RSP: 0018:ffff9e2d03f0fde8 EFLAGS: 00010246
 kernel: RAX: 0000000000000034 RBX: 000000000081f380 RCX: ffff8cffbddd6c20
 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8cffbddd6c20
 kernel: RBP: 0000000000000001 R08: 0000009898b94613 R09: 0000000000000000
 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000100000
 kernel: R13: 0000000000100000 R14: 0000000000000001 R15: ffffca7de07ce000
 kernel: FS:  0000000000000000(0000) GS:ffff8cffbdc00000(0000) knlGS:0000000000000000
 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 kernel: CR2: 00007fc1670e9000 CR3: 00000007f5276000 CR4: 00000000003406e0
 kernel: Call Trace:
 kernel:  __reset_isolation_suitable+0x62/0x120
 kernel:  reset_isolation_suitable+0x3b/0x40
 kernel:  kswapd+0x147/0x540
 kernel:  ? finish_wait+0x90/0x90
 kernel:  kthread+0x108/0x140
 kernel:  ? balance_pgdat+0x560/0x560
 kernel:  ? kthread_park+0x90/0x90
 kernel:  ret_from_fork+0x27/0x50

He bisected it down to e332f741a8dd ("mm, compaction: be selective about
what pageblocks to clear skip hints").  The problem is that the patch in
question was sloppy with respect to the handling of zone boundaries.  In
some instances, it was possible for PFNs outside of a zone to be examined
and if those were not properly initialised or poisoned then it would
trigger the VM_BUG_ON.  This patch corrects the zone boundary issues when
resetting pageblock skip hints and Mikhail reported that the bug did not
trigger after 30 hours of testing.

Link: http://lkml.kernel.org/r/20190327085424.GL3189@techsingularity.net
Fixes: e332f741a8dd ("mm, compaction: be selective about what pageblocks to clear skip hints")
Reported-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Tested-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

99d1c410

mm, compaction: capture a page under direct compaction · 35d915be

由 Mel Gorman 提交于 4月 04, 2020

to #26255339

commit 5e1f0f098b4649fad53011246bcaeff011ffdf5d upstream

Compaction is inherently race-prone as a suitable page freed during
compaction can be allocated by any parallel task. This patch uses a
capture_control structure to isolate a page immediately when it is freed
by a direct compactor in the slow path of the page allocator. The
intent is to avoid redundant scanning.

5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)

Latency is only moderately affected but the devil is in the details. A
closer examination indicates that base page fault latency is reduced but
latency of huge pages is increased as it takes creater care to succeed.
Part of the "problem" is that allocation success rates are close to 100%
even when under pressure and compaction gets harder

5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)

And scan rates are reduced as expected by 6% for the migration scanner
and 29% for the free scanner indicating that there is less redundant
work.

Compaction migrate scanned 20815362 19573286
Compaction free scanned 16352612 11510663

[mgorman@techsingularity.net: remove redundant check]
Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

35d915be

mm, compaction: be selective about what pageblocks to clear skip hints · 1edbee61

由 Mel Gorman 提交于 3月 05, 2019

to #26255339

commit e332f741a8dd1ec9a6dc8aa997296ecbfe64323e upstream

Pageblock hints are cleared when compaction restarts or kswapd makes
enough progress that it can sleep but it's over-eager in that the bit is
cleared for migration sources with no LRU pages and migration targets
with no free pages.  As pageblock skip hint flushes are relatively rare
and out-of-band with respect to kswapd, this patch makes a few more
expensive checks to see if it's appropriate to even clear the bit.
Every pageblock that is not cleared will avoid 512 pages being scanned
unnecessarily on x86-64.

The impact is variable with different workloads showing small
differences in latency, success rates and scan rates.  This is expected
as clearing the hints is not that common but doing a small amount of
work out-of-band to avoid a large amount of work in-band later is
generally a good thing.

Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Signed-off-by: NQian Cai <cai@lca.pw>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
[cai@lca.pw: no stuck in __reset_isolation_pfn()]
  Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pwSigned-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1edbee61

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功