提交 · a816f8314e763b606d273b6516912b94b6bffcf8 · openanolis / cloud-kernel

31 12月, 2019 1 次提交

alinux: hotfix: Add Cloud Kernel hotfix enhancement · a816f831

由 Xunlei Pang 提交于 7月 18, 2019

We reserve some fields beforehand for core structures prone to change,
so that we won't hurt when extra fields have to be added for hotfix,
thereby inceasing the success rate, we even can hot add features with
this enhancement.

After reserving, normally cache does not matter as the reserved fields
(usually at tail) are not accessed at all.

Currently involve the following structures:
    MM:
    struct zone
    struct pglist_data
    struct mm_struct
    struct vm_area_struct
    struct mem_cgroup
    struct writeback_control

    Block:
    struct gendisk
    struct backing_dev_info
    struct bio
    struct queue_limits
    struct request_queue
    struct blkcg
    struct blkcg_policy
    struct blk_mq_hw_ctx
    struct blk_mq_tag_set
    struct blk_mq_queue_data
    struct blk_mq_ops
    struct elevator_mq_ops
    struct inode
    struct dentry
    struct address_space
    struct block_device
    struct hd_struct
    struct bio_set

    Network:
    struct sk_buff
    struct sock
    struct net_device_ops
    struct xt_target
    struct dst_entry
    struct dst_ops
    struct fib_rule

    Scheduler:
    struct task_struct
    struct cfs_rq
    struct rq
    struct sched_statistics
    struct sched_entity
    struct signal_struct
    struct task_group
    struct cpuacct

    cgroup:
    struct cgroup_root
    struct cgroup_subsys_state
    struct cgroup_subsys
    struct css_set
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
[ caspar: use SPDX-License-Identifier ]
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>

a816f831

27 12月, 2019 7 次提交

x86/unwind/orc: Remove boot-time ORC unwind tables sorting · ec642e37

由 Shile Zhang 提交于 12月 04, 2019

commit f14bf6a350dfd6613dbf91be5b423bc7eab690da upstream.

Now that the orc_unwind and orc_unwind_ip tables are sorted at build time,
remove the boot time sorting pass.

No change in functionality.

[ mingo: Rewrote the changelog and code comments. ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kbuild@vger.kernel.org
Link: https://lkml.kernel.org/r/20191204004633.88660-8-shile.zhang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

ec642e37

scripts/sorttable: Implement build-time ORC unwind table sorting · 4aa2b0bf

由 Shile Zhang 提交于 12月 04, 2019

commit 57fa1899428538e314a7e0d52a5b617af082389a upstream.

The ORC unwinder has two tables: .orc_unwind_ip and .orc_unwind, which
need to be sorted for binary search. Previously this sorting was done
during bootup.

Sort them at build time to speed up booting.

Add the ORC tables sorting in a parallel build process to speed up the build.

[ mingo: Rewrote the changelog and fixed some comments. ]
Suggested-by: NAndy Lutomirski <luto@amacapital.net>
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: linux-kbuild@vger.kernel.org
Link: https://lkml.kernel.org/r/20191204004633.88660-7-shile.zhang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

4aa2b0bf

scripts/sorttable: Rename 'sortextable' to 'sorttable' · b0b806ee

由 Shile Zhang 提交于 12月 04, 2019

commit 1091670637be8bd34a39dd1ddcc0a10a7c88d4e2 upstream.

Use a more generic name for additional table sorting usecases,
such as the upcoming ORC table sorting feature. This tool is
not tied to exception table sorting anymore.

No functional changes intended.

[ mingo: Rewrote the changelog. ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: linux-kbuild@vger.kernel.org
Link: https://lkml.kernel.org/r/20191204004633.88660-6-shile.zhang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

b0b806ee

scripts/sortextable: Refactor the do_func() function · 007f52e4

由 Shile Zhang 提交于 12月 04, 2019

commit 57cafdf2a04e161b9654c4ae3888a7549594c499 upstream.

Refine the loop, naming and code structure, make the code more readable
and extendable. No functional changes intended.
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: linux-kbuild@vger.kernel.org
Link: https://lkml.kernel.org/r/20191204004633.88660-5-shile.zhang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

007f52e4

scripts/sortextable: Remove dead code · 31e4898f

由 Shile Zhang 提交于 12月 04, 2019

commit abe4f92ca8948a3e04c56788354933c326909acb upstream.
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: linux-kbuild@vger.kernel.org
Link: https://lkml.kernel.org/r/20191204004633.88660-4-shile.zhang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

31e4898f

scripts/sortextable: Clean up the code to meet the kernel coding style better · d8a6d037

由 Shile Zhang 提交于 12月 04, 2019

commit 6402e1416255a7bb94834925ba0255c750f54a2d upstream.

Fix various style errors and inconsistencies, no functional changes
intended.
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: linux-kbuild@vger.kernel.org
Link: https://lkml.kernel.org/r/20191204004633.88660-3-shile.zhang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d8a6d037

scripts/sortextable: Rewrite error/success handling · 85d40b81

由 Shile Zhang 提交于 12月 04, 2019

commit 3c47b787b6516d2c3cbaa193fe13a83adbaaad1f upstream.

The scripts/sortextable.c code has originally copied some code from
scripts/recordmount.c, which used the same setjmp/longjmp method to
manage control flow.

Meanwhile recordmcount has improved its error handling via:

   3f1df12019f3 ("recordmcount: Rewrite error/success handling").

So rewrite this part of sortextable as well to get rid of the setjmp/longjmp
kludges, with additional refactoring, to make it more readable and
easier to extend.

No functional changes intended.

[ mingo: Rewrote the changelog. ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: linux-kbuild@vger.kernel.org
Link: https://lkml.kernel.org/r/20191204004633.88660-2-shile.zhang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

85d40b81

26 12月, 2019 1 次提交

alios: introduce psi_v1 boot parameter · ad41765e

由 Joseph Qi 提交于 12月 25, 2019

Instead using static kconfig CONFIG_PSI_CGROUP_V1, we introduce a boot
parameter psi_v1 to enable psi cgroup v1 support. Default it is
disabled, which means when passing psi=1 boot parameter, we only support
cgroup v2.
This is to keep consistent with other cgroup v1 features such as cgroup
writeback v1 (cgwb_v1).
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

ad41765e

25 12月, 2019 1 次提交

alios: psi: Support PSI under cgroup v1 · fcb45d2d

由 Xunlei Pang 提交于 12月 23, 2019

Export "cpu|io|memory.pressure" under cgroup v1 "cpuacct" subsystem.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

fcb45d2d

24 12月, 2019 1 次提交

perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER · 1d3d91ad

由 Kairui Song 提交于 4月 23, 2019

commit d15d356887e770c5f2dcf963b52c7cb510c9e42d upstream.

Currently perf callchain doesn't work well with ORC unwinder
when sampling from trace point. We'll get useless in kernel callchain
like this:

perf  6429 [000]    22.498450:             kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
    ffffffffbe23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
	7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
	5651468729c1 [unknown] (/usr/bin/perf)
	5651467ee82a main+0x69a (/usr/bin/perf)
	7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
    5541f689495641d7 [unknown] ([unknown])

The root cause is that, for trace point events, it doesn't provide a
real snapshot of the hardware registers. Instead perf tries to get
required caller's registers and compose a fake register snapshot
which suppose to contain enough information for start a unwinding.
However without CONFIG_FRAME_POINTER, if failed to get caller's BP as the
frame pointer, so current frame pointer is returned instead. We get
a invalid register combination which confuse the unwinder, and end the
stacktrace early.

So in such case just don't try dump BP, and let the unwinder start
directly when the register is not a real snapshot. Use SP
as the skip mark, unwinder will skip all the frames until it meet
the frame of the trace point caller.

Tested with frame pointer unwinder and ORC unwinder, this makes perf
callchain get the full kernel space stacktrace again like this:

perf  6503 [000]  1567.570191:             kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
    ffffffffb523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux)
	7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
	55a22960d9c1 [unknown] (/usr/bin/perf)
	55a22958982a main+0x69a (/usr/bin/perf)
	7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
    5541f689495641d7 [unknown] ([unknown])
Co-developed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NKairui Song <kasong@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190422162652.15483-1-kasong@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1d3d91ad

19 12月, 2019 10 次提交

x86/cpufeatures: Add WBNOINVD feature definition · 27a59c39

由 Janakarajan Natarajan 提交于 12月 18, 2019

commit 08e823c2c5899ef2de3aa1727233f1f19e8c1cc1 upstream.

Add a new cpufeature definition for the WBNOINVD instruction.

The WBNOINVD instruction writes all modified cache lines in all levels of
the cache associated with a processor to main memory while retaining the
cached values.

Both AMD and Intel support this instruction.
Signed-off-by: NJanakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
CC: David Woodhouse <dwmw@amazon.co.uk>
CC: Fenghua Yu <fenghua.yu@intel.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: Rudolf Marek <r.marek@assembler.cz>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1541624211-32196-1-git-send-email-Janakarajan.Natarajan@amd.comSigned-off-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

27a59c39

ACPI / APEI: Fix parsing HEST that includes Deferred Machine Check subtable · b63fcb11

由 Yazen Ghannam 提交于 12月 18, 2019

commit f3355298fc5a24eb7606448bc02a08b3485e5979 upstream.

ACPI 6.2 includes a new definition for a Deferred Machine Check "DMC"
subtable.

The definition of this subtable was included in following commit:

c042933d (ACPICA: Add support for new HEST subtable)

However, the HEST parsing function was not updated to include this new
subtable. Therefore, Linux will fail to parse the HEST on systems that
include a DMC entry.

Add the length check for the new DMC subtable so that HEST parsing
doesn't fail on systems that include it.
Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

b63fcb11

ACPI / processor: Set P_LVL{2,3} idle state descriptions · 65ac216a

由 Yazen Ghannam 提交于 12月 18, 2019

commit 34a62cd0df89dd7034165048c0921d1314191b66 upstream.

The ACPI idle driver will fallback to using the legacy P_LVL* SystemIO
method of entering C-states if the _CST method is disabled and P_BLK is
defined. However, in this case the C2 and C3 states won't have a
description set, so the user will see "<null>" when reading the
description from sysfs.

Give each of these states a description.
Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

65ac216a

perf vendor events amd: perf PMU events for AMD Family 17h · 18fd4056

由 Martin Liška 提交于 12月 18, 2019

commit 98c07a8f74f85a19aeee2016f5afa0c667fa694d upstream.

Thi patch adds PMC events for AMD Family 17 CPUs as defined in [1].  It
covers events described in section: 2.1.13. Regex pattern in mapfile.csv
covers all CPUs of the family.

[1] https://support.amd.com/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdfSigned-off-by: NMartin Liška <mliska@suse.cz>
Acked-by: NBorislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Jon Grimm <jon.grimm@amd.com>
Cc: Martin Jambor <mjambor@suse.cz>
Cc: William Cohen <wcohen@redhat.com>
Link: https://lkml.kernel.org/r/d65873ca-e402-b198-4fe9-8c4af81258c8@suse.czSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

18fd4056

KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) · fdd99a14

由 Singh, Brijesh 提交于 12月 18, 2019

commit 05d5a48635259e621ea26d01e8316c6feeb34190 upstream.

Errata#1096:

On a nested data page fault when CR.SMAP=1 and the guest data read
generates a SMAP violation, GuestInstrBytes field of the VMCB on a
VMEXIT will incorrectly return 0h instead the correct guest
instruction bytes .

Recommend Workaround:

To determine what instruction the guest was executing the hypervisor
will have to decode the instruction at the instruction pointer.

The recommended workaround can not be implemented for the SEV
guest because guest memory is encrypted with the guest specific key,
and instruction decoder will not be able to decode the instruction
bytes. If we hit this errata in the SEV guest then log the message
and request a guest shutdown.
Reported-by: NVenkatesh Srinivas <venkateshs@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

fdd99a14

svm/avic: Fix invalidate logical APIC id entry · 85d84a38

由 Suthikulpanit, Suravee 提交于 12月 18, 2019

commit e44e3eacccfd2294a1ce279f68452b1635d7fa82 upstream.

Only clear the valid bit when invalidate logical APIC id entry.
The current logic clear the valid bit, but also set the rest of
the bits (including reserved bits) to 1.

Fixes: 98d90582be2e ('svm: Fix AVIC DFR and LDR handling')
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

85d84a38

svm: Fix improper check when deactivate AVIC · bf789ee2

由 Suthikulpanit, Suravee 提交于 12月 18, 2019

commit c57cd3c89ecf2812976f53e494580a396f93efd2 upstream.

The function svm_refresh_apicv_exec_ctrl() always returning prematurely
as kvm_vcpu_apicv_active() always return false when calling from
the function arch/x86/kvm/x86.c:kvm_vcpu_deactivate_apicv().
This is because the apicv_active is set to false just before calling
refresh_apicv_exec_ctrl().

Also, we need to mark VMCB_AVIC bit as dirty instead of VMCB_INTR.

So, fix svm_refresh_apicv_exec_ctrl() to properly deactivate AVIC.

Fixes: 67034bb9 ('KVM: SVM: Add irqchip_split() checks before enabling AVIC')
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

bf789ee2

svm: Fix AVIC DFR and LDR handling · 9f9f0a44

由 Suthikulpanit, Suravee 提交于 12月 18, 2019

commit 98d90582be2e08246a70af31e09950ecb8876252 upstream.

Current SVM AVIC driver makes two incorrect assumptions:
  1. APIC LDR register cannot be zero
  2. APIC DFR for all vCPUs must be the same

LDR=0 means the local APIC does not support logical destination mode.
Therefore, the driver should mark any previously assigned logical APIC ID
table entry as invalid, and return success.  Also, DFR is specific to
a particular local APIC, and can be different among all vCPUs
(as observed on Windows 10).

These incorrect assumptions cause Windows 10 and FreeBSD VMs to fail
to boot with AVIC enabled. So, instead of flush the whole logical APIC ID
table, handle DFR and LDR for each vCPU independently.

Fixes: 18f40c53 ('svm: Add VMEXIT handlers for AVIC')
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: NJulian Stecklina <jsteckli@amazon.de>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

9f9f0a44

svm: Add warning message for AVIC IPI invalid target · aebb60fc

由 Suravee Suthikulpanit 提交于 12月 18, 2019

commit 37ef0c4414c9743ba7f1af4392f0a27a99649f2a upstream.

Print warning message when IPI target ID is invalid due to one of
the following reasons:
  * In logical mode: cluster > max_cluster (64)
  * In physical mode: target > max_physical (512)
  * Address is not present in the physical or logical ID tables
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

aebb60fc

KVM: nSVM: Fix nested guest support for PAUSE filtering. · a5bbbd0d

由 Tambe, William 提交于 12月 18, 2019

commit e081354d6aa7b67c6d0ef51ff8c428b6c261a6fe upstream.

Currently, the nested guest's PAUSE intercept intentions are not being
honored.  Instead, since the L0 hypervisor's pause_filter_count and
pause_filter_thresh values are still in place, these values are used
instead of those programmed in the VMCB by the L1 hypervisor.

To honor the desired PAUSE intercept support of the L1 hypervisor, the L0
hypervisor must use the PAUSE filtering fields of the L1 hypervisor. This
requires saving and restoring of both the L0 and L1 hypervisor's PAUSE
filtering fields.
Signed-off-by: NWilliam Tambe <william.tambe@amd.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NWANG Siyuan <Siyuan.Wang@amd.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a5bbbd0d

09 12月, 2019 3 次提交

alios: jbd2: fix build warning on i386 · 8d34c9ab

由 Joseph Qi 提交于 12月 09, 2019

Fix the following build warning on arch i386:

ld: fs/jbd2/journal.o: in function `jbd2_seq_stats_show':
journal.c:(.text+0x137d): undefined reference to `__udivdi3'

Fixes: 3550da0c ("alios: jbd2: add new "stats" proc file")
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

8d34c9ab

alios: jbd2/doc: fix new kernel-doc warning · eefa339e

由 Joseph Qi 提交于 12月 09, 2019

Fix the following kernel-doc warning:

include/linux/jbd2.h:1184: warning: Function parameter or member 'j_checkpoint_task' not described in 'journal_s'

Fixes: c31b17e5 ("alios: jbd2: create jbd2-ckpt thread for journal checkpoint")
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

eefa339e

alios: mm/thp: remove unused variable 'pgdata' in split_huge_page_to_list() · 8978f14a

由 Joseph Qi 提交于 12月 05, 2019

This fixes the following build warning:
mm/huge_memory.c: In function ‘split_huge_page_to_list’:
mm/huge_memory.c:2656:22: warning: unused variable ‘pgdata’ [-Wunused-variable]
  struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
                      ^

Fixes: 6c52af5ee5c5 ("mm: thp: extract split_queue_* into a struct")
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

8978f14a

08 12月, 2019 12 次提交

mm: thp: move deferred split queue to memcg's nodeinfo · 1d1b4c6c

由 Yang Shi 提交于 10月 22, 2019

The commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe ("mm: thp: make
deferred split shrinker memcg aware") makes deferred split queue per
memcg to resolve memcg pre-mature OOM problem.  But, all nodes end up
sharing the same queue instead of one queue per-node before the commit.
It is not a big deal for memcg limit reclaim, but it may cause global
kswapd shrink THPs from a different node.

And, 0-day testing reported -19.6% regression of stress-ng's madvise
test [1].  I didn't see that much regression on my test box (24 threads,
48GB memory, 2 nodes), with the same test (stress-ng --timeout 1
--metrics-brief --sequential 72  --class vm --exclude spawn,exec), I saw
average -3% (run the same test 10 times then calculate the average since
the test itself may have most 15% variation according to my test)
regression sometimes (not every time, sometimes I didn't see regression
at all).

This might be caused by deferred split queue lock contention.  With some
configuration (i.e. just one root memcg) the lock contention my be worse
than before (given 2 nodes, two locks are reduced to one lock).

So, moving deferred split queue to memcg's nodeinfo to make it NUMA
aware again.

With this change stress-ng's madvise test shows average 4% improvement
sometimes and I didn't see degradation anymore.

[1]: https://lore.kernel.org/lkml/20190930084604.GC17687@shao2-debian/

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1d1b4c6c

mm: thp: make deferred split shrinker memcg aware · ace35514

由 Yang Shi 提交于 10月 22, 2019

commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe upstream

Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration.  For example the below test would
run into premature OOM easily:

$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred
split queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue.  The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg.  When the page is
immigrated to the other memcg, it will be immigrated to the target
memcg's deferred split queue too.

Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.

[yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai]
  Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

ace35514

mm: shrinker: make shrinker not depend on memcg kmem · b382ffa5

由 Yang Shi 提交于 10月 22, 2019

commit 0a432dcbeb32edcd211a5d8f7847d0da7642a8b4 upstream

Currently shrinker is just allocated and can work when memcg kmem is
enabled.  But, THP deferred split shrinker is not slab shrinker, it
doesn't make too much sense to have such shrinker depend on memcg kmem.
It should be able to reclaim THP even though memcg kmem is disabled.

Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker.
When memcg kmem is disabled, just such shrinkers can be called in
shrinking memcg slab.

[yang.shi@linux.alibaba.com: add comment]
  Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

b382ffa5

mm: move mem_cgroup_uncharge out of __page_cache_release() · 79044939

由 Yang Shi 提交于 10月 22, 2019

commit 7ae88534cdd96235cd775c03b32a75009355740b upstream

A later patch makes THP deferred split shrinker memcg aware, but it
needs page->mem_cgroup information in THP destructor, which is called after
mem_cgroup_uncharge() now.

So move mem_cgroup_uncharge() from __page_cache_release() to compound
page destructor, which is called by both THP and other compound pages except
HugeTLB.  And call it in __put_single_page() for single order page.

Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

79044939

mm: thp: extract split_queue_* into a struct · c9acf2bd

由 Yang Shi 提交于 10月 22, 2019

commit 364c1eebe453f06f0c1e837eb155a5725c9cd272 upstream

Patch series "Make deferred split shrinker memcg aware", v6.

Currently THP deferred split shrinker is not memcg aware, this may cause
premature OOM with some configuration.  For example the below test would
run into premature OOM easily:

$ cgcreate -g memory:thp
$ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
$ cgexec -g memory:thp transhuge-stress 4000

transhuge-stress comes from kernel selftest.

It is easy to hit OOM, but there are still a lot THP on the deferred
split queue, memcg direct reclaim can't touch them since the deferred split
shrinker is not memcg aware.

Convert deferred split shrinker memcg aware by introducing per memcg
deferred split queue.  The THP should be on either per node or per memcg
deferred split queue if it belongs to a memcg.  When the page is
immigrated to the other memcg, it will be immigrated to the target
memcg's deferred split queue too.

Reuse the second tail page's deferred_list for per memcg list since the
same THP can't be on multiple deferred split queues.

Make deferred split shrinker not depend on memcg kmem since it is not
slab.  It doesn't make sense to not shrink THP even though memcg kmem is
disabled.

With the above change the test demonstrated above doesn't trigger OOM
even though with cgroup.memory=nokmem.

This patch (of 4):

Put split_queue, split_queue_lock and split_queue_len into a struct in
order to reduce code duplication when we convert deferred_split to memcg
aware in the later patches.

Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c9acf2bd

alios: mm: Support kidled · fd952d8c

由 Gavin Shan 提交于 8月 30, 2019

This enables scanning pages in fixed interval to determine their access
frequency (hot/cold). The result is exported to user land on basis of
memory cgroup by "memory.idle_page_stats". The design is highlighted as
below:

   * A kernel thread is spawn when this feature is enabled by writing
     non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds".
     The thread sequentially scans the nodes and their pages that have
     been chained up in LRU list.

   * For each page, its corresponding age information is stored in the
     page flags or array in node. The age represents the scanning intervals
     in which the page isn't accessed. Also, the page flag (PG_idle) is
     leveraged. The page's age is increased by one if the idle flag isn't
     cleared in two consective scans. Otherwise, the page's age is cleared out.
     Also, the page's age information is cleared when it's free'd so that
     the stale age information won't be fetched when it's allocated.

   * Initially, the flag is set, while the access bit in its PTE is cleared
     out by the thread. In next scanning period, its PTE access bit is
     synchronized with the page flag: clear the flag if access bit is set.
     The flag is kept otherwise. For unmapped pages, the flag is cleared
     when it's accessed.

   * Eventually, the page's aging information is updated to the unstable
     bucket of its corresponding memory cgroup, taking as statistics. The
     unstable bucket (statistics) is copied to stable bucket when all pages
     in all nodes are scanned for once. The stable bucket (statistics) is
     exported to user land through "memory.idle_page_stats".

TESTING
=======

   * cgroup1, unmapped pagecache

     # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128
     #
     # echo 1 > /sys/kernel/mm/kidled/use_hierarchy
     # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds
     # mkdir -p /cgroup/memory
     # mount -tcgroup -o memory /cgroup/memory
     # echo 1 > /cgroup/memory/memory.use_hierarchy
     # mkdir -p /cgroup/memory/test
     # echo 1 > /cgroup/memory/test/memory.use_hierarchy
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # dd if=/ext4/test.data of=/dev/null bs=1M count=128
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
       cfei   0   0   0   134217728   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfei
       cfei   0   0   0   134217728   0   0   0   0

   * cgroup1, mapped pagecache

     # < create same file and memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap the whole created file and access the area >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
       cfei   0   134217728   0   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfei
       cfei   0   134217728   0   0   0   0   0   0

   * cgroup1, mapped and locked pagecache

     # < create same file and memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap the whole created file and mlock the area >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui
       cfui   0   134217728   0   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfui
       cfui   0   134217728   0   0   0   0   0   0

   * cgroup1, anonymous and locked area

     # < create memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap anonymous area and mlock it >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep csui
       csui   0   0   134217728   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep csui
       csui   0   0   134217728   0   0   0   0   0

   * Rerun above test cases in cgroup2 and the results are no exceptional.
     However, the cgroups are populated in different way as below:

     # mkdir -p /cgroup
     # mount -tcgroup2 none /cgroup
     # echo "+memory" > /cgroup/cgroup.subtree_control
     # mkdir -p /cgroup/test
Signed-off-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

fd952d8c

alios: mm: memcontrol: make distance between wmark_low and wmark_high configurable · 33ef4784

由 Yang Shi 提交于 8月 17, 2019

Introduce a new interface, wmark_scale_factor, which defines the
distance between wmark_high and wmark_low.  The unit is in fractions of
10,000. The default value of 50 means the distance between wmark_high
and wmark_low is 0.5% of the max limit of the cgroup.  The maximum value
is 1000, or 10% of the max limit.

The distance between wmark_low and wmark_high have impact on how hard
memcg kswapd would reclaim.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

33ef4784

alios: mm: vmscan: make memcg kswapd set memcg state to dirty or writeback · e10c247b

由 Yang Shi 提交于 8月 02, 2019

The global kswapd could set memory node to dirty or writeback if current
scan find all pages are unqueued dirty or writeback. Then kswapd would
write out dirty pages or wait for writeback done. The memcg kswapd
behaves like global kswapd, and it should set dirty or writeback state
to memcg too if the same condition is met.

Since direct reclaim can't write out page caches, the system depends on
kswapd to write out dirty pages if scan finds too many dirty pages in
order to avoid pre-mature OOM. But, if page cache is dirtied too fast,
writing out pages definitely can't catch up with dirtying pages. It is
the responsibility of dirty page balance to throttle dirtying pages.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

e10c247b

alios: mm: memcontrol: treat memcg wmark reclaim work as kswapd · f7c87fa3

由 Yang Shi 提交于 8月 02, 2019

Since background water mark reclaim is scheduled by workqueue, it could
do more work than direct reclaim, i.e. write out dirty page, etc.

So, add PF_KSWAPD flag, so that current_is_kswapd() would return true
for memcg background reclaim.  The condition "current_is_kswapd() &&
!global_reclaim(sc)" is good enough to tell current is global kswapd or
memcg background reclaim.

And, kswapd is not allowed to break memory.low protection for now, memcg
kswapd should not break it either.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

f7c87fa3

alios: mm: memcontrol: add background reclaim support for cgroupv2 · 256b5d94

由 Yang Shi 提交于 8月 14, 2019

Like v1, add background reclaim support for cgroup v2. The interfaces
are exactly same with v1. However, if high limit is setup for v2, the
water mark would be calculated by high limit instead of max limit.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

256b5d94

alios: mm: memcontrol: support background async page reclaim · 6b2ef082

由 Yang Shi 提交于 8月 14, 2019

Currently when memory usage exceeds memory cgroup limit, memory cgroup
just can do sync direct reclaim.  This may incur unexpected stall on
some applications which are sensitive to latency.  Introduce background
async page reclaim mechanism, like what kswapd does.

Define memcg memory usage water mark by introducing wmark_ratio interface,
which is from 0 to 100 and represents percentage of max limit.  The
wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
(wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
which is the default value.

If wmark_ratio is setup, when charging page, if usage is greater than
wmark_high, which means the available memory of memcg is low, a work
would be scheduled to do background page reclaim until memory usage is
reduced to wmark_low if possible.

Define a dedicated unbound workqueue for scheduling water mark reclaim
works.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

6b2ef082

alios: mm: vmscan: make it sane reclaim if cgwb_v1 is enabled · 76e0403d

由 Yang Shi 提交于 8月 02, 2019

AliOS Cloud Kernel has cgroup writeback support for v1, so the reclaim could be
treated as sane reclaim if cgwb_v1 is enabled.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

76e0403d

05 12月, 2019 3 次提交

J
iocost: rename weight to cost.weight to avoid conflict with cfq · 78e38d28
由 Jiufei Xue 提交于 12月 05, 2019
```
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
```
78e38d28

ovl: implement async IO routines · b7706b3d

由 Jiufei Xue 提交于 11月 14, 2019

A performance regression is observed since linux v4.19 when we do aio
test using fio with iodepth 128 on overlayfs. And we found that queue
depth of the device is always 1 which is unexpected.

After investigation, it is found that commit 16914e6f
("ovl: add ovl_read_iter()") and commit 2a92e07e
("ovl: add ovl_write_iter()") use do_iter_readv_writev() to submit
requests to real filesystem. Async IOs are converted to sync IOs here
and cause performance regression.

So implement async IO for stacked reading and writing.

Changes since v1:
  - add a cleanup helper for completion/error handling
  - handle the case when aio_req allocation failed
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b7706b3d

vfs: add vfs_iocb_iter_[read|write] helper functions · 7ff6623e

由 Jiufei Xue 提交于 11月 14, 2019

This isn't cause any behavior changes and will be used by overlay
async IO implementation.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7ff6623e

29 11月, 2019 1 次提交

alios: mm, memcg: fix possible soft lockup in try_charge · 1f6142a0

由 Xu Yu 提交于 11月 26, 2019

When events such as direct reclaim and oom occur intensively, soft
lockup is very likely to happen in the instances with 1 vcpu and with
kernel preempt disabled.

The example soft lockup is as follows.

[  160.555984] watchdog: BUG: soft lockup - CPU#0 stuck for 112s! [malloc:2188]
[  160.557975] Modules linked in: button
[  160.559495] CPU: 0 PID: 2188 Comm: malloc Not tainted 4.19.57-15.457.al7.x86_64 #1
[  160.561546] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
[  160.563707] RIP: 0010:shrink_node+0x1ae/0x450
[  160.565391] Code: 00 00 00 49 8b 4f 20 ba 01 00 00 00 4c 8b 74 24 10 4d 8b 47 28 49 8b 77 10 48 2b 4c 24 08 41 8b 7f 1c 4d8
[  160.570747] RSP: 0000:ffff9d0ec07a3b58 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
[  160.572889] RAX: ffff982ab6014330 RBX: ffff982ab6014000 RCX: 0000000000000000
[  160.574992] RDX: 0000000000000001 RSI: ffff982ab6014000 RDI: ffff982ab6014000
[  160.577106] RBP: ffff982afffb6000 R08: 0000000000000000 R09: ffff982ab6014000
[  160.579219] R10: 0000000000000004 R11: 0000000000aaaaaa R12: 0000000000000000
[  160.581326] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9d0ec07a3c50
[  160.583450] FS:  00007f8b414f7740(0000) GS:ffff982afda00000(0000) knlGS:0000000000000000
[  160.585704] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  160.587662] CR2: 00007f8adb800010 CR3: 000000007ac9e001 CR4: 00000000003606b0
[  160.589835] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  160.591971] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  160.594133] Call Trace:
[  160.595602]  do_try_to_free_pages+0xcc/0x390
[  160.597356]  try_to_free_mem_cgroup_pages+0xf9/0x1d0
[  160.599198]  ? out_of_memory+0xb5/0x4a0
[  160.600882]  try_charge+0x244/0x750
[  160.602522]  ? __pagevec_lru_add_fn+0x1d0/0x330
[  160.604310]  mem_cgroup_try_charge+0xb4/0x1d0
[  160.606085]  mem_cgroup_try_charge_delay+0x1c/0x40
[  160.607892]  do_anonymous_page+0xf7/0x540
[  160.609574]  __handle_mm_fault+0x665/0xa00
[  160.611233]  ? __switch_to_asm+0x35/0x70
[  160.612838]  handle_mm_fault+0x122/0x1e0
[  160.614407]  __do_page_fault+0x1b7/0x470
[  160.615962]  do_page_fault+0x32/0x140
[  160.617474]  ? async_page_fault+0x8/0x30
[  160.619012]  async_page_fault+0x1e/0x30
[  160.620526] RIP: 0033:0x40068e

Fix it by adding cond_resched() in try_charge(), just before goto retry
after OOM_SUCCESS, in order to let OOM free some memory first.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1f6142a0

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功