提交 · 16c8e98a79755782cc18fbfaed013ab08d504096 · openanolis / cloud-kernel

27 3月, 2020 1 次提交
- H
  
  alinux: memcg: fix a typo in comment · 16c8e98a
  由 Haosdent Huang 提交于 3月 27, 2020
  
  16c8e98a
18 3月, 2020 39 次提交

alinux: mm, memcg: account number of processes in the css · 2061acd6

由 Xu Yu 提交于 3月 13, 2020

Since commit e0205ae40f12 ("mm: memcontrol: use CSS_TASK_ITER_PROCS at
mem_cgroup_scan_tasks()") made mem_cgroup_scan_tasks() to check only one
thread from each thread group, we can make cgroup_subsys_state::nr_tasks
to record only the thread group leader, i.e., process, instead of
thread(s). Furthermore, this renames cgroup_subsys_state::nr_tasks to
cgroup_subsys_state::nr_procs.

Fixes: f061cd88 ("alinux: kernel: cgroup: account number of tasks in
the css and its descendants")
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

2061acd6

alinux: mm, memcg: record latency of memcg wmark reclaim · 40969475

由 Xu Yu 提交于 12月 28, 2019

The memcg background async page reclaim, a.k.a, memcg kswapd, is
implemented with a dedicated unbound workqueue currently.

However, memcg kswapd will run too frequently, resulting in high
overhead, page cache thrashing, frequent dirty page writeback, etc., due
to improper memcg memory.wmark_ratio, unreasonable memcg memor capacity,
or even abnormal memcg memory usage.

We need to find out the problematic memcg(s) where memcg kswapd
introduces significant overhead.

This records the latency of each run of memcg kswapd work, and then
aggregates into the exstat of per memcg.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

40969475

KVM: vgic-v4: Track the number of VLPIs per vcpu · 27840020

由 Marc Zyngier 提交于 11月 07, 2019

commit 5bd90b0989731520f2cdcfbbe467f1271f3cc803 upstream.

In order to find out whether a vcpu is likely to be the target of
VLPIs (and to further optimize the way we deal with those), let's
track the number of VLPIs a vcpu can receive.

This gets implemented with an atomic variable that gets incremented
or decremented on map, unmap and move of a VLPI.
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com>
Link: https://lore.kernel.org/r/20191107160412.30301-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
Acked-by: NZou Cao <zoucao@linux.alibaba.com>

27840020

KVM: arm64: vgic-v4: Move the GICv4 residency flow to be driven by vcpu_load/put · 42993070

由 Marc Zyngier 提交于 10月 27, 2019

commit 8e01d9a396e6db153d94a6004e6473d9ff251a6a upstream.

When the VHE code was reworked, a lot of the vgic stuff was moved around,
but the GICv4 residency code did stay untouched, meaning that we come
in and out of residency on each flush/sync, which is obviously suboptimal.

To address this, let's move things around a bit:

- Residency entry (flush) moves to vcpu_load
- Residency exit (sync) moves to vcpu_put
- On blocking (entry to WFI), we "put"
- On unblocking (exit from WFI), we "load"

Because these can nest (load/block/put/load/unblock/put, for example),
we now have per-VPE tracking of the residency state.

Additionally, vgic_v4_put gains a "need doorbell" parameter, which only
gets set to true when blocking because of a WFI. This allows a finer
control of the doorbell, which now also gets disabled as soon as
it gets signaled.
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20191027144234.8395-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
Acked-by: NZou Cao <zoucao@linux.alibaba.com>

42993070

mm: introduce MADV_PAGEOUT · 23757dcc

由 Minchan Kim 提交于 9月 25, 2019

commit 1a4e58cce84ee88129d5d49c064bd2852b481357 upstream

When a process expects no accesses to a certain memory range for a long
time, it could hint kernel that the pages can be reclaimed instantly but
data should be preserved for future use.  This could reduce workingset
eviction so it ends up increasing performance.

This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
MADV_PAGEOUT can be used by a process to mark a memory range as not
expected to be used for a long time so that kernel reclaims *any LRU*
pages instantly.  The hint can help kernel in deciding which pages to
evict proactively.

A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
intentionally because it's automatically bounded by PMD size.  If PMD
size(e.g., 256) makes some trouble, we could fix it later by limit it to
SWAP_CLUSTER_MAX[1].

- man-page material

MADV_PAGEOUT (since Linux x.x)

Do not expect access in the near future so pages in the specified
regions could be reclaimed instantly regardless of memory pressure.
Thus, access in the range after successful operation could cause
major page fault but never lose the up-to-date contents unlike
MADV_DONTNEED. Pages belonging to a shared mapping are only processed
if a write access is allowed for the calling process.

MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
VM_PFNMAP pages.

[1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/

[minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
  Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
Reported-by: Nkbuild test robot <lkp@intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

23757dcc

mm: introduce MADV_COLD · 1af766e8

由 Minchan Kim 提交于 9月 25, 2019

commit 9c276cc65a58faf98be8e56962745ec99ab87636 upstream

Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot
start.  While we continually try to improve the performance of cold
starts, hot starts will always be significantly less power hungry as well
as faster so we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps
should be killed in a process called ActivityManagerService.
ActivityManagerService tracks every Android app or service that the user
could be interacting with at any time and translates that into a ranked
list for lmkd(low memory killer daemon).  They are likely to be killed by
lmkd if the system has to reclaim memory.  In that sense they are similar
to entries in any other cache.  Those apps are kept alive for
opportunistic performance improvements but those performance improvements
will vary based on the memory requirements of individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap.  Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a
cached process to be killed(we measured performance swapping out vs.
zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
times faster even though we use zram which is much faster than real
storage) so kill from lmkd will often satisfy the high zone watermark,
resulting in very few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd by
reclaiming apps as soon as they entered the cached state.  Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly.  These new
options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
ways to gain some free memory space.  MADV_PAGEOUT is similar to
MADV_DONTNEED in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed immediately; MADV_COLD is similar
to MADV_FREE in a way that it hints the kernel that memory region is not
currently needed and should be reclaimed when memory pressure rises.

This patch (of 5):

When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use.  This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future.  The hint can help kernel in deciding which
pages to evict early during memory pressure.

It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

	active file page -> inactive file LRU
	active anon page -> inacdtive anon LRU

Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault.  Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages.  It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied.  Even, it could give a bonus to make
them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost.  Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU.  Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device.  Let's start simpler way without adding
complexity at this moment.  However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.

* man-page material

MADV_COLD (since Linux x.x)

Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies.  In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.

MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.

[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
Reported-by: Nkbuild test robot <lkp@intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

1af766e8

efi: Make efi_rts_work accessible to efi page fault handler · 95fc4624

由 Sai Praneeth 提交于 9月 11, 2018

[ Upstream commit 9dbbedaa6171247c4c7c40b83f05b200a117c2e0 ]

After the kernel has booted, if any accesses by firmware causes a page
fault, the efi page fault handler would freeze efi_rts_wq and schedules
a new process. To do this, the efi page fault handler needs
efi_rts_work. Hence, make it accessible.

There will be no race conditions in accessing this structure, because
all the calls to efi runtime services are already serialized.

[ Wen: This patch also fixes a memory corruption:
       #define efi_queue_work(_rts, _arg1, _arg2, _arg3, _arg4, _arg5)\
       ({                                                             \
        struct efi_runtime_work efi_rts_work;                           \
       …
        init_completion(&efi_rts_work.efi_rts_comp);                    \
        INIT_WORK(&efi_rts_work.work, efi_call_rts);                    \
       …

       efi_rts_work is on the stack, registering it to workqueue will cause
       the following error:

       ODEBUG: object (____ptrval____) is on stack (____ptrval____),
       but NOT annotated.
       ------------[ cut here ]------------
       WARNING: CPU: 6 PID: 1 at lib/debugobjects.c:368
       __debug_object_init+0x218/0x538
       Modules linked in:
       CPU: 6 PID: 1 Comm: swapper/0 Tainted: G        W         4.19.91 #19
       …
       Call trace:
       __debug_object_init+0x218/0x538
       debug_object_init+0x20/0x28
       __init_work+0x34/0x58
       virt_efi_get_time.part.5+0x6c/0x12c
       virt_efi_get_time+0x4c/0x58
       efi_read_time+0x40/0x9c
       __rtc_read_time+0x50/0x118
       rtc_read_time+0x60/0x1f0
       rtc_hctosys+0x74/0x124
       do_one_initcall+0xac/0x3d4
       kernel_init_freeable+0x49c/0x59c
       kernel_init+0x18/0x110 ]
Tested-by: NBhupesh Sharma <bhsharma@redhat.com>
Suggested-by: NMatt Fleming <matt@codeblueprint.co.uk>
Based-on-code-from: Ricardo Neri <ricardo.neri@intel.com>
Signed-off-by: NSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Fixes: 3eb420e7 ("efi: Use a work queue to invoke EFI Runtime Services")
Signed-off-by: NWen Yang <wenyang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

95fc4624

iomap: Allow forcing of waiting for running DIO in iomap_dio_rw() · 96d39291

由 Jan Kara 提交于 10月 15, 2019

commit 13ef954445df4fd1d7c003a500ec5ce49573e14b upstream

Notes from Xiaoguang Wang:
    Indeed this patch should be appled before "ext4: introduce direct I/O
read using iomap infrastructure", but given that we have already appled
"ext4: introduce direct I/O read using iomap infrastructure" previously,
we need to update iomap_dio_rw() calls with the new argument  in ext4.

Filesystems do not support doing IO as asynchronous in some cases. For
example in case of unaligned writes or in case file size needs to be
extended (e.g. for ext4). Instead of forcing filesystem to wait for AIO
in such cases, add argument to iomap_dio_rw() which makes the function
wait for IO completion. This also results in executing
iomap_dio_complete() inline in iomap_dio_rw() providing its return value
to the caller as for ordinary sync IO.
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

96d39291

cpuidle: allow governor switch on cpuidle_register_driver() · 0176005e

由 Joao Martins 提交于 9月 08, 2019

commit 11c59eae6633b8a7e77b8ee1cf908964d80c78cd upstream

The recently introduced haltpoll driver is largely only useful with
haltpoll governor. To allow drivers to associate with a particular idle
behaviour, add a @governor property to 'struct cpuidle_driver' and thus
allow a cpuidle driver to switch to a *preferred* governor on idle driver
registration. We save the previous governor, and when an idle driver is
unregistered we switch back to that.

The @governor can be overridden by cpuidle.governor= boot param or
alternatively be ignored if the governor doesn't exist.
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

0176005e

governors: unify last_state_idx · dc59f8b0

由 Marcelo Tosatti 提交于 7月 03, 2019

commit 7d4daeedd575bbc3c40c87fc6708a8b88c50fe7e upstream

Since this field is shared by all governors, move it to
cpuidle device structure.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

dc59f8b0

cpuidle: add poll_limit_ns to cpuidle_device structure · 2163e221

由 Marcelo Tosatti 提交于 7月 03, 2019

commit 259231a045616c4101d023a8f4dcc8379af265a6 upstream

Add a poll_limit_ns variable to cpuidle_device structure.

Calculate and configure it in the new cpuidle_poll_time
function, in case its zero.

Individual governors are allowed to override this value.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

2163e221

alinux: mm: memcontrol: introduce memcg priority oom · 52e375fc

由 Wenwei Tao 提交于 8月 23, 2019

Under memory pressure reclaim and oom would happen, with multiple
cgroups exist in one system, we might want some of their memory
or tasks survived the reclaim and oom while there are other
candidates.

The @memory.low and @memory.min have make that happen during reclaim,
this patch introduces memcg priority oom to meet above requirement in
the oom.

The priority is from 0 to 12, the higher number the higher priority.
When oom happens it always choose victim from low priority memcg.
And it works both for memcg oom and global oom, it can be enabled/disabled
through @memory.use_priority_oom, for global oom through the root
memcg's @memory.use_priority_oom, it is disabled by default.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

52e375fc

alinux: kernel: cgroup: account number of tasks in the css and its descendants · 1e91d392

由 Wenwei Tao 提交于 8月 26, 2019

Account number of the tasks in the css and its descendants, this is
prepared for the incoming memcg priority patch.

In memcg priority oom, we will select victim cgroup which has victim
tasks in it. We need to know whether the memcg and its descendants
have tasks before the selection can move on.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1e91d392

alinux: memcg: Account throttled time due to memory.wmark_min_adj · ef467b9d

由 Xunlei Pang 提交于 9月 01, 2019

Accessing original memory.stat turned out to be one heavy
operation which has been caused many real product problems.

Introduce new cgroup memory.exstat, memory.exstat stands
for "extra/extended memory.stat", which contains dedicated
statistics from Alibaba Clould Kernel.

memory.exstat is supposed to provide hierarchical statistics.

Export its first "wmark_min_throttled_ms", and will add more
like direct reclaim, direct compaction, etc.
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

ef467b9d

alinux: memcg: Introduce memory.wmark_min_adj · 60be0f54

由 Xunlei Pang 提交于 8月 27, 2019

In co-location environment, there are more or less some memory
overcommitment, then BATCH tasks may break the shared global min
watermark resulting in all types of applications falling into
the direct reclaim slow path hurting the RT of LS tasks.
(NOTE: BATCH tasks tolerate big latency spike even in seconds
as long as doesn't hurt its overal throughput. While LS tasks
are very Latency-Sensitive, they may time out or fail in case
of sudden latency spike lasts like hundreds of ms typically.)

Actually BATCH tasks are not sensitive to memory latency, they
can be assigned a strict min watermark which is different from
that of LS tasks(which can be aissgned a lenient min watermark
accordingly), thus isolating each other in case of global memory
allocation. This is kind of like the idea behind ALLOC_HARDER
for rt_task(), see gfp_to_alloc_flags().

memory.wmark_min_adj stands for memcg global WMARK_MIN adjustment,
it is used to realize separate min watermarks above-mentioned for
memcgs, its valid value is within [-25, 50], specifically:
negative value means to be relative to [0, WMARK_MIN],
positive value means to be relative to [WMARK_MIN, WMARK_LOW].
For examples,
  -25 means "WMARK_MIN + (WMARK_MIN - 0) * (-25%)"
   50 means "WMARK_MIN + (WMARK_LOW - WMARK_MIN) * 50%"

Note that the minimum -25 is what ALLOC_HARDER uses which is safe
for us to adopt, and the maximum 50 is one experienced value.

Negative memory.wmark_min_adj means high QoS requirements, it can
allocate below the global WMARK_MIN, which is kind of like the idea
behind ALLOC_HARDER, see gfp_to_alloc_flags().

Positive memory.wmark_min_adj means low QoS requirements, thus when
allocation broke memcg min watermark, it should trigger direct reclaim
traditionally, and we trigger throttle instead to further prevent
them from disturbing others.

With this interface, we can assign positive values for BATCH memcgs
and negative values for LS memcgs.

memory.wmark_min_adj default value is 0, and inherit from its parent,
Note that the final effective wmark_min_adj will consider all the
hierarchical values, its value is the maximal(most conservative)
wmark_min_adj along the hierarchy but excluding intermediate default
values(zero).
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

60be0f54

alinux: memcg: Provide users the ability to reap zombie memcgs · 63442ea9

由 Xunlei Pang 提交于 5月 06, 2019

After memcg was deleted, page caches still reference to this memcg
causing large number of dead(zombie) memcgs in the system. Then it
slows down access to "/sys/fs/cgroup/cpu/memory.stat", etc due to
tons of iterations, further causing various latencies.

This patch introduces two ways to reclaim these zombie memcgs.
1) Background kthread reaper
Introduce a kernel thread "memcg_zombie_reaper" to reclaim zombie
memcgs at background periodically.

Several knobs are also added to control the reaper scan frequency:
- /sys/kernel/mm/memcg_reaper/scan_interval
  The scan period in second. Default 5s.
- /sys/kernel/mm/memcg_reaper/pages_scan
  The scan rate of pages per scan. Default 1310720(5GiB for 4KiB page).
- /sys/kernel/mm/memcg_reaper/verbose
  Output some zombie memcg information for debug purpose. Default off.
- /sys/kernel/mm/memcg_reaper/reap_background
  "on/off" switch. Default "0" means off. Write "1" to switch it on.

2) One-shot trigger by users
- /sys/kernel/mm/memcg_reaper/reap
  Write "1" to trigger one round of zombie memcg reaping, but without
  any guarantee, you may need to launch multiple rounds as needed.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

63442ea9

iomap: move the iomap_dio_rw ->end_io callback into a structure · 64ce72b7

由 Christoph Hellwig 提交于 9月 19, 2019

commit 838c4f3d7515efe9d0e32c846fb5d102b6d8a29d upstream.

Add a new iomap_dio_ops structure that for now just contains the end_io
handler.  This avoid storing the function pointer in a mutable structure,
which is a possible exploit vector for kernel code execution, and prepares
for adding a submit_io handler that btrfs needs.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

64ce72b7

iomap: use a srcmap for a read-modify-write I/O · 5b44c648

由 Goldwyn Rodrigues 提交于 10月 18, 2019

commit c039b99792726346ad46ff17c5a5bcb77a5edac4 upstream.

The srcmap is used to identify where the read is to be performed from.
It is passed to ->iomap_begin, which can fill it in if we need to read
data for partially written blocks from a different location than the
write target.  The srcmap is only supported for buffered writes so far.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
[hch: merged two patches, removed the IOMAP_F_COW flag, use iomap as
      srcmap if not set, adjust length down to srcmap end as well]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Acked-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5b44c648

net: add __sys_accept4_file() helper · a73fda3c

由 Jens Axboe 提交于 10月 17, 2019

commit de2ea4b64b75a79ed9cdf9bf30e0e197901084e4 upstream.

This is identical to __sys_accept4(), except it takes a struct file
instead of an fd, and it also allows passing in extra file->f_flags
flags. The latter is done to support masking in O_NONBLOCK without
manipulating the original file flags.

No functional changes in this patch.

Cc: netdev@vger.kernel.org
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a73fda3c

io-wq: small threadpool implementation for io_uring · 8a308e54

由 Jens Axboe 提交于 10月 22, 2019

commit 771b53d033e8663abdf59704806aa856b236dcdb upstream.

This adds support for io-wq, a smaller and specialized thread pool
implementation. This is meant to replace workqueues for io_uring. Among
the reasons for this addition are:

- We can assign memory context smarter and more persistently if we
  manage the life time of threads.

- We can drop various work-arounds we have in io_uring, like the
  async_list.

- We can implement hashed work insertion, to manage concurrency of
  buffered writes without needing a) an extra workqueue, or b)
  needlessly making the concurrency of said workqueue very low
  which hurts performance of multiple buffered file writers.

- We can implement cancel through signals, for cancelling
  interruptible work like read/write (or send/recv) to/from sockets.

- We need the above cancel for being able to assign and use file tables
  from a process.

- We can implement a more thorough cancel operation in general.

- We need it to move towards a syslet/threadlet model for even faster
  async execution. For that we need to take ownership of the used
  threads.

This list is just off the top of my head. Performance should be the
same, or better, at least that's what I've seen in my testing. io-wq
supports basic NUMA functionality, setting up a pool per node.

io-wq hooks up to the scheduler schedule in/out just like workqueue
and uses that to drive the need for more/less workers.
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: Cherry-pick allow_kernel_signal() from upstream commit 33da8e7c814f]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

8a308e54

sched: Remove stale PF_MUTEX_TESTER bit · 04695e1f

由 Thomas Gleixner 提交于 12月 19, 2018

commit 15917dc02841862840efcbfe1da0830f88078b5c upstream.

The RTMUTEX tester was removed long ago but the PF bit stayed
around. Remove it and free up the space.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

04695e1f

include/linux/notifier.h: SRCU: fix ctags · ffeba5d0

由 Sam Protsenko 提交于 1月 21, 2020

commit 94e297c50b529f5d01cfd1dbc808d61e95180ab7 upstream.

ctags indexing ("make tags" command) throws this warning:

    ctags: Warning: include/linux/notifier.h:125:
    null expansion of name pattern "\1"

This is the result of DEFINE_PER_CPU() macro expansion.  Fix that by
getting rid of line break.

Similar fix was already done in commit 25528213 ("tags: Fix
DEFINE_PER_CPU expansions"), but this one probably wasn't noticed.

Link: http://lkml.kernel.org/r/20181030202808.28027-1-semen.protsenko@linaro.org
Fixes: 9c80172b ("kernel/SRCU: provide a static initializer")
Signed-off-by: NSam Protsenko <semen.protsenko@linaro.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

ffeba5d0

io_uring: track io length in async_list based on bytes · e280026f

由 Zhengyuan Liu 提交于 7月 22, 2019

commit 9310a7ba6de8cce6209e3e8a3cdf733f824cdd9b upstream.

We are using PAGE_SIZE as the unit to determine if the total len in
async_list has exceeded max_pages, it's not fair for smaller io sizes.
For example, if we are doing 1k-size io streams, we will never exceed
max_pages since len >>= PAGE_SHIFT always gets zero. So use original
bytes to make it more accurate.
Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e280026f

signal: simplify set_user_sigmask/restore_user_sigmask · f12f9562

由 Oleg Nesterov 提交于 7月 16, 2019

commit b772434be0891ed1081a08ae7cfd4666728f8e82 upstream.

task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths.  This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.

This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().

Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f12f9562

io_uring: add support for recvmsg() · 3962b3d0

由 Jens Axboe 提交于 4月 19, 2019

commit aa1fa28fc73ea6b740ee7b62bf3b07141883dbb8 upstream.

This is done through IORING_OP_RECVMSG. This opcode uses the same
sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
msghdr struct in the sqe->addr field as well.

We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
block, and punt to async execution if it would have.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

3962b3d0

io_uring: add support for sendmsg() · 0cb8acf9

由 Jens Axboe 提交于 4月 19, 2019

commit 0fa03c624d8fc9932d0f27c39a9deca6a37e0e17 upstream.

This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
for the flags argument, and the msghdr struct is passed in the
sqe->addr field.

We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
block, and punt to async execution if it would have.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

0cb8acf9

block: never take page references for ITER_BVEC · 709d159e

由 Christoph Hellwig 提交于 6月 26, 2019

Cherry-pick from commit b620743077e291ae7d0debd21f50413a8c266229 upstream.

If we pass pages through an iov_iter we always already have a reference
in the caller.  Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
reference to pages by default for bvec backed iov_iters.

[Joseph] Resolve conflicts since we don't have:
81ba6abd2bcd "block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF"
7321ecbfc7cf "block: change how we get page references in bio_iov_iter_get_pages"
Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

709d159e

signal: remove the wrong signal_pending() check in restore_user_sigmask() · a48e4674

由 Oleg Nesterov 提交于 6月 28, 2019

commit 97abc889ee296faf95ca0e978340fb7b942a3e32 upstream.

This is the minimal fix for stable, I'll send cleanups later.

Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.

Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.

Eric said:

: For clarity.  I don't think this is required by posix, or fundamentally to
: remove the races in select.  It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented.  The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?

Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reported-by: NEric Wong <e@80x24.org>
Tested-by: NEric Wong <e@80x24.org>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org>	[5.0+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a48e4674

uio: make import_iovec()/compat_import_iovec() return bytes on success · 0c13034a

由 Jens Axboe 提交于 5月 14, 2019

commit 87e5e6dab6c2a21fab2620f37786276d202e2ce0 upstream.

Currently these functions return < 0 on error, and 0 for success.
Change that so that we return < 0 on error, but number of bytes
for success.

Some callers already treat the return value that way, others need a
slight tweak.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

0c13034a

fs: add sync_file_range() helper · ceda0208

由 Jens Axboe 提交于 4月 09, 2019

commit 22f96b3808c12a218e9a3bce6e1bfbd74efbe374 upstream.

This just pulls out the ksys_sync_file_range() code to work on a struct
file instead of an fd, so we can use it elsewhere.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

ceda0208

perf/smmuv3: Enable HiSilicon Erratum 162001800 quirk · 54c387a7

由 Shameer Kolothum 提交于 3月 26, 2019

commit 24062fe85860debfdae0eeaa495f27c9971ec163 upstream

HiSilicon erratum 162001800 describes the limitation of
SMMUv3 PMCG implementation on HiSilicon Hip08 platforms.

On these platforms, the PMCG event counter registers
(SMMU_PMCG_EVCNTRn) are read only and as a result it
is not possible to set the initial counter period value
on event monitor start.

To work around this, the current value of the counter
is read and used for delta calculations. OEM information
from ACPI header is used to identify the affected hardware
platforms.
Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: NHanjun Guo <hanjun.guo@linaro.org>
Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
[will: update silicon-errata.txt and add reason string to acpi match]
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

54c387a7

ACPI/IORT: Add support for PMCG · 9c6dfb51

由 Neil Leeder 提交于 3月 26, 2019

commit 24e516049360eda85cf3fe9903221d43886c2689 upstream.

Add support for the SMMU Performance Monitor Counter Group
information from ACPI. This is in preparation for its use
in the SMMUv3 PMU driver.
Signed-off-by: NNeil Leeder <nleeder@codeaurora.org>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

9c6dfb51

mm/hotplug: make remove_memory() interface usable · f8502f80

由 Pavel Tatashin 提交于 7月 16, 2019

commit eca499ab3749a4537dee77ffead47a1a2c0dee19 upstream

Presently the remove_memory() interface is inherently broken.  It tries
to remove memory but panics if some memory is not offline.  The problem
is that it is impossible to ensure that all memory blocks are offline as
this function also takes lock_device_hotplug that is required to change
memory state via sysfs.

So, between calling this function and offlining all memory blocks there
is always a window when lock_device_hotplug is released, and therefore,
there is always a chance for a panic during this window.

Make this interface to return an error if memory removal fails.  This
way it is safe to call this function without panicking machine, and also
makes it symmetric to add_memory() which already returns an error.

Link: http://lkml.kernel.org/r/20190517215438.6487-3-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

f8502f80

mm/memory_hotplug: make remove_memory() take the device_hotplug_lock · d2097173

由 David Hildenbrand 提交于 10月 30, 2018

commit d15e59260f62bd5e0f625cf5f5240f6ffac78ab6 upstream

Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.

Reading through the code and studying how mem_hotplug_lock is to be used,
I noticed that there are two places where we can end up calling
device_online()/device_offline() - online_pages()/offline_pages() without
the mem_hotplug_lock.  And there are other places where we call
device_online()/device_offline() without the device_hotplug_lock.

While e.g.
	echo "online" > /sys/devices/system/memory/memory9/state
is fine, e.g.
	echo 1 > /sys/devices/system/memory/memory9/online
Will not take the mem_hotplug_lock. However the device_lock() and
device_hotplug_lock.

E.g.  via memory_probe_store(), we can end up calling
add_memory()->online_pages() without the device_hotplug_lock.  So we can
have concurrent callers in online_pages().  We e.g.  touch in
online_pages() basically unprotected zone->present_pages then.

Looks like there is a longer history to that (see Patch #2 for details),
and fixing it to work the way it was intended is not really possible.  We
would e.g.  have to take the mem_hotplug_lock in device/base/core.c, which
sounds wrong.

Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
More details can be found in patch 3 and patch 6.

I propose the general rules (documentation added in patch 6):

1. add_memory/add_memory_resource() must only be called with
   device_hotplug_lock.
2. remove_memory() must only be called with device_hotplug_lock. This is
   already documented and holds for all callers.
3. device_online()/device_offline() must only be called with
   device_hotplug_lock. This is already documented and true for now in core
   code. Other callers (related to memory hotplug) have to be fixed up.
4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
   online_pages/offline_pages.

To me, this looks way cleaner than what we have right now (and easier to
verify).  And looking at the documentation of remove_memory, using
lock_device_hotplug also for add_memory() feels natural.

This patch (of 6):

remove_memory() is exported right now but requires the
device_hotplug_lock, which is not exported.  So let's provide a variant
that takes the lock and only export that one.

The lock is already held in
	arch/powerpc/platforms/pseries/hotplug-memory.c
	drivers/acpi/acpi_memhotplug.c
	arch/powerpc/platforms/powernv/memtrace.c

Apart from that, there are not other users in the tree.

Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

d2097173

mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections · f38de7b3

由 Alexander Duyck 提交于 5月 13, 2019

commit 0e56acae4b4dd4a9fbe897854ab83a109e2a9e11 upstream.

Add yet another iterator, for_each_free_mem_range_in_zone_from, and then
use it to support initializing and freeing pages in groups no larger than
MAX_ORDER_NR_PAGES.  By doing this we can greatly improve the cache
locality of the pages while we do several loops over them in the init and
freeing process.

We are able to tighten the loops further as a result of the "from"
iterator as we can perform the initial checks for first_init_pfn in our
first call to the iterator, and continue without the need for those checks
via the "from" iterator.  I have added this functionality in the function
called deferred_init_mem_pfn_range_in_zone that primes the iterator and
causes us to exit if we encounter any failure.

On my x86_64 test system with 384GB of memory per node I saw a reduction
in initialization time from 1.85s to 1.38s as a result of this patch.

Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <yi.z.zhang@linux.intel.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f38de7b3

mm: implement new zone specific memblock iterator · ad97e5e4

由 Alexander Duyck 提交于 5月 13, 2019

commit 837566e7e08e3f89444166444836a8a49b9f9322 upstream.

Introduce a new iterator for_each_free_mem_pfn_range_in_zone.

This iterator will take care of making sure a given memory range provided
is in fact contained within a zone.  It takes are of all the bounds
checking we were doing in deferred_grow_zone, and deferred_init_memmap.
In addition it should help to speed up the search a bit by iterating until
the end of a range is greater than the start of the zone pfn range, and
will exit completely if the start is beyond the end of the zone.

Link: http://lkml.kernel.org/r/20190405221225.12227.22573.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <yi.z.zhang@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

ad97e5e4

mm: use mm_zero_struct_page from SPARC on all 64b architectures · e23b0cb5

由 Alexander Duyck 提交于 5月 13, 2019

commit 5470dea49f5382257c242ac617d908267727f1a8 upstream.

Patch series "Deferred page init improvements", v7.

This patchset is essentially a refactor of the page initialization logic
that is meant to provide for better code reuse while providing a
significant improvement in deferred page initialization performance.

In my testing on an x86_64 system with 384GB of RAM I have seen the
following.  In the case of regular memory initialization the deferred init
time was decreased from 3.75s to 1.38s on average.  This amounts to a 172%
improvement for the deferred memory initialization performance.

I have called out the improvement observed with each patch.

This patch (of 4):

Use the same approach that was already in use on Sparc on all the
architectures that support a 64b long.

This is mostly motivated by the fact that 7 to 10 store/move instructions
are likely always going to be faster than having to call into a function
that is not specialized for handling page init.

An added advantage to doing it this way is that the compiler can get away
with combining writes in the __init_single_page call.  As a result the
memset call will be reduced to only about 4 write operations, or at least
that is what I am seeing with GCC 6.2 as the flags, LRU pointers, and
count/mapcount seem to be cancelling out at least 4 of the 8 assignments
on my system.

One change I had to make to the function was to reduce the minimum page
size to 56 to support some powerpc64 configurations.

This change should introduce no change on SPARC since it already had this
code.  In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
initializing 384GB of RAM per node.  Pavel Tatashin tested on a system
with Broadcom's Stingray CPU and 48GB of RAM and found that
__init_single_page() takes 19.30ns / 64-byte struct page before this patch
and with this patch it takes 17.33ns / 64-byte struct page.  Mike Rapoport
ran a similar test on a OpenPower (S812LC 8348-21C) with Power8 processor
and 128GB or RAM.  His results per 64-byte struct page were 4.68ns before,
and 4.59ns after this patch.

Link: http://lkml.kernel.org/r/20190405221213.12227.9392.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <yi.z.zhang@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e23b0cb5

blk-mq: not embed .mq_kobj and ctx->kobj into queue instance · 9ff28240

由 Ming Lei 提交于 11月 20, 2018

commit 1db4909e76f64a85f4aaa187f0f683f5c85a471d upstream.

Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs.

This patch fixes the issue by the following approach:

1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
all ctxs

2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
handler of .mq_kobj

3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
.mq_kobj is always released after all ctxs are freed.

This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9ff28240

mm/memblock.c: skip kmemleak for kasan_init() · 9f093569

由 Qian Cai 提交于 12月 28, 2018

commit fed84c78527009d4f799a3ed9a566502fa026d82 upstream.

Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
Huawei TaiShan 2280 aarch64 servers).

After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
log buffer went from something like 280 to 260000 which caused kmemleak
disabled and crash dump memory reservation failed.  The multitude of
kmemleak_alloc() calls is from nested loops while KASAN is setting up full
memory mappings, so let early kmemleak allocations skip those
memblock_alloc_internal() calls came from kasan_init() given that those
early KASAN memory mappings should not reference to other memory.  Hence,
no kmemleak false positives.

kasan_init
  kasan_map_populate [1]
    kasan_pgd_populate [2]
      kasan_pud_populate [3]
        kasan_pmd_populate [4]
          kasan_pte_populate [5]
            kasan_alloc_zeroed_page
              memblock_alloc_try_nid
                memblock_alloc_internal
                  kmemleak_alloc

[1] for_each_memblock(memory, reg)
[2] while (pgdp++, addr = next, addr != end)
[3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
[4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
[5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))

Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.usSigned-off-by: NQian Cai <cai@gmx.us>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9f093569

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功