提交 · 6221288e5e97c3a3feca3bb1cc30d9661d36e459 · openanolis / cloud-kernel

22 2月, 2020 3 次提交

alinux: memcg: Account throttled time due to memory.wmark_min_adj · 6221288e

由 Xunlei Pang 提交于 9月 01, 2019

Accessing original memory.stat turned out to be one heavy
operation which has been caused many real product problems.

Introduce new cgroup memory.exstat, memory.exstat stands
for "extra/extended memory.stat", which contains dedicated
statistics from Alibaba Clould Kernel.

memory.exstat is supposed to provide hierarchical statistics.

Export its first "wmark_min_throttled_ms", and will add more
like direct reclaim, direct compaction, etc.
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

6221288e

alinux: memcg: Introduce memory.wmark_min_adj · 6bc07d25

由 Xunlei Pang 提交于 8月 27, 2019

In co-location environment, there are more or less some memory
overcommitment, then BATCH tasks may break the shared global min
watermark resulting in all types of applications falling into
the direct reclaim slow path hurting the RT of LS tasks.
(NOTE: BATCH tasks tolerate big latency spike even in seconds
as long as doesn't hurt its overal throughput. While LS tasks
are very Latency-Sensitive, they may time out or fail in case
of sudden latency spike lasts like hundreds of ms typically.)

Actually BATCH tasks are not sensitive to memory latency, they
can be assigned a strict min watermark which is different from
that of LS tasks(which can be aissgned a lenient min watermark
accordingly), thus isolating each other in case of global memory
allocation. This is kind of like the idea behind ALLOC_HARDER
for rt_task(), see gfp_to_alloc_flags().

memory.wmark_min_adj stands for memcg global WMARK_MIN adjustment,
it is used to realize separate min watermarks above-mentioned for
memcgs, its valid value is within [-25, 50], specifically:
negative value means to be relative to [0, WMARK_MIN],
positive value means to be relative to [WMARK_MIN, WMARK_LOW].
For examples,
  -25 means "WMARK_MIN + (WMARK_MIN - 0) * (-25%)"
   50 means "WMARK_MIN + (WMARK_LOW - WMARK_MIN) * 50%"

Note that the minimum -25 is what ALLOC_HARDER uses which is safe
for us to adopt, and the maximum 50 is one experienced value.

Negative memory.wmark_min_adj means high QoS requirements, it can
allocate below the global WMARK_MIN, which is kind of like the idea
behind ALLOC_HARDER, see gfp_to_alloc_flags().

Positive memory.wmark_min_adj means low QoS requirements, thus when
allocation broke memcg min watermark, it should trigger direct reclaim
traditionally, and we trigger throttle instead to further prevent
them from disturbing others.

With this interface, we can assign positive values for BATCH memcgs
and negative values for LS memcgs.

memory.wmark_min_adj default value is 0, and inherit from its parent,
Note that the final effective wmark_min_adj will consider all the
hierarchical values, its value is the maximal(most conservative)
wmark_min_adj along the hierarchy but excluding intermediate default
values(zero).
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

6bc07d25

alinux: memcg: Provide users the ability to reap zombie memcgs · 867b4772

由 Xunlei Pang 提交于 5月 06, 2019

After memcg was deleted, page caches still reference to this memcg
causing large number of dead(zombie) memcgs in the system. Then it
slows down access to "/sys/fs/cgroup/cpu/memory.stat", etc due to
tons of iterations, further causing various latencies.

This patch introduces two ways to reclaim these zombie memcgs.
1) Background kthread reaper
Introduce a kernel thread "memcg_zombie_reaper" to reclaim zombie
memcgs at background periodically.

Several knobs are also added to control the reaper scan frequency:
- /sys/kernel/mm/memcg_reaper/scan_interval
  The scan period in second. Default 5s.
- /sys/kernel/mm/memcg_reaper/pages_scan
  The scan rate of pages per scan. Default 1310720(5GiB for 4KiB page).
- /sys/kernel/mm/memcg_reaper/verbose
  Output some zombie memcg information for debug purpose. Default off.
- /sys/kernel/mm/memcg_reaper/reap_background
  "on/off" switch. Default "0" means off. Write "1" to switch it on.

2) One-shot trigger by users
- /sys/kernel/mm/memcg_reaper/reap
  Write "1" to trigger one round of zombie memcg reaping, but without
  any guarantee, you may need to launch multiple rounds as needed.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

867b4772

17 2月, 2020 2 次提交

iomap: move the iomap_dio_rw ->end_io callback into a structure · 99d88c6e

由 Christoph Hellwig 提交于 9月 19, 2019

commit 838c4f3d7515efe9d0e32c846fb5d102b6d8a29d upstream.

Add a new iomap_dio_ops structure that for now just contains the end_io
handler.  This avoid storing the function pointer in a mutable structure,
which is a possible exploit vector for kernel code execution, and prepares
for adding a submit_io handler that btrfs needs.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

99d88c6e

iomap: use a srcmap for a read-modify-write I/O · 1103210e

由 Goldwyn Rodrigues 提交于 10月 18, 2019

commit c039b99792726346ad46ff17c5a5bcb77a5edac4 upstream.

The srcmap is used to identify where the read is to be performed from.
It is passed to ->iomap_begin, which can fill it in if we need to read
data for partially written blocks from a different location than the
write target.  The srcmap is only supported for buffered writes so far.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
[hch: merged two patches, removed the IOMAP_F_COW flag, use iomap as
      srcmap if not set, adjust length down to srcmap end as well]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Acked-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

1103210e

13 2月, 2020 1 次提交

io_uring: fix build warning on arm · 0676d7a1

由 Joseph Qi 提交于 2月 13, 2020

./include/linux/socket.h:380:38: warning: 'struct file' declared inside parameter list will not be visible outside of this definition or declaration
  380 | extern int __sys_accept4_file(struct file *file, unsigned file_flags,
      |                                      ^~~~

Fixes: d7d134b7 ("net: add __sys_accept4_file() helper")
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

0676d7a1

11 2月, 2020 3 次提交

net: add __sys_accept4_file() helper · d7d134b7

由 Jens Axboe 提交于 10月 17, 2019

commit de2ea4b64b75a79ed9cdf9bf30e0e197901084e4 upstream.

This is identical to __sys_accept4(), except it takes a struct file
instead of an fd, and it also allows passing in extra file->f_flags
flags. The latter is done to support masking in O_NONBLOCK without
manipulating the original file flags.

No functional changes in this patch.

Cc: netdev@vger.kernel.org
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

d7d134b7

io-wq: small threadpool implementation for io_uring · 9c8dc805

由 Jens Axboe 提交于 10月 22, 2019

commit 771b53d033e8663abdf59704806aa856b236dcdb upstream.

This adds support for io-wq, a smaller and specialized thread pool
implementation. This is meant to replace workqueues for io_uring. Among
the reasons for this addition are:

- We can assign memory context smarter and more persistently if we
  manage the life time of threads.

- We can drop various work-arounds we have in io_uring, like the
  async_list.

- We can implement hashed work insertion, to manage concurrency of
  buffered writes without needing a) an extra workqueue, or b)
  needlessly making the concurrency of said workqueue very low
  which hurts performance of multiple buffered file writers.

- We can implement cancel through signals, for cancelling
  interruptible work like read/write (or send/recv) to/from sockets.

- We need the above cancel for being able to assign and use file tables
  from a process.

- We can implement a more thorough cancel operation in general.

- We need it to move towards a syslet/threadlet model for even faster
  async execution. For that we need to take ownership of the used
  threads.

This list is just off the top of my head. Performance should be the
same, or better, at least that's what I've seen in my testing. io-wq
supports basic NUMA functionality, setting up a pool per node.

io-wq hooks up to the scheduler schedule in/out just like workqueue
and uses that to drive the need for more/less workers.
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: Cherry-pick allow_kernel_signal() from upstream commit 33da8e7c814f]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9c8dc805

sched: Remove stale PF_MUTEX_TESTER bit · e96c72ce

由 Thomas Gleixner 提交于 12月 19, 2018

commit 15917dc02841862840efcbfe1da0830f88078b5c upstream.

The RTMUTEX tester was removed long ago but the PF bit stayed
around. Remove it and free up the space.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e96c72ce

06 2月, 2020 1 次提交

include/linux/notifier.h: SRCU: fix ctags · 7d6e6270

由 Sam Protsenko 提交于 1月 21, 2020

commit 94e297c50b529f5d01cfd1dbc808d61e95180ab7 upstream.

ctags indexing ("make tags" command) throws this warning:

    ctags: Warning: include/linux/notifier.h:125:
    null expansion of name pattern "\1"

This is the result of DEFINE_PER_CPU() macro expansion.  Fix that by
getting rid of line break.

Similar fix was already done in commit 25528213 ("tags: Fix
DEFINE_PER_CPU expansions"), but this one probably wasn't noticed.

Link: http://lkml.kernel.org/r/20181030202808.28027-1-semen.protsenko@linaro.org
Fixes: 9c80172b ("kernel/SRCU: provide a static initializer")
Signed-off-by: NSam Protsenko <semen.protsenko@linaro.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

7d6e6270

04 2月, 2020 5 次提交

io_uring: track io length in async_list based on bytes · 70b10c40

由 Zhengyuan Liu 提交于 7月 22, 2019

commit 9310a7ba6de8cce6209e3e8a3cdf733f824cdd9b upstream.

We are using PAGE_SIZE as the unit to determine if the total len in
async_list has exceeded max_pages, it's not fair for smaller io sizes.
For example, if we are doing 1k-size io streams, we will never exceed
max_pages since len >>= PAGE_SHIFT always gets zero. So use original
bytes to make it more accurate.
Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

70b10c40

signal: simplify set_user_sigmask/restore_user_sigmask · a4dd0237

由 Oleg Nesterov 提交于 7月 16, 2019

commit b772434be0891ed1081a08ae7cfd4666728f8e82 upstream.

task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths.  This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.

This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().

Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a4dd0237

io_uring: add support for recvmsg() · 7cabfcb1

由 Jens Axboe 提交于 4月 19, 2019

commit aa1fa28fc73ea6b740ee7b62bf3b07141883dbb8 upstream.

This is done through IORING_OP_RECVMSG. This opcode uses the same
sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
msghdr struct in the sqe->addr field as well.

We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
block, and punt to async execution if it would have.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

7cabfcb1

io_uring: add support for sendmsg() · ce630b99

由 Jens Axboe 提交于 4月 19, 2019

commit 0fa03c624d8fc9932d0f27c39a9deca6a37e0e17 upstream.

This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
for the flags argument, and the msghdr struct is passed in the
sqe->addr field.

We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
block, and punt to async execution if it would have.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

ce630b99

block: never take page references for ITER_BVEC · 3f9da4d9

由 Christoph Hellwig 提交于 6月 26, 2019

Cherry-pick from commit b620743077e291ae7d0debd21f50413a8c266229 upstream.

If we pass pages through an iov_iter we always already have a reference
in the caller.  Thus remove the ITER_BVEC_FLAG_NO_REF and don't take
reference to pages by default for bvec backed iov_iters.

[Joseph] Resolve conflicts since we don't have:
81ba6abd2bcd "block: loop: mark bvec as ITER_BVEC_FLAG_NO_REF"
7321ecbfc7cf "block: change how we get page references in bio_iov_iter_get_pages"
Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

3f9da4d9

03 2月, 2020 3 次提交

signal: remove the wrong signal_pending() check in restore_user_sigmask() · 3e25e056

由 Oleg Nesterov 提交于 6月 28, 2019

commit 97abc889ee296faf95ca0e978340fb7b942a3e32 upstream.

This is the minimal fix for stable, I'll send cleanups later.

Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.

Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.

Eric said:

: For clarity.  I don't think this is required by posix, or fundamentally to
: remove the races in select.  It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented.  The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?

Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reported-by: NEric Wong <e@80x24.org>
Tested-by: NEric Wong <e@80x24.org>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org>	[5.0+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

3e25e056

uio: make import_iovec()/compat_import_iovec() return bytes on success · c93ad6cc

由 Jens Axboe 提交于 5月 14, 2019

commit 87e5e6dab6c2a21fab2620f37786276d202e2ce0 upstream.

Currently these functions return < 0 on error, and 0 for success.
Change that so that we return < 0 on error, but number of bytes
for success.

Some callers already treat the return value that way, others need a
slight tweak.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c93ad6cc

fs: add sync_file_range() helper · bce27885

由 Jens Axboe 提交于 4月 09, 2019

commit 22f96b3808c12a218e9a3bce6e1bfbd74efbe374 upstream.

This just pulls out the ksys_sync_file_range() code to work on a struct
file instead of an fd, so we can use it elsewhere.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

bce27885

19 1月, 2020 4 次提交

perf/smmuv3: Enable HiSilicon Erratum 162001800 quirk · 8c6138de

由 Shameer Kolothum 提交于 3月 26, 2019

commit 24062fe85860debfdae0eeaa495f27c9971ec163 upstream

HiSilicon erratum 162001800 describes the limitation of
SMMUv3 PMCG implementation on HiSilicon Hip08 platforms.

On these platforms, the PMCG event counter registers
(SMMU_PMCG_EVCNTRn) are read only and as a result it
is not possible to set the initial counter period value
on event monitor start.

To work around this, the current value of the counter
is read and used for delta calculations. OEM information
from ACPI header is used to identify the affected hardware
platforms.
Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: NHanjun Guo <hanjun.guo@linaro.org>
Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
[will: update silicon-errata.txt and add reason string to acpi match]
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

8c6138de

ACPI/IORT: Add support for PMCG · c54f184c

由 Neil Leeder 提交于 3月 26, 2019

commit 24e516049360eda85cf3fe9903221d43886c2689 upstream.

Add support for the SMMU Performance Monitor Counter Group
information from ACPI. This is in preparation for its use
in the SMMUv3 PMU driver.
Signed-off-by: NNeil Leeder <nleeder@codeaurora.org>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c54f184c

mm/hotplug: make remove_memory() interface usable · 27d25b17

由 Pavel Tatashin 提交于 7月 16, 2019

commit eca499ab3749a4537dee77ffead47a1a2c0dee19 upstream

Presently the remove_memory() interface is inherently broken.  It tries
to remove memory but panics if some memory is not offline.  The problem
is that it is impossible to ensure that all memory blocks are offline as
this function also takes lock_device_hotplug that is required to change
memory state via sysfs.

So, between calling this function and offlining all memory blocks there
is always a window when lock_device_hotplug is released, and therefore,
there is always a chance for a panic during this window.

Make this interface to return an error if memory removal fails.  This
way it is safe to call this function without panicking machine, and also
makes it symmetric to add_memory() which already returns an error.

Link: http://lkml.kernel.org/r/20190517215438.6487-3-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

27d25b17

mm/memory_hotplug: make remove_memory() take the device_hotplug_lock · e060ade6

由 David Hildenbrand 提交于 10月 30, 2018

commit d15e59260f62bd5e0f625cf5f5240f6ffac78ab6 upstream

Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.

Reading through the code and studying how mem_hotplug_lock is to be used,
I noticed that there are two places where we can end up calling
device_online()/device_offline() - online_pages()/offline_pages() without
the mem_hotplug_lock.  And there are other places where we call
device_online()/device_offline() without the device_hotplug_lock.

While e.g.
	echo "online" > /sys/devices/system/memory/memory9/state
is fine, e.g.
	echo 1 > /sys/devices/system/memory/memory9/online
Will not take the mem_hotplug_lock. However the device_lock() and
device_hotplug_lock.

E.g.  via memory_probe_store(), we can end up calling
add_memory()->online_pages() without the device_hotplug_lock.  So we can
have concurrent callers in online_pages().  We e.g.  touch in
online_pages() basically unprotected zone->present_pages then.

Looks like there is a longer history to that (see Patch #2 for details),
and fixing it to work the way it was intended is not really possible.  We
would e.g.  have to take the mem_hotplug_lock in device/base/core.c, which
sounds wrong.

Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
More details can be found in patch 3 and patch 6.

I propose the general rules (documentation added in patch 6):

1. add_memory/add_memory_resource() must only be called with
   device_hotplug_lock.
2. remove_memory() must only be called with device_hotplug_lock. This is
   already documented and holds for all callers.
3. device_online()/device_offline() must only be called with
   device_hotplug_lock. This is already documented and true for now in core
   code. Other callers (related to memory hotplug) have to be fixed up.
4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
   online_pages/offline_pages.

To me, this looks way cleaner than what we have right now (and easier to
verify).  And looking at the documentation of remove_memory, using
lock_device_hotplug also for add_memory() feels natural.

This patch (of 6):

remove_memory() is exported right now but requires the
device_hotplug_lock, which is not exported.  So let's provide a variant
that takes the lock and only export that one.

The lock is already held in
	arch/powerpc/platforms/pseries/hotplug-memory.c
	drivers/acpi/acpi_memhotplug.c
	arch/powerpc/platforms/powernv/memtrace.c

Apart from that, there are not other users in the tree.

Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

e060ade6

15 1月, 2020 9 次提交

mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections · 78c5482c

由 Alexander Duyck 提交于 5月 13, 2019

commit 0e56acae4b4dd4a9fbe897854ab83a109e2a9e11 upstream.

Add yet another iterator, for_each_free_mem_range_in_zone_from, and then
use it to support initializing and freeing pages in groups no larger than
MAX_ORDER_NR_PAGES.  By doing this we can greatly improve the cache
locality of the pages while we do several loops over them in the init and
freeing process.

We are able to tighten the loops further as a result of the "from"
iterator as we can perform the initial checks for first_init_pfn in our
first call to the iterator, and continue without the need for those checks
via the "from" iterator.  I have added this functionality in the function
called deferred_init_mem_pfn_range_in_zone that primes the iterator and
causes us to exit if we encounter any failure.

On my x86_64 test system with 384GB of memory per node I saw a reduction
in initialization time from 1.85s to 1.38s as a result of this patch.

Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <yi.z.zhang@linux.intel.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

78c5482c

mm: implement new zone specific memblock iterator · cff9acae

由 Alexander Duyck 提交于 5月 13, 2019

commit 837566e7e08e3f89444166444836a8a49b9f9322 upstream.

Introduce a new iterator for_each_free_mem_pfn_range_in_zone.

This iterator will take care of making sure a given memory range provided
is in fact contained within a zone.  It takes are of all the bounds
checking we were doing in deferred_grow_zone, and deferred_init_memmap.
In addition it should help to speed up the search a bit by iterating until
the end of a range is greater than the start of the zone pfn range, and
will exit completely if the start is beyond the end of the zone.

Link: http://lkml.kernel.org/r/20190405221225.12227.22573.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <yi.z.zhang@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

cff9acae

mm: use mm_zero_struct_page from SPARC on all 64b architectures · e258d861

由 Alexander Duyck 提交于 5月 13, 2019

commit 5470dea49f5382257c242ac617d908267727f1a8 upstream.

Patch series "Deferred page init improvements", v7.

This patchset is essentially a refactor of the page initialization logic
that is meant to provide for better code reuse while providing a
significant improvement in deferred page initialization performance.

In my testing on an x86_64 system with 384GB of RAM I have seen the
following.  In the case of regular memory initialization the deferred init
time was decreased from 3.75s to 1.38s on average.  This amounts to a 172%
improvement for the deferred memory initialization performance.

I have called out the improvement observed with each patch.

This patch (of 4):

Use the same approach that was already in use on Sparc on all the
architectures that support a 64b long.

This is mostly motivated by the fact that 7 to 10 store/move instructions
are likely always going to be faster than having to call into a function
that is not specialized for handling page init.

An added advantage to doing it this way is that the compiler can get away
with combining writes in the __init_single_page call.  As a result the
memset call will be reduced to only about 4 write operations, or at least
that is what I am seeing with GCC 6.2 as the flags, LRU pointers, and
count/mapcount seem to be cancelling out at least 4 of the 8 assignments
on my system.

One change I had to make to the function was to reduce the minimum page
size to 56 to support some powerpc64 configurations.

This change should introduce no change on SPARC since it already had this
code.  In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
initializing 384GB of RAM per node.  Pavel Tatashin tested on a system
with Broadcom's Stingray CPU and 48GB of RAM and found that
__init_single_page() takes 19.30ns / 64-byte struct page before this patch
and with this patch it takes 17.33ns / 64-byte struct page.  Mike Rapoport
ran a similar test on a OpenPower (S812LC 8348-21C) with Power8 processor
and 128GB or RAM.  His results per 64-byte struct page were 4.68ns before,
and 4.59ns after this patch.

Link: http://lkml.kernel.org/r/20190405221213.12227.9392.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <yi.z.zhang@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e258d861

blk-mq: not embed .mq_kobj and ctx->kobj into queue instance · 1c230acb

由 Ming Lei 提交于 11月 20, 2018

commit 1db4909e76f64a85f4aaa187f0f683f5c85a471d upstream.

Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs.

This patch fixes the issue by the following approach:

1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
all ctxs

2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
handler of .mq_kobj

3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
.mq_kobj is always released after all ctxs are freed.

This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

1c230acb

mm/memblock.c: skip kmemleak for kasan_init() · aae2160b

由 Qian Cai 提交于 12月 28, 2018

commit fed84c78527009d4f799a3ed9a566502fa026d82 upstream.

Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
Huawei TaiShan 2280 aarch64 servers).

After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
log buffer went from something like 280 to 260000 which caused kmemleak
disabled and crash dump memory reservation failed.  The multitude of
kmemleak_alloc() calls is from nested loops while KASAN is setting up full
memory mappings, so let early kmemleak allocations skip those
memblock_alloc_internal() calls came from kasan_init() given that those
early KASAN memory mappings should not reference to other memory.  Hence,
no kmemleak false positives.

kasan_init
  kasan_map_populate [1]
    kasan_pgd_populate [2]
      kasan_pud_populate [3]
        kasan_pmd_populate [4]
          kasan_pte_populate [5]
            kasan_alloc_zeroed_page
              memblock_alloc_try_nid
                memblock_alloc_internal
                  kmemleak_alloc

[1] for_each_memblock(memory, reg)
[2] while (pgdp++, addr = next, addr != end)
[3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
[4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
[5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))

Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.usSigned-off-by: NQian Cai <cai@gmx.us>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

aae2160b

alinux: jbd2: track slow handle which is preventing transaction committing · 861575c9

由 Xiaoguang Wang 提交于 1月 09, 2020

While transaction is going to commit, it first sets its state to be
T_LOCKED and waits all outstanding handles to complete, and the
committing transaction will always be in locked state so long as it
has outstanding handles, also the whole fs will be locked and all later
fs modification operations will be stucked in wait_transaction_locked().

It's hard to tell why handles are that slow, so here we add a new staic
tracepoint to track such slow handle, and show io wait time and sched
wait time, output likes below:
  fsstress-20347 [024] ....  1570.305454: jbd2_slow_handle_stats: dev 254,17
tid 15853 type 4 line_no 3101 interval 126 sync 0 requested_blocks 24
dirtied_blocks 0 trans_wait 122 space_wait 0 sched_wait 0 io_wait 126

"trans_wait 122" means that this current committing transaction has been
locked for 122ms, due to this handle is not completed quickly.

From "io_wait 126", we can see that io is the major reason.

In this patch, we also add a per fs control file used to determine
whether a handle can be considered to be slow.
    /proc/fs/jbd2/vdb1-8/stall_thresh
default value is 100ms, users can set new threshold by echoing new value
to this file.

Later I also plan to add a proc file fs per fs to record these info.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

861575c9

alinux: fs: record page or bio info while process is waitting on it · 79209707

由 Xiaoguang Wang 提交于 11月 07, 2019

If one process context is stucked in wait_on_buffer(), lock_buffer(),
lock_page() and wait_on_page_writeback() and wait_on_bit_io(), it's
hard to tell ture reason, for example, whether this page is under io,
or this page is just locked too long by other process context.

Normally io request has multiple bios, and every bio contains multiple
pages which will hold data to be read from or written to device, so here
we record page info or bio info in task_struct while process calls
lock_page(), lock_buffer(), wait_on_page_writeback(), wait_on_buffer()
and wait_on_bit_io(), we add a new proce interface:
[lege@localhost linux]$ cat /proc/4516/wait_res
1 ffffd0969f95d3c0 4295369599 4295381596

Above info means that thread 4516 is waitting on a page, address is
ffffd0969f95d3c0, and has waited for 11997ms.

First field denotes the page address process is waitting on.
Second field denotes the wait moment and the third denotes current moment.

In practice, if we found a process waitting on one page for too long time,
we can get page's address by reading /proc/$pid/wait_page, and search this
page address in all block devices' /sys/kernel/debug/block/${devname}/rq_hang,
if search operation hits one, we can get the request and know why this io
request hangs that long.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

79209707

alinux: blk: add iohang check function · e036d088

由 Xiaoguang Wang 提交于 10月 11, 2019

Background:
  We do not have a dependable block layer interface to determine whether
block device has io requests which have not been completed for somewhat
long time. Currently we have 'in_flight' interface, it counts the number
of I/O requests that have been issued to the device driver but have
not yet completed, and it does not include I/O requests that are in the
queue but not yet issued to the device driver, which means it will not
count io requests that have been stucked in block layer.
  Also say that there are steady io requests issued to device driver,
'in_flight' maybe always non-zero, but you could not determine whether
there is one io request which has not been completed for too long.

Solution:
  To find io requests which have not been completed for too long, here
add 3 new inferfaces:
  /sys/block/vdb/queue/hang_threshold
If one io request's running time has been greater than this value, count
this io as hang.

  /sys/block/vdb/hang
Show read/write io requests' hang counter.

  /sys/kernel/debug/block/vdb/rq_hang
Show all hang io requests's detailed info, like below:
  ffff97db96301200 {.op=WRITE, .cmd_flags=SYNC, .rq_flags=STARTED|
ELVPRIV|IO_STAT|STATS, .state=in_flight, .tag=30, .internal_tag=169,
.start_time_ns=140634088407, .io_start_time_ns=140634102958,
.current_time=146497371953, .bio = ffff97db91e8e000,
.bio_pages = { ffffd096a0602540 }, .bio = ffff97db91e8ec00,
.bio_pages = { ffffd096a070eec0 }, .bio = ffff97db91e8f600,
.bio_pages = { ffffd096a0424cc0 }, .bio = ffff97db91e8f300,
.bio_pages = { ffffd096a0600a80 }}

With above info, we can easily see this request's latency distribution,
and see next patch for bio_pages's usage.

Note, /sys/kernel/debug/block/vdb/rq_hang only exists in blk-mq device driver
and needs CONFIG_BLK_DEBUG_FS enabled.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e036d088

alinux: mm: thp: remove deferred split queue from mem_cgroup · bb09ae16

由 Caspar Zhang 提交于 1月 15, 2020

in commit 295949f0 ("alinux: mm: thp: move deferred split queue
to memcg's nodeinfo"), we actually failed to remove deferred split
queue from mem_cgroup unexpectedly. Fix it by manually removing it
again.

Fixes: 295949f0 ("alinux: mm: thp: move deferred split queue to memcg's nodeinfo")
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

bb09ae16

14 1月, 2020 9 次提交

tcp: Add TCP_INFO counter for packets received out-of-order · ad8cde51

由 Thomas Higdon 提交于 9月 13, 2019

commit f9af2dbbfe01def62765a58af7fbc488351893c3 upstream

For receive-heavy cases on the server-side, we want to track the
connection quality for individual client IPs. This counter, similar to
the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
tracks out-of-order packet reception. By providing this counter in
TCP_INFO, it will allow understanding to what degree receive-heavy
sockets are experiencing out-of-order delivery and packet drops
indicating congestion.

Please note that this is similar to the counter in NetBSD TCP_INFO, and
has the same name.

Also note that we avoid increasing the size of the tcp_sock struct by
taking advantage of a hole.
Signed-off-by: NThomas Higdon <tph@fb.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Acked-by: NDust Li <dust.li@linux.alibaba.com>

ad8cde51

mm, memcg: introduce memory.events.local · 5fcab459

由 Shakeel Butt 提交于 12月 16, 2019

commit 1e577f970f66a53d429cbee37b36177c9712f488 upstream.

The memory controller in cgroup v2 exposes memory.events file for each
memcg which shows the number of times events like low, high, max, oom
and oom_kill have happened for the whole tree rooted at that memcg.
Users can also poll or register notification to monitor the changes in
that file. Any event at any level of the tree rooted at memcg will
notify all the listeners along the path till root_mem_cgroup. There are
existing users which depend on this behavior.

However there are users which are only interested in the events
happening at a specific level of the memcg tree and not in the events in
the underlying tree rooted at that memcg. One such use-case is a
centralized resource monitor which can dynamically adjust the limits of
the jobs running on a system. The jobs can create their sub-hierarchy
for their own sub-tasks. The centralized monitor is only interested in
the events at the top level memcgs of the jobs as it can then act and
adjust the limits of the jobs. Using the current memory.events for such
centralized monitor is very inconvenient. The monitor will keep
receiving events which it is not interested and to find if the received
event is interesting, it has to read memory.event files of the next
level and compare it with the top level one. So, let's introduce
memory.events.local to the memcg which shows and notify for the events
at the memcg level.

Now, does memory.stat and memory.pressure need their local versions. IMHO
no due to the no internal process contraint of the cgroup v2. The
memory.stat file of the top level memcg of a job shows the stats and
vmevents of the whole tree. The local stats or vmevents of the top level
memcg will only change if there is a process running in that memcg but v2
does not allow that. Similarly for memory.pressure there will not be any
process in the internal nodes and thus no chance of local pressure.

Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

5fcab459

mm, memcg: consider subtrees in memory.events · 0d9e08d3

由 Chris Down 提交于 12月 16, 2019

commit 9852ae3fe5293264f01c49f2571ef7688f7823ce upstream.

memory.stat and other files already consider subtrees in their output, and
we should too in order to not present an inconsistent interface.

The current situation is fairly confusing, because people interacting with
cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files.  For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0" at
non-leaf memcg nodes, which frequently causes people to misdiagnose breach
behaviour.  The same confusion applies to other counters in this file when
debugging issues.

Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.

After this patch, events are propagated up the hierarchy:

    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
    low 0
    high 0
    max 0
    oom 0
    oom_kill 0
    [root@ktst ~]# systemd-run -p MemoryMax=1 true
    Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
    low 0
    high 0
    max 7
    oom 1
    oom_kill 1

As this is a change in behaviour, this can be reverted to the old
behaviour by mounting with the `memory_localevents' flag set.  However, we
use the new behaviour by default as there's a lack of evidence that there
are any current users of memory.events that would find this change
undesirable.

akpm: this is a behaviour change, so Cc:stable.  THis is so that
forthcoming distros which use cgroup v2 are more likely to pick up the
revised behaviour.

[xuyu: remove the new memory_localevents mount option because it is
rarely used]

Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

0d9e08d3

mm: introduce ARCH_HAS_PTE_DEVMAP · 7569cab3

由 Robin Murphy 提交于 7月 16, 2019

commit 175967318c3018d01931ac950c82adab5deb47ca upstream.

ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
with the long-out-of-date comment can lead to the impression than an
architecture may just enable it (since __add_pages() now "comprehends
device memory" for itself) and expect things to work.

In practice, however, ZONE_DEVICE users have little chance of
functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
dependency so the real situation is clearer.

Link: http://lkml.kernel.org/r/87554aa78478a02a63f2c4cf60a847279ae3eb3b.1558547956.git.robin.murphy@arm.comSigned-off-by: NRobin Murphy <robin.murphy@arm.com>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Acked-by: NOliver O'Halloran <oohall@gmail.com>
Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

7569cab3

linkage: add generic GLOBAL() macro · 01356c64

由 Mark Rutland 提交于 11月 15, 2018

commit ad697a1aecac19ec351063b5d8e6fc9d4bca7ee5 upstream.

Declaring a global symbol in assembly is tedious, error-prone, and
painful to read. While ENTRY() exists, this is supposed to be used for
function entry points, and this affects alignment in a potentially
undesireable manner.

Instead, let's add a generic GLOBAL() macro for this, as x86 added
locally in commit:

  95695547 ("x86: asm linkage - introduce GLOBAL macro")

... thus allowing us to use this more freely in the kernel.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Torsten Duwe <duwe@suse.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

01356c64

compiler.h: add CC_USING_PATCHABLE_FUNCTION_ENTRY · 4b73556a

由 Sven Schnelle 提交于 6月 05, 2019

commit 2809b392a62ae307da058a52d451b2fc3ce4de7e upstream.

This can be used for architectures implementing dynamic
ftrace via -fpatchable-function-entry.
Signed-off-by: NSven Schnelle <svens@stackframe.org>
Signed-off-by: NHelge Deller <deller@gmx.de>
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

4b73556a

module/ftrace: handle patchable-function-entry · 1c78bf00

由 Mark Rutland 提交于 10月 16, 2019

backport from a1326b17ac03a9012cb3d01e434aacb4d67a416c upstream

When using patchable-function-entry, the compiler will record the
callsites into a section named "__patchable_function_entries" rather
than "__mcount_loc". Let's abstract this difference behind a new
FTRACE_CALLSITE_SECTION, so that architectures don't have to handle this
explicitly (e.g. with custom module linker scripts).

As parisc currently handles this explicitly, it is fixed up accordingly,
with its custom linker script removed. Since FTRACE_CALLSITE_SECTION is
only defined when DYNAMIC_FTRACE is selected, the parisc module loading
code is updated to only use the definition in that case. When
DYNAMIC_FTRACE is not selected, modules shouldn't have this section, so
this removes some redundant work in that case.

To make sure that this is keep up-to-date for modules and the main
kernel, a comment is added to vmlinux.lds.h, with the existing ifdeffery
simplified for legibility.

I built parisc generic-{32,64}bit_defconfig with DYNAMIC_FTRACE enabled,
and verified that the section made it into the .ko files for modules.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Acked-by: NHelge Deller <deller@gmx.de>
Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: NTorsten Duwe <duwe@suse.de>
Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
Tested-by: NSven Schnelle <svens@stackframe.org>
Tested-by: NTorsten Duwe <duwe@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: linux-parisc@vger.kernel.org
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

1c78bf00

ftrace: add ftrace_init_nop() · 3dd21835

由 Mark Rutland 提交于 10月 16, 2019

commit fbf6c73c5b264c25484fa9f449b5546569fe11f0 upstream

Architectures may need to perform special initialization of ftrace
callsites, and today they do so by special-casing ftrace_make_nop() when
the expected branch address is MCOUNT_ADDR. In some cases (e.g. for
patchable-function-entry), we don't have an mcount-like symbol and don't
want a synthetic MCOUNT_ADDR, but we may need to perform some
initialization of callsites.

To make it possible to separate initialization from runtime
modification, and to handle cases without an mcount-like symbol, this
patch adds an optional ftrace_init_nop() function that architectures can
implement, which does not pass a branch address.

Where an architecture does not provide ftrace_init_nop(), we will fall
back to the existing behaviour of calling ftrace_make_nop() with
MCOUNT_ADDR.

At the same time, ftrace_code_disable() is renamed to
ftrace_nop_initialize() to make it clearer that it is intended to
intialize a callsite into a disabled state, and is not for disabling a
callsite that has been runtime enabled. The kerneldoc description of rec
arguments is updated to cover non-mcount callsites.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Reviewed-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: NMiroslav Benes <mbenes@suse.cz>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NTorsten Duwe <duwe@suse.de>
Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
Tested-by: NSven Schnelle <svens@stackframe.org>
Tested-by: NTorsten Duwe <duwe@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

3dd21835

spi: Optionally use GPIO descriptors for CS GPIOs · 068726e1

由 Linus Walleij 提交于 1月 07, 2019

commit f3186dd876697e696d07136623d5cf0a6fb0bc0f upstream

This augments the SPI core to optionally use GPIO descriptors
for chip select on a per-master-driver opt-in basis.

Drivers using this will rely on the SPI core to look up
GPIO descriptors associated with the device, such as
when using device tree or board files with GPIO descriptor
tables.

When getting descriptors from the device tree, this will in
turn activate the code in gpiolib that was
added in commit 6953c57ab172
("gpio: of: Handle SPI chipselect legacy bindings")
which means that these descriptors are aware of the active
low semantics that is the default for SPI CS GPIO lines
and we can assume that all of these are "active high" and
thus assign SPI_CS_HIGH to all CS lines on the DT path.

The previously used gpio_set_value() would call down into
gpiod_set_raw_value() and ignore the polarity inversion
semantics.

It seems like many drivers go to great lengths to set up the
CS GPIO line as non-asserted, respecting SPI_CS_HIGH. We pull
this out of the SPI drivers and into the core, and by simply
requesting the line as GPIOD_OUT_LOW when retrieveing it from
the device and relying on the gpiolib to handle any inversion
semantics. This way a lot of code can be simplified and
removed in each converted driver.

The end goal after dealing with each driver in turn, is to
delete the non-descriptor path (of_spi_register_master() for
example) and let the core deal with only descriptors.

The different SPI drivers have complex interactions with the
core so we cannot simply change them all over, we need to use
a stepwise, bisectable approach so that each driver can be
converted and fixed in isolation.

This patch has the intended side effect of adding support for
ACPI GPIOs as it starts relying on gpiod_get_*() to get
the GPIO handle associated with the device.

Cc: Linuxarm <linuxarm@huawei.com>
Acked-by: NJonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: NFangjian (Turing) <f.fangjian@huawei.com>
Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
Signed-off-by: NMark Brown <broonie@kernel.org>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

068726e1

openanolis / cloud-kernel 大约 2 年 前同步成功

openanolis / cloud-kernel
大约 2 年前同步成功