提交 · 965d75d3785caa2649d44c99d85e95d715d70bd3 · openanolis / cloud-kernel

24 4月, 2020 6 次提交

alinux: cpuacct: make cpuacct record nr_migrations · 965d75d3

由 Yihao Wu 提交于 1月 17, 2020

to #26424323

This patch makes cpuacct to be able to monitor the number of
across-cpu-migrations. Output as follows:

   [root@caspar /sys/fs/cgroup/cpuacct]
   # cat cpuacct.proc_stat
   user 7727
   nice 4
   <snip>
   nr_migrations 48432
Signed-off-by: NZhu Yanhai <zhu.yanhai@linux.alibaba.com>
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

965d75d3

alinux: cpuacct: Export nr_running & nr_uninterruptible · 9410d314

由 Yihao Wu 提交于 1月 17, 2020

to #26424323

cpu cgroup's nr_running and nr_uninterruptible are useful
for troubleshooting. Export them in cpuacct.proc_stat.
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

9410d314

alinux: sched/cputime: Fix guest cputime of cpuacct.proc_stat · 004b3ba9

由 Shanpei Chen 提交于 11月 06, 2019

to #26424323

For container only cases, since guest cputime is always 0, we don't
calculate it and return 0 directly before.

Howerver, when running vm inside a cgroup, we expect the cgroup to
maintain guest cputime correctly.
Tested-by: NYihao Wu <wuyihao@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

004b3ba9

alinux: cpuacct/proc_stat: Consider isolcpus · 8ab45c97

由 Xunlei Pang 提交于 11月 06, 2019

to #26424323

When "isolcpus=" is passed, skip all its accountings.
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Tested-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

8ab45c97

alinux: cpuacct: export cpuacct.proc_stat interface · 9be0ac2b

由 Xunlei Pang 提交于 7月 23, 2019

to #26424323

Add the cgroup file "cpuacct.proc_stat", we'll export per-cgroup
cpu usages and some other scheduler statistics in this interface.
Reviewed-by: NMichael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

9be0ac2b

alinux: sched: Maintain "nr_uninterruptible" in runqueue · 36da4fe9

由 Xunlei Pang 提交于 7月 24, 2019

to #26424323

It's relatively easy to maintain nr_uninterruptible in scheduler
compared to doing it in cpuacct, we assume that "cpu,cpuacct" are
bound together, so that it can be used for per-cgroup load.

This will be needed to calculate per-cgroup load average later.
Reviewed-by: NMichael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

36da4fe9

23 4月, 2020 1 次提交

mm, compaction: capture a page under direct compaction · 35d915be

由 Mel Gorman 提交于 4月 04, 2020

to #26255339

commit 5e1f0f098b4649fad53011246bcaeff011ffdf5d upstream

Compaction is inherently race-prone as a suitable page freed during
compaction can be allocated by any parallel task. This patch uses a
capture_control structure to isolate a page immediately when it is freed
by a direct compactor in the slow path of the page allocator. The
intent is to avoid redundant scanning.

5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)

Latency is only moderately affected but the devil is in the details. A
closer examination indicates that base page fault latency is reduced but
latency of huge pages is increased as it takes creater care to succeed.
Part of the "problem" is that allocation success rates are close to 100%
even when under pressure and compaction gets harder

5.0.0-rc1 5.0.0-rc1
selective-v3r17 capture-v3r19
Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)

And scan rates are reduced as expected by 6% for the migration scanner
and 29% for the free scanner indicating that there is less redundant
work.

Compaction migrate scanned 20815362 19573286
Compaction free scanned 16352612 11510663

[mgorman@techsingularity.net: remove redundant check]
Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

35d915be

22 4月, 2020 4 次提交

sysctl: handle overflow in proc_get_long · 662ef34f

由 Christian Brauner 提交于 3月 07, 2019

fix #27124689

commit 7f2923c4f73f21cfd714d12a2d48de8c21f11cfe upstream.

proc_get_long() is a funny function.  It uses simple_strtoul() and for a
good reason.  proc_get_long() wants to always succeed the parse and
return the maybe incorrect value and the trailing characters to check
against a pre-defined list of acceptable trailing values.  However,
simple_strtoul() explicitly ignores overflows which can cause funny
things like the following to happen:

  echo 18446744073709551616 > /proc/sys/fs/file-max
  cat /proc/sys/fs/file-max
  0

(Which will cause your system to silently die behind your back.)

On the other hand kstrtoul() does do overflow detection but does not
return the trailing characters, and also fails the parse when anything
other than '\n' is a trailing character whereas proc_get_long() wants to
be more lenient.

Now, before adding another kstrtoul() function let's simply add a static
parse strtoul_lenient() which:
 - fails on overflow with -ERANGE
 - returns the trailing characters to the caller

The reason why we should fail on ERANGE is that we already do a partial
fail on overflow right now.  Namely, when the TMPBUFLEN is exceeded.  So
we already reject values such as 184467440737095516160 (21 chars) but
accept values such as 18446744073709551616 (20 chars) but both are
overflows.  So we should just always reject 64bit overflows and not
special-case this based on the number of chars.

Link: http://lkml.kernel.org/r/20190107222700.15954-2-christian@brauner.ioSigned-off-by: NChristian Brauner <christian@brauner.io>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

662ef34f

sched: Avoid scale real weight down to zero · 9b83fd88

由 Michael Wang 提交于 3月 27, 2020

fix #26198889

commit 26cf52229efc87e2effa9d788f9b33c40fb3358a linux-next

During our testing, we found a case that shares no longer
working correctly, the cgroup topology is like:

  /sys/fs/cgroup/cpu/A		(shares=102400)
  /sys/fs/cgroup/cpu/A/B	(shares=2)
  /sys/fs/cgroup/cpu/A/B/C	(shares=1024)

  /sys/fs/cgroup/cpu/D		(shares=1024)
  /sys/fs/cgroup/cpu/D/E	(shares=1024)
  /sys/fs/cgroup/cpu/D/E/F	(shares=1024)

The same benchmark is running in group C & F, no other tasks are
running, the benchmark is capable to consumed all the CPUs.

We suppose the group C will win more CPU resources since it could
enjoy all the shares of group A, but it's F who wins much more.

The reason is because we have group B with shares as 2, since
A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus,
so A->cfs_rq.load.weight become very small.

And in calc_group_shares() we calculate shares as:

  load = max(scale_load_down(cfs_rq->load.weight),
cfs_rq->avg.load_avg);
  shares = (tg_shares * load) / tg_weight;

Since the 'cfs_rq->load.weight' is too small, the load become 0
after scale down, although 'tg_shares' is 102400, shares of the se
which stand for group A on root cfs_rq become 2.

While the se of D on root cfs_rq is far more bigger than 2, so it
wins the battle.

Thus when scale_load_down() scale real weight down to 0, it's no
longer telling the real story, the caller will have the wrong
information and the calculation will be buggy.

This patch add check in scale_load_down(), so the real weight will
be >= MIN_SHARES after scale, after applied the group C wins as
expected.
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NMichael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.comAcked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

9b83fd88

sched/fair: Fix race between runtime distribution and assignment · 70a23044

由 Huaixin Chang 提交于 3月 24, 2020

fix #25892693

commit 26a8b12747c975b33b4a82d62e4a307e1c07f31b upstream

Currently, there is a potential race between distribute_cfs_runtime()
and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read,
distributes without holding lock and finds out there is not enough
runtime to charge against after distribution. Because
assign_cfs_rq_runtime() might be called during distribution, and use
cfs_b->runtime at the same time.

Fibtest is the tool to test this race. Assume all gcfs_rq is throttled
and cfs period timer runs, slow threads might run and sleep, returning
unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local
pool. If all this happens sufficiently quickly, cfs_b->runtime will drop
a lot. If runtime distributed is large too, over-use of runtime happens.

A runtime over-using by about 70 percent of quota is seen when we
test fibtest on a 96-core machine. We run fibtest with 1 fast thread and
95 slow threads in test group, configure 10ms quota for this group and
see the CPU usage of fibtest is 17.0%, which is far from than the
expected 10%.

On a smaller machine with 32 cores, we also run fibtest with 96
threads. CPU usage is more than 12%, which is also more than expected
10%. This shows that on similar workloads, this race do affect CPU
bandwidth control.

Solve this by holding lock inside distribute_cfs_runtime().

Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
Reviewed-by: NBen Segall <bsegall@google.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Link: https://lore.kernel.org/lkml/20200325092602.22471-1-changhuaixin@linux.alibaba.com/Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

70a23044

alinux: cgroup: Fix task_css_check rcu warnings · 798cfa76

由 Xunlei Pang 提交于 3月 23, 2020

to #26424323

task_css() should be protected by rcu, fix several callers.

Fixes: 1f49a738 ("alinux: psi: Support PSI under cgroup v1")
Acked-by: NMichael Wang <yun.wany@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>

798cfa76

17 4月, 2020 1 次提交

alinux: kernel: reap zombie process by specified pid · ac2b5c94

由 zhongjiang-ali 提交于 2月 26, 2020

to #26788859

We've met several real-world issues that the child reaper
(i.e. systemd) gets stuck in some aborted status and cann't
reap its zombie children, so we provide the interface to do
By specified the pid.
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>

ac2b5c94

18 3月, 2020 10 次提交

bpf/sockmap: Read psock ingress_msg before sk_receive_queue · 350f8ab8

由 Lingpeng Chen 提交于 3月 06, 2020

commit e7a5f1f1cd0008e5ad379270a8657e121eedb669 upstream

Right now in tcp_bpf_recvmsg, sock read data first from sk_receive_queue
if not empty than psock->ingress_msg otherwise. If a FIN packet arrives
and there's also some data in psock->ingress_msg, the data in
psock->ingress_msg will be purged. It is always happen when request to a
HTTP1.0 server like python SimpleHTTPServer since the server send FIN
packet after data is sent out.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: NArika Chen <eaglesora@gmail.com>
Suggested-by: NArika Chen <eaglesora@gmail.com>
Signed-off-by: NLingpeng Chen <forrest0579@gmail.com>
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NSong Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200109014833.18951-1-forrest0579@gmail.com
[tonylu: patched modified to match BIG rework between v4.19 and upstream]
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Acked-by: NDust Li <dust.li@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

350f8ab8

alinux: mm, memcg: account number of processes in the css · 2061acd6

由 Xu Yu 提交于 3月 13, 2020

Since commit e0205ae40f12 ("mm: memcontrol: use CSS_TASK_ITER_PROCS at
mem_cgroup_scan_tasks()") made mem_cgroup_scan_tasks() to check only one
thread from each thread group, we can make cgroup_subsys_state::nr_tasks
to record only the thread group leader, i.e., process, instead of
thread(s). Furthermore, this renames cgroup_subsys_state::nr_tasks to
cgroup_subsys_state::nr_procs.

Fixes: f061cd88 ("alinux: kernel: cgroup: account number of tasks in
the css and its descendants")
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

2061acd6

alinux: mm: add proc interface to control context readahead · 2e38a0f2

由 Xiaoguang Wang 提交于 3月 09, 2020

For some workloads whose io activities are mostly random, context
readahead feature can introduce unnecessary io read operations, which
will impact app's performance. Context readahead's algorithm is
straightforward and not that smart.

This patch adds "/proc/sys/vm/enable_context_readahead" to control
whether to disable or enable this feature. Currently we enable context
readahead default, user can echo 0 to /proc/sys/vm/enable_context_readahead
to disable context readahead.

We also have tested mongodb's performance in 'random point select' case,
With context readahead enabled:
  mongodb eps 12409
With context readahead disabled:
  mongodb eps 14443
About 16% performance improvement.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2e38a0f2

alinux: kernel: cgroup: account number of tasks in the css and its descendants · 1e91d392

由 Wenwei Tao 提交于 8月 26, 2019

Account number of the tasks in the css and its descendants, this is
prepared for the incoming memcg priority patch.

In memcg priority oom, we will select victim cgroup which has victim
tasks in it. We need to know whether the memcg and its descendants
have tasks before the selection can move on.
Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1e91d392

io-wq: small threadpool implementation for io_uring · 8a308e54

由 Jens Axboe 提交于 10月 22, 2019

commit 771b53d033e8663abdf59704806aa856b236dcdb upstream.

This adds support for io-wq, a smaller and specialized thread pool
implementation. This is meant to replace workqueues for io_uring. Among
the reasons for this addition are:

- We can assign memory context smarter and more persistently if we
  manage the life time of threads.

- We can drop various work-arounds we have in io_uring, like the
  async_list.

- We can implement hashed work insertion, to manage concurrency of
  buffered writes without needing a) an extra workqueue, or b)
  needlessly making the concurrency of said workqueue very low
  which hurts performance of multiple buffered file writers.

- We can implement cancel through signals, for cancelling
  interruptible work like read/write (or send/recv) to/from sockets.

- We need the above cancel for being able to assign and use file tables
  from a process.

- We can implement a more thorough cancel operation in general.

- We need it to move towards a syslet/threadlet model for even faster
  async execution. For that we need to take ownership of the used
  threads.

This list is just off the top of my head. Performance should be the
same, or better, at least that's what I've seen in my testing. io-wq
supports basic NUMA functionality, setting up a pool per node.

io-wq hooks up to the scheduler schedule in/out just like workqueue
and uses that to drive the need for more/less workers.
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: Cherry-pick allow_kernel_signal() from upstream commit 33da8e7c814f]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

8a308e54

sched/core, workqueues: Distangle worker accounting from rq lock · 143495ca

由 Thomas Gleixner 提交于 3月 13, 2019

commit 6d25be5782e482eb93e3de0c94d0a517879377d0 upstream.

The worker accounting for CPU bound workers is plugged into the core
scheduler code and the wakeup code. This is not a hard requirement and
can be avoided by keeping track of the state in the workqueue code
itself.

Keep track of the sleeping state in the worker itself and call the
notifier before entering the core scheduler. There might be false
positives when the task is woken between that call and actually
scheduling, but that's not really different from scheduling and being
woken immediately after switching away. When nr_running is updated when
the task is retunrning from schedule() then it is later compared when it
is done from ttwu().

[ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

143495ca

signal: simplify set_user_sigmask/restore_user_sigmask · f12f9562

由 Oleg Nesterov 提交于 7月 16, 2019

commit b772434be0891ed1081a08ae7cfd4666728f8e82 upstream.

task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths.  This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.

This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().

Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f12f9562

signal: remove the wrong signal_pending() check in restore_user_sigmask() · a48e4674

由 Oleg Nesterov 提交于 6月 28, 2019

commit 97abc889ee296faf95ca0e978340fb7b942a3e32 upstream.

This is the minimal fix for stable, I'll send cleanups later.

Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.

Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.

Eric said:

: For clarity.  I don't think this is required by posix, or fundamentally to
: remove the races in select.  It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented.  The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?

Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reported-by: NEric Wong <e@80x24.org>
Tested-by: NEric Wong <e@80x24.org>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org>	[5.0+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a48e4674

module/ftrace: handle patchable-function-entry · 0f596d7d

由 Mark Rutland 提交于 10月 16, 2019

backport from a1326b17ac03a9012cb3d01e434aacb4d67a416c upstream

When using patchable-function-entry, the compiler will record the
callsites into a section named "__patchable_function_entries" rather
than "__mcount_loc". Let's abstract this difference behind a new
FTRACE_CALLSITE_SECTION, so that architectures don't have to handle this
explicitly (e.g. with custom module linker scripts).

As parisc currently handles this explicitly, it is fixed up accordingly,
with its custom linker script removed. Since FTRACE_CALLSITE_SECTION is
only defined when DYNAMIC_FTRACE is selected, the parisc module loading
code is updated to only use the definition in that case. When
DYNAMIC_FTRACE is not selected, modules shouldn't have this section, so
this removes some redundant work in that case.

To make sure that this is keep up-to-date for modules and the main
kernel, a comment is added to vmlinux.lds.h, with the existing ifdeffery
simplified for legibility.

I built parisc generic-{32,64}bit_defconfig with DYNAMIC_FTRACE enabled,
and verified that the section made it into the .ko files for modules.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Acked-by: NHelge Deller <deller@gmx.de>
Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: NTorsten Duwe <duwe@suse.de>
Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
Tested-by: NSven Schnelle <svens@stackframe.org>
Tested-by: NTorsten Duwe <duwe@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: linux-parisc@vger.kernel.org
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

0f596d7d

ftrace: add ftrace_init_nop() · 21aca133

由 Mark Rutland 提交于 10月 16, 2019

commit fbf6c73c5b264c25484fa9f449b5546569fe11f0 upstream

Architectures may need to perform special initialization of ftrace
callsites, and today they do so by special-casing ftrace_make_nop() when
the expected branch address is MCOUNT_ADDR. In some cases (e.g. for
patchable-function-entry), we don't have an mcount-like symbol and don't
want a synthetic MCOUNT_ADDR, but we may need to perform some
initialization of callsites.

To make it possible to separate initialization from runtime
modification, and to handle cases without an mcount-like symbol, this
patch adds an optional ftrace_init_nop() function that architectures can
implement, which does not pass a branch address.

Where an architecture does not provide ftrace_init_nop(), we will fall
back to the existing behaviour of calling ftrace_make_nop() with
MCOUNT_ADDR.

At the same time, ftrace_code_disable() is renamed to
ftrace_nop_initialize() to make it clearer that it is intended to
intialize a callsite into a disabled state, and is not for disabling a
callsite that has been runtime enabled. The kerneldoc description of rec
arguments is updated to cover non-mcount callsites.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Reviewed-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: NMiroslav Benes <mbenes@suse.cz>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NTorsten Duwe <duwe@suse.de>
Tested-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
Tested-by: NSven Schnelle <svens@stackframe.org>
Tested-by: NTorsten Duwe <duwe@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Acked-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

21aca133

17 1月, 2020 10 次提交

io_uring: add support for pre-mapped user IO buffers · a078ed69

由 Jens Axboe 提交于 1月 09, 2019

commit edafccee56ff31678a091ddb7219aba9b28bc3cb upstream.

If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.

If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.

The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.

It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.

For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.

RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a078ed69

Add io_uring IO interface · 209d771f

由 Jens Axboe 提交于 1月 07, 2019

commit 2b188cc1bb857a9d4701ae59aa7768b5124e262e upstream.

The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.

IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.

Two new system calls are added for this:

io_uring_setup(entries, params)
Sets up an io_uring instance for doing async IO. On success,
returns a file descriptor that the application can mmap to
gain access to the SQ ring, CQ ring, and io_uring_sqes.

io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
Initiates IO against the rings mapped to this fd, or waits for
them to complete, or both. The behavior is controlled by the
parameters passed in. If 'to_submit' is non-zero, then we'll
try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
kernel will wait for 'min_complete' events, if they aren't
already available. It's valid to set IORING_ENTER_GETEVENTS
and 'min_complete' == 0 at the same time, this allows the
kernel to return already completed events without waiting
for them. This is useful only for polling, as for IRQ
driven IO, the application can just check the CQ ring
without entering the kernel.

With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.

For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.

Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.

Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

209d771f

signal: Add restore_user_sigmask() · d2d8b4c2

由 Deepa Dinamani 提交于 9月 19, 2018

commit 854a6ed56839a40f6b5d02a2962f48841482eec4 upstream.

Refactor the logic to restore the sigmask before the syscall
returns into an api.
This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during
the execution and restored after the execution of the syscall.

With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.
Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d2d8b4c2

signal: Add set_user_sigmask() · 2e132aa1

由 Deepa Dinamani 提交于 9月 19, 2018

commit ded653ccbec0335a78fa7a7aff3ec9870349fafb upstream.

Refactor reading sigset from userspace and updating sigmask
into an api.

This is useful for versions of syscalls that pass in the
sigmask and expect the current->sigmask to be changed during,
and restored after, the execution of the syscall.

With the advent of new y2038 syscalls in the subsequent patches,
we add two more new versions of the syscalls (for pselect, ppoll,
and io_pgetevents) in addition to the existing native and compat
versions. Adding such an api reduces the logic that would need to
be replicated otherwise.

Note that the calls to sigprocmask() ignored the return value
from the api as the function only returns an error on an invalid
first argument that is hardcoded at these call sites.
The updated logic uses set_current_blocked() instead.
Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

2e132aa1

alinux: psi: using cpuacct_cgrp_id under CONFIG_CGROUP_CPUACCT · 1bd8a72b

由 Joseph Qi 提交于 1月 03, 2020

This is to fix the build error if CONFIG_CGROUP_CPUACCT is not enabled.
kernel/sched/psi.c: In function 'iterate_groups':
kernel/sched/psi.c:732:31: error: 'cpuacct_cgrp_id' undeclared (first use in this function); did you mean 'cpuacct_charge'?
Reported-by: Nkbuild test robot <lkp@intel.com>
Fixes: 1f49a738 ("alinux: psi: Support PSI under cgroup v1")
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1bd8a72b

alinux: sched/fair: use static load in wake_affine_weight · d2440c99

由 Huaixin Chang 提交于 12月 23, 2019

For a long time runnable cpu load has been used in selecting task rq
when waking up tasks. Recent test has shown for test load with a large
quantity of short running tasks and almost full cpu utility, static load
is more helpful.

In our e2e tests, runnable load avg of java threads ranges from less than
10 to as large as 362, while these java threads are no different from
each other, and should be treated in the same way. After using static
load, qps imporvement has been seen in multiple test cases.

A new sched feature WA_STATIC_WEIGHT is introduced here to control. Echo
WA_STATIC_WEIGHT to /sys/kernel/debug/sched_features to turn static load
in wake_affine_weight on and NO_WA_STATIC_WEIGHT to turn it off. This
feature is kept off by default.

Test is done on the following hardware:

4 threads Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz

In tests with 120 threads and sql loglevel configured to info:

	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
	33170.63                34614.95 (+4.35%)

In tests with 160 threads and sql loglevel configured to info:

	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
	35888.71                38247.20 (+6.57%)

In tests with 160 threads and sql loglevel configured to warn:

	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
	39118.72                39698.72 (+1.48%)
Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>

d2440c99

modsign: use all trusted keys to verify module signature · f7a573c3

由 Ke Wu 提交于 11月 06, 2018

commit e84cd7ee630e44a2cc8ae49e85920a271b214cb3 upstream

Make mod_verify_sig to use all trusted keys. This allows keys in
secondary_trusted_keys to be used to verify PKCS#7 signature on a
kernel module.
Signed-off-by: NKe Wu <mikewu@google.com>
Signed-off-by: NJessica Yu <jeyu@kernel.org>
Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com>
Reviewed-by: Jia Zhang <zhang.jia@linux.alibaba.com>

f7a573c3

alinux: hotfix: Add Cloud Kernel hotfix enhancement · f94e5b1a

由 Xunlei Pang 提交于 7月 18, 2019

We reserve some fields beforehand for core structures prone to change,
so that we won't hurt when extra fields have to be added for hotfix,
thereby inceasing the success rate, we even can hot add features with
this enhancement.

After reserving, normally cache does not matter as the reserved fields
(usually at tail) are not accessed at all.

Currently involve the following structures:
    MM:
    struct zone
    struct pglist_data
    struct mm_struct
    struct vm_area_struct
    struct mem_cgroup
    struct writeback_control

    Block:
    struct gendisk
    struct backing_dev_info
    struct bio
    struct queue_limits
    struct request_queue
    struct blkcg
    struct blkcg_policy
    struct blk_mq_hw_ctx
    struct blk_mq_tag_set
    struct blk_mq_queue_data
    struct blk_mq_ops
    struct elevator_mq_ops
    struct inode
    struct dentry
    struct address_space
    struct block_device
    struct hd_struct
    struct bio_set

    Network:
    struct sk_buff
    struct sock
    struct net_device_ops
    struct xt_target
    struct dst_entry
    struct dst_ops
    struct fib_rule

    Scheduler:
    struct task_struct
    struct cfs_rq
    struct rq
    struct sched_statistics
    struct sched_entity
    struct signal_struct
    struct task_group
    struct cpuacct

    cgroup:
    struct cgroup_root
    struct cgroup_subsys_state
    struct cgroup_subsys
    struct css_set
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
[ caspar: use SPDX-License-Identifier ]
Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>

f94e5b1a

alinux: introduce psi_v1 boot parameter · dc159a61

由 Joseph Qi 提交于 12月 25, 2019

Instead using static kconfig CONFIG_PSI_CGROUP_V1, we introduce a boot
parameter psi_v1 to enable psi cgroup v1 support. Default it is
disabled, which means when passing psi=1 boot parameter, we only support
cgroup v2.
This is to keep consistent with other cgroup v1 features such as cgroup
writeback v1 (cgwb_v1).
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

dc159a61

alinux: psi: Support PSI under cgroup v1 · 1f49a738

由 Xunlei Pang 提交于 12月 23, 2019

Export "cpu|io|memory.pressure" under cgroup v1 "cpuacct" subsystem.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

1f49a738

15 1月, 2020 8 次提交

alinux: fs,ext4: remove projid limit when create hard link · 08e6d768

由 zhangliguang 提交于 12月 27, 2018

This is a temporary workaround plan to avoid the limitation when
creating hard link cross two projids.
Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

08e6d768

resource/docs: Complete kernel-doc style function documentation · f1f8f4a4

由 Borislav Petkov 提交于 11月 05, 2018

commit f26621e60b35369bca9228bc936dc723b3e421af upstream.

Add the missing kernel-doc style function parameters documentation.
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: linux-tip-commits@vger.kernel.org
Cc: rdunlap@infradead.org
Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
Link: http://lkml.kernel.org/r/20181105093307.GA12445@zn.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
[joseph: fix find_next_iomem_res() documentation]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

f1f8f4a4

resource/docs: Fix new kernel-doc warnings · 275561cd

由 Randy Dunlap 提交于 11月 04, 2018

commit f75d651587f719a813ebbbfeee570e6570731d55 upstream.

The first group of warnings is caused by a "/**" kernel-doc notation
marker but the function comments are not in kernel-doc format.
Also add another error return value here.

  ../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
  ../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'

Add the missing function parameter documentation for the other warnings:

  ../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
  ../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
[joseph: fix find_next_iomem_res() documentation]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

275561cd

mm/resource: Let walk_system_ram_range() search child resources · a9b17a5e

由 Dave Hansen 提交于 2月 25, 2019

commit 2b539aefe9e48e3908cff02699aa63a8b9bd268e upstream

In the process of onlining memory, we use walk_system_ram_range()
to find the actual RAM areas inside of the area being onlined.

However, it currently only finds memory resources which are
"top-level" iomem_resources.  Children are not currently
searched which causes it to skip System RAM in areas like this
(in the format of /proc/iomem):

a0000000-bfffffff : Persistent Memory (legacy)
  a0000000-afffffff : System RAM

Changing the true->false here allows children to be searched
as well.  We need this because we add a new "System RAM"
resource underneath the "persistent memory" resource when
we use persistent memory in a volatile mode.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

a9b17a5e

mm/resource: Move HMM pr_debug() deeper into resource code · 0297fb96

由 Dave Hansen 提交于 2月 25, 2019

commit b926b7f3baecb2a855db629e6822e1a85212e91c upstream

HMM consumes physical address space for its own use, even
though nothing is mapped or accessible there.  It uses a
special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
to uniquely identify these areas.

When HMM consumes address space, it makes a best guess about
what to consume.  However, it is possible that a future memory
or device hotplug can collide with the reserved area.  In the
case of these conflicts, there is an error message in
register_memory_resource().

Later patches in this series move register_memory_resource()
from using request_resource_conflict() to __request_region().
Unfortunately, __request_region() does not return the conflict
like the previous function did, which makes it impossible to
check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
resource.

Instead of warning in register_memory_resource(), move the
check into the core resource code itself (__request_region())
where the conflicting resource _is_ available.  This has the
added bonus of producing a warning in case of HMM conflicts
with devices *or* RAM address space, as opposed to the RAM-
only warnings that were there previously.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NJerome Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

0297fb96

mm/resource: Return real error codes from walk failures · 88e75600

由 Dave Hansen 提交于 2月 25, 2019

commit 5cd401ace914dc68556c6d2fcae0c349444d5f86 upstream

walk_system_ram_range() can return an error code either becuase
*it* failed, or because the 'func' that it calls returned an
error.  The memory hotplug does the following:

	ret = walk_system_ram_range(..., func);
        if (ret)
		return ret;

and 'ret' makes it out to userspace, eventually.  The problem
s, walk_system_ram_range() failues that result from *it* failing
(as opposed to 'func') return -1.  That leads to a very odd
-EPERM (-1) return code out to userspace.

Make walk_system_ram_range() return -EINVAL for internal
failures to keep userspace less confused.

This return code is compatible with all the callers that I
audited.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: NBjorn Helgaas <bhelgaas@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

88e75600

kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable · be4a8d62

由 Oscar Salvador 提交于 12月 28, 2018

commit 65c78784135f847e49eb98e6b976e453e71100c3 upstream

This is a preparation for the next patch.

Currently, we only call release_mem_region_adjustable() in __remove_pages
if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
are being released by themselves with devm_release_mem_region.

Since we do not want to touch any zone/page stuff during the removing of
the memory (but during the offlining), we do not want to check for the
zone here.  So we need another way to tell release_mem_region_adjustable()
to not realease the resource in case it belongs to HMM/devm.

HMM/devm acquires/releases a resource through
devm_request_mem_region/devm_release_mem_region.

These resources have the flag IORESOURCE_MEM, while resources acquired by
hot-add memory path (register_memory_resource()) contain
IORESOURCE_SYSTEM_RAM.

So, we can check for this flag in release_mem_region_adjustable, and if
the resource does not contain such flag, we know that we are dealing with
a HMM/devm resource, so we can back off.

Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

be4a8d62

resource: Clean it up a bit · 87e09e0a

由 Borislav Petkov 提交于 10月 09, 2018

commit b69c2e20f6e4046da84ce5b33ba1ef89cb087b40 upstream

- Drop BUG_ON()s and do normal error handling instead, in
  find_next_iomem_res().

- Align function arguments on opening braces.

- Get rid of local var sibling_only in find_next_iomem_res().

- Shorten unnecessarily long first_level_children_only arg name.
Signed-off-by: NBorislav Petkov <bp@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Bjorn Helgaas <bhelgaas@google.com>
CC: Brijesh Singh <brijesh.singh@amd.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Lianbo Jiang <lijiang@redhat.com>
CC: Takashi Iwai <tiwai@suse.de>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Vivek Goyal <vgoyal@redhat.com>
CC: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
CC: bhe@redhat.com
CC: dan.j.williams@intel.com
CC: dyoung@redhat.com
CC: kexec@lists.infradead.org
CC: mingo@redhat.com
Link: <new submission>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

87e09e0a

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功