提交 · e52fca0399641df892e67c7630ff4913d49564f4 · openanolis / cloud-kernel

02 9月, 2020 40 次提交

fuse: invalidate inode attr in writeback cache mode · e52fca03

由 Eryu Guan 提交于 5月 12, 2020

to #29613774

commit cf576c58b3a283333fc6e9a7c1c8e5342fa59b97 upstream

Under writeback mode, inode->i_blocks is not updated, making utils du
read st.blocks as 0.

For example, when using virtiofs (cache=always & nondax mode) with
writeback_cache enabled, writing a new file and check its disk usage
with du, du reports 0 usage.

  # uname -r
  5.6.0-rc6+
  # mount -t virtiofs virtiofs /mnt/virtiofs
  # rm -f /mnt/virtiofs/testfile

  # create new file and do extend write
  # xfs_io -fc "pwrite 0 4k" /mnt/virtiofs/testfile
  wrote 4096/4096 bytes at offset 0
  4 KiB, 1 ops; 0.0001 sec (28.103 MiB/sec and 7194.2446 ops/sec)
  # du -k /mnt/virtiofs/testfile
  0               <==== disk usage is 0
  # stat -c %s,%b /mnt/virtiofs/testfile
  4096,0          <==== i_size is correct, but st_blocks is 0

Fix it by invalidating attr in fuse_flush(), so we get up-to-date attr
from server on next getattr.
Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>

e52fca03

mm, page_alloc: skip ->waternark_boost for atomic order-0 allocations · 63de663e

由 Charan Teja Reddy 提交于 8月 06, 2020

to #29931646

commit f80b08fc44536a311a9f3182e50f318b79076425 upstream.

When boosting is enabled, it is observed that rate of atomic order-0
allocation failures are high due to the fact that free levels in the
system are checked with ->watermark_boost offset.  This is not a problem
for sleepable allocations but for atomic allocations which looks like
regression.

This problem is seen frequently on system setup of Android kernel running
on Snapdragon hardware with 4GB RAM size.  When no extfrag event occurred
in the system, ->watermark_boost factor is zero, thus the watermark
configurations in the system are:

   _watermark = (
          [WMARK_MIN] = 1272, --> ~5MB
          [WMARK_LOW] = 9067, --> ~36MB
          [WMARK_HIGH] = 9385), --> ~38MB
   watermark_boost = 0

After launching some memory hungry applications in Android which can cause
extfrag events in the system to an extent that ->watermark_boost can be
set to max i.e.  default boost factor makes it to 150% of high watermark.

   _watermark = (
          [WMARK_MIN] = 1272, --> ~5MB
          [WMARK_LOW] = 9067, --> ~36MB
          [WMARK_HIGH] = 9385), --> ~38MB
   watermark_boost = 14077, -->~57MB

With default system configuration, for an atomic order-0 allocation to
succeed, having free memory of ~2MB will suffice.  But boosting makes the
min_wmark to ~61MB thus for an atomic order-0 allocation to be successful
system should have minimum of ~23MB of free memory(from calculations of
zone_watermark_ok(), min = 3/4(min/2)).  But failures are observed despite
system is having ~20MB of free memory.  In the testing, this is
reproducible as early as first 300secs since boot and with furtherlowram
configurations(<2GB) it is observed as early as first 150secs since boot.

These failures can be avoided by excluding the ->watermark_boost in
watermark caluculations for atomic order-0 allocations.

[akpm@linux-foundation.org: fix comment grammar, reflow comment]
[charante@codeaurora.org: fix suggested by Mel Gorman]
  Link: http://lkml.kernel.org/r/31556793-57b1-1c21-1a9d-22674d9bd938@codeaurora.orgSigned-off-by: NCharan Teja Reddy <charante@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/1589882284-21010-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

63de663e

blk-mq: fix hang caused by freeze/unfreeze sequence · 550e512f

由 Bob Liu 提交于 5月 21, 2019

to #29683594

commit 7996a8b5511a72465b0b286763c2d8f412b8874a upstream.

The following is a description of a hang in blk_mq_freeze_queue_wait().
The hang happens on attempt to freeze a queue while another task does
queue unfreeze.

The root cause is an incorrect sequence of percpu_ref_resurrect() and
percpu_ref_kill() and as a result those two can be swapped:

 CPU#0                         CPU#1
 ----------------              -----------------
 q1 = blk_mq_init_queue(shared_tags)

                                q2 = blk_mq_init_queue(shared_tags):
                                  blk_mq_add_queue_tag_set(shared_tags):
                                    blk_mq_update_tag_set_depth(shared_tags):
				     list_for_each_entry()
                                      blk_mq_freeze_queue(q1)
                                       > percpu_ref_kill()
                                       > blk_mq_freeze_queue_wait()

 blk_cleanup_queue(q1)
  blk_mq_freeze_queue(q1)
   > percpu_ref_kill()
                 ^^^^^^ freeze_depth can't guarantee the order

                                      blk_mq_unfreeze_queue()
                                        > percpu_ref_resurrect()

   > blk_mq_freeze_queue_wait()
                 ^^^^^^ Hang here!!!!

This wrong sequence raises kernel warning:
percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0

But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
which waits for a zero of a q_usage_counter, which never happens
because percpu-ref was reinited (instead of being killed) and stays in
PERCPU state forever.

How to reproduce:
 - "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
 - cpu0: python Script.py 0; taskset the corresponding process running on cpu0
 - cpu1: python Script.py 1; taskset the corresponding process running on cpu1

 Script.py:
 ------
 #!/usr/bin/python3

import os
import sys

while True:
    on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
    os.system(on)
    os.system(off)
------

This bug was first reported and fixed by Roman, previous discussion:
[1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
[2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
[3] https://patchwork.kernel.org/patch/9268199/Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
Signed-off-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

550e512f

block: Allow unfreezing of a queue while requests are in progress · 0f2f9486

由 Bart Van Assche 提交于 9月 26, 2018

to #29683594

commit bdd6316094e0370cd183bc979dd7e322b68dc993 upstream.

A later patch will call blk_freeze_queue_start() followed by
blk_mq_unfreeze_queue() without waiting for q_usage_counter to drop
to zero. Make sure that this doesn't cause a kernel warning to appear
by switching from percpu_ref_reinit() to percpu_ref_resurrect(). The
former namely requires that the refcount it operates on is zero.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0f2f9486

alinux: io_uring: add percpu io sq thread support · ef286499

由 Xiaoguang Wang 提交于 7月 27, 2020

task #26578122

Currently we can create multiple io_uring instances which all have SQPOLL
enabled and make them bound to same cpu core by setting sq_thread_cpu argument,
but sometimes this isn't efficient. Imagine such extreme case, create two io
uring instances, which both have SQPOLL enabled and are bound to same cpu core.
One instance submits io per 500ms, another instance submits io continually,
then obviously the 1st instance still always contend for cpu resource, which
will impact 2nd instance.

To fix this issue, add a new flag IORING_SETUP_SQPOLL_PERCPU, when both
IORING_SETUP_SQ_AFF and IORING_SETUP_SQPOLL_PERCPU are enabled, we create a
percpu io sq_thread to handle multiple io_uring instances' io requests with
round-robin strategy, the improvements are very big, see below:

IOPS:
No of instances 1 2 4 8 16 32
kernel unpatched 589k 487k 303k 165k 85.8k 43.7k
kernel patched 590k 593k 581k 538k 494k 406k

LATENCY(us):
No of instances 1 2 4 8 16 32
kernel unpatched 217 262 422 775 1488 2917
kernel patched 216 215 219 237 258 313

Link: https://lore.kernel.org/io-uring/20200520115648.6140-1-xiaoguang.wang@linux.alibaba.com/Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

ef286499

alinux: nvme: pci: Fix the incorrect ways to calculate the request size · b21ffd2e

由 Baolin Wang 提交于 8月 31, 2020

fix #29375191

For NVMe discard request, it will use special_vec to describe the size
of the request, thus it will get an incorrect request size with
blk_rq_bytes() when handling the NVMe discard request.

Thus we should use blk_rq_payload_bytes() to calculate the data transfer
size which can fix this issue.

Fixes: 220741e8c12d ("alios: nvme-pci: Improve mapping single segment requests using SGLs")
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>

b21ffd2e

configs: Open the IGB IGBVF IXGBE IXGBEVF configs · ffae4b1d

由 Erwei Deng 提交于 8月 28, 2020

to #27883731

Open the IGB IGBVF IXGBE IXGBEVF configs for x86_64.
Signed-off-by: NErwei Deng <erwei@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

ffae4b1d

alinux: arm64: adjust tk_core memory layout · fc8d2968

由 Peng Wang 提交于 8月 25, 2020

to #29722367

On some specific hardware with 128 bytes LLC cacheline, tk_core may
cause false sharing problem. We can align it to 128 bytes so that
it won't be affected by other global variables.

This change will make a bit waste on cache utilization but get good
number of performance improvement. So for both 64 and 128 bytes aligned
LLC cacheline, we adjust tk_core memory layout to avoid potential cacheline
contention.
Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba.com>

fc8d2968

configs: enable vsyscall emulate by default for x86_64 · e7ac16c0

由 Shile Zhang 提交于 8月 25, 2020

fix #30070241

Enable the vsyscall emulate help to support some legency docker images
wich is based on CentOS 6, etc.
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

e7ac16c0

efi: Fix a race and a buffer overflow while reading efivars via sysfs · 38e2ecc1

由 Vladis Dronov 提交于 3月 08, 2020

fix #30009260

commit 286d3250c9d6437340203fb64938bea344729a0e upstream

There is a race and a buffer overflow corrupting a kernel memory while
reading an EFI variable with a size more than 1024 bytes via the older
sysfs method. This happens because accessing struct efi_variable in
efivar_{attr,size,data}_read() and friends is not protected from
a concurrent access leading to a kernel memory corruption and, at best,
to a crash. The race scenario is the following:

CPU0:                                CPU1:
efivar_attr_read()
  var->DataSize = 1024;
  efivar_entry_get(... &var->DataSize)
    down_interruptible(&efivars_lock)
                                     efivar_attr_read() // same EFI var
                                       var->DataSize = 1024;
                                       efivar_entry_get(... &var->DataSize)
                                         down_interruptible(&efivars_lock)
    virt_efi_get_variable()
    // returns EFI_BUFFER_TOO_SMALL but
    // var->DataSize is set to a real
    // var size more than 1024 bytes
    up(&efivars_lock)
                                         virt_efi_get_variable()
                                         // called with var->DataSize set
                                         // to a real var size, returns
                                         // successfully and overwrites
                                         // a 1024-bytes kernel buffer
                                         up(&efivars_lock)

This can be reproduced by concurrent reading of an EFI variable which size
is more than 1024 bytes:

  ts# for cpu in $(seq 0 $(nproc --ignore=1)); do ( taskset -c $cpu \
  cat /sys/firmware/efi/vars/KEKDefault*/size & ) ; done

Fix this by using a local variable for a var's data buffer size so it
does not get overwritten.

Fixes: e14ab23d ("efivars: efivar_entry API")
Reported-by: Bob Sanders <bob.sanders@hpe.com> and the LTP testsuite
Signed-off-by: NVladis Dronov <vdronov@redhat.com>
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200305084041.24053-2-vdronov@redhat.com
Link: https://lore.kernel.org/r/20200308080859.21568-24-ardb@kernel.orgSigned-off-by: NChunmei Xu <xuchunmei@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

38e2ecc1

configs: open the UIO Kconfig for x86_64 · 46c728a5

由 Erwei Deng 提交于 8月 24, 2020

fix #30012285

Open the UIO Kconfig for x86_64.
Signed-off-by: NErwei Deng <erwei@linux.alibaba.com>
Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

46c728a5

EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode. · 8601ca27

由 Tony Luck 提交于 8月 15, 2019

fix #29902604

commit 29b8e84fbc23cb2b70317b745641ea0569426872 upstream

Simplifies the code a little.
Acked-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

8601ca27

libnvdimm: Out of bounds read in __nd_ioctl() · 6d91d578

由 Dan Carpenter 提交于 2月 25, 2020

fix #29902604

commit f84afbdd3a9e5e10633695677b95422572f920dc upstream

The "cmd" comes from the user and it can be up to 255.  It it's more
than the number of bits in long, it results out of bounds read when we
check test_bit(cmd, &cmd_mask).  The highest valid value for "cmd" is
ND_CMD_CALL (10) so I added a compare against that.

Fixes: 62232e45 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

6d91d578

alinux: sched: Fix per-cgroup idle accounting deadlock · b2250b44

由 Yihao Wu 提交于 8月 20, 2020

fix #29692432

seqcount assumes that writers are already exclusive, so it saves extra
locking. However this critial section protected by idle_seqcount can
be entered twice if an interrupt tries to wake up a task on this cpu.

Once race caused by interrupt is avoided, writers are exclusive. So
seqlock is unnecessary, and local_irq_save + seqcount is enough.

Fixes: 61e58859 ("alinux: sched: Introduce per-cgroup idle accounting")
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>

b2250b44

io_uring: hold 'ctx' reference around task_work queue + execute · e0b245f8

由 Jiufei Xue 提交于 8月 14, 2020

fix #29820404

commit 6d816e088c359866f9867057e04f244c608c42fe linux-block/io_uring-5.9
branch.

We're holding the request reference, but we need to go one higher
to ensure that the ctx remains valid after the request has finished.
If the ring is closed with pending task_work inflight, and the
given io_kiocb finishes sync during issue, then we need a reference
to the ring itself around the task_work execution cycle.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: syzbot+9b260fc33297966f5a8e@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e0b245f8

x86/mce: Move nmi_enter/exit() into the entry point · 329479e2

由 Thomas Gleixner 提交于 4月 03, 2020

fix #29760792

commit 94a46d316f2b54e3de8a4fa884cb16383db7fcd8 upstream

There is no reason to have nmi_enter/exit() in the actual MCE
handlers. Move it to the entry point. This also covers the until now
uncovered initial handler which only prints.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135314.243936614@linutronix.deSigned-off-by: NZelin Deng <zelin.deng@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

329479e2

io_uring: Fix NULL pointer dereference in loop_rw_iter() · 58511eb2

由 Guoyu Huang 提交于 8月 05, 2020

fix #29760246

Cherry-pick 2dd2111d0d383df104b144e0d1f6b5a00cb7cd88 from io_uring-5.9.

loop_rw_iter() does not check whether the file has a read or
write function. This can lead to NULL pointer dereference
when the user passes in a file descriptor that does not have
read or write function.

The crash log looks like this:

[   99.834071] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   99.835364] #PF: supervisor instruction fetch in kernel mode
[   99.836522] #PF: error_code(0x0010) - not-present page
[   99.837771] PGD 8000000079d62067 P4D 8000000079d62067 PUD 79d8c067 PMD 0
[   99.839649] Oops: 0010 [#2] SMP PTI
[   99.840591] CPU: 1 PID: 333 Comm: io_wqe_worker-0 Tainted: G      D           5.8.0 #2
[   99.842622] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[   99.845140] RIP: 0010:0x0
[   99.845840] Code: Bad RIP value.
[   99.846672] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
[   99.848018] RAX: 0000000000000000 RBX: ffff92363bd67300 RCX: ffff92363d461208
[   99.849854] RDX: 0000000000000010 RSI: 00007ffdbf696bb0 RDI: ffff92363bd67300
[   99.851743] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
[   99.853394] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
[   99.855148] R13: 0000000000000000 R14: ffff92363d461208 R15: ffffa1c7c01ebc68
[   99.856914] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
[   99.858651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.860032] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
[   99.861979] Call Trace:
[   99.862617]  loop_rw_iter.part.0+0xad/0x110
[   99.863838]  io_write+0x2ae/0x380
[   99.864644]  ? kvm_sched_clock_read+0x11/0x20
[   99.865595]  ? sched_clock+0x9/0x10
[   99.866453]  ? sched_clock_cpu+0x11/0xb0
[   99.867326]  ? newidle_balance+0x1d4/0x3c0
[   99.868283]  io_issue_sqe+0xd8f/0x1340
[   99.869216]  ? __switch_to+0x7f/0x450
[   99.870280]  ? __switch_to_asm+0x42/0x70
[   99.871254]  ? __switch_to_asm+0x36/0x70
[   99.872133]  ? lock_timer_base+0x72/0xa0
[   99.873155]  ? switch_mm_irqs_off+0x1bf/0x420
[   99.874152]  io_wq_submit_work+0x64/0x180
[   99.875192]  ? kthread_use_mm+0x71/0x100
[   99.876132]  io_worker_handle_work+0x267/0x440
[   99.877233]  io_wqe_worker+0x297/0x350
[   99.878145]  kthread+0x112/0x150
[   99.878849]  ? __io_worker_unuse+0x100/0x100
[   99.879935]  ? kthread_park+0x90/0x90
[   99.880874]  ret_from_fork+0x22/0x30
[   99.881679] Modules linked in:
[   99.882493] CR2: 0000000000000000
[   99.883324] ---[ end trace 4453745f4673190b ]---
[   99.884289] RIP: 0010:0x0
[   99.884837] Code: Bad RIP value.
[   99.885492] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
[   99.886851] RAX: 0000000000000000 RBX: ffff92363acd7f00 RCX: ffff92363d461608
[   99.888561] RDX: 0000000000000010 RSI: 00007ffe040d9e10 RDI: ffff92363acd7f00
[   99.890203] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
[   99.891907] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
[   99.894106] R13: 0000000000000000 R14: ffff92363d461608 R15: ffffa1c7c01ebc68
[   99.896079] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
[   99.898017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.899197] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0

Fixes: 32960613b7c3 ("io_uring: correctly handle non ->{read,write}_iter() file_operations")
Cc: stable@vger.kernel.org
Signed-off-by: NGuoyu Huang <hgy5945@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

58511eb2

alinux: nvme: pci: Use bio->bi_vcnt directly · 2f68309b

由 Baolin Wang 提交于 7月 31, 2020

fix #29327388

Just use bio->bi_vcnt directly to validate if only one bvec in
a bio for PRP mode, which can remove warnings for dm device.
No functional changes.

Fixes: c8b92b847512 ("alios: nvme-pci: Improve mapping single segment requests using PRP")
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2f68309b

blk-mq: fix failure to decrement plug count on single rq removal · 1f0aa1e1

由 Jens Axboe 提交于 11月 27, 2018

to #29139300

commit 4711b57317f0ff5ca9fbd5e2df6c73b2c07ddc53 upstream

If we yank a 'same_queue_rq' request off the plug list, we should
also decrement the cached request count.

Fixes: 5f0ed774ed29 ("block: sum requests in the plug structure")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NHongnan Li <hongnan.li@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1f0aa1e1

block: sum requests in the plug structure · 5d7041ee

由 Jens Axboe 提交于 11月 23, 2018

to #29139300

commit 5f0ed774ed2914decfd397569fface997532e94d upstream

This isn't exactly the same as the previous count, as it includes
requests for all devices. But that really doesn't matter, if we have
more than the threshold (16) queued up, flush it. It's not worth it
to have an expensive list loop for this.

[Hongnan Li] performance evaluation
Performance results running fio(ioengine=io_uring,iodepth=256)

bs      IOPS(randread nomerges=0)   IOPS(randread nomerges=2)
           before / after                before / after
-----  ---------------------------  ---------------------------
512         818K / 840K                   855K / 897K
1k          816K / 842K                   853K / 898K
2k          820K / 839K                   850K / 899K
4k          818K / 840K                   852K / 895K
8k          574K / 574K                   574K / 574K
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NHongnan Li <hongnan.li@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5d7041ee

alinux: blk: export sector and len fields for iohang · 867d8967

由 Jeffle Xu 提交于 7月 30, 2020

fix #29612968

Sector address of all bios in a single request should be guaranteed to
be contiguous, except for DISCARD request. We could get the whole sector
range of the request by blk_rq_pos() and blk_rq_bytes() for normal read
/write requests, but here we still print the sector range of every bio
for code simpility. Since it is a low frequency operation, this design
will lead to no performance penalty.

Besides squash the 'if(bio)' and 'while(1)' into one single
'while(bio)'.
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

867d8967

mm/mmap.c: close race between munmap() and expand_upwards()/downwards() · 2044be44

由 Kirill A. Shutemov 提交于 7月 23, 2020

to #28718400

commit 246c320a8cfe0b11d81a4af38fa9985ef0cc9a4c upstream.

VMA with VM_GROWSDOWN or VM_GROWSUP flag set can change their size under
mmap_read_lock().  It can lead to race with __do_munmap():

	Thread A			Thread B
__do_munmap()
  detach_vmas_to_be_unmapped()
  mmap_write_downgrade()
				expand_downwards()
				  vma->vm_start = address;
				  // The VMA now overlaps with
				  // VMAs detached by the Thread A
				// page fault populates expanded part
				// of the VMA
  unmap_region()
    // Zaps pagetables partly
    // populated by Thread B

Similar race exists for expand_upwards().

The fix is to avoid downgrading mmap_lock in __do_munmap() if detached
VMAs are next to VM_GROWSDOWN or VM_GROWSUP VMA.

[akpm@linux-foundation.org: s/mmap_sem/mmap_lock/ in comment]

Fixes: 3ee4347a3fb3 ("mm: mmap: zap pages with read mmap_sem in munmap")
Reported-by: NJann Horn <jannh@google.com>
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>	[4.20+]
Link: http://lkml.kernel.org/r/20200709105309.42495-1-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

2044be44

x86/mpx, mm/core: Fix recursive munmap() corruption · 97356a46

由 Dave Hansen 提交于 4月 19, 2019

to #28718400

commit 5a28fc94c9143db766d1ba5480cae82d856ad080 upstream.

This is a bit of a mess, to put it mildly.  But, it's a bug
that only seems to have showed up in 4.20 but wasn't noticed
until now, because nobody uses MPX.

MPX has the arch_unmap() hook inside of munmap() because MPX
uses bounds tables that protect other areas of memory.  When
memory is unmapped, there is also a need to unmap the MPX
bounds tables.  Barring this, unused bounds tables can eat 80%
of the address space.

But, the recursive do_munmap() that gets called vi arch_unmap()
wreaks havoc with __do_munmap()'s state.  It can result in
freeing populated page tables, accessing bogus VMA state,
double-freed VMAs and more.

See the "long story" further below for the gory details.

To fix this, call arch_unmap() before __do_unmap() has a chance
to do anything meaningful.  Also, remove the 'vma' argument
and force the MPX code to do its own, independent VMA lookup.

== UML / unicore32 impact ==

Remove unused 'vma' argument to arch_unmap().  No functional
change.

I compile tested this on UML but not unicore32.

== powerpc impact ==

powerpc uses arch_unmap() well to watch for munmap() on the
VDSO and zeroes out 'current->mm->context.vdso_base'.  Moving
arch_unmap() makes this happen earlier in __do_munmap().  But,
'vdso_base' seems to only be used in perf and in the signal
delivery that happens near the return to userspace.  I can not
find any likely impact to powerpc, other than the zeroing
happening a little earlier.

powerpc does not use the 'vma' argument and is unaffected by
its removal.

I compile-tested a 64-bit powerpc defconfig.

== x86 impact ==

For the common success case this is functionally identical to
what was there before.  For the munmap() failure case, it's
possible that some MPX tables will be zapped for memory that
continues to be in use.  But, this is an extraordinarily
unlikely scenario and the harm would be that MPX provides no
protection since the bounds table got reset (zeroed).

I can't imagine anyone doing this:

	ptr = mmap();
	// use ptr
	ret = munmap(ptr);
	if (ret)
		// oh, there was an error, I'll
		// keep using ptr.

Because if you're doing munmap(), you are *done* with the
memory.  There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as
well as the existing mpx selftests/.

The long story:

munmap() has a couple of pieces:

 1. Find the affected VMA(s)
 2. Split the start/end one(s) if neceesary
 3. Pull the VMAs out of the rbtree
 4. Actually zap the memory via unmap_region(), including
    freeing page tables (or queueing them to be freed).
 5. Fix up some of the accounting (like fput()) and actually
    free the VMA itself.

This specific ordering was actually introduced by:

  dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

during the 4.20 merge window.  The previous __do_munmap() code
was actually safe because the only thing after arch_unmap() was
remove_vma_list().  arch_unmap() could not see 'vma' in the
rbtree because it was detached, so it is not even capable of
doing operations unsafe for remove_vma_list()'s use of 'vma'.

Richard Biener reported a test that shows this in dmesg:

  [1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551
  [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via
arch_unmap().  It was freeing page tables that has not been
properly zapped.

But, the problem was bigger than this.  For one, arch_unmap()
can free VMAs.  But, the calling __do_munmap() has variables
that *point* to VMAs and obviously can't handle them just
getting freed while the pointer is still in use.

I tried a couple of things here.  First, I tried to fix the page
table freeing problem in isolation, but I then found the VMA
issue.  I also tried having the MPX code return a flag if it
modified the rbtree which would force __do_munmap() to re-walk
to restart.  That spiralled out of control in complexity pretty
fast.

Just moving arch_unmap() and accepting that the bonkers failure
case might eat some bounds tables seems like the simplest viable
fix.

This was also reported in the following kernel bugzilla entry:

  https://bugzilla.kernel.org/show_bug.cgi?id=203123

There are some reports that this commit triggered this bug:

  3ee4347a3fb3 ("mm: mmap: zap pages with read mmap_sem in munmap")

While that commit certainly made the issues easier to hit, I believe
the fundamental issue has been with us as long as MPX itself, thus
the Fixes: tag below is for one of the original MPX commits.

[ mingo: Minor edits to the changelog and the patch. ]
Reported-by: NRichard Biener <rguenther@suse.de>
Reported-by: NH.J. Lu <hjl.tools@gmail.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-um@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: stable@vger.kernel.org
Fixes: 3ee4347a3fb3 ("mm: mmap: zap pages with read mmap_sem in munmap")
Link: http://lkml.kernel.org/r/20190419194747.5E1AD6DC@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

[xuyu: resolve conflicts in arch/x86/include/asm/mpx.h]
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

97356a46

alinux: Fix latency histogram & nr_migrations rcu bugs · 57804174

由 Yihao Wu 提交于 7月 30, 2020

to #29558346

rcu variable can be NULL inside rcu lock region, if rcu_read_lock is
called in grace period. So check if (!variable) before dereference it.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>

57804174

io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works · cd40824f

由 Xiaoguang Wang 提交于 7月 23, 2020

fix #29605829

commit 23b3628e45924419399da48c2b3a522b05557c91 upstream

In io_sq_thread(), if there are task works to handle, current codes
will skip schedule() and go on polling sq again, but forget to clear
IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to
set and clear IORING_SQ_NEED_WAKEUP flag,
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cd40824f

alinux: configs: disable low limit and enable io latency · 10924b1b

由 Joseph Qi 提交于 7月 31, 2020

to #29613419

To be consistent, disable low limit in arm as it is not used, also
enable io latency in x86.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

10924b1b

io_uring: fix lockup in io_fail_links() · 18ca51d0

由 Pavel Begunkov 提交于 7月 24, 2020

to #29608102

commit 4ae6dbd683860b9edc254ea8acf5e04b5ae242e5 upstream.

io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested
spin_lock(completion_lock) and lockup.

[  197.680409] rcu: INFO: rcu_preempt detected expedited stalls on
	CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/.
[  197.680411] rcu: blocking rcu_node structures:
[  197.680412] Task dump for CPU 6:
[  197.680413] link-timeout    R  running task        0  1669
	1 0x8000008a
[  197.680414] Call Trace:
[  197.680420]  ? io_req_find_next+0xa0/0x200
[  197.680422]  ? io_put_req_find_next+0x2a/0x50
[  197.680423]  ? io_poll_task_func+0xcf/0x140
[  197.680425]  ? task_work_run+0x67/0xa0
[  197.680426]  ? do_exit+0x35d/0xb70
[  197.680429]  ? syscall_trace_enter+0x187/0x2c0
[  197.680430]  ? do_group_exit+0x43/0xa0
[  197.680448]  ? __x64_sys_exit_group+0x18/0x20
[  197.680450]  ? do_syscall_64+0x52/0xa0
[  197.680452]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

18ca51d0

io_uring: fix ->work corruption with poll_add · 7943eeed

由 Pavel Begunkov 提交于 7月 24, 2020

to #29608102

commit d5e16d8e23825304c6a9945116cc6b6f8d51f28c upstream.

req->work might be already initialised by the time it gets into
__io_arm_poll_handler(), which will corrupt it by using fields that are
in an union with req->work. Luckily, the only side effect is missing
put_creds(). Clean req->work before going there.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7943eeed

io_uring: missed req_init_async() for IOSQE_ASYNC · e6508674

由 Pavel Begunkov 提交于 7月 23, 2020

to #29608102

commit 3e863ea3bb1a2203ae648eb272db0ce6a1a2072c upstream.

IOSQE_ASYNC branch of io_queue_sqe() is another place where an
unitialised req->work can be accessed (i.e. prior io_req_init_async()).
Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e6508674

io_uring: always allow drain/link/hardlink/async sqe flags · 6e5cbea1

由 Daniele Albano 提交于 7月 18, 2020

to #29608102

commit 61710e437f2807e26a3402543bdbb7217a9c8620 upstream.

We currently filter these for timeout_remove/async_cancel/files_update,
but we only should be filtering for fixed file and buffer select. This
also causes a second read of sqe->flags, which isn't needed.

Just check req->flags for the relevant bits. This then allows these
commands to be used in links, for example, like everything else.
Signed-off-by: NDaniele Albano <d.albano@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

6e5cbea1

io_uring: ensure double poll additions work with both request types · 8d834b0c

由 Jens Axboe 提交于 7月 17, 2020

to #29608102

commit 807abcb0883439af5ead73f3308310453b97b624 upstream.

The double poll additions were centered around doing POLL_ADD on file
descriptors that use more than one waitqueue (typically one for read,
one for write) when being polled. However, it can also end up being
triggered for when we use poll triggered retry. For that case, we cannot
safely use req->io, as that could be used by the request type itself.

Add a second io_poll_iocb pointer in the structure we allocate for poll
based retry, and ensure we use the right one from the two paths.

Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8d834b0c

mm/memory-failure: send SIGBUS(BUS_MCEERR_AR) only to current thread · d831f3a3

由 Wetp Zhang 提交于 6月 22, 2020

fix #29415191

commit 03151c6e0b66c63c3e9980edf78c3a7a99801764 upstream

Action Required memory error should happen only when a processor is
about to access to a corrupted memory, so it's synchronous and only
affects current process/thread.

Recently commit 872e9a205c84 ("mm, memory_failure: don't send
BUS_MCEERR_AO for action required error") fixed the issue that Action
Required memory could unnecessarily send SIGBUS to the processes which
share the error memory.  But we still have another issue that we could
send SIGBUS to a wrong thread.

This is because collect_procs() and task_early_kill() fails to add the
current process to "to-kill" list.  So this patch is suggesting to fix
it.  With this fix, SIGBUS(BUS_MCEERR_AR) is never sent to non-current
process/thread.
Signed-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NTony Luck <tony.luck@intel.com>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/1591321039-22141-3-git-send-email-naoya.horiguchi@nec.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

d831f3a3

mm, memory_failure: don't send BUS_MCEERR_AO for action required error · de9dc394

由 Wetp Zhang 提交于 6月 01, 2020

fix #29415191

commit 872e9a205c8491daf1a51ea3733c8c1d15d51e10 upstream

Some processes dont't want to be killed early, but in "Action Required"
case, those also may be killed by BUS_MCEERR_AO when sharing memory with
other which is accessing the fail memory.  And sending SIGBUS with
BUS_MCEERR_AO for action required error is strange, so ignore the
non-current processes here.
Suggested-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/1590817116-21281-1-git-send-email-wetp.zy@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

de9dc394

EDAC/i10nm: Update driver to support different bus number config register offsets · bd96d19e

由 Qiuxu Zhuo 提交于 7月 14, 2020

fix #29307272

commit ce20670828c1228ecd37befbdda87a1f87a803b9 upstream

The i10nm_edac driver failed to load on Ice Lake and Tremont/Jacobsville
servers if their CPU stepping >= 4 and failed on Ice Lake-D servers from
stepping 0. The root cause was that for Ice Lake and Tremont/Jacobsville
servers with CPU stepping >=4, the offset for bus number configuration
register was updated from 0xcc to 0xd0. For Ice Lake-D servers, all the
steppings use the updated 0xd0 offset.

Fix the issue by using the appropriate offset for bus number
configuration register according to the CPU model number and stepping.
Reported-by: NJerry Chen <jerry.t.chen@intel.com>
Reported-and-tested-by: NJin Wen <wen.jin@intel.com>
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/linux-edac/20200427084022.GC11036@zn.tnicSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

bd96d19e

EDAC, {skx,i10nm}: Make some configurations CPU model specific · 115b8aa5

由 Youquan Song 提交于 7月 14, 2020

fix #29307272

commit ee5340abab3babb91c1807cea47de4468b2dfc91 upstream

The device ID for configuration agent PCI device and the offset for
bus number configuration register can be CPU model specific. So add
a new structure res_config to make them configurable and pass res_config
to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnicSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

115b8aa5

x86/mce: fixup exception only for the correct mces · f030656c

由 Borislav Petkov 提交于 7月 06, 2020

fix #29415191

commit 1df73b2131e3b33d518609769636b41ce00212de upstream

the severity grading code returns in_kernel_recov error context for
errors which have happened in kernel space but from which the kernel can
recover. whether the recovery can happen is determined by the exception
table entry having as handler ex_handler_fault() and which has been
declared at build time using _asm_extable_fault().

in_kernel_recov is used in mce_severity_intel() to lookup the
corresponding error severity in the severities table.

however, the mapping back from error severity to whether the error is
in_kernel_recov is ambiguous and in the very paranoid case - which
might not be possible right now - but be better safe than sorry later,
an exception fixup could be attempted for another mce whose address
is in the exception table and has the proper severity. which would be
unfortunate, to say the least.

therefore, mark such mces explicitly as mce_in_kernel_recov so that the
recovery attempt is done only for them.

document the whole handling, while at it, as it is not trivial.
reported-by: Nthomas gleixner <tglx@linutronix.de>
signed-off-by: Nborislav petkov <bp@suse.de>
tested-by: Ntony luck <tony.luck@intel.com>
link: https://lkml.kernel.org/r/20200407163414.18058-10-bp@alien8.deSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

f030656c

x86/mce: Add mce=print_all option · 78fbe8f6

由 Tony Luck 提交于 7月 06, 2020

fix #29415191

commit 43505646941bee217b91d064756975aa1ab6ee3b upstream

Sometimes, when logs are getting lost, it's nice to just
have everything dumped to the serial console.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Tested-by: NTony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-7-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

78fbe8f6

x86/mce: Change default MCE logger to check mce->kflags · b3029f49

由 Tony Luck 提交于 7月 06, 2020

fix #29415191

commit 925946cfa715a5a71639528f82b98e58f14dd4cb upstream

Instead of keeping count of how many handlers are registered on the
MCE notifier chain and printing if below some magic value, look at
mce->kflags to see if anyone claims to have handled/logged this error.

 [ bp: Do not print ->kflags in __print_mce(). ]
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Tested-by: NTony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-6-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

b3029f49

x86/mce: Fix all mce notifiers to update the mce->kflags bitmask · 55789bce

由 Tony Luck 提交于 7月 06, 2020

fix #29415191

commit 23ba710a0864108910c7531dc4c73ef65eca5568 upstream

If the handler took any action to log or deal with the error, set a bit
in mce->kflags so that the default handler on the end of the machine
check chain can see what has been done.

Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
skip over errors already processed by CEC.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Tested-by: NTony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

55789bce

x86/mce: Add a struct mce.kflags field · 8e98e1ff

由 Tony Luck 提交于 7月 06, 2020

fix #29415191

commit 1de08dccd383482a3e88845d3554094d338f5ff9 upstream

There can be many different subsystems register on the mce handler
chain. Add a new bitmask field and define values so that handlers can
indicate whether they took any action to log or otherwise handle an
error.

The default handler at the end of the chain can use this information to
decide whether to print to the console log.

Boris suggested a generic name and leaving plenty of spare bits for
possible future use.

 [ bp: Move flag bits to the internal mce.h header and use BIT_ULL(). ]
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Tested-by: NTony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-4-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

8e98e1ff

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功