提交 · 46c728a50f2259ea9e7ce395518c31d2470e825c · openanolis / cloud-kernel

02 9月, 2020 40 次提交

configs: open the UIO Kconfig for x86_64 · 46c728a5

由 Erwei Deng 提交于 8月 24, 2020

fix #30012285

Open the UIO Kconfig for x86_64.
Signed-off-by: NErwei Deng <erwei@linux.alibaba.com>
Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

46c728a5

EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode. · 8601ca27

由 Tony Luck 提交于 8月 15, 2019

fix #29902604

commit 29b8e84fbc23cb2b70317b745641ea0569426872 upstream

Simplifies the code a little.
Acked-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

8601ca27

libnvdimm: Out of bounds read in __nd_ioctl() · 6d91d578

由 Dan Carpenter 提交于 2月 25, 2020

fix #29902604

commit f84afbdd3a9e5e10633695677b95422572f920dc upstream

The "cmd" comes from the user and it can be up to 255.  It it's more
than the number of bits in long, it results out of bounds read when we
check test_bit(cmd, &cmd_mask).  The highest valid value for "cmd" is
ND_CMD_CALL (10) so I added a compare against that.

Fixes: 62232e45 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

6d91d578

alinux: sched: Fix per-cgroup idle accounting deadlock · b2250b44

由 Yihao Wu 提交于 8月 20, 2020

fix #29692432

seqcount assumes that writers are already exclusive, so it saves extra
locking. However this critial section protected by idle_seqcount can
be entered twice if an interrupt tries to wake up a task on this cpu.

Once race caused by interrupt is avoided, writers are exclusive. So
seqlock is unnecessary, and local_irq_save + seqcount is enough.

Fixes: 61e58859 ("alinux: sched: Introduce per-cgroup idle accounting")
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>

b2250b44

io_uring: hold 'ctx' reference around task_work queue + execute · e0b245f8

由 Jiufei Xue 提交于 8月 14, 2020

fix #29820404

commit 6d816e088c359866f9867057e04f244c608c42fe linux-block/io_uring-5.9
branch.

We're holding the request reference, but we need to go one higher
to ensure that the ctx remains valid after the request has finished.
If the ring is closed with pending task_work inflight, and the
given io_kiocb finishes sync during issue, then we need a reference
to the ring itself around the task_work execution cycle.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: syzbot+9b260fc33297966f5a8e@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e0b245f8

x86/mce: Move nmi_enter/exit() into the entry point · 329479e2

由 Thomas Gleixner 提交于 4月 03, 2020

fix #29760792

commit 94a46d316f2b54e3de8a4fa884cb16383db7fcd8 upstream

There is no reason to have nmi_enter/exit() in the actual MCE
handlers. Move it to the entry point. This also covers the until now
uncovered initial handler which only prints.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135314.243936614@linutronix.deSigned-off-by: NZelin Deng <zelin.deng@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

329479e2

io_uring: Fix NULL pointer dereference in loop_rw_iter() · 58511eb2

由 Guoyu Huang 提交于 8月 05, 2020

fix #29760246

Cherry-pick 2dd2111d0d383df104b144e0d1f6b5a00cb7cd88 from io_uring-5.9.

loop_rw_iter() does not check whether the file has a read or
write function. This can lead to NULL pointer dereference
when the user passes in a file descriptor that does not have
read or write function.

The crash log looks like this:

[   99.834071] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   99.835364] #PF: supervisor instruction fetch in kernel mode
[   99.836522] #PF: error_code(0x0010) - not-present page
[   99.837771] PGD 8000000079d62067 P4D 8000000079d62067 PUD 79d8c067 PMD 0
[   99.839649] Oops: 0010 [#2] SMP PTI
[   99.840591] CPU: 1 PID: 333 Comm: io_wqe_worker-0 Tainted: G      D           5.8.0 #2
[   99.842622] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[   99.845140] RIP: 0010:0x0
[   99.845840] Code: Bad RIP value.
[   99.846672] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
[   99.848018] RAX: 0000000000000000 RBX: ffff92363bd67300 RCX: ffff92363d461208
[   99.849854] RDX: 0000000000000010 RSI: 00007ffdbf696bb0 RDI: ffff92363bd67300
[   99.851743] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
[   99.853394] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
[   99.855148] R13: 0000000000000000 R14: ffff92363d461208 R15: ffffa1c7c01ebc68
[   99.856914] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
[   99.858651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.860032] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
[   99.861979] Call Trace:
[   99.862617]  loop_rw_iter.part.0+0xad/0x110
[   99.863838]  io_write+0x2ae/0x380
[   99.864644]  ? kvm_sched_clock_read+0x11/0x20
[   99.865595]  ? sched_clock+0x9/0x10
[   99.866453]  ? sched_clock_cpu+0x11/0xb0
[   99.867326]  ? newidle_balance+0x1d4/0x3c0
[   99.868283]  io_issue_sqe+0xd8f/0x1340
[   99.869216]  ? __switch_to+0x7f/0x450
[   99.870280]  ? __switch_to_asm+0x42/0x70
[   99.871254]  ? __switch_to_asm+0x36/0x70
[   99.872133]  ? lock_timer_base+0x72/0xa0
[   99.873155]  ? switch_mm_irqs_off+0x1bf/0x420
[   99.874152]  io_wq_submit_work+0x64/0x180
[   99.875192]  ? kthread_use_mm+0x71/0x100
[   99.876132]  io_worker_handle_work+0x267/0x440
[   99.877233]  io_wqe_worker+0x297/0x350
[   99.878145]  kthread+0x112/0x150
[   99.878849]  ? __io_worker_unuse+0x100/0x100
[   99.879935]  ? kthread_park+0x90/0x90
[   99.880874]  ret_from_fork+0x22/0x30
[   99.881679] Modules linked in:
[   99.882493] CR2: 0000000000000000
[   99.883324] ---[ end trace 4453745f4673190b ]---
[   99.884289] RIP: 0010:0x0
[   99.884837] Code: Bad RIP value.
[   99.885492] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
[   99.886851] RAX: 0000000000000000 RBX: ffff92363acd7f00 RCX: ffff92363d461608
[   99.888561] RDX: 0000000000000010 RSI: 00007ffe040d9e10 RDI: ffff92363acd7f00
[   99.890203] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
[   99.891907] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
[   99.894106] R13: 0000000000000000 R14: ffff92363d461608 R15: ffffa1c7c01ebc68
[   99.896079] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
[   99.898017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   99.899197] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0

Fixes: 32960613b7c3 ("io_uring: correctly handle non ->{read,write}_iter() file_operations")
Cc: stable@vger.kernel.org
Signed-off-by: NGuoyu Huang <hgy5945@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

58511eb2

alinux: nvme: pci: Use bio->bi_vcnt directly · 2f68309b

由 Baolin Wang 提交于 7月 31, 2020

fix #29327388

Just use bio->bi_vcnt directly to validate if only one bvec in
a bio for PRP mode, which can remove warnings for dm device.
No functional changes.

Fixes: c8b92b847512 ("alios: nvme-pci: Improve mapping single segment requests using PRP")
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2f68309b

blk-mq: fix failure to decrement plug count on single rq removal · 1f0aa1e1

由 Jens Axboe 提交于 11月 27, 2018

to #29139300

commit 4711b57317f0ff5ca9fbd5e2df6c73b2c07ddc53 upstream

If we yank a 'same_queue_rq' request off the plug list, we should
also decrement the cached request count.

Fixes: 5f0ed774ed29 ("block: sum requests in the plug structure")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NHongnan Li <hongnan.li@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1f0aa1e1

block: sum requests in the plug structure · 5d7041ee

由 Jens Axboe 提交于 11月 23, 2018

to #29139300

commit 5f0ed774ed2914decfd397569fface997532e94d upstream

This isn't exactly the same as the previous count, as it includes
requests for all devices. But that really doesn't matter, if we have
more than the threshold (16) queued up, flush it. It's not worth it
to have an expensive list loop for this.

[Hongnan Li] performance evaluation
Performance results running fio(ioengine=io_uring,iodepth=256)

bs      IOPS(randread nomerges=0)   IOPS(randread nomerges=2)
           before / after                before / after
-----  ---------------------------  ---------------------------
512         818K / 840K                   855K / 897K
1k          816K / 842K                   853K / 898K
2k          820K / 839K                   850K / 899K
4k          818K / 840K                   852K / 895K
8k          574K / 574K                   574K / 574K
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NHongnan Li <hongnan.li@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5d7041ee

alinux: blk: export sector and len fields for iohang · 867d8967

由 Jeffle Xu 提交于 7月 30, 2020

fix #29612968

Sector address of all bios in a single request should be guaranteed to
be contiguous, except for DISCARD request. We could get the whole sector
range of the request by blk_rq_pos() and blk_rq_bytes() for normal read
/write requests, but here we still print the sector range of every bio
for code simpility. Since it is a low frequency operation, this design
will lead to no performance penalty.

Besides squash the 'if(bio)' and 'while(1)' into one single
'while(bio)'.
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

867d8967

mm/mmap.c: close race between munmap() and expand_upwards()/downwards() · 2044be44

由 Kirill A. Shutemov 提交于 7月 23, 2020

to #28718400

commit 246c320a8cfe0b11d81a4af38fa9985ef0cc9a4c upstream.

VMA with VM_GROWSDOWN or VM_GROWSUP flag set can change their size under
mmap_read_lock().  It can lead to race with __do_munmap():

	Thread A			Thread B
__do_munmap()
  detach_vmas_to_be_unmapped()
  mmap_write_downgrade()
				expand_downwards()
				  vma->vm_start = address;
				  // The VMA now overlaps with
				  // VMAs detached by the Thread A
				// page fault populates expanded part
				// of the VMA
  unmap_region()
    // Zaps pagetables partly
    // populated by Thread B

Similar race exists for expand_upwards().

The fix is to avoid downgrading mmap_lock in __do_munmap() if detached
VMAs are next to VM_GROWSDOWN or VM_GROWSUP VMA.

[akpm@linux-foundation.org: s/mmap_sem/mmap_lock/ in comment]

Fixes: 3ee4347a3fb3 ("mm: mmap: zap pages with read mmap_sem in munmap")
Reported-by: NJann Horn <jannh@google.com>
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>	[4.20+]
Link: http://lkml.kernel.org/r/20200709105309.42495-1-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

2044be44

x86/mpx, mm/core: Fix recursive munmap() corruption · 97356a46

由 Dave Hansen 提交于 4月 19, 2019

to #28718400

commit 5a28fc94c9143db766d1ba5480cae82d856ad080 upstream.

This is a bit of a mess, to put it mildly.  But, it's a bug
that only seems to have showed up in 4.20 but wasn't noticed
until now, because nobody uses MPX.

MPX has the arch_unmap() hook inside of munmap() because MPX
uses bounds tables that protect other areas of memory.  When
memory is unmapped, there is also a need to unmap the MPX
bounds tables.  Barring this, unused bounds tables can eat 80%
of the address space.

But, the recursive do_munmap() that gets called vi arch_unmap()
wreaks havoc with __do_munmap()'s state.  It can result in
freeing populated page tables, accessing bogus VMA state,
double-freed VMAs and more.

See the "long story" further below for the gory details.

To fix this, call arch_unmap() before __do_unmap() has a chance
to do anything meaningful.  Also, remove the 'vma' argument
and force the MPX code to do its own, independent VMA lookup.

== UML / unicore32 impact ==

Remove unused 'vma' argument to arch_unmap().  No functional
change.

I compile tested this on UML but not unicore32.

== powerpc impact ==

powerpc uses arch_unmap() well to watch for munmap() on the
VDSO and zeroes out 'current->mm->context.vdso_base'.  Moving
arch_unmap() makes this happen earlier in __do_munmap().  But,
'vdso_base' seems to only be used in perf and in the signal
delivery that happens near the return to userspace.  I can not
find any likely impact to powerpc, other than the zeroing
happening a little earlier.

powerpc does not use the 'vma' argument and is unaffected by
its removal.

I compile-tested a 64-bit powerpc defconfig.

== x86 impact ==

For the common success case this is functionally identical to
what was there before.  For the munmap() failure case, it's
possible that some MPX tables will be zapped for memory that
continues to be in use.  But, this is an extraordinarily
unlikely scenario and the harm would be that MPX provides no
protection since the bounds table got reset (zeroed).

I can't imagine anyone doing this:

	ptr = mmap();
	// use ptr
	ret = munmap(ptr);
	if (ret)
		// oh, there was an error, I'll
		// keep using ptr.

Because if you're doing munmap(), you are *done* with the
memory.  There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as
well as the existing mpx selftests/.

The long story:

munmap() has a couple of pieces:

 1. Find the affected VMA(s)
 2. Split the start/end one(s) if neceesary
 3. Pull the VMAs out of the rbtree
 4. Actually zap the memory via unmap_region(), including
    freeing page tables (or queueing them to be freed).
 5. Fix up some of the accounting (like fput()) and actually
    free the VMA itself.

This specific ordering was actually introduced by:

  dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")

during the 4.20 merge window.  The previous __do_munmap() code
was actually safe because the only thing after arch_unmap() was
remove_vma_list().  arch_unmap() could not see 'vma' in the
rbtree because it was detached, so it is not even capable of
doing operations unsafe for remove_vma_list()'s use of 'vma'.

Richard Biener reported a test that shows this in dmesg:

  [1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551
  [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via
arch_unmap().  It was freeing page tables that has not been
properly zapped.

But, the problem was bigger than this.  For one, arch_unmap()
can free VMAs.  But, the calling __do_munmap() has variables
that *point* to VMAs and obviously can't handle them just
getting freed while the pointer is still in use.

I tried a couple of things here.  First, I tried to fix the page
table freeing problem in isolation, but I then found the VMA
issue.  I also tried having the MPX code return a flag if it
modified the rbtree which would force __do_munmap() to re-walk
to restart.  That spiralled out of control in complexity pretty
fast.

Just moving arch_unmap() and accepting that the bonkers failure
case might eat some bounds tables seems like the simplest viable
fix.

This was also reported in the following kernel bugzilla entry:

  https://bugzilla.kernel.org/show_bug.cgi?id=203123

There are some reports that this commit triggered this bug:

  3ee4347a3fb3 ("mm: mmap: zap pages with read mmap_sem in munmap")

While that commit certainly made the issues easier to hit, I believe
the fundamental issue has been with us as long as MPX itself, thus
the Fixes: tag below is for one of the original MPX commits.

[ mingo: Minor edits to the changelog and the patch. ]
Reported-by: NRichard Biener <rguenther@suse.de>
Reported-by: NH.J. Lu <hjl.tools@gmail.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Reviewed-by Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-um@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: stable@vger.kernel.org
Fixes: 3ee4347a3fb3 ("mm: mmap: zap pages with read mmap_sem in munmap")
Link: http://lkml.kernel.org/r/20190419194747.5E1AD6DC@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

[xuyu: resolve conflicts in arch/x86/include/asm/mpx.h]
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

97356a46

alinux: Fix latency histogram & nr_migrations rcu bugs · 57804174

由 Yihao Wu 提交于 7月 30, 2020

to #29558346

rcu variable can be NULL inside rcu lock region, if rcu_read_lock is
called in grace period. So check if (!variable) before dereference it.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>

57804174

io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works · cd40824f

由 Xiaoguang Wang 提交于 7月 23, 2020

fix #29605829

commit 23b3628e45924419399da48c2b3a522b05557c91 upstream

In io_sq_thread(), if there are task works to handle, current codes
will skip schedule() and go on polling sq again, but forget to clear
IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to
set and clear IORING_SQ_NEED_WAKEUP flag,
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cd40824f

alinux: configs: disable low limit and enable io latency · 10924b1b

由 Joseph Qi 提交于 7月 31, 2020

to #29613419

To be consistent, disable low limit in arm as it is not used, also
enable io latency in x86.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

10924b1b

io_uring: fix lockup in io_fail_links() · 18ca51d0

由 Pavel Begunkov 提交于 7月 24, 2020

to #29608102

commit 4ae6dbd683860b9edc254ea8acf5e04b5ae242e5 upstream.

io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested
spin_lock(completion_lock) and lockup.

[  197.680409] rcu: INFO: rcu_preempt detected expedited stalls on
	CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/.
[  197.680411] rcu: blocking rcu_node structures:
[  197.680412] Task dump for CPU 6:
[  197.680413] link-timeout    R  running task        0  1669
	1 0x8000008a
[  197.680414] Call Trace:
[  197.680420]  ? io_req_find_next+0xa0/0x200
[  197.680422]  ? io_put_req_find_next+0x2a/0x50
[  197.680423]  ? io_poll_task_func+0xcf/0x140
[  197.680425]  ? task_work_run+0x67/0xa0
[  197.680426]  ? do_exit+0x35d/0xb70
[  197.680429]  ? syscall_trace_enter+0x187/0x2c0
[  197.680430]  ? do_group_exit+0x43/0xa0
[  197.680448]  ? __x64_sys_exit_group+0x18/0x20
[  197.680450]  ? do_syscall_64+0x52/0xa0
[  197.680452]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

18ca51d0

io_uring: fix ->work corruption with poll_add · 7943eeed

由 Pavel Begunkov 提交于 7月 24, 2020

to #29608102

commit d5e16d8e23825304c6a9945116cc6b6f8d51f28c upstream.

req->work might be already initialised by the time it gets into
__io_arm_poll_handler(), which will corrupt it by using fields that are
in an union with req->work. Luckily, the only side effect is missing
put_creds(). Clean req->work before going there.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7943eeed

io_uring: missed req_init_async() for IOSQE_ASYNC · e6508674

由 Pavel Begunkov 提交于 7月 23, 2020

to #29608102

commit 3e863ea3bb1a2203ae648eb272db0ce6a1a2072c upstream.

IOSQE_ASYNC branch of io_queue_sqe() is another place where an
unitialised req->work can be accessed (i.e. prior io_req_init_async()).
Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e6508674

io_uring: always allow drain/link/hardlink/async sqe flags · 6e5cbea1

由 Daniele Albano 提交于 7月 18, 2020

to #29608102

commit 61710e437f2807e26a3402543bdbb7217a9c8620 upstream.

We currently filter these for timeout_remove/async_cancel/files_update,
but we only should be filtering for fixed file and buffer select. This
also causes a second read of sqe->flags, which isn't needed.

Just check req->flags for the relevant bits. This then allows these
commands to be used in links, for example, like everything else.
Signed-off-by: NDaniele Albano <d.albano@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

6e5cbea1

io_uring: ensure double poll additions work with both request types · 8d834b0c

由 Jens Axboe 提交于 7月 17, 2020

to #29608102

commit 807abcb0883439af5ead73f3308310453b97b624 upstream.

The double poll additions were centered around doing POLL_ADD on file
descriptors that use more than one waitqueue (typically one for read,
one for write) when being polled. However, it can also end up being
triggered for when we use poll triggered retry. For that case, we cannot
safely use req->io, as that could be used by the request type itself.

Add a second io_poll_iocb pointer in the structure we allocate for poll
based retry, and ensure we use the right one from the two paths.

Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8d834b0c

mm/memory-failure: send SIGBUS(BUS_MCEERR_AR) only to current thread · d831f3a3

由 Wetp Zhang 提交于 6月 22, 2020

fix #29415191

commit 03151c6e0b66c63c3e9980edf78c3a7a99801764 upstream

Action Required memory error should happen only when a processor is
about to access to a corrupted memory, so it's synchronous and only
affects current process/thread.

Recently commit 872e9a205c84 ("mm, memory_failure: don't send
BUS_MCEERR_AO for action required error") fixed the issue that Action
Required memory could unnecessarily send SIGBUS to the processes which
share the error memory.  But we still have another issue that we could
send SIGBUS to a wrong thread.

This is because collect_procs() and task_early_kill() fails to add the
current process to "to-kill" list.  So this patch is suggesting to fix
it.  With this fix, SIGBUS(BUS_MCEERR_AR) is never sent to non-current
process/thread.
Signed-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NTony Luck <tony.luck@intel.com>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/1591321039-22141-3-git-send-email-naoya.horiguchi@nec.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

d831f3a3

mm, memory_failure: don't send BUS_MCEERR_AO for action required error · de9dc394

由 Wetp Zhang 提交于 6月 01, 2020

fix #29415191

commit 872e9a205c8491daf1a51ea3733c8c1d15d51e10 upstream

Some processes dont't want to be killed early, but in "Action Required"
case, those also may be killed by BUS_MCEERR_AO when sharing memory with
other which is accessing the fail memory.  And sending SIGBUS with
BUS_MCEERR_AO for action required error is strange, so ignore the
non-current processes here.
Suggested-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/1590817116-21281-1-git-send-email-wetp.zy@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

de9dc394

EDAC/i10nm: Update driver to support different bus number config register offsets · bd96d19e

由 Qiuxu Zhuo 提交于 7月 14, 2020

fix #29307272

commit ce20670828c1228ecd37befbdda87a1f87a803b9 upstream

The i10nm_edac driver failed to load on Ice Lake and Tremont/Jacobsville
servers if their CPU stepping >= 4 and failed on Ice Lake-D servers from
stepping 0. The root cause was that for Ice Lake and Tremont/Jacobsville
servers with CPU stepping >=4, the offset for bus number configuration
register was updated from 0xcc to 0xd0. For Ice Lake-D servers, all the
steppings use the updated 0xd0 offset.

Fix the issue by using the appropriate offset for bus number
configuration register according to the CPU model number and stepping.
Reported-by: NJerry Chen <jerry.t.chen@intel.com>
Reported-and-tested-by: NJin Wen <wen.jin@intel.com>
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/linux-edac/20200427084022.GC11036@zn.tnicSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

bd96d19e

EDAC, {skx,i10nm}: Make some configurations CPU model specific · 115b8aa5

由 Youquan Song 提交于 7月 14, 2020

fix #29307272

commit ee5340abab3babb91c1807cea47de4468b2dfc91 upstream

The device ID for configuration agent PCI device and the offset for
bus number configuration register can be CPU model specific. So add
a new structure res_config to make them configurable and pass res_config
to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnicSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

115b8aa5

x86/mce: fixup exception only for the correct mces · f030656c

由 Borislav Petkov 提交于 7月 06, 2020

fix #29415191

commit 1df73b2131e3b33d518609769636b41ce00212de upstream

the severity grading code returns in_kernel_recov error context for
errors which have happened in kernel space but from which the kernel can
recover. whether the recovery can happen is determined by the exception
table entry having as handler ex_handler_fault() and which has been
declared at build time using _asm_extable_fault().

in_kernel_recov is used in mce_severity_intel() to lookup the
corresponding error severity in the severities table.

however, the mapping back from error severity to whether the error is
in_kernel_recov is ambiguous and in the very paranoid case - which
might not be possible right now - but be better safe than sorry later,
an exception fixup could be attempted for another mce whose address
is in the exception table and has the proper severity. which would be
unfortunate, to say the least.

therefore, mark such mces explicitly as mce_in_kernel_recov so that the
recovery attempt is done only for them.

document the whole handling, while at it, as it is not trivial.
reported-by: Nthomas gleixner <tglx@linutronix.de>
signed-off-by: Nborislav petkov <bp@suse.de>
tested-by: Ntony luck <tony.luck@intel.com>
link: https://lkml.kernel.org/r/20200407163414.18058-10-bp@alien8.deSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

f030656c

x86/mce: Add mce=print_all option · 78fbe8f6

由 Tony Luck 提交于 7月 06, 2020

fix #29415191

commit 43505646941bee217b91d064756975aa1ab6ee3b upstream

Sometimes, when logs are getting lost, it's nice to just
have everything dumped to the serial console.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Tested-by: NTony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-7-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

78fbe8f6

x86/mce: Change default MCE logger to check mce->kflags · b3029f49

由 Tony Luck 提交于 7月 06, 2020

fix #29415191

commit 925946cfa715a5a71639528f82b98e58f14dd4cb upstream

Instead of keeping count of how many handlers are registered on the
MCE notifier chain and printing if below some magic value, look at
mce->kflags to see if anyone claims to have handled/logged this error.

 [ bp: Do not print ->kflags in __print_mce(). ]
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Tested-by: NTony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-6-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

b3029f49

x86/mce: Fix all mce notifiers to update the mce->kflags bitmask · 55789bce

由 Tony Luck 提交于 7月 06, 2020

fix #29415191

commit 23ba710a0864108910c7531dc4c73ef65eca5568 upstream

If the handler took any action to log or deal with the error, set a bit
in mce->kflags so that the default handler on the end of the machine
check chain can see what has been done.

Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
skip over errors already processed by CEC.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Tested-by: NTony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

55789bce

x86/mce: Add a struct mce.kflags field · 8e98e1ff

由 Tony Luck 提交于 7月 06, 2020

fix #29415191

commit 1de08dccd383482a3e88845d3554094d338f5ff9 upstream

There can be many different subsystems register on the mce handler
chain. Add a new bitmask field and define values so that handlers can
indicate whether they took any action to log or otherwise handle an
error.

The default handler at the end of the chain can use this information to
decide whether to print to the console log.

Boris suggested a generic name and leaving plenty of spare bits for
possible future use.

 [ bp: Move flag bits to the internal mce.h header and use BIT_ULL(). ]
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Tested-by: NTony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-4-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

8e98e1ff

x86: Replace ist_enter() with nmi_enter() · 8e1f4150

由 Peter Zijlstra 提交于 7月 06, 2020

fix #29415191

commit 0d00449c7a28a1514595630735df383dec606812 upstream

A few exceptions (like #DB and #BP) can happen at any location in the code,
this then means that tracers should treat events from these exceptions as
NMI-like. The interrupted context could be holding locks with interrupts
disabled for instance.

Similarly, #MC is an actual NMI-like exception.

All of them use ist_enter() which only concerns itself with RCU, but does
not do any of the other setup that NMIs need. This means things like:

	printk()
	  raw_spin_lock_irq(&logbuf_lock);
	  <#DB/#BP/#MC>
	     printk()
	       raw_spin_lock_irq(&logbuf_lock);

are entirely possible (well, not really since printk tries hard to
play nice, but the concept stands).

So replace ist_enter() with nmi_enter(). Also observe that any nmi_enter()
caller must be both notrace and NOKPROBE, or in the noinstr text section.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.525508608@linutronix.deSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

8e1f4150

x86/mce: Send #MC singal from task work · 1681616f

由 Peter Zijlstra 提交于 7月 06, 2020

fix #29415191

commit 5567d11c21a1d508a91a8cb64a819783a0835d9f upstream

Convert #MC over to using task_work_add(); it will run the same code
slightly later, on the return to user path of the same exception.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134100.957390899@linutronix.deSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

1681616f

x86/entry: Get rid of ist_begin/end_non_atomic() · 432c4786

由 Youquan Song 提交于 7月 06, 2020

fix #29415191

commit b052df3da821adfd6be26a6eb16624fb50e90e56 upstream

This is completely overengineered and definitely not an interface which
should be made available to anything else than this particular MCE case.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.462640294@linutronix.deSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

432c4786

x86/{mce,mm}: Unmap the entire page if the whole page is affected and poisoned · f9d84468

由 Tony Luck 提交于 7月 06, 2020

fix #29415191

commit 17fae1294ad9d711b2c3dd0edef479d40c76a5e8 upstream

An interesting thing happened when a guest Linux instance took a machine
check. The VMM unmapped the bad page from guest physical space and
passed the machine check to the guest.

Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:

do_memory_failure
    set_mce_nospec
         set_memory_uc
              _set_memory_uc
                   change_page_attr_set_clr
                        cpa_flush
                             clflush_cache_range_opt

This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.

Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.

This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).

  [ bp: Adjust to x86/entry changes. ]

Fixes: 284ce401 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reported-by: NJue Wang <juew@google.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJue Wang <juew@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

f9d84468

PCI: pciehp: Wait for PDS if in-band presence is disabled · 4d357e42

由 Alexandru Gagniuc 提交于 7月 27, 2020

task #29600094

commit f496648b99f8f7f6711f7c30a6327381f37dd1e8 upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

When in-band presence detect is disabled, PDS may come up at any time or
not at all.  PDS being low may indicate that the card is still mating, and
we could expect contact bounce to bring down the link as well.

It is reasonable to assume that most cards will mate in a hotplug slot in
about a second.  Thus, when we know PDS only reflects out-of-band presence
detect, it's worthwhile to wait the extra second or so to make sure the
card is properly mated before loading the driver and to prevent the hotplug
code from disabling a device if the presence detect change goes active
after the device is enabled.

Link: https://lore.kernel.org/r/20191025190047.38130-3-stuart.w.hayes@gmail.com
[bhelgaas: use ctrl_info() instead of pci_info()]
Signed-off-by: NAlexandru Gagniuc <mr.nuke.me@gmail.com>
Signed-off-by: NStuart Hayes <stuart.w.hayes@gmail.com>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: NLukas Wunner <lukas@wunner.de>

(cherry picked from commit f496648b99f8f7f6711f7c30a6327381f37dd1e8)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

4d357e42

PCI: pciehp: Disable in-band presence detect when possible · 7144257e

由 Alexandru Gagniuc 提交于 7月 27, 2020

task #29600094

commit 202853595e53f981c86656c49fc1cc1e3620f558 upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

The presence detect state (PDS) is normally a logical OR of in-band and
out-of-band (OOB) presence detect.  As of PCIe 4.0, there is the option to
disable in-band presence so that the PDS bit always reflects the state of
the out-of-band presence.

The recommendation of the PCIe spec is to disable in-band presence whenever
supported (PCIe r5.0, appendix I implementation note):

  Due to architectural issues, the in-band (Physical-Layer-based) portion
  of the PD mechanism is deprecated for use with async hot-plug. One issue
  is that in-band PD as architected does not detect adapter removal during
  certain LTSSM states, notably the L1 and Disabled States.  Another issue
  is that when both in-band and OOB PD are being used together, the
  Presence Detect State bit and its associated interrupt mechanism always
  reflect the logical OR of the inband and OOB PD states, and with some
  hot-plug hardware configurations, it is important for software to detect
  and respond to in-band and OOB PD events independently.  If OOB PD is
  being used and the associated DSP supports In-Band PD Disable, it is
  recommended that the In-Band PD Disable bit be Set, and the Presence
  Detect State bit and its associated interrupt mechanism be used
  exclusively for OOB PD.  As a substitute for in-band PD with async
  hot-plug, the reference model uses either the DPC or the DLL Link Active
  mechanism.

Link: https://lore.kernel.org/r/20191025190047.38130-2-stuart.w.hayes@gmail.com
[bhelgaas: move PCI_EXP_SLTCAP2 read earlier & print PCI_EXP_SLTCAP2_IBPD
value (suggested by Lukas)]
Signed-off-by: NAlexandru Gagniuc <mr.nuke.me@gmail.com>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: NLukas Wunner <lukas@wunner.de>

(cherry picked from commit 202853595e53f981c86656c49fc1cc1e3620f558)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>

Conflicts:
	drivers/pci/hotplug/pciehp.h
	drivers/pci/hotplug/pciehp_hpc.c
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

7144257e

PCI/AER: Fix the broken interrupt injection · f809fe13

由 Thomas Gleixner 提交于 7月 27, 2020

task #29600094

commit 9ae0522537852408f0f48af888e44d6876777463 upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

The AER error injection mechanism just blindly abuses generic_handle_irq()
which is really not meant for consumption by random drivers. The include of
linux/irq.h should have been a red flag in the first place. Driver code,
unless implementing interrupt chips or low level hypervisor functionality
has absolutely no business with that.

Invoking generic_handle_irq() from non interrupt handling context can have
nasty side effects at least on x86 due to the hardware trainwreck which
makes interrupt affinity changes a fragile beast. Sathyanarayanan triggered
a NULL pointer dereference in the low level APIC code that way. While the
particular pointer could be checked this would only paper over the issue
because there are other ways to trigger warnings or silently corrupt state.

Invoke the new irq_inject_interrupt() mechanism, which has the necessary
sanity checks in place and injects the interrupt via the irq_retrigger()
mechanism, which is at least halfways safe vs. the fragile x86 affinity
change mechanics.

It's safe on x86 as it does not corrupt state, but it still can cause a
premature completion of an interrupt affinity change causing the interrupt
line to become stale. Very unlikely, but possible.

For regular operations this is a non issue as AER error injection is meant
for debugging and testing and not for usage on production systems. People
using this should better know what they are doing.

Fixes: 390e2db82480 ("PCI/AER: Abstract AER interrupt handling")
Reported-by: sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lkml.kernel.org/r/20200306130624.098374457@linutronix.de
(cherry picked from commit 9ae0522537852408f0f48af888e44d6876777463)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>

Conflicts:
	drivers/pci/pcie/Kconfig
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

f809fe13

genirq: Provide interrupt injection mechanism · 75abfe7d

由 Thomas Gleixner 提交于 7月 27, 2020

task #29600094

commit acd26bcf362708594ea081ef55140e37d0854ed2 upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

Error injection mechanisms need a half ways safe way to inject interrupts as
invoking generic_handle_irq() or the actual device interrupt handler
directly from e.g. a debugfs write is not guaranteed to be safe.

On x86 generic_handle_irq() is unsafe due to the hardware trainwreck which
is the base of x86 interrupt delivery and affinity management.

Move the irq debugfs injection code into a separate function which can be
used by error injection code as well.

The implementation prevents at least that state is corrupted, but it cannot
close a very tiny race window on x86 which might result in a stale and not
serviced device interrupt under very unlikely circumstances.

This is explicitly for debugging and testing and not for production use or
abuse in random driver code.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: NMarc Zyngier <maz@kernel.org>
Link: https://lkml.kernel.org/r/20200306130623.990928309@linutronix.de

(cherry picked from commit acd26bcf362708594ea081ef55140e37d0854ed2)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>

Conflicts:
	include/linux/interrupt.h
	kernel/irq/debugfs.c
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

75abfe7d

genirq: Sanitize state handling in check_irq_resend() · d3b057a5

由 Thomas Gleixner 提交于 7月 27, 2020

task #29600094

commit da90921acc62c71d27729ae211ccfda5370bf75b upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

The code sets IRQS_REPLAY unconditionally whether the resend happens or
not. That doesn't have bad side effects right now, but inconsistent state
is always a latent source of problems.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NMarc Zyngier <maz@kernel.org>
Link: https://lkml.kernel.org/r/20200306130623.882129117@linutronix.de

(cherry picked from commit da90921acc62c71d27729ae211ccfda5370bf75b)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d3b057a5

genirq: Add return value to check_irq_resend() · 9ba5f0d6

由 Thomas Gleixner 提交于 7月 27, 2020

task #29600094

commit 1f85b1f5e1f5541272abedc19ba7b6c5b564c228 upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

In preparation for an interrupt injection interface which can be used
safely by error injection mechanisms. e.g. PCI-E/ AER, add a return value
to check_irq_resend() so errors can be propagated to the caller.

Split out the software resend code so the ugly #ifdef in check_irq_resend()
goes away and the whole thing becomes readable.

Fix up the caller in debugfs. The caller in irq_startup() does not care
about the return value as this is unconditionally invoked for all
interrupts and the resend is best effort anyway.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NMarc Zyngier <maz@kernel.org>
Link: https://lkml.kernel.org/r/20200306130623.775200917@linutronix.de

(cherry picked from commit 1f85b1f5e1f5541272abedc19ba7b6c5b564c228)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

9ba5f0d6

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功