- 02 9月, 2020 40 次提交
-
-
由 Erwei Deng 提交于
fix #30012285 Open the UIO Kconfig for x86_64. Signed-off-by: NErwei Deng <erwei@linux.alibaba.com> Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Tony Luck 提交于
fix #29902604 commit 29b8e84fbc23cb2b70317b745641ea0569426872 upstream Simplifies the code a little. Acked-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NTony Luck <tony.luck@intel.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Dan Carpenter 提交于
fix #29902604 commit f84afbdd3a9e5e10633695677b95422572f920dc upstream The "cmd" comes from the user and it can be up to 255. It it's more than the number of bits in long, it results out of bounds read when we check test_bit(cmd, &cmd_mask). The highest valid value for "cmd" is ND_CMD_CALL (10) so I added a compare against that. Fixes: 62232e45 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices") Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Yihao Wu 提交于
fix #29692432 seqcount assumes that writers are already exclusive, so it saves extra locking. However this critial section protected by idle_seqcount can be entered twice if an interrupt tries to wake up a task on this cpu. Once race caused by interrupt is avoided, writers are exclusive. So seqlock is unnecessary, and local_irq_save + seqcount is enough. Fixes: 61e58859 ("alinux: sched: Introduce per-cgroup idle accounting") Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Jiufei Xue 提交于
fix #29820404 commit 6d816e088c359866f9867057e04f244c608c42fe linux-block/io_uring-5.9 branch. We're holding the request reference, but we need to go one higher to ensure that the ctx remains valid after the request has finished. If the ring is closed with pending task_work inflight, and the given io_kiocb finishes sync during issue, then we need a reference to the ring itself around the task_work execution cycle. Cc: stable@vger.kernel.org # v5.7+ Reported-by: syzbot+9b260fc33297966f5a8e@syzkaller.appspotmail.com Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Thomas Gleixner 提交于
fix #29760792 commit 94a46d316f2b54e3de8a4fa884cb16383db7fcd8 upstream There is no reason to have nmi_enter/exit() in the actual MCE handlers. Move it to the entry point. This also covers the until now uncovered initial handler which only prints. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Acked-by: NAndy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200505135314.243936614@linutronix.deSigned-off-by: NZelin Deng <zelin.deng@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Guoyu Huang 提交于
fix #29760246 Cherry-pick 2dd2111d0d383df104b144e0d1f6b5a00cb7cd88 from io_uring-5.9. loop_rw_iter() does not check whether the file has a read or write function. This can lead to NULL pointer dereference when the user passes in a file descriptor that does not have read or write function. The crash log looks like this: [ 99.834071] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 99.835364] #PF: supervisor instruction fetch in kernel mode [ 99.836522] #PF: error_code(0x0010) - not-present page [ 99.837771] PGD 8000000079d62067 P4D 8000000079d62067 PUD 79d8c067 PMD 0 [ 99.839649] Oops: 0010 [#2] SMP PTI [ 99.840591] CPU: 1 PID: 333 Comm: io_wqe_worker-0 Tainted: G D 5.8.0 #2 [ 99.842622] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014 [ 99.845140] RIP: 0010:0x0 [ 99.845840] Code: Bad RIP value. [ 99.846672] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202 [ 99.848018] RAX: 0000000000000000 RBX: ffff92363bd67300 RCX: ffff92363d461208 [ 99.849854] RDX: 0000000000000010 RSI: 00007ffdbf696bb0 RDI: ffff92363bd67300 [ 99.851743] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000 [ 99.853394] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010 [ 99.855148] R13: 0000000000000000 R14: ffff92363d461208 R15: ffffa1c7c01ebc68 [ 99.856914] FS: 0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000 [ 99.858651] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 99.860032] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0 [ 99.861979] Call Trace: [ 99.862617] loop_rw_iter.part.0+0xad/0x110 [ 99.863838] io_write+0x2ae/0x380 [ 99.864644] ? kvm_sched_clock_read+0x11/0x20 [ 99.865595] ? sched_clock+0x9/0x10 [ 99.866453] ? sched_clock_cpu+0x11/0xb0 [ 99.867326] ? newidle_balance+0x1d4/0x3c0 [ 99.868283] io_issue_sqe+0xd8f/0x1340 [ 99.869216] ? __switch_to+0x7f/0x450 [ 99.870280] ? __switch_to_asm+0x42/0x70 [ 99.871254] ? __switch_to_asm+0x36/0x70 [ 99.872133] ? lock_timer_base+0x72/0xa0 [ 99.873155] ? switch_mm_irqs_off+0x1bf/0x420 [ 99.874152] io_wq_submit_work+0x64/0x180 [ 99.875192] ? kthread_use_mm+0x71/0x100 [ 99.876132] io_worker_handle_work+0x267/0x440 [ 99.877233] io_wqe_worker+0x297/0x350 [ 99.878145] kthread+0x112/0x150 [ 99.878849] ? __io_worker_unuse+0x100/0x100 [ 99.879935] ? kthread_park+0x90/0x90 [ 99.880874] ret_from_fork+0x22/0x30 [ 99.881679] Modules linked in: [ 99.882493] CR2: 0000000000000000 [ 99.883324] ---[ end trace 4453745f4673190b ]--- [ 99.884289] RIP: 0010:0x0 [ 99.884837] Code: Bad RIP value. [ 99.885492] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202 [ 99.886851] RAX: 0000000000000000 RBX: ffff92363acd7f00 RCX: ffff92363d461608 [ 99.888561] RDX: 0000000000000010 RSI: 00007ffe040d9e10 RDI: ffff92363acd7f00 [ 99.890203] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000 [ 99.891907] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010 [ 99.894106] R13: 0000000000000000 R14: ffff92363d461608 R15: ffffa1c7c01ebc68 [ 99.896079] FS: 0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000 [ 99.898017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 99.899197] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0 Fixes: 32960613b7c3 ("io_uring: correctly handle non ->{read,write}_iter() file_operations") Cc: stable@vger.kernel.org Signed-off-by: NGuoyu Huang <hgy5945@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Baolin Wang 提交于
fix #29327388 Just use bio->bi_vcnt directly to validate if only one bvec in a bio for PRP mode, which can remove warnings for dm device. No functional changes. Fixes: c8b92b847512 ("alios: nvme-pci: Improve mapping single segment requests using PRP") Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #29139300 commit 4711b57317f0ff5ca9fbd5e2df6c73b2c07ddc53 upstream If we yank a 'same_queue_rq' request off the plug list, we should also decrement the cached request count. Fixes: 5f0ed774ed29 ("block: sum requests in the plug structure") Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NHongnan Li <hongnan.li@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #29139300 commit 5f0ed774ed2914decfd397569fface997532e94d upstream This isn't exactly the same as the previous count, as it includes requests for all devices. But that really doesn't matter, if we have more than the threshold (16) queued up, flush it. It's not worth it to have an expensive list loop for this. [Hongnan Li] performance evaluation Performance results running fio(ioengine=io_uring,iodepth=256) bs IOPS(randread nomerges=0) IOPS(randread nomerges=2) before / after before / after ----- --------------------------- --------------------------- 512 818K / 840K 855K / 897K 1k 816K / 842K 853K / 898K 2k 820K / 839K 850K / 899K 4k 818K / 840K 852K / 895K 8k 574K / 574K 574K / 574K Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NHongnan Li <hongnan.li@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jeffle Xu 提交于
fix #29612968 Sector address of all bios in a single request should be guaranteed to be contiguous, except for DISCARD request. We could get the whole sector range of the request by blk_rq_pos() and blk_rq_bytes() for normal read /write requests, but here we still print the sector range of every bio for code simpility. Since it is a low frequency operation, this design will lead to no performance penalty. Besides squash the 'if(bio)' and 'while(1)' into one single 'while(bio)'. Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Kirill A. Shutemov 提交于
to #28718400 commit 246c320a8cfe0b11d81a4af38fa9985ef0cc9a4c upstream. VMA with VM_GROWSDOWN or VM_GROWSUP flag set can change their size under mmap_read_lock(). It can lead to race with __do_munmap(): Thread A Thread B __do_munmap() detach_vmas_to_be_unmapped() mmap_write_downgrade() expand_downwards() vma->vm_start = address; // The VMA now overlaps with // VMAs detached by the Thread A // page fault populates expanded part // of the VMA unmap_region() // Zaps pagetables partly // populated by Thread B Similar race exists for expand_upwards(). The fix is to avoid downgrading mmap_lock in __do_munmap() if detached VMAs are next to VM_GROWSDOWN or VM_GROWSUP VMA. [akpm@linux-foundation.org: s/mmap_sem/mmap_lock/ in comment] Fixes: 3ee4347a3fb3 ("mm: mmap: zap pages with read mmap_sem in munmap") Reported-by: NJann Horn <jannh@google.com> Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.20+] Link: http://lkml.kernel.org/r/20200709105309.42495-1-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Dave Hansen 提交于
to #28718400 commit 5a28fc94c9143db766d1ba5480cae82d856ad080 upstream. This is a bit of a mess, to put it mildly. But, it's a bug that only seems to have showed up in 4.20 but wasn't noticed until now, because nobody uses MPX. MPX has the arch_unmap() hook inside of munmap() because MPX uses bounds tables that protect other areas of memory. When memory is unmapped, there is also a need to unmap the MPX bounds tables. Barring this, unused bounds tables can eat 80% of the address space. But, the recursive do_munmap() that gets called vi arch_unmap() wreaks havoc with __do_munmap()'s state. It can result in freeing populated page tables, accessing bogus VMA state, double-freed VMAs and more. See the "long story" further below for the gory details. To fix this, call arch_unmap() before __do_unmap() has a chance to do anything meaningful. Also, remove the 'vma' argument and force the MPX code to do its own, independent VMA lookup. == UML / unicore32 impact == Remove unused 'vma' argument to arch_unmap(). No functional change. I compile tested this on UML but not unicore32. == powerpc impact == powerpc uses arch_unmap() well to watch for munmap() on the VDSO and zeroes out 'current->mm->context.vdso_base'. Moving arch_unmap() makes this happen earlier in __do_munmap(). But, 'vdso_base' seems to only be used in perf and in the signal delivery that happens near the return to userspace. I can not find any likely impact to powerpc, other than the zeroing happening a little earlier. powerpc does not use the 'vma' argument and is unaffected by its removal. I compile-tested a 64-bit powerpc defconfig. == x86 impact == For the common success case this is functionally identical to what was there before. For the munmap() failure case, it's possible that some MPX tables will be zapped for memory that continues to be in use. But, this is an extraordinarily unlikely scenario and the harm would be that MPX provides no protection since the bounds table got reset (zeroed). I can't imagine anyone doing this: ptr = mmap(); // use ptr ret = munmap(ptr); if (ret) // oh, there was an error, I'll // keep using ptr. Because if you're doing munmap(), you are *done* with the memory. There's probably no good data in there _anyway_. This passes the original reproducer from Richard Biener as well as the existing mpx selftests/. The long story: munmap() has a couple of pieces: 1. Find the affected VMA(s) 2. Split the start/end one(s) if neceesary 3. Pull the VMAs out of the rbtree 4. Actually zap the memory via unmap_region(), including freeing page tables (or queueing them to be freed). 5. Fix up some of the accounting (like fput()) and actually free the VMA itself. This specific ordering was actually introduced by: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") during the 4.20 merge window. The previous __do_munmap() code was actually safe because the only thing after arch_unmap() was remove_vma_list(). arch_unmap() could not see 'vma' in the rbtree because it was detached, so it is not even capable of doing operations unsafe for remove_vma_list()'s use of 'vma'. Richard Biener reported a test that shows this in dmesg: [1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551 [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576 What triggered this was the recursive do_munmap() called via arch_unmap(). It was freeing page tables that has not been properly zapped. But, the problem was bigger than this. For one, arch_unmap() can free VMAs. But, the calling __do_munmap() has variables that *point* to VMAs and obviously can't handle them just getting freed while the pointer is still in use. I tried a couple of things here. First, I tried to fix the page table freeing problem in isolation, but I then found the VMA issue. I also tried having the MPX code return a flag if it modified the rbtree which would force __do_munmap() to re-walk to restart. That spiralled out of control in complexity pretty fast. Just moving arch_unmap() and accepting that the bonkers failure case might eat some bounds tables seems like the simplest viable fix. This was also reported in the following kernel bugzilla entry: https://bugzilla.kernel.org/show_bug.cgi?id=203123 There are some reports that this commit triggered this bug: 3ee4347a3fb3 ("mm: mmap: zap pages with read mmap_sem in munmap") While that commit certainly made the issues easier to hit, I believe the fundamental issue has been with us as long as MPX itself, thus the Fixes: tag below is for one of the original MPX commits. [ mingo: Minor edits to the changelog and the patch. ] Reported-by: NRichard Biener <rguenther@suse.de> Reported-by: NH.J. Lu <hjl.tools@gmail.com> Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Reviewed-by Thomas Gleixner <tglx@linutronix.de> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Acked-by: NMichael Ellerman <mpe@ellerman.id.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-um@lists.infradead.org Cc: linuxppc-dev@lists.ozlabs.org Cc: stable@vger.kernel.org Fixes: 3ee4347a3fb3 ("mm: mmap: zap pages with read mmap_sem in munmap") Link: http://lkml.kernel.org/r/20190419194747.5E1AD6DC@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org> [xuyu: resolve conflicts in arch/x86/include/asm/mpx.h] Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #29558346 rcu variable can be NULL inside rcu lock region, if rcu_read_lock is called in grace period. So check if (!variable) before dereference it. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
fix #29605829 commit 23b3628e45924419399da48c2b3a522b05557c91 upstream In io_sq_thread(), if there are task works to handle, current codes will skip schedule() and go on polling sq again, but forget to clear IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to set and clear IORING_SQ_NEED_WAKEUP flag, Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Joseph Qi 提交于
to #29613419 To be consistent, disable low limit in arm as it is not used, also enable io latency in x86. Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #29608102 commit 4ae6dbd683860b9edc254ea8acf5e04b5ae242e5 upstream. io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested spin_lock(completion_lock) and lockup. [ 197.680409] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/. [ 197.680411] rcu: blocking rcu_node structures: [ 197.680412] Task dump for CPU 6: [ 197.680413] link-timeout R running task 0 1669 1 0x8000008a [ 197.680414] Call Trace: [ 197.680420] ? io_req_find_next+0xa0/0x200 [ 197.680422] ? io_put_req_find_next+0x2a/0x50 [ 197.680423] ? io_poll_task_func+0xcf/0x140 [ 197.680425] ? task_work_run+0x67/0xa0 [ 197.680426] ? do_exit+0x35d/0xb70 [ 197.680429] ? syscall_trace_enter+0x187/0x2c0 [ 197.680430] ? do_group_exit+0x43/0xa0 [ 197.680448] ? __x64_sys_exit_group+0x18/0x20 [ 197.680450] ? do_syscall_64+0x52/0xa0 [ 197.680452] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #29608102 commit d5e16d8e23825304c6a9945116cc6b6f8d51f28c upstream. req->work might be already initialised by the time it gets into __io_arm_poll_handler(), which will corrupt it by using fields that are in an union with req->work. Luckily, the only side effect is missing put_creds(). Clean req->work before going there. Suggested-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #29608102 commit 3e863ea3bb1a2203ae648eb272db0ce6a1a2072c upstream. IOSQE_ASYNC branch of io_queue_sqe() is another place where an unitialised req->work can be accessed (i.e. prior io_req_init_async()). Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Daniele Albano 提交于
to #29608102 commit 61710e437f2807e26a3402543bdbb7217a9c8620 upstream. We currently filter these for timeout_remove/async_cancel/files_update, but we only should be filtering for fixed file and buffer select. This also causes a second read of sqe->flags, which isn't needed. Just check req->flags for the relevant bits. This then allows these commands to be used in links, for example, like everything else. Signed-off-by: NDaniele Albano <d.albano@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #29608102 commit 807abcb0883439af5ead73f3308310453b97b624 upstream. The double poll additions were centered around doing POLL_ADD on file descriptors that use more than one waitqueue (typically one for read, one for write) when being polled. However, it can also end up being triggered for when we use poll triggered retry. For that case, we cannot safely use req->io, as that could be used by the request type itself. Add a second io_poll_iocb pointer in the structure we allocate for poll based retry, and ensure we use the right one from the two paths. Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users") Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Wetp Zhang 提交于
fix #29415191 commit 03151c6e0b66c63c3e9980edf78c3a7a99801764 upstream Action Required memory error should happen only when a processor is about to access to a corrupted memory, so it's synchronous and only affects current process/thread. Recently commit 872e9a205c84 ("mm, memory_failure: don't send BUS_MCEERR_AO for action required error") fixed the issue that Action Required memory could unnecessarily send SIGBUS to the processes which share the error memory. But we still have another issue that we could send SIGBUS to a wrong thread. This is because collect_procs() and task_early_kill() fails to add the current process to "to-kill" list. So this patch is suggesting to fix it. With this fix, SIGBUS(BUS_MCEERR_AR) is never sent to non-current process/thread. Signed-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NTony Luck <tony.luck@intel.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/1591321039-22141-3-git-send-email-naoya.horiguchi@nec.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Wetp Zhang 提交于
fix #29415191 commit 872e9a205c8491daf1a51ea3733c8c1d15d51e10 upstream Some processes dont't want to be killed early, but in "Action Required" case, those also may be killed by BUS_MCEERR_AO when sharing memory with other which is accessing the fail memory. And sending SIGBUS with BUS_MCEERR_AO for action required error is strange, so ignore the non-current processes here. Suggested-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/1590817116-21281-1-git-send-email-wetp.zy@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Qiuxu Zhuo 提交于
fix #29307272 commit ce20670828c1228ecd37befbdda87a1f87a803b9 upstream The i10nm_edac driver failed to load on Ice Lake and Tremont/Jacobsville servers if their CPU stepping >= 4 and failed on Ice Lake-D servers from stepping 0. The root cause was that for Ice Lake and Tremont/Jacobsville servers with CPU stepping >=4, the offset for bus number configuration register was updated from 0xcc to 0xd0. For Ice Lake-D servers, all the steppings use the updated 0xd0 offset. Fix the issue by using the appropriate offset for bus number configuration register according to the CPU model number and stepping. Reported-by: NJerry Chen <jerry.t.chen@intel.com> Reported-and-tested-by: NJin Wen <wen.jin@intel.com> Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: NTony Luck <tony.luck@intel.com> Reviewed-by: NBorislav Petkov <bp@suse.de> Link: https://lore.kernel.org/linux-edac/20200427084022.GC11036@zn.tnicSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Youquan Song 提交于
fix #29307272 commit ee5340abab3babb91c1807cea47de4468b2dfc91 upstream The device ID for configuration agent PCI device and the offset for bus number configuration register can be CPU model specific. So add a new structure res_config to make them configurable and pass res_config to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use. Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: NTony Luck <tony.luck@intel.com> Reviewed-by: NBorislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnicSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Borislav Petkov 提交于
fix #29415191 commit 1df73b2131e3b33d518609769636b41ce00212de upstream the severity grading code returns in_kernel_recov error context for errors which have happened in kernel space but from which the kernel can recover. whether the recovery can happen is determined by the exception table entry having as handler ex_handler_fault() and which has been declared at build time using _asm_extable_fault(). in_kernel_recov is used in mce_severity_intel() to lookup the corresponding error severity in the severities table. however, the mapping back from error severity to whether the error is in_kernel_recov is ambiguous and in the very paranoid case - which might not be possible right now - but be better safe than sorry later, an exception fixup could be attempted for another mce whose address is in the exception table and has the proper severity. which would be unfortunate, to say the least. therefore, mark such mces explicitly as mce_in_kernel_recov so that the recovery attempt is done only for them. document the whole handling, while at it, as it is not trivial. reported-by: Nthomas gleixner <tglx@linutronix.de> signed-off-by: Nborislav petkov <bp@suse.de> tested-by: Ntony luck <tony.luck@intel.com> link: https://lkml.kernel.org/r/20200407163414.18058-10-bp@alien8.deSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Tony Luck 提交于
fix #29415191 commit 43505646941bee217b91d064756975aa1ab6ee3b upstream Sometimes, when logs are getting lost, it's nice to just have everything dumped to the serial console. Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Tested-by: NTony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200214222720.13168-7-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Tony Luck 提交于
fix #29415191 commit 925946cfa715a5a71639528f82b98e58f14dd4cb upstream Instead of keeping count of how many handlers are registered on the MCE notifier chain and printing if below some magic value, look at mce->kflags to see if anyone claims to have handled/logged this error. [ bp: Do not print ->kflags in __print_mce(). ] Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Tested-by: NTony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200214222720.13168-6-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Tony Luck 提交于
fix #29415191 commit 23ba710a0864108910c7531dc4c73ef65eca5568 upstream If the handler took any action to log or deal with the error, set a bit in mce->kflags so that the default handler on the end of the machine check chain can see what has been done. Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers skip over errors already processed by CEC. Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Tested-by: NTony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Tony Luck 提交于
fix #29415191 commit 1de08dccd383482a3e88845d3554094d338f5ff9 upstream There can be many different subsystems register on the mce handler chain. Add a new bitmask field and define values so that handlers can indicate whether they took any action to log or otherwise handle an error. The default handler at the end of the chain can use this information to decide whether to print to the console log. Boris suggested a generic name and leaving plenty of spare bits for possible future use. [ bp: Move flag bits to the internal mce.h header and use BIT_ULL(). ] Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Tested-by: NTony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200214222720.13168-4-tony.luck@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Peter Zijlstra 提交于
fix #29415191 commit 0d00449c7a28a1514595630735df383dec606812 upstream A few exceptions (like #DB and #BP) can happen at any location in the code, this then means that tracers should treat events from these exceptions as NMI-like. The interrupted context could be holding locks with interrupts disabled for instance. Similarly, #MC is an actual NMI-like exception. All of them use ist_enter() which only concerns itself with RCU, but does not do any of the other setup that NMIs need. This means things like: printk() raw_spin_lock_irq(&logbuf_lock); <#DB/#BP/#MC> printk() raw_spin_lock_irq(&logbuf_lock); are entirely possible (well, not really since printk tries hard to play nice, but the concept stands). So replace ist_enter() with nmi_enter(). Also observe that any nmi_enter() caller must be both notrace and NOKPROBE, or in the noinstr text section. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.525508608@linutronix.deSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Peter Zijlstra 提交于
fix #29415191 commit 5567d11c21a1d508a91a8cb64a819783a0835d9f upstream Convert #MC over to using task_work_add(); it will run the same code slightly later, on the return to user path of the same exception. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NFrederic Weisbecker <frederic@kernel.org> Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134100.957390899@linutronix.deSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Youquan Song 提交于
fix #29415191 commit b052df3da821adfd6be26a6eb16624fb50e90e56 upstream This is completely overengineered and definitely not an interface which should be made available to anything else than this particular MCE case. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.462640294@linutronix.deSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Tony Luck 提交于
fix #29415191 commit 17fae1294ad9d711b2c3dd0edef479d40c76a5e8 upstream An interesting thing happened when a guest Linux instance took a machine check. The VMM unmapped the bad page from guest physical space and passed the machine check to the guest. Linux took all the normal actions to offline the page from the process that was using it. But then guest Linux crashed because it said there was a second machine check inside the kernel with this stack trace: do_memory_failure set_mce_nospec set_memory_uc _set_memory_uc change_page_attr_set_clr cpa_flush clflush_cache_range_opt This was odd, because a CLFLUSH instruction shouldn't raise a machine check (it isn't consuming the data). Further investigation showed that the VMM had passed in another machine check because is appeared that the guest was accessing the bad page. Fix is to check the scope of the poison by checking the MCi_MISC register. If the entire page is affected, then unmap the page. If only part of the page is affected, then mark the page as uncacheable. This assumes that VMMs will do the logical thing and pass in the "whole page scope" via the MCi_MISC register (since they unmapped the entire page). [ bp: Adjust to x86/entry changes. ] Fixes: 284ce401 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()") Reported-by: NJue Wang <juew@google.com> Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NJue Wang <juew@google.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NWetp Zhang <wetp.zy@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Alexandru Gagniuc 提交于
task #29600094 commit f496648b99f8f7f6711f7c30a6327381f37dd1e8 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support. When in-band presence detect is disabled, PDS may come up at any time or not at all. PDS being low may indicate that the card is still mating, and we could expect contact bounce to bring down the link as well. It is reasonable to assume that most cards will mate in a hotplug slot in about a second. Thus, when we know PDS only reflects out-of-band presence detect, it's worthwhile to wait the extra second or so to make sure the card is properly mated before loading the driver and to prevent the hotplug code from disabling a device if the presence detect change goes active after the device is enabled. Link: https://lore.kernel.org/r/20191025190047.38130-3-stuart.w.hayes@gmail.com [bhelgaas: use ctrl_info() instead of pci_info()] Signed-off-by: NAlexandru Gagniuc <mr.nuke.me@gmail.com> Signed-off-by: NStuart Hayes <stuart.w.hayes@gmail.com> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com> Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: NLukas Wunner <lukas@wunner.de> (cherry picked from commit f496648b99f8f7f6711f7c30a6327381f37dd1e8) Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com> Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Alexandru Gagniuc 提交于
task #29600094 commit 202853595e53f981c86656c49fc1cc1e3620f558 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support. The presence detect state (PDS) is normally a logical OR of in-band and out-of-band (OOB) presence detect. As of PCIe 4.0, there is the option to disable in-band presence so that the PDS bit always reflects the state of the out-of-band presence. The recommendation of the PCIe spec is to disable in-band presence whenever supported (PCIe r5.0, appendix I implementation note): Due to architectural issues, the in-band (Physical-Layer-based) portion of the PD mechanism is deprecated for use with async hot-plug. One issue is that in-band PD as architected does not detect adapter removal during certain LTSSM states, notably the L1 and Disabled States. Another issue is that when both in-band and OOB PD are being used together, the Presence Detect State bit and its associated interrupt mechanism always reflect the logical OR of the inband and OOB PD states, and with some hot-plug hardware configurations, it is important for software to detect and respond to in-band and OOB PD events independently. If OOB PD is being used and the associated DSP supports In-Band PD Disable, it is recommended that the In-Band PD Disable bit be Set, and the Presence Detect State bit and its associated interrupt mechanism be used exclusively for OOB PD. As a substitute for in-band PD with async hot-plug, the reference model uses either the DPC or the DLL Link Active mechanism. Link: https://lore.kernel.org/r/20191025190047.38130-2-stuart.w.hayes@gmail.com [bhelgaas: move PCI_EXP_SLTCAP2 read earlier & print PCI_EXP_SLTCAP2_IBPD value (suggested by Lukas)] Signed-off-by: NAlexandru Gagniuc <mr.nuke.me@gmail.com> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com> Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: NLukas Wunner <lukas@wunner.de> (cherry picked from commit 202853595e53f981c86656c49fc1cc1e3620f558) Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com> Conflicts: drivers/pci/hotplug/pciehp.h drivers/pci/hotplug/pciehp_hpc.c Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Thomas Gleixner 提交于
task #29600094 commit 9ae0522537852408f0f48af888e44d6876777463 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support. The AER error injection mechanism just blindly abuses generic_handle_irq() which is really not meant for consumption by random drivers. The include of linux/irq.h should have been a red flag in the first place. Driver code, unless implementing interrupt chips or low level hypervisor functionality has absolutely no business with that. Invoking generic_handle_irq() from non interrupt handling context can have nasty side effects at least on x86 due to the hardware trainwreck which makes interrupt affinity changes a fragile beast. Sathyanarayanan triggered a NULL pointer dereference in the low level APIC code that way. While the particular pointer could be checked this would only paper over the issue because there are other ways to trigger warnings or silently corrupt state. Invoke the new irq_inject_interrupt() mechanism, which has the necessary sanity checks in place and injects the interrupt via the irq_retrigger() mechanism, which is at least halfways safe vs. the fragile x86 affinity change mechanics. It's safe on x86 as it does not corrupt state, but it still can cause a premature completion of an interrupt affinity change causing the interrupt line to become stale. Very unlikely, but possible. For regular operations this is a non issue as AER error injection is meant for debugging and testing and not for usage on production systems. People using this should better know what they are doing. Fixes: 390e2db82480 ("PCI/AER: Abstract AER interrupt handling") Reported-by: sathyanarayanan.kuppuswamy@linux.intel.com Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Link: https://lkml.kernel.org/r/20200306130624.098374457@linutronix.de (cherry picked from commit 9ae0522537852408f0f48af888e44d6876777463) Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com> Conflicts: drivers/pci/pcie/Kconfig Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Thomas Gleixner 提交于
task #29600094 commit acd26bcf362708594ea081ef55140e37d0854ed2 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support. Error injection mechanisms need a half ways safe way to inject interrupts as invoking generic_handle_irq() or the actual device interrupt handler directly from e.g. a debugfs write is not guaranteed to be safe. On x86 generic_handle_irq() is unsafe due to the hardware trainwreck which is the base of x86 interrupt delivery and affinity management. Move the irq debugfs injection code into a separate function which can be used by error injection code as well. The implementation prevents at least that state is corrupted, but it cannot close a very tiny race window on x86 which might result in a stale and not serviced device interrupt under very unlikely circumstances. This is explicitly for debugging and testing and not for production use or abuse in random driver code. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: NMarc Zyngier <maz@kernel.org> Link: https://lkml.kernel.org/r/20200306130623.990928309@linutronix.de (cherry picked from commit acd26bcf362708594ea081ef55140e37d0854ed2) Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com> Conflicts: include/linux/interrupt.h kernel/irq/debugfs.c Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Thomas Gleixner 提交于
task #29600094 commit da90921acc62c71d27729ae211ccfda5370bf75b upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support. The code sets IRQS_REPLAY unconditionally whether the resend happens or not. That doesn't have bad side effects right now, but inconsistent state is always a latent source of problems. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NMarc Zyngier <maz@kernel.org> Link: https://lkml.kernel.org/r/20200306130623.882129117@linutronix.de (cherry picked from commit da90921acc62c71d27729ae211ccfda5370bf75b) Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com> Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Thomas Gleixner 提交于
task #29600094 commit 1f85b1f5e1f5541272abedc19ba7b6c5b564c228 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support. In preparation for an interrupt injection interface which can be used safely by error injection mechanisms. e.g. PCI-E/ AER, add a return value to check_irq_resend() so errors can be propagated to the caller. Split out the software resend code so the ugly #ifdef in check_irq_resend() goes away and the whole thing becomes readable. Fix up the caller in debugfs. The caller in irq_startup() does not care about the return value as this is unconditionally invoked for all interrupts and the resend is best effort anyway. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NMarc Zyngier <maz@kernel.org> Link: https://lkml.kernel.org/r/20200306130623.775200917@linutronix.de (cherry picked from commit 1f85b1f5e1f5541272abedc19ba7b6c5b564c228) Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com> Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-