- 20 3月, 2020 2 次提交
-
-
由 Al Viro 提交于
commit 6404674acd596de41fd3ad5f267b4525494a891a upstream Brown paperbag time: fetching ->i_uid/->i_mode really should've been done from nd->inode. I even suggested that, but the reason for that has slipped through the cracks and I went for dir->d_inode instead - made for more "obvious" patch. Analysis: - at the entry into do_last() and all the way to step_into(): dir (aka nd->path.dentry) is known not to have been freed; so's nd->inode and it's equal to dir->d_inode unless we are already doomed to -ECHILD. inode of the file to get opened is not known. - after step_into(): inode of the file to get opened is known; dir might be pointing to freed memory/be negative/etc. - at the call of may_create_in_sticky(): guaranteed to be out of RCU mode; inode of the file to get opened is known and pinned; dir might be garbage. The last was the reason for the original patch. Except that at the do_last() entry we can be in RCU mode and it is possible that nd->path.dentry->d_inode has already changed under us. In that case we are going to fail with -ECHILD, but we need to be careful; nd->inode is pointing to valid struct inode and it's the same as nd->path.dentry->d_inode in "won't fail with -ECHILD" case, so we should use that. Reported-by: N"Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com> Reported-by: syzbot+190005201ced78a74ad6@syzkaller.appspotmail.com Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org Fixes: d0cb50185ae9 ("do_last(): fetch directory ->i_mode and ->i_uid before it's too late") Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit b60fda6000a99a7ccac36005ab78b14b47c06de3 upstream We currently have a race where if setup is really slow, we can be calling io_wq_destroy() before we're done setting up. This will cause the caller to get stuck waiting for the manager to set things up, but the manager already exited. Fix this by doing a sync setup of the manager. This also fixes the case where if we failed creating workers, we'd also get stuck. In practice this race window was really small, as we already wait for the manager to start. Hence someone would have to call io_wq_destroy() after the task has started, but before it started the first loop. The reported test case forked tons of these, which is why it became an issue. Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com Fixes: 771b53d033e8 ("io-wq: small threadpool implementation for io_uring") Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 19 3月, 2020 3 次提交
-
-
由 Jens Axboe 提交于
commit 576a347b7af8abfbddc80783fb6629c2894d036e upstream. If we don't inherit the original task creds, then we can confuse users like fuse that pass creds in the request header. See link below on identical aio issue. Link: https://lore.kernel.org/linux-fsdevel/26f0d78e-99ca-2f1b-78b9-433088053a61@scylladb.com/T/#uSigned-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 576a347b7af8abfbddc80783fb6629c2894d036e upstream. We currently pass in 4 arguments outside of the bounded size. In preparation for adding one more argument, let's bundle them up in a struct to make it more readable. No functional changes in this patch. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 7d7230652e7c788ef908536fd79f4cca077f269f upstream. For cancellation, we need to ensure that the work item stays valid for as long as ->cur_work is valid. Right now we can't safely dereference the work item even under the wqe->lock, because while the ->cur_work pointer will remain valid, the work could be completing and be freed in parallel. Only invoke ->get/put_work() on items we know that the caller queued themselves. Add IO_WQ_WORK_INTERNAL for io-wq to use, which is needed when we're queueing a flush item, for instance. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
- 18 3月, 2020 35 次提交
-
-
由 YueHaibing 提交于
commit 1d3ff0950e2b40dc861b1739029649d03f591820 upstream. [ Fixes: CVE-2019-20096 ] If dccp_feat_push_change fails, we forget free the mem which is alloced by kmemdup in dccp_feat_clone_sp_val. Reported-by: NHulk Robot <hulkci@huawei.com> Fixes: e8ef967a ("dccp: Registration routines for changing feature values") Reviewed-by: NMukesh Ojha <mojha@codeaurora.org> Signed-off-by: NYueHaibing <yuehaibing@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jason Yan 提交于
commit f70267f379b5e5e11bdc5d72a56bf17e5feed01f upstream. [ Fixes: CVE-2019-19965 ] The discovering of sas port is driven by workqueue in libsas. When libsas is processing port events or phy events in workqueue, new events may rise up and change the state of some structures such as asd_sas_phy. This may cause some problems such as follows: ==>thread 1 ==>thread 2 ==>phy up ==>phy_up_v3_hw() ==>oob_mode = SATA_OOB_MODE; ==>phy down quickly ==>hisi_sas_phy_down() ==>sas_ha->notify_phy_event() ==>sas_phy_disconnected() ==>oob_mode = OOB_NOT_CONNECTED ==>workqueue wakeup ==>sas_form_port() ==>sas_discover_domain() ==>sas_get_port_device() ==>oob_mode is OOB_NOT_CONNECTED and device is wrongly taken as expander This at last lead to the panic when libsas trying to issue a command to discover the device. [183047.614035] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058 [183047.622896] Mem abort info: [183047.625762] ESR = 0x96000004 [183047.628893] Exception class = DABT (current EL), IL = 32 bits [183047.634888] SET = 0, FnV = 0 [183047.638015] EA = 0, S1PTW = 0 [183047.641232] Data abort info: [183047.644189] ISV = 0, ISS = 0x00000004 [183047.648100] CM = 0, WnR = 0 [183047.651145] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000b7df67be [183047.657834] [0000000000000058] pgd=0000000000000000 [183047.662789] Internal error: Oops: 96000004 [#1] SMP [183047.667740] Process kworker/u16:2 (pid: 31291, stack limit = 0x00000000417c4974) [183047.675208] CPU: 0 PID: 3291 Comm: kworker/u16:2 Tainted: G W OE 4.19.36-vhulk1907.1.0.h410.eulerosv2r8.aarch64 #1 [183047.687015] Hardware name: N/A N/A/Kunpeng Desktop Board D920S10, BIOS 0.15 10/22/2019 [183047.695007] Workqueue: 0000:74:02.0_disco_q sas_discover_domain [183047.700999] pstate: 20c00009 (nzCv daif +PAN +UAO) [183047.705864] pc : prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw] [183047.711510] lr : prep_ata_v3_hw+0xb0/0x230 [hisi_sas_v3_hw] [183047.717153] sp : ffff00000f28ba60 [183047.720541] x29: ffff00000f28ba60 x28: ffff8026852d7228 [183047.725925] x27: ffff8027dba3e0a8 x26: ffff8027c05fc200 [183047.731310] x25: 0000000000000000 x24: ffff8026bafa8dc0 [183047.736695] x23: ffff8027c05fc218 x22: ffff8026852d7228 [183047.742079] x21: ffff80007c2f2940 x20: ffff8027c05fc200 [183047.747464] x19: 0000000000f80800 x18: 0000000000000010 [183047.752848] x17: 0000000000000000 x16: 0000000000000000 [183047.758232] x15: ffff000089a5a4ff x14: 0000000000000005 [183047.763617] x13: ffff000009a5a50e x12: ffff8026bafa1e20 [183047.769001] x11: ffff0000087453b8 x10: ffff00000f28b870 [183047.774385] x9 : 0000000000000000 x8 : ffff80007e58f9b0 [183047.779770] x7 : 0000000000000000 x6 : 000000000000003f [183047.785154] x5 : 0000000000000040 x4 : ffffffffffffffe0 [183047.790538] x3 : 00000000000000f8 x2 : 0000000002000007 [183047.795922] x1 : 0000000000000008 x0 : 0000000000000000 [183047.801307] Call trace: [183047.803827] prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw] [183047.809127] hisi_sas_task_prep+0x750/0x888 [hisi_sas_main] [183047.814773] hisi_sas_task_exec.isra.7+0x88/0x1f0 [hisi_sas_main] [183047.820939] hisi_sas_queue_command+0x28/0x38 [hisi_sas_main] [183047.826757] smp_execute_task_sg+0xec/0x218 [183047.831013] smp_execute_task+0x74/0xa0 [183047.834921] sas_discover_expander.part.7+0x9c/0x5f8 [183047.839959] sas_discover_root_expander+0x90/0x160 [183047.844822] sas_discover_domain+0x1b8/0x1e8 [183047.849164] process_one_work+0x1b4/0x3f8 [183047.853246] worker_thread+0x54/0x470 [183047.856981] kthread+0x134/0x138 [183047.860283] ret_from_fork+0x10/0x18 [183047.863931] Code: f9407a80 528000e2 39409281 72a04002 (b9405800) [183047.870097] kernel fault(0x1) notification starting on CPU 0 [183047.875828] kernel fault(0x1) notification finished on CPU 0 [183047.881559] Modules linked in: unibsp(OE) hns3(OE) hclge(OE) hnae3(OE) mem_drv(OE) hisi_sas_v3_hw(OE) hisi_sas_main(OE) [183047.892418] ---[ end trace 4cc26083fc11b783 ]--- [183047.897107] Kernel panic - not syncing: Fatal exception [183047.902403] kernel fault(0x5) notification starting on CPU 0 [183047.908134] kernel fault(0x5) notification finished on CPU 0 [183047.913865] SMP: stopping secondary CPUs [183047.917861] Kernel Offset: disabled [183047.921422] CPU features: 0x2,a2a00a38 [183047.925243] Memory Limit: none [183047.928372] kernel reboot(0x2) notification starting on CPU 0 [183047.934190] kernel reboot(0x2) notification finished on CPU 0 [183047.940008] ---[ end Kernel panic - not syncing: Fatal exception ]--- Fixes: 2908d778 ("[SCSI] aic94xx: new driver") Link: https://lore.kernel.org/r/20191206011118.46909-1-yanaijie@huawei.comReported-by: NGao Chuan <gaochuan4@huawei.com> Reviewed-by: NJohn Garry <john.garry@huawei.com> Signed-off-by: NJason Yan <yanaijie@huawei.com> Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Akeem G Abodunrin 提交于
commit bc8a76a152c5f9ef3b48104154a65a68a8b76946 upstream. [ Fixes: CVE-2019-14615 ] Intel ID: PSIRT-TA-201910-001 CVEID: CVE-2019-14615 Intel GPU Hardware prior to Gen11 does not clear EU state during a context switch. This can result in information leakage between contexts. For Gen8 and Gen9, hardware provides a mechanism for fast cleardown of the EU state, by issuing a PIPE_CONTROL with bit 27 set. We can use this in a context batch buffer to explicitly cleardown the state on every context switch. As this workaround is already in place for gen8, we can borrow the code verbatim for Gen9. Signed-off-by: NMika Kuoppala <mika.kuoppala@linux.intel.com> Signed-off-by: NAkeem G Abodunrin <akeem.g.abodunrin@intel.com> Cc: Kumar Valsan Prathap <prathap.kumar.valsan@intel.com> Cc: Chris Wilson <chris.p.wilson@intel.com> Cc: Balestrieri Francesco <francesco.balestrieri@intel.com> Cc: Bloomfield Jon <jon.bloomfield@intel.com> Cc: Dutt Sudeep <sudeep.dutt@intel.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Navid Emamdoost 提交于
commit 4a9d46a9fe14401f21df69cea97c62396d5fb053 upstream. [ Fixes: CVE-2019-19077 ] In bnxt_re_create_srq(), when ib_copy_to_udata() fails allocated memory should be released by goto fail. Fixes: 37cb11ac ("RDMA/bnxt_re: Add SRQ support for Broadcom adapters") Link: https://lore.kernel.org/r/20190910222120.16517-1-navid.emamdoost@gmail.comSigned-off-by: NNavid Emamdoost <navid.emamdoost@gmail.com> Reviewed-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com> Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Navid Emamdoost 提交于
commit 4aa7afb0ee20a97fbf0c5bab3df028d5fb85fdab upstream. [ Fixes: CVE-2019-19046 ] In the impelementation of __ipmi_bmc_register() the allocated memory for bmc should be released in case ida_simple_get() fails. Fixes: 68e7e50f ("ipmi: Don't use BMC product/dev ids in the BMC name") Signed-off-by: NNavid Emamdoost <navid.emamdoost@gmail.com> Message-Id: <20191021200649.1511-1-navid.emamdoost@gmail.com> Signed-off-by: NCorey Minyard <cminyard@mvista.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jiri Slaby 提交于
commit 07e6124a1a46b4b5a9b3cacc0c306b50da87abf5 upstream. [ Fixes: CVE-2020-8648 ] syzkaller reported this UAF: BUG: KASAN: use-after-free in n_tty_receive_buf_common+0x2481/0x2940 drivers/tty/n_tty.c:1741 Read of size 1 at addr ffff8880089e40e9 by task syz-executor.1/13184 CPU: 0 PID: 13184 Comm: syz-executor.1 Not tainted 5.4.7 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: ... kasan_report+0xe/0x20 mm/kasan/common.c:634 n_tty_receive_buf_common+0x2481/0x2940 drivers/tty/n_tty.c:1741 tty_ldisc_receive_buf+0xac/0x190 drivers/tty/tty_buffer.c:461 paste_selection+0x297/0x400 drivers/tty/vt/selection.c:372 tioclinux+0x20d/0x4e0 drivers/tty/vt/vt.c:3044 vt_ioctl+0x1bcf/0x28d0 drivers/tty/vt/vt_ioctl.c:364 tty_ioctl+0x525/0x15a0 drivers/tty/tty_io.c:2657 vfs_ioctl fs/ioctl.c:47 [inline] It is due to a race between parallel paste_selection (TIOCL_PASTESEL) and set_selection_user (TIOCL_SETSEL) invocations. One uses sel_buffer, while the other frees it and reallocates a new one for another selection. Add a mutex to close this race. The mutex takes care properly of sel_buffer and sel_buffer_lth only. The other selection global variables (like sel_start, sel_end, and sel_cons) are protected only in set_selection_user. The other functions need quite some more work to close the races of the variables there. This is going to happen later. This likely fixes (I am unsure as there is no reproducer provided) bug 206361 too. It was marked as CVE-2020-8648. Signed-off-by: NJiri Slaby <jslaby@suse.cz> Reported-by: syzbot+59997e8d5cbdc486e6f6@syzkaller.appspotmail.com References: https://bugzilla.kernel.org/show_bug.cgi?id=206361 Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20200210081131.23572-2-jslaby@suse.czSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Zhang Xiaoxu 提交于
commit 513dc792d6060d5ef572e43852683097a8420f56 upstream. [ Fixes: CVE-2020-8647, CVE-2020-8649 ] When syzkaller tests, there is a UAF: BUG: KASan: use after free in vgacon_invert_region+0x9d/0x110 at addr ffff880000100000 Read of size 2 by task syz-executor.1/16489 page:ffffea0000004000 count:0 mapcount:-127 mapping: (null) index:0x0 page flags: 0xfffff00000000() page dumped because: kasan: bad access detected CPU: 1 PID: 16489 Comm: syz-executor.1 Not tainted Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 Call Trace: [<ffffffffb119f309>] dump_stack+0x1e/0x20 [<ffffffffb04af957>] kasan_report+0x577/0x950 [<ffffffffb04ae652>] __asan_load2+0x62/0x80 [<ffffffffb090f26d>] vgacon_invert_region+0x9d/0x110 [<ffffffffb0a39d95>] invert_screen+0xe5/0x470 [<ffffffffb0a21dcb>] set_selection+0x44b/0x12f0 [<ffffffffb0a3bfae>] tioclinux+0xee/0x490 [<ffffffffb0a1d114>] vt_ioctl+0xff4/0x2670 [<ffffffffb0a0089a>] tty_ioctl+0x46a/0x1a10 [<ffffffffb052db3d>] do_vfs_ioctl+0x5bd/0xc40 [<ffffffffb052e2f2>] SyS_ioctl+0x132/0x170 [<ffffffffb11c9b1b>] system_call_fastpath+0x22/0x27 Memory state around the buggy address: ffff8800000fff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8800000fff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff880000100000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff It can be reproduce in the linux mainline by the program: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/ioctl.h> #include <linux/vt.h> struct tiocl_selection { unsigned short xs; /* X start */ unsigned short ys; /* Y start */ unsigned short xe; /* X end */ unsigned short ye; /* Y end */ unsigned short sel_mode; /* selection mode */ }; #define TIOCL_SETSEL 2 struct tiocl { unsigned char type; unsigned char pad; struct tiocl_selection sel; }; int main() { int fd = 0; const char *dev = "/dev/char/4:1"; struct vt_consize v = {0}; struct tiocl tioc = {0}; fd = open(dev, O_RDWR, 0); v.v_rows = 3346; ioctl(fd, VT_RESIZEX, &v); tioc.type = TIOCL_SETSEL; ioctl(fd, TIOCLINUX, &tioc); return 0; } When resize the screen, update the 'vc->vc_size_row' to the new_row_size, but when 'set_origin' in 'vgacon_set_origin', vgacon use 'vga_vram_base' for 'vc_origin' and 'vc_visible_origin', not 'vc_screenbuf'. It maybe smaller than 'vc_screenbuf'. When TIOCLINUX, use the new_row_size to calc the offset, it maybe larger than the vga_vram_size in vgacon driver, then bad access. Also, if set an larger screenbuf firstly, then set an more larger screenbuf, when copy old_origin to new_origin, a bad access may happen. So, If the screen size larger than vga_vram, resize screen should be failed. This alse fix CVE-2020-8649 and CVE-2020-8647. Linus pointed out that overflow checking seems absent. We're saved by the existing bounds checks in vc_do_resize() with rather strict limits: if (cols > VC_RESIZE_MAXCOL || lines > VC_RESIZE_MAXROW) return -EINVAL; Fixes: 0aec4867 ("[PATCH] SVGATextMode fix") Reference: CVE-2020-8647 and CVE-2020-8649 Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com> [danvet: augment commit message to point out overflow safety] Cc: stable@vger.kernel.org Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20200304022429.37738-1-zhangxiaoxu5@huawei.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Al Viro 提交于
commit d0cb50185ae942b03c4327be322055d622dc79f6 upstream. [ Fixes: CVE-2020-8428 ] may_create_in_sticky() call is done when we already have dropped the reference to dir. Fixes: 30aba665 (namei: allow restricted O_CREAT of FIFOs and regular files) Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Boris Ostrovsky 提交于
commit 8c6de56a42e0c657955e12b882a81ef07d1d073e upstream. [ Fixes: CVE-2019-3016 ] kvm_steal_time_set_preempted() may accidentally clear KVM_VCPU_FLUSH_TLB bit if it is called more than once while VCPU is preempted. This is part of CVE-2019-3016. (This bug was also independently discovered by Jim Mattson <jmattson@google.com>) Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: NJoao Martins <joao.m.martins@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Oliver Upton 提交于
commit 35a571346a94fb93b5b3b6a599675ef3384bc75c upstream. [ Fixes: CVE-2020-2732 ] Consult the 'unconditional IO exiting' and 'use IO bitmaps' VM-execution controls when checking instruction interception. If the 'use IO bitmaps' VM-execution control is 1, check the instruction access against the IO bitmaps to determine if the instruction causes a VM-exit. Signed-off-by: NOliver Upton <oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Oliver Upton 提交于
commit e71237d3ff1abf9f3388337cfebf53b96df2020d upstream. [ Fixes: CVE-2020-2732 ] Checks against the IO bitmap are useful for both instruction emulation and VM-exit reflection. Refactor the IO bitmap checks into a helper function. Signed-off-by: NOliver Upton <oupton@google.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Paolo Bonzini 提交于
commit 07721feee46b4b248402133228235318199b05ec upstream. [ Fixes: CVE-2020-2732 ] vmx_check_intercept is not yet fully implemented. To avoid emulating instructions disallowed by the L1 hypervisor, refuse to emulate instructions by default. Cc: stable@vger.kernel.org [Made commit, added commit msg - Oliver] Signed-off-by: NOliver Upton <oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Shile Zhang 提交于
commit 07447453db3aebb6a0917592f411a7122d12a8b9 upstream linux-next. When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdatinit' kthread will initialise the deferred pages with local interrupts disabled. It is introduced by commit 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages"). On machine with NCPUS <= 2, the 'pgdatinit' kthread could be bound to the boot CPU, which could caused the tick timer long time stall, system jiffies not be updated in time. The dmesg shown that: [ 0.197975] node 0 initialised, 32170688 pages in 1ms Obviously, 1ms is unreasonable. Now, fix it by restore in the pending interrupts for every 32*1204 pages (128MB) initialized, give the chance to update the systemd jiffies. The reasonable demsg shown likes: [ 1.069306] node 0 initialised, 32203456 pages in 894ms Link: http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages") Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Co-developed-by: NKirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Xu Yu 提交于
This exports the workingset counters, i.e., workingset_refault, workingset_activate, workingset_restore, and workingset_nodereclaim, to memory cgroup v1. The stat collection of these counters is shared between memory cgroup v1 and v2. What this patch does is just to export them on memory cgroup v1. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Lingpeng Chen 提交于
commit e7a5f1f1cd0008e5ad379270a8657e121eedb669 upstream Right now in tcp_bpf_recvmsg, sock read data first from sk_receive_queue if not empty than psock->ingress_msg otherwise. If a FIN packet arrives and there's also some data in psock->ingress_msg, the data in psock->ingress_msg will be purged. It is always happen when request to a HTTP1.0 server like python SimpleHTTPServer since the server send FIN packet after data is sent out. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Reported-by: NArika Chen <eaglesora@gmail.com> Suggested-by: NArika Chen <eaglesora@gmail.com> Signed-off-by: NLingpeng Chen <forrest0579@gmail.com> Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NSong Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200109014833.18951-1-forrest0579@gmail.com [tonylu: patched modified to match BIG rework between v4.19 and upstream] Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 liushanghui 提交于
It enables SRIOV of PCIe devices of Alibaba MOC, then VFs can be used by other host or VM above on the host. Signed-off-by: Nliushanghui <liushanghui@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Xu Yu 提交于
Explicitly abort mem_cgroup_select_bad_process in priority oom if there is already a task as oom victim without MMF_OOM_SKIP flag set. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Xu Yu 提交于
Since commit e0205ae40f12 ("mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()") made mem_cgroup_scan_tasks() to check only one thread from each thread group, we can make cgroup_subsys_state::nr_tasks to record only the thread group leader, i.e., process, instead of thread(s). Furthermore, this renames cgroup_subsys_state::nr_tasks to cgroup_subsys_state::nr_procs. Fixes: f061cd88 ("alinux: kernel: cgroup: account number of tasks in the css and its descendants") Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Tetsuo Handa 提交于
commit f168a9a54ec39b3f832c353733898b713b6b5c1f upstream. Since commit c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works, mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check only one thread from each thread group. [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()] Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
Assuming that there is a memory cgroup tree as follows: A (use_priority_oom=1, limit=2.5G) / \ / C (priority=3, usage=1.5G) B (priority=0, usage=1G) As task in C (task-c) invokes oom-killer, task in B (task-b) is chosen and killed, and then task-c returns from mem_cgroup_oom and retries in try_charge. If memory page_counter of B has not been reset yet, leading to task-c invokes oom-killer again, the soft lockup may happen. In this situation, task-c keeps selecting bad process in B, while the only task-b in B has already been set PF_EXITING flag, which makes task-b skipped in css_task_iter_advance. Finally, task-c selected no bad process in B and keeps retrying, and task-b is stalled in synchronize_rcu when do_exit, exit_task_namespaces specifically. In a nutshell, the new behavior of css_task_iter_advance, i.e., commit c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations"), causes priority oom to misbehave. This fixes the soft lockup by accounting num_oom_skip of the victim memcg and its parents (sift up to oc->memcg), if no bad process is chosen from it. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
commit 32b2244a840a90ea94ba42392de5c48d53f521f5 upstream linux-next When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need to do io completion events polling again, they can rely on io_sq_thread to do polling work, which can reduce cpu usage and uring_lock contention. I modify fio io_uring engine codes a bit to evaluate the performance: static int fio_ioring_getevents(struct thread_data *td, unsigned int min, continue; } - if (!o->sqpoll_thread) { + if (o->sqpoll_thread && o->hipri) { r = io_uring_enter(ld, 0, actual_min, IORING_ENTER_GETEVENTS); if (r < 0) { and use "fio -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread -rw=read -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=1 -time_based -runtime=120" original codes -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s fio cpu usage | 100% | 100% | 100% | 100% | 100% -------------------------------------------------------------------- with patch -------------------------------------------------------------------- iodepth | 4 | 8 | 16 | 32 | 64 bw | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s fio cpu usage | 63.8% | 74.4%% | 81.1% | 83.7% | 82.4% -------------------------------------------------------------------- bw improve | 5.5% | 13.2% | 12.3% | 9.8% | 11.5% -------------------------------------------------------------------- From above test results, we can see that bw has above 5.5%~13% improvement, and fio process's cpu usage also drops much. Note this won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL are both enabled, in this case, io_sq_thread always has 100% cpu usage. I think this patch will be friendly to applications which will often use io_uring_wait_cqe() or similar from liburing. Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Yufen Yu 提交于
commit 3b7436cc9449d5ff7fa1c1fd5bc3edb6402ff5b8 upstream. For super_90_load, we need to make sure 'desc_nr' less than MD_SB_DISKS, avoiding invalid memory access of 'sb->disks'. Fixes: 228fc7d76db6 ("md: avoid invalid memory access for array sb->dev_roles") Signed-off-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Yufen Yu 提交于
commit 228fc7d76db68732677230a3c64337908fd298e3 upstream. we need to gurantee 'desc_nr' valid before access array of sb->dev_roles. In addition, we should avoid .load_super always return '0' when level is LEVEL_MULTIPATH, which is not expected. Reported-by: Ncoverity-bot <keescook+coverity-bot@chromium.org> Addresses-Coverity-ID: 1487373 ("Memory - illegal accesses") Fixes: 6a5cb53aaa4e ("md: no longer compare spare disk superblock events in super_load") Signed-off-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Yufen Yu 提交于
commit 6a5cb53aaa4ef515ddeffa04ce18b771121127b4 upstream. We have a test case as follow: mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \ --assume-clean --bitmap=internal mdadm -S /dev/md1 mdadm -A /dev/md1 /dev/sd[b-c] --run --force mdadm --zero /dev/sda mdadm /dev/md1 -a /dev/sda echo offline > /sys/block/sdc/device/state echo offline > /sys/block/sdb/device/state sleep 5 mdadm -S /dev/md1 echo running > /sys/block/sdb/device/state echo running > /sys/block/sdc/device/state mdadm -A /dev/md1 /dev/sd[a-c] --run --force When we readd /dev/sda to the array, it started to do recovery. After offline the other two disks in md1, the recovery have been interrupted and superblock update info cannot be written to the offline disks. While the spare disk (/dev/sda) can continue to update superblock info. After stopping the array and assemble it, we found the array run fail, with the follow kernel message: [ 172.986064] md: kicking non-fresh sdb from array! [ 173.004210] md: kicking non-fresh sdc from array! [ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors [ 173.022406] md1: failed to create bitmap (-5) [ 173.023466] md: md1 stopped. Since both sdb and sdc have the value of 'sb->events' smaller than that in sda, they have been kicked from the array. However, the only remained disk sda is in 'spare' state before stop and it cannot be added to conf->mirrors[] array. In the end, raid array assemble and run fail. In fact, we can use the older disk sdb or sdc to assemble the array. That means we should not choose the 'spare' disk as the fresh disk in analyze_sbs(). To fix the problem, we do not compare superblock events when it is a spare disk, as same as validate_super. Signed-off-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Pawel Baldysiak 提交于
commit c42d3240990814eec1e4b2b93fa0487fc4873aed upstream. Mdadm expects that setting drive as faulty will fail with -EBUSY only if this operation will cause RAID to be failed. If this happens, it will try to stop the array. Currently -EBUSY might also be returned if rdev is in the middle of the removal process - for example there is a race with mdmon that already requested the drive to be failed/removed. If rdev does not contain mddev, return -ENODEV instead, so the caller can distinguish between those two cases and behave accordingly. Reviewed-by: NNeilBrown <neilb@suse.com> Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Alex Wu 提交于
commit ee37d7314a32ab6809eacc3389bad0406c69a81f upstream. [Symptom] Resync thread hang when new added disk faulty during replacing. [Root Cause] In raid10_sync_request(), we expect to issue a bio with callback end_sync_read(), and a bio with callback end_sync_write(). In normal situation, we will add resyncing sectors into mddev->recovery_active when raid10_sync_request() returned, and sub resynced sectors from mddev->recovery_active when end_sync_write() calls end_sync_request(). If new added disk, which are replacing the old disk, is set faulty, there is a race condition: 1. In the first rcu protected section, resync thread did not detect that mreplace is set faulty and pass the condition. 2. In the second rcu protected section, mreplace is set faulty. 3. But, resync thread will prepare the read object first, and then check the write condition. 4. It will find that mreplace is set faulty and do not have to prepare write object. This cause we add resync sectors but never sub it. [How to Reproduce] This issue can be easily reproduced by the following steps: mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd] mdadm /dev/md0 -a /dev/sde mdadm /dev/md0 --replace /dev/sdd sleep 1 mdadm /dev/md0 -f /dev/sde [How to Fix] This issue can be fixed by using local variables to record the result of test conditions. Once the conditions are satisfied, we can make sure that we need to issue a bio for read and a bio for write. Previous 'commit 24afd80d ("md/raid10: handle recovery of replacement devices.")' will also check whether bio is NULL, but leave the comment saying that it is a pointless test. So we remove this dummy check. Reported-by: NAlex Chen <alexchen@synology.com> Reviewed-by: NAllen Peng <allenpeng@synology.com> Reviewed-by: NBingJing Chang <bingjingc@synology.com> Signed-off-by: NAlex Wu <alexwu@synology.com> Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Xu Yu 提交于
The memcg background async page reclaim, a.k.a, memcg kswapd, is implemented with a dedicated unbound workqueue currently. However, memcg kswapd will run too frequently, resulting in high overhead, page cache thrashing, frequent dirty page writeback, etc., due to improper memcg memory.wmark_ratio, unreasonable memcg memor capacity, or even abnormal memcg memory usage. We need to find out the problematic memcg(s) where memcg kswapd introduces significant overhead. This records the latency of each run of memcg kswapd work, and then aggregates into the exstat of per memcg. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Rafael J. Wysocki 提交于
commit 22782b3f9bb8ae21c710e2880db21bc729771e92 upstream After commit 61cb5758d3c4 ("cpuidle: Add cpuidle.governor= command line parameter") new cpuidle governors are not added to the list of available governors, so governor selection via sysfs doesn't work as expected (even though it is rarely used anyway). Fix that by making cpuidle_register_governor() add new governors to cpuidle_governors again. Fixes: 61cb5758d3c4 ("cpuidle: Add cpuidle.governor= command line parameter") Reported-by: NKees Cook <keescook@chromium.org> Cc: 5.0+ <stable@vger.kernel.org> # 5.0+ Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Marcelo Tosatti 提交于
commit 2d5ba19bdfef4dd06add144eb04287ee98409f75 upstream Add an MSRs which allows the guest to disable host polling (specifically the cpuidle-haltpoll, when performing polling in the guest, disables host side polling). Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Marc Zyngier 提交于
commit ef2e78ddadbb939ce79553b10dee0131d65d8f3e upstream. Just like we do for WFE trapping, it can be useful to turn off WFI trapping when the physical CPU is not oversubscribed (that is, the vcpu is the only runnable process on this CPU) *and* that we're using direct injection of interrupts. The conditions are reevaluated on each vcpu_load(), ensuring that we don't switch to this mode on a busy system. On a GICv4 system, this has the effect of reducing the generation of doorbell interrupts to zero when the right conditions are met, which is a huge improvement over the current situation (where the doorbells are screaming if the CPU ever hits a blocking WFI). Signed-off-by: NMarc Zyngier <maz@kernel.org> Reviewed-by: NZenghui Yu <yuzenghui@huawei.com> Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com> Link: https://lore.kernel.org/r/20191107160412.30301-3-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com> Acked-by: NZou Cao <zoucao@linux.alibaba.com>
-
由 Marc Zyngier 提交于
commit 5bd90b0989731520f2cdcfbbe467f1271f3cc803 upstream. In order to find out whether a vcpu is likely to be the target of VLPIs (and to further optimize the way we deal with those), let's track the number of VLPIs a vcpu can receive. This gets implemented with an atomic variable that gets incremented or decremented on map, unmap and move of a VLPI. Signed-off-by: NMarc Zyngier <maz@kernel.org> Reviewed-by: NZenghui Yu <yuzenghui@huawei.com> Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com> Link: https://lore.kernel.org/r/20191107160412.30301-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com> Acked-by: NZou Cao <zoucao@linux.alibaba.com>
-
由 Marc Zyngier 提交于
commit 8e01d9a396e6db153d94a6004e6473d9ff251a6a upstream. When the VHE code was reworked, a lot of the vgic stuff was moved around, but the GICv4 residency code did stay untouched, meaning that we come in and out of residency on each flush/sync, which is obviously suboptimal. To address this, let's move things around a bit: - Residency entry (flush) moves to vcpu_load - Residency exit (sync) moves to vcpu_put - On blocking (entry to WFI), we "put" - On unblocking (exit from WFI), we "load" Because these can nest (load/block/put/load/unblock/put, for example), we now have per-VPE tracking of the residency state. Additionally, vgic_v4_put gains a "need doorbell" parameter, which only gets set to true when blocking because of a WFI. This allows a finer control of the doorbell, which now also gets disabled as soon as it gets signaled. Signed-off-by: NMarc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20191027144234.8395-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com> Acked-by: NZou Cao <zoucao@linux.alibaba.com>
-
由 Tony Luck 提交于
commit e80634a75aba90e7485cd1fdb463fcac5d45f14d upstream Skylake logs some additional useful information in per-channel registers in addition the the architectural status/addr/misc logged in the machine check bank. Pick up this information and add it to the EDAC log: retry_rd_err_[five 32-bit register values] Sorry, no definitions for these registers. OEMs and DIMM vendors will be able to use them to isolate which cells in the DIMM are causing problems. correrrcnt[per rank corrected error counts] Note that if additional errors are logged while these registers are being read, you may see a jumble of values some from earlier errors, others from later errors (since the registers report the most recent logged error). The correrrcnt registers provide error counts per possible rank. If these counts only change by one since the previous error logged for this channel, then it is safe to assume that the registers logged provide a coherent view of one error. With this change EDAC logs look like this: EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 - err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000]) Acked-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: Nzhaobing <zhaobing@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Arnaldo Carvalho de Melo 提交于
commit b1ba55cf1cfb9f3e0e00d743534684a25bf66d28 upstream To pick the changes from: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") 9c276cc65a58 ("mm: introduce MADV_COLD") That result in these changes in the tools: $ tools/perf/trace/beauty/madvise_behavior.sh > before $ cp include/uapi/asm-generic/mman-common.h tools/include/uapi/asm-generic/mman-common.h $ git diff diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h index 63b1f506ea67..c160a5354eb6 100644 --- a/tools/include/uapi/asm-generic/mman-common.h +++ b/tools/include/uapi/asm-generic/mman-common.h @@ -67,6 +67,9 @@ #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */ #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */ +#define MADV_COLD 20 /* deactivate these pages */ +#define MADV_PAGEOUT 21 /* reclaim these pages */ + /* compatibility flags */ #define MAP_FILE 0 $ tools/perf/trace/beauty/madvise_behavior.sh > after $ diff -u before after --- before 2019-09-27 11:29:43.346320100 -0300 +++ after 2019-09-27 11:30:03.838570439 -0300 @@ -16,6 +16,8 @@ [17] = "DODUMP", [18] = "WIPEONFORK", [19] = "KEEPONFORK", + [20] = "COLD", + [21] = "PAGEOUT", [100] = "HWPOISON", [101] = "SOFT_OFFLINE", }; $ I.e. now when madvise gets those behaviours as args, it will be able to translate from the number to a human readable string. This addresses the following perf build warning: Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/mman-common.h' differs from latest version at 'include/uapi/asm-generic/mman-common.h' diff -u tools/include/uapi/asm-generic/mman-common.h include/uapi/asm-generic/mman-common.h Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-n40y6c4sa49p29q6sl8w3ufx@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 zhong jiang 提交于
commit 82072962973008201b817fae1896512977dd5083 upstream Recently, I hit the following issue when running upstream. kernel BUG at mm/vmscan.c:1521! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 Call Trace: reclaim_pages+0x499/0x800 mm/vmscan.c:2188 madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 walk_pmd_range mm/pagewalk.c:53 [inline] walk_pud_range mm/pagewalk.c:112 [inline] walk_p4d_range mm/pagewalk.c:139 [inline] walk_pgd_range mm/pagewalk.c:166 [inline] __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 walk_page_range+0x179/0x310 mm/pagewalk.c:349 madvise_pageout_page_range mm/madvise.c:506 [inline] madvise_pageout+0x1f0/0x330 mm/madvise.c:542 madvise_vma mm/madvise.c:931 [inline] __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe madvise_pageout() accesses the specified range of the vma and isolates them, then runs shrink_page_list() to reclaim its memory. But it also isolates the unevictable pages to reclaim. Hence, we can catch the cases in shrink_page_list(). The root cause is that we scan the page tables instead of specific LRU list. and so we need to filter out the unevictable lru pages from our end. Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") Signed-off-by: Nzhong jiang <zhongjiang@huawei.com> Suggested-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMinchan Kim <minchan@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-