1. 19 3月, 2020 1 次提交
  2. 18 3月, 2020 39 次提交
    • Y
      dccp: Fix memleak in __feat_register_sp · 84971431
      YueHaibing 提交于
      commit 1d3ff0950e2b40dc861b1739029649d03f591820 upstream.
      
      [ Fixes: CVE-2019-20096 ]
      
      If dccp_feat_push_change fails, we forget free the mem
      which is alloced by kmemdup in dccp_feat_clone_sp_val.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Fixes: e8ef967a ("dccp: Registration routines for changing feature values")
      Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      84971431
    • J
      scsi: libsas: stop discovering if oob mode is disconnected · d6fbed97
      Jason Yan 提交于
      commit f70267f379b5e5e11bdc5d72a56bf17e5feed01f upstream.
      
      [ Fixes: CVE-2019-19965 ]
      
      The discovering of sas port is driven by workqueue in libsas. When libsas
      is processing port events or phy events in workqueue, new events may rise
      up and change the state of some structures such as asd_sas_phy.  This may
      cause some problems such as follows:
      
      ==>thread 1                       ==>thread 2
      
                                        ==>phy up
                                        ==>phy_up_v3_hw()
                                          ==>oob_mode = SATA_OOB_MODE;
                                        ==>phy down quickly
                                        ==>hisi_sas_phy_down()
                                          ==>sas_ha->notify_phy_event()
                                          ==>sas_phy_disconnected()
                                            ==>oob_mode = OOB_NOT_CONNECTED
      ==>workqueue wakeup
      ==>sas_form_port()
        ==>sas_discover_domain()
          ==>sas_get_port_device()
            ==>oob_mode is OOB_NOT_CONNECTED and device
               is wrongly taken as expander
      
      This at last lead to the panic when libsas trying to issue a command to
      discover the device.
      
      [183047.614035] Unable to handle kernel NULL pointer dereference at
      virtual address 0000000000000058
      [183047.622896] Mem abort info:
      [183047.625762]   ESR = 0x96000004
      [183047.628893]   Exception class = DABT (current EL), IL = 32 bits
      [183047.634888]   SET = 0, FnV = 0
      [183047.638015]   EA = 0, S1PTW = 0
      [183047.641232] Data abort info:
      [183047.644189]   ISV = 0, ISS = 0x00000004
      [183047.648100]   CM = 0, WnR = 0
      [183047.651145] user pgtable: 4k pages, 48-bit VAs, pgdp =
      00000000b7df67be
      [183047.657834] [0000000000000058] pgd=0000000000000000
      [183047.662789] Internal error: Oops: 96000004 [#1] SMP
      [183047.667740] Process kworker/u16:2 (pid: 31291, stack limit =
      0x00000000417c4974)
      [183047.675208] CPU: 0 PID: 3291 Comm: kworker/u16:2 Tainted: G
      W  OE 4.19.36-vhulk1907.1.0.h410.eulerosv2r8.aarch64 #1
      [183047.687015] Hardware name: N/A N/A/Kunpeng Desktop Board D920S10,
      BIOS 0.15 10/22/2019
      [183047.695007] Workqueue: 0000:74:02.0_disco_q sas_discover_domain
      [183047.700999] pstate: 20c00009 (nzCv daif +PAN +UAO)
      [183047.705864] pc : prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw]
      [183047.711510] lr : prep_ata_v3_hw+0xb0/0x230 [hisi_sas_v3_hw]
      [183047.717153] sp : ffff00000f28ba60
      [183047.720541] x29: ffff00000f28ba60 x28: ffff8026852d7228
      [183047.725925] x27: ffff8027dba3e0a8 x26: ffff8027c05fc200
      [183047.731310] x25: 0000000000000000 x24: ffff8026bafa8dc0
      [183047.736695] x23: ffff8027c05fc218 x22: ffff8026852d7228
      [183047.742079] x21: ffff80007c2f2940 x20: ffff8027c05fc200
      [183047.747464] x19: 0000000000f80800 x18: 0000000000000010
      [183047.752848] x17: 0000000000000000 x16: 0000000000000000
      [183047.758232] x15: ffff000089a5a4ff x14: 0000000000000005
      [183047.763617] x13: ffff000009a5a50e x12: ffff8026bafa1e20
      [183047.769001] x11: ffff0000087453b8 x10: ffff00000f28b870
      [183047.774385] x9 : 0000000000000000 x8 : ffff80007e58f9b0
      [183047.779770] x7 : 0000000000000000 x6 : 000000000000003f
      [183047.785154] x5 : 0000000000000040 x4 : ffffffffffffffe0
      [183047.790538] x3 : 00000000000000f8 x2 : 0000000002000007
      [183047.795922] x1 : 0000000000000008 x0 : 0000000000000000
      [183047.801307] Call trace:
      [183047.803827]  prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw]
      [183047.809127]  hisi_sas_task_prep+0x750/0x888 [hisi_sas_main]
      [183047.814773]  hisi_sas_task_exec.isra.7+0x88/0x1f0 [hisi_sas_main]
      [183047.820939]  hisi_sas_queue_command+0x28/0x38 [hisi_sas_main]
      [183047.826757]  smp_execute_task_sg+0xec/0x218
      [183047.831013]  smp_execute_task+0x74/0xa0
      [183047.834921]  sas_discover_expander.part.7+0x9c/0x5f8
      [183047.839959]  sas_discover_root_expander+0x90/0x160
      [183047.844822]  sas_discover_domain+0x1b8/0x1e8
      [183047.849164]  process_one_work+0x1b4/0x3f8
      [183047.853246]  worker_thread+0x54/0x470
      [183047.856981]  kthread+0x134/0x138
      [183047.860283]  ret_from_fork+0x10/0x18
      [183047.863931] Code: f9407a80 528000e2 39409281 72a04002 (b9405800)
      [183047.870097] kernel fault(0x1) notification starting on CPU 0
      [183047.875828] kernel fault(0x1) notification finished on CPU 0
      [183047.881559] Modules linked in: unibsp(OE) hns3(OE) hclge(OE)
      hnae3(OE) mem_drv(OE) hisi_sas_v3_hw(OE) hisi_sas_main(OE)
      [183047.892418] ---[ end trace 4cc26083fc11b783  ]---
      [183047.897107] Kernel panic - not syncing: Fatal exception
      [183047.902403] kernel fault(0x5) notification starting on CPU 0
      [183047.908134] kernel fault(0x5) notification finished on CPU 0
      [183047.913865] SMP: stopping secondary CPUs
      [183047.917861] Kernel Offset: disabled
      [183047.921422] CPU features: 0x2,a2a00a38
      [183047.925243] Memory Limit: none
      [183047.928372] kernel reboot(0x2) notification starting on CPU 0
      [183047.934190] kernel reboot(0x2) notification finished on CPU 0
      [183047.940008] ---[ end Kernel panic - not syncing: Fatal exception
      ]---
      
      Fixes: 2908d778 ("[SCSI] aic94xx: new driver")
      Link: https://lore.kernel.org/r/20191206011118.46909-1-yanaijie@huawei.comReported-by: NGao Chuan <gaochuan4@huawei.com>
      Reviewed-by: NJohn Garry <john.garry@huawei.com>
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d6fbed97
    • A
      drm/i915/gen9: Clear residual context state on context switch · 3a25a42b
      Akeem G Abodunrin 提交于
      commit bc8a76a152c5f9ef3b48104154a65a68a8b76946 upstream.
      
      [ Fixes: CVE-2019-14615 ]
      
      Intel ID: PSIRT-TA-201910-001
      CVEID: CVE-2019-14615
      
      Intel GPU Hardware prior to Gen11 does not clear EU state
      during a context switch. This can result in information
      leakage between contexts.
      
      For Gen8 and Gen9, hardware provides a mechanism for
      fast cleardown of the EU state, by issuing a PIPE_CONTROL
      with bit 27 set. We can use this in a context batch buffer
      to explicitly cleardown the state on every context switch.
      
      As this workaround is already in place for gen8, we can borrow
      the code verbatim for Gen9.
      Signed-off-by: NMika Kuoppala <mika.kuoppala@linux.intel.com>
      Signed-off-by: NAkeem G Abodunrin <akeem.g.abodunrin@intel.com>
      Cc: Kumar Valsan Prathap <prathap.kumar.valsan@intel.com>
      Cc: Chris Wilson <chris.p.wilson@intel.com>
      Cc: Balestrieri Francesco <francesco.balestrieri@intel.com>
      Cc: Bloomfield Jon <jon.bloomfield@intel.com>
      Cc: Dutt Sudeep <sudeep.dutt@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3a25a42b
    • N
      RDMA: Fix goto target to release the allocated memory · ecf0326a
      Navid Emamdoost 提交于
      commit 4a9d46a9fe14401f21df69cea97c62396d5fb053 upstream.
      
      [ Fixes: CVE-2019-19077 ]
      
      In bnxt_re_create_srq(), when ib_copy_to_udata() fails allocated memory
      should be released by goto fail.
      
      Fixes: 37cb11ac ("RDMA/bnxt_re: Add SRQ support for Broadcom adapters")
      Link: https://lore.kernel.org/r/20190910222120.16517-1-navid.emamdoost@gmail.comSigned-off-by: NNavid Emamdoost <navid.emamdoost@gmail.com>
      Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ecf0326a
    • N
      ipmi: Fix memory leak in __ipmi_bmc_register · 07c1ee1f
      Navid Emamdoost 提交于
      commit 4aa7afb0ee20a97fbf0c5bab3df028d5fb85fdab upstream.
      
      [ Fixes: CVE-2019-19046 ]
      
      In the impelementation of __ipmi_bmc_register() the allocated memory for
      bmc should be released in case ida_simple_get() fails.
      
      Fixes: 68e7e50f ("ipmi: Don't use BMC product/dev ids in the BMC name")
      Signed-off-by: NNavid Emamdoost <navid.emamdoost@gmail.com>
      Message-Id: <20191021200649.1511-1-navid.emamdoost@gmail.com>
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      07c1ee1f
    • J
      vt: selection, close sel_buffer race · 210ea9a5
      Jiri Slaby 提交于
      commit 07e6124a1a46b4b5a9b3cacc0c306b50da87abf5 upstream.
      
      [ Fixes: CVE-2020-8648 ]
      
      syzkaller reported this UAF:
      BUG: KASAN: use-after-free in n_tty_receive_buf_common+0x2481/0x2940 drivers/tty/n_tty.c:1741
      Read of size 1 at addr ffff8880089e40e9 by task syz-executor.1/13184
      
      CPU: 0 PID: 13184 Comm: syz-executor.1 Not tainted 5.4.7 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      Call Trace:
      ...
       kasan_report+0xe/0x20 mm/kasan/common.c:634
       n_tty_receive_buf_common+0x2481/0x2940 drivers/tty/n_tty.c:1741
       tty_ldisc_receive_buf+0xac/0x190 drivers/tty/tty_buffer.c:461
       paste_selection+0x297/0x400 drivers/tty/vt/selection.c:372
       tioclinux+0x20d/0x4e0 drivers/tty/vt/vt.c:3044
       vt_ioctl+0x1bcf/0x28d0 drivers/tty/vt/vt_ioctl.c:364
       tty_ioctl+0x525/0x15a0 drivers/tty/tty_io.c:2657
       vfs_ioctl fs/ioctl.c:47 [inline]
      
      It is due to a race between parallel paste_selection (TIOCL_PASTESEL)
      and set_selection_user (TIOCL_SETSEL) invocations. One uses sel_buffer,
      while the other frees it and reallocates a new one for another
      selection. Add a mutex to close this race.
      
      The mutex takes care properly of sel_buffer and sel_buffer_lth only. The
      other selection global variables (like sel_start, sel_end, and sel_cons)
      are protected only in set_selection_user. The other functions need quite
      some more work to close the races of the variables there. This is going
      to happen later.
      
      This likely fixes (I am unsure as there is no reproducer provided) bug
      206361 too. It was marked as CVE-2020-8648.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Reported-by: syzbot+59997e8d5cbdc486e6f6@syzkaller.appspotmail.com
      References: https://bugzilla.kernel.org/show_bug.cgi?id=206361
      Cc: stable <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20200210081131.23572-2-jslaby@suse.czSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      210ea9a5
    • Z
      vgacon: Fix a UAF in vgacon_invert_region · 057afee4
      Zhang Xiaoxu 提交于
      commit 513dc792d6060d5ef572e43852683097a8420f56 upstream.
      
      [ Fixes: CVE-2020-8647, CVE-2020-8649 ]
      
      When syzkaller tests, there is a UAF:
        BUG: KASan: use after free in vgacon_invert_region+0x9d/0x110 at addr
          ffff880000100000
        Read of size 2 by task syz-executor.1/16489
        page:ffffea0000004000 count:0 mapcount:-127 mapping:          (null)
        index:0x0
        page flags: 0xfffff00000000()
        page dumped because: kasan: bad access detected
        CPU: 1 PID: 16489 Comm: syz-executor.1 Not tainted
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
        rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
        Call Trace:
          [<ffffffffb119f309>] dump_stack+0x1e/0x20
          [<ffffffffb04af957>] kasan_report+0x577/0x950
          [<ffffffffb04ae652>] __asan_load2+0x62/0x80
          [<ffffffffb090f26d>] vgacon_invert_region+0x9d/0x110
          [<ffffffffb0a39d95>] invert_screen+0xe5/0x470
          [<ffffffffb0a21dcb>] set_selection+0x44b/0x12f0
          [<ffffffffb0a3bfae>] tioclinux+0xee/0x490
          [<ffffffffb0a1d114>] vt_ioctl+0xff4/0x2670
          [<ffffffffb0a0089a>] tty_ioctl+0x46a/0x1a10
          [<ffffffffb052db3d>] do_vfs_ioctl+0x5bd/0xc40
          [<ffffffffb052e2f2>] SyS_ioctl+0x132/0x170
          [<ffffffffb11c9b1b>] system_call_fastpath+0x22/0x27
          Memory state around the buggy address:
           ffff8800000fff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
           00 00
           ffff8800000fff80: 00 00 00 00 00 00 00 00 00 00 00 00 00
           00 00 00
          >ffff880000100000: ff ff ff ff ff ff ff ff ff ff ff ff ff
           ff ff ff
      
      It can be reproduce in the linux mainline by the program:
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <sys/types.h>
        #include <sys/stat.h>
        #include <sys/ioctl.h>
        #include <linux/vt.h>
      
        struct tiocl_selection {
          unsigned short xs;      /* X start */
          unsigned short ys;      /* Y start */
          unsigned short xe;      /* X end */
          unsigned short ye;      /* Y end */
          unsigned short sel_mode; /* selection mode */
        };
      
        #define TIOCL_SETSEL    2
        struct tiocl {
          unsigned char type;
          unsigned char pad;
          struct tiocl_selection sel;
        };
      
        int main()
        {
          int fd = 0;
          const char *dev = "/dev/char/4:1";
      
          struct vt_consize v = {0};
          struct tiocl tioc = {0};
      
          fd = open(dev, O_RDWR, 0);
      
          v.v_rows = 3346;
          ioctl(fd, VT_RESIZEX, &v);
      
          tioc.type = TIOCL_SETSEL;
          ioctl(fd, TIOCLINUX, &tioc);
      
          return 0;
        }
      
      When resize the screen, update the 'vc->vc_size_row' to the new_row_size,
      but when 'set_origin' in 'vgacon_set_origin', vgacon use 'vga_vram_base'
      for 'vc_origin' and 'vc_visible_origin', not 'vc_screenbuf'. It maybe
      smaller than 'vc_screenbuf'. When TIOCLINUX, use the new_row_size to calc
      the offset, it maybe larger than the vga_vram_size in vgacon driver, then
      bad access.
      Also, if set an larger screenbuf firstly, then set an more larger
      screenbuf, when copy old_origin to new_origin, a bad access may happen.
      
      So, If the screen size larger than vga_vram, resize screen should be
      failed. This alse fix CVE-2020-8649 and CVE-2020-8647.
      
      Linus pointed out that overflow checking seems absent. We're saved by
      the existing bounds checks in vc_do_resize() with rather strict
      limits:
      
      	if (cols > VC_RESIZE_MAXCOL || lines > VC_RESIZE_MAXROW)
      		return -EINVAL;
      
      Fixes: 0aec4867 ("[PATCH] SVGATextMode fix")
      Reference: CVE-2020-8647 and CVE-2020-8649
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      [danvet: augment commit message to point out overflow safety]
      Cc: stable@vger.kernel.org
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20200304022429.37738-1-zhangxiaoxu5@huawei.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      057afee4
    • A
      do_last(): fetch directory ->i_mode and ->i_uid before it's too late · 98ab6ba3
      Al Viro 提交于
      commit d0cb50185ae942b03c4327be322055d622dc79f6 upstream.
      
      [ Fixes: CVE-2020-8428 ]
      
      may_create_in_sticky() call is done when we already have dropped the
      reference to dir.
      
      Fixes: 30aba665 (namei: allow restricted O_CREAT of FIFOs and regular files)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      98ab6ba3
    • B
      x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit · dad6cc96
      Boris Ostrovsky 提交于
      commit 8c6de56a42e0c657955e12b882a81ef07d1d073e upstream.
      
      [ Fixes: CVE-2019-3016 ]
      
      kvm_steal_time_set_preempted() may accidentally clear KVM_VCPU_FLUSH_TLB
      bit if it is called more than once while VCPU is preempted.
      
      This is part of CVE-2019-3016.
      
      (This bug was also independently discovered by Jim Mattson
      <jmattson@google.com>)
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      dad6cc96
    • O
      KVM: nVMX: Check IO instruction VM-exit conditions · 4cb032ec
      Oliver Upton 提交于
      commit 35a571346a94fb93b5b3b6a599675ef3384bc75c upstream.
      
      [ Fixes: CVE-2020-2732 ]
      
      Consult the 'unconditional IO exiting' and 'use IO bitmaps' VM-execution
      controls when checking instruction interception. If the 'use IO bitmaps'
      VM-execution control is 1, check the instruction access against the IO
      bitmaps to determine if the instruction causes a VM-exit.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4cb032ec
    • O
      KVM: nVMX: Refactor IO bitmap checks into helper function · f26796b0
      Oliver Upton 提交于
      commit e71237d3ff1abf9f3388337cfebf53b96df2020d upstream.
      
      [ Fixes: CVE-2020-2732 ]
      
      Checks against the IO bitmap are useful for both instruction emulation
      and VM-exit reflection. Refactor the IO bitmap checks into a helper
      function.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f26796b0
    • P
      KVM: nVMX: Don't emulate instructions in guest mode · c2868767
      Paolo Bonzini 提交于
      commit 07721feee46b4b248402133228235318199b05ec upstream.
      
      [ Fixes: CVE-2020-2732 ]
      
      vmx_check_intercept is not yet fully implemented. To avoid emulating
      instructions disallowed by the L1 hypervisor, refuse to emulate
      instructions by default.
      
      Cc: stable@vger.kernel.org
      [Made commit, added commit msg - Oliver]
      Signed-off-by: NOliver Upton <oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c2868767
    • S
      mm: fix tick timer stall during deferred page init · bdfadace
      Shile Zhang 提交于
      commit 07447453db3aebb6a0917592f411a7122d12a8b9 upstream linux-next.
      
      When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdatinit' kthread will
      initialise the deferred pages with local interrupts disabled. It is
      introduced by commit 3a2d7fa8 ("mm: disable interrupts while
      initializing deferred pages").
      
      On machine with NCPUS <= 2, the 'pgdatinit' kthread could be bound to
      the boot CPU, which could caused the tick timer long time stall, system
      jiffies not be updated in time.
      
      The dmesg shown that:
      
          [    0.197975] node 0 initialised, 32170688 pages in 1ms
      
      Obviously, 1ms is unreasonable.
      
      Now, fix it by restore in the pending interrupts for every 32*1204 pages
      (128MB) initialized, give the chance to update the systemd jiffies.
      The reasonable demsg shown likes:
      
          [    1.069306] node 0 initialised, 32203456 pages in 894ms
      
      Link: http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Co-developed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      bdfadace
    • X
      alinux: mm, memcg: export workingset counters on memcg v1 · 27393b9b
      Xu Yu 提交于
      This exports the workingset counters, i.e., workingset_refault,
      workingset_activate, workingset_restore, and workingset_nodereclaim, to
      memory cgroup v1.
      
      The stat collection of these counters is shared between memory cgroup v1
      and v2.  What this patch does is just to export them on memory cgroup v1.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      27393b9b
    • L
      bpf/sockmap: Read psock ingress_msg before sk_receive_queue · 350f8ab8
      Lingpeng Chen 提交于
      commit e7a5f1f1cd0008e5ad379270a8657e121eedb669 upstream
      
      Right now in tcp_bpf_recvmsg, sock read data first from sk_receive_queue
      if not empty than psock->ingress_msg otherwise. If a FIN packet arrives
      and there's also some data in psock->ingress_msg, the data in
      psock->ingress_msg will be purged. It is always happen when request to a
      HTTP1.0 server like python SimpleHTTPServer since the server send FIN
      packet after data is sent out.
      
      Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
      Reported-by: NArika Chen <eaglesora@gmail.com>
      Suggested-by: NArika Chen <eaglesora@gmail.com>
      Signed-off-by: NLingpeng Chen <forrest0579@gmail.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200109014833.18951-1-forrest0579@gmail.com
      [tonylu: patched modified to match BIG rework between v4.19 and upstream]
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      350f8ab8
    • L
      alinux: pci/iohub-sriov: Support for Alibaba PCIe IOHub SRIOV · 4a5d2b59
      liushanghui 提交于
      It enables SRIOV of PCIe devices of Alibaba MOC,
      then VFs can be used by other host or VM above on the host.
      Signed-off-by: Nliushanghui <liushanghui@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4a5d2b59
    • X
      alinux: mm, memcg: abort priority oom if with oom victim · ec661706
      Xu Yu 提交于
      Explicitly abort mem_cgroup_select_bad_process in priority oom if there
      is already a task as oom victim without MMF_OOM_SKIP flag set.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      ec661706
    • X
      alinux: mm, memcg: account number of processes in the css · 2061acd6
      Xu Yu 提交于
      Since commit e0205ae40f12 ("mm: memcontrol: use CSS_TASK_ITER_PROCS at
      mem_cgroup_scan_tasks()") made mem_cgroup_scan_tasks() to check only one
      thread from each thread group, we can make cgroup_subsys_state::nr_tasks
      to record only the thread group leader, i.e., process, instead of
      thread(s). Furthermore, this renames cgroup_subsys_state::nr_tasks to
      cgroup_subsys_state::nr_procs.
      
      Fixes: f061cd88 ("alinux: kernel: cgroup: account number of tasks in
      the css and its descendants")
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      2061acd6
    • T
      mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks() · 7bf04cbb
      Tetsuo Handa 提交于
      commit f168a9a54ec39b3f832c353733898b713b6b5c1f upstream.
      
      Since commit c03cd7738a83 ("cgroup: Include dying leaders with live
      threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
      mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
      only one thread from each thread group.
      
      [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
        Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      7bf04cbb
    • X
      alinux: mm, memcg: fix soft lockup in priority oom · fb4c5ea6
      Xu Yu 提交于
      Assuming that there is a memory cgroup tree as follows:
      
              A (use_priority_oom=1, limit=2.5G)
             / \
            /   C (priority=3, usage=1.5G)
           B (priority=0, usage=1G)
      
      As task in C (task-c) invokes oom-killer, task in B (task-b) is chosen
      and killed, and then task-c returns from mem_cgroup_oom and retries in
      try_charge.
      
      If memory page_counter of B has not been reset yet, leading to task-c
      invokes oom-killer again, the soft lockup may happen. In this situation,
      task-c keeps selecting bad process in B, while the only task-b in B has
      already been set PF_EXITING flag, which makes task-b skipped in
      css_task_iter_advance.
      
      Finally, task-c selected no bad process in B and keeps retrying, and
      task-b is stalled in synchronize_rcu when do_exit, exit_task_namespaces
      specifically.
      
      In a nutshell, the new behavior of css_task_iter_advance, i.e., commit
      c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS
      iterations"), causes priority oom to misbehave.
      
      This fixes the soft lockup by accounting num_oom_skip of the victim
      memcg and its parents (sift up to oc->memcg), if no bad process is
      chosen from it.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      fb4c5ea6
    • X
      io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled · f1046eaf
      Xiaoguang Wang 提交于
      commit 32b2244a840a90ea94ba42392de5c48d53f521f5 upstream linux-next
      
      When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, applications don't need
      to do io completion events polling again, they can rely on io_sq_thread to do
      polling work, which can reduce cpu usage and uring_lock contention.
      
      I modify fio io_uring engine codes a bit to evaluate the performance:
      static int fio_ioring_getevents(struct thread_data *td, unsigned int min,
                              continue;
                      }
      
      -               if (!o->sqpoll_thread) {
      +               if (o->sqpoll_thread && o->hipri) {
                              r = io_uring_enter(ld, 0, actual_min,
                                                      IORING_ENTER_GETEVENTS);
                              if (r < 0) {
      
      and use "fio  -name=fiotest -filename=/dev/nvme0n1 -iodepth=$depth -thread
      -rw=read -ioengine=io_uring  -hipri=1 -sqthread_poll=1  -direct=1 -bs=4k
      -size=10G -numjobs=1  -time_based -runtime=120"
      
      original codes
      --------------------------------------------------------------------
      iodepth       |        4 |        8 |       16 |       32 |       64
      bw            | 1133MB/s | 1519MB/s | 2090MB/s | 2710MB/s | 3012MB/s
      fio cpu usage |     100% |     100% |     100% |     100% |     100%
      --------------------------------------------------------------------
      
      with patch
      --------------------------------------------------------------------
      iodepth       |        4 |        8 |       16 |       32 |       64
      bw            | 1196MB/s | 1721MB/s | 2351MB/s | 2977MB/s | 3357MB/s
      fio cpu usage |    63.8% |   74.4%% |    81.1% |    83.7% |    82.4%
      --------------------------------------------------------------------
      bw improve    |     5.5% |    13.2% |    12.3% |     9.8% |    11.5%
      --------------------------------------------------------------------
      
      From above test results, we can see that bw has above 5.5%~13%
      improvement, and fio process's cpu usage also drops much. Note this
      won't improve io_sq_thread's cpu usage when SETUP_IOPOLL|SETUP_SQPOLL
      are both enabled, in this case, io_sq_thread always has 100% cpu usage.
      I think this patch will be friendly to applications which will often use
      io_uring_wait_cqe() or similar from liburing.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f1046eaf
    • Y
      md: make sure desc_nr less than MD_SB_DISKS · 659772b2
      Yufen Yu 提交于
      commit 3b7436cc9449d5ff7fa1c1fd5bc3edb6402ff5b8 upstream.
      
      For super_90_load, we need to make sure 'desc_nr' less
      than MD_SB_DISKS, avoiding invalid memory access of 'sb->disks'.
      
      Fixes: 228fc7d76db6 ("md: avoid invalid memory access for array sb->dev_roles")
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      659772b2
    • Y
      md: avoid invalid memory access for array sb->dev_roles · 3126899a
      Yufen Yu 提交于
      commit 228fc7d76db68732677230a3c64337908fd298e3 upstream.
      
      we need to gurantee 'desc_nr' valid before access array
      of sb->dev_roles.
      
      In addition, we should avoid .load_super always return '0'
      when level is LEVEL_MULTIPATH, which is not expected.
      Reported-by: Ncoverity-bot <keescook+coverity-bot@chromium.org>
      Addresses-Coverity-ID: 1487373 ("Memory - illegal accesses")
      Fixes: 6a5cb53aaa4e ("md: no longer compare spare disk superblock events in super_load")
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3126899a
    • Y
      md: no longer compare spare disk superblock events in super_load · 432bc300
      Yufen Yu 提交于
      commit 6a5cb53aaa4ef515ddeffa04ce18b771121127b4 upstream.
      
      We have a test case as follow:
      
        mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
      	--assume-clean --bitmap=internal
        mdadm -S /dev/md1
        mdadm -A /dev/md1 /dev/sd[b-c] --run --force
      
        mdadm --zero /dev/sda
        mdadm /dev/md1 -a /dev/sda
      
        echo offline > /sys/block/sdc/device/state
        echo offline > /sys/block/sdb/device/state
        sleep 5
        mdadm -S /dev/md1
      
        echo running > /sys/block/sdb/device/state
        echo running > /sys/block/sdc/device/state
        mdadm -A /dev/md1 /dev/sd[a-c] --run --force
      
      When we readd /dev/sda to the array, it started to do recovery.
      After offline the other two disks in md1, the recovery have
      been interrupted and superblock update info cannot be written
      to the offline disks. While the spare disk (/dev/sda) can continue
      to update superblock info.
      
      After stopping the array and assemble it, we found the array
      run fail, with the follow kernel message:
      
      [  172.986064] md: kicking non-fresh sdb from array!
      [  173.004210] md: kicking non-fresh sdc from array!
      [  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
      [  173.022406] md1: failed to create bitmap (-5)
      [  173.023466] md: md1 stopped.
      
      Since both sdb and sdc have the value of 'sb->events' smaller than
      that in sda, they have been kicked from the array. However, the only
      remained disk sda is in 'spare' state before stop and it cannot be
      added to conf->mirrors[] array. In the end, raid array assemble
      and run fail.
      
      In fact, we can use the older disk sdb or sdc to assemble the array.
      That means we should not choose the 'spare' disk as the fresh disk in
      analyze_sbs().
      
      To fix the problem, we do not compare superblock events when it is
      a spare disk, as same as validate_super.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      432bc300
    • P
      md: return -ENODEV if rdev has no mddev assigned · d6d6b831
      Pawel Baldysiak 提交于
      commit c42d3240990814eec1e4b2b93fa0487fc4873aed upstream.
      
      Mdadm expects that setting drive as faulty will fail with -EBUSY only if
      this operation will cause RAID to be failed. If this happens, it will
      try to stop the array. Currently -EBUSY might also be returned if rdev
      is in the middle of the removal process - for example there is a race
      with mdmon that already requested the drive to be failed/removed.
      
      If rdev does not contain mddev, return -ENODEV instead, so the caller
      can distinguish between those two cases and behave accordingly.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d6d6b831
    • A
      md/raid10: Fix raid10 replace hang when new added disk faulty · 52c7cc01
      Alex Wu 提交于
      commit ee37d7314a32ab6809eacc3389bad0406c69a81f upstream.
      
      [Symptom]
      
      Resync thread hang when new added disk faulty during replacing.
      
      [Root Cause]
      
      In raid10_sync_request(), we expect to issue a bio with callback
      end_sync_read(), and a bio with callback end_sync_write().
      
      In normal situation, we will add resyncing sectors into
      mddev->recovery_active when raid10_sync_request() returned, and sub
      resynced sectors from mddev->recovery_active when end_sync_write()
      calls end_sync_request().
      
      If new added disk, which are replacing the old disk, is set faulty,
      there is a race condition:
          1. In the first rcu protected section, resync thread did not detect
             that mreplace is set faulty and pass the condition.
          2. In the second rcu protected section, mreplace is set faulty.
          3. But, resync thread will prepare the read object first, and then
             check the write condition.
          4. It will find that mreplace is set faulty and do not have to
             prepare write object.
      This cause we add resync sectors but never sub it.
      
      [How to Reproduce]
      
      This issue can be easily reproduced by the following steps:
          mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
          mdadm /dev/md0 -a /dev/sde
          mdadm /dev/md0 --replace /dev/sdd
          sleep 1
          mdadm /dev/md0 -f /dev/sde
      
      [How to Fix]
      
      This issue can be fixed by using local variables to record the result
      of test conditions. Once the conditions are satisfied, we can make sure
      that we need to issue a bio for read and a bio for write.
      
      Previous 'commit 24afd80d ("md/raid10: handle recovery of
      replacement devices.")' will also check whether bio is NULL, but leave
      the comment saying that it is a pointless test. So we remove this dummy
      check.
      Reported-by: NAlex Chen <alexchen@synology.com>
      Reviewed-by: NAllen Peng <allenpeng@synology.com>
      Reviewed-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NAlex Wu <alexwu@synology.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      52c7cc01
    • X
      alinux: mm, memcg: record latency of memcg wmark reclaim · 40969475
      Xu Yu 提交于
      The memcg background async page reclaim, a.k.a, memcg kswapd, is
      implemented with a dedicated unbound workqueue currently.
      
      However, memcg kswapd will run too frequently, resulting in high
      overhead, page cache thrashing, frequent dirty page writeback, etc., due
      to improper memcg memory.wmark_ratio, unreasonable memcg memor capacity,
      or even abnormal memcg memory usage.
      
      We need to find out the problematic memcg(s) where memcg kswapd
      introduces significant overhead.
      
      This records the latency of each run of memcg kswapd work, and then
      aggregates into the exstat of per memcg.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      40969475
    • R
      cpuidle: governor: Add new governors to cpuidle_governors again · 8e22a1fb
      Rafael J. Wysocki 提交于
      commit 22782b3f9bb8ae21c710e2880db21bc729771e92 upstream
      
      After commit 61cb5758d3c4 ("cpuidle: Add cpuidle.governor= command
      line parameter") new cpuidle governors are not added to the list
      of available governors, so governor selection via sysfs doesn't
      work as expected (even though it is rarely used anyway).
      
      Fix that by making cpuidle_register_governor() add new governors to
      cpuidle_governors again.
      
      Fixes: 61cb5758d3c4 ("cpuidle: Add cpuidle.governor= command line parameter")
      Reported-by: NKees Cook <keescook@chromium.org>
      Cc: 5.0+ <stable@vger.kernel.org> # 5.0+
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      8e22a1fb
    • M
      kvm: x86: add host poll control msrs · 1e53713f
      Marcelo Tosatti 提交于
      commit 2d5ba19bdfef4dd06add144eb04287ee98409f75 upstream
      
      Add an MSRs which allows the guest to disable
      host polling (specifically the cpuidle-haltpoll,
      when performing polling in the guest, disables
      host side polling).
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      1e53713f
    • M
      KVM: arm64: Opportunistically turn off WFI trapping when using direct LPI injection · ff9b66b5
      Marc Zyngier 提交于
      commit ef2e78ddadbb939ce79553b10dee0131d65d8f3e upstream.
      
      Just like we do for WFE trapping, it can be useful to turn off
      WFI trapping when the physical CPU is not oversubscribed (that
      is, the vcpu is the only runnable process on this CPU) *and*
      that we're using direct injection of interrupts.
      
      The conditions are reevaluated on each vcpu_load(), ensuring that
      we don't switch to this mode on a busy system.
      
      On a GICv4 system, this has the effect of reducing the generation
      of doorbell interrupts to zero when the right conditions are
      met, which is a huge improvement over the current situation
      (where the doorbells are screaming if the CPU ever hits a
      blocking WFI).
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
      Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com>
      Link: https://lore.kernel.org/r/20191107160412.30301-3-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
      Acked-by: NZou Cao <zoucao@linux.alibaba.com>
      ff9b66b5
    • M
      KVM: vgic-v4: Track the number of VLPIs per vcpu · 27840020
      Marc Zyngier 提交于
      commit 5bd90b0989731520f2cdcfbbe467f1271f3cc803 upstream.
      
      In order to find out whether a vcpu is likely to be the target of
      VLPIs (and to further optimize the way we deal with those), let's
      track the number of VLPIs a vcpu can receive.
      
      This gets implemented with an atomic variable that gets incremented
      or decremented on map, unmap and move of a VLPI.
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
      Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com>
      Link: https://lore.kernel.org/r/20191107160412.30301-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
      Acked-by: NZou Cao <zoucao@linux.alibaba.com>
      27840020
    • M
      KVM: arm64: vgic-v4: Move the GICv4 residency flow to be driven by vcpu_load/put · 42993070
      Marc Zyngier 提交于
      commit 8e01d9a396e6db153d94a6004e6473d9ff251a6a upstream.
      
      When the VHE code was reworked, a lot of the vgic stuff was moved around,
      but the GICv4 residency code did stay untouched, meaning that we come
      in and out of residency on each flush/sync, which is obviously suboptimal.
      
      To address this, let's move things around a bit:
      
      - Residency entry (flush) moves to vcpu_load
      - Residency exit (sync) moves to vcpu_put
      - On blocking (entry to WFI), we "put"
      - On unblocking (exit from WFI), we "load"
      
      Because these can nest (load/block/put/load/unblock/put, for example),
      we now have per-VPE tracking of the residency state.
      
      Additionally, vgic_v4_put gains a "need doorbell" parameter, which only
      gets set to true when blocking because of a WFI. This allows a finer
      control of the doorbell, which now also gets disabled as soon as
      it gets signaled.
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20191027144234.8395-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
      Acked-by: NZou Cao <zoucao@linux.alibaba.com>
      42993070
    • T
      EDAC, skx: Retrieve and print retry_rd_err_log registers · a20889d9
      Tony Luck 提交于
      commit e80634a75aba90e7485cd1fdb463fcac5d45f14d upstream
      
      Skylake logs some additional useful information in per-channel
      registers in addition the the architectural status/addr/misc
      logged in the machine check bank.
      
      Pick up this information and add it to the EDAC log:
      
      	retry_rd_err_[five 32-bit register values]
      
      Sorry, no definitions for these registers. OEMs and DIMM vendors
      will be able to use them to isolate which cells in the DIMM are
      causing problems.
      
      	correrrcnt[per rank corrected error counts]
      
      Note that if additional errors are logged while these registers are
      being read, you may see a jumble of values some from earlier errors,
      others from later errors (since the registers report the most recent
      logged error). The correrrcnt registers provide error counts per possible
      rank. If these counts only change by one since the previous error logged
      for this channel, then it is safe to assume that the registers logged
      provide a coherent view of one error.
      
      With this change EDAC logs look like this:
      
      EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 -  err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000])
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: Nzhaobing <zhaobing@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      a20889d9
    • A
      tools headers uapi: Sync asm-generic/mman-common.h with the kernel · 5c1675fc
      Arnaldo Carvalho de Melo 提交于
      commit b1ba55cf1cfb9f3e0e00d743534684a25bf66d28 upstream
      
      To pick the changes from:
      
        1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
        9c276cc65a58 ("mm: introduce MADV_COLD")
      
      That result in these changes in the tools:
      
        $ tools/perf/trace/beauty/madvise_behavior.sh > before
        $ cp include/uapi/asm-generic/mman-common.h tools/include/uapi/asm-generic/mman-common.h
        $ git diff
        diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
        index 63b1f506ea67..c160a5354eb6 100644
        --- a/tools/include/uapi/asm-generic/mman-common.h
        +++ b/tools/include/uapi/asm-generic/mman-common.h
        @@ -67,6 +67,9 @@
         #define MADV_WIPEONFORK 18             /* Zero memory on fork, child only */
         #define MADV_KEEPONFORK 19             /* Undo MADV_WIPEONFORK */
      
        +#define MADV_COLD      20              /* deactivate these pages */
        +#define MADV_PAGEOUT   21              /* reclaim these pages */
        +
         /* compatibility flags */
         #define MAP_FILE       0
      
        $ tools/perf/trace/beauty/madvise_behavior.sh > after
        $ diff -u before after
        --- before	2019-09-27 11:29:43.346320100 -0300
        +++ after	2019-09-27 11:30:03.838570439 -0300
        @@ -16,6 +16,8 @@
         	[17] = "DODUMP",
         	[18] = "WIPEONFORK",
         	[19] = "KEEPONFORK",
        +	[20] = "COLD",
        +	[21] = "PAGEOUT",
         	[100] = "HWPOISON",
         	[101] = "SOFT_OFFLINE",
         };
        $
      
      I.e. now when madvise gets those behaviours as args, it will be able to
      translate from the number to a human readable string.
      
      This addresses the following perf build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/mman-common.h' differs from latest version at 'include/uapi/asm-generic/mman-common.h'
        diff -u tools/include/uapi/asm-generic/mman-common.h include/uapi/asm-generic/mman-common.h
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Luis Cláudio Gonçalves <lclaudio@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Link: https://lkml.kernel.org/n/tip-n40y6c4sa49p29q6sl8w3ufx@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      5c1675fc
    • Z
      mm: fix trying to reclaim unevictable lru page when calling madvise_pageout · 88ec97ec
      zhong jiang 提交于
      commit 82072962973008201b817fae1896512977dd5083 upstream
      
      Recently, I hit the following issue when running upstream.
      
        kernel BUG at mm/vmscan.c:1521!
        invalid opcode: 0000 [#1] SMP KASAN PTI
        CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
        RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
        Call Trace:
         reclaim_pages+0x499/0x800 mm/vmscan.c:2188
         madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
         walk_pmd_range mm/pagewalk.c:53 [inline]
         walk_pud_range mm/pagewalk.c:112 [inline]
         walk_p4d_range mm/pagewalk.c:139 [inline]
         walk_pgd_range mm/pagewalk.c:166 [inline]
         __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
         walk_page_range+0x179/0x310 mm/pagewalk.c:349
         madvise_pageout_page_range mm/madvise.c:506 [inline]
         madvise_pageout+0x1f0/0x330 mm/madvise.c:542
         madvise_vma mm/madvise.c:931 [inline]
         __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
         do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      madvise_pageout() accesses the specified range of the vma and isolates
      them, then runs shrink_page_list() to reclaim its memory.  But it also
      isolates the unevictable pages to reclaim.  Hence, we can catch the
      cases in shrink_page_list().
      
      The root cause is that we scan the page tables instead of specific LRU
      list.  and so we need to filter out the unevictable lru pages from our
      end.
      
      Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
      Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      88ec97ec
    • M
      mm: factor out common parts between MADV_COLD and MADV_PAGEOUT · b6a18a3c
      Minchan Kim 提交于
      commit d616d5126503967bf365db0711ee3c78b356efe9 upstream
      
      There are many common parts between MADV_COLD and MADV_PAGEOUT.
      This patch factor them out to save code duplication.
      
      Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: kbuild test robot <lkp@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      b6a18a3c
    • M
      mm: introduce MADV_PAGEOUT · 23757dcc
      Minchan Kim 提交于
      commit 1a4e58cce84ee88129d5d49c064bd2852b481357 upstream
      
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      23757dcc
    • M
      mm: introduce MADV_COLD · 1af766e8
      Minchan Kim 提交于
      commit 9c276cc65a58faf98be8e56962745ec99ab87636 upstream
      
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1af766e8
    • M
      mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM · a0747c91
      Minchan Kim 提交于
      commit 8940b34a4e082ae11498ddae8432f2ac07685d1c upstream
      
      The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
      as default.  It is for preventing to reclaim dirty pages when CMA try to
      migrate pages.  Strictly speaking, we don't need it because CMA didn't
      allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
      
      Moreover, it has a problem to prevent anonymous pages's swap out even
      though force_reclaim = true in shrink_page_list on upcoming patch.  So
      this patch makes references's default value to PAGEREF_RECLAIM and rename
      force_reclaim with ignore_references to make it more clear.
      
      This is a preparatory work for next patch.
      
      Link: http://lkml.kernel.org/r/20190726023435.214162-3-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: kbuild test robot <lkp@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      a0747c91