1. 17 12月, 2022 1 次提交
  2. 15 6月, 2022 1 次提交
    • M
      powerpc/32: Fix overread/overwrite of thread_struct via ptrace · ae7a0c91
      Michael Ellerman 提交于
      mainline inclusion
      from mainline-v5.19-rc2
      commit 8e127844
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5C43D
      CVE: CVE-2022-32981
      
      --------------------------------
      
      The ptrace PEEKUSR/POKEUSR (aka PEEKUSER/POKEUSER) API allows a process
      to read/write registers of another process.
      
      To get/set a register, the API takes an index into an imaginary address
      space called the "USER area", where the registers of the process are
      laid out in some fashion.
      
      The kernel then maps that index to a particular register in its own data
      structures and gets/sets the value.
      
      The API only allows a single machine-word to be read/written at a time.
      So 4 bytes on 32-bit kernels and 8 bytes on 64-bit kernels.
      
      The way floating point registers (FPRs) are addressed is somewhat
      complicated, because double precision float values are 64-bit even on
      32-bit CPUs. That means on 32-bit kernels each FPR occupies two
      word-sized locations in the USER area. On 64-bit kernels each FPR
      occupies one word-sized location in the USER area.
      
      Internally the kernel stores the FPRs in an array of u64s, or if VSX is
      enabled, an array of pairs of u64s where one half of each pair stores
      the FPR. Which half of the pair stores the FPR depends on the kernel's
      endianness.
      
      To handle the different layouts of the FPRs depending on VSX/no-VSX and
      big/little endian, the TS_FPR() macro was introduced.
      
      Unfortunately the TS_FPR() macro does not take into account the fact
      that the addressing of each FPR differs between 32-bit and 64-bit
      kernels. It just takes the index into the "USER area" passed from
      userspace and indexes into the fp_state.fpr array.
      
      On 32-bit there are 64 indexes that address FPRs, but only 32 entries in
      the fp_state.fpr array, meaning the user can read/write 256 bytes past
      the end of the array. Because the fp_state sits in the middle of the
      thread_struct there are various fields than can be overwritten,
      including some pointers. As such it may be exploitable.
      
      It has also been observed to cause systems to hang or otherwise
      misbehave when using gdbserver, and is probably the root cause of this
      report which could not be easily reproduced:
        https://lore.kernel.org/linuxppc-dev/dc38afe9-6b78-f3f5-666b-986939e40fc6@keymile.com/
      
      Rather than trying to make the TS_FPR() macro even more complicated to
      fix the bug, or add more macros, instead add a special-case for 32-bit
      kernels. This is more obvious and hopefully avoids a similar bug
      happening again in future.
      
      Note that because 32-bit kernels never have VSX enabled the code doesn't
      need to consider TS_FPRWIDTH/OFFSET at all. Add a BUILD_BUG_ON() to
      ensure that 32-bit && VSX is never enabled.
      
      Fixes: 87fec051 ("powerpc: PTRACE_PEEKUSR/PTRACE_POKEUSER of FPR registers in little endian builds")
      Cc: stable@vger.kernel.org # v3.13+
      Reported-by: NAriel Miculas <ariel.miculas@belden.com>
      Tested-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220609133245.573565-1-mpe@ellerman.id.au
      Conflicts:
                  arch/powerpc/kernel/ptrace/ptrace-fpu.c
                  arch/powerpc/kernel/ptrace/ptrace.c
      Signed-off-by: NYipeng Zou <zouyipeng@huawei.com>
      Reviewed-by: NZhang Jianhua <chris.zjh@huawei.com>
      Reviewed-by: NLiao Chang <liaochang1@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      ae7a0c91
  3. 14 4月, 2022 1 次提交
  4. 17 1月, 2022 1 次提交
    • Z
      hugetlbfs: extend the definition of hugepages parameter to support node allocation · b0750f70
      Zhenguo Yao 提交于
      mainline inclusion
      from mainline-v5.16-rc1
      commit b5389086
      category: feature
      bugzilla: 186043
      CVE: NA
      
      --------------------------------
      
      We can specify the number of hugepages to allocate at boot.  But the
      hugepages is balanced in all nodes at present.  In some scenarios, we
      only need hugepages in one node.  For example: DPDK needs hugepages
      which are in the same node as NIC.
      
      If DPDK needs four hugepages of 1G size in node1 and system has 16 numa
      nodes we must reserve 64 hugepages on the kernel cmdline.  But only four
      hugepages are used.  The others should be free after boot.  If the
      system memory is low(for example: 64G), it will be an impossible task.
      
      So extend the hugepages parameter to support specifying hugepages on a
      specific node.  For example add following parameter:
      
        hugepagesz=1G hugepages=0:1,1:3
      
      It will allocate 1 hugepage in node0 and 3 hugepages in node1.
      
      Link: https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.comSigned-off-by: NZhenguo Yao <yaozhenguo1@gmail.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	Documentation/admin-guide/kernel-parameters.txt
      	Documentation/admin-guide/mm/hugetlbpage.rst
      	arch/powerpc/mm/hugetlbpage.c
      	include/linux/hugetlb.h
      	mm/hugetlb.c
      Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: Kefeng Wang<wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b0750f70
  5. 29 10月, 2021 3 次提交
  6. 24 8月, 2021 1 次提交
  7. 30 7月, 2021 1 次提交
  8. 29 6月, 2021 1 次提交
  9. 22 5月, 2021 1 次提交
  10. 14 4月, 2021 1 次提交
    • P
      mm: allow VM_FAULT_RETRY for multiple times · 9745f703
      Peter Xu 提交于
      mainline inclusion
      from mainline-5.6
      commit 4064b982
      category: bugfix
      bugzilla: 47439
      CVE: NA
      ---------------------------
      
      The idea comes from a discussion between Linus and Andrea [1].
      
      Before this patch we only allow a page fault to retry once.  We achieved
      this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
      handle_mm_fault() the second time.  This was majorly used to avoid
      unexpected starvation of the system by looping over forever to handle the
      page fault on a single page.  However that should hardly happen, and after
      all for each code path to return a VM_FAULT_RETRY we'll first wait for a
      condition (during which time we should possibly yield the cpu) to happen
      before VM_FAULT_RETRY is really returned.
      
      This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
      flag when we receive VM_FAULT_RETRY.  It means that the page fault handler
      now can retry the page fault for multiple times if necessary without the
      need to generate another page fault event.  Meanwhile we still keep the
      FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
      page fault is the first attempt or not.
      
      Then we'll have these combinations of fault flags (only considering
      ALLOW_RETRY flag and TRIED flag):
      
        - ALLOW_RETRY and !TRIED:  this means the page fault allows to
                                   retry, and this is the first try
      
        - ALLOW_RETRY and TRIED:   this means the page fault allows to
                                   retry, and this is not the first try
      
        - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
                                   to retry at all
      
        - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used
      
      In existing code we have multiple places that has taken special care of
      the first condition above by checking against (fault_flags &
      FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to detect
      the first retry of a page fault by checking against both (fault_flags &
      FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
      even the 2nd try will have the ALLOW_RETRY set, then use that helper in
      all existing special paths.  One example is in __lock_page_or_retry(), now
      we'll drop the mmap_sem only in the first attempt of page fault and we'll
      keep it in follow up retries, so old locking behavior will be retained.
      
      This will be a nice enhancement for current code [2] at the same time a
      supporting material for the future userfaultfd-writeprotect work, since in
      that work there will always be an explicit userfault writeprotect retry
      for protected pages, and if that cannot resolve the page fault (e.g., when
      userfaultfd-writeprotect is used in conjunction with swapped pages) then
      we'll possibly need a 3rd retry of the page fault.  It might also benefit
      other potential users who will have similar requirement like userfault
      write-protection.
      
      GUP code is not touched yet and will be covered in follow up patch.
      
      Please read the thread below for more information.
      
      [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
      [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
       Conflicts:
      	arch/arc/mm/fault.c
      	arch/arm64/mm/fault.c
      	arch/x86/mm/fault.c
      	drivers/gpu/drm/ttm/ttm_bo_vm.c
      	include/linux/mm.h
      	mm/internal.h
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: NKefeng  Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      9745f703
  11. 22 2月, 2021 3 次提交
    • B
      powerpc: fix a compiling error for 'access_ok' · 10ea25c2
      Bixuan Cui 提交于
      hulk inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      An error is reported during powerpc platform compilation because
      VERIFY_WRITE is already removed in powerpc platform.
      
          ./arch/powerpc/include/asm/uaccess.h: In function ‘clear_user’:
          ./arch/powerpc/include/asm/uaccess.h:446:48: error: macro "access_ok" passed 3 arguments, but takes just 2
            if (likely(access_ok(VERIFY_WRITE, addr, size))) {
                                                          ^
          In file included from ./include/asm-generic/div64.h:25:0,
                           from ./arch/powerpc/include/generated/asm/div64.h:1,
                           from ./include/linux/math64.h:6,
                           from ./include/linux/time64.h:5,
                           from ./include/linux/compat_time.h:6,
                           from ./include/linux/compat.h:10,
                           from arch/powerpc/kernel/asm-offsets.c:16:
          ./arch/powerpc/include/asm/uaccess.h:446:13: error: ‘access_ok’ undeclared (first use in this function)
            if (likely(access_ok(VERIFY_WRITE, addr, size))) {
                       ^
          ./include/linux/compiler.h:76:40: note: in definition of macro ‘likely’
           # define likely(x) __builtin_expect(!!(x), 1)
                                                  ^
          ./arch/powerpc/include/asm/uaccess.h:446:13: note: each undeclared identifier is reported only once for each function it appears in
            if (likely(access_ok(VERIFY_WRITE, addr, size))) {
                       ^
          ./include/linux/compiler.h:76:40: note: in definition of macro ‘likely’
           # define likely(x) __builtin_expect(!!(x), 1)
                                                  ^
          Kbuild:56: recipe for target 'arch/powerpc/kernel/asm-offsets.s' failed
      
      Fixes: 837baab68b87 ("powerpc: Add a framework for user access tracking")
      Fixes: a10f8b4fe993 ("powerpc: Implement user_access_begin and friends")
      Signed-off-by: NBixuan Cui <cuibixuan@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      10ea25c2
    • B
      mmap: fix a compiling error for 'MAP_CHECKNODE' · 9ced0cc2
      Bixuan Cui 提交于
      hulk inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      MAP_CHECKNODE was defined in uapi/asm-generic/mman.h, that was not
      automatically included by mm/mmap.c when building on platforms such as
      mips, and result in following compiling error:
      
          mm/mmap.c: In function ‘__do_mmap’:
          mm/mmap.c:1581:14: error: ‘MAP_CHECKNODE’ undeclared (first use in this function)
            if (flags & MAP_CHECKNODE)
                        ^
          mm/mmap.c:1581:14: note: each undeclared identifier is reported only once for each function it appears in
          scripts/Makefile.build:303: recipe for target 'mm/mmap.o' failed
      
      Fixes: cdccf4d4b7b5 ("arm64/ascend: mm: Add MAP_CHECKNODE flag to check node hugetlb")
      Signed-off-by: NBixuan Cui <cuibixuan@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      9ced0cc2
    • Z
      mmap: fix a compiling error for 'MAP_PA32BIT' · b9d00721
      Zhengyuan Liu 提交于
      hulk inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      MAP_PA32BIT was defined in uapi/asm-generic/mman.h, that was not
      automatically included by mm/mmap.c when building on platforms such as
      mips, and result in following compiling error:
      
      	mm/mmap.c: In function ‘do_mmap’:
      	mm/mmap.c:1450:14: error: ‘MAP_PA32BIT’ undeclared (first use in this function); did you mean ‘MAP_32BIT’?
      	  if (flags & MAP_PA32BIT)
      	              ^~~~~~~~~~~
      	              MAP_32BIT
      	mm/mmap.c:1450:14: note: each undeclared identifier is reported only once for each function it appears in
      	make[1]: *** [scripts/Makefile.build:303: mm/mmap.o] Error 1
      
      Fixes: e422eca7f9c5 ("svm: add support for allocing memory which is within 4G physical address in svm_mmap")
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@tj.kylinos.cn>
      Signed-off-by: NBixuan Cui <cuibixuan@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      b9d00721
  12. 08 1月, 2021 1 次提交
    • A
      powerpc/rtas: Restrict RTAS requests from userspace · c050e20b
      Andrew Donnellan 提交于
      stable inclusion
      from linux-4.19.155
      commit 94e8f0bbc475228c93d28b2e0f7e37303db80ffe
      CVE: CVE-2020-27777
      
      --------------------------------
      
      commit bd59380c upstream.
      
      A number of userspace utilities depend on making calls to RTAS to retrieve
      information and update various things.
      
      The existing API through which we expose RTAS to userspace exposes more
      RTAS functionality than we actually need, through the sys_rtas syscall,
      which allows root (or anyone with CAP_SYS_ADMIN) to make any RTAS call they
      want with arbitrary arguments.
      
      Many RTAS calls take the address of a buffer as an argument, and it's up to
      the caller to specify the physical address of the buffer as an argument. We
      allocate a buffer (the "RMO buffer") in the Real Memory Area that RTAS can
      access, and then expose the physical address and size of this buffer in
      /proc/powerpc/rtas/rmo_buffer. Userspace is expected to read this address,
      poke at the buffer using /dev/mem, and pass an address in the RMO buffer to
      the RTAS call.
      
      However, there's nothing stopping the caller from specifying whatever
      address they want in the RTAS call, and it's easy to construct a series of
      RTAS calls that can overwrite arbitrary bytes (even without /dev/mem
      access).
      
      Additionally, there are some RTAS calls that do potentially dangerous
      things and for which there are no legitimate userspace use cases.
      
      In the past, this would not have been a particularly big deal as it was
      assumed that root could modify all system state freely, but with Secure
      Boot and lockdown we need to care about this.
      
      We can't fundamentally change the ABI at this point, however we can address
      this by implementing a filter that checks RTAS calls against a list
      of permitted calls and forces the caller to use addresses within the RMO
      buffer.
      
      The list is based off the list of calls that are used by the librtas
      userspace library, and has been tested with a number of existing userspace
      RTAS utilities. For compatibility with any applications we are not aware of
      that require other calls, the filter can be turned off at build time.
      
      Cc: stable@vger.kernel.org
      Reported-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NAndrew Donnellan <ajd@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200820044512.7543-1-ajd@linux.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c050e20b
  13. 24 11月, 2020 7 次提交
  14. 22 9月, 2020 6 次提交
    • P
      sched/core: Fix illegal RCU from offline CPUs · 6cf457d1
      Peter Zijlstra 提交于
      stable inclusion
      from linux-4.19.129
      commit 373491f1f41896241864b527b584856d8a510946
      
      --------------------------------
      
      [ Upstream commit bf2c59fc ]
      
      In the CPU-offline process, it calls mmdrop() after idle entry and the
      subsequent call to cpuhp_report_idle_dead(). Once execution passes the
      call to rcu_report_dead(), RCU is ignoring the CPU, which results in
      lockdep complaining when mmdrop() uses RCU from either memcg or
      debugobjects below.
      
      Fix it by cleaning up the active_mm state from BP instead. Every arch
      which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
      from AP. The only exception is parisc because it switches them to
      &init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
      but the patch will still work there because it calls mmgrab(&init_mm) in
      smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().
      
        WARNING: suspicious RCU usage
        -----------------------------
        kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!
      
        other info that might help us debug this:
      
        RCU used illegally from offline CPU!
        Call Trace:
         dump_stack+0xf4/0x164 (unreliable)
         lockdep_rcu_suspicious+0x140/0x164
         get_work_pool+0x110/0x150
         __queue_work+0x1bc/0xca0
         queue_work_on+0x114/0x120
         css_release+0x9c/0xc0
         percpu_ref_put_many+0x204/0x230
         free_pcp_prepare+0x264/0x570
         free_unref_page+0x38/0xf0
         __mmdrop+0x21c/0x2c0
         idle_task_exit+0x170/0x1b0
         pnv_smp_cpu_kill_self+0x38/0x2e0
         cpu_die+0x48/0x64
         arch_cpu_idle_dead+0x30/0x50
         do_idle+0x2f4/0x470
         cpu_startup_entry+0x38/0x40
         start_secondary+0x7a8/0xa80
         start_secondary_resume+0x10/0x14
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pwSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6cf457d1
    • D
      mm/memory_hotplug: shrink zones when offlining memory · 8de9a7ef
      David Hildenbrand 提交于
      stable inclusion
      from linux-4.19.100
      commit 86834898d5a5e5aef9ae6d285201f2d99a4eb300
      
      --------------------------------
      
      commit feee6b29 upstream.
      
      -- snip --
      
      - Missing arm64 hot(un)plug support
      - Missing some vmem_altmap_offset() cleanups
      - Missing sub-section hotadd support
      - Missing unification of mm/hmm.c and kernel/memremap.c
      
      -- snip --
      
      We currently try to shrink a single zone when removing memory.  We use
      the zone of the first page of the memory we are removing.  If that
      memmap was never initialized (e.g., memory was never onlined), we will
      read garbage and can trigger kernel BUGs (due to a stale pointer):
      
          BUG: unable to handle page fault for address: 000000000000353d
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:clear_zone_contiguous+0x5/0x10
          Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
          RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
          RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
          RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
          R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
          FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           __remove_pages+0x4b/0x640
           arch_remove_memory+0x63/0x8d
           try_remove_memory+0xdb/0x130
           __remove_memory+0xa/0x11
           acpi_memory_device_remove+0x70/0x100
           acpi_bus_trim+0x55/0x90
           acpi_device_hotplug+0x227/0x3a0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x221/0x550
           worker_thread+0x50/0x3b0
           kthread+0x105/0x140
           ret_from_fork+0x3a/0x50
          Modules linked in:
          CR2: 000000000000353d
      
      Instead, shrink the zones when offlining memory or when onlining failed.
      Introduce and use remove_pfn_range_from_zone(() for that.  We now
      properly shrink the zones, even if we have DIMMs whereby
      
       - Some memory blocks fall into no zone (never onlined)
      
       - Some memory blocks fall into multiple zones (offlined+re-onlined)
      
       - Multiple memory blocks that fall into different zones
      
      Drop the zone parameter (with a potential dubious value) from
      __remove_pages() and __remove_section().
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      [yyl: drop the zone parameter in arch/arm64/mm/mmu.c]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8de9a7ef
    • D
      mm/memory_hotplug: allow arch_remove_memory() without CONFIG_MEMORY_HOTREMOVE · 34b64663
      David Hildenbrand 提交于
      stable inclusion
      from linux-4.19.100
      commit 000a1d59cfe9d6e875462ed72de32770322c282b
      
      --------------------------------
      
      commit 80ec922d upstream.
      
      -- snip --
      
      Missing arm64 memory hot(un)plug support.
      
      -- snip --
      
      We want to improve error handling while adding memory by allowing to use
      arch_remove_memory() and __remove_pages() even if
      CONFIG_MEMORY_HOTREMOVE is not set to e.g., implement something like:
      
      	arch_add_memory()
      	rc = do_something();
      	if (rc) {
      		arch_remove_memory();
      	}
      
      We won't get rid of CONFIG_MEMORY_HOTREMOVE for now, as it will require
      quite some dependencies for memory offlining.
      
      Link: http://lkml.kernel.org/r/20190527111152.16324-7-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "mike.travis@hpe.com" <mike.travis@hpe.com>
      Cc: Andrew Banman <andrew.banman@hpe.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      [yyl: remove CONFIG_MEMORY_HOTREMOVE in arch/arm64/mm/mmu.c]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      34b64663
    • D
      mm/memory_hotplug: make __remove_pages() and arch_remove_memory() never fail · a860e45f
      David Hildenbrand 提交于
      stable inclusion
      from linux-4.19.100
      commit 5163b1ec3a0c3a2e1e53b7794b64866cd6ba8697
      
      --------------------------------
      
      commit ac5c9426 upstream.
      
      -- snip --
      
      Minor conflict in arch/powerpc/mm/mem.c
      
      -- snip --
      
      All callers of arch_remove_memory() ignore errors.  And we should really
      try to remove any errors from the memory removal path.  No more errors are
      reported from __remove_pages().  BUG() in s390x code in case
      arch_remove_memory() is triggered.  We may implement that properly later.
      WARN in case powerpc code failed to remove the section mapping, which is
      better than ignoring the error completely right now.
      
      Link: http://lkml.kernel.org/r/20190409100148.24703-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Stefan Agner <stefan@agner.ch>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Andrew Banman <andrew.banman@hpe.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mike Travis <mike.travis@hpe.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a860e45f
    • A
      powerpc/mm: Fix section mismatch warning · 7c9ee508
      Aneesh Kumar K.V 提交于
      stable inclusion
      from linux-4.19.100
      commit 58ddf0b0eff2a6cb536082fc6b046f5eb51c240c
      
      --------------------------------
      
      commit 26ad2671 upstream.
      
      This patch fix the below section mismatch warnings.
      
      WARNING: vmlinux.o(.text+0x2d1f44): Section mismatch in reference from the function devm_memremap_pages_release() to the function .meminit.text:arch_remove_memory()
      WARNING: vmlinux.o(.text+0x2d265c): Section mismatch in reference from the function devm_memremap_pages() to the function .meminit.text:arch_add_memory()
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7c9ee508
    • O
      mm, memory_hotplug: add nid parameter to arch_remove_memory · 1ef436b7
      Oscar Salvador 提交于
      stable inclusion
      from linux-4.19.100
      commit 5c1f8f5358e8cd501245ee0e954dc0c0b231d6a2
      
      --------------------------------
      
      commit 2c2a5af6 upstream.
      
      -- snip --
      
      Missing unification of mm/hmm.c and kernel/memremap.c
      
      -- snip --
      
      Patch series "Do not touch pages in hot-remove path", v2.
      
      This patchset aims for two things:
      
       1) A better definition about offline and hot-remove stage
       2) Solving bugs where we can access non-initialized pages
          during hot-remove operations [2] [3].
      
      This is achieved by moving all page/zone handling to the offline
      stage, so we do not need to access pages when hot-removing memory.
      
      [1] https://patchwork.kernel.org/cover/10691415/
      [2] https://patchwork.kernel.org/patch/10547445/
      [3] https://www.spinics.net/lists/linux-mm/msg161316.html
      
      This patch (of 5):
      
      This is a preparation for the following-up patches.  The idea of passing
      the nid is that it will allow us to get rid of the zone parameter
      afterwards.
      
      Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1ef436b7
  15. 31 8月, 2020 1 次提交
  16. 22 4月, 2020 1 次提交
  17. 05 3月, 2020 9 次提交
    • J
      powerpc/spinlocks: Include correct header for static key · 8d12e2e5
      Jason A. Donenfeld 提交于
      commit 6da3eced upstream.
      
      Recently, the spinlock implementation grew a static key optimization,
      but the jump_label.h header include was left out, leading to build
      errors:
      
        linux/arch/powerpc/include/asm/spinlock.h:44:7: error: implicit declaration of function ‘static_branch_unlikely’
         44 |  if (!static_branch_unlikely(&shared_processor))
      
      This commit adds the missing header.
      
      mpe: The build break is only seen with CONFIG_JUMP_LABEL=n.
      
      Fixes: 656c21d6 ("powerpc/shared: Use static key to detect shared processor")
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20191223133147.129983-1-Jason@zx2c4.com
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8d12e2e5
    • S
      powerpc/vcpu: Assume dedicated processors as non-preempt · e181693f
      Srikar Dronamraju 提交于
      commit 14c73bd3 upstream.
      
      With commit 247f2f6f ("sched/core: Don't schedule threads on
      pre-empted vCPUs"), the scheduler avoids preempted vCPUs to schedule
      tasks on wakeup. This leads to wrong choice of CPU, which in-turn
      leads to larger wakeup latencies. Eventually, it leads to performance
      regression in latency sensitive benchmarks like soltp, schbench etc.
      
      On Powerpc, vcpu_is_preempted() only looks at yield_count. If the
      yield_count is odd, the vCPU is assumed to be preempted. However
      yield_count is increased whenever the LPAR enters CEDE state (idle).
      So any CPU that has entered CEDE state is assumed to be preempted.
      
      Even if vCPU of dedicated LPAR is preempted/donated, it should have
      right of first-use since they are supposed to own the vCPU.
      
      On a Power9 System with 32 cores:
        # lscpu
        Architecture:        ppc64le
        Byte Order:          Little Endian
        CPU(s):              128
        On-line CPU(s) list: 0-127
        Thread(s) per core:  8
        Core(s) per socket:  1
        Socket(s):           16
        NUMA node(s):        2
        Model:               2.2 (pvr 004e 0202)
        Model name:          POWER9 (architected), altivec supported
        Hypervisor vendor:   pHyp
        Virtualization type: para
        L1d cache:           32K
        L1i cache:           32K
        L2 cache:            512K
        L3 cache:            10240K
        NUMA node0 CPU(s):   0-63
        NUMA node1 CPU(s):   64-127
      
        # perf stat -a -r 5 ./schbench
        v5.4                               v5.4 + patch
        Latency percentiles (usec)         Latency percentiles (usec)
              50.0000th: 45                      50.0th: 45
              75.0000th: 62                      75.0th: 63
              90.0000th: 71                      90.0th: 74
              95.0000th: 77                      95.0th: 78
              *99.0000th: 91                     *99.0th: 82
              99.5000th: 707                     99.5th: 83
              99.9000th: 6920                    99.9th: 86
              min=0, max=10048                   min=0, max=96
        Latency percentiles (usec)         Latency percentiles (usec)
              50.0000th: 45                      50.0th: 46
              75.0000th: 61                      75.0th: 64
              90.0000th: 72                      90.0th: 75
              95.0000th: 79                      95.0th: 79
              *99.0000th: 691                    *99.0th: 83
              99.5000th: 3972                    99.5th: 85
              99.9000th: 8368                    99.9th: 91
              min=0, max=16606                   min=0, max=117
        Latency percentiles (usec)         Latency percentiles (usec)
              50.0000th: 45                      50.0th: 46
              75.0000th: 61                      75.0th: 64
              90.0000th: 71                      90.0th: 75
              95.0000th: 77                      95.0th: 79
              *99.0000th: 106                    *99.0th: 83
              99.5000th: 2364                    99.5th: 84
              99.9000th: 7480                    99.9th: 90
              min=0, max=10001                   min=0, max=95
        Latency percentiles (usec)         Latency percentiles (usec)
              50.0000th: 45                      50.0th: 47
              75.0000th: 62                      75.0th: 65
              90.0000th: 72                      90.0th: 75
              95.0000th: 78                      95.0th: 79
              *99.0000th: 93                     *99.0th: 84
              99.5000th: 108                     99.5th: 85
              99.9000th: 6792                    99.9th: 90
              min=0, max=17681                   min=0, max=117
        Latency percentiles (usec)         Latency percentiles (usec)
              50.0000th: 46                      50.0th: 45
              75.0000th: 62                      75.0th: 64
              90.0000th: 73                      90.0th: 75
              95.0000th: 79                      95.0th: 79
              *99.0000th: 113                    *99.0th: 82
              99.5000th: 2724                    99.5th: 83
              99.9000th: 6184                    99.9th: 93
              min=0, max=9887                    min=0, max=111
      
         Performance counter stats for 'system wide' (5 runs):
      
        context-switches    43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
        cpu-migrations       1,211  ( +-  5.04% )      220 ( +-  6.23% )
        page-faults         15,983  ( +-  5.21% )   15,360 ( +-  3.38% )
      
      Waiman Long suggested using static_keys.
      
      Fixes: 247f2f6f ("sched/core: Don't schedule threads on pre-empted vCPUs")
      Cc: stable@vger.kernel.org # v4.18+
      Reported-by: NParth Shah <parth@linux.ibm.com>
      Reported-by: NIhor Pasichnyk <Ihor.Pasichnyk@ibm.com>
      Tested-by: NJuri Lelli <juri.lelli@redhat.com>
      Acked-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NVaidyanathan Srinivasan <svaidy@linux.ibm.com>
      Tested-by: NParth Shah <parth@linux.ibm.com>
      [mpe: Move the key and setting of the key to pseries/setup.c]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20191213035036.6913-1-mpe@ellerman.id.auSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e181693f
    • M
      powerpc: Ensure that swiotlb buffer is allocated from low memory · 3cab13df
      Mike Rapoport 提交于
      [ Upstream commit 8fabc623 ]
      
      Some powerpc platforms (e.g. 85xx) limit DMA-able memory way below 4G.
      If a system has more physical memory than this limit, the swiotlb
      buffer is not addressable because it is allocated from memblock using
      top-down mode.
      
      Force memblock to bottom-up mode before calling swiotlb_init() to
      ensure that the swiotlb buffer is DMA-able.
      Reported-by: NChristian Zigotzky <chzigotzky@xenosoft.de>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20191204123524.22919-1-rppt@kernel.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3cab13df
    • M
      KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag · fe380374
      Michael Roth 提交于
      [ Upstream commit 3a83f677 ]
      
      On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
      of memory running the following guest configs:
      
        guest A:
          - 224GB of memory
          - 56 VCPUs (sockets=1,cores=28,threads=2), where:
            VCPUs 0-1 are pinned to CPUs 0-3,
            VCPUs 2-3 are pinned to CPUs 4-7,
            ...
            VCPUs 54-55 are pinned to CPUs 108-111
      
        guest B:
          - 4GB of memory
          - 4 VCPUs (sockets=1,cores=4,threads=1)
      
      with the following workloads (with KSM and THP enabled in all):
      
        guest A:
          stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
      
        guest B:
          stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
      
        host:
          stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
      
      the below soft-lockup traces were observed after an hour or so and
      persisted until the host was reset (this was found to be reliably
      reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
      and 5.3-rc5):
      
        [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1253.183319] rcu:     124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941
        [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709]
        [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913]
        [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331]
        [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338]
        [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525]
        [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1316.198032] rcu:     124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243
        [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1379.212629] rcu:     124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714
        [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1442.227115] rcu:     124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403
        [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
        [ 1455.111822]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
        [ 1455.111905]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
        [ 1455.111986]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
        [ 1455.112068]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
        [ 1455.112159]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
        [ 1455.112231]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
        [ 1455.112303]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
        [ 1455.112392]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
      
      CPUs 45, 24, and 124 are stuck on spin locks, likely held by
      CPUs 105 and 31.
      
      CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
      target CPU 42. For instance:
      
        # CPU 105 registers (via xmon)
        R00 = c00000000020b20c   R16 = 00007d1bcd800000
        R01 = c00000363eaa7970   R17 = 0000000000000001
        R02 = c0000000019b3a00   R18 = 000000000000006b
        R03 = 000000000000002a   R19 = 00007d537d7aecf0
        R04 = 000000000000002a   R20 = 60000000000000e0
        R05 = 000000000000002a   R21 = 0801000000000080
        R06 = c0002073fb0caa08   R22 = 0000000000000d60
        R07 = c0000000019ddd78   R23 = 0000000000000001
        R08 = 000000000000002a   R24 = c00000000147a700
        R09 = 0000000000000001   R25 = c0002073fb0ca908
        R10 = c000008ffeb4e660   R26 = 0000000000000000
        R11 = c0002073fb0ca900   R27 = c0000000019e2464
        R12 = c000000000050790   R28 = c0000000000812b0
        R13 = c000207fff623e00   R29 = c0002073fb0ca808
        R14 = 00007d1bbee00000   R30 = c0002073fb0ca800
        R15 = 00007d1bcd600000   R31 = 0000000000000800
        pc  = c00000000020b260 smp_call_function_many+0x3d0/0x460
        cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460
        lr  = c00000000020b20c smp_call_function_many+0x37c/0x460
        msr = 900000010288b033   cr  = 44024824
        ctr = c000000000050790   xer = 0000000000000000   trap =  100
      
      CPU 42 is running normally, doing VCPU work:
      
        # CPU 42 stack trace (via xmon)
        [link register   ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv]
        [c000008ed3343820] c000008ed3343850 (unreliable)
        [c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv]
        [c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
        [c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
        [c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
        [c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
        [c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70
        [c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120
        [c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80
        [c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70
        --- Exception: c00 (System Call) at 00007d545cfd7694
        SP (7d53ff7edf50) is in userspace
      
      It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
      was set for CPU 42 by at least 1 of the CPUs waiting in
      smp_call_function_many(), but somehow the corresponding
      call_single_queue entries were never processed by CPU 42, causing the
      callers to spin in csd_lock_wait() indefinitely.
      
      Nick Piggin suggested something similar to the following sequence as
      a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
      IPI messages seems to be most common, but any mix of CALL_FUNCTION and
      !CALL_FUNCTION messages could trigger it):
      
          CPU
            X: smp_muxed_ipi_set_message():
            X:   smp_mb()
            X:   message[RESCHEDULE] = 1
            X: doorbell_global_ipi(42):
            X:   kvmppc_set_host_ipi(42, 1)
            X:   ppc_msgsnd_sync()/smp_mb()
            X:   ppc_msgsnd() -> 42
           42: doorbell_exception(): // from CPU X
           42:   ppc_msgsync()
          105: smp_muxed_ipi_set_message():
          105:   smb_mb()
               // STORE DEFERRED DUE TO RE-ORDERING
        --105:   message[CALL_FUNCTION] = 1
        | 105: doorbell_global_ipi(42):
        | 105:   kvmppc_set_host_ipi(42, 1)
        |  42:   kvmppc_set_host_ipi(42, 0)
        |  42: smp_ipi_demux_relaxed()
        |  42: // returns to executing guest
        |      // RE-ORDERED STORE COMPLETES
        ->105:   message[CALL_FUNCTION] = 1
          105:   ppc_msgsnd_sync()/smp_mb()
          105:   ppc_msgsnd() -> 42
           42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
          105: // hangs waiting on 42 to process messages/call_single_queue
      
      This can be prevented with an smp_mb() at the beginning of
      kvmppc_set_host_ipi(), such that stores to message[<type>] (or other
      state indicated by the host_ipi flag) are ordered vs. the store to
      to host_ipi.
      
      However, doing so might still allow for the following scenario (not
      yet observed):
      
          CPU
            X: smp_muxed_ipi_set_message():
            X:   smp_mb()
            X:   message[RESCHEDULE] = 1
            X: doorbell_global_ipi(42):
            X:   kvmppc_set_host_ipi(42, 1)
            X:   ppc_msgsnd_sync()/smp_mb()
            X:   ppc_msgsnd() -> 42
           42: doorbell_exception(): // from CPU X
           42:   ppc_msgsync()
               // STORE DEFERRED DUE TO RE-ORDERING
        -- 42:   kvmppc_set_host_ipi(42, 0)
        |  42: smp_ipi_demux_relaxed()
        | 105: smp_muxed_ipi_set_message():
        | 105:   smb_mb()
        | 105:   message[CALL_FUNCTION] = 1
        | 105: doorbell_global_ipi(42):
        | 105:   kvmppc_set_host_ipi(42, 1)
        |      // RE-ORDERED STORE COMPLETES
        -> 42:   kvmppc_set_host_ipi(42, 0)
           42: // returns to executing guest
          105:   ppc_msgsnd_sync()/smp_mb()
          105:   ppc_msgsnd() -> 42
           42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
          105: // hangs waiting on 42 to process messages/call_single_queue
      
      Fixing this scenario would require an smp_mb() *after* clearing
      host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
      subsequent processing of IPI messages.
      
      To handle both cases, this patch splits kvmppc_set_host_ipi() into
      separate set/clear functions, where we execute smp_mb() prior to
      setting host_ipi flag, and after clearing host_ipi flag. These
      functions pair with each other to synchronize the sender and receiver
      sides.
      
      With that change in place the above workload ran for 20 hours without
      triggering any lock-ups.
      
      Fixes: 755563bc ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0
      Signed-off-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fe380374
    • D
      powerpc/pseries/hvconsole: Fix stack overread via udbg · a3284606
      Daniel Axtens 提交于
      [ Upstream commit 934bda59 ]
      
      While developing KASAN for 64-bit book3s, I hit the following stack
      over-read.
      
      It occurs because the hypercall to put characters onto the terminal
      takes 2 longs (128 bits/16 bytes) of characters at a time, and so
      hvc_put_chars() would unconditionally copy 16 bytes from the argument
      buffer, regardless of supplied length. However, udbg_hvc_putc() can
      call hvc_put_chars() with a single-byte buffer, leading to the error.
      
        ==================================================================
        BUG: KASAN: stack-out-of-bounds in hvc_put_chars+0xdc/0x110
        Read of size 8 at addr c0000000023e7a90 by task swapper/0
      
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-rc2-next-20190528-02824-g048a6ab4835b #113
        Call Trace:
          dump_stack+0x104/0x154 (unreliable)
          print_address_description+0xa0/0x30c
          __kasan_report+0x20c/0x224
          kasan_report+0x18/0x30
          __asan_report_load8_noabort+0x24/0x40
          hvc_put_chars+0xdc/0x110
          hvterm_raw_put_chars+0x9c/0x110
          udbg_hvc_putc+0x154/0x200
          udbg_write+0xf0/0x240
          console_unlock+0x868/0xd30
          register_console+0x970/0xe90
          register_early_udbg_console+0xf8/0x114
          setup_arch+0x108/0x790
          start_kernel+0x104/0x784
          start_here_common+0x1c/0x534
      
        Memory state around the buggy address:
         c0000000023e7980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         c0000000023e7a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
        >c0000000023e7a80: f1 f1 01 f2 f2 f2 00 00 00 00 00 00 00 00 00 00
                                 ^
         c0000000023e7b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         c0000000023e7b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ==================================================================
      
      Document that a 16-byte buffer is requred, and provide it in udbg.
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a3284606
    • G
      Revert "powerpc/vcpu: Assume dedicated processors as non-preempt" · f6ab004e
      Greg Kroah-Hartman 提交于
      This reverts commit 4ba32bdbd8c66d9c7822aea8dcf4e51410df84a8 which is
      commit 14c73bd3 upstream.
      
      It breaks the build.
      
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Parth Shah <parth@linux.ibm.com>
      Cc: Ihor Pasichnyk <Ihor.Pasichnyk@ibm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
      Cc: Parth Shah <parth@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f6ab004e
    • M
      libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h · b409b11b
      Masahiro Yamada 提交于
      [ Upstream commit a8de1304 ]
      
      The DTC v1.5.1 added references to (U)INT32_MAX.
      
      This is no problem for user-space programs since <stdint.h> defines
      (U)INT32_MAX along with (u)int32_t.
      
      For the kernel space, libfdt_env.h needs to be adjusted before we
      pull in the changes.
      
      In the kernel, we usually use s/u32 instead of (u)int32_t for the
      fixed-width types.
      
      Accordingly, we already have S/U32_MAX for their max values.
      So, we should not add (U)INT32_MAX to <linux/limits.h> any more.
      
      Instead, add them to the in-kernel libfdt_env.h to compile the
      latest libfdt.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b409b11b
    • N
      powerpc: Don't add -mabi= flags when building with Clang · 4f0810c4
      Nathan Chancellor 提交于
      [ Upstream commit 465bfd9c ]
      
      When building pseries_defconfig, building vdso32 errors out:
      
        error: unknown target ABI 'elfv1'
      
      This happens because -m32 in clang changes the target to 32-bit,
      which does not allow the ABI to be changed.
      
      Commit 4dc831aa ("powerpc: Fix compiling a BE kernel with a
      powerpc64le toolchain") added these flags to fix building big endian
      kernels with a little endian GCC.
      
      Clang doesn't need -mabi because the target triple controls the
      default value. -mlittle-endian and -mbig-endian manipulate the triple
      into either powerpc64-* or powerpc64le-*, which properly sets the
      default ABI.
      
      Adding a debug print out in the PPC64TargetInfo constructor after line
      383 above shows this:
      
        $ echo | ./clang -E --target=powerpc64-linux -mbig-endian -o /dev/null -
        Default ABI: elfv1
      
        $ echo | ./clang -E --target=powerpc64-linux -mlittle-endian -o /dev/null -
        Default ABI: elfv2
      
        $ echo | ./clang -E --target=powerpc64le-linux -mbig-endian -o /dev/null -
        Default ABI: elfv1
      
        $ echo | ./clang -E --target=powerpc64le-linux -mlittle-endian -o /dev/null -
        Default ABI: elfv2
      
      Don't specify -mabi when building with clang to avoid the build error
      with -m32 and not change any code generation.
      
      -mcall-aixdesc is not an implemented flag in clang so it can be safely
      excluded as well, see commit 238abecd ("powerpc: Don't use gcc
      specific options on clang").
      
      pseries_defconfig successfully builds after this patch and
      powernv_defconfig and ppc44x_defconfig don't regress.
      Reviewed-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      [mpe: Trim clang links in change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20191119045712.39633-2-natechancellor@gmail.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4f0810c4
    • G
      powerpc/security: Fix wrong message when RFI Flush is disable · 900553fb
      Gustavo L. F. Walbon 提交于
      [ Upstream commit 4e706af3 ]
      
      The issue was showing "Mitigation" message via sysfs whatever the
      state of "RFI Flush", but it should show "Vulnerable" when it is
      disabled.
      
      If you have "L1D private" feature enabled and not "RFI Flush" you are
      vulnerable to meltdown attacks.
      
      "RFI Flush" is the key feature to mitigate the meltdown whatever the
      "L1D private" state.
      
      SEC_FTR_L1D_THREAD_PRIV is a feature for Power9 only.
      
      So the message should be as the truth table shows:
      
        CPU | L1D private | RFI Flush |                sysfs
        ----|-------------|-----------|-------------------------------------
         P9 |    False    |   False   | Vulnerable
         P9 |    False    |   True    | Mitigation: RFI Flush
         P9 |    True     |   False   | Vulnerable: L1D private per thread
         P9 |    True     |   True    | Mitigation: RFI Flush, L1D private per thread
         P8 |    False    |   False   | Vulnerable
         P8 |    False    |   True    | Mitigation: RFI Flush
      
      Output before this fix:
        # cat /sys/devices/system/cpu/vulnerabilities/meltdown
        Mitigation: RFI Flush, L1D private per thread
        # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
        # cat /sys/devices/system/cpu/vulnerabilities/meltdown
        Mitigation: L1D private per thread
      
      Output after fix:
        # cat /sys/devices/system/cpu/vulnerabilities/meltdown
        Mitigation: RFI Flush, L1D private per thread
        # echo 0 > /sys/kernel/debug/powerpc/rfi_flush
        # cat /sys/devices/system/cpu/vulnerabilities/meltdown
        Vulnerable: L1D private per thread
      Signed-off-by: NGustavo L. F. Walbon <gwalbon@linux.ibm.com>
      Signed-off-by: NMauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190502210907.42375-1-gwalbon@linux.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      900553fb