1. 16 9月, 2019 9 次提交
    • B
      powerpc/tm: Remove msr_tm_active() · 052bc385
      Breno Leitao 提交于
      [ Upstream commit 5c784c8414fba11b62e12439f11e109fb5751f38 ]
      
      Currently msr_tm_active() is a wrapper around MSR_TM_ACTIVE() if
      CONFIG_PPC_TRANSACTIONAL_MEM is set, or it is just a function that
      returns false if CONFIG_PPC_TRANSACTIONAL_MEM is not set.
      
      This function is not necessary, since MSR_TM_ACTIVE() just do the same and
      could be used, removing the dualism and simplifying the code.
      
      This patchset remove every instance of msr_tm_active() and replaced it
      by MSR_TM_ACTIVE().
      Signed-off-by: NBreno Leitao <leitao@debian.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      052bc385
    • S
      powerpc/mm: Limit rma_size to 1TB when running without HV mode · c4fc7cb9
      Suraj Jitindar Singh 提交于
      [ Upstream commit da0ef93310e67ae6902efded60b6724dab27a5d1 ]
      
      The virtual real mode addressing (VRMA) mechanism is used when a
      partition is using HPT (Hash Page Table) translation and performs real
      mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this mode
      effective address bits 0:23 are treated as zero (i.e. the access is
      aliased to 0) and the access is performed using an implicit 1TB SLB
      entry.
      
      The size of the RMA (Real Memory Area) is communicated to the guest as
      the size of the first memory region in the device tree. And because of
      the mechanism described above can be expected to not exceed 1TB. In
      the event that the host erroneously represents the RMA as being larger
      than 1TB, guest accesses in real mode to memory addresses above 1TB
      will be aliased down to below 1TB. This means that a memory access
      performed in real mode may differ to one performed in virtual mode for
      the same memory address, which would likely have unintended
      consequences.
      
      To avoid this outcome have the guest explicitly limit the size of the
      RMA to the current maximum, which is 1TB. This means that even if the
      first memory block is larger than 1TB, only the first 1TB should be
      accessed in real mode.
      
      Fixes: c610d65c ("powerpc/pseries: lift RTAS limit for hash")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Tested-by: NSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190710052018.14628-1-sjitindarsingh@gmail.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      c4fc7cb9
    • M
      KVM: PPC: Book3S HV: Fix CR0 setting in TM emulation · 3a1b79ad
      Michael Neuling 提交于
      [ Upstream commit 3fefd1cd95df04da67c83c1cb93b663f04b3324f ]
      
      When emulating tsr, treclaim and trechkpt, we incorrectly set CR0. The
      code currently sets:
          CR0 <- 00 || MSR[TS]
      but according to the ISA it should be:
          CR0 <-  0 || MSR[TS] || 0
      
      This fixes the bit shift to put the bits in the correct location.
      
      This is a data integrity issue as CR0 is corrupted.
      
      Fixes: 4bb3c7a0 ("KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9")
      Cc: stable@vger.kernel.org # v4.17+
      Tested-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3a1b79ad
    • P
      KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct · 3ac71806
      Paul Mackerras 提交于
      [ Upstream commit fd0944baad806dfb4c777124ec712c55b714ff51 ]
      
      When the 'regs' field was added to struct kvm_vcpu_arch, the code
      was changed to use several of the fields inside regs (e.g., gpr, lr,
      etc.) but not the ccr field, because the ccr field in struct pt_regs
      is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
      only 32 bits.  This changes the code to use the regs.ccr field
      instead of cr, and changes the assembly code on 64-bit platforms to
      use 64-bit loads and stores instead of 32-bit ones.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3ac71806
    • M
      powerpc/kvm: Save and restore host AMR/IAMR/UAMOR · 915c9d0a
      Michael Ellerman 提交于
      [ Upstream commit c3c7470c75566a077c8dc71dcf8f1948b8ddfab4 ]
      
      When the hash MMU is active the AMR, IAMR and UAMOR are used for
      pkeys. The AMR is directly writable by user space, and the UAMOR masks
      those writes, meaning both registers are effectively user register
      state. The IAMR is used to create an execute only key.
      
      Also we must maintain the value of at least the AMR when running in
      process context, so that any memory accesses done by the kernel on
      behalf of the process are correctly controlled by the AMR.
      
      Although we are correctly switching all registers when going into a
      guest, on returning to the host we just write 0 into all regs, except
      on Power9 where we restore the IAMR correctly.
      
      This could be observed by a user process if it writes the AMR, then
      runs a guest and we then return immediately to it without
      rescheduling. Because we have written 0 to the AMR that would have the
      effect of granting read/write permission to pages that the process was
      trying to protect.
      
      In addition, when using the Radix MMU, the AMR can prevent inadvertent
      kernel access to userspace data, writing 0 to the AMR disables that
      protection.
      
      So save and restore AMR, IAMR and UAMOR.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      915c9d0a
    • R
      powerpc/pkeys: Fix handling of pkey state across fork() · cfbf227e
      Ram Pai 提交于
      [ Upstream commit 2cd4bd192ee94848695c1c052d87913260e10f36 ]
      
      Protection key tracking information is not copied over to the
      mm_struct of the child during fork(). This can cause the child to
      erroneously allocate keys that were already allocated. Any allocated
      execute-only key is lost aswell.
      
      Add code; called by dup_mmap(), to copy the pkey state from parent to
      child explicitly.
      
      This problem was originally found by Dave Hansen on x86, which turns
      out to be a problem on powerpc aswell.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Reviewed-by: NThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      cfbf227e
    • P
      KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and MMU mode switch · d3984e80
      Paul Mackerras 提交于
      [ Upstream commit 234ff0b729ad882d20f7996591a964965647addf ]
      
      Testing has revealed an occasional crash which appears to be caused
      by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv.
      The symptom is a NULL pointer dereference in __find_linux_pte() called
      from kvm_unmap_radix() with kvm->arch.pgtable == NULL.
      
      Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear
      kvm->arch.pgtable (via kvmppc_free_radix()) before setting
      kvm->arch.radix to NULL, and there is nothing to prevent
      kvm_unmap_hva_range_hv() or the other MMU callback functions from
      being called concurrently with kvmppc_switch_mmu_to_hpt() or
      kvmppc_switch_mmu_to_radix().
      
      This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock
      around the assignments to kvm->arch.radix, and makes sure that the
      partition-scoped radix tree or HPT is only freed after changing
      kvm->arch.radix.
      
      This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure
      that the clearing of each rmap array (one per memslot) doesn't happen
      concurrently with use of the array in the kvm_unmap_hva_range_hv()
      or the other MMU callbacks.
      
      Fixes: 18c3640c ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d3984e80
    • C
      powerpc/64: mark start_here_multiplatform as __ref · 7f8b2360
      Christophe Leroy 提交于
      [ Upstream commit 9c4e4c90ec24652921e31e9551fcaedc26eec86d ]
      
      Otherwise, the following warning is encountered:
      
      WARNING: vmlinux.o(.text+0x3dc6): Section mismatch in reference from the variable start_here_multiplatform to the function .init.text:.early_setup()
      The function start_here_multiplatform() references
      the function __init .early_setup().
      This is often because start_here_multiplatform lacks a __init
      annotation or the annotation of .early_setup is wrong.
      
      Fixes: 56c46bba9bbf ("powerpc/64: Fix booting large kernels with STRICT_KERNEL_RWX")
      Cc: Russell Currey <ruscur@russell.cc>
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7f8b2360
    • G
      powerpc/tm: Fix FP/VMX unavailable exceptions inside a transaction · 47a0f70d
      Gustavo Romero 提交于
      commit 8205d5d98ef7f155de211f5e2eb6ca03d95a5a60 upstream.
      
      When we take an FP unavailable exception in a transaction we have to
      account for the hardware FP TM checkpointed registers being
      incorrect. In this case for this process we know the current and
      checkpointed FP registers must be the same (since FP wasn't used
      inside the transaction) hence in the thread_struct we copy the current
      FP registers to the checkpointed ones.
      
      This copy is done in tm_reclaim_thread(). We use thread->ckpt_regs.msr
      to determine if FP was on when in userspace. thread->ckpt_regs.msr
      represents the state of the MSR when exiting userspace. This is setup
      by check_if_tm_restore_required().
      
      Unfortunatley there is an optimisation in giveup_all() which returns
      early if tsk->thread.regs->msr (via local variable `usermsr`) has
      FP=VEC=VSX=SPE=0. This optimisation means that
      check_if_tm_restore_required() is not called and hence
      thread->ckpt_regs.msr is not updated and will contain an old value.
      
      This can happen if due to load_fp=255 we start a userspace process
      with MSR FP=1 and then we are context switched out. In this case
      thread->ckpt_regs.msr will contain FP=1. If that same process is then
      context switched in and load_fp overflows, MSR will have FP=0. If that
      process now enters a transaction and does an FP instruction, the FP
      unavailable will not update thread->ckpt_regs.msr (the bug) and MSR
      FP=1 will be retained in thread->ckpt_regs.msr.  tm_reclaim_thread()
      will then not perform the required memcpy and the checkpointed FP regs
      in the thread struct will contain the wrong values.
      
      The code path for this happening is:
      
             Userspace:                      Kernel
                         Start userspace
                          with MSR FP/VEC/VSX/SPE=0 TM=1
                            < -----
             ...
             tbegin
             bne
             fp instruction
                         FP unavailable
                             ---- >
                                              fp_unavailable_tm()
      					  tm_reclaim_current()
      					    tm_reclaim_thread()
      					      giveup_all()
      					        return early since FP/VMX/VSX=0
      						/* ckpt MSR not updated (Incorrect) */
      					      tm_reclaim()
      					        /* thread_struct ckpt FP regs contain junk (OK) */
                                                    /* Sees ckpt MSR FP=1 (Incorrect) */
      					      no memcpy() performed
      					        /* thread_struct ckpt FP regs not fixed (Incorrect) */
      					  tm_recheckpoint()
      					     /* Put junk in hardware checkpoint FP regs */
                                               ....
                            < -----
                         Return to userspace
                           with MSR TM=1 FP=1
                           with junk in the FP TM checkpoint
             TM rollback
             reads FP junk
      
      This is a data integrity problem for the current process as the FP
      registers are corrupted. It's also a security problem as the FP
      registers from one process may be leaked to another.
      
      This patch moves up check_if_tm_restore_required() in giveup_all() to
      ensure thread->ckpt_regs.msr is updated correctly.
      
      A simple testcase to replicate this will be posted to
      tools/testing/selftests/powerpc/tm/tm-poison.c
      
      Similarly for VMX.
      
      This fixes CVE-2019-15030.
      
      Fixes: f48e91e8 ("powerpc/tm: Fix FP and VMX register corruption")
      Cc: stable@vger.kernel.org # 4.12+
      Signed-off-by: NGustavo Romero <gromero@linux.vnet.ibm.com>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190904045529.23002-1-gromero@linux.vnet.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47a0f70d
  2. 06 9月, 2019 1 次提交
  3. 29 8月, 2019 1 次提交
  4. 16 8月, 2019 1 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 2bc73d91
      Wanpeng Li 提交于
      commit 17e433b54393a6269acbcb792da97791fe1592d8 upstream.
      
      After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2bc73d91
  5. 31 7月, 2019 10 次提交
    • M
      powerpc/tm: Fix oops on sigreturn on systems without TM · b993a66d
      Michael Neuling 提交于
      commit f16d80b75a096c52354c6e0a574993f3b0dfbdfe upstream.
      
      On systems like P9 powernv where we have no TM (or P8 booted with
      ppc_tm=off), userspace can construct a signal context which still has
      the MSR TS bits set. The kernel tries to restore this context which
      results in the following crash:
      
        Unexpected TM Bad Thing exception at c0000000000022fc (msr 0x8000000102a03031) tm_scratch=800000020280f033
        Oops: Unrecoverable exception, sig: 6 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in:
        CPU: 0 PID: 1636 Comm: sigfuz Not tainted 5.2.0-11043-g0a8ad0ffa4 #69
        NIP:  c0000000000022fc LR: 00007fffb2d67e48 CTR: 0000000000000000
        REGS: c00000003fffbd70 TRAP: 0700   Not tainted  (5.2.0-11045-g7142b497d8)
        MSR:  8000000102a03031 <SF,VEC,VSX,FP,ME,IR,DR,LE,TM[E]>  CR: 42004242  XER: 00000000
        CFAR: c0000000000022e0 IRQMASK: 0
        GPR00: 0000000000000072 00007fffb2b6e560 00007fffb2d87f00 0000000000000669
        GPR04: 00007fffb2b6e728 0000000000000000 0000000000000000 00007fffb2b6f2a8
        GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR12: 0000000000000000 00007fffb2b76900 0000000000000000 0000000000000000
        GPR16: 00007fffb2370000 00007fffb2d84390 00007fffea3a15ac 000001000a250420
        GPR20: 00007fffb2b6f260 0000000010001770 0000000000000000 0000000000000000
        GPR24: 00007fffb2d843a0 00007fffea3a14a0 0000000000010000 0000000000800000
        GPR28: 00007fffea3a14d8 00000000003d0f00 0000000000000000 00007fffb2b6e728
        NIP [c0000000000022fc] rfi_flush_fallback+0x7c/0x80
        LR [00007fffb2d67e48] 0x7fffb2d67e48
        Call Trace:
        Instruction dump:
        e96a0220 e96a02a8 e96a0330 e96a03b8 394a0400 4200ffdc 7d2903a6 e92d0c00
        e94d0c08 e96d0c10 e82d0c18 7db242a6 <4c000024> 7db243a6 7db142a6 f82d0c18
      
      The problem is the signal code assumes TM is enabled when
      CONFIG_PPC_TRANSACTIONAL_MEM is enabled. This may not be the case as
      with P9 powernv or if `ppc_tm=off` is used on P8.
      
      This means any local user can crash the system.
      
      Fix the problem by returning a bad stack frame to the user if they try
      to set the MSR TS bits with sigreturn() on systems where TM is not
      supported.
      
      Found with sigfuz kernel selftest on P9.
      
      This fixes CVE-2019-13648.
      
      Fixes: 2b0a576d ("powerpc: Add new transactional memory state to the signal context")
      Cc: stable@vger.kernel.org # v3.9
      Reported-by: NPraveen Pandey <Praveen.Pandey@in.ibm.com>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190719050502.405-1-mikey@neuling.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b993a66d
    • G
      powerpc/xive: Fix loop exit-condition in xive_find_target_in_mask() · b9310c56
      Gautham R. Shenoy 提交于
      commit 4d202c8c8ed3822327285747db1765967110b274 upstream.
      
      xive_find_target_in_mask() has the following for(;;) loop which has a
      bug when @first == cpumask_first(@mask) and condition 1 fails to hold
      for every CPU in @mask. In this case we loop forever in the for-loop.
      
        first = cpu;
        for (;;) {
        	  if (cpu_online(cpu) && xive_try_pick_target(cpu)) // condition 1
      		  return cpu;
      	  cpu = cpumask_next(cpu, mask);
      	  if (cpu == first) // condition 2
      		  break;
      
      	  if (cpu >= nr_cpu_ids) // condition 3
      		  cpu = cpumask_first(mask);
        }
      
      This is because, when @first == cpumask_first(@mask), we never hit the
      condition 2 (cpu == first) since prior to this check, we would have
      executed "cpu = cpumask_next(cpu, mask)" which will set the value of
      @cpu to a value greater than @first or to nr_cpus_ids. When this is
      coupled with the fact that condition 1 is not met, we will never exit
      this loop.
      
      This was discovered by the hard-lockup detector while running LTP test
      concurrently with SMT switch tests.
      
       watchdog: CPU 12 detected hard LOCKUP on other CPUs 68
       watchdog: CPU 12 TB:85587019220796, last SMP heartbeat TB:85578827223399 (15999ms ago)
       watchdog: CPU 68 Hard LOCKUP
       watchdog: CPU 68 TB:85587019361273, last heartbeat TB:85576815065016 (19930ms ago)
       CPU: 68 PID: 45050 Comm: hxediag Kdump: loaded Not tainted 4.18.0-100.el8.ppc64le #1
       NIP:  c0000000006f5578 LR: c000000000cba9ec CTR: 0000000000000000
       REGS: c000201fff3c7d80 TRAP: 0100   Not tainted  (4.18.0-100.el8.ppc64le)
       MSR:  9000000002883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 24028424  XER: 00000000
       CFAR: c0000000006f558c IRQMASK: 1
       GPR00: c0000000000afc58 c000201c01c43400 c0000000015ce500 c000201cae26ec18
       GPR04: 0000000000000800 0000000000000540 0000000000000800 00000000000000f8
       GPR08: 0000000000000020 00000000000000a8 0000000080000000 c00800001a1beed8
       GPR12: c0000000000b1410 c000201fff7f4c00 0000000000000000 0000000000000000
       GPR16: 0000000000000000 0000000000000000 0000000000000540 0000000000000001
       GPR20: 0000000000000048 0000000010110000 c00800001a1e3780 c000201cae26ed18
       GPR24: 0000000000000000 c000201cae26ed8c 0000000000000001 c000000001116bc0
       GPR28: c000000001601ee8 c000000001602494 c000201cae26ec18 000000000000001f
       NIP [c0000000006f5578] find_next_bit+0x38/0x90
       LR [c000000000cba9ec] cpumask_next+0x2c/0x50
       Call Trace:
       [c000201c01c43400] [c000201cae26ec18] 0xc000201cae26ec18 (unreliable)
       [c000201c01c43420] [c0000000000afc58] xive_find_target_in_mask+0x1b8/0x240
       [c000201c01c43470] [c0000000000b0228] xive_pick_irq_target.isra.3+0x168/0x1f0
       [c000201c01c435c0] [c0000000000b1470] xive_irq_startup+0x60/0x260
       [c000201c01c43640] [c0000000001d8328] __irq_startup+0x58/0xf0
       [c000201c01c43670] [c0000000001d844c] irq_startup+0x8c/0x1a0
       [c000201c01c436b0] [c0000000001d57b0] __setup_irq+0x9f0/0xa90
       [c000201c01c43760] [c0000000001d5aa0] request_threaded_irq+0x140/0x220
       [c000201c01c437d0] [c00800001a17b3d4] bnx2x_nic_load+0x188c/0x3040 [bnx2x]
       [c000201c01c43950] [c00800001a187c44] bnx2x_self_test+0x1fc/0x1f70 [bnx2x]
       [c000201c01c43a90] [c000000000adc748] dev_ethtool+0x11d8/0x2cb0
       [c000201c01c43b60] [c000000000b0b61c] dev_ioctl+0x5ac/0xa50
       [c000201c01c43bf0] [c000000000a8d4ec] sock_do_ioctl+0xbc/0x1b0
       [c000201c01c43c60] [c000000000a8dfb8] sock_ioctl+0x258/0x4f0
       [c000201c01c43d20] [c0000000004c9704] do_vfs_ioctl+0xd4/0xa70
       [c000201c01c43de0] [c0000000004ca274] sys_ioctl+0xc4/0x160
       [c000201c01c43e30] [c00000000000b388] system_call+0x5c/0x70
       Instruction dump:
       78aad182 54a806be 3920ffff 78a50664 794a1f24 7d294036 7d43502a 7d295039
       4182001c 48000034 78a9d182 79291f24 <7d23482a> 2fa90000 409e0020 38a50040
      
      To fix this, move the check for condition 2 after the check for
      condition 3, so that we are able to break out of the loop soon after
      iterating through all the CPUs in the @mask in the problem case. Use
      do..while() to achieve this.
      
      Fixes: 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Reported-by: NIndira P. Joga <indira.priya@in.ibm.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/1563359724-13931-1-git-send-email-ego@linux.vnet.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9310c56
    • O
      powerpc/eeh: Handle hugepages in ioremap space · 7f775a67
      Oliver O'Halloran 提交于
      [ Upstream commit 33439620680be5225c1b8806579a291e0d761ca0 ]
      
      In commit 4a7b06c157a2 ("powerpc/eeh: Handle hugepages in ioremap
      space") support for using hugepages in the vmalloc and ioremap areas was
      enabled for radix. Unfortunately this broke EEH MMIO error checking.
      
      Detection works by inserting a hook which checks the results of the
      ioreadXX() set of functions.  When a read returns a 0xFFs response we
      need to check for an error which we do by mapping the (virtual) MMIO
      address back to a physical address, then mapping physical address to a
      PCI device via an interval tree.
      
      When translating virt -> phys we currently assume the ioremap space is
      only populated by PAGE_SIZE mappings. If a hugepage mapping is found we
      emit a WARN_ON(), but otherwise handles the check as though a normal
      page was found. In pathalogical cases such as copying a buffer
      containing a lot of 0xFFs from BAR memory this can result in the system
      not booting because it's too busy printing WARN_ON()s.
      
      There's no real reason to assume huge pages can't be present and we're
      prefectly capable of handling them, so do that.
      
      Fixes: 4a7b06c157a2 ("powerpc/eeh: Handle hugepages in ioremap space")
      Reported-by: NSachin Sant <sachinp@linux.vnet.ibm.com>
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      Tested-by: NSachin Sant <sachinp@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190710150517.27114-1-oohall@gmail.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      7f775a67
    • M
      powerpc/boot: add {get, put}_unaligned_be32 to xz_config.h · 4b9dc73a
      Masahiro Yamada 提交于
      [ Upstream commit 9e005b761e7ad153dcf40a6cba1d681fe0830ac6 ]
      
      The next commit will make the way of passing CONFIG options more robust.
      Unfortunately, it would uncover another hidden issue; without this
      commit, skiroot_defconfig would be broken like this:
      
      |   WRAP    arch/powerpc/boot/zImage.pseries
      | arch/powerpc/boot/wrapper.a(decompress.o): In function `bcj_powerpc.isra.10':
      | decompress.c:(.text+0x720): undefined reference to `get_unaligned_be32'
      | decompress.c:(.text+0x7a8): undefined reference to `put_unaligned_be32'
      | make[1]: *** [arch/powerpc/boot/Makefile;383: arch/powerpc/boot/zImage.pseries] Error 1
      | make: *** [arch/powerpc/Makefile;295: zImage] Error 2
      
      skiroot_defconfig is the only defconfig that enables CONFIG_KERNEL_XZ
      for ppc, which has never been correctly built before.
      
      I figured out the root cause in lib/decompress_unxz.c:
      
      | #ifdef CONFIG_PPC
      | #      define XZ_DEC_POWERPC
      | #endif
      
      CONFIG_PPC is undefined here in the ppc bootwrapper because autoconf.h
      is not included except by arch/powerpc/boot/serial.c
      
      XZ_DEC_POWERPC is not defined, therefore, bcj_powerpc() is not compiled
      for the bootwrapper.
      
      With the next commit passing CONFIG_PPC correctly, we would realize that
      {get,put}_unaligned_be32 was missing.
      
      Unlike the other decompressors, the ppc bootwrapper duplicates all the
      necessary helpers in arch/powerpc/boot/.
      
      The other architectures define __KERNEL__ and pull in helpers for
      building the decompressors.
      
      If ppc bootwrapper had defined __KERNEL__, lib/xz/xz_private.h would
      have included <asm/unaligned.h>:
      
      | #ifdef __KERNEL__
      | #       include <linux/xz.h>
      | #       include <linux/kernel.h>
      | #       include <asm/unaligned.h>
      
      However, doing so would cause tons of definition conflicts since the
      bootwrapper has duplicated everything.
      
      I just added copies of {get,put}_unaligned_be32, following the
      bootwrapper coding convention.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190705100144.28785-1-yamada.masahiro@socionext.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      4b9dc73a
    • A
      powerpc/mm: Handle page table allocation failures · d48720ba
      Aneesh Kumar K.V 提交于
      [ Upstream commit 2230ebf6e6dd0b7751e2921b40f6cfe34f09bb16 ]
      
      This fixes kernel crash that arises due to not handling page table allocation
      failures while allocating hugetlb page table.
      
      Fixes: e2b3d202 ("powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d48720ba
    • C
      powerpc/4xx/uic: clear pending interrupt after irq type/pol change · 52373ab6
      Christian Lamparter 提交于
      [ Upstream commit 3ab3a0689e74e6aa5b41360bc18861040ddef5b1 ]
      
      When testing out gpio-keys with a button, a spurious
      interrupt (and therefore a key press or release event)
      gets triggered as soon as the driver enables the irq
      line for the first time.
      
      This patch clears any potential bogus generated interrupt
      that was caused by the switching of the associated irq's
      type and polarity.
      Signed-off-by: NChristian Lamparter <chunkeey@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      52373ab6
    • N
      powerpc/xmon: Fix disabling tracing while in xmon · 9fac3948
      Naveen N. Rao 提交于
      [ Upstream commit aaf06665f7ea3ee9f9754e16c1a507a89f1de5b1 ]
      
      Commit ed49f7fd ("powerpc/xmon: Disable tracing when entering
      xmon") added code to disable recording trace entries while in xmon. The
      commit introduced a variable 'tracing_enabled' to record if tracing was
      enabled on xmon entry, and used this to conditionally enable tracing
      during exit from xmon.
      
      However, we are not checking the value of 'fromipi' variable in
      xmon_core() when setting 'tracing_enabled'. Due to this, when secondary
      cpus enter xmon, they will see tracing as being disabled already and
      tracing won't be re-enabled on exit. Fix the same.
      
      Fixes: ed49f7fd ("powerpc/xmon: Disable tracing when entering xmon")
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9fac3948
    • Q
      powerpc/cacheflush: fix variable set but not used · a80f67d5
      Qian Cai 提交于
      [ Upstream commit 04db3ede40ae4fc23a5c4237254c4a53bbe4c1f2 ]
      
      The powerpc's flush_cache_vmap() is defined as a macro and never use
      both of its arguments, so it will generate a compilation warning,
      
      lib/ioremap.c: In function 'ioremap_page_range':
      lib/ioremap.c:203:16: warning: variable 'start' set but not used
      [-Wunused-but-set-variable]
      
      Fix it by making it an inline function.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a80f67d5
    • A
      powerpc/pci/of: Fix OF flags parsing for 64bit BARs · 216462fa
      Alexey Kardashevskiy 提交于
      [ Upstream commit df5be5be8735ef2ae80d5ae1f2453cd81a035c4b ]
      
      When the firmware does PCI BAR resource allocation, it passes the assigned
      addresses and flags (prefetch/64bit/...) via the "reg" property of
      a PCI device device tree node so the kernel does not need to do
      resource allocation.
      
      The flags are stored in resource::flags - the lower byte stores
      PCI_BASE_ADDRESS_SPACE/etc bits and the other bytes are IORESOURCE_IO/etc.
      Some flags from PCI_BASE_ADDRESS_xxx and IORESOURCE_xxx are duplicated,
      such as PCI_BASE_ADDRESS_MEM_PREFETCH/PCI_BASE_ADDRESS_MEM_TYPE_64/etc.
      When parsing the "reg" property, we copy the prefetch flag but we skip
      on PCI_BASE_ADDRESS_MEM_TYPE_64 which leaves the flags out of sync.
      
      The missing IORESOURCE_MEM_64 flag comes into play under 2 conditions:
      1. we remove PCI_PROBE_ONLY for pseries (by hacking pSeries_setup_arch()
      or by passing "/chosen/linux,pci-probe-only");
      2. we request resource alignment (by passing pci=resource_alignment=
      via the kernel cmd line to request PAGE_SIZE alignment or defining
      ppc_md.pcibios_default_alignment which returns anything but 0). Note that
      the alignment requests are ignored if PCI_PROBE_ONLY is enabled.
      
      With 1) and 2), the generic PCI code in the kernel unconditionally
      decides to:
      - reassign the BARs in pci_specified_resource_alignment() (works fine)
      - write new BARs to the device - this fails for 64bit BARs as the generic
      code looks at IORESOURCE_MEM_64 (not set) and writes only lower 32bits
      of the BAR and leaves the upper 32bit unmodified which breaks BAR mapping
      in the hypervisor.
      
      This fixes the issue by copying the flag. This is useful if we want to
      enforce certain BAR alignment per platform as handling subpage sized BARs
      is proven to cause problems with hotplug (SLOF already aligns BARs to 64k).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NSam Bobroff <sbobroff@linux.ibm.com>
      Reviewed-by: NOliver O'Halloran <oohall@gmail.com>
      Reviewed-by: NShawn Anastasio <shawn@anastas.io>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      216462fa
    • N
      powerpc/pseries/mobility: prevent cpu hotplug during DT update · fd0d171c
      Nathan Lynch 提交于
      [ Upstream commit e59a175faa8df9d674247946f2a5a9c29c835725 ]
      
      CPU online/offline code paths are sensitive to parts of the device
      tree (various cpu node properties, cache nodes) that can be changed as
      a result of a migration.
      
      Prevent CPU hotplug while the device tree potentially is inconsistent.
      
      Fixes: 410bccf9 ("powerpc/pseries: Partition migration in the kernel")
      Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fd0d171c
  6. 26 7月, 2019 4 次提交
    • N
      powerpc/pseries: Fix oops in hotplug memory notifier · 01982f7b
      Nathan Lynch 提交于
      commit 0aa82c482ab2ece530a6f44897b63b274bb43c8e upstream.
      
      During post-migration device tree updates, we can oops in
      pseries_update_drconf_memory() if the source device tree has an
      ibm,dynamic-memory-v2 property and the destination has a
      ibm,dynamic_memory (v1) property. The notifier processes an "update"
      for the ibm,dynamic-memory property but it's really an add in this
      scenario. So make sure the old property object is there before
      dereferencing it.
      
      Fixes: 2b31e3ae ("powerpc/drmem: Add support for ibm, dynamic-memory-v2 property")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      01982f7b
    • G
      powerpc/powernv/npu: Fix reference leak · e725502b
      Greg Kurz 提交于
      commit 02c5f5394918b9b47ff4357b1b18335768cd867d upstream.
      
      Since 902bdc57, get_pci_dev() calls pci_get_domain_bus_and_slot(). This
      has the effect of incrementing the reference count of the PCI device, as
      explained in drivers/pci/search.c:
      
       * Given a PCI domain, bus, and slot/function number, the desired PCI
       * device is located in the list of PCI devices. If the device is
       * found, its reference count is increased and this function returns a
       * pointer to its data structure.  The caller must decrement the
       * reference count by calling pci_dev_put().  If no device is found,
       * %NULL is returned.
      
      Nothing was done to call pci_dev_put() and the reference count of GPU and
      NPU PCI devices rockets up.
      
      A natural way to fix this would be to teach the callers about the change,
      so that they call pci_dev_put() when done with the pointer. This turns
      out to be quite intrusive, as it affects many paths in npu-dma.c,
      pci-ioda.c and vfio_pci_nvlink2.c. Also, the issue appeared in 4.16 and
      some affected code got moved around since then: it would be problematic
      to backport the fix to stable releases.
      
      All that code never cared for reference counting anyway. Call pci_dev_put()
      from get_pci_dev() to revert to the previous behavior.
      
      Fixes: 902bdc57 ("powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn")
      Cc: stable@vger.kernel.org # v4.16
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e725502b
    • R
      powerpc/watchpoint: Restore NV GPRs while returning from exception · 1e3b61cb
      Ravi Bangoria 提交于
      commit f474c28fbcbe42faca4eb415172c07d76adcb819 upstream.
      
      powerpc hardware triggers watchpoint before executing the instruction.
      To make trigger-after-execute behavior, kernel emulates the
      instruction. If the instruction is 'load something into non-volatile
      register', exception handler should restore emulated register state
      while returning back, otherwise there will be register state
      corruption. eg, adding a watchpoint on a list can corrput the list:
      
        # cat /proc/kallsyms | grep kthread_create_list
        c00000000121c8b8 d kthread_create_list
      
      Add watchpoint on kthread_create_list->prev:
      
        # perf record -e mem:0xc00000000121c8c0
      
      Run some workload such that new kthread gets invoked. eg, I just
      logged out from console:
      
        list_add corruption. next->prev should be prev (c000000001214e00), \
      	but was c00000000121c8b8. (next=c00000000121c8b8).
        WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0
        CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69
        ...
        NIP __list_add_valid+0xb4/0xc0
        LR __list_add_valid+0xb0/0xc0
        Call Trace:
        __list_add_valid+0xb0/0xc0 (unreliable)
        __kthread_create_on_node+0xe0/0x260
        kthread_create_on_node+0x34/0x50
        create_worker+0xe8/0x260
        worker_thread+0x444/0x560
        kthread+0x160/0x1a0
        ret_from_kernel_thread+0x5c/0x70
      
      List corruption happened because it uses 'load into non-volatile
      register' instruction:
      
      Snippet from __kthread_create_on_node:
      
        c000000000136be8:     addis   r29,r2,-19
        c000000000136bec:     ld      r29,31424(r29)
              if (!__list_add_valid(new, prev, next))
        c000000000136bf0:     mr      r3,r30
        c000000000136bf4:     mr      r5,r28
        c000000000136bf8:     mr      r4,r29
        c000000000136bfc:     bl      c00000000059a2f8 <__list_add_valid+0x8>
      
      Register state from WARN_ON():
      
        GPR00: c00000000059a3a0 c000007ff23afb50 c000000001344e00 0000000000000075
        GPR04: 0000000000000000 0000000000000000 0000001852af8bc1 0000000000000000
        GPR08: 0000000000000001 0000000000000007 0000000000000006 00000000000004aa
        GPR12: 0000000000000000 c000007ffffeb080 c000000000137038 c000005ff62aaa00
        GPR16: 0000000000000000 0000000000000000 c000007fffbe7600 c000007fffbe7370
        GPR20: c000007fffbe7320 c000007fffbe7300 c000000001373a00 0000000000000000
        GPR24: fffffffffffffef7 c00000000012e320 c000007ff23afcb0 c000000000cb8628
        GPR28: c00000000121c8b8 c000000001214e00 c000007fef5b17e8 c000007fef5b17c0
      
      Watchpoint hit at 0xc000000000136bec.
      
        addis   r29,r2,-19
         => r29 = 0xc000000001344e00 + (-19 << 16)
         => r29 = 0xc000000001214e00
      
        ld      r29,31424(r29)
         => r29 = *(0xc000000001214e00 + 31424)
         => r29 = *(0xc00000000121c8c0)
      
      0xc00000000121c8c0 is where we placed a watchpoint and thus this
      instruction was emulated by emulate_step. But because handle_dabr_fault
      did not restore emulated register state, r29 still contains stale
      value in above register state.
      
      Fixes: 5aae8a53 ("powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors")
      Signed-off-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: stable@vger.kernel.org # 2.6.36+
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1e3b61cb
    • C
      powerpc/32s: fix suspend/resume when IBATs 4-7 are used · 237ac0d7
      Christophe Leroy 提交于
      commit 6ecb78ef56e08d2119d337ae23cb951a640dc52d upstream.
      
      Previously, only IBAT1 and IBAT2 were used to map kernel linear mem.
      Since commit 63b2bc619565 ("powerpc/mm/32s: Use BATs for
      STRICT_KERNEL_RWX"), we may have all 8 BATs used for mapping
      kernel text. But the suspend/restore functions only save/restore
      BATs 0 to 3, and clears BATs 4 to 7.
      
      Make suspend and restore functions respectively save and reload
      the 8 BATs on CPUs having MMU_FTR_USE_HIGH_BATS feature.
      Reported-by: NAndreas Schwab <schwab@linux-m68k.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      237ac0d7
  7. 25 6月, 2019 2 次提交
    • M
      powerpc/mm/64s/hash: Reallocate context ids on fork · cd3e4939
      Michael Ellerman 提交于
      commit ca72d88378b2f2444d3ec145dd442d449d3fefbc upstream.
      
      When using the Hash Page Table (HPT) MMU, userspace memory mappings
      are managed at two levels. Firstly in the Linux page tables, much like
      other architectures, and secondly in the SLB (Segment Lookaside
      Buffer) and HPT. It's the SLB and HPT that are actually used by the
      hardware to do translations.
      
      As part of the series adding support for 4PB user virtual address
      space using the hash MMU, we added support for allocating multiple
      "context ids" per process, one for each 512TB chunk of address space.
      These are tracked in an array called extended_id in the mm_context_t
      of a process that has done a mapping above 512TB.
      
      If such a process forks (ie. clone(2) without CLONE_VM set) it's mm is
      copied, including the mm_context_t, and then init_new_context() is
      called to reinitialise parts of the mm_context_t as appropriate to
      separate the address spaces of the two processes.
      
      The key step in ensuring the two processes have separate address
      spaces is to allocate a new context id for the process, this is done
      at the beginning of hash__init_new_context(). If we didn't allocate a
      new context id then the two processes would share mappings as far as
      the SLB and HPT are concerned, even though their Linux page tables
      would be separate.
      
      For mappings above 512TB, which use the extended_id array, we
      neglected to allocate new context ids on fork, meaning the parent and
      child use the same ids and therefore share those mappings even though
      they're supposed to be separate. This can lead to the parent seeing
      writes done by the child, which is essentially memory corruption.
      
      There is an additional exposure which is that if the child process
      exits, all its context ids are freed, including the context ids that
      are still in use by the parent for mappings above 512TB. One or more
      of those ids can then be reallocated to a third process, that process
      can then read/write to the parent's mappings above 512TB. Additionally
      if the freed id is used for the third process's primary context id,
      then the parent is able to read/write to the third process's mappings
      *below* 512TB.
      
      All of these are fundamental failures to enforce separation between
      processes. The only mitigating factor is that the bug only occurs if a
      process creates mappings above 512TB, and most applications still do
      not create such mappings.
      
      Only machines using the hash page table MMU are affected, eg. PowerPC
      970 (G5), PA6T, Power5/6/7/8/9. By default Power9 bare metal machines
      (powernv) use the Radix MMU and are not affected, unless the machine
      has been explicitly booted in HPT mode (using disable_radix on the
      kernel command line). KVM guests on Power9 may be affected if the host
      or guest is configured to use the HPT MMU. LPARs under PowerVM on
      Power9 are affected as they always use the HPT MMU. Kernels built with
      PAGE_SIZE=4K are not affected.
      
      The fix is relatively simple, we need to reallocate context ids for
      all extended mappings on fork.
      
      Fixes: f384796c ("powerpc/mm: Add support for handling > 512TB address in SLB miss")
      Cc: stable@vger.kernel.org # v4.17+
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd3e4939
    • N
      powerpc/bpf: use unsigned division instruction for 64-bit operations · 48ee85dc
      Naveen N. Rao 提交于
      commit 758f2046ea040773ae8ea7f72dd3bbd8fa984501 upstream.
      
      BPF_ALU64 div/mod operations are currently using signed division, unlike
      BPF_ALU32 operations. Fix the same. DIV64 and MOD64 overflow tests pass
      with this fix.
      
      Fixes: 156d0e29 ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF")
      Cc: stable@vger.kernel.org # v4.8+
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      48ee85dc
  8. 22 6月, 2019 3 次提交
    • P
      KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpu · df6384e0
      Paul Mackerras 提交于
      [ Upstream commit 5a3f49364c3ffa1107bd88f8292406e98c5d206c ]
      
      Currently the HV KVM code takes the kvm->lock around calls to
      kvm_for_each_vcpu() and kvm_get_vcpu_by_id() (which can call
      kvm_for_each_vcpu() internally).  However, that leads to a lock
      order inversion problem, because these are called in contexts where
      the vcpu mutex is held, but the vcpu mutexes nest within kvm->lock
      according to Documentation/virtual/kvm/locking.txt.  Hence there
      is a possibility of deadlock.
      
      To fix this, we simply don't take the kvm->lock mutex around these
      calls.  This is safe because the implementations of kvm_for_each_vcpu()
      and kvm_get_vcpu_by_id() have been designed to be able to be called
      locklessly.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      df6384e0
    • P
      KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token list · b376683f
      Paul Mackerras 提交于
      [ Upstream commit 1659e27d2bc1ef47b6d031abe01b467f18cb72d9 ]
      
      Currently the Book 3S KVM code uses kvm->lock to synchronize access
      to the kvm->arch.rtas_tokens list.  Because this list is scanned
      inside kvmppc_rtas_hcall(), which is called with the vcpu mutex held,
      taking kvm->lock cause a lock inversion problem, which could lead to
      a deadlock.
      
      To fix this, we add a new mutex, kvm->arch.rtas_token_lock, which nests
      inside the vcpu mutexes, and use that instead of kvm->lock when
      accessing the rtas token list.
      
      This removes the lockdep_assert_held() in kvmppc_rtas_tokens_free().
      At this point we don't hold the new mutex, but that is OK because
      kvmppc_rtas_tokens_free() is only called when the whole VM is being
      destroyed, and at that point nothing can be looking up a token in
      the list.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b376683f
    • A
      powerpc/powernv: Return for invalid IMC domain · 930d31a6
      Anju T Sudhakar 提交于
      [ Upstream commit b59bd3527fe3c1939340df558d7f9d568fc9f882 ]
      
      Currently init_imc_pmu() can fail either because we try to register an
      IMC unit with an invalid domain (i.e an IMC node not supported by the
      kernel) or something went wrong while registering a valid IMC unit. In
      both the cases kernel provides a 'Register failed' error message.
      
      For example when trace-imc node is not supported by the kernel, but
      skiboot advertises a trace-imc node we print:
      
        IMC Unknown Device type
        IMC PMU (null) Register failed
      
      To avoid confusion just print the unknown device type message, before
      attempting PMU registration, so the second message isn't printed.
      
      Fixes: 8f95faaa ("powerpc/powernv: Detect and create IMC device")
      Reported-by: NPavaman Subramaniyam <pavsubra@in.ibm.com>
      Signed-off-by: NAnju T Sudhakar <anju@linux.vnet.ibm.com>
      Reviewed-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      [mpe: Reword change log a bit]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      930d31a6
  9. 11 6月, 2019 1 次提交
    • K
      pstore: Convert buf_lock to semaphore · d4128a1b
      Kees Cook 提交于
      commit ea84b580b95521644429cc6748b6c2bf27c8b0f3 upstream.
      
      Instead of running with interrupts disabled, use a semaphore. This should
      make it easier for backends that may need to sleep (e.g. EFI) when
      performing a write:
      
      |BUG: sleeping function called from invalid context at kernel/sched/completion.c:99
      |in_atomic(): 1, irqs_disabled(): 1, pid: 2236, name: sig-xstate-bum
      |Preemption disabled at:
      |[<ffffffff99d60512>] pstore_dump+0x72/0x330
      |CPU: 26 PID: 2236 Comm: sig-xstate-bum Tainted: G      D           4.20.0-rc3 #45
      |Call Trace:
      | dump_stack+0x4f/0x6a
      | ___might_sleep.cold.91+0xd3/0xe4
      | __might_sleep+0x50/0x90
      | wait_for_completion+0x32/0x130
      | virt_efi_query_variable_info+0x14e/0x160
      | efi_query_variable_store+0x51/0x1a0
      | efivar_entry_set_safe+0xa3/0x1b0
      | efi_pstore_write+0x109/0x140
      | pstore_dump+0x11c/0x330
      | kmsg_dump+0xa4/0xd0
      | oops_exit+0x22/0x30
      ...
      Reported-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Fixes: 21b3ddd3 ("efi: Don't use spinlocks for efi vars")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4128a1b
  10. 09 6月, 2019 3 次提交
    • T
      KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID · 6a2fbec7
      Thomas Huth 提交于
      commit a86cb413f4bf273a9d341a3ab2c2ca44e12eb317 upstream.
      
      KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all
      architectures. However, on s390x, the amount of usable CPUs is determined
      during runtime - it is depending on the features of the machine the code
      is running on. Since we are using the vcpu_id as an index into the SCA
      structures that are defined by the hardware (see e.g. the sca_add_vcpu()
      function), it is not only the amount of CPUs that is limited by the hard-
      ware, but also the range of IDs that we can use.
      Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too.
      So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common
      code into the architecture specific code, and on s390x we have to return
      the same value here as for KVM_CAP_MAX_VCPUS.
      This problem has been discovered with the kvm_create_max_vcpus selftest.
      With this change applied, the selftest now passes on s390x, too.
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Message-Id: <20190523164309.13345-9-thuth@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      6a2fbec7
    • R
      powerpc/perf: Fix MMCRA corruption by bhrb_filter · ca221cf9
      Ravi Bangoria 提交于
      commit 3202e35ec1c8fc19cea24253ff83edf702a60a02 upstream.
      
      Consider a scenario where user creates two events:
      
        1st event:
          attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;
          attr.branch_sample_type = PERF_SAMPLE_BRANCH_ANY;
          fd = perf_event_open(attr, 0, 1, -1, 0);
      
        This sets cpuhw->bhrb_filter to 0 and returns valid fd.
      
        2nd event:
          attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;
          attr.branch_sample_type = PERF_SAMPLE_BRANCH_CALL;
          fd = perf_event_open(attr, 0, 1, -1, 0);
      
        It overrides cpuhw->bhrb_filter to -1 and returns with error.
      
      Now if power_pmu_enable() gets called by any path other than
      power_pmu_add(), ppmu->config_bhrb(-1) will set MMCRA to -1.
      
      Fixes: 3925f46b ("powerpc/perf: Enable branch stack sampling framework")
      Cc: stable@vger.kernel.org # v3.10+
      Signed-off-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Reviewed-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ca221cf9
    • C
      KVM: PPC: Book3S HV: XIVE: Do not clear IRQ data of passthrough interrupts · 55a94d81
      Cédric Le Goater 提交于
      commit ef9740204051d0e00f5402fe96cf3a43ddd2bbbf upstream.
      
      The passthrough interrupts are defined at the host level and their IRQ
      data should not be cleared unless specifically deconfigured (shutdown)
      by the host. They differ from the IPI interrupts which are allocated
      by the XIVE KVM device and reserved to the guest usage only.
      
      This fixes a host crash when destroying a VM in which a PCI adapter
      was passed-through. In this case, the interrupt is cleared and freed
      by the KVM device and then shutdown by vfio at the host level.
      
      [ 1007.360265] BUG: Kernel NULL pointer dereference at 0x00000d00
      [ 1007.360285] Faulting instruction address: 0xc00000000009da34
      [ 1007.360296] Oops: Kernel access of bad area, sig: 7 [#1]
      [ 1007.360303] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
      [ 1007.360314] Modules linked in: vhost_net vhost iptable_mangle ipt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc kvm_hv kvm xt_tcpudp iptable_filter squashfs fuse binfmt_misc vmx_crypto ib_iser rdma_cm iw_cm ib_cm libiscsi scsi_transport_iscsi nfsd ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress lzo_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath mlx5_ib ib_uverbs ib_core crc32c_vpmsum mlx5_core
      [ 1007.360425] CPU: 9 PID: 15576 Comm: CPU 18/KVM Kdump: loaded Not tainted 5.1.0-gad7e7d0ef #4
      [ 1007.360454] NIP:  c00000000009da34 LR: c00000000009e50c CTR: c00000000009e5d0
      [ 1007.360482] REGS: c000007f24ccf330 TRAP: 0300   Not tainted  (5.1.0-gad7e7d0ef)
      [ 1007.360500] MSR:  900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002484  XER: 00000000
      [ 1007.360532] CFAR: c00000000009da10 DAR: 0000000000000d00 DSISR: 00080000 IRQMASK: 1
      [ 1007.360532] GPR00: c00000000009e62c c000007f24ccf5c0 c000000001510600 c000007fe7f947c0
      [ 1007.360532] GPR04: 0000000000000d00 0000000000000000 0000000000000000 c000005eff02d200
      [ 1007.360532] GPR08: 0000000000400000 0000000000000000 0000000000000000 fffffffffffffffd
      [ 1007.360532] GPR12: c00000000009e5d0 c000007fffff7b00 0000000000000031 000000012c345718
      [ 1007.360532] GPR16: 0000000000000000 0000000000000008 0000000000418004 0000000000040100
      [ 1007.360532] GPR20: 0000000000000000 0000000008430000 00000000003c0000 0000000000000027
      [ 1007.360532] GPR24: 00000000000000ff 0000000000000000 00000000000000ff c000007faa90d98c
      [ 1007.360532] GPR28: c000007faa90da40 00000000000fe040 ffffffffffffffff c000007fe7f947c0
      [ 1007.360689] NIP [c00000000009da34] xive_esb_read+0x34/0x120
      [ 1007.360706] LR [c00000000009e50c] xive_do_source_set_mask.part.0+0x2c/0x50
      [ 1007.360732] Call Trace:
      [ 1007.360738] [c000007f24ccf5c0] [c000000000a6383c] snooze_loop+0x15c/0x270 (unreliable)
      [ 1007.360775] [c000007f24ccf5f0] [c00000000009e62c] xive_irq_shutdown+0x5c/0xe0
      [ 1007.360795] [c000007f24ccf630] [c00000000019e4a0] irq_shutdown+0x60/0xe0
      [ 1007.360813] [c000007f24ccf660] [c000000000198c44] __free_irq+0x3a4/0x420
      [ 1007.360831] [c000007f24ccf700] [c000000000198dc8] free_irq+0x78/0xe0
      [ 1007.360849] [c000007f24ccf730] [c00000000096c5a8] vfio_msi_set_vector_signal+0xa8/0x350
      [ 1007.360878] [c000007f24ccf7f0] [c00000000096c938] vfio_msi_set_block+0xe8/0x1e0
      [ 1007.360899] [c000007f24ccf850] [c00000000096cae0] vfio_msi_disable+0xb0/0x110
      [ 1007.360912] [c000007f24ccf8a0] [c00000000096cd04] vfio_pci_set_msi_trigger+0x1c4/0x3d0
      [ 1007.360922] [c000007f24ccf910] [c00000000096d910] vfio_pci_set_irqs_ioctl+0xa0/0x170
      [ 1007.360941] [c000007f24ccf930] [c00000000096b400] vfio_pci_disable+0x80/0x5e0
      [ 1007.360963] [c000007f24ccfa10] [c00000000096b9bc] vfio_pci_release+0x5c/0x90
      [ 1007.360991] [c000007f24ccfa40] [c000000000963a9c] vfio_device_fops_release+0x3c/0x70
      [ 1007.361012] [c000007f24ccfa70] [c0000000003b5668] __fput+0xc8/0x2b0
      [ 1007.361040] [c000007f24ccfac0] [c0000000001409b0] task_work_run+0x140/0x1b0
      [ 1007.361059] [c000007f24ccfb20] [c000000000118f8c] do_exit+0x3ac/0xd00
      [ 1007.361076] [c000007f24ccfc00] [c0000000001199b0] do_group_exit+0x60/0x100
      [ 1007.361094] [c000007f24ccfc40] [c00000000012b514] get_signal+0x1a4/0x8f0
      [ 1007.361112] [c000007f24ccfd30] [c000000000021cc8] do_notify_resume+0x1a8/0x430
      [ 1007.361141] [c000007f24ccfe20] [c00000000000e444] ret_from_except_lite+0x70/0x74
      [ 1007.361159] Instruction dump:
      [ 1007.361175] 38422c00 e9230000 712a0004 41820010 548a2036 7d442378 78840020 71290020
      [ 1007.361194] 4082004c e9230010 7c892214 7c0004ac <e9240000> 0c090000 4c00012c 792a0022
      
      Cc: stable@vger.kernel.org # v4.12+
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      55a94d81
  11. 04 6月, 2019 1 次提交
  12. 31 5月, 2019 4 次提交