1. 12 10月, 2019 21 次提交
    • V
      riscv: Avoid interrupts being erroneously enabled in handle_exception() · d286a374
      Vincent Chen 提交于
      [ Upstream commit c82dd6d078a2bb29d41eda032bb96d05699a524d ]
      
      When the handle_exception function addresses an exception, the interrupts
      will be unconditionally enabled after finishing the context save. However,
      It may erroneously enable the interrupts if the interrupts are disabled
      before entering the handle_exception.
      
      For example, one of the WARN_ON() condition is satisfied in the scheduling
      where the interrupt is disabled and rq.lock is locked. The WARN_ON will
      trigger a break exception and the handle_exception function will enable the
      interrupts before entering do_trap_break function. During the procedure, if
      a timer interrupt is pending, it will be taken when interrupts are enabled.
      In this case, it may cause a deadlock problem if the rq.lock is locked
      again in the timer ISR.
      
      Hence, the handle_exception() can only enable interrupts when the state of
      sstatus.SPIE is 1.
      
      This patch is tested on HiFive Unleashed board.
      Signed-off-by: NVincent Chen <vincent.chen@sifive.com>
      Reviewed-by: NPalmer Dabbelt <palmer@sifive.com>
      [paul.walmsley@sifive.com: updated to apply]
      Fixes: bcae803a ("RISC-V: Enable IRQ during exception handling")
      Cc: David Abdurachmanov <david.abdurachmanov@sifive.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaul Walmsley <paul.walmsley@sifive.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d286a374
    • A
      powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag · d1e4b4cc
      Aneesh Kumar K.V 提交于
      commit 09ce98cacd51fcd0fa0af2f79d1e1d3192f4cbb0 upstream.
      
      Rename the #define to indicate this is related to store vs tlbie
      ordering issue. In the next patch, we will be adding another feature
      flag that is used to handles ERAT flush vs tlbie ordering issue.
      
      Fixes: a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d1e4b4cc
    • G
      powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt() · f5f31a6e
      Gautham R. Shenoy 提交于
      [ Upstream commit c784be435d5dae28d3b03db31753dd7a18733f0c ]
      
      The calls to arch_add_memory()/arch_remove_memory() are always made
      with the read-side cpu_hotplug_lock acquired via memory_hotplug_begin().
      On pSeries, arch_add_memory()/arch_remove_memory() eventually call
      resize_hpt() which in turn calls stop_machine() which acquires the
      read-side cpu_hotplug_lock again, thereby resulting in the recursive
      acquisition of this lock.
      
      In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system
      lockup during a memory hotplug operation because cpus_read_lock() is a
      per-cpu rwsem read, which, in the fast-path (in the absence of the
      writer, which in our case is a CPU-hotplug operation) simply
      increments the read_count on the semaphore. Thus a recursive read in
      the fast-path doesn't cause any problems.
      
      However, we can hit this problem in practice if there is a concurrent
      CPU-Hotplug operation in progress which is waiting to acquire the
      write-side of the lock. This will cause the second recursive read to
      block until the writer finishes. While the writer is blocked since the
      first read holds the lock. Thus both the reader as well as the writers
      fail to make any progress thereby blocking both CPU-Hotplug as well as
      Memory Hotplug operations.
      
      Memory-Hotplug				CPU-Hotplug
      CPU 0					CPU 1
      ------                                  ------
      
      1. down_read(cpu_hotplug_lock.rw_sem)
         [memory_hotplug_begin]
      					2. down_write(cpu_hotplug_lock.rw_sem)
      					[cpu_up/cpu_down]
      3. down_read(cpu_hotplug_lock.rw_sem)
         [stop_machine()]
      
      Lockdep complains as follows in these code-paths.
      
       swapper/0/1 is trying to acquire lock:
       (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: stop_machine+0x2c/0x60
      
      but task is already holding lock:
      (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(cpu_hotplug_lock.rw_sem);
         lock(cpu_hotplug_lock.rw_sem);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       3 locks held by swapper/0/1:
        #0: (____ptrval____) (&dev->mutex){....}, at: __driver_attach+0x12c/0x1b0
        #1: (____ptrval____) (cpu_hotplug_lock.rw_sem){++++}, at: mem_hotplug_begin+0x20/0x50
        #2: (____ptrval____) (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x54/0x1a0
      
      stack backtrace:
       CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166
       Call Trace:
         dump_stack+0xe8/0x164 (unreliable)
         __lock_acquire+0x1110/0x1c70
         lock_acquire+0x240/0x290
         cpus_read_lock+0x64/0xf0
         stop_machine+0x2c/0x60
         pseries_lpar_resize_hpt+0x19c/0x2c0
         resize_hpt_for_hotplug+0x70/0xd0
         arch_add_memory+0x58/0xfc
         devm_memremap_pages+0x5e8/0x8f0
         pmem_attach_disk+0x764/0x830
         nvdimm_bus_probe+0x118/0x240
         really_probe+0x230/0x4b0
         driver_probe_device+0x16c/0x1e0
         __driver_attach+0x148/0x1b0
         bus_for_each_dev+0x90/0x130
         driver_attach+0x34/0x50
         bus_add_driver+0x1a8/0x360
         driver_register+0x108/0x170
         __nd_driver_register+0xd0/0xf0
         nd_pmem_driver_init+0x34/0x48
         do_one_initcall+0x1e0/0x45c
         kernel_init_freeable+0x540/0x64c
         kernel_init+0x2c/0x160
         ret_from_kernel_thread+0x5c/0x68
      
      Fix this issue by
        1) Requiring all the calls to pseries_lpar_resize_hpt() be made
           with cpu_hotplug_lock held.
      
        2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked()
           as a consequence of 1)
      
        3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt()
           with cpu_hotplug_lock held.
      
      Fixes: dbcf929c ("powerpc/pseries: Add support for hash table resizing")
      Cc: stable@vger.kernel.org # v4.11+
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/1557906352-29048-1-git-send-email-ego@linux.vnet.ibm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      f5f31a6e
    • S
      KVM: nVMX: Fix consistency check on injected exception error code · 63bb8b76
      Sean Christopherson 提交于
      [ Upstream commit 567926cca99ba1750be8aae9c4178796bf9bb90b ]
      
      Current versions of Intel's SDM incorrectly state that "bits 31:15 of
      the VM-Entry exception error-code field" must be zero.  In reality, bits
      31:16 must be zero, i.e. error codes are 16-bit values.
      
      The bogus error code check manifests as an unexpected VM-Entry failure
      due to an invalid code field (error number 7) in L1, e.g. when injecting
      a #GP with error_code=0x9f00.
      
      Nadav previously reported the bug[*], both to KVM and Intel, and fixed
      the associated kvm-unit-test.
      
      [*] https://patchwork.kernel.org/patch/11124749/Reported-by: NNadav Amit <namit@vmware.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      63bb8b76
    • C
      KVM: PPC: Book3S HV: XIVE: Free escalation interrupts before disabling the VP · 34b13ff6
      Cédric Le Goater 提交于
      [ Upstream commit 237aed48c642328ff0ab19b63423634340224a06 ]
      
      When a vCPU is brought done, the XIVE VP (Virtual Processor) is first
      disabled and then the event notification queues are freed. When freeing
      the queues, we check for possible escalation interrupts and free them
      also.
      
      But when a XIVE VP is disabled, the underlying XIVE ENDs also are
      disabled in OPAL. When an END (Event Notification Descriptor) is
      disabled, its ESB pages (ESn and ESe) are disabled and loads return all
      1s. Which means that any access on the ESB page of the escalation
      interrupt will return invalid values.
      
      When an interrupt is freed, the shutdown handler computes a 'saved_p'
      field from the value returned by a load in xive_do_source_set_mask().
      This value is incorrect for escalation interrupts for the reason
      described above.
      
      This has no impact on Linux/KVM today because we don't make use of it
      but we will introduce in future changes a xive_get_irqchip_state()
      handler. This handler will use the 'saved_p' field to return the state
      of an interrupt and 'saved_p' being incorrect, softlockup will occur.
      
      Fix the vCPU cleanup sequence by first freeing the escalation interrupts
      if any, then disable the XIVE VP and last free the queues.
      
      Fixes: 90c73795afa2 ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode")
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
      34b13ff6
    • A
      x86/purgatory: Disable the stackleak GCC plugin for the purgatory · 9dabade5
      Arvind Sankar 提交于
      [ Upstream commit ca14c996afe7228ff9b480cf225211cc17212688 ]
      
      Since commit:
      
        b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS")
      
      kexec breaks if GCC_PLUGIN_STACKLEAK=y is enabled, as the purgatory
      contains undefined references to stackleak_track_stack.
      
      Attempting to load a kexec kernel results in this failure:
      
        kexec: Undefined symbol: stackleak_track_stack
        kexec-bzImage64: Loading purgatory failed
      
      Fix this by disabling the stackleak plugin for the purgatory.
      Signed-off-by: NArvind Sankar <nivedita@alum.mit.edu>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS")
      Link: https://lkml.kernel.org/r/20190923171753.GA2252517@rani.riverdale.lanSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9dabade5
    • W
      arm64: cpufeature: Detect SSBS and advertise to userspace · 6df3c66d
      Will Deacon 提交于
      commit d71be2b6c0e19180b5f80a6d42039cc074a693a2 upstream.
      
      Armv8.5 introduces a new PSTATE bit known as Speculative Store Bypass
      Safe (SSBS) which can be used as a mitigation against Spectre variant 4.
      
      Additionally, a CPU may provide instructions to manipulate PSTATE.SSBS
      directly, so that userspace can toggle the SSBS control without trapping
      to the kernel.
      
      This patch probes for the existence of SSBS and advertise the new instructions
      to userspace if they exist.
      Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6df3c66d
    • J
      MIPS: Treat Loongson Extensions as ASEs · fb93ccde
      Jiaxun Yang 提交于
      commit d2f965549006acb865c4638f1f030ebcefdc71f6 upstream.
      
      Recently, binutils had split Loongson-3 Extensions into four ASEs:
      MMI, CAM, EXT, EXT2. This patch do the samething in kernel and expose
      them in cpuinfo so applications can probe supported ASEs at runtime.
      Signed-off-by: NJiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Huacai Chen <chenhc@lemote.com>
      Cc: Yunqiang Su <ysu@wavecomp.com>
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Cc: linux-mips@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb93ccde
    • A
      powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisions · 9124eac4
      Aneesh Kumar K.V 提交于
      commit 677733e296b5c7a37c47da391fc70a43dc40bd67 upstream.
      
      The store ordering vs tlbie issue mentioned in commit
      a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on
      POWER9") is fixed for Nimbus 2.3 and Cumulus 1.3 revisions. We don't
      need to apply the fixup if we are running on them
      
      We can only do this on PowerNV. On pseries guest with KVM we still
      don't support redoing the feature fixup after migration. So we should
      be enabling all the workarounds needed, because whe can possibly
      migrate between DD 2.3 and DD 2.2
      
      Fixes: a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-1-aneesh.kumar@linux.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9124eac4
    • A
      powerpc/powernv/ioda: Fix race in TCE level allocation · 19c12f12
      Alexey Kardashevskiy 提交于
      commit 56090a3902c80c296e822d11acdb6a101b322c52 upstream.
      
      pnv_tce() returns a pointer to a TCE entry and originally a TCE table
      would be pre-allocated. For the default case of 2GB window the table
      needs only a single level and that is fine. However if more levels are
      requested, it is possible to get a race when 2 threads want a pointer
      to a TCE entry from the same page of TCEs.
      
      This adds cmpxchg to handle the race. Note that once TCE is non-zero,
      it cannot become zero again.
      
      Fixes: a68bd126 ("powerpc/powernv/ioda: Allocate indirect TCE levels on demand")
      CC: stable@vger.kernel.org # v4.19+
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190718051139.74787-2-aik@ozlabs.ruSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      19c12f12
    • A
      powerpc/powernv: Restrict OPAL symbol map to only be readable by root · 032ce7d7
      Andrew Donnellan 提交于
      commit e7de4f7b64c23e503a8c42af98d56f2a7462bd6d upstream.
      
      Currently the OPAL symbol map is globally readable, which seems bad as
      it contains physical addresses.
      
      Restrict it to root.
      
      Fixes: c8742f85 ("powerpc/powernv: Expose OPAL firmware symbol map")
      Cc: stable@vger.kernel.org # v3.19+
      Suggested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Donnellan <ajd@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190503075253.22798-1-ajd@linux.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      032ce7d7
    • S
      powerpc/mce: Schedule work from irq_work · ba3ca9fc
      Santosh Sivaraj 提交于
      commit b5bda6263cad9a927e1a4edb7493d542da0c1410 upstream.
      
      schedule_work() cannot be called from MCE exception context as MCE can
      interrupt even in interrupt disabled context.
      
      Fixes: 733e4a4c ("powerpc/mce: hookup memory_failure for UE errors")
      Cc: stable@vger.kernel.org # v4.15+
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NSantosh Sivaraj <santosh@fossix.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190820081352.8641-2-santosh@fossix.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba3ca9fc
    • B
      powerpc/mce: Fix MCE handling for huge pages · ee6eeeb8
      Balbir Singh 提交于
      commit 99ead78afd1128bfcebe7f88f3b102fb2da09aee upstream.
      
      The current code would fail on huge pages addresses, since the shift would
      be incorrect. Use the correct page shift value returned by
      __find_linux_pte() to get the correct physical address. The code is more
      generic and can handle both regular and compound pages.
      
      Fixes: ba41e1e1 ("powerpc/mce: Hookup derror (load/store) UE errors")
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      [arbab@linux.ibm.com: Fixup pseries_do_memory_failure()]
      Signed-off-by: NReza Arbab <arbab@linux.ibm.com>
      Tested-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NSantosh Sivaraj <santosh@fossix.org>
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190820081352.8641-3-santosh@fossix.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee6eeeb8
    • J
      KVM: nVMX: handle page fault in vmread fix · eff3a54a
      Jack Wang 提交于
      During backport f7eea636c3d5 ("KVM: nVMX: handle page fault in vmread"),
      there was a mistake the exception reference should be passed to function
      kvm_write_guest_virt_system, instead of NULL, other wise, we will get
      NULL pointer deref, eg
      
      kvm-unit-test triggered a NULL pointer deref below:
      [  948.518437] kvm [24114]: vcpu0, guest rIP: 0x407ef9 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x3, nop
      [  949.106464] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      [  949.106707] PGD 0 P4D 0
      [  949.106872] Oops: 0002 [#1] SMP
      [  949.107038] CPU: 2 PID: 24126 Comm: qemu-2.7 Not tainted 4.19.77-pserver #4.19.77-1+feature+daily+update+20191005.1625+a4168bb~deb9
      [  949.107283] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 2.7.3 01/31/2018
      [  949.107549] RIP: 0010:kvm_write_guest_virt_system+0x12/0x40 [kvm]
      [  949.107719] Code: c0 5d 41 5c 41 5d 41 5e 83 f8 03 41 0f 94 c0 41 c1 e0 02 e9 b0 ed ff ff 0f 1f 44 00 00 48 89 f0 c6 87 59 56 00 00 01 48 89 d6 <49> c7 00 00 00 00 00 89 ca 49 c7 40 08 00 00 00 00 49 c7 40 10 00
      [  949.108044] RSP: 0018:ffffb31b0a953cb0 EFLAGS: 00010202
      [  949.108216] RAX: 000000000046b4d8 RBX: ffff9e9f415b0000 RCX: 0000000000000008
      [  949.108389] RDX: ffffb31b0a953cc0 RSI: ffffb31b0a953cc0 RDI: ffff9e9f415b0000
      [  949.108562] RBP: 00000000d2e14928 R08: 0000000000000000 R09: 0000000000000000
      [  949.108733] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffc8
      [  949.108907] R13: 0000000000000002 R14: ffff9e9f4f26f2e8 R15: 0000000000000000
      [  949.109079] FS:  00007eff8694c700(0000) GS:ffff9e9f51a80000(0000) knlGS:0000000031415928
      [  949.109318] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  949.109495] CR2: 0000000000000000 CR3: 00000003be53b002 CR4: 00000000003626e0
      [  949.109671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  949.109845] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  949.110017] Call Trace:
      [  949.110186]  handle_vmread+0x22b/0x2f0 [kvm_intel]
      [  949.110356]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
      [  949.110549]  kvm_arch_vcpu_ioctl_run+0xa98/0x1b30 [kvm]
      [  949.110725]  ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
      [  949.110901]  kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
      [  949.111072]  do_vfs_ioctl+0xa2/0x620
      Signed-off-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      eff3a54a
    • W
      KVM: X86: Fix userspace set invalid CR4 · 21874027
      Wanpeng Li 提交于
      commit 3ca94192278ca8de169d78c085396c424be123b3 upstream.
      
      Reported by syzkaller:
      
      	WARNING: CPU: 0 PID: 6544 at /home/kernel/data/kvm/arch/x86/kvm//vmx/vmx.c:4689 handle_desc+0x37/0x40 [kvm_intel]
      	CPU: 0 PID: 6544 Comm: a.out Tainted: G           OE     5.3.0-rc4+ #4
      	RIP: 0010:handle_desc+0x37/0x40 [kvm_intel]
      	Call Trace:
      	 vmx_handle_exit+0xbe/0x6b0 [kvm_intel]
      	 vcpu_enter_guest+0x4dc/0x18d0 [kvm]
      	 kvm_arch_vcpu_ioctl_run+0x407/0x660 [kvm]
      	 kvm_vcpu_ioctl+0x3ad/0x690 [kvm]
      	 do_vfs_ioctl+0xa2/0x690
      	 ksys_ioctl+0x6d/0x80
      	 __x64_sys_ioctl+0x1a/0x20
      	 do_syscall_64+0x74/0x720
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      When CR4.UMIP is set, guest should have UMIP cpuid flag. Current
      kvm set_sregs function doesn't have such check when userspace inputs
      sregs values. SECONDARY_EXEC_DESC is enabled on writes to CR4.UMIP
      in vmx_set_cr4 though guest doesn't have UMIP cpuid flag. The testcast
      triggers handle_desc warning when executing ltr instruction since
      guest architectural CR4 doesn't set UMIP. This patch fixes it by
      adding valid CR4 and CPUID combination checking in __set_sregs.
      
      syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=138efb99600000
      
      Reported-by: syzbot+0f1819555fbdce992df9@syzkaller.appspotmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21874027
    • P
      KVM: PPC: Book3S HV: Don't lose pending doorbell request on migration on P9 · 30fbe0d3
      Paul Mackerras 提交于
      commit ff42df49e75f053a8a6b4c2533100cdcc23afe69 upstream.
      
      On POWER9, when userspace reads the value of the DPDES register on a
      vCPU, it is possible for 0 to be returned although there is a doorbell
      interrupt pending for the vCPU.  This can lead to a doorbell interrupt
      being lost across migration.  If the guest kernel uses doorbell
      interrupts for IPIs, then it could malfunction because of the lost
      interrupt.
      
      This happens because a newly-generated doorbell interrupt is signalled
      by setting vcpu->arch.doorbell_request to 1; the DPDES value in
      vcpu->arch.vcore->dpdes is not updated, because it can only be updated
      when holding the vcpu mutex, in order to avoid races.
      
      To fix this, we OR in vcpu->arch.doorbell_request when reading the
      DPDES value.
      
      Cc: stable@vger.kernel.org # v4.13+
      Fixes: 57900694 ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30fbe0d3
    • P
      KVM: PPC: Book3S HV: Check for MMU ready on piggybacked virtual cores · 4faa7f05
      Paul Mackerras 提交于
      commit d28eafc5a64045c78136162af9d4ba42f8230080 upstream.
      
      When we are running multiple vcores on the same physical core, they
      could be from different VMs and so it is possible that one of the
      VMs could have its arch.mmu_ready flag cleared (for example by a
      concurrent HPT resize) when we go to run it on a physical core.
      We currently check the arch.mmu_ready flag for the primary vcore
      but not the flags for the other vcores that will be run alongside
      it.  This adds that check, and also a check when we select the
      secondary vcores from the preempted vcores list.
      
      Cc: stable@vger.kernel.org # v4.14+
      Fixes: 38c53af8 ("KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4faa7f05
    • P
      KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts · 577a5119
      Paul Mackerras 提交于
      commit 959c5d5134786b4988b6fdd08e444aa67d1667ed upstream.
      
      Escalation interrupts are interrupts sent to the host by the XIVE
      hardware when it has an interrupt to deliver to a guest VCPU but that
      VCPU is not running anywhere in the system.  Hence we disable the
      escalation interrupt for the VCPU being run when we enter the guest
      and re-enable it when the guest does an H_CEDE hypercall indicating
      it is idle.
      
      It is possible that an escalation interrupt gets generated just as we
      are entering the guest.  In that case the escalation interrupt may be
      using a queue entry in one of the interrupt queues, and that queue
      entry may not have been processed when the guest exits with an H_CEDE.
      The existing entry code detects this situation and does not clear the
      vcpu->arch.xive_esc_on flag as an indication that there is a pending
      queue entry (if the queue entry gets processed, xive_esc_irq() will
      clear the flag).  There is a comment in the code saying that if the
      flag is still set on H_CEDE, we have to abort the cede rather than
      re-enabling the escalation interrupt, lest we end up with two
      occurrences of the escalation interrupt in the interrupt queue.
      
      However, the exit code doesn't do that; it aborts the cede in the sense
      that vcpu->arch.ceded gets cleared, but it still enables the escalation
      interrupt by setting the source's PQ bits to 00.  Instead we need to
      set the PQ bits to 10, indicating that an interrupt has been triggered.
      We also need to avoid setting vcpu->arch.xive_esc_on in this case
      (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
      xive_esc_irq() will run at some point and clear it, and if we race with
      that we may end up with an incorrect result (i.e. xive_esc_on set when
      the escalation interrupt has just been handled).
      
      It is extremely unlikely that having two queue entries would cause
      observable problems; theoretically it could cause queue overflow, but
      the CPU would have to have thousands of interrupts targetted to it for
      that to be possible.  However, this fix will also make it possible to
      determine accurately whether there is an unhandled escalation
      interrupt in the queue, which will be needed by the following patch.
      
      Fixes: 9b9b13a6 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100349.GD9567@blackberrySigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      577a5119
    • V
      s390/topology: avoid firing events before kobjs are created · 9aa823b3
      Vasily Gorbik 提交于
      commit f3122a79a1b0a113d3aea748e0ec26f2cb2889de upstream.
      
      arch_update_cpu_topology is first called from:
      kernel_init_freeable->sched_init_smp->sched_init_domains
      
      even before cpus has been registered in:
      kernel_init_freeable->do_one_initcall->s390_smp_init
      
      Do not trigger kobject_uevent change events until cpu devices are
      actually created. Fixes the following kasan findings:
      
      BUG: KASAN: global-out-of-bounds in kobject_uevent_env+0xb40/0xee0
      Read of size 8 at addr 0000000000000020 by task swapper/0/1
      
      BUG: KASAN: global-out-of-bounds in kobject_uevent_env+0xb36/0xee0
      Read of size 8 at addr 0000000000000018 by task swapper/0/1
      
      CPU: 0 PID: 1 Comm: swapper/0 Tainted: G    B
      Hardware name: IBM 3906 M04 704 (LPAR)
      Call Trace:
      ([<0000000143c6db7e>] show_stack+0x14e/0x1a8)
       [<0000000145956498>] dump_stack+0x1d0/0x218
       [<000000014429fb4c>] print_address_description+0x64/0x380
       [<000000014429f630>] __kasan_report+0x138/0x168
       [<0000000145960b96>] kobject_uevent_env+0xb36/0xee0
       [<0000000143c7c47c>] arch_update_cpu_topology+0x104/0x108
       [<0000000143df9e22>] sched_init_domains+0x62/0xe8
       [<000000014644c94a>] sched_init_smp+0x3a/0xc0
       [<0000000146433a20>] kernel_init_freeable+0x558/0x958
       [<000000014599002a>] kernel_init+0x22/0x160
       [<00000001459a71d4>] ret_from_fork+0x28/0x30
       [<00000001459a71dc>] kernel_thread_starter+0x0/0x10
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9aa823b3
    • T
      KVM: s390: Test for bad access register and size at the start of S390_MEM_OP · ddfef75f
      Thomas Huth 提交于
      commit a13b03bbb4575b350b46090af4dfd30e735aaed1 upstream.
      
      If the KVM_S390_MEM_OP ioctl is called with an access register >= 16,
      then there is certainly a bug in the calling userspace application.
      We check for wrong access registers, but only if the vCPU was already
      in the access register mode before (i.e. the SIE block has recorded
      it). The check is also buried somewhere deep in the calling chain (in
      the function ar_translation()), so this is somewhat hard to find.
      
      It's better to always report an error to the userspace in case this
      field is set wrong, and it's safer in the KVM code if we block wrong
      values here early instead of relying on a check somewhere deep down
      the calling chain, so let's add another check to kvm_s390_guest_mem_op()
      directly.
      
      We also should check that the "size" is non-zero here (thanks to Janosch
      Frank for the hint!). If we do not check the size, we could call vmalloc()
      with this 0 value, and this will cause a kernel warning.
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Link: https://lkml.kernel.org/r/20190829122517.31042-1-thuth@redhat.comReviewed-by: NCornelia Huck <cohuck@redhat.com>
      Reviewed-by: NJanosch Frank <frankja@linux.ibm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ddfef75f
    • V
      s390/process: avoid potential reading of freed stack · 8b41a30f
      Vasily Gorbik 提交于
      commit 8769f610fe6d473e5e8e221709c3ac402037da6c upstream.
      
      With THREAD_INFO_IN_TASK (which is selected on s390) task's stack usage
      is refcounted and should always be protected by get/put when touching
      other task's stack to avoid race conditions with task's destruction code.
      
      Fixes: d5c352cd ("s390: move thread_info into task_struct")
      Cc: stable@vger.kernel.org # v4.10+
      Acked-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b41a30f
  2. 08 10月, 2019 19 次提交