1. 20 11月, 2019 2 次提交
  2. 14 11月, 2019 1 次提交
    • S
      KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast() · ed69a6cb
      Sean Christopherson 提交于
      Acquire the per-VM slots_lock when zapping all shadow pages as part of
      toggling nx_huge_pages.  The fast zap algorithm relies on exclusivity
      (via slots_lock) to identify obsolete vs. valid shadow pages, because it
      uses a single bit for its generation number. Holding slots_lock also
      obviates the need to acquire a read lock on the VM's srcu.
      
      Failing to take slots_lock when toggling nx_huge_pages allows multiple
      instances of kvm_mmu_zap_all_fast() to run concurrently, as the other
      user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock.
      (kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be
      temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough
      to enforce exclusivity).
      
      Concurrent fast zap instances causes obsolete shadow pages to be
      incorrectly identified as valid due to the single bit generation number
      wrapping, which results in stale shadow pages being left in KVM's MMU
      and leads to all sorts of undesirable behavior.
      The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and
      toggling nx_huge_pages via its module param.
      
      Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when
      using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used
      an ulong-sized generation instead of relying on exclusivity for
      correctness, but all callers except the recently added set_nx_huge_pages()
      needed to hold slots_lock anyways.  Therefore, this patch does not have
      to be backported to stable kernels.
      
      Given that toggling nx_huge_pages is by no means a fast path, force it
      to conform to the current approach instead of reintroducing the previous
      generation count.
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE)
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed69a6cb
  3. 13 11月, 2019 3 次提交
  4. 12 11月, 2019 7 次提交
  5. 07 11月, 2019 1 次提交
    • J
      x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs · 012206a8
      Josh Poimboeuf 提交于
      For new IBRS_ALL CPUs, the Enhanced IBRS check at the beginning of
      cpu_bugs_smt_update() causes the function to return early, unintentionally
      skipping the MDS and TAA logic.
      
      This is not a problem for MDS, because there appears to be no overlap
      between IBRS_ALL and MDS-affected CPUs.  So the MDS mitigation would be
      disabled and nothing would need to be done in this function anyway.
      
      But for TAA, the TAA_MSG_SMT string will never get printed on Cascade
      Lake and newer.
      
      The check is superfluous anyway: when 'spectre_v2_enabled' is
      SPECTRE_V2_IBRS_ENHANCED, 'spectre_v2_user' is always
      SPECTRE_V2_USER_NONE, and so the 'spectre_v2_user' switch statement
      handles it appropriately by doing nothing.  So just remove the check.
      
      Fixes: 1b42f017 ("x86/speculation/taa: Add mitigation for TSX Async Abort")
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      012206a8
  6. 05 11月, 2019 4 次提交
    • M
      x86/tsc: Respect tsc command line paraemeter for clocksource_tsc_early · 63ec58b4
      Michael Zhivich 提交于
      The introduction of clocksource_tsc_early broke the functionality of
      "tsc=reliable" and "tsc=nowatchdog" command line parameters, since
      clocksource_tsc_early is unconditionally registered with
      CLOCK_SOURCE_MUST_VERIFY and thus put on the watchdog list.
      
      This can cause the TSC to be declared unstable during boot:
      
        clocksource: timekeeping watchdog on CPU0: Marking clocksource
                     'tsc-early' as unstable because the skew is too large:
        clocksource: 'refined-jiffies' wd_now: fffb7018 wd_last: fffb6e9d
                     mask: ffffffff
        clocksource: 'tsc-early' cs_now: 68a6a7070f6a0 cs_last: 68a69ab6f74d6
                     mask: ffffffffffffffff
        tsc: Marking TSC unstable due to clocksource watchdog
      
      The corresponding elapsed times are cs_nsec=1224152026 and wd_nsec=378942392, so
      the watchdog differs from TSC by 0.84 seconds.
      
      This happens when HPET is not available and jiffies are used as the TSC
      watchdog instead and the jiffies update is not happening due to lost timer
      interrupts in periodic mode, which can happen e.g. with expensive debug
      mechanisms enabled or under massive overload conditions in virtualized
      environments.
      
      Before the introduction of the early TSC clocksource the command line
      parameters "tsc=reliable" and "tsc=nowatchdog" could be used to work around
      this issue.
      
      Restore the behaviour by disabling the watchdog if requested on the kernel
      command line.
      
      [ tglx: Clarify changelog ]
      
      Fixes: aa83c457 ("x86/tsc: Introduce early tsc clocksource")
      Signed-off-by: NMichael Zhivich <mzhivich@akamai.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191024175945.14338-1-mzhivich@akamai.com
      63ec58b4
    • T
      x86/dumpstack/64: Don't evaluate exception stacks before setup · e361362b
      Thomas Gleixner 提交于
      Cyrill reported the following crash:
      
        BUG: unable to handle page fault for address: 0000000000001ff0
        #PF: supervisor read access in kernel mode
        RIP: 0010:get_stack_info+0xb3/0x148
      
      It turns out that if the stack tracer is invoked before the exception stack
      mappings are initialized in_exception_stack() can erroneously classify an
      invalid address as an address inside of an exception stack:
      
          begin = this_cpu_read(cea_exception_stacks);  <- 0
          end = begin + sizeof(exception stacks);
      
      i.e. any address between 0 and end will be considered as exception stack
      address and the subsequent code will then try to derefence the resulting
      stack frame at a non mapped address.
      
       end = begin + (unsigned long)ep->size;
           ==> end = 0x2000
      
       regs = (struct pt_regs *)end - 1;
           ==> regs = 0x2000 - sizeof(struct pt_regs *) = 0x1ff0
      
       info->next_sp   = (unsigned long *)regs->sp;
           ==> Crashes due to accessing 0x1ff0
      
      Prevent this by checking the validity of the cea_exception_stack base
      address and bailing out if it is zero.
      
      Fixes: afcd21da ("x86/dumpstack/64: Use cpu_entry_area instead of orig_ist")
      Reported-by: NCyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NCyrill Gorcunov <gorcunov@gmail.com>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1910231950590.1852@nanos.tec.linutronix.de
      e361362b
    • J
      x86/apic/32: Avoid bogus LDR warnings · fe6f85ca
      Jan Beulich 提交于
      The removal of the LDR initialization in the bigsmp_32 APIC code unearthed
      a problem in setup_local_APIC().
      
      The code checks unconditionally for a mismatch of the logical APIC id by
      comparing the early APIC id which was initialized in get_smp_config() with
      the actual LDR value in the APIC.
      
      Due to the removal of the bogus LDR initialization the check now can
      trigger on bigsmp_32 APIC systems emitting a warning for every booting
      CPU. This is of course a false positive because the APIC is not using
      logical destination mode.
      
      Restrict the check and the possibly resulting fixup to systems which are
      actually using the APIC in logical destination mode.
      
      [ tglx: Massaged changelog and added Cc stable ]
      
      Fixes: bae3a8d3 ("x86/apic: Do not initialize LDR and DFR for bigsmp")
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/666d8f91-b5a8-1afd-7add-821e72a35f03@suse.com
      fe6f85ca
    • J
      kvm: x86: mmu: Recovery of shattered NX large pages · 1aa9b957
      Junaid Shahid 提交于
      The page table pages corresponding to broken down large pages are zapped in
      FIFO order, so that the large page can potentially be recovered, if it is
      not longer being used for execution.  This removes the performance penalty
      for walking deeper EPT page tables.
      
      By default, one large page will last about one hour once the guest
      reaches a steady state.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1aa9b957
  7. 04 11月, 2019 4 次提交
  8. 31 10月, 2019 2 次提交
    • P
      KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active · 9167ab79
      Paolo Bonzini 提交于
      VMX already does so if the host has SMEP, in order to support the combination of
      CR0.WP=1 and CR4.SMEP=1.  However, it is perfectly safe to always do so, and in
      fact VMX already ends up running with EFER.NXE=1 on old processors that lack the
      "load EFER" controls, because it may help avoiding a slow MSR write.  Removing
      all the conditionals simplifies the code.
      
      SVM does not have similar code, but it should since recent AMD processors do
      support SMEP.  So this patch also makes the code for the two vendors more similar
      while fixing NPT=0, CR0.WP=1 and CR4.SMEP=1 on AMD processors.
      
      Cc: stable@vger.kernel.org
      Cc: Joerg Roedel <jroedel@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9167ab79
    • K
      x86, efi: Never relocate kernel below lowest acceptable address · 220dd769
      Kairui Song 提交于
      Currently, kernel fails to boot on some HyperV VMs when using EFI.
      And it's a potential issue on all x86 platforms.
      
      It's caused by broken kernel relocation on EFI systems, when below three
      conditions are met:
      
      1. Kernel image is not loaded to the default address (LOAD_PHYSICAL_ADDR)
         by the loader.
      2. There isn't enough room to contain the kernel, starting from the
         default load address (eg. something else occupied part the region).
      3. In the memmap provided by EFI firmware, there is a memory region
         starts below LOAD_PHYSICAL_ADDR, and suitable for containing the
         kernel.
      
      EFI stub will perform a kernel relocation when condition 1 is met. But
      due to condition 2, EFI stub can't relocate kernel to the preferred
      address, so it fallback to ask EFI firmware to alloc lowest usable memory
      region, got the low region mentioned in condition 3, and relocated
      kernel there.
      
      It's incorrect to relocate the kernel below LOAD_PHYSICAL_ADDR. This
      is the lowest acceptable kernel relocation address.
      
      The first thing goes wrong is in arch/x86/boot/compressed/head_64.S.
      Kernel decompression will force use LOAD_PHYSICAL_ADDR as the output
      address if kernel is located below it. Then the relocation before
      decompression, which move kernel to the end of the decompression buffer,
      will overwrite other memory region, as there is no enough memory there.
      
      To fix it, just don't let EFI stub relocate the kernel to any address
      lower than lowest acceptable address.
      
      [ ardb: introduce efi_low_alloc_above() to reduce the scope of the change ]
      Signed-off-by: NKairui Song <kasong@redhat.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: https://lkml.kernel.org/r/20191029173755.27149-6-ardb@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      220dd769
  9. 28 10月, 2019 11 次提交
  10. 25 10月, 2019 1 次提交
  11. 23 10月, 2019 2 次提交
    • J
      KVM: nVMX: Don't leak L1 MMIO regions to L2 · 671ddc70
      Jim Mattson 提交于
      If the "virtualize APIC accesses" VM-execution control is set in the
      VMCS, the APIC virtualization hardware is triggered when a page walk
      in VMX non-root mode terminates at a PTE wherein the address of the 4k
      page frame matches the APIC-access address specified in the VMCS. On
      hardware, the APIC-access address may be any valid 4k-aligned physical
      address.
      
      KVM's nVMX implementation enforces the additional constraint that the
      APIC-access address specified in the vmcs12 must be backed by
      a "struct page" in L1. If not, L0 will simply clear the "virtualize
      APIC accesses" VM-execution control in the vmcs02.
      
      The problem with this approach is that the L1 guest has arranged the
      vmcs12 EPT tables--or shadow page tables, if the "enable EPT"
      VM-execution control is clear in the vmcs12--so that the L2 guest
      physical address(es)--or L2 guest linear address(es)--that reference
      the L2 APIC map to the APIC-access address specified in the
      vmcs12. Without the "virtualize APIC accesses" VM-execution control in
      the vmcs02, the APIC accesses in the L2 guest will directly access the
      APIC-access page in L1.
      
      When there is no mapping whatsoever for the APIC-access address in L1,
      the L2 VM just loses the intended APIC virtualization. However, when
      the APIC-access address is mapped to an MMIO region in L1, the L2
      guest gets direct access to the L1 MMIO device. For example, if the
      APIC-access address specified in the vmcs12 is 0xfee00000, then L2
      gets direct access to L1's APIC.
      
      Since this vmcs12 configuration is something that KVM cannot
      faithfully emulate, the appropriate response is to exit to userspace
      with KVM_INTERNAL_ERROR_EMULATION.
      
      Fixes: fe3ef05c ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12")
      Reported-by: NDan Cross <dcross@google.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      671ddc70
    • M
      KVM: SVM: Fix potential wrong physical id in avic_handle_ldr_update · 5c94ac5d
      Miaohe Lin 提交于
      Guest physical APIC ID may not equal to vcpu->vcpu_id in some case.
      We may set the wrong physical id in avic_handle_ldr_update as we
      always use vcpu->vcpu_id. Get physical APIC ID from vAPIC page
      instead.
      Export and use kvm_xapic_id here and in avic_handle_apic_id_update
      as suggested by Vitaly.
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5c94ac5d
  12. 22 10月, 2019 2 次提交