1. 16 9月, 2019 16 次提交
    • S
      KVM: VMX: Fix handling of #MC that occurs during VM-Entry · 891011ca
      Sean Christopherson 提交于
      [ Upstream commit beb8d93b3e423043e079ef3dda19dad7b28467a8 ]
      
      A previous fix to prevent KVM from consuming stale VMCS state after a
      failed VM-Entry inadvertantly blocked KVM's handling of machine checks
      that occur during VM-Entry.
      
      Per Intel's SDM, a #MC during VM-Entry is handled in one of three ways,
      depending on when the #MC is recognoized.  As it pertains to this bug
      fix, the third case explicitly states EXIT_REASON_MCE_DURING_VMENTRY
      is handled like any other VM-Exit during VM-Entry, i.e. sets bit 31 to
      indicate the VM-Entry failed.
      
      If a machine-check event occurs during a VM entry, one of the following occurs:
       - The machine-check event is handled as if it occurred before the VM entry:
              ...
       - The machine-check event is handled after VM entry completes:
              ...
       - A VM-entry failure occurs as described in Section 26.7. The basic
         exit reason is 41, for "VM-entry failure due to machine-check event".
      
      Explicitly handle EXIT_REASON_MCE_DURING_VMENTRY as a one-off case in
      vmx_vcpu_run() instead of binning it into vmx_complete_atomic_exit().
      Doing so allows vmx_vcpu_run() to handle VMX_EXIT_REASONS_FAILED_VMENTRY
      in a sane fashion and also simplifies vmx_complete_atomic_exit() since
      VMCS.VM_EXIT_INTR_INFO is guaranteed to be fresh.
      
      Fixes: b060ca3b ("kvm: vmx: Handle VMLAUNCH/VMRESUME failure properly")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      891011ca
    • S
      KVM: VMX: Always signal #GP on WRMSR to MSR_IA32_CR_PAT with bad value · 74ce1333
      Sean Christopherson 提交于
      [ Upstream commit d28f4290b53a157191ed9991ad05dffe9e8c0c89 ]
      
      The behavior of WRMSR is in no way dependent on whether or not KVM
      consumes the value.
      
      Fixes: 4566654b ("KVM: vmx: Inject #GP on invalid PAT CR")
      Cc: stable@vger.kernel.org
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      74ce1333
    • P
      KVM: x86: optimize check for valid PAT value · 74fd8aae
      Paolo Bonzini 提交于
      [ Upstream commit 674ea351cdeb01d2740edce31db7f2d79ce6095d ]
      
      This check will soon be done on every nested vmentry and vmexit,
      "parallelize" it using bitwise operations.
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      74fd8aae
    • P
      kvm: Check irqchip mode before assign irqfd · d5f65393
      Peter Xu 提交于
      [ Upstream commit 654f1f13ea56b92bacade8ce2725aea0457f91c0 ]
      
      When assigning kvm irqfd we didn't check the irqchip mode but we allow
      KVM_IRQFD to succeed with all the irqchip modes.  However it does not
      make much sense to create irqfd even without the kernel chips.  Let's
      provide a arch-dependent helper to check whether a specific irqfd is
      allowed by the arch.  At least for x86, it should make sense to check:
      
      - when irqchip mode is NONE, all irqfds should be disallowed, and,
      
      - when irqchip mode is SPLIT, irqfds that are with resamplefd should
        be disallowed.
      
      For either of the case, previously we'll silently ignore the irq or
      the irq ack event if the irqchip mode is incorrect.  However that can
      cause misterious guest behaviors and it can be hard to triage.  Let's
      fail KVM_IRQFD even earlier to detect these incorrect configurations.
      
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Radim Krčmář <rkrcmar@redhat.com>
      CC: Alex Williamson <alex.williamson@redhat.com>
      CC: Eduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d5f65393
    • S
      KVM: x86: Always use 32-bit SMRAM save state for 32-bit kernels · df5d4ea2
      Sean Christopherson 提交于
      [ Upstream commit b68f3cc7d978943fcf85148165b00594c38db776 ]
      
      Invoking the 64-bit variation on a 32-bit kenrel will crash the guest,
      trigger a WARN, and/or lead to a buffer overrun in the host, e.g.
      rsm_load_state_64() writes r8-r15 unconditionally, but enum kvm_reg and
      thus x86_emulate_ctxt._regs only define r8-r15 for CONFIG_X86_64.
      
      KVM allows userspace to report long mode support via CPUID, even though
      the guest is all but guaranteed to crash if it actually tries to enable
      long mode.  But, a pure 32-bit guest that is ignorant of long mode will
      happily plod along.
      
      SMM complicates things as 64-bit CPUs use a different SMRAM save state
      area.  KVM handles this correctly for 64-bit kernels, e.g. uses the
      legacy save state map if userspace has hid long mode from the guest,
      but doesn't fare well when userspace reports long mode support on a
      32-bit host kernel (32-bit KVM doesn't support 64-bit guests).
      
      Since the alternative is to crash the guest, e.g. by not loading state
      or explicitly requesting shutdown, unconditionally use the legacy SMRAM
      save state map for 32-bit KVM.  If a guest has managed to get far enough
      to handle SMIs when running under a weird/buggy userspace hypervisor,
      then don't deliberately crash the guest since there are no downsides
      (from KVM's perspective) to allow it to continue running.
      
      Fixes: 660a5d51 ("KVM: x86: save/load state on SMM switch")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      df5d4ea2
    • W
      x86/kvm: move kvm_load/put_guest_xcr0 into atomic context · 7a74d806
      WANG Chao 提交于
      [ Upstream commit 1811d979c71621aafc7b879477202d286f7e863b ]
      
      guest xcr0 could leak into host when MCE happens in guest mode. Because
      do_machine_check() could schedule out at a few places.
      
      For example:
      
      kvm_load_guest_xcr0
      ...
      kvm_x86_ops->run(vcpu) {
        vmx_vcpu_run
          vmx_complete_atomic_exit
            kvm_machine_check
              do_machine_check
                do_memory_failure
                  memory_failure
                    lock_page
      
      In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
      out, host cpu has guest xcr0 loaded (0xff).
      
      In __switch_to {
           switch_fpu_finish
             copy_kernel_to_fpregs
               XRSTORS
      
      If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
      generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
      and tries to reinitialize fpu by restoring init fpu state. Same story as
      last #GP, except we get DOUBLE FAULT this time.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NWANG Chao <chao.wang@ucloud.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7a74d806
    • B
      kvm: mmu: Fix overflow on kvm mmu page limit calculation · 163b24b1
      Ben Gardon 提交于
      [ Upstream commit bc8a3d8925a8fa09fa550e0da115d95851ce33c6 ]
      
      KVM bases its memory usage limits on the total number of guest pages
      across all memslots. However, those limits, and the calculations to
      produce them, use 32 bit unsigned integers. This can result in overflow
      if a VM has more guest pages that can be represented by a u32. As a
      result of this overflow, KVM can use a low limit on the number of MMU
      pages it will allocate. This makes KVM unable to map all of guest memory
      at once, prompting spurious faults.
      
      Tested: Ran all kvm-unit-tests on an Intel Haswell machine. This patch
      	introduced no new failures.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      163b24b1
    • P
      x86/kvmclock: set offset for kvm unstable clock · 1d60902a
      Pavel Tatashin 提交于
      [ Upstream commit b5179ec4187251a751832193693d6e474d3445ac ]
      
      VMs may show incorrect uptime and dmesg printk offsets on hypervisors with
      unstable clock. The problem is produced when VM is rebooted without exiting
      from qemu.
      
      The fix is to calculate clock offset not only for stable clock but for
      unstable clock as well, and use kvm_sched_clock_read() which substracts
      the offset for both clocks.
      
      This is safe, because pvclock_clocksource_read() does the right thing and
      makes sure that clock always goes forward, so once offset is calculated
      with unstable clock, we won't get new reads that are smaller than offset,
      and thus won't get negative results.
      
      Thank you Jon DeVree for helping to reproduce this issue.
      
      Fixes: 857baa87 ("sched/clock: Enable sched clock early")
      Cc: stable@vger.kernel.org
      Reported-by: NDominique Martinet <asmadeus@codewreck.org>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1d60902a
    • S
      KVM: VMX: Compare only a single byte for VMCS' "launched" in vCPU-run · cd490d44
      Sean Christopherson 提交于
      [ Upstream commit 61c08aa9606d4e48a8a50639c956448a720174c3 ]
      
      The vCPU-run asm blob does a manual comparison of a VMCS' launched
      status to execute the correct VM-Enter instruction, i.e. VMLAUNCH vs.
      VMRESUME.  The launched flag is a bool, which is a typedef of _Bool.
      C99 does not define an exact size for _Bool, stating only that is must
      be large enough to hold '0' and '1'.  Most, if not all, compilers use
      a single byte for _Bool, including gcc[1].
      
      Originally, 'launched' was of type 'int' and so the asm blob used 'cmpl'
      to check the launch status.  When 'launched' was moved to be stored on a
      per-VMCS basis, struct vcpu_vmx's "temporary" __launched flag was added
      in order to avoid having to pass the current VMCS into the asm blob.
      The new  '__launched' was defined as a 'bool' and not an 'int', but the
      'cmp' instruction was not updated.
      
      This has not caused any known problems, likely due to compilers aligning
      variables to 4-byte or 8-byte boundaries and KVM zeroing out struct
      vcpu_vmx during allocation.  I.e. vCPU-run accesses "junk" data, it just
      happens to always be zero and so doesn't affect the result.
      
      [1] https://gcc.gnu.org/ml/gcc-patches/2000-10/msg01127.html
      
      Fixes: d462b819 ("KVM: VMX: Keep list of loaded VMCSs, instead of vcpus")
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      cd490d44
    • V
      x86/kvm/lapic: preserve gfn_to_hva_cache len on cache reinit · 796469e3
      Vitaly Kuznetsov 提交于
      [ Upstream commit a7c42bb6da6b1b54b2e7bd567636d72d87b10a79 ]
      
      vcpu->arch.pv_eoi is accessible through both HV_X64_MSR_VP_ASSIST_PAGE and
      MSR_KVM_PV_EOI_EN so on migration userspace may try to restore them in any
      order. Values match, however, kvm_lapic_enable_pv_eoi() uses different
      length: for Hyper-V case it's the whole struct hv_vp_assist_page, for KVM
      native case it is 8. In case we restore KVM-native MSR last cache will
      be reinitialized with len=8 so trying to access VP assist page beyond
      8 bytes with kvm_read_guest_cached() will fail.
      
      Check if we re-initializing cache for the same address and preserve length
      in case it was greater.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      796469e3
    • L
      KVM: hyperv: define VP assist page helpers · cdad0f65
      Ladi Prosek 提交于
      [ Upstream commit 72bbf9358c3676bd89dc4bd8fb0b1f2a11c288fc ]
      
      The state related to the VP assist page is still managed by the LAPIC
      code in the pv_eoi field.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      cdad0f65
    • V
      KVM: x86: hyperv: keep track of mismatched VP indexes · b0d9043b
      Vitaly Kuznetsov 提交于
      [ Upstream commit 87ee613d076351950b74383215437f841ebbeb75 ]
      
      In most common cases VP index of a vcpu matches its vcpu index. Userspace
      is, however, free to set any mapping it wishes and we need to account for
      that when we need to find a vCPU with a particular VP index. To keep search
      algorithms optimal in both cases introduce 'num_mismatched_vp_indexes'
      counter showing how many vCPUs with mismatching VP index we have. In case
      the counter is zero we can assume vp_index == vcpu_idx.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b0d9043b
    • V
      KVM: x86: hyperv: consistently use 'hv_vcpu' for 'struct kvm_vcpu_hv' variables · f031fd03
      Vitaly Kuznetsov 提交于
      [ Upstream commit 1779a39f786397760ae7a7cc03cf37697d8ae58d ]
      
      Rename 'hv' to 'hv_vcpu' in kvm_hv_set_msr/kvm_hv_get_msr(); 'hv' is
      'reserved' for 'struct kvm_hv' variables across the file.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f031fd03
    • V
      KVM: x86: hyperv: enforce vp_index < KVM_MAX_VCPUS · 0b535f7b
      Vitaly Kuznetsov 提交于
      [ Upstream commit 9170200ec0ebad70e5b9902bc93e2b1b11456a3b ]
      
      Hyper-V TLFS (5.0b) states:
      
      > Virtual processors are identified by using an index (VP index). The
      > maximum number of virtual processors per partition supported by the
      > current implementation of the hypervisor can be obtained through CPUID
      > leaf 0x40000005. A virtual processor index must be less than the
      > maximum number of virtual processors per partition.
      
      Forbid userspace to set VP_INDEX above KVM_MAX_VCPUS. get_vcpu_by_vpidx()
      can now be optimized to bail early when supplied vpidx is >= KVM_MAX_VCPUS.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0b535f7b
    • Z
      x86, hibernate: Fix nosave_regions setup for hibernation · 4d970758
      Zhimin Gu 提交于
      [ Upstream commit cc55f7537db6af371e9c1c6a71161ee40f918824 ]
      
      On 32bit systems, nosave_regions(non RAM areas) located between
      max_low_pfn and max_pfn are not excluded from hibernation snapshot
      currently, which may result in a machine check exception when
      trying to access these unsafe regions during hibernation:
      
      [  612.800453] Disabling lock debugging due to kernel taint
      [  612.805786] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 6: fe00000000801136
      [  612.814344] mce: [Hardware Error]: RIP !INEXACT! 60:<00000000d90be566> {swsusp_save+0x436/0x560}
      [  612.823167] mce: [Hardware Error]: TSC 1f5939fe276 ADDR dd000000 MISC 30e0000086
      [  612.830677] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1529487426 SOCKET 0 APIC 0 microcode 24
      [  612.839581] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
      [  612.846394] mce: [Hardware Error]: Machine check: Processor context corrupt
      [  612.853380] Kernel panic - not syncing: Fatal machine check
      [  612.858978] Kernel Offset: 0x18000000 from 0xc1000000 (relocation range: 0xc0000000-0xf7ffdfff)
      
      This is because on 32bit systems, pages above max_low_pfn are regarded
      as high memeory, and accessing unsafe pages might cause expected MCE.
      On the problematic 32bit system, there are reserved memory above low
      memory, which triggered the MCE:
      
      e820 memory mapping:
      [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable
      [    0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
      [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000d160cfff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000d160d000-0x00000000d1613fff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000d1614000-0x00000000d1a44fff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000d1a45000-0x00000000d1ecffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000d1ed0000-0x00000000d7eeafff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000d7eeb000-0x00000000d7ffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000d8000000-0x00000000d875ffff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000d8760000-0x00000000d87fffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000d8800000-0x00000000d8fadfff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000d8fae000-0x00000000d8ffffff] ACPI data
      [    0.000000] BIOS-e820: [mem 0x00000000d9000000-0x00000000da71bfff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000da71c000-0x00000000da7fffff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000da800000-0x00000000dbb8bfff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000dbb8c000-0x00000000dbffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000dd000000-0x00000000df1fffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed03fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000041edfffff] usable
      
      Fix this problem by changing pfn limit from max_low_pfn to max_pfn.
      This fix does not impact 64bit system because on 64bit max_low_pfn
      is the same as max_pfn.
      Signed-off-by: NZhimin Gu <kookoo.gu@intel.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NChen Yu <yu.c.chen@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: All applicable <stable@vger.kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4d970758
    • S
      x86/ftrace: Fix warning and considate ftrace_jmp_replace() and ftrace_call_replace() · 85a24825
      Steven Rostedt (VMware) 提交于
      [ Upstream commit 745cfeaac09ce359130a5451d90cb0bd4094c290 ]
      
      Arnd reported the following compiler warning:
      
      arch/x86/kernel/ftrace.c:669:23: error: 'ftrace_jmp_replace' defined but not used [-Werror=unused-function]
      
      The ftrace_jmp_replace() function now only has a single user and should be
      simply moved by that user. But looking at the code, it shows that
      ftrace_jmp_replace() is similar to ftrace_call_replace() except that instead
      of using the opcode of 0xe8 it uses 0xe9. It makes more sense to consolidate
      that function into one implementation that both ftrace_jmp_replace() and
      ftrace_call_replace() use by passing in the op code separate.
      
      The structure in ftrace_code_union is also modified to replace the "e8"
      field with the more appropriate name "op".
      
      Cc: stable@vger.kernel.org
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Link: http://lkml.kernel.org/r/20190304200748.1418790-1-arnd@arndb.de
      Fixes: d2a68c4effd8 ("x86/ftrace: Do not call function graph from dynamic trampolines")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      85a24825
  2. 10 9月, 2019 4 次提交
  3. 06 9月, 2019 6 次提交
    • G
      x86/ptrace: fix up botched merge of spectrev1 fix · b307f99d
      Greg Kroah-Hartman 提交于
      I incorrectly merged commit 31a2fbb390fe ("x86/ptrace: Fix possible
      spectre-v1 in ptrace_get_debugreg()") when backporting it, as was
      graciously pointed out at
      https://grsecurity.net/teardown_of_a_failed_linux_lts_spectre_fix.php
      
      Resolve the upstream difference with the stable kernel merge to properly
      protect things.
      Reported-by: NBrad Spengler <spender@grsecurity.net>
      Cc: Dianzhang Chen <dianzhangchen0@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <bp@alien8.de>
      Cc: <hpa@zytor.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b307f99d
    • B
      x86/apic: Include the LDR when clearing out APIC registers · edc454cd
      Bandan Das 提交于
      commit 558682b5291937a70748d36fd9ba757fb25b99ae upstream.
      
      Although APIC initialization will typically clear out the LDR before
      setting it, the APIC cleanup code should reset the LDR.
      
      This was discovered with a 32-bit KVM guest jumping into a kdump
      kernel. The stale bits in the LDR triggered a bug in the KVM APIC
      implementation which caused the destination mapping for VCPUs to be
      corrupted.
      
      Note that this isn't intended to paper over the KVM APIC bug. The kernel
      has to clear the LDR when resetting the APIC registers except when X2APIC
      is enabled.
      
      This lacks a Fixes tag because missing to clear LDR goes way back into pre
      git history.
      
      [ tglx: Made x2apic_enabled a function call as required ]
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190826101513.5080-3-bsd@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      edc454cd
    • B
      x86/apic: Do not initialize LDR and DFR for bigsmp · 95983265
      Bandan Das 提交于
      commit bae3a8d3308ee69a7dbdf145911b18dfda8ade0d upstream.
      
      Legacy apic init uses bigsmp for smp systems with 8 and more CPUs. The
      bigsmp APIC implementation uses physical destination mode, but it
      nevertheless initializes LDR and DFR. The LDR even ends up incorrectly with
      multiple bit being set.
      
      This does not cause a functional problem because LDR and DFR are ignored
      when physical destination mode is active, but it triggered a problem on a
      32-bit KVM guest which jumps into a kdump kernel.
      
      The multiple bits set unearthed a bug in the KVM APIC implementation. The
      code which creates the logical destination map for VCPUs ignores the
      disabled state of the APIC and ends up overwriting an existing valid entry
      and as a result, APIC calibration hangs in the guest during kdump
      initialization.
      
      Remove the bogus LDR/DFR initialization.
      
      This is not intended to work around the KVM APIC bug. The LDR/DFR
      ininitalization is wrong on its own.
      
      The issue goes back into the pre git history. The fixes tag is the commit
      in the bitkeeper import which introduced bigsmp support in 2003.
      
        git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      
      Fixes: db7b9e9f26b8 ("[PATCH] Clustered APIC setup for >8 CPU systems")
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190826101513.5080-2-bsd@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      95983265
    • S
      uprobes/x86: Fix detection of 32-bit user mode · 941d875c
      Sebastian Mayr 提交于
      commit 9212ec7d8357ea630031e89d0d399c761421c83b upstream.
      
      32-bit processes running on a 64-bit kernel are not always detected
      correctly, causing the process to crash when uretprobes are installed.
      
      The reason for the crash is that in_ia32_syscall() is used to determine the
      process's mode, which only works correctly when called from a syscall.
      
      In the case of uretprobes, however, the function is called from a exception
      and always returns 'false' on a 64-bit kernel. In consequence this leads to
      corruption of the process's return address.
      
      Fix this by using user_64bit_mode() instead of in_ia32_syscall(), which
      is correct in any situation.
      
      [ tglx: Add a comment and the following historical info ]
      
      This should have been detected by the rename which happened in commit
      
        abfb9498 ("x86/entry: Rename is_{ia32,x32}_task() to in_{ia32,x32}_syscall()")
      
      which states in the changelog:
      
          The is_ia32_task()/is_x32_task() function names are a big misnomer: they
          suggests that the compat-ness of a system call is a task property, which
          is not true, the compatness of a system call purely depends on how it
          was invoked through the system call layer.
          .....
      
      and then it went and blindly renamed every call site.
      
      Sadly enough this was already mentioned here:
      
         8faaed1b ("uprobes/x86: Introduce sizeof_long(), cleanup adjust_ret_addr() and
      arch_uretprobe_hijack_return_addr()")
      
      where the changelog says:
      
          TODO: is_ia32_task() is not what we actually want, TS_COMPAT does
          not necessarily mean 32bit. Fortunately syscall-like insns can't be
          probed so it actually works, but it would be better to rename and
          use is_ia32_frame().
      
      and goes all the way back to:
      
          0326f5a9 ("uprobes/core: Handle breakpoint and singlestep exceptions")
      
      Oh well. 7+ years until someone actually tried a uretprobe on a 32bit
      process on a 64bit kernel....
      
      Fixes: 0326f5a9 ("uprobes/core: Handle breakpoint and singlestep exceptions")
      Signed-off-by: NSebastian Mayr <me@sam.st>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190728152617.7308-1-me@sam.stSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      941d875c
    • S
      KVM: x86: Don't update RIP or do single-step on faulting emulation · 3c2b4827
      Sean Christopherson 提交于
      commit 75ee23b30dc712d80d2421a9a547e7ab6e379b44 upstream.
      
      Don't advance RIP or inject a single-step #DB if emulation signals a
      fault.  This logic applies to all state updates that are conditional on
      clean retirement of the emulation instruction, e.g. updating RFLAGS was
      previously handled by commit 38827dbd ("KVM: x86: Do not update
      EFLAGS on faulting emulation").
      
      Not advancing RIP is likely a nop, i.e. ctxt->eip isn't updated with
      ctxt->_eip until emulation "retires" anyways.  Skipping #DB injection
      fixes a bug reported by Andy Lutomirski where a #UD on SYSCALL due to
      invalid state with EFLAGS.TF=1 would loop indefinitely due to emulation
      overwriting the #UD with #DB and thus restarting the bad SYSCALL over
      and over.
      
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: stable@vger.kernel.org
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Fixes: 663f4c61 ("KVM: x86: handle singlestep during emulation")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c2b4827
    • R
      kvm: x86: skip populating logical dest map if apic is not sw enabled · 3ec35109
      Radim Krcmar 提交于
      commit b14c876b994f208b6b95c222056e1deb0a45de0e upstream.
      
      recalculate_apic_map does not santize ldr and it's possible that
      multiple bits are set. In that case, a previous valid entry
      can potentially be overwritten by an invalid one.
      
      This condition is hit when booting a 32 bit, >8 CPU, RHEL6 guest and then
      triggering a crash to boot a kdump kernel. This is the sequence of
      events:
      1. Linux boots in bigsmp mode and enables PhysFlat, however, it still
      writes to the LDR which probably will never be used.
      2. However, when booting into kdump, the stale LDR values remain as
      they are not cleared by the guest and there isn't a apic reset.
      3. kdump boots with 1 cpu, and uses Logical Destination Mode but the
      logical map has been overwritten and points to an inactive vcpu.
      Signed-off-by: NRadim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ec35109
  4. 29 8月, 2019 6 次提交
    • J
      x86/boot: Fix boot regression caused by bootparam sanitizing · f7d157f3
      John Hubbard 提交于
      commit 7846f58fba964af7cb8cf77d4d13c33254725211 upstream.
      
      commit a90118c445cc ("x86/boot: Save fields explicitly, zero out everything
      else") had two errors:
      
          * It preserved boot_params.acpi_rsdp_addr, and
          * It failed to preserve boot_params.hdr
      
      Therefore, zero out acpi_rsdp_addr, and preserve hdr.
      
      Fixes: a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else")
      Reported-by: NNeil MacLeod <neil@nmacleod.com>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NNeil MacLeod <neil@nmacleod.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190821192513.20126-1-jhubbard@nvidia.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7d157f3
    • J
      x86/boot: Save fields explicitly, zero out everything else · d9556011
      John Hubbard 提交于
      commit a90118c445cc7f07781de26a9684d4ec58bfcfd1 upstream.
      
      Recent gcc compilers (gcc 9.1) generate warnings about an out of bounds
      memset, if the memset goes accross several fields of a struct. This
      generated a couple of warnings on x86_64 builds in sanitize_boot_params().
      
      Fix this by explicitly saving the fields in struct boot_params
      that are intended to be preserved, and zeroing all the rest.
      
      [ tglx: Tagged for stable as it breaks the warning free build there as well ]
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190731054627.5627-2-jhubbard@nvidia.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9556011
    • T
      x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h · e063b03b
      Tom Lendacky 提交于
      commit c49a0a80137c7ca7d6ced4c812c9e07a949f6f24 upstream.
      
      There have been reports of RDRAND issues after resuming from suspend on
      some AMD family 15h and family 16h systems. This issue stems from a BIOS
      not performing the proper steps during resume to ensure RDRAND continues
      to function properly.
      
      RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be
      reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND
      support using CPUID, including the kernel, will believe that RDRAND is
      not supported.
      
      Update the CPU initialization to clear the RDRAND CPUID bit for any family
      15h and 16h processor that supports RDRAND. If it is known that the family
      15h or family 16h system does not have an RDRAND resume issue or that the
      system will not be placed in suspend, the "rdrand=force" kernel parameter
      can be used to stop the clearing of the RDRAND CPUID bit.
      
      Additionally, update the suspend and resume path to save and restore the
      MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in
      place after resuming from suspend.
      
      Note, that clearing the RDRAND CPUID bit does not prevent a processor
      that normally supports the RDRAND instruction from executing it. So any
      code that determined the support based on family and model won't #UD.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chen Yu <yu.c.chen@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>
      Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: <stable@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "x86@kernel.org" <x86@kernel.org>
      Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e063b03b
    • T
      x86/apic: Handle missing global clockevent gracefully · 685e598e
      Thomas Gleixner 提交于
      commit f897e60a12f0b9146357780d317879bce2a877dc upstream.
      
      Some newer machines do not advertise legacy timers. The kernel can handle
      that situation if the TSC and the CPU frequency are enumerated by CPUID or
      MSRs and the CPU supports TSC deadline timer. If the CPU does not support
      TSC deadline timer the local APIC timer frequency has to be known as well.
      
      Some Ryzens machines do not advertize legacy timers, but there is no
      reliable way to determine the bus frequency which feeds the local APIC
      timer when the machine allows overclocking of that frequency.
      
      As there is no legacy timer the local APIC timer calibration crashes due to
      a NULL pointer dereference when accessing the not installed global clock
      event device.
      
      Switch the calibration loop to a non interrupt based one, which polls
      either TSC (if frequency is known) or jiffies. The latter requires a global
      clockevent. As the machines which do not have a global clockevent installed
      have a known TSC frequency this is a non issue. For older machines where
      TSC frequency is not known, there is no known case where the legacy timers
      do not exist as that would have been reported long ago.
      Reported-by: NDaniel Drake <drake@endlessm.com>
      Reported-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908091443030.21433@nanos.tec.linutronix.de
      Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1142926#c12Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      685e598e
    • S
      x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386 · f9747104
      Sean Christopherson 提交于
      commit b63f20a778c88b6a04458ed6ffc69da953d3a109 upstream.
      
      Use 'lea' instead of 'add' when adjusting %rsp in CALL_NOSPEC so as to
      avoid clobbering flags.
      
      KVM's emulator makes indirect calls into a jump table of sorts, where
      the destination of the CALL_NOSPEC is a small blob of code that performs
      fast emulation by executing the target instruction with fixed operands.
      
        adcb_al_dl:
           0x000339f8 <+0>:   adc    %dl,%al
           0x000339fa <+2>:   ret
      
      A major motiviation for doing fast emulation is to leverage the CPU to
      handle consumption and manipulation of arithmetic flags, i.e. RFLAGS is
      both an input and output to the target of CALL_NOSPEC.  Clobbering flags
      results in all sorts of incorrect emulation, e.g. Jcc instructions often
      take the wrong path.  Sans the nops...
      
        asm("push %[flags]; popf; " CALL_NOSPEC " ; pushf; pop %[flags]\n"
           0x0003595a <+58>:  mov    0xc0(%ebx),%eax
           0x00035960 <+64>:  mov    0x60(%ebx),%edx
           0x00035963 <+67>:  mov    0x90(%ebx),%ecx
           0x00035969 <+73>:  push   %edi
           0x0003596a <+74>:  popf
           0x0003596b <+75>:  call   *%esi
           0x000359a0 <+128>: pushf
           0x000359a1 <+129>: pop    %edi
           0x000359a2 <+130>: mov    %eax,0xc0(%ebx)
           0x000359b1 <+145>: mov    %edx,0x60(%ebx)
      
        ctxt->eflags = (ctxt->eflags & ~EFLAGS_MASK) | (flags & EFLAGS_MASK);
           0x000359a8 <+136>: mov    -0x10(%ebp),%eax
           0x000359ab <+139>: and    $0x8d5,%edi
           0x000359b4 <+148>: and    $0xfffff72a,%eax
           0x000359b9 <+153>: or     %eax,%edi
           0x000359bd <+157>: mov    %edi,0x4(%ebx)
      
      For the most part this has gone unnoticed as emulation of guest code
      that can trigger fast emulation is effectively limited to MMIO when
      running on modern hardware, and MMIO is rarely, if ever, accessed by
      instructions that affect or consume flags.
      
      Breakage is almost instantaneous when running with unrestricted guest
      disabled, in which case KVM must emulate all instructions when the guest
      has invalid state, e.g. when the guest is in Big Real Mode during early
      BIOS.
      
      Fixes: 776b043848fd2 ("x86/retpoline: Add initial retpoline support")
      Fixes: 1a29b5b7 ("KVM: x86: Make indirect calls in emulator speculation safe")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190822211122.27579-1-sean.j.christopherson@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9747104
    • V
      x86/lib/cpu: Address missing prototypes warning · 923de016
      Valdis Klētnieks 提交于
      [ Upstream commit 04f5bda84b0712d6f172556a7e8dca9ded5e73b9 ]
      
      When building with W=1, warnings about missing prototypes are emitted:
      
        CC      arch/x86/lib/cpu.o
      arch/x86/lib/cpu.c:5:14: warning: no previous prototype for 'x86_family' [-Wmissing-prototypes]
          5 | unsigned int x86_family(unsigned int sig)
            |              ^~~~~~~~~~
      arch/x86/lib/cpu.c:18:14: warning: no previous prototype for 'x86_model' [-Wmissing-prototypes]
         18 | unsigned int x86_model(unsigned int sig)
            |              ^~~~~~~~~
      arch/x86/lib/cpu.c:33:14: warning: no previous prototype for 'x86_stepping' [-Wmissing-prototypes]
         33 | unsigned int x86_stepping(unsigned int sig)
            |              ^~~~~~~~~~~~
      
      Add the proper include file so the prototypes are there.
      Signed-off-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/42513.1565234837@turing-policeSigned-off-by: NSasha Levin <sashal@kernel.org>
      923de016
  5. 16 8月, 2019 5 次提交
  6. 07 8月, 2019 3 次提交
    • T
      x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS · b88241ae
      Thomas Gleixner 提交于
      commit f36cf386e3fec258a341d446915862eded3e13d8 upstream
      
      Intel provided the following information:
      
       On all current Atom processors, instructions that use a segment register
       value (e.g. a load or store) will not speculatively execute before the
       last writer of that segment retires. Thus they will not use a
       speculatively written segment value.
      
      That means on ATOMs there is no speculation through SWAPGS, so the SWAPGS
      entry paths can be excluded from the extra LFENCE if PTI is disabled.
      
      Create a separate bug flag for the through SWAPGS speculation and mark all
      out-of-order ATOMs and AMD/HYGON CPUs as not affected. The in-order ATOMs
      are excluded from the whole mitigation mess anyway.
      Reported-by: NAndrew Cooper <andrew.cooper3@citrix.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b88241ae
    • J
      x86/entry/64: Use JMP instead of JMPQ · 931b6bfe
      Josh Poimboeuf 提交于
      commit 64dbc122b20f75183d8822618c24f85144a5a94d upstream
      
      Somehow the swapgs mitigation entry code patch ended up with a JMPQ
      instruction instead of JMP, where only the short jump is needed.  Some
      assembler versions apparently fail to optimize JMPQ into a two-byte JMP
      when possible, instead always using a 7-byte JMP with relocation.  For
      some reason that makes the entry code explode with a #GP during boot.
      
      Change it back to "JMP" as originally intended.
      
      Fixes: 18ec54fdd6d1 ("x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations")
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      931b6bfe
    • J
      x86/speculation: Enable Spectre v1 swapgs mitigations · 23e7a7b3
      Josh Poimboeuf 提交于
      commit a2059825986a1c8143fd6698774fa9d83733bb11 upstream
      
      The previous commit added macro calls in the entry code which mitigate the
      Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are
      enabled.  Enable those features where applicable.
      
      The mitigations may be disabled with "nospectre_v1" or "mitigations=off".
      
      There are different features which can affect the risk of attack:
      
      - When FSGSBASE is enabled, unprivileged users are able to place any
        value in GS, using the wrgsbase instruction.  This means they can
        write a GS value which points to any value in kernel space, which can
        be useful with the following gadget in an interrupt/exception/NMI
        handler:
      
      	if (coming from user space)
      		swapgs
      	mov %gs:<percpu_offset>, %reg1
      	// dependent load or store based on the value of %reg
      	// for example: mov %(reg1), %reg2
      
        If an interrupt is coming from user space, and the entry code
        speculatively skips the swapgs (due to user branch mistraining), it
        may speculatively execute the GS-based load and a subsequent dependent
        load or store, exposing the kernel data to an L1 side channel leak.
      
        Note that, on Intel, a similar attack exists in the above gadget when
        coming from kernel space, if the swapgs gets speculatively executed to
        switch back to the user GS.  On AMD, this variant isn't possible
        because swapgs is serializing with respect to future GS-based
        accesses.
      
        NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case
      	doesn't exist quite yet.
      
      - When FSGSBASE is disabled, the issue is mitigated somewhat because
        unprivileged users must use prctl(ARCH_SET_GS) to set GS, which
        restricts GS values to user space addresses only.  That means the
        gadget would need an additional step, since the target kernel address
        needs to be read from user space first.  Something like:
      
      	if (coming from user space)
      		swapgs
      	mov %gs:<percpu_offset>, %reg1
      	mov (%reg1), %reg2
      	// dependent load or store based on the value of %reg2
      	// for example: mov %(reg2), %reg3
      
        It's difficult to audit for this gadget in all the handlers, so while
        there are no known instances of it, it's entirely possible that it
        exists somewhere (or could be introduced in the future).  Without
        tooling to analyze all such code paths, consider it vulnerable.
      
        Effects of SMAP on the !FSGSBASE case:
      
        - If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not
          susceptible to Meltdown), the kernel is prevented from speculatively
          reading user space memory, even L1 cached values.  This effectively
          disables the !FSGSBASE attack vector.
      
        - If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP
          still prevents the kernel from speculatively reading user space
          memory.  But it does *not* prevent the kernel from reading the
          user value from L1, if it has already been cached.  This is probably
          only a small hurdle for an attacker to overcome.
      
      Thanks to Dave Hansen for contributing the speculative_smap() function.
      
      Thanks to Andrew Cooper for providing the inside scoop on whether swapgs
      is serializing on AMD.
      
      [ tglx: Fixed the USER fence decision and polished the comment as suggested
        	by Dave Hansen ]
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23e7a7b3