1. 30 4月, 2019 14 次提交
  2. 26 4月, 2019 1 次提交
    • N
      x86/mm/tlb: Remove 'struct flush_tlb_info' from the stack · 3db6d5a5
      Nadav Amit 提交于
      Move flush_tlb_info variables off the stack. This allows to align
      flush_tlb_info to cache-line and avoid potentially unnecessary cache
      line movements. It also allows to have a fixed virtual-to-physical
      translation of the variables, which reduces TLB misses.
      
      Use per-CPU struct for flush_tlb_mm_range() and
      flush_tlb_kernel_range(). Add debug assertions to ensure there are
      no nested TLB flushes that might overwrite the per-CPU data. For
      arch_tlbbatch_flush() use a const struct.
      
      Results when running a microbenchmarks that performs 10^6 MADV_DONTEED
      operations and touching a page, in which 3 additional threads run a
      busy-wait loop (5 runs, PTI and retpolines are turned off):
      
      			base		off-stack
      			----		---------
        avg (usec/op)		1.629		1.570	(-3%)
        stddev		0.014		0.009
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190425230143.7008-1-namit@vmware.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3db6d5a5
  3. 25 4月, 2019 3 次提交
  4. 24 4月, 2019 3 次提交
    • K
      x86/build: Move _etext to actual end of .text · 392bef70
      Kees Cook 提交于
      When building x86 with Clang LTO and CFI, CFI jump regions are
      automatically added to the end of the .text section late in linking. As a
      result, the _etext position was being labelled before the appended jump
      regions, causing confusion about where the boundaries of the executable
      region actually are in the running kernel, and broke at least the fault
      injection code. This moves the _etext mark to outside (and immediately
      after) the .text area, as it already the case on other architectures
      (e.g. arm64, arm).
      Reported-and-tested-by: NSami Tolvanen <samitolvanen@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20190423183827.GA4012@beastSigned-off-by: NIngo Molnar <mingo@kernel.org>
      392bef70
    • J
      x86/mm: Remove in_nmi() warning from 64-bit implementation of vmalloc_fault() · a65c88e1
      Jiri Kosina 提交于
      In-NMI warnings have been added to vmalloc_fault() via:
      
        ebc8827f ("x86: Barf when vmalloc and kmemcheck faults happen in NMI")
      
      back in the time when our NMI entry code could not cope with nested NMIs.
      
      These days, it's perfectly fine to take a fault in NMI context and we
      don't have to care about the fact that IRET from the fault handler might
      cause NMI nesting.
      
      This warning has already been removed from 32-bit implementation of
      vmalloc_fault() in:
      
        6863ea0c ("x86/mm: Remove in_nmi() warning from vmalloc_fault()")
      
      but the 64-bit version was omitted.
      
      Remove the bogus warning also from 64-bit implementation of vmalloc_fault().
      Reported-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 6863ea0c ("x86/mm: Remove in_nmi() warning from vmalloc_fault()")
      Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1904240902280.9803@cbobk.fhfr.pmSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a65c88e1
    • Q
      x86/mm: Fix a crash with kmemleak_scan() · 0d02113b
      Qian Cai 提交于
      The first kmemleak_scan() call after boot would trigger the crash below
      because this callpath:
      
        kernel_init
          free_initmem
            mem_encrypt_free_decrypted_mem
              free_init_pages
      
      unmaps memory inside the .bss when DEBUG_PAGEALLOC=y.
      
      kmemleak_init() will register the .data/.bss sections and then
      kmemleak_scan() will scan those addresses and dereference them looking
      for pointer references. If free_init_pages() frees and unmaps pages in
      those sections, kmemleak_scan() will crash if referencing one of those
      addresses:
      
        BUG: unable to handle kernel paging request at ffffffffbd402000
        CPU: 12 PID: 325 Comm: kmemleak Not tainted 5.1.0-rc4+ #4
        RIP: 0010:scan_block
        Call Trace:
         scan_gray_list
         kmemleak_scan
         kmemleak_scan_thread
         kthread
         ret_from_fork
      
      Since kmemleak_free_part() is tolerant to unknown objects (not tracked
      by kmemleak), it is fine to call it from free_init_pages() even if not
      all address ranges passed to this function are known to kmemleak.
      
       [ bp: Massage. ]
      
      Fixes: b3f0907c ("x86/mm: Add .bss..decrypted section to hold shared variables")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190423165811.36699-1-cai@lca.pw
      0d02113b
  5. 22 4月, 2019 2 次提交
    • B
      x86/boot: Disable RSDP parsing temporarily · 36f0c423
      Borislav Petkov 提交于
      The original intention to move RDSP parsing very early, before KASLR
      does its ranges selection, was to accommodate movable memory regions
      machines (CONFIG_MEMORY_HOTREMOVE) to still be able to do memory
      hotplug.
      
      However, that broke kexec'ing a kernel on EFI machines because depending
      on where the EFI systab was mapped, on at least one machine it isn't
      present in the kexec mapping of the second kernel, leading to a triple
      fault in the early code.
      
      Fixing this properly requires significantly involved surgery and we
      cannot allow ourselves to do that, that close to the merge window.
      
      So disable the RSDP parsing code temporarily until it is fixed properly
      in the next release cycle.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Chao Fan <fanc.fnst@cn.fujitsu.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: indou.takao@jp.fujitsu.com
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: kasong@redhat.com
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: msys.mizuma@gmail.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190419141952.GE10324@zn.tnic
      36f0c423
    • B
      x86/fault: Make fault messages more succinct · ea2f8d60
      Borislav Petkov 提交于
      So we are going to be staring at those in the next years, let's make
      them more succinct. In particular:
      
       - change "address = " to "address: "
      
       - "-privileged" reads funny. It should be simply "kernel" or "user"
      
       - "from kernel code" reads funny too. "kernel mode" or "user mode" is
         more natural.
      
      An actual example says more than 1000 words, of course:
      
        [    0.248370] BUG: kernel NULL pointer dereference, address: 00000000000005b8
        [    0.249120] #PF: supervisor write access in kernel mode
        [    0.249717] #PF: error_code(0x0002) - not-present page
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave.hansen@linux.intel.com
      Cc: luto@kernel.org
      Cc: riel@surriel.com
      Cc: sean.j.christopherson@intel.com
      Cc: yu-cheng.yu@intel.com
      Link: http://lkml.kernel.org/r/20190421183524.GC6048@zn.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ea2f8d60
  6. 20 4月, 2019 3 次提交
    • S
      x86/fault: Decode and print #PF oops in human readable form · 18ea35c5
      Sean Christopherson 提交于
      Linus pointed out that deciphering the raw #PF error code and printing
      a more human readable message are two different things, and also that
      printing the negative cases is mostly just noise[1].  For example, the
      USER bit doesn't mean the fault originated in user code and stating
      that an oops wasn't due to a protection keys violation isn't interesting
      since an oops on a keys violation is a one-in-a-million scenario.
      
      Remove the per-bit decoding of the error code and instead print:
        - the raw error code
        - why the fault occurred
        - the effective privilege level of the access
        - the type of access
        - whether the fault originated in user code or kernel code
      
      This provides the user with the information needed to triage 99.9% of
      oopses without polluting the log with useless information or conflating
      the error_code with the CPL.
      
      Sample output:
      
          BUG: kernel NULL pointer dereference, address = 0000000000000008
          #PF: supervisor-privileged instruction fetch from kernel code
          #PF: error_code(0x0010) - not-present page
      
          BUG: unable to handle page fault for address = ffffbeef00000000
          #PF: supervisor-privileged instruction fetch from kernel code
          #PF: error_code(0x0010) - not-present page
      
          BUG: unable to handle page fault for address = ffffc90000230000
          #PF: supervisor-privileged write access from kernel code
          #PF: error_code(0x000b) - reserved bit violation
      
      [1] https://lkml.kernel.org/r/CAHk-=whk_fsnxVMvF1T2fFCaP2WrvSybABrLQCWLJyCvHw6NKA@mail.gmail.comSuggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/20181221213657.27628-3-sean.j.christopherson@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      18ea35c5
    • S
      x86/fault: Reword initial BUG message for unhandled page faults · f28b11a2
      Sean Christopherson 提交于
      Reword the NULL pointer dereference case to simply state that a NULL
      pointer was dereferenced, i.e. drop "unable to handle" as that implies
      that there are instances where the kernel actual does handle NULL
      pointer dereferences, which is not true barring funky exception fixup.
      
      For the non-NULL case, replace "kernel paging request" with "page fault"
      as the kernel can technically oops on faults that originated in user
      code.  Dropping "kernel" also allows future patches to provide detailed
      information on where the fault occurred, e.g. user vs. kernel, without
      conflicting with the initial BUG message.
      
      In both cases, replace "at address=" with wording more appropriate to
      the oops, as "at" may be interpreted as stating that the address is the
      RIP of the instruction that faulted.
      
      Last, and probably least, further qualify the NULL-pointer path by
      checking that the fault actually originated in kernel code.  It's
      technically possible for userspace to map address 0, and not printing
      a super specific message is the least of our worries if the kernel does
      manage to oops on an actual NULL pointer dereference from userspace.
      
      Before:
          BUG: unable to handle kernel NULL pointer dereference at ffffbeef00000000
          BUG: unable to handle kernel paging request at ffffbeef00000000
      
      After:
          BUG: kernel NULL pointer dereference, address = 0000000000000008
          BUG: unable to handle page fault for address = ffffbeef00000000
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng Yu <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/20181221213657.27628-2-sean.j.christopherson@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f28b11a2
    • H
      x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority · 2ee27796
      Hans de Goede 提交于
      The "ENERGY_PERF_BIAS: Set to 'normal', was 'performance'" message triggers
      on pretty much every Intel machine. The purpose of log messages with
      a warning level is to notify the user of something which potentially is
      a problem, or at least somewhat unexpected.
      
      This message clearly does not match those criteria, so lower its log
      priority from warning to info.
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181230172715.17469-1-hdegoede@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2ee27796
  7. 19 4月, 2019 3 次提交
  8. 18 4月, 2019 2 次提交
    • K
      perf/x86/amd: Add event map for AMD Family 17h · 3fe3331b
      Kim Phillips 提交于
      Family 17h differs from prior families by:
      
       - Does not support an L2 cache miss event
       - It has re-enumerated PMC counters for:
         - L2 cache references
         - front & back end stalled cycles
      
      So we add a new amd_f17h_perfmon_event_map[] so that the generic
      perf event names will resolve to the correct h/w events on
      family 17h and above processors.
      
      Reference sections 2.1.13.3.3 (stalls) and 2.1.13.3.6 (L2):
      
        https://www.amd.com/system/files/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdfSigned-off-by: NKim Phillips <kim.phillips@amd.com>
      Cc: <stable@vger.kernel.org> # v4.9+
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Liška <mliska@suse.cz>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: e40ed154 ("perf/x86: Add perf support for AMD family-17h processors")
      [ Improved the formatting a bit. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3fe3331b
    • B
      x86/mm/KASLR: Fix the size of the direct mapping section · ec393710
      Baoquan He 提交于
      kernel_randomize_memory() uses __PHYSICAL_MASK_SHIFT to calculate
      the maximum amount of system RAM supported. The size of the direct
      mapping section is obtained from the smaller one of the below two
      values:
      
        (actual system RAM size + padding size) vs (max system RAM size supported)
      
      This calculation is wrong since commit
      
        b83ce5ee ("x86/mm/64: Make __PHYSICAL_MASK_SHIFT always 52").
      
      In it, __PHYSICAL_MASK_SHIFT was changed to be 52, regardless of whether
      the kernel is using 4-level or 5-level page tables. Thus, it will always
      use 4 PB as the maximum amount of system RAM, even in 4-level paging
      mode where it should actually be 64 TB.
      
      Thus, the size of the direct mapping section will always
      be the sum of the actual system RAM size plus the padding size.
      
      Even when the amount of system RAM is 64 TB, the following layout will
      still be used. Obviously KALSR will be weakened significantly.
      
         |____|_______actual RAM_______|_padding_|______the rest_______|
         0            64TB                                            ~120TB
      
      Instead, it should be like this:
      
         |____|_______actual RAM_______|_________the rest______________|
         0            64TB                                            ~120TB
      
      The size of padding region is controlled by
      CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING, which is 10 TB by default.
      
      The above issue only exists when
      CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING is set to a non-zero value,
      which is the case when CONFIG_MEMORY_HOTPLUG is enabled. Otherwise,
      using __PHYSICAL_MASK_SHIFT doesn't affect KASLR.
      
      Fix it by replacing __PHYSICAL_MASK_SHIFT with MAX_PHYSMEM_BITS.
      
       [ bp: Massage commit message. ]
      
      Fixes: b83ce5ee ("x86/mm/64: Make __PHYSICAL_MASK_SHIFT always 52")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NThomas Garnier <thgarnie@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: frank.ramsay@hpe.com
      Cc: herbert@gondor.apana.org.au
      Cc: kirill@shutemov.name
      Cc: mike.travis@hpe.com
      Cc: thgarnie@google.com
      Cc: x86-ml <x86@kernel.org>
      Cc: yamada.masahiro@socionext.com
      Link: https://lkml.kernel.org/r/20190417083536.GE7065@MiWiFi-R3L-srv
      ec393710
  9. 16 4月, 2019 9 次提交
    • V
      KVM: x86: avoid misreporting level-triggered irqs as edge-triggered in tracing · 7a223e06
      Vitaly Kuznetsov 提交于
      In __apic_accept_irq() interface trig_mode is int and actually on some code
      paths it is set above u8:
      
      kvm_apic_set_irq() extracts it from 'struct kvm_lapic_irq' where trig_mode
      is u16. This is done on purpose as e.g. kvm_set_msi_irq() sets it to
      (1 << 15) & e->msi.data
      
      kvm_apic_local_deliver sets it to reg & (1 << 15).
      
      Fix the immediate issue by making 'tm' into u16. We may also want to adjust
      __apic_accept_irq() interface and use proper sizes for vector, level,
      trig_mode but this is not urgent.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a223e06
    • P
      KVM: fix spectrev1 gadgets · 1d487e9b
      Paolo Bonzini 提交于
      These were found with smatch, and then generalized when applicable.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1d487e9b
    • H
      KVM: x86: fix warning Using plain integer as NULL pointer · be43c440
      Hariprasad Kelam 提交于
      Changed passing argument as "0 to NULL" which resolves below sparse warning
      
      arch/x86/kvm/x86.c:3096:61: warning: Using plain integer as NULL pointer
      Signed-off-by: NHariprasad Kelam <hariprasad.kelam@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      be43c440
    • S
      KVM: x86: Always use 32-bit SMRAM save state for 32-bit kernels · b68f3cc7
      Sean Christopherson 提交于
      Invoking the 64-bit variation on a 32-bit kenrel will crash the guest,
      trigger a WARN, and/or lead to a buffer overrun in the host, e.g.
      rsm_load_state_64() writes r8-r15 unconditionally, but enum kvm_reg and
      thus x86_emulate_ctxt._regs only define r8-r15 for CONFIG_X86_64.
      
      KVM allows userspace to report long mode support via CPUID, even though
      the guest is all but guaranteed to crash if it actually tries to enable
      long mode.  But, a pure 32-bit guest that is ignorant of long mode will
      happily plod along.
      
      SMM complicates things as 64-bit CPUs use a different SMRAM save state
      area.  KVM handles this correctly for 64-bit kernels, e.g. uses the
      legacy save state map if userspace has hid long mode from the guest,
      but doesn't fare well when userspace reports long mode support on a
      32-bit host kernel (32-bit KVM doesn't support 64-bit guests).
      
      Since the alternative is to crash the guest, e.g. by not loading state
      or explicitly requesting shutdown, unconditionally use the legacy SMRAM
      save state map for 32-bit KVM.  If a guest has managed to get far enough
      to handle SMIs when running under a weird/buggy userspace hypervisor,
      then don't deliberately crash the guest since there are no downsides
      (from KVM's perspective) to allow it to continue running.
      
      Fixes: 660a5d51 ("KVM: x86: save/load state on SMM switch")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b68f3cc7
    • S
      KVM: x86: Don't clear EFER during SMM transitions for 32-bit vCPU · 8f4dc2e7
      Sean Christopherson 提交于
      Neither AMD nor Intel CPUs have an EFER field in the legacy SMRAM save
      state area, i.e. don't save/restore EFER across SMM transitions.  KVM
      somewhat models this, e.g. doesn't clear EFER on entry to SMM if the
      guest doesn't support long mode.  But during RSM, KVM unconditionally
      clears EFER so that it can get back to pure 32-bit mode in order to
      start loading CRs with their actual non-SMM values.
      
      Clear EFER only when it will be written when loading the non-SMM state
      so as to preserve bits that can theoretically be set on 32-bit vCPUs,
      e.g. KVM always emulates EFER_SCE.
      
      And because CR4.PAE is cleared only to play nice with EFER, wrap that
      code in the long mode check as well.  Note, this may result in a
      compiler warning about cr4 being consumed uninitialized.  Re-read CR4
      even though it's technically unnecessary, as doing so allows for more
      readable code and RSM emulation is not a performance critical path.
      
      Fixes: 660a5d51 ("KVM: x86: save/load state on SMM switch")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8f4dc2e7
    • S
      KVM: x86: clear SMM flags before loading state while leaving SMM · 9ec19493
      Sean Christopherson 提交于
      RSM emulation is currently broken on VMX when the interrupted guest has
      CR4.VMXE=1.  Stop dancing around the issue of HF_SMM_MASK being set when
      loading SMSTATE into architectural state, e.g. by toggling it for
      problematic flows, and simply clear HF_SMM_MASK prior to loading
      architectural state (from SMRAM save state area).
      Reported-by: NJon Doron <arilou@gmail.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Fixes: 5bea5123 ("KVM: VMX: check nested state and CR4.VMXE against SMM")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9ec19493
    • S
      KVM: x86: Open code kvm_set_hflags · c5833c7a
      Sean Christopherson 提交于
      Prepare for clearing HF_SMM_MASK prior to loading state from the SMRAM
      save state map, i.e. kvm_smm_changed() needs to be called after state
      has been loaded and so cannot be done automatically when setting
      hflags from RSM.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c5833c7a
    • S
      KVM: x86: Load SMRAM in a single shot when leaving SMM · ed19321f
      Sean Christopherson 提交于
      RSM emulation is currently broken on VMX when the interrupted guest has
      CR4.VMXE=1.  Rather than dance around the issue of HF_SMM_MASK being set
      when loading SMSTATE into architectural state, ideally RSM emulation
      itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM
      architectural state.
      
      Ostensibly, the only motivation for having HF_SMM_MASK set throughout
      the loading of state from the SMRAM save state area is so that the
      memory accesses from GET_SMSTATE() are tagged with role.smm.  Load
      all of the SMRAM save state area from guest memory at the beginning of
      RSM emulation, and load state from the buffer instead of reading guest
      memory one-by-one.
      
      This paves the way for clearing HF_SMM_MASK prior to loading state,
      and also aligns RSM with the enter_smm() behavior, which fills a
      buffer and writes SMRAM save state in a single go.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed19321f
    • L
      KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU · e51bfdb6
      Liran Alon 提交于
      Issue was discovered when running kvm-unit-tests on KVM running as L1 on
      top of Hyper-V.
      
      When vmx_instruction_intercept unit-test attempts to run RDPMC to test
      RDPMC-exiting, it is intercepted by L1 KVM which it's EXIT_REASON_RDPMC
      handler raise #GP because vCPU exposed by Hyper-V doesn't support PMU.
      Instead of unit-test expectation to be reflected with EXIT_REASON_RDPMC.
      
      The reason vmx_instruction_intercept unit-test attempts to run RDPMC
      even though Hyper-V doesn't support PMU is because L1 expose to L2
      support for RDPMC-exiting. Which is reasonable to assume that is
      supported only in case CPU supports PMU to being with.
      
      Above issue can easily be simulated by modifying
      vmx_instruction_intercept config in x86/unittests.cfg to run QEMU with
      "-cpu host,+vmx,-pmu" and run unit-test.
      
      To handle issue, change KVM to expose RDPMC-exiting only when guest
      supports PMU.
      Reported-by: NSaar Amar <saaramar@microsoft.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e51bfdb6