1. 25 11月, 2019 1 次提交
    • I
      x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the... · 05b042a1
      Ingo Molnar 提交于
      x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise
      
      When two recent commits that increased the size of the 'struct cpu_entry_area'
      were merged in -tip, the 32-bit defconfig build started failing on the following
      build time assert:
      
        ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_189’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE
        arch/x86/mm/cpu_entry_area.c:189:2: note: in expansion of macro ‘BUILD_BUG_ON’
        In function ‘setup_cpu_entry_area_ptes’,
      
      Which corresponds to the following build time assert:
      
      	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);
      
      The purpose of this assert is to sanity check the fixed-value definition of
      CPU_ENTRY_AREA_PAGES arch/x86/include/asm/pgtable_32_types.h:
      
      	#define CPU_ENTRY_AREA_PAGES    (NR_CPUS * 41)
      
      The '41' is supposed to match sizeof(struct cpu_entry_area)/PAGE_SIZE, which value
      we didn't want to define in such a low level header, because it would cause
      dependency hell.
      
      Every time the size of cpu_entry_area is changed, we have to adjust CPU_ENTRY_AREA_PAGES
      accordingly - and this assert is checking that constraint.
      
      But the assert is both imprecise and buggy, primarily because it doesn't
      include the single readonly IDT page that is mapped at CPU_ENTRY_AREA_BASE
      (which begins at a PMD boundary).
      
      This bug was hidden by the fact that by accident CPU_ENTRY_AREA_PAGES is defined
      too large upstream (v5.4-rc8):
      
      	#define CPU_ENTRY_AREA_PAGES    (NR_CPUS * 40)
      
      While 'struct cpu_entry_area' is 155648 bytes, or 38 pages. So we had two extra
      pages, which hid the bug.
      
      The following commit (not yet upstream) increased the size to 40 pages:
      
        x86/iopl: ("Restrict iopl() permission scope")
      
      ... but increased CPU_ENTRY_AREA_PAGES only 41 - i.e. shortening the gap
      to just 1 extra page.
      
      Then another not-yet-upstream commit changed the size again:
      
        880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      
      Which increased the cpu_entry_area size from 38 to 39 pages, but
      didn't change CPU_ENTRY_AREA_PAGES (kept it at 40). This worked
      fine, because we still had a page left from the accidental 'reserve'.
      
      But when these two commits were merged into the same tree, the
      combined size of cpu_entry_area grew from 38 to 40 pages, while
      CPU_ENTRY_AREA_PAGES finally caught up to 40 as well.
      
      Which is fine in terms of functionality, but the assert broke:
      
      	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);
      
      because CPU_ENTRY_AREA_MAP_SIZE is the total size of the area,
      which is 1 page larger due to the IDT page.
      
      To fix all this, change the assert to two precise asserts:
      
      	BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE);
      	BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE);
      
      This takes the IDT page into account, and also connects the size-based
      define of CPU_ENTRY_AREA_TOTAL_SIZE with the address-subtraction based
      define of CPU_ENTRY_AREA_MAP_SIZE.
      
      Also clean up some of the names which made it rather confusing:
      
       - 'CPU_ENTRY_AREA_TOT_SIZE' wasn't actually the 'total' size of
         the cpu-entry-area, but the per-cpu array size, so rename this
         to CPU_ENTRY_AREA_ARRAY_SIZE.
      
       - Introduce CPU_ENTRY_AREA_TOTAL_SIZE that _is_ the total mapping
         size, with the IDT included.
      
       - Add comments where '+1' denotes the IDT mapping - it wasn't
         obvious and took me about 3 hours to decode...
      
      Finally, because this particular commit is actually applied after
      this patch:
      
        880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      
      Fix the CPU_ENTRY_AREA_PAGES value from 40 pages to the correct 39 pages.
      
      All future commits that change cpu_entry_area will have to adjust
      this value precisely.
      
      As a side note, we should probably attempt to remove CPU_ENTRY_AREA_PAGES
      and derive its value directly from the structure, without causing
      header hell - but that is an adventure for another day! :-)
      
      Fixes: 880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      05b042a1
  2. 22 11月, 2019 1 次提交
  3. 21 11月, 2019 1 次提交
  4. 20 11月, 2019 1 次提交
  5. 18 11月, 2019 1 次提交
  6. 16 11月, 2019 13 次提交
  7. 15 11月, 2019 7 次提交
    • N
      KVM: x86: deliver KVM IOAPIC scan request to target vCPUs · 7ee30bc1
      Nitesh Narayan Lal 提交于
      In IOAPIC fixed delivery mode instead of flushing the scan
      requests to all vCPUs, we should only send the requests to
      vCPUs specified within the destination field.
      
      This patch introduces kvm_get_dest_vcpus_mask() API which
      retrieves an array of target vCPUs by using
      kvm_apic_map_get_dest_lapic() and then based on the
      vcpus_idx, it sets the bit in a bitmap. However, if the above
      fails kvm_get_dest_vcpus_mask() finds the target vCPUs by
      traversing all available vCPUs. Followed by setting the
      bits in the bitmap.
      
      If we had different vCPUs in the previous request for the
      same redirection table entry then bits corresponding to
      these vCPUs are also set. This to done to keep
      ioapic_handled_vectors synchronized.
      
      This bitmap is then eventually passed on to
      kvm_make_vcpus_request_mask() to generate a masked request
      only for the target vCPUs.
      
      This would enable us to reduce the latency overhead on isolated
      vCPUs caused by the IPI to process due to KVM_REQ_IOAPIC_SCAN.
      Suggested-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NNitesh Narayan Lal <nitesh@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7ee30bc1
    • L
      KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC · b35e5548
      Like Xu 提交于
      Currently, a host perf_event is created for a vPMC functionality emulation.
      It’s unpredictable to determine if a disabled perf_event will be reused.
      If they are disabled and are not reused for a considerable period of time,
      those obsolete perf_events would increase host context switch overhead that
      could have been avoided.
      
      If the guest doesn't WRMSR any of the vPMC's MSRs during an entire vcpu
      sched time slice, and its independent enable bit of the vPMC isn't set,
      we can predict that the guest has finished the use of this vPMC, and then
      do request KVM_REQ_PMU in kvm_arch_sched_in and release those perf_events
      in the first call of kvm_pmu_handle_event() after the vcpu is scheduled in.
      
      This lazy mechanism delays the event release time to the beginning of the
      next scheduled time slice if vPMC's MSRs aren't changed during this time
      slice. If guest comes back to use this vPMC in next time slice, a new perf
      event would be re-created via perf_event_create_kernel_counter() as usual.
      Suggested-by: NWei Wang <wei.w.wang@intel.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b35e5548
    • L
      KVM: x86/vPMU: Reuse perf_event to avoid unnecessary pmc_reprogram_counter · a6da0d77
      Like Xu 提交于
      The perf_event_create_kernel_counter() in the pmc_reprogram_counter() is
      a heavyweight and high-frequency operation, especially when host disables
      the watchdog (maximum 21000000 ns) which leads to an unacceptable latency
      of the guest NMI handler. It limits the use of vPMUs in the guest.
      
      When a vPMC is fully enabled, the legacy reprogram_*_counter() would stop
      and release its existing perf_event (if any) every time EVEN in most cases
      almost the same requested perf_event will be created and configured again.
      
      For each vPMC, if the reuqested config ('u64 eventsel' for gp and 'u8 ctrl'
      for fixed) is the same as its current config AND a new sample period based
      on pmc->counter is accepted by host perf interface, the current event could
      be reused safely as a new created one does. Otherwise, do release the
      undesirable perf_event and reprogram a new one as usual.
      
      It's light-weight to call pmc_pause_counter (disable, read and reset event)
      and pmc_resume_counter (recalibrate period and re-enable event) as guest
      expects instead of release-and-create again on any condition. Compared to
      use the filterable event->attr or hw.config, a new 'u64 current_config'
      field is added to save the last original programed config for each vPMC.
      
      Based on this implementation, the number of calls to pmc_reprogram_counter
      is reduced by ~82.5% for a gp sampling event and ~99.9% for a fixed event.
      In the usage of multiplexing perf sampling mode, the average latency of the
      guest NMI handler is reduced from 104923 ns to 48393 ns (~2.16x speed up).
      If host disables watchdog, the minimum latecy of guest NMI handler could be
      speed up at ~3413x (from 20407603 to 5979 ns) and at ~786x in the average.
      Suggested-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a6da0d77
    • C
      x86/pci: Remove #ifdef __KERNEL__ guard from <asm/pci.h> · b52b0c4f
      Christoph Hellwig 提交于
      pci.h is not a UAPI header, so the __KERNEL__ ifdef is rather pointless.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191113071836.21041-4-hch@lst.de
      b52b0c4f
    • C
      x86/pci: Remove pci_64.h · 948fdcf9
      Christoph Hellwig 提交于
      This file only contains external declarations for two non-existing
      function pointers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191113071836.21041-3-hch@lst.de
      948fdcf9
    • C
      x86: Remove the calgary IOMMU driver · 90dc392f
      Christoph Hellwig 提交于
      The calgary IOMMU was only used on high-end IBM systems in the early
      x86_64 age and has no known users left.  Remove it to avoid having to
      touch it for pending changes to the DMA API.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191113071836.21041-2-hch@lst.de
      90dc392f
    • L
      x86/kdump: Remove the backup region handling · 7c321eb2
      Lianbo Jiang 提交于
      When the crashkernel kernel command line option is specified, the low
      1M memory will always be reserved now. Therefore, it's not necessary to
      create a backup region anymore and also no need to copy the contents of
      the first 640k to it.
      
      Remove all the code related to handling that backup region.
      
       [ bp: Massage commit message. ]
      Signed-off-by: NLianbo Jiang <lijiang@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: bhe@redhat.com
      Cc: Dave Young <dyoung@redhat.com>
      Cc: d.hatayama@fujitsu.com
      Cc: dhowells@redhat.com
      Cc: ebiederm@xmission.com
      Cc: horms@verge.net.au
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jürgen Gross <jgross@suse.com>
      Cc: kexec@lists.infradead.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: vgoyal@redhat.com
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191108090027.11082-3-lijiang@redhat.com
      7c321eb2
  8. 14 11月, 2019 3 次提交
    • L
      x86/kdump: Always reserve the low 1M when the crashkernel option is specified · 6f599d84
      Lianbo Jiang 提交于
      On x86, purgatory() copies the first 640K of memory to a backup region
      because the kernel needs those first 640K for the real mode trampoline
      during boot, among others.
      
      However, when SME is enabled, the kernel cannot properly copy the old
      memory to the backup area but reads only its encrypted contents. The
      result is that the crash tool gets invalid pointers when parsing vmcore:
      
        crash> kmem -s|grep -i invalid
        kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4
        kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4
        crash>
      
      So reserve the remaining low 1M memory when the crashkernel option is
      specified (after reserving real mode memory) so that allocated memory
      does not fall into the low 1M area and thus the copying of the contents
      of the first 640k to a backup region in purgatory() can be avoided
      altogether.
      
      This way, it does not need to be included in crash dumps or used for
      anything except the trampolines that must live in the low 1M.
      
       [ bp: Heavily rewrite commit message, flip check logic in
         crash_reserve_low_1M().]
      Signed-off-by: NLianbo Jiang <lijiang@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: bhe@redhat.com
      Cc: Dave Young <dyoung@redhat.com>
      Cc: d.hatayama@fujitsu.com
      Cc: dhowells@redhat.com
      Cc: ebiederm@xmission.com
      Cc: horms@verge.net.au
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jürgen Gross <jgross@suse.com>
      Cc: kexec@lists.infradead.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: vgoyal@redhat.com
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191108090027.11082-2-lijiang@redhat.com
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=204793
      6f599d84
    • L
      x86/crash: Add a forward declaration of struct kimage · 112eee5d
      Lianbo Jiang 提交于
      Add a forward declaration of struct kimage to the crash.h header because
      future changes will invoke a crash-specific function from the realmode
      init path and the compiler will complain otherwise like this:
      
        In file included from arch/x86/realmode/init.c:11:
        ./arch/x86/include/asm/crash.h:5:32: warning: ‘struct kimage’ declared inside\
         parameter list will not be visible outside of this definition or declaration
            5 | int crash_load_segments(struct kimage *image);
              |                                ^~~~~~
        ./arch/x86/include/asm/crash.h:6:37: warning: ‘struct kimage’ declared inside\
         parameter list will not be visible outside of this definition or declaration
            6 | int crash_copy_backup_region(struct kimage *image);
              |                                     ^~~~~~
        ./arch/x86/include/asm/crash.h:7:39: warning: ‘struct kimage’ declared inside\
         parameter list will not be visible outside of this definition or declaration
            7 | int crash_setup_memmap_entries(struct kimage *image,
              |
      
       [ bp: Rewrite the commit message. ]
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NLianbo Jiang <lijiang@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: bhe@redhat.com
      Cc: d.hatayama@fujitsu.com
      Cc: dhowells@redhat.com
      Cc: dyoung@redhat.com
      Cc: ebiederm@xmission.com
      Cc: horms@verge.net.au
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jürgen Gross <jgross@suse.com>
      Cc: kexec@lists.infradead.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: vgoyal@redhat.com
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191108090027.11082-4-lijiang@redhat.com
      Link: https://lkml.kernel.org/r/201910310233.EJRtTMWP%25lkp@intel.com
      112eee5d
    • J
      xen/mcelog: add PPIN to record when available · 4e3f77d8
      Jan Beulich 提交于
      This is to augment commit 3f5a7896 ("x86/mce: Include the PPIN in MCE
      records when available").
      
      I'm also adding "synd" and "ipid" fields to struct xen_mce, in an
      attempt to keep field offsets in sync with struct mce. These two fields
      won't get populated for now, though.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      4e3f77d8
  9. 12 11月, 2019 5 次提交
  10. 07 11月, 2019 1 次提交
  11. 05 11月, 2019 3 次提交
    • J
      kvm: x86: mmu: Recovery of shattered NX large pages · 1aa9b957
      Junaid Shahid 提交于
      The page table pages corresponding to broken down large pages are zapped in
      FIFO order, so that the large page can potentially be recovered, if it is
      not longer being used for execution.  This removes the performance penalty
      for walking deeper EPT page tables.
      
      By default, one large page will last about one hour once the guest
      reaches a steady state.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1aa9b957
    • K
      x86/mm: Report which part of kernel image is freed · 5494c3a6
      Kees Cook 提交于
      The memory freeing report wasn't very useful for figuring out which
      parts of the kernel image were being freed. Add the details for clearer
      reporting in dmesg.
      
      Before:
      
        Freeing unused kernel image memory: 1348K
        Write protecting the kernel read-only data: 20480k
        Freeing unused kernel image memory: 2040K
        Freeing unused kernel image memory: 172K
      
      After:
      
        Freeing unused kernel image (initmem) memory: 1348K
        Write protecting the kernel read-only data: 20480k
        Freeing unused kernel image (text/rodata gap) memory: 2040K
        Freeing unused kernel image (rodata/data gap) memory: 172K
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: x86-ml <x86@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20191029211351.13243-28-keescook@chromium.org
      5494c3a6
    • K
      x86/vmlinux: Actually use _etext for the end of the text segment · b9076938
      Kees Cook 提交于
      Various calculations are using the end of the exception table (which
      does not need to be executable) as the end of the text segment. Instead,
      in preparation for moving the exception table into RO_DATA, move _etext
      after the exception table and update the calculations.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Ross Zwisler <zwisler@chromium.org>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: x86-ml <x86@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20191029211351.13243-16-keescook@chromium.org
      b9076938
  12. 04 11月, 2019 2 次提交
    • P
      kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830
      Paolo Bonzini 提交于
      With some Intel processors, putting the same virtual address in the TLB
      as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
      and cause the processor to issue a machine check resulting in a CPU lockup.
      
      Unfortunately when EPT page tables use huge pages, it is possible for a
      malicious guest to cause this situation.
      
      Add a knob to mark huge pages as non-executable. When the nx_huge_pages
      parameter is enabled (and we are using EPT), all huge pages are marked as
      NX. If the guest attempts to execute in one of those pages, the page is
      broken down into 4K pages, which are then marked executable.
      
      This is not an issue for shadow paging (except nested EPT), because then
      the host is in control of TLB flushes and the problematic situation cannot
      happen.  With nested EPT, again the nested guest can cause problems shadow
      and direct EPT is treated in the same way.
      
      [ tglx: Fixup default to auto and massage wording a bit ]
      Originally-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b8e8c830
    • V
      x86/bugs: Add ITLB_MULTIHIT bug infrastructure · db4d30fb
      Vineela Tummalapalli 提交于
      Some processors may incur a machine check error possibly resulting in an
      unrecoverable CPU lockup when an instruction fetch encounters a TLB
      multi-hit in the instruction TLB. This can occur when the page size is
      changed along with either the physical address or cache type. The relevant
      erratum can be found here:
      
         https://bugzilla.kernel.org/show_bug.cgi?id=205195
      
      There are other processors affected for which the erratum does not fully
      disclose the impact.
      
      This issue affects both bare-metal x86 page tables and EPT.
      
      It can be mitigated by either eliminating the use of large pages or by
      using careful TLB invalidations when changing the page size in the page
      tables.
      
      Just like Spectre, Meltdown, L1TF and MDS, a new bit has been allocated in
      MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will be set on CPUs which
      are mitigated against this issue.
      Signed-off-by: NVineela Tummalapalli <vineela.tummalapalli@intel.com>
      Co-developed-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      db4d30fb
  13. 28 10月, 2019 1 次提交
    • P
      x86/speculation/taa: Add mitigation for TSX Async Abort · 1b42f017
      Pawan Gupta 提交于
      TSX Async Abort (TAA) is a side channel vulnerability to the internal
      buffers in some Intel processors similar to Microachitectural Data
      Sampling (MDS). In this case, certain loads may speculatively pass
      invalid data to dependent operations when an asynchronous abort
      condition is pending in a TSX transaction.
      
      This includes loads with no fault or assist condition. Such loads may
      speculatively expose stale data from the uarch data structures as in
      MDS. Scope of exposure is within the same-thread and cross-thread. This
      issue affects all current processors that support TSX, but do not have
      ARCH_CAP_TAA_NO (bit 8) set in MSR_IA32_ARCH_CAPABILITIES.
      
      On CPUs which have their IA32_ARCH_CAPABILITIES MSR bit MDS_NO=0,
      CPUID.MD_CLEAR=1 and the MDS mitigation is clearing the CPU buffers
      using VERW or L1D_FLUSH, there is no additional mitigation needed for
      TAA. On affected CPUs with MDS_NO=1 this issue can be mitigated by
      disabling the Transactional Synchronization Extensions (TSX) feature.
      
      A new MSR IA32_TSX_CTRL in future and current processors after a
      microcode update can be used to control the TSX feature. There are two
      bits in that MSR:
      
      * TSX_CTRL_RTM_DISABLE disables the TSX sub-feature Restricted
      Transactional Memory (RTM).
      
      * TSX_CTRL_CPUID_CLEAR clears the RTM enumeration in CPUID. The other
      TSX sub-feature, Hardware Lock Elision (HLE), is unconditionally
      disabled with updated microcode but still enumerated as present by
      CPUID(EAX=7).EBX{bit4}.
      
      The second mitigation approach is similar to MDS which is clearing the
      affected CPU buffers on return to user space and when entering a guest.
      Relevant microcode update is required for the mitigation to work.  More
      details on this approach can be found here:
      
        https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html
      
      The TSX feature can be controlled by the "tsx" command line parameter.
      If it is force-enabled then "Clear CPU buffers" (MDS mitigation) is
      deployed. The effective mitigation state can be read from sysfs.
      
       [ bp:
         - massage + comments cleanup
         - s/TAA_MITIGATION_TSX_DISABLE/TAA_MITIGATION_TSX_DISABLED/g - Josh.
         - remove partial TAA mitigation in update_mds_branch_idle() - Josh.
         - s/tsx_async_abort_cmdline/tsx_async_abort_parse_cmdline/g
       ]
      Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      1b42f017