1. 31 12月, 2017 4 次提交
    • T
      x86/ldt: Make LDT pgtable free conditional · 7f414195
      Thomas Gleixner 提交于
      Andy prefers to be paranoid about the pagetable free in the error path of
      write_ldt(). Make it conditional and warn whenever the installment of a
      secondary LDT fails.
      Requested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      7f414195
    • T
      x86/ldt: Plug memory leak in error path · a62d6985
      Thomas Gleixner 提交于
      The error path in write_ldt() tries to free 'old_ldt' instead of the newly
      allocated 'new_ldt', resulting in a memory leak. It also misses to clean up a
      half populated LDT pagetable, which is not a leak as it gets cleaned up
      when the process exits.
      
      Free both the potentially half populated LDT pagetable and the newly
      allocated LDT struct. This can be done unconditionally because once an LDT
      is mapped subsequent maps will succeed, because the PTE page is already
      populated and the two LDTs fit into that single page.
      Reported-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Fixes: f55f0501 ("x86/pti: Put the LDT in its own PGD if PTI is on")
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1712311121340.1899@nanosSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a62d6985
    • T
      x86/mm: Remove preempt_disable/enable() from __native_flush_tlb() · decab088
      Thomas Gleixner 提交于
      The preempt_disable/enable() pair in __native_flush_tlb() was added in
      commit:
      
        5cf0791d ("x86/mm: Disable preemption during CR3 read+write")
      
      ... to protect the UP variant of flush_tlb_mm_range().
      
      That preempt_disable/enable() pair should have been added to the UP variant
      of flush_tlb_mm_range() instead.
      
      The UP variant was removed with commit:
      
        ce4a4e56 ("x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code")
      
      ... but the preempt_disable/enable() pair stayed around.
      
      The latest change to __native_flush_tlb() in commit:
      
        6fd166aa ("x86/mm: Use/Fix PCID to optimize user/kernel switches")
      
      ... added an access to a per CPU variable outside the preempt disabled
      regions, which makes no sense at all. __native_flush_tlb() must always
      be called with at least preemption disabled.
      
      Remove the preempt_disable/enable() pair and add a WARN_ON_ONCE() to catch
      bad callers independent of the smp_processor_id() debugging.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20171230211829.679325424@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      decab088
    • T
      x86/smpboot: Remove stale TLB flush invocations · 322f8b8b
      Thomas Gleixner 提交于
      smpboot_setup_warm_reset_vector() and smpboot_restore_warm_reset_vector()
      invoke local_flush_tlb() for no obvious reason.
      
      Digging in history revealed that the original code in the 2.1 era added
      those because the code manipulated a swapper_pg_dir pagetable entry. The
      pagetable manipulation was removed long ago in the 2.3 timeframe, but the
      TLB flush invocations stayed around forever.
      
      Remove them along with the pointless pr_debug()s which come from the same 2.1
      change.
      Reported-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20171230211829.586548655@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      322f8b8b
  2. 30 12月, 2017 3 次提交
    • T
      genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI · bc976233
      Thomas Gleixner 提交于
      The new reservation mode for interrupts assigns a dummy vector when the
      interrupt is allocated and assigns a real vector when the interrupt is
      requested. The reservation mode prevents vector pressure when devices with
      a large amount of queues/interrupts are initialized, but only a minimal
      subset of those queues/interrupts is actually used.
      
      This mode has an issue with MSI interrupts which cannot be masked. If the
      driver is not careful or the hardware emits an interrupt before the device
      irq is requestd by the driver then the interrupt ends up on the dummy
      vector as a spurious interrupt which can cause malfunction of the device or
      in the worst case a lockup of the machine.
      
      Change the logic for the reservation mode so that the early activation of
      MSI interrupts checks whether:
      
       - the device is a PCI/MSI device
       - the reservation mode of the underlying irqdomain is activated
       - PCI/MSI masking is globally enabled
       - the PCI/MSI device uses either MSI-X, which supports masking, or
         MSI with the maskbit supported.
      
      If one of those conditions is false, then clear the reservation mode flag
      in the irq data of the interrupt and invoke irq_domain_activate_irq() with
      the reserve argument cleared. In the x86 vector code, clear the can_reserve
      flag in the vector allocation data so a subsequent free_irq() won't create
      the same situation again. The interrupt stays assigned to a real vector
      until pci_disable_msi() is invoked and all allocations are undone.
      
      Fixes: 4900be83 ("x86/vector/msi: Switch to global reservation mode")
      Reported-by: NAlexandru Chirvasitu <achirvasub@gmail.com>
      Reported-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAlexandru Chirvasitu <achirvasub@gmail.com>
      Tested-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Sakari Ailus <sakari.ailus@intel.com>,
      Cc: linux-media@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291406420.1899@nanos
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291409460.1899@nanos
      bc976233
    • T
      genirq/irqdomain: Rename early argument of irq_domain_activate_irq() · 702cb0a0
      Thomas Gleixner 提交于
      The 'early' argument of irq_domain_activate_irq() is actually used to
      denote reservation mode. To avoid confusion, rename it before abuse
      happens.
      
      No functional change.
      
      Fixes: 72491643 ("genirq/irqdomain: Update irq_domain_ops.activate() signature")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexandru Chirvasitu <achirvasub@gmail.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Sakari Ailus <sakari.ailus@intel.com>,
      Cc: linux-media@vger.kernel.org
      702cb0a0
    • T
      x86/vector: Use IRQD_CAN_RESERVE flag · 945f50a5
      Thomas Gleixner 提交于
      Set the new CAN_RESERVE flag when the initial reservation for an interrupt
      happens. The flag is used in a subsequent patch to disable reservation mode
      for a certain class of MSI devices.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAlexandru Chirvasitu <achirvasub@gmail.com>
      Tested-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Mikael Pettersson <mikpelinux@gmail.com>
      Cc: Josh Poulson <jopoulso@microsoft.com>
      Cc: Mihai Costache <v-micos@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Simon Xiao <sixiao@microsoft.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Cc: Jork Loeser <Jork.Loeser@microsoft.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: devel@linuxdriverproject.org
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: Sakari Ailus <sakari.ailus@intel.com>,
      Cc: linux-media@vger.kernel.org
      
      945f50a5
  3. 29 12月, 2017 1 次提交
    • T
      x86/apic: Switch all APICs to Fixed delivery mode · a31e58e1
      Thomas Gleixner 提交于
      Some of the APIC incarnations are operating in lowest priority delivery
      mode. This worked as long as the vector management code allocated the same
      vector on all possible CPUs for each interrupt.
      
      Lowest priority delivery mode does not necessarily respect the affinity
      setting and may redirect to some other online CPU. This was documented
      somewhere in the old code and the conversion to single target delivery
      missed to update the delivery mode of the affected APIC drivers which
      results in spurious interrupts on some of the affected CPU/Chipset
      combinations.
      
      Switch the APIC drivers over to Fixed delivery mode and remove all
      leftovers of lowest priority delivery mode.
      
      Switching to Fixed delivery mode is not a problem on these CPUs because the
      kernel already uses Fixed delivery mode for IPIs. The reason for this is
      that th SDM explicitely forbids lowest prio mode for IPIs. The reason is
      obvious: If the irq routing does not honor destination targets in lowest
      prio mode then an IPI targeted at CPU1 might end up on CPU0, which would be
      a fatal problem in many cases.
      
      As a consequence of this change, the apic::irq_delivery_mode field is now
      pointless, but this needs to be cleaned up in a separate patch.
      
      Fixes: fdba46ff ("x86/apic: Get rid of multi CPU affinity")
      Reported-by: vcaputo@pengaru.com
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: vcaputo@pengaru.com
      Cc: Pavel Machek <pavel@ucw.cz>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712281140440.1688@nanos
      a31e58e1
  4. 28 12月, 2017 4 次提交
  5. 24 12月, 2017 28 次提交
    • T
      x86/ldt: Make the LDT mapping RO · 9f5cb6b3
      Thomas Gleixner 提交于
      Now that the LDT mapping is in a known area when PAGE_TABLE_ISOLATION is
      enabled its a primary target for attacks, if a user space interface fails
      to validate a write address correctly. That can never happen, right?
      
      The SDM states:
      
          If the segment descriptors in the GDT or an LDT are placed in ROM, the
          processor can enter an indefinite loop if software or the processor
          attempts to update (write to) the ROM-based segment descriptors. To
          prevent this problem, set the accessed bits for all segment descriptors
          placed in a ROM. Also, remove operating-system or executive code that
          attempts to modify segment descriptors located in ROM.
      
      So its a valid approach to set the ACCESS bit when setting up the LDT entry
      and to map the table RO. Fixup the selftest so it can handle that new mode.
      
      Remove the manual ACCESS bit setter in set_tls_desc() as this is now
      pointless. Folded the patch from Peter Ziljstra.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9f5cb6b3
    • T
      x86/mm/dump_pagetables: Allow dumping current pagetables · a4b51ef6
      Thomas Gleixner 提交于
      Add two debugfs files which allow to dump the pagetable of the current
      task.
      
      current_kernel dumps the regular page table. This is the page table which
      is normally shared between kernel and user space. If kernel page table
      isolation is enabled this is the kernel space mapping.
      
      If kernel page table isolation is enabled the second file, current_user,
      dumps the user space page table.
      
      These files allow to verify the resulting page tables for page table
      isolation, but even in the normal case its useful to be able to inspect
      user space page tables of current for debugging purposes.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a4b51ef6
    • T
      x86/mm/dump_pagetables: Check user space page table for WX pages · b4bf4f92
      Thomas Gleixner 提交于
      ptdump_walk_pgd_level_checkwx() checks the kernel page table for WX pages,
      but does not check the PAGE_TABLE_ISOLATION user space page table.
      
      Restructure the code so that dmesg output is selected by an explicit
      argument and not implicit via checking the pgd argument for !NULL.
      
      Add the check for the user space page table.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b4bf4f92
    • B
      x86/mm/dump_pagetables: Add page table directory to the debugfs VFS hierarchy · 75298aa1
      Borislav Petkov 提交于
      The upcoming support for dumping the kernel and the user space page tables
      of the current process would create more random files in the top level
      debugfs directory.
      
      Add a page table directory and move the existing file to it.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      75298aa1
    • V
      x86/dumpstack: Indicate in Oops whether PTI is configured and enabled · 5f26d76c
      Vlastimil Babka 提交于
      CONFIG_PAGE_TABLE_ISOLATION is relatively new and intrusive feature that may
      still have some corner cases which could take some time to manifest and be
      fixed. It would be useful to have Oops messages indicate whether it was
      enabled for building the kernel, and whether it was disabled during boot.
      
      Example of fully enabled:
      
      	Oops: 0001 [#1] SMP PTI
      
      Example of enabled during build, but disabled during boot:
      
      	Oops: 0001 [#1] SMP NOPTI
      
      We can decide to remove this after the feature has been tested in the field
      long enough.
      
      [ tglx: Made it use boot_cpu_has() as requested by Borislav ]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NEduardo Valentin <eduval@amazon.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: bpetkov@suse.de
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: jkosina@suse.cz
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5f26d76c
    • P
      x86/mm: Clarify the whole ASID/kernel PCID/user PCID naming · 0a126abd
      Peter Zijlstra 提交于
      Ideally we'd also use sparse to enforce this separation so it becomes much
      more difficult to mess up.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0a126abd
    • D
      x86/mm: Use INVPCID for __native_flush_tlb_single() · 6cff64b8
      Dave Hansen 提交于
      This uses INVPCID to shoot down individual lines of the user mapping
      instead of marking the entire user map as invalid. This
      could/might/possibly be faster.
      
      This for sure needs tlb_single_page_flush_ceiling to be redetermined;
      esp. since INVPCID is _slow_.
      
      A detailed performance analysis is available here:
      
        https://lkml.kernel.org/r/3062e486-3539-8a1f-5724-16199420be71@intel.com
      
      [ Peterz: Split out from big combo patch ]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6cff64b8
    • P
      x86/mm: Optimize RESTORE_CR3 · 21e94459
      Peter Zijlstra 提交于
      Most NMI/paranoid exceptions will not in fact change pagetables and would
      thus not require TLB flushing, however RESTORE_CR3 uses flushing CR3
      writes.
      
      Restores to kernel PCIDs can be NOFLUSH, because we explicitly flush the
      kernel mappings and now that we track which user PCIDs need flushing we can
      avoid those too when possible.
      
      This does mean RESTORE_CR3 needs an additional scratch_reg, luckily both
      sites have plenty available.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      21e94459
    • P
      x86/mm: Use/Fix PCID to optimize user/kernel switches · 6fd166aa
      Peter Zijlstra 提交于
      We can use PCID to retain the TLBs across CR3 switches; including those now
      part of the user/kernel switch. This increases performance of kernel
      entry/exit at the cost of more expensive/complicated TLB flushing.
      
      Now that we have two address spaces, one for kernel and one for user space,
      we need two PCIDs per mm. We use the top PCID bit to indicate a user PCID
      (just like we use the PFN LSB for the PGD). Since we do TLB invalidation
      from kernel space, the existing code will only invalidate the kernel PCID,
      we augment that by marking the corresponding user PCID invalid, and upon
      switching back to userspace, use a flushing CR3 write for the switch.
      
      In order to access the user_pcid_flush_mask we use PER_CPU storage, which
      means the previously established SWAPGS vs CR3 ordering is now mandatory
      and required.
      
      Having to do this memory access does require additional registers, most
      sites have a functioning stack and we can spill one (RAX), sites without
      functional stack need to otherwise provide the second scratch register.
      
      Note: PCID is generally available on Intel Sandybridge and later CPUs.
      Note: Up until this point TLB flushing was broken in this series.
      
      Based-on-code-from: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6fd166aa
    • D
      x86/mm: Abstract switching CR3 · 48e11198
      Dave Hansen 提交于
      In preparation to adding additional PCID flushing, abstract the
      loading of a new ASID into CR3.
      
      [ PeterZ: Split out from big combo patch ]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      48e11198
    • D
      x86/mm: Allow flushing for future ASID switches · 2ea907c4
      Dave Hansen 提交于
      If changing the page tables in such a way that an invalidation of all
      contexts (aka. PCIDs / ASIDs) is required, they can be actively invalidated
      by:
      
       1. INVPCID for each PCID (works for single pages too).
      
       2. Load CR3 with each PCID without the NOFLUSH bit set
      
       3. Load CR3 with the NOFLUSH bit set for each and do INVLPG for each address.
      
      But, none of these are really feasible since there are ~6 ASIDs (12 with
      PAGE_TABLE_ISOLATION) at the time that invalidation is required.
      Instead of actively invalidating them, invalidate the *current* context and
      also mark the cpu_tlbstate _quickly_ to indicate future invalidation to be
      required.
      
      At the next context-switch, look for this indicator
      ('invalidate_other' being set) invalidate all of the
      cpu_tlbstate.ctxs[] entries.
      
      This ensures that any future context switches will do a full flush
      of the TLB, picking up the previous changes.
      
      [ tglx: Folded more fixups from Peter ]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2ea907c4
    • A
      x86/pti: Map the vsyscall page if needed · 85900ea5
      Andy Lutomirski 提交于
      Make VSYSCALLs work fully in PTI mode by mapping them properly to the user
      space visible page tables.
      
      [ tglx: Hide unused functions (Patch by Arnd Bergmann) ]
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      85900ea5
    • A
      x86/pti: Put the LDT in its own PGD if PTI is on · f55f0501
      Andy Lutomirski 提交于
      With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
      The LDT is per process, i.e. per mm.
      
      An earlier approach mapped the LDT on context switch into a fixmap area,
      but that's a big overhead and exhausted the fixmap space when NR_CPUS got
      big.
      
      Take advantage of the fact that there is an address space hole which
      provides a completely unused pgd. Use this pgd to manage per-mm LDT
      mappings.
      
      This has a down side: the LDT isn't (currently) randomized, and an attack
      that can write the LDT is instant root due to call gates (thanks, AMD, for
      leaving call gates in AMD64 but designing them wrong so they're only useful
      for exploits).  This can be mitigated by making the LDT read-only or
      randomizing the mapping, either of which is strightforward on top of this
      patch.
      
      This will significantly slow down LDT users, but that shouldn't matter for
      important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
      old libc implementations.
      
      [ tglx: Cleaned it up. ]
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f55f0501
    • A
      x86/mm/64: Make a full PGD-entry size hole in the memory map · 9f449772
      Andy Lutomirski 提交于
      Shrink vmalloc space from 16384TiB to 12800TiB to enlarge the hole starting
      at 0xff90000000000000 to be a full PGD entry.
      
      A subsequent patch will use this hole for the pagetable isolation LDT
      alias.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9f449772
    • H
      x86/events/intel/ds: Map debug buffers in cpu_entry_area · c1961a46
      Hugh Dickins 提交于
      The BTS and PEBS buffers both have their virtual addresses programmed into
      the hardware.  This means that any access to them is performed via the page
      tables.  The times that the hardware accesses these are entirely dependent
      on how the performance monitoring hardware events are set up.  In other
      words, there is no way for the kernel to tell when the hardware might
      access these buffers.
      
      To avoid perf crashes, place 'debug_store' allocate pages and map them into
      the cpu_entry_area.
      
      The PEBS fixup buffer does not need this treatment.
      
      [ tglx: Got rid of the kaiser_add_mapping() complication ]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c1961a46
    • T
      x86/cpu_entry_area: Add debugstore entries to cpu_entry_area · 10043e02
      Thomas Gleixner 提交于
      The Intel PEBS/BTS debug store is a design trainwreck as it expects virtual
      addresses which must be visible in any execution context.
      
      So it is required to make these mappings visible to user space when kernel
      page table isolation is active.
      
      Provide enough room for the buffer mappings in the cpu_entry_area so the
      buffers are available in the user space visible page tables.
      
      At the point where the kernel side entry area is populated there is no
      buffer available yet, but the kernel PMD must be populated. To achieve this
      set the entries for these buffers to non present.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      10043e02
    • A
      x86/mm/pti: Map ESPFIX into user space · 4b6bbe95
      Andy Lutomirski 提交于
      Map the ESPFIX pages into user space when PTI is enabled.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4b6bbe95
    • T
      x86/mm/pti: Share entry text PMD · 6dc72c3c
      Thomas Gleixner 提交于
      Share the entry text PMD of the kernel mapping with the user space
      mapping. If large pages are enabled this is a single PMD entry and at the
      point where it is copied into the user page table the RW bit has not been
      cleared yet. Clear it right away so the user space visible map becomes RX.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6dc72c3c
    • T
      x86/entry: Align entry text section to PMD boundary · 2f7412ba
      Thomas Gleixner 提交于
      The (irq)entry text must be visible in the user space page tables. To allow
      simple PMD based sharing, make the entry text PMD aligned.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2f7412ba
    • A
      x86/mm/pti: Share cpu_entry_area with user space page tables · f7cfbee9
      Andy Lutomirski 提交于
      Share the cpu entry area so the user space and kernel space page tables
      have the same P4D page.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f7cfbee9
    • T
      x86/mm/pti: Force entry through trampoline when PTI active · 8d4b0678
      Thomas Gleixner 提交于
      Force the entry through the trampoline only when PTI is active. Otherwise
      go through the normal entry code.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8d4b0678
    • A
      x86/mm/pti: Add functions to clone kernel PMDs · 03f4424f
      Andy Lutomirski 提交于
      Provide infrastructure to:
      
       - find a kernel PMD for a mapping which must be visible to user space for
         the entry/exit code to work.
      
       - walk an address range and share the kernel PMD with it.
      
      This reuses a small part of the original KAISER patches to populate the
      user space page table.
      
      [ tglx: Made it universally usable so it can be used for any kind of shared
      	mapping. Add a mechanism to clear specific bits in the user space
      	visible PMD entry. Folded Andys simplifactions ]
      Originally-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      03f4424f
    • D
      x86/mm/pti: Populate user PGD · fc2fbc85
      Dave Hansen 提交于
      In clone_pgd_range() copy the init user PGDs which cover the kernel half of
      the address space, so a process has all the required kernel mappings
      visible.
      
      [ tglx: Split out from the big kaiser dump and folded Andys simplification ]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      fc2fbc85
    • D
      x86/mm/pti: Allocate a separate user PGD · d9e9a641
      Dave Hansen 提交于
      Kernel page table isolation requires to have two PGDs. One for the kernel,
      which contains the full kernel mapping plus the user space mapping and one
      for user space which contains the user space mappings and the minimal set
      of kernel mappings which are required by the architecture to be able to
      transition from and to user space.
      
      Add the necessary preliminaries.
      
      [ tglx: Split out from the big kaiser dump. EFI fixup from Kirill ]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d9e9a641
    • D
      x86/mm/pti: Allow NX poison to be set in p4d/pgd · 1c4de1ff
      Dave Hansen 提交于
      With PAGE_TABLE_ISOLATION the user portion of the kernel page tables is
      poisoned with the NX bit so if the entry code exits with the kernel page
      tables selected in CR3, userspace crashes.
      
      But doing so trips the p4d/pgd_bad() checks.  Make sure it does not do
      that.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1c4de1ff
    • D
      x86/mm/pti: Add mapping helper functions · 61e9b367
      Dave Hansen 提交于
      Add the pagetable helper functions do manage the separate user space page
      tables.
      
      [ tglx: Split out from the big combo kaiser patch. Folded Andys
      	simplification and made it out of line as Boris suggested ]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      61e9b367
    • B
      x86/pti: Add the pti= cmdline option and documentation · 41f4c20b
      Borislav Petkov 提交于
      Keep the "nopti" optional for traditional reasons.
      
      [ tglx: Don't allow force on when running on XEN PV and made 'on'
      	printout conditional ]
      Requested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Link: https://lkml.kernel.org/r/20171212133952.10177-1-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      41f4c20b
    • T
      x86/mm/pti: Add infrastructure for page table isolation · aa8c6248
      Thomas Gleixner 提交于
      Add the initial files for kernel page table isolation, with a minimal init
      function and the boot time detection for this misfeature.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      aa8c6248