1. 08 5月, 2015 3 次提交
    • P
      locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching · f233f7f1
      Peter Zijlstra (Intel) 提交于
      We use the regular paravirt call patching to switch between:
      
        native_queued_spin_lock_slowpath()	__pv_queued_spin_lock_slowpath()
        native_queued_spin_unlock()		__pv_queued_spin_unlock()
      
      We use a callee saved call for the unlock function which reduces the
      i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions
      again.
      
      We further optimize the unlock path by patching the direct call with a
      "movb $0,%arg1" if we are indeed using the native unlock code. This
      makes the unlock code almost as fast as the !PARAVIRT case.
      
      This significantly lowers the overhead of having
      CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1429901803-29771-10-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f233f7f1
    • P
      locking/qspinlock: Revert to test-and-set on hypervisors · 2aa79af6
      Peter Zijlstra (Intel) 提交于
      When we detect a hypervisor (!paravirt, see qspinlock paravirt support
      patches), revert to a simple test-and-set lock to avoid the horrors
      of queue preemption.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1429901803-29771-8-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2aa79af6
    • W
      locking/qspinlock, x86: Enable x86-64 to use queued spinlocks · d73a3397
      Waiman Long 提交于
      This patch makes the necessary changes at the x86 architecture
      specific layer to enable the use of queued spinlocks for x86-64. As
      x86-32 machines are typically not multi-socket. The benefit of queue
      spinlock may not be apparent. So queued spinlocks are not enabled.
      
      Currently, there is some incompatibilities between the para-virtualized
      spinlock code (which hard-codes the use of ticket spinlock) and the
      queued spinlocks. Therefore, the use of queued spinlocks is disabled
      when the para-virtualized spinlock is enabled.
      
      The arch/x86/include/asm/qspinlock.h header file includes some x86
      specific optimization which will make the queueds spinlock code
      perform better than the generic implementation.
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Daniel J Blueman <daniel@numascale.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Douglas Hatch <doug.hatch@hp.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: virtualization@lists.linux-foundation.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1429901803-29771-3-git-send-email-Waiman.Long@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d73a3397
  2. 06 5月, 2015 2 次提交
  3. 05 5月, 2015 1 次提交
  4. 27 4月, 2015 2 次提交
  5. 17 4月, 2015 1 次提交
  6. 15 4月, 2015 4 次提交
    • V
      mm: move memtest under mm · 4a20799d
      Vladimir Murzin 提交于
      Memtest is a simple feature which fills the memory with a given set of
      patterns and validates memory contents, if bad memory regions is detected
      it reserves them via memblock API.  Since memblock API is widely used by
      other architectures this feature can be enabled outside of x86 world.
      
      This patch set promotes memtest to live under generic mm umbrella and
      enables memtest feature for arm/arm64.
      
      It was reported that this patch set was useful for tracking down an issue
      with some errant DMA on an arm64 platform.
      
      This patch (of 6):
      
      There is nothing platform dependent in the core memtest code, so other
      platforms might benefit from this feature too.
      
      [linux@roeck-us.net: MEMTEST depends on MEMBLOCK]
      Signed-off-by: NVladimir Murzin <vladimir.murzin@arm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a20799d
    • K
      mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE · 204db6ed
      Kees Cook 提交于
      The arch_randomize_brk() function is used on several architectures,
      even those that don't support ET_DYN ASLR. To avoid bulky extern/#define
      tricks, consolidate the support under CONFIG_ARCH_HAS_ELF_RANDOMIZE for
      the architectures that support it, while still handling CONFIG_COMPAT_BRK.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Russell King <linux@arm.linux.org.uk>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "David A. Long" <dave.long@linaro.org>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Arun Chandran <achandran@mvista.com>
      Cc: Yann Droneaud <ydroneaud@opteya.com>
      Cc: Min-Hua Chen <orca.chen@gmail.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Alex Smith <alex@alex-smith.me.uk>
      Cc: Markos Chandras <markos.chandras@imgtec.com>
      Cc: Vineeth Vijayan <vvijayan@mvista.com>
      Cc: Jeff Bailey <jeffbailey@google.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Behan Webster <behanw@converseincode.com>
      Cc: Ismael Ripoll <iripoll@upv.es>
      Cc: Jan-Simon Mller <dl9pf@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      204db6ed
    • T
      x86, mm: support huge I/O mapping capability I/F · 5d72b4fb
      Toshi Kani 提交于
      Implement huge I/O mapping capability interfaces for ioremap() on x86.
      
      IOREMAP_MAX_ORDER is defined to PUD_SHIFT on x86/64 and PMD_SHIFT on
      x86/32, which overrides the default value defined in <linux/vmalloc.h>.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Robert Elliott <Elliott@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d72b4fb
    • K
      x86: expose number of page table levels on Kconfig level · 98233368
      Kirill A. Shutemov 提交于
      We would want to use number of page table level to define mm_struct.
      Let's expose it as CONFIG_PGTABLE_LEVELS.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98233368
  7. 13 4月, 2015 1 次提交
  8. 09 4月, 2015 3 次提交
    • A
      x86/iommu: Fix header comments regarding standard and _FINISH macros · b4491592
      Aravind Gopalakrishnan 提交于
      The comment line regarding IOMMU_INIT and IOMMU_INIT_FINISH
      macros is incorrect:
      
        "The standard vs the _FINISH differs in that the _FINISH variant
        will continue detecting other IOMMUs in the call list..."
      
      It should be "..the *standard* variant will continue
      detecting..."
      
      Fix that. Also, make it readable while at it.
      Signed-off-by: NAravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: konrad.wilk@oracle.com
      Fixes: 6e963669 ("x86, iommu: Update header comments with appropriate naming")
      Link: http://lkml.kernel.org/r/1428508017-5316-1-git-send-email-Aravind.Gopalakrishnan@amd.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b4491592
    • A
      jump_label: Allow asm/jump_label.h to be included in assembly · 55dd0df7
      Anton Blanchard 提交于
      Wrap asm/jump_label.h for all archs with #ifndef __ASSEMBLY__.
      Since these are kernel only headers, we don't need #ifdef
      __KERNEL__ so can simplify things a bit.
      
      If an architecture wants to use jump labels in assembly, it
      will still need to define a macro to create the __jump_table
      entries (see ARCH_STATIC_BRANCH in the powerpc asm/jump_label.h
      for an example).
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: benh@kernel.crashing.org
      Cc: catalin.marinas@arm.com
      Cc: davem@davemloft.net
      Cc: heiko.carstens@de.ibm.com
      Cc: jbaron@akamai.com
      Cc: linux@arm.linux.org.uk
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: liuj97@gmail.com
      Cc: mgorman@suse.de
      Cc: mmarek@suse.cz
      Cc: mpe@ellerman.id.au
      Cc: paulus@samba.org
      Cc: ralf@linux-mips.org
      Cc: rostedt@goodmis.org
      Cc: schwidefsky@de.ibm.com
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1428551492-21977-1-git-send-email-anton@samba.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      55dd0df7
    • L
      x86: clean up/fix 'copy_in_user()' tail zeroing · cae2a173
      Linus Torvalds 提交于
      The rule for 'copy_from_user()' is that it zeroes the remaining kernel
      buffer even when the copy fails halfway, just to make sure that we don't
      leave uninitialized kernel memory around.  Because even if we check for
      errors, some kernel buffers stay around after thge copy (think page
      cache).
      
      However, the x86-64 logic for user copies uses a copy_user_generic()
      function for all the cases, that set the "zerorest" flag for any fault
      on the source buffer.  Which meant that it didn't just try to clear the
      kernel buffer after a failure in copy_from_user(), it also tried to
      clear the destination user buffer for the "copy_in_user()" case.
      
      Not only is that pointless, it also means that the clearing code has to
      worry about the tail clearing taking page faults for the user buffer
      case.  Which is just stupid, since that case shouldn't happen in the
      first place.
      
      Get rid of the whole "zerorest" thing entirely, and instead just check
      if the destination is in kernel space or not.  And then just use
      memset() to clear the tail of the kernel buffer if necessary.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cae2a173
  9. 08 4月, 2015 7 次提交
    • W
      kvm: mmu: lazy collapse small sptes into large sptes · 3ea3b7fa
      Wanpeng Li 提交于
      Dirty logging tracks sptes in 4k granularity, meaning that large sptes
      have to be split.  If live migration is successful, the guest in the
      source machine will be destroyed and large sptes will be created in the
      destination. However, the guest continues to run in the source machine
      (for example if live migration fails), small sptes will remain around
      and cause bad performance.
      
      This patch introduce lazy collapsing of small sptes into large sptes.
      The rmap will be scanned in ioctl context when dirty logging is stopped,
      dropping those sptes which can be collapsed into a single large-page spte.
      Later page faults will create the large-page sptes.
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Message-Id: <1428046825-6905-1-git-send-email-wanpeng.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3ea3b7fa
    • N
      KVM: x86: DR0-DR3 are not clear on reset · ae561ede
      Nadav Amit 提交于
      DR0-DR3 are not cleared as they should during reset and when they are set from
      userspace.  It appears to be caused by c77fb5fe ("KVM: x86: Allow the guest
      to run with dirty debug registers").
      
      Force their reload on these situations.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1427933438-12782-4-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ae561ede
    • R
      KVM: x86: simplify kvm_apic_map · 3b5a5ffa
      Radim Krčmář 提交于
      recalculate_apic_map() uses two passes over all VCPUs.  This is a relic
      from time when we selected a global mode in the first pass and set up
      the optimized table in the second pass (to have a consistent mode).
      
      Recent changes made mixed mode unoptimized and we can do it in one pass.
      Format of logical MDA is a function of the mode, so we encode it in
      apic_logical_id() and drop obsoleted variables from the struct.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-5-git-send-email-rkrcmar@redhat.com>
      [Add lid_bits temporary in apic_logical_id. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3b5a5ffa
    • R
      KVM: x86: avoid logical_map when it is invalid · 3548a259
      Radim Krčmář 提交于
      We want to support mixed modes and the easiest solution is to avoid
      optimizing those weird and unlikely scenarios.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-4-git-send-email-rkrcmar@redhat.com>
      [Add comment above KVM_APIC_MODE_* defines. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3548a259
    • R
      KVM: x86: fix mixed APIC mode broadcast · 9ea369b0
      Radim Krčmář 提交于
      Broadcast allowed only one global APIC mode, but mixed modes are
      theoretically possible.  x2APIC IPI doesn't mean 0xff as broadcast,
      the rest does.
      
      x2APIC broadcasts are accepted by xAPIC.  If we take SDM to be logical,
      even addreses beginning with 0xff should be accepted, but real hardware
      disagrees.  This patch aims for simple code by considering most of real
      behavior as undefined.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Message-Id: <1423766494-26150-3-git-send-email-rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9ea369b0
    • E
      KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu · 5a4f55cd
      Eugene Korenevsky 提交于
      cpuid_maxphyaddr(), which performs lot of memory accesses is called
      extensively across KVM, especially in nVMX code.
      
      This patch adds a cached value of maxphyaddr to vcpu.arch to reduce the
      pressure onto CPU cache and simplify the code of cpuid_maxphyaddr()
      callers. The cached value is initialized in kvm_arch_vcpu_init() and
      reloaded every time CPUID is updated by usermode. It is obvious that
      these reloads occur infrequently.
      Signed-off-by: NEugene Korenevsky <ekorenevsky@gmail.com>
      Message-Id: <20150329205612.GA1223@gnote>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5a4f55cd
    • D
      x86/asm/entry/irq: Simplify interrupt dispatch table (IDT) layout · 3304c9c3
      Denys Vlasenko 提交于
      Interrupt entry points are handled with the following code,
      each 32-byte code block contains seven entry points:
      
      		...
      		[push][jump 22] // 4 bytes
      		[push][jump 18] // 4 bytes
      		[push][jump 14] // 4 bytes
      		[push][jump 10] // 4 bytes
      		[push][jump  6] // 4 bytes
      		[push][jump  2] // 4 bytes
      		[push][jump common_interrupt][padding] // 8 bytes
      
      		[push][jump]
      		[push][jump]
      		[push][jump]
      		[push][jump]
      		[push][jump]
      		[push][jump]
      		[push][jump common_interrupt][padding]
      
      		[padding_2]
      	common_interrupt:
      
      And there is a table which holds pointers to every entry point,
      IOW: to every push.
      
      In cold cache, two jumps are still costlier than one, even
      though we get the benefit of them residing in the same
      cacheline.
      
      This change replaces short jumps with near ones to
      'common_interrupt', and pads every push+jump pair to 8 bytes. This
      way, each interrupt takes only one jump.
      
      This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before
      dispatch table with ".align 8" - we do not need anything
      stronger than that.
      
      The table of entry addresses (the interrupt[] array) is no
      longer necessary, the address of entries can be easily
      calculated as (irq_entries_start + i*8).
      
         text	   data	    bss	    dec	    hex	filename
        12546	      0	      0	  12546	   3102	entry_64.o.before
        11626	      0	      0	  11626	   2d6a	entry_64.o
      
      The size decrease is because 1656 bytes of .init.rodata are
      gone. That's initdata, though. The resident size does go up a
      bit.
      
      Run-tested (32 and 64 bits).
      Acked-and-Tested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1428090553-7283-1-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3304c9c3
  10. 06 4月, 2015 2 次提交
    • D
      x86/asm/entry: Clear EXTRA_REGS for all executable formats · fc3e958a
      Denys Vlasenko 提交于
      On failure, sys_execve() does not clobber EXTRA_REGS, so we can
      just return to userpsace without saving/restoring them.
      
      On success, ELF_PLAT_INIT() in sys_execve() clears all these
      registers.
      
      On other executable formats:
      
        - binfmt_flat.c has similar FLAT_PLAT_INIT, but x86 (and everyone
          else except sh) doesn't define it.
      
        - binfmt_elf_fdpic.c has ELF_FDPIC_PLAT_INIT, but x86 (and most
          others) doesn't define it.
      
        - There are no such hooks in binfmt_aout.c et al. We inherit
          EXTRA_REGS from the prior executable.
      
      This inconsistency was not intended.
      
      This change removes SAVE/RESTORE_EXTRA_REGS in stub_execve,
      removes register clearing in ELF_PLAT_INIT(),
      and instead simply clears them on success in stub_execve.
      
      Run-tested.
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1428173719-7637-1-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fc3e958a
    • B
      x86/signal: Remove pax argument from restore_sigcontext · 6a3713f0
      Brian Gerst 提交于
      The 'pax' argument is unnecesary.  Instead, store the RAX value
      directly in regs.
      
      This pattern goes all the way back to 2.1.106pre1, when restore_sigcontext()
      was changed to return an error code instead of EAX directly:
      
        https://git.kernel.org/cgit/linux/kernel/git/history/history.git/diff/arch/i386/kernel/signal.c?id=9a8f8b7ca3f319bd668298d447bdf32730e51174
      
      In 2007 sigaltstack syscall support was added, where the return
      value of restore_sigcontext() was changed to carry the memory-copying
      failure code.
      
      But instead of putting 'ax' into regs->ax directly, it was carried
      in via a pointer and then returned, where the generic syscall return
      code copied it to regs->ax.
      
      So there was never any deeper reason for this suboptimal pattern, it
      was simply never noticed after being introduced.
      Signed-off-by: NBrian Gerst <brgerst@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1428152303-17154-1-git-send-email-brgerst@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6a3713f0
  11. 04 4月, 2015 1 次提交
    • B
      x86/alternatives: Fix ALTERNATIVE_2 padding generation properly · dbe4058a
      Borislav Petkov 提交于
      Quentin caught a corner case with the generation of instruction
      padding in the ALTERNATIVE_2 macro: if len(orig_insn) <
      len(alt1) < len(alt2), then not enough padding gets added and
      that is not good(tm) as we could overwrite the beginning of the
      next instruction.
      
      Luckily, at the time of this writing, we don't have
      ALTERNATIVE_2() invocations which have that problem and even if
      we did, a simple fix would be to prepend the instructions with
      enough prefixes so that that corner case doesn't happen.
      
      However, best it would be if we fixed it properly. See below for
      a simple, abstracted example of what we're doing.
      
      So what we ended up doing is, we compute the
      
      	max(len(alt1), len(alt2)) - len(orig_insn)
      
      and feed that value to the .skip gas directive. The max() cannot
      have conditionals due to gas limitations, thus the fancy integer
      math.
      
      With this patch, all ALTERNATIVE_2 sites get padded correctly;
      generating obscure test cases pass too:
      
        #define alt_max_short(a, b)    ((a) ^ (((a) ^ (b)) & -(-((a) < (b)))))
      
        #define gen_skip(orig, alt1, alt2, marker)	\
        	.skip -((alt_max_short(alt1, alt2) - (orig)) > 0) * \
        		(alt_max_short(alt1, alt2) - (orig)),marker
      
        	.pushsection .text, "ax"
        .globl main
        main:
        	gen_skip(1, 2, 4, 0x09)
        	gen_skip(4, 1, 2, 0x10)
        	...
        	.popsection
      
      Thanks to Quentin for catching it and double-checking the fix!
      Reported-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150404133443.GE21152@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dbe4058a
  12. 03 4月, 2015 5 次提交
    • B
      x86/asm/entry/64: Use a define for an invalid segment selector · 6b51311c
      Borislav Petkov 提交于
      ... instead of a naked number, for better readability.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1428054130-25847-1-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6b51311c
    • B
      x86/mm/KASLR: Propagate KASLR status to kernel proper · 78cac48c
      Borislav Petkov 提交于
      Commit:
      
        e2b32e67 ("x86, kaslr: randomize module base load address")
      
      made module base address randomization unconditional and didn't regard
      disabled KKASLR due to CONFIG_HIBERNATION and command line option
      "nokaslr". For more info see (now reverted) commit:
      
        f47233c2 ("x86/mm/ASLR: Propagate base load address calculation")
      
      In order to propagate KASLR status to kernel proper, we need a single bit
      in boot_params.hdr.loadflags and we've chosen bit 1 thus leaving the
      top-down allocated bits for bits supposed to be used by the bootloader.
      
      Originally-From: Jiri Kosina <jkosina@suse.cz>
      Suggested-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      78cac48c
    • B
      x86/asm/entry: Drop now unused ENABLE_INTERRUPTS_SYSEXIT32 · 47091e3c
      Borislav Petkov 提交于
      Commit:
      
        4214a16b ("x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER")
      
      removed the last user of ENABLE_INTERRUPTS_SYSEXIT32. Kill the
      macro now too.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: virtualization@lists.linux-foundation.org
      Link: http://lkml.kernel.org/r/1428049714-829-1-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      47091e3c
    • A
      x86/asm/entry/32: Stop caching MSR_IA32_SYSENTER_ESP in tss.sp1 · cf9328cc
      Andy Lutomirski 提交于
      We write a stack pointer to MSR_IA32_SYSENTER_ESP exactly once,
      and we unnecessarily cache the value in tss.sp1.  We never
      read the cached value.
      
      Remove all of the caching.  It serves no purpose.
      Suggested-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/05a0163eb33ef5208363f0015496855da7cebadd.1428002830.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cf9328cc
    • R
      x86/asm: Add support for the CLWB instruction · d9dc64f3
      Ross Zwisler 提交于
      Add support for the new CLWB (cache line write back)
      instruction.  This instruction was announced in the document
      "Intel Architecture Instruction Set Extensions Programming
      Reference" with reference number 319433-022.
      
        https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
      
      The CLWB instruction is used to write back the contents of
      dirtied cache lines to memory without evicting the cache lines
      from the processor's cache hierarchy.  This should be used in
      favor of clflushopt or clflush in cases where you require the
      cache line to be written to memory but plan to access the data
      again in the near future.
      
      One of the main use cases for this is with persistent memory
      where CLWB can be used with PCOMMIT to ensure that data has been
      accepted to memory and is durable on the DIMM.
      
      This function shows how to properly use CLWB/CLFLUSHOPT/CLFLUSH
      and PCOMMIT with appropriate fencing:
      
      void flush_and_commit_buffer(void *vaddr, unsigned int size)
      {
      	void *vend = vaddr + size - 1;
      
      	for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
      		clwb(vaddr);
      
      	/* Flush any possible final partial cacheline */
      	clwb(vend);
      
      	/*
      	 * Use SFENCE to order CLWB/CLFLUSHOPT/CLFLUSH cache flushes.
      	 * (MFENCE via mb() also works)
      	 */
      	wmb();
      
      	/* PCOMMIT and the required SFENCE for ordering */
      	pcommit_sfence();
      }
      
      After this function completes the data pointed to by vaddr is
      has been accepted to memory and will be durable if the vaddr
      points to persistent memory.
      
      Regarding the details of how the alternatives assembly is set
      up, we need one additional byte at the beginning of the CLFLUSH
      so that we can flip it into a CLFLUSHOPT by changing that byte
      into a 0x66 prefix.  Two options are to either insert a 1 byte
      ASM_NOP1, or to add a 1 byte NOP_DS_PREFIX.  Both have no
      functional effect with the plain CLFLUSH, but I've been told
      that executing a CLFLUSH + prefix should be faster than
      executing a CLFLUSH + NOP.
      
      We had to hard code the assembly for CLWB because, lacking the
      ability to assemble the CLWB instruction itself, the next
      closest thing is to have an xsaveopt instruction with a 0x66
      prefix.  Unfortunately XSAVEOPT itself is also relatively new,
      and isn't included by all the GCC versions that the kernel needs
      to support.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1422377631-8986-3-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d9dc64f3
  13. 02 4月, 2015 2 次提交
  14. 01 4月, 2015 3 次提交
  15. 31 3月, 2015 2 次提交
    • I
      x86/asm/entry: Remove user_mode_ignore_vm86() · 55474c48
      Ingo Molnar 提交于
      user_mode_ignore_vm86() can be used instead of user_mode(), in
      places where we have already done a v8086_mode() security
      check of ptregs.
      
      But doing this check in the wrong place would be a bug that
      could result in security problems, and also the naming still
      isn't very clear.
      
      Furthermore, it only affects 32-bit kernels, while most
      development happens on 64-bit kernels.
      
      If we replace them with user_mode() checks then the cost is only
      a very minor increase in various slowpaths:
      
         text             data   bss     dec              hex    filename
         10573391         703562 1753042 13029995         c6d26b vmlinux.o.before
         10573423         703562 1753042 13030027         c6d28b vmlinux.o.after
      
      So lets get rid of this distinction once and for all.
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150329090233.GA1963@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      55474c48
    • H
      x86/mm: Improve AMD Bulldozer ASLR workaround · 4e26d11f
      Hector Marco-Gisbert 提交于
      The ASLR implementation needs to special-case AMD F15h processors by
      clearing out bits [14:12] of the virtual address in order to avoid I$
      cross invalidations and thus performance penalty for certain workloads.
      For details, see:
      
        dfb09f9b ("x86, amd: Avoid cache aliasing penalties on AMD family 15h")
      
      This special case reduces the mmapped file's entropy by 3 bits.
      
      The following output is the run on an AMD Opteron 62xx class CPU
      processor under x86_64 Linux 4.0.0:
      
        $ for i in `seq 1 10`; do cat /proc/self/maps | grep "r-xp.*libc" ; done
        b7588000-b7736000 r-xp 00000000 00:01 4924       /lib/i386-linux-gnu/libc.so.6
        b7570000-b771e000 r-xp 00000000 00:01 4924       /lib/i386-linux-gnu/libc.so.6
        b75d0000-b777e000 r-xp 00000000 00:01 4924       /lib/i386-linux-gnu/libc.so.6
        b75b0000-b775e000 r-xp 00000000 00:01 4924       /lib/i386-linux-gnu/libc.so.6
        b7578000-b7726000 r-xp 00000000 00:01 4924       /lib/i386-linux-gnu/libc.so.6
        ...
      
      Bits [12:14] are always 0, i.e. the address always ends in 0x8000 or
      0x0000.
      
      32-bit systems, as in the example above, are especially sensitive
      to this issue because 32-bit randomness for VA space is 8 bits (see
      mmap_rnd()). With the Bulldozer special case, this diminishes to only 32
      different slots of mmap virtual addresses.
      
      This patch randomizes per boot the three affected bits rather than
      setting them to zero. Since all the shared pages have the same value
      at bits [12..14], there is no cache aliasing problems. This value gets
      generated during system boot and it is thus not known to a potential
      remote attacker. Therefore, the impact from the Bulldozer workaround
      gets diminished and ASLR randomness increased.
      
      More details at:
      
        http://hmarco.org/bugs/AMD-Bulldozer-linux-ASLR-weakness-reducing-mmaped-files-by-eight.html
      
      Original white paper by AMD dealing with the issue:
      
        http://developer.amd.com/wordpress/media/2012/10/SharedL1InstructionCacheonAMD15hCPU.pdfMentored-by: NIsmael Ripoll <iripoll@disca.upv.es>
      Signed-off-by: NHector Marco-Gisbert <hecmargi@upv.es>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan-Simon <dl9pf@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-fsdevel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1427456301-3764-1-git-send-email-hecmargi@upv.esSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4e26d11f
  16. 30 3月, 2015 1 次提交