1. 19 2月, 2016 1 次提交
    • D
      mm/core: Do not enforce PKEY permissions on remote mm access · 1b2ee126
      Dave Hansen 提交于
      We try to enforce protection keys in software the same way that we
      do in hardware.  (See long example below).
      
      But, we only want to do this when accessing our *own* process's
      memory.  If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
      tried to PTRACE_POKE a target process which just happened to have
      some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
      debugger access to that memory.  PKRU is fundamentally a
      thread-local structure and we do not want to enforce it on access
      to _another_ thread's data.
      
      This gets especially tricky when we have workqueues or other
      delayed-work mechanisms that might run in a random process's context.
      We can check that we only enforce pkeys when operating on our *own* mm,
      but delayed work gets performed when a random user context is active.
      We might end up with a situation where a delayed-work gup fails when
      running randomly under its "own" task but succeeds when running under
      another process.  We want to avoid that.
      
      To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
      fault flag: FAULT_FLAG_REMOTE.  They indicate that we are
      walking an mm which is not guranteed to be the same as
      current->mm and should not be subject to protection key
      enforcement.
      
      Thanks to Jerome Glisse for pointing out this scenario.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: iommu@lists.linux-foundation.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1b2ee126
  2. 18 2月, 2016 2 次提交
    • D
      mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys · 33a709b2
      Dave Hansen 提交于
      Today, for normal faults and page table walks, we check the VMA
      and/or PTE to ensure that it is compatible with the action.  For
      instance, if we get a write fault on a non-writeable VMA, we
      SIGSEGV.
      
      We try to do the same thing for protection keys.  Basically, we
      try to make sure that if a user does this:
      
      	mprotect(ptr, size, PROT_NONE);
      	*ptr = foo;
      
      they see the same effects with protection keys when they do this:
      
      	mprotect(ptr, size, PROT_READ|PROT_WRITE);
      	set_pkey(ptr, size, 4);
      	wrpkru(0xffffff3f); // access disable pkey 4
      	*ptr = foo;
      
      The state to do that checking is in the VMA, but we also
      sometimes have to do it on the page tables only, like when doing
      a get_user_pages_fast() where we have no VMA.
      
      We add two functions and expose them to generic code:
      
      	arch_pte_access_permitted(pte_flags, write)
      	arch_vma_access_permitted(vma, write)
      
      These are, of course, backed up in x86 arch code with checks
      against the PTE or VMA's protection key.
      
      But, there are also cases where we do not want to respect
      protection keys.  When we ptrace(), for instance, we do not want
      to apply the tracer's PKRU permissions to the PTEs from the
      process being traced.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33a709b2
    • D
      x86/mm/pkeys: Add arch-specific VMA protection bits · 8f62c883
      Dave Hansen 提交于
      Lots of things seem to do:
      
              vma->vm_page_prot = vm_get_page_prot(flags);
      
      and the ptes get created right from things we pull out
      of ->vm_page_prot.  So it is very convenient if we can
      store the protection key in flags and vm_page_prot, just
      like the existing permission bits (_PAGE_RW/PRESENT).  It
      greatly reduces the amount of plumbing and arch-specific
      hacking we have to do in generic code.
      
      This also takes the new PROT_PKEY{0,1,2,3} flags and
      turns *those* in to VM_ flags for vma->vm_flags.
      
      The protection key values are stored in 4 places:
      	1. "prot" argument to system calls
      	2. vma->vm_flags, filled from the mmap "prot"
      	3. vma->vm_page prot, filled from vma->vm_flags
      	4. the PTE itself.
      
      The pseudocode for these for steps are as follows:
      
      	mmap(PROT_PKEY*)
      	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
      	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
      	pte = pfn | vma->vm_page_prot
      
      Note that this provides a new definitions for x86:
      
      	arch_vm_get_page_prot()
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210210.FE483A42@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8f62c883
  3. 13 1月, 2016 1 次提交
  4. 11 1月, 2016 1 次提交
    • A
      x86/mm: Add barriers and document switch_mm()-vs-flush synchronization · 71b3c126
      Andy Lutomirski 提交于
      When switch_mm() activates a new PGD, it also sets a bit that
      tells other CPUs that the PGD is in use so that TLB flush IPIs
      will be sent.  In order for that to work correctly, the bit
      needs to be visible prior to loading the PGD and therefore
      starting to fill the local TLB.
      
      Document all the barriers that make this work correctly and add
      a couple that were missing.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      71b3c126
  5. 31 7月, 2015 2 次提交
    • A
      x86/ldt: Make modify_ldt() optional · a5b9e5a2
      Andy Lutomirski 提交于
      The modify_ldt syscall exposes a large attack surface and is
      unnecessary for modern userspace.  Make it optional.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/a605166a771c343fd64802dece77a903507333bd.1438291540.git.luto@kernel.org
      [ Made MATH_EMULATION dependent on MODIFY_LDT_SYSCALL. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a5b9e5a2
    • A
      x86/ldt: Make modify_ldt synchronous · 37868fe1
      Andy Lutomirski 提交于
      modify_ldt() has questionable locking and does not synchronize
      threads.  Improve it: redesign the locking and synchronize all
      threads' LDTs using an IPI on all modifications.
      
      This will dramatically slow down modify_ldt in multithreaded
      programs, but there shouldn't be any multithreaded programs that
      care about modify_ldt's performance in the first place.
      
      This fixes some fallout from the CVE-2015-5157 fixes.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/4c6978476782160600471bd865b318db34c7b628.1438291540.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37868fe1
  6. 10 7月, 2015 1 次提交
  7. 09 6月, 2015 1 次提交
  8. 04 2月, 2015 3 次提交
  9. 23 1月, 2015 1 次提交
    • D
      x86, mpx: Fix potential performance issue on unmaps · c922228e
      Dave Hansen 提交于
      The 3.19 merge window saw some TLB modifications merged which caused a
      performance regression. They were fixed in commit 045bbb9fa.
      
      Once that fix was applied, I also noticed that there was a small
      but intermittent regression still present.  It was not present
      consistently enough to bisect reliably, but I'm fairly confident
      that it came from (my own) MPX patches.  The source was reading
      a relatively unused field in the mm_struct via arch_unmap.
      
      I also noted that this code was in the main instruction flow of
      do_munmap() and probably had more icache impact than we want.
      
      This patch does two things:
      1. Adds a static (via Kconfig) and dynamic (via cpuid) check
         for MPX with cpu_feature_enabled().  This keeps us from
         reading that cacheline in the mm and trades it for a check
         of the global CPUID variables at least on CPUs without MPX.
      2. Adds an unlikely() to ensure that the MPX call ends up out
         of the main instruction flow in do_munmap().  I've added
         a detailed comment about why this was done and why we want
         it even on systems where MPX is present.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: luto@amacapital.net
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20150108223021.AEEAB987@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c922228e
  10. 19 11月, 2014 1 次提交
    • D
      x86: Cleanly separate use of asm-generic/mm_hooks.h · a1ea1c03
      Dave Hansen 提交于
      asm-generic/mm_hooks.h provides some generic fillers for the 90%
      of architectures that do not need to hook some mmap-manipulation
      functions.  A comment inside says:
      
      > Define generic no-op hooks for arch_dup_mmap and
      > arch_exit_mmap, to be included in asm-FOO/mmu_context.h
      > for any arch FOO which doesn't need to hook these.
      
      So, does x86 need to hook these?  It depends on CONFIG_PARAVIRT.
      We *conditionally* include this generic header if we have
      CONFIG_PARAVIRT=n.  That's madness.
      
      With this patch, x86 stops using asm-generic/mmu_hooks.h entirely.
      We use our own copies of the functions.  The paravirt code
      provides some stubs if it is disabled, and we always call those
      stubs in our x86-private versions of arch_exit_mmap() and
      arch_dup_mmap().
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: x86@kernel.org
      Link: http://lkml.kernel.org/r/20141118182349.14567FA5@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a1ea1c03
  11. 18 11月, 2014 2 次提交
    • D
      x86, mpx: Cleanup unused bound tables · 1de4fa14
      Dave Hansen 提交于
      The previous patch allocates bounds tables on-demand.  As noted in
      an earlier description, these can add up to *HUGE* amounts of
      memory.  This has caused OOMs in practice when running tests.
      
      This patch adds support for freeing bounds tables when they are no
      longer in use.
      
      There are two types of mappings in play when unmapping tables:
       1. The mapping with the actual data, which userspace is
          munmap()ing or brk()ing away, etc...
       2. The mapping for the bounds table *backing* the data
          (is tagged with VM_MPX, see the patch "add MPX specific
          mmap interface").
      
      If userspace use the prctl() indroduced earlier in this patchset
      to enable the management of bounds tables in kernel, when it
      unmaps the first type of mapping with the actual data, the kernel
      needs to free the mapping for the bounds table backing the data.
      This patch hooks in at the very end of do_unmap() to do so.
      We look at the addresses being unmapped and find the bounds
      directory entries and tables which cover those addresses.  If
      an entire table is unused, we clear associated directory entry
      and free the table.
      
      Once we unmap the bounds table, we would have a bounds directory
      entry pointing at empty address space. That address space might
      now be allocated for some other (random) use, and the MPX
      hardware might now try to walk it as if it were a bounds table.
      That would be bad.  So any unmapping of an enture bounds table
      has to be accompanied by a corresponding write to the bounds
      directory entry to invalidate it.  That write to the bounds
      directory can fault, which causes the following problem:
      
      Since we are doing the freeing from munmap() (and other paths
      like it), we hold mmap_sem for write. If we fault, the page
      fault handler will attempt to acquire mmap_sem for read and
      we will deadlock.  To avoid the deadlock, we pagefault_disable()
      when touching the bounds directory entry and use a
      get_user_pages() to resolve the fault.
      
      The unmapping of bounds tables happends under vm_munmap().  We
      also (indirectly) call vm_munmap() to _do_ the unmapping of the
      bounds tables.  We avoid unbounded recursion by disallowing
      freeing of bounds tables *for* bounds tables.  This would not
      occur normally, so should not have any practical impact.  Being
      strict about it here helps ensure that we do not have an
      exploitable stack overflow.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1de4fa14
    • D
      x86, mpx: On-demand kernel allocation of bounds tables · fe3d197f
      Dave Hansen 提交于
      This is really the meat of the MPX patch set.  If there is one patch to
      review in the entire series, this is the one.  There is a new ABI here
      and this kernel code also interacts with userspace memory in a
      relatively unusual manner.  (small FAQ below).
      
      Long Description:
      
      This patch adds two prctl() commands to provide enable or disable the
      management of bounds tables in kernel, including on-demand kernel
      allocation (See the patch "on-demand kernel allocation of bounds tables")
      and cleanup (See the patch "cleanup unused bound tables"). Applications
      do not strictly need the kernel to manage bounds tables and we expect
      some applications to use MPX without taking advantage of this kernel
      support. This means the kernel can not simply infer whether an application
      needs bounds table management from the MPX registers.  The prctl() is an
      explicit signal from userspace.
      
      PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
      require kernel's help in managing bounds tables.
      
      PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
      want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
      won't allocate and free bounds tables even if the CPU supports MPX.
      
      PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
      directory out of a userspace register (bndcfgu) and then cache it into
      a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
      will set "bd_addr" to an invalid address.  Using this scheme, we can
      use "bd_addr" to determine whether the management of bounds tables in
      kernel is enabled.
      
      Also, the only way to access that bndcfgu register is via an xsaves,
      which can be expensive.  Caching "bd_addr" like this also helps reduce
      the cost of those xsaves when doing table cleanup at munmap() time.
      Unfortunately, we can not apply this optimization to #BR fault time
      because we need an xsave to get the value of BNDSTATUS.
      
      ==== Why does the hardware even have these Bounds Tables? ====
      
      MPX only has 4 hardware registers for storing bounds information.
      If MPX-enabled code needs more than these 4 registers, it needs to
      spill them somewhere. It has two special instructions for this
      which allow the bounds to be moved between the bounds registers
      and some new "bounds tables".
      
      They are similar conceptually to a page fault and will be raised by
      the MPX hardware during both bounds violations or when the tables
      are not present. This patch handles those #BR exceptions for
      not-present tables by carving the space out of the normal processes
      address space (essentially calling the new mmap() interface indroduced
      earlier in this patch set.) and then pointing the bounds-directory
      over to it.
      
      The tables *need* to be accessed and controlled by userspace because
      the instructions for moving bounds in and out of them are extremely
      frequent. They potentially happen every time a register pointing to
      memory is dereferenced. Any direct kernel involvement (like a syscall)
      to access the tables would obviously destroy performance.
      
      ==== Why not do this in userspace? ====
      
      This patch is obviously doing this allocation in the kernel.
      However, MPX does not strictly *require* anything in the kernel.
      It can theoretically be done completely from userspace. Here are
      a few ways this *could* be done. I don't think any of them are
      practical in the real-world, but here they are.
      
      Q: Can virtual space simply be reserved for the bounds tables so
         that we never have to allocate them?
      A: As noted earlier, these tables are *HUGE*. An X-GB virtual
         area needs 4*X GB of virtual space, plus 2GB for the bounds
         directory. If we were to preallocate them for the 128TB of
         user virtual address space, we would need to reserve 512TB+2GB,
         which is larger than the entire virtual address space today.
         This means they can not be reserved ahead of time. Also, a
         single process's pre-popualated bounds directory consumes 2GB
         of virtual *AND* physical memory. IOW, it's completely
         infeasible to prepopulate bounds directories.
      
      Q: Can we preallocate bounds table space at the same time memory
         is allocated which might contain pointers that might eventually
         need bounds tables?
      A: This would work if we could hook the site of each and every
         memory allocation syscall. This can be done for small,
         constrained applications. But, it isn't practical at a larger
         scale since a given app has no way of controlling how all the
         parts of the app might allocate memory (think libraries). The
         kernel is really the only place to intercept these calls.
      
      Q: Could a bounds fault be handed to userspace and the tables
         allocated there in a signal handler instead of in the kernel?
      A: (thanks to tglx) mmap() is not on the list of safe async
         handler functions and even if mmap() would work it still
         requires locking or nasty tricks to keep track of the
         allocation state there.
      
      Having ruled out all of the userspace-only approaches for managing
      bounds tables that we could think of, we create them on demand in
      the kernel.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fe3d197f
  12. 28 10月, 2014 1 次提交
  13. 31 7月, 2014 1 次提交
  14. 01 8月, 2013 1 次提交
  15. 15 5月, 2012 1 次提交
  16. 27 7月, 2011 1 次提交
  17. 04 2月, 2011 1 次提交
    • S
      x86, mm: avoid possible bogus tlb entries by clearing prev mm_cpumask after switching mm · 831d52bc
      Suresh Siddha 提交于
      Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb
      IPI's while the cr3 is still pointing to the prev mm.  And this window
      can lead to the possibility of bogus TLB fills resulting in strange
      failures.  One such problematic scenario is mentioned below.
      
       T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI
           etc between the point of clearing the cpu from the mm_cpumask(mm1)
           and before reloading the cr3 with the new mm2.
      
       T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with
           flushing the TLB for mm1.  It doesn't send the flush TLB to CPU-1
           as it doesn't see that cpu listed in the mm_cpumask(mm1).
      
       T3. After the TLB flush is complete, CPU-2 goes ahead and frees the
           page-table pages associated with the removed vma mapping.
      
       T4. CPU-2 now allocates those freed page-table pages for something
           else.
      
       T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1
           can potentially speculate and walk through the page-table caches
           and can insert new TLB entries.  As the page-table pages are
           already freed and being used on CPU-2, this page walk can
           potentially insert a bogus global TLB entry depending on the
           (random) contents of the page that is being used on CPU-2.
      
       T6. This bogus TLB entry being global will be active across future CR3
           changes and can result in weird memory corruption etc.
      
      To avoid this issue, for the prev mm that is handing over the cpu to
      another mm, clear the cpu from the mm_cpumask(prev) after the cr3 is
      changed.
      
      Marking it for -stable, though we haven't seen any reported failure that
      can be attributed to this.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: stable@kernel.org	[v2.6.32+]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      831d52bc
  18. 24 9月, 2009 1 次提交
  19. 10 2月, 2009 2 次提交
    • T
      x86: make lazy %gs optional on x86_32 · ccbeed3a
      Tejun Heo 提交于
      Impact: pt_regs changed, lazy gs handling made optional, add slight
              overhead to SAVE_ALL, simplifies error_code path a bit
      
      On x86_32, %gs hasn't been used by kernel and handled lazily.  pt_regs
      doesn't have place for it and gs is saved/loaded only when necessary.
      In preparation for stack protector support, this patch makes lazy %gs
      handling optional by doing the followings.
      
      * Add CONFIG_X86_32_LAZY_GS and place for gs in pt_regs.
      
      * Save and restore %gs along with other registers in entry_32.S unless
        LAZY_GS.  Note that this unfortunately adds "pushl $0" on SAVE_ALL
        even when LAZY_GS.  However, it adds no overhead to common exit path
        and simplifies entry path with error code.
      
      * Define different user_gs accessors depending on LAZY_GS and add
        lazy_save_gs() and lazy_load_gs() which are noop if !LAZY_GS.  The
        lazy_*_gs() ops are used to save, load and clear %gs lazily.
      
      * Define ELF_CORE_COPY_KERNEL_REGS() which always read %gs directly.
      
      xen and lguest changes need to be verified.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ccbeed3a
    • T
      x86: add %gs accessors for x86_32 · d9a89a26
      Tejun Heo 提交于
      Impact: cleanup
      
      On x86_32, %gs is handled lazily.  It's not saved and restored on
      kernel entry/exit but only when necessary which usually is during task
      switch but there are few other places.  Currently, it's done by
      calling savesegment() and loadsegment() explicitly.  Define
      get_user_gs(), set_user_gs() and task_user_gs() and use them instead.
      
      While at it, clean up register access macros in signal.c.
      
      This cleans up code a bit and will help future changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d9a89a26
  20. 21 1月, 2009 1 次提交
  21. 23 10月, 2008 2 次提交
  22. 23 7月, 2008 1 次提交
    • V
      x86: consolidate header guards · 77ef50a5
      Vegard Nossum 提交于
      This patch is the result of an automatic script that consolidates the
      format of all the headers in include/asm-x86/.
      
      The format:
      
      1. No leading underscore. Names with leading underscores are reserved.
      2. Pathname components are separated by two underscores. So we can
         distinguish between mm_types.h and mm/types.h.
      3. Everything except letters and numbers are turned into single
         underscores.
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      77ef50a5
  23. 08 7月, 2008 1 次提交
  24. 11 10月, 2007 1 次提交