1. 08 6月, 2017 1 次提交
  2. 05 6月, 2017 2 次提交
    • A
      x86/mm, KVM: Teach KVM's VMX code that CR3 isn't a constant · d6e41f11
      Andy Lutomirski 提交于
      When PCID is enabled, CR3's PCID bits can change during context
      switches, so KVM won't be able to treat CR3 as a per-mm constant any
      more.
      
      I structured this like the existing CR4 handling.  Under ordinary
      circumstances (PCID disabled or if the current PCID and the value
      that's already in the VMCS match), then we won't do an extra VMCS
      write, and we'll never do an extra direct CR3 read.  The overhead
      should be minimal.
      
      I disallowed using the new helper in non-atomic context because
      PCID support will cause CR3 to stop being constant in non-atomic
      process context.
      
      (Frankly, it also scares me a bit that KVM ever treated CR3 as
      constant, but it looks like it was okay before.)
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d6e41f11
    • A
      x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code · ce4a4e56
      Andy Lutomirski 提交于
      The UP asm/tlbflush.h generates somewhat nicer code than the SMP version.
      Aside from that, it's fallen quite a bit behind the SMP code:
      
       - flush_tlb_mm_range() didn't flush individual pages if the range
         was small.
      
       - The lazy TLB code was much weaker.  This usually wouldn't matter,
         but, if a kernel thread flushed its lazy "active_mm" more than
         once (due to reclaim or similar), it wouldn't be unlazied and
         would instead pointlessly flush repeatedly.
      
       - Tracepoints were missing.
      
      Aside from that, simply having the UP code around was a maintanence
      burden, since it means that any change to the TLB flush code had to
      make sure not to break it.
      
      Simplify everything by deleting the UP code.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ce4a4e56
  3. 23 4月, 2017 1 次提交
    • I
      Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" · 6dd29b3d
      Ingo Molnar 提交于
      This reverts commit 2947ba05.
      
      Dan Williams reported dax-pmem kernel warnings with the following signature:
      
         WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
         percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic
      
      ... and bisected it to this commit, which suggests possible memory corruption
      caused by the x86 fast-GUP conversion.
      
      He also pointed out:
      
       "
        This is similar to the backtrace when we were not properly handling
        pud faults and was fixed with this commit: 220ced16 "mm: fix
        get_user_pages() vs device-dax pud mappings"
      
        I've found some missing _devmap checks in the generic
        get_user_pages_fast() path, but this does not fix the regression
        [...]
       "
      
      So given that there are known bugs, and a pretty robust looking bisection
      points to this commit suggesting that are unknown bugs in the conversion
      as well, revert it for the time being - we'll re-try in v4.13.
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dann.frazier@canonical.com
      Cc: dave.hansen@intel.com
      Cc: steve.capper@linaro.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6dd29b3d
  4. 18 3月, 2017 2 次提交
  5. 10 12月, 2016 1 次提交
    • T
      x86/ldt: Make all size computations unsigned · 990e9dc3
      Thomas Gleixner 提交于
      ldt->size can never be negative. The helper functions take 'unsigned int'
      arguments which are assigned from ldt->size. The related user space
      user_desc struct member entry_number is unsigned as well.
      
      But ldt->size itself and a few local variables which are related to
      ldt->size are type 'int' which makes no sense whatsoever and results in
      typecasts which make the eyes bleed.
      
      Clean it up and convert everything which is related to ldt->size to
      unsigned it.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      990e9dc3
  6. 09 9月, 2016 2 次提交
    • D
      x86/pkeys: Allocation/free syscalls · e8c24d3a
      Dave Hansen 提交于
      This patch adds two new system calls:
      
      	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
      	int pkey_free(int pkey);
      
      These implement an "allocator" for the protection keys
      themselves, which can be thought of as analogous to the allocator
      that the kernel has for file descriptors.  The kernel tracks
      which numbers are in use, and only allows operations on keys that
      are valid.  A key which was not obtained by pkey_alloc() may not,
      for instance, be passed to pkey_mprotect().
      
      These system calls are also very important given the kernel's use
      of pkeys to implement execute-only support.  These help ensure
      that userspace can never assume that it has control of a key
      unless it first asks the kernel.  The kernel does not promise to
      preserve PKRU (right register) contents except for allocated
      pkeys.
      
      The 'init_access_rights' argument to pkey_alloc() specifies the
      rights that will be established for the returned pkey.  For
      instance:
      
      	pkey = pkey_alloc(flags, PKEY_DENY_WRITE);
      
      will allocate 'pkey', but also sets the bits in PKRU[1] such that
      writing to 'pkey' is already denied.
      
      The kernel does not prevent pkey_free() from successfully freeing
      in-use pkeys (those still assigned to a memory range by
      pkey_mprotect()).  It would be expensive to implement the checks
      for this, so we instead say, "Just don't do it" since sane
      software will never do it anyway.
      
      Any piece of userspace calling pkey_alloc() needs to be prepared
      for it to fail.  Why?  pkey_alloc() returns the same error code
      (ENOSPC) when there are no pkeys and when pkeys are unsupported.
      They can be unsupported for a whole host of reasons, so apps must
      be prepared for this.  Also, libraries or LD_PRELOADs might steal
      keys before an application gets access to them.
      
      This allocation mechanism could be implemented in userspace.
      Even if we did it in userspace, we would still need additional
      user/kernel interfaces to tell userspace which keys are being
      used by the kernel internally (such as for execute-only
      mappings).  Having the kernel provide this facility completely
      removes the need for these additional interfaces, or having an
      implementation of this in userspace at all.
      
      Note that we have to make changes to all of the architectures
      that do not use mman-common.h because we use the new
      PKEY_DENY_ACCESS/WRITE macros in arch-independent code.
      
      1. PKRU is the Protection Key Rights User register.  It is a
         usermode-accessible register that controls whether writes
         and/or access to each individual pkey is allowed or denied.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: linux-arch@vger.kernel.org
      Cc: Dave Hansen <dave@sr71.net>
      Cc: arnd@arndb.de
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e8c24d3a
    • D
      mm: Implement new pkey_mprotect() system call · 7d06d9c9
      Dave Hansen 提交于
      pkey_mprotect() is just like mprotect, except it also takes a
      protection key as an argument.  On systems that do not support
      protection keys, it still works, but requires that key=0.
      Otherwise it does exactly what mprotect does.
      
      I expect it to get used like this, if you want to guarantee that
      any mapping you create can *never* be accessed without the right
      protection keys set up.
      
      	int real_prot = PROT_READ|PROT_WRITE;
      	pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
      	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
      
      This way, there is *no* window where the mapping is accessible
      since it was always either PROT_NONE or had a protection key set
      that denied all access.
      
      We settled on 'unsigned long' for the type of the key here.  We
      only need 4 bits on x86 today, but I figured that other
      architectures might need some more space.
      
      Semantically, we have a bit of a problem if we combine this
      syscall with our previously-introduced execute-only support:
      What do we do when we mix execute-only pkey use with
      pkey_mprotect() use?  For instance:
      
      	pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
      	mprotect(ptr, PAGE_SIZE, PROT_EXEC);  // set pkey=X_ONLY_PKEY?
      	mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?
      
      To solve that, we make the plain-mprotect()-initiated execute-only
      support only apply to VMAs that have the default protection key (0)
      set on them.
      
      Proposed semantics:
      1. protection key 0 is special and represents the default,
         "unassigned" protection key.  It is always allocated.
      2. mprotect() never affects a mapping's pkey_mprotect()-assigned
         protection key. A protection key of 0 (even if set explicitly)
         represents an unassigned protection key.
         2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
             key may or may not result in a mapping with execute-only
             properties.  pkey_mprotect() plus pkey_set() on all threads
             should be used to _guarantee_ execute-only semantics if this
             is not a strong enough semantic.
      3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
         kernel will internally attempt to allocate and dedicate a
         protection key for the purpose of execute-only mappings.  This
         may not be possible in cases where there are no free protection
         keys available.  It can also happen, of course, in situations
         where there is no hardware support for protection keys.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: linux-arch@vger.kernel.org
      Cc: Dave Hansen <dave@sr71.net>
      Cc: arnd@arndb.de
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7d06d9c9
  7. 04 8月, 2016 1 次提交
    • M
      tree-wide: replace config_enabled() with IS_ENABLED() · 97f2645f
      Masahiro Yamada 提交于
      The use of config_enabled() against config options is ambiguous.  In
      practical terms, config_enabled() is equivalent to IS_BUILTIN(), but the
      author might have used it for the meaning of IS_ENABLED().  Using
      IS_ENABLED(), IS_BUILTIN(), IS_MODULE() etc.  makes the intention
      clearer.
      
      This commit replaces config_enabled() with IS_ENABLED() where possible.
      This commit is only touching bool config options.
      
      I noticed two cases where config_enabled() is used against a tristate
      option:
      
       - config_enabled(CONFIG_HWMON)
        [ drivers/net/wireless/ath/ath10k/thermal.c ]
      
       - config_enabled(CONFIG_BACKLIGHT_CLASS_DEVICE)
        [ drivers/gpu/drm/gma500/opregion.c ]
      
      I did not touch them because they should be converted to IS_BUILTIN()
      in order to keep the logic, but I was not sure it was the authors'
      intention.
      
      Link: http://lkml.kernel.org/r/1465215656-20569-1-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Stas Sergeev <stsp@list.ru>
      Cc: Matt Redfearn <matt.redfearn@imgtec.com>
      Cc: Joshua Kinard <kumba@gentoo.org>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Markos Chandras <markos.chandras@imgtec.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: yu-cheng yu <yu-cheng.yu@intel.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Nikolay Martynov <mar.kolya@gmail.com>
      Cc: Huacai Chen <chenhc@lemote.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
      Cc: Rafal Milecki <zajec5@gmail.com>
      Cc: James Cowgill <James.Cowgill@imgtec.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Alex Smith <alex.smith@imgtec.com>
      Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
      Cc: Qais Yousef <qais.yousef@imgtec.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Mikko Rapeli <mikko.rapeli@iki.fi>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: "Luis R. Rodriguez" <mcgrof@do-not-panic.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Kalle Valo <kvalo@qca.qualcomm.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Tony Wu <tung7970@gmail.com>
      Cc: Huaitong Han <huaitong.han@intel.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Gelmini <andrea.gelmini@gelma.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Rabin Vincent <rabin@rab.in>
      Cc: "Maciej W. Rozycki" <macro@imgtec.com>
      Cc: David Daney <david.daney@cavium.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97f2645f
  8. 28 4月, 2016 2 次提交
  9. 19 2月, 2016 3 次提交
    • D
      x86/mm: Factor out LDT init from context init · 39a0526f
      Dave Hansen 提交于
      The arch-specific mm_context_t is a great place to put
      protection-key allocation state.
      
      But, we need to initialize the allocation state because pkey 0 is
      always "allocated".  All of the runtime initialization of
      mm_context_t is done in *_ldt() manipulation functions.  This
      renames the existing LDT functions like this:
      
      	init_new_context() -> init_new_context_ldt()
      	destroy_context() -> destroy_context_ldt()
      
      and makes init_new_context() and destroy_context() available for
      generic use.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210234.DB34FCC5@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      39a0526f
    • D
      mm/core, x86/mm/pkeys: Differentiate instruction fetches · d61172b4
      Dave Hansen 提交于
      As discussed earlier, we attempt to enforce protection keys in
      software.
      
      However, the code checks all faults to ensure that they are not
      violating protection key permissions.  It was assumed that all
      faults are either write faults where we check PKRU[key].WD (write
      disable) or read faults where we check the AD (access disable)
      bit.
      
      But, there is a third category of faults for protection keys:
      instruction faults.  Instruction faults never run afoul of
      protection keys because they do not affect instruction fetches.
      
      So, plumb the PF_INSTR bit down in to the
      arch_vma_access_permitted() function where we do the protection
      key checks.
      
      We also add a new FAULT_FLAG_INSTRUCTION.  This is because
      handle_mm_fault() is not passed the architecture-specific
      error_code where we keep PF_INSTR, so we need to encode the
      instruction fetch information in to the arch-generic fault
      flags.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d61172b4
    • D
      mm/core: Do not enforce PKEY permissions on remote mm access · 1b2ee126
      Dave Hansen 提交于
      We try to enforce protection keys in software the same way that we
      do in hardware.  (See long example below).
      
      But, we only want to do this when accessing our *own* process's
      memory.  If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
      tried to PTRACE_POKE a target process which just happened to have
      some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
      debugger access to that memory.  PKRU is fundamentally a
      thread-local structure and we do not want to enforce it on access
      to _another_ thread's data.
      
      This gets especially tricky when we have workqueues or other
      delayed-work mechanisms that might run in a random process's context.
      We can check that we only enforce pkeys when operating on our *own* mm,
      but delayed work gets performed when a random user context is active.
      We might end up with a situation where a delayed-work gup fails when
      running randomly under its "own" task but succeeds when running under
      another process.  We want to avoid that.
      
      To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
      fault flag: FAULT_FLAG_REMOTE.  They indicate that we are
      walking an mm which is not guranteed to be the same as
      current->mm and should not be subject to protection key
      enforcement.
      
      Thanks to Jerome Glisse for pointing out this scenario.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: iommu@lists.linux-foundation.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1b2ee126
  10. 18 2月, 2016 2 次提交
    • D
      mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys · 33a709b2
      Dave Hansen 提交于
      Today, for normal faults and page table walks, we check the VMA
      and/or PTE to ensure that it is compatible with the action.  For
      instance, if we get a write fault on a non-writeable VMA, we
      SIGSEGV.
      
      We try to do the same thing for protection keys.  Basically, we
      try to make sure that if a user does this:
      
      	mprotect(ptr, size, PROT_NONE);
      	*ptr = foo;
      
      they see the same effects with protection keys when they do this:
      
      	mprotect(ptr, size, PROT_READ|PROT_WRITE);
      	set_pkey(ptr, size, 4);
      	wrpkru(0xffffff3f); // access disable pkey 4
      	*ptr = foo;
      
      The state to do that checking is in the VMA, but we also
      sometimes have to do it on the page tables only, like when doing
      a get_user_pages_fast() where we have no VMA.
      
      We add two functions and expose them to generic code:
      
      	arch_pte_access_permitted(pte_flags, write)
      	arch_vma_access_permitted(vma, write)
      
      These are, of course, backed up in x86 arch code with checks
      against the PTE or VMA's protection key.
      
      But, there are also cases where we do not want to respect
      protection keys.  When we ptrace(), for instance, we do not want
      to apply the tracer's PKRU permissions to the PTEs from the
      process being traced.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33a709b2
    • D
      x86/mm/pkeys: Add arch-specific VMA protection bits · 8f62c883
      Dave Hansen 提交于
      Lots of things seem to do:
      
              vma->vm_page_prot = vm_get_page_prot(flags);
      
      and the ptes get created right from things we pull out
      of ->vm_page_prot.  So it is very convenient if we can
      store the protection key in flags and vm_page_prot, just
      like the existing permission bits (_PAGE_RW/PRESENT).  It
      greatly reduces the amount of plumbing and arch-specific
      hacking we have to do in generic code.
      
      This also takes the new PROT_PKEY{0,1,2,3} flags and
      turns *those* in to VM_ flags for vma->vm_flags.
      
      The protection key values are stored in 4 places:
      	1. "prot" argument to system calls
      	2. vma->vm_flags, filled from the mmap "prot"
      	3. vma->vm_page prot, filled from vma->vm_flags
      	4. the PTE itself.
      
      The pseudocode for these for steps are as follows:
      
      	mmap(PROT_PKEY*)
      	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
      	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
      	pte = pfn | vma->vm_page_prot
      
      Note that this provides a new definitions for x86:
      
      	arch_vm_get_page_prot()
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210210.FE483A42@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8f62c883
  11. 13 1月, 2016 1 次提交
  12. 11 1月, 2016 1 次提交
    • A
      x86/mm: Add barriers and document switch_mm()-vs-flush synchronization · 71b3c126
      Andy Lutomirski 提交于
      When switch_mm() activates a new PGD, it also sets a bit that
      tells other CPUs that the PGD is in use so that TLB flush IPIs
      will be sent.  In order for that to work correctly, the bit
      needs to be visible prior to loading the PGD and therefore
      starting to fill the local TLB.
      
      Document all the barriers that make this work correctly and add
      a couple that were missing.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      71b3c126
  13. 31 7月, 2015 2 次提交
    • A
      x86/ldt: Make modify_ldt() optional · a5b9e5a2
      Andy Lutomirski 提交于
      The modify_ldt syscall exposes a large attack surface and is
      unnecessary for modern userspace.  Make it optional.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/a605166a771c343fd64802dece77a903507333bd.1438291540.git.luto@kernel.org
      [ Made MATH_EMULATION dependent on MODIFY_LDT_SYSCALL. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a5b9e5a2
    • A
      x86/ldt: Make modify_ldt synchronous · 37868fe1
      Andy Lutomirski 提交于
      modify_ldt() has questionable locking and does not synchronize
      threads.  Improve it: redesign the locking and synchronize all
      threads' LDTs using an IPI on all modifications.
      
      This will dramatically slow down modify_ldt in multithreaded
      programs, but there shouldn't be any multithreaded programs that
      care about modify_ldt's performance in the first place.
      
      This fixes some fallout from the CVE-2015-5157 fixes.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/4c6978476782160600471bd865b318db34c7b628.1438291540.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37868fe1
  14. 10 7月, 2015 1 次提交
  15. 09 6月, 2015 1 次提交
  16. 04 2月, 2015 3 次提交
  17. 23 1月, 2015 1 次提交
    • D
      x86, mpx: Fix potential performance issue on unmaps · c922228e
      Dave Hansen 提交于
      The 3.19 merge window saw some TLB modifications merged which caused a
      performance regression. They were fixed in commit 045bbb9fa.
      
      Once that fix was applied, I also noticed that there was a small
      but intermittent regression still present.  It was not present
      consistently enough to bisect reliably, but I'm fairly confident
      that it came from (my own) MPX patches.  The source was reading
      a relatively unused field in the mm_struct via arch_unmap.
      
      I also noted that this code was in the main instruction flow of
      do_munmap() and probably had more icache impact than we want.
      
      This patch does two things:
      1. Adds a static (via Kconfig) and dynamic (via cpuid) check
         for MPX with cpu_feature_enabled().  This keeps us from
         reading that cacheline in the mm and trades it for a check
         of the global CPUID variables at least on CPUs without MPX.
      2. Adds an unlikely() to ensure that the MPX call ends up out
         of the main instruction flow in do_munmap().  I've added
         a detailed comment about why this was done and why we want
         it even on systems where MPX is present.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: luto@amacapital.net
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20150108223021.AEEAB987@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c922228e
  18. 19 11月, 2014 1 次提交
    • D
      x86: Cleanly separate use of asm-generic/mm_hooks.h · a1ea1c03
      Dave Hansen 提交于
      asm-generic/mm_hooks.h provides some generic fillers for the 90%
      of architectures that do not need to hook some mmap-manipulation
      functions.  A comment inside says:
      
      > Define generic no-op hooks for arch_dup_mmap and
      > arch_exit_mmap, to be included in asm-FOO/mmu_context.h
      > for any arch FOO which doesn't need to hook these.
      
      So, does x86 need to hook these?  It depends on CONFIG_PARAVIRT.
      We *conditionally* include this generic header if we have
      CONFIG_PARAVIRT=n.  That's madness.
      
      With this patch, x86 stops using asm-generic/mmu_hooks.h entirely.
      We use our own copies of the functions.  The paravirt code
      provides some stubs if it is disabled, and we always call those
      stubs in our x86-private versions of arch_exit_mmap() and
      arch_dup_mmap().
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: x86@kernel.org
      Link: http://lkml.kernel.org/r/20141118182349.14567FA5@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a1ea1c03
  19. 18 11月, 2014 2 次提交
    • D
      x86, mpx: Cleanup unused bound tables · 1de4fa14
      Dave Hansen 提交于
      The previous patch allocates bounds tables on-demand.  As noted in
      an earlier description, these can add up to *HUGE* amounts of
      memory.  This has caused OOMs in practice when running tests.
      
      This patch adds support for freeing bounds tables when they are no
      longer in use.
      
      There are two types of mappings in play when unmapping tables:
       1. The mapping with the actual data, which userspace is
          munmap()ing or brk()ing away, etc...
       2. The mapping for the bounds table *backing* the data
          (is tagged with VM_MPX, see the patch "add MPX specific
          mmap interface").
      
      If userspace use the prctl() indroduced earlier in this patchset
      to enable the management of bounds tables in kernel, when it
      unmaps the first type of mapping with the actual data, the kernel
      needs to free the mapping for the bounds table backing the data.
      This patch hooks in at the very end of do_unmap() to do so.
      We look at the addresses being unmapped and find the bounds
      directory entries and tables which cover those addresses.  If
      an entire table is unused, we clear associated directory entry
      and free the table.
      
      Once we unmap the bounds table, we would have a bounds directory
      entry pointing at empty address space. That address space might
      now be allocated for some other (random) use, and the MPX
      hardware might now try to walk it as if it were a bounds table.
      That would be bad.  So any unmapping of an enture bounds table
      has to be accompanied by a corresponding write to the bounds
      directory entry to invalidate it.  That write to the bounds
      directory can fault, which causes the following problem:
      
      Since we are doing the freeing from munmap() (and other paths
      like it), we hold mmap_sem for write. If we fault, the page
      fault handler will attempt to acquire mmap_sem for read and
      we will deadlock.  To avoid the deadlock, we pagefault_disable()
      when touching the bounds directory entry and use a
      get_user_pages() to resolve the fault.
      
      The unmapping of bounds tables happends under vm_munmap().  We
      also (indirectly) call vm_munmap() to _do_ the unmapping of the
      bounds tables.  We avoid unbounded recursion by disallowing
      freeing of bounds tables *for* bounds tables.  This would not
      occur normally, so should not have any practical impact.  Being
      strict about it here helps ensure that we do not have an
      exploitable stack overflow.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1de4fa14
    • D
      x86, mpx: On-demand kernel allocation of bounds tables · fe3d197f
      Dave Hansen 提交于
      This is really the meat of the MPX patch set.  If there is one patch to
      review in the entire series, this is the one.  There is a new ABI here
      and this kernel code also interacts with userspace memory in a
      relatively unusual manner.  (small FAQ below).
      
      Long Description:
      
      This patch adds two prctl() commands to provide enable or disable the
      management of bounds tables in kernel, including on-demand kernel
      allocation (See the patch "on-demand kernel allocation of bounds tables")
      and cleanup (See the patch "cleanup unused bound tables"). Applications
      do not strictly need the kernel to manage bounds tables and we expect
      some applications to use MPX without taking advantage of this kernel
      support. This means the kernel can not simply infer whether an application
      needs bounds table management from the MPX registers.  The prctl() is an
      explicit signal from userspace.
      
      PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
      require kernel's help in managing bounds tables.
      
      PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
      want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
      won't allocate and free bounds tables even if the CPU supports MPX.
      
      PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
      directory out of a userspace register (bndcfgu) and then cache it into
      a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
      will set "bd_addr" to an invalid address.  Using this scheme, we can
      use "bd_addr" to determine whether the management of bounds tables in
      kernel is enabled.
      
      Also, the only way to access that bndcfgu register is via an xsaves,
      which can be expensive.  Caching "bd_addr" like this also helps reduce
      the cost of those xsaves when doing table cleanup at munmap() time.
      Unfortunately, we can not apply this optimization to #BR fault time
      because we need an xsave to get the value of BNDSTATUS.
      
      ==== Why does the hardware even have these Bounds Tables? ====
      
      MPX only has 4 hardware registers for storing bounds information.
      If MPX-enabled code needs more than these 4 registers, it needs to
      spill them somewhere. It has two special instructions for this
      which allow the bounds to be moved between the bounds registers
      and some new "bounds tables".
      
      They are similar conceptually to a page fault and will be raised by
      the MPX hardware during both bounds violations or when the tables
      are not present. This patch handles those #BR exceptions for
      not-present tables by carving the space out of the normal processes
      address space (essentially calling the new mmap() interface indroduced
      earlier in this patch set.) and then pointing the bounds-directory
      over to it.
      
      The tables *need* to be accessed and controlled by userspace because
      the instructions for moving bounds in and out of them are extremely
      frequent. They potentially happen every time a register pointing to
      memory is dereferenced. Any direct kernel involvement (like a syscall)
      to access the tables would obviously destroy performance.
      
      ==== Why not do this in userspace? ====
      
      This patch is obviously doing this allocation in the kernel.
      However, MPX does not strictly *require* anything in the kernel.
      It can theoretically be done completely from userspace. Here are
      a few ways this *could* be done. I don't think any of them are
      practical in the real-world, but here they are.
      
      Q: Can virtual space simply be reserved for the bounds tables so
         that we never have to allocate them?
      A: As noted earlier, these tables are *HUGE*. An X-GB virtual
         area needs 4*X GB of virtual space, plus 2GB for the bounds
         directory. If we were to preallocate them for the 128TB of
         user virtual address space, we would need to reserve 512TB+2GB,
         which is larger than the entire virtual address space today.
         This means they can not be reserved ahead of time. Also, a
         single process's pre-popualated bounds directory consumes 2GB
         of virtual *AND* physical memory. IOW, it's completely
         infeasible to prepopulate bounds directories.
      
      Q: Can we preallocate bounds table space at the same time memory
         is allocated which might contain pointers that might eventually
         need bounds tables?
      A: This would work if we could hook the site of each and every
         memory allocation syscall. This can be done for small,
         constrained applications. But, it isn't practical at a larger
         scale since a given app has no way of controlling how all the
         parts of the app might allocate memory (think libraries). The
         kernel is really the only place to intercept these calls.
      
      Q: Could a bounds fault be handed to userspace and the tables
         allocated there in a signal handler instead of in the kernel?
      A: (thanks to tglx) mmap() is not on the list of safe async
         handler functions and even if mmap() would work it still
         requires locking or nasty tricks to keep track of the
         allocation state there.
      
      Having ruled out all of the userspace-only approaches for managing
      bounds tables that we could think of, we create them on demand in
      the kernel.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fe3d197f
  20. 28 10月, 2014 1 次提交
  21. 31 7月, 2014 1 次提交
  22. 01 8月, 2013 1 次提交
  23. 15 5月, 2012 1 次提交
  24. 27 7月, 2011 1 次提交
  25. 04 2月, 2011 1 次提交
    • S
      x86, mm: avoid possible bogus tlb entries by clearing prev mm_cpumask after switching mm · 831d52bc
      Suresh Siddha 提交于
      Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb
      IPI's while the cr3 is still pointing to the prev mm.  And this window
      can lead to the possibility of bogus TLB fills resulting in strange
      failures.  One such problematic scenario is mentioned below.
      
       T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI
           etc between the point of clearing the cpu from the mm_cpumask(mm1)
           and before reloading the cr3 with the new mm2.
      
       T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with
           flushing the TLB for mm1.  It doesn't send the flush TLB to CPU-1
           as it doesn't see that cpu listed in the mm_cpumask(mm1).
      
       T3. After the TLB flush is complete, CPU-2 goes ahead and frees the
           page-table pages associated with the removed vma mapping.
      
       T4. CPU-2 now allocates those freed page-table pages for something
           else.
      
       T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1
           can potentially speculate and walk through the page-table caches
           and can insert new TLB entries.  As the page-table pages are
           already freed and being used on CPU-2, this page walk can
           potentially insert a bogus global TLB entry depending on the
           (random) contents of the page that is being used on CPU-2.
      
       T6. This bogus TLB entry being global will be active across future CR3
           changes and can result in weird memory corruption etc.
      
      To avoid this issue, for the prev mm that is handing over the cpu to
      another mm, clear the cpu from the mm_cpumask(prev) after the cr3 is
      changed.
      
      Marking it for -stable, though we haven't seen any reported failure that
      can be attributed to this.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: stable@kernel.org	[v2.6.32+]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      831d52bc
  26. 24 9月, 2009 1 次提交
  27. 10 2月, 2009 2 次提交
    • T
      x86: make lazy %gs optional on x86_32 · ccbeed3a
      Tejun Heo 提交于
      Impact: pt_regs changed, lazy gs handling made optional, add slight
              overhead to SAVE_ALL, simplifies error_code path a bit
      
      On x86_32, %gs hasn't been used by kernel and handled lazily.  pt_regs
      doesn't have place for it and gs is saved/loaded only when necessary.
      In preparation for stack protector support, this patch makes lazy %gs
      handling optional by doing the followings.
      
      * Add CONFIG_X86_32_LAZY_GS and place for gs in pt_regs.
      
      * Save and restore %gs along with other registers in entry_32.S unless
        LAZY_GS.  Note that this unfortunately adds "pushl $0" on SAVE_ALL
        even when LAZY_GS.  However, it adds no overhead to common exit path
        and simplifies entry path with error code.
      
      * Define different user_gs accessors depending on LAZY_GS and add
        lazy_save_gs() and lazy_load_gs() which are noop if !LAZY_GS.  The
        lazy_*_gs() ops are used to save, load and clear %gs lazily.
      
      * Define ELF_CORE_COPY_KERNEL_REGS() which always read %gs directly.
      
      xen and lguest changes need to be verified.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ccbeed3a
    • T
      x86: add %gs accessors for x86_32 · d9a89a26
      Tejun Heo 提交于
      Impact: cleanup
      
      On x86_32, %gs is handled lazily.  It's not saved and restored on
      kernel entry/exit but only when necessary which usually is during task
      switch but there are few other places.  Currently, it's done by
      calling savesegment() and loadsegment() explicitly.  Define
      get_user_gs(), set_user_gs() and task_user_gs() and use them instead.
      
      While at it, clean up register access macros in signal.c.
      
      This cleans up code a bit and will help future changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d9a89a26
  28. 21 1月, 2009 1 次提交