1. 14 10月, 2017 1 次提交
    • A
      x86/mm: Flush more aggressively in lazy TLB mode · b956575b
      Andy Lutomirski 提交于
      Since commit:
      
        94b1b03b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
      
      x86's lazy TLB mode has been all the way lazy: when running a kernel thread
      (including the idle thread), the kernel keeps using the last user mm's
      page tables without attempting to maintain user TLB coherence at all.
      
      From a pure semantic perspective, this is fine -- kernel threads won't
      attempt to access user pages, so having stale TLB entries doesn't matter.
      
      Unfortunately, I forgot about a subtlety.  By skipping TLB flushes,
      we also allow any paging-structure caches that may exist on the CPU
      to become incoherent.  This means that we can have a
      paging-structure cache entry that references a freed page table, and
      the CPU is within its rights to do a speculative page walk starting
      at the freed page table.
      
      I can imagine this causing two different problems:
      
       - A speculative page walk starting from a bogus page table could read
         IO addresses.  I haven't seen any reports of this causing problems.
      
       - A speculative page walk that involves a bogus page table can install
         garbage in the TLB.  Such garbage would always be at a user VA, but
         some AMD CPUs have logic that triggers a machine check when it notices
         these bogus entries.  I've seen a couple reports of this.
      
      Boris further explains the failure mode:
      
      > It is actually more of an optimization which assumes that paging-structure
      > entries are in WB DRAM:
      >
      > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
      > performance optimization that assumes PML4, PDP, PDE, and PTE entries
      > are in cacheable WB-DRAM; memory type checks may be bypassed, and
      > addresses outside of WB-DRAM may result in undefined behavior or NB
      > protocol errors. 1=Disables performance optimization and allows PML4,
      > PDP, PDE and PTE entries to be in any memory type. Operating systems
      > that maintain page tables in memory types other than WB- DRAM must set
      > TlbCacheDis to insure proper operation."
      >
      > The MCE generated is an NB protocol error to signal that
      >
      > "Link: A specific coherent-only packet from a CPU was issued to an
      > IO link. This may be caused by software which addresses page table
      > structures in a memory type other than cacheable WB-DRAM without
      > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
      > example, when page table structure addresses are above top of memory. In
      > such cases, the NB will generate an MCE if it sees a mismatch between
      > the memory operation generated by the core and the link type."
      >
      > I'm assuming coherent-only packets don't go out on IO links, thus the
      > error.
      
      To fix this, reinstate TLB coherence in lazy mode.  With this patch
      applied, we do it in one of two ways:
      
       - If we have PCID, we simply switch back to init_mm's page tables
         when we enter a kernel thread -- this seems to be quite cheap
         except for the cost of serializing the CPU.
      
       - If we don't have PCID, then we set a flag and switch to init_mm
         the first time we would otherwise need to flush the TLB.
      
      The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
      to override the default mode for benchmarking.
      
      In theory, we could optimize this better by only flushing the TLB in
      lazy CPUs when a page table is freed.  Doing that would require
      auditing the mm code to make sure that all page table freeing goes
      through tlb_remove_page() as well as reworking some data structures
      to implement the improved flush logic.
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Reported-by: NAdam Borowski <kilobyte@angband.pl>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roman Kagan <rkagan@virtuozzo.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 94b1b03b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
      Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b956575b
  2. 18 9月, 2017 2 次提交
    • A
      x86/mm/64: Stop using CR3.PCID == 0 in ASID-aware code · 52a2af40
      Andy Lutomirski 提交于
      Putting the logical ASID into CR3's PCID bits directly means that we
      have two cases to consider separately: ASID == 0 and ASID != 0.
      This means that bugs that only hit in one of these cases trigger
      nondeterministically.
      
      There were some bugs like this in the past, and I think there's
      still one in current kernels.  In particular, we have a number of
      ASID-unware code paths that save CR3, write some special value, and
      then restore CR3.  This includes suspend/resume, hibernate, kexec,
      EFI, and maybe other things I've missed.  This is currently
      dangerous: if ASID != 0, then this code sequence will leave garbage
      in the TLB tagged for ASID 0.  We could potentially see corruption
      when switching back to ASID 0.  In principle, an
      initialize_tlbstate_and_flush() call after these sequences would
      solve the problem, but EFI, at least, does not call this.  (And it
      probably shouldn't -- initialize_tlbstate_and_flush() is rather
      expensive.)
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/cdc14bbe5d3c3ef2a562be09a6368ffe9bd947a6.1505663533.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      52a2af40
    • A
      x86/mm: Factor out CR3-building code · 47061a24
      Andy Lutomirski 提交于
      Current, the code that assembles a value to load into CR3 is
      open-coded everywhere.  Factor it out into helpers build_cr3() and
      build_cr3_noflush().
      
      This makes one semantic change: __get_current_cr3_fast() was wrong
      on SME systems.  No one noticed because the only caller is in the
      VMX code, and there are no CPUs with both SME and VMX.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <Thomas.Lendacky@amd.com>
      Link: http://lkml.kernel.org/r/ce350cf11e93e2842d14d0b95b0199c7d881f527.1505663533.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      47061a24
  3. 25 8月, 2017 1 次提交
    • E
      x86/mm: Fix use-after-free of ldt_struct · ccd5b323
      Eric Biggers 提交于
      The following commit:
      
        39a0526f ("x86/mm: Factor out LDT init from context init")
      
      renamed init_new_context() to init_new_context_ldt() and added a new
      init_new_context() which calls init_new_context_ldt().  However, the
      error code of init_new_context_ldt() was ignored.  Consequently, if a
      memory allocation in alloc_ldt_struct() failed during a fork(), the
      ->context.ldt of the new task remained the same as that of the old task
      (due to the memcpy() in dup_mm()).  ldt_struct's are not intended to be
      shared, so a use-after-free occurred after one task exited.
      
      Fix the bug by making init_new_context() pass through the error code of
      init_new_context_ldt().
      
      This bug was found by syzkaller, which encountered the following splat:
      
          BUG: KASAN: use-after-free in free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116
          Read of size 4 at addr ffff88006d2cb7c8 by task kworker/u9:0/3710
      
          CPU: 1 PID: 3710 Comm: kworker/u9:0 Not tainted 4.13.0-rc4-next-20170811 #2
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          Call Trace:
           __dump_stack lib/dump_stack.c:16 [inline]
           dump_stack+0x194/0x257 lib/dump_stack.c:52
           print_address_description+0x73/0x250 mm/kasan/report.c:252
           kasan_report_error mm/kasan/report.c:351 [inline]
           kasan_report+0x24e/0x340 mm/kasan/report.c:409
           __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
           free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116
           free_ldt_struct arch/x86/kernel/ldt.c:173 [inline]
           destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171
           destroy_context arch/x86/include/asm/mmu_context.h:157 [inline]
           __mmdrop+0xe9/0x530 kernel/fork.c:889
           mmdrop include/linux/sched/mm.h:42 [inline]
           exec_mmap fs/exec.c:1061 [inline]
           flush_old_exec+0x173c/0x1ff0 fs/exec.c:1291
           load_elf_binary+0x81f/0x4ba0 fs/binfmt_elf.c:855
           search_binary_handler+0x142/0x6b0 fs/exec.c:1652
           exec_binprm fs/exec.c:1694 [inline]
           do_execveat_common.isra.33+0x1746/0x22e0 fs/exec.c:1816
           do_execve+0x31/0x40 fs/exec.c:1860
           call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100
           ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
      
          Allocated by task 3700:
           save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
           save_stack+0x43/0xd0 mm/kasan/kasan.c:447
           set_track mm/kasan/kasan.c:459 [inline]
           kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
           kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3627
           kmalloc include/linux/slab.h:493 [inline]
           alloc_ldt_struct+0x52/0x140 arch/x86/kernel/ldt.c:67
           write_ldt+0x7b7/0xab0 arch/x86/kernel/ldt.c:277
           sys_modify_ldt+0x1ef/0x240 arch/x86/kernel/ldt.c:307
           entry_SYSCALL_64_fastpath+0x1f/0xbe
      
          Freed by task 3700:
           save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
           save_stack+0x43/0xd0 mm/kasan/kasan.c:447
           set_track mm/kasan/kasan.c:459 [inline]
           kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
           __cache_free mm/slab.c:3503 [inline]
           kfree+0xca/0x250 mm/slab.c:3820
           free_ldt_struct.part.2+0xdd/0x150 arch/x86/kernel/ldt.c:121
           free_ldt_struct arch/x86/kernel/ldt.c:173 [inline]
           destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171
           destroy_context arch/x86/include/asm/mmu_context.h:157 [inline]
           __mmdrop+0xe9/0x530 kernel/fork.c:889
           mmdrop include/linux/sched/mm.h:42 [inline]
           __mmput kernel/fork.c:916 [inline]
           mmput+0x541/0x6e0 kernel/fork.c:927
           copy_process.part.36+0x22e1/0x4af0 kernel/fork.c:1931
           copy_process kernel/fork.c:1546 [inline]
           _do_fork+0x1ef/0xfb0 kernel/fork.c:2025
           SYSC_clone kernel/fork.c:2135 [inline]
           SyS_clone+0x37/0x50 kernel/fork.c:2129
           do_syscall_64+0x26c/0x8c0 arch/x86/entry/common.c:287
           return_from_SYSCALL_64+0x0/0x7a
      
      Here is a C reproducer:
      
          #include <asm/ldt.h>
          #include <pthread.h>
          #include <signal.h>
          #include <stdlib.h>
          #include <sys/syscall.h>
          #include <sys/wait.h>
          #include <unistd.h>
      
          static void *fork_thread(void *_arg)
          {
              fork();
          }
      
          int main(void)
          {
              struct user_desc desc = { .entry_number = 8191 };
      
              syscall(__NR_modify_ldt, 1, &desc, sizeof(desc));
      
              for (;;) {
                  if (fork() == 0) {
                      pthread_t t;
      
                      srand(getpid());
                      pthread_create(&t, NULL, fork_thread, NULL);
                      usleep(rand() % 10000);
                      syscall(__NR_exit_group, 0);
                  }
                  wait(NULL);
              }
          }
      
      Note: the reproducer takes advantage of the fact that alloc_ldt_struct()
      may use vmalloc() to allocate a large ->entries array, and after
      commit:
      
        5d17a73a ("vmalloc: back off when the current task is killed")
      
      it is possible for userspace to fail a task's vmalloc() by
      sending a fatal signal, e.g. via exit_group().  It would be more
      difficult to reproduce this bug on kernels without that commit.
      
      This bug only affected kernels with CONFIG_MODIFY_LDT_SYSCALL=y.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org> [v4.6+]
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Fixes: 39a0526f ("x86/mm: Factor out LDT init from context init")
      Link: http://lkml.kernel.org/r/20170824175029.76040-1-ebiggers3@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ccd5b323
  4. 25 7月, 2017 1 次提交
    • A
      x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID · 10af6235
      Andy Lutomirski 提交于
      PCID is a "process context ID" -- it's what other architectures call
      an address space ID.  Every non-global TLB entry is tagged with a
      PCID, only TLB entries that match the currently selected PCID are
      used, and we can switch PGDs without flushing the TLB.  x86's
      PCID is 12 bits.
      
      This is an unorthodox approach to using PCID.  x86's PCID is far too
      short to uniquely identify a process, and we can't even really
      uniquely identify a running process because there are monster
      systems with over 4096 CPUs.  To make matters worse, past attempts
      to use all 12 PCID bits have resulted in slowdowns instead of
      speedups.
      
      This patch uses PCID differently.  We use a PCID to identify a
      recently-used mm on a per-cpu basis.  An mm has no fixed PCID
      binding at all; instead, we give it a fresh PCID each time it's
      loaded except in cases where we want to preserve the TLB, in which
      case we reuse a recent value.
      
      Here are some benchmark results, done on a Skylake laptop at 2.3 GHz
      (turbo off, intel_pstate requesting max performance) under KVM with
      the guest using idle=poll (to avoid artifacts when bouncing between
      CPUs).  I haven't done any real statistics here -- I just ran them
      in a loop and picked the fastest results that didn't look like
      outliers.  Unpatched means commit a4eb8b99, so all the
      bookkeeping overhead is gone.
      
      ping-pong between two mms on the same CPU using eventfd:
      
        patched:         1.22µs
        patched, nopcid: 1.33µs
        unpatched:       1.34µs
      
      Same ping-pong, but now touch 512 pages (all zero-page to minimize
      cache misses) each iteration.  dTLB misses are measured by
      dtlb_load_misses.miss_causes_a_walk:
      
        patched:         1.8µs  11M  dTLB misses
        patched, nopcid: 6.2µs, 207M dTLB misses
        unpatched:       6.1µs, 190M dTLB misses
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NNadav Amit <nadav.amit@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/9ee75f17a81770feed616358e6860d98a2a5b1e7.1500957502.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10af6235
  5. 18 7月, 2017 1 次提交
    • R
      x86/mm, KVM: Fix warning when !CONFIG_PREEMPT_COUNT · 4c07f904
      Roman Kagan 提交于
      A recent commit:
      
        d6e41f11 ("x86/mm, KVM: Teach KVM's VMX code that CR3 isn't a constant")
      
      introduced a VM_WARN_ON(!in_atomic()) which generates false positives
      on every VM entry on !CONFIG_PREEMPT_COUNT kernels.
      
      Replace it with a test for preemptible(), which appears to match the
      original intent and works across different CONFIG_PREEMPT* variations.
      Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Cc: linux-mm@kvack.org
      Fixes: d6e41f11 ("x86/mm, KVM: Teach KVM's VMX code that CR3 isn't a constant")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4c07f904
  6. 05 7月, 2017 2 次提交
    • A
      x86/mm: Rework lazy TLB mode and TLB freshness tracking · 94b1b03b
      Andy Lutomirski 提交于
      x86's lazy TLB mode used to be fairly weak -- it would switch to
      init_mm the first time it tried to flush a lazy TLB.  This meant an
      unnecessary CR3 write and, if the flush was remote, an unnecessary
      IPI.
      
      Rewrite it entirely.  When we enter lazy mode, we simply remove the
      CPU from mm_cpumask.  This means that we need a way to figure out
      whether we've missed a flush when we switch back out of lazy mode.
      I use the tlb_gen machinery to track whether a context is up to
      date.
      
      Note to reviewers: this patch, my itself, looks a bit odd.  I'm
      using an array of length 1 containing (ctx_id, tlb_gen) rather than
      just storing tlb_gen, and making it at array isn't necessary yet.
      I'm doing this because the next few patches add PCID support, and,
      with PCID, we need ctx_id, and the array will end up with a length
      greater than 1.  Making it an array now means that there will be
      less churn and therefore less stress on your eyeballs.
      
      NB: This is dubious but, AFAICT, still correct on Xen and UV.
      xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
      patch changes the way that mm_cpumask() works.  This should be okay,
      since Xen *also* iterates all online CPUs to find all the CPUs it
      needs to twiddle.
      
      The UV tlbflush code is rather dated and should be changed.
      
      Here are some benchmark results, done on a Skylake laptop at 2.3 GHz
      (turbo off, intel_pstate requesting max performance) under KVM with
      the guest using idle=poll (to avoid artifacts when bouncing between
      CPUs).  I haven't done any real statistics here -- I just ran them
      in a loop and picked the fastest results that didn't look like
      outliers.  Unpatched means commit a4eb8b99, so all the
      bookkeeping overhead is gone.
      
      MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity.  In
      an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU.
      This is intended to be a nearly worst-case test.
      
        patched:         13.4µs
        unpatched:       21.6µs
      
      Vitaly's pthread_mmap microbenchmark with 8 threads (on four cores),
      nrounds = 100, 256M data
      
        patched:         1.1 seconds or so
        unpatched:       1.9 seconds or so
      
      The sleepup on Vitaly's test appearss to be because it spends a lot
      of time blocked on mmap_sem, and this patch avoids sending IPIs to
      blocked CPUs.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/ddf2c92962339f4ba39d8fc41b853936ec0b44f1.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      94b1b03b
    • A
      x86/mm: Give each mm TLB flush generation a unique ID · f39681ed
      Andy Lutomirski 提交于
      This adds two new variables to mmu_context_t: ctx_id and tlb_gen.
      ctx_id uniquely identifies the mm_struct and will never be reused.
      For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic
      count of the number of times that a TLB flush has been requested.
      The pair (ctx_id, tlb_gen) can be used as an identifier for TLB
      flush actions and will be used in subsequent patches to reliably
      determine whether all needed TLB flushes have occurred on a given
      CPU.
      
      This patch is split out for ease of review.  By itself, it has no
      real effect other than creating and updating the new variables.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/413a91c24dab3ed0caa5f4e4d017d87b0857f920.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f39681ed
  7. 22 6月, 2017 1 次提交
    • A
      x86/ldt: Simplify the LDT switching logic · 73534258
      Andy Lutomirski 提交于
      Originally, Linux reloaded the LDT whenever the prev mm or the next
      mm had an LDT. It was changed in 2002 in:
      
        0bbed3beb4f2 ("[PATCH] Thread-Local Storage (TLS) support")
      
      (commit from the historical tree), like this:
      
      -		/* load_LDT, if either the previous or next thread
      -		 * has a non-default LDT.
      +		/*
      +		 * load the LDT, if the LDT is different:
      		 */
      -		if (next->context.size+prev->context.size)
      +		if (unlikely(prev->context.ldt != next->context.ldt))
      			load_LDT(&next->context);
      
      The current code is unlikely to avoid any LDT reloads, since different
      mms won't share an LDT.
      
      When we redo lazy mode to stop flush IPIs without switching to
      init_mm, though, the current logic would become incorrect: it will
      be possible to have real_prev == next but nonetheless have a stale
      LDT descriptor.
      
      Simplify the code to update LDTR if either the previous or the next
      mm has an LDT, i.e. effectively restore the historical logic..
      While we're at it, clean up the code by moving all the ifdeffery to
      a header where it belongs.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/2a859ac01245f9594c58f9d0a8b2ed8a7cd2507e.1498022414.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      73534258
  8. 13 6月, 2017 2 次提交
  9. 08 6月, 2017 1 次提交
  10. 05 6月, 2017 2 次提交
    • A
      x86/mm, KVM: Teach KVM's VMX code that CR3 isn't a constant · d6e41f11
      Andy Lutomirski 提交于
      When PCID is enabled, CR3's PCID bits can change during context
      switches, so KVM won't be able to treat CR3 as a per-mm constant any
      more.
      
      I structured this like the existing CR4 handling.  Under ordinary
      circumstances (PCID disabled or if the current PCID and the value
      that's already in the VMCS match), then we won't do an extra VMCS
      write, and we'll never do an extra direct CR3 read.  The overhead
      should be minimal.
      
      I disallowed using the new helper in non-atomic context because
      PCID support will cause CR3 to stop being constant in non-atomic
      process context.
      
      (Frankly, it also scares me a bit that KVM ever treated CR3 as
      constant, but it looks like it was okay before.)
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d6e41f11
    • A
      x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code · ce4a4e56
      Andy Lutomirski 提交于
      The UP asm/tlbflush.h generates somewhat nicer code than the SMP version.
      Aside from that, it's fallen quite a bit behind the SMP code:
      
       - flush_tlb_mm_range() didn't flush individual pages if the range
         was small.
      
       - The lazy TLB code was much weaker.  This usually wouldn't matter,
         but, if a kernel thread flushed its lazy "active_mm" more than
         once (due to reclaim or similar), it wouldn't be unlazied and
         would instead pointlessly flush repeatedly.
      
       - Tracepoints were missing.
      
      Aside from that, simply having the UP code around was a maintanence
      burden, since it means that any change to the TLB flush code had to
      make sure not to break it.
      
      Simplify everything by deleting the UP code.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ce4a4e56
  11. 23 4月, 2017 1 次提交
    • I
      Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" · 6dd29b3d
      Ingo Molnar 提交于
      This reverts commit 2947ba05.
      
      Dan Williams reported dax-pmem kernel warnings with the following signature:
      
         WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
         percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic
      
      ... and bisected it to this commit, which suggests possible memory corruption
      caused by the x86 fast-GUP conversion.
      
      He also pointed out:
      
       "
        This is similar to the backtrace when we were not properly handling
        pud faults and was fixed with this commit: 220ced16 "mm: fix
        get_user_pages() vs device-dax pud mappings"
      
        I've found some missing _devmap checks in the generic
        get_user_pages_fast() path, but this does not fix the regression
        [...]
       "
      
      So given that there are known bugs, and a pretty robust looking bisection
      points to this commit suggesting that are unknown bugs in the conversion
      as well, revert it for the time being - we'll re-try in v4.13.
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: aneesh.kumar@linux.vnet.ibm.com
      Cc: dann.frazier@canonical.com
      Cc: dave.hansen@intel.com
      Cc: steve.capper@linaro.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6dd29b3d
  12. 18 3月, 2017 2 次提交
  13. 10 12月, 2016 1 次提交
    • T
      x86/ldt: Make all size computations unsigned · 990e9dc3
      Thomas Gleixner 提交于
      ldt->size can never be negative. The helper functions take 'unsigned int'
      arguments which are assigned from ldt->size. The related user space
      user_desc struct member entry_number is unsigned as well.
      
      But ldt->size itself and a few local variables which are related to
      ldt->size are type 'int' which makes no sense whatsoever and results in
      typecasts which make the eyes bleed.
      
      Clean it up and convert everything which is related to ldt->size to
      unsigned it.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      990e9dc3
  14. 09 9月, 2016 2 次提交
    • D
      x86/pkeys: Allocation/free syscalls · e8c24d3a
      Dave Hansen 提交于
      This patch adds two new system calls:
      
      	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
      	int pkey_free(int pkey);
      
      These implement an "allocator" for the protection keys
      themselves, which can be thought of as analogous to the allocator
      that the kernel has for file descriptors.  The kernel tracks
      which numbers are in use, and only allows operations on keys that
      are valid.  A key which was not obtained by pkey_alloc() may not,
      for instance, be passed to pkey_mprotect().
      
      These system calls are also very important given the kernel's use
      of pkeys to implement execute-only support.  These help ensure
      that userspace can never assume that it has control of a key
      unless it first asks the kernel.  The kernel does not promise to
      preserve PKRU (right register) contents except for allocated
      pkeys.
      
      The 'init_access_rights' argument to pkey_alloc() specifies the
      rights that will be established for the returned pkey.  For
      instance:
      
      	pkey = pkey_alloc(flags, PKEY_DENY_WRITE);
      
      will allocate 'pkey', but also sets the bits in PKRU[1] such that
      writing to 'pkey' is already denied.
      
      The kernel does not prevent pkey_free() from successfully freeing
      in-use pkeys (those still assigned to a memory range by
      pkey_mprotect()).  It would be expensive to implement the checks
      for this, so we instead say, "Just don't do it" since sane
      software will never do it anyway.
      
      Any piece of userspace calling pkey_alloc() needs to be prepared
      for it to fail.  Why?  pkey_alloc() returns the same error code
      (ENOSPC) when there are no pkeys and when pkeys are unsupported.
      They can be unsupported for a whole host of reasons, so apps must
      be prepared for this.  Also, libraries or LD_PRELOADs might steal
      keys before an application gets access to them.
      
      This allocation mechanism could be implemented in userspace.
      Even if we did it in userspace, we would still need additional
      user/kernel interfaces to tell userspace which keys are being
      used by the kernel internally (such as for execute-only
      mappings).  Having the kernel provide this facility completely
      removes the need for these additional interfaces, or having an
      implementation of this in userspace at all.
      
      Note that we have to make changes to all of the architectures
      that do not use mman-common.h because we use the new
      PKEY_DENY_ACCESS/WRITE macros in arch-independent code.
      
      1. PKRU is the Protection Key Rights User register.  It is a
         usermode-accessible register that controls whether writes
         and/or access to each individual pkey is allowed or denied.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: linux-arch@vger.kernel.org
      Cc: Dave Hansen <dave@sr71.net>
      Cc: arnd@arndb.de
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e8c24d3a
    • D
      mm: Implement new pkey_mprotect() system call · 7d06d9c9
      Dave Hansen 提交于
      pkey_mprotect() is just like mprotect, except it also takes a
      protection key as an argument.  On systems that do not support
      protection keys, it still works, but requires that key=0.
      Otherwise it does exactly what mprotect does.
      
      I expect it to get used like this, if you want to guarantee that
      any mapping you create can *never* be accessed without the right
      protection keys set up.
      
      	int real_prot = PROT_READ|PROT_WRITE;
      	pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
      	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);
      
      This way, there is *no* window where the mapping is accessible
      since it was always either PROT_NONE or had a protection key set
      that denied all access.
      
      We settled on 'unsigned long' for the type of the key here.  We
      only need 4 bits on x86 today, but I figured that other
      architectures might need some more space.
      
      Semantically, we have a bit of a problem if we combine this
      syscall with our previously-introduced execute-only support:
      What do we do when we mix execute-only pkey use with
      pkey_mprotect() use?  For instance:
      
      	pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
      	mprotect(ptr, PAGE_SIZE, PROT_EXEC);  // set pkey=X_ONLY_PKEY?
      	mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?
      
      To solve that, we make the plain-mprotect()-initiated execute-only
      support only apply to VMAs that have the default protection key (0)
      set on them.
      
      Proposed semantics:
      1. protection key 0 is special and represents the default,
         "unassigned" protection key.  It is always allocated.
      2. mprotect() never affects a mapping's pkey_mprotect()-assigned
         protection key. A protection key of 0 (even if set explicitly)
         represents an unassigned protection key.
         2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
             key may or may not result in a mapping with execute-only
             properties.  pkey_mprotect() plus pkey_set() on all threads
             should be used to _guarantee_ execute-only semantics if this
             is not a strong enough semantic.
      3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
         kernel will internally attempt to allocate and dedicate a
         protection key for the purpose of execute-only mappings.  This
         may not be possible in cases where there are no free protection
         keys available.  It can also happen, of course, in situations
         where there is no hardware support for protection keys.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: linux-arch@vger.kernel.org
      Cc: Dave Hansen <dave@sr71.net>
      Cc: arnd@arndb.de
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7d06d9c9
  15. 04 8月, 2016 1 次提交
    • M
      tree-wide: replace config_enabled() with IS_ENABLED() · 97f2645f
      Masahiro Yamada 提交于
      The use of config_enabled() against config options is ambiguous.  In
      practical terms, config_enabled() is equivalent to IS_BUILTIN(), but the
      author might have used it for the meaning of IS_ENABLED().  Using
      IS_ENABLED(), IS_BUILTIN(), IS_MODULE() etc.  makes the intention
      clearer.
      
      This commit replaces config_enabled() with IS_ENABLED() where possible.
      This commit is only touching bool config options.
      
      I noticed two cases where config_enabled() is used against a tristate
      option:
      
       - config_enabled(CONFIG_HWMON)
        [ drivers/net/wireless/ath/ath10k/thermal.c ]
      
       - config_enabled(CONFIG_BACKLIGHT_CLASS_DEVICE)
        [ drivers/gpu/drm/gma500/opregion.c ]
      
      I did not touch them because they should be converted to IS_BUILTIN()
      in order to keep the logic, but I was not sure it was the authors'
      intention.
      
      Link: http://lkml.kernel.org/r/1465215656-20569-1-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Stas Sergeev <stsp@list.ru>
      Cc: Matt Redfearn <matt.redfearn@imgtec.com>
      Cc: Joshua Kinard <kumba@gentoo.org>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Markos Chandras <markos.chandras@imgtec.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: yu-cheng yu <yu-cheng.yu@intel.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Nikolay Martynov <mar.kolya@gmail.com>
      Cc: Huacai Chen <chenhc@lemote.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
      Cc: Rafal Milecki <zajec5@gmail.com>
      Cc: James Cowgill <James.Cowgill@imgtec.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Alex Smith <alex.smith@imgtec.com>
      Cc: Adam Buchbinder <adam.buchbinder@gmail.com>
      Cc: Qais Yousef <qais.yousef@imgtec.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Mikko Rapeli <mikko.rapeli@iki.fi>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: "Luis R. Rodriguez" <mcgrof@do-not-panic.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Kalle Valo <kvalo@qca.qualcomm.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Tony Wu <tung7970@gmail.com>
      Cc: Huaitong Han <huaitong.han@intel.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Gelmini <andrea.gelmini@gelma.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Rabin Vincent <rabin@rab.in>
      Cc: "Maciej W. Rozycki" <macro@imgtec.com>
      Cc: David Daney <david.daney@cavium.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97f2645f
  16. 28 4月, 2016 2 次提交
  17. 19 2月, 2016 3 次提交
    • D
      x86/mm: Factor out LDT init from context init · 39a0526f
      Dave Hansen 提交于
      The arch-specific mm_context_t is a great place to put
      protection-key allocation state.
      
      But, we need to initialize the allocation state because pkey 0 is
      always "allocated".  All of the runtime initialization of
      mm_context_t is done in *_ldt() manipulation functions.  This
      renames the existing LDT functions like this:
      
      	init_new_context() -> init_new_context_ldt()
      	destroy_context() -> destroy_context_ldt()
      
      and makes init_new_context() and destroy_context() available for
      generic use.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210234.DB34FCC5@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      39a0526f
    • D
      mm/core, x86/mm/pkeys: Differentiate instruction fetches · d61172b4
      Dave Hansen 提交于
      As discussed earlier, we attempt to enforce protection keys in
      software.
      
      However, the code checks all faults to ensure that they are not
      violating protection key permissions.  It was assumed that all
      faults are either write faults where we check PKRU[key].WD (write
      disable) or read faults where we check the AD (access disable)
      bit.
      
      But, there is a third category of faults for protection keys:
      instruction faults.  Instruction faults never run afoul of
      protection keys because they do not affect instruction fetches.
      
      So, plumb the PF_INSTR bit down in to the
      arch_vma_access_permitted() function where we do the protection
      key checks.
      
      We also add a new FAULT_FLAG_INSTRUCTION.  This is because
      handle_mm_fault() is not passed the architecture-specific
      error_code where we keep PF_INSTR, so we need to encode the
      instruction fetch information in to the arch-generic fault
      flags.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d61172b4
    • D
      mm/core: Do not enforce PKEY permissions on remote mm access · 1b2ee126
      Dave Hansen 提交于
      We try to enforce protection keys in software the same way that we
      do in hardware.  (See long example below).
      
      But, we only want to do this when accessing our *own* process's
      memory.  If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
      tried to PTRACE_POKE a target process which just happened to have
      some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
      debugger access to that memory.  PKRU is fundamentally a
      thread-local structure and we do not want to enforce it on access
      to _another_ thread's data.
      
      This gets especially tricky when we have workqueues or other
      delayed-work mechanisms that might run in a random process's context.
      We can check that we only enforce pkeys when operating on our *own* mm,
      but delayed work gets performed when a random user context is active.
      We might end up with a situation where a delayed-work gup fails when
      running randomly under its "own" task but succeeds when running under
      another process.  We want to avoid that.
      
      To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
      fault flag: FAULT_FLAG_REMOTE.  They indicate that we are
      walking an mm which is not guranteed to be the same as
      current->mm and should not be subject to protection key
      enforcement.
      
      Thanks to Jerome Glisse for pointing out this scenario.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: iommu@lists.linux-foundation.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1b2ee126
  18. 18 2月, 2016 2 次提交
    • D
      mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys · 33a709b2
      Dave Hansen 提交于
      Today, for normal faults and page table walks, we check the VMA
      and/or PTE to ensure that it is compatible with the action.  For
      instance, if we get a write fault on a non-writeable VMA, we
      SIGSEGV.
      
      We try to do the same thing for protection keys.  Basically, we
      try to make sure that if a user does this:
      
      	mprotect(ptr, size, PROT_NONE);
      	*ptr = foo;
      
      they see the same effects with protection keys when they do this:
      
      	mprotect(ptr, size, PROT_READ|PROT_WRITE);
      	set_pkey(ptr, size, 4);
      	wrpkru(0xffffff3f); // access disable pkey 4
      	*ptr = foo;
      
      The state to do that checking is in the VMA, but we also
      sometimes have to do it on the page tables only, like when doing
      a get_user_pages_fast() where we have no VMA.
      
      We add two functions and expose them to generic code:
      
      	arch_pte_access_permitted(pte_flags, write)
      	arch_vma_access_permitted(vma, write)
      
      These are, of course, backed up in x86 arch code with checks
      against the PTE or VMA's protection key.
      
      But, there are also cases where we do not want to respect
      protection keys.  When we ptrace(), for instance, we do not want
      to apply the tracer's PKRU permissions to the PTEs from the
      process being traced.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33a709b2
    • D
      x86/mm/pkeys: Add arch-specific VMA protection bits · 8f62c883
      Dave Hansen 提交于
      Lots of things seem to do:
      
              vma->vm_page_prot = vm_get_page_prot(flags);
      
      and the ptes get created right from things we pull out
      of ->vm_page_prot.  So it is very convenient if we can
      store the protection key in flags and vm_page_prot, just
      like the existing permission bits (_PAGE_RW/PRESENT).  It
      greatly reduces the amount of plumbing and arch-specific
      hacking we have to do in generic code.
      
      This also takes the new PROT_PKEY{0,1,2,3} flags and
      turns *those* in to VM_ flags for vma->vm_flags.
      
      The protection key values are stored in 4 places:
      	1. "prot" argument to system calls
      	2. vma->vm_flags, filled from the mmap "prot"
      	3. vma->vm_page prot, filled from vma->vm_flags
      	4. the PTE itself.
      
      The pseudocode for these for steps are as follows:
      
      	mmap(PROT_PKEY*)
      	vma->vm_flags 	  = ... | arch_calc_vm_prot_bits(mmap_prot);
      	vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags);
      	pte = pfn | vma->vm_page_prot
      
      Note that this provides a new definitions for x86:
      
      	arch_vm_get_page_prot()
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210210.FE483A42@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8f62c883
  19. 13 1月, 2016 1 次提交
  20. 11 1月, 2016 1 次提交
    • A
      x86/mm: Add barriers and document switch_mm()-vs-flush synchronization · 71b3c126
      Andy Lutomirski 提交于
      When switch_mm() activates a new PGD, it also sets a bit that
      tells other CPUs that the PGD is in use so that TLB flush IPIs
      will be sent.  In order for that to work correctly, the bit
      needs to be visible prior to loading the PGD and therefore
      starting to fill the local TLB.
      
      Document all the barriers that make this work correctly and add
      a couple that were missing.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      71b3c126
  21. 31 7月, 2015 2 次提交
    • A
      x86/ldt: Make modify_ldt() optional · a5b9e5a2
      Andy Lutomirski 提交于
      The modify_ldt syscall exposes a large attack surface and is
      unnecessary for modern userspace.  Make it optional.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/a605166a771c343fd64802dece77a903507333bd.1438291540.git.luto@kernel.org
      [ Made MATH_EMULATION dependent on MODIFY_LDT_SYSCALL. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a5b9e5a2
    • A
      x86/ldt: Make modify_ldt synchronous · 37868fe1
      Andy Lutomirski 提交于
      modify_ldt() has questionable locking and does not synchronize
      threads.  Improve it: redesign the locking and synchronize all
      threads' LDTs using an IPI on all modifications.
      
      This will dramatically slow down modify_ldt in multithreaded
      programs, but there shouldn't be any multithreaded programs that
      care about modify_ldt's performance in the first place.
      
      This fixes some fallout from the CVE-2015-5157 fixes.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/4c6978476782160600471bd865b318db34c7b628.1438291540.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37868fe1
  22. 10 7月, 2015 1 次提交
  23. 09 6月, 2015 1 次提交
  24. 04 2月, 2015 3 次提交
  25. 23 1月, 2015 1 次提交
    • D
      x86, mpx: Fix potential performance issue on unmaps · c922228e
      Dave Hansen 提交于
      The 3.19 merge window saw some TLB modifications merged which caused a
      performance regression. They were fixed in commit 045bbb9fa.
      
      Once that fix was applied, I also noticed that there was a small
      but intermittent regression still present.  It was not present
      consistently enough to bisect reliably, but I'm fairly confident
      that it came from (my own) MPX patches.  The source was reading
      a relatively unused field in the mm_struct via arch_unmap.
      
      I also noted that this code was in the main instruction flow of
      do_munmap() and probably had more icache impact than we want.
      
      This patch does two things:
      1. Adds a static (via Kconfig) and dynamic (via cpuid) check
         for MPX with cpu_feature_enabled().  This keeps us from
         reading that cacheline in the mm and trades it for a check
         of the global CPUID variables at least on CPUs without MPX.
      2. Adds an unlikely() to ensure that the MPX call ends up out
         of the main instruction flow in do_munmap().  I've added
         a detailed comment about why this was done and why we want
         it even on systems where MPX is present.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: luto@amacapital.net
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20150108223021.AEEAB987@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c922228e
  26. 19 11月, 2014 1 次提交
    • D
      x86: Cleanly separate use of asm-generic/mm_hooks.h · a1ea1c03
      Dave Hansen 提交于
      asm-generic/mm_hooks.h provides some generic fillers for the 90%
      of architectures that do not need to hook some mmap-manipulation
      functions.  A comment inside says:
      
      > Define generic no-op hooks for arch_dup_mmap and
      > arch_exit_mmap, to be included in asm-FOO/mmu_context.h
      > for any arch FOO which doesn't need to hook these.
      
      So, does x86 need to hook these?  It depends on CONFIG_PARAVIRT.
      We *conditionally* include this generic header if we have
      CONFIG_PARAVIRT=n.  That's madness.
      
      With this patch, x86 stops using asm-generic/mmu_hooks.h entirely.
      We use our own copies of the functions.  The paravirt code
      provides some stubs if it is disabled, and we always call those
      stubs in our x86-private versions of arch_exit_mmap() and
      arch_dup_mmap().
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: x86@kernel.org
      Link: http://lkml.kernel.org/r/20141118182349.14567FA5@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a1ea1c03
  27. 18 11月, 2014 1 次提交
    • D
      x86, mpx: Cleanup unused bound tables · 1de4fa14
      Dave Hansen 提交于
      The previous patch allocates bounds tables on-demand.  As noted in
      an earlier description, these can add up to *HUGE* amounts of
      memory.  This has caused OOMs in practice when running tests.
      
      This patch adds support for freeing bounds tables when they are no
      longer in use.
      
      There are two types of mappings in play when unmapping tables:
       1. The mapping with the actual data, which userspace is
          munmap()ing or brk()ing away, etc...
       2. The mapping for the bounds table *backing* the data
          (is tagged with VM_MPX, see the patch "add MPX specific
          mmap interface").
      
      If userspace use the prctl() indroduced earlier in this patchset
      to enable the management of bounds tables in kernel, when it
      unmaps the first type of mapping with the actual data, the kernel
      needs to free the mapping for the bounds table backing the data.
      This patch hooks in at the very end of do_unmap() to do so.
      We look at the addresses being unmapped and find the bounds
      directory entries and tables which cover those addresses.  If
      an entire table is unused, we clear associated directory entry
      and free the table.
      
      Once we unmap the bounds table, we would have a bounds directory
      entry pointing at empty address space. That address space might
      now be allocated for some other (random) use, and the MPX
      hardware might now try to walk it as if it were a bounds table.
      That would be bad.  So any unmapping of an enture bounds table
      has to be accompanied by a corresponding write to the bounds
      directory entry to invalidate it.  That write to the bounds
      directory can fault, which causes the following problem:
      
      Since we are doing the freeing from munmap() (and other paths
      like it), we hold mmap_sem for write. If we fault, the page
      fault handler will attempt to acquire mmap_sem for read and
      we will deadlock.  To avoid the deadlock, we pagefault_disable()
      when touching the bounds directory entry and use a
      get_user_pages() to resolve the fault.
      
      The unmapping of bounds tables happends under vm_munmap().  We
      also (indirectly) call vm_munmap() to _do_ the unmapping of the
      bounds tables.  We avoid unbounded recursion by disallowing
      freeing of bounds tables *for* bounds tables.  This would not
      occur normally, so should not have any practical impact.  Being
      strict about it here helps ensure that we do not have an
      exploitable stack overflow.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1de4fa14