1. 15 4月, 2015 2 次提交
  2. 13 2月, 2015 1 次提交
  3. 12 2月, 2015 4 次提交
    • A
      mm: gup: use get_user_pages_unlocked within get_user_pages_fast · a7b78075
      Andrea Arcangeli 提交于
      This allows the get_user_pages_fast slow path to release the mmap_sem
      before blocking.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7b78075
    • A
      mm: gup: add __get_user_pages_unlocked to customize gup_flags · 0fd71a56
      Andrea Arcangeli 提交于
      Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
      to get a proper -EHWPOSION retval instead of -EFAULT to take a more
      appropriate action if get_user_pages runs into a memory failure.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fd71a56
    • A
      mm: gup: add get_user_pages_locked and get_user_pages_unlocked · f0818f47
      Andrea Arcangeli 提交于
      FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
      reading to reduce the mmap_sem contention (for writing), like while
      waiting for I/O completion.  The problem is that right now practically no
      get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
      that nifty feature.
      
      Andres fixed it for the KVM page fault.  However get_user_pages_fast
      remains uncovered, and 99% of other get_user_pages aren't using it either
      (the only exception being FOLL_NOWAIT in KVM which is really nonblocking
      and in fact it doesn't even release the mmap_sem).
      
      So this patchsets extends the optimization Andres did in the KVM page
      fault to the whole kernel.  It makes most important places (including
      gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
      during I/O.
      
      The only few places that remains uncovered are drivers like v4l and other
      exceptions that tends to work on their own memory and they're not working
      on random user memory (for example like O_DIRECT that uses gup_fast and is
      fully covered by this patch).
      
      A follow up patch should probably also add a printk_once warning to
      get_user_pages that should go obsolete and be phased out eventually.  The
      "vmas" parameter of get_user_pages makes it fundamentally incompatible
      with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
      mmap_sem is released).
      
      While this is just an optimization, this becomes an absolute requirement
      for the userfaultfd feature http://lwn.net/Articles/615086/ .
      
      The userfaultfd allows to block the page fault, and in order to do so I
      need to drop the mmap_sem first.  So this patch also ensures that all
      memory where userfaultfd could be registered by KVM, the very first fault
      (no matter if it is a regular page fault, or a get_user_pages) always has
      FAULT_FOLL_ALLOW_RETRY set.  Then the userfaultfd blocks and it is waken
      only when the pagetable is already mapped.  The second fault attempt after
      the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
      without it.
      
      This patch (of 5):
      
      We can leverage the VM_FAULT_RETRY functionality in the page fault paths
      better by using either get_user_pages_locked or get_user_pages_unlocked.
      
      The former allows conversion of get_user_pages invocations that will have
      to pass a "&locked" parameter to know if the mmap_sem was dropped during
      the call.  Example from:
      
          down_read(&mm->mmap_sem);
          do_something()
          get_user_pages(tsk, mm, ..., pages, NULL);
          up_read(&mm->mmap_sem);
      
      to:
      
          int locked = 1;
          down_read(&mm->mmap_sem);
          do_something()
          get_user_pages_locked(tsk, mm, ..., pages, &locked);
          if (locked)
              up_read(&mm->mmap_sem);
      
      The latter is suitable only as a drop in replacement of the form:
      
          down_read(&mm->mmap_sem);
          get_user_pages(tsk, mm, ..., pages, NULL);
          up_read(&mm->mmap_sem);
      
      into:
      
          get_user_pages_unlocked(tsk, mm, ..., pages);
      
      Where tsk, mm, the intermediate "..." paramters and "pages" can be any
      value as before.  Just the last parameter of get_user_pages (vmas) must be
      NULL for get_user_pages_locked|unlocked to be usable (the latter original
      form wouldn't have been safe anyway if vmas wasn't null, for the former we
      just make it explicit by dropping the parameter).
      
      If vmas is not NULL these two methods cannot be used.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Reviewed-by: NPeter Feiner <pfeiner@google.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0818f47
    • N
      mm/hugetlb: take page table lock in follow_huge_pmd() · e66f17ff
      Naoya Horiguchi 提交于
      We have a race condition between move_pages() and freeing hugepages, where
      move_pages() calls follow_page(FOLL_GET) for hugepages internally and
      tries to get its refcount without preventing concurrent freeing.  This
      race crashes the kernel, so this patch fixes it by moving FOLL_GET code
      for hugepages into follow_huge_pmd() with taking the page table lock.
      
      This patch intentionally removes page==NULL check after pte_page.
      This is justified because pte_page() never returns NULL for any
      architectures or configurations.
      
      This patch changes the behavior of follow_huge_pmd() for tail pages and
      then tail pages can be pinned/returned.  So the caller must be changed to
      properly handle the returned tail pages.
      
      We could have a choice to add the similar locking to
      follow_huge_(addr|pud) for consistency, but it's not necessary because
      currently these functions don't support FOLL_GET flag, so let's leave it
      for future development.
      
      Here is the reproducer:
      
        $ cat movepages.c
        #include <stdio.h>
        #include <stdlib.h>
        #include <numaif.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
        #define PS              0x1000
      
        int main(int argc, char *argv[]) {
                int i;
                int nr_hp = strtol(argv[1], NULL, 0);
                int nr_p  = nr_hp * HPS / PS;
                int ret;
                void **addrs;
                int *status;
                int *nodes;
                pid_t pid;
      
                pid = strtol(argv[2], NULL, 0);
                addrs  = malloc(sizeof(char *) * nr_p + 1);
                status = malloc(sizeof(char *) * nr_p + 1);
                nodes  = malloc(sizeof(char *) * nr_p + 1);
      
                while (1) {
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 1;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
      
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 0;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
                }
                return 0;
        }
      
        $ cat hugepage.c
        #include <stdio.h>
        #include <sys/mman.h>
        #include <string.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
      
        int main(int argc, char *argv[]) {
                int nr_hp = strtol(argv[1], NULL, 0);
                char *p;
      
                while (1) {
                        p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                        if (p != (void *)ADDR_INPUT) {
                                perror("mmap");
                                break;
                        }
                        memset(p, 0, nr_hp * HPS);
                        munmap(p, nr_hp * HPS);
                }
        }
      
        $ sysctl vm.nr_hugepages=40
        $ ./hugepage 10 &
        $ ./movepages 10 $(pgrep -f hugepage)
      
      Fixes: e632a938 ("mm: migrate: add hugepage migration code to move_pages()")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e66f17ff
  4. 11 2月, 2015 1 次提交
  5. 30 1月, 2015 1 次提交
    • L
      vm: add VM_FAULT_SIGSEGV handling support · 33692f27
      Linus Torvalds 提交于
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33692f27
  6. 19 1月, 2015 1 次提交
  7. 18 12月, 2014 1 次提交
  8. 14 11月, 2014 1 次提交
  9. 10 10月, 2014 1 次提交
    • S
      mm: introduce a general RCU get_user_pages_fast() · 2667f50e
      Steve Capper 提交于
      This series implements general forms of get_user_pages_fast and
      __get_user_pages_fast in core code and activates them for arm and arm64.
      
      These are required for Transparent HugePages to function correctly, as a
      futex on a THP tail will otherwise result in an infinite loop (due to the
      core implementation of __get_user_pages_fast always returning 0).
      
      Unfortunately, a futex on THP tail can be quite common for certain
      workloads; thus THP is unreliable without a __get_user_pages_fast
      implementation.
      
      This series may also be beneficial for direct-IO heavy workloads and
      certain KVM workloads.
      
      This patch (of 6):
      
      get_user_pages_fast() attempts to pin user pages by walking the page
      tables directly and avoids taking locks.  Thus the walker needs to be
      protected from page table pages being freed from under it, and needs to
      block any THP splits.
      
      One way to achieve this is to have the walker disable interrupts, and rely
      on IPIs from the TLB flushing code blocking before the page table pages
      are freed.
      
      On some platforms we have hardware broadcast of TLB invalidations, thus
      the TLB flushing code doesn't necessarily need to broadcast IPIs; and
      spuriously broadcasting IPIs can hurt system performance if done too
      often.
      
      This problem has been solved on PowerPC and Sparc by batching up page
      table pages belonging to more than one mm_user, then scheduling an
      rcu_sched callback to free the pages.  This RCU page table free logic has
      been promoted to core code and is activated when one enables
      HAVE_RCU_TABLE_FREE.  Unfortunately, these architectures implement their
      own get_user_pages_fast routines.
      
      The RCU page table free logic coupled with an IPI broadcast on THP split
      (which is a rare event), allows one to protect a page table walker by
      merely disabling the interrupts during the walk.
      
      This patch provides a general RCU implementation of get_user_pages_fast
      that can be used by architectures that perform hardware broadcast of TLB
      invalidations.
      
      It is based heavily on the PowerPC implementation by Nick Piggin.
      
      [akpm@linux-foundation.org: various comment fixes]
      Signed-off-by: NSteve Capper <steve.capper@linaro.org>
      Tested-by: NDann Frazier <dann.frazier@canonical.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2667f50e
  10. 24 9月, 2014 1 次提交
    • A
      kvm: Faults which trigger IO release the mmap_sem · 234b239b
      Andres Lagar-Cavilla 提交于
      When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
      has been swapped out or is behind a filemap, this will trigger async
      readahead and return immediately. The rationale is that KVM will kick
      back the guest with an "async page fault" and allow for some other
      guest process to take over.
      
      If async PFs are enabled the fault is retried asap from an async
      workqueue. If not, it's retried immediately in the same code path. In
      either case the retry will not relinquish the mmap semaphore and will
      block on the IO. This is a bad thing, as other mmap semaphore users
      now stall as a function of swap or filemap latency.
      
      This patch ensures both the regular and async PF path re-enter the
      fault allowing for the mmap semaphore to be relinquished in the case
      of IO wait.
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      234b239b
  11. 07 8月, 2014 1 次提交
  12. 05 6月, 2014 5 次提交