1. 05 4月, 2013 1 次提交
    • J
      mm: prevent mmap_cache race in find_vma() · b6a9b7f6
      Jan Stancek 提交于
      find_vma() can be called by multiple threads with read lock
      held on mm->mmap_sem and any of them can update mm->mmap_cache.
      Prevent compiler from re-fetching mm->mmap_cache, because other
      readers could update it in the meantime:
      
                     thread 1                             thread 2
                                              |
        find_vma()                            |  find_vma()
          struct vm_area_struct *vma = NULL;  |
          vma = mm->mmap_cache;               |
          if (!(vma && vma->vm_end > addr     |
              && vma->vm_start <= addr)) {    |
                                              |    mm->mmap_cache = vma;
          return vma;                         |
           ^^ compiler may optimize this      |
              local variable out and re-read  |
              mm->mmap_cache                  |
      
      This issue can be reproduced with gcc-4.8.0-1 on s390x by running
      mallocstress testcase from LTP, which triggers:
      
        kernel BUG at mm/rmap.c:1088!
          Call Trace:
           ([<000003d100c57000>] 0x3d100c57000)
            [<000000000023a1c0>] do_wp_page+0x2fc/0xa88
            [<000000000023baae>] handle_pte_fault+0x41a/0xac8
            [<000000000023d832>] handle_mm_fault+0x17a/0x268
            [<000000000060507a>] do_protection_exception+0x1e2/0x394
            [<0000000000603a04>] pgm_check_handler+0x138/0x13c
            [<000003fffcf1f07a>] 0x3fffcf1f07a
          Last Breaking-Event-Address:
            [<000000000024755e>] page_add_new_anon_rmap+0xc2/0x168
      
      Thanks to Jakub Jelinek for his insight on gcc and helping to
      track this down.
      Signed-off-by: NJan Stancek <jstancek@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6a9b7f6
  2. 24 2月, 2013 5 次提交
    • M
      mm: accelerate mm_populate() treatment of THP pages · 240aadee
      Michel Lespinasse 提交于
      This change adds a follow_page_mask function which is equivalent to
      follow_page, but with an extra page_mask argument.
      
      follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
      a THP page, and to 0 in other cases.
      
      __get_user_pages() makes use of this in order to accelerate populating
      THP ranges - that is, when both the pages and vmas arrays are NULL, we
      don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
      we also avoid taking mm->page_table_lock that many times).
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      240aadee
    • M
      mm: use long type for page counts in mm_populate() and get_user_pages() · 28a35716
      Michel Lespinasse 提交于
      Use long type for page counts in mm_populate() so as to avoid integer
      overflow when running the following test code:
      
      int main(void) {
        void *p = mmap(NULL, 0x100000000000, PROT_READ,
                       MAP_PRIVATE | MAP_ANON, -1, 0);
        printf("p: %p\n", p);
        mlockall(MCL_CURRENT);
        printf("done\n");
        return 0;
      }
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28a35716
    • S
      swap: add per-partition lock for swapfile · ec8acf20
      Shaohua Li 提交于
      swap_lock is heavily contended when I test swap to 3 fast SSD (even
      slightly slower than swap to 2 such SSD).  The main contention comes
      from swap_info_get().  This patch tries to fix the gap with adding a new
      per-partition lock.
      
      Global data like nr_swapfiles, total_swap_pages, least_priority and
      swap_list are still protected by swap_lock.
      
      nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
      theory, it's possible get_swap_page() finds no swap pages but actually
      there are free swap pages.  But sounds not a big problem.
      
      Accessing partition specific data (like scan_swap_map and so on) is only
      protected by swap_info_struct.lock.
      
      Changing swap_info_struct.flags need hold swap_lock and
      swap_info_struct.lock, because scan_scan_map() will check it.  read the
      flags is ok with either the locks hold.
      
      If both swap_lock and swap_info_struct.lock must be hold, we always hold
      the former first to avoid deadlock.
      
      swap_entry_free() can change swap_list.  To delete that code, we add a
      new highest_priority_index.  Whenever get_swap_page() is called, we
      check it.  If it's valid, we use it.
      
      It's a pity get_swap_page() still holds swap_lock().  But in practice,
      swap_lock() isn't heavily contended in my test with this patch (or I can
      say there are other much more heavier bottlenecks like TLB flush).  And
      BTW, looks get_swap_page() doesn't really need the lock.  We never free
      swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
      lock is we could swapout to some low priority swap, but we can quickly
      recover after several rounds of swap, so sounds not a big deal to me.
      But I'd prefer to fix this if it's a real problem.
      
      "swap: make each swap partition have one address_space" improved the
      swapout speed from 1.7G/s to 2G/s.  This patch further improves the
      speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
      so TLB flush isn't the biggest bottleneck before the patches.
      
      [arnd@arndb.de: fix it for nommu]
      [hughd@google.com: add missing unlock]
      [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8acf20
    • M
      mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool · 41badc15
      Michel Lespinasse 提交于
      do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
      multiple, however there was no equivalent code in mm_populate(), which
      caused issues.
      
      This could be fixed by introduced the same rounding in mm_populate(),
      however I think it's preferable to make do_mmap_pgoff() return populate
      as a size rather than as a boolean, so we don't have to duplicate the
      size rounding logic in mm_populate().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41badc15
    • M
      mm: introduce mm_populate() for populating new vmas · bebeb3d6
      Michel Lespinasse 提交于
      When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
      with MCL_FUTURE in effect), we want to populate the pages within the
      newly created vmas.  This may take a while as we may have to read pages
      from disk, so ideally we want to do this outside of the write-locked
      mmap_sem region.
      
      This change introduces mm_populate(), which is used to defer populating
      such mappings until after the mmap_sem write lock has been released.
      This is implemented as a generalization of the former do_mlock_pages(),
      which accomplished the same task but was using during mlock() /
      mlockall().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bebeb3d6
  3. 23 2月, 2013 1 次提交
  4. 08 2月, 2013 1 次提交
  5. 16 11月, 2012 1 次提交
  6. 09 10月, 2012 4 次提交
    • M
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse 提交于
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
    • K
      mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas · e9714acf
      Konstantin Khlebnikov 提交于
      Currently the kernel sets mm->exe_file during sys_execve() and then tracks
      number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
      as this counter drops to zero kernel resets mm->exe_file to NULL.  Plus it
      resets mm->exe_file at last mmput() when mm->mm_users drops to zero.
      
      VMA with VM_EXECUTABLE flag appears after mapping file with flag
      MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
      splitting, because sys_mmap ignores this flag.  Usually binfmt module sets
      mm->exe_file and mmaps executable vmas with this file, they hold
      mm->exe_file while task is running.
      
      comment from v2.6.25-6245-g925d1c40 ("procfs task exe symlink"),
      where all this stuff was introduced:
      
      > The kernel implements readlink of /proc/pid/exe by getting the file from
      > the first executable VMA.  Then the path to the file is reconstructed and
      > reported as the result.
      >
      > Because of the VMA walk the code is slightly different on nommu systems.
      > This patch avoids separate /proc/pid/exe code on nommu systems.  Instead of
      > walking the VMAs to find the first executable file-backed VMA we store a
      > reference to the exec'd file in the mm_struct.
      >
      > That reference would prevent the filesystem holding the executable file
      > from being unmounted even after unmapping the VMAs.  So we track the number
      > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
      > unmapped.  This avoids pinning the mounted filesystem.
      
      exe_file's vma accounting is hooked into every file mmap/unmmap and vma
      split/merge just to fix some hypothetical pinning fs from umounting by mm,
      which already unmapped all its executable files, but still alive.
      
      Seems like currently nobody depends on this behaviour.  We can try to
      remove this logic and keep mm->exe_file until final mmput().
      
      mm->exe_file is still protected with mm->mmap_sem, because we want to
      change it via new sys_prctl(PR_SET_MM_EXE_FILE).  Also via this syscall
      task can change its mm->exe_file and unpin mountpoint explicitly.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9714acf
    • K
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov 提交于
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
  7. 27 9月, 2012 1 次提交
  8. 05 6月, 2012 1 次提交
    • G
      nommu: fix compilation of nommu.c · ad1ed293
      Greg Ungerer 提交于
      Compiling 3.5-rc1 for nommu targets gives:
      
        CC      mm/nommu.o
      mm/nommu.c: In function ‘sys_mmap_pgoff’:
      mm/nommu.c:1489:2: error: ‘ret’ undeclared (first use in this function)
      mm/nommu.c:1489:2: note: each undeclared identifier is reported only once for each function it appears in
      
      It is trivially fixed by replacing 'ret' with the local variable that is
      already defined for the return value 'retval'.
      Signed-off-by: NGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ad1ed293
  9. 01 6月, 2012 5 次提交
  10. 31 5月, 2012 1 次提交
  11. 21 4月, 2012 4 次提交
  12. 25 2月, 2012 2 次提交
  13. 17 11月, 2011 1 次提交
  14. 31 10月, 2011 1 次提交
  15. 26 7月, 2011 1 次提交
  16. 09 7月, 2011 1 次提交
    • B
      mm/nommu.c: fix remap_pfn_range() · 8f3b1327
      Bob Liu 提交于
      remap_pfn_range() means map physical address pfn<<PAGE_SHIFT to user addr.
      
      For nommu arch it's implemented by vma->vm_start = pfn << PAGE_SHIFT which
      is wrong acroding the original meaning of this function.  And some driver
      developer using remap_pfn_range() with correct parameter will get
      unexpected result because vm_start is changed.  It should be implementd
      like addr = pfn << PAGE_SHIFT but which is meanless on nommu arch, this
      patch just make it simply return.
      
      Parameter name and setting of vma->vm_flags also be fixed.
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: NGreg Ungerer <gerg@uclinux.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f3b1327
  17. 23 6月, 2011 1 次提交
    • T
      ptrace: kill trivial tracehooks · a288eecc
      Tejun Heo 提交于
      At this point, tracehooks aren't useful to mainline kernel and mostly
      just add an extra layer of obfuscation.  Although they have comments,
      without actual in-kernel users, it is difficult to tell what are their
      assumptions and they're actually trying to achieve.  To mainline
      kernel, they just aren't worth keeping around.
      
      This patch kills the following trivial tracehooks.
      
      * Ones testing whether task is ptraced.  Replace with ->ptrace test.
      
      	tracehook_expect_breakpoints()
      	tracehook_consider_ignored_signal()
      	tracehook_consider_fatal_signal()
      
      * ptrace_event() wrappers.  Call directly.
      
      	tracehook_report_exec()
      	tracehook_report_exit()
      	tracehook_report_vfork_done()
      
      * ptrace_release_task() wrapper.  Call directly.
      
      	tracehook_finish_release_task()
      
      * noop
      
      	tracehook_prepare_release_task()
      	tracehook_report_death()
      
      This doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      a288eecc
  18. 25 5月, 2011 8 次提交