1. 12 2月, 2015 3 次提交
    • R
      mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() · 8138a67a
      Roman Gushchin 提交于
      I noticed that "allowed" can easily overflow by falling below 0, because
      (total_vm / 32) can be larger than "allowed".  The problem occurs in
      OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8138a67a
    • A
      mm: gup: add __get_user_pages_unlocked to customize gup_flags · 0fd71a56
      Andrea Arcangeli 提交于
      Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
      to get a proper -EHWPOSION retval instead of -EFAULT to take a more
      appropriate action if get_user_pages runs into a memory failure.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fd71a56
    • A
      mm: gup: add get_user_pages_locked and get_user_pages_unlocked · f0818f47
      Andrea Arcangeli 提交于
      FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
      reading to reduce the mmap_sem contention (for writing), like while
      waiting for I/O completion.  The problem is that right now practically no
      get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
      that nifty feature.
      
      Andres fixed it for the KVM page fault.  However get_user_pages_fast
      remains uncovered, and 99% of other get_user_pages aren't using it either
      (the only exception being FOLL_NOWAIT in KVM which is really nonblocking
      and in fact it doesn't even release the mmap_sem).
      
      So this patchsets extends the optimization Andres did in the KVM page
      fault to the whole kernel.  It makes most important places (including
      gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
      during I/O.
      
      The only few places that remains uncovered are drivers like v4l and other
      exceptions that tends to work on their own memory and they're not working
      on random user memory (for example like O_DIRECT that uses gup_fast and is
      fully covered by this patch).
      
      A follow up patch should probably also add a printk_once warning to
      get_user_pages that should go obsolete and be phased out eventually.  The
      "vmas" parameter of get_user_pages makes it fundamentally incompatible
      with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
      mmap_sem is released).
      
      While this is just an optimization, this becomes an absolute requirement
      for the userfaultfd feature http://lwn.net/Articles/615086/ .
      
      The userfaultfd allows to block the page fault, and in order to do so I
      need to drop the mmap_sem first.  So this patch also ensures that all
      memory where userfaultfd could be registered by KVM, the very first fault
      (no matter if it is a regular page fault, or a get_user_pages) always has
      FAULT_FOLL_ALLOW_RETRY set.  Then the userfaultfd blocks and it is waken
      only when the pagetable is already mapped.  The second fault attempt after
      the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
      without it.
      
      This patch (of 5):
      
      We can leverage the VM_FAULT_RETRY functionality in the page fault paths
      better by using either get_user_pages_locked or get_user_pages_unlocked.
      
      The former allows conversion of get_user_pages invocations that will have
      to pass a "&locked" parameter to know if the mmap_sem was dropped during
      the call.  Example from:
      
          down_read(&mm->mmap_sem);
          do_something()
          get_user_pages(tsk, mm, ..., pages, NULL);
          up_read(&mm->mmap_sem);
      
      to:
      
          int locked = 1;
          down_read(&mm->mmap_sem);
          do_something()
          get_user_pages_locked(tsk, mm, ..., pages, &locked);
          if (locked)
              up_read(&mm->mmap_sem);
      
      The latter is suitable only as a drop in replacement of the form:
      
          down_read(&mm->mmap_sem);
          get_user_pages(tsk, mm, ..., pages, NULL);
          up_read(&mm->mmap_sem);
      
      into:
      
          get_user_pages_unlocked(tsk, mm, ..., pages);
      
      Where tsk, mm, the intermediate "..." paramters and "pages" can be any
      value as before.  Just the last parameter of get_user_pages (vmas) must be
      NULL for get_user_pages_locked|unlocked to be usable (the latter original
      form wouldn't have been safe anyway if vmas wasn't null, for the former we
      just make it explicit by dropping the parameter).
      
      If vmas is not NULL these two methods cannot be used.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Reviewed-by: NPeter Feiner <pfeiner@google.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0818f47
  2. 11 2月, 2015 1 次提交
    • K
      mm: replace remap_file_pages() syscall with emulation · c8d78c18
      Kirill A. Shutemov 提交于
      remap_file_pages(2) was invented to be able efficiently map parts of
      huge file into limited 32-bit virtual address space such as in database
      workloads.
      
      Nonlinear mappings are pain to support and it seems there's no
      legitimate use-cases nowadays since 64-bit systems are widely available.
      
      Let's drop it and get rid of all these special-cased code.
      
      The patch replaces the syscall with emulation which creates new VMA on
      each remap_file_pages(), unless they it can be merged with an adjacent
      one.
      
      I didn't find *any* real code that uses remap_file_pages(2) to test
      emulation impact on.  I've checked Debian code search and source of all
      packages in ALT Linux.  No real users: libc wrappers, mentions in
      strace, gdb, valgrind and this kind of stuff.
      
      There are few basic tests in LTP for the syscall.  They work just fine
      with emulation.
      
      To test performance impact, I've written small test case which
      demonstrate pretty much worst case scenario: map 4G shmfs file, write to
      begin of every page pgoff of the page, remap pages in reverse order,
      read every page.
      
      The test creates 1 million of VMAs if emulation is in use, so I had to
      set vm.max_map_count to 1100000 to avoid -ENOMEM.
      
      Before:		23.3 ( +-  4.31% ) seconds
      After:		43.9 ( +-  0.85% ) seconds
      Slowdown:	1.88x
      
      I believe we can live with that.
      
      Test case:
      
              #define _GNU_SOURCE
              #include <assert.h>
              #include <stdlib.h>
              #include <stdio.h>
              #include <sys/mman.h>
      
              #define MB	(1024UL * 1024)
              #define SIZE	(4096 * MB)
      
              int main(int argc, char **argv)
              {
                      unsigned long *p;
                      long i, pass;
      
                      for (pass = 0; pass < 10; pass++) {
                              p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                              MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                              if (p == MAP_FAILED) {
                                      perror("mmap");
                                      return -1;
                              }
      
                              for (i = 0; i < SIZE / 4096; i++)
                                      p[i * 4096 / sizeof(*p)] = i;
      
                              for (i = 0; i < SIZE / 4096; i++) {
                                      if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
                                                      0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
                                              perror("remap_file_pages");
                                              return -1;
                                      }
                              }
      
                              for (i = SIZE / 4096 - 1; i >= 0; i--)
                                      assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
      
                              munmap(p, SIZE);
                      }
      
                      return 0;
              }
      
      [akpm@linux-foundation.org: fix spello]
      [sasha.levin@oracle.com: initialize populate before usage]
      [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
      Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Armin Rigo <arigo@tunes.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8d78c18
  3. 06 2月, 2015 1 次提交
  4. 14 12月, 2014 3 次提交
  5. 08 9月, 2014 1 次提交
    • T
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo 提交于
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  6. 09 8月, 2014 1 次提交
    • A
      arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area · a6c19dfe
      Andy Lutomirski 提交于
      The core mm code will provide a default gate area based on
      FIXADDR_USER_START and FIXADDR_USER_END if
      !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).
      
      This default is only useful for ia64.  arm64, ppc, s390, sh, tile, 64-bit
      UML, and x86_32 have their own code just to disable it.  arm, 32-bit UML,
      and x86_64 have gate areas, but they have their own implementations.
      
      This gets rid of the default and moves the code into ia64.
      
      This should save some code on architectures without a gate area: it's now
      possible to inline the gate_area functions in the default case.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Acked-by: NNathan Lynch <nathan_lynch@mentor.com>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle]
      Acked-by: Richard Weinberger <richard@nod.at> [for um]
      Acked-by: Will Deacon <will.deacon@arm.com> [for arm64]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Nathan Lynch <Nathan_Lynch@mentor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6c19dfe
  7. 24 6月, 2014 1 次提交
    • S
      mm: nommu: per-thread vma cache fix · e020d5bd
      Steven Miao 提交于
      mm could be removed from current task struct, using previous vma->vm_mm
      
      It will crash on blackfin after updated to Linux 3.15.  The commit "mm:
      per-thread vma caching" caused the crash.  mm could be removed from
      current task struct before
      
        mmput()->
          exit_mmap()->
            delete_vma_from_mm()
      
      the detailed fault information:
      
          NULL pointer access
          Kernel OOPS in progress
          Deferred Exception context
          CURRENT PROCESS:
          COMM=modprobe PID=278  CPU=0
          invalid mm
          return address: [0x000531de]; contents of:
          0x000531b0:  c727  acea  0c42  181d  0000  0000  0000  a0a8
          0x000531c0:  b090  acaa  0c42  1806  0000  0000  0000  a0e8
          0x000531d0:  b0d0  e801  0000  05b3  0010  e522  0046 [a090]
          0x000531e0:  6408  b090  0c00  17cc  3042  e3ff  f37b  2fc8
      
          CPU: 0 PID: 278 Comm: modprobe Not tainted 3.15.0-ADI-2014R1-pre-00345-gea9f446 #25
          task: 0572b720 ti: 0569e000 task.ti: 0569e000
          Compiled for cpu family 0x27fe (Rev 0), but running on:0x0000 (Rev 0)
          ADSP-BF609-0.0 500(MHz CCLK) 125(MHz SCLK) (mpu off)
          Linux version 3.15.0-ADI-2014R1-pre-00345-gea9f446 (steven@steven-OptiPlex-390) (gcc version 4.3.5 (ADI-trunk/svn-5962) ) #25 Tue Jun 10 17:47:46 CST 2014
      
          SEQUENCER STATUS:		Not tainted
           SEQSTAT: 00000027  IPEND: 8008  IMASK: ffff  SYSCFG: 2806
            EXCAUSE   : 0x27
            physical IVG3 asserted : <0xffa00744> { _trap + 0x0 }
            physical IVG15 asserted : <0xffa00d68> { _evt_system_call + 0x0 }
            logical irq   6 mapped  : <0xffa003bc> { _bfin_coretmr_interrupt + 0x0 }
            logical irq   7 mapped  : <0x00008828> { _bfin_fault_routine + 0x0 }
            logical irq  11 mapped  : <0x00007724> { _l2_ecc_err + 0x0 }
            logical irq  13 mapped  : <0x00008828> { _bfin_fault_routine + 0x0 }
            logical irq  39 mapped  : <0x00150788> { _bfin_twi_interrupt_entry + 0x0 }
            logical irq  40 mapped  : <0x00150788> { _bfin_twi_interrupt_entry + 0x0 }
           RETE: <0x00000000> /* Maybe null pointer? */
           RETN: <0x0569fe50> /* kernel dynamic memory (maybe user-space) */
           RETX: <0x00000480> /* Maybe fixed code section */
           RETS: <0x00053384> { _exit_mmap + 0x28 }
           PC  : <0x000531de> { _delete_vma_from_mm + 0x92 }
          DCPLB_FAULT_ADDR: <0x00000008> /* Maybe null pointer? */
          ICPLB_FAULT_ADDR: <0x000531de> { _delete_vma_from_mm + 0x92 }
          PROCESSOR STATE:
           R0 : 00000004    R1 : 0569e000    R2 : 00bf3db4    R3 : 00000000
           R4 : 057f9800    R5 : 00000001    R6 : 0569ddd0    R7 : 0572b720
           P0 : 0572b854    P1 : 00000004    P2 : 00000000    P3 : 0569dda0
           P4 : 0572b720    P5 : 0566c368    FP : 0569fe5c    SP : 0569fd74
           LB0: 057f523f    LT0: 057f523e    LC0: 00000000
           LB1: 0005317c    LT1: 00053172    LC1: 00000002
           B0 : 00000000    L0 : 00000000    M0 : 0566f5bc    I0 : 00000000
           B1 : 00000000    L1 : 00000000    M1 : 00000000    I1 : ffffffff
           B2 : 00000001    L2 : 00000000    M2 : 00000000    I2 : 00000000
           B3 : 00000000    L3 : 00000000    M3 : 00000000    I3 : 057f8000
          A0.w: 00000000   A0.x: 00000000   A1.w: 00000000   A1.x: 00000000
          USP : 056ffcf8  ASTAT: 02003024
      
          Hardware Trace:
             0 Target : <0x00003fb8> { _trap_c + 0x0 }
               Source : <0xffa006d8> { _exception_to_level5 + 0xa0 } JUMP.L
             1 Target : <0xffa00638> { _exception_to_level5 + 0x0 }
               Source : <0xffa004f2> { _bfin_return_from_exception + 0x6 } RTX
             2 Target : <0xffa004ec> { _bfin_return_from_exception + 0x0 }
               Source : <0xffa00590> { _ex_trap_c + 0x70 } JUMP.S
             3 Target : <0xffa00520> { _ex_trap_c + 0x0 }
               Source : <0xffa0076e> { _trap + 0x2a } JUMP (P4)
             4 Target : <0xffa00744> { _trap + 0x0 }
                FAULT : <0x000531de> { _delete_vma_from_mm + 0x92 } P0 = W[P2 + 2]
               Source : <0x000531da> { _delete_vma_from_mm + 0x8e } P2 = [P4 + 0x18]
             5 Target : <0x000531da> { _delete_vma_from_mm + 0x8e }
               Source : <0x00053176> { _delete_vma_from_mm + 0x2a } IF CC JUMP pcrel
             6 Target : <0x0005314c> { _delete_vma_from_mm + 0x0 }
               Source : <0x00053380> { _exit_mmap + 0x24 } JUMP.L
             7 Target : <0x00053378> { _exit_mmap + 0x1c }
               Source : <0x00053394> { _exit_mmap + 0x38 } IF !CC JUMP pcrel (BP)
             8 Target : <0x00053390> { _exit_mmap + 0x34 }
               Source : <0xffa020e0> { __cond_resched + 0x20 } RTS
             9 Target : <0xffa020c0> { __cond_resched + 0x0 }
               Source : <0x0005338c> { _exit_mmap + 0x30 } JUMP.L
            10 Target : <0x0005338c> { _exit_mmap + 0x30 }
               Source : <0x0005333a> { _delete_vma + 0xb2 } RTS
            11 Target : <0x00053334> { _delete_vma + 0xac }
               Source : <0x0005507a> { _kmem_cache_free + 0xba } RTS
            12 Target : <0x00055068> { _kmem_cache_free + 0xa8 }
               Source : <0x0005505e> { _kmem_cache_free + 0x9e } IF !CC JUMP pcrel (BP)
            13 Target : <0x00055052> { _kmem_cache_free + 0x92 }
               Source : <0x0005501a> { _kmem_cache_free + 0x5a } IF CC JUMP pcrel
            14 Target : <0x00054ff4> { _kmem_cache_free + 0x34 }
               Source : <0x00054fce> { _kmem_cache_free + 0xe } IF CC JUMP pcrel (BP)
            15 Target : <0x00054fc0> { _kmem_cache_free + 0x0 }
               Source : <0x00053330> { _delete_vma + 0xa8 } JUMP.L
          Kernel Stack
          Stack info:
           SP: [0x0569ff24] <0x0569ff24> /* kernel dynamic memory (maybe user-space) */
           Memory from 0x0569ff20 to 056a0000
          0569ff20: 00000001 [04e8da5a] 00008000  00000000  00000000  056a0000  04e8da5a  04e8da5a
          0569ff40: 04eb9eea  ffa00dce  02003025  04ea09c5  057f523f  04ea09c4  057f523e  00000000
          0569ff60: 00000000  00000000  00000000  00000000  00000000  00000000  00000001  00000000
          0569ff80: 00000000  00000000  00000000  00000000  00000000  00000000  00000000  00000000
          0569ffa0: 0566f5bc  057f8000  057f8000  00000001  04ec0170  056ffcf8  056ffd04  057f9800
          0569ffc0: 04d1d498  057f9800  057f8fe4  057f8ef0  00000001  057f928c  00000001  00000001
          0569ffe0: 057f9800  00000000  00000008  00000007  00000001  00000001  00000001 <00002806>
          Return addresses in stack:
              address : <0x00002806> { _show_cpuinfo + 0x2d2 }
          Modules linked in:
          Kernel panic - not syncing: Kernel exception
          [ end Kernel panic - not syncing: Kernel exception
      Signed-off-by: NSteven Miao <realmz6@gmail.com>
      Acked-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[3.15.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e020d5bd
  8. 07 6月, 2014 1 次提交
  9. 08 4月, 2014 4 次提交
    • C
    • G
      mm: use macros from compiler.h instead of __attribute__((...)) · 3b32123d
      Gideon Israel Dsouza 提交于
      To increase compiler portability there is <linux/compiler.h> which
      provides convenience macros for various gcc constructs.  Eg: __weak for
      __attribute__((weak)).  I've replaced all instances of gcc attributes with
      the right macro in the memory management (/mm) subsystem.
      
      [akpm@linux-foundation.org: while-we're-there consistency tweaks]
      Signed-off-by: NGideon Israel Dsouza <gidisrael@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b32123d
    • D
      mm: per-thread vma caching · 615d6e87
      Davidlohr Bueso 提交于
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      615d6e87
    • K
      mm: implement ->map_pages for page cache · f1820361
      Kirill A. Shutemov 提交于
      filemap_map_pages() is generic implementation of ->map_pages() for
      filesystems who uses page cache.
      
      It should be safe to use filemap_map_pages() for ->map_pages() if
      filesystem use filemap_fault() for ->fault().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1820361
  10. 31 3月, 2014 1 次提交
  11. 22 1月, 2014 1 次提交
    • J
      mm: add overcommit_kbytes sysctl variable · 49f0ce5f
      Jerome Marchand 提交于
      Some applications that run on HPC clusters are designed around the
      availability of RAM and the overcommit ratio is fine tuned to get the
      maximum usage of memory without swapping.  With growing memory, the
      1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
      for these workload (on a 2TB machine it represents no less than 20GB).
      
      This patch adds the new overcommit_kbytes sysctl variable that allow a
      much finer grain.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49f0ce5f
  12. 13 11月, 2013 1 次提交
  13. 25 10月, 2013 1 次提交
  14. 11 7月, 2013 1 次提交
  15. 04 7月, 2013 2 次提交
  16. 30 4月, 2013 3 次提交
    • A
      mm: replace hardcoded 3% with admin_reserve_pages knob · 4eeab4f5
      Andrew Shewmaker 提交于
      Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
      memory reserve to something other than 3%, which may be multiple
      gigabytes on large memory systems.  Only about 8MB is necessary to
      enable recovery in the default mode, and only a few hundred MB are
      required even when overcommit is disabled.
      
      This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.
      
      admin_reserve_kbytes is initialized to min(3% free pages, 8MB)
      
      I arrived at 8MB by summing the RSS of sshd or login, bash, and top.
      
      Please see first patch in this series for full background, motivation,
      testing, and full changelog.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: make init_admin_reserve() static]
      Signed-off-by: NAndrew Shewmaker <agshew@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eeab4f5
    • A
      mm: limit growth of 3% hardcoded other user reserve · c9b1d098
      Andrew Shewmaker 提交于
      Add user_reserve_kbytes knob.
      
      Limit the growth of the memory reserved for other user processes to
      min(3% current process size, user_reserve_pages).  Only about 8MB is
      necessary to enable recovery in the default mode, and only a few hundred
      MB are required even when overcommit is disabled.
      
      user_reserve_pages defaults to min(3% free pages, 128MB)
      
      I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
      then adding the RSS of each.
      
      This only affects OVERCOMMIT_NEVER mode.
      
      Background
      
      1. user reserve
      
      __vm_enough_memory reserves a hardcoded 3% of the current process size for
      other applications when overcommit is disabled.  This was done so that a
      user could recover if they launched a memory hogging process.  Without the
      reserve, a user would easily run into a message such as:
      
      bash: fork: Cannot allocate memory
      
      2. admin reserve
      
      Additionally, a hardcoded 3% of free memory is reserved for root in both
      overcommit 'guess' and 'never' modes.  This was intended to prevent a
      scenario where root-cant-log-in and perform recovery operations.
      
      Note that this reserve shrinks, and doesn't guarantee a useful reserve.
      
      Motivation
      
      The two hardcoded memory reserves should be updated to account for current
      memory sizes.
      
      Also, the admin reserve would be more useful if it didn't shrink too much.
      
      When the current code was originally written, 1GB was considered
      "enterprise".  Now the 3% reserve can grow to multiple GB on large memory
      systems, and it only needs to be a few hundred MB at most to enable a user
      or admin to recover a system with an unwanted memory hogging process.
      
      I've found that reducing these reserves is especially beneficial for a
      specific type of application load:
      
       * single application system
       * one or few processes (e.g. one per core)
       * allocating all available memory
       * not initializing every page immediately
       * long running
      
      I've run scientific clusters with this sort of load.  A long running job
      sometimes failed many hours (weeks of CPU time) into a calculation.  They
      weren't initializing all of their memory immediately, and they weren't
      using calloc, so I put systems into overcommit 'never' mode.  These
      clusters run diskless and have no swap.
      
      However, with the current reserves, a user wishing to allocate as much
      memory as possible to one process may be prevented from using, for
      example, almost 2GB out of 32GB.
      
      The effect is less, but still significant when a user starts a job with
      one process per core.  I have repeatedly seen a set of processes
      requesting the same amount of memory fail because one of them could not
      allocate the amount of memory a user would expect to be able to allocate.
      For example, Message Passing Interfce (MPI) processes, one per core.  And
      it is similar for other parallel programming frameworks.
      
      Changing this reserve code will make the overcommit never mode more useful
      by allowing applications to allocate nearly all of the available memory.
      
      Also, the new admin_reserve_kbytes will be safer than the current behavior
      since the hardcoded 3% of available memory reserve can shrink to something
      useless in the case where applications have grabbed all available memory.
      
      Risks
      
      * "bash: fork: Cannot allocate memory"
      
        The downside of the first patch-- which creates a tunable user reserve
        that is only used in overcommit 'never' mode--is that an admin can set
        it so low that a user may not be able to kill their process, even if
        they already have a shell prompt.
      
        Of course, a user can get in the same predicament with the current 3%
        reserve--they just have to launch processes until 3% becomes negligible.
      
      * root-cant-log-in problem
      
        The second patch, adding the tunable rootuser_reserve_pages, allows
        the admin to shoot themselves in the foot by setting it too small.  They
        can easily get the system into a state where root-can't-log-in.
      
        However, the new admin_reserve_kbytes will be safer than the current
        behavior since the hardcoded 3% of available memory reserve can shrink
        to something useless in the case where applications have grabbed all
        available memory.
      
      Alternatives
      
       * Memory cgroups provide a more flexible way to limit application memory.
      
         Not everyone wants to set up cgroups or deal with their overhead.
      
       * We could create a fourth overcommit mode which provides smaller reserves.
      
         The size of useful reserves may be drastically different depending
         on the whether the system is embedded or enterprise.
      
       * Force users to initialize all of their memory or use calloc.
      
         Some users don't want/expect the system to overcommit when they malloc.
         Overcommit 'never' mode is for this scenario, and it should work well.
      
      The new user and admin reserve tunables are simple to use, with low
      overhead compared to cgroups.  The patches preserve current behavior where
      3% of memory is less than 128MB, except that the admin reserve doesn't
      shrink to an unusable size under pressure.  The code allows admins to tune
      for embedded and enterprise usage.
      
      FAQ
      
       * How is the root-cant-login problem addressed?
         What happens if admin_reserve_pages is set to 0?
      
         Root is free to shoot themselves in the foot by setting
         admin_reserve_kbytes too low.
      
         On x86_64, the minimum useful reserve is:
           8MB for overcommit 'guess'
         128MB for overcommit 'never'
      
         admin_reserve_pages defaults to min(3% free memory, 8MB)
      
         So, anyone switching to 'never' mode needs to adjust
         admin_reserve_pages.
      
       * How do you calculate a minimum useful reserve?
      
         A user or the admin needs enough memory to login and perform
         recovery operations, which includes, at a minimum:
      
         sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
      
         For overcommit 'guess', we can sum resident set sizes (RSS)
         because we only need enough memory to handle what the recovery
         programs will typically use. On x86_64 this is about 8MB.
      
         For overcommit 'never', we can take the max of their virtual sizes (VSZ)
         and add the sum of their RSS. We use VSZ instead of RSS because mode
         forces us to ensure we can fulfill all of the requested memory allocations--
         even if the programs only use a fraction of what they ask for.
         On x86_64 this is about 128MB.
      
         When swap is enabled, reserves are useful even when they are as
         small as 10MB, regardless of overcommit mode.
      
         When both swap and overcommit are disabled, then the admin should
         tune the reserves higher to be absolutley safe. Over 230MB each
         was safest in my testing.
      
       * What happens if user_reserve_pages is set to 0?
      
         Note, this only affects overcomitt 'never' mode.
      
         Then a user will be able to allocate all available memory minus
         admin_reserve_kbytes.
      
         However, they will easily see a message such as:
      
         "bash: fork: Cannot allocate memory"
      
         And they won't be able to recover/kill their application.
         The admin should be able to recover the system if
         admin_reserve_kbytes is set appropriately.
      
       * What's the difference between overcommit 'guess' and 'never'?
      
         "Guess" allows an allocation if there are enough free + reclaimable
         pages. It has a hardcoded 3% of free pages reserved for root.
      
         "Never" allows an allocation if there is enough swap + a configurable
         percentage (default is 50) of physical RAM. It has a hardcoded 3% of
         free pages reserved for root, like "Guess" mode. It also has a
         hardcoded 3% of the current process size reserved for additional
         applications.
      
       * Why is overcommit 'guess' not suitable even when an app eventually
         writes to every page? It takes free pages, file pages, available
         swap pages, reclaimable slab pages into consideration. In other words,
         these are all pages available, then why isn't overcommit suitable?
      
         Because it only looks at the present state of the system. It
         does not take into account the memory that other applications have
         malloced, but haven't initialized yet. It overcommits the system.
      
      Test Summary
      
      There was little change in behavior in the default overcommit 'guess'
      mode with swap enabled before and after the patch. This was expected.
      
      Systems run most predictably (i.e. no oom kills) in overcommit 'never'
      mode with swap enabled. This also allowed the most memory to be allocated
      to a user application.
      
      Overcommit 'guess' mode without swap is a bad idea. It is easy to
      crash the system. None of the other tested combinations crashed.
      This matches my experience on the Roadrunner supercomputer.
      
      Without the tunable user reserve, a system in overcommit 'never' mode
      and without swap does not allow the admin to recover, although the
      admin can.
      
      With the new tunable reserves, a system in overcommit 'never' mode
      and without swap can be configured to:
      
      1. maximize user-allocatable memory, running close to the edge of
      recoverability
      
      2. maximize recoverability, sacrificing allocatable memory to
      ensure that a user cannot take down a system
      
      Test Description
      
      Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
      
      System is booted into multiuser console mode, with unnecessary services
      turned off. Caches were dropped before each test.
      
      Hogs are user memtester processes that attempt to allocate all free memory
      as reported by /proc/meminfo
      
      In overcommit 'never' mode, memory_ratio=100
      
      Test Results
      
      3.9.0-rc1-mm1
      
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5432/5432       no     yes             yes
      guess        yes    4      5444/5444       1      yes             yes
      guess        no     1      5302/5449       no     yes             yes
      guess        no     4      -               crash  no              no
      
      never        yes    1      5460/5460       1      yes             yes
      never        yes    4      5460/5460       1      yes             yes
      never        no     1      5218/5432       no     no              yes
      never        no     4      5203/5448       no     no              yes
      
      3.9.0-rc1-mm1-tunablereserves
      
      User and Admin Recovery show their respective reserves, if applicable.
      
      Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
      ----------   ----   ----   -------------   ----   -------------   --------------
      guess        yes    1      5419/5419       no     - yes           8MB yes
      guess        yes    4      5436/5436       1      - yes           8MB yes
      guess        no     1      5440/5440       *      - yes           8MB yes
      guess        no     4      -               crash  - no            8MB no
      
      * process would successfully mlock, then the oom killer would pick it
      
      never        yes    1      5446/5446       no     10MB yes        20MB yes
      never        yes    4      5456/5456       no     10MB yes        20MB yes
      never        no     1      5387/5429       no     128MB no        8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      never        no     1      5323/5428       no     226MB barely    8MB barely
      
      never        no     1      5359/5448       no     10MB no         10MB barely
      
      never        no     1      5323/5428       no     0MB no          10MB barely
      never        no     1      5332/5428       no     0MB no          50MB yes
      never        no     1      5293/5429       no     0MB no          90MB yes
      
      never        no     1      5001/5427       no     230MB yes       338MB yes
      never        no     4*     4998/5424       no     230MB yes       338MB yes
      
      * more memtesters were launched, able to allocate approximately another 100MB
      
      Future Work
      
       - Test larger memory systems.
      
       - Test an embedded image.
      
       - Test other architectures.
      
       - Time malloc microbenchmarks.
      
       - Would it be useful to be able to set overcommit policy for
         each memory cgroup?
      
       - Some lines are slightly above 80 chars.
         Perhaps define a macro to convert between pages and kb?
         Other places in the kernel do this.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: make init_user_reserve() static]
      Signed-off-by: NAndrew Shewmaker <agshew@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b1d098
    • J
      mm, vmalloc: export vmap_area_list, instead of vmlist · f1c4069e
      Joonsoo Kim 提交于
      Although our intention is to unexport internal structure entirely, but
      there is one exception for kexec.  kexec dumps address of vmlist and
      makedumpfile uses this information.
      
      We are about to remove vmlist, then another way to retrieve information
      of vmalloc layer is needed for makedumpfile.  For this purpose, we
      export vmap_area_list, instead of vmlist.
      Signed-off-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Dave Anderson <anderson@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1c4069e
  17. 28 4月, 2013 1 次提交
    • L
      vm: add no-mmu vm_iomap_memory() stub · 3c0b9de6
      Linus Torvalds 提交于
      I think we could just move the full vm_iomap_memory() function into
      util.h or similar, but I didn't get any reply from anybody actually
      using nommu even to this trivial patch, so I'm not going to touch it any
      more than required.
      
      Here's the fairly minimal stub to make the nommu case at least
      potentially work.  It doesn't seem like anybody cares, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c0b9de6
  18. 05 4月, 2013 1 次提交
    • J
      mm: prevent mmap_cache race in find_vma() · b6a9b7f6
      Jan Stancek 提交于
      find_vma() can be called by multiple threads with read lock
      held on mm->mmap_sem and any of them can update mm->mmap_cache.
      Prevent compiler from re-fetching mm->mmap_cache, because other
      readers could update it in the meantime:
      
                     thread 1                             thread 2
                                              |
        find_vma()                            |  find_vma()
          struct vm_area_struct *vma = NULL;  |
          vma = mm->mmap_cache;               |
          if (!(vma && vma->vm_end > addr     |
              && vma->vm_start <= addr)) {    |
                                              |    mm->mmap_cache = vma;
          return vma;                         |
           ^^ compiler may optimize this      |
              local variable out and re-read  |
              mm->mmap_cache                  |
      
      This issue can be reproduced with gcc-4.8.0-1 on s390x by running
      mallocstress testcase from LTP, which triggers:
      
        kernel BUG at mm/rmap.c:1088!
          Call Trace:
           ([<000003d100c57000>] 0x3d100c57000)
            [<000000000023a1c0>] do_wp_page+0x2fc/0xa88
            [<000000000023baae>] handle_pte_fault+0x41a/0xac8
            [<000000000023d832>] handle_mm_fault+0x17a/0x268
            [<000000000060507a>] do_protection_exception+0x1e2/0x394
            [<0000000000603a04>] pgm_check_handler+0x138/0x13c
            [<000003fffcf1f07a>] 0x3fffcf1f07a
          Last Breaking-Event-Address:
            [<000000000024755e>] page_add_new_anon_rmap+0xc2/0x168
      
      Thanks to Jakub Jelinek for his insight on gcc and helping to
      track this down.
      Signed-off-by: NJan Stancek <jstancek@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6a9b7f6
  19. 04 3月, 2013 1 次提交
  20. 24 2月, 2013 5 次提交
    • M
      mm: accelerate mm_populate() treatment of THP pages · 240aadee
      Michel Lespinasse 提交于
      This change adds a follow_page_mask function which is equivalent to
      follow_page, but with an extra page_mask argument.
      
      follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
      a THP page, and to 0 in other cases.
      
      __get_user_pages() makes use of this in order to accelerate populating
      THP ranges - that is, when both the pages and vmas arrays are NULL, we
      don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
      we also avoid taking mm->page_table_lock that many times).
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      240aadee
    • M
      mm: use long type for page counts in mm_populate() and get_user_pages() · 28a35716
      Michel Lespinasse 提交于
      Use long type for page counts in mm_populate() so as to avoid integer
      overflow when running the following test code:
      
      int main(void) {
        void *p = mmap(NULL, 0x100000000000, PROT_READ,
                       MAP_PRIVATE | MAP_ANON, -1, 0);
        printf("p: %p\n", p);
        mlockall(MCL_CURRENT);
        printf("done\n");
        return 0;
      }
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28a35716
    • S
      swap: add per-partition lock for swapfile · ec8acf20
      Shaohua Li 提交于
      swap_lock is heavily contended when I test swap to 3 fast SSD (even
      slightly slower than swap to 2 such SSD).  The main contention comes
      from swap_info_get().  This patch tries to fix the gap with adding a new
      per-partition lock.
      
      Global data like nr_swapfiles, total_swap_pages, least_priority and
      swap_list are still protected by swap_lock.
      
      nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
      theory, it's possible get_swap_page() finds no swap pages but actually
      there are free swap pages.  But sounds not a big problem.
      
      Accessing partition specific data (like scan_swap_map and so on) is only
      protected by swap_info_struct.lock.
      
      Changing swap_info_struct.flags need hold swap_lock and
      swap_info_struct.lock, because scan_scan_map() will check it.  read the
      flags is ok with either the locks hold.
      
      If both swap_lock and swap_info_struct.lock must be hold, we always hold
      the former first to avoid deadlock.
      
      swap_entry_free() can change swap_list.  To delete that code, we add a
      new highest_priority_index.  Whenever get_swap_page() is called, we
      check it.  If it's valid, we use it.
      
      It's a pity get_swap_page() still holds swap_lock().  But in practice,
      swap_lock() isn't heavily contended in my test with this patch (or I can
      say there are other much more heavier bottlenecks like TLB flush).  And
      BTW, looks get_swap_page() doesn't really need the lock.  We never free
      swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
      lock is we could swapout to some low priority swap, but we can quickly
      recover after several rounds of swap, so sounds not a big deal to me.
      But I'd prefer to fix this if it's a real problem.
      
      "swap: make each swap partition have one address_space" improved the
      swapout speed from 1.7G/s to 2G/s.  This patch further improves the
      speed to 2.3G/s, so around 15% improvement.  It's a multi-process test,
      so TLB flush isn't the biggest bottleneck before the patches.
      
      [arnd@arndb.de: fix it for nommu]
      [hughd@google.com: add missing unlock]
      [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8acf20
    • M
      mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool · 41badc15
      Michel Lespinasse 提交于
      do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
      multiple, however there was no equivalent code in mm_populate(), which
      caused issues.
      
      This could be fixed by introduced the same rounding in mm_populate(),
      however I think it's preferable to make do_mmap_pgoff() return populate
      as a size rather than as a boolean, so we don't have to duplicate the
      size rounding logic in mm_populate().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41badc15
    • M
      mm: introduce mm_populate() for populating new vmas · bebeb3d6
      Michel Lespinasse 提交于
      When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
      with MCL_FUTURE in effect), we want to populate the pages within the
      newly created vmas.  This may take a while as we may have to read pages
      from disk, so ideally we want to do this outside of the write-locked
      mmap_sem region.
      
      This change introduces mm_populate(), which is used to defer populating
      such mappings until after the mmap_sem write lock has been released.
      This is implemented as a generalization of the former do_mlock_pages(),
      which accomplished the same task but was using during mlock() /
      mlockall().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bebeb3d6
  21. 23 2月, 2013 1 次提交
  22. 08 2月, 2013 1 次提交
  23. 16 11月, 2012 1 次提交
  24. 09 10月, 2012 3 次提交
    • M
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse 提交于
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
    • K
      mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas · e9714acf
      Konstantin Khlebnikov 提交于
      Currently the kernel sets mm->exe_file during sys_execve() and then tracks
      number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
      as this counter drops to zero kernel resets mm->exe_file to NULL.  Plus it
      resets mm->exe_file at last mmput() when mm->mm_users drops to zero.
      
      VMA with VM_EXECUTABLE flag appears after mapping file with flag
      MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
      splitting, because sys_mmap ignores this flag.  Usually binfmt module sets
      mm->exe_file and mmaps executable vmas with this file, they hold
      mm->exe_file while task is running.
      
      comment from v2.6.25-6245-g925d1c40 ("procfs task exe symlink"),
      where all this stuff was introduced:
      
      > The kernel implements readlink of /proc/pid/exe by getting the file from
      > the first executable VMA.  Then the path to the file is reconstructed and
      > reported as the result.
      >
      > Because of the VMA walk the code is slightly different on nommu systems.
      > This patch avoids separate /proc/pid/exe code on nommu systems.  Instead of
      > walking the VMAs to find the first executable file-backed VMA we store a
      > reference to the exec'd file in the mm_struct.
      >
      > That reference would prevent the filesystem holding the executable file
      > from being unmounted even after unmapping the VMAs.  So we track the number
      > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
      > unmapped.  This avoids pinning the mounted filesystem.
      
      exe_file's vma accounting is hooked into every file mmap/unmmap and vma
      split/merge just to fix some hypothetical pinning fs from umounting by mm,
      which already unmapped all its executable files, but still alive.
      
      Seems like currently nobody depends on this behaviour.  We can try to
      remove this logic and keep mm->exe_file until final mmput().
      
      mm->exe_file is still protected with mm->mmap_sem, because we want to
      change it via new sys_prctl(PR_SET_MM_EXE_FILE).  Also via this syscall
      task can change its mm->exe_file and unpin mountpoint explicitly.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9714acf