1. 06 11月, 2015 4 次提交
    • E
      mm: mlock: add new mlock system call · a8ca5d0e
      Eric B Munson 提交于
      With the refactored mlock code, introduce a new system call for mlock.
      The new call will allow the user to specify what lock states are being
      added.  mlock2 is trivial at the moment, but a follow on patch will add a
      new mlock state making it useful.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8ca5d0e
    • E
      mm: mlock: refactor mlock, munlock, and munlockall code · 1aab92ec
      Eric B Munson 提交于
      mlock() allows a user to control page out of program memory, but this
      comes at the cost of faulting in the entire mapping when it is allocated.
      For large mappings where the entire area is not necessary this is not
      ideal.  Instead of forcing all locked pages to be present when they are
      allocated, this set creates a middle ground.  Pages are marked to be
      placed on the unevictable LRU (locked) when they are first used, but they
      are not faulted in by the mlock call.
      
      This series introduces a new mlock() system call that takes a flags
      argument along with the start address and size.  This flags argument gives
      the caller the ability to request memory be locked in the traditional way,
      or to be locked after the page is faulted in.  A new MCL flag is added to
      mirror the lock on fault behavior from mlock() in mlockall().
      
      There are two main use cases that this set covers.  The first is the
      security focussed mlock case.  A buffer is needed that cannot be written
      to swap.  The maximum size is known, but on average the memory used is
      significantly less than this maximum.  With lock on fault, the buffer is
      guaranteed to never be paged out without consuming the maximum size every
      time such a buffer is created.
      
      The second use case is focussed on performance.  Portions of a large file
      are needed and we want to keep the used portions in memory once accessed.
      This is the case for large graphical models where the path through the
      graph is not known until run time.  The entire graph is unlikely to be
      used in a given invocation, but once a node has been used it needs to stay
      resident for further processing.  Given these constraints we have a number
      of options.  We can potentially waste a large amount of memory by mlocking
      the entire region (this can also cause a significant stall at startup as
      the entire file is read in).  We can mlock every page as we access them
      without tracking if the page is already resident but this introduces large
      overhead for each access.  The third option is mapping the entire region
      with PROT_NONE and using a signal handler for SIGSEGV to
      mprotect(PROT_READ) and mlock() the needed page.  Doing this page at a
      time adds a significant performance penalty.  Batching can be used to
      mitigate this overhead, but in order to safely avoid trying to mprotect
      pages outside of the mapping, the boundaries of each mapping to be used in
      this way must be tracked and available to the signal handler.  This is
      precisely what the mm system in the kernel should already be doing.
      
      For mlock(MLOCK_ONFAULT) the user is charged against RLIMIT_MEMLOCK as if
      mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was used, so when the VMA is
      created not when the pages are faulted in.  For mlockall(MCL_ONFAULT) the
      user is charged as if MCL_FUTURE was used.  This decision was made to keep
      the accounting checks out of the page fault path.
      
      To illustrate the benefit of this set I wrote a test program that mmaps a
      5 GB file filled with random data and then makes 15,000,000 accesses to
      random addresses in that mapping.  The test program was run 20 times for
      each setup.  Results are reported for two program portions, setup and
      execution.  The setup phase is calling mmap and optionally mlock on the
      entire region.  For most experiments this is trivial, but it highlights
      the cost of faulting in the entire region.  Results are averages across
      the 20 runs in milliseconds.
      
      mmap with mlock(MLOCK_LOCKED) on entire range:
      Setup avg:      8228.666
      Processing avg: 8274.257
      
      mmap with mlock(MLOCK_LOCKED) before each access:
      Setup avg:      0.113
      Processing avg: 90993.552
      
      mmap with PROT_NONE and signal handler and batch size of 1 page:
      With the default value in max_map_count, this gets ENOMEM as I attempt
      to change the permissions, after upping the sysctl significantly I get:
      Setup avg:      0.058
      Processing avg: 69488.073
      mmap with PROT_NONE and signal handler and batch size of 8 pages:
      Setup avg:      0.068
      Processing avg: 38204.116
      
      mmap with PROT_NONE and signal handler and batch size of 16 pages:
      Setup avg:      0.044
      Processing avg: 29671.180
      
      mmap with mlock(MLOCK_ONFAULT) on entire range:
      Setup avg:      0.189
      Processing avg: 17904.899
      
      The signal handler in the batch cases faulted in memory in two steps to
      avoid having to know the start and end of the faulting mapping.  The first
      step covers the page that caused the fault as we know that it will be
      possible to lock.  The second step speculatively tries to mlock and
      mprotect the batch size - 1 pages that follow.  There may be a clever way
      to avoid this without having the program track each mapping to be covered
      by this handeler in a globally accessible structure, but I could not find
      it.  It should be noted that with a large enough batch size this two step
      fault handler can still cause the program to crash if it reaches far
      beyond the end of the mapping.
      
      These results show that if the developer knows that a majority of the
      mapping will be used, it is better to try and fault it in at once,
      otherwise mlock(MLOCK_ONFAULT) is significantly faster.
      
      The performance cost of these patches are minimal on the two benchmarks I
      have tested (stream and kernbench).  The following are the average values
      across 20 runs of stream and 10 runs of kernbench after a warmup run whose
      results were discarded.
      
      Avg throughput in MB/s from stream using 1000000 element arrays
      Test     4.2-rc1      4.2-rc1+lock-on-fault
      Copy:    10,566.5     10,421
      Scale:   10,685       10,503.5
      Add:     12,044.1     11,814.2
      Triad:   12,064.8     11,846.3
      
      Kernbench optimal load
                       4.2-rc1  4.2-rc1+lock-on-fault
      Elapsed Time     78.453   78.991
      User Time        64.2395  65.2355
      System Time      9.7335   9.7085
      Context Switches 22211.5  22412.1
      Sleeps           14965.3  14956.1
      
      This patch (of 6):
      
      Extending the mlock system call is very difficult because it currently
      does not take a flags argument.  A later patch in this set will extend
      mlock to support a middle ground between pages that are locked and faulted
      in immediately and unlocked pages.  To pave the way for the new system
      call, the code needs some reorganization so that all the actual entry
      point handles is checking input and translating to VMA flags.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1aab92ec
    • A
      mm/mlock: use offset_in_page macro · 8fd9e488
      Alexander Kuleshov 提交于
      linux/mm.h provides offset_in_page() macro.  Let's use already predefined
      macro instead of (addr & ~PAGE_MASK).
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fd9e488
    • A
      mm/mlock.c: reorganize mlockall() return values and remove goto-out label · 86d2adcc
      Alexey Klimov 提交于
      In mlockall syscall wrapper after out-label for goto code just doing
      return.  Remove goto out statements and return error values directly.
      
      Also instead of rewriting ret variable before every if-check move returns
      to 'error'-like path under if-check.
      
      Objdump asm listing showed me reducing by few asm lines.  Object file size
      descreased from 220592 bytes to 220528 bytes for me (for aarch64).
      Signed-off-by: NAlexey Klimov <klimov.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86d2adcc
  2. 05 9月, 2015 1 次提交
  3. 15 4月, 2015 4 次提交
  4. 13 3月, 2015 1 次提交
    • J
      mm: reorder can_do_mlock to fix audit denial · a5a6579d
      Jeff Vander Stoep 提交于
      A userspace call to mmap(MAP_LOCKED) may result in the successful locking
      of memory while also producing a confusing audit log denial.  can_do_mlock
      checks capable and rlimit.  If either of these return positive
      can_do_mlock returns true.  The capable check leads to an LSM hook used by
      apparmour and selinux which produce the audit denial.  Reordering so
      rlimit is checked first eliminates the denial on success, only recording a
      denial when the lock is unsuccessful as a result of the denial.
      Signed-off-by: NJeff Vander Stoep <jeffv@google.com>
      Acked-by: NNick Kralevich <nnk@google.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Paul Cassella <cassella@cray.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5a6579d
  5. 10 10月, 2014 2 次提交
  6. 08 9月, 2014 1 次提交
  7. 07 8月, 2014 1 次提交
  8. 08 4月, 2014 1 次提交
    • V
      mm: try_to_unmap_cluster() should lock_page() before mlocking · 57e68e9c
      Vlastimil Babka 提交于
      A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
      fuzzing with trinity.  The call site try_to_unmap_cluster() does not lock
      the pages other than its check_page parameter (which is already locked).
      
      The BUG_ON in mlock_vma_page() is not documented and its purpose is
      somewhat unclear, but apparently it serializes against page migration,
      which could otherwise fail to transfer the PG_mlocked flag.  This would
      not be fatal, as the page would be eventually encountered again, but
      NR_MLOCK accounting would become distorted nevertheless.  This patch adds
      a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
      effect.
      
      The call site try_to_unmap_cluster() is fixed so that for page !=
      check_page, trylock_page() is attempted (to avoid possible deadlocks as we
      already have check_page locked) and mlock_vma_page() is performed only
      upon success.  If the page lock cannot be obtained, the page is left
      without PG_mlocked, which is again not a problem in the whole unevictable
      memory design.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57e68e9c
  9. 24 1月, 2014 2 次提交
    • S
      mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE · 309381fe
      Sasha Levin 提交于
      Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
      one of these assertions fails we'll get a BUG_ON with a call stack and
      the registers.
      
      I've recently noticed based on the requests to add a small piece of code
      that dumps the page to various VM_BUG_ON sites that the page dump is
      quite useful to people debugging issues in mm.
      
      This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
      VM_BUG_ON() does, also dumps the page before executing the actual
      BUG_ON.
      
      [akpm@linux-foundation.org: fix up includes]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      309381fe
    • V
      mm: munlock: fix potential race with THP page split · 01cc2e58
      Vlastimil Babka 提交于
      Since commit ff6a6da6 ("mm: accelerate munlock() treatment of THP
      pages") munlock skips tail pages of a munlocked THP page.  There is some
      attempt to prevent bad consequences of racing with a THP page split, but
      code inspection indicates that there are two problems that may lead to a
      non-fatal, yet wrong outcome.
      
      First, __split_huge_page_refcount() copies flags including PageMlocked
      from the head page to the tail pages.  Clearing PageMlocked by
      munlock_vma_page() in the middle of this operation might result in part
      of tail pages left with PageMlocked flag.  As the head page still
      appears to be a THP page until all tail pages are processed,
      munlock_vma_page() might think it munlocked the whole THP page and skip
      all the former tail pages.  Before ff6a6da6, those pages would be
      cleared in further iterations of munlock_vma_pages_range(), but NR_MLOCK
      would still become undercounted (related the next point).
      
      Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after
      the PageMlocked is cleared.  The accounting might also become
      inconsistent due to race with __split_huge_page_refcount()
      
      - undercount when HUGE_PMD_NR is subtracted, but some tail pages are
        left with PageMlocked set and counted again (only possible before
        ff6a6da6)
      
      - overcount when hpage_nr_pages() sees a normal page (split has already
        finished), but the parallel split has meanwhile cleared PageMlocked from
        additional tail pages
      
      This patch prevents both problems via extending the scope of lru_lock in
      munlock_vma_page().  This is convenient because:
      
      - __split_huge_page_refcount() takes lru_lock for its whole operation
      
      - munlock_vma_page() typically takes lru_lock anyway for page isolation
      
      As this becomes a second function where page isolation is done with
      lru_lock already held, factor this out to a new
      __munlock_isolate_lru_page() function and clean up the code around.
      
      [akpm@linux-foundation.org: avoid a coding-style ugly]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01cc2e58
  10. 22 1月, 2014 1 次提交
  11. 03 1月, 2014 2 次提交
    • V
      mm: munlock: fix deadlock in __munlock_pagevec() · 3b25df93
      Vlastimil Babka 提交于
      Commit 7225522b ("mm: munlock: batch non-THP page isolation and
      munlock+putback using pagevec" introduced __munlock_pagevec() to speed
      up munlock by holding lru_lock over multiple isolated pages.  Pages that
      fail to be isolated are put_page()d immediately, also within the lock.
      
      This can lead to deadlock when __munlock_pagevec() becomes the holder of
      the last page pin and put_page() leads to __page_cache_release() which
      also locks lru_lock.  The deadlock has been observed by Sasha Levin
      using trinity.
      
      This patch avoids the deadlock by deferring put_page() operations until
      lru_lock is released.  Another pagevec (which is also used by later
      phases of the function is reused to gather the pages for put_page()
      operation.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b25df93
    • V
      mm: munlock: fix a bug where THP tail page is encountered · c424be1c
      Vlastimil Babka 提交于
      Since commit ff6a6da6 ("mm: accelerate munlock() treatment of THP
      pages") munlock skips tail pages of a munlocked THP page.  However, when
      the head page already has PageMlocked unset, it will not skip the tail
      pages.
      
      Commit 7225522b ("mm: munlock: batch non-THP page isolation and
      munlock+putback using pagevec") has added a PageTransHuge() check which
      contains VM_BUG_ON(PageTail(page)).  Sasha Levin found this triggered
      using trinity, on the first tail page of a THP page without PageMlocked
      flag.
      
      This patch fixes the issue by skipping tail pages also in the case when
      PageMlocked flag is unset.  There is still a possibility of race with
      THP page split between clearing PageMlocked and determining how many
      pages to skip.  The race might result in former tail pages not being
      skipped, which is however no longer a bug, as during the skip the
      PageTail flags are cleared.
      
      However this race also affects correctness of NR_MLOCK accounting, which
      is to be fixed in a separate patch.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c424be1c
  12. 01 10月, 2013 1 次提交
    • V
      mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration · eadb41ae
      Vlastimil Babka 提交于
      The function __munlock_pagevec_fill() introduced in commit 7a8010cd
      ("mm: munlock: manual pte walk in fast path instead of
      follow_page_mask()") uses pmd_addr_end() for restricting its operation
      within current page table.
      
      This is insufficient on architectures/configurations where pmd is folded
      and pmd_addr_end() just returns the end of the full range to be walked.
      In this case, it allows pte++ to walk off the end of a page table
      resulting in unpredictable behaviour.
      
      This patch fixes the function by using pgd_addr_end() and pud_addr_end()
      before pmd_addr_end(), which will yield correct page table boundary on
      all configurations.  This is similar to what existing page walkers do
      when walking each level of the page table.
      
      Additionaly, the patch clarifies a comment for get_locked_pte() call in the
      function.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Jörn Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eadb41ae
  13. 25 9月, 2013 1 次提交
    • P
      mm: Place preemption point in do_mlockall() loop · 22356f44
      Paul E. McKenney 提交于
      There is a loop in do_mlockall() that lacks a preemption point, which
      means that the following can happen on non-preemptible builds of the
      kernel. Dave Jones reports:
      
       "My fuzz tester keeps hitting this.  Every instance shows the non-irq
        stack came in from mlockall.  I'm only seeing this on one box, but
        that has more ram (8gb) than my other machines, which might explain
        it.
      
          INFO: rcu_preempt self-detected stall on CPU { 3}  (t=6500 jiffies g=470344 c=470343 q=0)
          sending NMI to all CPUs:
          NMI backtrace for cpu 3
          CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32
          Call Trace:
            lru_add_drain_all+0x15/0x20
            SyS_mlockall+0xa5/0x1a0
            tracesys+0xdd/0xe2"
      
      This commit addresses this problem by inserting the required preemption
      point.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22356f44
  14. 21 9月, 2013 1 次提交
    • P
      mm: Place preemption point in do_mlockall() loop · 5c889690
      Paul E. McKenney 提交于
      There is a loop in do_mlockall() that lacks a preemption point, which
      means that the following can happen on non-preemptible builds of the
      kernel:
      
      > My fuzz tester keeps hitting this. Every instance shows the non-irq stack
      > came in from mlockall.  I'm only seeing this on one box, but that has more
      > ram (8gb) than my other machines, which might explain it.
      >
      > 	Dave
      >
      > INFO: rcu_preempt self-detected stall on CPU { 3}  (t=6500 jiffies g=470344 c=470343 q=0)
      > sending NMI to all CPUs:
      > NMI backtrace for cpu 3
      > CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32
      > task: ffff88023e743fc0 ti: ffff88022f6f2000 task.ti: ffff88022f6f2000
      > RIP: 0010:[<ffffffff810bf7d1>]  [<ffffffff810bf7d1>] trace_hardirqs_off_caller+0x21/0xb0
      > RSP: 0018:ffff880244e03c30  EFLAGS: 00000046
      > RAX: ffff88023e743fc0 RBX: 0000000000000001 RCX: 000000000000003c
      > RDX: 000000000000000f RSI: 0000000000000004 RDI: ffffffff81033cab
      > RBP: ffff880244e03c38 R08: ffff880243288a80 R09: 0000000000000001
      > R10: 0000000000000000 R11: 0000000000000001 R12: ffff880243288a80
      > R13: ffff8802437eda40 R14: 0000000000080000 R15: 000000000000d010
      > FS:  00007f50ae33b740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > CR2: 000000000097f000 CR3: 0000000240fa0000 CR4: 00000000001407e0
      > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      > Stack:
      >  ffffffff810bf86d ffff880244e03c98 ffffffff81033cab 0000000000000096
      >  000000000000d008 0000000300000002 0000000000000004 0000000000000003
      >  0000000000002710 ffffffff81c50d00 ffffffff81c50d00 ffff880244fcde00
      > Call Trace:
      >  <IRQ>
      >  [<ffffffff810bf86d>] ? trace_hardirqs_off+0xd/0x10
      >  [<ffffffff81033cab>] __x2apic_send_IPI_mask+0x1ab/0x1c0
      >  [<ffffffff81033cdc>] x2apic_send_IPI_all+0x1c/0x20
      >  [<ffffffff81030115>] arch_trigger_all_cpu_backtrace+0x65/0xa0
      >  [<ffffffff811144b1>] rcu_check_callbacks+0x331/0x8e0
      >  [<ffffffff8108bfa0>] ? hrtimer_run_queues+0x20/0x180
      >  [<ffffffff8109e905>] ? sched_clock_cpu+0xb5/0x100
      >  [<ffffffff81069557>] update_process_times+0x47/0x80
      >  [<ffffffff810bd115>] tick_sched_handle.isra.16+0x25/0x60
      >  [<ffffffff810bd231>] tick_sched_timer+0x41/0x60
      >  [<ffffffff8108ace1>] __run_hrtimer+0x81/0x4e0
      >  [<ffffffff810bd1f0>] ? tick_sched_do_timer+0x60/0x60
      >  [<ffffffff8108b93f>] hrtimer_interrupt+0xff/0x240
      >  [<ffffffff8102de84>] local_apic_timer_interrupt+0x34/0x60
      >  [<ffffffff81718c5f>] smp_apic_timer_interrupt+0x3f/0x60
      >  [<ffffffff817178ef>] apic_timer_interrupt+0x6f/0x80
      >  [<ffffffff8170e8e0>] ? retint_restore_args+0xe/0xe
      >  [<ffffffff8105f101>] ? __do_softirq+0xb1/0x440
      >  [<ffffffff8105f64d>] irq_exit+0xcd/0xe0
      >  [<ffffffff81718c65>] smp_apic_timer_interrupt+0x45/0x60
      >  [<ffffffff817178ef>] apic_timer_interrupt+0x6f/0x80
      >  <EOI>
      >  [<ffffffff8170e8e0>] ? retint_restore_args+0xe/0xe
      >  [<ffffffff8170b830>] ? wait_for_completion_killable+0x170/0x170
      >  [<ffffffff8170c853>] ? preempt_schedule_irq+0x53/0x90
      >  [<ffffffff8170e9f6>] retint_kernel+0x26/0x30
      >  [<ffffffff8107a523>] ? queue_work_on+0x43/0x90
      >  [<ffffffff8107c369>] schedule_on_each_cpu+0xc9/0x1a0
      >  [<ffffffff81167770>] ? lru_add_drain+0x50/0x50
      >  [<ffffffff811677c5>] lru_add_drain_all+0x15/0x20
      >  [<ffffffff81186965>] SyS_mlockall+0xa5/0x1a0
      >  [<ffffffff81716e94>] tracesys+0xdd/0xe2
      
      This commit addresses this problem by inserting the required preemption
      point.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      5c889690
  15. 12 9月, 2013 6 次提交
    • V
      mm: munlock: manual pte walk in fast path instead of follow_page_mask() · 7a8010cd
      Vlastimil Babka 提交于
      Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
      each individual struct page.  This entails repeated full page table
      translations and page table lock taken for each page separately.
      
      This patch avoids the costly follow_page_mask() where possible, by
      iterating over ptes within single pmd under single page table lock.  The
      first pte is obtained by get_locked_pte() for non-THP page acquired by the
      initial follow_page_mask().  The rest of the on-stack pagevec for munlock
      is filled up using pte_walk as long as pte_present() and vm_normal_page()
      are sufficient to obtain the struct page.
      
      After this patch, a 14% speedup was measured for munlocking a 56GB large
      memory area with THP disabled.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jörn Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a8010cd
    • V
      mm: munlock: remove redundant get_page/put_page pair on the fast path · 5b40998a
      Vlastimil Babka 提交于
      The performance of the fast path in munlock_vma_range() can be further
      improved by avoiding atomic ops of a redundant get_page()/put_page() pair.
      
      When calling get_page() during page isolation, we already have the pin
      from follow_page_mask().  This pin will be then returned by
      __pagevec_lru_add(), after which we do not reference the pages anymore.
      
      After this patch, an 8% speedup was measured for munlocking a 56GB large
      memory area with THP disabled.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b40998a
    • V
      mm: munlock: bypass per-cpu pvec for putback_lru_page · 56afe477
      Vlastimil Babka 提交于
      After introducing batching by pagevecs into munlock_vma_range(), we can
      further improve performance by bypassing the copying into per-cpu pagevec
      and the get_page/put_page pair associated with that.  Instead we perform
      LRU putback directly from our pagevec.  However, this is possible only for
      single-mapped pages that are evictable after munlock.  Unevictable pages
      require rechecking after putting on the unevictable list, so for those we
      fallback to putback_lru_page(), hich handles that.
      
      After this patch, a 13% speedup was measured for munlocking a 56GB large
      memory area with THP disabled.
      
      [akpm@linux-foundation.org:clarify comment]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56afe477
    • V
      mm: munlock: batch NR_MLOCK zone state updates · 1ebb7cc6
      Vlastimil Babka 提交于
      Depending on previous batch which introduced batched isolation in
      munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats.
       After the whole pagevec is processed for page isolation, the stats are
      updated only once with the number of successful isolations.  There were
      however no measurable perfomance gains.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ebb7cc6
    • V
      mm: munlock: batch non-THP page isolation and munlock+putback using pagevec · 7225522b
      Vlastimil Babka 提交于
      Currently, munlock_vma_range() calls munlock_vma_page on each page in a
      loop, which results in repeated taking and releasing of the lru_lock
      spinlock for isolating pages one by one.  This patch batches the munlock
      operations using an on-stack pagevec, so that isolation is done under
      single lru_lock.  For THP pages, the old behavior is preserved as they
      might be split while putting them into the pagevec.  After this patch, a
      9% speedup was measured for munlocking a 56GB large memory area with THP
      disabled.
      
      A new function __munlock_pagevec() is introduced that takes a pagevec and:
      1) It clears PageMlocked and isolates all pages under lru_lock.  Zone page
      stats can be also updated using the variant which assumes disabled
      interrupts.  2) It finishes the munlock and lru putback on all pages under
      their lock_page.  Note that previously, lock_page covered also the
      PageMlocked clearing and page isolation, but it is not needed for those
      operations.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7225522b
    • V
      mm: munlock: remove unnecessary call to lru_add_drain() · 586a32ac
      Vlastimil Babka 提交于
      In munlock_vma_range(), lru_add_drain() is currently called in a loop
      before each munlock_vma_page() call.
      
      This is suboptimal for performance when munlocking many pages.  The
      benefits of per-cpu pagevec for batching the LRU putback are removed since
      the pagevec only holds at most one page from the previous loop's
      iteration.
      
      The lru_add_drain() call also does not serve any purposes for correctness
      - it does not even drain pagavecs of all cpu's.  The munlock code already
      expects and handles situations where a page cannot be isolated from the
      LRU (e.g.  because it is on some per-cpu pagevec).
      
      The history of the (not commented) call also suggest that it appears there
      as an oversight rather than intentionally.  Before commit ff6a6da6 ("mm:
      accelerate munlock() treatment of THP pages") the call happened only once
      upon entering the function.  The commit has moved the call into the while
      loope.  So while the other changes in the commit improved munlock
      performance for THP pages, it introduced the abovementioned suboptimal
      per-cpu pagevec usage.
      
      Further in history, before commit 408e82b7 ("mm: munlock use
      follow_page"), munlock_vma_pages_range() was just a wrapper around
      __mlock_vma_pages_range which performed both mlock and munlock depending
      on a flag.  However, before ba470de4 ("mmap: handle mlocked pages during
      map, remap, unmap") the function handled only mlock, not munlock.  The
      lru_add_drain call thus comes from the implementation in commit b291f000
      ("mlock: mlocked pages are unevictable" and was intended only for
      mlocking, not munlocking.  The original intention of draining the LRU
      pagevec at mlock time was to ensure the pages were on the LRU before the
      lock operation so that they could be placed on the unevictable list
      immediately.  There is very little motivation to do the same in the
      munlock path this, particularly for every single page.
      
      This patch therefore removes the call completely.  After removing the
      call, a 10% speedup was measured for munlock() of a 56GB large memory area
      with THP disabled.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJörn Engel <joern@logfs.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      586a32ac
  16. 29 3月, 2013 1 次提交
  17. 28 2月, 2013 1 次提交
  18. 24 2月, 2013 5 次提交
  19. 13 2月, 2013 1 次提交
  20. 09 10月, 2012 3 次提交
    • D
      mm, thp: fix mlock statistics · 8449d21f
      David Rientjes 提交于
      NR_MLOCK is only accounted in single page units: there's no logic to
      handle transparent hugepages.  This patch checks the appropriate number of
      pages to adjust the statistics by so that the correct amount of memory is
      reflected.
      
      Currently:
      
      		$ grep Mlocked /proc/meminfo
      		Mlocked:           19636 kB
      
      	#define MAP_SIZE	(4 << 30)	/* 4GB */
      
      	void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      			 MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      	mlock(ptr, MAP_SIZE);
      
      		$ grep Mlocked /proc/meminfo
      		Mlocked:           29844 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		$ grep Mlocked /proc/meminfo
      		Mlocked:           19636 kB
      
      And with this patch:
      
      		$ grep Mlock /proc/meminfo
      		Mlocked:           19636 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		$ grep Mlock /proc/meminfo
      		Mlocked:         4213664 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		$ grep Mlock /proc/meminfo
      		Mlocked:           19636 kB
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NHugh Dickens <hughd@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8449d21f
    • H
      mm: use clear_page_mlock() in page_remove_rmap() · e6c509f8
      Hugh Dickins 提交于
      We had thought that pages could no longer get freed while still marked as
      mlocked; but Johannes Weiner posted this program to demonstrate that
      truncating an mlocked private file mapping containing COWed pages is still
      mishandled:
      
      #include <sys/types.h>
      #include <sys/mman.h>
      #include <sys/stat.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <stdio.h>
      
      int main(void)
      {
      	char *map;
      	int fd;
      
      	system("grep mlockfreed /proc/vmstat");
      	fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
      	unlink("chigurh");
      	ftruncate(fd, 4096);
      	map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
      	map[0] = 11;
      	mlock(map, sizeof(fd));
      	ftruncate(fd, 0);
      	close(fd);
      	munlock(map, sizeof(fd));
      	munmap(map, 4096);
      	system("grep mlockfreed /proc/vmstat");
      	return 0;
      }
      
      The anon COWed pages are not caught by truncation's clear_page_mlock() of
      the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
      look out for them there in page_remove_rmap().  Indeed, why should
      truncation or invalidation be doing the clear_page_mlock() when removing
      from pagecache?  mlock is a property of mapping in userspace, not a
      property of pagecache: an mlocked unmapped page is nonsensical.
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6c509f8
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9