1. 14 2月, 2013 1 次提交
    • M
      s390/mm: implement software dirty bits · abf09bed
      Martin Schwidefsky 提交于
      The s390 architecture is unique in respect to dirty page detection,
      it uses the change bit in the per-page storage key to track page
      modifications. All other architectures track dirty bits by means
      of page table entries. This property of s390 has caused numerous
      problems in the past, e.g. see git commit ef5d437f
      "mm: fix XFS oops due to dirty pages without buffers on s390".
      
      To avoid future issues in regard to per-page dirty bits convert
      s390 to a fault based software dirty bit detection mechanism. All
      user page table entries which are marked as clean will be hardware
      read-only, even if the pte is supposed to be writable. A write by
      the user process will trigger a protection fault which will cause
      the user pte to be marked as dirty and the hardware read-only bit
      is removed.
      
      With this change the dirty bit in the storage key is irrelevant
      for Linux as a host, but the storage key is still required for
      KVM guests. The effect is that page_test_and_clear_dirty and the
      related code can be removed. The referenced bit in the storage
      key is still used by the page_test_and_clear_young primitive to
      provide page age information.
      
      For page cache pages of mappings with mapping_cap_account_dirty
      there will not be any change in behavior as the dirty bit tracking
      already uses read-only ptes to control the amount of dirty pages.
      Only for swap cache pages and pages of mappings without
      mapping_cap_account_dirty there can be additional protection faults.
      To avoid an excessive number of additional faults the mk_pte
      primitive checks for PageDirty if the pgprot value allows for writes
      and pre-dirties the pte. That avoids all additional faults for
      tmpfs and shmem pages until these pages are added to the swap cache.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      abf09bed
  2. 13 12月, 2012 1 次提交
  3. 12 12月, 2012 2 次提交
  4. 11 12月, 2012 2 次提交
    • I
      mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable · 4fc3f1d6
      Ingo Molnar 提交于
      rmap_walk_anon() and try_to_unmap_anon() appears to be too
      careful about locking the anon vma: while it needs protection
      against anon vma list modifications, it does not need exclusive
      access to the list itself.
      
      Transforming this exclusive lock to a read-locked rwsem removes
      a global lock from the hot path of page-migration intense
      threaded workloads which can cause pathological performance like
      this:
      
          96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
                        |
                        --- perf_trace_sched_switch
                            __schedule
                            schedule
                            schedule_preempt_disabled
                            __mutex_lock_common.isra.6
                            __mutex_lock_slowpath
                            mutex_lock
                           |
                           |--50.61%-- rmap_walk
                           |          move_to_new_page
                           |          migrate_pages
                           |          migrate_misplaced_page
                           |          __do_numa_page.isra.69
                           |          handle_pte_fault
                           |          handle_mm_fault
                           |          __do_page_fault
                           |          do_page_fault
                           |          page_fault
                           |          __memset_sse2
                           |          |
                           |           --100.00%-- worker_thread
                           |                     |
                           |                      --100.00%-- start_thread
                           |
                            --49.39%-- page_lock_anon_vma
                                      try_to_unmap_anon
                                      try_to_unmap
                                      migrate_pages
                                      migrate_misplaced_page
                                      __do_numa_page.isra.69
                                      handle_pte_fault
                                      handle_mm_fault
                                      __do_page_fault
                                      do_page_fault
                                      page_fault
                                      __memset_sse2
                                      |
                                       --100.00%-- worker_thread
                                                 start_thread
      
      With this change applied the profile is now nicely flat
      and there's no anon-vma related scheduling/blocking.
      
      Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
      to make it clearer that it's an exclusive write-lock in
      that case - suggested by Rik van Riel.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Turner <pjt@google.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      4fc3f1d6
    • I
      mm/rmap: Convert the struct anon_vma::mutex to an rwsem · 5a505085
      Ingo Molnar 提交于
      Convert the struct anon_vma::mutex to an rwsem, which will help
      in solving a page-migration scalability problem. (Addressed in
      a separate patch.)
      
      The conversion is simple and straightforward: in every case
      where we mutex_lock()ed we'll now down_write().
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Turner <pjt@google.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5a505085
  5. 26 10月, 2012 1 次提交
    • J
      mm: fix XFS oops due to dirty pages without buffers on s390 · ef5d437f
      Jan Kara 提交于
      On s390 any write to a page (even from kernel itself) sets architecture
      specific page dirty bit.  Thus when a page is written to via buffered
      write, HW dirty bit gets set and when we later map and unmap the page,
      page_remove_rmap() finds the dirty bit and calls set_page_dirty().
      
      Dirtying of a page which shouldn't be dirty can cause all sorts of
      problems to filesystems.  The bug we observed in practice is that
      buffers from the page get freed, so when the page gets later marked as
      dirty and writeback writes it, XFS crashes due to an assertion
      BUG_ON(!PagePrivate(page)) in page_buffers() called from
      xfs_count_page_state().
      
      Similar problem can also happen when zero_user_segment() call from
      xfs_vm_writepage() (or block_write_full_page() for that matter) set the
      hardware dirty bit during writeback, later buffers get freed, and then
      page unmapped.
      
      Fix the issue by ignoring s390 HW dirty bit for page cache pages of
      mappings with mapping_cap_account_dirty().  This is safe because for
      such mappings when a page gets marked as writeable in PTE it is also
      marked dirty in do_wp_page() or do_page_fault().  When the dirty bit is
      cleared by clear_page_dirty_for_io(), the page gets writeprotected in
      page_mkclean().  So pagecache page is writeable if and only if it is
      dirty.
      
      Thanks to Hugh Dickins for pointing out mapping has to have
      mapping_cap_account_dirty() for things to work and proposing a cleaned
      up variant of the patch.
      
      The patch has survived about two hours of running fsx-linux on tmpfs
      while heavily swapping and several days of running on out build machines
      where the original problem was triggered.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <stable@vger.kernel.org>		[3.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef5d437f
  6. 09 10月, 2012 7 次提交
    • S
      mm: move all mmu notifier invocations to be done outside the PT lock · 2ec74c3e
      Sagi Grimberg 提交于
      In order to allow sleeping during mmu notifier calls, we need to avoid
      invoking them under the page table spinlock.  This patch solves the
      problem by calling invalidate_page notification after releasing the lock
      (but before freeing the page itself), or by wrapping the page invalidation
      with calls to invalidate_range_begin and invalidate_range_end.
      
      To prevent accidental changes to the invalidate_range_end arguments after
      the call to invalidate_range_begin, the patch introduces a convention of
      saving the arguments in consistently named locals:
      
      	unsigned long mmun_start;	/* For mmu_notifiers */
      	unsigned long mmun_end;	/* For mmu_notifiers */
      
      	...
      
      	mmun_start = ...
      	mmun_end = ...
      	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
      
      	...
      
      	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
      
      The patch changes code to use this convention for all calls to
      mmu_notifier_invalidate_range_start/end, except those where the calls are
      close enough so that anyone who glances at the code can see the values
      aren't changing.
      
      This patchset is a preliminary step towards on-demand paging design to be
      added to the RDMA stack.
      
      Why do we want on-demand paging for Infiniband?
      
        Applications register memory with an RDMA adapter using system calls,
        and subsequently post IO operations that refer to the corresponding
        virtual addresses directly to HW.  Until now, this was achieved by
        pinning the memory during the registration calls.  The goal of on demand
        paging is to avoid pinning the pages of registered memory regions (MRs).
         This will allow users the same flexibility they get when swapping any
        other part of their processes address spaces.  Instead of requiring the
        entire MR to fit in physical memory, we can allow the MR to be larger,
        and only fit the current working set in physical memory.
      
      Why should anyone care?  What problems are users currently experiencing?
      
        This can make programming with RDMA much simpler.  Today, developers
        that are working with more data than their RAM can hold need either to
        deregister and reregister memory regions throughout their process's
        life, or keep a single memory region and copy the data to it.  On demand
        paging will allow these developers to register a single MR at the
        beginning of their process's life, and let the operating system manage
        which pages needs to be fetched at a given time.  In the future, we
        might be able to provide a single memory access key for each process
        that would provide the entire process's address as one large memory
        region, and the developers wouldn't need to register memory regions at
        all.
      
      Is there any prospect that any other subsystems will utilise these
      infrastructural changes?  If so, which and how, etc?
      
        As for other subsystems, I understand that XPMEM wanted to sleep in
        MMU notifiers, as Christoph Lameter wrote at
        http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
        perhaps Andrea knows about other use cases.
      
        Scheduling in mmu notifications is required since we need to sync the
        hardware with the secondary page tables change.  A TLB flush of an IO
        device is inherently slower than a CPU TLB flush, so our design works by
        sending the invalidation request to the device, and waiting for an
        interrupt before exiting the mmu notifier handler.
      
      Avi said:
      
        kvm may be a buyer.  kvm::mmu_lock, which serializes guest page
        faults, also protects long operations such as destroying large ranges.
        It would be good to convert it into a spinlock, but as it is used inside
        mmu notifiers, this cannot be done.
      
        (there are alternatives, such as keeping the spinlock and using a
        generation counter to do the teardown in O(1), which is what the "may"
        is doing up there).
      
      [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ec74c3e
    • H
      mm: use clear_page_mlock() in page_remove_rmap() · e6c509f8
      Hugh Dickins 提交于
      We had thought that pages could no longer get freed while still marked as
      mlocked; but Johannes Weiner posted this program to demonstrate that
      truncating an mlocked private file mapping containing COWed pages is still
      mishandled:
      
      #include <sys/types.h>
      #include <sys/mman.h>
      #include <sys/stat.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <stdio.h>
      
      int main(void)
      {
      	char *map;
      	int fd;
      
      	system("grep mlockfreed /proc/vmstat");
      	fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
      	unlink("chigurh");
      	ftruncate(fd, 4096);
      	map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
      	map[0] = 11;
      	mlock(map, sizeof(fd));
      	ftruncate(fd, 0);
      	close(fd);
      	munlock(map, sizeof(fd));
      	munmap(map, 4096);
      	system("grep mlockfreed /proc/vmstat");
      	return 0;
      }
      
      The anon COWed pages are not caught by truncation's clear_page_mlock() of
      the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
      look out for them there in page_remove_rmap().  Indeed, why should
      truncation or invalidation be doing the clear_page_mlock() when removing
      from pagecache?  mlock is a property of mapping in userspace, not a
      property of pagecache: an mlocked unmapped page is nonsensical.
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6c509f8
    • H
      mm: remove vma arg from page_evictable · 39b5f29a
      Hugh Dickins 提交于
      page_evictable(page, vma) is an irritant: almost all its callers pass
      NULL for vma.  Remove the vma arg and use mlocked_vma_newpage(vma, page)
      explicitly in the couple of places it's needed.  But in those places we
      don't even need page_evictable() itself!  They're dealing with a freshly
      allocated anonymous page, which has no "mapping" and cannot be mlocked yet.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39b5f29a
    • M
      mm rmap: remove vma_address check for address inside vma · 86c2ad19
      Michel Lespinasse 提交于
      In file and anon rmap, we use interval trees to find potentially relevant
      vmas and then call vma_address() to find the virtual address the given
      page might be found at in these vmas.  vma_address() used to include a
      check that the returned address falls within the limits of the vma, but
      this check isn't necessary now that we always use interval trees in rmap:
      the interval tree just doesn't return any vmas which this check would find
      to be irrelevant.  As a result, we can replace the use of -EFAULT error
      code (which then needed to be checked in every call site) with a
      VM_BUG_ON().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86c2ad19
    • M
      mm anon rmap: replace same_anon_vma linked list with an interval tree. · bf181b9f
      Michel Lespinasse 提交于
      When a large VMA (anon or private file mapping) is first touched, which
      will populate its anon_vma field, and then split into many regions through
      the use of mprotect(), the original anon_vma ends up linking all of the
      vmas on a linked list.  This can cause rmap to become inefficient, as we
      have to walk potentially thousands of irrelevent vmas before finding the
      one a given anon page might fall into.
      
      By replacing the same_anon_vma linked list with an interval tree (where
      each avc's interval is determined by its vma's start and last pgoffs), we
      can make rmap efficient for this use case again.
      
      While the change is large, all of its pieces are fairly simple.
      
      Most places that were walking the same_anon_vma list were looking for a
      known pgoff, so they can just use the anon_vma_interval_tree_foreach()
      interval tree iterator instead.  The exception here is ksm, where the
      page's index is not known.  It would probably be possible to rework ksm so
      that the index would be known, but for now I have decided to keep things
      simple and just walk the entirety of the interval tree there.
      
      When updating vma's that already have an anon_vma assigned, we must take
      care to re-index the corresponding avc's on their interval tree.  This is
      done through the use of anon_vma_interval_tree_pre_update_vma() and
      anon_vma_interval_tree_post_update_vma(), which remove the avc's from
      their interval tree before the update and re-insert them after the update.
       The anon_vma stays locked during the update, so there is no chance that
      rmap would miss the vmas that are being updated.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf181b9f
    • M
      mm anon rmap: remove anon_vma_moveto_tail · 108d6642
      Michel Lespinasse 提交于
      mremap() had a clever optimization where move_ptes() did not take the
      anon_vma lock to avoid a race with anon rmap users such as page migration.
       Instead, the avc's were ordered in such a way that the origin vma was
      always visited by rmap before the destination.  This ordering and the use
      of page table locks rmap usage safe.  However, we want to replace the use
      of linked lists in anon rmap with an interval tree, and this will make it
      harder to impose such ordering as the interval tree will always be sorted
      by the avc->vma->vm_pgoff value.  For now, let's replace the
      anon_vma_moveto_tail() ordering function with proper anon_vma locking in
      move_ptes().  Once we have the anon interval tree in place, we will
      re-introduce an optimization to avoid taking these locks in the most
      common cases.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      108d6642
    • M
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse 提交于
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
  7. 30 5月, 2012 1 次提交
  8. 22 3月, 2012 3 次提交
    • K
      memcg: use new logic for page stat accounting · 89c06bd5
      KAMEZAWA Hiroyuki 提交于
      Now, page-stat-per-memcg is recorded into per page_cgroup flag by
      duplicating page's status into the flag.  The reason is that memcg has a
      feature to move a page from a group to another group and we have race
      between "move" and "page stat accounting",
      
      Under current logic, assume CPU-A and CPU-B.  CPU-A does "move" and CPU-B
      does "page stat accounting".
      
      When CPU-A goes 1st,
      
                  CPU-A                           CPU-B
                                          update "struct page" info.
          move_lock_mem_cgroup(memcg)
          see pc->flags
          copy page stat to new group
          overwrite pc->mem_cgroup.
          move_unlock_mem_cgroup(memcg)
                                          move_lock_mem_cgroup(mem)
                                          set pc->flags
                                          update page stat accounting
                                          move_unlock_mem_cgroup(mem)
      
      stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
      (CPU-A) doesn't see changes in "struct page" information.
      
      But it's costly to have the same information both in 'struct page' and
      'struct page_cgroup'.  And, there is a potential problem.
      
      For example, assume we have PG_dirty accounting in memcg.
      PG_..is a flag for struct page.
      PCG_ is a flag for struct page_cgroup.
      (This is just an example. The same problem can be found in any
       kind of page stat accounting.)
      
      	  CPU-A                               CPU-B
            TestSet PG_dirty
            (delay)                        TestClear PG_dirty
                                           if (TestClear(PCG_dirty))
                                                memcg->nr_dirty--
            if (TestSet(PCG_dirty))
                memcg->nr_dirty++
      
      Here, memcg->nr_dirty = +1, this is wrong.  This race was reported by Greg
      Thelen <gthelen@google.com>.  Now, only FILE_MAPPED is supported but
      fortunately, it's serialized by page table lock and this is not real bug,
      _now_,
      
      If this potential problem is caused by having duplicated information in
      struct page and struct page_cgroup, we may be able to fix this by using
      original 'struct page' information.  But we'll have a problem in "move
      account"
      
      Assume we use only PG_dirty.
      
               CPU-A                   CPU-B
          TestSet PG_dirty
          (delay)                    move_lock_mem_cgroup()
                                     if (PageDirty(page))
                                            new_memcg->nr_dirty++
                                     pc->mem_cgroup = new_memcg;
                                     move_unlock_mem_cgroup()
          move_lock_mem_cgroup()
          memcg = pc->mem_cgroup
          new_memcg->nr_dirty++
      
      accounting information may be double-counted.  This was original reason to
      have PCG_xxx flags but it seems PCG_xxx has another problem.
      
      I think we need a bigger lock as
      
           move_lock_mem_cgroup(page)
           TestSetPageDirty(page)
           update page stats (without any checks)
           move_unlock_mem_cgroup(page)
      
      This fixes both of problems and we don't have to duplicate page flag into
      page_cgroup.  Please note: move_lock_mem_cgroup() is held only when there
      are possibility of "account move" under the system.  So, in most path,
      status update will go without atomic locks.
      
      This patch introduces mem_cgroup_begin_update_page_stat() and
      mem_cgroup_end_update_page_stat() both should be called at modifying
      'struct page' information if memcg takes care of it.  as
      
           mem_cgroup_begin_update_page_stat()
           modify page information
           mem_cgroup_update_page_stat()
           => never check any 'struct page' info, just update counters.
           mem_cgroup_end_update_page_stat().
      
      This patch is slow because we need to call begin_update_page_stat()/
      end_update_page_stat() regardless of accounted will be changed or not.  A
      following patch adds an easy optimization and reduces the cost.
      
      [akpm@linux-foundation.org: s/lock/locked/]
      [hughd@google.com: fix deadlock by avoiding stat lock when anon]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Greg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c06bd5
    • K
      rmap: anon_vma_prepare: Reduce code duplication by calling anon_vma_chain_link · 6583a843
      Kautuk Consul 提交于
      Reduce code duplication by calling anon_vma_chain_link() from
      anon_vma_prepare().
      
      Also move anon_vmal_chain_link() to a more suitable location in the file.
      Signed-off-by: NKautuk Consul <consul.kautuk@gmail.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NKAMEZWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6583a843
    • K
      mm: replace PAGE_MIGRATION with IS_ENABLED(CONFIG_MIGRATION) · ce1744f4
      Konstantin Khlebnikov 提交于
      Since commit 2a11c8ea ("kconfig: Introduce IS_ENABLED(),
      IS_BUILTIN() and IS_MODULE()") there is a generic grep-friendly method
      for checking config options in C expressions.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce1744f4
  9. 13 1月, 2012 1 次提交
  10. 11 1月, 2012 1 次提交
    • A
      mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() · 948f017b
      Andrea Arcangeli 提交于
      migrate was doing an rmap_walk with speculative lock-less access on
      pagetables.  That could lead it to not serializing properly against mremap
      PT locks.  But a second problem remains in the order of vmas in the
      same_anon_vma list used by the rmap_walk.
      
      If vma_merge succeeds in copy_vma, the src vma could be placed after the
      dst vma in the same_anon_vma list.  That could still lead to migrate
      missing some pte.
      
      This patch adds an anon_vma_moveto_tail() function to force the dst vma at
      the end of the list before mremap starts to solve the problem.
      
      If the mremap is very large and there are a lots of parents or childs
      sharing the anon_vma root lock, this should still scale better than taking
      the anon_vma root lock around every pte copy practically for the whole
      duration of mremap.
      
      Update: Hugh noticed special care is needed in the error path where
      move_page_tables goes in the reverse direction, a second
      anon_vma_moveto_tail() call is needed in the error path.
      
      This program exercises the anon_vma_moveto_tail:
      
      ===
      
      int main()
      {
      	static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	printf("%p\n", p);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	if (p4 != p3)
      		perror("mremap"), exit(1);
      	p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
      	if (p4 != p+SIZE/2)
      		perror("mremap"), exit(1);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      $ perf probe -a anon_vma_moveto_tail
      Add new event:
        probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
      
      You can now use it on all perf tools, such as:
      
              perf record -e probe:anon_vma_moveto_tail -aR sleep 1
      
      $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
      0x7f2ca2800000
      ok
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
      $ perf report --stdio
         100.00%  anon_vma_moveto  [kernel.kallsyms]  [k] anon_vma_moveto_tail
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NNai Xia <nai.xia@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pawel Sikora <pluto@agmk.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      948f017b
  11. 01 11月, 2011 1 次提交
  12. 31 10月, 2011 1 次提交
  13. 24 7月, 2011 1 次提交
    • M
      [S390] reference bit testing for unmapped pages · 50a15981
      Martin Schwidefsky 提交于
      On x86 a page without a mapper is by definition not referenced / old.
      The s390 architecture keeps the reference bit in the storage key and
      the current code will check the storage key for page without a mapper.
      This leads to an interesting effect: the first time an s390 system
      needs to write pages to swap it only finds referenced pages. This
      causes a lot of pages to get added and written to the swap device.
      To avoid this behaviour change page_referenced to query the storage
      key only if there is a mapper of the page.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      50a15981
  14. 21 7月, 2011 1 次提交
    • C
      fs: kill i_alloc_sem · bd5fe6c5
      Christoph Hellwig 提交于
      i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
      be released by a non-owner, and it's write side is always mirrored by
      real exclusion.  It's intended use it to wait for all pending direct I/O
      requests to finish before starting a truncate.
      
      Replace it with a hand-grown construct:
      
       - exclusion for truncates is already guaranteed by i_mutex, so it can
         simply fall way
       - the reader side is replaced by an i_dio_count member in struct inode
         that counts the number of pending direct I/O requests.  Truncate can't
         proceed as long as it's non-zero
       - when i_dio_count reaches non-zero we wake up a pending truncate using
         wake_up_bit on a new bit in i_flags
       - new references to i_dio_count can't appear while we are waiting for
         it to read zero because the direct I/O count always needs i_mutex
         (or an equivalent like XFS's i_iolock) for starting a new operation.
      
      This scheme is much simpler, and saves the space of a spinlock_t and a
      struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
      system).
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd5fe6c5
  15. 28 6月, 2011 1 次提交
  16. 18 6月, 2011 3 次提交
    • L
      mm: avoid anon_vma_chain allocation under anon_vma lock · dd34739c
      Linus Torvalds 提交于
      Hugh Dickins points out that lockdep (correctly) spots a potential
      deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation
      of anon_vma_chain while doing anon_vma_clone().  The problem is that
      page reclaim will want to take the anon_vma lock of any anonymous pages
      that it will try to reclaim.
      
      So re-organize the code in anon_vma_clone() slightly: first do just a
      GFP_NOWAIT allocation, which will usually work fine.  But if that fails,
      let's just drop the lock and re-do the allocation, now with GFP_KERNEL.
      
      End result: not only do we avoid the locking problem, this also ends up
      getting better concurrency in case the allocation does need to block.
      Tim Chen reports that with all these anon_vma locking tweaks, we're now
      almost back up to the spinlock performance.
      Reported-and-tested-by: NHugh Dickins <hughd@google.com>
      Tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd34739c
    • P
      mm: avoid repeated anon_vma lock/unlock sequences in unlink_anon_vmas() · eee2acba
      Peter Zijlstra 提交于
      This matches the anon_vma_clone() case, and uses the same lock helper
      functions.  Because of the need to potentially release the anon_vma's,
      it's a bit more complex, though.
      
      We traverse the 'vma->anon_vma_chain' in two phases: the first loop gets
      the anon_vma lock (with the helper function that only takes the lock
      once for the whole loop), and removes any entries that don't need any
      more processing.
      
      The second phase just traverses the remaining list entries (without
      holding the anon_vma lock), and does any actual freeing of the
      anon_vma's that is required.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Tested-by: NHugh Dickins <hughd@google.com>
      Tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eee2acba
    • L
      mm: avoid repeated anon_vma lock/unlock sequences in anon_vma_clone() · bb4aa396
      Linus Torvalds 提交于
      In anon_vma_clone() we traverse the vma->anon_vma_chain of the source
      vma, locking the anon_vma for each entry.
      
      But they are all going to have the same root entry, which means that
      we're locking and unlocking the same lock over and over again.  Which is
      expensive in locked operations, but can get _really_ expensive when that
      root entry sees any kind of lock contention.
      
      In fact, Tim Chen reports a big performance regression due to this: when
      we switched to use a mutex instead of a spinlock, the contention case
      gets much worse.
      
      So to alleviate this all, this commit creates a small helper function
      (lock_anon_vma_root()) that can be used to take the lock just once
      rather than taking and releasing it over and over again.
      
      We still have the same "take the lock and release" it behavior in the
      exit path (in unlink_anon_vmas()), but that one is a bit harder to fix
      since we're actually freeing the anon_vma entries as we go, and that
      will touch the lock too.
      Reported-and-tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb4aa396
  17. 08 6月, 2011 1 次提交
    • C
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig 提交于
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      f758eeab
  18. 30 5月, 2011 1 次提交
  19. 29 5月, 2011 2 次提交
    • H
      mm: fix page_lock_anon_vma leaving mutex locked · eee0f252
      Hugh Dickins 提交于
      On one machine I've been getting hangs, a page fault's anon_vma_prepare()
      waiting in anon_vma_lock(), other processes waiting for that page's lock.
      
      This is a replay of last year's f1819427 "mm: fix hang on
      anon_vma->root->lock".
      
      The new page_lock_anon_vma() places too much faith in its refcount: when
      it has acquired the mutex_trylock(), it's possible that a racing task in
      anon_vma_alloc() has just reallocated the struct anon_vma, set refcount
      to 1, and is about to reset its anon_vma->root.
      
      Fix this by saving anon_vma->root, and relying on the usual page_mapped()
      check instead of a refcount check: if page is still mapped, the anon_vma
      is still ours; if page is not still mapped, we're no longer interested.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eee0f252
    • H
      mm: fix kernel BUG at mm/rmap.c:1017! · 5dbe0af4
      Hugh Dickins 提交于
      I've hit the "address >= vma->vm_end" check in do_page_add_anon_rmap()
      just once.  The stack showed khugepaged allocation trying to compact
      pages: the call to page_add_anon_rmap() coming from remove_migration_pte().
      
      That path holds anon_vma lock, but does not hold mmap_sem: it can
      therefore race with a split_vma(), and in commit 5f70b962 "mmap:
      avoid unnecessary anon_vma lock" we just took away the anon_vma lock
      protection when adjusting vma->vm_end.
      
      I don't think that particular BUG_ON ever caught anything interesting,
      so better replace it by a comment, than reinstate the anon_vma locking.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dbe0af4
  20. 25 5月, 2011 6 次提交
  21. 23 5月, 2011 1 次提交
    • M
      [S390] merge page_test_dirty and page_clear_dirty · 2d42552d
      Martin Schwidefsky 提交于
      The page_clear_dirty primitive always sets the default storage key
      which resets the access control bits and the fetch protection bit.
      That will surprise a KVM guest that sets non-zero access control
      bits or the fetch protection bit. Merge page_test_dirty and
      page_clear_dirty back to a single function and only clear the
      dirty bit from the storage key.
      
      In addition move the function page_test_and_clear_dirty and
      page_test_and_clear_young to page.h where they belong. This
      requires to change the parameter from a struct page * to a page
      frame number.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      2d42552d
  22. 25 3月, 2011 1 次提交