1. 28 2月, 2017 1 次提交
  2. 25 2月, 2017 1 次提交
  3. 27 12月, 2016 1 次提交
    • J
      mm: Invalidate DAX radix tree entries only if appropriate · c6dcf52c
      Jan Kara 提交于
      Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
      just delete all exceptional radix tree entries they find. For DAX this
      is not desirable as we track cache dirtiness in these entries and when
      they are evicted, we may not flush caches although it is necessary. This
      can for example manifest when we write to the same block both via mmap
      and via write(2) (to different offsets) and fsync(2) then does not
      properly flush CPU caches when modification via write(2) was the last
      one.
      
      Create appropriate DAX functions to handle invalidation of DAX entries
      for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
      wire them up into the corresponding mm functions.
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      c6dcf52c
  4. 13 12月, 2016 2 次提交
    • J
      mm: workingset: move shadow entry tracking to radix tree exceptional tracking · 14b46879
      Johannes Weiner 提交于
      Currently, we track the shadow entries in the page cache in the upper
      bits of the radix_tree_node->count, behind the back of the radix tree
      implementation.  Because the radix tree code has no awareness of them,
      we rely on random subtleties throughout the implementation (such as the
      node->count != 1 check in the shrinking code, which is meant to exclude
      multi-entry nodes but also happens to skip nodes with only one shadow
      entry, as that's accounted in the upper bits).  This is error prone and
      has, in fact, caused the bug fixed in d3798ae8 ("mm: filemap: don't
      plant shadow entries without radix tree node").
      
      To remove these subtleties, this patch moves shadow entry tracking from
      the upper bits of node->count to the existing counter for exceptional
      entries.  node->count goes back to being a simple counter of valid
      entries in the tree node and can be shrunk to a single byte.
      
      This vastly simplifies the page cache code.  All accounting happens
      natively inside the radix tree implementation, and maintaining the LRU
      linkage of shadow nodes is consolidated into a single function in the
      workingset code that is called for leaf nodes affected by a change in
      the page cache tree.
      
      This also removes the last user of the __radix_delete_node() return
      value.  Eliminate it.
      
      Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14b46879
    • J
      lib: radix-tree: check accounting of existing slot replacement users · 6d75f366
      Johannes Weiner 提交于
      The bug in khugepaged fixed earlier in this series shows that radix tree
      slot replacement is fragile; and it will become more so when not only
      NULL<->!NULL transitions need to be caught but transitions from and to
      exceptional entries as well.  We need checks.
      
      Re-implement radix_tree_replace_slot() on top of the sanity-checked
      __radix_tree_replace().  This requires existing callers to also pass the
      radix tree root, but it'll warn us when somebody replaces slots with
      contents that need proper accounting (transitions between NULL entries,
      real entries, exceptional entries) and where a replacement through the
      slot pointer would corrupt the radix tree node counts.
      
      Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d75f366
  5. 01 12月, 2016 1 次提交
  6. 27 7月, 2016 1 次提交
  7. 20 5月, 2016 1 次提交
    • J
      dax: New fault locking · ac401cc7
      Jan Kara 提交于
      Currently DAX page fault locking is racy.
      
      CPU0 (write fault)		CPU1 (read fault)
      
      __dax_fault()			__dax_fault()
        get_block(inode, block, &bh, 0) -> not mapped
      				  get_block(inode, block, &bh, 0)
      				    -> not mapped
        if (!buffer_mapped(&bh))
          if (vmf->flags & FAULT_FLAG_WRITE)
            get_block(inode, block, &bh, 1) -> allocates blocks
        if (page) -> no
      				  if (!buffer_mapped(&bh))
      				    if (vmf->flags & FAULT_FLAG_WRITE) {
      				    } else {
      				      dax_load_hole();
      				    }
        dax_insert_mapping()
      
      And we are in a situation where we fail in dax_radix_entry() with -EIO.
      
      Another problem with the current DAX page fault locking is that there is
      no race-free way to clear dirty tag in the radix tree. We can always
      end up with clean radix tree and dirty data in CPU cache.
      
      We fix the first problem by introducing locking of exceptional radix
      tree entries in DAX mappings acting very similarly to page lock and thus
      synchronizing properly faults against the same mapping index. The same
      lock can later be used to avoid races when clearing radix tree dirty
      tag.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      ac401cc7
  8. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  9. 16 3月, 2016 3 次提交
  10. 23 1月, 2016 1 次提交
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
  11. 02 6月, 2015 2 次提交
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
    • T
      page_writeback: revive cancel_dirty_page() in a restricted form · 11f81bec
      Tejun Heo 提交于
      cancel_dirty_page() had some issues and b9ea2515 ("page_writeback:
      clean up mess around cancel_dirty_page()") replaced it with
      account_page_cleaned() which makes the caller responsible for clearing
      the dirty bit; unfortunately, the planned changes for cgroup writeback
      support requires synchronization between dirty bit manipulation and
      stat updates.  While we can open-code such synchronization in each
      account_page_cleaned() callsite, that's gonna be unnecessarily awkward
      and verbose.
      
      This patch revives cancel_dirty_page() but in a more restricted form.
      All it does is TestClearPageDirty() followed by account_page_cleaned()
      invocation if the page was dirty.  This helper covers all
      account_page_cleaned() usages except for __delete_from_page_cache()
      which is a special case anyway and left alone.  As this leaves no
      module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
      from it.
      
      This patch just revives cancel_dirty_page() as a trivial wrapper to
      replace equivalent usages and doesn't introduce any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      11f81bec
  12. 16 4月, 2015 1 次提交
  13. 15 4月, 2015 1 次提交
    • K
      page_writeback: clean up mess around cancel_dirty_page() · b9ea2515
      Konstantin Khlebnikov 提交于
      This patch replaces cancel_dirty_page() with a helper function
      account_page_cleaned() which only updates counters.  It's called from
      truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
      Page is locked in both cases, page-lock protects against concurrent
      dirtiers: see commit 2d6d7f98 ("mm: protect set_page_dirty() from
      ongoing truncation").
      
      Delete_from_page_cache() shouldn't be called for dirty pages, they must
      be handled by caller (either written or truncated).  This patch treats
      final dirty accounting fixup at the end of __delete_from_page_cache() as
      a debug check and adds WARN_ON_ONCE() around it.  If something removes
      dirty pages without proper handling that might be a bug and unwritten
      data might be lost.
      
      Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
      here.
      
      cancel_dirty_page() in nfs_wb_page_cancel() is redundant.  This is
      helper for nfs_invalidate_page() and it's called only in case complete
      invalidation.
      
      The mess was started in v2.6.20 after commits 46d2277c ("Clean up
      and make try_to_free_buffers() not race with dirty pages") and
      3e67c098 ("truncate: clear page dirtiness before running
      try_to_free_buffers()") first was reverted right in v2.6.20 in commit
      ecdfc978 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
      v2.6.25 commit a2b34564 ("Fix dirty page accounting leak with ext3
      data=journal").
      
      Custom fixes were introduced between these points.  NFS in v2.6.23, commit
      1b3b4a1a ("NFS: Fix a write request leak in nfs_invalidate_page()").
      Kludge in __delete_from_page_cache() in v2.6.24, commit 3a692790 ("Do
      dirty page accounting when removing a page from the page cache").  Since
      v2.6.25 all of them are redundant.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9ea2515
  14. 21 1月, 2015 1 次提交
  15. 07 11月, 2014 1 次提交
  16. 30 10月, 2014 1 次提交
  17. 02 10月, 2014 1 次提交
    • J
      vfs: fix data corruption when blocksize < pagesize for mmaped data · 90a80202
      Jan Kara 提交于
      ->page_mkwrite() is used by filesystems to allocate blocks under a page
      which is becoming writeably mmapped in some process' address space. This
      allows a filesystem to return a page fault if there is not enough space
      available, user exceeds quota or similar problem happens, rather than
      silently discarding data later when writepage is called.
      
      However VFS fails to call ->page_mkwrite() in all the cases where
      filesystems need it when blocksize < pagesize. For example when
      blocksize = 1024, pagesize = 4096 the following is problematic:
        ftruncate(fd, 0);
        pwrite(fd, buf, 1024, 0);
        map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
        map[0] = 'a';       ----> page_mkwrite() for index 0 is called
        ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
        mremap(map, 1024, 10000, 0);
        map[4095] = 'a';    ----> no page_mkwrite() called
      
      At the moment ->page_mkwrite() is called, filesystem can allocate only
      one block for the page because i_size == 1024. Otherwise it would create
      blocks beyond i_size which is generally undesirable. But later at
      ->writepage() time, we also need to store data at offset 4095 but we
      don't have block allocated for it.
      
      This patch introduces a helper function filesystems can use to have
      ->page_mkwrite() called at all the necessary moments.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      90a80202
  18. 09 8月, 2014 1 次提交
    • J
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner 提交于
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: NJet Chen <jet.chen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
  19. 24 7月, 2014 1 次提交
    • H
      mm/fs: fix pessimization in hole-punching pagecache · 792ceaef
      Hugh Dickins 提交于
      I wanted to revert my v3.1 commit d0823576 ("mm: pincer in
      truncate_inode_pages_range"), to keep truncate_inode_pages_range() in
      synch with shmem_undo_range(); but have stepped back - a change to
      hole-punching in truncate_inode_pages_range() is a change to
      hole-punching in every filesystem (except tmpfs) that supports it.
      
      If there's a logical proof why no filesystem can depend for its own
      correctness on the pincer guarantee in truncate_inode_pages_range() - an
      instant when the entire hole is removed from pagecache - then let's
      revisit later.  But the evidence is that only tmpfs suffered from the
      livelock, and we have no intention of extending hole-punch to ramfs.  So
      for now just add a few comments (to match or differ from those in
      shmem_undo_range()), and fix one silliness noticed in d0823576...
      
      Its "index == start" addition to the hole-punch termination test was
      incomplete: it opened a way for the end condition to be missed, and the
      loop go on looking through the radix_tree, all the way to end of file.
      Fix that pessimization by resetting index when detected in inner loop.
      
      Note that it's actually hard to hit this case, without the obsessive
      concurrent faulting that trinity does: normally all pages are removed in
      the initial trylock_page() pass, and this loop finds nothing to do.  I
      had to "#if 0" out the initial pass to reproduce bug and test fix.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      792ceaef
  20. 07 5月, 2014 1 次提交
    • J
      mm: filemap: update find_get_pages_tag() to deal with shadow entries · 139b6a6f
      Johannes Weiner 提交于
      Dave Jones reports the following crash when find_get_pages_tag() runs
      into an exceptional entry:
      
        kernel BUG at mm/filemap.c:1347!
        RIP: find_get_pages_tag+0x1cb/0x220
        Call Trace:
          find_get_pages_tag+0x36/0x220
          pagevec_lookup_tag+0x21/0x30
          filemap_fdatawait_range+0xbe/0x1e0
          filemap_fdatawait+0x27/0x30
          sync_inodes_sb+0x204/0x2a0
          sync_inodes_one_sb+0x19/0x20
          iterate_supers+0xb2/0x110
          sys_sync+0x44/0xb0
          ia32_do_call+0x13/0x13
      
        1343                         /*
        1344                          * This function is never used on a shmem/tmpfs
        1345                          * mapping, so a swap entry won't be found here.
        1346                          */
        1347                         BUG();
      
      After commit 0cd6144a ("mm + fs: prepare for non-page entries in
      page cache radix trees") this comment and BUG() are out of date because
      exceptional entries can now appear in all mappings - as shadows of
      recently evicted pages.
      
      However, as Hugh Dickins notes,
      
        "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
         any other PAGECACHE_TAG_*) to appear on an exceptional entry.
      
         I expect it comes down to an occasional race in RCU lookup of the
         radix_tree: lacking absolute synchronization, we might sometimes
         catch an exceptional entry, with the tag which really belongs with
         the unexceptional entry which was there an instant before."
      
      And indeed, not only is the tree walk lockless, the tags are also read
      in chunks, one radix tree node at a time.  There is plenty of time for
      page reclaim to swoop in and replace a page that was already looked up
      as tagged with a shadow entry.
      
      Remove the BUG() and update the comment.  While reviewing all other
      lookup sites for whether they properly deal with shadow entries of
      evicted pages, update all the comments and fix memcg file charge moving
      to not miss shmem/tmpfs swapcache pages.
      
      Fixes: 0cd6144a ("mm + fs: prepare for non-page entries in page cache radix trees")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDave Jones <davej@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      139b6a6f
  21. 04 4月, 2014 3 次提交
    • J
      mm: keep page cache radix tree nodes in check · 449dd698
      Johannes Weiner 提交于
      Previously, page cache radix tree nodes were freed after reclaim emptied
      out their page pointers.  But now reclaim stores shadow entries in their
      place, which are only reclaimed when the inodes themselves are
      reclaimed.  This is problematic for bigger files that are still in use
      after they have a significant amount of their cache reclaimed, without
      any of those pages actually refaulting.  The shadow entries will just
      sit there and waste memory.  In the worst case, the shadow entries will
      accumulate until the machine runs out of memory.
      
      To get this under control, the VM will track radix tree nodes
      exclusively containing shadow entries on a per-NUMA node list.  Per-NUMA
      rather than global because we expect the radix tree nodes themselves to
      be allocated node-locally and we want to reduce cross-node references of
      otherwise independent cache workloads.  A simple shrinker will then
      reclaim these nodes on memory pressure.
      
      A few things need to be stored in the radix tree node to implement the
      shadow node LRU and allow tree deletions coming from the list:
      
      1. There is no index available that would describe the reverse path
         from the node up to the tree root, which is needed to perform a
         deletion.  To solve this, encode in each node its offset inside the
         parent.  This can be stored in the unused upper bits of the same
         member that stores the node's height at no extra space cost.
      
      2. The number of shadow entries needs to be counted in addition to the
         regular entries, to quickly detect when the node is ready to go to
         the shadow node LRU list.  The current entry count is an unsigned
         int but the maximum number of entries is 64, so a shadow counter
         can easily be stored in the unused upper bits.
      
      3. Tree modification needs tree lock and tree root, which are located
         in the address space, so store an address_space backpointer in the
         node.  The parent pointer of the node is in a union with the 2-word
         rcu_head, so the backpointer comes at no extra cost as well.
      
      4. The node needs to be linked to an LRU list, which requires a list
         head inside the node.  This does increase the size of the node, but
         it does not change the number of objects that fit into a slab page.
      
      [akpm@linux-foundation.org: export the right function]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      449dd698
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
    • J
      mm + fs: prepare for non-page entries in page cache radix trees · 0cd6144a
      Johannes Weiner 提交于
      shmem mappings already contain exceptional entries where swap slot
      information is remembered.
      
      To be able to store eviction information for regular page cache, prepare
      every site dealing with the radix trees directly to handle entries other
      than pages.
      
      The common lookup functions will filter out non-page entries and return
      NULL for page cache holes, just as before.  But provide a raw version of
      the API which returns non-page entries as well, and switch shmem over to
      use it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cd6144a
  22. 13 9月, 2013 1 次提交
  23. 28 5月, 2013 1 次提交
    • L
      mm: teach truncate_inode_pages_range() to handle non page aligned ranges · 5a720394
      Lukas Czerner 提交于
      This commit changes truncate_inode_pages_range() so it can handle non
      page aligned regions of the truncate. Currently we can hit BUG_ON when
      the end of the range is not page aligned, but we can handle unaligned
      start of the range.
      
      Being able to handle non page aligned regions of the page can help file
      system punch_hole implementations and save some work, because once we're
      holding the page we might as well deal with it right away.
      
      In previous commits we've changed ->invalidatepage() prototype to accept
      'length' argument to be able to specify range to invalidate. No we can
      use that new ability in truncate_inode_pages_range().
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      5a720394
  24. 22 5月, 2013 1 次提交
    • L
      mm: change invalidatepage prototype to accept length · d47992f8
      Lukas Czerner 提交于
      Currently there is no way to truncate partial page where the end
      truncate point is not at the end of the page. This is because it was not
      needed and the functionality was enough for file system truncate
      operation to work properly. However more file systems now support punch
      hole feature and it can benefit from mm supporting truncating page just
      up to the certain point.
      
      Specifically, with this functionality truncate_inode_pages_range() can
      be changed so it supports truncating partial page at the end of the
      range (currently it will BUG_ON() if 'end' is not at the end of the
      page).
      
      This commit changes the invalidatepage() address space operation
      prototype to accept range to be invalidated and update all the instances
      for it.
      
      We also change the block_invalidatepage() in the same way and actually
      make a use of the new length argument implementing range invalidation.
      
      Actual file system implementations will follow except the file systems
      where the changes are really simple and should not change the behaviour
      in any way .Implementation for truncate_page_range() which will be able
      to accept page unaligned ranges will follow as well.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      d47992f8
  25. 21 12月, 2012 1 次提交
  26. 09 10月, 2012 2 次提交
    • H
      mm: use clear_page_mlock() in page_remove_rmap() · e6c509f8
      Hugh Dickins 提交于
      We had thought that pages could no longer get freed while still marked as
      mlocked; but Johannes Weiner posted this program to demonstrate that
      truncating an mlocked private file mapping containing COWed pages is still
      mishandled:
      
      #include <sys/types.h>
      #include <sys/mman.h>
      #include <sys/stat.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <stdio.h>
      
      int main(void)
      {
      	char *map;
      	int fd;
      
      	system("grep mlockfreed /proc/vmstat");
      	fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
      	unlink("chigurh");
      	ftruncate(fd, 4096);
      	map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
      	map[0] = 11;
      	mlock(map, sizeof(fd));
      	ftruncate(fd, 0);
      	close(fd);
      	munlock(map, sizeof(fd));
      	munmap(map, 4096);
      	system("grep mlockfreed /proc/vmstat");
      	return 0;
      }
      
      The anon COWed pages are not caught by truncation's clear_page_mlock() of
      the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
      look out for them there in page_remove_rmap().  Indeed, why should
      truncation or invalidation be doing the clear_page_mlock() when removing
      from pagecache?  mlock is a property of mapping in userspace, not a
      property of pagecache: an mlocked unmapped page is nonsensical.
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6c509f8
    • H
      mm: fix invalidate_complete_page2() lock ordering · ec4d9f62
      Hugh Dickins 提交于
      In fuzzing with trinity, lockdep protested "possible irq lock inversion
      dependency detected" when isolate_lru_page() reenabled interrupts while
      still holding the supposedly irq-safe tree_lock:
      
      invalidate_inode_pages2
        invalidate_complete_page2
          spin_lock_irq(&mapping->tree_lock)
          clear_page_mlock
            isolate_lru_page
              spin_unlock_irq(&zone->lru_lock)
      
      isolate_lru_page() is correct to enable interrupts unconditionally:
      invalidate_complete_page2() is incorrect to call clear_page_mlock() while
      holding tree_lock, which is supposed to nest inside lru_lock.
      
      Both truncate_complete_page() and invalidate_complete_page() call
      clear_page_mlock() before taking tree_lock to remove page from radix_tree.
       I guess invalidate_complete_page2() preferred to test PageDirty (again)
      under tree_lock before committing to the munlock; but since the page has
      already been unmapped, its state is already somewhat inconsistent, and no
      worse if clear_page_mlock() moved up.
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Deciphered-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec4d9f62
  27. 30 5月, 2012 1 次提交
  28. 29 3月, 2012 1 次提交
    • H
      mm for fs: add truncate_pagecache_range() · 623e3db9
      Hugh Dickins 提交于
      Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
      but forgetting to unmap pages first (ocfs2 remembers).  This is not really
      a bug, since races already require truncate_inode_page() to handle that
      case once the page is locked; but it can be very inefficient if the file
      being punched happens to be mapped into many vmas.
      
      Provide a drop-in replacement truncate_pagecache_range() which does the
      unmapping pass first, handling the awkward mismatch between arguments to
      truncate_inode_pages_range() and arguments to unmap_mapping_range().
      
      Note that holepunching does not unmap privately COWed pages in the range:
      POSIX requires that we do so when truncating, but it's hard to justify,
      difficult to implement without an i_size cutoff, and no filesystem is
      attempting to implement it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      623e3db9
  29. 23 2月, 2012 1 次提交
  30. 24 1月, 2012 1 次提交
    • D
      mm: cleancache: s/flush/invalidate/ · 3167760f
      Dan Magenheimer 提交于
      Per akpm suggestions alter the use of the term flush to be
      invalidate. The next patch will do this across all MM.
      
      This change is completely cosmetic.
      
      [v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 3]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jan Beulich <JBeulich@novell.com>
      Reviewed-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Rik Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      [v10: Fixed  fs: move code out of buffer.c conflict change]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      3167760f
  31. 31 10月, 2011 1 次提交
  32. 04 8月, 2011 1 次提交
  33. 26 7月, 2011 1 次提交