1. 14 7月, 2021 5 次提交
  2. 03 11月, 2020 1 次提交
  3. 17 10月, 2020 1 次提交
  4. 14 10月, 2020 1 次提交
    • Y
      mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED · eb1d7a65
      Yafang Shao 提交于
      Our users reported that there're some random latency spikes when their RT
      process is running.  Finally we found that latency spike is caused by
      FADV_DONTNEED.  Which may call lru_add_drain_all() to drain LRU cache on
      remote CPUs, and then waits the per-cpu work to complete.  The wait time
      is uncertain, which may be tens millisecond.
      
      That behavior is unreasonable, because this process is bound to a specific
      CPU and the file is only accessed by itself, IOW, there should be no
      pagecache pages on a per-cpu pagevec of a remote CPU.  That unreasonable
      behavior is partially caused by the wrong comparation of the number of
      invalidated pages and the number of the target.  For example,
      
              if (count < (end_index - start_index + 1))
      
      The count above is how many pages were invalidated in the local CPU, and
      (end_index - start_index + 1) is how many pages should be invalidated.
      The usage of (end_index - start_index + 1) is incorrect, because they are
      virtual addresses, which may not mapped to pages.  Besides that, there may
      be holes between start and end.  So we'd better check whether there are
      still pages on per-cpu pagevec after drain the local cpu, and then decide
      whether or not to call lru_add_drain_all().
      
      After I applied it with a hotfix to our production environment, most of
      the lru_add_drain_all() can be avoided.
      Suggested-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Link: https://lkml.kernel.org/r/20200923133318.14373-1-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb1d7a65
  5. 19 10月, 2019 1 次提交
  6. 21 5月, 2019 1 次提交
  7. 06 3月, 2019 1 次提交
  8. 01 12月, 2018 1 次提交
  9. 21 10月, 2018 1 次提交
  10. 30 9月, 2018 1 次提交
    • M
      xarray: Replace exceptional entries · 3159f943
      Matthew Wilcox 提交于
      Introduce xarray value entries and tagged pointers to replace radix
      tree exceptional entries.  This is a slight change in encoding to allow
      the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
      value entry).  It is also a change in emphasis; exceptional entries are
      intimidating and different.  As the comment explains, you can choose
      to store values or pointers in the xarray and they are both first-class
      citizens.
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      3159f943
  11. 12 4月, 2018 1 次提交
  12. 01 2月, 2018 1 次提交
  13. 16 11月, 2017 5 次提交
    • M
      mm, pagevec: remove cold parameter for pagevecs · 86679820
      Mel Gorman 提交于
      Every pagevec_init user claims the pages being released are hot even in
      cases where it is unlikely the pages are hot.  As no one cares about the
      hotness of pages being released to the allocator, just ditch the
      parameter.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86679820
    • M
      mm, truncate: remove all exceptional entries from pagevec under one lock · f2187599
      Mel Gorman 提交于
      During truncate each entry in a pagevec is checked to see if it is an
      exceptional entry and if so, the shadow entry is cleaned up.  This is
      potentially expensive as multiple entries for a mapping locks/unlocks
      the tree lock.  This batches the operation such that any exceptional
      entries removed from a pagevec only acquire the mapping tree lock once.
      The corner case where this is more expensive is where there is only one
      exceptional entry but this is unlikely due to temporal locality and how
      it affects LRU ordering.  Note that for truncations of small files
      created recently, this patch should show no gain because it only batches
      the handling of exceptional entries.
      
      sparsetruncate (large)
                                    4.14.0-rc4             4.14.0-rc4
                               pickhelper-v1r1       batchshadow-v1r1
      Min          Time       38.00 (   0.00%)       27.00 (  28.95%)
      1st-qrtle    Time       40.00 (   0.00%)       28.00 (  30.00%)
      2nd-qrtle    Time       44.00 (   0.00%)       41.00 (   6.82%)
      3rd-qrtle    Time      146.00 (   0.00%)      147.00 (  -0.68%)
      Max-90%      Time      153.00 (   0.00%)      153.00 (   0.00%)
      Max-95%      Time      155.00 (   0.00%)      156.00 (  -0.65%)
      Max-99%      Time      181.00 (   0.00%)      171.00 (   5.52%)
      Amean        Time       93.04 (   0.00%)       88.43 (   4.96%)
      Best99%Amean Time       92.08 (   0.00%)       86.13 (   6.46%)
      Best95%Amean Time       89.19 (   0.00%)       83.13 (   6.80%)
      Best90%Amean Time       85.60 (   0.00%)       79.15 (   7.53%)
      Best75%Amean Time       72.95 (   0.00%)       65.09 (  10.78%)
      Best50%Amean Time       39.86 (   0.00%)       28.20 (  29.25%)
      Best25%Amean Time       39.44 (   0.00%)       27.70 (  29.77%)
      
      bonnie
                                            4.14.0-rc4             4.14.0-rc4
                                       pickhelper-v1r1       batchshadow-v1r1
      Hmean     SeqCreate ops         71.92 (   0.00%)       76.78 (   6.76%)
      Hmean     SeqCreate read        42.42 (   0.00%)       45.01 (   6.10%)
      Hmean     SeqCreate del      26519.88 (   0.00%)    27191.87 (   2.53%)
      Hmean     RandCreate ops        71.92 (   0.00%)       76.95 (   7.00%)
      Hmean     RandCreate read       44.44 (   0.00%)       49.23 (  10.78%)
      Hmean     RandCreate del     24948.62 (   0.00%)    24764.97 (  -0.74%)
      
      Truncation of a large number of files shows a substantial gain with 99%
      of files being truncated 6.46% faster.  bonnie shows a modest gain of
      2.53%
      
      [jack@suse.cz: fix truncate_exceptional_pvec_entries()]
        Link: http://lkml.kernel.org/r/20171108164226.26788-1-jack@suse.cz
      Link: http://lkml.kernel.org/r/20171018075952.10627-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2187599
    • M
      mm, truncate: do not check mapping for every page being truncated · c7df8ad2
      Mel Gorman 提交于
      During truncation, the mapping has already been checked for shmem and
      dax so it's known that workingset_update_node is required.
      
      This patch avoids the checks on mapping for each page being truncated.
      In all other cases, a lookup helper is used to determine if
      workingset_update_node() needs to be called.  The one danger is that the
      API is slightly harder to use as calling workingset_update_node directly
      without checking for dax or shmem mappings could lead to surprises.
      However, the API rarely needs to be used and hopefully the comment is
      enough to give people the hint.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                   oneirq-v1r1        pickhelper-v1r1
      Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
      1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
      2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
      Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
      Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
      Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
      Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
      Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
      Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
      Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
      Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
      Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
      Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
      Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
      Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
      Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)
      
      As you'd expect, the gain is marginal but it can be detected.  The
      differences in bonnie are all within the noise which is not surprising
      given the impact on the microbenchmark.
      
      radix_tree_update_node_t is a callback for some radix operations that
      optionally passes in a private field.  The only user of the callback is
      workingset_update_node and as it no longer requires a mapping, the
      private field is removed.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7df8ad2
    • J
      mm: batch radix tree operations when truncating pages · aa65c29c
      Jan Kara 提交于
      Currently we remove pages from the radix tree one by one.  To speed up
      page cache truncation, lock several pages at once and free them in one
      go.  This allows us to batch radix tree operations in a more efficient
      way and also save round-trips on mapping->tree_lock.  As a result we
      gain about 20% speed improvement in page cache truncation.
      
      Data from a simple benchmark timing 10000 truncates of 1024 pages (on
      ext4 on ramdisk but the filesystem is barely visible in the profiles).
      The range shows 1% and 95% percentiles of the measured times:
      
        4.14-rc2	4.14-rc2 + batched truncation
        248-256	209-219
        249-258	209-217
        248-255	211-239
        248-255	209-217
        247-256	210-218
      
      [jack@suse.cz: convert delete_from_page_cache_batch() to pagevec]
        Link: http://lkml.kernel.org/r/20171018111648.13714-1-jack@suse.cz
      [akpm@linux-foundation.org: move struct pagevec forward declaration to top-of-file]
      Link: http://lkml.kernel.org/r/20171010151937.26984-8-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa65c29c
    • J
      mm: refactor truncate_complete_page() · 9f4e41f4
      Jan Kara 提交于
      Move call of delete_from_page_cache() and page->mapping check out of
      truncate_complete_page() into the single caller - truncate_inode_page().
      Also move page_mapped() check into truncate_complete_page().  That way
      it will be easier to batch operations.
      
      Also rename truncate_complete_page() to truncate_cleanup_page().
      
      Link: http://lkml.kernel.org/r/20171010151937.26984-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f4e41f4
  14. 11 7月, 2017 1 次提交
  15. 13 5月, 2017 2 次提交
    • J
      mm: fix data corruption due to stale mmap reads · cd656375
      Jan Kara 提交于
      Currently, we didn't invalidate page tables during invalidate_inode_pages2()
      for DAX.  That could result in e.g. 2MiB zero page being mapped into
      page tables while there were already underlying blocks allocated and
      thus data seen through mmap were different from data seen by read(2).
      The following sequence reproduces the problem:
      
       - open an mmap over a 2MiB hole
      
       - read from a 2MiB hole, faulting in a 2MiB zero page
      
       - write to the hole with write(3p). The write succeeds but we
         incorrectly leave the 2MiB zero page mapping intact.
      
       - via the mmap, read the data that was just written. Since the zero
         page mapping is still intact we read back zeroes instead of the new
         data.
      
      Fix the problem by unconditionally calling invalidate_inode_pages2_range()
      in dax_iomap_actor() for new block allocations and by properly
      invalidating page tables in invalidate_inode_pages2_range() for DAX
      mappings.
      
      Fixes: c6dcf52c
      Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd656375
    • R
      dax: prevent invalidation of mapped DAX entries · 4636e70b
      Ross Zwisler 提交于
      Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
      v4.
      
      This series fixes data corruption that can happen for DAX mounts when
      page faults race with write(2) and as a result page tables get out of
      sync with block mappings in the filesystem and thus data seen through
      mmap is different from data seen through read(2).
      
      The series passes testing with t_mmap_stale test program from Ross and
      also other mmap related tests on DAX filesystem.
      
      This patch (of 4):
      
      dax_invalidate_mapping_entry() currently removes DAX exceptional entries
      only if they are clean and unlocked.  This is done via:
      
        invalidate_mapping_pages()
          invalidate_exceptional_entry()
            dax_invalidate_mapping_entry()
      
      However, for page cache pages removed in invalidate_mapping_pages()
      there is an additional criteria which is that the page must not be
      mapped.  This is noted in the comments above invalidate_mapping_pages()
      and is checked in invalidate_inode_page().
      
      For DAX entries this means that we can can end up in a situation where a
      DAX exceptional entry, either a huge zero page or a regular DAX entry,
      could end up mapped but without an associated radix tree entry.  This is
      inconsistent with the rest of the DAX code and with what happens in the
      page cache case.
      
      We aren't able to unmap the DAX exceptional entry because according to
      its comments invalidate_mapping_pages() isn't allowed to block, and
      unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
      
      Since we essentially never have unmapped DAX entries to evict from the
      radix tree, just remove dax_invalidate_mapping_entry().
      
      Fixes: c6dcf52c ("mm: Invalidate DAX radix tree entries only if appropriate")
      Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.czSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reported-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>    [4.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4636e70b
  16. 04 5月, 2017 2 次提交
  17. 28 2月, 2017 1 次提交
  18. 25 2月, 2017 1 次提交
  19. 27 12月, 2016 1 次提交
    • J
      mm: Invalidate DAX radix tree entries only if appropriate · c6dcf52c
      Jan Kara 提交于
      Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
      just delete all exceptional radix tree entries they find. For DAX this
      is not desirable as we track cache dirtiness in these entries and when
      they are evicted, we may not flush caches although it is necessary. This
      can for example manifest when we write to the same block both via mmap
      and via write(2) (to different offsets) and fsync(2) then does not
      properly flush CPU caches when modification via write(2) was the last
      one.
      
      Create appropriate DAX functions to handle invalidation of DAX entries
      for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
      wire them up into the corresponding mm functions.
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      c6dcf52c
  20. 13 12月, 2016 2 次提交
    • J
      mm: workingset: move shadow entry tracking to radix tree exceptional tracking · 14b46879
      Johannes Weiner 提交于
      Currently, we track the shadow entries in the page cache in the upper
      bits of the radix_tree_node->count, behind the back of the radix tree
      implementation.  Because the radix tree code has no awareness of them,
      we rely on random subtleties throughout the implementation (such as the
      node->count != 1 check in the shrinking code, which is meant to exclude
      multi-entry nodes but also happens to skip nodes with only one shadow
      entry, as that's accounted in the upper bits).  This is error prone and
      has, in fact, caused the bug fixed in d3798ae8 ("mm: filemap: don't
      plant shadow entries without radix tree node").
      
      To remove these subtleties, this patch moves shadow entry tracking from
      the upper bits of node->count to the existing counter for exceptional
      entries.  node->count goes back to being a simple counter of valid
      entries in the tree node and can be shrunk to a single byte.
      
      This vastly simplifies the page cache code.  All accounting happens
      natively inside the radix tree implementation, and maintaining the LRU
      linkage of shadow nodes is consolidated into a single function in the
      workingset code that is called for leaf nodes affected by a change in
      the page cache tree.
      
      This also removes the last user of the __radix_delete_node() return
      value.  Eliminate it.
      
      Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14b46879
    • J
      lib: radix-tree: check accounting of existing slot replacement users · 6d75f366
      Johannes Weiner 提交于
      The bug in khugepaged fixed earlier in this series shows that radix tree
      slot replacement is fragile; and it will become more so when not only
      NULL<->!NULL transitions need to be caught but transitions from and to
      exceptional entries as well.  We need checks.
      
      Re-implement radix_tree_replace_slot() on top of the sanity-checked
      __radix_tree_replace().  This requires existing callers to also pass the
      radix tree root, but it'll warn us when somebody replaces slots with
      contents that need proper accounting (transitions between NULL entries,
      real entries, exceptional entries) and where a replacement through the
      slot pointer would corrupt the radix tree node counts.
      
      Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d75f366
  21. 01 12月, 2016 1 次提交
  22. 27 7月, 2016 1 次提交
  23. 20 5月, 2016 1 次提交
    • J
      dax: New fault locking · ac401cc7
      Jan Kara 提交于
      Currently DAX page fault locking is racy.
      
      CPU0 (write fault)		CPU1 (read fault)
      
      __dax_fault()			__dax_fault()
        get_block(inode, block, &bh, 0) -> not mapped
      				  get_block(inode, block, &bh, 0)
      				    -> not mapped
        if (!buffer_mapped(&bh))
          if (vmf->flags & FAULT_FLAG_WRITE)
            get_block(inode, block, &bh, 1) -> allocates blocks
        if (page) -> no
      				  if (!buffer_mapped(&bh))
      				    if (vmf->flags & FAULT_FLAG_WRITE) {
      				    } else {
      				      dax_load_hole();
      				    }
        dax_insert_mapping()
      
      And we are in a situation where we fail in dax_radix_entry() with -EIO.
      
      Another problem with the current DAX page fault locking is that there is
      no race-free way to clear dirty tag in the radix tree. We can always
      end up with clean radix tree and dirty data in CPU cache.
      
      We fix the first problem by introducing locking of exceptional radix
      tree entries in DAX mappings acting very similarly to page lock and thus
      synchronizing properly faults against the same mapping index. The same
      lock can later be used to avoid races when clearing radix tree dirty
      tag.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      ac401cc7
  24. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  25. 16 3月, 2016 3 次提交
  26. 23 1月, 2016 1 次提交
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
  27. 02 6月, 2015 1 次提交
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75