1. 01 10月, 2016 1 次提交
    • J
      mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page() · 22f2ac51
      Johannes Weiner 提交于
      Antonio reports the following crash when using fuse under memory pressure:
      
        kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: all of them
        CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
        Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
        task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
        RIP: shadow_lru_isolate+0x181/0x190
        Call Trace:
          __list_lru_walk_one.isra.3+0x8f/0x130
          list_lru_walk_one+0x23/0x30
          scan_shadow_nodes+0x34/0x50
          shrink_slab.part.40+0x1ed/0x3d0
          shrink_zone+0x2ca/0x2e0
          kswapd+0x51e/0x990
          kthread+0xd8/0xf0
          ret_from_fork+0x3f/0x70
      
      which corresponds to the following sanity check in the shadow node
      tracking:
      
        BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
      
      The workingset code tracks radix tree nodes that exclusively contain
      shadow entries of evicted pages in them, and this (somewhat obscure)
      line checks whether there are real pages left that would interfere with
      reclaim of the radix tree node under memory pressure.
      
      While discussing ways how fuse might sneak pages into the radix tree
      past the workingset code, Miklos pointed to replace_page_cache_page(),
      and indeed there is a problem there: it properly accounts for the old
      page being removed - __delete_from_page_cache() does that - but then
      does a raw raw radix_tree_insert(), not accounting for the replacement
      page.  Eventually the page count bits in node->count underflow while
      leaving the node incorrectly linked to the shadow node LRU.
      
      To address this, make sure replace_page_cache_page() uses the tracked
      page insertion code, page_cache_tree_insert().  This fixes the page
      accounting and makes sure page-containing nodes are properly unlinked
      from the shadow node LRU again.
      
      Also, make the sanity checks a bit less obscure by using the helpers for
      checking the number of pages and shadows in a radix tree node.
      
      Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
      Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAntonio SJ Musumeci <trapexit@spawn.link>
      Debugged-by: NMiklos Szeredi <miklos@szeredi.hu>
      Cc: <stable@vger.kernel.org>	[3.15+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22f2ac51
  2. 08 8月, 2016 1 次提交
  3. 05 8月, 2016 1 次提交
  4. 29 7月, 2016 3 次提交
  5. 27 7月, 2016 4 次提交
  6. 25 6月, 2016 1 次提交
  7. 21 5月, 2016 4 次提交
    • M
      radix-tree: introduce radix_tree_replace_clear_tags() · d604c324
      Matthew Wilcox 提交于
      In addition to replacing the entry, we also clear all associated tags.
      This is really a one-off special for page_cache_tree_delete() which had
      far too much detailed knowledge about how the radix tree works.
      
      For efficiency, factor node_tag_clear() out of radix_tree_tag_clear() It
      can be used by radix_tree_delete_item() as well as
      radix_tree_replace_clear_tags().
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d604c324
    • K
      mm: make faultaround produce old ptes · 5c0a85fa
      Kirill A. Shutemov 提交于
      Currently, faultaround code produces young pte.  This can screw up
      vmscan behaviour[1], as it makes vmscan think that these pages are hot
      and not push them out on first round.
      
      During sparse file access faultaround gets more pages mapped and all of
      them are young.  Under memory pressure, this makes vmscan swap out anon
      pages instead, or to drop other page cache pages which otherwise stay
      resident.
      
      Modify faultaround to produce old ptes, so they can easily be reclaimed
      under memory pressure.
      
      This can to some extend defeat the purpose of faultaround on machines
      without hardware accessed bit as it will not help us with reducing the
      number of minor page faults.
      
      We may want to disable faultaround on such machines altogether, but
      that's subject for separate patchset.
      
      Minchan:
       "I tested 512M mmap sequential word read test on non-HW access bit
        system (i.e., ARM) and confirmed it doesn't increase minor fault any
        more.
      
        old: 4096 fault_around
        minor fault: 131291
        elapsed time: 6747645 usec
      
        new: 65536 fault_around
        minor fault: 131291
        elapsed time: 6709263 usec
      
        0.56% benefit"
      
      [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org
      
      Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c0a85fa
    • J
      mm: filemap: only do access activations on reads · bbddabe2
      Johannes Weiner 提交于
      Andres observed that his database workload is struggling with the
      transaction journal creating pressure on frequently read pages.
      
      Access patterns like transaction journals frequently write the same
      pages over and over, but in the majority of cases those pages are never
      read back.  There are no caching benefits to be had for those pages, so
      activating them and having them put pressure on pages that do benefit
      from caching is a bad choice.
      
      Leave page activations to read accesses and don't promote pages based on
      writes alone.
      
      It could be said that partially written pages do contain cache-worthy
      data, because even if *userspace* does not access the unwritten part,
      the kernel still has to read it from the filesystem for correctness.
      However, a counter argument is that these pages enjoy at least *some*
      protection over other inactive file pages through the writeback cache,
      in the sense that dirty pages are written back with a delay and cache
      reclaim leaves them alone until they have been written back to disk.
      Should that turn out to be insufficient and we see increased read IO
      from partial writes under memory pressure, we can always go back and
      update grab_cache_page_write_begin() to take (pos, len) so that it can
      tell partial writes from pages that don't need partial reads.  But for
      now, keep it simple.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAndres Freund <andres@anarazel.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbddabe2
    • R
      mm: workingset: only do workingset activations on reads · f0281a00
      Rik van Riel 提交于
      This is a follow-up to
      
        http://www.spinics.net/lists/linux-mm/msg101739.html
      
      where Andres reported his database workingset being pushed out by the
      minimum size enforcement of the inactive file list - currently 50% of
      cache - as well as repeatedly written file pages that are never actually
      read.
      
      Two changes fell out of the discussions.  The first change observes that
      pages that are only ever written don't benefit from caching beyond what
      the writeback cache does for partial page writes, and so we shouldn't
      promote them to the active file list where they compete with pages whose
      cached data is actually accessed repeatedly.  This change comes in two
      patches - one for in-cache write accesses and one for refaults triggered
      by writes, neither of which should promote a cache page.
      
      Second, with the refault detection we don't need to set 50% of the cache
      aside for used-once cache anymore since we can detect frequently used
      pages even when they are evicted between accesses.  We can allow the
      active list to be bigger and thus protect a bigger workingset that isn't
      challenged by streamers.  Depending on the access patterns, this can
      increase major faults during workingset transitions for better
      performance during stable phases.
      
      This patch (of 3):
      
      When rewriting a page, the data in that page is replaced with new data.
      This means that evicting something else from the active file list, in
      order to cache data that will be replaced by something else, is likely
      to be a waste of memory.
      
      It is better to save the active list for frequently read pages, because
      reads actually use the data that is in the page.
      
      This patch ignores partial writes, because it is unclear whether the
      complexity of identifying those is worth any potential performance gain
      obtained from better caching pages that see repeated partial writes at
      large enough intervals to not get caught by the use-twice promotion code
      used for the inactive file list.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0281a00
  8. 20 5月, 2016 3 次提交
  9. 02 5月, 2016 5 次提交
  10. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  11. 26 3月, 2016 1 次提交
    • N
      mm/filemap: generic_file_read_iter(): check for zero reads unconditionally · e7080a43
      Nicolai Stange 提交于
      If
       - generic_file_read_iter() gets called with a zero read length,
       - the read offset is at a page boundary,
       - IOCB_DIRECT is not set
      -  and the page in question hasn't made it into the page cache yet,
      then do_generic_file_read() will trigger a readahead with a req_size hint
      of zero.
      
      Since roundup_pow_of_two(0) is undefined, UBSAN reports
      
        UBSAN: Undefined behaviour in include/linux/log2.h:63:13
        shift exponent 64 is too large for 64-bit type 'long unsigned int'
        CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
        [...]
        Call Trace:
         [...]
         [<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
         [<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
         [<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
         [<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
         [<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
         [<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
         [...]
         [<ffffffff81510b06>] __vfs_read+0x256/0x3d0
         [...]
      
      when get_init_ra_size() gets called from ondemand_readahead().
      
      The net effect is that the initial readahead size is arch dependent for
      requested read lengths of zero: for example, since
      
        1UL << (sizeof(unsigned long) * 8)
      
      evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
      size becomes 4 on the former and 0 on the latter.
      
      What's more, whether or not the file access timestamp is updated for zero
      length reads is decided differently for the two cases of IOCB_DIRECT
      being set or cleared: in the first case, generic_file_read_iter()
      explicitly skips updating that timestamp while in the latter case, it is
      always updated through the call to do_generic_file_read().
      
      According to POSIX, zero length reads "do not modify the last data access
      timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.
      
      Let generic_file_read_iter() unconditionally check the requested read
      length at its entry and return immediately with success if it is zero.
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7080a43
  12. 18 3月, 2016 2 次提交
    • M
      mm: use radix_tree_iter_retry() · 2cf938aa
      Matthew Wilcox 提交于
      Instead of a 'goto restart', we can now use radix_tree_iter_retry() to
      restart from our current position.  This will make a difference when
      there are more ways to happen across an indirect pointer.  And it
      eliminates some confusing gotos.
      
      [vbabka@suse.cz: remove now-obsolete-and-misleading comment]
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cf938aa
    • M
      radix_tree: add support for multi-order entries · e6145236
      Matthew Wilcox 提交于
      With huge pages, it is convenient to have the radix tree be able to
      return an entry that covers multiple indices.  Previous attempts to deal
      with the problem have involved inserting N duplicate entries, which is a
      waste of memory and leads to problems trying to handle aliased tags, or
      probing the tree multiple times to find alternative entries which might
      cover the requested index.
      
      This approach inserts one canonical entry into the tree for a given
      range of indices, and may also insert other entries in order to ensure
      that lookups find the canonical entry.
      
      This solution only tolerates inserting powers of two that are greater
      than the fanout of the tree.  If we wish to expand the radix tree's
      abilities to support large-ish pages that is less than the fanout at the
      penultimate level of the tree, then we would need to add one more step
      in lookup to ensure that any sibling nodes in the final level of the
      tree are dereferenced and we return the canonical entry that they
      reference.
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6145236
  13. 16 3月, 2016 6 次提交
  14. 10 3月, 2016 1 次提交
    • H
      mm: __delete_from_page_cache show Bad page if mapped · 06b241f3
      Hugh Dickins 提交于
      Commit e1534ae9 ("mm: differentiate page_mapped() from
      page_mapcount() for compound pages") changed the famous
      BUG_ON(page_mapped(page)) in __delete_from_page_cache() to
      VM_BUG_ON_PAGE(page_mapped(page)): which gives us more info when
      CONFIG_DEBUG_VM=y, but nothing at all when not.
      
      Although it has not usually been very helpul, being hit long after the
      error in question, we do need to know if it actually happens on users'
      systems; but reinstating a crash there is likely to be opposed :)
      
      In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
      dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
      but that seems to be the standard procedure now.  Move that, or the
      VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
      unNULLified page->mapping gives a little more information.
      
      If the inode is being evicted (rather than truncated), it won't have any
      vmas left, so it's safe(ish) to assume that the raised mapcount is
      erroneous, and we can discount it from page_count to avoid leaking the
      page (I'm less worried by leaking the occasional 4kB, than losing a
      potential 2MB page with each 4kB page leaked).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06b241f3
  15. 28 2月, 2016 1 次提交
    • R
      dax: move writeback calls into the filesystems · 7f6d5b52
      Ross Zwisler 提交于
      Previously calls to dax_writeback_mapping_range() for all DAX filesystems
      (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
      
      dax_writeback_mapping_range() needs a struct block_device, and it used
      to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
      mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
      block devices and for XFS real-time files.
      
      Instead, call dax_writeback_mapping_range() directly from the filesystem
      ->writepages function so that it can supply us with a valid block
      device.  This also fixes DAX code to properly flush caches in response
      to sync(2).
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f6d5b52
  16. 12 2月, 2016 1 次提交
  17. 23 1月, 2016 4 次提交
    • R
      dax: add support for fsync/sync · 9973c98e
      Ross Zwisler 提交于
      To properly handle fsync/msync in an efficient way DAX needs to track
      dirty pages so it is able to flush them durably to media on demand.
      
      The tracking of dirty pages is done via the radix tree in struct
      address_space.  This radix tree is already used by the page writeback
      infrastructure for tracking dirty pages associated with an open file,
      and it already has support for exceptional (non struct page*) entries.
      We build upon these features to add exceptional entries to the radix
      tree for DAX dirty PMD or PTE pages at fault time.
      
      [dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9973c98e
    • R
      mm: add find_get_entries_tag() · 7e7f7749
      Ross Zwisler 提交于
      Add find_get_entries_tag() to the family of functions that include
      find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
      needed for DAX dirty page handling because we need a list of both page
      offsets and radix tree entries ('indices' and 'entries' in this
      function) that are marked with the PAGECACHE_TAG_TOWRITE tag.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e7f7749
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c