1. 06 10月, 2016 2 次提交
    • J
      mm: filemap: fix mapping->nrpages double accounting in fuse · 3ddf40e8
      Johannes Weiner 提交于
      Commit 22f2ac51 ("mm: workingset: fix crash in shadow node shrinker
      caused by replace_page_cache_page()") switched replace_page_cache() from
      raw radix tree operations to page_cache_tree_insert() but didn't take
      into account that the latter function, unlike the raw radix tree op,
      handles mapping->nrpages.  As a result, that counter is bumped for each
      page replacement rather than balanced out even.
      
      The mapping->nrpages counter is used to skip needless radix tree walks
      when invalidating, truncating, syncing inodes without pages, as well as
      statistics for userspace.  Since the error is positive, we'll do more
      page cache tree walks than necessary; we won't miss a necessary one.
      And we'll report more buffer pages to userspace than there are.  The
      error is limited to fuse inodes.
      
      Fixes: 22f2ac51 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ddf40e8
    • J
      mm: filemap: don't plant shadow entries without radix tree node · d3798ae8
      Johannes Weiner 提交于
      When the underflow checks were added to workingset_node_shadow_dec(),
      they triggered immediately:
      
        kernel BUG at ./include/linux/swap.h:276!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
         soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
        CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60b #1
        Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
        task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
        RIP: page_cache_tree_insert+0xf1/0x100
        Call Trace:
          __add_to_page_cache_locked+0x12e/0x270
          add_to_page_cache_lru+0x4e/0xe0
          mpage_readpages+0x112/0x1d0
          blkdev_readpages+0x1d/0x20
          __do_page_cache_readahead+0x1ad/0x290
          force_page_cache_readahead+0xaa/0x100
          page_cache_sync_readahead+0x3f/0x50
          generic_file_read_iter+0x5af/0x740
          blkdev_read_iter+0x35/0x40
          __vfs_read+0xe1/0x130
          vfs_read+0x96/0x130
          SyS_read+0x55/0xc0
          entry_SYSCALL_64_fastpath+0x13/0x8f
        Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
        RIP  page_cache_tree_insert+0xf1/0x100
      
      This is a long-standing bug in the way shadow entries are accounted in
      the radix tree nodes. The shrinker needs to know when radix tree nodes
      contain only shadow entries, no pages, so node->count is split in half
      to count shadows in the upper bits and pages in the lower bits.
      
      Unfortunately, the radix tree implementation doesn't know of this and
      assumes all entries are in node->count. When there is a shadow entry
      directly in root->rnode and the tree is later extended, the radix tree
      implementation will copy that entry into the new node and and bump its
      node->count, i.e. increases the page count bits. Once the shadow gets
      removed and we subtract from the upper counter, node->count underflows
      and triggers the warning. Afterwards, without node->count reaching 0
      again, the radix tree node is leaked.
      
      Limit shadow entries to when we have actual radix tree nodes and can
      count them properly. That means we lose the ability to detect refaults
      from files that had only the first page faulted in at eviction time.
      
      Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-and-tested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3798ae8
  2. 03 10月, 2016 1 次提交
  3. 01 10月, 2016 1 次提交
    • J
      mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page() · 22f2ac51
      Johannes Weiner 提交于
      Antonio reports the following crash when using fuse under memory pressure:
      
        kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: all of them
        CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
        Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
        task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
        RIP: shadow_lru_isolate+0x181/0x190
        Call Trace:
          __list_lru_walk_one.isra.3+0x8f/0x130
          list_lru_walk_one+0x23/0x30
          scan_shadow_nodes+0x34/0x50
          shrink_slab.part.40+0x1ed/0x3d0
          shrink_zone+0x2ca/0x2e0
          kswapd+0x51e/0x990
          kthread+0xd8/0xf0
          ret_from_fork+0x3f/0x70
      
      which corresponds to the following sanity check in the shadow node
      tracking:
      
        BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
      
      The workingset code tracks radix tree nodes that exclusively contain
      shadow entries of evicted pages in them, and this (somewhat obscure)
      line checks whether there are real pages left that would interfere with
      reclaim of the radix tree node under memory pressure.
      
      While discussing ways how fuse might sneak pages into the radix tree
      past the workingset code, Miklos pointed to replace_page_cache_page(),
      and indeed there is a problem there: it properly accounts for the old
      page being removed - __delete_from_page_cache() does that - but then
      does a raw raw radix_tree_insert(), not accounting for the replacement
      page.  Eventually the page count bits in node->count underflow while
      leaving the node incorrectly linked to the shadow node LRU.
      
      To address this, make sure replace_page_cache_page() uses the tracked
      page insertion code, page_cache_tree_insert().  This fixes the page
      accounting and makes sure page-containing nodes are properly unlinked
      from the shadow node LRU again.
      
      Also, make the sanity checks a bit less obscure by using the helpers for
      checking the number of pages and shadows in a radix tree node.
      
      Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
      Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAntonio SJ Musumeci <trapexit@spawn.link>
      Debugged-by: NMiklos Szeredi <miklos@szeredi.hu>
      Cc: <stable@vger.kernel.org>	[3.15+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22f2ac51
  4. 08 8月, 2016 1 次提交
  5. 05 8月, 2016 1 次提交
  6. 29 7月, 2016 3 次提交
  7. 27 7月, 2016 4 次提交
  8. 25 6月, 2016 1 次提交
  9. 21 5月, 2016 4 次提交
    • M
      radix-tree: introduce radix_tree_replace_clear_tags() · d604c324
      Matthew Wilcox 提交于
      In addition to replacing the entry, we also clear all associated tags.
      This is really a one-off special for page_cache_tree_delete() which had
      far too much detailed knowledge about how the radix tree works.
      
      For efficiency, factor node_tag_clear() out of radix_tree_tag_clear() It
      can be used by radix_tree_delete_item() as well as
      radix_tree_replace_clear_tags().
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d604c324
    • K
      mm: make faultaround produce old ptes · 5c0a85fa
      Kirill A. Shutemov 提交于
      Currently, faultaround code produces young pte.  This can screw up
      vmscan behaviour[1], as it makes vmscan think that these pages are hot
      and not push them out on first round.
      
      During sparse file access faultaround gets more pages mapped and all of
      them are young.  Under memory pressure, this makes vmscan swap out anon
      pages instead, or to drop other page cache pages which otherwise stay
      resident.
      
      Modify faultaround to produce old ptes, so they can easily be reclaimed
      under memory pressure.
      
      This can to some extend defeat the purpose of faultaround on machines
      without hardware accessed bit as it will not help us with reducing the
      number of minor page faults.
      
      We may want to disable faultaround on such machines altogether, but
      that's subject for separate patchset.
      
      Minchan:
       "I tested 512M mmap sequential word read test on non-HW access bit
        system (i.e., ARM) and confirmed it doesn't increase minor fault any
        more.
      
        old: 4096 fault_around
        minor fault: 131291
        elapsed time: 6747645 usec
      
        new: 65536 fault_around
        minor fault: 131291
        elapsed time: 6709263 usec
      
        0.56% benefit"
      
      [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org
      
      Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c0a85fa
    • J
      mm: filemap: only do access activations on reads · bbddabe2
      Johannes Weiner 提交于
      Andres observed that his database workload is struggling with the
      transaction journal creating pressure on frequently read pages.
      
      Access patterns like transaction journals frequently write the same
      pages over and over, but in the majority of cases those pages are never
      read back.  There are no caching benefits to be had for those pages, so
      activating them and having them put pressure on pages that do benefit
      from caching is a bad choice.
      
      Leave page activations to read accesses and don't promote pages based on
      writes alone.
      
      It could be said that partially written pages do contain cache-worthy
      data, because even if *userspace* does not access the unwritten part,
      the kernel still has to read it from the filesystem for correctness.
      However, a counter argument is that these pages enjoy at least *some*
      protection over other inactive file pages through the writeback cache,
      in the sense that dirty pages are written back with a delay and cache
      reclaim leaves them alone until they have been written back to disk.
      Should that turn out to be insufficient and we see increased read IO
      from partial writes under memory pressure, we can always go back and
      update grab_cache_page_write_begin() to take (pos, len) so that it can
      tell partial writes from pages that don't need partial reads.  But for
      now, keep it simple.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAndres Freund <andres@anarazel.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbddabe2
    • R
      mm: workingset: only do workingset activations on reads · f0281a00
      Rik van Riel 提交于
      This is a follow-up to
      
        http://www.spinics.net/lists/linux-mm/msg101739.html
      
      where Andres reported his database workingset being pushed out by the
      minimum size enforcement of the inactive file list - currently 50% of
      cache - as well as repeatedly written file pages that are never actually
      read.
      
      Two changes fell out of the discussions.  The first change observes that
      pages that are only ever written don't benefit from caching beyond what
      the writeback cache does for partial page writes, and so we shouldn't
      promote them to the active file list where they compete with pages whose
      cached data is actually accessed repeatedly.  This change comes in two
      patches - one for in-cache write accesses and one for refaults triggered
      by writes, neither of which should promote a cache page.
      
      Second, with the refault detection we don't need to set 50% of the cache
      aside for used-once cache anymore since we can detect frequently used
      pages even when they are evicted between accesses.  We can allow the
      active list to be bigger and thus protect a bigger workingset that isn't
      challenged by streamers.  Depending on the access patterns, this can
      increase major faults during workingset transitions for better
      performance during stable phases.
      
      This patch (of 3):
      
      When rewriting a page, the data in that page is replaced with new data.
      This means that evicting something else from the active file list, in
      order to cache data that will be replaced by something else, is likely
      to be a waste of memory.
      
      It is better to save the active list for frequently read pages, because
      reads actually use the data that is in the page.
      
      This patch ignores partial writes, because it is unclear whether the
      complexity of identifying those is worth any potential performance gain
      obtained from better caching pages that see repeated partial writes at
      large enough intervals to not get caught by the use-twice promotion code
      used for the inactive file list.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0281a00
  10. 20 5月, 2016 3 次提交
  11. 02 5月, 2016 5 次提交
  12. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  13. 26 3月, 2016 1 次提交
    • N
      mm/filemap: generic_file_read_iter(): check for zero reads unconditionally · e7080a43
      Nicolai Stange 提交于
      If
       - generic_file_read_iter() gets called with a zero read length,
       - the read offset is at a page boundary,
       - IOCB_DIRECT is not set
      -  and the page in question hasn't made it into the page cache yet,
      then do_generic_file_read() will trigger a readahead with a req_size hint
      of zero.
      
      Since roundup_pow_of_two(0) is undefined, UBSAN reports
      
        UBSAN: Undefined behaviour in include/linux/log2.h:63:13
        shift exponent 64 is too large for 64-bit type 'long unsigned int'
        CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14
        [...]
        Call Trace:
         [...]
         [<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0
         [<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0
         [<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210
         [<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0
         [<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90
         [<ffffffff813cc955>] generic_file_read_iter+0x185/0x420
         [...]
         [<ffffffff81510b06>] __vfs_read+0x256/0x3d0
         [...]
      
      when get_init_ra_size() gets called from ondemand_readahead().
      
      The net effect is that the initial readahead size is arch dependent for
      requested read lengths of zero: for example, since
      
        1UL << (sizeof(unsigned long) * 8)
      
      evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead
      size becomes 4 on the former and 0 on the latter.
      
      What's more, whether or not the file access timestamp is updated for zero
      length reads is decided differently for the two cases of IOCB_DIRECT
      being set or cleared: in the first case, generic_file_read_iter()
      explicitly skips updating that timestamp while in the latter case, it is
      always updated through the call to do_generic_file_read().
      
      According to POSIX, zero length reads "do not modify the last data access
      timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct.
      
      Let generic_file_read_iter() unconditionally check the requested read
      length at its entry and return immediately with success if it is zero.
      Signed-off-by: NNicolai Stange <nicstange@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7080a43
  14. 18 3月, 2016 2 次提交
    • M
      mm: use radix_tree_iter_retry() · 2cf938aa
      Matthew Wilcox 提交于
      Instead of a 'goto restart', we can now use radix_tree_iter_retry() to
      restart from our current position.  This will make a difference when
      there are more ways to happen across an indirect pointer.  And it
      eliminates some confusing gotos.
      
      [vbabka@suse.cz: remove now-obsolete-and-misleading comment]
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2cf938aa
    • M
      radix_tree: add support for multi-order entries · e6145236
      Matthew Wilcox 提交于
      With huge pages, it is convenient to have the radix tree be able to
      return an entry that covers multiple indices.  Previous attempts to deal
      with the problem have involved inserting N duplicate entries, which is a
      waste of memory and leads to problems trying to handle aliased tags, or
      probing the tree multiple times to find alternative entries which might
      cover the requested index.
      
      This approach inserts one canonical entry into the tree for a given
      range of indices, and may also insert other entries in order to ensure
      that lookups find the canonical entry.
      
      This solution only tolerates inserting powers of two that are greater
      than the fanout of the tree.  If we wish to expand the radix tree's
      abilities to support large-ish pages that is less than the fanout at the
      penultimate level of the tree, then we would need to add one more step
      in lookup to ensure that any sibling nodes in the final level of the
      tree are dereferenced and we return the canonical entry that they
      reference.
      Signed-off-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6145236
  15. 16 3月, 2016 6 次提交
  16. 10 3月, 2016 1 次提交
    • H
      mm: __delete_from_page_cache show Bad page if mapped · 06b241f3
      Hugh Dickins 提交于
      Commit e1534ae9 ("mm: differentiate page_mapped() from
      page_mapcount() for compound pages") changed the famous
      BUG_ON(page_mapped(page)) in __delete_from_page_cache() to
      VM_BUG_ON_PAGE(page_mapped(page)): which gives us more info when
      CONFIG_DEBUG_VM=y, but nothing at all when not.
      
      Although it has not usually been very helpul, being hit long after the
      error in question, we do need to know if it actually happens on users'
      systems; but reinstating a crash there is likely to be opposed :)
      
      In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
      dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
      but that seems to be the standard procedure now.  Move that, or the
      VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
      unNULLified page->mapping gives a little more information.
      
      If the inode is being evicted (rather than truncated), it won't have any
      vmas left, so it's safe(ish) to assume that the raised mapcount is
      erroneous, and we can discount it from page_count to avoid leaking the
      page (I'm less worried by leaking the occasional 4kB, than losing a
      potential 2MB page with each 4kB page leaked).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06b241f3
  17. 28 2月, 2016 1 次提交
    • R
      dax: move writeback calls into the filesystems · 7f6d5b52
      Ross Zwisler 提交于
      Previously calls to dax_writeback_mapping_range() for all DAX filesystems
      (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
      
      dax_writeback_mapping_range() needs a struct block_device, and it used
      to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
      mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
      block devices and for XFS real-time files.
      
      Instead, call dax_writeback_mapping_range() directly from the filesystem
      ->writepages function so that it can supply us with a valid block
      device.  This also fixes DAX code to properly flush caches in response
      to sync(2).
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f6d5b52
  18. 12 2月, 2016 1 次提交
  19. 23 1月, 2016 1 次提交
    • R
      dax: add support for fsync/sync · 9973c98e
      Ross Zwisler 提交于
      To properly handle fsync/msync in an efficient way DAX needs to track
      dirty pages so it is able to flush them durably to media on demand.
      
      The tracking of dirty pages is done via the radix tree in struct
      address_space.  This radix tree is already used by the page writeback
      infrastructure for tracking dirty pages associated with an open file,
      and it already has support for exceptional (non struct page*) entries.
      We build upon these features to add exceptional entries to the radix
      tree for DAX dirty PMD or PTE pages at fault time.
      
      [dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9973c98e