1. 20 5月, 2016 4 次提交
    • J
      dax: New fault locking · ac401cc7
      Jan Kara 提交于
      Currently DAX page fault locking is racy.
      
      CPU0 (write fault)		CPU1 (read fault)
      
      __dax_fault()			__dax_fault()
        get_block(inode, block, &bh, 0) -> not mapped
      				  get_block(inode, block, &bh, 0)
      				    -> not mapped
        if (!buffer_mapped(&bh))
          if (vmf->flags & FAULT_FLAG_WRITE)
            get_block(inode, block, &bh, 1) -> allocates blocks
        if (page) -> no
      				  if (!buffer_mapped(&bh))
      				    if (vmf->flags & FAULT_FLAG_WRITE) {
      				    } else {
      				      dax_load_hole();
      				    }
        dax_insert_mapping()
      
      And we are in a situation where we fail in dax_radix_entry() with -EIO.
      
      Another problem with the current DAX page fault locking is that there is
      no race-free way to clear dirty tag in the radix tree. We can always
      end up with clean radix tree and dirty data in CPU cache.
      
      We fix the first problem by introducing locking of exceptional radix
      tree entries in DAX mappings acting very similarly to page lock and thus
      synchronizing properly faults against the same mapping index. The same
      lock can later be used to avoid races when clearing radix tree dirty
      tag.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      ac401cc7
    • J
      dax: Define DAX lock bit for radix tree exceptional entry · e804315d
      Jan Kara 提交于
      We will use lowest available bit in the radix tree exceptional entry for
      locking of the entry. Define it. Also clean up definitions of DAX entry
      type bits in DAX exceptional entries to use defined constants instead of
      hardcoding numbers and cleanup checking of these bits to not rely on how
      other bits in the entry are set.
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      e804315d
    • J
      dax: Make huge page handling depend of CONFIG_BROKEN · 348e967a
      Jan Kara 提交于
      Currently the handling of huge pages for DAX is racy. For example the
      following can happen:
      
      CPU0 (THP write fault)			CPU1 (normal read fault)
      
      __dax_pmd_fault()			__dax_fault()
        get_block(inode, block, &bh, 0) -> not mapped
      					get_block(inode, block, &bh, 0)
      					  -> not mapped
        if (!buffer_mapped(&bh) && write)
          get_block(inode, block, &bh, 1) -> allocates blocks
        truncate_pagecache_range(inode, lstart, lend);
      					dax_load_hole();
      
      This results in data corruption since process on CPU1 won't see changes
      into the file done by CPU0.
      
      The race can happen even if two normal faults race however with THP the
      situation is even worse because the two faults don't operate on the same
      entries in the radix tree and we want to use these entries for
      serialization. So make THP support in DAX code depend on CONFIG_BROKEN
      for now.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      348e967a
    • J
      dax: Fix condition for filling of PMD holes · b9953536
      Jan Kara 提交于
      Currently dax_pmd_fault() decides to fill a PMD-sized hole only if
      returned buffer has BH_Uptodate set. However that doesn't get set for
      any mapping buffer so that branch is actually a dead code. The
      BH_Uptodate check doesn't make any sense so just remove it.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      b9953536
  2. 19 5月, 2016 4 次提交
  3. 17 5月, 2016 7 次提交
  4. 13 5月, 2016 1 次提交
    • J
      dax: call get_blocks() with create == 1 for write faults to unwritten extents · aef39ab1
      Jan Kara 提交于
      Currently, __dax_fault() does not call get_blocks() callback with create
      argument set, when we got back unwritten extent from the initial
      get_blocks() call during a write fault. This is because originally
      filesystems were supposed to convert unwritten extents to written ones
      using complete_unwritten() callback. Later this was abandoned in favor of
      using pre-zeroed blocks however the condition whether get_blocks() needs
      to be called with create == 1 remained.
      
      Fix the condition so that filesystems are not forced to zero-out and
      convert unwritten extents when get_blocks() is called with create == 0
      (which introduces unnecessary overhead for read faults and can be
      problematic as the filesystem may possibly be read-only).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      aef39ab1
  5. 05 4月, 2016 2 次提交
    • K
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov 提交于
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  6. 10 3月, 2016 1 次提交
  7. 28 2月, 2016 2 次提交
    • R
      dax: move writeback calls into the filesystems · 7f6d5b52
      Ross Zwisler 提交于
      Previously calls to dax_writeback_mapping_range() for all DAX filesystems
      (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
      
      dax_writeback_mapping_range() needs a struct block_device, and it used
      to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
      mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
      block devices and for XFS real-time files.
      
      Instead, call dax_writeback_mapping_range() directly from the filesystem
      ->writepages function so that it can supply us with a valid block
      device.  This also fixes DAX code to properly flush caches in response
      to sync(2).
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f6d5b52
    • R
      dax: give DAX clearing code correct bdev · 20a90f58
      Ross Zwisler 提交于
      dax_clear_blocks() needs a valid struct block_device and previously it
      was using inode->i_sb->s_bdev in all cases.  This is correct for normal
      inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
      DAX raw block devices and for XFS real-time devices.
      
      Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
      its arguments to take a bdev and a sector instead of an inode and a
      block.  This better reflects what the function does, and it allows the
      filesystem and raw block device code to pass in an appropriate struct
      block_device.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20a90f58
  8. 08 2月, 2016 1 次提交
  9. 06 2月, 2016 1 次提交
  10. 31 1月, 2016 1 次提交
  11. 23 1月, 2016 5 次提交
    • R
      dax: never rely on bh.b_dev being set by get_block() · eab95db6
      Ross Zwisler 提交于
      Previously in DAX we assumed that calls to get_block() would set
      bh.b_bdev, and we would then use that value even in error cases for
      debugging.  This caused a NULL pointer dereference in __dax_dbg() which
      was fixed by a previous commit, but that commit only changed the one
      place where we were hitting an error.
      
      Instead, update dax.c so that we always initialize bh.b_bdev as best we
      can based on the information that DAX has.  get_block() may or may not
      update to a new value, but this at least lets us get something helpful
      from bh.b_bdev for error messages and not have to worry about whether it
      was set by get_block() or not.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eab95db6
    • R
      dax: add support for fsync/sync · 9973c98e
      Ross Zwisler 提交于
      To properly handle fsync/msync in an efficient way DAX needs to track
      dirty pages so it is able to flush them durably to media on demand.
      
      The tracking of dirty pages is done via the radix tree in struct
      address_space.  This radix tree is already used by the page writeback
      infrastructure for tracking dirty pages associated with an open file,
      and it already has support for exceptional (non struct page*) entries.
      We build upon these features to add exceptional entries to the radix
      tree for DAX dirty PMD or PTE pages at fault time.
      
      [dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9973c98e
    • R
      dax: fix conversion of holes to PMDs · de14b9cb
      Ross Zwisler 提交于
      When we get a DAX PMD fault for a write it is possible that there could
      be some number of 4k zero pages already present for the same range that
      were inserted to service reads from a hole.  These 4k zero pages need to
      be unmapped from the VMAs and removed from the struct address_space
      radix tree before the real DAX PMD entry can be inserted.
      
      For PTE faults this same use case also exists and is handled by a
      combination of unmap_mapping_range() to unmap the VMAs and
      delete_from_page_cache() to remove the page from the address_space radix
      tree.
      
      For PMD faults we do have a call to unmap_mapping_range() (protected by
      a buffer_new() check), but nothing clears out the radix tree entry.  The
      buffer_new() check is also incorrect as the current ext4 and XFS
      filesystem code will never return a buffer_head with BH_New set, even
      when allocating new blocks over a hole.  Instead the filesystem will
      zero the blocks manually and return a buffer_head with only BH_Mapped
      set.
      
      Fix this situation by removing the buffer_new() check and adding a call
      to truncate_inode_pages_range() to clear out the radix tree entries
      before we insert the DAX PMD.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de14b9cb
    • R
      dax: fix NULL pointer dereference in __dax_dbg() · d4bbe706
      Ross Zwisler 提交于
      In __dax_pmd_fault() we currently assume that get_block() will always
      set bh.b_bdev and we unconditionally dereference it in __dax_dbg().
      
      This assumption isn't always true - when called for reads of holes
      ext4_dax_mmap_get_block() returns a buffer head where bh->b_bdev is
      never set.  I hit this BUG while testing the DAX PMD fault path.
      
      Instead, initialize bh.b_bdev before passing bh into get_block().  It is
      possible that the filesystem's get_block() will update bh.b_bdev, and
      this is fine - we just want to initialize bh.b_bdev to something
      reasonable so that the calls to __dax_dbg() work and print something
      useful.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4bbe706
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  12. 16 1月, 2016 10 次提交
  13. 17 11月, 2015 1 次提交
    • D
      dax: disable pmd mappings · ee82c9ed
      Dan Williams 提交于
      While dax pmd mappings are functional in the nominal path they trigger
      kernel crashes in the following paths:
      
       BUG: unable to handle kernel paging request at ffffea0004098000
       IP: [<ffffffff812362f7>] follow_trans_huge_pmd+0x117/0x3b0
       [..]
       Call Trace:
        [<ffffffff811f6573>] follow_page_mask+0x2d3/0x380
        [<ffffffff811f6708>] __get_user_pages+0xe8/0x6f0
        [<ffffffff811f7045>] get_user_pages_unlocked+0x165/0x1e0
        [<ffffffff8106f5b1>] get_user_pages_fast+0xa1/0x1b0
      
       kernel BUG at arch/x86/mm/gup.c:131!
       [..]
       Call Trace:
        [<ffffffff8106f34c>] gup_pud_range+0x1bc/0x220
        [<ffffffff8106f634>] get_user_pages_fast+0x124/0x1b0
      
       BUG: unable to handle kernel paging request at ffffea0004088000
       IP: [<ffffffff81235f49>] copy_huge_pmd+0x159/0x350
       [..]
       Call Trace:
        [<ffffffff811fad3c>] copy_page_range+0x34c/0x9f0
        [<ffffffff810a0daf>] copy_process+0x1b7f/0x1e10
        [<ffffffff810a11c1>] _do_fork+0x91/0x590
      
      All of these paths are interpreting a dax pmd mapping as a transparent
      huge page and making the assumption that the pfn is covered by the
      memmap, i.e. that the pfn has an associated struct page.  PTE mappings
      do not suffer the same fate since they have the _PAGE_SPECIAL flag to
      cause the gup path to fault.  We can do something similar for the PMD
      path, or otherwise defer pmd support for cases where a struct page is
      available.  For now, 4.4-rc and -stable need to disable dax pmd support
      by default.
      
      For development the "depends on BROKEN" line can be removed from
      CONFIG_FS_DAX_PMD.
      
      Cc: <stable@vger.kernel.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ee82c9ed