1. 30 11月, 2016 2 次提交
    • C
      xfs: use iomap_dio_rw · acdda3aa
      Christoph Hellwig 提交于
      Straight switch over to using iomap for direct I/O - we already have the
      non-COW dio path in write_begin for DAX and files with extent size hints,
      so nothing to add there.  The COW path is ported over from the old
      get_blocks version and a bit of a mess, but I have some work in progress
      to make it look more like the buffered I/O COW path.
      
      This gets rid of xfs_get_blocks_direct and the last caller of
      xfs_get_blocks with the create flag set, so all that code can be removed.
      
      Last but not least I've removed a comment in xfs_filemap_fault that
      refers to xfs_get_blocks entirely instead of updating it - while the
      reference is correct, the whole DAX fault path looks different than
      the non-DAX one, so it seems rather pointless.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      acdda3aa
    • C
      xfs: remove i_iolock and use i_rwsem in the VFS inode instead · 65523218
      Christoph Hellwig 提交于
      This patch drops the XFS-own i_iolock and uses the VFS i_rwsem which
      recently replaced i_mutex instead.  This means we only have to take
      one lock instead of two in many fast path operations, and we can
      also shrink the xfs_inode structure.  Thanks to the xfs_ilock family
      there is very little churn, the only thing of note is that we need
      to switch to use the lock_two_directory helper for taking the i_rwsem
      on two inodes in a few places to make sure our lock order matches
      the one used in the VFS.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      65523218
  2. 24 11月, 2016 1 次提交
  3. 08 11月, 2016 2 次提交
    • B
      xfs: don't BUG() on mixed direct and mapped I/O · 04197b34
      Brian Foster 提交于
      We've had reports of generic/095 causing XFS to BUG() in
      __xfs_get_blocks() due to the existence of delalloc blocks on a
      direct I/O read. generic/095 issues a mix of various types of I/O,
      including direct and memory mapped I/O to a single file. This is
      clearly not supported behavior and is known to lead to such
      problems. E.g., the lack of exclusion between the direct I/O and
      write fault paths means that a write fault can allocate delalloc
      blocks in a region of a file that was previously a hole after the
      direct read has attempted to flush/inval the file range, but before
      it actually reads the block mapping. In turn, the direct read
      discovers a delalloc extent and cannot proceed.
      
      While the appropriate solution here is to not mix direct and memory
      mapped I/O to the same regions of the same file, the current
      BUG_ON() behavior is probably overkill as it can crash the entire
      system.  Instead, localize the failure to the I/O in question by
      returning an error for a direct I/O that cannot be handled safely
      due to delalloc blocks. Be careful to allow the case of a direct
      write to post-eof delalloc blocks. This can occur due to speculative
      preallocation and is safe as post-eof blocks are not accompanied by
      dirty pages in pagecache (conversely, preallocation within eof must
      have been zeroed, and thus dirtied, before the inode size could have
      been increased beyond said blocks).
      
      Finally, provide an additional warning if a direct I/O write occurs
      while the file is memory mapped. This may not catch all problematic
      scenarios, but provides a hint that some known-to-be-problematic I/O
      methods are in use.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      04197b34
    • R
      xfs: use struct iomap based DAX PMD fault path · 862f1b9d
      Ross Zwisler 提交于
      Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
      improved dax_iomap_pmd_fault().  Also, now that it has no more users,
      remove xfs_get_blocks_dax_fault().
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      862f1b9d
  4. 11 10月, 2016 1 次提交
  5. 06 10月, 2016 3 次提交
    • D
      xfs: implement CoW for directio writes · 0613f16c
      Darrick J. Wong 提交于
      For O_DIRECT writes to shared blocks, we have to CoW them just like
      we would with buffered writes.  For writes that are not block-aligned,
      just bounce them to the page cache.
      
      For block-aligned writes, however, we can do better than that.  Use
      the same mechanisms that we employ for buffered CoW to set up a
      delalloc reservation, allocate all the blocks at once, issue the
      writes against the new blocks and use the same ioend functions to
      remap the blocks after the write.  This should be fairly performant.
      
      Christoph discovered that xfs_reflink_allocate_cow_range may stumble
      over invalid entries in the extent array given that it drops the ilock
      but still expects the index to be stable.  Simple fixing it to a new
      lookup for every iteration still isn't correct given that
      xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and
      there is nothing preventing a xfs_bunmapi_cow call removing extents
      once we dropped the ilock either.
      
      This patch duplicates the inner loop of xfs_bmapi_allocate into a
      helper for xfs_reflink_allocate_cow_range so that it can be done under
      the same ilock critical section as our CoW fork delayed allocation.
      The directio CoW warts will be revisited in a later patch.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      0613f16c
    • D
      xfs: report shared extent mappings to userspace correctly · db1327b1
      Darrick J. Wong 提交于
      Report shared extents through the iomap interface so that FIEMAP flags
      shared blocks accurately.  Have xfs_vm_bmap return zero for reflinked
      files because the bmap-based swap code requires static block mappings,
      which is incompatible with copy on write.
      
      NOTE: Existing userspace bmap users such as lilo will have the same
      problem with reflink files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      db1327b1
    • D
      xfs: move mappings from cow fork to data fork after copy-write · 43caeb18
      Darrick J. Wong 提交于
      After the write component of a copy-write operation finishes, clean up
      the bookkeeping left behind.  On error, we simply free the new blocks
      and pass the error up.  If we succeed, however, then we must remove
      the old data fork mapping and move the cow fork mapping to the data
      fork.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [hch: Call the CoW failure function during xfs_cancel_ioend]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      43caeb18
  6. 05 10月, 2016 2 次提交
    • D
      xfs: allocate delayed extents in CoW fork · ef473667
      Darrick J. Wong 提交于
      Modify the writepage handler to find and convert pending delalloc
      extents to real allocations.  Furthermore, when we're doing non-cow
      writes to a part of a file that already has a CoW reservation (the
      cowextsz hint that we set up in a subsequent patch facilitates this),
      promote the write to copy-on-write so that the entire extent can get
      written out as a single extent on disk, thereby reducing post-CoW
      fragmentation.
      
      Christoph moved the CoW support code in _map_blocks to a separate helper
      function, refactored other functions, and reduced the number of CoW fork
      lookups, so I merged those changes here to reduce churn.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      ef473667
    • D
      xfs: support allocating delayed extents in CoW fork · 60b4984f
      Darrick J. Wong 提交于
      Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed
      allocation extents in the CoW fork to real allocations, and wire this
      up all the way back to xfs_iomap_write_allocate().  In a subsequent
      patch, we'll modify the writepage handler to call this.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      60b4984f
  7. 19 9月, 2016 1 次提交
  8. 22 7月, 2016 2 次提交
    • D
      xfs: bufferhead chains are invalid after end_page_writeback · 28b783e4
      Dave Chinner 提交于
      In xfs_finish_page_writeback(), we have a loop that looks like this:
      
              do {
                      if (off < bvec->bv_offset)
                              goto next_bh;
                      if (off > end)
                              break;
                      bh->b_end_io(bh, !error);
      next_bh:
                      off += bh->b_size;
              } while ((bh = bh->b_this_page) != head);
      
      The b_end_io function is end_buffer_async_write(), which will call
      end_page_writeback() once all the buffers have marked as no longer
      under IO.  This issue here is that the only thing currently
      protecting both the bufferhead chain and the page from being
      reclaimed is the PageWriteback state held on the page.
      
      While we attempt to limit the loop to just the buffers covered by
      the IO, we still read from the buffer size and follow the next
      pointer in the bufferhead chain. There is no guarantee that either
      of these are valid after the PageWriteback flag has been cleared.
      Hence, loops like this are completely unsafe, and result in
      use-after-free issues. One such problem was caught by Calvin Owens
      with KASAN:
      
      .....
       INFO: Freed in 0x103fc80ec age=18446651500051355200 cpu=2165122683 pid=-1
        free_buffer_head+0x41/0x90
        __slab_free+0x1ed/0x340
        kmem_cache_free+0x270/0x300
        free_buffer_head+0x41/0x90
        try_to_free_buffers+0x171/0x240
        xfs_vm_releasepage+0xcb/0x3b0
        try_to_release_page+0x106/0x190
        shrink_page_list+0x118e/0x1a10
        shrink_inactive_list+0x42c/0xdf0
        shrink_zone_memcg+0xa09/0xfa0
        shrink_zone+0x2c3/0xbc0
      .....
       Call Trace:
        <IRQ>  [<ffffffff81e8b8e4>] dump_stack+0x68/0x94
        [<ffffffff8153a995>] print_trailer+0x115/0x1a0
        [<ffffffff81541174>] object_err+0x34/0x40
        [<ffffffff815436e7>] kasan_report_error+0x217/0x530
        [<ffffffff81543b33>] __asan_report_load8_noabort+0x43/0x50
        [<ffffffff819d651f>] xfs_destroy_ioend+0x3bf/0x4c0
        [<ffffffff819d69d4>] xfs_end_bio+0x154/0x220
        [<ffffffff81de0c58>] bio_endio+0x158/0x1b0
        [<ffffffff81dff61b>] blk_update_request+0x18b/0xb80
        [<ffffffff821baf57>] scsi_end_request+0x97/0x5a0
        [<ffffffff821c5558>] scsi_io_completion+0x438/0x1690
        [<ffffffff821a8d95>] scsi_finish_command+0x375/0x4e0
        [<ffffffff821c3940>] scsi_softirq_done+0x280/0x340
      
      
      Where the access is occuring during IO completion after the buffer
      had been freed from direct memory reclaim.
      
      Prevent use-after-free accidents in this end_io processing loop by
      pre-calculating the loop conditionals before calling bh->b_end_io().
      The loop is already limited to just the bufferheads covered by the
      IO in progress, so the offset checks are sufficient to prevent
      accessing buffers in the chain after end_page_writeback() has been
      called by the the bh->b_end_io() callout.
      
      Yet another example of why Bufferheads Must Die.
      
      cc: <stable@vger.kernel.org> # 4.7
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reported-and-Tested-by: NCalvin Owens <calvinowens@fb.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      28b783e4
    • B
      xfs: skip dirty pages in ->releasepage() · 99579cce
      Brian Foster 提交于
      XFS has had scattered reports of delalloc blocks present at
      ->releasepage() time. This results in a warning with a stack trace
      similar to the following:
      
       ...
       Call Trace:
        [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
        [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
        [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
        [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
        [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
        [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
        [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
        [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
        [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
        [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
        [<ffffffffa2168539>] kswapd+0x4f9/0x970
        [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
        [<ffffffffa20a0d99>] kthread+0xc9/0xe0
        [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
        [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
        [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
      
      This occurs because it is possible for shrink_active_list() to send
      pages marked dirty to ->releasepage() when certain buffer_head threshold
      conditions are met. shrink_active_list() doesn't check the page dirty
      state apparently to handle an old ext3 corner case where in some cases
      clean pages would not have the dirty bit cleared, thus it is up to the
      filesystem to determine how to handle the page.
      
      XFS currently handles the delalloc case properly, but this behavior
      makes the warning spurious. Update the XFS ->releasepage() handler to
      explicitly skip dirty pages. Retain the existing delalloc/unwritten
      checks so we continue to warn if such buffers exist on clean pages when
      they shouldn't.
      Diagnosed-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      99579cce
  9. 20 7月, 2016 1 次提交
  10. 21 6月, 2016 2 次提交
  11. 08 6月, 2016 2 次提交
  12. 20 5月, 2016 1 次提交
  13. 02 5月, 2016 1 次提交
  14. 06 4月, 2016 4 次提交
    • C
      xfs: better xfs_trans_alloc interface · 253f4911
      Christoph Hellwig 提交于
      Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
      that returns a transaction with all the required log and block reservations,
      and which allows passing transaction flags directly to avoid the cumbersome
      _xfs_trans_alloc interface.
      
      While we're at it we also get rid of the transaction type argument that has
      been superflous since we stopped supporting the non-CIL logging mode.  The
      guts of it will be removed in another patch.
      
      [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      253f4911
    • C
      xfs: optimize bio handling in the buffer writeback path · 0e51a8e1
      Christoph Hellwig 提交于
      This patch implements two closely related changes:  First it embeds
      a bio the ioend structure so that we don't have to allocate one
      separately.  Second it uses the block layer bio chaining mechanism
      to chain additional bios off this first one if needed instead of
      manually accounting for multiple bio completions in the ioend
      structure.  Together this removes a memory allocation per ioend and
      greatly simplifies the ioend setup and I/O completion path.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      0e51a8e1
    • D
      xfs: don't release bios on completion immediately · 37992c18
      Dave Chinner 提交于
      Completion of an ioend requires us to walk the bufferhead list to
      end writback on all the bufferheads. This, in turn, is needed so
      that we can end writeback on all the pages we just did IO on.
      
      To remove our dependency on bufferheads in writeback, we need to
      turn this around the other way - we need to walk the pages we've
      just completed IO on, and then walk the buffers attached to the
      pages and complete their IO. In doing this, we remove the
      requirement for the ioend to track bufferheads directly.
      
      To enable IO completion to walk all the pages we've submitted IO on,
      we need to keep the bios that we used for IO around until the ioend
      has been completed. We can do this simply by chaining the bios to
      the ioend at completion time, and then walking their pages directly
      just before destroying the ioend.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [hch: changed the xfs_finish_page_writeback calling convention]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      37992c18
    • D
      xfs: build bios directly in xfs_add_to_ioend · bb18782a
      Dave Chinner 提交于
      Currently adding a buffer to the ioend and then building a bio from
      the buffer list are two separate operations. We don't build the bios
      and submit them until the ioend is submitted, and this places a
      fixed dependency on bufferhead chaining in the ioend.
      
      The first step to removing the bufferhead chaining in the ioend is
      on the IO submission side. We can build the bio directly as we add
      the buffers to the ioend chain, thereby removing the need for a
      latter "buffer-to-bio" submission loop. This allows us to submit
      bios on large ioends as soon as we cannot add more data to the bio.
      
      These bios then get captured by the active plug, and hence will be
      dispatched as soon as either the plug overflows or we schedule away
      from the writeback context. This will reduce submission latency for
      large IOs, but will also allow more timely request queue based
      writeback blocking when the device becomes congested.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [hch: various small updates]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      bb18782a
  15. 05 4月, 2016 2 次提交
    • K
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov 提交于
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  16. 16 3月, 2016 2 次提交
  17. 15 3月, 2016 1 次提交
    • B
      xfs: debug mode forced buffered write failure · 801cc4e1
      Brian Foster 提交于
      Add a DEBUG mode-only sysfs knob to enable forced buffered write
      failure. An additional side effect of this mode is brute force killing
      of delayed allocation blocks in the range of the write. The latter is
      the prime motiviation behind this patch, as userspace test
      infrastructure requires a reliable mechanism to create and split
      delalloc extents without causing extent conversion.
      
      Certain fallocate operations (i.e., zero range) were used for this in
      the past, but the implementations have changed such that delalloc
      extents are flushed and converted to real blocks, rendering the test
      useless.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      801cc4e1
  18. 07 3月, 2016 1 次提交
  19. 28 2月, 2016 2 次提交
    • R
      dax: move writeback calls into the filesystems · 7f6d5b52
      Ross Zwisler 提交于
      Previously calls to dax_writeback_mapping_range() for all DAX filesystems
      (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
      
      dax_writeback_mapping_range() needs a struct block_device, and it used
      to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
      mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
      block devices and for XFS real-time files.
      
      Instead, call dax_writeback_mapping_range() directly from the filesystem
      ->writepages function so that it can supply us with a valid block
      device.  This also fixes DAX code to properly flush caches in response
      to sync(2).
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f6d5b52
    • R
      dax: give DAX clearing code correct bdev · 20a90f58
      Ross Zwisler 提交于
      dax_clear_blocks() needs a valid struct block_device and previously it
      was using inode->i_sb->s_bdev in all cases.  This is correct for normal
      inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
      DAX raw block devices and for XFS real-time devices.
      
      Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
      its arguments to take a bdev and a sector instead of an inode and a
      block.  This better reflects what the function does, and it allows the
      filesystem and raw block device code to pass in an appropriate struct
      block_device.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20a90f58
  20. 15 2月, 2016 6 次提交
    • D
      xfs: don't chain ioends during writepage submission · e10de372
      Dave Chinner 提交于
      Currently we can build a long ioend chain during ->writepages that
      gets attached to the writepage context. IO submission only then
      occurs when we finish all the writepage processing. This means we
      can have many ioends allocated and pending, and this violates the
      mempool guarantees that we need to give about forwards progress.
      i.e. we really should only have one ioend being built at a time,
      otherwise we may drain the mempool trying to allocate a new ioend
      and that blocks submission, completion and freeing of ioends that
      are already in progress.
      
      To prevent this situation from happening, we need to submit ioends
      for IO as soon as they are ready for dispatch rather than queuing
      them for later submission. This means the ioends have bios built
      immediately and they get queued on any plug that is current active.
      Hence if we schedule away from writeback, the ioends that have been
      built will make forwards progress due to the plug flushing on
      context switch. This will also prevent context switches from
      creating unnecessary IO submission latency.
      
      We can't completely avoid having nested IO allocation - when we have
      a block size smaller than a page size, we still need to hold the
      ioend submission until after we have marked the current page dirty.
      Hence we may need multiple ioends to be held while the current page
      is completely mapped and made ready for IO dispatch. We cannot avoid
      this problem - the current code already has this ioend chaining
      within a page so we can mostly ignore that it occurs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e10de372
    • D
      xfs: factor mapping out of xfs_do_writepage · bfce7d2e
      Dave Chinner 提交于
      Separate out the bufferhead based mapping from the writepage code so
      that we have a clear separation of the page operations and the
      bufferhead state.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bfce7d2e
    • D
      xfs: xfs_cluster_write is redundant · ad68972a
      Dave Chinner 提交于
      xfs_cluster_write() is not necessary now that xfs_vm_writepages()
      aggregates writepage calls across a single mapping. This means we no
      longer need to do page lookups in xfs_cluster_write, so writeback
      only needs to look up th epage cache once per page being written.
      This also removes a large amount of mostly duplicate code between
      xfs_do_writepage() and xfs_convert_page().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ad68972a
    • D
      xfs: Introduce writeback context for writepages · fbcc0256
      Dave Chinner 提交于
      xfs_vm_writepages() calls generic_writepages to writeback a range of
      a file, but then xfs_vm_writepage() clusters pages itself as it does
      not have any context it can pass between->writepage calls from
      __write_cache_pages().
      
      Introduce a writeback context for xfs_vm_writepages() and call
      __write_cache_pages directly with our own writepage callback so that
      we can pass that context to each writepage invocation. This
      encapsulates the current mapping, whether it is valid or not, the
      current ioend and it's IO type and the ioend chain being built.
      
      This requires us to move the ioend submission up to the level where
      the writepage context is declared. This does mean we do not submit
      IO until we packaged the entire writeback range, but with the block
      plugging in the writepages call this is the way IO is submitted,
      anyway.
      
      It also means that we need to handle discontiguous page ranges.  If
      the pages sent down by write_cache_pages to the writepage callback
      are discontiguous, we need to detect this and put each discontiguous
      page range into individual ioends. This is needed to ensure that the
      ioend accurately represents the range of the file that it covers so
      that file size updates during IO completion set the size correctly.
      Failure to take into account the discontiguous ranges results in
      files being too small when writeback patterns are non-sequential.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fbcc0256
    • D
      xfs: remove xfs_cancel_ioend · 150d5be0
      Dave Chinner 提交于
      We currently have code to cancel ioends being built because we
      change bufferhead state as we build the ioend. On error, this needs
      to be unwound and so we have cancelling code that walks the buffers
      on the ioend chain and undoes these state changes.
      
      However, the IO submission path already handles state changes for
      buffers when a submission error occurs, so we don't really need a
      separate cancel function to do this - we can simply submit the
      ioend chain with the specific error and it will be cancelled rather
      than submitted.
      
      Hence we can remove the explicit cancel code and just rely on
      submission to deal with the error correctly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      150d5be0
    • D
      xfs: remove nonblocking mode from xfs_vm_writepage · 988ef927
      Dave Chinner 提交于
      Remove the nonblocking optimisation done for mapping lookups during
      writeback. It's not clear that leaving a hole in the writeback range
      just because we couldn't get a lock is really a win, as it makes us
      do another small random IO later on rather than a large sequential
      IO now.
      
      As this gets in the way of sane error handling later on, just remove
      for the moment and we can re-introduce an equivalent optimisation in
      future if we see problems due to extent map lock contention.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      988ef927
  21. 08 2月, 2016 1 次提交
    • B
      xfs: fix xfs_log_ticket leak in xfs_end_io() after fs shutdown · af055e37
      Brian Foster 提交于
      If the filesystem has shut down, xfs_end_io() currently sets an
      error on the ioend and proceeds to ioend destruction. The ioend
      might contain a truncate transaction if the I/O extended the size of
      the file. This transaction is only cleaned up in
      xfs_setfilesize_ioend(), however, which is skipped in this case.
      This results in an xfs_log_ticket leak message when the associate
      cache slab is destroyed (e.g., on rmmod).
      
      This was originally reproduced by xfs/141 on a distro kernel. The
      problem is reproducible on an upstream kernel, but not easily
      detected in current upstream if the xfs_log_ticket cache happens to
      be merged with another cache. This can be reproduced more
      deterministically with the 'slab_nomerge' kernel boot option.
      
      Update xfs_end_io() to proceed with normal end I/O processing after
      an error is set on an ioend due to fs shutdown. The I/O type-based
      processing is already designed to handle an I/O error and ensure
      that the ioend is cleaned up correctly.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      af055e37