1. 15 5月, 2012 3 次提交
    • D
      xfs: page type check in writeback only checks last buffer · 6ffc4db5
      Dave Chinner 提交于
      xfs_is_delayed_page() checks to see if a page has buffers matching
      the given IO type passed in. It does so by walking the buffer heads
      on the page and checking if the state flags match the IO type.
      
      However, the "acceptable" variable that is calculated is overwritten
      every time a new buffer is checked. Hence if the first buffer on the
      page is of the right type, this state is lost if the second buffer
      is not of the correct type. This means that xfs_aops_discard_page()
      may not discard delalloc regions when it is supposed to, and
      xfs_convert_page() may not cluster IO as efficiently as possible.
      
      This problem only occurs on filesystems with a block size smaller
      than page size.
      
      Also, rename xfs_is_delayed_page() to xfs_check_page_type() to
      better describe what it is doing - it is not delalloc specific
      anymore.
      
      The problem was first noticed by Peter Watkins.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      6ffc4db5
    • D
      xfs: punch all delalloc blocks beyond EOF on write failure. · 01c84d2d
      Dave Chinner 提交于
      I've been seeing regular ASSERT failures in xfstests when running
      fsstress based tests over the past month. xfs_getbmap() has been
      failing this test:
      
      XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) ||
      (map[i].br_startblock != DELAYSTARTBLOCK), file: fs/xfs/xfs_bmap.c,
      line: 5650
      
      where it is encountering a delayed allocation extent after writing
      all the dirty data to disk and then walking the extent map
      atomically by holding the XFS_IOLOCK_SHARED to prevent new delayed
      allocation extents from being created.
      
      Test 083 on a 512 byte block size filesystem was used to reproduce
      the problem, because it only had a 5s run timeand would usually fail
      every 3-4 runs. This test is exercising ENOSPC behaviour by running
      fsstress on a nearly full filesystem. The following trace extract
      shows the final few events on the inode that tripped the assert:
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_setfilesize
       xfs_setfilesize:       isize 0x180000 disize 0x12d400 offset 0x17e200 count 7680
      
      file size updated to 0x180000 by IO completion
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
       xfs_iext_insert:       state  idx 3 offset 3072 block 4503599627239432 count 1 flag 0 caller xfs_bmap_add_extent_hole_delay
       xfs_get_blocks_alloc:  size 0x180000 offset 0x180000 count 512 type  startoff 0xc00 startblock -1 blockcount 0x1
       xfs_ilock:             flags ILOCK_EXCL caller __xfs_get_blocks
      
      delalloc write, adding a single block at offset 0x180000
      
       xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180200 count 512
      
      ENOSPC trying to allocate a dellalloc block at offset 0x180200
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
       xfs_get_blocks_alloc:  size 0x180000 offset 0x180200 count 512 type  startoff 0xc00 startblock -1 blockcount 0x2
      
      And succeeding on retry after flushing dirty inodes.
      
       xfs_ilock:             flags ILOCK_EXCL caller __xfs_get_blocks
       xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180400 count 512
      
      ENOSPC trying to allocate a dellalloc block at offset 0x180400
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
       xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180400 count 512
      
      And failing the retry, giving a real ENOSPC error.
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_vm_write_failed
                                                      ^^^^^^^^^^^^^^^^^^^
      The smoking gun - the write being failed and cleaning up delalloc
      blocks beyond EOF allocated by the failed write.
      
       xfs_getattr:
       xfs_ilock:             flags IOLOCK_SHARED caller xfs_getbmap
       xfs_ilock:             flags ILOCK_SHARED caller xfs_ilock_map_shared
      
      And that's where we died almost immediately afterwards.
      xfs_bmapi_read() found delalloc extent beyond current file in memory
      file size. Some debug I added to xfs_getbmap() showed the state just
      before the assert failure:
      
       ino 0x80e48: off 0xc00, fsb 0xffffffffffffffff, len 0x1, size 0x180000
       start_fsb 0x106, end_fsb 0x638
       ino flags 0x2 nex 0xd bmvcnt 0x555, len 0x3c58a6f23c0bf1, start 0xc00
       ext 0: off 0x1fc, fsb 0x24782, len 0x254
       ext 1: off 0x450, fsb 0x40851, len 0x30
       ext 2: off 0x480, fsb 0xd99, len 0x1b8
       ext 3: off 0x92f, fsb 0x4099a, len 0x3b
       ext 4: off 0x96d, fsb 0x41844, len 0x98
       ext 5: off 0xbf1, fsb 0x408ab, len 0xf
      
      which shows that we found a single delalloc block beyond EOF (first
      line of output) when we were returning the map for a length
      somewhere around 10^16 bytes long (second line), and the on-disk
      extents showed they didn't go past EOF (last lines).
      
      Further debug added to xfs_vm_write_failed() showed this happened
      when punching out delalloc blocks beyond the end of the file after
      the failed write:
      
      [  132.606693] ino 0x80e48: vwf to 0x181000, sze 0x180000
      [  132.609573] start_fsb 0xc01, end_fsb 0xc08
      
      It punched the range 0xc01 -> 0xc08, but the range we really need to
      punch is 0xc00 -> 0xc07 (8 blocks from 0xc00) as this testing was
      run on a 512 byte block size filesystem (8 blocks per page).
      the punch from is 0xc00. So end_fsb is correct, but start_fsb is
      wrong as we punch from start_fsb for (end_fsb - start_fsb) blocks.
      Hence we are not punching the delalloc block beyond EOF in the case.
      
      The fix is simple - it's a silly off-by-one mistake in calculating
      the range. It's especially silly because the macro used to calculate
      the start_fsb already takes into account the case where the inode
      size is an exact multiple of the filesystem block size...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      01c84d2d
    • D
      xfs: use shared ilock mode for direct IO writes by default · 507630b2
      Dave Chinner 提交于
      For the direct IO write path, we only really need the ilock to be taken in
      exclusive mode during IO submission if we need to do extent allocation
      instead of all the time.
      
      Change the block mapping code to take the ilock in shared mode for the
      initial block mapping, and only retake it exclusively when we actually
      have to perform extent allocations.  We were already dropping the ilock
      for the transaction allocation, so this doesn't introduce new race windows.
      
      Based on an earlier patch from Dave Chinner.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      507630b2
  2. 14 3月, 2012 1 次提交
  3. 06 3月, 2012 3 次提交
  4. 18 1月, 2012 2 次提交
    • C
      xfs: remove the i_new_size field in struct xfs_inode · 2813d682
      Christoph Hellwig 提交于
      Now that we use the VFS i_size field throughout XFS there is no need for the
      i_new_size field any more given that the VFS i_size field gets updated
      in ->write_end before unlocking the page, and thus is always uptodate when
      writeback could see a page.  Removing i_new_size also has the advantage that
      we will never have to trim back di_size during a failed buffered write,
      given that it never gets updated past i_size.
      
      Note that currently the generic direct I/O code only updates i_size after
      calling our end_io handler, which requires a small workaround to make
      sure di_size actually makes it to disk.  I hope to fix this properly in
      the generic code.
      
      A downside is that we lose the support for parallel non-overlapping O_DIRECT
      appending writes that recently was added.  I don't think keeping the complex
      and fragile i_new_size infrastructure for this is a good tradeoff - if we
      really care about parallel appending writers we should investigate turning
      the iolock into a range lock, which would also allow for parallel
      non-overlapping buffered writers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2813d682
    • C
      xfs: remove the i_size field in struct xfs_inode · ce7ae151
      Christoph Hellwig 提交于
      There is no fundamental need to keep an in-memory inode size copy in the XFS
      inode.  We already have the on-disk value in the dinode, and the separate
      in-memory copy that we need for regular files only in the XFS inode.
      
      Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
      VFS inode i_size field for regular files.  Switch code that was directly
      accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
      we are limited to regular files direct access of the VFS inode i_size field.
      
      This also allows dropping some fairly complicated code in the write path
      which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
      that is getting updated inside ->write_end.
      
      Note that we do not bother resetting the VFS i_size when truncating a file
      that gets freed to zero as there is no point in doing so because the VFS inode
      is no longer in use at this point.  Just relax the assert in xfs_ifree to
      only check the on-disk size instead.
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ce7ae151
  5. 09 11月, 2011 1 次提交
  6. 01 11月, 2011 1 次提交
    • M
      xfs: warn if direct reclaim tries to writeback pages · 94054fa3
      Mel Gorman 提交于
      Direct reclaim should never writeback pages.  For now, handle the
      situation and warn about it.  Ultimately, this will be a BUG_ON.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94054fa3
  7. 12 10月, 2011 6 次提交
  8. 14 9月, 2011 1 次提交
  9. 13 8月, 2011 1 次提交
    • C
      xfs: remove subdirectories · c59d87c4
      Christoph Hellwig 提交于
      Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
      annoying subdirectories in the XFS source code.  Besides the large
      amount of file rename the only changes are to the Makefile, a few
      files including headers with the subdirectory prefix, and the binary
      sysctl compat code that includes a header under fs/xfs/ from
      kernel/.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      c59d87c4
  10. 21 7月, 2011 2 次提交
  11. 08 7月, 2011 2 次提交
  12. 08 6月, 2011 1 次提交
  13. 31 3月, 2011 1 次提交
  14. 10 3月, 2011 2 次提交
    • J
      block: kill off REQ_UNPLUG · 721a9602
      Jens Axboe 提交于
      With the plugging now being explicitly controlled by the
      submitter, callers need not pass down unplugging hints
      to the block layer. If they want to unplug, it's because they
      manually plugged on their own - in which case, they should just
      unplug at will.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      721a9602
    • J
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe 提交于
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7eaceacc
  15. 07 3月, 2011 1 次提交
  16. 01 12月, 2010 1 次提交
    • D
      xfs: fix failed write truncation handling. · c726de44
      Dave Chinner 提交于
      Since the move to the new truncate sequence we call xfs_setattr to
      truncate down excessively instanciated blocks.  As shown by the testcase
      in kernel.org BZ #22452 that doesn't work too well.  Due to the confusion
      of the internal inode size, and the VFS inode i_size it zeroes data that
      it shouldn't.
      
      But full blown truncate seems like overkill here.  We only instanciate
      delayed allocations in the write path, and given that we never released
      the iolock we can't have converted them to real allocations yet either.
      
      The only nasty case is pre-existing preallocation which we need to skip.
      We already do this for page discard during writeback, so make the delayed
      allocation block punching a generic function and call it from the failed
      write path as well as xfs_aops_discard_page. The callers are
      responsible for ensuring that partial blocks are not truncated away,
      and that they hold the ilock.
      
      Based on a fix originally from Christoph Hellwig. This version used
      filesystem blocks as the range unit.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c726de44
  17. 17 12月, 2010 9 次提交
  18. 11 11月, 2010 1 次提交
    • C
      xfs: remove incorrect assert in xfs_vm_writepage · ece413f5
      Christoph Hellwig 提交于
      In commit 20cb52eb, titled
      "xfs: simplify xfs_vm_writepage" I added an assert that any !mapped and
      uptodate buffers are not dirty.  That asserts turns out to trigger a lot
      when running fsx on filesystems with small block sizes.  The reason for
      that is that the assert is simply incorrect.  !mapped and uptodate
      just mean this buffer covers a hole, and whenever we do a set_page_dirty
      we mark all blocks in the page dirty, no matter if they have data or
      not.  So remove the assert, and update the comment above the condition
      to match reality.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      ece413f5
  19. 27 10月, 2010 1 次提交
    • W
      writeback: remove nonblocking/encountered_congestion references · 1b430bee
      Wu Fengguang 提交于
      This removes more dead code that was somehow missed by commit 0d99519e
      (writeback: remove unused nonblocking and congestion checks).  There are
      no behavior change except for the removal of two entries from one of the
      ext4 tracing interface.
      
      The nonblocking checks in ->writepages are no longer used because the
      flusher now prefer to block on get_request_wait() than to skip inodes on
      IO congestion.  The latter will lead to more seeky IO.
      
      The nonblocking checks in ->writepage are no longer used because it's
      redundant with the WB_SYNC_NONE check.
      
      We no long set ->nonblocking in VM page out and page migration, because
      a) it's effectively redundant with WB_SYNC_NONE in current code
      b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
         that would skip some dirty inodes on congestion and page out others, which
         is unfair in terms of LRU age.
      
      Inspired by Christoph Hellwig. Thanks!
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Sage Weil <sage@newdream.net>
      Cc: Steve French <sfrench@samba.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b430bee