1. 13 8月, 2013 1 次提交
    • D
      xfs: create xfs_bmap_util.[ch] · 68988114
      Dave Chinner 提交于
      There is a bunch of code in xfs_bmap.c that is kernel specific and
      not shared with userspace. To minimise the difference between the
      kernel and userspace code, shift this unshared code to
      xfs_bmap_util.c, and the declarations to xfs_bmap_util.h.
      
      The biggest issue here is xfs_bmap_finish() - userspace has it's own
      definition of this function, and so we need to move it out of
      xfs_bmap.[ch]. This means several other files need to include
      xfs_bmap_util.h as well.
      
      It also introduces and interesting dance for the stack switching
      code in xfs_bmapi_allocate(). The stack switching/workqueue code is
      actually moved to xfs_bmap_util.c, so that userspace can simply use
      a #define in a header file to connect the dots without needing to
      know about the stack switch code at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      68988114
  2. 23 7月, 2013 1 次提交
    • J
      xfs: fix assertion failure in xfs_vm_write_failed() · 58e59854
      Jie Liu 提交于
      In xfs_vm_write_failed(), we evaluate the block_offset of pos with
      PAGE_MASK which is an unsigned long.  That is fine on 64-bit platforms
      regardless of whether the request pos is 32-bit or 64-bit.  However, on
      32-bit platforms the value is 0xfffff000 and so the high 32 bits in it
      will be masked off with (pos & PAGE_MASK) for a 64-bit pos.
      
      As a result, the evaluated block_offset is incorrect which will cause
      this failure ASSERT(block_offset + from == pos); and potentially pass
      the wrong block to xfs_vm_kill_delalloc_range().
      
      In this case, we can get a kernel panic if CONFIG_XFS_DEBUG is enabled:
      
      XFS: Assertion failed: block_offset + from == pos, file: fs/xfs/xfs_aops.c, line: 1504
      
      ------------[ cut here ]------------
       kernel BUG at fs/xfs/xfs_message.c:100!
       invalid opcode: 0000 [#1] SMP
       ........
       Pid: 4057, comm: mkfs.xfs Tainted: G           O 3.9.0-rc2 #1
       EIP: 0060:[<f94a7e8b>] EFLAGS: 00010282 CPU: 0
       EIP is at assfail+0x2b/0x30 [xfs]
       EAX: 00000056 EBX: f6ef28a0 ECX: 00000007 EDX: f57d22a4
       ESI: 1c2fb000 EDI: 00000000 EBP: ea6b5d30 ESP: ea6b5d1c
       DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
       CR0: 8005003b CR2: 094f3ff4 CR3: 2bcb4000 CR4: 000006f0
       DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
       DR6: ffff0ff0 DR7: 00000400
       Process mkfs.xfs (pid: 4057, ti=ea6b4000 task=ea5799e0 task.ti=ea6b4000)
       Stack:
       00000000 f9525c48 f951fa80 f951f96b 000005e4 ea6b5d7c f9494b34 c19b0ea2
       00000066 f3d6c620 c19b0ea2 00000000 e9a91458 00001000 00000000 00000000
       00000000 c15c7e89 00000000 1c2fb000 00000000 00000000 1c2fb000 00000080
       Call Trace:
       [<f9494b34>] xfs_vm_write_failed+0x74/0x1b0 [xfs]
       [<c15c7e89>] ? printk+0x4d/0x4f
       [<f9494d7d>] xfs_vm_write_begin+0x10d/0x170 [xfs]
       [<c110a34c>] generic_file_buffered_write+0xdc/0x210
       [<f949b669>] xfs_file_buffered_aio_write+0xf9/0x190 [xfs]
       [<f949b7f3>] xfs_file_aio_write+0xf3/0x160 [xfs]
       [<c115e504>] do_sync_write+0x94/0xd0
       [<c115ed1f>] vfs_write+0x8f/0x160
       [<c115e470>] ? wait_on_retry_sync_kiocb+0x50/0x50
       [<c115f017>] sys_write+0x47/0x80
       [<c15d860d>] sysenter_do_call+0x12/0x28
       .............
       EIP: [<f94a7e8b>] assfail+0x2b/0x30 [xfs] SS:ESP 0068:ea6b5d1c
       ---[ end trace cdd9af4f4ecab42f ]---
       Kernel panic - not syncing: Fatal exception
      
      In order to avoid this, we can evaluate the block_offset of the start
      of the page by using shifts rather than masks the mismatch problem.
      
      Thanks Dave Chinner for help finding and fixing this bug.
      Reported-by: NMichael L. Semon <mlsemon35@gmail.com>
      Reviewed-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      58e59854
  3. 25 5月, 2013 1 次提交
    • D
      xfs: fix sub-page blocksize data integrity writes · 480d7467
      Dave Chinner 提交于
      FSX on 512 byte block size filesystems has been failing for some
      time with corrupted data. The fault dates back to the change in
      the writeback data integrity algorithm that uses a mark-and-sweep
      approach to avoid data writeback livelocks.
      
      Unfortunately, a side effect of this mark-and-sweep approach is that
      each page will only be written once for a data integrity sync, and
      there is a condition in writeback in XFS where a page may require
      two writeback attempts to be fully written. As a result of the high
      level change, we now only get a partial page writeback during the
      integrity sync because the first pass through writeback clears the
      mark left on the page index to tell writeback that the page needs
      writeback....
      
      The cause is writing a partial page in the clustering code. This can
      happen when a mapping boundary falls in the middle of a page - we
      end up writing back the first part of the page that the mapping
      covers, but then never revisit the page to have the remainder mapped
      and written.
      
      The fix is simple - if the mapping boundary falls inside a page,
      then simple abort clustering without touching the page. This means
      that the next ->writepage entry that write_cache_pages() will make
      is the page we aborted on, and xfs_vm_writepage() will map all
      sections of the page correctly. This behaviour is also optimal for
      non-data integrity writes, as it results in contiguous sequential
      writeback of the file rather than missing small holes and having to
      write them a "random" writes in a future pass.
      
      With this fix, all the fsx tests in xfstests now pass on a 512 byte
      block size filesystem on a 4k page machine.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 49b137cb)
      480d7467
  4. 22 5月, 2013 2 次提交
    • L
      xfs: use ->invalidatepage() length argument · 34097dfe
      Lukas Czerner 提交于
      ->invalidatepage() aop now accepts range to invalidate so we can make
      use of it in xfs_vm_invalidatepage()
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Acked-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      34097dfe
    • L
      mm: change invalidatepage prototype to accept length · d47992f8
      Lukas Czerner 提交于
      Currently there is no way to truncate partial page where the end
      truncate point is not at the end of the page. This is because it was not
      needed and the functionality was enough for file system truncate
      operation to work properly. However more file systems now support punch
      hole feature and it can benefit from mm supporting truncating page just
      up to the certain point.
      
      Specifically, with this functionality truncate_inode_pages_range() can
      be changed so it supports truncating partial page at the end of the
      range (currently it will BUG_ON() if 'end' is not at the end of the
      page).
      
      This commit changes the invalidatepage() address space operation
      prototype to accept range to be invalidated and update all the instances
      for it.
      
      We also change the block_invalidatepage() in the same way and actually
      make a use of the new length argument implementing range invalidation.
      
      Actual file system implementations will follow except the file systems
      where the changes are really simple and should not change the behaviour
      in any way .Implementation for truncate_page_range() which will be able
      to accept page unaligned ranges will follow as well.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      d47992f8
  5. 21 5月, 2013 1 次提交
    • D
      xfs: fix sub-page blocksize data integrity writes · 49b137cb
      Dave Chinner 提交于
      FSX on 512 byte block size filesystems has been failing for some
      time with corrupted data. The fault dates back to the change in
      the writeback data integrity algorithm that uses a mark-and-sweep
      approach to avoid data writeback livelocks.
      
      Unfortunately, a side effect of this mark-and-sweep approach is that
      each page will only be written once for a data integrity sync, and
      there is a condition in writeback in XFS where a page may require
      two writeback attempts to be fully written. As a result of the high
      level change, we now only get a partial page writeback during the
      integrity sync because the first pass through writeback clears the
      mark left on the page index to tell writeback that the page needs
      writeback....
      
      The cause is writing a partial page in the clustering code. This can
      happen when a mapping boundary falls in the middle of a page - we
      end up writing back the first part of the page that the mapping
      covers, but then never revisit the page to have the remainder mapped
      and written.
      
      The fix is simple - if the mapping boundary falls inside a page,
      then simple abort clustering without touching the page. This means
      that the next ->writepage entry that write_cache_pages() will make
      is the page we aborted on, and xfs_vm_writepage() will map all
      sections of the page correctly. This behaviour is also optimal for
      non-data integrity writes, as it results in contiguous sequential
      writeback of the file rather than missing small holes and having to
      write them a "random" writes in a future pass.
      
      With this fix, all the fsx tests in xfstests now pass on a 512 byte
      block size filesystem on a 4k page machine.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      49b137cb
  6. 08 5月, 2013 1 次提交
  7. 23 3月, 2013 1 次提交
    • J
      xfs: Fix WARN_ON(delalloc) in xfs_vm_releasepage() · ff9a28f6
      Jan Kara 提交于
      When a dirty page is truncated from a file but reclaim gets to it before
      truncate_inode_pages(), we hit WARN_ON(delalloc) in
      xfs_vm_releasepage(). This is because reclaim tries to write the page,
      xfs_vm_writepage() just bails out (leaving page clean) and thus reclaim
      thinks it can continue and calls xfs_vm_releasepage() on page with dirty
      buffers.
      
      Fix the issue by redirtying the page in xfs_vm_writepage(). This makes
      reclaim stop reclaiming the page and also logically it keeps page in a
      more consistent state where page with dirty buffers has PageDirty set.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ff9a28f6
  8. 29 1月, 2013 1 次提交
  9. 26 1月, 2013 1 次提交
  10. 30 11月, 2012 1 次提交
    • D
      xfs: fix direct IO nested transaction deadlock. · 437a255a
      Dave Chinner 提交于
      The direct IO path can do a nested transaction reservation when
      writing past the EOF. The first transaction is the append
      transaction for setting the filesize at IO completion, but we can
      also need a transaction for allocation of blocks. If the log is low
      on space due to reservations and small log, the append transaction
      can be granted after wating for space as the only active transaction
      in the system. This then attempts a reservation for an allocation,
      which there isn't space in the log for, and the reservation sleeps.
      The result is that there is nothing left in the system to wake up
      all the processes waiting for log space to come free.
      
      The stack trace that shows this deadlock is relatively innocuous:
      
       xlog_grant_head_wait
       xlog_grant_head_check
       xfs_log_reserve
       xfs_trans_reserve
       xfs_iomap_write_direct
       __xfs_get_blocks
       xfs_get_blocks_direct
       do_blockdev_direct_IO
       __blockdev_direct_IO
       xfs_vm_direct_IO
       generic_file_direct_write
       xfs_file_dio_aio_writ
       xfs_file_aio_write
       do_sync_write
       vfs_write
      
      This was discovered on a filesystem with a log of only 10MB, and a
      log stripe unit of 256k whih increased the base reservations by
      512k. Hence a allocation transaction requires 1.2MB of log space to
      be available instead of only 260k, and so greatly increased the
      chance that there wouldn't be enough log space available for the
      nested transaction to succeed. The key to reproducing it is this
      mkfs command:
      
      mkfs.xfs -f -d agcount=16,su=256k,sw=12 -l su=256k,size=2560b $SCRATCH_DEV
      
      The test case was a 1000 fsstress processes running with random
      freeze and unfreezes every few seconds. Thanks to Eryu Guan
      (eguan@redhat.com) for writing the test that found this on a system
      with a somewhat unique default configuration....
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAndrew Dahl <adahl@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      437a255a
  11. 17 11月, 2012 1 次提交
    • D
      xfs: fix broken error handling in xfs_vm_writepage · 3daed8bc
      Dave Chinner 提交于
      When we shut down the filesystem, it might first be detected in
      writeback when we are allocating a inode size transaction. This
      happens after we have moved all the pages into the writeback state
      and unlocked them. Unfortunately, if we fail to set up the
      transaction we then abort writeback and try to invalidate the
      current page. This then triggers are BUG() in block_invalidatepage()
      because we are trying to invalidate an unlocked page.
      
      Fixing this is a bit of a chicken and egg problem - we can't
      allocate the transaction until we've clustered all the pages into
      the IO and we know the size of it (i.e. whether the last block of
      the IO is beyond the current EOF or not). However, we don't want to
      hold pages locked for long periods of time, especially while we lock
      other pages to cluster them into the write.
      
      To fix this, we need to make a clear delineation in writeback where
      errors can only be handled by IO completion processing. That is,
      once we have marked a page for writeback and unlocked it, we have to
      report errors via IO completion because we've already started the
      IO. We may not have submitted any IO, but we've changed the page
      state to indicate that it is under IO so we must now use the IO
      completion path to report errors.
      
      To do this, add an error field to xfs_submit_ioend() to pass it the
      error that occurred during the building on the ioend chain. When
      this is non-zero, mark each ioend with the error and call
      xfs_finish_ioend() directly rather than building bios. This will
      immediately push the ioends through completion processing with the
      error that has occurred.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3daed8bc
  12. 15 11月, 2012 1 次提交
    • D
      xfs: remove xfs_flush_pages · 4bc1ea6b
      Dave Chinner 提交于
      It is a complex wrapper around VFS functions, but there are VFS
      functions that provide exactly the same functionality. Call the VFS
      functions directly and remove the unnecessary indirection and
      complexity.
      
      We don't need to care about clearing the XFS_ITRUNCATED flag, as
      that is done during .writepages. Hence is cleared by the VFS
      writeback path if there is anything to write back during the flush.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAndrew Dahl <adahl@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4bc1ea6b
  13. 14 11月, 2012 1 次提交
    • D
      xfs: fix broken error handling in xfs_vm_writepage · 7bf7f352
      Dave Chinner 提交于
      When we shut down the filesystem, it might first be detected in
      writeback when we are allocating a inode size transaction. This
      happens after we have moved all the pages into the writeback state
      and unlocked them. Unfortunately, if we fail to set up the
      transaction we then abort writeback and try to invalidate the
      current page. This then triggers are BUG() in block_invalidatepage()
      because we are trying to invalidate an unlocked page.
      
      Fixing this is a bit of a chicken and egg problem - we can't
      allocate the transaction until we've clustered all the pages into
      the IO and we know the size of it (i.e. whether the last block of
      the IO is beyond the current EOF or not). However, we don't want to
      hold pages locked for long periods of time, especially while we lock
      other pages to cluster them into the write.
      
      To fix this, we need to make a clear delineation in writeback where
      errors can only be handled by IO completion processing. That is,
      once we have marked a page for writeback and unlocked it, we have to
      report errors via IO completion because we've already started the
      IO. We may not have submitted any IO, but we've changed the page
      state to indicate that it is under IO so we must now use the IO
      completion path to report errors.
      
      To do this, add an error field to xfs_submit_ioend() to pass it the
      error that occurred during the building on the ioend chain. When
      this is non-zero, mark each ioend with the error and call
      xfs_finish_ioend() directly rather than building bios. This will
      immediately push the ioends through completion processing with the
      error that has occurred.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      7bf7f352
  14. 31 7月, 2012 1 次提交
    • J
      xfs: Convert to new freezing code · d9457dc0
      Jan Kara 提交于
      Generic code now blocks all writers from standard write paths. So we add
      blocking of all writers coming from ioctl (we get a protection of ioctl against
      racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
      non-racy freeze protection. We also keep freeze protection on transaction
      start to block internal filesystem writes such as removal of preallocated
      blocks.
      
      CC: Ben Myers <bpm@sgi.com>
      CC: Alex Elder <elder@kernel.org>
      CC: xfs@oss.sgi.com
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d9457dc0
  15. 23 7月, 2012 1 次提交
  16. 22 7月, 2012 1 次提交
  17. 21 6月, 2012 1 次提交
    • A
      xfs: xfs_vm_writepage clear iomap_valid when !buffer_uptodate (REV2) · 66f93113
      Alain Renaud 提交于
      On filesytems with a block size smaller than PAGE_SIZE we currently have
      a problem with unwritten extents.  If a we have multi-block page for
      which an unwritten extent has been allocated, and only some of the
      buffers have been written to, and they are not contiguous, we can expose
      stale data from disk in the blocks between the writes after extent
      conversion.
      
      Example of a page with unwritten and real data.
      buffer  content
      0       empty  b_state = 0
      1       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      2       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      3       empty  b_state = 0
      4       empty  b_state = 0
      5       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      6       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      7       empty  b_state = 0
      
      Buffers 1, 2, 5, and 6 have been written to, leaving 0, 3, 4, and 7
      empty.  Currently buffers 1, 2, 5, and 6 are added to a single ioend,
      and when IO has completed, extent conversion creates a real extent from
      block 1 through block 6, leaving 0 and 7 unwritten.  However buffers 3
      and 4 were not written to disk, so stale data is exposed from those
      blocks on a subsequent read.
      
      Fix this by setting iomap_valid = 0 when we find a buffer that is not
      Uptodate.  This ensures that buffers 5 and 6 are not added to the same
      ioend as buffers 1 and 2.  Later these blocks will be converted into two
      separate real extents, leaving the blocks in between unwritten.
      Signed-off-by: NAlain Renaud <arenaud@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      66f93113
  18. 15 6月, 2012 2 次提交
    • D
      xfs: m_maxioffset is redundant · d2c28191
      Dave Chinner 提交于
      The m_maxioffset field in the struct xfs_mount contains the same
      value as the superblock s_maxbytes field. There is no need to carry
      two copies of this limit around, so use the VFS superblock version.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      d2c28191
    • A
      xfs: xfs_vm_writepage clear iomap_valid when !buffer_uptodate (REV2) · 7d0fa3ec
      Alain Renaud 提交于
      On filesytems with a block size smaller than PAGE_SIZE we currently have
      a problem with unwritten extents.  If a we have multi-block page for
      which an unwritten extent has been allocated, and only some of the
      buffers have been written to, and they are not contiguous, we can expose
      stale data from disk in the blocks between the writes after extent
      conversion.
      
      Example of a page with unwritten and real data.
      buffer  content
      0       empty  b_state = 0
      1       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      2       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      3       empty  b_state = 0
      4       empty  b_state = 0
      5       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      6       DATA   b_state = 0x1023 Uptodate,Dirty,Mapped,Unwritten
      7       empty  b_state = 0
      
      Buffers 1, 2, 5, and 6 have been written to, leaving 0, 3, 4, and 7
      empty.  Currently buffers 1, 2, 5, and 6 are added to a single ioend,
      and when IO has completed, extent conversion creates a real extent from
      block 1 through block 6, leaving 0 and 7 unwritten.  However buffers 3
      and 4 were not written to disk, so stale data is exposed from those
      blocks on a subsequent read.
      
      Fix this by setting iomap_valid = 0 when we find a buffer that is not
      Uptodate.  This ensures that buffers 5 and 6 are not added to the same
      ioend as buffers 1 and 2.  Later these blocks will be converted into two
      separate real extents, leaving the blocks in between unwritten.
      Signed-off-by: NAlain Renaud <arenaud@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      7d0fa3ec
  19. 15 5月, 2012 8 次提交
    • D
      xfs: clean up xfs_bit.h includes · ad1e95c5
      Dave Chinner 提交于
      With the removal of xfs_rw.h and other changes over time, xfs_bit.h
      is being included in many files that don't actually need it. Clean
      up the includes as necessary.
      
      Also move the only-used-once xfs_ialloc_find_free() static inline
      function out of a header file that is widely included to reduce
      the number of needless dependencies on xfs_bit.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ad1e95c5
    • D
      xfs: move xfs_get_extsz_hint() and kill xfs_rw.h · 2a0ec1d9
      Dave Chinner 提交于
      The only thing left in xfs_rw.h is a function prototype for an inode
      function.  Move that to xfs_inode.h, and kill xfs_rw.h.
      
      Also move the function implementing the prototype from xfs_rw.c to
      xfs_inode.c so we only have one function left in xfs_rw.c
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2a0ec1d9
    • D
      xfs: move xfsagino_t to xfs_types.h · 60a34607
      Dave Chinner 提交于
      Untangle the header file includes a bit by moving the definition of
      xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on
      xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include
      xfs_ag.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      60a34607
    • D
      xfs: Use preallocation for inodes with extsz hints · aff3a9ed
      Dave Chinner 提交于
      xfstest 229 exposes a problem with buffered IO, delayed allocation
      and extent size hints. That is when we do delayed allocation during
      buffered IO, we reserve space for the extent size hint alignment and
      allocate the physical space to align the extent, but we do not zero
      the regions of the extent that aren't written by the write(2)
      syscall. The result is that we expose stale data in unwritten
      regions of the extent size hints.
      
      There are two ways to fix this. The first is to detect that we are
      doing unaligned writes, check if there is already a mapping or data
      over the extent size hint range, and if not zero the page cache
      first before then doing the real write. This can be very expensive
      for large extent size hints, especially if the subsequent writes
      fill then entire extent size before the data is written to disk.
      
      The second, and simpler way, is simply to turn off delayed
      allocation when the extent size hint is set and use preallocation
      instead. This results in unwritten extents being laid down on disk
      and so only the written portions will be converted. This matches the
      behaviour for direct IO, and will also work for the real time
      device. The disadvantage of this approach is that for small extent
      size hints we can get file fragmentation, but in general extent size
      hints are fairly large (e.g. stripe width sized) so this isn't a big
      deal.
      
      Implement the second approach as it is simple and effective.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      aff3a9ed
    • D
      xfs: punch new delalloc blocks out of failed writes inside EOF. · d3bc815a
      Dave Chinner 提交于
      When a partial write inside EOF fails, it can leave delayed
      allocation blocks lying around because they don't get punched back
      out. This leads to assert failures like:
      
      XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 847
      
      when evicting inodes from the cache. This can be trivially triggered
      by xfstests 083, which takes between 5 and 15 executions on a 512
      byte block size filesystem to trip over this. Debugging shows a
      failed write due to ENOSPC calling xfs_vm_write_failed such as:
      
      [ 5012.329024] ino 0xa0026: vwf to 0x17000, sze 0x1c85ae
      
      and no action is taken on it. This leaves behind a delayed
      allocation extent that has no page covering it and no data in it:
      
      [ 5015.867162] ino 0xa0026: blks: 0x83 delay blocks 0x1, size 0x2538c0
      [ 5015.868293] ext 0: off 0x4a, fsb 0x50306, len 0x1
      [ 5015.869095] ext 1: off 0x4b, fsb 0x7899, len 0x6b
      [ 5015.869900] ext 2: off 0xb6, fsb 0xffffffffe0008, len 0x1
                                          ^^^^^^^^^^^^^^^
      [ 5015.871027] ext 3: off 0x36e, fsb 0x7a27, len 0xd
      [ 5015.872206] ext 4: off 0x4cf, fsb 0x7a1d, len 0xa
      
      So the delayed allocation extent is one block long at offset
      0x16c00. Tracing shows that a bigger write:
      
      xfs_file_buffered_write: size 0x1c85ae offset 0x959d count 0x1ca3f ioflags
      
      allocates the block, and then fails with ENOSPC trying to allocate
      the last block on the page, leading to a failed write with stale
      delalloc blocks on it.
      
      Because we've had an ENOSPC when trying to allocate 0x16e00, it
      means that we are never goinge to call ->write_end on the page and
      so the allocated new buffer will not get marked dirty or have the
      buffer_new state cleared. In other works, what the above write is
      supposed to end up with is this mapping for the page:
      
          +------+------+------+------+------+------+------+------+
            UMA    UMA    UMA    UMA    UMA    UMA    UND    FAIL
      
      where:  U = uptodate
              M = mapped
              N = new
              A = allocated
              D = delalloc
              FAIL = block we ENOSPC'd on.
      
      and the key point being the buffer_new() state for the newly
      allocated delayed allocation block. Except it doesn't - we're not
      marking buffers new correctly.
      
      That buffer_new() problem goes back to the xfs_iomap removal days,
      where xfs_iomap() used to return a "new" status for any map with
      newly allocated blocks, so that __xfs_get_blocks() could call
      set_buffer_new() on it. We still have the "new" variable and the
      check for it in the set_buffer_new() logic - except we never set it
      now!
      
      Hence that newly allocated delalloc block doesn't have the new flag
      set on it, so when the write fails we cannot tell which blocks we
      are supposed to punch out. WHy do we need the buffer_new flag? Well,
      that's because we can have this case:
      
          +------+------+------+------+------+------+------+------+
            UMD    UMD    UMD    UMD    UMD    UMD    UND    FAIL
      
      where all the UMD buffers contain valid data from a previously
      successful write() system call. We only want to punch the UND buffer
      because that's the only one that we added in this write and it was
      only this write that failed.
      
      That implies that even the old buffer_new() logic was wrong -
      because it would result in all those UMD buffers on the page having
      set_buffer_new() called on them even though they aren't new. Hence
      we shoul donly be calling set_buffer_new() for delalloc buffers that
      were allocated (i.e. were a hole before xfs_iomap_write_delay() was
      called).
      
      So, fix this set_buffer_new logic according to how we need it to
      work for handling failed writes correctly. Also, restore the new
      buffer logic handling for blocks allocated via
      xfs_iomap_write_direct(), because it should still set the buffer_new
      flag appropriately for newly allocated blocks, too.
      
      SO, now we have the buffer_new() being set appropriately in
      __xfs_get_blocks(), we can detect the exact delalloc ranges that
      we allocated in a failed write, and hence can now do a walk of the
      buffers on a page to find them.
      
      Except, it's not that easy. When block_write_begin() fails, it
      unlocks and releases the page that we just had an error on, so we
      can't use that page to handle errors anymore. We have to get access
      to the page while it is still locked to walk the buffers. Hence we
      have to open code block_write_begin() in xfs_vm_write_begin() to be
      able to insert xfs_vm_write_failed() is the right place.
      
      With that, we can pass the page and write range to
      xfs_vm_write_failed() and walk the buffers on the page, looking for
      delalloc buffers that are either new or beyond EOF and punch them
      out. Handling buffers beyond EOF ensures we still handle the
      existing case that xfs_vm_write_failed() handles.
      
      Of special note is the truncate_pagecache() handling - that only
      should be done for pages outside EOF - pages within EOF can still
      contain valid, dirty data so we must not punch them out of the
      cache.
      
      That just leaves the xfs_vm_write_end() failure handling.
      The only failure case here is that we didn't copy the entire range,
      and generic_write_end() handles that by zeroing the region of the
      page that wasn't copied, we don't have to punch out blocks within
      the file because they are guaranteed to contain zeros. Hence we only
      have to handle the existing "beyond EOF" case and don't need access
      to the buffers on the page. Hence it remains largely unchanged.
      
      Note that xfs_getbmap() can still trip over delalloc blocks beyond
      EOF that are left there by speculative delayed allocation. Hence
      this bug fix does not solve all known issues with bmap vs delalloc,
      but it does fix all the the known accidental occurances of the
      problem.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      d3bc815a
    • D
      xfs: page type check in writeback only checks last buffer · 6ffc4db5
      Dave Chinner 提交于
      xfs_is_delayed_page() checks to see if a page has buffers matching
      the given IO type passed in. It does so by walking the buffer heads
      on the page and checking if the state flags match the IO type.
      
      However, the "acceptable" variable that is calculated is overwritten
      every time a new buffer is checked. Hence if the first buffer on the
      page is of the right type, this state is lost if the second buffer
      is not of the correct type. This means that xfs_aops_discard_page()
      may not discard delalloc regions when it is supposed to, and
      xfs_convert_page() may not cluster IO as efficiently as possible.
      
      This problem only occurs on filesystems with a block size smaller
      than page size.
      
      Also, rename xfs_is_delayed_page() to xfs_check_page_type() to
      better describe what it is doing - it is not delalloc specific
      anymore.
      
      The problem was first noticed by Peter Watkins.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      6ffc4db5
    • D
      xfs: punch all delalloc blocks beyond EOF on write failure. · 01c84d2d
      Dave Chinner 提交于
      I've been seeing regular ASSERT failures in xfstests when running
      fsstress based tests over the past month. xfs_getbmap() has been
      failing this test:
      
      XFS: Assertion failed: ((iflags & BMV_IF_DELALLOC) != 0) ||
      (map[i].br_startblock != DELAYSTARTBLOCK), file: fs/xfs/xfs_bmap.c,
      line: 5650
      
      where it is encountering a delayed allocation extent after writing
      all the dirty data to disk and then walking the extent map
      atomically by holding the XFS_IOLOCK_SHARED to prevent new delayed
      allocation extents from being created.
      
      Test 083 on a 512 byte block size filesystem was used to reproduce
      the problem, because it only had a 5s run timeand would usually fail
      every 3-4 runs. This test is exercising ENOSPC behaviour by running
      fsstress on a nearly full filesystem. The following trace extract
      shows the final few events on the inode that tripped the assert:
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_setfilesize
       xfs_setfilesize:       isize 0x180000 disize 0x12d400 offset 0x17e200 count 7680
      
      file size updated to 0x180000 by IO completion
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
       xfs_iext_insert:       state  idx 3 offset 3072 block 4503599627239432 count 1 flag 0 caller xfs_bmap_add_extent_hole_delay
       xfs_get_blocks_alloc:  size 0x180000 offset 0x180000 count 512 type  startoff 0xc00 startblock -1 blockcount 0x1
       xfs_ilock:             flags ILOCK_EXCL caller __xfs_get_blocks
      
      delalloc write, adding a single block at offset 0x180000
      
       xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180200 count 512
      
      ENOSPC trying to allocate a dellalloc block at offset 0x180200
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
       xfs_get_blocks_alloc:  size 0x180000 offset 0x180200 count 512 type  startoff 0xc00 startblock -1 blockcount 0x2
      
      And succeeding on retry after flushing dirty inodes.
      
       xfs_ilock:             flags ILOCK_EXCL caller __xfs_get_blocks
       xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180400 count 512
      
      ENOSPC trying to allocate a dellalloc block at offset 0x180400
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_iomap_write_delay
       xfs_delalloc_enospc:   isize 0x180000 disize 0x180000 offset 0x180400 count 512
      
      And failing the retry, giving a real ENOSPC error.
      
       xfs_ilock:             flags ILOCK_EXCL caller xfs_vm_write_failed
                                                      ^^^^^^^^^^^^^^^^^^^
      The smoking gun - the write being failed and cleaning up delalloc
      blocks beyond EOF allocated by the failed write.
      
       xfs_getattr:
       xfs_ilock:             flags IOLOCK_SHARED caller xfs_getbmap
       xfs_ilock:             flags ILOCK_SHARED caller xfs_ilock_map_shared
      
      And that's where we died almost immediately afterwards.
      xfs_bmapi_read() found delalloc extent beyond current file in memory
      file size. Some debug I added to xfs_getbmap() showed the state just
      before the assert failure:
      
       ino 0x80e48: off 0xc00, fsb 0xffffffffffffffff, len 0x1, size 0x180000
       start_fsb 0x106, end_fsb 0x638
       ino flags 0x2 nex 0xd bmvcnt 0x555, len 0x3c58a6f23c0bf1, start 0xc00
       ext 0: off 0x1fc, fsb 0x24782, len 0x254
       ext 1: off 0x450, fsb 0x40851, len 0x30
       ext 2: off 0x480, fsb 0xd99, len 0x1b8
       ext 3: off 0x92f, fsb 0x4099a, len 0x3b
       ext 4: off 0x96d, fsb 0x41844, len 0x98
       ext 5: off 0xbf1, fsb 0x408ab, len 0xf
      
      which shows that we found a single delalloc block beyond EOF (first
      line of output) when we were returning the map for a length
      somewhere around 10^16 bytes long (second line), and the on-disk
      extents showed they didn't go past EOF (last lines).
      
      Further debug added to xfs_vm_write_failed() showed this happened
      when punching out delalloc blocks beyond the end of the file after
      the failed write:
      
      [  132.606693] ino 0x80e48: vwf to 0x181000, sze 0x180000
      [  132.609573] start_fsb 0xc01, end_fsb 0xc08
      
      It punched the range 0xc01 -> 0xc08, but the range we really need to
      punch is 0xc00 -> 0xc07 (8 blocks from 0xc00) as this testing was
      run on a 512 byte block size filesystem (8 blocks per page).
      the punch from is 0xc00. So end_fsb is correct, but start_fsb is
      wrong as we punch from start_fsb for (end_fsb - start_fsb) blocks.
      Hence we are not punching the delalloc block beyond EOF in the case.
      
      The fix is simple - it's a silly off-by-one mistake in calculating
      the range. It's especially silly because the macro used to calculate
      the start_fsb already takes into account the case where the inode
      size is an exact multiple of the filesystem block size...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      01c84d2d
    • D
      xfs: use shared ilock mode for direct IO writes by default · 507630b2
      Dave Chinner 提交于
      For the direct IO write path, we only really need the ilock to be taken in
      exclusive mode during IO submission if we need to do extent allocation
      instead of all the time.
      
      Change the block mapping code to take the ilock in shared mode for the
      initial block mapping, and only retake it exclusively when we actually
      have to perform extent allocations.  We were already dropping the ilock
      for the transaction allocation, so this doesn't introduce new race windows.
      
      Based on an earlier patch from Dave Chinner.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      507630b2
  20. 14 3月, 2012 1 次提交
  21. 06 3月, 2012 3 次提交
  22. 18 1月, 2012 2 次提交
    • C
      xfs: remove the i_new_size field in struct xfs_inode · 2813d682
      Christoph Hellwig 提交于
      Now that we use the VFS i_size field throughout XFS there is no need for the
      i_new_size field any more given that the VFS i_size field gets updated
      in ->write_end before unlocking the page, and thus is always uptodate when
      writeback could see a page.  Removing i_new_size also has the advantage that
      we will never have to trim back di_size during a failed buffered write,
      given that it never gets updated past i_size.
      
      Note that currently the generic direct I/O code only updates i_size after
      calling our end_io handler, which requires a small workaround to make
      sure di_size actually makes it to disk.  I hope to fix this properly in
      the generic code.
      
      A downside is that we lose the support for parallel non-overlapping O_DIRECT
      appending writes that recently was added.  I don't think keeping the complex
      and fragile i_new_size infrastructure for this is a good tradeoff - if we
      really care about parallel appending writers we should investigate turning
      the iolock into a range lock, which would also allow for parallel
      non-overlapping buffered writers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2813d682
    • C
      xfs: remove the i_size field in struct xfs_inode · ce7ae151
      Christoph Hellwig 提交于
      There is no fundamental need to keep an in-memory inode size copy in the XFS
      inode.  We already have the on-disk value in the dinode, and the separate
      in-memory copy that we need for regular files only in the XFS inode.
      
      Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
      VFS inode i_size field for regular files.  Switch code that was directly
      accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
      we are limited to regular files direct access of the VFS inode i_size field.
      
      This also allows dropping some fairly complicated code in the write path
      which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
      that is getting updated inside ->write_end.
      
      Note that we do not bother resetting the VFS i_size when truncating a file
      that gets freed to zero as there is no point in doing so because the VFS inode
      is no longer in use at this point.  Just relax the assert in xfs_ifree to
      only check the on-disk size instead.
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ce7ae151
  23. 09 11月, 2011 1 次提交
  24. 01 11月, 2011 1 次提交
    • M
      xfs: warn if direct reclaim tries to writeback pages · 94054fa3
      Mel Gorman 提交于
      Direct reclaim should never writeback pages.  For now, handle the
      situation and warn about it.  Ultimately, this will be a BUG_ON.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94054fa3
  25. 12 10月, 2011 4 次提交