1. 02 3月, 2010 2 次提交
  2. 06 2月, 2010 2 次提交
    • D
      xfs: Use delayed write for inodes rather than async V2 · c854363e
      Dave Chinner 提交于
      We currently do background inode flush asynchronously, resulting in
      inodes being written in whatever order the background writeback
      issues them. Not only that, there are also blocking and non-blocking
      asynchronous inode flushes, depending on where the flush comes from.
      
      This patch completely removes asynchronous inode writeback. It
      removes all the strange writeback modes and replaces them with
      either a synchronous flush or a non-blocking delayed write flush.
      That is, inode flushes will only issue IO directly if they are
      synchronous, and background flushing may do nothing if the operation
      would block (e.g. on a pinned inode or buffer lock).
      
      Delayed write flushes will now result in the inode buffer sitting in
      the delwri queue of the buffer cache to be flushed by either an AIL
      push or by the xfsbufd timing out the buffer. This will allow
      accumulation of dirty inode buffers in memory and allow optimisation
      of inode cluster writeback at the xfsbufd level where we have much
      greater queue depths than the block layer elevators. We will also
      get adjacent inode cluster buffer IO merging for free when a later
      patch in the series allows sorting of the delayed write buffers
      before dispatch.
      
      This effectively means that any inode that is written back by
      background writeback will be seen as flush locked during AIL
      pushing, and will result in the buffers being pushed from there.
      This writeback path is currently non-optimal, but the next patch
      in the series will fix that problem.
      
      A side effect of this delayed write mechanism is that background
      inode reclaim will no longer directly flush inodes, nor can it wait
      on the flush lock. The result is that inode reclaim must leave the
      inode in the reclaimable state until it is clean. Hence attempts to
      reclaim a dirty inode in the background will simply skip the inode
      until it is clean and this allows other mechanisms (i.e. xfsbufd) to
      do more optimal writeback of the dirty buffers. As a result, the
      inode reclaim code has been rewritten so that it no longer relies on
      the ambiguous return values of xfs_iflush() to determine whether it
      is safe to reclaim an inode.
      
      Portions of this patch are derived from patches by Christoph
      Hellwig.
      
      Version 2:
      - cleanup reclaim code as suggested by Christoph
      - log background reclaim inode flush errors
      - just pass sync flags to xfs_iflush
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c854363e
    • D
      xfs: Make inode reclaim states explicit · 777df5af
      Dave Chinner 提交于
      A.K.A.: don't rely on xfs_iflush() return value in reclaim
      
      We have gradually been moving checks out of the reclaim code because
      they are duplicated in xfs_iflush(). We've had a history of problems
      in this area, and many of them stem from the overloading of the
      return values from xfs_iflush() and interaction with inode flush
      locking to determine if the inode is safe to reclaim.
      
      With the desire to move to delayed write flushing of inodes and
      non-blocking inode tree reclaim walks, the overloading of the
      return value of xfs_iflush makes it very difficult to determine
      the correct thing to do next.
      
      This patch explicitly re-adds the checks to the inode reclaim code,
      removing the reliance on the return value of xfs_iflush() to
      determine what to do next. It also means that we can clearly
      document all the inode states that reclaim must handle and hence
      we can easily see that we handled all the necessary cases.
      
      This also removes the need for the xfs_inode_clean() check in
      xfs_iflush() as all callers now check this first (safely).
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      777df5af
  3. 22 1月, 2010 2 次提交
  4. 16 1月, 2010 3 次提交
  5. 11 1月, 2010 1 次提交
    • D
      xfs: Don't flush stale inodes · 44e08c45
      Dave Chinner 提交于
      Because inodes remain in cache much longer than inode buffers do
      under memory pressure, we can get the situation where we have
      stale, dirty inodes being reclaimed but the backing storage has
      been freed.  Hence we should never, ever flush XFS_ISTALE inodes
      to disk as there is no guarantee that the backing buffer is in
      cache and still marked stale when the flush occurs.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      44e08c45
  6. 15 12月, 2009 2 次提交
    • C
      xfs: event tracing support · 0b1b213f
      Christoph Hellwig 提交于
      Convert the old xfs tracing support that could only be used with the
      out of tree kdb and xfsidbg patches to use the generic event tracer.
      
      To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
      all xfs trace channels by:
      
         echo 1 > /sys/kernel/debug/tracing/events/xfs/enable
      
      or alternatively enable single events by just doing the same in one
      event subdirectory, e.g.
      
         echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable
      
      or set more complex filters, etc. In Documentation/trace/events.txt
      all this is desctribed in more detail.  To reads the events do a
      
         cat /sys/kernel/debug/tracing/trace
      
      Compared to the last posting this patch converts the tracing mostly to
      the one tracepoint per callsite model that other users of the new
      tracing facility also employ.  This allows a very fine-grained control
      of the tracing, a cleaner output of the traces and also enables the
      perf tool to use each tracepoint as a virtual performance counter,
           allowing us to e.g. count how often certain workloads git various
           spots in XFS.  Take a look at
      
          http://lwn.net/Articles/346470/
      
      for some examples.
      
      Also the btree tracing isn't included at all yet, as it will require
      additional core tracing features not in mainline yet, I plan to
      deliver it later.
      
      And the really nice thing about this patch is that it actually removes
      many lines of code while adding this nice functionality:
      
       fs/xfs/Makefile                |    8
       fs/xfs/linux-2.6/xfs_acl.c     |    1
       fs/xfs/linux-2.6/xfs_aops.c    |   52 -
       fs/xfs/linux-2.6/xfs_aops.h    |    2
       fs/xfs/linux-2.6/xfs_buf.c     |  117 +--
       fs/xfs/linux-2.6/xfs_buf.h     |   33
       fs/xfs/linux-2.6/xfs_fs_subr.c |    3
       fs/xfs/linux-2.6/xfs_ioctl.c   |    1
       fs/xfs/linux-2.6/xfs_ioctl32.c |    1
       fs/xfs/linux-2.6/xfs_iops.c    |    1
       fs/xfs/linux-2.6/xfs_linux.h   |    1
       fs/xfs/linux-2.6/xfs_lrw.c     |   87 --
       fs/xfs/linux-2.6/xfs_lrw.h     |   45 -
       fs/xfs/linux-2.6/xfs_super.c   |  104 ---
       fs/xfs/linux-2.6/xfs_super.h   |    7
       fs/xfs/linux-2.6/xfs_sync.c    |    1
       fs/xfs/linux-2.6/xfs_trace.c   |   75 ++
       fs/xfs/linux-2.6/xfs_trace.h   | 1369 +++++++++++++++++++++++++++++++++++++++++
       fs/xfs/linux-2.6/xfs_vnode.h   |    4
       fs/xfs/quota/xfs_dquot.c       |  110 ---
       fs/xfs/quota/xfs_dquot.h       |   21
       fs/xfs/quota/xfs_qm.c          |   40 -
       fs/xfs/quota/xfs_qm_syscalls.c |    4
       fs/xfs/support/ktrace.c        |  323 ---------
       fs/xfs/support/ktrace.h        |   85 --
       fs/xfs/xfs.h                   |   16
       fs/xfs/xfs_ag.h                |   14
       fs/xfs/xfs_alloc.c             |  230 +-----
       fs/xfs/xfs_alloc.h             |   27
       fs/xfs/xfs_alloc_btree.c       |    1
       fs/xfs/xfs_attr.c              |  107 ---
       fs/xfs/xfs_attr.h              |   10
       fs/xfs/xfs_attr_leaf.c         |   14
       fs/xfs/xfs_attr_sf.h           |   40 -
       fs/xfs/xfs_bmap.c              |  507 +++------------
       fs/xfs/xfs_bmap.h              |   49 -
       fs/xfs/xfs_bmap_btree.c        |    6
       fs/xfs/xfs_btree.c             |    5
       fs/xfs/xfs_btree_trace.h       |   17
       fs/xfs/xfs_buf_item.c          |   87 --
       fs/xfs/xfs_buf_item.h          |   20
       fs/xfs/xfs_da_btree.c          |    3
       fs/xfs/xfs_da_btree.h          |    7
       fs/xfs/xfs_dfrag.c             |    2
       fs/xfs/xfs_dir2.c              |    8
       fs/xfs/xfs_dir2_block.c        |   20
       fs/xfs/xfs_dir2_leaf.c         |   21
       fs/xfs/xfs_dir2_node.c         |   27
       fs/xfs/xfs_dir2_sf.c           |   26
       fs/xfs/xfs_dir2_trace.c        |  216 ------
       fs/xfs/xfs_dir2_trace.h        |   72 --
       fs/xfs/xfs_filestream.c        |    8
       fs/xfs/xfs_fsops.c             |    2
       fs/xfs/xfs_iget.c              |  111 ---
       fs/xfs/xfs_inode.c             |   67 --
       fs/xfs/xfs_inode.h             |   76 --
       fs/xfs/xfs_inode_item.c        |    5
       fs/xfs/xfs_iomap.c             |   85 --
       fs/xfs/xfs_iomap.h             |    8
       fs/xfs/xfs_log.c               |  181 +----
       fs/xfs/xfs_log_priv.h          |   20
       fs/xfs/xfs_log_recover.c       |    1
       fs/xfs/xfs_mount.c             |    2
       fs/xfs/xfs_quota.h             |    8
       fs/xfs/xfs_rename.c            |    1
       fs/xfs/xfs_rtalloc.c           |    1
       fs/xfs/xfs_rw.c                |    3
       fs/xfs/xfs_trans.h             |   47 +
       fs/xfs/xfs_trans_buf.c         |   62 -
       fs/xfs/xfs_vnodeops.c          |    8
       70 files changed, 2151 insertions(+), 2592 deletions(-)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      0b1b213f
    • C
      xfs: change the xfs_iext_insert / xfs_iext_remove · 6ef35544
      Christoph Hellwig 提交于
      Change the xfs_iext_insert / xfs_iext_remove prototypes to pass more
      information which will allow pushing the trace points from the callers
      into those functions.  This includes folding the whichfork information
      into the state variable to minimize the addition stack footprint.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      6ef35544
  7. 09 10月, 2009 1 次提交
    • C
      xfs: implement ->dirty_inode to fix timestamp handling · f9581b14
      Christoph Hellwig 提交于
      This is picking up on Felix's repost of Dave's patch to implement a
      .dirty_inode method.  We really need this notification because
      the VFS keeps writing directly into the inode structure instead
      of going through methods to update this state.  In addition to
      the long-known atime issue we now also have a caller in VM code
      that updates c/mtime that way for shared writeable mmaps.  And
      I found another one that no one has noticed in practice in the FIFO
      code.
      
      So implement ->dirty_inode to set i_update_core whenever the
      inode gets externally dirtied, and switch the c/mtime handling to
      the same scheme we already use for atime (always picking up
      the value from the Linux inode).
      
      Note that this patch also removes the xfs_synchronize_atime call
      in xfs_reclaim it was superflous as we already synchronize the time
      when writing the inode via the log (xfs_inode_item_format) or the
      normal buffers (xfs_iflush_int).
      
      In addition also remove the I_CLEAR check before copying the Linux
      timestamps - now that we always have the Linux inode available
      we can always use the timestamps in it.
      
      Also switch to just using file_update_time for regular reads/writes -
      that will get us all optimization done to it for free and make
      sure we notice early when it breaks.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NFelix Blyakher <felixb@sgi.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      f9581b14
  8. 01 9月, 2009 1 次提交
  9. 12 8月, 2009 2 次提交
  10. 02 7月, 2009 1 次提交
  11. 10 6月, 2009 1 次提交
  12. 30 4月, 2009 1 次提交
    • L
      xfs_file_last_byte() needs to acquire ilock · def6b3ba
      Lachlan McIlroy 提交于
      We had some systems crash with this stack:
      
      [<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
      [<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
      [<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
      [<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
      [<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
      [<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
      [<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
      [<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
      [<a000000100162ea0>] __fput+0x1a0/0x420
      [<a000000100163160>] fput+0x40/0x60
      
      The problem here is that xfs_file_last_byte() does not acquire the
      inode lock and can therefore race with another thread that is modifying
      the extext list.  While xfs_bmap_last_offset() is trying to lookup
      what was the last extent some extents were merged and the extent list
      shrunk so the index we lookup is now beyond the end of the extent list
      and potentially in a freed buffer.
      Signed-off-by: NLachlan McIlroy <lmcilroy@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NFelix Blyakher <felixb@sgi.com>
      Signed-off-by: NFelix Blyakher <felixb@sgi.com>
      def6b3ba
  13. 29 4月, 2009 1 次提交
    • L
      xfs_file_last_byte() needs to acquire ilock · f25181f5
      Lachlan McIlroy 提交于
      We had some systems crash with this stack:
      
      [<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
      [<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
      [<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
      [<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
      [<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
      [<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
      [<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
      [<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
      [<a000000100162ea0>] __fput+0x1a0/0x420
      [<a000000100163160>] fput+0x40/0x60
      
      The problem here is that xfs_file_last_byte() does not acquire the
      inode lock and can therefore race with another thread that is modifying
      the extext list.  While xfs_bmap_last_offset() is trying to lookup
      what was the last extent some extents were merged and the extent list
      shrunk so the index we lookup is now beyond the end of the extent list
      and potentially in a freed buffer.
      Signed-off-by: NLachlan McIlroy <lmcilroy@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NFelix Blyakher <felixb@sgi.com>
      Signed-off-by: NFelix Blyakher <felixb@sgi.com>
      f25181f5
  14. 19 1月, 2009 3 次提交
  15. 16 1月, 2009 1 次提交
  16. 22 12月, 2008 1 次提交
  17. 11 12月, 2008 1 次提交
    • C
      [XFS] resync headers with libxfs · 6d73cf13
      Christoph Hellwig 提交于
       - xfs_sb.h add the XFS_SB_VERSION2_PARENTBIT features2 that has been
         around in userspace for some time
       - xfs_inode.h: move a few things out of __KERNEL__ that are needed by
         userspace
       - xfs_mount.h: only include xfs_sync.h under __KERNEL__
       - xfs_inode.c: minor whitespace fixup.  I accidentaly changes this when
         importing this file for use by userspace.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      6d73cf13
  18. 10 12月, 2008 1 次提交
  19. 04 12月, 2008 2 次提交
  20. 01 12月, 2008 9 次提交
  21. 17 11月, 2008 1 次提交
    • D
      [XFS] Fix double free of log tickets · cc09c0dc
      Dave Chinner 提交于
      When an I/O error occurs during an intermediate commit on a rolling
      transaction, xfs_trans_commit() will free the transaction structure
      and the related ticket. However, the duplicate transaction that
      gets used as the transaction continues still contains a pointer
      to the ticket. Hence when the duplicate transaction is cancelled
      and freed, we free the ticket a second time.
      
      Add reference counting to the ticket so that we hold an extra
      reference to the ticket over the transaction commit. We drop the
      extra reference once we have checked that the transaction commit
      did not return an error, thus avoiding a double free on commit
      error.
      
      Credit to Nick Piggin for tripping over the problem.
      
      SGI-PV: 989741
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      cc09c0dc
  22. 10 11月, 2008 1 次提交
    • L
      [XFS] Wait for all I/O on truncate to zero file size · 2cf7f0da
      Lachlan McIlroy 提交于
      It's possible to have outstanding xfs_ioend_t's queued when the file size
      is zero. This can happen in the direct I/O path when a direct I/O write
      fails due to ENOSPC. In this case the xfs_ioend_t will still be queued (ie
      xfs_end_io_direct() does not know that the I/O failed so can't force the
      xfs_ioend_t to be flushed synchronously).
      
      When we truncate a file on unlink we don't know to wait for these
      xfs_ioend_ts and we can have a use-after-free situation if the inode is
      reclaimed before the xfs_ioend_t is finally processed.
      
      As was suggested by Dave Chinner lets wait for all I/Os to complete when
      truncating the file size to zero.
      
      SGI-PV: 981668
      
      SGI-Modid: xfs-linux-melb:xfs-kern:32216a
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      2cf7f0da