1. 19 10月, 2010 3 次提交
  2. 24 8月, 2010 1 次提交
    • D
      xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE · 5b3eed75
      Dave Chinner 提交于
      Under heavy load parallel metadata loads (e.g. dbench), we can fail
      to mark all the inodes in a cluster being freed as XFS_ISTALE as we
      skip inodes we cannot get the XFS_ILOCK_EXCL or the flush lock on.
      When this happens and the inode cluster buffer has already been
      marked stale and freed, inode reclaim can try to write the inode out
      as it is dirty and not marked stale. This can result in writing th
      metadata to an freed extent, or in the case it has already
      been overwritten trigger a magic number check failure and return an
      EUCLEAN error such as:
      
      Filesystem "ram0": inode 0x442ba1 background reclaim flush failed with 117
      
      Fix this by ensuring that we hoover up all in memory inodes in the
      cluster and mark them XFS_ISTALE when freeing the cluster.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5b3eed75
  3. 27 7月, 2010 9 次提交
  4. 24 6月, 2010 2 次提交
  5. 03 6月, 2010 1 次提交
    • D
      xfs: fix race in inode cluster freeing failing to stale inodes · 5b257b4a
      Dave Chinner 提交于
      When an inode cluster is freed, it needs to mark all inodes in memory as
      XFS_ISTALE before marking the buffer as stale. This is eeded because the inodes
      have a different life cycle to the buffer, and once the buffer is torn down
      during transaction completion, we must ensure none of the inodes get written
      back (which is what XFS_ISTALE does).
      
      Unfortunately, xfs_ifree_cluster() has some bugs that lead to inodes not being
      marked with XFS_ISTALE. This shows up when xfs_iflush() is called on these
      inodes either during inode reclaim or tail pushing on the AIL.  The buffer is
      read back, but no longer contains inodes and so triggers assert failures and
      shutdowns. This was reproducable with at run.dbench10 invocation from xfstests.
      
      There are two main causes of xfs_ifree_cluster() failing. The first is simple -
      it checks in-memory inodes it finds in the per-ag icache to see if they are
      clean without holding the flush lock. if they are clean it skips them
      completely. However, If an inode is flushed delwri, it will
      appear clean, but is not guaranteed to be written back until the flush lock has
      been dropped. Hence we may have raced on the clean check and the inode may
      actually be dirty. Hence always mark inodes found in memory stale before we
      check properly if they are clean.
      
      The second is more complex, and makes the first problem easier to hit.
      Basically the in-memory inode scan is done with full knowledge it can be racing
      with inode flushing and AIl tail pushing, which means that inodes that it can't
      get the flush lock on might not be attached to the buffer after then in-memory
      inode scan due to IO completion occurring. This is actually documented in the
      code as "needs better interlocking". i.e. this is a zero-day bug.
      
      Effectively, the in-memory scan must be done while the inode buffer is locked
      and Io cannot be issued on it while we do the in-memory inode scan. This
      ensures that inodes we couldn't get the flush lock on are guaranteed to be
      attached to the cluster buffer, so we can then catch all in-memory inodes and
      mark them stale.
      
      Now that the inode cluster buffer is locked before the in-memory scan is done,
      there is no need for the two-phase update of the in-memory inodes, so simplify
      the code into two loops and remove the allocation of the temporary buffer used
      to hold locked inodes across the phases.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5b257b4a
  6. 29 5月, 2010 1 次提交
    • C
      xfs: fix access to upper inodes without inode64 · fb3b504a
      Christoph Hellwig 提交于
      If a filesystem is mounted without the inode64 mount option we
      should still be able to access inodes not fitting into 32 bits, just
      not created new ones.  For this to work we need to make sure the
      inode cache radix tree is initialized for all allocation groups, not
      just those we plan to allocate inodes from.  This patch makes sure
      we initialize the inode cache radix tree for all allocation groups,
      and also cleans xfs_initialize_perag up a bit to separate the
      inode32 logical from the general perag structure setup.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      fb3b504a
  7. 19 5月, 2010 1 次提交
  8. 02 3月, 2010 2 次提交
  9. 06 2月, 2010 2 次提交
    • D
      xfs: Use delayed write for inodes rather than async V2 · c854363e
      Dave Chinner 提交于
      We currently do background inode flush asynchronously, resulting in
      inodes being written in whatever order the background writeback
      issues them. Not only that, there are also blocking and non-blocking
      asynchronous inode flushes, depending on where the flush comes from.
      
      This patch completely removes asynchronous inode writeback. It
      removes all the strange writeback modes and replaces them with
      either a synchronous flush or a non-blocking delayed write flush.
      That is, inode flushes will only issue IO directly if they are
      synchronous, and background flushing may do nothing if the operation
      would block (e.g. on a pinned inode or buffer lock).
      
      Delayed write flushes will now result in the inode buffer sitting in
      the delwri queue of the buffer cache to be flushed by either an AIL
      push or by the xfsbufd timing out the buffer. This will allow
      accumulation of dirty inode buffers in memory and allow optimisation
      of inode cluster writeback at the xfsbufd level where we have much
      greater queue depths than the block layer elevators. We will also
      get adjacent inode cluster buffer IO merging for free when a later
      patch in the series allows sorting of the delayed write buffers
      before dispatch.
      
      This effectively means that any inode that is written back by
      background writeback will be seen as flush locked during AIL
      pushing, and will result in the buffers being pushed from there.
      This writeback path is currently non-optimal, but the next patch
      in the series will fix that problem.
      
      A side effect of this delayed write mechanism is that background
      inode reclaim will no longer directly flush inodes, nor can it wait
      on the flush lock. The result is that inode reclaim must leave the
      inode in the reclaimable state until it is clean. Hence attempts to
      reclaim a dirty inode in the background will simply skip the inode
      until it is clean and this allows other mechanisms (i.e. xfsbufd) to
      do more optimal writeback of the dirty buffers. As a result, the
      inode reclaim code has been rewritten so that it no longer relies on
      the ambiguous return values of xfs_iflush() to determine whether it
      is safe to reclaim an inode.
      
      Portions of this patch are derived from patches by Christoph
      Hellwig.
      
      Version 2:
      - cleanup reclaim code as suggested by Christoph
      - log background reclaim inode flush errors
      - just pass sync flags to xfs_iflush
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c854363e
    • D
      xfs: Make inode reclaim states explicit · 777df5af
      Dave Chinner 提交于
      A.K.A.: don't rely on xfs_iflush() return value in reclaim
      
      We have gradually been moving checks out of the reclaim code because
      they are duplicated in xfs_iflush(). We've had a history of problems
      in this area, and many of them stem from the overloading of the
      return values from xfs_iflush() and interaction with inode flush
      locking to determine if the inode is safe to reclaim.
      
      With the desire to move to delayed write flushing of inodes and
      non-blocking inode tree reclaim walks, the overloading of the
      return value of xfs_iflush makes it very difficult to determine
      the correct thing to do next.
      
      This patch explicitly re-adds the checks to the inode reclaim code,
      removing the reliance on the return value of xfs_iflush() to
      determine what to do next. It also means that we can clearly
      document all the inode states that reclaim must handle and hence
      we can easily see that we handled all the necessary cases.
      
      This also removes the need for the xfs_inode_clean() check in
      xfs_iflush() as all callers now check this first (safely).
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      777df5af
  10. 22 1月, 2010 2 次提交
  11. 16 1月, 2010 3 次提交
  12. 11 1月, 2010 1 次提交
    • D
      xfs: Don't flush stale inodes · 44e08c45
      Dave Chinner 提交于
      Because inodes remain in cache much longer than inode buffers do
      under memory pressure, we can get the situation where we have
      stale, dirty inodes being reclaimed but the backing storage has
      been freed.  Hence we should never, ever flush XFS_ISTALE inodes
      to disk as there is no guarantee that the backing buffer is in
      cache and still marked stale when the flush occurs.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      44e08c45
  13. 15 12月, 2009 2 次提交
    • C
      xfs: event tracing support · 0b1b213f
      Christoph Hellwig 提交于
      Convert the old xfs tracing support that could only be used with the
      out of tree kdb and xfsidbg patches to use the generic event tracer.
      
      To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
      all xfs trace channels by:
      
         echo 1 > /sys/kernel/debug/tracing/events/xfs/enable
      
      or alternatively enable single events by just doing the same in one
      event subdirectory, e.g.
      
         echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable
      
      or set more complex filters, etc. In Documentation/trace/events.txt
      all this is desctribed in more detail.  To reads the events do a
      
         cat /sys/kernel/debug/tracing/trace
      
      Compared to the last posting this patch converts the tracing mostly to
      the one tracepoint per callsite model that other users of the new
      tracing facility also employ.  This allows a very fine-grained control
      of the tracing, a cleaner output of the traces and also enables the
      perf tool to use each tracepoint as a virtual performance counter,
           allowing us to e.g. count how often certain workloads git various
           spots in XFS.  Take a look at
      
          http://lwn.net/Articles/346470/
      
      for some examples.
      
      Also the btree tracing isn't included at all yet, as it will require
      additional core tracing features not in mainline yet, I plan to
      deliver it later.
      
      And the really nice thing about this patch is that it actually removes
      many lines of code while adding this nice functionality:
      
       fs/xfs/Makefile                |    8
       fs/xfs/linux-2.6/xfs_acl.c     |    1
       fs/xfs/linux-2.6/xfs_aops.c    |   52 -
       fs/xfs/linux-2.6/xfs_aops.h    |    2
       fs/xfs/linux-2.6/xfs_buf.c     |  117 +--
       fs/xfs/linux-2.6/xfs_buf.h     |   33
       fs/xfs/linux-2.6/xfs_fs_subr.c |    3
       fs/xfs/linux-2.6/xfs_ioctl.c   |    1
       fs/xfs/linux-2.6/xfs_ioctl32.c |    1
       fs/xfs/linux-2.6/xfs_iops.c    |    1
       fs/xfs/linux-2.6/xfs_linux.h   |    1
       fs/xfs/linux-2.6/xfs_lrw.c     |   87 --
       fs/xfs/linux-2.6/xfs_lrw.h     |   45 -
       fs/xfs/linux-2.6/xfs_super.c   |  104 ---
       fs/xfs/linux-2.6/xfs_super.h   |    7
       fs/xfs/linux-2.6/xfs_sync.c    |    1
       fs/xfs/linux-2.6/xfs_trace.c   |   75 ++
       fs/xfs/linux-2.6/xfs_trace.h   | 1369 +++++++++++++++++++++++++++++++++++++++++
       fs/xfs/linux-2.6/xfs_vnode.h   |    4
       fs/xfs/quota/xfs_dquot.c       |  110 ---
       fs/xfs/quota/xfs_dquot.h       |   21
       fs/xfs/quota/xfs_qm.c          |   40 -
       fs/xfs/quota/xfs_qm_syscalls.c |    4
       fs/xfs/support/ktrace.c        |  323 ---------
       fs/xfs/support/ktrace.h        |   85 --
       fs/xfs/xfs.h                   |   16
       fs/xfs/xfs_ag.h                |   14
       fs/xfs/xfs_alloc.c             |  230 +-----
       fs/xfs/xfs_alloc.h             |   27
       fs/xfs/xfs_alloc_btree.c       |    1
       fs/xfs/xfs_attr.c              |  107 ---
       fs/xfs/xfs_attr.h              |   10
       fs/xfs/xfs_attr_leaf.c         |   14
       fs/xfs/xfs_attr_sf.h           |   40 -
       fs/xfs/xfs_bmap.c              |  507 +++------------
       fs/xfs/xfs_bmap.h              |   49 -
       fs/xfs/xfs_bmap_btree.c        |    6
       fs/xfs/xfs_btree.c             |    5
       fs/xfs/xfs_btree_trace.h       |   17
       fs/xfs/xfs_buf_item.c          |   87 --
       fs/xfs/xfs_buf_item.h          |   20
       fs/xfs/xfs_da_btree.c          |    3
       fs/xfs/xfs_da_btree.h          |    7
       fs/xfs/xfs_dfrag.c             |    2
       fs/xfs/xfs_dir2.c              |    8
       fs/xfs/xfs_dir2_block.c        |   20
       fs/xfs/xfs_dir2_leaf.c         |   21
       fs/xfs/xfs_dir2_node.c         |   27
       fs/xfs/xfs_dir2_sf.c           |   26
       fs/xfs/xfs_dir2_trace.c        |  216 ------
       fs/xfs/xfs_dir2_trace.h        |   72 --
       fs/xfs/xfs_filestream.c        |    8
       fs/xfs/xfs_fsops.c             |    2
       fs/xfs/xfs_iget.c              |  111 ---
       fs/xfs/xfs_inode.c             |   67 --
       fs/xfs/xfs_inode.h             |   76 --
       fs/xfs/xfs_inode_item.c        |    5
       fs/xfs/xfs_iomap.c             |   85 --
       fs/xfs/xfs_iomap.h             |    8
       fs/xfs/xfs_log.c               |  181 +----
       fs/xfs/xfs_log_priv.h          |   20
       fs/xfs/xfs_log_recover.c       |    1
       fs/xfs/xfs_mount.c             |    2
       fs/xfs/xfs_quota.h             |    8
       fs/xfs/xfs_rename.c            |    1
       fs/xfs/xfs_rtalloc.c           |    1
       fs/xfs/xfs_rw.c                |    3
       fs/xfs/xfs_trans.h             |   47 +
       fs/xfs/xfs_trans_buf.c         |   62 -
       fs/xfs/xfs_vnodeops.c          |    8
       70 files changed, 2151 insertions(+), 2592 deletions(-)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      0b1b213f
    • C
      xfs: change the xfs_iext_insert / xfs_iext_remove · 6ef35544
      Christoph Hellwig 提交于
      Change the xfs_iext_insert / xfs_iext_remove prototypes to pass more
      information which will allow pushing the trace points from the callers
      into those functions.  This includes folding the whichfork information
      into the state variable to minimize the addition stack footprint.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      6ef35544
  14. 09 10月, 2009 1 次提交
    • C
      xfs: implement ->dirty_inode to fix timestamp handling · f9581b14
      Christoph Hellwig 提交于
      This is picking up on Felix's repost of Dave's patch to implement a
      .dirty_inode method.  We really need this notification because
      the VFS keeps writing directly into the inode structure instead
      of going through methods to update this state.  In addition to
      the long-known atime issue we now also have a caller in VM code
      that updates c/mtime that way for shared writeable mmaps.  And
      I found another one that no one has noticed in practice in the FIFO
      code.
      
      So implement ->dirty_inode to set i_update_core whenever the
      inode gets externally dirtied, and switch the c/mtime handling to
      the same scheme we already use for atime (always picking up
      the value from the Linux inode).
      
      Note that this patch also removes the xfs_synchronize_atime call
      in xfs_reclaim it was superflous as we already synchronize the time
      when writing the inode via the log (xfs_inode_item_format) or the
      normal buffers (xfs_iflush_int).
      
      In addition also remove the I_CLEAR check before copying the Linux
      timestamps - now that we always have the Linux inode available
      we can always use the timestamps in it.
      
      Also switch to just using file_update_time for regular reads/writes -
      that will get us all optimization done to it for free and make
      sure we notice early when it breaks.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NFelix Blyakher <felixb@sgi.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      f9581b14
  15. 01 9月, 2009 1 次提交
  16. 12 8月, 2009 2 次提交
  17. 02 7月, 2009 1 次提交
  18. 10 6月, 2009 1 次提交
  19. 30 4月, 2009 1 次提交
    • L
      xfs_file_last_byte() needs to acquire ilock · def6b3ba
      Lachlan McIlroy 提交于
      We had some systems crash with this stack:
      
      [<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
      [<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
      [<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
      [<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
      [<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
      [<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
      [<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
      [<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
      [<a000000100162ea0>] __fput+0x1a0/0x420
      [<a000000100163160>] fput+0x40/0x60
      
      The problem here is that xfs_file_last_byte() does not acquire the
      inode lock and can therefore race with another thread that is modifying
      the extext list.  While xfs_bmap_last_offset() is trying to lookup
      what was the last extent some extents were merged and the extent list
      shrunk so the index we lookup is now beyond the end of the extent list
      and potentially in a freed buffer.
      Signed-off-by: NLachlan McIlroy <lmcilroy@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NFelix Blyakher <felixb@sgi.com>
      Signed-off-by: NFelix Blyakher <felixb@sgi.com>
      def6b3ba
  20. 29 4月, 2009 1 次提交
    • L
      xfs_file_last_byte() needs to acquire ilock · f25181f5
      Lachlan McIlroy 提交于
      We had some systems crash with this stack:
      
      [<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
      [<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
      [<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
      [<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
      [<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
      [<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
      [<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
      [<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
      [<a000000100162ea0>] __fput+0x1a0/0x420
      [<a000000100163160>] fput+0x40/0x60
      
      The problem here is that xfs_file_last_byte() does not acquire the
      inode lock and can therefore race with another thread that is modifying
      the extext list.  While xfs_bmap_last_offset() is trying to lookup
      what was the last extent some extents were merged and the extent list
      shrunk so the index we lookup is now beyond the end of the extent list
      and potentially in a freed buffer.
      Signed-off-by: NLachlan McIlroy <lmcilroy@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NFelix Blyakher <felixb@sgi.com>
      Signed-off-by: NFelix Blyakher <felixb@sgi.com>
      f25181f5
  21. 19 1月, 2009 2 次提交