1. 23 9月, 2014 1 次提交
  2. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
  3. 06 6月, 2014 1 次提交
  4. 14 4月, 2014 1 次提交
  5. 07 2月, 2014 1 次提交
  6. 17 12月, 2013 1 次提交
    • D
      xfs: abort metadata writeback on permanent errors · ac8809f9
      Dave Chinner 提交于
      If we are doing aysnc writeback of metadata, we can get write errors
      but have nobody to report them to. At the moment, we simply attempt
      to reissue the write from io completion in the hope that it's a
      transient error.
      
      When it's not a transient error, the buffer is stuck forever in
      this loop, and we cannot break out of it. Eventually, unmount will
      hang because the AIL cannot be emptied and everything goes downhill
      from them.
      
      To solve this problem, only retry the write IO once before aborting
      it. We don't throw the buffer away because some transient errors can
      last minutes (e.g.  FC path failover) or even hours (thin
      provisioned devices that have run out of backing space) before they
      go away. Hence we really want to keep trying until we can't try any
      more.
      
      Because the buffer was not cleaned, however, it does not get removed
      from the AIL and hence the next pass across the AIL will start IO on
      it again. As such, we still get the "retry forever" semantics that
      we currently have, but we allow other access to the buffer in the
      mean time. Meanwhile the filesystem can continue to modify the
      buffer and relog it, so the IO errors won't hang the log or the
      filesystem.
      
      Now when we are pushing the AIL, we can see all these "permanent IO
      error" buffers and we can issue a warning about failures before we
      retry the IO. We can also catch these buffers when unmounting an
      issue a corruption warning, too.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ac8809f9
  7. 13 12月, 2013 3 次提交
  8. 31 10月, 2013 1 次提交
  9. 24 10月, 2013 1 次提交
    • D
      xfs: decouple log and transaction headers · 239880ef
      Dave Chinner 提交于
      xfs_trans.h has a dependency on xfs_log.h for a couple of
      structures. Most code that does transactions doesn't need to know
      anything about the log, but this dependency means that they have to
      include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
      files and clean up the includes to be in dependency order.
      
      In doing this, remove the direct include of xfs_trans_reserve.h from
      xfs_trans.h so that we remove the dependency between xfs_trans.h and
      xfs_mount.h. Hence the xfs_trans.h include can be moved to the
      indicate the actual dependencies other header files have on it.
      
      Note that these are kernel only header files, so this does not
      translate to any userspace changes at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      239880ef
  10. 25 9月, 2013 1 次提交
  11. 11 9月, 2013 1 次提交
    • D
      xfs: aborted buf items can be in the AIL. · 46f9d2eb
      Dave Chinner 提交于
      Saw this on generic/270 after a DQALLOC transaction overrun
      shutdown:
      
      XFS: Assertion failed: !(bip->bli_item.li_flags & XFS_LI_IN_AIL), file: fs/xfs/xfs_buf_item.c, line: 952
      .....
       xfs_buf_item_relse+0x4f/0xd0
       xfs_buf_item_unlock+0x1b4/0x1e0
       xfs_trans_free_items+0x7d/0xb0
       xfs_trans_cancel+0x13c/0x1b0
       xfs_symlink+0x37e/0xa60
      ....
      
      When a transaction abort occured.
      
      If we are aborting a transaction and trigger this code path, then
      the item may be dirty. If the item is dirty, then it may be in the
      AIL. Hence if we are aborting, we need to check if the item is in
      the AIL and remove it before freeing it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      46f9d2eb
  12. 16 8月, 2013 1 次提交
  13. 14 8月, 2013 1 次提交
  14. 28 6月, 2013 2 次提交
    • D
      xfs: Use inode create transaction · ddf6ad01
      Dave Chinner 提交于
      Replace the use of buffer based logging of inode initialisation,
      uses the new logical form to describe the range to be initialised
      in recovery. We continue to "log" the inode buffers to push them
      into the AIL and ensure that the inode create transaction is not
      removed from the log before the inode buffers are written to disk.
      
      Update the transaction identifier and reservations to match the
      changed implementation.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ddf6ad01
    • D
      xfs: Introduce an ordered buffer item · 5f6bed76
      Dave Chinner 提交于
      If we have a buffer that we have modified but we do not wish to
      physically log in a transaction (e.g. we've logged a logical
      change), we still need to ensure that transactional integrity is
      maintained. Hence we must not move the tail of the log past the
      transaction that the buffer is associated with before the buffer is
      written to disk.
      
      This means these special buffers still need to be included in the
      transaction and added to the AIL just like a normal buffer, but we
      do not want the modifications to the buffer written into the
      transaction. IOWs, what we want is an "ordered buffer" that
      maintains the same transactional life cycle as a physically logged
      buffer, just without the transcribing of the modifications to the
      log.
      
      Hence we need to flag the buffer as an "ordered buffer" to avoid
      including it in vector size calculations or formatting during the
      transaction. Once the transaction is committed, the buffer appears
      for all intents to be the same as a physically logged buffer as it
      transitions through the log and AIL.
      
      Relogging will also work just fine for such an ordered buffer - the
      logical transaction will be replayed before the subsequent
      modifications that relog the buffer, so everything will be
      reconstructed correctly by recovery.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5f6bed76
  15. 31 5月, 2013 2 次提交
    • D
      xfs: fix split buffer vector log recovery support · 7d2ffe80
      Dave Chinner 提交于
      A long time ago in a galaxy far away....
      
      .. the was a commit made to fix some ilinux specific "fragmented
      buffer" log recovery problem:
      
      http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603
      
      That problem occurred when a contiguous dirty region of a buffer was
      split across across two pages of an unmapped buffer. It's been a
      long time since that has been done in XFS, and the changes to log
      the entire inode buffers for CRC enabled filesystems has
      re-introduced that corner case.
      
      And, of course, it turns out that the above commit didn't actually
      fix anything - it just ensured that log recovery is guaranteed to
      fail when this situation occurs. And now for the gory details.
      
      xfstest xfs/085 is failing with this assert:
      
      XFS (vdb): bad number of regions (0) in inode log format
      XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583
      
      Largely undocumented factoid #1: Log recovery depends on all log
      buffer format items starting with this format:
      
      struct foo_log_format {
      	__uint16_t	type;
      	__uint16_t	size;
      	....
      
      As recoery uses the size field and assumptions about 32 bit
      alignment in decoding format items.  So don't pay much attention to
      the fact log recovery thinks that it decoding an inode log format
      item - it just uses them to determine what the size of the item is.
      
      But why would it see a log format item with a zero size? Well,
      luckily enough xfs_logprint uses the same code and gives the same
      error, so with a bit of gdb magic, it turns out that it isn't a log
      format that is being decoded. What logprint tells us is this:
      
      Oper (130): tid: a0375e1a  len: 28  clientid: TRANS  flags: none
      BUF:  #regs: 2   start blkno: 144 (0x90)  len: 16  bmap size: 2  flags: 0x4000
      Oper (131): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
      BUF DATA
      ----------------------------------------------------------------------------
      Oper (132): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
      xfs_logprint: unknown log operation type (4e49)
      **********************************************************************
      * ERROR: data block=2                                                 *
      **********************************************************************
      
      That we've got a buffer format item (oper 130) that has two regions;
      the format item itself and one dirty region. The subsequent region
      after the buffer format item and it's data is them what we are
      tripping over, and the first bytes of it at an inode magic number.
      Not a log opheader like there is supposed to be.
      
      That means there's a problem with the buffer format item. It's dirty
      data region is 4096 bytes, and it contains - you guessed it -
      initialised inodes. But inode buffers are 8k, not 4k, and we log
      them in their entirety. So something is wrong here. The buffer
      format item contains:
      
      (gdb) p /x *(struct xfs_buf_log_format *)in_f
      $22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000,
             blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2,
             blf_data_map = {0xffffffff, 0xffffffff, .... }}
      
      Two regions, and a signle dirty contiguous region of 64 bits.  64 *
      128 = 8k, so this should be followed by a single 8k region of data.
      And the blf_flags tell us that the type of buffer is a
      XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have
      the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation
      buffer. So, it should be followed by 8k of inode data.
      
      But we know that the next region has a header of:
      
      (gdb) p /x *ohead
      $25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69,
             oh_flags = 0x0, oh_res2 = 0x0}
      
      and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not
      long enough to hold all the logged data. There must be another
      region. There is - there's a following opheader for another 4k of
      data that contains the other half of the inode cluster data - the
      one we assert fail on because it's not a log format header.
      
      So why is the second part of the data not being accounted to the
      correct buffer log format structure? It took a little more work with
      gdb to work out that the buffer log format structure was both
      expecting it to be there but hadn't accounted for it. It was at that
      point I went to the kernel code, as clearly this wasn't a bug in
      xfs_logprint and the kernel was writing bad stuff to the log.
      
      First port of call was the buffer item formatting code, and the
      discontiguous memory/contiguous dirty region handling code
      immediately stood out. I've wondered for a long time why the code
      had this comment in it:
      
                              vecp->i_addr = xfs_buf_offset(bp, buffer_offset);
                              vecp->i_len = nbits * XFS_BLF_CHUNK;
                              vecp->i_type = XLOG_REG_TYPE_BCHUNK;
      /*
       * You would think we need to bump the nvecs here too, but we do not
       * this number is used by recovery, and it gets confused by the boundary
       * split here
       *                      nvecs++;
       */
                              vecp++;
      
      And it didn't account for the extra vector pointer. The case being
      handled here is that a contiguous dirty region lies across a
      boundary that cannot be memcpy()d across, and so has to be split
      into two separate operations for xlog_write() to perform.
      
      What this code assumes is that what is written to the log is two
      consecutive blocks of data that are accounted in the buf log format
      item as the same contiguous dirty region and so will get decoded as
      such by the log recovery code.
      
      The thing is, xlog_write() knows nothing about this, and so just
      does it's normal thing of adding an opheader for each vector. That
      means the 8k region gets written to the log as two separate regions
      of 4k each, but because nvecs has not been incremented, the buf log
      format item accounts for only one of them.
      
      Hence when we come to log recovery, we process the first 4k region
      and then expect to come across a new item that starts with a log
      format structure of some kind that tells us whenteh next data is
      going to be. Instead, we hit raw buffer data and things go bad real
      quick.
      
      So, the commit from 2002 that commented out nvecs++ is just plain
      wrong. It breaks log recovery completely, and it would seem the only
      reason this hasn't been since then is that we don't log large
      contigous regions of multi-page unmapped buffers very often. Never
      would be a closer estimate, at least until the CRC code came along....
      
      So, lets fix that by restoring the nvecs accounting for the extra
      region when we hit this case.....
      
      .... and there's the problemin log recovery it is apparently working
      around:
      
      XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135
      
      Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty
      regions being broken up into multiple regions by the log formatting
      code. That's an easy fix, though - if the number of contiguous dirty
      bits exceeds the length of the region being copied out of the log,
      only account for the number of dirty bits that region covers, and
      then loop again and copy more from the next region. It's a 2 line
      fix.
      
      Now xfstests xfs/085 passes, we have one less piece of mystery
      code, and one more important piece of knowledge about how to
      structure new log format items..
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      
      (cherry picked from commit 709da6a6)
      7d2ffe80
    • D
      xfs: fix split buffer vector log recovery support · 709da6a6
      Dave Chinner 提交于
      A long time ago in a galaxy far away....
      
      .. the was a commit made to fix some ilinux specific "fragmented
      buffer" log recovery problem:
      
      http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603
      
      That problem occurred when a contiguous dirty region of a buffer was
      split across across two pages of an unmapped buffer. It's been a
      long time since that has been done in XFS, and the changes to log
      the entire inode buffers for CRC enabled filesystems has
      re-introduced that corner case.
      
      And, of course, it turns out that the above commit didn't actually
      fix anything - it just ensured that log recovery is guaranteed to
      fail when this situation occurs. And now for the gory details.
      
      xfstest xfs/085 is failing with this assert:
      
      XFS (vdb): bad number of regions (0) in inode log format
      XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583
      
      Largely undocumented factoid #1: Log recovery depends on all log
      buffer format items starting with this format:
      
      struct foo_log_format {
      	__uint16_t	type;
      	__uint16_t	size;
      	....
      
      As recoery uses the size field and assumptions about 32 bit
      alignment in decoding format items.  So don't pay much attention to
      the fact log recovery thinks that it decoding an inode log format
      item - it just uses them to determine what the size of the item is.
      
      But why would it see a log format item with a zero size? Well,
      luckily enough xfs_logprint uses the same code and gives the same
      error, so with a bit of gdb magic, it turns out that it isn't a log
      format that is being decoded. What logprint tells us is this:
      
      Oper (130): tid: a0375e1a  len: 28  clientid: TRANS  flags: none
      BUF:  #regs: 2   start blkno: 144 (0x90)  len: 16  bmap size: 2  flags: 0x4000
      Oper (131): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
      BUF DATA
      ----------------------------------------------------------------------------
      Oper (132): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
      xfs_logprint: unknown log operation type (4e49)
      **********************************************************************
      * ERROR: data block=2                                                 *
      **********************************************************************
      
      That we've got a buffer format item (oper 130) that has two regions;
      the format item itself and one dirty region. The subsequent region
      after the buffer format item and it's data is them what we are
      tripping over, and the first bytes of it at an inode magic number.
      Not a log opheader like there is supposed to be.
      
      That means there's a problem with the buffer format item. It's dirty
      data region is 4096 bytes, and it contains - you guessed it -
      initialised inodes. But inode buffers are 8k, not 4k, and we log
      them in their entirety. So something is wrong here. The buffer
      format item contains:
      
      (gdb) p /x *(struct xfs_buf_log_format *)in_f
      $22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000,
             blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2,
             blf_data_map = {0xffffffff, 0xffffffff, .... }}
      
      Two regions, and a signle dirty contiguous region of 64 bits.  64 *
      128 = 8k, so this should be followed by a single 8k region of data.
      And the blf_flags tell us that the type of buffer is a
      XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have
      the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation
      buffer. So, it should be followed by 8k of inode data.
      
      But we know that the next region has a header of:
      
      (gdb) p /x *ohead
      $25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69,
             oh_flags = 0x0, oh_res2 = 0x0}
      
      and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not
      long enough to hold all the logged data. There must be another
      region. There is - there's a following opheader for another 4k of
      data that contains the other half of the inode cluster data - the
      one we assert fail on because it's not a log format header.
      
      So why is the second part of the data not being accounted to the
      correct buffer log format structure? It took a little more work with
      gdb to work out that the buffer log format structure was both
      expecting it to be there but hadn't accounted for it. It was at that
      point I went to the kernel code, as clearly this wasn't a bug in
      xfs_logprint and the kernel was writing bad stuff to the log.
      
      First port of call was the buffer item formatting code, and the
      discontiguous memory/contiguous dirty region handling code
      immediately stood out. I've wondered for a long time why the code
      had this comment in it:
      
                              vecp->i_addr = xfs_buf_offset(bp, buffer_offset);
                              vecp->i_len = nbits * XFS_BLF_CHUNK;
                              vecp->i_type = XLOG_REG_TYPE_BCHUNK;
      /*
       * You would think we need to bump the nvecs here too, but we do not
       * this number is used by recovery, and it gets confused by the boundary
       * split here
       *                      nvecs++;
       */
                              vecp++;
      
      And it didn't account for the extra vector pointer. The case being
      handled here is that a contiguous dirty region lies across a
      boundary that cannot be memcpy()d across, and so has to be split
      into two separate operations for xlog_write() to perform.
      
      What this code assumes is that what is written to the log is two
      consecutive blocks of data that are accounted in the buf log format
      item as the same contiguous dirty region and so will get decoded as
      such by the log recovery code.
      
      The thing is, xlog_write() knows nothing about this, and so just
      does it's normal thing of adding an opheader for each vector. That
      means the 8k region gets written to the log as two separate regions
      of 4k each, but because nvecs has not been incremented, the buf log
      format item accounts for only one of them.
      
      Hence when we come to log recovery, we process the first 4k region
      and then expect to come across a new item that starts with a log
      format structure of some kind that tells us whenteh next data is
      going to be. Instead, we hit raw buffer data and things go bad real
      quick.
      
      So, the commit from 2002 that commented out nvecs++ is just plain
      wrong. It breaks log recovery completely, and it would seem the only
      reason this hasn't been since then is that we don't log large
      contigous regions of multi-page unmapped buffers very often. Never
      would be a closer estimate, at least until the CRC code came along....
      
      So, lets fix that by restoring the nvecs accounting for the extra
      region when we hit this case.....
      
      .... and there's the problemin log recovery it is apparently working
      around:
      
      XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135
      
      Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty
      regions being broken up into multiple regions by the log formatting
      code. That's an easy fix, though - if the number of contiguous dirty
      bits exceeds the length of the region being copied out of the log,
      only account for the number of dirty bits that region covers, and
      then loop again and copy more from the next region. It's a 2 line
      fix.
      
      Now xfstests xfs/085 passes, we have one less piece of mystery
      code, and one more important piece of knowledge about how to
      structure new log format items..
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      709da6a6
  16. 15 2月, 2013 1 次提交
    • B
      xfs: recheck buffer pinned status after push trylock failure · 5337fe9b
      Brian Foster 提交于
      The buffer pinned check and trylock sequence in xfs_buf_item_push()
      can race with an active transaction on marking the buffer pinned.
      This can result in the buffer becoming pinned and stale after the
      initial check and the trylock failure, but before the check in
      xfs_buf_trylock() that issues a log force. If the log force is
      issued from this context, a spinlock recursion occurs on xa_lock.
      
      Prepare xfs_buf_item_push() to handle the race by detecting a
      pinned buffer after the trylock failure so xfsaild issues a log
      force from a safe context. This, along with various previous fixes,
      renders the log force in xfs_buf_trylock() redundant.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5337fe9b
  17. 29 1月, 2013 1 次提交
    • D
      xfs: fix shutdown hang on invalid inode during create · 9f87832a
      Dave Chinner 提交于
      When the new inode verify in xfs_iread() fails, the create
      transaction is aborted and a shutdown occurs. The subsequent unmount
      then hangs in xfs_wait_buftarg() on a buffer that has an elevated
      hold count. Debug showed that it was an AGI buffer getting stuck:
      
      [   22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      
      The trace of this buffer leading up to the shutdown (trimmed for
      brevity) looks like:
      
      xfs_buf_init:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
      xfs_buf_get:         bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
      xfs_buf_read:        bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
      xfs_buf_iorequest:   bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
      xfs_buf_iowait:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_ioerror:     bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
      xfs_buf_iodone:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
      xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
      xfs_trans_read_buf:  bno 0x2 len 0x200 hold 2 recur 0 refcount 1
      xfs_trans_brelse:    bno 0x2 len 0x200 hold 2 recur 0 refcount 1
      xfs_buf_item_relse:  bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
      xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
      xfs_buf_trylock:     bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
      xfs_buf_find:        bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
      xfs_buf_get:         bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
      xfs_buf_read:        bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
      xfs_trans_read_buf:  bno 0x2 len 0x200 hold 3 recur 0 refcount 1
      xfs_trans_log_buf:   bno 0x2 len 0x200 hold 3 recur 0 refcount 1
      xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
      xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
      
      And that is the AGI buffer from cold cache read into memory to
      transaction abort. You can see at transaction abort the bli is dirty
      and only has a single reference. The item is not pinned, and it's
      not in the AIL. Hence the only reference to it is this transaction.
      
      The problem is that the xfs_buf_item_unlock() call is dropping the
      last reference to the xfs_buf_log_item attached to the buffer (which
      holds a reference to the buffer), but it is not freeing the
      xfs_buf_log_item. Hence nothing will ever release the buffer, and
      the unmount hangs waiting for this reference to go away.
      
      The fix is simple - xfs_buf_item_unlock needs to detect the last
      reference going away in this case and free the xfs_buf_log_item to
      release the reference it holds on the buffer.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      9f87832a
  18. 26 1月, 2013 1 次提交
    • D
      xfs: fix shutdown hang on invalid inode during create · 3b19034d
      Dave Chinner 提交于
      When the new inode verify in xfs_iread() fails, the create
      transaction is aborted and a shutdown occurs. The subsequent unmount
      then hangs in xfs_wait_buftarg() on a buffer that has an elevated
      hold count. Debug showed that it was an AGI buffer getting stuck:
      
      [   22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      
      The trace of this buffer leading up to the shutdown (trimmed for
      brevity) looks like:
      
      xfs_buf_init:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
      xfs_buf_get:         bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
      xfs_buf_read:        bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
      xfs_buf_iorequest:   bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
      xfs_buf_iowait:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_ioerror:     bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
      xfs_buf_iodone:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
      xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
      xfs_trans_read_buf:  bno 0x2 len 0x200 hold 2 recur 0 refcount 1
      xfs_trans_brelse:    bno 0x2 len 0x200 hold 2 recur 0 refcount 1
      xfs_buf_item_relse:  bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
      xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
      xfs_buf_trylock:     bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
      xfs_buf_find:        bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
      xfs_buf_get:         bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
      xfs_buf_read:        bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
      xfs_trans_read_buf:  bno 0x2 len 0x200 hold 3 recur 0 refcount 1
      xfs_trans_log_buf:   bno 0x2 len 0x200 hold 3 recur 0 refcount 1
      xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
      xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
      
      And that is the AGI buffer from cold cache read into memory to
      transaction abort. You can see at transaction abort the bli is dirty
      and only has a single reference. The item is not pinned, and it's
      not in the AIL. Hence the only reference to it is this transaction.
      
      The problem is that the xfs_buf_item_unlock() call is dropping the
      last reference to the xfs_buf_log_item attached to the buffer (which
      holds a reference to the buffer), but it is not freeing the
      xfs_buf_log_item. Hence nothing will ever release the buffer, and
      the unmount hangs waiting for this reference to go away.
      
      The fix is simple - xfs_buf_item_unlock needs to detect the last
      reference going away in this case and free the xfs_buf_log_item to
      release the reference it holds on the buffer.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3b19034d
  19. 17 1月, 2013 3 次提交
  20. 18 12月, 2012 4 次提交
  21. 09 11月, 2012 1 次提交
    • D
      xfs: fix buffer shudown reference count mismatch · 03b1293e
      Dave Chinner 提交于
      When we shut down the filesystem, we have to unpin and free all the
      buffers currently active in the CIL. To do this we unpin and remove
      them in one operation as a result of a failed iclogbuf write. For
      buffers, we do this removal via a simultated IO completion of after
      marking the buffer stale.
      
      At the time we do this, we have two references to the buffer - the
      active LRU reference and the buf log item.  The LRU reference is
      removed by marking the buffer stale, and the active CIL reference is
      by the xfs_buf_iodone() callback that is run by
      xfs_buf_do_callbacks() during ioend processing (via the bp->b_iodone
      callback).
      
      However, ioend processing requires one more reference - that of the
      IO that it is completing. We don't have this reference, so we free
      the buffer prematurely and use it after it is freed. For buffers
      marked with XBF_ASYNC, this leads to assert failures in
      xfs_buf_rele() on debug kernels because the b_hold count is zero.
      
      Fix this by making sure we take the necessary IO reference before
      starting IO completion processing on the stale buffer, and set the
      XBF_ASYNC flag to ensure that IO completion processing removes all
      the active references from the buffer to ensure it is fully torn
      down.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      03b1293e
  22. 08 11月, 2012 1 次提交
    • D
      xfs: fix buffer shudown reference count mismatch · 137fff09
      Dave Chinner 提交于
      When we shut down the filesystem, we have to unpin and free all the
      buffers currently active in the CIL. To do this we unpin and remove
      them in one operation as a result of a failed iclogbuf write. For
      buffers, we do this removal via a simultated IO completion of after
      marking the buffer stale.
      
      At the time we do this, we have two references to the buffer - the
      active LRU reference and the buf log item.  The LRU reference is
      removed by marking the buffer stale, and the active CIL reference is
      by the xfs_buf_iodone() callback that is run by
      xfs_buf_do_callbacks() during ioend processing (via the bp->b_iodone
      callback).
      
      However, ioend processing requires one more reference - that of the
      IO that it is completing. We don't have this reference, so we free
      the buffer prematurely and use it after it is freed. For buffers
      marked with XBF_ASYNC, this leads to assert failures in
      xfs_buf_rele() on debug kernels because the b_hold count is zero.
      
      Fix this by making sure we take the necessary IO reference before
      starting IO completion processing on the stale buffer, and set the
      XBF_ASYNC flag to ensure that IO completion processing removes all
      the active references from the buffer to ensure it is fully torn
      down.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      137fff09
  23. 14 7月, 2012 2 次提交
  24. 02 7月, 2012 2 次提交
    • D
      xfs: support discontiguous buffers in the xfs_buf_log_item · 372cc85e
      Dave Chinner 提交于
      discontigous buffer in separate buffer format structures. This means log
      recovery will recover all the changes on a per segment basis without
      requiring any knowledge of the fact that it was logged from a
      compound buffer.
      
      To do this, we need to be able to determine what buffer segment any
      given offset into the compound buffer sits over. This enables us to
      translate the dirty bitmap in the number of separate buffer format
      structures required.
      
      We also need to be able to determine the number of bitmap elements
      that a given buffer segment has, as this determines the size of the
      buffer format structure. Hence we need to be able to determine the
      both the start offset into the buffer and the length of a given
      segment to be able to calculate this.
      
      With this information, we can preallocate, build and format the
      correct log vector array for each segment in a compound buffer to
      appear exactly the same as individually logged buffers in the log.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      372cc85e
    • D
      xfs: struct xfs_buf_log_format isn't variable sized. · 77c1a08f
      Dave Chinner 提交于
      The struct xfs_buf_log_format wants to think the dirty bitmap is
      variable sized.  In fact, it is variable size on disk simply due to
      the way we map it from the in-memory structure, but we still just
      use a fixed size memory allocation for the in-memory structure.
      
      Hence it makes no sense to set the function up as a variable sized
      structure when we already know it's maximum size, and we always
      allocate it as such. Simplify the structure by making the dirty
      bitmap a fixed sized array and just using the size of the structure
      for the allocation size.
      
      This will make it much simpler to allocate and manipulate an array
      of format structures for discontiguous buffer support.
      
      The previous struct xfs_buf_log_item size according to
      /proc/slabinfo was 224 bytes. pahole doesn't give the same size
      because of the variable size definition. With this modification,
      pahole reports the same as /proc/slabinfo:
      
      	/* size: 224, cachelines: 4, members: 6 */
      
      Because the xfs_buf_log_item size is now determined by the maximum
      supported block size we introduce a dependency on xfs_alloc_btree.h.
      Avoid this dependency by moving the idefines for the maximum block
      sizes supported to xfs_types.h with all the other max/min type
      defines to avoid any new dependencies.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      77c1a08f
  25. 15 5月, 2012 5 次提交
    • D
      xfs: move xfsagino_t to xfs_types.h · 60a34607
      Dave Chinner 提交于
      Untangle the header file includes a bit by moving the definition of
      xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on
      xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include
      xfs_ag.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      60a34607
    • D
      xfs: use blocks for storing the desired IO size · aa0e8833
      Dave Chinner 提交于
      Now that we pass block counts everywhere, and index buffers by block
      number and length in units of blocks, convert the desired IO size
      into block counts rather than bytes. Convert the code to use block
      counts, and those that need byte counts get converted at the time of
      use.
      
      Rename the b_desired_count variable to something closer to it's
      purpose - b_io_length - as it is only used to specify the length of
      an IO for a subset of the buffer.  The only time this is used is for
      log IO - both writing iclogs and during log recovery. In all other
      cases, the b_io_length matches b_length, and hence a lot of code
      confuses the two. e.g. the buf item code uses the io count
      exclusively when it should be using the buffer length. Fix these
      apprpriately as they are found.
      
      Also, remove the XFS_BUF_{SET_}COUNT() macros that are just wrappers
      around the desired IO length. They only serve to make the code
      shouty loud, don't actually add any real value, and are often used
      incorrectly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      aa0e8833
    • D
      xfs: pass shutdown method into xfs_trans_ail_delete_bulk · 04913fdd
      Dave Chinner 提交于
      xfs_trans_ail_delete_bulk() can be called from different contexts so
      if the item is not in the AIL we need different shutdown for each
      context.  Pass in the shutdown method needed so the correct action
      can be taken.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      04913fdd
    • C
      xfs: on-stack delayed write buffer lists · 43ff2122
      Christoph Hellwig 提交于
      Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
      and write back the buffers per-process instead of by waking up xfsbufd.
      
      This is now easily doable given that we have very few places left that write
      delwri buffers:
      
       - log recovery:
      	Only done at mount time, and already forcing out the buffers
      	synchronously using xfs_flush_buftarg
      
       - quotacheck:
      	Same story.
      
       - dquot reclaim:
      	Writes out dirty dquots on the LRU under memory pressure.  We might
      	want to look into doing more of this via xfsaild, but it's already
      	more optimal than the synchronous inode reclaim that writes each
      	buffer synchronously.
      
       - xfsaild:
      	This is the main beneficiary of the change.  By keeping a local list
      	of buffers to write we reduce latency of writing out buffers, and
      	more importably we can remove all the delwri list promotions which
      	were hitting the buffer cache hard under sustained metadata loads.
      
      The implementation is very straight forward - xfs_buf_delwri_queue now gets
      a new list_head pointer that it adds the delwri buffers to, and all callers
      need to eventually submit the list using xfs_buf_delwi_submit or
      xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
      skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
      list.  The biggest change to pass down the buffer list was done to the AIL
      pushing. Now that we operate on buffers the trylock, push and pushbuf log
      item methods are merged into a single push routine, which tries to lock the
      item, and if possible add the buffer that needs writeback to the buffer list.
      This leads to much simpler code than the previous split but requires the
      individual IOP_PUSH instances to unlock and reacquire the AIL around calls
      to blocking routines.
      
      Given that xfsailds now also handle writing out buffers, the conditions for
      log forcing and the sleep times needed some small changes.  The most
      important one is that we consider an AIL busy as long we still have buffers
      to push, and the other one is that we do increment the pushed LSN for
      buffers that are under flushing at this moment, but still count them towards
      the stuck items for restart purposes.  Without this we could hammer on stuck
      items without ever forcing the log and not make progress under heavy random
      delete workloads on fast flash storage devices.
      
      [ Dave Chinner:
      	- rebase on previous patches.
      	- improved comments for XBF_DELWRI_Q handling
      	- fix XBF_ASYNC handling in queue submission (test 106 failure)
      	- rename delwri submit function buffer list parameters for clarity
      	- xfs_efd_item_push() should return XFS_ITEM_PINNED ]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      43ff2122
    • C
      xfs: do not add buffers to the delwri queue until pushed · 960c60af
      Christoph Hellwig 提交于
      Instead of adding buffers to the delwri list as soon as they are logged,
      even if they can't be written until commited because they are pinned
      defer adding them to the delwri list until xfsaild pushes them.  This
      makes the code more similar to other log items and prepares for writing
      buffers directly from xfsaild.
      
      The complication here is that we need to fail buffers that were added
      but not logged yet in xfs_buf_item_unpin, borrowing code from
      xfs_bioerror.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      960c60af