1. 03 8月, 2016 6 次提交
    • D
      xfs: define the on-disk rmap btree format · 035e00ac
      Darrick J. Wong 提交于
      Originally-From: Dave Chinner <dchinner@redhat.com>
      
      Now we have all the surrounding call infrastructure in place, we can
      start filling out the rmap btree implementation. Start with the
      on-disk btree format; add everything needed to read, write and
      manipulate rmap btree blocks. This prepares the way for adding the
      btree operations implementation.
      
      [darrick: record owner and offset info in rmap btree]
      [darrick: fork, bmbt and unwritten state in rmap btree]
      [darrick: flags are a separate field in xfs_rmap_irec]
      [darrick: calculate maxlevels separately]
      [darrick: move the 'unwritten' bit into unused parts of rm_offset]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      035e00ac
    • D
      xfs: introduce rmap extent operation stubs · 673930c3
      Darrick J. Wong 提交于
      Originally-From: Dave Chinner <dchinner@redhat.com>
      
      Add the stubs into the extent allocation and freeing paths that the
      rmap btree implementation will hook into. While doing this, add the
      trace points that will be used to track rmap btree extent
      manipulations.
      
      [darrick.wong@oracle.com: Extend the stubs to take full owner info.]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      673930c3
    • D
      xfs: add tracepoints and error injection for deferred extent freeing · ba9e7802
      Darrick J. Wong 提交于
      Add a couple of tracepoints for the deferred extent free operation and
      a site for injecting errors while finishing the operation.  This makes
      it easier to debug deferred ops and test log redo.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ba9e7802
    • D
      xfs: add tracepoints for the deferred ops mechanism · 3cd48abc
      Darrick J. Wong 提交于
      Add tracepoints for the internals of the deferred ops mechanism
      and tracepoint classes for clients of the dops, to make debugging
      easier.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3cd48abc
    • D
      xfs: introduce interval queries on btrees · 105f7d83
      Darrick J. Wong 提交于
      Create a function to enable querying of btree records mapping to a
      range of keys.  This will be used in subsequent patches to allow
      querying the reverse mapping btree to find the extents mapped to a
      range of physical blocks, though the generic code can be used for
      any range query.
      
      The overlapped query range function needs to use the btree get_block
      helper because the root block could be an inode, in which case
      bc_bufs[nlevels-1] will be NULL.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      105f7d83
    • D
      xfs: support btrees with overlapping intervals for keys · 2c813ad6
      Darrick J. Wong 提交于
      On a filesystem with both reflink and reverse mapping enabled, it's
      possible to have multiple rmap records referring to the same blocks on
      disk.  When overlapping intervals are possible, querying a classic
      btree to find all records intersecting a given interval is inefficient
      because we cannot use the left side of the search interval to filter
      out non-matching records the same way that we can use the existing
      btree key to filter out records coming after the right side of the
      search interval.  This will become important once we want to use the
      rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl.
      
      (For the non-overlapping case, we can perform such queries trivially
      by starting at the left side of the interval and walking the tree
      until we pass the right side.)
      
      Therefore, extend the btree code to come closer to supporting
      intervals as a first-class record attribute.  This involves widening
      the btree node's key space to store both the lowest key reachable via
      the node pointer (as the btree does now) and the highest key reachable
      via the same pointer and teaching the btree modifying functions to
      keep the highest-key records up to date.
      
      This behavior can be turned on via a new btree ops flag so that btrees
      that cannot store overlapping intervals don't pay the overhead costs
      in terms of extra code and disk format changes.
      
      When we're deleting a record in a btree that supports overlapped
      interval records and the deletion results in two btree blocks being
      joined, we defer updating the high/low keys until after all possible
      joining (at higher levels in the tree) have finished.  At this point,
      the btree pointers at all levels have been updated to remove the empty
      blocks and we can update the low and high keys.
      
      When we're doing this, we must be careful to update the keys of all
      node pointers up to the root instead of stopping at the first set of
      keys that don't need updating.  This is because it's possible for a
      single deletion to cause joining of multiple levels of tree, and so
      we need to update everything going back to the root.
      
      The diff_two_keys functions return < 0, 0, or > 0 if key1 is less than,
      equal to, or greater than key2, respectively.  This is consistent
      with the rest of the kernel and the C library.
      
      In btree_updkeys(), we need to evaluate the force_all parameter before
      running the key diff to avoid reading uninitialized memory when we're
      forcing a key update.  This happens when we've allocated an empty slot
      at level N + 1 to point to a new block at level N and we're in the
      process of filling out the new keys.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2c813ad6
  2. 20 7月, 2016 2 次提交
  3. 21 6月, 2016 2 次提交
  4. 18 5月, 2016 1 次提交
  5. 06 4月, 2016 2 次提交
  6. 08 2月, 2016 1 次提交
    • C
      xfs: don't use ioends for direct write completions · 273dda76
      Christoph Hellwig 提交于
      We only need to communicate two bits of information to the direct I/O
      completion handler:
      
       (1) do we need to convert any unwritten extents in the range
       (2) do we need to check if we need to update the inode size based
           on the range passed to the completion handler
      
      We can use the private data passed to the get_block handler and the
      completion handler as a simple bitmask to communicate this information
      instead of the current complicated infrastructure reusing the ioends
      from the buffer I/O path, and thus avoiding a memory allocation and
      a context switch for any non-trivial direct write.  As a nice side
      effect we also decouple the direct I/O path implementation from that
      of the buffered I/O path.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      
      273dda76
  7. 08 1月, 2016 1 次提交
  8. 03 11月, 2015 1 次提交
  9. 12 10月, 2015 1 次提交
  10. 09 9月, 2015 1 次提交
  11. 19 8月, 2015 1 次提交
  12. 29 5月, 2015 1 次提交
    • B
      xfs: allocate sparse inode chunks on full chunk allocation failure · 56d1115c
      Brian Foster 提交于
      xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode
      chunk. If all else fails, reduce the allocation to the sparse length and
      alignment and attempt to allocate a sparse inode chunk.
      
      If sparse chunk allocation succeeds, check whether an inobt record
      already exists that can track the chunk. If so, inherit and update the
      existing record. Otherwise, insert a new record for the sparse chunk.
      
      Create helpers to align sparse chunk inode records and insert or update
      existing records in the inode btrees. The xfs_inobt_insert_sprec()
      helper implements the merge or update semantics required for sparse
      inode records with respect to both the inobt and finobt. To update the
      inobt, either insert a new record or merge with an existing record. To
      update the finobt, use the updated inobt record to either insert or
      replace an existing record.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      56d1115c
  13. 16 4月, 2015 3 次提交
    • D
      xfs: DIO writes within EOF don't need an ioend · a06c277a
      Dave Chinner 提交于
      DIO writes that lie entirely within EOF have nothing to do in IO
      completion. In this case, we don't need no steekin' ioend, and so we
      can avoid allocating an ioend until we have a mapping that spans
      EOF.
      
      This means that IO completion has two contexts - deferred completion
      to the dio workqueue that uses an ioend, and interrupt completion
      that does nothing because there is nothing that can be done in this
      context.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a06c277a
    • D
      xfs: handle DIO overwrite EOF update completion correctly · 6dfa1b67
      Dave Chinner 提交于
      Currently a DIO overwrite that extends the EOF (e.g sub-block IO or
      write into allocated blocks beyond EOF) requires a transaction for
      the EOF update. Thi is done in IO completion context, but we aren't
      explicitly handling this situation properly and so it can run in
      interrupt context. Ensure that we defer IO that spans EOF correctly
      to the DIO completion workqueue, and now that we have an ioend in IO
      completion we can use the common ioend completion path to do all the
      work.
      
      Note: we do not preallocate the append transaction as we can have
      multiple mapping and allocation calls per direct IO. hence
      preallocating can still leave us with nested transactions by
      attempting to map and allocate more blocks after we've preallocated
      an append transaction.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6dfa1b67
    • D
      xfs: DIO needs an ioend for writes · d5cc2e3f
      Dave Chinner 提交于
      Currently we can only tell DIO completion that an IO requires
      unwritten extent completion. This is done by a hacky non-null
      private pointer passed to Io completion, but the private pointer
      does not actually contain any information that is used.
      
      We also need to pass to IO completion the fact that the IO may be
      beyond EOF and so a size update transaction needs to be done. This
      is currently determined by checks in the io completion, but we need
      to determine if this is necessary at block mapping time as we need
      to defer the size update transactions to a completion workqueue,
      just like unwritten extent conversion.
      
      To do this, first we need to allocate and pass an ioend to to IO
      completion. Add this for unwritten extent conversion; we'll do the
      EOF updates in the next commit.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d5cc2e3f
  14. 25 3月, 2015 2 次提交
  15. 23 2月, 2015 2 次提交
  16. 02 10月, 2014 1 次提交
    • D
      xfs: introduce xfs_buf_submit[_wait] · 595bff75
      Dave Chinner 提交于
      There is a lot of cookie-cutter code that looks like:
      
      	if (shutdown)
      		handle buffer error
      	xfs_buf_iorequest(bp)
      	error = xfs_buf_iowait(bp)
      	if (error)
      		handle buffer error
      
      spread through XFS. There's significant complexity now in
      xfs_buf_iorequest() to specifically handle this sort of synchronous
      IO pattern, but there's all sorts of nasty surprises in different
      error handling code dependent on who owns the buffer references and
      the locks.
      
      Pull this pattern into a single helper, where we can hide all the
      synchronous IO warts and hence make the error handling for all the
      callers much saner. This removes the need for a special extra
      reference to protect IO completion processing, as we can now hold a
      single reference across dispatch and waiting, simplifying the sync
      IO smeantics and error handling.
      
      In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
      make it explicitly handle on asynchronous IO. This forces all users
      to be switched specifically to one interface or the other and
      removes any ambiguity between how the interfaces are to be used. It
      also means that xfs_buf_iowait() goes away.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      595bff75
  17. 12 6月, 2014 1 次提交
    • A
      ->splice_write() via ->write_iter() · 8d020765
      Al Viro 提交于
      iter_file_splice_write() - a ->splice_write() instance that gathers the
      pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
      it to ->write_iter().  A bunch of simple cases coverted to that...
      
      [AV: fixed the braino spotted by Cyrill]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8d020765
  18. 23 4月, 2014 1 次提交
  19. 14 4月, 2014 1 次提交
  20. 24 2月, 2014 1 次提交
  21. 07 11月, 2013 2 次提交
  22. 28 6月, 2013 1 次提交
    • D
      xfs: Introduce an ordered buffer item · 5f6bed76
      Dave Chinner 提交于
      If we have a buffer that we have modified but we do not wish to
      physically log in a transaction (e.g. we've logged a logical
      change), we still need to ensure that transactional integrity is
      maintained. Hence we must not move the tail of the log past the
      transaction that the buffer is associated with before the buffer is
      written to disk.
      
      This means these special buffers still need to be included in the
      transaction and added to the AIL just like a normal buffer, but we
      do not want the modifications to the buffer written into the
      transaction. IOWs, what we want is an "ordered buffer" that
      maintains the same transactional life cycle as a physically logged
      buffer, just without the transcribing of the modifications to the
      log.
      
      Hence we need to flag the buffer as an "ordered buffer" to avoid
      including it in vector size calculations or formatting during the
      transaction. Once the transaction is committed, the buffer appears
      for all intents to be the same as a physically logged buffer as it
      transitions through the log and AIL.
      
      Relogging will also work just fine for such an ordered buffer - the
      logical transaction will be replayed before the subsequent
      modifications that relog the buffer, so everything will be
      reconstructed correctly by recovery.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5f6bed76
  23. 20 6月, 2013 1 次提交
  24. 22 5月, 2013 1 次提交
  25. 23 3月, 2013 1 次提交
  26. 29 1月, 2013 1 次提交
    • D
      xfs: fix shutdown hang on invalid inode during create · 9f87832a
      Dave Chinner 提交于
      When the new inode verify in xfs_iread() fails, the create
      transaction is aborted and a shutdown occurs. The subsequent unmount
      then hangs in xfs_wait_buftarg() on a buffer that has an elevated
      hold count. Debug showed that it was an AGI buffer getting stuck:
      
      [   22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      
      The trace of this buffer leading up to the shutdown (trimmed for
      brevity) looks like:
      
      xfs_buf_init:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
      xfs_buf_get:         bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
      xfs_buf_read:        bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
      xfs_buf_iorequest:   bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
      xfs_buf_iowait:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_ioerror:     bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
      xfs_buf_iodone:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
      xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
      xfs_trans_read_buf:  bno 0x2 len 0x200 hold 2 recur 0 refcount 1
      xfs_trans_brelse:    bno 0x2 len 0x200 hold 2 recur 0 refcount 1
      xfs_buf_item_relse:  bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
      xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
      xfs_buf_trylock:     bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
      xfs_buf_find:        bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
      xfs_buf_get:         bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
      xfs_buf_read:        bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
      xfs_trans_read_buf:  bno 0x2 len 0x200 hold 3 recur 0 refcount 1
      xfs_trans_log_buf:   bno 0x2 len 0x200 hold 3 recur 0 refcount 1
      xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
      xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
      
      And that is the AGI buffer from cold cache read into memory to
      transaction abort. You can see at transaction abort the bli is dirty
      and only has a single reference. The item is not pinned, and it's
      not in the AIL. Hence the only reference to it is this transaction.
      
      The problem is that the xfs_buf_item_unlock() call is dropping the
      last reference to the xfs_buf_log_item attached to the buffer (which
      holds a reference to the buffer), but it is not freeing the
      xfs_buf_log_item. Hence nothing will ever release the buffer, and
      the unmount hangs waiting for this reference to go away.
      
      The fix is simple - xfs_buf_item_unlock needs to detect the last
      reference going away in this case and free the xfs_buf_log_item to
      release the reference it holds on the buffer.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      9f87832a
  27. 26 1月, 2013 1 次提交
    • D
      xfs: fix shutdown hang on invalid inode during create · 3b19034d
      Dave Chinner 提交于
      When the new inode verify in xfs_iread() fails, the create
      transaction is aborted and a shutdown occurs. The subsequent unmount
      then hangs in xfs_wait_buftarg() on a buffer that has an elevated
      hold count. Debug showed that it was an AGI buffer getting stuck:
      
      [   22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      [   23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
      
      The trace of this buffer leading up to the shutdown (trimmed for
      brevity) looks like:
      
      xfs_buf_init:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
      xfs_buf_get:         bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
      xfs_buf_read:        bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
      xfs_buf_iorequest:   bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
      xfs_buf_iowait:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_ioerror:     bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
      xfs_buf_iodone:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
      xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
      xfs_trans_read_buf:  bno 0x2 len 0x200 hold 2 recur 0 refcount 1
      xfs_trans_brelse:    bno 0x2 len 0x200 hold 2 recur 0 refcount 1
      xfs_buf_item_relse:  bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
      xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
      xfs_buf_trylock:     bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
      xfs_buf_find:        bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
      xfs_buf_get:         bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
      xfs_buf_read:        bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
      xfs_buf_hold:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
      xfs_trans_read_buf:  bno 0x2 len 0x200 hold 3 recur 0 refcount 1
      xfs_trans_log_buf:   bno 0x2 len 0x200 hold 3 recur 0 refcount 1
      xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
      xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
      xfs_buf_rele:        bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
      
      And that is the AGI buffer from cold cache read into memory to
      transaction abort. You can see at transaction abort the bli is dirty
      and only has a single reference. The item is not pinned, and it's
      not in the AIL. Hence the only reference to it is this transaction.
      
      The problem is that the xfs_buf_item_unlock() call is dropping the
      last reference to the xfs_buf_log_item attached to the buffer (which
      holds a reference to the buffer), but it is not freeing the
      xfs_buf_log_item. Hence nothing will ever release the buffer, and
      the unmount hangs waiting for this reference to go away.
      
      The fix is simple - xfs_buf_item_unlock needs to detect the last
      reference going away in this case and free the xfs_buf_log_item to
      release the reference it holds on the buffer.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3b19034d