1. 04 10月, 2016 1 次提交
  2. 26 9月, 2016 2 次提交
    • B
      xfs: log recovery tracepoints to track current lsn and buffer submission · 5cd9cee9
      Brian Foster 提交于
      Log recovery has particular rules around buffer submission along with
      tricky corner cases where independent transactions can share an LSN. As
      such, it can be difficult to follow when/why buffers are submitted
      during recovery.
      
      Add a couple tracepoints to post the current LSN of a record when a new
      record is being processed and when a buffer is being skipped due to LSN
      ordering. Also, update the recover item class to include the LSN of the
      current transaction for the item being processed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5cd9cee9
    • D
      xfs: remote attribute blocks aren't really userdata · 292378ed
      Dave Chinner 提交于
      When adding a new remote attribute, we write the attribute to the
      new extent before the allocation transaction is committed. This
      means we cannot reuse busy extents as that violates crash
      consistency semantics. Hence we currently treat remote attribute
      extent allocation like userdata because it has the same overwrite
      ordering constraints as userdata.
      
      Unfortunately, this also allows the allocator to incorrectly apply
      extent size hints to the remote attribute extent allocation. This
      results in interesting failures, such as transaction block
      reservation overruns and in-memory inode attribute fork corruption.
      
      To fix this, we need to separate the busy extent reuse configuration
      from the userdata configuration. This changes the definition of
      XFS_BMAPI_METADATA slightly - it now means that allocation is
      metadata and reuse of busy extents is acceptible due to the metadata
      ordering semantics of the journal. If this flag is not set, it
      means the allocation is that has unordered data writeback, and hence
      busy extent reuse is not allowed. It no longer implies the
      allocation is for user data, just that the data write will not be
      strictly ordered. This matches the semantics for both user data
      and remote attribute block allocation.
      
      As such, This patch changes the "userdata" field to a "datatype"
      field, and adds a "no busy reuse" flag to the field.
      When we detect an unordered data extent allocation, we immediately set
      the no reuse flag. We then set the "user data" flags based on the
      inode fork we are allocating the extent to. Hence we only set
      userdata flags on data fork allocations now and consider attribute
      fork remote extents to be an unordered metadata extent.
      
      The result is that remote attribute extents now have the expected
      allocation semantics, and the data fork allocation behaviour is
      completely unchanged.
      
      It should be noted that there may be other ways to fix this (e.g.
      use ordered metadata buffers for the remote attribute extent data
      write) but they are more invasive and difficult to validate both
      from a design and implementation POV. Hence this patch takes the
      simple, obvious route to fixing the problem...
      Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      292378ed
  3. 19 9月, 2016 1 次提交
    • D
      xfs: set up per-AG free space reservations · 3fd129b6
      Darrick J. Wong 提交于
      One unfortunate quirk of the reference count and reverse mapping
      btrees -- they can expand in size when blocks are written to *other*
      allocation groups if, say, one large extent becomes a lot of tiny
      extents.  Since we don't want to start throwing errors in the middle
      of CoWing, we need to reserve some blocks to handle future expansion.
      The transaction block reservation counters aren't sufficient here
      because we have to have a reserve of blocks in every AG, not just
      somewhere in the filesystem.
      
      Therefore, create two per-AG block reservation pools.  One feeds the
      AGFL so that rmapbt expansion always succeeds, and the other feeds all
      other metadata so that refcountbt expansion never fails.
      
      Use the count of how many reserved blocks we need to have on hand to
      create a virtual reservation in the AG.  Through selective clamping of
      the maximum length of allocation requests and of the length of the
      longest free extent, we can make it look like there's less free space
      in the AG unless the reservation owner is asking for blocks.
      
      In other words, play some accounting tricks in-core to make sure that
      we always have blocks available.  On the plus side, there's nothing to
      clean up if we crash, which is contrast to the strategy that the rough
      draft used (actually removing extents from the freespace btrees).
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3fd129b6
  4. 30 8月, 2016 1 次提交
    • D
      xfs: track log done items directly in the deferred pending work item · ea78d808
      Darrick J. Wong 提交于
      Christoph reports slab corruption when a deferred refcount update
      aborts during _defer_finish().  The cause of this was broken log item
      state tracking in xfs_defer_pending -- upon an abort,
      _defer_trans_abort() will call abort_intent on all intent items,
      including the ones that have already had a done item attached.
      
      This is incorrect because each intent item has 2 refcount: the first
      is released when the intent item is committed to the log; and the
      second is released when the _done_ item is committed to the log, or
      by the intent creator if there is no done item.  In other words, once
      we log the done item, responsibility for releasing the intent item's
      second refcount is transferred to the done item and /must not/ be
      performed by anything else.
      
      The dfp_committed flag should have been tracking whether or not we had
      a done item so that _defer_trans_abort could decide if it needs to
      abort the intent item, but due to a thinko this was not the case.  Rip
      it out and track the done item directly so that we do the right thing
      w.r.t. intent item freeing.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ea78d808
  5. 17 8月, 2016 1 次提交
  6. 03 8月, 2016 10 次提交
  7. 20 7月, 2016 2 次提交
  8. 21 6月, 2016 2 次提交
  9. 18 5月, 2016 1 次提交
  10. 06 4月, 2016 2 次提交
  11. 08 2月, 2016 1 次提交
    • C
      xfs: don't use ioends for direct write completions · 273dda76
      Christoph Hellwig 提交于
      We only need to communicate two bits of information to the direct I/O
      completion handler:
      
       (1) do we need to convert any unwritten extents in the range
       (2) do we need to check if we need to update the inode size based
           on the range passed to the completion handler
      
      We can use the private data passed to the get_block handler and the
      completion handler as a simple bitmask to communicate this information
      instead of the current complicated infrastructure reusing the ioends
      from the buffer I/O path, and thus avoiding a memory allocation and
      a context switch for any non-trivial direct write.  As a nice side
      effect we also decouple the direct I/O path implementation from that
      of the buffered I/O path.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      
      273dda76
  12. 08 1月, 2016 1 次提交
  13. 03 11月, 2015 1 次提交
  14. 12 10月, 2015 1 次提交
  15. 09 9月, 2015 1 次提交
  16. 19 8月, 2015 1 次提交
  17. 29 5月, 2015 1 次提交
    • B
      xfs: allocate sparse inode chunks on full chunk allocation failure · 56d1115c
      Brian Foster 提交于
      xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode
      chunk. If all else fails, reduce the allocation to the sparse length and
      alignment and attempt to allocate a sparse inode chunk.
      
      If sparse chunk allocation succeeds, check whether an inobt record
      already exists that can track the chunk. If so, inherit and update the
      existing record. Otherwise, insert a new record for the sparse chunk.
      
      Create helpers to align sparse chunk inode records and insert or update
      existing records in the inode btrees. The xfs_inobt_insert_sprec()
      helper implements the merge or update semantics required for sparse
      inode records with respect to both the inobt and finobt. To update the
      inobt, either insert a new record or merge with an existing record. To
      update the finobt, use the updated inobt record to either insert or
      replace an existing record.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      56d1115c
  18. 16 4月, 2015 3 次提交
    • D
      xfs: DIO writes within EOF don't need an ioend · a06c277a
      Dave Chinner 提交于
      DIO writes that lie entirely within EOF have nothing to do in IO
      completion. In this case, we don't need no steekin' ioend, and so we
      can avoid allocating an ioend until we have a mapping that spans
      EOF.
      
      This means that IO completion has two contexts - deferred completion
      to the dio workqueue that uses an ioend, and interrupt completion
      that does nothing because there is nothing that can be done in this
      context.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a06c277a
    • D
      xfs: handle DIO overwrite EOF update completion correctly · 6dfa1b67
      Dave Chinner 提交于
      Currently a DIO overwrite that extends the EOF (e.g sub-block IO or
      write into allocated blocks beyond EOF) requires a transaction for
      the EOF update. Thi is done in IO completion context, but we aren't
      explicitly handling this situation properly and so it can run in
      interrupt context. Ensure that we defer IO that spans EOF correctly
      to the DIO completion workqueue, and now that we have an ioend in IO
      completion we can use the common ioend completion path to do all the
      work.
      
      Note: we do not preallocate the append transaction as we can have
      multiple mapping and allocation calls per direct IO. hence
      preallocating can still leave us with nested transactions by
      attempting to map and allocate more blocks after we've preallocated
      an append transaction.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6dfa1b67
    • D
      xfs: DIO needs an ioend for writes · d5cc2e3f
      Dave Chinner 提交于
      Currently we can only tell DIO completion that an IO requires
      unwritten extent completion. This is done by a hacky non-null
      private pointer passed to Io completion, but the private pointer
      does not actually contain any information that is used.
      
      We also need to pass to IO completion the fact that the IO may be
      beyond EOF and so a size update transaction needs to be done. This
      is currently determined by checks in the io completion, but we need
      to determine if this is necessary at block mapping time as we need
      to defer the size update transactions to a completion workqueue,
      just like unwritten extent conversion.
      
      To do this, first we need to allocate and pass an ioend to to IO
      completion. Add this for unwritten extent conversion; we'll do the
      EOF updates in the next commit.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d5cc2e3f
  19. 25 3月, 2015 2 次提交
  20. 23 2月, 2015 2 次提交
  21. 02 10月, 2014 1 次提交
    • D
      xfs: introduce xfs_buf_submit[_wait] · 595bff75
      Dave Chinner 提交于
      There is a lot of cookie-cutter code that looks like:
      
      	if (shutdown)
      		handle buffer error
      	xfs_buf_iorequest(bp)
      	error = xfs_buf_iowait(bp)
      	if (error)
      		handle buffer error
      
      spread through XFS. There's significant complexity now in
      xfs_buf_iorequest() to specifically handle this sort of synchronous
      IO pattern, but there's all sorts of nasty surprises in different
      error handling code dependent on who owns the buffer references and
      the locks.
      
      Pull this pattern into a single helper, where we can hide all the
      synchronous IO warts and hence make the error handling for all the
      callers much saner. This removes the need for a special extra
      reference to protect IO completion processing, as we can now hold a
      single reference across dispatch and waiting, simplifying the sync
      IO smeantics and error handling.
      
      In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
      make it explicitly handle on asynchronous IO. This forces all users
      to be switched specifically to one interface or the other and
      removes any ambiguity between how the interfaces are to be used. It
      also means that xfs_buf_iowait() goes away.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      595bff75
  22. 12 6月, 2014 1 次提交
    • A
      ->splice_write() via ->write_iter() · 8d020765
      Al Viro 提交于
      iter_file_splice_write() - a ->splice_write() instance that gathers the
      pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
      it to ->write_iter().  A bunch of simple cases coverted to that...
      
      [AV: fixed the braino spotted by Cyrill]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8d020765
  23. 23 4月, 2014 1 次提交