1. 11 5月, 2022 1 次提交
  2. 04 5月, 2022 1 次提交
    • D
      xfs: intent item whiteouts · 0d227466
      Dave Chinner 提交于
      When we log modifications based on intents, we add both intent
      and intent done items to the modification being made. These get
      written to the log to ensure that the operation is re-run if the
      intent done is not found in the log.
      
      However, for operations that complete wholly within a single
      checkpoint, the change in the checkpoint is atomic and will never
      need replay. In this case, we don't need to actually write the
      intent and intent done items to the journal because log recovery
      will never need to manually restart this modification.
      
      Log recovery currently handles intent/intent done matching by
      inserting the intent into the AIL, then removing it when a matching
      intent done item is found. Hence for all the intent-based operations
      that complete within a checkpoint, we spend all that time parsing
      the intent/intent done items just to cancel them and do nothing with
      them.
      
      Hence it follows that the only time we actually need intents in the
      log is when the modification crosses checkpoint boundaries in the
      log and so may only be partially complete in the journal. Hence if
      we commit and intent done item to the CIL and the intent item is in
      the same checkpoint, we don't actually have to write them to the
      journal because log recovery will always cancel the intents.
      
      We've never really worried about the overhead of logging intents
      unnecessarily like this because the intents we log are generally
      very much smaller than the change being made. e.g. freeing an extent
      involves modifying at lease two freespace btree blocks and the AGF,
      so the EFI/EFD overhead is only a small increase in space and
      processing time compared to the overall cost of freeing an extent.
      
      However, delayed attributes change this cost equation dramatically,
      especially for inline attributes. In the case of adding an inline
      attribute, we only log the inode core and attribute fork at present.
      With delayed attributes, we now log the attr intent which includes
      the name and value, the inode core adn attr fork, and finally the
      attr intent done item. We increase the number of items we log from 1
      to 3, and the number of log vectors (regions) goes up from 3 to 7.
      Hence we tripple the number of objects that the CIL has to process,
      and more than double the number of log vectors that need to be
      written to the journal.
      
      At scale, this means delayed attributes cause a non-pipelined CIL to
      become CPU bound processing all the extra items, resulting in a > 40%
      performance degradation on 16-way file+xattr create worklaods.
      Pipelining the CIL (as per 5.15) reduces the performance degradation
      to 20%, but now the limitation is the rate at which the log items
      can be written to the iclogs and iclogs be dispatched for IO and
      completed.
      
      Even log IO completion is slowed down by these intents, because it
      now has to process 3x the number of items in the checkpoint.
      Processing completed intents is especially inefficient here, because
      we first insert the intent into the AIL, then remove it from the AIL
      when the intent done is processed. IOWs, we are also doing expensive
      operations in log IO completion we could completely avoid if we
      didn't log completed intent/intent done pairs.
      
      Enter log item whiteouts.
      
      When an intent done is committed, we can check to see if the
      associated intent is in the same checkpoint as we are currently
      committing the intent done to. If so, we can mark the intent log
      item with a whiteout and immediately free the intent done item
      rather than committing it to the CIL. We can basically skip the
      entire formatting and CIL insertion steps for the intent done item.
      
      However, we cannot remove the intent item from the CIL at this point
      because the unlocked per-cpu CIL item lists do not permit removal
      without holding the CIL context lock exclusively. Transaction commit
      only holds the context lock shared, hence the best we can do is mark
      the intent item with a whiteout so that the CIL push can release it
      rather than writing it to the log.
      
      This means we never write the intent to the log if the intent done
      has also been committed to the same checkpoint, but we'll always
      write the intent if the intent done has not been committed or has
      been committed to a different checkpoint. This will result in
      correct log recovery behaviour in all cases, without the overhead of
      logging unnecessary intents.
      
      This intent whiteout concept is generic - we can apply it to all
      intent/intent done pairs that have a direct 1:1 relationship. The
      way deferred ops iterate and relog intents mean that all intents
      currently have a 1:1 relationship with their done intent, and hence
      we can apply this cancellation to all existing intent/intent done
      implementations.
      
      For delayed attributes with a 16-way 64kB xattr create workload,
      whiteouts reduce the amount of journalled metadata from ~2.5GB/s
      down to ~600MB/s and improve the creation rate from 9000/s to
      14000/s.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0d227466
  3. 29 4月, 2022 3 次提交
  4. 28 4月, 2022 1 次提交
  5. 21 4月, 2022 2 次提交
  6. 11 4月, 2022 2 次提交
  7. 20 3月, 2022 1 次提交
  8. 15 3月, 2022 1 次提交
  9. 20 10月, 2021 1 次提交
  10. 20 8月, 2021 16 次提交
  11. 19 8月, 2021 2 次提交
  12. 17 8月, 2021 1 次提交
  13. 10 8月, 2021 6 次提交
  14. 07 8月, 2021 1 次提交
    • D
      xfs: per-cpu deferred inode inactivation queues · ab23a776
      Dave Chinner 提交于
      Move inode inactivation to background work contexts so that it no
      longer runs in the context that releases the final reference to an
      inode. This will allow process work that ends up blocking on
      inactivation to continue doing work while the filesytem processes
      the inactivation in the background.
      
      A typical demonstration of this is unlinking an inode with lots of
      extents. The extents are removed during inactivation, so this blocks
      the process that unlinked the inode from the directory structure. By
      moving the inactivation to the background process, the userspace
      applicaiton can keep working (e.g. unlinking the next inode in the
      directory) while the inactivation work on the previous inode is
      done by a different CPU.
      
      The implementation of the queue is relatively simple. We use a
      per-cpu lockless linked list (llist) to queue inodes for
      inactivation without requiring serialisation mechanisms, and a work
      item to allow the queue to be processed by a CPU bound worker
      thread. We also keep a count of the queue depth so that we can
      trigger work after a number of deferred inactivations have been
      queued.
      
      The use of a bound workqueue with a single work depth allows the
      workqueue to run one work item per CPU. We queue the work item on
      the CPU we are currently running on, and so this essentially gives
      us affine per-cpu worker threads for the per-cpu queues. THis
      maintains the effective CPU affinity that occurs within XFS at the
      AG level due to all objects in a directory being local to an AG.
      Hence inactivation work tends to run on the same CPU that last
      accessed all the objects that inactivation accesses and this
      maintains hot CPU caches for unlink workloads.
      
      A depth of 32 inodes was chosen to match the number of inodes in an
      inode cluster buffer. This hopefully allows sequential
      allocation/unlink behaviours to defering inactivation of all the
      inodes in a single cluster buffer at a time, further helping
      maintain hot CPU and buffer cache accesses while running
      inactivations.
      
      A hard per-cpu queue throttle of 256 inode has been set to avoid
      runaway queuing when inodes that take a long to time inactivate are
      being processed. For example, when unlinking inodes with large
      numbers of extents that can take a lot of processing to free.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: tweak comments and tracepoints, convert opflags to state bits]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ab23a776
  15. 30 7月, 2021 1 次提交