1. 22 6月, 2021 13 次提交
    • D
      xfs: shorten the shutdown messages to a single line · c06ad17c
      Darrick J. Wong 提交于
      Consolidate the shutdown messages to a single line containing the
      reason, the passed-in flags, the source of the shutdown, and the end
      result.  This means we now only have one line to look for when
      debugging, which is useful when the fs goes down while something else is
      flooding dmesg.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      c06ad17c
    • D
      xfs: print name of function causing fs shutdown instead of hex pointer · 3a1c3abe
      Darrick J. Wong 提交于
      In xfs_do_force_shutdown, print the symbolic name of the function that
      called us to shut down the filesystem instead of a raw hex pointer.
      This makes debugging a lot easier:
      
      XFS (sda): xfs_do_force_shutdown(0x2) called from line 2440 of file
      	fs/xfs/xfs_log.c. Return address = ffffffffa038bc38
      
      becomes:
      
      XFS (sda): xfs_do_force_shutdown(0x2) called from line 2440 of file
      	fs/xfs/xfs_log.c. Return address = xfs_trans_mod_sb+0x25
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      3a1c3abe
    • D
      xfs: fix type mismatches in the inode reclaim functions · 10be350b
      Darrick J. Wong 提交于
      It's currently unlikely that we will ever end up with more than 4
      billion inodes waiting for reclamation, but the fs object code uses long
      int for object counts and we're certainly capable of generating that
      many.  Instead of truncating the internal counters, widen them and
      report the object counts correctly.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      10be350b
    • D
      xfs: separate primary inode selection criteria in xfs_iget_cache_hit · 77b4d286
      Darrick J. Wong 提交于
      During review of the v6 deferred inode inactivation patchset[1], Dave
      commented that _cache_hit should have a clear separation between inode
      selection criteria and actions performed on a selected inode.  Move a
      hunk to make this true, and compact the shrink cases in the function.
      
      [1] https://lore.kernel.org/linux-xfs/162310469340.3465262.504398465311182657.stgit@locust/T/#mca6d958521cb88bbc1bfe1a30767203328d410b5Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      77b4d286
    • D
      xfs: refactor the inode recycling code · ff7bebeb
      Darrick J. Wong 提交于
      Hoist the code in xfs_iget_cache_hit that restores the VFS inode state
      to an xfs_inode that was previously vfs-destroyed.  The next patch will
      add a new set of state flags, so we need the helper to avoid
      duplication.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      ff7bebeb
    • D
      xfs: add iclog state trace events · 956f6daa
      Dave Chinner 提交于
      For the DEBUGS!
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      956f6daa
    • D
      xfs: xfs_log_force_lsn isn't passed a LSN · 5f9b4b0d
      Dave Chinner 提交于
      In doing an investigation into AIL push stalls, I was looking at the
      log force code to see if an async CIL push could be done instead.
      This lead me to xfs_log_force_lsn() and looking at how it works.
      
      xfs_log_force_lsn() is only called from inode synchronisation
      contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
      value as the LSN to sync the log to. This gets passed to
      xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
      journal, and then used by xfs_log_force_lsn() to flush the iclogs to
      the journal.
      
      The problem is that ip->i_itemp->ili_last_lsn does not store a
      log sequence number. What it stores is passed to it from the
      ->iop_committing method, which is called by xfs_log_commit_cil().
      The value this passes to the iop_committing method is the CIL
      context sequence number that the item was committed to.
      
      As it turns out, xlog_cil_force_lsn() converts the sequence to an
      actual commit LSN for the related context and returns that to
      xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
      variable that contained a sequence with an actual LSN and then uses
      that to sync the iclogs.
      
      This caused me some confusion for a while, even though I originally
      wrote all this code a decade ago. ->iop_committing is only used by
      a couple of log item types, and only inode items use the sequence
      number it is passed.
      
      Let's clean up the API, CIL structures and inode log item to call it
      a sequence number, and make it clear that the high level code is
      using CIL sequence numbers and not on-disk LSNs for integrity
      synchronisation purposes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5f9b4b0d
    • D
      xfs: Fix CIL throttle hang when CIL space used going backwards · 19f4e7cc
      Dave Chinner 提交于
      A hang with tasks stuck on the CIL hard throttle was reported and
      largely diagnosed by Donald Buczek, who discovered that it was a
      result of the CIL context space usage decrementing in committed
      transactions once the hard throttle limit had been hit and processes
      were already blocked.  This resulted in the CIL push not waking up
      those waiters because the CIL context was no longer over the hard
      throttle limit.
      
      The surprising aspect of this was the CIL space usage going
      backwards regularly enough to trigger this situation. Assumptions
      had been made in design that the relogging process would only
      increase the size of the objects in the CIL, and so that space would
      only increase.
      
      This change and commit message fixes the issue and documents the
      result of an audit of the triggers that can cause the CIL space to
      go backwards, how large the backwards steps tend to be, the
      frequency in which they occur, and what the impact on the CIL
      accounting code is.
      
      Even though the CIL ctx->space_used can go backwards, it will only
      do so if the log item is already logged to the CIL and contains a
      space reservation for it's entire logged state. This is tracked by
      the shadow buffer state on the log item. If the item is not
      previously logged in the CIL it has no shadow buffer nor log vector,
      and hence the entire size of the logged item copied to the log
      vector is accounted to the CIL space usage. i.e.  it will always go
      up in this case.
      
      If the item has a log vector (i.e. already in the CIL) and the size
      decreases, then the existing log vector will be overwritten and the
      space usage will go down. This is the only condition where the space
      usage reduces, and it can only occur when an item is already tracked
      in the CIL. Hence we are safe from CIL space usage underruns as a
      result of log items decreasing in size when they are relogged.
      
      Typically this reduction in CIL usage occurs from metadata blocks
      being free, such as when a btree block merge occurs or a directory
      enter/xattr entry is removed and the da-tree is reduced in size.
      This generally results in a reduction in size of around a single
      block in the CIL, but also tends to increase the number of log
      vectors because the parent and sibling nodes in the tree needs to be
      updated when a btree block is removed. If a multi-level merge
      occurs, then we see reduction in size of 2+ blocks, but again the
      log vector count goes up.
      
      The other vector is inode fork size changes, which only log the
      current size of the fork and ignore the previously logged size when
      the fork is relogged. Hence if we are removing items from the inode
      fork (dir/xattr removal in shortform, extent record removal in
      extent form, etc) the relogged size of the inode for can decrease.
      
      No other log items can decrease in size either because they are a
      fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging
      an intent actually creates a new intent log item and doesn't relog
      the old item at all.) Hence the only two vectors for CIL context
      size reduction are relogging inode forks and marking buffers active
      in the CIL as stale.
      
      Long story short: the majority of the code does the right thing and
      handles the reduction in log item size correctly, and only the CIL
      hard throttle implementation is problematic and needs fixing. This
      patch makes that fix, as well as adds comments in the log item code
      that result in items shrinking in size when they are relogged as a
      clear reminder that this can and does happen frequently.
      
      The throttle fix is based upon the change Donald proposed, though it
      goes further to ensure that once the throttle is activated, it
      captures all tasks until the CIL push issues a wakeup, regardless of
      whether the CIL space used has gone back under the throttle
      threshold.
      
      This ensures that we prevent tasks reducing the CIL slightly under
      the throttle threshold and then making more changes that push it
      well over the throttle limit. This is acheived by checking if the
      throttle wait queue is already active as a condition of throttling.
      Hence once we start throttling, we continue to apply the throttle
      until the CIL context push wakes everything on the wait queue.
      
      We can use waitqueue_active() for the waitqueue manipulations and
      checks as they are all done under the ctx->xc_push_lock. Hence the
      waitqueue has external serialisation and we can safely peek inside
      the wait queue without holding the internal waitqueue locks.
      
      Many thanks to Donald for his diagnostic and analysis work to
      isolate the cause of this hang.
      Reported-and-tested-by: NDonald Buczek <buczek@molgen.mpg.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      19f4e7cc
    • D
      xfs: journal IO cache flush reductions · eef983ff
      Dave Chinner 提交于
      Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
      guarantee the ordering requirements the journal has w.r.t. metadata
      writeback. THe two ordering constraints are:
      
      1. we cannot overwrite metadata in the journal until we guarantee
      that the dirty metadata has been written back in place and is
      stable.
      
      2. we cannot write back dirty metadata until it has been written to
      the journal and guaranteed to be stable (and hence recoverable) in
      the journal.
      
      The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
      causes the journal IO to issue a cache flush and wait for it to
      complete before issuing the write IO to the journal. Hence all
      completed metadata IO is guaranteed to be stable before the journal
      overwrites the old metadata.
      
      The ordering guarantees of #2 are provided by the REQ_FUA, which
      ensures the journal writes do not complete until they are on stable
      storage. Hence by the time the last journal IO in a checkpoint
      completes, we know that the entire checkpoint is on stable storage
      and we can unpin the dirty metadata and allow it to be written back.
      
      This is the mechanism by which ordering was first implemented in XFS
      way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
      ("Add support for drive write cache flushing") in the xfs-archive
      tree.
      
      A lot has changed since then, most notably we now use delayed
      logging to checkpoint the filesystem to the journal rather than
      write each individual transaction to the journal. Cache flushes on
      journal IO are necessary when individual transactions are wholly
      contained within a single iclog. However, CIL checkpoints are single
      transactions that typically span hundreds to thousands of individual
      journal writes, and so the requirements for device cache flushing
      have changed.
      
      That is, the ordering rules I state above apply to ordering of
      atomic transactions recorded in the journal, not to the journal IO
      itself. Hence we need to ensure metadata is stable before we start
      writing a new transaction to the journal (guarantee #1), and we need
      to ensure the entire transaction is stable in the journal before we
      start metadata writeback (guarantee #2).
      
      Hence we only need a REQ_PREFLUSH on the journal IO that starts a
      new journal transaction to provide #1, and it is not on any other
      journal IO done within the context of that journal transaction.
      
      The CIL checkpoint already issues a cache flush before it starts
      writing to the log, so we no longer need the iclog IO to issue a
      REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
      to xlog_write(), we no longer need to mark the first iclog in
      the log write with REQ_PREFLUSH for this case. As an added bonus,
      this ordering mechanism works for both internal and external logs,
      meaning we can remove the explicit data device cache flushes from
      the iclog write code when using external logs.
      
      Given the new ordering semantics of commit records for the CIL, we
      need iclogs containing commit records to issue a REQ_PREFLUSH. We
      also require unmount records to do this. Hence for both
      XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
      to mark the first iclog being written with REQ_PREFLUSH.
      
      For both commit records and unmount records, we also want them
      immediately on stable storage, so we want to also mark the iclogs
      that contain these records to be marked REQ_FUA. That means if a
      record is split across multiple iclogs, they are all marked REQ_FUA
      and not just the last one so that when the transaction is completed
      all the parts of the record are on stable storage.
      
      And for external logs, unmount records need a pre-write data device
      cache flush similar to the CIL checkpoint cache pre-flush as the
      internal iclog write code does not do this implicitly anymore.
      
      As an optimisation, when the commit record lands in the same iclog
      as the journal transaction starts, we don't need to wait for
      anything and can simply use REQ_FUA to provide guarantee #2.  This
      means that for fsync() heavy workloads, the cache flush behaviour is
      completely unchanged and there is no degradation in performance as a
      result of optimise the multi-IO transaction case.
      
      The most notable sign that there is less IO latency on my test
      machine (nvme SSDs) is that the "noiclogs" rate has dropped
      substantially. This metric indicates that the CIL push is blocking
      in xlog_get_iclog_space() waiting for iclog IO completion to occur.
      With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
      every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
      is blocking waiting for log IO. With the changes in this patch, this
      drops to 1 noiclog event for every 100 iclog writes. Hence it is
      clear that log IO is completing much faster than it was previously,
      but it is also clear that for large iclog sizes, this isn't the
      performance limiting factor on this hardware.
      
      With smaller iclogs (32kB), however, there is a substantial
      difference. With the cache flush modifications, the journal is now
      running at over 4000 write IOPS, and the journal throughput is
      largely identical to the 256kB iclogs and the noiclog event rate
      stays low at about 1:50 iclog writes. The existing code tops out at
      about 2500 IOPS as the number of cache flushes dominate performance
      and latency. The noiclog event rate is about 1:4, and the
      performance variance is quite large as the journal throughput can
      fall to less than half the peak sustained rate when the cache flush
      rate prevents metadata writeback from keeping up and the log runs
      out of space and throttles reservations.
      
      As a result:
      
      	logbsize	fsmark create rate	rm -rf
      before	32kb		152851+/-5.3e+04	5m28s
      patched	32kb		221533+/-1.1e+04	5m24s
      
      before	256kb		220239+/-6.2e+03	4m58s
      patched	256kb		228286+/-9.2e+03	5m06s
      
      The rm -rf times are included because I ran them, but the
      differences are largely noise. This workload is largely metadata
      read IO latency bound and the changes to the journal cache flushing
      doesn't really make any noticable difference to behaviour apart from
      a reduction in noiclog events from background CIL pushing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      eef983ff
    • D
      xfs: remove need_start_rec parameter from xlog_write() · 3468bb1c
      Dave Chinner 提交于
      The CIL push is the only call to xlog_write that sets this variable
      to true. The other callers don't need a start rec, and they tell
      xlog_write what to do by passing the type of ophdr they need written
      in the flags field. The need_start_rec parameter essentially tells
      xlog_write to to write an extra ophdr with a XLOG_START_TRANS type,
      so get rid of the variable to do this and pass XLOG_START_TRANS as
      the flag value into xlog_write() from the CIL push.
      
      $ size fs/xfs/xfs_log.o*
        text	   data	    bss	    dec	    hex	filename
       27595	    560	      8	  28163	   6e03	fs/xfs/xfs_log.o.orig
       27454	    560	      8	  28022	   6d76	fs/xfs/xfs_log.o.patched
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      3468bb1c
    • D
      xfs: CIL checkpoint flushes caches unconditionally · bad77c37
      Dave Chinner 提交于
      Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
      guarantee the ordering requirements the journal has w.r.t. metadata
      writeback. THe two ordering constraints are:
      
      1. we cannot overwrite metadata in the journal until we guarantee
      that the dirty metadata has been written back in place and is
      stable.
      
      2. we cannot write back dirty metadata until it has been written to
      the journal and guaranteed to be stable (and hence recoverable) in
      the journal.
      
      These rules apply to the atomic transactions recorded in the
      journal, not to the journal IO itself. Hence we need to ensure
      metadata is stable before we start writing a new transaction to the
      journal (guarantee #1), and we need to ensure the entire transaction
      is stable in the journal before we start metadata writeback
      (guarantee #2).
      
      The ordering guarantees of #1 are currently provided by REQ_PREFLUSH
      being added to every iclog IO. This causes the journal IO to issue a
      cache flush and wait for it to complete before issuing the write IO
      to the journal. Hence all completed metadata IO is guaranteed to be
      stable before the journal overwrites the old metadata.
      
      However, for long running CIL checkpoints that might do a thousand
      journal IOs, we don't need every single one of these iclog IOs to
      issue a cache flush - the cache flush done before the first iclog is
      submitted is sufficient to cover the entire range in the log that
      the checkpoint will overwrite because the CIL space reservation
      guarantees the tail of the log (completed metadata) is already
      beyond the range of the checkpoint write.
      
      Hence we only need a full cache flush between closing off the CIL
      checkpoint context (i.e. when the push switches it out) and issuing
      the first journal IO. Rather than plumbing this through to the
      journal IO, we can start this cache flush the moment the CIL context
      is owned exclusively by the push worker. The cache flush can be in
      progress while we process the CIL ready for writing, hence
      reducing the latency of the initial iclog write. This is especially
      true for large checkpoints, where we might have to process hundreds
      of thousands of log vectors before we issue the first iclog write.
      In these cases, it is likely the cache flush has already been
      completed by the time we have built the CIL log vector chain.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      bad77c37
    • D
      xfs: async blkdev cache flush · 0431d926
      Dave Chinner 提交于
      The new checkpoint cache flush mechanism requires us to issue an
      unconditional cache flush before we start a new checkpoint. We don't
      want to block for this if we can help it, and we have a fair chunk
      of CPU work to do between starting the checkpoint and issuing the
      first journal IO.
      
      Hence it makes sense to amortise the latency cost of the cache flush
      by issuing it asynchronously and then waiting for it only when we
      need to issue the first IO in the transaction.
      
      To do this, we need async cache flush primitives to submit the cache
      flush bio and to wait on it. The block layer has no such primitives
      for filesystems, so roll our own for the moment.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0431d926
    • D
      xfs: remove xfs_blkdev_issue_flush · b5071ada
      Dave Chinner 提交于
      It's a one line wrapper around blkdev_issue_flush(). Just replace it
      with direct calls to blkdev_issue_flush().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b5071ada
  2. 18 6月, 2021 5 次提交
  3. 10 6月, 2021 2 次提交
  4. 09 6月, 2021 6 次提交
  5. 07 6月, 2021 5 次提交
  6. 04 6月, 2021 9 次提交