1. 22 6月, 2021 4 次提交
    • D
      xfs: add iclog state trace events · 956f6daa
      Dave Chinner 提交于
      For the DEBUGS!
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      956f6daa
    • D
      xfs: xfs_log_force_lsn isn't passed a LSN · 5f9b4b0d
      Dave Chinner 提交于
      In doing an investigation into AIL push stalls, I was looking at the
      log force code to see if an async CIL push could be done instead.
      This lead me to xfs_log_force_lsn() and looking at how it works.
      
      xfs_log_force_lsn() is only called from inode synchronisation
      contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
      value as the LSN to sync the log to. This gets passed to
      xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
      journal, and then used by xfs_log_force_lsn() to flush the iclogs to
      the journal.
      
      The problem is that ip->i_itemp->ili_last_lsn does not store a
      log sequence number. What it stores is passed to it from the
      ->iop_committing method, which is called by xfs_log_commit_cil().
      The value this passes to the iop_committing method is the CIL
      context sequence number that the item was committed to.
      
      As it turns out, xlog_cil_force_lsn() converts the sequence to an
      actual commit LSN for the related context and returns that to
      xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
      variable that contained a sequence with an actual LSN and then uses
      that to sync the iclogs.
      
      This caused me some confusion for a while, even though I originally
      wrote all this code a decade ago. ->iop_committing is only used by
      a couple of log item types, and only inode items use the sequence
      number it is passed.
      
      Let's clean up the API, CIL structures and inode log item to call it
      a sequence number, and make it clear that the high level code is
      using CIL sequence numbers and not on-disk LSNs for integrity
      synchronisation purposes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5f9b4b0d
    • D
      xfs: journal IO cache flush reductions · eef983ff
      Dave Chinner 提交于
      Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
      guarantee the ordering requirements the journal has w.r.t. metadata
      writeback. THe two ordering constraints are:
      
      1. we cannot overwrite metadata in the journal until we guarantee
      that the dirty metadata has been written back in place and is
      stable.
      
      2. we cannot write back dirty metadata until it has been written to
      the journal and guaranteed to be stable (and hence recoverable) in
      the journal.
      
      The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
      causes the journal IO to issue a cache flush and wait for it to
      complete before issuing the write IO to the journal. Hence all
      completed metadata IO is guaranteed to be stable before the journal
      overwrites the old metadata.
      
      The ordering guarantees of #2 are provided by the REQ_FUA, which
      ensures the journal writes do not complete until they are on stable
      storage. Hence by the time the last journal IO in a checkpoint
      completes, we know that the entire checkpoint is on stable storage
      and we can unpin the dirty metadata and allow it to be written back.
      
      This is the mechanism by which ordering was first implemented in XFS
      way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
      ("Add support for drive write cache flushing") in the xfs-archive
      tree.
      
      A lot has changed since then, most notably we now use delayed
      logging to checkpoint the filesystem to the journal rather than
      write each individual transaction to the journal. Cache flushes on
      journal IO are necessary when individual transactions are wholly
      contained within a single iclog. However, CIL checkpoints are single
      transactions that typically span hundreds to thousands of individual
      journal writes, and so the requirements for device cache flushing
      have changed.
      
      That is, the ordering rules I state above apply to ordering of
      atomic transactions recorded in the journal, not to the journal IO
      itself. Hence we need to ensure metadata is stable before we start
      writing a new transaction to the journal (guarantee #1), and we need
      to ensure the entire transaction is stable in the journal before we
      start metadata writeback (guarantee #2).
      
      Hence we only need a REQ_PREFLUSH on the journal IO that starts a
      new journal transaction to provide #1, and it is not on any other
      journal IO done within the context of that journal transaction.
      
      The CIL checkpoint already issues a cache flush before it starts
      writing to the log, so we no longer need the iclog IO to issue a
      REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
      to xlog_write(), we no longer need to mark the first iclog in
      the log write with REQ_PREFLUSH for this case. As an added bonus,
      this ordering mechanism works for both internal and external logs,
      meaning we can remove the explicit data device cache flushes from
      the iclog write code when using external logs.
      
      Given the new ordering semantics of commit records for the CIL, we
      need iclogs containing commit records to issue a REQ_PREFLUSH. We
      also require unmount records to do this. Hence for both
      XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
      to mark the first iclog being written with REQ_PREFLUSH.
      
      For both commit records and unmount records, we also want them
      immediately on stable storage, so we want to also mark the iclogs
      that contain these records to be marked REQ_FUA. That means if a
      record is split across multiple iclogs, they are all marked REQ_FUA
      and not just the last one so that when the transaction is completed
      all the parts of the record are on stable storage.
      
      And for external logs, unmount records need a pre-write data device
      cache flush similar to the CIL checkpoint cache pre-flush as the
      internal iclog write code does not do this implicitly anymore.
      
      As an optimisation, when the commit record lands in the same iclog
      as the journal transaction starts, we don't need to wait for
      anything and can simply use REQ_FUA to provide guarantee #2.  This
      means that for fsync() heavy workloads, the cache flush behaviour is
      completely unchanged and there is no degradation in performance as a
      result of optimise the multi-IO transaction case.
      
      The most notable sign that there is less IO latency on my test
      machine (nvme SSDs) is that the "noiclogs" rate has dropped
      substantially. This metric indicates that the CIL push is blocking
      in xlog_get_iclog_space() waiting for iclog IO completion to occur.
      With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
      every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
      is blocking waiting for log IO. With the changes in this patch, this
      drops to 1 noiclog event for every 100 iclog writes. Hence it is
      clear that log IO is completing much faster than it was previously,
      but it is also clear that for large iclog sizes, this isn't the
      performance limiting factor on this hardware.
      
      With smaller iclogs (32kB), however, there is a substantial
      difference. With the cache flush modifications, the journal is now
      running at over 4000 write IOPS, and the journal throughput is
      largely identical to the 256kB iclogs and the noiclog event rate
      stays low at about 1:50 iclog writes. The existing code tops out at
      about 2500 IOPS as the number of cache flushes dominate performance
      and latency. The noiclog event rate is about 1:4, and the
      performance variance is quite large as the journal throughput can
      fall to less than half the peak sustained rate when the cache flush
      rate prevents metadata writeback from keeping up and the log runs
      out of space and throttles reservations.
      
      As a result:
      
      	logbsize	fsmark create rate	rm -rf
      before	32kb		152851+/-5.3e+04	5m28s
      patched	32kb		221533+/-1.1e+04	5m24s
      
      before	256kb		220239+/-6.2e+03	4m58s
      patched	256kb		228286+/-9.2e+03	5m06s
      
      The rm -rf times are included because I ran them, but the
      differences are largely noise. This workload is largely metadata
      read IO latency bound and the changes to the journal cache flushing
      doesn't really make any noticable difference to behaviour apart from
      a reduction in noiclog events from background CIL pushing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      eef983ff
    • D
      xfs: remove need_start_rec parameter from xlog_write() · 3468bb1c
      Dave Chinner 提交于
      The CIL push is the only call to xlog_write that sets this variable
      to true. The other callers don't need a start rec, and they tell
      xlog_write what to do by passing the type of ophdr they need written
      in the flags field. The need_start_rec parameter essentially tells
      xlog_write to to write an extra ophdr with a XLOG_START_TRANS type,
      so get rid of the variable to do this and pass XLOG_START_TRANS as
      the flag value into xlog_write() from the CIL push.
      
      $ size fs/xfs/xfs_log.o*
        text	   data	    bss	    dec	    hex	filename
       27595	    560	      8	  28163	   6e03	fs/xfs/xfs_log.o.orig
       27454	    560	      8	  28022	   6d76	fs/xfs/xfs_log.o.patched
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      3468bb1c
  2. 18 6月, 2021 2 次提交
  3. 29 7月, 2020 1 次提交
  4. 23 6月, 2020 1 次提交
    • D
      xfs: fix use-after-free on CIL context on shutdown · c7f87f39
      Dave Chinner 提交于
      xlog_wait() on the CIL context can reference a freed context if the
      waiter doesn't get scheduled before the CIL context is freed. This
      can happen when a task is on the hard throttle and the CIL push
      aborts due to a shutdown. This was detected by generic/019:
      
      thread 1			thread 2
      
      __xfs_trans_commit
       xfs_log_commit_cil
        <CIL size over hard throttle limit>
        xlog_wait
         schedule
      				xlog_cil_push_work
      				wake_up_all
      				<shutdown aborts commit>
      				xlog_cil_committed
      				kmem_free
      
         remove_wait_queue
          spin_lock_irqsave --> UAF
      
      Fix it by moving the wait queue to the CIL rather than keeping it in
      in the CIL context that gets freed on push completion. Because the
      wait queue is now independent of the CIL context and we might have
      multiple contexts in flight at once, only wake the waiters on the
      push throttle when the context we are pushing is over the hard
      throttle size threshold.
      
      Fixes: 0e7ab7ef ("xfs: Throttle commits on delayed background CIL push")
      Reported-by: NYu Kuai <yukuai3@huawei.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c7f87f39
  5. 27 3月, 2020 7 次提交
  6. 14 3月, 2020 2 次提交
  7. 14 11月, 2019 1 次提交
  8. 11 11月, 2019 1 次提交
  9. 22 10月, 2019 4 次提交
  10. 03 7月, 2019 1 次提交
  11. 29 6月, 2019 5 次提交
  12. 07 6月, 2018 1 次提交
    • D
      xfs: convert to SPDX license tags · 0b61f8a4
      Dave Chinner 提交于
      Remove the verbose license text from XFS files and replace them
      with SPDX tags. This does not change the license of any of the code,
      merely refers to the common, up-to-date license files in LICENSES/
      
      This change was mostly scripted. fs/xfs/Makefile and
      fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
      and modified by the following command:
      
      for f in `git grep -l "GNU General" fs/xfs/` ; do
      	echo $f
      	cat $f | awk -f hdr.awk > $f.new
      	mv -f $f.new $f
      done
      
      And the hdr.awk script that did the modification (including
      detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
      is as follows:
      
      $ cat hdr.awk
      BEGIN {
      	hdr = 1.0
      	tag = "GPL-2.0"
      	str = ""
      }
      
      /^ \* This program is free software/ {
      	hdr = 2.0;
      	next
      }
      
      /any later version./ {
      	tag = "GPL-2.0+"
      	next
      }
      
      /^ \*\// {
      	if (hdr > 0.0) {
      		print "// SPDX-License-Identifier: " tag
      		print str
      		print $0
      		str=""
      		hdr = 0.0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \* / {
      	if (hdr > 1.0)
      		next
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \*/ {
      	if (hdr > 0.0)
      		next
      	print $0
      	next
      }
      
      // {
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      }
      
      END { }
      $
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b61f8a4
  13. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  14. 20 6月, 2017 1 次提交
    • D
      xfs: remove double-underscore integer types · c8ce540d
      Darrick J. Wong 提交于
      This is a purely mechanical patch that removes the private
      __{u,}int{8,16,32,64}_t typedefs in favor of using the system
      {u,}int{8,16,32,64}_t typedefs.  This is the sed script used to perform
      the transformation and fix the resulting whitespace and indentation
      errors:
      
      s/typedef\t__uint8_t/typedef __uint8_t\t/g
      s/typedef\t__uint/typedef __uint/g
      s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
      s/__uint8_t\t/__uint8_t\t\t/g
      s/__uint/uint/g
      s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
      s/__int/int/g
      /^typedef.*int[0-9]*_t;$/d
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c8ce540d
  15. 19 6月, 2017 1 次提交
    • B
      xfs: dump transaction usage details on log reservation overrun · d4ca1d55
      Brian Foster 提交于
      If a transaction log reservation overrun occurs, the ticket data
      associated with the reservation is dumped in xfs_log_commit_cil().
      This occurs long after the transaction items and details have been
      removed from the transaction and effectively lost. This limited set
      of ticket data provides very little information to support debugging
      transaction overruns based on the typical report.
      
      To improve transaction log reservation overrun reporting, create a
      helper to dump transaction details such as log items, log vector
      data, etc., as well as the underlying ticket data for the
      transaction. Move the overrun detection from xfs_log_commit_cil() to
      xlog_cil_insert_items() so it occurs prior to migration of the
      logged items to the CIL. Call the new helper such that it is able to
      dump this transaction data before it is lost.
      
      Also, warn on overrun to provide callstack context for the offending
      transaction and include a few additional messages from
      xlog_cil_insert_items() to display the reservation consumed locally
      for overhead such as log vector headers, split region headers and
      the context ticket. This provides a complete general breakdown of
      the reservation consumption of a transaction when/if it happens to
      overrun the reservation.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d4ca1d55
  16. 10 2月, 2017 1 次提交
  17. 26 9月, 2016 1 次提交
    • B
      xfs: rework log recovery to submit buffers on LSN boundaries · 12818d24
      Brian Foster 提交于
      The fix to log recovery to update the metadata LSN in recovered buffers
      introduces the requirement that a buffer is submitted only once per
      current LSN. Log recovery currently submits buffers on transaction
      boundaries. This is not sufficient as the abstraction between log
      records and transactions allows for various scenarios where multiple
      transactions can share the same current LSN. If independent transactions
      share an LSN and both modify the same buffer, log recovery can
      incorrectly skip updates and leave the filesystem in an inconsisent
      state.
      
      In preparation for proper metadata LSN updates during log recovery,
      update log recovery to submit buffers for write on LSN change boundaries
      rather than transaction boundaries. Explicitly track the current LSN in
      a new struct xlog field to handle the various corner cases of when the
      current LSN may or may not change.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      12818d24
  18. 06 4月, 2016 1 次提交
  19. 05 1月, 2016 1 次提交
    • B
      xfs: debug mode log record crc error injection · 609adfc2
      Brian Foster 提交于
      XFS now uses CRC verification over a limited section of the log to
      detect torn writes prior to a crash. This is difficult to test directly
      due to the timing and hardware requirements to cause a short write.
      
      Add a mechanism to inject CRC errors into log records to facilitate
      testing torn write detection during log recovery. This mechanism is
      dangerous and can result in filesystem corruption. Thus, it is only
      available in DEBUG mode for testing/development purposes. Set a non-zero
      value to the following sysfs entry to enable error injection:
      
      	/sys/fs/xfs/<dev>/log/log_badcrc_factor
      
      Once enabled, XFS intentionally writes an invalid CRC to a log record at
      some random point in the future based on the provided frequency. The
      filesystem immediately shuts down once the record has been written to
      the physical log to prevent metadata writeback (e.g., AIL insertion)
      once the log write completes. This helps reasonably simulate a torn
      write to the log as the affected record must be safe to discard. The
      next mount after the intentional shutdown requires log recovery and
      should detect and recover from the torn write.
      
      Note again that this _will_ result in data loss or worse. For testing
      and development purposes only!
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      609adfc2
  20. 12 10月, 2015 1 次提交
    • B
      xfs: validate metadata LSNs against log on v5 superblocks · a45086e2
      Brian Foster 提交于
      Since the onset of v5 superblocks, the LSN of the last modification has
      been included in a variety of on-disk data structures. This LSN is used
      to provide log recovery ordering guarantees (e.g., to ensure an older
      log recovery item is not replayed over a newer target data structure).
      
      While this works correctly from the point a filesystem is formatted and
      mounted, userspace tools have some problematic behaviors that defeat
      this mechanism. For example, xfs_repair historically zeroes out the log
      unconditionally (regardless of whether corruption is detected). If this
      occurs, the LSN of the filesystem is reset and the log is now in a
      problematic state with respect to on-disk metadata structures that might
      have a larger LSN. Until either the log catches up to the highest
      previously used metadata LSN or each affected data structure is modified
      and written out without incident (which resets the metadata LSN), log
      recovery is susceptible to filesystem corruption.
      
      This problem is ultimately addressed and repaired in the associated
      userspace tools. The kernel is still responsible to detect the problem
      and notify the user that something is wrong. Check the superblock LSN at
      mount time and fail the mount if it is invalid. From that point on,
      trigger verifier failure on any metadata I/O where an invalid LSN is
      detected. This results in a filesystem shutdown and guarantees that we
      do not log metadata changes with invalid LSNs on disk. Since this is a
      known issue with a known recovery path, present a warning to instruct
      the user how to recover.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a45086e2
  21. 19 8月, 2015 1 次提交
    • B
      xfs: don't leave EFIs on AIL on mount failure · f0b2efad
      Brian Foster 提交于
      Log recovery occurs in two phases at mount time. In the first phase,
      EFIs and EFDs are processed and potentially cancelled out. EFIs without
      EFD objects are inserted into the AIL for processing and recovery in the
      second phase. xfs_mountfs() runs various other operations between the
      phases and is thus subject to failure. If failure occurs after the first
      phase but before the second, pending EFIs sit on the AIL, pin it and
      cause the mount to hang.
      
      Update the mount sequence to ensure that pending EFIs are cancelled in
      the event of failure. Add a recovery cancellation mechanism to iterate
      the AIL and cancel all EFI items when requested. Plumb cancellation
      support through the log mount finish helper and update xfs_mountfs() to
      invoke cancellation in the event of failure after recovery has started.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f0b2efad
  22. 22 6月, 2015 1 次提交