1. 29 6月, 2019 3 次提交
  2. 30 7月, 2018 1 次提交
  3. 07 6月, 2018 1 次提交
    • D
      xfs: convert to SPDX license tags · 0b61f8a4
      Dave Chinner 提交于
      Remove the verbose license text from XFS files and replace them
      with SPDX tags. This does not change the license of any of the code,
      merely refers to the common, up-to-date license files in LICENSES/
      
      This change was mostly scripted. fs/xfs/Makefile and
      fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
      and modified by the following command:
      
      for f in `git grep -l "GNU General" fs/xfs/` ; do
      	echo $f
      	cat $f | awk -f hdr.awk > $f.new
      	mv -f $f.new $f
      done
      
      And the hdr.awk script that did the modification (including
      detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
      is as follows:
      
      $ cat hdr.awk
      BEGIN {
      	hdr = 1.0
      	tag = "GPL-2.0"
      	str = ""
      }
      
      /^ \* This program is free software/ {
      	hdr = 2.0;
      	next
      }
      
      /any later version./ {
      	tag = "GPL-2.0+"
      	next
      }
      
      /^ \*\// {
      	if (hdr > 0.0) {
      		print "// SPDX-License-Identifier: " tag
      		print str
      		print $0
      		str=""
      		hdr = 0.0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \* / {
      	if (hdr > 1.0)
      		next
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \*/ {
      	if (hdr > 0.0)
      		next
      	print $0
      	next
      }
      
      // {
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      }
      
      END { }
      $
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b61f8a4
  4. 10 5月, 2018 1 次提交
  5. 15 3月, 2018 2 次提交
  6. 12 3月, 2018 1 次提交
  7. 29 1月, 2018 3 次提交
  8. 07 11月, 2017 1 次提交
    • C
      xfs: use a b+tree for the in-core extent list · 6bdcf26a
      Christoph Hellwig 提交于
      Replace the current linear list and the indirection array for the in-core
      extent list with a b+tree to avoid the need for larger memory allocations
      for the indirection array when lots of extents are present.  The current
      extent list implementations leads to heavy pressure on the memory
      allocator when modifying files with a high extent count, and can lead
      to high latencies because of that.
      
      The replacement is a b+tree with a few quirks.  The leaf nodes directly
      store the extent record in two u64 values.  The encoding is a little bit
      different from the existing in-core extent records so that the start
      offset and length which are required for lookups can be retreived with
      simple mask operations.  The inner nodes store a 64-bit key containing
      the start offset in the first half of the node, and the pointers to the
      next lower level in the second half.  In either case we walk the node
      from the beginninig to the end and do a linear search, as that is more
      efficient for the low number of cache lines touched during a search
      (2 for the inner nodes, 4 for the leaf nodes) than a binary search.
      We store termination markers (zero length for the leaf nodes, an
      otherwise impossible high bit for the inner nodes) to terminate the key
      list / records instead of storing a count to use the available cache
      lines as efficiently as possible.
      
      One quirk of the algorithm is that while we normally split a node half and
      half like usual btree implementations we just spill over entries added at
      the very end of the list to a new node on its own.  This means we get a
      100% fill grade for the common cases of bulk insertion when reading an
      inode into memory, and when only sequentially appending to a file.  The
      downside is a slightly higher chance of splits on the first random
      insertions.
      
      Both insert and removal manually recurse into the lower levels, but
      the bulk deletion of the whole tree is still implemented as a recursive
      function call, although one limited by the overall depth and with very
      little stack usage in every iteration.
      
      For the first few extents we dynamically grow the list from a single
      extent to the next powers of two until we have a first full leaf block
      and that building the actual tree.
      
      The code started out based on the generic lib/btree.c code from Joern
      Engel based on earlier work from Peter Zijlstra, but has since been
      rewritten beyond recognition.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      6bdcf26a
  9. 27 10月, 2017 2 次提交
  10. 12 10月, 2017 1 次提交
  11. 27 9月, 2017 1 次提交
  12. 23 8月, 2017 1 次提交
    • C
      xfs: Properly retry failed inode items in case of error during buffer writeback · d3a304b6
      Carlos Maiolino 提交于
      When a buffer has been failed during writeback, the inode items into it
      are kept flush locked, and are never resubmitted due the flush lock, so,
      if any buffer fails to be written, the items in AIL are never written to
      disk and never unlocked.
      
      This causes unmount operation to hang due these items flush locked in AIL,
      but this also causes the items in AIL to never be written back, even when
      the IO device comes back to normal.
      
      I've been testing this patch with a DM-thin device, creating a
      filesystem larger than the real device.
      
      When writing enough data to fill the DM-thin device, XFS receives ENOSPC
      errors from the device, and keep spinning on xfsaild (when 'retry
      forever' configuration is set).
      
      At this point, the filesystem can not be unmounted because of the flush locked
      items in AIL, but worse, the items in AIL are never retried at all
      (once xfs_inode_item_push() will skip the items that are flush locked),
      even if the underlying DM-thin device is expanded to the proper size.
      
      This patch fixes both cases, retrying any item that has been failed
      previously, using the infra-structure provided by the previous patch.
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d3a304b6
  13. 05 6月, 2017 1 次提交
  14. 26 4月, 2017 1 次提交
  15. 08 11月, 2016 1 次提交
  16. 06 10月, 2016 1 次提交
    • D
      xfs: create a separate cow extent size hint for the allocator · f7ca3522
      Darrick J. Wong 提交于
      Create a per-inode extent size allocator hint for copy-on-write.  This
      hint is separate from the existing extent size hint so that CoW can
      take advantage of the fragmentation-reducing properties of extent size
      hints without disabling delalloc for regular writes.
      
      The extent size hint that's fed to the allocator during a copy on
      write operation is the greater of the cowextsize and regular extsize
      hint.
      
      During reflink, if we're sharing the entire source file to the entire
      destination file and the destination file doesn't already have a
      cowextsize hint, propagate the source file's cowextsize hint to the
      destination file.
      
      Furthermore, zero the bulkstat buffer prior to setting the fields
      so that we don't copy kernel memory contents into userspace.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      f7ca3522
  17. 22 7月, 2016 1 次提交
    • D
      xfs: allocate log vector buffers outside CIL context lock · b1c5ebb2
      Dave Chinner 提交于
      One of the problems we currently have with delayed logging is that
      under serious memory pressure we can deadlock memory reclaim. THis
      occurs when memory reclaim (such as run by kswapd) is reclaiming XFS
      inodes and issues a log force to unpin inodes that are dirty in the
      CIL.
      
      The CIL is pushed, but this will only occur once it gets the CIL
      context lock to ensure that all committing transactions are complete
      and no new transactions start being committed to the CIL while the
      push switches to a new context.
      
      The deadlock occurs when the CIL context lock is held by a
      committing process that is doing memory allocation for log vector
      buffers, and that allocation is then blocked on memory reclaim
      making progress. Memory reclaim, however, is blocked waiting for
      a log force to make progress, and so we effectively deadlock at this
      point.
      
      To solve this problem, we have to move the CIL log vector buffer
      allocation outside of the context lock so that memory reclaim can
      always make progress when it needs to force the log. The problem
      with doing this is that a CIL push can take place while we are
      determining if we need to allocate a new log vector buffer for
      an item and hence the current log vector may go away without
      warning. That means we canot rely on the existing log vector being
      present when we finally grab the context lock and so we must have a
      replacement buffer ready to go at all times.
      
      To ensure this, introduce a "shadow log vector" buffer that is
      always guaranteed to be present when we gain the CIL context lock
      and format the item. This shadow buffer may or may not be used
      during the formatting, but if the log item does not have an existing
      log vector buffer or that buffer is too small for the new
      modifications, we swap it for the new shadow buffer and format
      the modifications into that new log vector buffer.
      
      The result of this is that for any object we modify more than once
      in a given CIL checkpoint, we double the memory required
      to track dirty regions in the log. For single modifications then
      we consume the shadow log vectorwe allocate on commit, and that gets
      consumed by the checkpoint. However, if we make multiple
      modifications, then the second transaction commit will allocate a
      shadow log vector and hence we will end up with double the memory
      usage as only one of the log vectors is consumed by the CIL
      checkpoint. The remaining shadow vector will be freed when th elog
      item is freed.
      
      This can probably be optimised in future - access to the shadow log
      vector is serialised by the object lock (as opposited to the active
      log vector, which is controlled by the CIL context lock) and so we
      can probably free shadow log vector from some objects when the log
      item is marked clean on removal from the AIL.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b1c5ebb2
  18. 06 4月, 2016 2 次提交
  19. 09 2月, 2016 8 次提交
  20. 03 11月, 2015 1 次提交
    • D
      xfs: optimise away log forces on timestamp updates for fdatasync · fc0561ce
      Dave Chinner 提交于
      xfs: timestamp updates cause excessive fdatasync log traffic
      
      Sage Weil reported that a ceph test workload was writing to the
      log on every fdatasync during an overwrite workload. Event tracing
      showed that the only metadata modification being made was the
      timestamp updates during the write(2) syscall, but fdatasync(2)
      is supposed to ignore them. The key observation was that the
      transactions in the log all looked like this:
      
      INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32
      
      And contained a flags field of 0x45 or 0x85, and had data and
      attribute forks following the inode core. This means that the
      timestamp updates were triggering dirty relogging of previously
      logged parts of the inode that hadn't yet been flushed back to
      disk.
      
      There are two parts to this problem. The first is that XFS relogs
      dirty regions in subsequent transactions, so it carries around the
      fields that have been dirtied since the last time the inode was
      written back to disk, not since the last time the inode was forced
      into the log.
      
      The second part is that on v5 filesystems, the inode change count
      update during inode dirtying also sets the XFS_ILOG_CORE flag, so
      on v5 filesystems this makes a timestamp update dirty the entire
      inode.
      
      As a result when fdatasync is run, it looks at the dirty fields in
      the inode, and sees more than just the timestamp flag, even though
      the only metadata change since the last fdatasync was just the
      timestamps. Hence we force the log on every subsequent fdatasync
      even though it is not needed.
      
      To fix this, add a new field to the inode log item that tracks
      changes since the last time fsync/fdatasync forced the log to flush
      the changes to the journal. This flag is updated when we dirty the
      inode, but we do it before updating the change count so it does not
      carry the "core dirty" flag from timestamp updates. The fields are
      zeroed when the inode is marked clean (due to writeback/freeing) or
      when an fsync/datasync forces the log. Hence if we only dirty the
      timestamps on the inode between fsync/fdatasync calls, the fdatasync
      will not trigger another log force.
      
      Over 100 runs of the test program:
      
      Ext4 baseline:
      	runtime: 1.63s +/- 0.24s
      	avg lat: 1.59ms +/- 0.24ms
      	iops: ~2000
      
      XFS, vanilla kernel:
              runtime: 2.45s +/- 0.18s
      	avg lat: 2.39ms +/- 0.18ms
      	log forces: ~400/s
      	iops: ~1000
      
      XFS, patched kernel:
              runtime: 1.49s +/- 0.26s
      	avg lat: 1.46ms +/- 0.25ms
      	log forces: ~30/s
      	iops: ~1500
      Reported-by: NSage Weil <sage@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fc0561ce
  21. 19 8月, 2015 1 次提交
  22. 28 11月, 2014 3 次提交
  23. 03 10月, 2014 1 次提交
    • M
      xfs: xfs_iflush_done checks the wrong log item callback · 52177937
      Mark Tinguely 提交于
      Commit 30136832 ("xfs: remove all the inodes on a buffer from the AIL
      in bulk") made the xfs inode flush callback more efficient by
      combining all the inode writes on the buffer and the deletions of
      the inode log item from AIL.
      
      The initial loop in this patch should be looping through all
      the log items on the buffer to see which items have
      xfs_iflush_done as their callback function. But currently,
      only the log item passed to the function has its callback
      compared to xfs_iflush_done. If the log item pointer passed to
      the function does have the xfs_iflush_done callback function,
      then all the log items on the buffer are removed from the
      li_bio_list on the buffer b_fspriv and could be removed from
      the AIL even though they may have not been written yet.
      
      This problem is masked by the fact that currently all inodes on a
      buffer will have the same calback function - either xfs_iflush_done
      or xfs_istale_done - and hence the bug cannot manifest in any way.
      Still, we need to remove the landmine so that if we add new
      callbacks in future this doesn't cause us problems.
      Signed-off-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      52177937
  24. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d