1. 15 4月, 2019 1 次提交
  2. 12 2月, 2019 1 次提交
    • D
      xfs: cache unlinked pointers in an rhashtable · 9b247179
      Darrick J. Wong 提交于
      Use a rhashtable to cache the unlinked list incore.  This should speed
      up unlinked processing considerably when there are a lot of inodes on
      the unlinked list because iunlink_remove no longer has to traverse an
      entire bucket list to find which inode points to the one being removed.
      
      The incore list structure records "X.next_unlinked = Y" relations, with
      the rhashtable using Y to index the records.  This makes finding the
      inode X that points to a inode Y very quick.  If our cache fails to find
      anything we can always fall back on the old method.
      
      FWIW this drastically reduces the amount of time it takes to remove
      inodes from the unlinked list.  I wrote a program to open a lot of
      O_TMPFILE files and then close them in the same order, which takes
      a very long time if we have to traverse the unlinked lists.  With the
      ptach, I see:
      
      + /d/t/tmpfile/tmpfile
      Opened 193531 files in 6.33s.
      Closed 193531 files in 5.86s
      
      real    0m12.192s
      user    0m0.064s
      sys     0m11.619s
      + cd /
      + umount /mnt
      
      real    0m0.050s
      user    0m0.004s
      sys     0m0.030s
      
      And without the patch:
      
      + /d/t/tmpfile/tmpfile
      Opened 193588 files in 6.35s.
      Closed 193588 files in 751.61s
      
      real    12m38.853s
      user    0m0.084s
      sys     12m34.470s
      + cd /
      + umount /mnt
      
      real    0m0.086s
      user    0m0.000s
      sys     0m0.060s
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      9b247179
  3. 13 12月, 2018 2 次提交
  4. 13 8月, 2018 1 次提交
  5. 01 8月, 2018 1 次提交
  6. 27 7月, 2018 1 次提交
  7. 24 7月, 2018 2 次提交
  8. 07 6月, 2018 1 次提交
    • D
      xfs: convert to SPDX license tags · 0b61f8a4
      Dave Chinner 提交于
      Remove the verbose license text from XFS files and replace them
      with SPDX tags. This does not change the license of any of the code,
      merely refers to the common, up-to-date license files in LICENSES/
      
      This change was mostly scripted. fs/xfs/Makefile and
      fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
      and modified by the following command:
      
      for f in `git grep -l "GNU General" fs/xfs/` ; do
      	echo $f
      	cat $f | awk -f hdr.awk > $f.new
      	mv -f $f.new $f
      done
      
      And the hdr.awk script that did the modification (including
      detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
      is as follows:
      
      $ cat hdr.awk
      BEGIN {
      	hdr = 1.0
      	tag = "GPL-2.0"
      	str = ""
      }
      
      /^ \* This program is free software/ {
      	hdr = 2.0;
      	next
      }
      
      /any later version./ {
      	tag = "GPL-2.0+"
      	next
      }
      
      /^ \*\// {
      	if (hdr > 0.0) {
      		print "// SPDX-License-Identifier: " tag
      		print str
      		print $0
      		str=""
      		hdr = 0.0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \* / {
      	if (hdr > 1.0)
      		next
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      	next
      }
      
      /^ \*/ {
      	if (hdr > 0.0)
      		next
      	print $0
      	next
      }
      
      // {
      	if (hdr > 0.0) {
      		if (str != "")
      			str = str "\n"
      		str = str $0
      		next
      	}
      	print $0
      }
      
      END { }
      $
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      0b61f8a4
  9. 06 6月, 2018 1 次提交
  10. 16 5月, 2018 1 次提交
  11. 26 3月, 2018 1 次提交
  12. 12 3月, 2018 1 次提交
  13. 13 1月, 2018 1 次提交
  14. 10 11月, 2017 1 次提交
  15. 12 10月, 2017 1 次提交
  16. 18 8月, 2017 2 次提交
    • D
      xfs: don't leak quotacheck dquots when cow recovery · 77aff8c7
      Darrick J. Wong 提交于
      If we fail a mount on account of cow recovery errors, it's possible that
      a previous quotacheck left some dquots in memory.  The bailout clause of
      xfs_mountfs forgets to purge these, and so we leak them.  Fix that.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      77aff8c7
    • D
      xfs: clear MS_ACTIVE after finishing log recovery · 8204f8dd
      Darrick J. Wong 提交于
      Way back when we established inode block-map redo log items, it was
      discovered that we needed to prevent the VFS from evicting inodes during
      log recovery because any given inode might be have bmap redo items to
      replay even if the inode has no link count and is ultimately deleted,
      and any eviction of an unlinked inode causes the inode to be truncated
      and freed too early.
      
      To make this possible, we set MS_ACTIVE so that inodes would not be torn
      down immediately upon release.  Unfortunately, this also results in the
      quota inodes not being released at all if a later part of the mount
      process should fail, because we never reclaim the inodes.  So, set
      MS_ACTIVE right before we do the last part of log recovery and clear it
      immediately after we finish the log recovery so that everything
      will be torn down properly if we abort the mount.
      
      Fixes: 17c12bcd ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      8204f8dd
  17. 28 6月, 2017 1 次提交
  18. 21 6月, 2017 1 次提交
    • N
      percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch · 104b4e51
      Nikolay Borisov 提交于
      Currently, percpu_counter_add is a wrapper around __percpu_counter_add
      which is preempt safe due to explicit calls to preempt_disable.  Given
      how __ prefix is used in percpu related interfaces, the naming
      unfortunately creates the false sense that __percpu_counter_add is
      less safe than percpu_counter_add.  In terms of context-safety,
      they're equivalent.  The only difference is that the __ version takes
      a batch parameter.
      
      Make this a bit more explicit by just renaming __percpu_counter_add to
      percpu_counter_add_batch.
      
      This patch doesn't cause any functional changes.
      
      tj: Minor updates to patch description for clarity.  Cosmetic
          indentation updates.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: linux-mm@kvack.org
      Cc: "David S. Miller" <davem@davemloft.net>
      104b4e51
  19. 20 6月, 2017 1 次提交
    • D
      xfs: remove double-underscore integer types · c8ce540d
      Darrick J. Wong 提交于
      This is a purely mechanical patch that removes the private
      __{u,}int{8,16,32,64}_t typedefs in favor of using the system
      {u,}int{8,16,32,64}_t typedefs.  This is the sed script used to perform
      the transformation and fix the resulting whitespace and indentation
      errors:
      
      s/typedef\t__uint8_t/typedef __uint8_t\t/g
      s/typedef\t__uint/typedef __uint/g
      s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g
      s/__uint8_t\t/__uint8_t\t\t/g
      s/__uint/uint/g
      s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g
      s/__int/int/g
      /^typedef.*int[0-9]*_t;$/d
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c8ce540d
  20. 05 6月, 2017 3 次提交
  21. 28 4月, 2017 1 次提交
  22. 08 3月, 2017 1 次提交
    • C
      xfs: Use xfs_icluster_size_fsb() to calculate inode alignment mask · d5825712
      Chandan Rajendra 提交于
      When block size is larger than inode cluster size, the call to
      XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs
      would have set xfs_sb->sb_inoalignmt to 0. Hence in
      xfs_set_inoalignment(), xfs_mount->m_inoalign_mask gets initialized to
      -1 instead of 0. However, xfs_mount->m_sinoalign would get correctly
      intialized to 0 because for every positive value of xfs_mount->m_dalign,
      the condition "!(mp->m_dalign & mp->m_inoalign_mask)" would evaluate to
      false.
      
      Also, xfs_imap() worked fine even with xfs_mount->m_inoalign_mask having
      -1 as the value because blks_per_cluster variable would have the value 1
      and hence we would never have a need to use xfs_mount->m_inoalign_mask
      to compute the inode chunk's agbno and offset within the chunk.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d5825712
  23. 10 2月, 2017 3 次提交
  24. 07 12月, 2016 1 次提交
    • L
      xfs: use rhashtable to track buffer cache · 6031e73a
      Lucas Stach 提交于
      On filesystems with a lot of metadata and in metadata intensive workloads
      xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
      the CPU time is spent on CPU cache misses while traversing the rbtree.
      
      As the buffer cache does not need any kind of ordering, but fast lookups
      a hashtable is the natural data structure to use. The rhashtable
      infrastructure provides a self-scaling hashtable implementation and
      allows lookups to proceed while the table is going through a resize
      operation.
      
      This reduces the CPU-time spent for the lookups to 1/3 even for small
      filesystems with a relatively small number of cached buffers, with
      possibly much larger gains on higher loaded filesystems.
      
      [dchinner: reduce minimum hash size to an acceptable size for large
      	   filesystems with many AGs with no active use.]
      [dchinner: remove stale rbtree asserts.]
      [dchinner: use xfs_buf_map for compare function argument.]
      [dchinner: make functions static.]
      [dchinner: remove redundant comments.]
      Signed-off-by: NLucas Stach <dev@lynxeye.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      6031e73a
  25. 20 10月, 2016 1 次提交
    • D
      xfs: unset MS_ACTIVE if mount fails · d0992452
      Darrick J. Wong 提交于
      As part of the inode block map intent log item recovery process, we
      had to set the IRECOVERY flag to prevent an unlinked inode from
      being truncated during the first iput call.  This required us to set
      MS_ACTIVE so that iput puts the inode on the lru instead of
      immediately evicting the inode.
      
      Unfortunately, if the mount fails later on, the inodes that have
      been loaded (root dir and realtime) actually need to be evicted
      since we're aborting the mount.  If we don't clear MS_ACTIVE in the
      failure step, those inodes are not evicted and therefore leak.   The
      leak was found by running xfs/130 and rmmoding xfs immediately after
      the test.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      d0992452
  26. 06 10月, 2016 3 次提交
    • D
      xfs: garbage collect old cowextsz reservations · 83104d44
      Darrick J. Wong 提交于
      Trim CoW reservations made on behalf of a cowextsz hint if they get too
      old or we run low on quota, so long as we don't have dirty data awaiting
      writeback or directio operations in progress.
      
      Garbage collection of the cowextsize extents are kept separate from
      prealloc extent reaping because setting the CoW prealloc lifetime to a
      (much) higher value than the regular prealloc extent lifetime has been
      useful for combatting CoW fragmentation on VM hosts where the VMs
      experience bursty write behaviors and we can keep the utilization ratios
      low enough that we don't start to run out of space.  IOWs, it benefits
      us to keep the CoW fork reservations around for as long as we can unless
      we run out of blocks or hit inode reclaim.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      83104d44
    • D
      xfs: preallocate blocks for worst-case btree expansion · 84d69619
      Darrick J. Wong 提交于
      To gracefully handle the situation where a CoW operation turns a
      single refcount extent into a lot of tiny ones and then run out of
      space when a tree split has to happen, use the per-AG reserved block
      pool to pre-allocate all the space we'll ever need for a maximal
      btree.  For a 4K block size, this only costs an overhead of 0.3% of
      available disk space.
      
      When reflink is enabled, we have an unfortunate problem with rmap --
      since we can share a block billions of times, this means that the
      reverse mapping btree can expand basically infinitely.  When an AG is
      so full that there are no free blocks with which to expand the rmapbt,
      the filesystem will shut down hard.
      
      This is rather annoying to the user, so use the AG reservation code to
      reserve a "reasonable" amount of space for rmap.  We'll prevent
      reflinks and CoW operations if we think we're getting close to
      exhausting an AG's free space rather than shutting down, but this
      permanent reservation should be enough for "most" users.  Hopefully.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [hch@lst.de: ensure that we invalidate the freed btree buffer]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      84d69619
    • D
      xfs: store in-progress CoW allocations in the refcount btree · 174edb0e
      Darrick J. Wong 提交于
      Due to the way the CoW algorithm in XFS works, there's an interval
      during which blocks allocated to handle a CoW can be lost -- if the FS
      goes down after the blocks are allocated but before the block
      remapping takes place.  This is exacerbated by the cowextsz hint --
      allocated reservations can sit around for a while, waiting to get
      used.
      
      Since the refcount btree doesn't normally store records with refcount
      of 1, we can use it to record these in-progress extents.  In-progress
      blocks cannot be shared because they're not user-visible, so there
      shouldn't be any conflicts with other programs.  This is a better
      solution than holding EFIs during writeback because (a) EFIs can't be
      relogged currently, (b) even if they could, EFIs are bound by
      available log space, which puts an unnecessary upper bound on how much
      CoW we can have in flight, and (c) we already have a mechanism to
      track blocks.
      
      At mount time, read the refcount records and free anything we find
      with a refcount of 1 because those were in-progress when the FS went
      down.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      174edb0e
  27. 05 10月, 2016 1 次提交
    • D
      xfs: when replaying bmap operations, don't let unlinked inodes get reaped · 17c12bcd
      Darrick J. Wong 提交于
      Log recovery will iget an inode to replay BUI items and iput the inode
      when it's done.  Unfortunately, if the inode was unlinked, the iput
      will see that i_nlink == 0 and decide to truncate & free the inode,
      which prevents us from replaying subsequent BUIs.  We can't skip the
      BUIs because we have to replay all the redo items to ensure that
      atomic operations complete.
      
      Since unlinked inode recovery will reap the inode anyway, we can
      safely introduce a new inode flag to indicate that an inode is in this
      'unlinked recovery' state and should not be auto-reaped in the
      drop_inode path.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      17c12bcd
  28. 04 10月, 2016 1 次提交
  29. 26 9月, 2016 1 次提交
    • D
      xfs: quiesce the filesystem after recovery on readonly mount · ddeb14f4
      Dave Chinner 提交于
      Recently we've had a number of reports where log recovery on a v5
      filesystem has reported corruptions that looked to be caused by
      recovery being re-run over the top of an already-recovered
      metadata. This has uncovered a bug in recovery (fixed elsewhere)
      but the vector that caused this was largely unknown.
      
      A kdump test started tripping over this problem - the system
      would be crashed, the kdump kernel and environment would boot and
      dump the kernel core image, and then the system would reboot. After
      reboot, the root filesystem was triggering log recovery and
      corruptions were being detected. The metadumps indicated the above
      log recovery issue.
      
      What is happening is that the kdump kernel and environment is
      mounting the root device read-only to find the binaries needed to do
      it's work. The result of this is that it is running log recovery.
      However, because there were unlinked files and EFIs to be processed
      by recovery, the completion of phase 1 of log recovery could not
      mark the log clean. And because it's a read-only mount, the unmount
      process does not write records to the log to mark it clean, either.
      Hence on the next mount of the filesystem, log recovery was run
      again across all the metadata that had already been recovered and
      this is what triggered corruption warnings.
      
      To avoid this problem, we need to ensure that a read-only mount
      always updates the log when it completes the second phase of
      recovery. We already handle this sort of issue with rw->ro remount
      transitions, so the solution is as simple as quiescing the
      filesystem at the appropriate time during the mount process. This
      results in the log being marked clean so the mount behaviour
      recorded in the logs on repeated RO mounts will change (i.e. log
      recovery will no longer be run on every mount until a RW mount is
      done). This is a user visible change in behaviour, but it is
      harmless.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ddeb14f4
  30. 03 8月, 2016 2 次提交
    • D
      xfs: rmap btree requires more reserved free space · 52548852
      Darrick J. Wong 提交于
      Originally-From: Dave Chinner <dchinner@redhat.com>
      
      The rmap btree is allocated from the AGFL, which means we have to
      ensure ENOSPC is reported to userspace before we run out of free
      space in each AG. The last allocation in an AG can cause a full
      height rmap btree split, and that means we have to reserve at least
      this many blocks *in each AG* to be placed on the AGFL at ENOSPC.
      Update the various space calculation functions to handle this.
      
      Also, because the macros are now executing conditional code and are
      called quite frequently, convert them to functions that initialise
      variables in the struct xfs_mount, use the new variables everywhere
      and document the calculations better.
      
      [darrick.wong@oracle.com: don't reserve blocks if !rmap]
      [dchinner@redhat.com: update m_ag_max_usable after growfs]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      52548852
    • D
      xfs: define the on-disk rmap btree format · 035e00ac
      Darrick J. Wong 提交于
      Originally-From: Dave Chinner <dchinner@redhat.com>
      
      Now we have all the surrounding call infrastructure in place, we can
      start filling out the rmap btree implementation. Start with the
      on-disk btree format; add everything needed to read, write and
      manipulate rmap btree blocks. This prepares the way for adding the
      btree operations implementation.
      
      [darrick: record owner and offset info in rmap btree]
      [darrick: fork, bmbt and unwritten state in rmap btree]
      [darrick: flags are a separate field in xfs_rmap_irec]
      [darrick: calculate maxlevels separately]
      [darrick: move the 'unwritten' bit into unused parts of rm_offset]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      035e00ac