1. 27 10月, 2017 1 次提交
  2. 02 9月, 2017 1 次提交
  3. 20 6月, 2017 2 次提交
  4. 08 6月, 2017 1 次提交
  5. 28 4月, 2017 2 次提交
    • B
      xfs: update ag iterator to support wait on new inodes · ae2c4ac2
      Brian Foster 提交于
      The AG inode iterator currently skips new inodes as such inodes are
      inserted into the inode radix tree before they are fully
      constructed. Certain contexts require the ability to wait on the
      construction of new inodes, however. The fs-wide dquot release from
      the quotaoff sequence is an example of this.
      
      Update the AG inode iterator to support the ability to wait on
      inodes flagged with XFS_INEW upon request. Create a new
      xfs_inode_ag_iterator_flags() interface and support a set of
      iteration flags to modify the iteration behavior. When the
      XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the
      radix tree inode lookup and wait on them before the callback is
      executed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      ae2c4ac2
    • B
      xfs: support ability to wait on new inodes · 756baca2
      Brian Foster 提交于
      Inodes that are inserted into the perag tree but still under
      construction are flagged with the XFS_INEW bit. Most contexts either
      skip such inodes when they are encountered or have the ability to
      handle them.
      
      The runtime quotaoff sequence introduces a context that must wait
      for construction of such inodes to correctly ensure that all dquots
      in the fs are released. In anticipation of this, support the ability
      to wait on new inodes. Wake the appropriate bit when XFS_INEW is
      cleared.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      756baca2
  6. 08 3月, 2017 1 次提交
  7. 31 1月, 2017 2 次提交
    • B
      xfs: sync eofblocks scans under iolock are livelock prone · c3155097
      Brian Foster 提交于
      The xfs_eofblocks.eof_scan_owner field is an internal field to
      facilitate invoking eofb scans from the kernel while under the iolock.
      This is necessary because the eofb scan acquires the iolock of each
      inode. Synchronous scans are invoked on certain buffered write failures
      while under iolock. In such cases, the scan owner indicates that the
      context for the scan already owns the particular iolock and prevents a
      double lock deadlock.
      
      eofblocks scans while under iolock are still livelock prone in the event
      of multiple parallel scans, however. If multiple buffered writes to
      different inodes fail and invoke eofblocks scans at the same time, each
      scan avoids a deadlock with its own inode by virtue of the
      eof_scan_owner field, but will never be able to acquire the iolock of
      the inode from the parallel scan. Because the low free space scans are
      invoked with SYNC_WAIT, the scan will not return until it has processed
      every tagged inode and thus both scans will spin indefinitely on the
      iolock being held across the opposite scan. This problem can be
      reproduced reliably by generic/224 on systems with higher cpu counts
      (x16).
      
      To avoid this problem, simplify the semantics of eofblocks scans to
      never invoke a scan while under iolock. This means that the buffered
      write context must drop the iolock before the scan. It must reacquire
      the lock before the write retry and also repeat the initial write
      checks, as the original state might no longer be valid once the iolock
      was dropped.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c3155097
    • B
      xfs: pull up iolock from xfs_free_eofblocks() · a36b9261
      Brian Foster 提交于
      xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from
      different contexts where the lock may or may not be held. The
      need_iolock parameter exists for this reason, to indicate whether
      xfs_free_eofblocks() must acquire the iolock itself before it can
      proceed.
      
      This is ugly and confusing. Simplify the semantics of
      xfs_free_eofblocks() to require the caller to acquire the iolock
      appropriately and kill the need_iolock parameter. While here, the mp
      param can be removed as well as the xfs_mount is accessible from the
      xfs_inode structure. This patch does not change behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a36b9261
  8. 04 1月, 2017 1 次提交
    • C
      xfs: fix crash and data corruption due to removal of busy COW extents · a1b7a4de
      Christoph Hellwig 提交于
      There is a race window between write_cache_pages calling
      clear_page_dirty_for_io and XFS calling set_page_writeback, in which
      the mapping for an inode is tagged neither as dirty, nor as writeback.
      
      If the COW shrinker hits in exactly that window we'll remove the delayed
      COW extents and writepages trying to write it back, which in release
      kernels will manifest as corruption of the bmap btree, and in debug
      kernels will trip the ASSERT about now calling xfs_bmapi_write with the
      COWFORK flag for holes.  A complex customer load manages to hit this
      window fairly reliably, probably by always having COW writeback in flight
      while the cow shrinker runs.
      
      This patch adds another check for having the I_DIRTY_PAGES flag set,
      which is still set during this race window.  While this fixes the problem
      I'm still not overly happy about the way the COW shrinker works as it
      still seems a bit fragile.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a1b7a4de
  9. 30 11月, 2016 1 次提交
  10. 10 11月, 2016 1 次提交
    • B
      xfs: fix unbalanced inode reclaim flush locking · 98efe8af
      Brian Foster 提交于
      Filesystem shutdown testing on an older distro kernel has uncovered an
      imbalanced locking pattern for the inode flush lock in
      xfs_reclaim_inode(). Specifically, there is a double unlock sequence
      between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
      "reclaim:" label.
      
      This actually does not cause obvious problems on current kernels due to
      the current flush lock implementation. Older kernels use a counting
      based flush lock mechanism, however, which effectively breaks the lock
      indefinitely when an already unlocked flush lock is repeatedly unlocked.
      Though this only currently occurs on filesystem shutdown, it has
      reproduced the effect of elevating an fs shutdown to a system-wide crash
      or hang.
      
      As it turns out, the flush lock is not actually required for the reclaim
      logic in xfs_reclaim_inode() because by that time we have already cycled
      the flush lock once while holding ILOCK_EXCL. Therefore, remove the
      additional flush lock/unlock cycle around the 'reclaim:' label and
      update branches into this label to release the flush lock where
      appropriate. Add an assert to xfs_ifunlock() to help prevent future
      occurences of the same problem.
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      98efe8af
  11. 08 11月, 2016 1 次提交
    • B
      xfs: don't skip cow forks w/ delalloc blocks in cowblocks scan · 39937234
      Brian Foster 提交于
      The cowblocks background scanner currently clears the cowblocks tag
      for inodes without any real allocations in the cow fork. This
      excludes inodes with only delalloc blocks in the cow fork. While we
      might never expect to clear delalloc blocks from the cow fork in the
      background scanner, it is not necessarily correct to clear the
      cowblocks tag from such inodes.
      
      For example, if the background scanner happens to process an inode
      between a buffered write and writeback, the scanner catches the
      inode in a state after delalloc blocks have been allocated to the
      cow fork but before the delalloc blocks have been converted to real
      blocks by writeback. The background scanner then incorrectly clears
      the cowblocks tag, even if part of the aforementioned delalloc
      reservation will not be remapped to the data fork (i.e., extra
      blocks due to the cowextsize hint). This means that any such
      additional blocks in the cow fork might never be reclaimed by the
      background scanner and could persist until the inode itself is
      reclaimed.
      
      To address this problem, only skip and clear inodes without any cow
      fork allocations whatsoever from the background scanner. While we
      generally do not want to cancel delalloc reservations from the
      background scanner, the pagecache dirty check following the
      cowblocks check should prevent that situation. If we do end up with
      delalloc cow fork blocks without a dirty address space mapping, this
      is probably an indication that something has gone wrong and the
      blocks should be reclaimed, as they may never be converted to a real
      allocation.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      39937234
  12. 24 10月, 2016 1 次提交
  13. 06 10月, 2016 1 次提交
    • D
      xfs: garbage collect old cowextsz reservations · 83104d44
      Darrick J. Wong 提交于
      Trim CoW reservations made on behalf of a cowextsz hint if they get too
      old or we run low on quota, so long as we don't have dirty data awaiting
      writeback or directio operations in progress.
      
      Garbage collection of the cowextsize extents are kept separate from
      prealloc extent reaping because setting the CoW prealloc lifetime to a
      (much) higher value than the regular prealloc extent lifetime has been
      useful for combatting CoW fragmentation on VM hosts where the VMs
      experience bursty write behaviors and we can keep the utilization ratios
      low enough that we don't start to run out of space.  IOWs, it benefits
      us to keep the CoW fork reservations around for as long as we can unless
      we run out of blocks or hit inode reclaim.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      83104d44
  14. 05 10月, 2016 1 次提交
  15. 19 9月, 2016 1 次提交
  16. 21 6月, 2016 1 次提交
    • B
      xfs: cancel eofblocks background trimming on remount read-only · fa5a4f57
      Brian Foster 提交于
      The filesystem quiesce sequence performs the operations necessary to
      drain all background work, push pending transactions through the log
      infrastructure and wait on I/O resulting from the final AIL push. We
      have had reports of remount,ro hangs in xfs_log_quiesce() ->
      xfs_wait_buftarg(), however, and some instrumentation code to detect
      transaction commits at this point in the quiesce sequence has inculpated
      the eofblocks background scanner as a cause.
      
      While higher level remount code generally prevents user modifications by
      the time the filesystem has made it to xfs_log_quiesce(), the background
      scanner may still be alive and can perform pending work at any time. If
      this occurs between the xfs_log_force() and xfs_wait_buftarg() calls
      within xfs_log_quiesce(), this can lead to an indefinite lockup in
      xfs_wait_buftarg().
      
      To prevent this problem, cancel the background eofblocks scan worker
      during the remount read-only quiesce sequence. This suspends background
      trimming when a filesystem is remounted read-only. This is only done in
      the remount path because the freeze codepath has already locked out new
      transactions by the time the filesystem attempts to quiesce (and thus
      waiting on an active work item could deadlock). Kick the eofblocks
      worker to pick up where it left off once an fs is remounted back to
      read-write.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      fa5a4f57
  17. 18 5月, 2016 4 次提交
    • D
      xfs: move reclaim tagging functions · ad438c40
      Dave Chinner 提交于
      Rearrange the inode tagging functions so that they are higher up in
      xfs_cache.c and so there is no need for forward prototypes to be
      defined. This is purely code movement, no other change.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      ad438c40
    • D
      xfs: simplify inode reclaim tagging interfaces · 545c0889
      Dave Chinner 提交于
      Inode radix tree tagging for reclaim passes a lot of unnecessary
      variables around. Over time the xfs-perag has grown a xfs_mount
      backpointer, and an internal agno so we don't need to pass other
      variables into the tagging functions to supply this information.
      
      Rework the functions to pass the minimal variable set required
      and simplify the internal logic and flow.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      545c0889
    • D
      xfs: mark reclaimed inodes invalid earlier · 8a17d7dd
      Dave Chinner 提交于
      The last thing we do before using call_rcu() on an xfs_inode to be
      freed is mark it as invalid. This means there is a window between
      when we know for certain that the inode is going to be freed and
      when we do actually mark it as "freed".
      
      This is important in the context of RCU lookups - we can look up the
      inode, find that it is valid, and then use it as such not realising
      that it is in the final stages of being freed.
      
      As such, mark the inode as being invalid the moment we know it is
      going to be reclaimed. This can be done while we still hold the
      XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that
      it occurs well before we remove it from the radix tree, and that
      the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as
      synchronisation points for detecting that an inode is about to go
      away.
      
      For defensive purposes, this allows us to add a further check to
      xfs_iflush_cluster to ensure we skip inodes that are being freed
      after we grab the XFS_ILOCK_SHARED and the flush lock - we know that
      if the inode number if valid while we have these locks held we know
      that it has not progressed through reclaim to the point where it is
      clean and is about to be freed.
      
      [bfoster: fixed __xfs_inode_clear_reclaim() using ip->i_ino after it
      	  had already been zeroed.]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8a17d7dd
    • D
      xfs: xfs_inode_free() isn't RCU safe · 1f2dcfe8
      Dave Chinner 提交于
      The xfs_inode freed in xfs_inode_free() has multiple allocated
      structures attached to it. We free these in xfs_inode_free() before
      we mark the inode as invalid, and before we run call_rcu() to queue
      the structure for freeing.
      
      Unfortunately, this freeing can race with other accesses that are in
      the RCU current grace period that have found the inode in the radix
      tree with a valid state.  This includes xfs_iflush_cluster(), which
      calls xfs_inode_clean(), and that accesses the inode log item on the
      xfs_inode.
      
      The log item structure is freed in xfs_inode_free(), so there is the
      possibility we can be accessing freed memory in xfs_iflush_cluster()
      after validating the xfs_inode structure as being valid for this RCU
      context. Hence we can get spuriously incorrect clean state returned
      from such checks. This can lead to use thinking the inode is dirty
      when it is, in fact, clean, and so incorrectly attaching it to the
      buffer for IO and completion processing.
      
      This then leads to use-after-free situations on the xfs_inode itself
      if the IO completes after the current RCU grace period expires. The
      buffer callbacks will access the xfs_inode and try to do all sorts
      of things it shouldn't with freed memory.
      
      IOWs, xfs_iflush_cluster() only works correctly when racing with
      inode reclaim if the inode log item is present and correctly stating
      the inode is clean. If the inode is being freed, then reclaim has
      already made sure the inode is clean, and hence xfs_iflush_cluster
      can skip it. However, we are accessing the inode inode under RCU
      read lock protection and so also must ensure that all dynamically
      allocated memory we reference in this context is not freed until the
      RCU grace period expires.
      
      To fix this, move all the potential memory freeing into
      xfs_inode_free_callback() so that we are guarantee RCU protected
      lookup code will always have the memory structures it needs
      available during the RCU grace period that lookup races can occur
      in.
      Discovered-by: NBrain Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      1f2dcfe8
  18. 09 2月, 2016 6 次提交
  19. 12 10月, 2015 1 次提交
    • B
      xfs: per-filesystem stats counter implementation · ff6d6af2
      Bill O'Donnell 提交于
      This patch modifies the stats counting macros and the callers
      to those macros to properly increment, decrement, and add-to
      the xfs stats counts. The counts for global and per-fs stats
      are correctly advanced, and cleared by writing a "1" to the
      corresponding clear file.
      
      global counts: /sys/fs/xfs/stats/stats
      per-fs counts: /sys/fs/xfs/sda*/stats/stats
      
      global clear:  /sys/fs/xfs/stats/stats_clear
      per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear
      
      [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
       stats structures and macros. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ff6d6af2
  20. 28 8月, 2015 1 次提交
  21. 23 2月, 2015 1 次提交
    • D
      xfs: inodes are new until the dentry cache is set up · 58c90473
      Dave Chinner 提交于
      Al Viro noticed a generic set of issues to do with filehandle lookup
      racing with dentry cache setup. They involve a filehandle lookup
      occurring while an inode is being created and the filehandle lookup
      racing with the dentry creation for the real file. This can lead to
      multiple dentries for the one path being instantiated. There are a
      host of other issues around this same set of paths.
      
      The underlying cause is that file handle lookup only waits on inode
      cache instantiation rather than full dentry cache instantiation. XFS
      is mostly immune to the problems discovered due to it's own internal
      inode cache, but there are a couple of corner cases where races can
      happen.
      
      We currently clear the XFS_INEW flag when the inode is fully set up
      after insertion into the cache. Newly allocated inodes are inserted
      locked and so aren't usable until the allocation transaction
      commits. This, however, occurs before the dentry and security
      information is fully initialised and hence the inode is unlocked and
      available for lookups to find too early.
      
      To solve the problem, only clear the XFS_INEW flag for newly created
      inodes once the dentry is fully instantiated. This means lookups
      will retry until the XFS_INEW flag is removed from the inode and
      hence avoids the race conditions in questions.
      
      THis also means that xfs_create(), xfs_create_tmpfile() and
      xfs_symlink() need to finish the setup of the inode in their error
      paths if we had allocated the inode but failed later in the creation
      process. xfs_symlink(), in particular, needed a lot of help to make
      it's error handling match that of xfs_create().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      58c90473
  22. 04 12月, 2014 1 次提交
  23. 28 11月, 2014 2 次提交
  24. 23 9月, 2014 1 次提交
  25. 24 7月, 2014 3 次提交
    • B
      xfs: run an eofblocks scan on ENOSPC/EDQUOT · dc06f398
      Brian Foster 提交于
      From: Brian Foster <bfoster@redhat.com>
      
      Speculative preallocation and and the associated throttling metrics
      assume we're working with large files on large filesystems. Users have
      reported inefficiencies in these mechanisms when we happen to be dealing
      with large files on smaller filesystems. This can occur because while
      prealloc throttling is aggressive under low free space conditions, it is
      not active until we reach 5% free space or less.
      
      For example, a 40GB filesystem has enough space for several files large
      enough to have multi-GB preallocations at any given time. If those files
      are slow growing, they might reserve preallocation for long periods of
      time as well as avoid the background scanner due to frequent
      modification. If a new file is written under these conditions, said file
      has no access to this already reserved space and premature ENOSPC is
      imminent.
      
      To handle this scenario, modify the buffered write ENOSPC handling and
      retry sequence to invoke an eofblocks scan. In the smaller filesystem
      scenario, the eofblocks scan resets the usage of preallocation such that
      when the 5% free space threshold is met, throttling effectively takes
      over to provide fair and efficient preallocation until legitimate
      ENOSPC.
      
      The eofblocks scan is selective based on the nature of the failure. For
      example, an EDQUOT failure in a particular quota will use a filtered
      scan for that quota. Because we don't know which quota might have caused
      an allocation failure at any given time, we include each applicable
      quota determined to be under low free space conditions in the scan.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dc06f398
    • B
      xfs: support a union-based filter for eofblocks scans · f4526397
      Brian Foster 提交于
      From: Brian Foster <bfoster@redhat.com>
      
      The eofblocks scan inode filter uses intersection logic by default.
      E.g., specifying both user and group quota ids filters out inodes that
      are not covered by both the specified user and group quotas. This is
      suitable for behavior exposed to userspace.
      
      Scans that are initiated from within the kernel might require more broad
      semantics, such as scanning all inodes under each quota associated with
      an inode to alleviate low free space conditions in each.
      
      Create the XFS_EOF_FLAGS_UNION flag to support a conditional union-based
      filtering algorithm for eofblocks scans. This flag is intentionally left
      out of the valid mask as it is not supported for scans initiated from
      userspace.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f4526397
    • B
      xfs: add scan owner field to xfs_eofblocks · 5400da7d
      Brian Foster 提交于
      From: Brian Foster <bfoster@redhat.com>
      
      The scan owner field represents an optional inode number that is
      responsible for the current scan. The purpose is to identify that an
      inode is under iolock and as such, the iolock shouldn't be attempted
      when trimming eofblocks. This is an internal only field.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5400da7d
  26. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d