1. 31 1月, 2017 2 次提交
    • B
      xfs: sync eofblocks scans under iolock are livelock prone · c3155097
      Brian Foster 提交于
      The xfs_eofblocks.eof_scan_owner field is an internal field to
      facilitate invoking eofb scans from the kernel while under the iolock.
      This is necessary because the eofb scan acquires the iolock of each
      inode. Synchronous scans are invoked on certain buffered write failures
      while under iolock. In such cases, the scan owner indicates that the
      context for the scan already owns the particular iolock and prevents a
      double lock deadlock.
      
      eofblocks scans while under iolock are still livelock prone in the event
      of multiple parallel scans, however. If multiple buffered writes to
      different inodes fail and invoke eofblocks scans at the same time, each
      scan avoids a deadlock with its own inode by virtue of the
      eof_scan_owner field, but will never be able to acquire the iolock of
      the inode from the parallel scan. Because the low free space scans are
      invoked with SYNC_WAIT, the scan will not return until it has processed
      every tagged inode and thus both scans will spin indefinitely on the
      iolock being held across the opposite scan. This problem can be
      reproduced reliably by generic/224 on systems with higher cpu counts
      (x16).
      
      To avoid this problem, simplify the semantics of eofblocks scans to
      never invoke a scan while under iolock. This means that the buffered
      write context must drop the iolock before the scan. It must reacquire
      the lock before the write retry and also repeat the initial write
      checks, as the original state might no longer be valid once the iolock
      was dropped.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c3155097
    • B
      xfs: pull up iolock from xfs_free_eofblocks() · a36b9261
      Brian Foster 提交于
      xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from
      different contexts where the lock may or may not be held. The
      need_iolock parameter exists for this reason, to indicate whether
      xfs_free_eofblocks() must acquire the iolock itself before it can
      proceed.
      
      This is ugly and confusing. Simplify the semantics of
      xfs_free_eofblocks() to require the caller to acquire the iolock
      appropriately and kill the need_iolock parameter. While here, the mp
      param can be removed as well as the xfs_mount is accessible from the
      xfs_inode structure. This patch does not change behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a36b9261
  2. 04 1月, 2017 1 次提交
    • C
      xfs: fix crash and data corruption due to removal of busy COW extents · a1b7a4de
      Christoph Hellwig 提交于
      There is a race window between write_cache_pages calling
      clear_page_dirty_for_io and XFS calling set_page_writeback, in which
      the mapping for an inode is tagged neither as dirty, nor as writeback.
      
      If the COW shrinker hits in exactly that window we'll remove the delayed
      COW extents and writepages trying to write it back, which in release
      kernels will manifest as corruption of the bmap btree, and in debug
      kernels will trip the ASSERT about now calling xfs_bmapi_write with the
      COWFORK flag for holes.  A complex customer load manages to hit this
      window fairly reliably, probably by always having COW writeback in flight
      while the cow shrinker runs.
      
      This patch adds another check for having the I_DIRTY_PAGES flag set,
      which is still set during this race window.  While this fixes the problem
      I'm still not overly happy about the way the COW shrinker works as it
      still seems a bit fragile.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a1b7a4de
  3. 30 11月, 2016 1 次提交
  4. 10 11月, 2016 1 次提交
    • B
      xfs: fix unbalanced inode reclaim flush locking · 98efe8af
      Brian Foster 提交于
      Filesystem shutdown testing on an older distro kernel has uncovered an
      imbalanced locking pattern for the inode flush lock in
      xfs_reclaim_inode(). Specifically, there is a double unlock sequence
      between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the
      "reclaim:" label.
      
      This actually does not cause obvious problems on current kernels due to
      the current flush lock implementation. Older kernels use a counting
      based flush lock mechanism, however, which effectively breaks the lock
      indefinitely when an already unlocked flush lock is repeatedly unlocked.
      Though this only currently occurs on filesystem shutdown, it has
      reproduced the effect of elevating an fs shutdown to a system-wide crash
      or hang.
      
      As it turns out, the flush lock is not actually required for the reclaim
      logic in xfs_reclaim_inode() because by that time we have already cycled
      the flush lock once while holding ILOCK_EXCL. Therefore, remove the
      additional flush lock/unlock cycle around the 'reclaim:' label and
      update branches into this label to release the flush lock where
      appropriate. Add an assert to xfs_ifunlock() to help prevent future
      occurences of the same problem.
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      98efe8af
  5. 08 11月, 2016 1 次提交
    • B
      xfs: don't skip cow forks w/ delalloc blocks in cowblocks scan · 39937234
      Brian Foster 提交于
      The cowblocks background scanner currently clears the cowblocks tag
      for inodes without any real allocations in the cow fork. This
      excludes inodes with only delalloc blocks in the cow fork. While we
      might never expect to clear delalloc blocks from the cow fork in the
      background scanner, it is not necessarily correct to clear the
      cowblocks tag from such inodes.
      
      For example, if the background scanner happens to process an inode
      between a buffered write and writeback, the scanner catches the
      inode in a state after delalloc blocks have been allocated to the
      cow fork but before the delalloc blocks have been converted to real
      blocks by writeback. The background scanner then incorrectly clears
      the cowblocks tag, even if part of the aforementioned delalloc
      reservation will not be remapped to the data fork (i.e., extra
      blocks due to the cowextsize hint). This means that any such
      additional blocks in the cow fork might never be reclaimed by the
      background scanner and could persist until the inode itself is
      reclaimed.
      
      To address this problem, only skip and clear inodes without any cow
      fork allocations whatsoever from the background scanner. While we
      generally do not want to cancel delalloc reservations from the
      background scanner, the pagecache dirty check following the
      cowblocks check should prevent that situation. If we do end up with
      delalloc cow fork blocks without a dirty address space mapping, this
      is probably an indication that something has gone wrong and the
      blocks should be reclaimed, as they may never be converted to a real
      allocation.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      39937234
  6. 24 10月, 2016 1 次提交
  7. 06 10月, 2016 1 次提交
    • D
      xfs: garbage collect old cowextsz reservations · 83104d44
      Darrick J. Wong 提交于
      Trim CoW reservations made on behalf of a cowextsz hint if they get too
      old or we run low on quota, so long as we don't have dirty data awaiting
      writeback or directio operations in progress.
      
      Garbage collection of the cowextsize extents are kept separate from
      prealloc extent reaping because setting the CoW prealloc lifetime to a
      (much) higher value than the regular prealloc extent lifetime has been
      useful for combatting CoW fragmentation on VM hosts where the VMs
      experience bursty write behaviors and we can keep the utilization ratios
      low enough that we don't start to run out of space.  IOWs, it benefits
      us to keep the CoW fork reservations around for as long as we can unless
      we run out of blocks or hit inode reclaim.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      83104d44
  8. 05 10月, 2016 1 次提交
  9. 19 9月, 2016 1 次提交
  10. 21 6月, 2016 1 次提交
    • B
      xfs: cancel eofblocks background trimming on remount read-only · fa5a4f57
      Brian Foster 提交于
      The filesystem quiesce sequence performs the operations necessary to
      drain all background work, push pending transactions through the log
      infrastructure and wait on I/O resulting from the final AIL push. We
      have had reports of remount,ro hangs in xfs_log_quiesce() ->
      xfs_wait_buftarg(), however, and some instrumentation code to detect
      transaction commits at this point in the quiesce sequence has inculpated
      the eofblocks background scanner as a cause.
      
      While higher level remount code generally prevents user modifications by
      the time the filesystem has made it to xfs_log_quiesce(), the background
      scanner may still be alive and can perform pending work at any time. If
      this occurs between the xfs_log_force() and xfs_wait_buftarg() calls
      within xfs_log_quiesce(), this can lead to an indefinite lockup in
      xfs_wait_buftarg().
      
      To prevent this problem, cancel the background eofblocks scan worker
      during the remount read-only quiesce sequence. This suspends background
      trimming when a filesystem is remounted read-only. This is only done in
      the remount path because the freeze codepath has already locked out new
      transactions by the time the filesystem attempts to quiesce (and thus
      waiting on an active work item could deadlock). Kick the eofblocks
      worker to pick up where it left off once an fs is remounted back to
      read-write.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      fa5a4f57
  11. 18 5月, 2016 4 次提交
    • D
      xfs: move reclaim tagging functions · ad438c40
      Dave Chinner 提交于
      Rearrange the inode tagging functions so that they are higher up in
      xfs_cache.c and so there is no need for forward prototypes to be
      defined. This is purely code movement, no other change.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      ad438c40
    • D
      xfs: simplify inode reclaim tagging interfaces · 545c0889
      Dave Chinner 提交于
      Inode radix tree tagging for reclaim passes a lot of unnecessary
      variables around. Over time the xfs-perag has grown a xfs_mount
      backpointer, and an internal agno so we don't need to pass other
      variables into the tagging functions to supply this information.
      
      Rework the functions to pass the minimal variable set required
      and simplify the internal logic and flow.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      545c0889
    • D
      xfs: mark reclaimed inodes invalid earlier · 8a17d7dd
      Dave Chinner 提交于
      The last thing we do before using call_rcu() on an xfs_inode to be
      freed is mark it as invalid. This means there is a window between
      when we know for certain that the inode is going to be freed and
      when we do actually mark it as "freed".
      
      This is important in the context of RCU lookups - we can look up the
      inode, find that it is valid, and then use it as such not realising
      that it is in the final stages of being freed.
      
      As such, mark the inode as being invalid the moment we know it is
      going to be reclaimed. This can be done while we still hold the
      XFS_ILOCK_EXCL and the flush lock in xfs_inode_reclaim, meaning that
      it occurs well before we remove it from the radix tree, and that
      the i_flags_lock, the XFS_ILOCK and the inode flush lock all act as
      synchronisation points for detecting that an inode is about to go
      away.
      
      For defensive purposes, this allows us to add a further check to
      xfs_iflush_cluster to ensure we skip inodes that are being freed
      after we grab the XFS_ILOCK_SHARED and the flush lock - we know that
      if the inode number if valid while we have these locks held we know
      that it has not progressed through reclaim to the point where it is
      clean and is about to be freed.
      
      [bfoster: fixed __xfs_inode_clear_reclaim() using ip->i_ino after it
      	  had already been zeroed.]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8a17d7dd
    • D
      xfs: xfs_inode_free() isn't RCU safe · 1f2dcfe8
      Dave Chinner 提交于
      The xfs_inode freed in xfs_inode_free() has multiple allocated
      structures attached to it. We free these in xfs_inode_free() before
      we mark the inode as invalid, and before we run call_rcu() to queue
      the structure for freeing.
      
      Unfortunately, this freeing can race with other accesses that are in
      the RCU current grace period that have found the inode in the radix
      tree with a valid state.  This includes xfs_iflush_cluster(), which
      calls xfs_inode_clean(), and that accesses the inode log item on the
      xfs_inode.
      
      The log item structure is freed in xfs_inode_free(), so there is the
      possibility we can be accessing freed memory in xfs_iflush_cluster()
      after validating the xfs_inode structure as being valid for this RCU
      context. Hence we can get spuriously incorrect clean state returned
      from such checks. This can lead to use thinking the inode is dirty
      when it is, in fact, clean, and so incorrectly attaching it to the
      buffer for IO and completion processing.
      
      This then leads to use-after-free situations on the xfs_inode itself
      if the IO completes after the current RCU grace period expires. The
      buffer callbacks will access the xfs_inode and try to do all sorts
      of things it shouldn't with freed memory.
      
      IOWs, xfs_iflush_cluster() only works correctly when racing with
      inode reclaim if the inode log item is present and correctly stating
      the inode is clean. If the inode is being freed, then reclaim has
      already made sure the inode is clean, and hence xfs_iflush_cluster
      can skip it. However, we are accessing the inode inode under RCU
      read lock protection and so also must ensure that all dynamically
      allocated memory we reference in this context is not freed until the
      RCU grace period expires.
      
      To fix this, move all the potential memory freeing into
      xfs_inode_free_callback() so that we are guarantee RCU protected
      lookup code will always have the memory structures it needs
      available during the RCU grace period that lookup races can occur
      in.
      Discovered-by: NBrain Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      1f2dcfe8
  12. 09 2月, 2016 6 次提交
  13. 12 10月, 2015 1 次提交
    • B
      xfs: per-filesystem stats counter implementation · ff6d6af2
      Bill O'Donnell 提交于
      This patch modifies the stats counting macros and the callers
      to those macros to properly increment, decrement, and add-to
      the xfs stats counts. The counts for global and per-fs stats
      are correctly advanced, and cleared by writing a "1" to the
      corresponding clear file.
      
      global counts: /sys/fs/xfs/stats/stats
      per-fs counts: /sys/fs/xfs/sda*/stats/stats
      
      global clear:  /sys/fs/xfs/stats/stats_clear
      per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear
      
      [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
       stats structures and macros. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ff6d6af2
  14. 28 8月, 2015 1 次提交
  15. 23 2月, 2015 1 次提交
    • D
      xfs: inodes are new until the dentry cache is set up · 58c90473
      Dave Chinner 提交于
      Al Viro noticed a generic set of issues to do with filehandle lookup
      racing with dentry cache setup. They involve a filehandle lookup
      occurring while an inode is being created and the filehandle lookup
      racing with the dentry creation for the real file. This can lead to
      multiple dentries for the one path being instantiated. There are a
      host of other issues around this same set of paths.
      
      The underlying cause is that file handle lookup only waits on inode
      cache instantiation rather than full dentry cache instantiation. XFS
      is mostly immune to the problems discovered due to it's own internal
      inode cache, but there are a couple of corner cases where races can
      happen.
      
      We currently clear the XFS_INEW flag when the inode is fully set up
      after insertion into the cache. Newly allocated inodes are inserted
      locked and so aren't usable until the allocation transaction
      commits. This, however, occurs before the dentry and security
      information is fully initialised and hence the inode is unlocked and
      available for lookups to find too early.
      
      To solve the problem, only clear the XFS_INEW flag for newly created
      inodes once the dentry is fully instantiated. This means lookups
      will retry until the XFS_INEW flag is removed from the inode and
      hence avoids the race conditions in questions.
      
      THis also means that xfs_create(), xfs_create_tmpfile() and
      xfs_symlink() need to finish the setup of the inode in their error
      paths if we had allocated the inode but failed later in the creation
      process. xfs_symlink(), in particular, needed a lot of help to make
      it's error handling match that of xfs_create().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      58c90473
  16. 04 12月, 2014 1 次提交
  17. 28 11月, 2014 2 次提交
  18. 23 9月, 2014 1 次提交
  19. 24 7月, 2014 3 次提交
    • B
      xfs: run an eofblocks scan on ENOSPC/EDQUOT · dc06f398
      Brian Foster 提交于
      From: Brian Foster <bfoster@redhat.com>
      
      Speculative preallocation and and the associated throttling metrics
      assume we're working with large files on large filesystems. Users have
      reported inefficiencies in these mechanisms when we happen to be dealing
      with large files on smaller filesystems. This can occur because while
      prealloc throttling is aggressive under low free space conditions, it is
      not active until we reach 5% free space or less.
      
      For example, a 40GB filesystem has enough space for several files large
      enough to have multi-GB preallocations at any given time. If those files
      are slow growing, they might reserve preallocation for long periods of
      time as well as avoid the background scanner due to frequent
      modification. If a new file is written under these conditions, said file
      has no access to this already reserved space and premature ENOSPC is
      imminent.
      
      To handle this scenario, modify the buffered write ENOSPC handling and
      retry sequence to invoke an eofblocks scan. In the smaller filesystem
      scenario, the eofblocks scan resets the usage of preallocation such that
      when the 5% free space threshold is met, throttling effectively takes
      over to provide fair and efficient preallocation until legitimate
      ENOSPC.
      
      The eofblocks scan is selective based on the nature of the failure. For
      example, an EDQUOT failure in a particular quota will use a filtered
      scan for that quota. Because we don't know which quota might have caused
      an allocation failure at any given time, we include each applicable
      quota determined to be under low free space conditions in the scan.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dc06f398
    • B
      xfs: support a union-based filter for eofblocks scans · f4526397
      Brian Foster 提交于
      From: Brian Foster <bfoster@redhat.com>
      
      The eofblocks scan inode filter uses intersection logic by default.
      E.g., specifying both user and group quota ids filters out inodes that
      are not covered by both the specified user and group quotas. This is
      suitable for behavior exposed to userspace.
      
      Scans that are initiated from within the kernel might require more broad
      semantics, such as scanning all inodes under each quota associated with
      an inode to alleviate low free space conditions in each.
      
      Create the XFS_EOF_FLAGS_UNION flag to support a conditional union-based
      filtering algorithm for eofblocks scans. This flag is intentionally left
      out of the valid mask as it is not supported for scans initiated from
      userspace.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f4526397
    • B
      xfs: add scan owner field to xfs_eofblocks · 5400da7d
      Brian Foster 提交于
      From: Brian Foster <bfoster@redhat.com>
      
      The scan owner field represents an optional inode number that is
      responsible for the current scan. The purpose is to identify that an
      inode is under iolock and as such, the iolock shouldn't be attempted
      when trimming eofblocks. This is an internal only field.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5400da7d
  20. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
  21. 22 6月, 2014 1 次提交
  22. 14 4月, 2014 1 次提交
  23. 24 10月, 2013 2 次提交
    • D
      xfs: decouple inode and bmap btree header files · a4fbe6ab
      Dave Chinner 提交于
      Currently the xfs_inode.h header has a dependency on the definition
      of the BMAP btree records as the inode fork includes an array of
      xfs_bmbt_rec_host_t objects in it's definition.
      
      Move all the btree format definitions from xfs_btree.h,
      xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
      xfs_format.h to continue the process of centralising the on-disk
      format definitions. With this done, the xfs inode definitions are no
      longer dependent on btree header files.
      
      The enables a massive culling of unnecessary includes, with close to
      200 #include directives removed from the XFS kernel code base.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a4fbe6ab
    • D
      xfs: decouple log and transaction headers · 239880ef
      Dave Chinner 提交于
      xfs_trans.h has a dependency on xfs_log.h for a couple of
      structures. Most code that does transactions doesn't need to know
      anything about the log, but this dependency means that they have to
      include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
      files and clean up the includes to be in dependency order.
      
      In doing this, remove the direct include of xfs_trans_reserve.h from
      xfs_trans.h so that we remove the dependency between xfs_trans.h and
      xfs_mount.h. Hence the xfs_trans.h include can be moved to the
      indicate the actual dependencies other header files have on it.
      
      Note that these are kernel only header files, so this does not
      translate to any userspace changes at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      239880ef
  24. 02 10月, 2013 1 次提交
  25. 25 9月, 2013 1 次提交
    • D
      xfs: asserting lock not held during freeing not valid · b313a5f1
      Dave Chinner 提交于
      When we free an inode, we do so via RCU. As an RCU lookup can occur
      at any time before we free an inode, and that lookup takes the inode
      flags lock, we cannot safely assert that the flags lock is not held
      just before marking it dead and running call_rcu() to free the
      inode.
      
      We check on allocation of a new inode structre that the lock is not
      held, so we still have protection against locks being leaked and
      hence not correctly initialised when allocated out of the slab.
      Hence just remove the assert...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b313a5f1
  26. 11 9月, 2013 2 次提交
    • D
      shrinker: convert superblock shrinkers to new API · 0a234c6d
      Dave Chinner 提交于
      Convert superblock shrinker to use the new count/scan API, and propagate
      the API changes through to the filesystem callouts.  The filesystem
      callouts already use a count/scan API, so it's just changing counters to
      longs to match the VM API.
      
      This requires the dentry and inode shrinker callouts to be converted to
      the count/scan API.  This is mainly a mechanical change.
      
      [glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0a234c6d
    • D
      xfs: recovery of swap extents operations for CRC filesystems · 638f4416
      Dave Chinner 提交于
      This is the recovery side of the btree block owner change operation
      performed by swapext on CRC enabled filesystems. We detect that an
      owner change is needed by the flag that has been placed on the inode
      log format flag field. Because the inode recovery is being replayed
      after the buffers that make up the BMBT in the given checkpoint, we
      can walk all the buffers and directly modify them when we see the
      flag set on an inode.
      
      Because the inode can be relogged and hence present in multiple
      chekpoints with the "change owner" flag set, we could do multiple
      passes across the inode to do this change. While this isn't optimal,
      we can't directly ignore the flag as there may be multiple
      independent swap extent operations being replayed on the same inode
      in different checkpoints so we can't ignore them.
      
      Further, because the owner change operation uses ordered buffers, we
      might have buffers that are newer on disk than the current
      checkpoint and so already have the owner changed in them. Hence we
      cannot just peek at a buffer in the tree and check that it has the
      correct owner and assume that the change was completed.
      
      So, for the moment just brute force the owner change every time we
      see an inode with the flag set. Note that we have to be careful here
      because the owner of the buffers may point to either the old owner
      or the new owner. Currently the verifier can't verify the owner
      directly, so there is no failure case here right now. If we verify
      the owner exactly in future, then we'll have to take this into
      account.
      
      This was tested in terms of normal operation via xfstests - all of
      the fsr tests now pass without failure. however, we really need to
      modify xfs/227 to stress v3 inodes correctly to ensure we fully
      cover this case for v5 filesystems.
      
      In terms of recovery testing, I used a hacked version of xfs_fsr
      that held the temp inode open for a few seconds before exiting so
      that the filesystem could be shut down with an open owner change
      recovery flags set on at least the temp inode. fsr leaves the temp
      inode unlinked and in btree format, so this was necessary for the
      owner change to be reliably replayed.
      
      logprint confirmed the tmp inode in the log had the correct flag set:
      
      INO: cnt:3 total:3 a:0x69e9e0 len:56 a:0x69ea20 len:176 a:0x69eae0 len:88
              INODE: #regs:3   ino:0x44  flags:0x209   dsize:88
      	                                 ^^^^^
      
      0x200 is set, indicating a data fork owner change needed to be
      replayed on inode 0x44.  A printk in the revoery code confirmed that
      the inode change was recovered:
      
      XFS (vdc): Mounting Filesystem
      XFS (vdc): Starting recovery (logdev: internal)
      recovering owner change ino 0x44
      XFS (vdc): Version 5 superblock detected. This kernel L support enabled!
      Use of these features in this kernel is at your own risk!
      XFS (vdc): Ending recovery (logdev: internal)
      
      The script used to test this was:
      
      $ cat ./recovery-fsr.sh
      #!/bin/bash
      
      dev=/dev/vdc
      mntpt=/mnt/scratch
      testfile=$mntpt/testfile
      
      umount $mntpt
      mkfs.xfs -f -m crc=1 $dev
      mount $dev $mntpt
      chmod 777 $mntpt
      
      for i in `seq 10000 -1 0`; do
              xfs_io -f -d -c "pwrite $(($i * 4096)) 4096" $testfile > /dev/null 2>&1
      done
      xfs_bmap -vp $testfile |head -20
      
      xfs_fsr -d -v $testfile &
      sleep 10
      /home/dave/src/xfstests-dev/src/godown -f $mntpt
      wait
      umount $mntpt
      
      xfs_logprint -t $dev |tail -20
      time mount $dev $mntpt
      xfs_bmap -vp $testfile
      umount $mntpt
      $
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      638f4416