1. 23 2月, 2015 3 次提交
    • D
      xfs: use generic percpu counters for free block counter · 0d485ada
      Dave Chinner 提交于
      XFS has hand-rolled per-cpu counters for the superblock since before
      there was any generic implementation. The free block counter is
      special in that it is used for ENOSPC detection outside transaction
      contexts for for delayed allocation. This means that the counter
      needs to be accurate at zero. The current per-cpu counter code jumps
      through lots of hoops to ensure we never run past zero, but we don't
      need to make all those jumps with the generic counter
      implementation.
      
      The generic counter implementation allows us to pass a "batch"
      threshold at which the addition/subtraction to the counter value
      will be folded back into global value under lock. We can use this
      feature to reduce the batch size as we approach 0 in a very similar
      manner to the existing counters and their rebalance algorithm. If we
      use a batch size of 1 as we approach 0, then every addition and
      subtraction will be done against the global value and hence allow
      accurate detection of zero threshold crossing.
      
      Hence we can replace the handrolled, accurate-at-zero counters with
      generic percpu counters.
      
      Note: this removes just enough of the icsb infrastructure to compile
      without warnings. The rest will go in subsequent commits.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0d485ada
    • D
      xfs: use generic percpu counters for free inode counter · e88b64ea
      Dave Chinner 提交于
      XFS has hand-rolled per-cpu counters for the superblock since before
      there was any generic implementation. The free inode counter is not
      used for any limit enforcement - the per-AG free inode counters are
      used during allocation to determine if there are inode available for
      allocation.
      
      Hence we don't need any of the complexity of the hand-rolled
      counters and we can simply replace them with generic per-cpu
      counters similar to the inode counter.
      
      This version introduces a xfs_mod_ifree() helper function from
      Christoph Hellwig.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e88b64ea
    • D
      xfs: use generic percpu counters for inode counter · 501ab323
      Dave Chinner 提交于
      XFS has hand-rolled per-cpu counters for the superblock since before
      there was any generic implementation. There are some warts around
      the  use of them for the inode counter as the hand rolled counter is
      designed to be accurate at zero, but has no specific accurracy at
      any other value. This design causes problems for the maximum inode
      count threshold enforcement, as there is no trigger that balances
      the counters as they get close tothe maximum threshold.
      
      Instead of designing new triggers for balancing, just replace the
      handrolled per-cpu counter with a generic counter.  This enables us
      to update the counter through the normal superblock modification
      funtions, but rather than do that we add a xfs_mod_icount() helper
      function (from Christoph Hellwig) and keep the percpu counter
      outside the superblock in the struct xfs_mount.
      
      This means we still need to initialise the per-cpu counter
      specifically when we read the superblock, and vice versa when we
      log/write it, but it does mean that we don't need to change any
      other code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      501ab323
  2. 16 2月, 2015 1 次提交
  3. 22 1月, 2015 2 次提交
    • D
      xfs: consolidate superblock logging functions · 61e63ecb
      Dave Chinner 提交于
      We now have several superblock loggin functions that are identical
      except for the transaction reservation and whether it shoul dbe a
      synchronous transaction or not. Consolidate these all into a single
      function, a single reserveration and a sync flag and call it
      xfs_sync_sb().
      
      Also, xfs_mod_sb() is not really a modification function - it's the
      operation of logging the superblock buffer. hence change the name of
      it to reflect this.
      
      Note that we have to change the mp->m_update_flags that are passed
      around at mount time to a boolean simply to indicate a superblock
      update is needed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      61e63ecb
    • D
      xfs: remove bitfield based superblock updates · 4d11a402
      Dave Chinner 提交于
      When we log changes to the superblock, we first have to write them
      to the on-disk buffer, and then log that. Right now we have a
      complex bitfield based arrangement to only write the modified field
      to the buffer before we log it.
      
      This used to be necessary as a performance optimisation because we
      logged the superblock buffer in every extent or inode allocation or
      freeing, and so performance was extremely important. We haven't done
      this for years, however, ever since the lazy superblock counters
      pulled the superblock logging out of the transaction commit
      fast path.
      
      Hence we have a bunch of complexity that is not necessary that makes
      writing the in-core superblock to disk much more complex than it
      needs to be. We only need to log the superblock now during
      management operations (e.g. during mount, unmount or quota control
      operations) so it is not a performance critical path anymore.
      
      As such, remove the complex field based logging mechanism and
      replace it with a simple conversion function similar to what we use
      for all other on-disk structures.
      
      This means we always log the entirity of the superblock, but again
      because we rarely modify the superblock this is not an issue for log
      bandwidth or CPU time. Indeed, if we do log the superblock
      frequently, delayed logging will minimise the impact of this
      overhead.
      
      [Fixed gquota/pquota inode sharing regression noticed by bfoster.]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4d11a402
  4. 28 11月, 2014 3 次提交
    • C
      xfs: merge xfs_ag.h into xfs_format.h · 4fb6e8ad
      Christoph Hellwig 提交于
      More on-disk format consolidation.  A few declarations that weren't on-disk
      format related move into better suitable spots.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4fb6e8ad
    • B
      xfs: allow lazy sb counter sync during filesystem freeze sequence · 91ee575f
      Brian Foster 提交于
      The expectation since the introduction the lazy superblock counters is
      that the counters are synced and superblock logged appropriately as part
      of the filesystem freeze sequence. This does not occur, however, due to
      the logic in xfs_fs_writable() that prevents progress when the fs is in
      any state other than SB_UNFROZEN.
      
      While this is a bug, it has not been exposed to date because the last
      thing XFS does during freeze is dirty the log. The log recovery process
      recalculates the counters from AGI/AGF metadata to ensure everything is
      correct. Therefore should a crash occur while an fs is frozen, the
      subsequent log recovery puts everything back in order. See the following
      commit for reference:
      
      	92821e2b [XFS] Lazy Superblock Counters
      
      We might not always want to rely on dirtying the log on a frozen fs.
      Modify xfs_log_sbcount() to proceed when the filesystem is freezing but
      not once the freeze process has completed. Modify xfs_fs_writable() to
      accept the minimum freeze level for which modifications should be
      blocked to support various codepaths.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      91ee575f
    • B
      xfs: replace global xfslogd wq with per-mount wq · 78c931b8
      Brian Foster 提交于
      The xfslogd workqueue is a global, single-job workqueue for buffer ioend
      processing. This means we allow for a single work item at a time for all
      possible XFS mounts on a system. fsstress testing in loopback XFS over
      XFS configurations has reproduced xfslogd deadlocks due to the single
      threaded nature of the queue and dependencies introduced between the
      separate XFS instances by online discard (-o discard).
      
      Discard over a loopback device converts the discard request to a hole
      punch (fallocate) on the underlying file. Online discard requests are
      issued synchronously and from xfslogd context in XFS, hence the xfslogd
      workqueue is blocked in the upper fs waiting on a hole punch request to
      be servied in the lower fs. If the lower fs issues I/O that depends on
      xfslogd to complete, both filesystems end up hung indefinitely. This is
      reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
      the '-o discard' mount option.
      
      Further, docker implementations appear to use this kind of configuration
      for container instance filesystems by default (container fs->dm->
      loop->base fs) and therefore are subject to this deadlock when running
      on XFS.
      
      Replace the global xfslogd workqueue with a per-mount variant. This
      guarantees each mount access to a single worker and prevents deadlocks
      due to inter-fs dependencies introduced by discard. Since the queue is
      only responsible for buffer iodone processing at this point in time,
      rename xfslogd to xfs-buf.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      78c931b8
  5. 15 7月, 2014 1 次提交
    • B
      xfs: add xfs_mount sysfs kobject · a31b1d3d
      Brian Foster 提交于
      Embed a base kobject into xfs_mount. This creates a kobject associated
      with each XFS mount and a subdirectory in sysfs with the name of the
      filesystem. The subdirectory lifecycle matches that of the mount. Also
      add the new xfs_sysfs.[c,h] source files with some XFS sysfs
      infrastructure to facilitate attribute creation.
      
      Note that there are currently no attributes exported as part of the
      xfs_mount kobject. It exists solely to serve as a per-mount container
      for child objects.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a31b1d3d
  6. 06 6月, 2014 6 次提交
  7. 18 11月, 2013 1 次提交
    • D
      xfs: increase inode cluster size for v5 filesystems · 8f80587b
      Dave Chinner 提交于
      v5 filesystems use 512 byte inodes as a minimum, so read inodes in
      clusters that are effectively half the size of a v4 filesystem with
      256 byte inodes. For v5 fielsystems, scale the inode cluster size
      with the size of the inode so that we keep a constant 32 inodes per
      cluster ratio for all inode IO.
      
      This only works if mkfs.xfs sets the inode alignment appropriately
      for larger inode clusters, so this functionality is made conditional
      on mkfs doing the right thing. xfs_repair needs to know about
      the inode alignment changes, too.
      
      Wall time:
      	create	bulkstat	find+stat	ls -R	unlink
      v4	237s	161s		173s		201s	299s
      v5	235s	163s		205s		 31s	356s
      patched	234s	160s		182s		 29s	317s
      
      System time:
      	create	bulkstat	find+stat	ls -R	unlink
      v4	2601s	2490s		1653s		1656s	2960s
      v5	2637s	2497s		1681s		  20s	3216s
      patched	2613s	2451s		1658s		  20s	3007s
      
      So, wall time same or down across the board, system time same or
      down across the board, and cache hit rates all improve except for
      the ls -R case which is a pure cold cache directory read workload
      on v5 filesystems...
      
      So, this patch removes most of the performance and CPU usage
      differential between v4 and v5 filesystems on traversal related
      workloads.
      
      Note: while this patch is currently for v5 filesystems only, there
      is no reason it can't be ported back to v4 filesystems.  This hasn't
      been done here because bringing the code back to v4 requires
      forwards and backwards kernel compatibility testing.  i.e. to
      deterine if older kernels(*) do the right thing with larger inode
      alignments but still only using 8k inode cluster sizes. None of this
      testing and validation on v4 filesystems has been done, so for the
      moment larger inode clusters is limited to v5 superblocks.
      
      (*) a current default config v4 filesystem should mount just fine on
      2.6.23 (when lazy-count support was introduced), and so if we change
      the alignment emitted by mkfs without a feature bit then we have to
      make sure it works properly on all kernels since 2.6.23. And if we
      allow it to be changed when the lazy-count bit is not set, then it's
      all kernels since v2 logs were introduced that need to be tested for
      compatibility...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      8f80587b
  8. 31 10月, 2013 2 次提交
    • D
      xfs: vectorise DA btree operations · 4bceb18f
      Dave Chinner 提交于
      The remaining non-vectorised code for the directory structure is the
      node format blocks. This is shared with the attribute tree, and so
      is slightly more complex to vectorise.
      
      Introduce a "non-directory" directory ops structure that is attached
      to all non-directory inodes so that attribute operations can be
      vectorised for all inodes.
      
      Once we do this, we can vectorise all the da btree operations.
      Because this patch adds more infrastructure than it removes the
      binary size does not decrease:
      
         text    data     bss     dec     hex filename
       794490   96802    1096  892388   d9de4 fs/xfs/xfs.o.orig
       792986   96802    1096  890884   d9804 fs/xfs/xfs.o.p1
       792350   96802    1096  890248   d9588 fs/xfs/xfs.o.p2
       789293   96802    1096  887191   d8997 fs/xfs/xfs.o.p3
       789005   96802    1096  886903   d8997 fs/xfs/xfs.o.p4
       789061   96802    1096  886959   d88af fs/xfs/xfs.o.p5
       789733   96802    1096  887631   d8b4f fs/xfs/xfs.o.p6
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4bceb18f
    • D
      xfs: abstract the differences in dir2/dir3 via an ops vector · 32c5483a
      Dave Chinner 提交于
      Lots of the dir code now goes through switches to determine what is
      the correct on-disk format to parse. It generally involves a
      "xfs_sbversion_hasfoo" check, deferencing the superblock version and
      feature fields and hence touching several cache lines per operation
      in the process. Some operations do multiple checks because they nest
      conditional operations and they don't pass the information in a
      direct fashion between each other.
      
      Hence, add an ops vector to the xfs_inode structure that is
      configured when the inode is initialised to point to all the correct
      decode and encoding operations.  This will significantly reduce the
      branchiness and cacheline footprint of the directory object decoding
      and encoding.
      
      This is the first patch in a series of conversion patches. It will
      introduce the ops structure, the setup of it and add the first
      operation to the vector. Subsequent patches will convert directory
      ops one at a time to keep the changes simple and obvious.
      
      Just this patch shows the benefit of such an approach on code size.
      Just converting the two shortform dir operations as this patch does
      decreases the built binary size by ~1500 bytes:
      
      $ size fs/xfs/xfs.o.orig fs/xfs/xfs.o.p1
         text    data     bss     dec     hex filename
       794490   96802    1096  892388   d9de4 fs/xfs/xfs.o.orig
       792986   96802    1096  890884   d9804 fs/xfs/xfs.o.p1
      $
      
      That's a significant decrease in the instruction cache footprint of
      the directory code for such a simple change, and indicates that this
      approach is definitely worth pursuing further.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      32c5483a
  9. 13 8月, 2013 4 次提交
  10. 20 6月, 2013 1 次提交
    • J
      xfs: Remove XFS_MOUNT_RETERR · 39a45d84
      Jie Liu 提交于
      XFS_MOUNT_RETERR is going to be set at xfs_parseargs() if
      mp->m_dalign is enabled, so any time we enter "if (mp->m_dalign)"
      branch in xfs_update_alignment(), XFS_MOUNT_RETERR is set and so
      we always be emitting a warning and returning an error.
      
      Hence, we can remove it and get rid of a couple of redundant
      check up against it at xfs_upate_alignment().
      
      Thanks Dave Chinner for the suggestions of simplify the code
      in xfs_parseargs().
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Mark Tinguely <tinguely@sgi.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      39a45d84
  11. 18 6月, 2013 1 次提交
  12. 28 4月, 2013 1 次提交
    • D
      xfs: add CRC checks to the superblock · 04a1e6c5
      Dave Chinner 提交于
      With the addition of CRCs, there is such a wide and varied change to
      the on disk format that it makes sense to bump the superblock
      version number rather than try to use feature bits for all the new
      functionality.
      
      This commit introduces all the new superblock fields needed for all
      the new functionality: feature masks similar to ext4, separate
      project quota inodes, a LSN field for recovery and the CRC field.
      
      This commit does not bump the superblock version number, however.
      That will be done as a separate commit at the end of the series
      after all the new functionality is present so we switch it all on in
      one commit. This means that we can slowly introduce the changes
      without them being active and hence maintain bisectability of the
      tree.
      
      This patch is based on a patch originally written by myself back
      from SGI days, which was subsequently modified by Christoph Hellwig.
      There is relatively little of that patch remaining, but the history
      of the patch still should be acknowledged here.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      04a1e6c5
  13. 15 3月, 2013 1 次提交
  14. 02 2月, 2013 7 次提交
  15. 16 11月, 2012 3 次提交
    • D
      xfs: convert buffer verifiers to an ops structure. · 1813dd64
      Dave Chinner 提交于
      To separate the verifiers from iodone functions and associate read
      and write verifiers at the same time, introduce a buffer verifier
      operations structure to the xfs_buf.
      
      This avoids the need for assigning the write verifier, clearing the
      iodone function and re-running ioend processing in the read
      verifier, and gets rid of the nasty "b_pre_io" name for the write
      verifier function pointer. If we ever need to, it will also be
      easier to add further content specific callbacks to a buffer with an
      ops structure in place.
      
      We also avoid needing to export verifier functions, instead we
      can simply export the ops structures for those that are needed
      outside the function they are defined in.
      
      This patch also fixes a directory block readahead verifier issue
      it exposed.
      
      This patch also adds ops callbacks to the inode/alloc btree blocks
      initialised by growfs. These will need more work before they will
      work with CRCs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NPhil White <pwhite@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1813dd64
    • D
      xfs: connect up write verifiers to new buffers · b0f539de
      Dave Chinner 提交于
      Metadata buffers that are read from disk have write verifiers
      already attached to them, but newly allocated buffers do not. Add
      appropriate write verifiers to all new metadata buffers.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b0f539de
    • D
      xfs: verify superblocks as they are read from disk · 98021821
      Dave Chinner 提交于
      Add a superblock verify callback function and pass it into the
      buffer read functions. Remove the now redundant verification code
      that is currently in use.
      
      Adding verification shows that secondary superblocks never have
      their "sb_inprogress" flag cleared by mkfs.xfs, so when validating
      the secondary superblocks during a grow operation we have to avoid
      checking this field. Even if we fix mkfs, we will still have to
      ignore this field for verification purposes unless a version of mkfs
      that does not have this bug was used.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NPhil White <pwhite@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      98021821
  16. 09 11月, 2012 1 次提交
  17. 18 10月, 2012 2 次提交