1. 28 11月, 2014 3 次提交
    • C
      xfs: merge xfs_ag.h into xfs_format.h · 4fb6e8ad
      Christoph Hellwig 提交于
      More on-disk format consolidation.  A few declarations that weren't on-disk
      format related move into better suitable spots.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4fb6e8ad
    • B
      xfs: allow lazy sb counter sync during filesystem freeze sequence · 91ee575f
      Brian Foster 提交于
      The expectation since the introduction the lazy superblock counters is
      that the counters are synced and superblock logged appropriately as part
      of the filesystem freeze sequence. This does not occur, however, due to
      the logic in xfs_fs_writable() that prevents progress when the fs is in
      any state other than SB_UNFROZEN.
      
      While this is a bug, it has not been exposed to date because the last
      thing XFS does during freeze is dirty the log. The log recovery process
      recalculates the counters from AGI/AGF metadata to ensure everything is
      correct. Therefore should a crash occur while an fs is frozen, the
      subsequent log recovery puts everything back in order. See the following
      commit for reference:
      
      	92821e2b [XFS] Lazy Superblock Counters
      
      We might not always want to rely on dirtying the log on a frozen fs.
      Modify xfs_log_sbcount() to proceed when the filesystem is freezing but
      not once the freeze process has completed. Modify xfs_fs_writable() to
      accept the minimum freeze level for which modifications should be
      blocked to support various codepaths.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      91ee575f
    • B
      xfs: replace global xfslogd wq with per-mount wq · 78c931b8
      Brian Foster 提交于
      The xfslogd workqueue is a global, single-job workqueue for buffer ioend
      processing. This means we allow for a single work item at a time for all
      possible XFS mounts on a system. fsstress testing in loopback XFS over
      XFS configurations has reproduced xfslogd deadlocks due to the single
      threaded nature of the queue and dependencies introduced between the
      separate XFS instances by online discard (-o discard).
      
      Discard over a loopback device converts the discard request to a hole
      punch (fallocate) on the underlying file. Online discard requests are
      issued synchronously and from xfslogd context in XFS, hence the xfslogd
      workqueue is blocked in the upper fs waiting on a hole punch request to
      be servied in the lower fs. If the lower fs issues I/O that depends on
      xfslogd to complete, both filesystems end up hung indefinitely. This is
      reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
      the '-o discard' mount option.
      
      Further, docker implementations appear to use this kind of configuration
      for container instance filesystems by default (container fs->dm->
      loop->base fs) and therefore are subject to this deadlock when running
      on XFS.
      
      Replace the global xfslogd workqueue with a per-mount variant. This
      guarantees each mount access to a single worker and prevents deadlocks
      due to inter-fs dependencies introduced by discard. Since the queue is
      only responsible for buffer iodone processing at this point in time,
      rename xfslogd to xfs-buf.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      78c931b8
  2. 15 7月, 2014 1 次提交
    • B
      xfs: add xfs_mount sysfs kobject · a31b1d3d
      Brian Foster 提交于
      Embed a base kobject into xfs_mount. This creates a kobject associated
      with each XFS mount and a subdirectory in sysfs with the name of the
      filesystem. The subdirectory lifecycle matches that of the mount. Also
      add the new xfs_sysfs.[c,h] source files with some XFS sysfs
      infrastructure to facilitate attribute creation.
      
      Note that there are currently no attributes exported as part of the
      xfs_mount kobject. It exists solely to serve as a per-mount container
      for child objects.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a31b1d3d
  3. 06 6月, 2014 6 次提交
  4. 18 11月, 2013 1 次提交
    • D
      xfs: increase inode cluster size for v5 filesystems · 8f80587b
      Dave Chinner 提交于
      v5 filesystems use 512 byte inodes as a minimum, so read inodes in
      clusters that are effectively half the size of a v4 filesystem with
      256 byte inodes. For v5 fielsystems, scale the inode cluster size
      with the size of the inode so that we keep a constant 32 inodes per
      cluster ratio for all inode IO.
      
      This only works if mkfs.xfs sets the inode alignment appropriately
      for larger inode clusters, so this functionality is made conditional
      on mkfs doing the right thing. xfs_repair needs to know about
      the inode alignment changes, too.
      
      Wall time:
      	create	bulkstat	find+stat	ls -R	unlink
      v4	237s	161s		173s		201s	299s
      v5	235s	163s		205s		 31s	356s
      patched	234s	160s		182s		 29s	317s
      
      System time:
      	create	bulkstat	find+stat	ls -R	unlink
      v4	2601s	2490s		1653s		1656s	2960s
      v5	2637s	2497s		1681s		  20s	3216s
      patched	2613s	2451s		1658s		  20s	3007s
      
      So, wall time same or down across the board, system time same or
      down across the board, and cache hit rates all improve except for
      the ls -R case which is a pure cold cache directory read workload
      on v5 filesystems...
      
      So, this patch removes most of the performance and CPU usage
      differential between v4 and v5 filesystems on traversal related
      workloads.
      
      Note: while this patch is currently for v5 filesystems only, there
      is no reason it can't be ported back to v4 filesystems.  This hasn't
      been done here because bringing the code back to v4 requires
      forwards and backwards kernel compatibility testing.  i.e. to
      deterine if older kernels(*) do the right thing with larger inode
      alignments but still only using 8k inode cluster sizes. None of this
      testing and validation on v4 filesystems has been done, so for the
      moment larger inode clusters is limited to v5 superblocks.
      
      (*) a current default config v4 filesystem should mount just fine on
      2.6.23 (when lazy-count support was introduced), and so if we change
      the alignment emitted by mkfs without a feature bit then we have to
      make sure it works properly on all kernels since 2.6.23. And if we
      allow it to be changed when the lazy-count bit is not set, then it's
      all kernels since v2 logs were introduced that need to be tested for
      compatibility...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      8f80587b
  5. 31 10月, 2013 2 次提交
    • D
      xfs: vectorise DA btree operations · 4bceb18f
      Dave Chinner 提交于
      The remaining non-vectorised code for the directory structure is the
      node format blocks. This is shared with the attribute tree, and so
      is slightly more complex to vectorise.
      
      Introduce a "non-directory" directory ops structure that is attached
      to all non-directory inodes so that attribute operations can be
      vectorised for all inodes.
      
      Once we do this, we can vectorise all the da btree operations.
      Because this patch adds more infrastructure than it removes the
      binary size does not decrease:
      
         text    data     bss     dec     hex filename
       794490   96802    1096  892388   d9de4 fs/xfs/xfs.o.orig
       792986   96802    1096  890884   d9804 fs/xfs/xfs.o.p1
       792350   96802    1096  890248   d9588 fs/xfs/xfs.o.p2
       789293   96802    1096  887191   d8997 fs/xfs/xfs.o.p3
       789005   96802    1096  886903   d8997 fs/xfs/xfs.o.p4
       789061   96802    1096  886959   d88af fs/xfs/xfs.o.p5
       789733   96802    1096  887631   d8b4f fs/xfs/xfs.o.p6
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4bceb18f
    • D
      xfs: abstract the differences in dir2/dir3 via an ops vector · 32c5483a
      Dave Chinner 提交于
      Lots of the dir code now goes through switches to determine what is
      the correct on-disk format to parse. It generally involves a
      "xfs_sbversion_hasfoo" check, deferencing the superblock version and
      feature fields and hence touching several cache lines per operation
      in the process. Some operations do multiple checks because they nest
      conditional operations and they don't pass the information in a
      direct fashion between each other.
      
      Hence, add an ops vector to the xfs_inode structure that is
      configured when the inode is initialised to point to all the correct
      decode and encoding operations.  This will significantly reduce the
      branchiness and cacheline footprint of the directory object decoding
      and encoding.
      
      This is the first patch in a series of conversion patches. It will
      introduce the ops structure, the setup of it and add the first
      operation to the vector. Subsequent patches will convert directory
      ops one at a time to keep the changes simple and obvious.
      
      Just this patch shows the benefit of such an approach on code size.
      Just converting the two shortform dir operations as this patch does
      decreases the built binary size by ~1500 bytes:
      
      $ size fs/xfs/xfs.o.orig fs/xfs/xfs.o.p1
         text    data     bss     dec     hex filename
       794490   96802    1096  892388   d9de4 fs/xfs/xfs.o.orig
       792986   96802    1096  890884   d9804 fs/xfs/xfs.o.p1
      $
      
      That's a significant decrease in the instruction cache footprint of
      the directory code for such a simple change, and indicates that this
      approach is definitely worth pursuing further.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      32c5483a
  6. 13 8月, 2013 4 次提交
  7. 20 6月, 2013 1 次提交
    • J
      xfs: Remove XFS_MOUNT_RETERR · 39a45d84
      Jie Liu 提交于
      XFS_MOUNT_RETERR is going to be set at xfs_parseargs() if
      mp->m_dalign is enabled, so any time we enter "if (mp->m_dalign)"
      branch in xfs_update_alignment(), XFS_MOUNT_RETERR is set and so
      we always be emitting a warning and returning an error.
      
      Hence, we can remove it and get rid of a couple of redundant
      check up against it at xfs_upate_alignment().
      
      Thanks Dave Chinner for the suggestions of simplify the code
      in xfs_parseargs().
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Mark Tinguely <tinguely@sgi.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      39a45d84
  8. 18 6月, 2013 1 次提交
  9. 28 4月, 2013 1 次提交
    • D
      xfs: add CRC checks to the superblock · 04a1e6c5
      Dave Chinner 提交于
      With the addition of CRCs, there is such a wide and varied change to
      the on disk format that it makes sense to bump the superblock
      version number rather than try to use feature bits for all the new
      functionality.
      
      This commit introduces all the new superblock fields needed for all
      the new functionality: feature masks similar to ext4, separate
      project quota inodes, a LSN field for recovery and the CRC field.
      
      This commit does not bump the superblock version number, however.
      That will be done as a separate commit at the end of the series
      after all the new functionality is present so we switch it all on in
      one commit. This means that we can slowly introduce the changes
      without them being active and hence maintain bisectability of the
      tree.
      
      This patch is based on a patch originally written by myself back
      from SGI days, which was subsequently modified by Christoph Hellwig.
      There is relatively little of that patch remaining, but the history
      of the patch still should be acknowledged here.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      04a1e6c5
  10. 15 3月, 2013 1 次提交
  11. 02 2月, 2013 7 次提交
  12. 16 11月, 2012 3 次提交
    • D
      xfs: convert buffer verifiers to an ops structure. · 1813dd64
      Dave Chinner 提交于
      To separate the verifiers from iodone functions and associate read
      and write verifiers at the same time, introduce a buffer verifier
      operations structure to the xfs_buf.
      
      This avoids the need for assigning the write verifier, clearing the
      iodone function and re-running ioend processing in the read
      verifier, and gets rid of the nasty "b_pre_io" name for the write
      verifier function pointer. If we ever need to, it will also be
      easier to add further content specific callbacks to a buffer with an
      ops structure in place.
      
      We also avoid needing to export verifier functions, instead we
      can simply export the ops structures for those that are needed
      outside the function they are defined in.
      
      This patch also fixes a directory block readahead verifier issue
      it exposed.
      
      This patch also adds ops callbacks to the inode/alloc btree blocks
      initialised by growfs. These will need more work before they will
      work with CRCs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NPhil White <pwhite@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1813dd64
    • D
      xfs: connect up write verifiers to new buffers · b0f539de
      Dave Chinner 提交于
      Metadata buffers that are read from disk have write verifiers
      already attached to them, but newly allocated buffers do not. Add
      appropriate write verifiers to all new metadata buffers.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      b0f539de
    • D
      xfs: verify superblocks as they are read from disk · 98021821
      Dave Chinner 提交于
      Add a superblock verify callback function and pass it into the
      buffer read functions. Remove the now redundant verification code
      that is currently in use.
      
      Adding verification shows that secondary superblocks never have
      their "sb_inprogress" flag cleared by mkfs.xfs, so when validating
      the secondary superblocks during a grow operation we have to avoid
      checking this field. Even if we fix mkfs, we will still have to
      ignore this field for verification purposes unless a version of mkfs
      that does not have this bug was used.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NPhil White <pwhite@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      98021821
  13. 09 11月, 2012 1 次提交
  14. 18 10月, 2012 4 次提交
    • D
      xfs: rename xfs_sync.[ch] to xfs_icache.[ch] · 6d8b79cf
      Dave Chinner 提交于
      xfs_sync.c now only contains inode reclaim functions and inode cache
      iteration functions. It is not related to sync operations anymore.
      Rename to xfs_icache.c to reflect it's contents and prepare for
      consolidation with the other inode cache file that exists
      (xfs_iget.c).
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      6d8b79cf
    • D
      xfs: syncd workqueue is no more · 5889608d
      Dave Chinner 提交于
      With the syncd functions moved to the log and/or removed, the syncd
      workqueue is the only remaining bit left. It is used by the log
      covering/ail pushing work, as well as by the inode reclaim work.
      
      Given how cheap workqueues are these days, give the log and inode
      reclaim work their own work queues and kill the syncd work queue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5889608d
    • D
      xfs: xfs_sync_data is redundant. · 9aa05000
      Dave Chinner 提交于
      We don't do any data writeback from XFS any more - the VFS is
      completely responsible for that, including for freeze. We can
      replace the remaining caller with a VFS level function that
      achieves the same thing, but without conflicting with current
      writeback work.
      
      This means we can remove the flush_work and xfs_flush_inodes() - the
      VFS functionality completely replaces the internal flush queue for
      doing this writeback work in a separate context to avoid stack
      overruns.
      
      This does have one complication - it cannot be called with page
      locks held.  Hence move the flushing of delalloc space when ENOSPC
      occurs back up into xfs_file_aio_buffered_write when we don't hold
      any locks that will stall writeback.
      
      Unfortunately, writeback_inodes_sb_if_idle() is not sufficient to
      trigger delalloc conversion fast enough to prevent spurious ENOSPC
      whent here are hundreds of writers, thousands of small files and GBs
      of free RAM.  Hence we need to use sync_sb_inodes() to block callers
      while we wait for writeback like the previous xfs_flush_inodes
      implementation did.
      
      That means we have to hold the s_umount lock here, but because this
      call can nest inside i_mutex (the parent directory in the create
      case, held by the VFS), we have to use down_read_trylock() to avoid
      potential deadlocks. In practice, this trylock will succeed on
      almost every attempt as unmount/remount type operations are
      exceedingly rare.
      
      Note: we always need to pass a count of zero to
      generic_file_buffered_write() as the previously written byte count.
      We only do this by accident before this patch by the virtue of ret
      always being zero when there are no errors. Make this explicit
      rather than needing to specifically zero ret in the ENOSPC retry
      case.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      9aa05000
    • D
      xfs: sync work is now only periodic log work · f661f1e0
      Dave Chinner 提交于
      The only thing the periodic sync work does now is flush the AIL and
      idle the log. These are really functions of the log code, so move
      the work to xfs_log.c and rename it appropriately.
      
      The only wart that this leaves behind is the xfssyncd_centisecs
      sysctl, otherwise the xfssyncd is dead. Clean up any comments that
      related to xfssyncd to reflect it's passing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f661f1e0
  15. 17 8月, 2012 1 次提交
    • A
      xfs: kill struct declarations in xfs_mount.h · 1ed845df
      Alex Elder 提交于
      I noticed that "struct xfs_mount_args" was still declared in
      "fs/xfs/xfs_mount.h".  That struct doesn't even exist any more (and
      is obviously not referenced elsewhere in that header file).  While
      in there, delete four other unneeded struct declarations in that
      file.
      
      Doing so highlights that "fs/xfs/xfs_trace.h" was relying indirectly
      on "xfs_mount.h" to be #included in order to declare "struct
      xfs_bmbt_irec", so add that declaration to resolve that issue.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1ed845df
  16. 31 7月, 2012 1 次提交
    • J
      xfs: Convert to new freezing code · d9457dc0
      Jan Kara 提交于
      Generic code now blocks all writers from standard write paths. So we add
      blocking of all writers coming from ioctl (we get a protection of ioctl against
      racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
      non-racy freeze protection. We also keep freeze protection on transaction
      start to block internal filesystem writes such as removal of preallocated
      blocks.
      
      CC: Ben Myers <bpm@sgi.com>
      CC: Alex Elder <elder@kernel.org>
      CC: xfs@oss.sgi.com
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d9457dc0
  17. 22 6月, 2012 2 次提交