1. 22 7月, 2012 2 次提交
  2. 15 5月, 2012 3 次提交
    • D
      xfs: move xfs_get_extsz_hint() and kill xfs_rw.h · 2a0ec1d9
      Dave Chinner 提交于
      The only thing left in xfs_rw.h is a function prototype for an inode
      function.  Move that to xfs_inode.h, and kill xfs_rw.h.
      
      Also move the function implementing the prototype from xfs_rw.c to
      xfs_inode.c so we only have one function left in xfs_rw.c
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2a0ec1d9
    • C
      xfs: on-stack delayed write buffer lists · 43ff2122
      Christoph Hellwig 提交于
      Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
      and write back the buffers per-process instead of by waking up xfsbufd.
      
      This is now easily doable given that we have very few places left that write
      delwri buffers:
      
       - log recovery:
      	Only done at mount time, and already forcing out the buffers
      	synchronously using xfs_flush_buftarg
      
       - quotacheck:
      	Same story.
      
       - dquot reclaim:
      	Writes out dirty dquots on the LRU under memory pressure.  We might
      	want to look into doing more of this via xfsaild, but it's already
      	more optimal than the synchronous inode reclaim that writes each
      	buffer synchronously.
      
       - xfsaild:
      	This is the main beneficiary of the change.  By keeping a local list
      	of buffers to write we reduce latency of writing out buffers, and
      	more importably we can remove all the delwri list promotions which
      	were hitting the buffer cache hard under sustained metadata loads.
      
      The implementation is very straight forward - xfs_buf_delwri_queue now gets
      a new list_head pointer that it adds the delwri buffers to, and all callers
      need to eventually submit the list using xfs_buf_delwi_submit or
      xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
      skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
      list.  The biggest change to pass down the buffer list was done to the AIL
      pushing. Now that we operate on buffers the trylock, push and pushbuf log
      item methods are merged into a single push routine, which tries to lock the
      item, and if possible add the buffer that needs writeback to the buffer list.
      This leads to much simpler code than the previous split but requires the
      individual IOP_PUSH instances to unlock and reacquire the AIL around calls
      to blocking routines.
      
      Given that xfsailds now also handle writing out buffers, the conditions for
      log forcing and the sleep times needed some small changes.  The most
      important one is that we consider an AIL busy as long we still have buffers
      to push, and the other one is that we do increment the pushed LSN for
      buffers that are under flushing at this moment, but still count them towards
      the stuck items for restart purposes.  Without this we could hammer on stuck
      items without ever forcing the log and not make progress under heavy random
      delete workloads on fast flash storage devices.
      
      [ Dave Chinner:
      	- rebase on previous patches.
      	- improved comments for XBF_DELWRI_Q handling
      	- fix XBF_ASYNC handling in queue submission (test 106 failure)
      	- rename delwri submit function buffer list parameters for clarity
      	- xfs_efd_item_push() should return XFS_ITEM_PINNED ]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      43ff2122
    • C
      xfs: do not write the buffer from xfs_iflush · 4c46819a
      Christoph Hellwig 提交于
      Instead of writing the buffer directly from inside xfs_iflush return it to
      the caller and let the caller decide what to do with the buffer.  Also
      remove the pincount check in xfs_iflush that all non-blocking callers already
      implement and the now unused flags parameter.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4c46819a
  3. 27 3月, 2012 1 次提交
    • D
      xfs: don't cache inodes read through bulkstat · 5132ba8f
      Dave Chinner 提交于
      When we read inodes via bulkstat, we generally only read them once
      and then throw them away - they never get used again. If we retain
      them in cache, then it simply causes the working set of inodes and
      other cached items to be reclaimed just so the inode cache can grow.
      
      Avoid this problem by marking inodes read by bulkstat not to be
      cached and check this flag in .drop_inode to determine whether the
      inode should be added to the VFS LRU or not. If the inode lookup
      hits an already cached inode, then don't set the flag. If the inode
      lookup hits an inode marked with no cache flag, remove the flag and
      allow it to be cached once the current reference goes away.
      
      Inodes marked as not cached will get cleaned up by the background
      inode reclaim or via memory pressure, so they will still generate
      some short term cache pressure. They will, however, be reclaimed
      much sooner and in preference to cache hot inodes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5132ba8f
  4. 14 3月, 2012 1 次提交
  5. 06 3月, 2012 1 次提交
  6. 23 2月, 2012 1 次提交
  7. 18 1月, 2012 6 次提交
    • C
      xfs: remove the i_new_size field in struct xfs_inode · 2813d682
      Christoph Hellwig 提交于
      Now that we use the VFS i_size field throughout XFS there is no need for the
      i_new_size field any more given that the VFS i_size field gets updated
      in ->write_end before unlocking the page, and thus is always uptodate when
      writeback could see a page.  Removing i_new_size also has the advantage that
      we will never have to trim back di_size during a failed buffered write,
      given that it never gets updated past i_size.
      
      Note that currently the generic direct I/O code only updates i_size after
      calling our end_io handler, which requires a small workaround to make
      sure di_size actually makes it to disk.  I hope to fix this properly in
      the generic code.
      
      A downside is that we lose the support for parallel non-overlapping O_DIRECT
      appending writes that recently was added.  I don't think keeping the complex
      and fragile i_new_size infrastructure for this is a good tradeoff - if we
      really care about parallel appending writers we should investigate turning
      the iolock into a range lock, which would also allow for parallel
      non-overlapping buffered writers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2813d682
    • C
      xfs: remove the i_size field in struct xfs_inode · ce7ae151
      Christoph Hellwig 提交于
      There is no fundamental need to keep an in-memory inode size copy in the XFS
      inode.  We already have the on-disk value in the dinode, and the separate
      in-memory copy that we need for regular files only in the XFS inode.
      
      Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
      VFS inode i_size field for regular files.  Switch code that was directly
      accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
      we are limited to regular files direct access of the VFS inode i_size field.
      
      This also allows dropping some fairly complicated code in the write path
      which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
      that is getting updated inside ->write_end.
      
      Note that we do not bother resetting the VFS i_size when truncating a file
      that gets freed to zero as there is no point in doing so because the VFS inode
      is no longer in use at this point.  Just relax the assert in xfs_ifree to
      only check the on-disk size instead.
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ce7ae151
    • C
      xfs: replace i_pin_wait with a bit waitqueue · f392e631
      Christoph Hellwig 提交于
      Replace i_pin_wait, which is only used during synchronous inode flushing
      with a bit waitqueue.  This trades off a much smaller inode against
      slightly slower wakeup performance, and saves 12 (32-bit) or 20 (64-bit)
      bytes in the XFS inode.
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f392e631
    • C
      xfs: replace i_flock with a sleeping bitlock · 474fce06
      Christoph Hellwig 提交于
      We almost never block on i_flock, the exception is synchronous inode
      flushing.  Instead of bloating the inode with a 16/24-byte completion
      that we abuse as a semaphore just implement it as a bitlock that uses
      a bit waitqueue for the rare sleeping path.  This primarily is a
      tradeoff between a much smaller inode and a faster non-blocking
      path vs faster wakeups, and we are much better off with the former.
      
      A small downside is that we will lose lockdep checking for i_flock, but
      given that it's always taken inside the ilock that should be acceptable.
      
      Note that for example the inode writeback locking is implemented in a
      very similar way.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      474fce06
    • C
      xfs: make i_flags an unsigned long · 49e4c70e
      Christoph Hellwig 提交于
      To be used for bit wakeup i_flags needs to be an unsigned long or we'll
      run into trouble on big endian systems.  Because of the 1-byte i_update
      field right after it this actually causes a fairly large size increase
      on its own (4 or 8 bytes), but that increase will be more than offset
      by the next two patches.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      49e4c70e
    • C
      xfs: remove the if_ext_max field in struct xfs_ifork · 8096b1eb
      Christoph Hellwig 提交于
      We spent a lot of effort to maintain this field, but it always equals to the
      fork size divided by the constant size of an extent.  The prime use of it is
      to assert that the two stay in sync.  Just divide the fork size by the extent
      size in the few places that we actually use it and remove the overhead
      of maintaining it.  Also introduce a few helpers to consolidate the places
      where we actually care about the value.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      8096b1eb
  8. 14 1月, 2012 2 次提交
    • C
      xfs: remove the unused dm_attrs structure · 3d2b3129
      Christoph Hellwig 提交于
      .. and the just as dead bhv_desc forward declaration while we're at it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      3d2b3129
    • C
      xfs: remove xfs_itruncate_data · 673e8e59
      Christoph Hellwig 提交于
      This wrapper isn't overly useful, not to say rather confusing.
      
      Around the call to xfs_itruncate_extents it does:
      
       - add tracing
       - add a few asserts in debug builds
       - conditionally update the inode size in two places
       - log the inode
      
      Both the tracing and the inode logging can be moved to xfs_itruncate_extents
      as they are useful for the attribute fork as well - in fact the attr code
      already does an equivalent xfs_trans_log_inode call just after calling
      xfs_itruncate_extents.  The conditional size updates are a mess, and there
      was no reason to do them in two places anyway, as the first one was
      conditional on the inode having extents - but without extents we
      xfs_itruncate_extents would be a no-op and the placement wouldn't matter
      anyway.  Instead move the size assignments and the asserts that make sense
      to the callers that want it.
      
      As a side effect of this clean up xfs_setattr_size by introducing variables
      for the old and new inode size, and moving the size updates into a common
      place.
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      673e8e59
  9. 04 1月, 2012 1 次提交
  10. 30 11月, 2011 1 次提交
  11. 12 10月, 2011 1 次提交
  12. 27 7月, 2011 1 次提交
  13. 08 7月, 2011 3 次提交
    • C
      xfs: remove i_transp · f3ca8738
      Christoph Hellwig 提交于
      Remove the transaction pointer in the inode.  It's only used to avoid
      passing down an argument in the bmap code, and for a few asserts in
      the transaction code right now.
      
      Also use the local variable ip in a few more places in xfs_inode_item_unlock,
      so that it isn't only used for debug builds after the above change.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      f3ca8738
    • C
      xfs: split xfs_itruncate_finish · 8f04c47a
      Christoph Hellwig 提交于
      Split the guts of xfs_itruncate_finish that loop over the existing extents
      and calls xfs_bunmapi on them into a new helper, xfs_itruncate_externs.
      Make xfs_attr_inactive call it directly instead of xfs_itruncate_finish,
      which allows to simplify the latter a lot, by only letting it deal with
      the data fork.  As a result xfs_itruncate_finish is renamed to
      xfs_itruncate_data to make its use case more obvious.
      
      Also remove the sync parameter from xfs_itruncate_data, which has been
      unessecary since the introduction of the busy extent list in 2002, and
      completely dead code since 2003 when the XFS_BMAPI_ASYNC parameter was
      made a no-op.
      
      I can't actually see why the xfs_attr_inactive needs to set the transaction
      sync, but let's keep this patch simple and without changes in behaviour.
      
      Also avoid passing a useless argument to xfs_isize_check, and make it
      private to xfs_inode.c.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      8f04c47a
    • C
      xfs: kill xfs_itruncate_start · 857b9778
      Christoph Hellwig 提交于
      xfs_itruncate_start is a rather length wrapper that evaluates to a call
      to xfs_ioend_wait and xfs_tosspages, and only has two callers.
      
      Instead of using the complicated checks left over from IRIX where we
      can to truncate the pagecache just call xfs_tosspages
      (aka truncate_inode_pages) directly as we want to get rid of all data
      after i_size, and truncate_inode_pages handles incorrect alignments
      and too large offsets just fine.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      857b9778
  14. 24 6月, 2011 1 次提交
    • D
      xfs: reset inode per-lifetime state when recycling it · 778e24bb
      Dave Chinner 提交于
      XFS inodes has several per-lifetime state fields that determine the
      behaviour of the inode. These state fields are not all reset when an
      inode is reused from the reclaimable state.
      
      This can lead to unexpected behaviour of the new inode such as
      speculative preallocation not being truncated away in the expected
      manner for local files until the inode is subsequently truncated,
      freed or cycles out of the cache. It can also lead to an inode being
      considered to be a filestream inode or having been truncated when
      that is not the case.
      
      Rework the reinitialisation of the inode when it is recycled to
      ensure that it is pristine before it is reused. While there, also
      fix the resetting of state flags in the recycling error paths so the
      inode does not become unreclaimable.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      778e24bb
  15. 25 5月, 2011 1 次提交
    • C
      xfs: remove if_lastex · ec90c556
      Christoph Hellwig 提交于
      The if_lastex field in struct xfs_ifork is only used as a temporary
      index during xfs_bmapi and xfs_bunmapi.  Instead of using the inode
      fork to store it keep it local in the callchain.  Fortunately this
      is very easy as we already pass a stack copy of it down the whole
      chain which can simplify be changed to be passed by reference.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      ec90c556
  16. 31 3月, 2011 1 次提交
  17. 08 2月, 2011 1 次提交
  18. 23 12月, 2010 2 次提交
    • D
      xfs: don't truncate prealloc from frequently accessed inodes · 6e857567
      Dave Chinner 提交于
      A long standing problem for streaming writeѕ through the NFS server
      has been that the NFS server opens and closes file descriptors on an
      inode for every write. The result of this behaviour is that the
      ->release() function is called on every close and that results in
      XFS truncating speculative preallocation beyond the EOF.  This has
      an adverse effect on file layout when multiple files are being
      written at the same time - they interleave their extents and can
      result in severe fragmentation.
      
      To avoid this problem, keep track of ->release calls made on a dirty
      inode. For most cases, an inode is only going to be opened once for
      writing and then closed again during it's lifetime in cache. Hence
      if there are multiple ->release calls when the inode is dirty, there
      is a good chance that the inode is being accessed by the NFS server.
      Hence set a flag the first time ->release is called while there are
      delalloc blocks still outstanding on the inode.
      
      If this flag is set when ->release is next called, then do no
      truncate away the speculative preallocation - leave it there so that
      subsequent writes do not need to reallocate the delalloc space. This
      will prevent interleaving of extents of different inodes written
      concurrently to the same AG.
      
      If we get this wrong, it is not a big deal as we truncate
      speculative allocation beyond EOF anyway in xfs_inactive() when the
      inode is thrown out of the cache.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      6e857567
    • D
      xfs: provide a inode iolock lockdep class · dcfcf205
      Dave Chinner 提交于
      The XFS iolock needs to be re-initialised to a new lock class before
      it enters reclaim to prevent lockdep false positives. Unfortunately,
      this is not sufficient protection as inodes in the XFS_IRECLAIMABLE
      state can be recycled and not re-initialised before being reused.
      
      We need to re-initialise the lock state when transfering out of
      XFS_IRECLAIMABLE state to XFS_INEW, but we need to keep the same
      class as if the inode was just allocated. Hence we need a specific
      lockdep class variable for the iolock so that both initialisations
      use the same class.
      
      While there, add a specific class for inodes in the reclaim state so
      that it is easy to tell from lockdep reports what state the inode
      was in that generated the report.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      dcfcf205
  19. 26 10月, 2010 1 次提交
  20. 19 10月, 2010 3 次提交
    • A
      xfs: Extend project quotas to support 32bit project ids · 6743099c
      Arkadiusz Mi?kiewicz 提交于
      This patch adds support for 32bit project quota identifiers.
      
      On disk format is backward compatible with 16bit projid numbers. projid
      on disk is now kept in two 16bit values - di_projid_lo (which holds the
      same position as old 16bit projid value) and new di_projid_hi (takes
      existing padding) and converts from/to 32bit value on the fly.
      
      xfs_admin (for existing fs), mkfs.xfs (for new fs) needs to be used
      to enable PROJID32BIT support.
      Signed-off-by: NArkadiusz Miśkiewicz <arekm@maven.pl>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      6743099c
    • C
      xfs: remove xfs_cred.h · 6c77b0ea
      Christoph Hellwig 提交于
      We're not actually passing around credentials inside XFS for a while
      now, so remove all xfs_cred.h with it's cred_t typedef and all
      instances of it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      6c77b0ea
    • D
      xfs: don't use vfs writeback for pure metadata modifications · dcd79a14
      Dave Chinner 提交于
      Under heavy multi-way parallel create workloads, the VFS struggles
      to write back all the inodes that have been changed in age order.
      The bdi flusher thread becomes CPU bound, spending 85% of it's time
      in the VFS code, mostly traversing the superblock dirty inode list
      to separate dirty inodes old enough to flush.
      
      We already keep an index of all metadata changes in age order - in
      the AIL - and continued log pressure will do age ordered writeback
      without any extra overhead at all. If there is no pressure on the
      log, the xfssyncd will periodically write back metadata in ascending
      disk address offset order so will be very efficient.
      
      Hence we can stop marking VFS inodes dirty during transaction commit
      or when changing timestamps during transactions. This will keep the
      inodes in the superblock dirty list to those containing data or
      unlogged metadata changes.
      
      However, the timstamp changes are slightly more complex than this -
      there are a couple of places that do unlogged updates of the
      timestamps, and the VFS need to be informed of these. Hence add a
      new function xfs_trans_ichgtime() for transactional changes,
      and leave xfs_ichgtime() for the non-transactional changes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      dcd79a14
  21. 27 7月, 2010 3 次提交
  22. 24 6月, 2010 2 次提交
  23. 02 3月, 2010 1 次提交