1. 19 10月, 2010 19 次提交
  2. 07 10月, 2010 1 次提交
    • J
      xfs: properly account for reclaimed inodes · 081003ff
      Johannes Weiner 提交于
      When marking an inode reclaimable, a per-AG counter is increased, the
      inode is tagged reclaimable in its per-AG tree, and, when this is the
      first reclaimable inode in the AG, the AG entry in the per-mount tree
      is also tagged.
      
      When an inode is finally reclaimed, however, it is only deleted from
      the per-AG tree.  Neither the counter is decreased, nor is the parent
      tree's AG entry untagged properly.
      
      Since the tags in the per-mount tree are not cleared, the inode
      shrinker iterates over all AGs that have had reclaimable inodes at one
      point in time.
      
      The counters on the other hand signal an increasing amount of slab
      objects to reclaim.  Since "70e60ce7 xfs: convert inode shrinker to
      per-filesystem context" this is not a real issue anymore because the
      shrinker bails out after one iteration.
      
      But the problem was observable on a machine running v2.6.34, where the
      reclaimable work increased and each process going into direct reclaim
      eventually got stuck on the xfs inode shrinking path, trying to scan
      several million objects.
      
      Fix this by properly unwinding the reclaimable-state tracking of an
      inode when it is reclaimed.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: stable@kernel.org
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      081003ff
  3. 10 9月, 2010 2 次提交
    • D
      xfs: log IO completion workqueue is a high priority queue · 51749e47
      Dave Chinner 提交于
      The workqueue implementation in 2.6.36-rcX has changed, resulting
      in the workqueues no longer having dedicated threads for work
      processing. This has caused severe livelocks under heavy parallel
      create workloads because the log IO completions have been getting
      held up behind metadata IO completions.  Hence log commits would
      stall, memory allocation would stall because pages could not be
      cleaned, and lock contention on the AIL during inode IO completion
      processing was being seen to slow everything down even further.
      
      By making the log Io completion workqueue a high priority workqueue,
      they are queued ahead of all data/metadata IO completions and
      processed before the data/metadata completions. Hence the log never
      gets stalled, and operations needed to clean memory can continue as
      quickly as possible. This avoids the livelock conditions and allos
      the system to keep running under heavy load as per normal.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      51749e47
    • D
      xfs: prevent reading uninitialized stack memory · a122eb2f
      Dan Rosenberg 提交于
      The XFS_IOC_FSGETXATTR ioctl allows unprivileged users to read 12
      bytes of uninitialized stack memory, because the fsxattr struct
      declared on the stack in xfs_ioc_fsgetxattr() does not alter (or zero)
      the 12-byte fsx_pad member before copying it back to the user.  This
      patch takes care of it.
      Signed-off-by: NDan Rosenberg <dan.j.rosenberg@gmail.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      a122eb2f
  4. 03 9月, 2010 1 次提交
    • T
      xfs: Make fiemap work with sparse files · 9af25465
      Tao Ma 提交于
      In xfs_vn_fiemap, we set bvm_count to fi_extent_max + 1 and want
      to return fi_extent_max extents, but actually it won't work for
      a sparse file. The reason is that in xfs_getbmap we will
      calculate holes and set it in 'out', while out is malloced by
      bmv_count(fi_extent_max+1) which didn't consider holes. So in the
      worst case, if 'out' vector looks like
      [hole, extent, hole, extent, hole, ... hole, extent, hole],
      we will only return half of fi_extent_max extents.
      
      This patch add a new parameter BMV_IF_NO_HOLES for bvm_iflags.
      So with this flags, we don't use our 'out' in xfs_getbmap for
      a hole. The solution is a bit ugly by just don't increasing
      index of 'out' vector. I felt that it is not easy to skip it
      at the very beginning since we have the complicated check and
      some function like xfs_getbmapx_fix_eof_hole to adjust 'out'.
      
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      9af25465
  5. 02 9月, 2010 2 次提交
    • A
      xfs: Disallow 32bit project quota id · 23963e54
      Arkadiusz Mi?kiewicz 提交于
      Currently on-disk structure is able to keep only 16bit project quota
      id, so disallow 32bit ones. This fixes a problem where parts of
      kernel structures holding project quota id are 32bit while parts
      (on-disk) are 16bit variables which causes project quota member
      files to be inaccessible for some operations (like mv/rm).
      Signed-off-by: NArkadiusz Mi?kiewicz <arekm@maven.pl>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      23963e54
    • D
      xfs: improve buffer cache hash scalability · 9bc08a45
      Dave Chinner 提交于
      When doing large parallel file creates on a 16p machines, large amounts of
      time is being spent in _xfs_buf_find(). A system wide profile with perf top
      shows this:
      
                1134740.00 19.3% _xfs_buf_find
                 733142.00 12.5% __ticket_spin_lock
      
      The problem is that the hash contains 45,000 buffers, and the hash table width
      is only 256 buffers. That means we've got around 200 buffers per chain, and
      searching it is quite expensive. The hash table size needs to increase.
      
      Secondly, every time we do a lookup, we promote the buffer we find to the head
      of the hash chain. This is causing cachelines to be dirtied and causes
      invalidation of cachelines across all CPUs that may have walked the hash chain
      recently. hence every walk of the hash chain is effectively a cold cache walk.
      Remove the promotion to avoid this invalidation.
      
      The results are:
      
                1045043.00 21.2% __ticket_spin_lock
                 326184.00  6.6% _xfs_buf_find
      
      A 70% drop in the CPU usage when looking up buffers. Unfortunately that does
      not result in an increase in performance underthis workload as contention on
      the inode_lock soaks up most of the reduction in CPU usage.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      9bc08a45
  6. 24 8月, 2010 4 次提交
    • C
      xfs: do not discard page cache data on EAGAIN · b5420f23
      Christoph Hellwig 提交于
      If xfs_map_blocks returns EAGAIN because of lock contention we must redirty the
      page and not disard the pagecache content and return an error from writepage.
      We used to do this correctly, but the logic got lost during the recent
      reshuffle of the writepage code.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reported-by: NMike Gao <ygao.linux@gmail.com>
      Tested-by: NMike Gao <ygao.linux@gmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      b5420f23
    • D
      xfs: dummy transactions should not dirty VFS state · 1a387d3b
      Dave Chinner 提交于
      When we  need to cover the log, we issue dummy transactions to ensure
      the current log tail is on disk. Unfortunately we currently use the
      root inode in the dummy transaction, and the act of committing the
      transaction dirties the inode at the VFS level.
      
      As a result, the VFS writeback of the dirty inode will prevent the
      filesystem from idling long enough for the log covering state
      machine to complete. The state machine gets stuck in a loop issuing
      new dummy transactions to cover the log and never makes progress.
      
      To avoid this problem, the dummy transactions should not cause
      externally visible state changes. To ensure this occurs, make sure
      that dummy transactions log an unchanging field in the superblock as
      it's state is never propagated outside the filesystem. This allows
      the log covering state machine to complete successfully and the
      filesystem now correctly enters a fully idle state about 90s after
      the last modification was made.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      1a387d3b
    • S
      xfs: ensure f_ffree returned by statfs() is non-negative · 2fe33661
      Stuart Brodsky 提交于
      Because of delayed updates to sb_icount field in the super block, it
      is possible to allocate over maxicount number of inodes.  This
      causes the arithmetic to calculate a negative number of free inodes
      in user commands like df or stat -f.
      
      Since maxicount is a somewhat arbitrary number, a slight over
      allocation is not critical but user commands should be displayed as
      0 or greater and never go negative.  To do this the value in the
      stats buffer f_ffree is capped to never go negative.
      
      [ Modified to use max_t as per Christoph's comment. ]
      Signed-off-by: NStu Brodsky <sbrodsky@sgi.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      2fe33661
    • D
      xfs: handle negative wbc->nr_to_write during sync writeback · efceab1d
      Dave Chinner 提交于
      During data integrity (WB_SYNC_ALL) writeback, wbc->nr_to_write will
      go negative on inodes with more than 1024 dirty pages due to
      implementation details of write_cache_pages(). Currently XFS will
      abort page clustering in writeback once nr_to_write drops below
      zero, and so for data integrity writeback we will do very
      inefficient page at a time allocation and IO submission for inodes
      with large numbers of dirty pages.
      
      Fix this by only aborting the page clustering code when
      wbc->nr_to_write is negative and the sync mode is WB_SYNC_NONE.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      efceab1d
  7. 10 8月, 2010 5 次提交
    • A
      convert remaining ->clear_inode() to ->evict_inode() · b57922d9
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b57922d9
    • A
      simplify checks for I_CLEAR/I_FREEING · a4ffdde6
      Al Viro 提交于
      add I_CLEAR instead of replacing I_FREEING with it.  I_CLEAR is
      equivalent to I_FREEING for almost all code looking at either;
      it's there to keep track of having called clear_inode() exactly
      once per inode lifetime, at some point after having set I_FREEING.
      I_CLEAR and I_FREEING never get set at the same time with the
      current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
      instead of I_CLEAR without loss of information.  As the result of
      such change, checks become simpler and the amount of code that needs
      to know about I_CLEAR shrinks a lot.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a4ffdde6
    • C
      xfs: new truncate sequence · fa9b227e
      Christoph Hellwig 提交于
      Convert XFS to the new truncate sequence.  We still can have errors after
      updating the file size in xfs_setattr, but these are real I/O errors and lead
      to a transaction abort and filesystem shutdown, so they are not an issue.
      
      Errors from ->write_begin and write_end can now be handled correctly because
      we can actually get rid of the delalloc extents while previous the buffer
      state was stipped in block_invalidatepage.
      
      There is still no error handling for ->direct_IO, because doing so will need
      some major restructuring given that we only have the iolock shared and do not
      hold i_mutex at all.  Fortunately leaving the normally allocated blocks behind
      there is not a major issue and this will get cleaned up by xfs_free_eofblock
      later.
      
      Note: the patch is against Al's vfs.git tree as that contains the nessecary
      preparations.  I'd prefer to get it applied there so that we can get some
      testing in linux-next.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fa9b227e
    • C
      get rid of block_write_begin_newtrunc · 155130a4
      Christoph Hellwig 提交于
      Move the call to vmtruncate to get rid of accessive blocks to the callers
      in preparation of the new truncate sequence and rename the non-truncating
      version to block_write_begin.
      
      While we're at it also remove several unused arguments to block_write_begin.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      155130a4
    • C
      sort out blockdev_direct_IO variants · eafdc7d1
      Christoph Hellwig 提交于
      Move the call to vmtruncate to get rid of accessive blocks to the callers
      in prepearation of the new truncate calling sequence.  This was only done
      for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
      was not needed anyway.  Get rid of blockdev_direct_IO_no_locking and
      its _newtrunc variant while at it as just opencoding the two additional
      paramters is shorted than the name suffix.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eafdc7d1
  8. 27 7月, 2010 6 次提交
    • C
      direct-io: move aio_complete into ->end_io · 552ef802
      Christoph Hellwig 提交于
      Filesystems with unwritten extent support must not complete an AIO request
      until the transaction to convert the extent has been commited.  That means
      the aio_complete calls needs to be moved into the ->end_io callback so
      that the filesystem can control when to call it exactly.
      
      This makes a bit of a mess out of dio_complete and the ->end_io callback
      prototype even more complicated. 
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: Jan Kara <jack@suse.cz> 
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      552ef802
    • C
      xfs simplify and speed up direct I/O completions · 209fb87a
      Christoph Hellwig 提交于
      Our current handling of direct I/O completions is rather suboptimal,
      because we defer it to a workqueue more often than needed, and we
      perform a much to aggressive flush of the workqueue in case unwritten
      extent conversions happen.
      
      This patch changes the direct I/O reads to not even use a completion
      handler, as we don't bother to use it at all, and to perform the unwritten
      extent conversions in caller context for synchronous direct I/O.
      
      For a small I/O size direct I/O workload on a consumer grade SSD, such as
      the untar of a kernel tree inside qemu this patch gives speedups of
      about 5%.  Getting us much closer to the speed of a native block device,
      or a fully allocated XFS file.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      209fb87a
    • C
      xfs: move aio completion after unwritten extent conversion · fb511f21
      Christoph Hellwig 提交于
      If we write into an unwritten extent using AIO we need to complete the AIO
      request after the extent conversion has finished.  Without that a read could
      race to see see the extent still unwritten and return zeros.   For synchronous
      I/O we already take care of that by flushing the xfsconvertd workqueue (which
      might be a bit of overkill).
      
      To do that add iocb and result fields to struct xfs_ioend, so that we can
      call aio_complete from xfs_end_io after the extent conversion has happened.
      Note that we need a new result field as io_error is used for positive errno
      values, while the AIO code can return negative error values and positive
      transfer sizes.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      fb511f21
    • C
      direct-io: move aio_complete into ->end_io · 40e2e973
      Christoph Hellwig 提交于
      Filesystems with unwritten extent support must not complete an AIO request
      until the transaction to convert the extent has been commited.  That means
      the aio_complete calls needs to be moved into the ->end_io callback so
      that the filesystem can control when to call it exactly.
      
      This makes a bit of a mess out of dio_complete and the ->end_io callback
      prototype even more complicated.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      40e2e973
    • C
      xfs: kill the b_strat callback in xfs_buf · 939d723b
      Christoph Hellwig 提交于
      The b_strat callback is used by xfs_buf_iostrategy to perform additional
      checks before submitting a buffer.  It is used in xfs_bwrite and when
      writing out delayed buffers.  In xfs_bwrite it we can de-virtualize the
      call easily as b_strat is set a few lines above the call to
      xfs_buf_iostrategy.  For the delayed buffers the rationale is a bit
      more complicated:
      
       - there are three callers of xfs_buf_delwri_queue, which places buffers
         on the delwri list:
          (1) xfs_bdwrite - this sets up b_strat, so it's fine
          (2) xfs_buf_iorequest.  None of the callers can have XBF_DELWRI set:
      	- xlog_bdstrat is only used for log buffers, which are never delwri
      	- _xfs_buf_read explicitly clears the delwri flag
      	- xfs_buf_iodone_work retries log buffers only
      	- xfsbdstrat - only used for reads, superblock writes without the
      	  delwri flag, log I/O and file zeroing with explicitly allocated
      	  buffers.
      	- xfs_buf_iostrategy - only calls xfs_buf_iorequest if b_strat is
      	  not set
          (3) xfs_buf_unlock
      	- only puts the buffer on the delwri list if the DELWRI flag is
      	  already set.  The DELWRI flag is only ever set in xfs_bwrite,
      	  xfs_buf_iodone_callbacks, or xfs_trans_log_buf.  For
      	  xfs_buf_iodone_callbacks and xfs_trans_log_buf we require
      	  an initialized buf item, which means b_strat was set to
      	  xfs_bdstrat_cb in xfs_buf_item_init.
      
      Conclusion: we can just get rid of the callback and replace it with
      explicit calls to xfs_bdstrat_cb.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      939d723b
    • C
      xfs: remove obsolete osyncisosync mount option · a64afb05
      Christoph Hellwig 提交于
      Since Linux 2.6.33 the kernel has support for real O_SYNC, which made
      the osyncisosync option a no-op.  Warn the users about this and remove
      the mount flag for it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a64afb05