1. 09 9月, 2010 7 次提交
  2. 08 9月, 2010 1 次提交
    • V
      VFS: Sanity check mount flags passed to change_mnt_propagation() · 7a2e8a8f
      Valerie Aurora 提交于
      Sanity check the flags passed to change_mnt_propagation().  Exactly
      one flag should be set.  Return EINVAL otherwise.
      
      Userspace can pass in arbitrary combinations of MS_* flags to mount().
      do_change_type() is called if any of MS_SHARED, MS_PRIVATE, MS_SLAVE,
      or MS_UNBINDABLE is set.  do_change_type() clears MS_REC and then
      calls change_mnt_propagation() with the rest of the user-supplied
      flags.  change_mnt_propagation() clearly assumes only one flag is set
      but do_change_type() does not check that this is true.  For example,
      mount() with flags MS_SHARED | MS_RDONLY does not actually make the
      mount shared or read-only but does clear MNT_UNBINDABLE.
      Signed-off-by: NValerie Aurora <vaurora@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a2e8a8f
  3. 04 9月, 2010 1 次提交
  4. 03 9月, 2010 2 次提交
    • T
      xfs: Make fiemap work with sparse files · 9af25465
      Tao Ma 提交于
      In xfs_vn_fiemap, we set bvm_count to fi_extent_max + 1 and want
      to return fi_extent_max extents, but actually it won't work for
      a sparse file. The reason is that in xfs_getbmap we will
      calculate holes and set it in 'out', while out is malloced by
      bmv_count(fi_extent_max+1) which didn't consider holes. So in the
      worst case, if 'out' vector looks like
      [hole, extent, hole, extent, hole, ... hole, extent, hole],
      we will only return half of fi_extent_max extents.
      
      This patch add a new parameter BMV_IF_NO_HOLES for bvm_iflags.
      So with this flags, we don't use our 'out' in xfs_getbmap for
      a hole. The solution is a bit ugly by just don't increasing
      index of 'out' vector. I felt that it is not easy to skip it
      at the very beginning since we have the complicated check and
      some function like xfs_getbmapx_fix_eof_hole to adjust 'out'.
      
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      9af25465
    • D
      xfs: prevent 32bit overflow in space reservation · 72656c46
      Dave Chinner 提交于
      If we attempt to preallocate more than 2^32 blocks of space in a
      single syscall, the transaction block reservation will overflow
      leading to a hangs in the superblock block accounting code. This
      is trivially reproduced with xfs_io. Fix the problem by capping the
      allocation reservation to the maximum number of blocks a single
      xfs_bmapi() call can allocate (2^21 blocks).
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      72656c46
  5. 02 9月, 2010 2 次提交
    • A
      xfs: Disallow 32bit project quota id · 23963e54
      Arkadiusz Mi?kiewicz 提交于
      Currently on-disk structure is able to keep only 16bit project quota
      id, so disallow 32bit ones. This fixes a problem where parts of
      kernel structures holding project quota id are 32bit while parts
      (on-disk) are 16bit variables which causes project quota member
      files to be inaccessible for some operations (like mv/rm).
      Signed-off-by: NArkadiusz Mi?kiewicz <arekm@maven.pl>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      23963e54
    • D
      xfs: improve buffer cache hash scalability · 9bc08a45
      Dave Chinner 提交于
      When doing large parallel file creates on a 16p machines, large amounts of
      time is being spent in _xfs_buf_find(). A system wide profile with perf top
      shows this:
      
                1134740.00 19.3% _xfs_buf_find
                 733142.00 12.5% __ticket_spin_lock
      
      The problem is that the hash contains 45,000 buffers, and the hash table width
      is only 256 buffers. That means we've got around 200 buffers per chain, and
      searching it is quite expensive. The hash table size needs to increase.
      
      Secondly, every time we do a lookup, we promote the buffer we find to the head
      of the hash chain. This is causing cachelines to be dirtied and causes
      invalidation of cachelines across all CPUs that may have walked the hash chain
      recently. hence every walk of the hash chain is effectively a cold cache walk.
      Remove the promotion to avoid this invalidation.
      
      The results are:
      
                1045043.00 21.2% __ticket_spin_lock
                 326184.00  6.6% _xfs_buf_find
      
      A 70% drop in the CPU usage when looking up buffers. Unfortunately that does
      not result in an increase in performance underthis workload as contention on
      the inode_lock soaks up most of the reduction in CPU usage.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      9bc08a45
  6. 30 8月, 2010 2 次提交
  7. 28 8月, 2010 3 次提交
  8. 27 8月, 2010 11 次提交
  9. 26 8月, 2010 3 次提交
  10. 25 8月, 2010 3 次提交
  11. 24 8月, 2010 5 次提交
    • C
      xfs: do not discard page cache data on EAGAIN · b5420f23
      Christoph Hellwig 提交于
      If xfs_map_blocks returns EAGAIN because of lock contention we must redirty the
      page and not disard the pagecache content and return an error from writepage.
      We used to do this correctly, but the logic got lost during the recent
      reshuffle of the writepage code.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reported-by: NMike Gao <ygao.linux@gmail.com>
      Tested-by: NMike Gao <ygao.linux@gmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      b5420f23
    • D
      xfs: don't do memory allocation under the CIL context lock · 3b93c7aa
      Dave Chinner 提交于
      Formatting items requires memory allocation when using delayed
      logging. Currently that memory allocation is done while holding the
      CIL context lock in read mode. This means that if memory allocation
      takes some time (e.g. enters reclaim), we cannot push on the CIL
      until the allocation(s) required by formatting complete. This can
      stall CIL pushes for some time, and once a push is stalled so are
      all new transaction commits.
      
      Fix this splitting the item formatting into two steps. The first
      step which does the allocation and memcpy() into the allocated
      buffer is now done outside the CIL context lock, and only the CIL
      insert is done inside the CIL context lock. This avoids the stall
      issue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      3b93c7aa
    • D
      xfs: Reduce log force overhead for delayed logging · a44f13ed
      Dave Chinner 提交于
      Delayed logging adds some serialisation to the log force process to
      ensure that it does not deference a bad commit context structure
      when determining if a CIL push is necessary or not. It does this by
      grabing the CIL context lock exclusively, then dropping it before
      pushing the CIL if necessary. This causes serialisation of all log
      forces and pushes regardless of whether a force is necessary or not.
      As a result fsync heavy workloads (like dbench) can be significantly
      slower with delayed logging than without.
      
      To avoid this penalty, copy the current sequence from the context to
      the CIL structure when they are swapped. This allows us to do
      unlocked checks on the current sequence without having to worry
      about dereferencing context structures that may have already been
      freed. Hence we can remove the CIL context locking in the forcing
      code and only call into the push code if the current context matches
      the sequence we need to force.
      
      By passing the sequence into the push code, we can check the
      sequence again once we have the CIL lock held exclusive and abort if
      the sequence has already been pushed. This avoids a lock round-trip
      and unnecessary CIL pushes when we have racing push calls.
      
      The result is that the regression in dbench performance goes away -
      this change improves dbench performance on a ramdisk from ~2100MB/s
      to ~2500MB/s. This compares favourably to not using delayed logging
      which retuns ~2500MB/s for the same workload.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      a44f13ed
    • D
      xfs: dummy transactions should not dirty VFS state · 1a387d3b
      Dave Chinner 提交于
      When we  need to cover the log, we issue dummy transactions to ensure
      the current log tail is on disk. Unfortunately we currently use the
      root inode in the dummy transaction, and the act of committing the
      transaction dirties the inode at the VFS level.
      
      As a result, the VFS writeback of the dirty inode will prevent the
      filesystem from idling long enough for the log covering state
      machine to complete. The state machine gets stuck in a loop issuing
      new dummy transactions to cover the log and never makes progress.
      
      To avoid this problem, the dummy transactions should not cause
      externally visible state changes. To ensure this occurs, make sure
      that dummy transactions log an unchanging field in the superblock as
      it's state is never propagated outside the filesystem. This allows
      the log covering state machine to complete successfully and the
      filesystem now correctly enters a fully idle state about 90s after
      the last modification was made.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      1a387d3b
    • S
      xfs: ensure f_ffree returned by statfs() is non-negative · 2fe33661
      Stuart Brodsky 提交于
      Because of delayed updates to sb_icount field in the super block, it
      is possible to allocate over maxicount number of inodes.  This
      causes the arithmetic to calculate a negative number of free inodes
      in user commands like df or stat -f.
      
      Since maxicount is a somewhat arbitrary number, a slight over
      allocation is not critical but user commands should be displayed as
      0 or greater and never go negative.  To do this the value in the
      stats buffer f_ffree is capped to never go negative.
      
      [ Modified to use max_t as per Christoph's comment. ]
      Signed-off-by: NStu Brodsky <sbrodsky@sgi.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      2fe33661