1. 19 10月, 2010 33 次提交
  2. 07 10月, 2010 1 次提交
    • J
      xfs: properly account for reclaimed inodes · 081003ff
      Johannes Weiner 提交于
      When marking an inode reclaimable, a per-AG counter is increased, the
      inode is tagged reclaimable in its per-AG tree, and, when this is the
      first reclaimable inode in the AG, the AG entry in the per-mount tree
      is also tagged.
      
      When an inode is finally reclaimed, however, it is only deleted from
      the per-AG tree.  Neither the counter is decreased, nor is the parent
      tree's AG entry untagged properly.
      
      Since the tags in the per-mount tree are not cleared, the inode
      shrinker iterates over all AGs that have had reclaimable inodes at one
      point in time.
      
      The counters on the other hand signal an increasing amount of slab
      objects to reclaim.  Since "70e60ce7 xfs: convert inode shrinker to
      per-filesystem context" this is not a real issue anymore because the
      shrinker bails out after one iteration.
      
      But the problem was observable on a machine running v2.6.34, where the
      reclaimable work increased and each process going into direct reclaim
      eventually got stuck on the xfs inode shrinking path, trying to scan
      several million objects.
      
      Fix this by properly unwinding the reclaimable-state tracking of an
      inode when it is reclaimed.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: stable@kernel.org
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      081003ff
  3. 29 9月, 2010 1 次提交
    • D
      xfs: force background CIL push under sustained load · 80168676
      Dave Chinner 提交于
      I have been seeing occasional pauses in transaction throughput up to
      30s long under heavy parallel workloads. The only notable thing was
      that the xfsaild was trying to be active during the pauses, but
      making no progress. It was running exactly 20 times a second (on the
      50ms no-progress backoff), and the number of pushbuf events was
      constant across this time as well.  IOWs, the xfsaild appeared to be
      stuck on buffers that it could not push out.
      
      Further investigation indicated that it was trying to push out inode
      buffers that were pinned and/or locked. The xfsbufd was also getting
      woken at the same frequency (by the xfsaild, no doubt) to push out
      delayed write buffers. The xfsbufd was not making any progress
      because all the buffers in the delwri queue were pinned. This scan-
      and-make-no-progress dance went one in the trace for some seconds,
      before the xfssyncd came along an issued a log force, and then
      things started going again.
      
      However, I noticed something strange about the log force - there
      were way too many IO's issued. 516 log buffers were written, to be
      exact. That added up to 129MB of log IO, which got me very
      interested because it's almost exactly 25% of the size of the log.
      He delayed logging code is suppose to aggregate the minimum of 25%
      of the log or 8MB worth of changes before flushing. That's what
      really puzzled me - why did a log force write 129MB instead of only
      8MB?
      
      Essentially what has happened is that no CIL pushes had occurred
      since the previous tail push which cleared out 25% of the log space.
      That caused all the new transactions to block because there wasn't
      log space for them, but they kick the xfsaild to push the tail.
      However, the xfsaild was not making progress because there were
      buffers it could not lock and flush, and the xfsbufd could not flush
      them because they were pinned. As a result, both the xfsaild and the
      xfsbufd could not move the tail of the log forward without the CIL
      first committing.
      
      The cause of the problem was that the background CIL push, which
      should happen when 8MB of aggregated changes have been committed, is
      being held off by the concurrent transaction commit load. The
      background push does a down_write_trylock() which will fail if there
      is a concurrent transaction commit holding the push lock in read
      mode. With 8 CPUs all doing transactions as fast as they can, there
      was enough concurrent transaction commits to hold off the background
      push until tail-pushing could no longer free log space, and the halt
      would occur.
      
      It should be noted that there is no reason why it would halt at 25%
      of log space used by a single CIL checkpoint. This bug could
      definitely violate the "no transaction should be larger than half
      the log" requirement and hence result in corruption if the system
      crashed under heavy load. This sort of bug is exactly the reason why
      delayed logging was tagged as experimental....
      
      The fix is to start blocking background pushes once the threshold
      has been exceeded. Rework the threshold calculations to keep the
      amount of log space a CIL checkpoint can use to below that of the
      AIL push threshold to avoid the problem completely.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      80168676
  4. 10 9月, 2010 2 次提交
    • D
      xfs: log IO completion workqueue is a high priority queue · 51749e47
      Dave Chinner 提交于
      The workqueue implementation in 2.6.36-rcX has changed, resulting
      in the workqueues no longer having dedicated threads for work
      processing. This has caused severe livelocks under heavy parallel
      create workloads because the log IO completions have been getting
      held up behind metadata IO completions.  Hence log commits would
      stall, memory allocation would stall because pages could not be
      cleaned, and lock contention on the AIL during inode IO completion
      processing was being seen to slow everything down even further.
      
      By making the log Io completion workqueue a high priority workqueue,
      they are queued ahead of all data/metadata IO completions and
      processed before the data/metadata completions. Hence the log never
      gets stalled, and operations needed to clean memory can continue as
      quickly as possible. This avoids the livelock conditions and allos
      the system to keep running under heavy load as per normal.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      51749e47
    • D
      xfs: prevent reading uninitialized stack memory · a122eb2f
      Dan Rosenberg 提交于
      The XFS_IOC_FSGETXATTR ioctl allows unprivileged users to read 12
      bytes of uninitialized stack memory, because the fsxattr struct
      declared on the stack in xfs_ioc_fsgetxattr() does not alter (or zero)
      the 12-byte fsx_pad member before copying it back to the user.  This
      patch takes care of it.
      Signed-off-by: NDan Rosenberg <dan.j.rosenberg@gmail.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      a122eb2f
  5. 03 9月, 2010 2 次提交
    • T
      xfs: Make fiemap work with sparse files · 9af25465
      Tao Ma 提交于
      In xfs_vn_fiemap, we set bvm_count to fi_extent_max + 1 and want
      to return fi_extent_max extents, but actually it won't work for
      a sparse file. The reason is that in xfs_getbmap we will
      calculate holes and set it in 'out', while out is malloced by
      bmv_count(fi_extent_max+1) which didn't consider holes. So in the
      worst case, if 'out' vector looks like
      [hole, extent, hole, extent, hole, ... hole, extent, hole],
      we will only return half of fi_extent_max extents.
      
      This patch add a new parameter BMV_IF_NO_HOLES for bvm_iflags.
      So with this flags, we don't use our 'out' in xfs_getbmap for
      a hole. The solution is a bit ugly by just don't increasing
      index of 'out' vector. I felt that it is not easy to skip it
      at the very beginning since we have the complicated check and
      some function like xfs_getbmapx_fix_eof_hole to adjust 'out'.
      
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      9af25465
    • D
      xfs: prevent 32bit overflow in space reservation · 72656c46
      Dave Chinner 提交于
      If we attempt to preallocate more than 2^32 blocks of space in a
      single syscall, the transaction block reservation will overflow
      leading to a hangs in the superblock block accounting code. This
      is trivially reproduced with xfs_io. Fix the problem by capping the
      allocation reservation to the maximum number of blocks a single
      xfs_bmapi() call can allocate (2^21 blocks).
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      72656c46
  6. 02 9月, 2010 1 次提交