1. 07 3月, 2011 9 次提交
  2. 02 3月, 2011 2 次提交
  3. 23 2月, 2011 3 次提交
  4. 22 2月, 2011 2 次提交
  5. 08 2月, 2011 3 次提交
  6. 28 1月, 2011 8 次提交
    • B
      xfs: xfs_bmap_add_extent_delay_real should init br_startblock · 24446fc6
      bpm@sgi.com 提交于
      When filling in the middle of a previous delayed allocation in
      xfs_bmap_add_extent_delay_real, set br_startblock of the new delay
      extent to the right to nullstartblock instead of 0 before inserting
      the extent into the ifork (xfs_iext_insert), rather than setting
      br_startblock afterward.
      
      Adding the extent into the ifork with br_startblock=0 can lead to
      the extent being copied into the btree by xfs_bmap_extent_to_btree
      if we happen to convert from extents format to btree format before
      updating br_startblock with the correct value.  The unexpected
      addition of this delay extent to the btree can cause subsequent
      XFS_WANT_CORRUPTED_GOTO filesystem shutdown in several
      xfs_bmap_add_extent_delay_real cases where we are converting a delay
      extent to real and unexpectedly find an extent already inserted.
      For example:
      
      911         case BMAP_LEFT_FILLING:
      912                 /*
      913                  * Filling in the first part of a previous delayed allocation.
      914                  * The left neighbor is not contiguous.
      915                  */
      916                 trace_xfs_bmap_pre_update(ip, idx, state, _THIS_IP_);
      917                 xfs_bmbt_set_startoff(ep, new_endoff);
      918                 temp = PREV.br_blockcount - new->br_blockcount;
      919                 xfs_bmbt_set_blockcount(ep, temp);
      920                 xfs_iext_insert(ip, idx, 1, new, state);
      921                 ip->i_df.if_lastex = idx;
      922                 ip->i_d.di_nextents++;
      923                 if (cur == NULL)
      924                         rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
      925                 else {
      926                         rval = XFS_ILOG_CORE;
      927                         if ((error = xfs_bmbt_lookup_eq(cur, new->br_startoff,
      928                                         new->br_startblock, new->br_blockcount,
      929                                         &i)))
      930                                 goto done;
      931                         XFS_WANT_CORRUPTED_GOTO(i == 0, done);
      
      With the bogus extent in the btree we shutdown the filesystem at
      931.  The conversion from extents to btree format happens when the
      number of extents in the inode increases above ip->i_df.if_ext_max.
      xfs_bmap_extent_to_btree copies extents from the ifork into the
      btree, ignoring all delalloc extents which are denoted by
      br_startblock having some value of nullstartblock.
      
      SGI-PV: 1013221
      Signed-off-by: NBen Myers <bpm@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      24446fc6
    • D
      xfs: fix dquot shaker deadlock · 0fbca4d1
      Dave Chinner 提交于
      Commit 368e1361 ("xfs: remove duplicate code from dquot reclaim") fails
      to unlock the dquot freelist when the number of loop restarts is
      exceeded in xfs_qm_dqreclaim_one(). This causes hangs in memory
      reclaim.
      
      Rework the loop control logic into an unwind stack that all the
      different cases jump into. This means there is only one set of code
      that processes the loop exit criteria, and simplifies the unlocking
      of all the items from different points in the loop. It also fixes a
      double increment of the restart counter from the qi_dqlist_lock
      case.
      Reported-by: NMalcolm Scott <lkml@malc.org.uk>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      0fbca4d1
    • D
      xfs: handle CIl transaction commit failures correctly · c6f990d1
      Dave Chinner 提交于
      Failure to commit a transaction into the CIL is not handled
      correctly. This currently can only happen when racing with a
      shutdown and requires an explicit shutdown check, so it rare and can
      be avoided. Remove the shutdown check and make the CIL commit a void
      function to indicate it will always succeed, thereby removing the
      incorrectly handled failure case.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      c6f990d1
    • D
      xfs: limit extsize to size of AGs and/or MAXEXTLEN · 5315837d
      Dave Chinner 提交于
      The extent size hint can be set to larger than an AG. This means
      that the alignment process can push the range to be allocated
      outside the bounds of the AG, resulting in assert failures or
      corrupted bmbt records. Similarly, if the extsize is larger than the
      maximum extent size supported, the alignment process will produce
      extents that are too large to fit into the bmbt records, resulting
      in a different type of assert/corruption failure.
      
      Fix this by limiting extsize at the time іt is set firstly to be
      less than MAXEXTLEN, then to be a maximum of half the size of the
      AGs in the filesystem for non-realtime inodes. Realtime inodes do
      not allocate out of AGs, so don't have to be restricted by the size
      of AGs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      5315837d
    • D
      xfs: prevent extsize alignment from exceeding maximum extent size · 4ce15989
      Dave Chinner 提交于
      When doing delayed allocation, if the allocation size is for a
      maximally sized extent, extent size alignment can push it over this
      limit. This results in an assert failure in xfs_bmbt_set_allf() as
      the extent length is too large to find in the extent record.
      
      Fix this by ensuring that we allow for space that extent size
      alignment requires (up to 2 * (extsize -1) blocks as we have to
      handle both head and tail alignment) when limiting the maximum size
      of the extent.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      4ce15989
    • D
      xfs: limit extent length for allocation to AG size · 14b064ce
      Dave Chinner 提交于
      Delayed allocation extents can be larger than AGs, so when trying to
      convert a large range we may scan every AG inside
      xfs_bmap_alloc_nullfb() trying to find an AG with a size larger than
      an AG. We should stop when we find the first AG with a maximum
      possible allocation size. This causes excessive CPU usage when there
      are lots of AGs.
      
      The same problem occurs when doing preallocation of a range larger
      than an AG.
      
      Fix the problem by limiting real allocation lengths to the maximum
      that an AG can support. This means if we have empty AGs, we'll stop
      the search at the first of them. If there are no empty AGs, we'll
      still scan them all, but that is a different problem....
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      14b064ce
    • D
      xfs: speculative delayed allocation uses rounddown_power_of_2 badly · b8fc8263
      Dave Chinner 提交于
      rounddown_power_of_2() returns an undefined result when passed a
      value of zero. The specualtive delayed allocation code is doing this
      when the inode is zero length. Hence occasionally the preallocation
      is much, much larger than is necessary (e.g. 8GB for a 270 _byte_
      file). Ensure we don't even pass a zero value to this function so
      the result of preallocation is always the desired size.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      b8fc8263
    • D
      xfs: fix efi item leak on forced shutdown · e34a314c
      Dave Chinner 提交于
      After test 139, kmemleak shows:
      
      unreferenced object 0xffff880078b405d8 (size 400):
        comm "xfs_io", pid 4904, jiffies 4294909383 (age 1186.728s)
        hex dump (first 32 bytes):
          60 c1 17 79 00 88 ff ff 60 c1 17 79 00 88 ff ff  `..y....`..y....
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60
          [<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0
          [<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0
          [<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50
          [<ffffffff8147cd6b>] xfs_efi_init+0x4b/0xb0
          [<ffffffff814a4ee8>] xfs_trans_get_efi+0x58/0x90
          [<ffffffff81455fab>] xfs_bmap_finish+0x8b/0x1d0
          [<ffffffff814851b4>] xfs_itruncate_finish+0x2c4/0x5d0
          [<ffffffff814a970f>] xfs_setattr+0x8df/0xa70
          [<ffffffff814b5c7b>] xfs_vn_setattr+0x1b/0x20
          [<ffffffff8117dc00>] notify_change+0x170/0x2e0
          [<ffffffff81163bf6>] do_truncate+0x66/0xa0
          [<ffffffff81163d0b>] sys_ftruncate+0xdb/0xe0
          [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The cause of the leak is that the "remove" parameter of IOP_UNPIN()
      is never set when a CIL push is aborted. This means that the EFI
      item is never freed if it was in the push being cancelled. The
      problem is specific to delayed logging, but has uncovered a couple
      of problems with the handling of IOP_UNPIN(remove).
      
      Firstly, we cannot safely call xfs_trans_del_item() from IOP_UNPIN()
      in the CIL commit failure path or the iclog write failure path
      because for delayed loging we have no transaction context. Hence we
      must only call xfs_trans_del_item() if the log item being unpinned
      has an active log item descriptor.
      
      Secondly, xfs_trans_uncommit() does not handle log item descriptor
      freeing during the traversal of log items on a transaction. It can
      reference a freed log item descriptor when unpinning an EFI item.
      Hence it needs to use a safe list traversal method to allow items to
      be removed from the transaction during IOP_UNPIN().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      e34a314c
  7. 27 1月, 2011 1 次提交
    • D
      xfs: fix log ticket leak on forced shutdown. · 7db37c5e
      Dave Chinner 提交于
      The kmemleak detector shows this after test 139:
      
      unreferenced object 0xffff880079b88bb0 (size 264):
        comm "xfs_io", pid 4904, jiffies 4294909382 (age 276.824s)
        hex dump (first 32 bytes):
          00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
          ff ff ff ff ff ff ff ff 48 7b c9 82 ff ff ff ff  ........H{......
        backtrace:
          [<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60
          [<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0
          [<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0
          [<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50
          [<ffffffff8148f394>] xlog_ticket_alloc+0x34/0x170
          [<ffffffff81494444>] xlog_cil_push+0xa4/0x3f0
          [<ffffffff81494eca>] xlog_cil_force_lsn+0x15a/0x160
          [<ffffffff814933a5>] _xfs_log_force_lsn+0x75/0x2d0
          [<ffffffff814a264d>] _xfs_trans_commit+0x2bd/0x2f0
          [<ffffffff8148bfdd>] xfs_iomap_write_allocate+0x1ad/0x350
          [<ffffffff814ac17f>] xfs_map_blocks+0x21f/0x370
          [<ffffffff814ad1b7>] xfs_vm_writepage+0x1c7/0x550
          [<ffffffff8112200a>] __writepage+0x1a/0x50
          [<ffffffff81122df2>] write_cache_pages+0x1c2/0x4c0
          [<ffffffff81123117>] generic_writepages+0x27/0x30
          [<ffffffff814aba5d>] xfs_vm_writepages+0x5d/0x80
      
      By inspection, the leak occurs when xlog_write() returns and error
      and we jump to the abort path without dropping the reference on the
      active ticket.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      7db37c5e
  8. 18 1月, 2011 1 次提交
  9. 17 1月, 2011 2 次提交
    • C
      fallocate should be a file operation · 2fe17c10
      Christoph Hellwig 提交于
      Currently all filesystems except XFS implement fallocate asynchronously,
      while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
      I/O we really want our allocation on disk, especially for the !KEEP_SIZE
      case where we actually grow the file with user-visible zeroes.  On the
      other hand always commiting the transaction is a bad idea for fast-path
      uses of fallocate like for example in recent Samba versions.   Given
      that block allocation is a data plane operation anyway change it from
      an inode operation to a file operation so that we have the file structure
      available that lets us check for O_SYNC.
      
      This also includes moving the code around for a few of the filesystems,
      and remove the already unnedded S_ISDIR checks given that we only wire
      up fallocate for regular files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2fe17c10
    • C
      make the feature checks in ->fallocate future proof · 64c23e86
      Christoph Hellwig 提交于
      Instead of various home grown checks that might need updates for new
      flags just check for any bit outside the mask of the features supported
      by the filesystem.  This makes the check future proof for any newly
      added flag.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      64c23e86
  10. 13 1月, 2011 1 次提交
  11. 12 1月, 2011 6 次提交
    • D
      xfs: prevent NMI timeouts in cmn_err · 73efe4a4
      Dave Chinner 提交于
      We currently have a global error message buffer in cmn_err that is
      protected by a spin lock that disables interrupts.  Recently there
      have been reports of NMI timeouts occurring when the console is
      being flooded by SCSI error reports due to cmn_err() getting stuck
      trying to print to the console while holding this lock (i.e. with
      interrupts disabled). The NMI watchdog is seeing this CPU as
      non-responding and so is triggering a panic.  While the trigger for
      the reported case is SCSI errors, pretty much anything that spams
      the kernel log could cause this to occur.
      
      Realistically the only reason that we have the intemediate message
      buffer is to prepend the correct kernel log level prefix to the log
      message. The only reason we have the lock is to protect the global
      message buffer and the only reason the message buffer is global is
      to keep it off the stack. Hence if we can avoid needing a global
      message buffer we avoid needing the lock, and we can do this with a
      small amount of cleanup and some preprocessor tricks:
      
      	1. clean up xfs_cmn_err() panic mask functionality to avoid
      	   needing debug code in xfs_cmn_err()
      	2. remove the couple of "!" message prefixes that still exist that
      	   the existing cmn_err() code steps over.
      	3. redefine CE_* levels directly to KERN_*
      	4. redefine cmn_err() and friends to use printk() directly
      	   via variable argument length macros.
      
      By doing this, we can completely remove the cmn_err() code and the
      lock that is causing the problems, and rely solely on printk()
      serialisation to ensure that we don't get garbled messages.
      
      A series of followup patches is really needed to clean up all the
      cmn_err() calls and related messages properly, but that results in a
      series that is not easily back portable to enterprise kernels. Hence
      this initial fix is only to address the direct problem in the lowest
      impact way possible.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      73efe4a4
    • A
      xfs: Add log level to assertion printk · 65a84a0f
      Anton Blanchard 提交于
      I received a ppc64 bug report involving xfs but the assertion was
      filtered out by the console log level. Use KERN_CRIT to ensure it
      makes it out.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      65a84a0f
    • J
      xfs: fix an assignment within an ASSERT() · 1884bd83
      Jesper Juhl 提交于
      In fs/xfs/xfs_trans.c::xfs_trans_unreserve_and_mod_sb() at the out:
      label we have this:
      	ASSERT(error = 0);
      I believe a comparison was intended, not an assignment. If I'm
      right, the patch below fixes that up.
      Signed-off-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      1884bd83
    • C
      xfs: fix error handling for synchronous writes · bfc60177
      Christoph Hellwig 提交于
      If we get an IO error on a synchronous superblock write, we attach an
      error release function to it so that when the last reference goes away
      the release function is called and the buffer is invalidated and
      unlocked. The buffer is left locked until the release function is
      called so that other concurrent users of the buffer will be locked out
      until the buffer error is fully processed.
      
      Unfortunately, for the superblock buffer the filesyetm itself holds a
      reference to the buffer which prevents the reference count from
      dropping to zero and the release function being called. As a result,
      once an IO error occurs on a sync write, the buffer will never be
      unlocked and all future attempts to lock the buffer will hang.
      
      To make matters worse, this problems is not unique to such buffers;
      if there is a concurrent _xfs_buf_find() running, the lookup will grab
      a reference to the buffer and then wait on the buffer lock, preventing
      the reference count from ever falling to zero and hence unlocking the
      buffer.
      
      As such, the whole b_relse function implementation is broken because it
      cannot rely on the buffer reference count falling to zero to unlock the
      errored buffer. The synchronous write error path is the only path that
      uses this callback - it is used to ensure that the synchronous waiter
      gets the buffer error before the error state is cleared from the buffer
      by the release function.
      
      Given that the only sychronous buffer writes now go through xfs_bwrite
      and the error path in question can only occur for a write of a dirty,
      logged buffer, we can move most of the b_relse processing to happen
      inline in xfs_buf_iodone_callbacks, just like a normal I/O completion.
      In addition to that we make sure the error is not cleared in
      xfs_buf_iodone_callbacks, so that xfs_bwrite can reliably check it.
      Given that xfs_bwrite keeps the buffer locked until it has waited for
      it and checked the error this allows to reliably propagate the error
      to the caller, and make sure that the buffer is reliably unlocked.
      
      Given that xfs_buf_iodone_callbacks was the only instance of the
      b_relse callback we can remove it entirely.
      
      Based on earlier patches by Dave Chinner and Ajeet Yadav.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reported-by: NAjeet Yadav <ajeet.yadav.77@gmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      bfc60177
    • C
      xfs: add FITRIM support · a46db608
      Christoph Hellwig 提交于
      Allow manual discards from userspace using the FITRIM ioctl.  This is not
      intended to be run during normal workloads, as the freepsace btree walks
      can cause large performance degradation.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      a46db608
    • D
      xfs: ensure log covering transactions are synchronous · c58efdb4
      Dave Chinner 提交于
      To ensure the log is covered and the filesystem idles correctly, we
      need to ensure that dummy transactions hit the disk and do not stay
      pinned in memory.  If the superblock is pinned in memory, it can't
      be flushed so the log covering cannot make progress. The result is
      dependent on timing - more oftent han not we continue to issues a
      log covering transaction every 36s rather than idling after ~90s.
      
      Fix this by making the log covering transaction synchronous. To
      avoid additional log force from xfssyncd, make the log covering
      transaction take the place of the existing log force in the xfssyncd
      background sync process.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      c58efdb4
  12. 11 1月, 2011 2 次提交
    • D
      xfs: serialise unaligned direct IOs · eda77982
      Dave Chinner 提交于
      When two concurrent unaligned, non-overlapping direct IOs are issued
      to the same block, the direct Io layer will race to zero the block.
      The result is that one of the concurrent IOs will overwrite data
      written by the other IO with zeros. This is demonstrated by the
      xfsqa test 240.
      
      To avoid this problem, serialise all unaligned direct IOs to an
      inode with a big hammer. We need a big hammer approach as we need to
      serialise AIO as well, so we can't just block writes on locks.
      Hence, the big hammer is calling xfs_ioend_wait() while holding out
      other unaligned direct IOs from starting.
      
      We don't bother trying to serialised aligned vs unaligned IOs as
      they are overlapping IO and the result of concurrent overlapping IOs
      is undefined - the result of either IO is a valid result so we let
      them race. Hence we only penalise unaligned IO, which already has a
      major overhead compared to aligned IO so this isn't a major problem.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      eda77982
    • D
      xfs: factor common write setup code · 4d8d1581
      Dave Chinner 提交于
      The buffered IO and direct IO write paths share a common set of
      checks and limiting code prior to issuing the write. Factor that
      into a common helper function.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      4d8d1581