1. 28 1月, 2011 1 次提交
    • D
      xfs: fix efi item leak on forced shutdown · e34a314c
      Dave Chinner 提交于
      After test 139, kmemleak shows:
      
      unreferenced object 0xffff880078b405d8 (size 400):
        comm "xfs_io", pid 4904, jiffies 4294909383 (age 1186.728s)
        hex dump (first 32 bytes):
          60 c1 17 79 00 88 ff ff 60 c1 17 79 00 88 ff ff  `..y....`..y....
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60
          [<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0
          [<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0
          [<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50
          [<ffffffff8147cd6b>] xfs_efi_init+0x4b/0xb0
          [<ffffffff814a4ee8>] xfs_trans_get_efi+0x58/0x90
          [<ffffffff81455fab>] xfs_bmap_finish+0x8b/0x1d0
          [<ffffffff814851b4>] xfs_itruncate_finish+0x2c4/0x5d0
          [<ffffffff814a970f>] xfs_setattr+0x8df/0xa70
          [<ffffffff814b5c7b>] xfs_vn_setattr+0x1b/0x20
          [<ffffffff8117dc00>] notify_change+0x170/0x2e0
          [<ffffffff81163bf6>] do_truncate+0x66/0xa0
          [<ffffffff81163d0b>] sys_ftruncate+0xdb/0xe0
          [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The cause of the leak is that the "remove" parameter of IOP_UNPIN()
      is never set when a CIL push is aborted. This means that the EFI
      item is never freed if it was in the push being cancelled. The
      problem is specific to delayed logging, but has uncovered a couple
      of problems with the handling of IOP_UNPIN(remove).
      
      Firstly, we cannot safely call xfs_trans_del_item() from IOP_UNPIN()
      in the CIL commit failure path or the iclog write failure path
      because for delayed loging we have no transaction context. Hence we
      must only call xfs_trans_del_item() if the log item being unpinned
      has an active log item descriptor.
      
      Secondly, xfs_trans_uncommit() does not handle log item descriptor
      freeing during the traversal of log items on a transaction. It can
      reference a freed log item descriptor when unpinning an EFI item.
      Hence it needs to use a safe list traversal method to allow items to
      be removed from the transaction during IOP_UNPIN().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      e34a314c
  2. 27 1月, 2011 1 次提交
    • D
      xfs: fix log ticket leak on forced shutdown. · 7db37c5e
      Dave Chinner 提交于
      The kmemleak detector shows this after test 139:
      
      unreferenced object 0xffff880079b88bb0 (size 264):
        comm "xfs_io", pid 4904, jiffies 4294909382 (age 276.824s)
        hex dump (first 32 bytes):
          00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
          ff ff ff ff ff ff ff ff 48 7b c9 82 ff ff ff ff  ........H{......
        backtrace:
          [<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60
          [<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0
          [<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0
          [<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50
          [<ffffffff8148f394>] xlog_ticket_alloc+0x34/0x170
          [<ffffffff81494444>] xlog_cil_push+0xa4/0x3f0
          [<ffffffff81494eca>] xlog_cil_force_lsn+0x15a/0x160
          [<ffffffff814933a5>] _xfs_log_force_lsn+0x75/0x2d0
          [<ffffffff814a264d>] _xfs_trans_commit+0x2bd/0x2f0
          [<ffffffff8148bfdd>] xfs_iomap_write_allocate+0x1ad/0x350
          [<ffffffff814ac17f>] xfs_map_blocks+0x21f/0x370
          [<ffffffff814ad1b7>] xfs_vm_writepage+0x1c7/0x550
          [<ffffffff8112200a>] __writepage+0x1a/0x50
          [<ffffffff81122df2>] write_cache_pages+0x1c2/0x4c0
          [<ffffffff81123117>] generic_writepages+0x27/0x30
          [<ffffffff814aba5d>] xfs_vm_writepages+0x5d/0x80
      
      By inspection, the leak occurs when xlog_write() returns and error
      and we jump to the abort path without dropping the reference on the
      active ticket.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      7db37c5e
  3. 18 1月, 2011 1 次提交
  4. 17 1月, 2011 2 次提交
    • C
      fallocate should be a file operation · 2fe17c10
      Christoph Hellwig 提交于
      Currently all filesystems except XFS implement fallocate asynchronously,
      while XFS forced a commit.  Both of these are suboptimal - in case of O_SYNC
      I/O we really want our allocation on disk, especially for the !KEEP_SIZE
      case where we actually grow the file with user-visible zeroes.  On the
      other hand always commiting the transaction is a bad idea for fast-path
      uses of fallocate like for example in recent Samba versions.   Given
      that block allocation is a data plane operation anyway change it from
      an inode operation to a file operation so that we have the file structure
      available that lets us check for O_SYNC.
      
      This also includes moving the code around for a few of the filesystems,
      and remove the already unnedded S_ISDIR checks given that we only wire
      up fallocate for regular files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2fe17c10
    • C
      make the feature checks in ->fallocate future proof · 64c23e86
      Christoph Hellwig 提交于
      Instead of various home grown checks that might need updates for new
      flags just check for any bit outside the mask of the features supported
      by the filesystem.  This makes the check future proof for any newly
      added flag.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      64c23e86
  5. 13 1月, 2011 1 次提交
  6. 12 1月, 2011 6 次提交
    • D
      xfs: prevent NMI timeouts in cmn_err · 73efe4a4
      Dave Chinner 提交于
      We currently have a global error message buffer in cmn_err that is
      protected by a spin lock that disables interrupts.  Recently there
      have been reports of NMI timeouts occurring when the console is
      being flooded by SCSI error reports due to cmn_err() getting stuck
      trying to print to the console while holding this lock (i.e. with
      interrupts disabled). The NMI watchdog is seeing this CPU as
      non-responding and so is triggering a panic.  While the trigger for
      the reported case is SCSI errors, pretty much anything that spams
      the kernel log could cause this to occur.
      
      Realistically the only reason that we have the intemediate message
      buffer is to prepend the correct kernel log level prefix to the log
      message. The only reason we have the lock is to protect the global
      message buffer and the only reason the message buffer is global is
      to keep it off the stack. Hence if we can avoid needing a global
      message buffer we avoid needing the lock, and we can do this with a
      small amount of cleanup and some preprocessor tricks:
      
      	1. clean up xfs_cmn_err() panic mask functionality to avoid
      	   needing debug code in xfs_cmn_err()
      	2. remove the couple of "!" message prefixes that still exist that
      	   the existing cmn_err() code steps over.
      	3. redefine CE_* levels directly to KERN_*
      	4. redefine cmn_err() and friends to use printk() directly
      	   via variable argument length macros.
      
      By doing this, we can completely remove the cmn_err() code and the
      lock that is causing the problems, and rely solely on printk()
      serialisation to ensure that we don't get garbled messages.
      
      A series of followup patches is really needed to clean up all the
      cmn_err() calls and related messages properly, but that results in a
      series that is not easily back portable to enterprise kernels. Hence
      this initial fix is only to address the direct problem in the lowest
      impact way possible.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      73efe4a4
    • A
      xfs: Add log level to assertion printk · 65a84a0f
      Anton Blanchard 提交于
      I received a ppc64 bug report involving xfs but the assertion was
      filtered out by the console log level. Use KERN_CRIT to ensure it
      makes it out.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      65a84a0f
    • J
      xfs: fix an assignment within an ASSERT() · 1884bd83
      Jesper Juhl 提交于
      In fs/xfs/xfs_trans.c::xfs_trans_unreserve_and_mod_sb() at the out:
      label we have this:
      	ASSERT(error = 0);
      I believe a comparison was intended, not an assignment. If I'm
      right, the patch below fixes that up.
      Signed-off-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      1884bd83
    • C
      xfs: fix error handling for synchronous writes · bfc60177
      Christoph Hellwig 提交于
      If we get an IO error on a synchronous superblock write, we attach an
      error release function to it so that when the last reference goes away
      the release function is called and the buffer is invalidated and
      unlocked. The buffer is left locked until the release function is
      called so that other concurrent users of the buffer will be locked out
      until the buffer error is fully processed.
      
      Unfortunately, for the superblock buffer the filesyetm itself holds a
      reference to the buffer which prevents the reference count from
      dropping to zero and the release function being called. As a result,
      once an IO error occurs on a sync write, the buffer will never be
      unlocked and all future attempts to lock the buffer will hang.
      
      To make matters worse, this problems is not unique to such buffers;
      if there is a concurrent _xfs_buf_find() running, the lookup will grab
      a reference to the buffer and then wait on the buffer lock, preventing
      the reference count from ever falling to zero and hence unlocking the
      buffer.
      
      As such, the whole b_relse function implementation is broken because it
      cannot rely on the buffer reference count falling to zero to unlock the
      errored buffer. The synchronous write error path is the only path that
      uses this callback - it is used to ensure that the synchronous waiter
      gets the buffer error before the error state is cleared from the buffer
      by the release function.
      
      Given that the only sychronous buffer writes now go through xfs_bwrite
      and the error path in question can only occur for a write of a dirty,
      logged buffer, we can move most of the b_relse processing to happen
      inline in xfs_buf_iodone_callbacks, just like a normal I/O completion.
      In addition to that we make sure the error is not cleared in
      xfs_buf_iodone_callbacks, so that xfs_bwrite can reliably check it.
      Given that xfs_bwrite keeps the buffer locked until it has waited for
      it and checked the error this allows to reliably propagate the error
      to the caller, and make sure that the buffer is reliably unlocked.
      
      Given that xfs_buf_iodone_callbacks was the only instance of the
      b_relse callback we can remove it entirely.
      
      Based on earlier patches by Dave Chinner and Ajeet Yadav.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reported-by: NAjeet Yadav <ajeet.yadav.77@gmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      bfc60177
    • C
      xfs: add FITRIM support · a46db608
      Christoph Hellwig 提交于
      Allow manual discards from userspace using the FITRIM ioctl.  This is not
      intended to be run during normal workloads, as the freepsace btree walks
      can cause large performance degradation.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      a46db608
    • D
      xfs: ensure log covering transactions are synchronous · c58efdb4
      Dave Chinner 提交于
      To ensure the log is covered and the filesystem idles correctly, we
      need to ensure that dummy transactions hit the disk and do not stay
      pinned in memory.  If the superblock is pinned in memory, it can't
      be flushed so the log covering cannot make progress. The result is
      dependent on timing - more oftent han not we continue to issues a
      log covering transaction every 36s rather than idling after ~90s.
      
      Fix this by making the log covering transaction synchronous. To
      avoid additional log force from xfssyncd, make the log covering
      transaction take the place of the existing log force in the xfssyncd
      background sync process.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      c58efdb4
  7. 11 1月, 2011 4 次提交
  8. 12 1月, 2011 1 次提交
    • D
      xfs: introduce xfs_rw_lock() helpers for locking the inode · 487f84f3
      Dave Chinner 提交于
      We need to obtain the i_mutex, i_iolock and i_ilock during the read
      and write paths. Add a set of wrapper functions to neatly
      encapsulate the lock ordering and shared/exclusive semantics to make
      the locking easier to follow and get right.
      
      Note that this changes some of the exclusive locking serialisation in
      that serialisation will occur against the i_mutex instead of the
      XFS_IOLOCK_EXCL. This does not change any behaviour, and it is
      arguably more efficient to use the mutex for such serialisation than
      the rw_sem.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      487f84f3
  9. 11 1月, 2011 3 次提交
  10. 07 1月, 2011 3 次提交
    • N
      xfs: provide simple rcu-walk ACL implementation · 880566e1
      Nick Piggin 提交于
      This simple implementation just checks for no ACLs on the inode, and
      if so, then the rcu-walk may proceed, otherwise fail it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      880566e1
    • N
      fs: provide rcu-walk aware permission i_ops · b74c79e9
      Nick Piggin 提交于
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b74c79e9
    • N
      fs: icache RCU free inodes · fa0d7e3d
      Nick Piggin 提交于
      RCU free the struct inode. This will allow:
      
      - Subsequent store-free path walking patch. The inode must be consulted for
        permissions when walking, so an RCU inode reference is a must.
      - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
        to take i_lock no longer need to take sb_inode_list_lock to walk the list in
        the first place. This will simplify and optimize locking.
      - Could remove some nested trylock loops in dcache code
      - Could potentially simplify things a bit in VM land. Do not need to take the
        page lock to follow page->mapping.
      
      The downsides of this is the performance cost of using RCU. In a simple
      creat/unlink microbenchmark, performance drops by about 10% due to inability to
      reuse cache-hot slab objects. As iterations increase and RCU freeing starts
      kicking over, this increases to about 20%.
      
      In cases where inode lifetimes are longer (ie. many inodes may be allocated
      during the average life span of a single inode), a lot of this cache reuse is
      not applicable, so the regression caused by this patch is smaller.
      
      The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
      however this adds some complexity to list walking and store-free path walking,
      so I prefer to implement this at a later date, if it is shown to be a win in
      real situations. I haven't found a regression in any non-micro benchmark so I
      doubt it will be a problem.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fa0d7e3d
  11. 21 12月, 2010 2 次提交
    • D
      xfs: convert grant head manipulations to lockless algorithm · d0eb2f38
      Dave Chinner 提交于
      The only thing that the grant lock remains to protect is the grant head
      manipulations when adding or removing space from the log. These calculations
      are already based on atomic variables, so we can already update them safely
      without locks. However, the grant head manpulations require atomic multi-step
      calculations to be executed, which the algorithms currently don't allow.
      
      To make these multi-step calculations atomic, convert the algorithms to
      compare-and-exchange loops on the atomic variables. That is, we sample the old
      value, perform the calculation and use atomic64_cmpxchg() to attempt to update
      the head with the new value. If the head has not changed since we sampled it,
      it will succeed and we are done. Otherwise, we rerun the calculation again from
      a new sample of the head.
      
      This allows us to remove the grant lock from around all the grant head space
      manipulations, and that effectively removes the grant lock from the log
      completely. Hence we can remove the grant lock completely from the log at this
      point.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      d0eb2f38
    • D
      xfs: introduce new locks for the log grant ticket wait queues · 3f16b985
      Dave Chinner 提交于
      The log grant ticket wait queues are currently protected by the log
      grant lock.  However, the queues are functionally independent from
      each other, and operations on them only require serialisation
      against other queue operations now that all of the other log
      variables they use are atomic values.
      
      Hence, we can make them independent of the grant lock by introducing
      new locks just to protect the lists operations. because the lists
      are independent, we can use a lock per list and ensure that reserve
      and write head queuing do not contend.
      
      To ensure forced shutdowns work correctly in conjunction with the
      new fast paths, ensure that we check whether the log has been shut
      down in the grant functions once we hold the relevant spin locks but
      before we go to sleep. This is needed to co-ordinate correctly with
      the wakeups that are issued on the ticket queues so we don't leave
      any processes sleeping on the queues during a shutdown.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      3f16b985
  12. 15 12月, 2010 1 次提交
    • T
      workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync() · afe2c511
      Tejun Heo 提交于
      cancel_rearming_delayed_work[queue]() has been superceded by
      cancel_delayed_work_sync() quite some time ago.  Convert all the
      in-kernel users.  The conversions are completely equivalent and
      trivial.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Acked-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Cc: Jeff Garzik <jgarzik@pobox.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: netdev@vger.kernel.org
      Cc: Anton Vorontsov <cbou@mail.ru>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: xfs-masters@oss.sgi.com
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: netfilter-devel@vger.kernel.org
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: linux-nfs@vger.kernel.org
      afe2c511
  13. 10 12月, 2010 1 次提交
    • C
      xfs: log timestamp changes to the source inode in rename · 05340d4a
      Christoph Hellwig 提交于
      Now that we don't mark VFS inodes dirty anymore for internal
      timestamp changes, but rely on the transaction subsystem to push
      them out, we need to explicitly log the source inode in rename after
      updating it's timestamps to make sure the changes actually get
      forced out by sync/fsync or an AIL push.
      
      We already account for the fourth inode in the log reservation, as a
      rename of directories needs to update the nlink field, so just
      adding the xfs_trans_log_inode call is enough.
      
      This fixes the xfsqa 065 regression introduced by:
      
      	"xfs: don't use vfs writeback for pure metadata modifications"
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      05340d4a
  14. 03 12月, 2010 1 次提交
  15. 21 12月, 2010 1 次提交
    • D
      xfs: convert l_tail_lsn to an atomic variable. · 1c3cb9ec
      Dave Chinner 提交于
      log->l_tail_lsn is currently protected by the log grant lock. The
      lock is only needed for serialising readers against writers, so we
      don't really need the lock if we make the l_tail_lsn variable an
      atomic. Converting the l_tail_lsn variable to an atomic64_t means we
      can start to peel back the grant lock from various operations.
      
      Also, provide functions to safely crack an atomic LSN variable into
      it's component pieces and to recombined the components into an
      atomic variable. Use them where appropriate.
      
      This also removes the need for explicitly holding a spinlock to read
      the l_tail_lsn on 32 bit platforms.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      
      1c3cb9ec
  16. 03 12月, 2010 1 次提交
    • D
      xfs: convert l_last_sync_lsn to an atomic variable · 84f3c683
      Dave Chinner 提交于
      log->l_last_sync_lsn is updated in only one critical spot - log
      buffer Io completion - and is protected by the grant lock here. This
      requires the grant lock to be taken for every log buffer IO
      completion. Converting the l_last_sync_lsn variable to an atomic64_t
      means that we do not need to take the grant lock in log buffer IO
      completion to update it.
      
      This also removes the need for explicitly holding a spinlock to read
      the l_last_sync_lsn on 32 bit platforms.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      84f3c683
  17. 21 12月, 2010 6 次提交
  18. 20 12月, 2010 3 次提交
  19. 03 12月, 2010 1 次提交
    • D
      xfs: consume iodone callback items on buffers as they are processed · c90821a2
      Dave Chinner 提交于
      To allow buffer iodone callbacks to consume multiple items off the
      callback list, first we need to convert the xfs_buf_do_callbacks()
      to consume items and always pull the next item from the head of the
      list.
      
      The means the item list walk is never dependent on knowing the
      next item on the list and hence allows callbacks to remove items
      from the list as well. This allows callbacks to do bulk operations
      by scanning the list for identical callbacks, consuming them all
      and then processing them in bulk, negating the need for multiple
      callbacks of that type.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c90821a2