1. 16 11月, 2008 1 次提交
    • A
      Fix inotify watch removal/umount races · 8f7b0ba1
      Al Viro 提交于
      Inotify watch removals suck violently.
      
      To kick the watch out we need (in this order) inode->inotify_mutex and
      ih->mutex.  That's fine if we have a hold on inode; however, for all
      other cases we need to make damn sure we don't race with umount.  We can
      *NOT* just grab a reference to a watch - inotify_unmount_inodes() will
      happily sail past it and we'll end with reference to inode potentially
      outliving its superblock.
      
      Ideally we just want to grab an active reference to superblock if we
      can; that will make sure we won't go into inotify_umount_inodes() until
      we are done.  Cleanup is just deactivate_super().
      
      However, that leaves a messy case - what if we *are* racing with
      umount() and active references to superblock can't be acquired anymore?
      We can bump ->s_count, grab ->s_umount, which will almost certainly wait
      until the superblock is shut down and the watch in question is pining
      for fjords.  That's fine, but there is a problem - we might have hit the
      window between ->s_active getting to 0 / ->s_count - below S_BIAS (i.e.
      the moment when superblock is past the point of no return and is heading
      for shutdown) and the moment when deactivate_super() acquires
      ->s_umount.
      
      We could just do drop_super() yield() and retry, but that's rather
      antisocial and this stuff is luser-triggerable.  OTOH, having grabbed
      ->s_umount and having found that we'd got there first (i.e.  that
      ->s_root is non-NULL) we know that we won't race with
      inotify_umount_inodes().
      
      So we could grab a reference to watch and do the rest as above, just
      with drop_super() instead of deactivate_super(), right? Wrong.  We had
      to drop ih->mutex before we could grab ->s_umount.  So the watch
      could've been gone already.
      
      That still can be dealt with - we need to save watch->wd, do idr_find()
      and compare its result with our pointer.  If they match, we either have
      the damn thing still alive or we'd lost not one but two races at once,
      the watch had been killed and a new one got created with the same ->wd
      at the same address.  That couldn't have happened in inotify_destroy(),
      but inotify_rm_wd() could run into that.  Still, "new one got created"
      is not a problem - we have every right to kill it or leave it alone,
      whatever's more convenient.
      
      So we can use idr_find(...) == watch && watch->inode->i_sb == sb as
      "grab it and kill it" check.  If it's been our original watch, we are
      fine, if it's a newcomer - nevermind, just pretend that we'd won the
      race and kill the fscker anyway; we are safe since we know that its
      superblock won't be going away.
      
      And yes, this is far beyond mere "not very pretty"; so's the entire
      concept of inotify to start with.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Acked-by: NGreg KH <greg@kroah.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f7b0ba1
  2. 14 11月, 2008 1 次提交
    • D
      dlm: fix shutdown cleanup · 278afcbf
      David Teigland 提交于
      Fixes a regression from commit 0f8e0d9a,
      "dlm: allow multiple lockspace creates".
      
      An extraneous 'else' slipped into a code fragment being moved from
      release_lockspace() to dlm_release_lockspace().  The result of the
      unwanted 'else' is that dlm threads and structures are not stopped
      and cleaned up when the final dlm lockspace is removed.  Trying to
      create a new lockspace again afterward will fail with
      "kmem_cache_create: duplicate cache dlm_conn" because the cache
      was not previously destroyed.
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      278afcbf
  3. 13 11月, 2008 2 次提交
  4. 11 11月, 2008 21 次提交
  5. 10 11月, 2008 6 次提交
    • D
      [XFS] XFS: Check for valid transaction headers in recovery · 220ca310
      David Chinner 提交于
      When we are about to add a new item to a transaction in recovery, we need
      to check that it is valid first. Currently we just assert that header
      magic number matches, but in production systems that is not present and we
      add a corrupted transaction to the list to be processed. This results in a
      kernel oops later when processing the corrupted transaction.
      
      Instead, if we detect a corrupted transaction, abort recovery and leave
      the user to clean up the mess that has occurred.
      
      SGI-PV: 988145
      
      SGI-Modid: xfs-linux-melb:xfs-kern:32356a
      Signed-off-by: NDavid Chinner <david@fromorbit.com>
      Signed-off-by: NTim Shimmin <tes@sgi.com>
      Signed-off-by: NEric Sandeen <sandeen@sandeen.net>
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      220ca310
    • D
      [XFS] handle memory allocation failures during log initialisation · 8f330f51
      Dave Chinner 提交于
      When there is no memory left in the system, xfs_buf_get_noaddr()
      can fail. If this happens at mount time during xlog_alloc_log()
      we fail to catch the error and oops.
      
      Catch the error from xfs_buf_get_noaddr(), and allow other memory
      allocations to fail and catch those errors too. Report the error
      to the console and fail the mount with ENOMEM.
      
      Tested by manually injecting errors into xfs_buf_get_noaddr() and
      xlog_alloc_log().
      
      Version 2:
      o remove unnecessary casts of the returned pointer from kmem_zalloc()
      
      SGI-PV: 987246
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      8f330f51
    • D
      [XFS] Account for allocated blocks when expanding directories · 6f9f51ad
      David Chinner 提交于
      When we create a directory, we reserve a number of blocks for the maximum
      possible expansion of of the directory due to various btree splits,
      freespace allocation, etc. Unfortunately, each allocation is not reflected
      in the total number of blocks still available to the transaction, so the
      maximal reservation is used over and over again.
      
      This leads to problems where an allocation group has only enough blocks
      for *some* of the allocations required for the directory modification.
      After the first N allocations, the remaining blocks in the allocation
      group drops below the total reservation, and subsequent allocations fail
      because the allocator will not allow the allocation to proceed if the AG
      does not have the enough blocks available for the entire allocation total.
      
      This results in an ENOSPC occurring after an allocation has already
      occurred. This results in aborting the directory operation (leaving the
      directory in an inconsistent state) and cancelling a dirty transaction,
      which results in a filesystem shutdown.
      
      Avoid the problem by reflecting the number of blocks allocated in any
      directory expansion in the total number of blocks available to the
      modification in progress. This prevents a directory modification from
      being aborted part way through with an ENOSPC.
      
      SGI-PV: 988144
      
      SGI-Modid: xfs-linux-melb:xfs-kern:32340a
      Signed-off-by: NDavid Chinner <david@fromorbit.com>
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      6f9f51ad
    • L
      [XFS] Wait for all I/O on truncate to zero file size · 2cf7f0da
      Lachlan McIlroy 提交于
      It's possible to have outstanding xfs_ioend_t's queued when the file size
      is zero. This can happen in the direct I/O path when a direct I/O write
      fails due to ENOSPC. In this case the xfs_ioend_t will still be queued (ie
      xfs_end_io_direct() does not know that the I/O failed so can't force the
      xfs_ioend_t to be flushed synchronously).
      
      When we truncate a file on unlink we don't know to wait for these
      xfs_ioend_ts and we can have a use-after-free situation if the inode is
      reclaimed before the xfs_ioend_t is finally processed.
      
      As was suggested by Dave Chinner lets wait for all I/Os to complete when
      truncating the file size to zero.
      
      SGI-PV: 981668
      
      SGI-Modid: xfs-linux-melb:xfs-kern:32216a
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      2cf7f0da
    • L
      [XFS] Fix use-after-free with log and quotas · 9ccbece5
      Lachlan McIlroy 提交于
      Destroying the quota stuff on unmount can access the log - ie
      XFS_QM_DONE() ends up in xfs_dqunlock() which calls
      xfs_trans_unlocked_item() and then xfs_log_move_tail(). By this time the
      log has already been destroyed. Just move the cleanup of the quota code
      earlier in xfs_unmountfs() before the call to xfs_log_unmount(). Moving
      XFS_QM_DONE() up near XFS_QM_DQPURGEALL() seems like a good spot.
      
      SGI-PV: 987086
      
      SGI-Modid: xfs-linux-melb:xfs-kern:32148a
      Signed-off-by: NLachlan McIlroy <lachlan@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NPeter Leckie <pleckie@sgi.com>
      9ccbece5
    • D
      Fix nfsd truncation of readdir results · b726e923
      Doug Nazar 提交于
      Commit 8d7c4203 "nfsd: fix failure to set eof in readdir in some
      situations" introduced a bug: on a directory in an exported ext3
      filesystem with dir_index unset, a READDIR will only return about 250
      entries, even if the directory was larger.
      
      Bisected it back to this commit; reverting it fixes the problem.
      
      It turns out that in this case ext3 reads a block at a time, then
      returns from readdir, which means we can end up with buf.full==0 but
      with more entries in the directory still to be read.  Before 8d7c4203
      (but after c002a6c7 "Optimise NFS readdir hack slightly"), this would
      cause us to return the READDIR result immediately, but with the eof bit
      unset.  That could cause a performance regression (because the client
      would need more roundtrips to the server to read the whole directory),
      but no loss in correctness, since the cleared eof bit caused the client
      to send another readdir.  After 8d7c4203, the setting of the eof bit
      made this a correctness problem.
      
      So, move nfserr_eof into the loop and remove the buf.full check so that
      we loop until buf.used==0.  The following seems to do the right thing
      and reduces the network traffic since we don't return a READDIR result
      until the buffer is full.
      
      Tested on an empty directory & large directory; eof is properly sent and
      there are no more short buffers.
      Signed-off-by: NDoug Nazar <nazard@dragoninc.ca>
      Cc: David Woodhouse <David.Woodhouse@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      b726e923
  6. 07 11月, 2008 9 次提交