1. 15 10月, 2009 1 次提交
    • F
      kill-the-bkl/reiserfs: always lock the ioctl path · ac78a078
      Frederic Weisbecker 提交于
      Reiserfs uses the ioctl callback for its file operations, which means
      that its ioctl path is still locked by the bkl, this was synchronizing
      with the rest of the filsystem operations. We have changed that by
      locking it with the new reiserfs lock but we do that only from the
      compat_ioctl callback.
      
      Fix that by locking reiserfs_ioctl() everytime.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Laurent Riffard <laurent.riffard@free.fr>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      ac78a078
  2. 05 10月, 2009 1 次提交
    • F
      kill-the-bkl/reiserfs: fix reiserfs lock to cpu_add_remove_lock dependency · 48f6ba5e
      Frederic Weisbecker 提交于
      While creating the reiserfs workqueue during the journal
      initialization, we are holding the reiserfs lock, but
      create_workqueue() also holds the cpu_add_remove_lock, creating
      then the following dependency:
      
      - reiserfs lock -> cpu_add_remove_lock
      
      But we also have the following existing dependencies:
      
      - mm->mmap_sem -> reiserfs lock
      - cpu_add_remove_lock -> cpu_hotplug.lock -> slub_lock -> sysfs_mutex
      
      The merged dependency chain then becomes:
      
      - mm->mmap_sem -> reiserfs lock -> cpu_add_remove_lock ->
      	cpu_hotplug.lock -> slub_lock -> sysfs_mutex
      
      But when we fill a dir entry in sysfs_readir(), we are holding the
      sysfs_mutex and we also might fault while copying the directory entry
      to the user, leading to the following dependency:
      
      - sysfs_mutex -> mm->mmap_sem
      
      The end result is then a lock inversion between sysfs_mutex and
      mm->mmap_sem, as reported in the following lockdep warning:
      
      [ INFO: possible circular locking dependency detected ]
      2.6.31-07095-g25a3912 #4
      -------------------------------------------------------
      udevadm/790 is trying to acquire lock:
       (&mm->mmap_sem){++++++}, at: [<c1098942>] might_fault+0x72/0xc0
      
      but task is already holding lock:
       (sysfs_mutex){+.+.+.}, at: [<c110813c>] sysfs_readdir+0x7c/0x260
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #5 (sysfs_mutex){+.+.+.}:
            [...]
      
      -> #4 (slub_lock){+++++.}:
            [...]
      
      -> #3 (cpu_hotplug.lock){+.+.+.}:
            [...]
      
      -> #2 (cpu_add_remove_lock){+.+.+.}:
            [...]
      
      -> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
            [...]
      
      -> #0 (&mm->mmap_sem){++++++}:
            [...]
      
      This can be fixed by relaxing the reiserfs lock while creating the
      workqueue.
      This is fine to relax the lock here, we just keep it around to pass
      through reiserfs lock checks and for paranoid reasons.
      Reported-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Tested-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Laurent Riffard <laurent.riffard@free.fr>
      48f6ba5e
  3. 17 9月, 2009 1 次提交
    • F
      kill-the-bkl/reiserfs: Fix induced mm->mmap_sem to sysfs_mutex dependency · 193be0ee
      Frederic Weisbecker 提交于
      Alexander Beregalov reported the following warning:
      
      	=======================================================
      	[ INFO: possible circular locking dependency detected ]
      	2.6.31-03149-gdcc030a #1
      	-------------------------------------------------------
      	udevadm/716 is trying to acquire lock:
      	 (&mm->mmap_sem){++++++}, at: [<c107249a>] might_fault+0x4a/0xa0
      
      	but task is already holding lock:
      	 (sysfs_mutex){+.+.+.}, at: [<c10cb9aa>] sysfs_readdir+0x5a/0x200
      
      	which lock already depends on the new lock.
      
      	the existing dependency chain (in reverse order) is:
      
      	-> #3 (sysfs_mutex){+.+.+.}:
      	       [...]
      
      	-> #2 (&bdev->bd_mutex){+.+.+.}:
      	       [...]
      
      	-> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
      	       [...]
      
      	-> #0 (&mm->mmap_sem){++++++}:
      	       [...]
      
      On reiserfs mount path, we take the reiserfs lock and while
      initializing the journal, we open the device, taking the
      bdev->bd_mutex. Then rescan_partition() may signal the change
      to sysfs.
      
      We have then the following dependency:
      
      	reiserfs_lock -> bd_mutex -> sysfs_mutex
      
      Later, while entering reiserfs_readpage() after a pagefault in an
      mmaped reiserfs file, we are holding the mm->mmap_sem, and we are going
      to take the reiserfs lock too.
      We have then the following dependency:
      
      	mm->mmap_sem -> reiserfs_lock
      
      which, expanded with the previous dependency gives us:
      
      	mm->mmap_sem -> reiserfs_lock -> bd_mutex -> sysfs_mutex
      
      Now while entering the sysfs readdir path, we are holding the
      sysfs_mutex. And when we copy a directory entry to the user buffer, we
      might fault and then take the mm->mmap_sem lock. Which leads to the
      circular locking dependency reported.
      
      We can fix that by relaxing the reiserfs lock during the call to
      journal_init_dev(), which is the place where we open the mounted
      device.
      
      This is fine to relax the lock here because we are in the begining of
      the reiserfs mount path and there is nothing to protect at this time,
      the journal is not intialized.
      We just keep this lock around for paranoid reasons.
      Reported-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Tested-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Laurent Riffard <laurent.riffard@free.fr>
      193be0ee
  4. 14 9月, 2009 22 次提交
    • F
      kill-the-bkl/reiserfs: panic in case of lock imbalance · 80503185
      Frederic Weisbecker 提交于
      Until now, trying to unlock the reiserfs write lock whereas the current
      task doesn't hold it lead to a simple warning.
      We should actually warn and panic in this case to avoid the user datas
      to reach an unstable state.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Laurent Riffard <laurent.riffard@free.fr>
      80503185
    • F
      kill-the-bkl/reiserfs: fix recursive reiserfs write lock in reiserfs_commit_write() · 7e942770
      Frederic Weisbecker 提交于
      reiserfs_commit_write() is always called with the write lock held.
      Thus the current calls to reiserfs_write_lock() in this function are
      acquiring the lock recursively.
      We can safely drop them.
      
      This also solves further assumptions for this lock to be really
      released while calling reiserfs_write_unlock().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Laurent Riffard <laurent.riffard@free.fr>
      7e942770
    • F
      kill-the-bkl/reiserfs: fix recursive reiserfs lock in reiserfs_mkdir() · b10ab4c3
      Frederic Weisbecker 提交于
      reiserfs_mkdir() acquires the reiserfs lock, assuming it has been called
      from the dir inodes callbacks, without the lock held.
      
      But it can also be called from other internal sites such as
      reiserfs_xattr_init() which already holds the lock. This recursive
      locking leads to further wrong assumptions. For example, later calls
      to reiserfs_mutex_lock_safe() won't actually unlock the reiserfs lock
      the time we acquire a given mutex, creating unexpected lock inversions.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Laurent Riffard <laurent.riffard@free.fr>
      b10ab4c3
    • F
      kill-the-bkl/reiserfs: fix "reiserfs lock" / "inode mutex" lock inversion dependency · ae635c0b
      Frederic Weisbecker 提交于
      reiserfs_xattr_init is called with the reiserfs write lock held, but
      if the ".reiserfs_priv" entry is not created, we take the superblock
      root directory inode mutex until .reiserfs_priv is created.
      
      This creates a lock dependency inversion against other sites such as
      reiserfs_file_release() which takes an inode mutex and the reiserfs
      lock after.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Laurent Riffard <laurent.riffard@free.fr>
      ae635c0b
    • F
      kill-the-bkl/reiserfs: move the concurrent tree accesses checks per superblock · 08f14fc8
      Frederic Weisbecker 提交于
      When do_balance() balances the tree, a trick is performed to
      provide the ability for other tree writers/readers to check whether
      do_balance() is executing concurrently (requires CONFIG_REISERFS_CHECK).
      
      This is done to protect concurrent accesses to the tree. The trick
      is the following:
      
      When do_balance is called, a unique global variable called cur_tb
      takes a pointer to the current tree to be rebalanced.
      Once do_balance finishes its work, cur_tb takes the NULL value.
      
      Then, concurrent tree readers/writers just have to check the value
      of cur_tb to ensure do_balance isn't executing concurrently.
      If it is, then it proves that schedule() occured on do_balance(),
      which then relaxed the bkl that protected the tree.
      
      Now that the bkl has be turned into a mutex, this check is still
      fine even though do_balance() becomes preemptible: the write lock
      will not be automatically released on schedule(), so the tree is
      still protected.
      
      But this is only fine if we have a single reiserfs mountpoint.
      Indeed, because the bkl is a global lock, it didn't allowed
      concurrent executions between a tree reader/writer in a mount point
      and a do_balance() on another tree from another mountpoint.
      
      So assuming all these readers/writers weren't supposed to be
      reentrant, the current check now sometimes detect false positives with
      the current per-superblock mutex which allows this reentrancy.
      
      This patch keeps the concurrent tree accesses check but moves it
      per superblock, so that only trees from a same mount point are
      checked to be not accessed concurrently.
      
      [ Impact: fix spurious panic while running several reiserfs mount-points ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      08f14fc8
    • F
      kill-the-bkl/reiserfs: acquire the inode mutex safely · c72e0575
      Frederic Weisbecker 提交于
      While searching a pathname, an inode mutex can be acquired
      in do_lookup() which calls reiserfs_lookup() which in turn
      acquires the write lock.
      
      On the other side reiserfs_fill_super() can acquire the write_lock
      and then call reiserfs_lookup_privroot() which can acquire an
      inode mutex (the root of the mount point).
      
      So we theoretically risk an AB - BA lock inversion that could lead
      to a deadlock.
      
      As for other lock dependencies found since the bkl to mutex
      conversion, the fix is to use reiserfs_mutex_lock_safe() which
      drops the lock dependency to the write lock.
      
      [ Impact: fix a possible deadlock with reiserfs ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      c72e0575
    • F
      kill-the-bkl/reiserfs: unlock only when needed in search_by_key · 2ac62695
      Frederic Weisbecker 提交于
      search_by_key() is the site which most requires the lock.
      This is mostly because it is a very central function and also
      because it releases/reaqcuires the write lock at least once each
      time it is called.
      
      Such release/reacquire creates a lot of contention in this place and
      also opens more the window which let another thread changing the tree.
      When it happens, the current path searching over the tree must be
      retried from the beggining (the root) which is a wasteful and
      time consuming recovery.
      
      This patch factorizes two release/reacquire sequences:
      
      - reading leaf nodes blocks
      - reading current block
      
      The latter immediately follows the former.
      
      The whole sequence is safe as a single unlocked section because
      we check just after if the tree has changed during these operations.
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      2ac62695
    • F
      kill-the-bkl/reiserfs: use mutex_lock in reiserfs_mutex_lock_safe · c63e3c0b
      Frederic Weisbecker 提交于
      reiserfs_mutex_lock_safe() is a hack to avoid any dependency between
      an internal reiserfs mutex and the write lock, it has been proposed
      to follow the old bkl logic.
      
      The code does the following:
      
      while (!mutex_trylock(m)) {
      	reiserfs_write_unlock(s);
      	schedule();
      	reiserfs_write_lock(s);
      }
      
      It then imitate the implicit behaviour of the lock when it was
      a Bkl and hadn't such dependency:
      
      mutex_lock(m) {
      	if (fastpath)
      		let's go
      	else {
      		wait_for_mutex() {
      			schedule() {
      				unlock_kernel()
      				reacquire_lock_kernel()
      			}
      		}
      	}
      }
      
      The problem is that by using such explicit schedule(), we don't
      benefit of the adaptive mutex spinning on owner.
      
      The logic in use now is:
      
      reiserfs_write_unlock(s);
      mutex_lock(m); // -> possible adaptive spinning
      reiserfs_write_lock(s);
      
      [ Impact: restore the use of adaptive spinning mutexes in reiserfs ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      c63e3c0b
    • F
      kill-the-bkl/reiserfs: factorize the locking in reiserfs_write_end() · d6f5b0aa
      Frederic Weisbecker 提交于
      reiserfs_write_end() is a hot path in reiserfs.
      We have two wasteful write lock lock/release inside that can be gathered
      without changing the code logic.
      
      This patch factorizes them out in a single protected section, reducing the
      number of contentions inside.
      
      [ Impact: reduce lock contention in a reiserfs hotpath ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      d6f5b0aa
    • F
      kill-the-bkl/reiserfs: reduce number of contentions in search_by_key() · 09eb47a7
      Frederic Weisbecker 提交于
      search_by_key() is a central function in reiserfs which searches
      the patch in the fs tree from the root to a node given its key.
      
      It is the function that is most requesting the write lock
      because it's a path very often used.
      
      Also we forget to release the lock while reading the next tree node,
      making us holding the lock in a wasteful way.
      
      Then we release the lock while reading the current node and its childs,
      all-in-one. It should be safe because we have a reference to these
      blocks and even if we read a block that will be concurrently changed,
      we have an fs_changed check later that will make us retry the path from
      the root.
      
      [ Impact: release the write lock while unused in a hot path ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      09eb47a7
    • F
      kill-the-bkl/reiserfs: don't hold the write recursively in reiserfs_lookup() · b1c839bb
      Frederic Weisbecker 提交于
      The write lock can be acquired recursively in reiserfs_lookup(). But we may
      want to *really* release the lock before possible rescheduling from a
      reiserfs_lookup() callee.
      
      Hence we want to only acquire the lock once (ie: not recursively).
      
      [ Impact: prevent from possible false unreleased write lock on sleeping ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      b1c839bb
    • F
      kill-the-bkl/reiserfs: lock only once on reiserfs_get_block() · 26931309
      Frederic Weisbecker 提交于
      reiserfs_get_block() is one of these sites where the write lock might
      be acquired recursively.
      
      It's a particular problem because this function is called very often.
      It's a hot spot which needs to reschedule() periodically while converting
      direct items to indirect ones because it can take some time.
      
      Then if we are applying the write lock release/reacquire pattern on
      schedule() here, it may not produce the desired effect since we may have
      locked in more than one depth.
      
      The solution is to use reiserfs_write_lock_once() which won't try
      to reacquire the lock recursively. Then the lock will be *really*
      released before schedule().
      
      Also, we only release the lock if TIF_NEED_RESCHED is set to not
      create wasteful numerous contentions.
      
      [ Impact: fix a too long holded lock case in reiserfs_get_block() ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      26931309
    • F
      kill-the-BKL/reiserfs: release the write lock on flush_commit_list() · 6e3647ac
      Frederic Weisbecker 提交于
      flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
      ll_rw_block() might sleep, and the bkl was released at this point. Then
      we can also relax the write lock at this point.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      6e3647ac
    • F
      kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block() · 4c5eface
      Frederic Weisbecker 提交于
      reiserfs_read_bitmap_block() uses sb_bread() to read the bitmap block. This
      helper might sleep.
      
      Then, when the bkl was used, it was released at this point. We can then
      relax the write lock too here.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      4c5eface
    • F
      kill-the-BKL/reiserfs: release the write lock inside get_neighbors() · 148d3504
      Frederic Weisbecker 提交于
      get_neighbors() is used to get the left and/or right blocks
      against a given one in order to balance a tree.
      
      sb_bread() is used to read the buffer of these neighors blocks and
      while it waits for this operation, it might sleep.
      
      The bkl was released at this point, and then we can also release
      the write lock before calling sb_bread().
      
      This is safe because if the filesystem is changed after this
      lock release, the function returns REPEAT_SEARCH (aka SCHEDULE_OCCURRED
      in the function header comments) in order to repeat the neighbhor
      research.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      148d3504
    • F
      kill-the-BKL/reiserfs: release write lock while rescheduling on prepare_for_delete_or_cut() · 5e69e3a4
      Frederic Weisbecker 提交于
      prepare_for_delete_or_cut() can process several types of items, including
      indirect items, ie: items which contain no file data but pointers to
      unformatted nodes scattering the datas of a file.
      
      In this case it has to zero out these pointers to block numbers of
      unformatted nodes and release the bitmap from these block numbers.
      
      It can take some time, so a rescheduling() is performed between each
      block processed. We can safely release the write lock while
      rescheduling(), like the bkl did, because the code checks just after
      if the item has moved after sleeping.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      5e69e3a4
    • F
      kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end() · e6950a4d
      Frederic Weisbecker 提交于
      When do_journal_end() copies data to the journal blocks buffers in memory,
      it reschedules if needed between each block copied and dirtyfied.
      
      We can also release the write lock at this rescheduling stage,
      like did the bkl implicitly.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      e6950a4d
    • F
      kill-the-BKL/reiserfs: only acquire the write lock once in reiserfs_dirty_inode · dc8f6d89
      Frederic Weisbecker 提交于
      Impact: fix a deadlock
      
      reiserfs_dirty_inode() is the super_operations::dirty_inode() callback
      of reiserfs. It can be called from different contexts where the write
      lock can be already held.
      
      But this function also grab the write lock (possibly recursively).
      Subsequent release of the lock before sleep will actually not release
      the lock if the caller of mark_inode_dirty() (which in turn calls
      reiserfs_dirty_inode()) already owns the lock.
      
      A typical case:
      
      reiserfs_write_end() {
      	acquire_write_lock()
      	mark_inode_dirty() {
      		reiserfs_dirty_inode() {
      			reacquire_write_lock() {
      				journal_begin() {
      					do_journal_begin_r() {
      						/*
      						 * fail to release, still
      						 * one depth of lock
      						 */
      						release_write_lock()
      						reiserfs_wait_on_write_block() {
      							wait_event()
      
      The event is usually provided by something which needs the write lock but
      it hasn't been released.
      
      We use reiserfs_write_lock_once() here to ensure we only grab the
      write lock in one level.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      LKML-Reference: <1239680065-25013-4-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dc8f6d89
    • F
      kill-the-BKL/reiserfs: lock only once in reiserfs_truncate_file · 22c963ad
      Frederic Weisbecker 提交于
      Impact: fix a deadlock
      
      reiserfs_truncate_file() can be called from multiple context where
      the write lock can be already hold or not.
      
      This function also acquire (possibly recursively) the write
      lock. Subsequent releases before sleeping will not actually release
      the lock because we may be in more than one lock depth degree.
      
      A typical case is:
      
      reiserfs_file_release {
      	acquire_the_lock()
      	reiserfs_truncate_file()
      		reacquire_the_lock()
      		journal_begin() {
      			do_journal_begin_r() {
      				reiserfs_wait_on_write_block() {
      					/*
      					 * Not released because still one
      					 * depth owned
      					 */
      					release_lock()
      					wait_for_event()
      
      At this stage the event never happen because the one which provides
      it needs the write lock.
      
      We use reiserfs_write_lock_once() here to ensure that we don't acquire the
      write lock recursively.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      LKML-Reference: <1239680065-25013-3-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      22c963ad
    • F
      kill-the-BKL/reiserfs: provide a tool to lock only once the write lock · daf88c89
      Frederic Weisbecker 提交于
      Sometimes we don't want to recursively hold the per superblock write
      lock because we want to be sure it is actually released when we come
      to sleep.
      
      This patch introduces the necessary tools for that.
      
      reiserfs_write_lock_once() does the same job than reiserfs_write_lock()
      except that it won't try to acquire recursively the lock if the current
      task already owns it. Also the lock_depth before the call of this function
      is returned.
      
      reiserfs_write_unlock_once() unlock only if reiserfs_write_lock_once()
      returned a depth equal to -1, ie: only if it actually locked.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      LKML-Reference: <1239680065-25013-2-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      daf88c89
    • F
      reiserfs, kill-the-BKL: fix unsafe j_flush_mutex lock · a412f9ef
      Frederic Weisbecker 提交于
      Impact: fix a deadlock
      
      The j_flush_mutex is acquired safely in journal.c:
      if we can't take it, we free the reiserfs per superblock lock
      and wait a bit.
      
      But we have a remaining place in kupdate_transactions() where
      j_flush_mutex is still acquired traditionnaly. Thus the following
      scenario (warned by lockdep) can happen:
      
      A						B
      
      mutex_lock(&write_lock)			mutex_lock(&write_lock)
      	mutex_lock(&j_flush_mutex)	mutex_lock(&j_flush_mutex) //block
      	mutex_unlock(&write_lock)
      	sleep...
      	mutex_lock(&write_lock) //deadlock
      
      Fix this by using reiserfs_mutex_lock_safe() in kupdate_transactions().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Jeff Mahoney <jeffm@suse.com>
      LKML-Reference: <1239660635-12940-1-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a412f9ef
    • F
      reiserfs: kill-the-BKL · 8ebc4232
      Frederic Weisbecker 提交于
      This patch is an attempt to remove the Bkl based locking scheme from
      reiserfs and is intended.
      
      It is a bit inspired from an old attempt by Peter Zijlstra:
      
         http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/2174.html
      
      The bkl is heavily used in this filesystem to prevent from
      concurrent write accesses on the filesystem.
      
      Reiserfs makes a deep use of the specific properties of the Bkl:
      
      - It can be acqquired recursively by a same task
      - It is released on the schedule() calls and reacquired when schedule() returns
      
      The two properties above are a roadmap for the reiserfs write locking so it's
      very hard to simply replace it with a common mutex.
      
      - We need a recursive-able locking unless we want to restructure several blocks
        of the code.
      - We need to identify the sites where the bkl was implictly relaxed
        (schedule, wait, sync, etc...) so that we can in turn release and
        reacquire our new lock explicitly.
        Such implicit releases of the lock are often required to let other
        resources producer/consumer do their job or we can suffer unexpected
        starvations or deadlocks.
      
      So the new lock that replaces the bkl here is a per superblock mutex with a
      specific property: it can be acquired recursively by a same task, like the
      bkl.
      
      For such purpose, we integrate a lock owner and a lock depth field on the
      superblock information structure.
      
      The first axis on this patch is to turn reiserfs_write_(un)lock() function
      into a wrapper to manage this mutex. Also some explicit calls to
      lock_kernel() have been converted to reiserfs_write_lock() helpers.
      
      The second axis is to find the important blocking sites (schedule...(),
      wait_on_buffer(), sync_dirty_buffer(), etc...) and then apply an explicit
      release of the write lock on these locations before blocking. Then we can
      safely wait for those who can give us resources or those who need some.
      Typically this is a fight between the current writer, the reiserfs workqueue
      (aka the async commiter) and the pdflush threads.
      
      The third axis is a consequence of the second. The write lock is usually
      on top of a lock dependency chain which can include the journal lock, the
      flush lock or the commit lock. So it's dangerous to release and trying to
      reacquire the write lock while we still hold other locks.
      
      This is fine with the bkl:
      
            T1                       T2
      
      lock_kernel()
          mutex_lock(A)
          unlock_kernel()
          // do something
                                  lock_kernel()
                                      mutex_lock(A) -> already locked by T1
                                      schedule() (and then unlock_kernel())
          lock_kernel()
          mutex_unlock(A)
          ....
      
      This is not fine with a mutex:
      
            T1                       T2
      
      mutex_lock(write)
          mutex_lock(A)
          mutex_unlock(write)
          // do something
                                 mutex_lock(write)
                                    mutex_lock(A) -> already locked by T1
                                    schedule()
      
          mutex_lock(write) -> already locked by T2
          deadlock
      
      The solution in this patch is to provide a helper which releases the write
      lock and sleep a bit if we can't lock a mutex that depend on it. It's another
      simulation of the bkl behaviour.
      
      The last axis is to locate the fs callbacks that are called with the bkl held,
      according to Documentation/filesystem/Locking.
      
      Those are:
      
      - reiserfs_remount
      - reiserfs_fill_super
      - reiserfs_put_super
      
      Reiserfs didn't need to explicitly lock because of the context of these callbacks.
      But now we must take care of that with the new locking.
      
      After this patch, reiserfs suffers from a slight performance regression (for now).
      On UP, a high volume write with dd reports an average of 27 MB/s instead
      of 30 MB/s without the patch applied.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Bron Gondwana <brong@fastmail.fm>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      LKML-Reference: <1239070789-13354-1-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8ebc4232
  5. 07 9月, 2009 1 次提交
  6. 06 9月, 2009 2 次提交
    • N
      ext2: fix unbalanced kmap()/kunmap() · 9de6886e
      Nicolas Pitre 提交于
      In ext2_rename(), dir_page is acquired through ext2_dotdot().  It is
      then released through ext2_set_link() but only if old_dir != new_dir.
      Failing that, the pkmap reference count is never decremented and the
      page remains pinned forever.  Repeat that a couple times with highmem
      pages and all pkmap slots get exhausted, and every further kmap() calls
      end up stalling on the pkmap_map_wait queue at which point the whole
      system comes to a halt.
      Signed-off-by: NNicolas Pitre <nico@marvell.com>
      Acked-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de6886e
    • O
      exec: do not sleep in TASK_TRACED under ->cred_guard_mutex · a2a8474c
      Oleg Nesterov 提交于
      Tom Horsley reports that his debugger hangs when it tries to read
      /proc/pid_of_tracee/maps, this happens since
      
      	"mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
      	04b836cbf19e885f8366bccb2e4b0474346c02d
      
      commit in 2.6.31.
      
      But the root of the problem lies in the fact that do_execve() path calls
      tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.
      
      The tracee must not sleep in TASK_TRACED holding this mutex.  Even if we
      remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
      another task doing PTRACE_ATTACH should not hang until it is killed or the
      tracee resumes.
      
      With this patch do_execve() does not use ->cred_guard_mutex directly and
      we do not hold it throughout, instead:
      
      	- introduce prepare_bprm_creds() helper, it locks the mutex
      	  and calls prepare_exec_creds() to initialize bprm->cred.
      
      	- install_exec_creds() drops the mutex after commit_creds(),
      	  and thus before tracehook_report_exec()->ptrace_stop().
      
      	  or, if exec fails,
      
      	  free_bprm() drops this mutex when bprm->cred != NULL which
      	  indicates install_exec_creds() was not called.
      Reported-by: NTom Horsley <tom.horsley@att.net>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2a8474c
  7. 05 9月, 2009 1 次提交
  8. 03 9月, 2009 1 次提交
  9. 02 9月, 2009 1 次提交
  10. 01 9月, 2009 1 次提交
  11. 31 8月, 2009 1 次提交
    • R
      nilfs2: fix preempt count underflow in nilfs_btnode_prepare_change_key · b1f1b8ce
      Ryusuke Konishi 提交于
      This will fix the following preempt count underflow reported from
      users with the title "[NILFS users] segctord problem" (Message-ID:
      <949415.6494.qm@web58808.mail.re1.yahoo.com> and Message-ID:
      <debc30fc0908270825v747c1734xa59126623cfd5b05@mail.gmail.com>):
      
       WARNING: at kernel/sched.c:4890 sub_preempt_count+0x95/0xa0()
       Hardware name: HP Compaq 6530b (KR980UT#ABC)
       Modules linked in: bridge stp llc bnep rfcomm l2cap xfs exportfs nilfs2 cowloop loop vboxnetadp vboxnetflt vboxdrv btusb bluetooth uvcvideo videodev v4l1_compat v4l2_compat_ioctl32 arc4 snd_hda_codec_analog ecb iwlagn iwlcore rfkill lib80211 mac80211 snd_hda_intel snd_hda_codec ehci_hcd uhci_hcd usbcore snd_hwdep snd_pcm tg3 cfg80211 psmouse snd_timer joydev libphy ohci1394 snd_page_alloc hp_accel lis3lv02d ieee1394 led_class i915 drm i2c_algo_bit video backlight output i2c_core dm_crypt dm_mod
       Pid: 4197, comm: segctord Not tainted 2.6.30-gentoo-r4-64 #7
       Call Trace:
        [<ffffffff8023fa05>] ? sub_preempt_count+0x95/0xa0
        [<ffffffff802470f8>] warn_slowpath_common+0x78/0xd0
        [<ffffffff8024715f>] warn_slowpath_null+0xf/0x20
        [<ffffffff8023fa05>] sub_preempt_count+0x95/0xa0
        [<ffffffffa04ce4db>] nilfs_btnode_prepare_change_key+0x11b/0x190 [nilfs2]
        [<ffffffffa04d01ad>] nilfs_btree_assign_p+0x19d/0x1e0 [nilfs2]
        [<ffffffffa04d10ad>] nilfs_btree_assign+0xbd/0x130 [nilfs2]
        [<ffffffffa04cead7>] nilfs_bmap_assign+0x47/0x70 [nilfs2]
        [<ffffffffa04d9bc6>] nilfs_segctor_do_construct+0x956/0x20f0 [nilfs2]
        [<ffffffff805ac8e2>] ? _spin_unlock_irqrestore+0x12/0x40
        [<ffffffff803c06e0>] ? __up_write+0xe0/0x150
        [<ffffffff80262959>] ? up_write+0x9/0x10
        [<ffffffffa04ce9f3>] ? nilfs_bmap_test_and_clear_dirty+0x43/0x60 [nilfs2]
        [<ffffffffa04cd627>] ? nilfs_mdt_fetch_dirty+0x27/0x60 [nilfs2]
        [<ffffffffa04db5fc>] nilfs_segctor_construct+0x8c/0xd0 [nilfs2]
        [<ffffffffa04dc3dc>] nilfs_segctor_thread+0x15c/0x3a0 [nilfs2]
        [<ffffffffa04dbe20>] ? nilfs_construction_timeout+0x0/0x10 [nilfs2]
        [<ffffffff80252633>] ? add_timer+0x13/0x20
        [<ffffffff802370da>] ? __wake_up_common+0x5a/0x90
        [<ffffffff8025e960>] ? autoremove_wake_function+0x0/0x40
        [<ffffffffa04dc280>] ? nilfs_segctor_thread+0x0/0x3a0 [nilfs2]
        [<ffffffffa04dc280>] ? nilfs_segctor_thread+0x0/0x3a0 [nilfs2]
        [<ffffffff8025e556>] kthread+0x56/0x90
        [<ffffffff8020cdea>] child_rip+0xa/0x20
        [<ffffffff8025e500>] ? kthread+0x0/0x90
        [<ffffffff8020cde0>] ? child_rip+0x0/0x20
      
      This problem was caused due to a missing radix_tree_preload() call in
      the retry path of nilfs_btnode_prepare_change_key() function.
      Reported-by: NEric A <eric225125@yahoo.com>
      Reported-by: NJerome Poulin <jeromepoulin@gmail.com>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: NJerome Poulin <jeromepoulin@gmail.com>
      Cc: stable@kernel.org
      b1f1b8ce
  12. 29 8月, 2009 1 次提交
  13. 28 8月, 2009 4 次提交
  14. 27 8月, 2009 2 次提交
    • E
      inotify: Ensure we alwasy write the terminating NULL. · 0db501bd
      Eric W. Biederman 提交于
      Before the rewrite copy_event_to_user always wrote a terqminating '\0'
      byte to user space after the filename.  Since the rewrite that
      terminating byte was skipped if your filename is exactly a multiple of
      event_size.  Ouch!
      
      So add one byte to name_size before we round up and use clear_user to
      set userspace to zero like /dev/zero does instead of copying the
      strange nul_inotify_event.  I can't quite convince myself len_to_zero
      will never exceed 16 and even if it doesn't clear_user should be more
      efficient and a more accurate reflection of what the code is trying to
      do.
      Signed-off-by: NEric W. Biederman <ebiederm@aristanetworks.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      0db501bd
    • E
      inotify: fix locking around inotify watching in the idr · dead537d
      Eric Paris 提交于
      The are races around the idr storage of inotify watches.  It's possible
      that a watch could be found from sys_inotify_rm_watch() in the idr, but it
      could be removed from the idr before that code does it's removal.  Move the
      locking and the refcnt'ing so that these have to happen atomically.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      dead537d