1. 24 9月, 2013 1 次提交
  2. 09 8月, 2013 1 次提交
    • J
      reiserfs: locking, handle nested locks properly · 278f6679
      Jeff Mahoney 提交于
      The reiserfs write lock replaced the BKL and uses similar semantics.
      
      Frederic's locking code makes a distinction between when the lock is nested
      and when it's being acquired/released, but I don't think that's the right
      distinction to make.
      
      The right distinction is between the lock being released at end-of-use and
      the lock being released for a schedule. The unlock should return the depth
      and the lock should restore it, rather than the other way around as it is now.
      
      This patch implements that and adds a number of places where the lock
      should be dropped.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      278f6679
  3. 07 5月, 2013 1 次提交
  4. 01 6月, 2012 3 次提交
    • A
      reiserfs: get rid of resierfs_sync_super · 033369d1
      Artem Bityutskiy 提交于
      This patch stops reiserfs using the VFS 'write_super()' method along with the
      s_dirt flag, because they are on their way out.
      
      The whole "superblock write-out" VFS infrastructure is served by the
      'sync_supers()' kernel thread, which wakes up every 5 (by default) seconds and
      writes out all dirty superblock using the '->write_super()' call-back.  But the
      problem with this thread is that it wastes power by waking up the system every
      5 seconds, even if there are no diry superblocks, or there are no client
      file-systems which would need this (e.g., btrfs does not use
      '->write_super()'). So we want to kill it completely and thus, we need to make
      file-systems to stop using the '->write_super()' VFS service, and then remove
      it together with the kernel thread.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      033369d1
    • A
      reiserfs: mark the superblock as dirty a bit later · 5c5fd819
      Artem Bityutskiy 提交于
      The 'journal_mark_dirty()' function currently first marks the superblock as
      dirty by setting 's_dirt' to 1, then does various sanity checks and returns,
      then actuall does all the magic with the journal.
      
      This is not an ideal order, though. It makes more sense to first do all the
      checks, then do all the internal stuff, and at the end notify the VFS that the
      superblock is now dirty.
      
      This patch moves the 's_dirt = 1' assignment from the very beginning of this
      function to the very end.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5c5fd819
    • A
      reiserfs: clean-up function return type · 25729b0e
      Artem Bityutskiy 提交于
      Turn 'reiserfs_flush_old_commits()' into a void function because the callers
      do not cares about what it returns anyway.
      
      We are going to remove the 'sb->s_dirt' field completely and this patch is a
      small step towards this direction.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      25729b0e
  5. 29 3月, 2012 1 次提交
  6. 21 3月, 2012 1 次提交
  7. 11 1月, 2012 2 次提交
  8. 15 9月, 2011 1 次提交
  9. 12 7月, 2011 1 次提交
    • J
      fixlet: Remove fs_excl from struct task. · 4aede84b
      Justin TerAvest 提交于
      fs_excl is a poor man's priority inheritance for filesystems to hint to
      the block layer that an operation is important. It was never clearly
      specified, not widely adopted, and will not prevent starvation in many
      cases (like across cgroups).
      
      fs_excl was introduced with the time sliced CFQ IO scheduler, to
      indicate when a process held FS exclusive resources and thus needed
      a boost.
      
      It doesn't cover all file systems, and it was never fully complete.
      Lets kill it.
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4aede84b
  10. 31 3月, 2011 1 次提交
  11. 01 2月, 2011 1 次提交
  12. 18 11月, 2010 1 次提交
  13. 13 11月, 2010 2 次提交
    • T
      block: clean up blkdev_get() wrappers and their users · d4d77629
      Tejun Heo 提交于
      After recent blkdev_get() modifications, open_by_devnum() and
      open_bdev_exclusive() are simple wrappers around blkdev_get().
      Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
      
      blkdev_get_by_dev() is identical to open_by_devnum().
      blkdev_get_by_path() is slightly different in that it doesn't
      automatically add %FMODE_EXCL to @mode.
      
      All users are converted.  Most conversions are mechanical and don't
      introduce any behavior difference.  There are several exceptions.
      
      * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
        reason to OR it explicitly on blkdev_put().
      
      * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
        sb->s_mode.
      
      * With the above changes, sb->s_mode now always should contain
        FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
        errors.
      
      The new blkdev_get_*() functions are with proper docbook comments.
      While at it, add function description to blkdev_get() too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Joern Engel <joern@lazybastard.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: xfs-masters@oss.sgi.com
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      d4d77629
    • T
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo 提交于
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      
      * blkdev_get/put() are the primary open and close functions.
      
      * bd_claim/release() deal with exclusive open.
      
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      
      * bd_link/unlink_disk_holder() to create and remove holder/slave
        symlinks.
      
      * open_by_devnum() wraps bdget() + blkdev_get().
      
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      
      * open_by_devnum() is modified to take @holder argument and pass it to
        blkdev_get().
      
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      same.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Brown <neilb@suse.de>
      Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Acked-by: NPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      e525fd89
  14. 10 9月, 2010 1 次提交
  15. 18 8月, 2010 1 次提交
    • C
      remove SWRITE* I/O types · 9cb569d6
      Christoph Hellwig 提交于
      These flags aren't real I/O types, but tell ll_rw_block to always
      lock the buffer instead of giving up on a failed trylock.
      
      Instead add a new write_dirty_buffer helper that implements this semantic
      and use it from the existing SWRITE* callers.  Note that the ll_rw_block
      code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
      this patch fixes.
      
      In the ufs code clean up the helper that used to call ll_rw_block
      to mirror sync_dirty_buffer, which is the function it implements for
      compound buffers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9cb569d6
  16. 11 8月, 2010 1 次提交
  17. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  18. 25 3月, 2010 1 次提交
    • J
      reiserfs: properly honor read-only devices · 3f8b5ee3
      Jeff Mahoney 提交于
      The reiserfs journal behaves inconsistently when determining whether to
      allow a mount of a read-only device.
      
      This is due to the use of the continue_replay variable to short circuit
      the journal scanning.  If it's set, it's assumed that there are
      transactions to replay, but there may not be.  If it's unset, it's assumed
      that there aren't any, and that may not be the case either.
      
      I've observed two failure cases:
      1) Where a clean file system on a read-only device refuses to mount
      2) Where a clean file system on a read-only device passes the
         optimization and then tries writing the journal header to update
         the latest mount id.
      
      The former is easily observable by using a freshly created file system on
      a read-only loopback device.
      
      This patch moves the check into journal_read_transaction, where it can
      bail out before it's about to replay a transaction.  That way it can go
      through and skip transactions where appropriate, yet still refuse to mount
      a file system with outstanding transactions.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f8b5ee3
  19. 28 1月, 2010 1 次提交
    • F
      reiserfs: Fix vmalloc call under reiserfs lock · bbec9191
      Frederic Weisbecker 提交于
      Vmalloc is called to allocate journal->j_cnode_free_list but
      we hold the reiserfs lock at this time, which raises a
      {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} lock inversion.
      
      Just drop the reiserfs lock at this time, as it's not even
      needed but kept for paranoid reasons.
      
      This fixes:
      
      [ INFO: inconsistent lock state ]
      2.6.33-rc5 #1
      ---------------------------------
      inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      kswapd0/313 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (&REISERFS_SB(s)->lock){+.+.?.}, at: [<c11118c8>]
      reiserfs_write_lock_once+0x28/0x50
      {RECLAIM_FS-ON-W} state was registered at:
        [<c104ee32>] mark_held_locks+0x62/0x90
        [<c104eefa>] lockdep_trace_alloc+0x9a/0xc0
        [<c108f7b6>] kmem_cache_alloc+0x26/0xf0
        [<c108621c>] __get_vm_area_node+0x6c/0xf0
        [<c108690e>] __vmalloc_node+0x7e/0xa0
        [<c1086aab>] vmalloc+0x2b/0x30
        [<c110e1fb>] journal_init+0x6cb/0xa10
        [<c10f90a2>] reiserfs_fill_super+0x342/0xb80
        [<c1095665>] get_sb_bdev+0x145/0x180
        [<c10f68e1>] get_super_block+0x21/0x30
        [<c1094520>] vfs_kern_mount+0x40/0xd0
        [<c1094609>] do_kern_mount+0x39/0xd0
        [<c10aaa97>] do_mount+0x2c7/0x6d0
        [<c10aaf06>] sys_mount+0x66/0xa0
        [<c16198a7>] mount_block_root+0xc4/0x245
        [<c1619a81>] mount_root+0x59/0x5f
        [<c1619b98>] prepare_namespace+0x111/0x14b
        [<c1619269>] kernel_init+0xcf/0xdb
        [<c100303a>] kernel_thread_helper+0x6/0x1c
      irq event stamp: 63236801
      hardirqs last  enabled at (63236801): [<c134e7fa>]
      __mutex_unlock_slowpath+0x9a/0x120
      hardirqs last disabled at (63236800): [<c134e799>]
      __mutex_unlock_slowpath+0x39/0x120
      softirqs last  enabled at (63218800): [<c102f451>] __do_softirq+0xc1/0x110
      softirqs last disabled at (63218789): [<c102f4ed>] do_softirq+0x4d/0x60
      
      other info that might help us debug this:
      2 locks held by kswapd0/313:
       #0:  (shrinker_rwsem){++++..}, at: [<c1074bb4>] shrink_slab+0x24/0x170
       #1:  (&type->s_umount_key#19){++++..}, at: [<c10a2edd>]
      shrink_dcache_memory+0xfd/0x1a0
      
      stack backtrace:
      Pid: 313, comm: kswapd0 Not tainted 2.6.33-rc5 #1
      Call Trace:
       [<c134db2c>] ? printk+0x18/0x1c
       [<c104e7ef>] print_usage_bug+0x15f/0x1a0
       [<c104ebcf>] mark_lock+0x39f/0x5a0
       [<c104d66b>] ? trace_hardirqs_off+0xb/0x10
       [<c1052c50>] ? check_usage_forwards+0x0/0xf0
       [<c1050c24>] __lock_acquire+0x214/0xa70
       [<c10438c5>] ? sched_clock_cpu+0x95/0x110
       [<c10514fa>] lock_acquire+0x7a/0xa0
       [<c11118c8>] ? reiserfs_write_lock_once+0x28/0x50
       [<c134f03f>] mutex_lock_nested+0x5f/0x2b0
       [<c11118c8>] ? reiserfs_write_lock_once+0x28/0x50
       [<c11118c8>] ? reiserfs_write_lock_once+0x28/0x50
       [<c11118c8>] reiserfs_write_lock_once+0x28/0x50
       [<c10f05b0>] reiserfs_delete_inode+0x50/0x140
       [<c10a653f>] ? generic_delete_inode+0x5f/0x150
       [<c10f0560>] ? reiserfs_delete_inode+0x0/0x140
       [<c10a657c>] generic_delete_inode+0x9c/0x150
       [<c10a666d>] generic_drop_inode+0x3d/0x60
       [<c10a5597>] iput+0x47/0x50
       [<c10a2a4f>] dentry_iput+0x6f/0xf0
       [<c10a2af4>] d_kill+0x24/0x50
       [<c10a2d3d>] __shrink_dcache_sb+0x21d/0x2b0
       [<c10a2f0f>] shrink_dcache_memory+0x12f/0x1a0
       [<c1074c9e>] shrink_slab+0x10e/0x170
       [<c1075177>] kswapd+0x477/0x6a0
       [<c1072d10>] ? isolate_pages_global+0x0/0x1b0
       [<c103e160>] ? autoremove_wake_function+0x0/0x40
       [<c1074d00>] ? kswapd+0x0/0x6a0
       [<c103de6c>] kthread+0x6c/0x80
       [<c103de00>] ? kthread+0x0/0x80
       [<c100303a>] kernel_thread_helper+0x6/0x1c
      Reported-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Christian Kujau <lists@nerdbynature.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      bbec9191
  20. 02 1月, 2010 1 次提交
    • F
      reiserfs: Relax reiserfs lock while freeing the journal · 0523676d
      Frederic Weisbecker 提交于
      Keeping the reiserfs lock while freeing the journal on
      umount path triggers a lock inversion between bdev->bd_mutex
      and the reiserfs lock.
      
      We don't need the reiserfs lock at this stage. The filesystem
      is not usable anymore, and there are no more pending commits,
      everything got flushed (even this operation was done in parallel
      and didn't required the reiserfs lock from the current process).
      
      This fixes the following lockdep report:
      
      =======================================================
      [ INFO: possible circular locking dependency detected ]
      2.6.32-atom #172
      -------------------------------------------------------
      umount/3904 is trying to acquire lock:
       (&bdev->bd_mutex){+.+.+.}, at: [<c10de2c2>] __blkdev_put+0x22/0x160
      
      but task is already holding lock:
       (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1143279>] reiserfs_write_lock+0x29/0x40
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #3 (&REISERFS_SB(s)->lock){+.+.+.}:
             [<c105ea7f>] __lock_acquire+0x11ff/0x19e0
             [<c105f2c8>] lock_acquire+0x68/0x90
             [<c140199b>] mutex_lock_nested+0x5b/0x340
             [<c1143229>] reiserfs_write_lock_once+0x29/0x50
             [<c111c485>] reiserfs_get_block+0x85/0x1620
             [<c10e1040>] do_mpage_readpage+0x1f0/0x6d0
             [<c10e1640>] mpage_readpages+0xc0/0x100
             [<c1119b89>] reiserfs_readpages+0x19/0x20
             [<c108f1ec>] __do_page_cache_readahead+0x1bc/0x260
             [<c108f2b8>] ra_submit+0x28/0x40
             [<c1087e3e>] filemap_fault+0x40e/0x420
             [<c109b5fd>] __do_fault+0x3d/0x430
             [<c109d47e>] handle_mm_fault+0x12e/0x790
             [<c1022a65>] do_page_fault+0x135/0x330
             [<c1403663>] error_code+0x6b/0x70
             [<c10ef9ca>] load_elf_binary+0x82a/0x1a10
             [<c10ba130>] search_binary_handler+0x90/0x1d0
             [<c10bb70f>] do_execve+0x1df/0x250
             [<c1001746>] sys_execve+0x46/0x70
             [<c1002fa5>] syscall_call+0x7/0xb
      
      -> #2 (&mm->mmap_sem){++++++}:
             [<c105ea7f>] __lock_acquire+0x11ff/0x19e0
             [<c105f2c8>] lock_acquire+0x68/0x90
             [<c109b1ab>] might_fault+0x8b/0xb0
             [<c11b8f52>] copy_to_user+0x32/0x70
             [<c10c3b94>] filldir64+0xa4/0xf0
             [<c1109116>] sysfs_readdir+0x116/0x210
             [<c10c3e1d>] vfs_readdir+0x8d/0xb0
             [<c10c3ea9>] sys_getdents64+0x69/0xb0
             [<c1002ec4>] sysenter_do_call+0x12/0x32
      
      -> #1 (sysfs_mutex){+.+.+.}:
             [<c105ea7f>] __lock_acquire+0x11ff/0x19e0
             [<c105f2c8>] lock_acquire+0x68/0x90
             [<c140199b>] mutex_lock_nested+0x5b/0x340
             [<c110951c>] sysfs_addrm_start+0x2c/0xb0
             [<c1109aa0>] create_dir+0x40/0x90
             [<c1109b1b>] sysfs_create_dir+0x2b/0x50
             [<c11b2352>] kobject_add_internal+0xc2/0x1b0
             [<c11b2531>] kobject_add_varg+0x31/0x50
             [<c11b25ac>] kobject_add+0x2c/0x60
             [<c1258294>] device_add+0x94/0x560
             [<c11036ea>] add_partition+0x18a/0x2a0
             [<c110418a>] rescan_partitions+0x33a/0x450
             [<c10de5bf>] __blkdev_get+0x12f/0x2d0
             [<c10de76a>] blkdev_get+0xa/0x10
             [<c11034b8>] register_disk+0x108/0x130
             [<c11a87a9>] add_disk+0xd9/0x130
             [<c12998e5>] sd_probe_async+0x105/0x1d0
             [<c10528af>] async_thread+0xcf/0x230
             [<c104bfd4>] kthread+0x74/0x80
             [<c1003aab>] kernel_thread_helper+0x7/0x3c
      
      -> #0 (&bdev->bd_mutex){+.+.+.}:
             [<c105f176>] __lock_acquire+0x18f6/0x19e0
             [<c105f2c8>] lock_acquire+0x68/0x90
             [<c140199b>] mutex_lock_nested+0x5b/0x340
             [<c10de2c2>] __blkdev_put+0x22/0x160
             [<c10de40a>] blkdev_put+0xa/0x10
             [<c113ce22>] free_journal_ram+0xd2/0x130
             [<c113ea18>] do_journal_release+0x98/0x190
             [<c113eb2a>] journal_release+0xa/0x10
             [<c1128eb6>] reiserfs_put_super+0x36/0x130
             [<c10b776f>] generic_shutdown_super+0x4f/0xe0
             [<c10b7825>] kill_block_super+0x25/0x40
             [<c11255df>] reiserfs_kill_sb+0x7f/0x90
             [<c10b7f4a>] deactivate_super+0x7a/0x90
             [<c10cccd8>] mntput_no_expire+0x98/0xd0
             [<c10ccfcc>] sys_umount+0x4c/0x310
             [<c10cd2a9>] sys_oldumount+0x19/0x20
             [<c1002ec4>] sysenter_do_call+0x12/0x32
      
      other info that might help us debug this:
      
      2 locks held by umount/3904:
       #0:  (&type->s_umount_key#30){+++++.}, at: [<c10b7f45>] deactivate_super+0x75/0x90
       #1:  (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1143279>] reiserfs_write_lock+0x29/0x40
      
      stack backtrace:
      Pid: 3904, comm: umount Not tainted 2.6.32-atom #172
      Call Trace:
       [<c13ff903>] ? printk+0x18/0x1a
       [<c105d33a>] print_circular_bug+0xca/0xd0
       [<c105f176>] __lock_acquire+0x18f6/0x19e0
       [<c108b66f>] ? free_pcppages_bulk+0x1f/0x250
       [<c105f2c8>] lock_acquire+0x68/0x90
       [<c10de2c2>] ? __blkdev_put+0x22/0x160
       [<c10de2c2>] ? __blkdev_put+0x22/0x160
       [<c140199b>] mutex_lock_nested+0x5b/0x340
       [<c10de2c2>] ? __blkdev_put+0x22/0x160
       [<c105c932>] ? mark_held_locks+0x62/0x80
       [<c10afe12>] ? kfree+0x92/0xd0
       [<c10de2c2>] __blkdev_put+0x22/0x160
       [<c105cc3b>] ? trace_hardirqs_on+0xb/0x10
       [<c10de40a>] blkdev_put+0xa/0x10
       [<c113ce22>] free_journal_ram+0xd2/0x130
       [<c113ea18>] do_journal_release+0x98/0x190
       [<c113eb2a>] journal_release+0xa/0x10
       [<c1128eb6>] reiserfs_put_super+0x36/0x130
       [<c1050596>] ? up_write+0x16/0x30
       [<c10b776f>] generic_shutdown_super+0x4f/0xe0
       [<c10b7825>] kill_block_super+0x25/0x40
       [<c10f41e0>] ? vfs_quota_off+0x0/0x20
       [<c11255df>] reiserfs_kill_sb+0x7f/0x90
       [<c10b7f4a>] deactivate_super+0x7a/0x90
       [<c10cccd8>] mntput_no_expire+0x98/0xd0
       [<c10ccfcc>] sys_umount+0x4c/0x310
       [<c10cd2a9>] sys_oldumount+0x19/0x20
       [<c1002ec4>] sysenter_do_call+0x12/0x32
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      0523676d
  21. 30 12月, 2009 1 次提交
  22. 05 10月, 2009 1 次提交
    • F
      kill-the-bkl/reiserfs: fix reiserfs lock to cpu_add_remove_lock dependency · 48f6ba5e
      Frederic Weisbecker 提交于
      While creating the reiserfs workqueue during the journal
      initialization, we are holding the reiserfs lock, but
      create_workqueue() also holds the cpu_add_remove_lock, creating
      then the following dependency:
      
      - reiserfs lock -> cpu_add_remove_lock
      
      But we also have the following existing dependencies:
      
      - mm->mmap_sem -> reiserfs lock
      - cpu_add_remove_lock -> cpu_hotplug.lock -> slub_lock -> sysfs_mutex
      
      The merged dependency chain then becomes:
      
      - mm->mmap_sem -> reiserfs lock -> cpu_add_remove_lock ->
      	cpu_hotplug.lock -> slub_lock -> sysfs_mutex
      
      But when we fill a dir entry in sysfs_readir(), we are holding the
      sysfs_mutex and we also might fault while copying the directory entry
      to the user, leading to the following dependency:
      
      - sysfs_mutex -> mm->mmap_sem
      
      The end result is then a lock inversion between sysfs_mutex and
      mm->mmap_sem, as reported in the following lockdep warning:
      
      [ INFO: possible circular locking dependency detected ]
      2.6.31-07095-g25a3912 #4
      -------------------------------------------------------
      udevadm/790 is trying to acquire lock:
       (&mm->mmap_sem){++++++}, at: [<c1098942>] might_fault+0x72/0xc0
      
      but task is already holding lock:
       (sysfs_mutex){+.+.+.}, at: [<c110813c>] sysfs_readdir+0x7c/0x260
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #5 (sysfs_mutex){+.+.+.}:
            [...]
      
      -> #4 (slub_lock){+++++.}:
            [...]
      
      -> #3 (cpu_hotplug.lock){+.+.+.}:
            [...]
      
      -> #2 (cpu_add_remove_lock){+.+.+.}:
            [...]
      
      -> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
            [...]
      
      -> #0 (&mm->mmap_sem){++++++}:
            [...]
      
      This can be fixed by relaxing the reiserfs lock while creating the
      workqueue.
      This is fine to relax the lock here, we just keep it around to pass
      through reiserfs lock checks and for paranoid reasons.
      Reported-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Tested-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Laurent Riffard <laurent.riffard@free.fr>
      48f6ba5e
  23. 17 9月, 2009 1 次提交
    • F
      kill-the-bkl/reiserfs: Fix induced mm->mmap_sem to sysfs_mutex dependency · 193be0ee
      Frederic Weisbecker 提交于
      Alexander Beregalov reported the following warning:
      
      	=======================================================
      	[ INFO: possible circular locking dependency detected ]
      	2.6.31-03149-gdcc030a #1
      	-------------------------------------------------------
      	udevadm/716 is trying to acquire lock:
      	 (&mm->mmap_sem){++++++}, at: [<c107249a>] might_fault+0x4a/0xa0
      
      	but task is already holding lock:
      	 (sysfs_mutex){+.+.+.}, at: [<c10cb9aa>] sysfs_readdir+0x5a/0x200
      
      	which lock already depends on the new lock.
      
      	the existing dependency chain (in reverse order) is:
      
      	-> #3 (sysfs_mutex){+.+.+.}:
      	       [...]
      
      	-> #2 (&bdev->bd_mutex){+.+.+.}:
      	       [...]
      
      	-> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
      	       [...]
      
      	-> #0 (&mm->mmap_sem){++++++}:
      	       [...]
      
      On reiserfs mount path, we take the reiserfs lock and while
      initializing the journal, we open the device, taking the
      bdev->bd_mutex. Then rescan_partition() may signal the change
      to sysfs.
      
      We have then the following dependency:
      
      	reiserfs_lock -> bd_mutex -> sysfs_mutex
      
      Later, while entering reiserfs_readpage() after a pagefault in an
      mmaped reiserfs file, we are holding the mm->mmap_sem, and we are going
      to take the reiserfs lock too.
      We have then the following dependency:
      
      	mm->mmap_sem -> reiserfs_lock
      
      which, expanded with the previous dependency gives us:
      
      	mm->mmap_sem -> reiserfs_lock -> bd_mutex -> sysfs_mutex
      
      Now while entering the sysfs readdir path, we are holding the
      sysfs_mutex. And when we copy a directory entry to the user buffer, we
      might fault and then take the mm->mmap_sem lock. Which leads to the
      circular locking dependency reported.
      
      We can fix that by relaxing the reiserfs lock during the call to
      journal_init_dev(), which is the place where we open the mounted
      device.
      
      This is fine to relax the lock here because we are in the begining of
      the reiserfs mount path and there is nothing to protect at this time,
      the journal is not intialized.
      We just keep this lock around for paranoid reasons.
      Reported-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Tested-by: NAlexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Laurent Riffard <laurent.riffard@free.fr>
      193be0ee
  24. 14 9月, 2009 6 次提交
    • F
      kill-the-bkl/reiserfs: acquire the inode mutex safely · c72e0575
      Frederic Weisbecker 提交于
      While searching a pathname, an inode mutex can be acquired
      in do_lookup() which calls reiserfs_lookup() which in turn
      acquires the write lock.
      
      On the other side reiserfs_fill_super() can acquire the write_lock
      and then call reiserfs_lookup_privroot() which can acquire an
      inode mutex (the root of the mount point).
      
      So we theoretically risk an AB - BA lock inversion that could lead
      to a deadlock.
      
      As for other lock dependencies found since the bkl to mutex
      conversion, the fix is to use reiserfs_mutex_lock_safe() which
      drops the lock dependency to the write lock.
      
      [ Impact: fix a possible deadlock with reiserfs ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      c72e0575
    • F
      kill-the-bkl/reiserfs: use mutex_lock in reiserfs_mutex_lock_safe · c63e3c0b
      Frederic Weisbecker 提交于
      reiserfs_mutex_lock_safe() is a hack to avoid any dependency between
      an internal reiserfs mutex and the write lock, it has been proposed
      to follow the old bkl logic.
      
      The code does the following:
      
      while (!mutex_trylock(m)) {
      	reiserfs_write_unlock(s);
      	schedule();
      	reiserfs_write_lock(s);
      }
      
      It then imitate the implicit behaviour of the lock when it was
      a Bkl and hadn't such dependency:
      
      mutex_lock(m) {
      	if (fastpath)
      		let's go
      	else {
      		wait_for_mutex() {
      			schedule() {
      				unlock_kernel()
      				reacquire_lock_kernel()
      			}
      		}
      	}
      }
      
      The problem is that by using such explicit schedule(), we don't
      benefit of the adaptive mutex spinning on owner.
      
      The logic in use now is:
      
      reiserfs_write_unlock(s);
      mutex_lock(m); // -> possible adaptive spinning
      reiserfs_write_lock(s);
      
      [ Impact: restore the use of adaptive spinning mutexes in reiserfs ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      c63e3c0b
    • F
      kill-the-BKL/reiserfs: release the write lock on flush_commit_list() · 6e3647ac
      Frederic Weisbecker 提交于
      flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
      ll_rw_block() might sleep, and the bkl was released at this point. Then
      we can also relax the write lock at this point.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      6e3647ac
    • F
      kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end() · e6950a4d
      Frederic Weisbecker 提交于
      When do_journal_end() copies data to the journal blocks buffers in memory,
      it reschedules if needed between each block copied and dirtyfied.
      
      We can also release the write lock at this rescheduling stage,
      like did the bkl implicitly.
      
      [ Impact: release the reiserfs write lock when it is not needed ]
      
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      e6950a4d
    • F
      reiserfs, kill-the-BKL: fix unsafe j_flush_mutex lock · a412f9ef
      Frederic Weisbecker 提交于
      Impact: fix a deadlock
      
      The j_flush_mutex is acquired safely in journal.c:
      if we can't take it, we free the reiserfs per superblock lock
      and wait a bit.
      
      But we have a remaining place in kupdate_transactions() where
      j_flush_mutex is still acquired traditionnaly. Thus the following
      scenario (warned by lockdep) can happen:
      
      A						B
      
      mutex_lock(&write_lock)			mutex_lock(&write_lock)
      	mutex_lock(&j_flush_mutex)	mutex_lock(&j_flush_mutex) //block
      	mutex_unlock(&write_lock)
      	sleep...
      	mutex_lock(&write_lock) //deadlock
      
      Fix this by using reiserfs_mutex_lock_safe() in kupdate_transactions().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Jeff Mahoney <jeffm@suse.com>
      LKML-Reference: <1239660635-12940-1-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a412f9ef
    • F
      reiserfs: kill-the-BKL · 8ebc4232
      Frederic Weisbecker 提交于
      This patch is an attempt to remove the Bkl based locking scheme from
      reiserfs and is intended.
      
      It is a bit inspired from an old attempt by Peter Zijlstra:
      
         http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/2174.html
      
      The bkl is heavily used in this filesystem to prevent from
      concurrent write accesses on the filesystem.
      
      Reiserfs makes a deep use of the specific properties of the Bkl:
      
      - It can be acqquired recursively by a same task
      - It is released on the schedule() calls and reacquired when schedule() returns
      
      The two properties above are a roadmap for the reiserfs write locking so it's
      very hard to simply replace it with a common mutex.
      
      - We need a recursive-able locking unless we want to restructure several blocks
        of the code.
      - We need to identify the sites where the bkl was implictly relaxed
        (schedule, wait, sync, etc...) so that we can in turn release and
        reacquire our new lock explicitly.
        Such implicit releases of the lock are often required to let other
        resources producer/consumer do their job or we can suffer unexpected
        starvations or deadlocks.
      
      So the new lock that replaces the bkl here is a per superblock mutex with a
      specific property: it can be acquired recursively by a same task, like the
      bkl.
      
      For such purpose, we integrate a lock owner and a lock depth field on the
      superblock information structure.
      
      The first axis on this patch is to turn reiserfs_write_(un)lock() function
      into a wrapper to manage this mutex. Also some explicit calls to
      lock_kernel() have been converted to reiserfs_write_lock() helpers.
      
      The second axis is to find the important blocking sites (schedule...(),
      wait_on_buffer(), sync_dirty_buffer(), etc...) and then apply an explicit
      release of the write lock on these locations before blocking. Then we can
      safely wait for those who can give us resources or those who need some.
      Typically this is a fight between the current writer, the reiserfs workqueue
      (aka the async commiter) and the pdflush threads.
      
      The third axis is a consequence of the second. The write lock is usually
      on top of a lock dependency chain which can include the journal lock, the
      flush lock or the commit lock. So it's dangerous to release and trying to
      reacquire the write lock while we still hold other locks.
      
      This is fine with the bkl:
      
            T1                       T2
      
      lock_kernel()
          mutex_lock(A)
          unlock_kernel()
          // do something
                                  lock_kernel()
                                      mutex_lock(A) -> already locked by T1
                                      schedule() (and then unlock_kernel())
          lock_kernel()
          mutex_unlock(A)
          ....
      
      This is not fine with a mutex:
      
            T1                       T2
      
      mutex_lock(write)
          mutex_lock(A)
          mutex_unlock(write)
          // do something
                                 mutex_lock(write)
                                    mutex_lock(A) -> already locked by T1
                                    schedule()
      
          mutex_lock(write) -> already locked by T2
          deadlock
      
      The solution in this patch is to provide a helper which releases the write
      lock and sleep a bit if we can't lock a mutex that depend on it. It's another
      simulation of the bkl behaviour.
      
      The last axis is to locate the fs callbacks that are called with the bkl held,
      according to Documentation/filesystem/Locking.
      
      Those are:
      
      - reiserfs_remount
      - reiserfs_fill_super
      - reiserfs_put_super
      
      Reiserfs didn't need to explicitly lock because of the context of these callbacks.
      But now we must take care of that with the new locking.
      
      After this patch, reiserfs suffers from a slight performance regression (for now).
      On UP, a high volume write with dd reports an average of 27 MB/s instead
      of 30 MB/s without the patch applied.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Bron Gondwana <brong@fastmail.fm>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      LKML-Reference: <1239070789-13354-1-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8ebc4232
  25. 11 7月, 2009 1 次提交
  26. 31 3月, 2009 6 次提交
    • J
      reiserfs: rename p_s_sb to sb · a9dd3643
      Jeff Mahoney 提交于
      This patch is a simple s/p_s_sb/sb/g to the reiserfs code.  This is the
      first in a series of patches to rip out some of the awful variable
      naming in reiserfs.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9dd3643
    • J
      reiserfs: strip trailing whitespace · 0222e657
      Jeff Mahoney 提交于
      This patch strips trailing whitespace from the reiserfs code.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0222e657
    • J
      reiserfs: rearrange journal abort · 32e8b106
      Jeff Mahoney 提交于
      This patch kills off reiserfs_journal_abort as it is never called, and
      combines __reiserfs_journal_abort_{soft,hard} into one function called
      reiserfs_abort_journal, which performs the same work. It is silent
      as opposed to the old version, since the message was always issued
      after a regular 'abort' message.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32e8b106
    • J
      reiserfs: rework reiserfs_panic · c3a9c210
      Jeff Mahoney 提交于
      ReiserFS panics can be somewhat inconsistent.
      In some cases:
       * a unique identifier may be associated with it
       * the function name may be included
       * the device may be printed separately
      
      This patch aims to make warnings more consistent. reiserfs_warning() prints
      the device name, so printing it a second time is not required. The function
      name for a warning is always helpful in debugging, so it is now automatically
      inserted into the output. Hans has stated that every warning should have
      a unique identifier. Some cases lack them, others really shouldn't have them.
      reiserfs_warning() now expects an id associated with each message. In the
      rare case where one isn't needed, "" will suffice.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3a9c210
    • J
      reiserfs: rework reiserfs_warning · 45b03d5e
      Jeff Mahoney 提交于
      ReiserFS warnings can be somewhat inconsistent.
      In some cases:
       * a unique identifier may be associated with it
       * the function name may be included
       * the device may be printed separately
      
      This patch aims to make warnings more consistent. reiserfs_warning() prints
      the device name, so printing it a second time is not required. The function
      name for a warning is always helpful in debugging, so it is now automatically
      inserted into the output. Hans has stated that every warning should have
      a unique identifier. Some cases lack them, others really shouldn't have them.
      reiserfs_warning() now expects an id associated with each message. In the
      rare case where one isn't needed, "" will suffice.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45b03d5e
    • J
      reiserfs: audit transaction ids to always be unsigned ints · 600ed416
      Jeff Mahoney 提交于
      This patch fixes up the reiserfs code such that transaction ids are
      always unsigned ints.  In places they can currently be signed ints or
      unsigned longs.
      
      The former just causes an annoying clm-2200 warning and may join a
      transaction when it should wait.
      
      The latter is just for correctness since the disk format uses a 32-bit
      transaction id.  There aren't any runtime problems that result from it
      not wrapping at the correct location since the value is truncated
      correctly even on big endian systems.  The 0 value might make it to
      disk, but the mount-time checks will bump it to 10 itself.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      600ed416