1. 16 1月, 2010 10 次提交
    • E
      xfs: make several more functions static · 5d77c0dc
      Eric Sandeen 提交于
      Just minor housekeeping, a lot more functions can be trivially made
      static; others could if we reordered things a bit...
      Signed-off-by: NEric Sandeen <sandeen@sandeen.net>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      5d77c0dc
    • D
      xfs: clean up inconsistent variable naming in xfs_swap_extent · 6bded0f3
      Dave Chinner 提交于
      The swap extent ioctl passes in a target inode and a temporary inode
      which are clearly named in the ioctl structure. The code then
      assigns temp to target and vice versa, making it extremely difficult
      to work out which inode is which later in the code.  Make this
      consistent throughout the code.
      
      Also make xfs_swap_extent static as there are no external users of
      the function.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      6bded0f3
    • D
      xfs: add tracing to xfs_swap_extents · 3a85cd96
      Dave Chinner 提交于
      To be able to diagnose whether the swap extents function is
      detecting compatible inode data fork configurations for swapping
      extents, add tracing points to the code to allow us to see the
      format of the inode forks before and after the swap.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      3a85cd96
    • D
      xfs: xfs_swap_extents needs to handle dynamic fork offsets · e09f9860
      Dave Chinner 提交于
      When swapping extents, we can corrupt inodes by swapping data forks
      that are in incompatible formats.  This is caused by the two indoes
      having different fork offsets due to the presence of an attribute
      fork on an attr2 filesystem.  xfs_fsr tries to be smart about
      setting the fork offset, but the trick it plays only works on attr1
      (old fixed format attribute fork) filesystems.
      
      Changing the way xfs_fsr sets up the attribute fork will prevent
      this situation from ever occurring, so in the kernel code we can get
      by with a preventative fix - check that the data fork in the
      defragmented inode is in a format valid for the inode it is being
      swapped into.  This will lead to files that will silently and
      potentially repeatedly fail defragmentation, so issue a warning to
      the log when this particular failure occurs to let us know that
      xfs_fsr needs updating/fixing.
      
      To help identify how to improve xfs_fsr to avoid this issue, add
      trace points for the inodes being swapped so that we can determine
      why the swap was rejected and to confirm that the code is making the
      right decisions and modifications when swapping forks.
      
      A further complication is even when the swap is allowed to proceed
      when the fork offset is different between the two inodes then value
      for the maximum number of extents the data fork can hold can be
      wrong. Make sure these are also set correctly after the swap occurs.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      e09f9860
    • D
      xfs: fix missing error check in xfs_rtfree_range · 3daeb42c
      Dave Chinner 提交于
      When xfs_rtfind_forw() returns an error, the block is returned
      uninitialised.  xfs_rtfree_range() is not checking the error return,
      so could be using an uninitialised block number for modifying bitmap
      summary info.
      
      The problem was found by gcc when compiling the *userspace* libxfs
      code - it is an copy of the kernel code with the exact same bug.
      gcc gives an uninitialised variable warning on the userspace code
      but not on the kernel code. You gotta love the consistency (Mmmm,
      slightly chewy today!).
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      3daeb42c
    • D
      xfs: fix stale inode flush avoidance · 4b6a4688
      Dave Chinner 提交于
      When reclaiming stale inodes, we need to guarantee that inodes are
      unpinned before returning with a "clean" status. If we don't we can
      reclaim inodes that are pinned, leading to use after free in the
      transaction subsystem as transactions complete.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      4b6a4688
    • D
      xfs: Remove inode iolock held check during allocation · 126976c7
      Dave Chinner 提交于
      lockdep complains about a the lock not being initialised as we do an
      ASSERT based check that the lock is not held before we initialise it
      to catch inodes freed with the lock held.
      
      lockdep does this check for us in the lock initialisation code, so
      remove the ASSERT to stop the lockdep warning.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      126976c7
    • D
      xfs: reclaim all inodes by background tree walks · 57817c68
      Dave Chinner 提交于
      We cannot do direct inode reclaim without taking the flush lock to
      ensure that we do not reclaim an inode under IO. We check the inode
      is clean before doing direct reclaim, but this is not good enough
      because the inode flush code marks the inode clean once it has
      copied the in-core dirty state to the backing buffer.
      
      It is the flush lock that determines whether the inode is still
      under IO, even though it is marked clean, and the inode is still
      required at IO completion so we can't reclaim it even though it is
      clean in core. Hence the requirement that we need to take the flush
      lock even on clean inodes because this guarantees that the inode
      writeback IO has completed and it is safe to reclaim the inode.
      
      With delayed write inode flushing, we coul dend up waiting a long
      time on the flush lock even for a clean inode. The background
      reclaim already handles this efficiently, so avoid all the problems
      by killing the direct reclaim path altogether.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      57817c68
    • D
      xfs: Avoid inodes in reclaim when flushing from inode cache · 018027be
      Dave Chinner 提交于
      The reclaim code will handle flushing of dirty inodes before reclaim
      occurs, so avoid them when determining whether an inode is a
      candidate for flushing to disk when walking the radix trees.  This
      is based on a test patch from Christoph Hellwig.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      018027be
    • D
      xfs: reclaim inodes under a write lock · c8e20be0
      Dave Chinner 提交于
      Make the inode tree reclaim walk exclusive to avoid races with
      concurrent sync walkers and lookups. This is a version of a patch
      posted by Christoph Hellwig that avoids all the code duplication.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      c8e20be0
  2. 13 1月, 2010 1 次提交
  3. 12 1月, 2010 2 次提交
    • M
      smaps: fix wrong rss count · 7f53a09e
      Minchan Kim 提交于
      A long time ago we regarded zero page as file_rss and vm_normal_page
      doesn't return NULL.
      
      But now, we reinstated ZERO_PAGE and vm_normal_page's implementation can
      return NULL in case of zero page.  Also we don't count it with file_rss
      any more.
      
      Then, RSS and PSS can't be matched.  For consistency, Let's ignore zero
      page in smaps_pte_range.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f53a09e
    • K
      proc: partially revert "procfs: provide stack information for threads" · 1306d603
      KOSAKI Motohiro 提交于
      Commit d899bf7b (procfs: provide stack information for threads) introduced
      to show stack information in /proc/{pid}/status.  But it cause large
      performance regression.  Unfortunately /proc/{pid}/status is used ps
      command too and ps is one of most important component.  Because both to
      take mmap_sem and page table walk are heavily operation.
      
      If many process run, the ps performance is,
      
      [before d899bf7b]
      
      % perf stat ps >/dev/null
      
       Performance counter stats for 'ps':
      
           4090.435806  task-clock-msecs         #      0.032 CPUs
                   229  context-switches         #      0.000 M/sec
                     0  CPU-migrations           #      0.000 M/sec
                   234  page-faults              #      0.000 M/sec
            8587565207  cycles                   #   2099.425 M/sec
            9866662403  instructions             #      1.149 IPC
            3789415411  cache-references         #    926.409 M/sec
              30419509  cache-misses             #      7.437 M/sec
      
         128.859521955  seconds time elapsed
      
      [after d899bf7b]
      
      % perf stat  ps  > /dev/null
      
       Performance counter stats for 'ps':
      
           4305.081146  task-clock-msecs         #      0.028 CPUs
                   480  context-switches         #      0.000 M/sec
                     2  CPU-migrations           #      0.000 M/sec
                   237  page-faults              #      0.000 M/sec
            9021211334  cycles                   #   2095.480 M/sec
           10605887536  instructions             #      1.176 IPC
            3612650999  cache-references         #    839.160 M/sec
              23917502  cache-misses             #      5.556 M/sec
      
         152.277819582  seconds time elapsed
      
      Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
      provide almost same information. we can use it.
      
      Commit d899bf7b introduced two features:
      
       1) Add the annotattion of [thread stack: xxxx] mark to
          /proc/{pid}/task/{tid}/maps.
       2) Add StackUsage field to /proc/{pid}/status.
      
      I only revert (2), because I haven't seen (1) cause regression.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1306d603
  4. 11 1月, 2010 6 次提交
    • J
      quota: Fix dquot_transfer for filesystems different from ext4 · 05b5d898
      Jan Kara 提交于
      Commit fd8fbfc1 modified the way we find amount of reserved space
      belonging to an inode. The amount of reserved space is checked
      from dquot_transfer and thus inode_reserved_space gets called
      even for filesystems that don't provide get_reserved_space callback
      which results in a BUG.
      
      Fix the problem by checking get_reserved_space callback and return 0 if
      the filesystem does not provide it.
      
      CC: Dmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      05b5d898
    • S
      GFS2: Use MAX_LFS_FILESIZE for meta inode size · ba198098
      Steven Whitehouse 提交于
      Using ~0ULL was cauing sign issues in filemap_fdatawrite_range, so
      use MAX_LFS_FILESIZE instead.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      ba198098
    • D
      xfs: Ensure we force all busy extents in range to disk · fd45e478
      Dave Chinner 提交于
      When we search for and find a busy extent during allocation we
      force the log out to ensure the extent free transaction is on
      disk before the allocation transaction. The current implementation
      has a subtle bug in it--it does not handle multiple overlapping
      ranges.
      
      That is, if we free lots of little extents into a single
      contiguous extent, then allocate the contiguous extent, the busy
      search code stops searching at the first extent it finds that
      overlaps the allocated range. It then uses the commit LSN of the
      transaction to force the log out to.
      
      Unfortunately, the other busy ranges might have more recent
      commit LSNs than the first busy extent that is found, and this
      results in xfs_alloc_search_busy() returning before all the
      extent free transactions are on disk for the range being
      allocated. This can lead to potential metadata corruption or
      stale data exposure after a crash because log replay won't replay
      all the extent free transactions that cover the allocation range.
      Modified-by: NAlex Elder <aelder@sgi.com>
      
      (Dropped the "found" argument from the xfs_alloc_busysearch trace
      event.)
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      fd45e478
    • D
      xfs: Don't flush stale inodes · 44e08c45
      Dave Chinner 提交于
      Because inodes remain in cache much longer than inode buffers do
      under memory pressure, we can get the situation where we have
      stale, dirty inodes being reclaimed but the backing storage has
      been freed.  Hence we should never, ever flush XFS_ISTALE inodes
      to disk as there is no guarantee that the backing buffer is in
      cache and still marked stale when the flush occurs.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      44e08c45
    • C
      xfs: fix timestamp handling in xfs_setattr · d6d59bad
      Christoph Hellwig 提交于
      We currently have some rather odd code in xfs_setattr for
      updating the a/c/mtime timestamps:
      
       - first we do a non-transaction update if all three are updated
         together
       - second we implicitly update the ctime for various changes
         instead of relying on the ATTR_CTIME flag
       - third we set the timestamps to the current time instead of the
         arguments in the iattr structure in many cases.
      
      This patch makes sure we update it in a consistent way:
      
       - always transactional
       - ctime is only updated if ATTR_CTIME is set or we do a size
         update, which is a special case
       - always to the times passed in from the caller instead of the
         current time
      
      The only non-size caller of xfs_setattr that doesn't come from
      the VFS is updated to set ATTR_CTIME and pass in a valid ctime
      value.
      Reported-by: NEric Blake <ebb9@byu.net>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      d6d59bad
    • C
      xfs: use DECLARE_EVENT_CLASS · ea9a4888
      Christoph Hellwig 提交于
      Using DECLARE_EVENT_CLASS allows us to to use trace event code
      instead of duplicating it in the binary.  This was not available
      before 2.6.33 so it had to be done as a separate step once the
      prerequisite was merged.
      
      This only requires changes to xfs_trace.h and the results are
      rather impressive:
      
      hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o*
      text	   data	    bss	    dec	    hex	filename
       607732	  41884	   3616	 653232	  9f7b0	fs/xfs/xfs.o
      1026732	  41884	   3808	1072424	 105d28	fs/xfs/xfs.o.old
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      ea9a4888
  5. 09 1月, 2010 1 次提交
  6. 08 1月, 2010 3 次提交
  7. 07 1月, 2010 6 次提交
  8. 05 1月, 2010 6 次提交
    • B
      exofs: simple_write_end does not mark_inode_dirty · efd124b9
      Boaz Harrosh 提交于
      exofs uses simple_write_end() for it's .write_end handler. But
      it is not enough because simple_write_end() does not call
      mark_inode_dirty() when it extends i_size. So even if we do
      call mark_inode_dirty at beginning of write out, with a very
      long IO and a saturated system we might get the .write_inode()
      called while still extend-writing to file and miss out on the last
      i_size updates.
      
      So override .write_end, call simple_write_end(), and afterwords if
      i_size was changed call mark_inode_dirty().
      
      It stands to logic that since simple_write_end() was the one extending
      i_size it should also call mark_inode_dirty(). But it looks like all
      users of simple_write_end() are memory-bound pseudo filesystems, who
      could careless about mark_inode_dirty(). I might submit a
      warning-comment patch to simple_write_end() in future.
      
      CC: Stable <stable@kernel.org>
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      efd124b9
    • B
      exofs: fix pnfs_osd re-definitions in pre-pnfs trees · 89be5030
      Boaz Harrosh 提交于
      Some on disk exofs constants and types are defined in the pnfs_osd_xdr.h
      file. Since we needed these types before the pnfs-objects code was
      accepted to mainline we duplicated the minimal needed definitions into
      an exofs local header. The definitions where conditionally included
      depending on !CONFIG_PNFS defined. So if PNFS was present in the tree
      definitions are taken from there and if not they are defined locally.
      
      That was all good but, the CONFIG_PNFS is planed to be included upstream
      before the pnfs-objects is also included. (The first pnfs batch might be
      pnfs-files only)
      
      So condition exofs local definitions on the absence of pnfs_osd_xdr.h
      inclusion (__PNFS_OSD_XDR_H__ not defined). User code must make sure
      that in future pnfs_osd_xdr.h will be included before fs/exofs/pnfs.h,
      which happens to be so in current code.
      
      Once pnfs-objects hits mainline, exofs's local header will be removed.
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      89be5030
    • F
      reiserfs: Relax lock on xattr removing · 4f3be1b5
      Frederic Weisbecker 提交于
      When we remove an xattr, we call lookup_and_delete_xattr()
      that takes some private xattr inodes mutexes. But we hold
      the reiserfs lock at this time, which leads to dependency
      inversions.
      
      We can safely call lookup_and_delete_xattr() without the
      reiserfs lock, where xattr inodes lookups only need the
      xattr inodes mutexes.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Christian Kujau <lists@nerdbynature.de>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      4f3be1b5
    • F
      reiserfs: Relax the lock before truncating pages · 108d3943
      Frederic Weisbecker 提交于
      While truncating a file, reiserfs_setattr() calls inode_setattr()
      that will truncate the mapping for the given inode, but for that
      it needs the pages locks.
      
      In order to release these, the owners need the reiserfs lock to
      complete their jobs. But they can't, as we don't release it before
      calling inode_setattr().
      
      We need to do that to fix the following softlockups:
      
      INFO: task flush-8:0:2149 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      flush-8:0     D f51af998     0  2149      2 0x00000000
       f51af9ac 00000092 00000002 f51af998 c2803304 00000000 c1894ad0 010f3000
       f51af9cc c1462604 c189ef80 f51af974 c1710304 f715b450 f715b5ec c2807c40
       00000000 0005bb00 c2803320 c102c55b c1710304 c2807c50 c2803304 00000246
      Call Trace:
       [<c1462604>] ? schedule+0x434/0xb20
       [<c102c55b>] ? resched_task+0x4b/0x70
       [<c106fa22>] ? mark_held_locks+0x62/0x80
       [<c146414d>] ? mutex_lock_nested+0x1fd/0x350
       [<c14640b9>] mutex_lock_nested+0x169/0x350
       [<c1178cde>] ? reiserfs_write_lock+0x2e/0x40
       [<c1178cde>] reiserfs_write_lock+0x2e/0x40
       [<c11719a2>] do_journal_end+0xc2/0xe70
       [<c1172912>] journal_end+0xb2/0x120
       [<c11686b3>] ? pathrelse+0x33/0xb0
       [<c11729e4>] reiserfs_end_persistent_transaction+0x64/0x70
       [<c1153caa>] reiserfs_get_block+0x12ba/0x15f0
       [<c106fa22>] ? mark_held_locks+0x62/0x80
       [<c1154b24>] reiserfs_writepage+0xa74/0xe80
       [<c1465a27>] ? _raw_spin_unlock_irq+0x27/0x50
       [<c11f3d25>] ? radix_tree_gang_lookup_tag_slot+0x95/0xc0
       [<c10b5377>] ? find_get_pages_tag+0x127/0x1a0
       [<c106fa22>] ? mark_held_locks+0x62/0x80
       [<c106fcd4>] ? trace_hardirqs_on_caller+0x124/0x170
       [<c10bc1e0>] __writepage+0x10/0x40
       [<c10bc9ab>] write_cache_pages+0x16b/0x320
       [<c10bc1d0>] ? __writepage+0x0/0x40
       [<c10bcb88>] generic_writepages+0x28/0x40
       [<c10bcbd5>] do_writepages+0x35/0x40
       [<c11059f7>] writeback_single_inode+0xc7/0x330
       [<c11067b2>] writeback_inodes_wb+0x2c2/0x490
       [<c1106a86>] wb_writeback+0x106/0x1b0
       [<c1106cf6>] wb_do_writeback+0x106/0x1e0
       [<c1106c18>] ? wb_do_writeback+0x28/0x1e0
       [<c1106e0a>] bdi_writeback_task+0x3a/0xb0
       [<c10cbb13>] bdi_start_fn+0x63/0xc0
       [<c10cbab0>] ? bdi_start_fn+0x0/0xc0
       [<c105d1f4>] kthread+0x74/0x80
       [<c105d180>] ? kthread+0x0/0x80
       [<c100327a>] kernel_thread_helper+0x6/0x10
      3 locks held by flush-8:0/2149:
       #0:  (&type->s_umount_key#30){+++++.}, at: [<c110676f>] writeback_inodes_wb+0x27f/0x490
       #1:  (&journal->j_mutex){+.+...}, at: [<c117199a>] do_journal_end+0xba/0xe70
       #2:  (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1178cde>] reiserfs_write_lock+0x2e/0x40
      INFO: task fstest:3813 blocked for more than 120 seconds.
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      fstest        D 00000002     0  3813   3812 0x00000000
       f5103c94 00000082 f5103c40 00000002 f5ad5450 00000007 f5103c28 011f3000
       00000006 f5ad5450 c10bb005 00000480 c1710304 f5ad5450 f5ad55ec c2907c40
       00000001 f5ad5450 f5103c74 00000046 00000002 f5ad5450 00000007 f5103c6c
      Call Trace:
       [<c10bb005>] ? free_hot_cold_page+0x1d5/0x280
       [<c1462d64>] io_schedule+0x74/0xc0
       [<c10b5a45>] sync_page+0x35/0x60
       [<c146325a>] __wait_on_bit_lock+0x4a/0x90
       [<c10b5a10>] ? sync_page+0x0/0x60
       [<c10b59e5>] __lock_page+0x85/0x90
       [<c105d660>] ? wake_bit_function+0x0/0x60
       [<c10bf654>] truncate_inode_pages_range+0x1e4/0x2d0
       [<c10bf75f>] truncate_inode_pages+0x1f/0x30
       [<c10bf7cf>] truncate_pagecache+0x5f/0xa0
       [<c10bf86a>] vmtruncate+0x5a/0x70
       [<c10fdb7d>] inode_setattr+0x5d/0x190
       [<c1150117>] reiserfs_setattr+0x1f7/0x2f0
       [<c1464569>] ? down_write+0x49/0x70
       [<c10fde01>] notify_change+0x151/0x330
       [<c10e6f3d>] do_truncate+0x6d/0xa0
       [<c10f4ce2>] do_filp_open+0x9a2/0xcf0
       [<c1465aec>] ? _raw_spin_unlock+0x2c/0x50
       [<c10fec50>] ? alloc_fd+0xe0/0x100
       [<c10e602d>] do_sys_open+0x6d/0x130
       [<c1002cfb>] ? sysenter_exit+0xf/0x16
       [<c10e615e>] sys_open+0x2e/0x40
       [<c1002ccc>] sysenter_do_call+0x12/0x32
      3 locks held by fstest/3813:
       #0:  (&sb->s_type->i_mutex_key#4){+.+.+.}, at: [<c10e6f33>] do_truncate+0x63/0xa0
       #1:  (&sb->s_type->i_alloc_sem_key#3){+.+.+.}, at: [<c10fdf07>] notify_change+0x257/0x330
       #2:  (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1178c8e>] reiserfs_write_lock_once+0x2e/0x50
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Christian Kujau <lists@nerdbynature.de>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      108d3943
    • F
      reiserfs: Fix recursive lock on lchown · 5fe1533f
      Frederic Weisbecker 提交于
      On chown, reiserfs will call reiserfs_setattr() to change the owner
      of the given inode, but it may also recursively call
      reiserfs_setattr() to propagate the owner change to the private xattr
      files for this inode.
      
      Hence, the reiserfs lock may be acquired twice which is not wanted
      as reiserfs_setattr() calls journal_begin() that is going to try to
      relax the lock in order to safely acquire the journal mutex.
      
      Using reiserfs_write_lock_once() from reiserfs_setattr() solves
      the problem.
      
      This fixes the following warning, that precedes a lockdep report.
      
      WARNING: at fs/reiserfs/lock.c:95 reiserfs_lock_check_recursive+0x3f/0x50()
      Hardware name: MS-7418
      Unwanted recursive reiserfs lock!
      Pid: 4189, comm: fsstress Not tainted 2.6.33-rc2-tip-atom+ #195
      Call Trace:
       [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50
       [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50
       [<c103f7ac>] warn_slowpath_common+0x6c/0xc0
       [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50
       [<c103f84b>] warn_slowpath_fmt+0x2b/0x30
       [<c1178bff>] reiserfs_lock_check_recursive+0x3f/0x50
       [<c1172ae3>] do_journal_begin_r+0x83/0x350
       [<c1172f2d>] journal_begin+0x7d/0x140
       [<c106509a>] ? in_group_p+0x2a/0x30
       [<c10fda71>] ? inode_change_ok+0x91/0x140
       [<c115007d>] reiserfs_setattr+0x15d/0x2e0
       [<c10f9bf3>] ? dput+0xe3/0x140
       [<c1465adc>] ? _raw_spin_unlock+0x2c/0x50
       [<c117831d>] chown_one_xattr+0xd/0x10
       [<c11780a3>] reiserfs_for_each_xattr+0x113/0x2c0
       [<c1178310>] ? chown_one_xattr+0x0/0x10
       [<c14641e9>] ? mutex_lock_nested+0x2a9/0x350
       [<c117826f>] reiserfs_chown_xattrs+0x1f/0x60
       [<c106509a>] ? in_group_p+0x2a/0x30
       [<c10fda71>] ? inode_change_ok+0x91/0x140
       [<c1150046>] reiserfs_setattr+0x126/0x2e0
       [<c1177c20>] ? reiserfs_getxattr+0x0/0x90
       [<c11b0d57>] ? cap_inode_need_killpriv+0x37/0x50
       [<c10fde01>] notify_change+0x151/0x330
       [<c10e659f>] chown_common+0x6f/0x90
       [<c10e67bd>] sys_lchown+0x6d/0x80
       [<c1002ccc>] sysenter_do_call+0x12/0x32
      ---[ end trace 7c2b77224c1442fc ]---
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Christian Kujau <lists@nerdbynature.de>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      5fe1533f
    • E
      sysfs: Add lockdep annotations for the sysfs active reference · 846f9974
      Eric W. Biederman 提交于
      Holding locks over device_del -> kobject_del -> sysfs_deactivate can
      cause deadlocks if those same locks are grabbed in sysfs show or store
      methods.
      
      The I model s_active count + completion as a sleeping read/write lock.
      I describe to lockdep sysfs_get_active as a read_trylock,
      sysfs_put_active as a read_unlock, and sysfs_deactivate as a
      write_lock and write_unlock pair.  This seems to capture the essence
      for purposes of finding deadlocks, and in my testing gives finds real
      issues and ignores non-issues.
      
      This brings us back to holding locks over kobject_del is a problem
      that ideally we should find a way of addressing, but at least lockdep
      can tell us about the problems instead of requiring developers to debug
      rare strange system deadlocks, that happen when sysfs files are removed
      while being written to.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      846f9974
  9. 04 1月, 2010 1 次提交
  10. 03 1月, 2010 2 次提交
  11. 02 1月, 2010 2 次提交
    • F
      reiserfs: Safely acquire i_mutex from xattr_rmdir · 835d5247
      Frederic Weisbecker 提交于
      Relax the reiserfs lock before taking the inode mutex from
      xattr_rmdir() to avoid the usual reiserfs lock <-> inode mutex
      bad dependency.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Tested-by: NChristian Kujau <lists@nerdbynature.de>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      835d5247
    • F
      reiserfs: Safely acquire i_mutex from reiserfs_for_each_xattr · 8b513f56
      Frederic Weisbecker 提交于
      Relax the reiserfs lock before taking the inode mutex from
      reiserfs_for_each_xattr() to avoid the usual bad dependencies:
      
      =======================================================
      [ INFO: possible circular locking dependency detected ]
      2.6.32-atom #179
      -------------------------------------------------------
      rm/3242 is trying to acquire lock:
       (&sb->s_type->i_mutex_key#4/3){+.+.+.}, at: [<c11428ef>] reiserfs_for_each_xattr+0x23f/0x290
      
      but task is already holding lock:
       (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1143389>] reiserfs_write_lock+0x29/0x40
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
             [<c105ea7f>] __lock_acquire+0x11ff/0x19e0
             [<c105f2c8>] lock_acquire+0x68/0x90
             [<c1401aab>] mutex_lock_nested+0x5b/0x340
             [<c1143339>] reiserfs_write_lock_once+0x29/0x50
             [<c1117022>] reiserfs_lookup+0x62/0x140
             [<c10bd85f>] __lookup_hash+0xef/0x110
             [<c10bf21d>] lookup_one_len+0x8d/0xc0
             [<c1141e3a>] open_xa_dir+0xea/0x1b0
             [<c1142720>] reiserfs_for_each_xattr+0x70/0x290
             [<c11429ba>] reiserfs_delete_xattrs+0x1a/0x60
             [<c111ea2f>] reiserfs_delete_inode+0x9f/0x150
             [<c10c9c32>] generic_delete_inode+0xa2/0x170
             [<c10c9d4f>] generic_drop_inode+0x4f/0x70
             [<c10c8b07>] iput+0x47/0x50
             [<c10c0965>] do_unlinkat+0xd5/0x160
             [<c10c0b13>] sys_unlinkat+0x23/0x40
             [<c1002ec4>] sysenter_do_call+0x12/0x32
      
      -> #0 (&sb->s_type->i_mutex_key#4/3){+.+.+.}:
             [<c105f176>] __lock_acquire+0x18f6/0x19e0
             [<c105f2c8>] lock_acquire+0x68/0x90
             [<c1401aab>] mutex_lock_nested+0x5b/0x340
             [<c11428ef>] reiserfs_for_each_xattr+0x23f/0x290
             [<c11429ba>] reiserfs_delete_xattrs+0x1a/0x60
             [<c111ea2f>] reiserfs_delete_inode+0x9f/0x150
             [<c10c9c32>] generic_delete_inode+0xa2/0x170
             [<c10c9d4f>] generic_drop_inode+0x4f/0x70
             [<c10c8b07>] iput+0x47/0x50
             [<c10c0965>] do_unlinkat+0xd5/0x160
             [<c10c0b13>] sys_unlinkat+0x23/0x40
             [<c1002ec4>] sysenter_do_call+0x12/0x32
      
      other info that might help us debug this:
      
      1 lock held by rm/3242:
       #0:  (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1143389>] reiserfs_write_lock+0x29/0x40
      
      stack backtrace:
      Pid: 3242, comm: rm Not tainted 2.6.32-atom #179
      Call Trace:
       [<c13ffa13>] ? printk+0x18/0x1a
       [<c105d33a>] print_circular_bug+0xca/0xd0
       [<c105f176>] __lock_acquire+0x18f6/0x19e0
       [<c105c932>] ? mark_held_locks+0x62/0x80
       [<c105cc3b>] ? trace_hardirqs_on+0xb/0x10
       [<c1401098>] ? mutex_unlock+0x8/0x10
       [<c105f2c8>] lock_acquire+0x68/0x90
       [<c11428ef>] ? reiserfs_for_each_xattr+0x23f/0x290
       [<c11428ef>] ? reiserfs_for_each_xattr+0x23f/0x290
       [<c1401aab>] mutex_lock_nested+0x5b/0x340
       [<c11428ef>] ? reiserfs_for_each_xattr+0x23f/0x290
       [<c11428ef>] reiserfs_for_each_xattr+0x23f/0x290
       [<c1143180>] ? delete_one_xattr+0x0/0x100
       [<c11429ba>] reiserfs_delete_xattrs+0x1a/0x60
       [<c1143339>] ? reiserfs_write_lock_once+0x29/0x50
       [<c111ea2f>] reiserfs_delete_inode+0x9f/0x150
       [<c11b0d4f>] ? _atomic_dec_and_lock+0x4f/0x70
       [<c111e990>] ? reiserfs_delete_inode+0x0/0x150
       [<c10c9c32>] generic_delete_inode+0xa2/0x170
       [<c10c9d4f>] generic_drop_inode+0x4f/0x70
       [<c10c8b07>] iput+0x47/0x50
       [<c10c0965>] do_unlinkat+0xd5/0x160
       [<c1401098>] ? mutex_unlock+0x8/0x10
       [<c10c3e0d>] ? vfs_readdir+0x7d/0xb0
       [<c10c3af0>] ? filldir64+0x0/0xf0
       [<c1002ef3>] ? sysenter_exit+0xf/0x16
       [<c105cbe4>] ? trace_hardirqs_on_caller+0x124/0x170
       [<c10c0b13>] sys_unlinkat+0x23/0x40
       [<c1002ec4>] sysenter_do_call+0x12/0x32
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Tested-by: NChristian Kujau <lists@nerdbynature.de>
      Cc: Alexander Beregalov <a.beregalov@gmail.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      8b513f56