1. 23 6月, 2009 4 次提交
  2. 16 6月, 2009 2 次提交
    • S
      ocfs2/net: Use wait_event() in o2net_send_message_vec() · 9af0b38f
      Sunil Mushran 提交于
      Replace wait_event_interruptible() with wait_event() in o2net_send_message_vec().
      This is because this function is called by the dlm that expects signals to be
      blocked.
      
      Fixes oss bugzilla#1126
      http://oss.oracle.com/bugzilla/show_bug.cgi?id=1126Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      9af0b38f
    • T
      ocfs2: Adjust rightmost path in ocfs2_add_branch. · 6b791bcc
      Tao Ma 提交于
      In ocfs2_add_branch, we use the rightmost rec of the leaf extent block
      to generate the e_cpos for the newly added branch. In the most case, it
      is OK but if the parent extent block's rightmost rec covers more clusters
      than the leaf does, it will cause kernel panic if we insert some clusters
      in it. The message is something like:
      (7445,1):ocfs2_insert_at_leaf:3775 ERROR: bug expression:
      le16_to_cpu(el->l_next_free_rec) >= le16_to_cpu(el->l_count)
      (7445,1):ocfs2_insert_at_leaf:3775 ERROR: inode 66053, depth 0, count 28,
      next free 28, rec.cpos 270, rec.clusters 1, insert.cpos 275, insert.clusters 1
       [<fa7ad565>] ? ocfs2_do_insert_extent+0xb58/0xda0 [ocfs2]
       [<fa7b08f2>] ? ocfs2_insert_extent+0x5bd/0x6ba [ocfs2]
       [<fa7b1b8b>] ? ocfs2_add_clusters_in_btree+0x37f/0x564 [ocfs2]
      ...
      
      The panic can be easily reproduced by the following small test case
      (with bs=512, cs=4K, and I remove all the error handling so that it looks
      clear enough for reading).
      
      int main(int argc, char **argv)
      {
      	int fd, i;
      	char buf[5] = "test";
      
      	fd = open(argv[1], O_RDWR|O_CREAT);
      
      	for (i = 0; i < 30; i++) {
      		lseek(fd, 40960 * i, SEEK_SET);
      		write(fd, buf, 5);
      	}
      
      	ftruncate(fd, 1146880);
      
      	lseek(fd, 1126400, SEEK_SET);
      	write(fd, buf, 5);
      
      	close(fd);
      
      	return 0;
      }
      
      The reason of the panic is that:
      the 30 writes and the ftruncate makes the file's extent list looks like:
      
      	Tree Depth: 1   Count: 19   Next Free Rec: 1
      	## Offset        Clusters       Block#
      	0  0             280            86183
      	SubAlloc Bit: 7   SubAlloc Slot: 0
      	Blknum: 86183   Next Leaf: 0
      	CRC32: 00000000   ECC: 0000
      	Tree Depth: 0   Count: 28   Next Free Rec: 28
      	## Offset        Clusters       Block#          Flags
      	0  0             1              143368          0x0
      	1  10            1              143376          0x0
      	...
      	26 260           1              143576          0x0
      	27 270           1              143584          0x0
      
      Now another write at 1126400(275 cluster) whiich will write at the gap
      between 271 and 280 will trigger ocfs2_add_branch, but the result after
      the function looks like:
      	Tree Depth: 1   Count: 19   Next Free Rec: 2
      	## Offset        Clusters       Block#
      	0  0             280            86183
      	1  271           0             143592
      So the extent record is intersected and make the following operation bug out.
      
      This patch just try to remove the gap before we add the new branch, so that
      the root(branch) rightmost rec will cover the same right position. So in the
      above case, before adding branch the tree will be changed to
      	Tree Depth: 1   Count: 19   Next Free Rec: 1
      	## Offset        Clusters       Block#
      	0  0             271            86183
      	SubAlloc Bit: 7   SubAlloc Slot: 0
      	Blknum: 86183   Next Leaf: 0
      	CRC32: 00000000   ECC: 0000
      	Tree Depth: 0   Count: 28   Next Free Rec: 28
      	## Offset        Clusters       Block#          Flags
      	0  0             1              143368          0x0
      	1  10            1              143376          0x0
      	...
      	26 260           1              143576          0x0
      	27 270           1              143584          0x0
      And after branch add, the tree looks like
      	Tree Depth: 1   Count: 19   Next Free Rec: 2
      	## Offset        Clusters       Block#
      	0  0             271            86183
      	1  271           0             143592
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      6b791bcc
  3. 12 6月, 2009 3 次提交
  4. 10 6月, 2009 1 次提交
    • H
      ocfs2: fdatasync should skip unimportant metadata writeout · e04cc15f
      Hisashi Hifumi 提交于
      In ocfs2, fdatasync and fsync are identical.
      I think fdatasync should skip committing transaction when
      inode->i_state is set just I_DIRTY_SYNC and this indicates
      only atime or/and mtime updates.
      Following patch improves fdatasync throughput.
      
      #sysbench --num-threads=16 --max-requests=300000 --test=fileio
      --file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
      --file-fsync-mode=fdatasync run
      
      Results:
      -2.6.30-rc8
      Test execution summary:
          total time:                          107.1445s
          total number of events:              119559
          total time taken by event execution: 116.1050
          per-request statistics:
               min:                            0.0000s
               avg:                            0.0010s
               max:                            0.1220s
               approx.  95 percentile:         0.0016s
      
      Threads fairness:
          events (avg/stddev):           7472.4375/303.60
          execution time (avg/stddev):   7.2566/0.64
      
      -2.6.30-rc8-patched
      Test execution summary:
          total time:                          86.8529s
          total number of events:              300016
          total time taken by event execution: 24.3077
          per-request statistics:
               min:                            0.0000s
               avg:                            0.0001s
               max:                            0.0336s
               approx.  95 percentile:         0.0001s
      
      Threads fairness:
          events (avg/stddev):           18751.0000/718.75
          execution time (avg/stddev):   1.5192/0.05
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      e04cc15f
  5. 04 6月, 2009 9 次提交
    • T
      ocfs2: Remove redundant gotos in ocfs2_mount_volume() · 06c59bb8
      Tao Ma 提交于
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      06c59bb8
    • J
      ocfs2: Add statistics for the checksum and ecc operations. · 73be192b
      Joel Becker 提交于
      It would be nice to know how often we get checksum failures.  Even
      better, how many of them we can fix with the single bit ecc.  So, we add
      a statistics structure.  The structure can be installed into debugfs
      wherever the user wants.
      
      For ocfs2, we'll put it in the superblock-specific debugfs directory and
      pass it down from our higher-level functions.  The stats are only
      registered with debugfs when the filesystem supports metadata ecc.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      73be192b
    • S
      ocfs2 patch to track delayed orphan scan timer statistics · 15633a22
      Srinivas Eeda 提交于
      Patch to track delayed orphan scan timer statistics.
      
      Modifies ocfs2_osb_dump to print the following:
        Orphan Scan=> Local: 10  Global: 21  Last Scan: 67 seconds ago
      Signed-off-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      15633a22
    • S
      ocfs2: timer to queue scan of all orphan slots · 83273932
      Srinivas Eeda 提交于
      When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
      before moving the dentry to the orphan directory. Other nodes that have
      this dentry in cache have a PR on the same dentry lock.  When the EX is
      requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
      during downconvert.  The inode is finally deleted when the last node to iput
      the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.
      
      A problem arises if a node is forced to free dentry locks because of memory
      pressure. If this happens, the node will no longer get downconvert
      notifications for the dentries that have been unlinked on another node.
      If it also happens that node is actively using the corresponding inode and
      happens to be the one performing the last iput on that inode, it will fail
      to delete the inode as it will not have the MAYBE_ORPHANED flag set.
      
      This patch fixes this shortcoming by introducing a periodic scan of the
      orphan directories to delete such inodes. Care has been taken to distribute
      the workload across the cluster so that no one node has to perform the task
      all the time.
      Signed-off-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      83273932
    • J
      ocfs2: Correct ordering of ip_alloc_sem and localloc locks for directories · edd45c08
      Jan Kara 提交于
      We use ordering ip_alloc_sem -> local alloc locks in ocfs2_write_begin().
      So change lock ordering in ocfs2_extend_dir() and ocfs2_expand_inline_dir()
      to also use this lock ordering.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      edd45c08
    • J
      ocfs2: Fix possible deadlock in quota recovery · 80d73f15
      Jan Kara 提交于
      In ocfs2_finish_quota_recovery() we acquired global quota file lock and started
      recovering local quota file. During this process we need to get quota
      structures, which calls ocfs2_dquot_acquire() which gets global quota file lock
      again. This second lock can block in case some other node has requested the
      quota file lock in the mean time. Fix the problem by moving quota file locking
      down into the function where it is really needed.  Then dqget() or dqput()
      won't be called with the lock held.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      80d73f15
    • J
      ocfs2: Fix possible deadlock with quotas in ocfs2_setattr() · 65bac575
      Jan Kara 提交于
      We called vfs_dq_transfer() with global quota file lock held. This can lead
      to deadlocks as if vfs_dq_transfer() has to allocate new quota structure,
      it calls ocfs2_dquot_acquire() which tries to get quota file lock again and
      this can block if another node requested the lock in the mean time.
      
      Since we have to call vfs_dq_transfer() with transaction already started
      and quota file lock ranks above the transaction start, we cannot just rely
      on ocfs2_dquot_acquire() or ocfs2_dquot_release() on getting the lock
      if they need it. We fix the problem by acquiring pointers to all quota
      structures needed by vfs_dq_transfer() already before calling the function.
      By this we are sure that all quota structures are properly allocated and
      they can be freed only after we drop references to them. Thus we don't need
      quota file lock anywhere inside vfs_dq_transfer().
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      65bac575
    • J
      ocfs2: Fix lock inversion in ocfs2_local_read_info() · b4c30de3
      Jan Kara 提交于
      This function is called with dqio_mutex held but it has to acquire lock
      from global quota file which ranks above this lock. This is not deadlockable
      lock inversion since this code path is take only during mount when noone
      else can race with us but let's clean this up to silence lockdep.
      
      We just drop the dqio_mutex in the beginning of the function and reacquire
      it in the end since we don't need it - noone can race with us at this moment.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      b4c30de3
    • J
      ocfs2: Fix possible deadlock in ocfs2_global_read_dquot() · 4e8a3019
      Jan Kara 提交于
      It is not possible to get a read lock and then try to get the same write lock
      in one thread as that can block on downconvert being requested by other node
      leading to deadlock. So first drop the quota lock for reading and only after
      that get it for writing.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      4e8a3019
  6. 23 5月, 2009 1 次提交
  7. 09 5月, 2009 1 次提交
  8. 06 5月, 2009 2 次提交
    • C
      ocfs2: update comments in masklog.h · 2b53bc7b
      Coly Li 提交于
      In the mainline ocfs2 code, the interface for masklog is in files under
      /sys/fs/o2cb/masklog, but the comments in fs/ocfs2/cluster/masklog.h
      reference the old /proc interface.  They are out of date.
      
      This patch modifies the comments in cluster/masklog.h, which also provides
      a bash script example on how to change the log mask bits.
      Signed-off-by: NColy Li <coly.li@suse.de>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      2b53bc7b
    • T
      ocfs2: Don't printk the error when listing too many xattrs. · a46fa684
      Tao Ma 提交于
      Currently the kernel defines XATTR_LIST_MAX as 65536
      in include/linux/limits.h.  This is the largest buffer that is used for
      listing xattrs.
      
      But with ocfs2 xattr tree, we actually have no limit for the number.  If
      filesystem has more names than can fit in the buffer, the kernel
      logs will be pollluted with something like this when listing:
      
      (27738,0):ocfs2_iterate_xattr_buckets:3158 ERROR: status = -34
      (27738,0):ocfs2_xattr_tree_list_index_block:3264 ERROR: status = -34
      
      So don't print "ERROR" message as this is not an ocfs2 error.
      Signed-off-by: NTao Ma <tao.ma@oracle.com>
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      a46fa684
  9. 01 5月, 2009 1 次提交
    • J
      ocfs2: Fix a missing credit when deleting from indexed directories. · dfa13f39
      Joel Becker 提交于
      The ocfs2 directory index updates two blocks when we remove an entry -
      the dx root and the dx leaf.  OCFS2_DELETE_INODE_CREDITS was only
      accounting for the dx leaf.  This shows up when ocfs2_delete_inode()
      runs out of credits in jbd2_journal_dirty_metadata() at
      "J_ASSERT_JH(jh, handle->h_buffer_credits > 0);".
      
      The test that caught this was running dirop_file_racer from the
      ocfs2-test suite with a 250-character filename PREFIX.  Run on a 512B
      blocksize, it forces the orphan dir index to grow large enough to
      trigger.
      Signed-off-by: NJoel Becker <joel.becker@oracle.com>
      dfa13f39
  10. 30 4月, 2009 1 次提交
  11. 24 4月, 2009 1 次提交
  12. 22 4月, 2009 2 次提交
  13. 15 4月, 2009 1 次提交
  14. 08 4月, 2009 1 次提交
  15. 07 4月, 2009 1 次提交
    • M
      splice: fix deadlock in splicing to file · 7bfac9ec
      Miklos Szeredi 提交于
      There's a possible deadlock in generic_file_splice_write(),
      splice_from_pipe() and ocfs2_file_splice_write():
      
       - task A calls generic_file_splice_write()
       - this calls inode_double_lock(), which locks i_mutex on both
         pipe->inode and target inode
       - ordering depends on inode pointers, can happen that pipe->inode is
         locked first
       - __splice_from_pipe() needs more data, calls pipe_wait()
       - this releases lock on pipe->inode, goes to interruptible sleep
       - task B calls generic_file_splice_write(), similarly to the first
       - this locks pipe->inode, then tries to lock inode, but that is
         already held by task A
       - task A is interrupted, it tries to lock pipe->inode, but fails, as
         it is already held by task B
       - ABBA deadlock
      
      Fix this by explicitly ordering locks: the outer lock must be on
      target inode and the inner lock (which is later unlocked and relocked)
      must be on pipe->inode.  This is OK, pipe inodes and target inodes
      form two nonoverlapping sets, generic_file_splice_write() and friends
      are not called with a target which is a pipe.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Acked-by: NMark Fasheh <mfasheh@suse.com>
      Acked-by: NJens Axboe <jens.axboe@oracle.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7bfac9ec
  16. 04 4月, 2009 9 次提交