- 16 6月, 2009 2 次提交
-
-
由 Sunil Mushran 提交于
Replace wait_event_interruptible() with wait_event() in o2net_send_message_vec(). This is because this function is called by the dlm that expects signals to be blocked. Fixes oss bugzilla#1126 http://oss.oracle.com/bugzilla/show_bug.cgi?id=1126Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Tao Ma 提交于
In ocfs2_add_branch, we use the rightmost rec of the leaf extent block to generate the e_cpos for the newly added branch. In the most case, it is OK but if the parent extent block's rightmost rec covers more clusters than the leaf does, it will cause kernel panic if we insert some clusters in it. The message is something like: (7445,1):ocfs2_insert_at_leaf:3775 ERROR: bug expression: le16_to_cpu(el->l_next_free_rec) >= le16_to_cpu(el->l_count) (7445,1):ocfs2_insert_at_leaf:3775 ERROR: inode 66053, depth 0, count 28, next free 28, rec.cpos 270, rec.clusters 1, insert.cpos 275, insert.clusters 1 [<fa7ad565>] ? ocfs2_do_insert_extent+0xb58/0xda0 [ocfs2] [<fa7b08f2>] ? ocfs2_insert_extent+0x5bd/0x6ba [ocfs2] [<fa7b1b8b>] ? ocfs2_add_clusters_in_btree+0x37f/0x564 [ocfs2] ... The panic can be easily reproduced by the following small test case (with bs=512, cs=4K, and I remove all the error handling so that it looks clear enough for reading). int main(int argc, char **argv) { int fd, i; char buf[5] = "test"; fd = open(argv[1], O_RDWR|O_CREAT); for (i = 0; i < 30; i++) { lseek(fd, 40960 * i, SEEK_SET); write(fd, buf, 5); } ftruncate(fd, 1146880); lseek(fd, 1126400, SEEK_SET); write(fd, buf, 5); close(fd); return 0; } The reason of the panic is that: the 30 writes and the ftruncate makes the file's extent list looks like: Tree Depth: 1 Count: 19 Next Free Rec: 1 ## Offset Clusters Block# 0 0 280 86183 SubAlloc Bit: 7 SubAlloc Slot: 0 Blknum: 86183 Next Leaf: 0 CRC32: 00000000 ECC: 0000 Tree Depth: 0 Count: 28 Next Free Rec: 28 ## Offset Clusters Block# Flags 0 0 1 143368 0x0 1 10 1 143376 0x0 ... 26 260 1 143576 0x0 27 270 1 143584 0x0 Now another write at 1126400(275 cluster) whiich will write at the gap between 271 and 280 will trigger ocfs2_add_branch, but the result after the function looks like: Tree Depth: 1 Count: 19 Next Free Rec: 2 ## Offset Clusters Block# 0 0 280 86183 1 271 0 143592 So the extent record is intersected and make the following operation bug out. This patch just try to remove the gap before we add the new branch, so that the root(branch) rightmost rec will cover the same right position. So in the above case, before adding branch the tree will be changed to Tree Depth: 1 Count: 19 Next Free Rec: 1 ## Offset Clusters Block# 0 0 271 86183 SubAlloc Bit: 7 SubAlloc Slot: 0 Blknum: 86183 Next Leaf: 0 CRC32: 00000000 ECC: 0000 Tree Depth: 0 Count: 28 Next Free Rec: 28 ## Offset Clusters Block# Flags 0 0 1 143368 0x0 1 10 1 143376 0x0 ... 26 260 1 143576 0x0 27 270 1 143584 0x0 And after branch add, the tree looks like Tree Depth: 1 Count: 19 Next Free Rec: 2 ## Offset Clusters Block# 0 0 271 86183 1 271 0 143592 Signed-off-by: NTao Ma <tao.ma@oracle.com> Acked-by: NMark Fasheh <mfasheh@suse.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 12 6月, 2009 3 次提交
-
-
由 Alessio Igor Bogani 提交于
[xfs, btrfs, capifs, shmem don't need BKL, exempt] Signed-off-by: NAlessio Igor Bogani <abogani@texware.it> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
Move BKL into ->put_super from the only caller. A couple of filesystems had trivial enough ->put_super (only kfree and NULLing of s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs, hugetlbfs, omfs, qnx4, shmem, all others got the full treatment. Most of them probably don't need it, but I'd rather sort that out individually. Preferably after all the other BKL pushdowns in that area. [AV: original used to move lock_super() down as well; these changes are removed since we don't do lock_super() at all in generic_shutdown_super() now] [AV: fuse, btrfs and xfs are known to need no damn BKL, exempt] Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NJoel Becker <joel.becker@oracle.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 10 6月, 2009 1 次提交
-
-
由 Hisashi Hifumi 提交于
In ocfs2, fdatasync and fsync are identical. I think fdatasync should skip committing transaction when inode->i_state is set just I_DIRTY_SYNC and this indicates only atime or/and mtime updates. Following patch improves fdatasync throughput. #sysbench --num-threads=16 --max-requests=300000 --test=fileio --file-block-size=4K --file-total-size=16G --file-test-mode=rndwr --file-fsync-mode=fdatasync run Results: -2.6.30-rc8 Test execution summary: total time: 107.1445s total number of events: 119559 total time taken by event execution: 116.1050 per-request statistics: min: 0.0000s avg: 0.0010s max: 0.1220s approx. 95 percentile: 0.0016s Threads fairness: events (avg/stddev): 7472.4375/303.60 execution time (avg/stddev): 7.2566/0.64 -2.6.30-rc8-patched Test execution summary: total time: 86.8529s total number of events: 300016 total time taken by event execution: 24.3077 per-request statistics: min: 0.0000s avg: 0.0001s max: 0.0336s approx. 95 percentile: 0.0001s Threads fairness: events (avg/stddev): 18751.0000/718.75 execution time (avg/stddev): 1.5192/0.05 Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Acked-by: NMark Fasheh <mfasheh@suse.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 04 6月, 2009 9 次提交
-
-
由 Tao Ma 提交于
Signed-off-by: NTao Ma <tao.ma@oracle.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Joel Becker 提交于
It would be nice to know how often we get checksum failures. Even better, how many of them we can fix with the single bit ecc. So, we add a statistics structure. The structure can be installed into debugfs wherever the user wants. For ocfs2, we'll put it in the superblock-specific debugfs directory and pass it down from our higher-level functions. The stats are only registered with debugfs when the filesystem supports metadata ecc. Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Srinivas Eeda 提交于
Patch to track delayed orphan scan timer statistics. Modifies ocfs2_osb_dump to print the following: Orphan Scan=> Local: 10 Global: 21 Last Scan: 67 seconds ago Signed-off-by: NSrinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Srinivas Eeda 提交于
When a dentry is unlinked, the unlinking node takes an EX on the dentry lock before moving the dentry to the orphan directory. Other nodes that have this dentry in cache have a PR on the same dentry lock. When the EX is requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED during downconvert. The inode is finally deleted when the last node to iput the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set. A problem arises if a node is forced to free dentry locks because of memory pressure. If this happens, the node will no longer get downconvert notifications for the dentries that have been unlinked on another node. If it also happens that node is actively using the corresponding inode and happens to be the one performing the last iput on that inode, it will fail to delete the inode as it will not have the MAYBE_ORPHANED flag set. This patch fixes this shortcoming by introducing a periodic scan of the orphan directories to delete such inodes. Care has been taken to distribute the workload across the cluster so that no one node has to perform the task all the time. Signed-off-by: NSrinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Jan Kara 提交于
We use ordering ip_alloc_sem -> local alloc locks in ocfs2_write_begin(). So change lock ordering in ocfs2_extend_dir() and ocfs2_expand_inline_dir() to also use this lock ordering. Signed-off-by: NJan Kara <jack@suse.cz> Acked-by: NMark Fasheh <mfasheh@suse.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Jan Kara 提交于
In ocfs2_finish_quota_recovery() we acquired global quota file lock and started recovering local quota file. During this process we need to get quota structures, which calls ocfs2_dquot_acquire() which gets global quota file lock again. This second lock can block in case some other node has requested the quota file lock in the mean time. Fix the problem by moving quota file locking down into the function where it is really needed. Then dqget() or dqput() won't be called with the lock held. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Jan Kara 提交于
We called vfs_dq_transfer() with global quota file lock held. This can lead to deadlocks as if vfs_dq_transfer() has to allocate new quota structure, it calls ocfs2_dquot_acquire() which tries to get quota file lock again and this can block if another node requested the lock in the mean time. Since we have to call vfs_dq_transfer() with transaction already started and quota file lock ranks above the transaction start, we cannot just rely on ocfs2_dquot_acquire() or ocfs2_dquot_release() on getting the lock if they need it. We fix the problem by acquiring pointers to all quota structures needed by vfs_dq_transfer() already before calling the function. By this we are sure that all quota structures are properly allocated and they can be freed only after we drop references to them. Thus we don't need quota file lock anywhere inside vfs_dq_transfer(). Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Jan Kara 提交于
This function is called with dqio_mutex held but it has to acquire lock from global quota file which ranks above this lock. This is not deadlockable lock inversion since this code path is take only during mount when noone else can race with us but let's clean this up to silence lockdep. We just drop the dqio_mutex in the beginning of the function and reacquire it in the end since we don't need it - noone can race with us at this moment. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Jan Kara 提交于
It is not possible to get a read lock and then try to get the same write lock in one thread as that can block on downconvert being requested by other node leading to deadlock. So first drop the quota lock for reading and only after that get it for writing. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 23 5月, 2009 1 次提交
-
-
由 Martin K. Petersen 提交于
Until now we have had a 1:1 mapping between storage device physical block size and the logical block sized used when addressing the device. With SATA 4KB drives coming out that will no longer be the case. The sector size will be 4KB but the logical block size will remain 512-bytes. Hence we need to distinguish between the physical block size and the logical ditto. This patch renames hardsect_size to logical_block_size. Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 09 5月, 2009 1 次提交
-
-
由 Joel Becker 提交于
ocfs2 was hand-calling vfs_follow_link(), but there's no point to that. Let's use page_follow_link_light() and nd_set_link(). Signed-off-by: NJoel Becker <joel.becker@oracle.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 06 5月, 2009 2 次提交
-
-
由 Coly Li 提交于
In the mainline ocfs2 code, the interface for masklog is in files under /sys/fs/o2cb/masklog, but the comments in fs/ocfs2/cluster/masklog.h reference the old /proc interface. They are out of date. This patch modifies the comments in cluster/masklog.h, which also provides a bash script example on how to change the log mask bits. Signed-off-by: NColy Li <coly.li@suse.de> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Tao Ma 提交于
Currently the kernel defines XATTR_LIST_MAX as 65536 in include/linux/limits.h. This is the largest buffer that is used for listing xattrs. But with ocfs2 xattr tree, we actually have no limit for the number. If filesystem has more names than can fit in the buffer, the kernel logs will be pollluted with something like this when listing: (27738,0):ocfs2_iterate_xattr_buckets:3158 ERROR: status = -34 (27738,0):ocfs2_xattr_tree_list_index_block:3264 ERROR: status = -34 So don't print "ERROR" message as this is not an ocfs2 error. Signed-off-by: NTao Ma <tao.ma@oracle.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 01 5月, 2009 1 次提交
-
-
由 Joel Becker 提交于
The ocfs2 directory index updates two blocks when we remove an entry - the dx root and the dx leaf. OCFS2_DELETE_INODE_CREDITS was only accounting for the dx leaf. This shows up when ocfs2_delete_inode() runs out of credits in jbd2_journal_dirty_metadata() at "J_ASSERT_JH(jh, handle->h_buffer_credits > 0);". The test that caught this was running dirop_file_racer from the ocfs2-test suite with a 250-character filename PREFIX. Run on a 512B blocksize, it forces the orphan dir index to grow large enough to trigger. Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 30 4月, 2009 1 次提交
-
-
由 Tao Ma 提交于
With indexed dir enabled, now we use ocfs2_dir_lookup_result to wrap all the bh used for dir. So remove the 2 unused variables. Signed-off-by: NTao Ma <tao.ma@oracle.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 24 4月, 2009 1 次提交
-
-
由 Sunil Mushran 提交于
In ocfs2_dentry_attach_lock(), if unable to get the dentry lock, we need to call iput(inode) because a failure here means no d_instantiate(), which means the normally matching iput() will not be called during dput(dentry). This patch fixes the oops that accompanies the following message: (3996,1):dlm_empty_lockres:2708 ERROR: lockres W00000000000000000a1046b06a4382 still has local locks! kernel BUG in dlm_empty_lockres at /rpmbuild/smushran/BUILD/ocfs2-1.4.2/fs/ocfs2/dlm/dlmmaster.c:2709! Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 22 4月, 2009 2 次提交
-
-
由 Joel Becker 提交于
The old %llu vs u64 battle. Cast them correctly. Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Tao Ma 提交于
fs/ocfs2/dir.c: In function ‘ocfs2_extend_dir’: fs/ocfs2/dir.c:2700: warning: ‘ret’ may be used uninitialized in this function fs/ocfs2/suballoc.c: In function ‘ocfs2_get_suballoc_slot_bit’: fs/ocfs2/suballoc.c:2216: warning: comparison is always true due to limited range of data type Signed-off-by: NTao Ma <tao.ma@oracle.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 15 4月, 2009 1 次提交
-
-
由 Miklos Szeredi 提交于
Rearrange locking of i_mutex on destination and call to ocfs2_rw_lock() so locks are only held while buffers are copied with the pipe_to_file() actor, and not while waiting for more data on the pipe. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 08 4月, 2009 1 次提交
-
-
由 Tao Ma 提交于
In ocfs2_expand_inline_dir, we calculate whether we need 1 extra cluster if we can't store the dx inline the root and save it in dx_alloc. So add it when we call ocfs2_reserve_clusters. Signed-off-by: NTao Ma <tao.ma@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
- 07 4月, 2009 1 次提交
-
-
由 Miklos Szeredi 提交于
There's a possible deadlock in generic_file_splice_write(), splice_from_pipe() and ocfs2_file_splice_write(): - task A calls generic_file_splice_write() - this calls inode_double_lock(), which locks i_mutex on both pipe->inode and target inode - ordering depends on inode pointers, can happen that pipe->inode is locked first - __splice_from_pipe() needs more data, calls pipe_wait() - this releases lock on pipe->inode, goes to interruptible sleep - task B calls generic_file_splice_write(), similarly to the first - this locks pipe->inode, then tries to lock inode, but that is already held by task A - task A is interrupted, it tries to lock pipe->inode, but fails, as it is already held by task B - ABBA deadlock Fix this by explicitly ordering locks: the outer lock must be on target inode and the inner lock (which is later unlocked and relocked) must be on pipe->inode. This is OK, pipe inodes and target inodes form two nonoverlapping sets, generic_file_splice_write() and friends are not called with a target which is a pipe. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Acked-by: NMark Fasheh <mfasheh@suse.com> Acked-by: NJens Axboe <jens.axboe@oracle.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 4月, 2009 13 次提交
-
-
由 Srinivas Eeda 提交于
During recovery, a node recovers orphans in it's slot and the dead node(s). But if the dead nodes were holding orphans in offline slots, they will be left unrecovered. If the dead node is the last one to die and is holding orphans in other slots and is the first one to mount, then it only recovers it's own slot, which leaves orphans in offline slots. This patch queues complete_recovery to clean orphans for all offline slots during mount and node recovery. Signed-off-by: NSrinivas Eeda <srinivas.eeda@oracle.com> Acked-by: NJoel Becker <joel.becker@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 Hisashi Hifumi 提交于
A page can have multiple buffers and even if a page is not uptodate, some buffers can be uptodate on pagesize != blocksize environment. This aops checks that all buffers which correspond to a part of a file that we want to read are uptodate. If so, we do not have to issue actual read IO to HDD even if a page is not uptodate because the portion we want to read are uptodate. "block_is_partially_uptodate" function is already used by ext2/3/4. With the following patch random read/write mixed workloads or random read after random write workloads can be optimized and we can get performance improvement. Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 wengang wang 提交于
For nfs exporting, ocfs2_get_dentry() returns the dentry for fh. ocfs2_get_dentry() may read from disk when the inode is not in memory, without any cross cluster lock. this leads to the file system loading a stale inode. This patch fixes above problem. Solution is that in case of inode is not in memory, we get the cluster lock(PR) of alloc inode where the inode in question is allocated from (this causes node on which deletion is done sync the alloc inode) before reading out the inode itsself. then we check the bitmap in the group (the inode in question allcated from) to see if the bit is clear. if it's clear then it's stale. if the bit is set, we then check generation as the existing code does. We have to read out the inode in question from disk first to know its alloc slot and allot bit. And if its not stale we read it out using ocfs2_iget(). The second read should then be from cache. And also we have to add a per superblock nfs_sync_lock to cover the lock for alloc inode and that for inode in question. this is because ocfs2_get_dentry() and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so that mutliple ocfs2_delete_inode() can run concurrently in normal case. [mfasheh@suse.com: build warning fixes and comment cleanups] Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com> Acked-by: NJoel Becker <joel.becker@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 Sunil Mushran 提交于
The debugfs file, mle_state, now prints the number of largest number of mles in one hash link. Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 Sunil Mushran 提交于
This patch attempts to fix a fine race between purging and migration. Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 Sunil Mushran 提交于
This patch removes struct dlm_lock_name and adds the entries directly to struct dlm_master_list_entry. Under the new scheme, both mles that are backed by a lockres or not, will have the name populated in mle->mname. This allows us to get rid of code that was figuring out the location of the mle name. Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 Sunil Mushran 提交于
This patch shows the number of lockres' and mles in the debugfs file, dlm_state. Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 Sunil Mushran 提交于
This patch inlines dlm_set_lockres_owner() and dlm_change_lockres_owner(). Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 Sunil Mushran 提交于
This patch replaces the lockres counts that tracked the number number of locally and remotely mastered lockres' with a current and total count. The total count is the number of lockres' that have been created since the dlm domain was created. The number of locally and remotely mastered counts can be computed using the locking_state output. Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 Sunil Mushran 提交于
The lifetime of a mle is limited to the duration of the lockres mastery process. While typically this lifetime is fairly short, we have noticed the number of mles explode under certain circumstances. This patch tracks the number of each different types of mles and should help us determine how best to speed up the mastery process. Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 Sunil Mushran 提交于
The previous patch explicitly did not indent dlm_cleanup_master_list() so as to make the patch readable. This patch properly indents the function. Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 Sunil Mushran 提交于
With this patch, the mles are stored in a hash and not a simple list. This should improve the mle lookup time when the number of outstanding masteries is large. Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-
由 Sunil Mushran 提交于
This patch adds code to create and destroy the dlm->master_hash. Signed-off-by: NSunil Mushran <sunil.mushran@oracle.com> Signed-off-by: NMark Fasheh <mfasheh@suse.com>
-