- 04 6月, 2009 4 次提交
-
-
由 Jan Kara 提交于
In ocfs2_finish_quota_recovery() we acquired global quota file lock and started recovering local quota file. During this process we need to get quota structures, which calls ocfs2_dquot_acquire() which gets global quota file lock again. This second lock can block in case some other node has requested the quota file lock in the mean time. Fix the problem by moving quota file locking down into the function where it is really needed. Then dqget() or dqput() won't be called with the lock held. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Jan Kara 提交于
We called vfs_dq_transfer() with global quota file lock held. This can lead to deadlocks as if vfs_dq_transfer() has to allocate new quota structure, it calls ocfs2_dquot_acquire() which tries to get quota file lock again and this can block if another node requested the lock in the mean time. Since we have to call vfs_dq_transfer() with transaction already started and quota file lock ranks above the transaction start, we cannot just rely on ocfs2_dquot_acquire() or ocfs2_dquot_release() on getting the lock if they need it. We fix the problem by acquiring pointers to all quota structures needed by vfs_dq_transfer() already before calling the function. By this we are sure that all quota structures are properly allocated and they can be freed only after we drop references to them. Thus we don't need quota file lock anywhere inside vfs_dq_transfer(). Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Jan Kara 提交于
This function is called with dqio_mutex held but it has to acquire lock from global quota file which ranks above this lock. This is not deadlockable lock inversion since this code path is take only during mount when noone else can race with us but let's clean this up to silence lockdep. We just drop the dqio_mutex in the beginning of the function and reacquire it in the end since we don't need it - noone can race with us at this moment. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Jan Kara 提交于
It is not possible to get a read lock and then try to get the same write lock in one thread as that can block on downconvert being requested by other node leading to deadlock. So first drop the quota lock for reading and only after that get it for writing. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 06 5月, 2009 2 次提交
-
-
由 Coly Li 提交于
In the mainline ocfs2 code, the interface for masklog is in files under /sys/fs/o2cb/masklog, but the comments in fs/ocfs2/cluster/masklog.h reference the old /proc interface. They are out of date. This patch modifies the comments in cluster/masklog.h, which also provides a bash script example on how to change the log mask bits. Signed-off-by: NColy Li <coly.li@suse.de> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
由 Tao Ma 提交于
Currently the kernel defines XATTR_LIST_MAX as 65536 in include/linux/limits.h. This is the largest buffer that is used for listing xattrs. But with ocfs2 xattr tree, we actually have no limit for the number. If filesystem has more names than can fit in the buffer, the kernel logs will be pollluted with something like this when listing: (27738,0):ocfs2_iterate_xattr_buckets:3158 ERROR: status = -34 (27738,0):ocfs2_xattr_tree_list_index_block:3264 ERROR: status = -34 So don't print "ERROR" message as this is not an ocfs2 error. Signed-off-by: NTao Ma <tao.ma@oracle.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 03 5月, 2009 6 次提交
-
-
由 Oleg Nesterov 提交于
->real_parent is the parent. ->parent may be the tracer. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NDavid Howells <dhowells@redhat.com> Acked-by: NRoland McGrath <roland@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
The Committed_AS field can underflow in certain situations: > # while true; do cat /proc/meminfo | grep _AS; sleep 1; done | uniq -c > 1 Committed_AS: 18446744073709323392 kB > 11 Committed_AS: 18446744073709455488 kB > 6 Committed_AS: 35136 kB > 5 Committed_AS: 18446744073709454400 kB > 7 Committed_AS: 35904 kB > 3 Committed_AS: 18446744073709453248 kB > 2 Committed_AS: 34752 kB > 9 Committed_AS: 18446744073709453248 kB > 8 Committed_AS: 34752 kB > 3 Committed_AS: 18446744073709320960 kB > 7 Committed_AS: 18446744073709454080 kB > 3 Committed_AS: 18446744073709320960 kB > 5 Committed_AS: 18446744073709454080 kB > 6 Committed_AS: 18446744073709320960 kB Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does not check for underflow. But NR_CPUS proportional isn't good calculation. In general, possibility of lock contention is proportional to the number of online cpus, not theorical maximum cpus (NR_CPUS). The current kernel has generic percpu-counter stuff. using it is right way. it makes code simplify and percpu_counter_read_positive() don't make underflow issue. Reported-by: NDave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Eric B Munson <ebmunson@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> [All kernel versions] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ivan Kokshaysky 提交于
This fixes the problem introduced by commit 3bfacef4 (get rid of special-casing the /sbin/loader on alpha): osf/1 ecoff binary segfaults when binfmt_aout built as module. That happens because aout binary handler gets on the top of the binfmt list due to late registration, and kernel attempts to execute the binary without preparatory work that must be done by binfmt_loader. Fixed by changing the registration order of the default binfmt handlers using list_add_tail() and introducing insert_binfmt() function which places new handler on the top of the binfmt list. This might be generally useful for installing arch-specific frontends for default handlers or just for overriding them. Signed-off-by: NIvan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Richard Henderson <rth@twiddle.net Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vitaly Mayatskikh 提交于
The intention of commit aae8679b ("pagemap: fix bug in add_to_pagemap, require aligned-length reads of /proc/pid/pagemap") was to force reads of /proc/pid/pagemap to be a multiple of 8 bytes, but now it allows to read 0 bytes, which actually puts some data to user's buffer. According to POSIX, if count is zero, read() should return zero and has no other results. Signed-off-by: NVitaly Mayatskikh <v.mayatskih@gmail.com> Cc: Thomas Tuttle <ttuttle@google.com> Acked-by: NMatt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Change page_mkwrite to allow implementations to return with the page locked, and also change it's callers (in page fault paths) to hold the lock until the page is marked dirty. This allows the filesystem to have full control of page dirtying events coming from the VM. Rather than simply hold the page locked over the page_mkwrite call, we call page_mkwrite with the page unlocked and allow callers to return with it locked, so filesystems can avoid LOR conditions with page lock. The problem with the current scheme is this: a filesystem that wants to associate some metadata with a page as long as the page is dirty, will perform this manipulation in its ->page_mkwrite. It currently then must return with the page unlocked and may not hold any other locks (according to existing page_mkwrite convention). In this window, the VM could write out the page, clearing page-dirty. The filesystem has no good way to detect that a dirty pte is about to be attached, so it will happily write out the page, at which point, the filesystem may manipulate the metadata to reflect that the page is no longer dirty. It is not always possible to perform the required metadata manipulation in ->set_page_dirty, because that function cannot block or fail. The filesystem may need to allocate some data structure, for example. And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite is allowed to fail, so we must not allow any window where the page could be written to if page_mkwrite does fail. This solution of holding the page locked over the 3 critical operations (page_mkwrite, setting the pte dirty, and finally setting the page dirty) closes out races nicely, preventing page cleaning for writeout being initiated in that window. This provides the filesystem with a strong synchronisation against the VM here. - Sage needs this race closed for ceph filesystem. - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913). - I need it for fsblock. - I suspect other filesystems may need it too (eg. btrfs). - I have converted buffer.c to the new locking. Even simple block allocation under dirty pages might be susceptible to i_size changing under partial page at the end of file (we also have a buffer.c-side problem here, but it cannot be fixed properly without this patch). - Other filesystems (eg. NFS, maybe btrfs) will need to change their page_mkwrite functions themselves. [ This also moves page_mkwrite another step closer to fault, which should eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a filesystem calldown and page lock/unlock cycle in __do_fault. ] [akpm@linux-foundation.org: fix derefs of NULL ->mapping] Cc: Sage Weil <sage@newdream.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: NNick Piggin <npiggin@suse.de> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ian Kent 提交于
Fix an obvious incorrect return status in autofs4_mount_busy(). Signed-off-by: NIan Kent <raven@themaw.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 5月, 2009 1 次提交
-
-
由 Joel Becker 提交于
The ocfs2 directory index updates two blocks when we remove an entry - the dx root and the dx leaf. OCFS2_DELETE_INODE_CREDITS was only accounting for the dx leaf. This shows up when ocfs2_delete_inode() runs out of credits in jbd2_journal_dirty_metadata() at "J_ASSERT_JH(jh, handle->h_buffer_credits > 0);". The test that caught this was running dirop_file_racer from the ocfs2-test suite with a 250-character filename PREFIX. Run on a 512B blocksize, it forces the orphan dir index to grow large enough to trigger. Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 30 4月, 2009 5 次提交
-
-
由 Christoph Hellwig 提交于
xfs_getbmap (or rather the formatters called by it) copy out the getbmap structures under the ilock, which can deadlock against mmap. This has been reported via bugzilla a while ago (#717) and has recently also shown up via lockdep. So allocate a temporary buffer to format the kernel getbmap structures into and then copy them out after dropping the locks. A little problem with this is that we limit the number of extents we can copy out by the maximum allocation size, but I see no real way around that. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NEric Sandeen <sandeen@sandeen.net> Reviewed-by: NFelix Blyakher <felixb@sgi.com> Signed-off-by: NFelix Blyakher <felixb@sgi.com>
-
由 Christoph Hellwig 提交于
- reshuffle various conditionals for data vs attr fork to make the code more readable - do fine-grainded goto-based error handling - exit early from conditionals instead of keeping a long else branch around - allow kmem_alloc to fail Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NEric Sandeen <sandeen@sandeen.net> Reviewed-by: NFelix Blyakher <felixb@sgi.com> Signed-off-by: NFelix Blyakher <felixb@sgi.com>
-
由 Olaf Weber 提交于
There had been reports where xfs filesystem was randomly corrupted with fsfuzzer, and xfs failed to handle it gracefully. This patch fixes couple of reported problem by providing additional checks in the superblock validation routine. Signed-off-by: NOlaf Weber <olaf@sgi.com> Reviewed-by: NJosef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NFelix Blyakher <felixb@sgi.com>
-
由 Lachlan McIlroy 提交于
We had some systems crash with this stack: [<a00000010000cb20>] ia64_leave_kernel+0x0/0x280 [<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs] [<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs] [<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs] [<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs] [<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs] [<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs] [<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs] [<a000000100162ea0>] __fput+0x1a0/0x420 [<a000000100163160>] fput+0x40/0x60 The problem here is that xfs_file_last_byte() does not acquire the inode lock and can therefore race with another thread that is modifying the extext list. While xfs_bmap_last_offset() is trying to lookup what was the last extent some extents were merged and the extent list shrunk so the index we lookup is now beyond the end of the extent list and potentially in a freed buffer. Signed-off-by: NLachlan McIlroy <lmcilroy@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NFelix Blyakher <felixb@sgi.com> Signed-off-by: NFelix Blyakher <felixb@sgi.com>
-
由 Tao Ma 提交于
With indexed dir enabled, now we use ocfs2_dir_lookup_result to wrap all the bh used for dir. So remove the 2 unused variables. Signed-off-by: NTao Ma <tao.ma@oracle.com> Signed-off-by: NJoel Becker <joel.becker@oracle.com>
-
- 29 4月, 2009 1 次提交
-
-
由 FUJITA Tomonori 提交于
st driver uses blk_rq_map_user() in order to just build a request out of page frames. In this case, map_data->offset is a non zero value and iov[0].iov_base is NULL. We need to increase nr_pages for that. Cc: stable@kernel.org Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 28 4月, 2009 4 次提交
-
-
由 Tyler Hicks 提交于
This warning shows up on 64 bit builds: fs/ecryptfs/inode.c:693: warning: comparison of distinct pointer types lacks a cast Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com>
-
由 Randy Dunlap 提交于
fs/ecryptfs/inode.c:670: warning: format '%d' expects type 'int', but argument 3 has type 'size_t' Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com> Signed-off-by: NTyler Hicks <tyhicks@linux.vnet.ibm.com> Cc: Dustin Kirkland <kirkland@canonical.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Chris Mason 提交于
This changes btrfs_read_locked_inode() to peek ahead in the btree for acl items. If it is certain a given inode has no acls, it will set the in memory acl fields to null to avoid acl lookups completely. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Linus noticed the btrfs code to cache acls wasn't properly caching a NULL acl when the inode didn't have any acls. This meant the common case of no acls resulted in expensive btree searches every time the kernel checked permissions (which is quite often). This is a modified version of Linus' original patch: Properly set initial acl fields to BTRFS_ACL_NOT_CACHED in the inode. This forces an acl lookup when permission checks are done. Fix btrfs_get_acl to avoid lookups and locking when the inode acls fields are set to null. Fix btrfs_get_acl to use the right return value from __btrfs_getxattr when deciding to cache a NULL acl. It was storing a NULL acl when __btrfs_getxattr return -ENOENT, but __btrfs_getxattr was actually returning -ENODATA for this case. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 27 4月, 2009 8 次提交
-
-
由 Dan Carpenter 提交于
The inode->i_mutex should be unlocked. Found by smatch (http://repo.or.cz/w/smatch.git). Compile tested. Signed-off-by: NDan Carpenter <error27@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Christoph Hellwig 提交于
Get rid of useless comments and the equally useless obj-y initialization. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Joel Becker 提交于
Just happened to notice a bunch of %llu vs u64 warnings. Here's a patch to cast them all. Signed-off-by: NJoel Becker <joel.becker@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Joel Becker 提交于
A small warning popped up on ia64 because inode-map.c was comparing a u64 object id with the ULL FIRST_FREE_OBJECTID. My first thought was that all the OBJECTID constants should contain the u64 cast because btrfs code deals entirely in u64s. But then I saw how large that was, and figured I'd just fix the max() call. Signed-off-by: NJoel Becker <joel.becker@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Btrfs has printks for various IO errors, including bad checksums and mismatches between what we expect the block headers to contain and what we actually find on the disk. Longer term we need a real reporting mechanism for this, but for now printk is going to have to do. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Btrfs had some old code sitting around under #if 0, this drops it. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Ball 提交于
Previously, we updated a device's size prior to attempting a shrink operation. This patch moves the device resizing logic to only happen if the shrink completes successfully. In the process, it introduces a new field to btrfs_device -- disk_total_bytes -- to track the on-disk size. Signed-off-by: NChris Ball <cjb@laptop.org> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 25 4月, 2009 9 次提交
-
-
由 Theodore Ts'o 提交于
The EXTENTS_FL flag should never be set on special files, but if it is, don't bother trying to validate that the extents tree is valid, since only files, directories, and non-fast symlinks will ever have an extent data structure. We perhaps should flag the filesystem as being corrupted if we see a special file (named pipes, device nodes, Unix domain sockets, etc.) with the EXTENTS_FL flag, but e2fsck doesn't currently check this case, so we'll just ignore this for now, since it's harmless. Without this fix, a special device with the extents flag is flagged as an error by the kernel, so it is impossible to access or delete the inode, but e2fsck doesn't see it as a problem, leading to confused/frustrated users. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 David Howells 提交于
RomFS should advance the destination buffer pointer when reading data from a blockdev source (the data may be split over multiple blocks, each requiring its own sb_read() call). Without this, all the data is copied to the beginning of the output buffer. Signed-off-by: NDavid Howells <dhowells@redhat.com> Tested-by: NMichal Simek <monstr@monstr.eu> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Howells 提交于
romfs_lookup() should be using a routine akin to strcmp() on the backing store, rather than one akin to strncmp(). If it uses the latter, it's liable to match /bin/shutdown when looking up /bin/sh. Signed-off-by: NDavid Howells <dhowells@redhat.com> Tested-by: NMichal Simek <monstr@monstr.eu> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Theodore Ts'o 提交于
Don't try to look at i_file_acl_high unless the INCOMPAT_64BIT feature bit is set. The field is normally zero, but older versions of e2fsck didn't automatically check to make sure of this, so in the spirit of "be liberal in what you accept", don't look at i_file_acl_high unless we are using a 64-bit filesystem. Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Chris Mason 提交于
After a transaction commit, the old root of the subvol btrees are sent through snapshot removal. This is what actually frees up any blocks replaced by COW, and anything the old blocks pointed to. Snapshot deletion will pause when a transaction commit has started, which helps to avoid a huge amount of delayed reference count updates piling up as the transaction is trying to close. But, this pause happens after the snapshot deletion process has asked other procs on the system to throttle back a bit so that it can make progress. We don't want to throttle everyone while we're waiting for the transaction commit, it leads to deadlocks in the user transaction ioctls used by Ceph and makes things slower in general. This patch changes things to avoid the throttling while we sleep. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
The btrfs fallocate call takes an extent lock on the entire range being fallocated, and then runs through insert_reserved_extent on each extent as they are allocated. The problem with this is that btrfs_drop_extents may decide to try and take the same extent lock fallocate was already holding. The solution used here is to push down knowledge of the range that is already locked going into btrfs_drop_extents. It turns out that at least one other caller had the same bug. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Christoph Hellwig 提交于
Just use kmem_cache_create directly. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Christoph Hellwig 提交于
Currently the extent_map code is only for btrfs so don't export it's symbols. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Christoph Hellwig 提交于
Get rid of the hacks for building out of tree, and always use += for assigning to the object lists. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-