- 07 9月, 2013 1 次提交
-
-
由 Milosz Tanski 提交于
Adding support for fscache to the Ceph filesystem. This would bring it to on par with some of the other network filesystems in Linux (like NFS, AFS, etc...) In order to mount the filesystem with fscache the 'fsc' mount option must be passed. Signed-off-by: NMilosz Tanski <milosz@adfin.com> Signed-off-by: NSage Weil <sage@inktank.com>
-
- 06 9月, 2013 3 次提交
-
-
由 Milosz Tanski 提交于
Currently the fscache code expect the netfs to call fscache_readpages_or_alloc inside the aops readpages callback. It marks all the pages in the list provided by readahead with PG_private_2. In the cases that the netfs fails to read all the pages (which is legal) it ends up returning to the readahead and triggering a BUG. This happens because the page list still contains marked pages. This patch implements a simple fscache_readpages_cancel function that the netfs should call before returning from readpages. It will revoke the pages from the underlying cache backend and unmark them. The problem was originally worked out in the Ceph devel tree, but it also occurs in CIFS. It appears that NFS, AFS and 9P are okay as read_cache_pages() will clean up the unprocessed pages in the case of an error. This can be used to address the following oops: [12410647.597278] BUG: Bad page state in process petabucket pfn:3d504e [12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping: (null) index:0x0 [12410647.597298] page flags: 0x200000000001000(private_2) ... [12410647.597334] Call Trace: [12410647.597345] [<ffffffff815523f2>] dump_stack+0x19/0x1b [12410647.597356] [<ffffffff8111def7>] bad_page+0xc7/0x120 [12410647.597359] [<ffffffff8111e49e>] free_pages_prepare+0x10e/0x120 [12410647.597361] [<ffffffff8111fc80>] free_hot_cold_page+0x40/0x170 [12410647.597363] [<ffffffff81123507>] __put_single_page+0x27/0x30 [12410647.597365] [<ffffffff81123df5>] put_page+0x25/0x40 [12410647.597376] [<ffffffffa02bdcf9>] ceph_readpages+0x2e9/0x6e0 [ceph] [12410647.597379] [<ffffffff81122a8f>] __do_page_cache_readahead+0x1af/0x260 [12410647.597382] [<ffffffff81122ea1>] ra_submit+0x21/0x30 [12410647.597384] [<ffffffff81118f64>] filemap_fault+0x254/0x490 [12410647.597387] [<ffffffff8113a74f>] __do_fault+0x6f/0x4e0 [12410647.597391] [<ffffffff810125bd>] ? __switch_to+0x16d/0x4a0 [12410647.597395] [<ffffffff810865ba>] ? finish_task_switch+0x5a/0xc0 [12410647.597398] [<ffffffff8113d856>] handle_pte_fault+0xf6/0x930 [12410647.597401] [<ffffffff81008c33>] ? pte_mfn_to_pfn+0x93/0x110 [12410647.597403] [<ffffffff81008cce>] ? xen_pmd_val+0xe/0x10 [12410647.597405] [<ffffffff81005469>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [12410647.597407] [<ffffffff8113f361>] handle_mm_fault+0x251/0x370 [12410647.597411] [<ffffffff812b0ac4>] ? call_rwsem_down_read_failed+0x14/0x30 [12410647.597414] [<ffffffff8155bffa>] __do_page_fault+0x1aa/0x550 [12410647.597418] [<ffffffff8108011d>] ? up_write+0x1d/0x20 [12410647.597422] [<ffffffff8113141c>] ? vm_mmap_pgoff+0xbc/0xe0 [12410647.597425] [<ffffffff81143bb8>] ? SyS_mmap_pgoff+0xd8/0x240 [12410647.597427] [<ffffffff8155c3ae>] do_page_fault+0xe/0x10 [12410647.597431] [<ffffffff81558818>] page_fault+0x28/0x30 Signed-off-by: NMilosz Tanski <milosz@adfin.com> Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
由 David Howells 提交于
Implement the FS-Cache interface to check the consistency of a cache object in CacheFiles. Original-author: Hongyi Jia <jiayisuse@gmail.com> Signed-off-by: NDavid Howells <dhowells@redhat.com> cc: Hongyi Jia <jiayisuse@gmail.com> cc: Milosz Tanski <milosz@adfin.com>
-
由 David Howells 提交于
Extend the fscache netfs API so that the netfs can ask as to whether a cache object is up to date with respect to its corresponding netfs object: int fscache_check_consistency(struct fscache_cookie *cookie) This will call back to the netfs to check whether the auxiliary data associated with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it may also return -ENOMEM and -ERESTARTSYS. The backends now have to implement a mandatory operation pointer: int (*check_consistency)(struct fscache_object *object) that corresponds to the above API call. FS-Cache takes care of pinning the object and the cookie in memory and managing this call with respect to the object state. Original-author: Hongyi Jia <jiayisuse@gmail.com> Signed-off-by: NDavid Howells <dhowells@redhat.com> cc: Hongyi Jia <jiayisuse@gmail.com> cc: Milosz Tanski <milosz@adfin.com>
-
- 29 8月, 2013 3 次提交
-
-
由 Goldwyn Rodrigues 提交于
While using pacemaker/corosync, the node numbers are generated using IP address as opposed to serial node number generation. This may not fit in a 8-byte string. Use a bigger string to print the complete node number. Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
This just replaces the dentry count/lock combination with the lockref structure that contains both a count and a spinlock, and does the mechanical conversion to use the lockref infrastructure. There are no semantic changes here, it's purely syntactic. The reference lockref implementation uses the spinlock exactly the same way that the old dcache code did, and the bulk of this patch is just expanding the internal "d_count" use in the dcache code to use "d_lockref.count" instead. This is purely preparation for the real change to make the reference count updates be lockless during the 3.12 merge window. [ As with the previous commit, this is a rewritten version of a concept originally from Waiman, so credit goes to him, blame for any errors goes to me. Waiman's patch had some semantic differences for taking advantage of the lockless update in dget_parent(), while this patch is intentionally a pure search-and-replace change with no semantic changes. - Linus ] Signed-off-by: NWaiman Long <Waiman.Long@hp.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
This reverts commit bb2314b4. It wasn't necessarily wrong per se, but we're still busily discussing the exact details of this all, so I'm going to revert it for now. It's true that you can already do flink() through /proc and that flink() isn't new. But as Brad Spengler points out, some secure environments do not mount proc, and flink adds a new interface that can avoid path lookup of the source for those kinds of environments. We may re-do this (and even mark it for stable backporting back in 3.11 and possibly earlier) once the whole discussion about the interface is done. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Brad Spengler <spender@grsecurity.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 28 8月, 2013 5 次提交
-
-
由 Sha Zhengju 提交于
Following we will begin to add memcg dirty page accounting around __set_page_dirty_{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to avoid exporting those details to filesystems. Since vfs set_page_dirty() should be called under page lock, here we don't need elaborate codes to handle racy anymore, and two WARN_ON() are added to detect such exceptions. Thanks very much for Sage and Yan Zheng's coaching! I tested it in a two server's ceph environment that one is client and the other is mds/osd/mon, and run the following fsx test from xfstests: ./fsx 1MB -N 50000 -p 10000 -l 1048576 ./fsx 10MB -N 50000 -p 10000 -l 10485760 ./fsx 100MB -N 50000 -p 10000 -l 104857600 The fsx does lots of mmap-read/mmap-write/truncate operations and the tests completed successfully without triggering any of WARN_ON. Signed-off-by: NSha Zhengju <handai.szj@taobao.com> Reviewed-by: NSage Weil <sage@inktank.com>
-
由 majianpeng 提交于
For sync_read/write, it may do multi stripe operations.If one of those met erro, we return the former successed size rather than a error value. There is a exception for write-operation met -EOLDSNAPC.If this occur,we retry the whole write again. Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
-
由 majianpeng 提交于
cephfs . show_layout >layyout.data_pool: 0 >layout.object_size: 4194304 >layout.stripe_unit: 4194304 >layout.stripe_count: 1 TestA: >dd if=/dev/urandom of=test bs=1M count=2 oflag=direct >dd if=/dev/urandom of=test bs=1M count=2 seek=4 oflag=direct >dd if=test of=/dev/null bs=6M count=1 iflag=direct The messages from func striped_read are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT ceph: file.c:350 : striped_read 2097152~4194304 (read 2097152) got 0 HITSTRIPE SHORT ceph: file.c:381 : zero tail 4194304 ceph: file.c:390 : striped_read returns 6291456 The hole of file is from 2M--4M.But actualy it zero the last 4M include the last 2M area which isn't a hole. Using this patch, the messages are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT ceph: file.c:358 : zero gap 2097152 to 4194304 ceph: file.c:350 : striped_read 4194304~2097152 (read 4194304) got 2097152 ceph: file.c:384 : striped_read returns 6291456 TestB: >echo majianpeng > test >dd if=test of=/dev/null bs=2M count=1 iflag=direct The messages are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT ceph: file.c:350 : striped_read 11~6291445 (read 11) got 0 HITSTRIPE SHORT ceph: file.c:390 : striped_read returns 11 For this case,it did once more striped_read.It's no meaningless. Using this patch, the message are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT ceph: file.c:384 : striped_read returns 11 Big thanks to Yan Zheng for the patch. Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com> Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
-
由 Li Wang 提交于
Cleanup in handle_cap_grant(). Signed-off-by: NLi Wang <liwang@ubuntukylin.com> Reviewed-by: NSage Weil <sage@inktank.com>
-
由 Sage Weil 提交于
We need to use do_div to divide by a 64-bit value. Signed-off-by: NSage Weil <sage@inktank.com> Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
-
- 25 8月, 2013 5 次提交
-
-
由 Dan Carpenter 提交于
This should actually be returning an ERR_PTR on error instead of NULL. That was how it was designed and all the callers expect it. [AV: actually, that's what "VFS: Make clone_mnt()/copy_tree()/collect_mounts() return errors" missed - originally collect_mounts() was expected to return NULL on failure] Cc: <stable@vger.kernel.org> # 3.10+ Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Dan Carpenter 提交于
iget_locked() returns a NULL on error, it doesn't return an ERR_PTR. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Dan Carpenter 提交于
The iget_locked() function returns NULL on error and never an ERR_PTR. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Oleg Nesterov 提交于
proc_readfd_common() does dir_emit_dots() twice in a row, we need to do this only once. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
dynamic_dname() is both too much and too little for those - the output may be well in excess of 64 bytes dynamic_dname() assumes to be enough (thanks to ashmem feeding really long names to shmem_file_setup()) and vsnprintf() is an overkill for those guys. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 24 8月, 2013 2 次提交
-
-
由 Vyacheslav Dubeyko 提交于
Fix the issue with improper counting number of flying bio requests for BIO_EOPNOTSUPP error detection case. The sb_nbio must be incremented exactly the same number of times as complete() function was called (or will be called) because nilfs_segbuf_wait() will call wail_for_completion() for the number of times set to sb_nbio: do { wait_for_completion(&segbuf->sb_bio_event); } while (--segbuf->sb_nbio > 0); Two functions complete() and wait_for_completion() must be called the same number of times for the same sb_bio_event. Otherwise, wait_for_completion() will hang or leak. Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vyacheslav Dubeyko 提交于
Remove double call of bio_put() in nilfs_end_bio_write() for the case of BIO_EOPNOTSUPP error detection. The issue was found by Dan Carpenter and he suggests first version of the fix too. Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com> Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 8月, 2013 1 次提交
-
-
由 Roland Dreier 提交于
There is a nasty bug in the SCSI SG_IO ioctl that in some circumstances leads to one process writing data into the address space of some other random unrelated process if the ioctl is interrupted by a signal. What happens is the following: - A process issues an SG_IO ioctl with direction DXFER_FROM_DEV (ie the underlying SCSI command will transfer data from the SCSI device to the buffer provided in the ioctl) - Before the command finishes, a signal is sent to the process waiting in the ioctl. This will end up waking up the sg_ioctl() code: result = wait_event_interruptible(sfp->read_wait, (srp_done(sfp, srp) || sdp->detached)); but neither srp_done() nor sdp->detached is true, so we end up just setting srp->orphan and returning to userspace: srp->orphan = 1; write_unlock_irq(&sfp->rq_list_lock); return result; /* -ERESTARTSYS because signal hit process */ At this point the original process is done with the ioctl and blithely goes ahead handling the signal, reissuing the ioctl, etc. - Eventually, the SCSI command issued by the first ioctl finishes and ends up in sg_rq_end_io(). At the end of that function, we run through: write_lock_irqsave(&sfp->rq_list_lock, iflags); if (unlikely(srp->orphan)) { if (sfp->keep_orphan) srp->sg_io_owned = 0; else done = 0; } srp->done = done; write_unlock_irqrestore(&sfp->rq_list_lock, iflags); if (likely(done)) { /* Now wake up any sg_read() that is waiting for this * packet. */ wake_up_interruptible(&sfp->read_wait); kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN); kref_put(&sfp->f_ref, sg_remove_sfp); } else { INIT_WORK(&srp->ew.work, sg_rq_end_io_usercontext); schedule_work(&srp->ew.work); } Since srp->orphan *is* set, we set done to 0 (assuming the userspace app has not set keep_orphan via an SG_SET_KEEP_ORPHAN ioctl), and therefore we end up scheduling sg_rq_end_io_usercontext() to run in a workqueue. - In workqueue context we go through sg_rq_end_io_usercontext() -> sg_finish_rem_req() -> blk_rq_unmap_user() -> ... -> bio_uncopy_user() -> __bio_copy_iov() -> copy_to_user(). The key point here is that we are doing copy_to_user() on a workqueue -- that is, we're on a kernel thread with current->mm equal to whatever random previous user process was scheduled before this kernel thread. So we end up copying whatever data the SCSI command returned to the virtual address of the buffer passed into the original ioctl, but it's quite likely we do this copying into a different address space! As suggested by James Bottomley <James.Bottomley@hansenpartnership.com>, add a check for current->mm (which is NULL if we're on a kernel thread without a real userspace address space) in bio_uncopy_user(), and skip the copy if we're on a kernel thread. There's no reason that I can think of for any caller of bio_uncopy_user() to want to do copying on a kernel thread with a random active userspace address space. Huge thanks to Costa Sapuntzakis <costa@purestorage.com> for the original pointer to this bug in the sg code. Signed-off-by: NRoland Dreier <roland@purestorage.com> Tested-by: NDavid Milburn <dmilburn@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: NJames Bottomley <JBottomley@Parallels.com>
-
- 20 8月, 2013 2 次提交
-
-
由 Linus Torvalds 提交于
In the previous commit, Richard Genoud fixed proc_root_readdir(), which had lost the check for whether all of the non-process /proc entries had been returned or not. But that in turn exposed _another_ bug, namely that the original readdir conversion patch had yet another problem: it had lost the return value of proc_readdir_de(), so now checking whether it had completed successfully or not didn't actually work right anyway. This reinstates the non-zero return for the "end of base entries" that had also gotten lost in commit f0c3b509 ("[readdir] convert procfs"). So now you get all the base entries *and* you get all the process entries, regardless of getdents buffer size. (Side note: the Linux "getdents" manual page actually has a nice example application for testing getdents, which can be easily modified to use different buffers. Who knew? Man-pages can be useful) Reported-by: NEmmanuel Benisty <benisty.e@gmail.com> Reported-by: NMarc Dionne <marc.c.dionne@gmail.com> Cc: Richard Genoud <richard.genoud@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Richard Genoud 提交于
Commit f0c3b509 ("[readdir] convert procfs") introduced a bug on the listing of the proc file-system. The return value of proc_readdir() isn't tested anymore in the proc_root_readdir function. This lead to an "interesting" behaviour when we are using the getdents() system call with a buffer too small: instead of failing, it returns the first entries of /proc (enough to fill the given buffer), plus the PID directories. This is not triggered on glibc (as getdents is called with a 32KB buffer), but on uclibc, the buffer size is only 1KB, thus some proc entries are missing. See https://lkml.org/lkml/2013/8/12/288 for more background. Signed-off-by: NRichard Genoud <richard.genoud@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 8月, 2013 5 次提交
-
-
由 Steven Whitehouse 提交于
Since the introduction of atomic_open, gfs2_getxattr can be called with the glock already held, so we need to allow for this. Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com> Reported-by: NDavid Teigland <teigland@redhat.com> Tested-by: NDavid Teigland <teigland@redhat.com>
-
由 Dan Carpenter 提交于
alloc_workqueue() returns a NULL on error, it doesn't return an ERR_PTR. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Benjamin Marzinski 提交于
When run during fsync, a gfs2_log_flush could happen between the time when gfs2_ail_flush checked the number of blocks to revoke, and when it actually started the transaction to do those revokes. This occassionally caused it to need more revokes than it reserved, causing gfs2 to crash. Instead of just reserving enough revokes to handle the blocks that currently need them, this patch makes gfs2_ail_flush reserve the maximum number of revokes it can, without increasing the total number of reserved log blocks. This patch also passes the number of reserved revokes to __gfs2_ail_flush() so that it doesn't go over its limit and cause a crash like we're seeing. Non-fsync calls to __gfs2_ail_flush will still cause a BUG() necessary revokes are skipped. Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com> Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
由 Tejun Heo 提交于
dbf2576e ("workqueue: make all workqueues non-reentrant") made WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages. This patch doesn't introduce any behavior changes. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com> Cc: cluster-devel@redhat.com
-
由 Steven Whitehouse 提交于
PTR_RET should be PTR_ERR Reported-by: NSachin Kamat <sachin.kamat@linaro.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
-
- 17 8月, 2013 1 次提交
-
-
由 Jan Kara 提交于
Commit 0713ed0c added jbd2_journal_file_inode() call into ext4_block_zero_page_range(). However that function gets called from truncate path and thus inode needn't have jinode attached - that happens in ext4_file_open() but the file needn't be ever open since mount. Calling jbd2_journal_file_inode() without jinode attached results in the oops. We fix the problem by attaching jinode to inode also in ext4_truncate() and ext4_punch_hole() when we are going to zero out partial blocks. Reported-by: Nmajianpeng <majianpeng@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
- 16 8月, 2013 6 次提交
-
-
由 Linus Torvalds 提交于
Ben Tebulin reported: "Since v3.7.2 on two independent machines a very specific Git repository fails in 9/10 cases on git-fsck due to an SHA1/memory failures. This only occurs on a very specific repository and can be reproduced stably on two independent laptops. Git mailing list ran out of ideas and for me this looks like some very exotic kernel issue" and bisected the failure to the backport of commit 53a59fc6 ("mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT"). That commit itself is not actually buggy, but what it does is to make it much more likely to hit the partial TLB invalidation case, since it introduces a new case in tlb_next_batch() that previously only ever happened when running out of memory. The real bug is that the TLB gather virtual memory range setup is subtly buggered. It was introduced in commit 597e1c35 ("mm/mmu_gather: enable tlb flush range in generic mmu_gather"), and the range handling was already fixed at least once in commit e6c495a9 ("mm: fix the TLB range flushed when __tlb_remove_page() runs out of slots"), but that fix was not complete. The problem with the TLB gather virtual address range is that it isn't set up by the initial tlb_gather_mmu() initialization (which didn't get the TLB range information), but it is set up ad-hoc later by the functions that actually flush the TLB. And so any such case that forgot to update the TLB range entries would potentially miss TLB invalidates. Rather than try to figure out exactly which particular ad-hoc range setup was missing (I personally suspect it's the hugetlb case in zap_huge_pmd(), which didn't have the same logic as zap_pte_range() did), this patch just gets rid of the problem at the source: make the TLB range information available to tlb_gather_mmu(), and initialize it when initializing all the other tlb gather fields. This makes the patch larger, but conceptually much simpler. And the end result is much more understandable; even if you want to play games with partial ranges when invalidating the TLB contents in chunks, now the range information is always there, and anybody who doesn't want to bother with it won't introduce subtle bugs. Ben verified that this fixes his problem. Reported-bisected-and-tested-by: NBen Tebulin <tebulin@googlemail.com> Build-testing-by: NStephen Rothwell <sfr@canb.auug.org.au> Build-testing-by: NRichard Weinberger <richard.weinberger@gmail.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Kleikamp 提交于
NFSv4 reserves readdir cookie values 0-2 for special entries (. and ..), but jfs allows a value of 2 for a non-special entry. This incompatibility can result in the nfs client reporting a readdir loop. This patch doesn't change the value stored internally, but adds one to the value exposed to the iterate method. Signed-off-by: NDave Kleikamp <dave.kleikamp@oracle.com> Tested-by: NChristian Kujau <lists@nerdbynature.de>
-
由 Li Wang 提交于
This patch implements fallocate and punch hole support for Ceph kernel client. Signed-off-by: NLi Wang <liwang@ubuntukylin.com> Signed-off-by: NYunchuan Wen <yunchuanwen@ubuntukylin.com>
-
由 Yan, Zheng 提交于
ceph_check_caps() requests new max size only when there is Fw cap. If we call check_max_size() while there is no Fw cap. It updates i_wanted_max_size and calls ceph_check_caps(), but ceph_check_caps() does nothing. Later when Fw cap is issued, we call check_max_size() again. But i_wanted_max_size is equal to 'endoff' at this time, so check_max_size() doesn't call ceph_check_caps() and we end up with waiting for the new max size forever. The fix is duplicate ceph_check_caps()'s "request max size" code in check_max_size(), and make try_get_cap_refs() wait for the Fw cap before retry requesting new max size. This patch also removes the "endoff > (inode->i_size << 1)" check in check_max_size(). It's useless because there is no corresponding logic in ceph_check_caps(). Reviewed-by: NSage Weil <sage@inktank.com> Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
-
由 Yan, Zheng 提交于
I encountered below deadlock when running fsstress wmtruncate work truncate MDS --------------- ------------------ -------------------------- lock i_mutex <- truncate file lock i_mutex (blocked) <- revoking Fcb (filelock to MIX) send request -> handle request (xlock filelock) At the initial time, there are some dirty pages in the page cache. When the kclient receives the truncate message, it reduces inode size and creates some 'out of i_size' dirty pages. wmtruncate work can't truncate these dirty pages because it's blocked by the i_mutex. Later when the kclient receives the cap message that revokes Fcb caps, It can't flush all dirty pages because writepages() only flushes dirty pages within the inode size. When the MDS handles the 'truncate' request from kclient, it waits for the filelock to become stable. But the filelock is stuck in unstable state because it can't finish revoking kclient's Fcb caps. The truncate pagecache locking has already caused lots of trouble for use. I think it's time simplify it by introducing a new mutex. We use the new mutex to prevent concurrent truncate_inode_pages(). There is no need to worry about race between buffered write and truncate_inode_pages(), because our "get caps" mechanism prevents them from concurrent execution. Reviewed-by: NSage Weil <sage@inktank.com> Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
-
由 Milosz Tanski 提交于
The invalidatepage code bails if it encounters a non-zero page offset. The current logic that does is non-obvious with multiple if statements. This should be logically and functionally equivalent. Signed-off-by: NMilosz Tanski <milosz@adfin.com> Reviewed-by: NSage Weil <sage@inktank.com>
-
- 14 8月, 2013 6 次提交
-
-
由 yonghua zheng 提交于
Recently we met quite a lot of random kernel panic issues after enabling CONFIG_PROC_PAGE_MONITOR. After debuggind we found this has something to do with following bug in pagemap: In struct pagemapread: struct pagemapread { int pos, len; pagemap_entry_t *buffer; bool v2; }; pos is number of PM_ENTRY_BYTES in buffer, but len is the size of buffer, it is a mistake to compare pos and len in add_page_map() for checking buffer is full or not, and this can lead to buffer overflow and random kernel panic issue. Correct len to be total number of PM_ENTRY_BYTES in buffer. [akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition] Signed-off-by: NYonghua Zheng <younghua.zheng@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jeff Liu 提交于
Fix a NULL pointer deference while removing an empty directory, which was introduced by commit 3704412b ("[readdir] convert ocfs2"). BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<(null)>] (null) PGD 6da85067 PUD 6da89067 PMD 0 Oops: 0010 [#1] SMP CPU: 0 PID: 6564 Comm: rmdir Tainted: G O 3.11.0-rc1 #4 RIP: 0010:[<0000000000000000>] [< (null)>] (null) Call Trace: ocfs2_dir_foreach+0x49/0x50 [ocfs2] ocfs2_empty_dir+0x12c/0x3e0 [ocfs2] ocfs2_unlink+0x56e/0xc10 [ocfs2] vfs_rmdir+0xd5/0x140 do_rmdir+0x1cb/0x1e0 SyS_rmdir+0x16/0x20 system_call_fastpath+0x16/0x1b Code: Bad RIP value. RIP [< (null)>] (null) RSP <ffff88006daddc10> CR2: 0000000000000000 [dan.carpenter@oracle.com: fix pointer math] Signed-off-by: NJie Liu <jeff.liu@oracle.com> Reported-by: NDavid Weber <wb@munzinger.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tiger Yang 提交于
Since ocfs2_cow_file_pos will invoke ocfs2_refcount_icow with a NULL as the struct file pointer, it finally result in a null pointer dereference in ocfs2_duplicate_clusters_by_page. This patch replace file pointer with inode pointer in cow_duplicate_clusters to fix this issue. [jeff.liu@oracle.com: rebased patch against linux-next tree] Signed-off-by: NTiger Yang <tiger.yang@oracle.com> Signed-off-by: NJie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: NTao Ma <tm@tao.ma> Tested-by: NDavid Weber <wb@munzinger.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jie Liu 提交于
Revert commit 40bd62eb ("fs/ocfs2/journal.h: add bits_wanted while calculating credits in ocfs2_calc_extend_credits"). Unfortunately this change broke fallocate even if there is insufficient disk space for the preallocation, which is a serious problem. # df -h /dev/sda8 22G 1.2G 21G 6% /ocfs2 # fallocate -o 0 -l 200M /ocfs2/testfile fallocate: /ocfs2/test: fallocate failed: No space left on device and a kernel warning: CPU: 3 PID: 3656 Comm: fallocate Tainted: G W O 3.11.0-rc3 #2 Call Trace: dump_stack+0x77/0x9e warn_slowpath_common+0xc4/0x110 warn_slowpath_null+0x2a/0x40 start_this_handle+0x6c/0x640 [jbd2] jbd2__journal_start+0x138/0x300 [jbd2] jbd2_journal_start+0x23/0x30 [jbd2] ocfs2_start_trans+0x166/0x300 [ocfs2] __ocfs2_extend_allocation+0x38f/0xdb0 [ocfs2] ocfs2_allocate_unwritten_extents+0x3c9/0x520 __ocfs2_change_file_space+0x5e0/0xa60 [ocfs2] ocfs2_fallocate+0xb1/0xe0 [ocfs2] do_fallocate+0x1cb/0x220 SyS_fallocate+0x6f/0xb0 system_call_fastpath+0x16/0x1b JBD2: fallocate wants too many credits (51216 > 4381) Signed-off-by: NJie Liu <jeff.liu@oracle.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Dave has reported the following lockdep splat: ================================= [ INFO: inconsistent lock state ] 3.11.0-rc1+ #9 Not tainted --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 {RECLAIM_FS-ON-W} state was registered at: mark_held_locks+0x81/0xe7 lockdep_trace_alloc+0x5e/0xbc __alloc_pages_nodemask+0x8b/0x9b6 __get_free_pages+0x20/0x31 get_zeroed_page+0x12/0x14 __pmd_alloc+0x1c/0x6b huge_pmd_share+0x265/0x283 huge_pte_alloc+0x5d/0x71 hugetlb_fault+0x7c/0x64a handle_mm_fault+0x255/0x299 __do_page_fault+0x142/0x55c do_page_fault+0xd/0x16 error_code+0x6c/0x74 irq event stamp: 3136917 hardirqs last enabled at (3136917): _raw_spin_unlock_irq+0x27/0x50 hardirqs last disabled at (3136916): _raw_spin_lock_irq+0x15/0x78 softirqs last enabled at (3136180): __do_softirq+0x137/0x30f softirqs last disabled at (3136175): irq_exit+0xa8/0xaa other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&mapping->i_mmap_mutex); <Interrupt> lock(&mapping->i_mmap_mutex); *** DEADLOCK *** no locks held by kswapd0/49. stack backtrace: CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 Call Trace: dump_stack+0x4b/0x79 print_usage_bug+0x1d9/0x1e3 mark_lock+0x1e0/0x261 __lock_acquire+0x623/0x17f2 lock_acquire+0x7d/0x195 mutex_lock_nested+0x6c/0x3a7 page_referenced+0x87/0x5e3 shrink_page_list+0x3d9/0x947 shrink_inactive_list+0x155/0x4cb shrink_lruvec+0x300/0x5ce shrink_zone+0x53/0x14e kswapd+0x517/0xa75 kthread+0xa8/0xaa ret_from_kernel_thread+0x1b/0x28 which is a false positive caused by hugetlb pmd sharing code which allocates a new pmd from withing mapping->i_mmap_mutex. If this allocation causes reclaim then the lockdep detector complains that we might self-deadlock. This is not correct though, because hugetlb pages are not reclaimable so their mapping will be never touched from the reclaim path. The patch tells lockup detector that hugetlb i_mmap_mutex is special by assigning it a separate lockdep class so it won't report possible deadlocks on unrelated mappings. [peterz@infradead.org: comment for annotation] Reported-by: NDave Jones <davej@redhat.com> Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cyrill Gorcunov 提交于
Andy reported that if file page get reclaimed we lose the soft-dirty bit if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get encoded into pte entry. Thus when #pf happens on such non-present pte we can restore it back. Reported-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NPavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-