1. 07 9月, 2013 1 次提交
  2. 06 9月, 2013 3 次提交
    • M
      fscache: Netfs function for cleanup post readpages · 5a6f282a
      Milosz Tanski 提交于
      Currently the fscache code expect the netfs to call fscache_readpages_or_alloc
      inside the aops readpages callback.  It marks all the pages in the list
      provided by readahead with PG_private_2.  In the cases that the netfs fails to
      read all the pages (which is legal) it ends up returning to the readahead and
      triggering a BUG.  This happens because the page list still contains marked
      pages.
      
      This patch implements a simple fscache_readpages_cancel function that the netfs
      should call before returning from readpages.  It will revoke the pages from the
      underlying cache backend and unmark them.
      
      The problem was originally worked out in the Ceph devel tree, but it also
      occurs in CIFS.  It appears that NFS, AFS and 9P are okay as read_cache_pages()
      will clean up the unprocessed pages in the case of an error.
      
      This can be used to address the following oops:
      
      [12410647.597278] BUG: Bad page state in process petabucket  pfn:3d504e
      [12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping:
      	(null) index:0x0
      [12410647.597298] page flags: 0x200000000001000(private_2)
      
      ...
      
      [12410647.597334] Call Trace:
      [12410647.597345]  [<ffffffff815523f2>] dump_stack+0x19/0x1b
      [12410647.597356]  [<ffffffff8111def7>] bad_page+0xc7/0x120
      [12410647.597359]  [<ffffffff8111e49e>] free_pages_prepare+0x10e/0x120
      [12410647.597361]  [<ffffffff8111fc80>] free_hot_cold_page+0x40/0x170
      [12410647.597363]  [<ffffffff81123507>] __put_single_page+0x27/0x30
      [12410647.597365]  [<ffffffff81123df5>] put_page+0x25/0x40
      [12410647.597376]  [<ffffffffa02bdcf9>] ceph_readpages+0x2e9/0x6e0 [ceph]
      [12410647.597379]  [<ffffffff81122a8f>] __do_page_cache_readahead+0x1af/0x260
      [12410647.597382]  [<ffffffff81122ea1>] ra_submit+0x21/0x30
      [12410647.597384]  [<ffffffff81118f64>] filemap_fault+0x254/0x490
      [12410647.597387]  [<ffffffff8113a74f>] __do_fault+0x6f/0x4e0
      [12410647.597391]  [<ffffffff810125bd>] ? __switch_to+0x16d/0x4a0
      [12410647.597395]  [<ffffffff810865ba>] ? finish_task_switch+0x5a/0xc0
      [12410647.597398]  [<ffffffff8113d856>] handle_pte_fault+0xf6/0x930
      [12410647.597401]  [<ffffffff81008c33>] ? pte_mfn_to_pfn+0x93/0x110
      [12410647.597403]  [<ffffffff81008cce>] ? xen_pmd_val+0xe/0x10
      [12410647.597405]  [<ffffffff81005469>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
      [12410647.597407]  [<ffffffff8113f361>] handle_mm_fault+0x251/0x370
      [12410647.597411]  [<ffffffff812b0ac4>] ? call_rwsem_down_read_failed+0x14/0x30
      [12410647.597414]  [<ffffffff8155bffa>] __do_page_fault+0x1aa/0x550
      [12410647.597418]  [<ffffffff8108011d>] ? up_write+0x1d/0x20
      [12410647.597422]  [<ffffffff8113141c>] ? vm_mmap_pgoff+0xbc/0xe0
      [12410647.597425]  [<ffffffff81143bb8>] ? SyS_mmap_pgoff+0xd8/0x240
      [12410647.597427]  [<ffffffff8155c3ae>] do_page_fault+0xe/0x10
      [12410647.597431]  [<ffffffff81558818>] page_fault+0x28/0x30
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5a6f282a
    • D
      CacheFiles: Implement interface to check cache consistency · 5002d7be
      David Howells 提交于
      Implement the FS-Cache interface to check the consistency of a cache object in
      CacheFiles.
      
      Original-author: Hongyi Jia <jiayisuse@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Hongyi Jia <jiayisuse@gmail.com>
      cc: Milosz Tanski <milosz@adfin.com>
      5002d7be
    • D
      FS-Cache: Add interface to check consistency of a cached object · da9803bc
      David Howells 提交于
      Extend the fscache netfs API so that the netfs can ask as to whether a cache
      object is up to date with respect to its corresponding netfs object:
      
      	int fscache_check_consistency(struct fscache_cookie *cookie)
      
      This will call back to the netfs to check whether the auxiliary data associated
      with a cookie is correct.  It returns 0 if it is and -ESTALE if it isn't; it
      may also return -ENOMEM and -ERESTARTSYS.
      
      The backends now have to implement a mandatory operation pointer:
      
      	int (*check_consistency)(struct fscache_object *object)
      
      that corresponds to the above API call.  FS-Cache takes care of pinning the
      object and the cookie in memory and managing this call with respect to the
      object state.
      
      Original-author: Hongyi Jia <jiayisuse@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Hongyi Jia <jiayisuse@gmail.com>
      cc: Milosz Tanski <milosz@adfin.com>
      da9803bc
  3. 29 8月, 2013 3 次提交
    • G
      fs/ocfs2/super.c: Use bigger nodestr to accomodate 32-bit node numbers · 49fa8140
      Goldwyn Rodrigues 提交于
      While using pacemaker/corosync, the node numbers are generated using IP
      address as opposed to serial node number generation.  This may not fit
      in a 8-byte string.  Use a bigger string to print the complete node
      number.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49fa8140
    • W
      vfs: make the dentry cache use the lockref infrastructure · 98474236
      Waiman Long 提交于
      This just replaces the dentry count/lock combination with the lockref
      structure that contains both a count and a spinlock, and does the
      mechanical conversion to use the lockref infrastructure.
      
      There are no semantic changes here, it's purely syntactic.  The
      reference lockref implementation uses the spinlock exactly the same way
      that the old dcache code did, and the bulk of this patch is just
      expanding the internal "d_count" use in the dcache code to use
      "d_lockref.count" instead.
      
      This is purely preparation for the real change to make the reference
      count updates be lockless during the 3.12 merge window.
      
      [ As with the previous commit, this is a rewritten version of a concept
        originally from Waiman, so credit goes to him, blame for any errors
        goes to me.
      
        Waiman's patch had some semantic differences for taking advantage of
        the lockless update in dget_parent(), while this patch is
        intentionally a pure search-and-replace change with no semantic
        changes.     - Linus ]
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98474236
    • L
      Revert "fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink" · f0cc6ffb
      Linus Torvalds 提交于
      This reverts commit bb2314b4.
      
      It wasn't necessarily wrong per se, but we're still busily discussing
      the exact details of this all, so I'm going to revert it for now.
      
      It's true that you can already do flink() through /proc and that flink()
      isn't new.  But as Brad Spengler points out, some secure environments do
      not mount proc, and flink adds a new interface that can avoid path
      lookup of the source for those kinds of environments.
      
      We may re-do this (and even mark it for stable backporting back in 3.11
      and possibly earlier) once the whole discussion about the interface is done.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0cc6ffb
  4. 28 8月, 2013 5 次提交
    • S
      ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem · 7d6e1f54
      Sha Zhengju 提交于
      Following we will begin to add memcg dirty page accounting around
      __set_page_dirty_{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
      avoid exporting those details to filesystems.
      
      Since vfs set_page_dirty() should be called under page lock, here we don't need elaborate
      codes to handle racy anymore, and two WARN_ON() are added to detect such exceptions.
      Thanks very much for Sage and Yan Zheng's coaching!
      
      I tested it in a two server's ceph environment that one is client and the other is
      mds/osd/mon, and run the following fsx test from xfstests:
      
        ./fsx   1MB -N 50000 -p 10000 -l 1048576
        ./fsx  10MB -N 50000 -p 10000 -l 10485760
        ./fsx 100MB -N 50000 -p 10000 -l 104857600
      
      The fsx does lots of mmap-read/mmap-write/truncate operations and the tests completed
      successfully without triggering any of WARN_ON.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      7d6e1f54
    • M
      ceph: allow sync_read/write return partial successed size of read/write. · ee7289bf
      majianpeng 提交于
      For sync_read/write, it may do multi stripe operations.If one of those
      met erro, we return the former successed size rather than a error value.
      There is a exception for write-operation met -EOLDSNAPC.If this occur,we
      retry the whole write again.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      ee7289bf
    • M
      ceph: fix bugs about handling short-read for sync read mode. · 02ae66d8
      majianpeng 提交于
      cephfs . show_layout
      >layyout.data_pool:     0
      >layout.object_size:   4194304
      >layout.stripe_unit:   4194304
      >layout.stripe_count:  1
      
      TestA:
      >dd if=/dev/urandom of=test bs=1M count=2 oflag=direct
      >dd if=/dev/urandom of=test bs=1M count=2 seek=4  oflag=direct
      >dd if=test of=/dev/null bs=6M count=1 iflag=direct
      The messages from func striped_read are:
      ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT
      ceph:           file.c:350  : striped_read 2097152~4194304 (read 2097152) got 0 HITSTRIPE SHORT
      ceph:           file.c:381  : zero tail 4194304
      ceph:           file.c:390  : striped_read returns 6291456
      The hole of file is from 2M--4M.But actualy it zero the last 4M include
      the last 2M area which isn't a hole.
      Using this patch, the messages are:
      ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT
      ceph:           file.c:358  :  zero gap 2097152 to 4194304
      ceph:           file.c:350  : striped_read 4194304~2097152 (read 4194304) got 2097152
      ceph:           file.c:384  : striped_read returns 6291456
      
      TestB:
      >echo majianpeng > test
      >dd if=test of=/dev/null bs=2M count=1 iflag=direct
      The messages are:
      ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT
      ceph:           file.c:350  : striped_read 11~6291445 (read 11) got 0 HITSTRIPE SHORT
      ceph:           file.c:390  : striped_read returns 11
      For this case,it did once more striped_read.It's no meaningless.
      Using this patch, the message are:
      ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT
      ceph:           file.c:384  : striped_read returns 11
      
      Big thanks to Yan Zheng for the patch.
      Reviewed-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      02ae66d8
    • L
      ceph: remove useless variable revoked_rdcache · e9075743
      Li Wang 提交于
      Cleanup in handle_cap_grant().
      Signed-off-by: NLi Wang <liwang@ubuntukylin.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      e9075743
    • S
      ceph: fix fallocate division · b314a90d
      Sage Weil 提交于
      We need to use do_div to divide by a 64-bit value.
      Signed-off-by: NSage Weil <sage@inktank.com>
      Reviewed-by: NJosh Durgin <josh.durgin@inktank.com>
      b314a90d
  5. 25 8月, 2013 5 次提交
  6. 24 8月, 2013 2 次提交
  7. 22 8月, 2013 1 次提交
    • R
      [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal · 35dc2483
      Roland Dreier 提交于
      There is a nasty bug in the SCSI SG_IO ioctl that in some circumstances
      leads to one process writing data into the address space of some other
      random unrelated process if the ioctl is interrupted by a signal.
      What happens is the following:
      
       - A process issues an SG_IO ioctl with direction DXFER_FROM_DEV (ie the
         underlying SCSI command will transfer data from the SCSI device to
         the buffer provided in the ioctl)
      
       - Before the command finishes, a signal is sent to the process waiting
         in the ioctl.  This will end up waking up the sg_ioctl() code:
      
      		result = wait_event_interruptible(sfp->read_wait,
      			(srp_done(sfp, srp) || sdp->detached));
      
         but neither srp_done() nor sdp->detached is true, so we end up just
         setting srp->orphan and returning to userspace:
      
      		srp->orphan = 1;
      		write_unlock_irq(&sfp->rq_list_lock);
      		return result;	/* -ERESTARTSYS because signal hit process */
      
         At this point the original process is done with the ioctl and
         blithely goes ahead handling the signal, reissuing the ioctl, etc.
      
       - Eventually, the SCSI command issued by the first ioctl finishes and
         ends up in sg_rq_end_io().  At the end of that function, we run through:
      
      	write_lock_irqsave(&sfp->rq_list_lock, iflags);
      	if (unlikely(srp->orphan)) {
      		if (sfp->keep_orphan)
      			srp->sg_io_owned = 0;
      		else
      			done = 0;
      	}
      	srp->done = done;
      	write_unlock_irqrestore(&sfp->rq_list_lock, iflags);
      
      	if (likely(done)) {
      		/* Now wake up any sg_read() that is waiting for this
      		 * packet.
      		 */
      		wake_up_interruptible(&sfp->read_wait);
      		kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN);
      		kref_put(&sfp->f_ref, sg_remove_sfp);
      	} else {
      		INIT_WORK(&srp->ew.work, sg_rq_end_io_usercontext);
      		schedule_work(&srp->ew.work);
      	}
      
         Since srp->orphan *is* set, we set done to 0 (assuming the
         userspace app has not set keep_orphan via an SG_SET_KEEP_ORPHAN
         ioctl), and therefore we end up scheduling sg_rq_end_io_usercontext()
         to run in a workqueue.
      
       - In workqueue context we go through sg_rq_end_io_usercontext() ->
         sg_finish_rem_req() -> blk_rq_unmap_user() -> ... ->
         bio_uncopy_user() -> __bio_copy_iov() -> copy_to_user().
      
         The key point here is that we are doing copy_to_user() on a
         workqueue -- that is, we're on a kernel thread with current->mm
         equal to whatever random previous user process was scheduled before
         this kernel thread.  So we end up copying whatever data the SCSI
         command returned to the virtual address of the buffer passed into
         the original ioctl, but it's quite likely we do this copying into a
         different address space!
      
      As suggested by James Bottomley <James.Bottomley@hansenpartnership.com>,
      add a check for current->mm (which is NULL if we're on a kernel thread
      without a real userspace address space) in bio_uncopy_user(), and skip
      the copy if we're on a kernel thread.
      
      There's no reason that I can think of for any caller of bio_uncopy_user()
      to want to do copying on a kernel thread with a random active userspace
      address space.
      
      Huge thanks to Costa Sapuntzakis <costa@purestorage.com> for the
      original pointer to this bug in the sg code.
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      Tested-by: NDavid Milburn <dmilburn@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJames Bottomley <JBottomley@Parallels.com>
      35dc2483
  8. 20 8月, 2013 2 次提交
    • L
      proc: more readdir conversion bug-fixes · fd3930f7
      Linus Torvalds 提交于
      In the previous commit, Richard Genoud fixed proc_root_readdir(), which
      had lost the check for whether all of the non-process /proc entries had
      been returned or not.
      
      But that in turn exposed _another_ bug, namely that the original readdir
      conversion patch had yet another problem: it had lost the return value
      of proc_readdir_de(), so now checking whether it had completed
      successfully or not didn't actually work right anyway.
      
      This reinstates the non-zero return for the "end of base entries" that
      had also gotten lost in commit f0c3b509 ("[readdir] convert
      procfs").  So now you get all the base entries *and* you get all the
      process entries, regardless of getdents buffer size.
      
      (Side note: the Linux "getdents" manual page actually has a nice example
      application for testing getdents, which can be easily modified to use
      different buffers.  Who knew? Man-pages can be useful)
      Reported-by: NEmmanuel Benisty <benisty.e@gmail.com>
      Reported-by: NMarc Dionne <marc.c.dionne@gmail.com>
      Cc: Richard Genoud <richard.genoud@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd3930f7
    • R
      proc: return on proc_readdir error · 94fc5d9d
      Richard Genoud 提交于
      Commit f0c3b509 ("[readdir] convert procfs") introduced a bug on the
      listing of the proc file-system.  The return value of proc_readdir()
      isn't tested anymore in the proc_root_readdir function.
      
      This lead to an "interesting" behaviour when we are using the getdents()
      system call with a buffer too small: instead of failing, it returns the
      first entries of /proc (enough to fill the given buffer), plus the PID
      directories.
      
      This is not triggered on glibc (as getdents is called with a 32KB
      buffer), but on uclibc, the buffer size is only 1KB, thus some proc
      entries are missing.
      
      See https://lkml.org/lkml/2013/8/12/288 for more background.
      Signed-off-by: NRichard Genoud <richard.genoud@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94fc5d9d
  9. 19 8月, 2013 5 次提交
  10. 17 8月, 2013 1 次提交
    • J
      jbd2: Fix oops in jbd2_journal_file_inode() · a361293f
      Jan Kara 提交于
      Commit 0713ed0c added
      jbd2_journal_file_inode() call into ext4_block_zero_page_range().
      However that function gets called from truncate path and thus inode
      needn't have jinode attached - that happens in ext4_file_open() but
      the file needn't be ever open since mount. Calling
      jbd2_journal_file_inode() without jinode attached results in the oops.
      
      We fix the problem by attaching jinode to inode also in ext4_truncate()
      and ext4_punch_hole() when we are going to zero out partial blocks.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a361293f
  11. 16 8月, 2013 6 次提交
    • L
      Fix TLB gather virtual address range invalidation corner cases · 2b047252
      Linus Torvalds 提交于
      Ben Tebulin reported:
      
       "Since v3.7.2 on two independent machines a very specific Git
        repository fails in 9/10 cases on git-fsck due to an SHA1/memory
        failures.  This only occurs on a very specific repository and can be
        reproduced stably on two independent laptops.  Git mailing list ran
        out of ideas and for me this looks like some very exotic kernel issue"
      
      and bisected the failure to the backport of commit 53a59fc6 ("mm:
      limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").
      
      That commit itself is not actually buggy, but what it does is to make it
      much more likely to hit the partial TLB invalidation case, since it
      introduces a new case in tlb_next_batch() that previously only ever
      happened when running out of memory.
      
      The real bug is that the TLB gather virtual memory range setup is subtly
      buggered.  It was introduced in commit 597e1c35 ("mm/mmu_gather:
      enable tlb flush range in generic mmu_gather"), and the range handling
      was already fixed at least once in commit e6c495a9 ("mm: fix the TLB
      range flushed when __tlb_remove_page() runs out of slots"), but that fix
      was not complete.
      
      The problem with the TLB gather virtual address range is that it isn't
      set up by the initial tlb_gather_mmu() initialization (which didn't get
      the TLB range information), but it is set up ad-hoc later by the
      functions that actually flush the TLB.  And so any such case that forgot
      to update the TLB range entries would potentially miss TLB invalidates.
      
      Rather than try to figure out exactly which particular ad-hoc range
      setup was missing (I personally suspect it's the hugetlb case in
      zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
      did), this patch just gets rid of the problem at the source: make the
      TLB range information available to tlb_gather_mmu(), and initialize it
      when initializing all the other tlb gather fields.
      
      This makes the patch larger, but conceptually much simpler.  And the end
      result is much more understandable; even if you want to play games with
      partial ranges when invalidating the TLB contents in chunks, now the
      range information is always there, and anybody who doesn't want to
      bother with it won't introduce subtle bugs.
      
      Ben verified that this fixes his problem.
      Reported-bisected-and-tested-by: NBen Tebulin <tebulin@googlemail.com>
      Build-testing-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Build-testing-by: NRichard Weinberger <richard.weinberger@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b047252
    • D
      jfs: fix readdir cookie incompatibility with NFSv4 · 44512449
      Dave Kleikamp 提交于
      NFSv4 reserves readdir cookie values 0-2 for special entries (. and ..),
      but jfs allows a value of 2 for a non-special entry. This incompatibility
      can result in the nfs client reporting a readdir loop.
      
      This patch doesn't change the value stored internally, but adds one to
      the value exposed to the iterate method.
      Signed-off-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Tested-by: NChristian Kujau <lists@nerdbynature.de>
      44512449
    • L
      ceph: punch hole support · ad7a60de
      Li Wang 提交于
      This patch implements fallocate and punch hole support for Ceph kernel client.
      Signed-off-by: NLi Wang <liwang@ubuntukylin.com>
      Signed-off-by: NYunchuan Wen <yunchuanwen@ubuntukylin.com>
      ad7a60de
    • Y
      ceph: fix request max size · 3871cbb9
      Yan, Zheng 提交于
      ceph_check_caps() requests new max size only when there is Fw cap.
      If we call check_max_size() while there is no Fw cap. It updates
      i_wanted_max_size and calls ceph_check_caps(), but ceph_check_caps()
      does nothing. Later when Fw cap is issued, we call check_max_size()
      again. But i_wanted_max_size is equal to 'endoff' at this time, so
      check_max_size() doesn't call ceph_check_caps() and we end up with
      waiting for the new max size forever.
      
      The fix is duplicate ceph_check_caps()'s "request max size" code in
      check_max_size(), and make try_get_cap_refs() wait for the Fw cap
      before retry requesting new max size.
      
      This patch also removes the "endoff > (inode->i_size << 1)" check
      in check_max_size(). It's useless because there is no corresponding
      logic in ceph_check_caps().
      Reviewed-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      3871cbb9
    • Y
      ceph: introduce i_truncate_mutex · b0d7c223
      Yan, Zheng 提交于
      I encountered below deadlock when running fsstress
      
      wmtruncate work      truncate                 MDS
      ---------------  ------------------  --------------------------
                         lock i_mutex
                                            <- truncate file
      lock i_mutex (blocked)
                                            <- revoking Fcb (filelock to MIX)
                         send request ->
                                               handle request (xlock filelock)
      
      At the initial time, there are some dirty pages in the page cache.
      When the kclient receives the truncate message, it reduces inode size
      and creates some 'out of i_size' dirty pages. wmtruncate work can't
      truncate these dirty pages because it's blocked by the i_mutex. Later
      when the kclient receives the cap message that revokes Fcb caps, It
      can't flush all dirty pages because writepages() only flushes dirty
      pages within the inode size.
      
      When the MDS handles the 'truncate' request from kclient, it waits
      for the filelock to become stable. But the filelock is stuck in
      unstable state because it can't finish revoking kclient's Fcb caps.
      
      The truncate pagecache locking has already caused lots of trouble
      for use. I think it's time simplify it by introducing a new mutex.
      We use the new mutex to prevent concurrent truncate_inode_pages().
      There is no need to worry about race between buffered write and
      truncate_inode_pages(), because our "get caps" mechanism prevents
      them from concurrent execution.
      Reviewed-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      b0d7c223
    • M
      ceph: cleanup the logic in ceph_invalidatepage · b150f5c1
      Milosz Tanski 提交于
      The invalidatepage code bails if it encounters a non-zero page offset. The
      current logic that does is non-obvious with multiple if statements.
      
      This should be logically and functionally equivalent.
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      b150f5c1
  12. 14 8月, 2013 6 次提交
    • Y
      fs/proc/task_mmu.c: fix buffer overflow in add_page_map() · 8c829622
      yonghua zheng 提交于
      Recently we met quite a lot of random kernel panic issues after enabling
      CONFIG_PROC_PAGE_MONITOR.  After debuggind we found this has something
      to do with following bug in pagemap:
      
      In struct pagemapread:
      
        struct pagemapread {
            int pos, len;
            pagemap_entry_t *buffer;
            bool v2;
        };
      
      pos is number of PM_ENTRY_BYTES in buffer, but len is the size of
      buffer, it is a mistake to compare pos and len in add_page_map() for
      checking buffer is full or not, and this can lead to buffer overflow and
      random kernel panic issue.
      
      Correct len to be total number of PM_ENTRY_BYTES in buffer.
      
      [akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition]
      Signed-off-by: NYonghua Zheng <younghua.zheng@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c829622
    • J
      ocfs2: fix null pointer dereference in ocfs2_dir_foreach_blk_id() · d6394b59
      Jeff Liu 提交于
      Fix a NULL pointer deference while removing an empty directory, which
      was introduced by commit 3704412b ("[readdir] convert ocfs2").
      
        BUG: unable to handle kernel NULL pointer dereference at (null)
        IP: [<(null)>]           (null)
        PGD 6da85067 PUD 6da89067 PMD 0
        Oops: 0010 [#1] SMP
        CPU: 0 PID: 6564 Comm: rmdir Tainted: G           O 3.11.0-rc1 #4
        RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
        Call Trace:
          ocfs2_dir_foreach+0x49/0x50 [ocfs2]
          ocfs2_empty_dir+0x12c/0x3e0 [ocfs2]
          ocfs2_unlink+0x56e/0xc10 [ocfs2]
          vfs_rmdir+0xd5/0x140
          do_rmdir+0x1cb/0x1e0
          SyS_rmdir+0x16/0x20
          system_call_fastpath+0x16/0x1b
        Code:  Bad RIP value.
        RIP  [<          (null)>]           (null)
        RSP <ffff88006daddc10>
        CR2: 0000000000000000
      
      [dan.carpenter@oracle.com: fix pointer math]
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Reported-by: NDavid Weber <wb@munzinger.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6394b59
    • T
      ocfs2: fix NULL pointer dereference in ocfs2_duplicate_clusters_by_page · c7dd3392
      Tiger Yang 提交于
      Since ocfs2_cow_file_pos will invoke ocfs2_refcount_icow with a NULL as
      the struct file pointer, it finally result in a null pointer dereference
      in ocfs2_duplicate_clusters_by_page.
      
      This patch replace file pointer with inode pointer in
      cow_duplicate_clusters to fix this issue.
      
      [jeff.liu@oracle.com: rebased patch against linux-next tree]
      Signed-off-by: NTiger Yang <tiger.yang@oracle.com>
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Acked-by: NTao Ma <tm@tao.ma>
      Tested-by: NDavid Weber <wb@munzinger.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7dd3392
    • J
      ocfs2: Revert 40bd62eb to avoid regression in extended allocation · 6115ea28
      Jie Liu 提交于
      Revert commit 40bd62eb ("fs/ocfs2/journal.h: add bits_wanted while
      calculating credits in ocfs2_calc_extend_credits").
      
      Unfortunately this change broke fallocate even if there is insufficient
      disk space for the preallocation, which is a serious problem.
      
        # df -h
        /dev/sda8        22G  1.2G   21G   6% /ocfs2
        # fallocate -o 0 -l 200M /ocfs2/testfile
        fallocate: /ocfs2/test: fallocate failed: No space left on device
      
      and a kernel warning:
      
        CPU: 3 PID: 3656 Comm: fallocate Tainted: G        W  O 3.11.0-rc3 #2
        Call Trace:
          dump_stack+0x77/0x9e
          warn_slowpath_common+0xc4/0x110
          warn_slowpath_null+0x2a/0x40
          start_this_handle+0x6c/0x640 [jbd2]
          jbd2__journal_start+0x138/0x300 [jbd2]
          jbd2_journal_start+0x23/0x30 [jbd2]
          ocfs2_start_trans+0x166/0x300 [ocfs2]
          __ocfs2_extend_allocation+0x38f/0xdb0 [ocfs2]
          ocfs2_allocate_unwritten_extents+0x3c9/0x520
          __ocfs2_change_file_space+0x5e0/0xa60 [ocfs2]
          ocfs2_fallocate+0xb1/0xe0 [ocfs2]
          do_fallocate+0x1cb/0x220
          SyS_fallocate+0x6f/0xb0
          system_call_fastpath+0x16/0x1b
        JBD2: fallocate wants too many credits (51216 > 4381)
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6115ea28
    • M
      hugetlb: fix lockdep splat caused by pmd sharing · b610ded7
      Michal Hocko 提交于
      Dave has reported the following lockdep splat:
      
        =================================
        [ INFO: inconsistent lock state ]
        3.11.0-rc1+ #9 Not tainted
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
         (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
        {RECLAIM_FS-ON-W} state was registered at:
           mark_held_locks+0x81/0xe7
           lockdep_trace_alloc+0x5e/0xbc
           __alloc_pages_nodemask+0x8b/0x9b6
           __get_free_pages+0x20/0x31
           get_zeroed_page+0x12/0x14
           __pmd_alloc+0x1c/0x6b
           huge_pmd_share+0x265/0x283
           huge_pte_alloc+0x5d/0x71
           hugetlb_fault+0x7c/0x64a
           handle_mm_fault+0x255/0x299
           __do_page_fault+0x142/0x55c
           do_page_fault+0xd/0x16
           error_code+0x6c/0x74
        irq event stamp: 3136917
        hardirqs last  enabled at (3136917):  _raw_spin_unlock_irq+0x27/0x50
        hardirqs last disabled at (3136916):  _raw_spin_lock_irq+0x15/0x78
        softirqs last  enabled at (3136180):  __do_softirq+0x137/0x30f
        softirqs last disabled at (3136175):  irq_exit+0xa8/0xaa
        other info that might help us debug this:
         Possible unsafe locking scenario:
               CPU0
               ----
          lock(&mapping->i_mmap_mutex);
          <Interrupt>
            lock(&mapping->i_mmap_mutex);
      
        *** DEADLOCK ***
        no locks held by kswapd0/49.
      
        stack backtrace:
        CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
        Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
        Call Trace:
          dump_stack+0x4b/0x79
          print_usage_bug+0x1d9/0x1e3
          mark_lock+0x1e0/0x261
          __lock_acquire+0x623/0x17f2
          lock_acquire+0x7d/0x195
          mutex_lock_nested+0x6c/0x3a7
          page_referenced+0x87/0x5e3
          shrink_page_list+0x3d9/0x947
          shrink_inactive_list+0x155/0x4cb
          shrink_lruvec+0x300/0x5ce
          shrink_zone+0x53/0x14e
          kswapd+0x517/0xa75
          kthread+0xa8/0xaa
          ret_from_kernel_thread+0x1b/0x28
      
      which is a false positive caused by hugetlb pmd sharing code which
      allocates a new pmd from withing mapping->i_mmap_mutex.  If this
      allocation causes reclaim then the lockdep detector complains that we
      might self-deadlock.
      
      This is not correct though, because hugetlb pages are not reclaimable so
      their mapping will be never touched from the reclaim path.
      
      The patch tells lockup detector that hugetlb i_mmap_mutex is special by
      assigning it a separate lockdep class so it won't report possible
      deadlocks on unrelated mappings.
      
      [peterz@infradead.org: comment for annotation]
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b610ded7
    • C
      mm: save soft-dirty bits on file pages · 41bb3476
      Cyrill Gorcunov 提交于
      Andy reported that if file page get reclaimed we lose the soft-dirty bit
      if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
      encoded into pte entry.  Thus when #pf happens on such non-present pte
      we can restore it back.
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41bb3476