1. 23 10月, 2015 1 次提交
  2. 22 10月, 2015 1 次提交
  3. 21 10月, 2015 1 次提交
    • Q
      btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size · 0f6925fa
      Qu Wenruo 提交于
      Current code will always truncate tailing page if its alloc_start is
      smaller than inode size.
      
      For example, the file extent layout is like:
      0	4K	8K	16K	32K
      |<-----Extent A---------------->|
      |<--Inode size: 18K---------->|
      
      But if calling fallocate even for range [0,4K), it will cause btrfs to
      re-truncate the range [16,32K), causing COW and a new extent.
      
      0	4K	8K	16K	32K
      |///////|	<- Fallocate call range
      |<-----Extent A-------->|<--B-->|
      
      The cause is quite easy, just a careless btrfs_truncate_inode() in a
      else branch without extra judgment.
      Fix it by add judgment on whether the fallocate range is beyond isize.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f6925fa
  4. 17 10月, 2015 2 次提交
    • R
      mm, dax: fix DAX deadlocks · 0f90cc66
      Ross Zwisler 提交于
      The following two locking commits in the DAX code:
      
      commit 84317297 ("dax: fix race between simultaneous faults")
      commit 46c043ed ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")
      
      introduced a number of deadlocks and other issues which need to be fixed
      for the v4.3 kernel.  The list of issues in DAX after these commits
      (some newly introduced by the commits, some preexisting) can be found
      here:
      
        https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault").
      
      This undoes most of the changes introduced by those two commits,
      essentially returning us to the DAX locking scheme that was used in
      v4.2.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Tested-by: NDave Chinner <dchinner@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f90cc66
    • M
      mm, fs: obey gfp_mapping for add_to_page_cache() · 063d99b4
      Michal Hocko 提交于
      Commit 6afdb859 ("mm: do not ignore mapping_gfp_mask in page cache
      allocation paths") has caught some users of hardcoded GFP_KERNEL used in
      the page cache allocation paths.  This, however, wasn't complete and
      there were others which went unnoticed.
      
      Dave Chinner has reported the following deadlock for xfs on loop device:
      : With the recent merge of the loop device changes, I'm now seeing
      : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
      :
      : The deadlocked is as follows:
      :
      : kloopd1: loop_queue_read_work
      :       xfs_file_iter_read
      :       lock XFS inode XFS_IOLOCK_SHARED (on image file)
      :       page cache read (GFP_KERNEL)
      :       radix tree alloc
      :       memory reclaim
      :       reclaim XFS inodes
      :       log force to unpin inodes
      :       <wait for log IO completion>
      :
      : xfs-cil/loop1: <does log force IO work>
      :       xlog_cil_push
      :       xlog_write
      :       <loop issuing log writes>
      :               xlog_state_get_iclog_space()
      :               <blocks due to all log buffers under write io>
      :               <waits for IO completion>
      :
      : kloopd1: loop_queue_write_work
      :       xfs_file_write_iter
      :       lock XFS inode XFS_IOLOCK_EXCL (on image file)
      :       <wait for inode to be unlocked>
      :
      : i.e. the kloopd, with it's split read and write work queues, has
      : introduced a dependency through memory reclaim. i.e. that writes
      : need to be able to progress for reads make progress.
      :
      : The problem, fundamentally, is that mpage_readpages() does a
      : GFP_KERNEL allocation, rather than paying attention to the inode's
      : mapping gfp mask, which is set to GFP_NOFS.
      :
      : The didn't used to happen, because the loop device used to issue
      : reads through the splice path and that does:
      :
      :       error = add_to_page_cache_lru(page, mapping, index,
      :                       GFP_KERNEL & mapping_gfp_mask(mapping));
      
      This has changed by commit aa4d8616 ("block: loop: switch to VFS
      ITER_BVEC").
      
      This patch changes mpage_readpage{s} to follow gfp mask set for the
      mapping.  There are, however, other places which are doing basically the
      same.
      
      lustre:ll_dir_filler is doing GFP_KERNEL from the function which
      apparently uses GFP_NOFS for other allocations so let's make this
      consistent.
      
      cifs:readpages_get_pages is called from cifs_readpages and
      __cifs_readpages_from_fscache called from the same path obeys mapping
      gfp.
      
      ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
      regardless it uses mapping_gfp_mask for the page allocation.
      
      ext4_mpage_readpages is the called from the page cache allocation path
      same as read_pages and read_cache_pages
      
      As I've noticed in my previous post I cannot say I would be happy about
      sprinkling mapping_gfp_mask all over the place and it sounds like we
      should drop gfp_mask argument altogether and use it internally in
      __add_to_page_cache_locked that would require all the filesystems to use
      mapping gfp consistently which I am not sure is the case here.  From a
      quick glance it seems that some file system use it all the time while
      others are selective.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      063d99b4
  5. 14 10月, 2015 2 次提交
    • C
      btrfs: fix use after free iterating extrefs · dc6c5fb3
      Chris Mason 提交于
      The code for btrfs inode-resolve has never worked properly for
      files with enough hard links to trigger extrefs.  It was trying to
      get the leaf out of a path after freeing the path:
      
      	btrfs_release_path(path);
      	leaf = path->nodes[0];
      	item_size = btrfs_item_size_nr(leaf, slot);
      
      The fix here is to use the extent buffer we cloned just a little higher
      up to avoid deadlocks caused by using the leaf in the path.
      Signed-off-by: NChris Mason <clm@fb.com>
      cc: stable@vger.kernel.org # v3.7+
      cc: Mark Fasheh <mfasheh@suse.de>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      dc6c5fb3
    • D
      btrfs: check unsupported filters in balance arguments · 8eb93459
      David Sterba 提交于
      We don't verify that all the balance filter arguments supplemented by
      the flags are actually known to the kernel. Thus we let it silently pass
      and do nothing.
      
      At the moment this means only the 'limit' filter, but we're going to add
      a few more soon so it's better to have that fixed. Also in older stable
      kernels so that it works with newer userspace tools.
      
      Cc: stable@vger.kernel.org # 3.16+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eb93459
  6. 13 10月, 2015 2 次提交
    • T
      writeback: bdi_writeback iteration must not skip dying ones · b817525a
      Tejun Heo 提交于
      bdi_for_each_wb() is used in several places to wake up or issue
      writeback work items to all wb's (bdi_writeback's) on a given bdi.
      The iteration is performed by walking bdi->cgwb_tree; however, the
      tree only indexes wb's which are currently active.
      
      For example, when a memcg gets associated with a different blkcg, the
      old wb is removed from the tree so that the new one can be indexed.
      The old wb starts dying from then on but will linger till all its
      inodes are drained.  As these dying wb's may still host dirty inodes,
      writeback operations which affect all wb's must include them.
      bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
      failing to sync the inodes belonging to those wb's.
      
      This patch adds a RCU protected @bdi->wb_list which lists all wb's
      beloinging to that bdi.  wb's are added on creation and removed on
      release rather than on the start of destruction.  bdi_for_each_wb()
      usages are replaced with list_for_each[_continue]_rcu() iterations
      over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
      
      v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
          fixed and unnecessary list head severing in cgwb_bdi_destroy()
          removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NArtem Bityutskiy <dedekind1@gmail.com>
      Fixes: ebe41ab0 ("writeback: implement bdi_for_each_wb()")
      Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b817525a
    • T
      writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() · 6fdf860f
      Tejun Heo 提交于
      wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
      unfortunately, it was always waking up bdi->wb instead of the wb being
      walked.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 001fe6f6 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
      Reviewed-by: NJan Kara <jack@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6fdf860f
  7. 11 10月, 2015 1 次提交
    • T
      namei: results of d_is_negative() should be checked after dentry revalidation · daf3761c
      Trond Myklebust 提交于
      Leandro Awa writes:
       "After switching to version 4.1.6, our parallelized and distributed
        workflows now fail consistently with errors of the form:
      
        T34: ./regex.c:39:22: error: config.h: No such file or directory
      
        From our 'git bisect' testing, the following commit appears to be the
        possible cause of the behavior we've been seeing: commit 766c4cbf"
      
      Al Viro says:
       "What happens is that 766c4cbf got the things subtly wrong.
      
        We used to treat d_is_negative() after lookup_fast() as "fall with
        ENOENT".  That was wrong - checking ->d_flags outside of ->d_seq
        protection is unreliable and failing with hard error on what should've
        fallen back to non-RCU pathname resolution is a bug.
      
        Unfortunately, we'd pulled the test too far up and ran afoul of
        another kind of staleness.  The dentry might have been absolutely
        stable from the RCU point of view (and we might be on UP, etc), but
        stale from the remote fs point of view.  If ->d_revalidate() returns
        "it's actually stale", dentry gets thrown away and the original code
        wouldn't even have looked at its ->d_flags.
      
        What we need is to check ->d_flags where 766c4cbf does (prior to
        ->d_seq validation) but only use the result in cases where we do not
        discard this dentry outright"
      Reported-by: NLeandro Awa <lawa@nvidia.com>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911
      Fixes: 766c4cbf ("namei: d_is_negative() should be checked...")
      Tested-by: NLeandro Awa <lawa@nvidia.com>
      Cc: stable@vger.kernel.org # v4.1+
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      daf3761c
  8. 10 10月, 2015 1 次提交
  9. 07 10月, 2015 1 次提交
  10. 06 10月, 2015 5 次提交
    • N
      BTRFS: support NFSv2 export · 7d35199e
      NeilBrown 提交于
      The "fh_len" passed to ->fh_to_* is not guaranteed to be that same as
      that returned by encode_fh - it may be larger.
      
      With NFSv2, the filehandle is fixed length, so it may appear longer
      than expected and be zero-padded.
      
      So we must test that fh_len is at least some value, not exactly equal
      to it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: NDavid Sterba <dsterba@suse.cz>
      7d35199e
    • C
      Btrfs: open_ctree: Fix possible memory leak · e5fffbac
      chandan 提交于
      After reading one of chunk or tree root tree's root node from disk, if the
      root node does not have EXTENT_BUFFER_UPTODATE flag set, we fail to release
      the memory used by the root node. Fix this.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      e5fffbac
    • F
      Btrfs: fix deadlock when finalizing block group creation · d9a0540a
      Filipe Manana 提交于
      Josef ran into a deadlock while a transaction handle was finalizing the
      creation of its block groups, which produced the following trace:
      
        [260445.593112] fio             D ffff88022a9df468     0  8924   4518 0x00000084
        [260445.593119]  ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
        [260445.593126]  ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
        [260445.593132]  ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
        [260445.593137] Call Trace:
        [260445.593145]  [<ffffffff8175a437>] schedule+0x37/0x80
        [260445.593189]  [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
        [260445.593197]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
        [260445.593225]  [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
        [260445.593253]  [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
        [260445.593295]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
        [260445.593324]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [260445.593351]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [260445.593394]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
        [260445.593427]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
        [260445.593459]  [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
        [260445.593491]  [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs]
        [260445.593524]  [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs]
        [260445.593532]  [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170
        [260445.593564]  [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
        [260445.593597]  [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
        [260445.593626]  [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
        [260445.593654]  [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
        [260445.593682]  [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
        [260445.593724]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
        [260445.593752]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [260445.593830]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [260445.593905]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
        [260445.593946]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
        [260445.593990]  [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
        [260445.594042]  [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
        [260445.594089]  [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs]
        [260445.594115]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
        [260445.594133]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
        [260445.594149]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
        [260445.594169]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
        [260445.594187]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
        [260445.594204]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      This happened because the same transaction handle created a large number
      of block groups and while finalizing their creation (inserting new items
      and updating existing items in the chunk and device trees) a new metadata
      extent had to be allocated and no free space was found in the current
      metadata block groups, which made find_free_extent() attempt to allocate
      a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
      ended up allocating a new system chunk too and exceeded the threshold
      of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
      final part of block group creation again (at
      btrfs_create_pending_block_groups()) and attempt to lock again the root
      of the chunk tree when it's already write locked by the same task.
      
      Similarly we can deadlock on extent tree nodes/leafs if while we are
      running delayed references we end up creating a new metadata block group
      in order to allocate a new node/leaf for the extent tree (as part of
      a CoW operation or growing the tree), as btrfs_create_pending_block_groups
      inserts items into the extent tree as well. In this case we get the
      following trace:
      
        [14242.773581] fio             D ffff880428ca3418     0  3615   3100 0x00000084
        [14242.773588]  ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
        [14242.773594]  ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
        [14242.773600]  ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
        [14242.773606] Call Trace:
        [14242.773613]  [<ffffffff8175a437>] schedule+0x37/0x80
        [14242.773656]  [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
        [14242.773664]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
        [14242.773692]  [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
        [14242.773720]  [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
        [14242.773750]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [14242.773758]  [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200
        [14242.773786]  [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs]
        [14242.773818]  [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
        [14242.773850]  [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
        [14242.773934]  [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs]
        [14242.773998]  [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
        [14242.774041]  [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
        [14242.774078]  [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
        [14242.774118]  [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
        [14242.774155]  [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
        [14242.774194]  [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
        [14242.774235]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [14242.774274]  [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [14242.774318]  [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
        [14242.774358]  [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
        [14242.774391]  [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
        [14242.774432]  [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs]
        [14242.774474]  [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
        [14242.774516]  [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
        [14242.774558]  [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs]
        [14242.774599]  [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
        [14242.774642]  [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs]
        [14242.774650]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
        [14242.774657]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
        [14242.774663]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
        [14242.774669]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
        [14242.774675]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
        [14242.774681]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      Fix this by never recursing into the finalization phase of block group
      creation and making sure we never trigger the finalization of block group
      creation while running delayed references.
      Reported-by: NJosef Bacik <jbacik@fb.com>
      Fixes: 00d80e34 ("Btrfs: fix quick exhaustion of the system array in the superblock")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      d9a0540a
    • F
      Btrfs: update fix for read corruption of compressed and shared extents · 808f80b4
      Filipe Manana 提交于
      My previous fix in commit 005efedf ("Btrfs: fix read corruption of
      compressed and shared extents") was effective only if the compressed
      extents cover a file range with a length that is not a multiple of 16
      pages. That's because the detection of when we reached a different range
      of the file that shares the same compressed extent as the previously
      processed range was done at extent_io.c:__do_contiguous_readpages(),
      which covers subranges with a length up to 16 pages, because
      extent_readpages() groups the pages in clusters no larger than 16 pages.
      So fix this by tracking the start of the previously processed file
      range's extent map at extent_readpages().
      
      The following test case for fstests reproduces the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
      
        rm -f $seqres.full
      
        test_clone_and_read_compressed_extent()
        {
            local mount_opts=$1
      
            _scratch_mkfs >>$seqres.full 2>&1
            _scratch_mount $mount_opts
      
            # Create our test file with a single extent of 64Kb that is going to
            # be compressed no matter which compression algo is used (zlib/lzo).
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \
                $SCRATCH_MNT/foo | _filter_xfs_io
      
            # Now clone the compressed extent into an adjacent file offset.
            $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \
                $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
            echo "File digest before unmount:"
            md5sum $SCRATCH_MNT/foo | _filter_scratch
      
            # Remount the fs or clear the page cache to trigger the bug in
            # btrfs. Because the extent has an uncompressed length that is a
            # multiple of 16 pages, all the pages belonging to the second range
            # of the file (64K to 128K), which points to the same extent as the
            # first range (0K to 64K), had their contents full of zeroes instead
            # of the byte 0xaa. This was a bug exclusively in the read path of
            # compressed extents, the correct data was stored on disk, btrfs
            # just failed to fill in the pages correctly.
            _scratch_remount
      
            echo "File digest after remount:"
            # Must match the digest we got before.
            md5sum $SCRATCH_MNT/foo | _filter_scratch
        }
      
        echo -e "\nTesting with zlib compression..."
        test_clone_and_read_compressed_extent "-o compress=zlib"
      
        _scratch_unmount
      
        echo -e "\nTesting with lzo compression..."
        test_clone_and_read_compressed_extent "-o compress=lzo"
      
        status=0
        exit
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NTimofey Titovets <nefelim4ag@gmail.com>
      808f80b4
    • F
      Btrfs: send, fix corner case for reference overwrite detection · b786f16a
      Filipe Manana 提交于
      When the inode given to did_overwrite_ref() matches the current progress
      and has a reference that collides with the reference of other inode that
      has the same number as the current progress, we were always telling our
      caller that the inode's reference was overwritten, which is incorrect
      because the other inode might be a new inode (different generation number)
      in which case we must return false from did_overwrite_ref() so that its
      callers don't use an orphanized path for the inode (as it will never be
      orphanized, instead it will be unlinked and the new inode created later).
      
      The following test case for fstests reproduces the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
      
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -fr $send_files_dir
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _need_to_be_root
      
        send_files_dir=$TEST_DIR/btrfs-test-$seq
      
        rm -f $seqres.full
        rm -fr $send_files_dir
        mkdir $send_files_dir
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount
      
        # Create our test file with a single extent of 64K.
        mkdir -p $SCRATCH_MNT/foo
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" $SCRATCH_MNT/foo/bar \
            | _filter_xfs_io
      
        _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
            $SCRATCH_MNT/mysnap1
        _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \
            $SCRATCH_MNT/mysnap2
      
        echo "File digest before being replaced:"
        md5sum $SCRATCH_MNT/mysnap1/foo/bar | _filter_scratch
      
        # Remove the file and then create a new one in the same location with
        # the same name but with different content. This new file ends up
        # getting the same inode number as the previous one, because that inode
        # number was the highest inode number used by the snapshot's root and
        # therefore when attempting to find the a new inode number for the new
        # file, we end up reusing the same inode number. This happens because
        # currently btrfs uses the highest inode number summed by 1 for the
        # first inode created once a snapshot's root is loaded (done at
        # fs/btrfs/inode-map.c:btrfs_find_free_objectid in the linux kernel
        # tree).
        # Having these two different files in the snapshots with the same inode
        # number (but different generation numbers) caused the btrfs send code
        # to emit an incorrect path for the file when issuing an unlink
        # operation because it failed to realize they were different files.
        rm -f $SCRATCH_MNT/mysnap2/foo/bar
        $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 96K" \
            $SCRATCH_MNT/mysnap2/foo/bar | _filter_xfs_io
      
        _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT/mysnap2 \
            $SCRATCH_MNT/mysnap2_ro
      
        _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
        _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 \
            $SCRATCH_MNT/mysnap2_ro -f $send_files_dir/2.snap
      
        echo "File digest in the original filesystem after being replaced:"
        md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch
      
        # Now recreate the filesystem by receiving both send streams and verify
        # we get the same file contents that the original filesystem had.
        _scratch_unmount
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount
      
        _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/1.snap
        _run_btrfs_util_prog receive -vv $SCRATCH_MNT -f $send_files_dir/2.snap
      
        echo "File digest in the new filesystem:"
        # Must match the digest from the new file.
        md5sum $SCRATCH_MNT/mysnap2_ro/foo/bar | _filter_scratch
      
        status=0
        exit
      Reported-by: NMartin Raiber <martin@urbackup.org>
      Fixes: 8b191a68 ("Btrfs: incremental send, check if orphanized dir inode needs delayed rename")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      b786f16a
  11. 04 10月, 2015 1 次提交
  12. 03 10月, 2015 6 次提交
  13. 02 10月, 2015 2 次提交
    • S
      [SMB3] Do not fall back to SMBWriteX in set_file_size error cases · 646200a0
      Steve French 提交于
      The error paths in set_file_size for cifs and smb3 are incorrect.
      
      In the unlikely event that a server did not support set file info
      of the file size, the code incorrectly falls back to trying SMBWriteX
      (note that only the original core SMB Write, used for example by DOS,
      can set the file size this way - this actually  does not work for the more
      recent SMBWriteX).  The idea was since the old DOS SMB Write could set
      the file size if you write zero bytes at that offset then use that if
      server rejects the normal set file info call.
      
      Fortunately the SMBWriteX will never be sent on the wire (except when
      file size is zero) since the length and offset fields were reversed
      in the two places in this function that call SMBWriteX causing
      the fall back path to return an error. It is also important to never call
      an SMB request from an SMB2/sMB3 session (which theoretically would
      be possible, and can cause a brief session drop, although the client
      recovers) so this should be fixed.  In practice this path does not happen
      with modern servers but the error fall back to SMBWriteX is clearly wrong.
      
      Removing the calls to SMBWriteX in the error paths in cifs_set_file_size
      
      Pointed out by PaX/grsecurity team
      Signed-off-by: NSteve French <steve.french@primarydata.com>
      Reported-by: NPaX Team <pageexec@freemail.hu>
      CC: Emese Revfy <re.emese@gmail.com>
      CC: Brad Spengler <spender@grsecurity.net>
      CC: Stable <stable@vger.kernel.org>
      646200a0
    • R
      dax: fix NULL pointer in __dax_pmd_fault() · 8346c416
      Ross Zwisler 提交于
      Commit 46c043ed ("mm: take i_mmap_lock in unmap_mapping_range() for
      DAX") moved some code in __dax_pmd_fault() that was responsible for
      zeroing newly allocated PMD pages.  The new location didn't properly set
      up 'kaddr', so when run this code resulted in a NULL pointer BUG.
      
      Fix this by getting the correct 'kaddr' via bdev_direct_access().
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8346c416
  14. 29 9月, 2015 1 次提交
    • R
      UBIFS: Kill unneeded locking in ubifs_init_security · cf6f54e3
      Richard Weinberger 提交于
      Fixes the following lockdep splat:
      [    1.244527] =============================================
      [    1.245193] [ INFO: possible recursive locking detected ]
      [    1.245193] 4.2.0-rc1+ #37 Not tainted
      [    1.245193] ---------------------------------------------
      [    1.245193] cp/742 is trying to acquire lock:
      [    1.245193]  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff812b3f69>] ubifs_init_security+0x29/0xb0
      [    1.245193]
      [    1.245193] but task is already holding lock:
      [    1.245193]  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff81198e7f>] path_openat+0x3af/0x1280
      [    1.245193]
      [    1.245193] other info that might help us debug this:
      [    1.245193]  Possible unsafe locking scenario:
      [    1.245193]
      [    1.245193]        CPU0
      [    1.245193]        ----
      [    1.245193]   lock(&sb->s_type->i_mutex_key#9);
      [    1.245193]   lock(&sb->s_type->i_mutex_key#9);
      [    1.245193]
      [    1.245193]  *** DEADLOCK ***
      [    1.245193]
      [    1.245193]  May be due to missing lock nesting notation
      [    1.245193]
      [    1.245193] 2 locks held by cp/742:
      [    1.245193]  #0:  (sb_writers#5){.+.+.+}, at: [<ffffffff811ad37f>] mnt_want_write+0x1f/0x50
      [    1.245193]  #1:  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff81198e7f>] path_openat+0x3af/0x1280
      [    1.245193]
      [    1.245193] stack backtrace:
      [    1.245193] CPU: 2 PID: 742 Comm: cp Not tainted 4.2.0-rc1+ #37
      [    1.245193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140816_022509-build35 04/01/2014
      [    1.245193]  ffffffff8252d530 ffff88007b023a38 ffffffff814f6f49 ffffffff810b56c5
      [    1.245193]  ffff88007c30cc80 ffff88007b023af8 ffffffff810a150d ffff88007b023a68
      [    1.245193]  000000008101302a ffff880000000000 00000008f447e23f ffffffff8252d500
      [    1.245193] Call Trace:
      [    1.245193]  [<ffffffff814f6f49>] dump_stack+0x4c/0x65
      [    1.245193]  [<ffffffff810b56c5>] ? console_unlock+0x1c5/0x510
      [    1.245193]  [<ffffffff810a150d>] __lock_acquire+0x1a6d/0x1ea0
      [    1.245193]  [<ffffffff8109fa78>] ? __lock_is_held+0x58/0x80
      [    1.245193]  [<ffffffff810a1a93>] lock_acquire+0xd3/0x270
      [    1.245193]  [<ffffffff812b3f69>] ? ubifs_init_security+0x29/0xb0
      [    1.245193]  [<ffffffff814fc83b>] mutex_lock_nested+0x6b/0x3a0
      [    1.245193]  [<ffffffff812b3f69>] ? ubifs_init_security+0x29/0xb0
      [    1.245193]  [<ffffffff812b3f69>] ? ubifs_init_security+0x29/0xb0
      [    1.245193]  [<ffffffff812b3f69>] ubifs_init_security+0x29/0xb0
      [    1.245193]  [<ffffffff8128e286>] ubifs_create+0xa6/0x1f0
      [    1.245193]  [<ffffffff81198e7f>] ? path_openat+0x3af/0x1280
      [    1.245193]  [<ffffffff81195d15>] vfs_create+0x95/0xc0
      [    1.245193]  [<ffffffff8119929c>] path_openat+0x7cc/0x1280
      [    1.245193]  [<ffffffff8109ffe3>] ? __lock_acquire+0x543/0x1ea0
      [    1.245193]  [<ffffffff81088f20>] ? sched_clock_cpu+0x90/0xc0
      [    1.245193]  [<ffffffff81088c00>] ? calc_global_load_tick+0x60/0x90
      [    1.245193]  [<ffffffff81088f20>] ? sched_clock_cpu+0x90/0xc0
      [    1.245193]  [<ffffffff811a9cef>] ? __alloc_fd+0xaf/0x180
      [    1.245193]  [<ffffffff8119ac55>] do_filp_open+0x75/0xd0
      [    1.245193]  [<ffffffff814ffd86>] ? _raw_spin_unlock+0x26/0x40
      [    1.245193]  [<ffffffff811a9cef>] ? __alloc_fd+0xaf/0x180
      [    1.245193]  [<ffffffff81189bd9>] do_sys_open+0x129/0x200
      [    1.245193]  [<ffffffff81189cc9>] SyS_open+0x19/0x20
      [    1.245193]  [<ffffffff81500717>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      While the lockdep splat is a false positive, becuase path_openat holds i_mutex
      of the parent directory and ubifs_init_security() tries to acquire i_mutex
      of a new inode, it reveals that taking i_mutex in ubifs_init_security() is
      in vain because it is only being called in the inode allocation path
      and therefore nobody else can see the inode yet.
      
      Cc: stable@vger.kernel.org # 3.20-
      Reported-and-tested-by: NBoris Brezillon <boris.brezillon@free-electrons.com>
      Reviewed-and-tested-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      Signed-off-by: dedekind1@gmail.com
      cf6f54e3
  15. 26 9月, 2015 1 次提交
  16. 24 9月, 2015 3 次提交
  17. 23 9月, 2015 7 次提交
    • P
      NFS41: make close wait for layoutreturn · 500d701f
      Peng Tao 提交于
      If we send a layoutreturn asynchronously before close, the close
      might reach server first and layoutreturn would fail with BADSTATEID
      because there is nothing keeping the layout stateid alive.
      
      Also do not pretend sending layoutreturn if we are not.
      Signed-off-by: NPeng Tao <tao.peng@primarydata.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      500d701f
    • J
      ocfs2/dlm: fix deadlock when dispatch assert master · 012572d4
      Joseph Qi 提交于
      The order of the following three spinlocks should be:
      dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock
      
      But dlm_dispatch_assert_master() is called while holding
      dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls
      dlm_grab() which will take dlm_domain_lock.
      
      Once another thread (for example, dlm_query_join_handler) has already
      taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock
      happens.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: "Junxiao Bi" <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      012572d4
    • A
      userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key" · ac5be6b4
      Andrea Arcangeli 提交于
      This reverts commit 51360155 and adapts
      fs/userfaultfd.c to use the old version of that function.
      
      It didn't look robust to call __wake_up_common with "nr == 1" when we
      absolutely require wakeall semantics, but we've full control of what we
      insert in the two waitqueue heads of the blocked userfaults.  No
      exclusive waitqueue risks to be inserted into those two waitqueue heads
      so we can as well stick to "nr == 1" of the old code and we can rely
      purely on the fact no waitqueue inserted in one of the two waitqueue
      heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac5be6b4
    • K
      NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set · 834e465b
      Kinglong Mee 提交于
      When lseg's commit_through_mds is set, pnfs client always WARN once
      in nfs_direct_select_verf after checking ds_cinfo.nbuckets.
      
      nfs should use the DS verf except commit_through_mds is set for
      layout segment where nbuckets is zero.
      
      [17844.666094] ------------[ cut here ]------------
      [17844.667071] WARNING: CPU: 0 PID: 21758 at /root/source/linux-pnfs/fs/nfs/direct.c:174 nfs_direct_select_verf+0x5a/0x70 [nfs]()
      [17844.668650] Modules linked in: nfs_layout_nfsv41_files(OE) nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c btrfs ppdev coretemp crct10dif_pclmul auth_rpcgss crc32_pclmul crc32c_intel nfs_acl ghash_clmulni_intel lockd vmw_balloon xor vmw_vmci grace raid6_pq shpchp sunrpc parport_pc i2c_piix4 parport vmwgfx drm_kms_helper ttm drm serio_raw mptspi e1000 scsi_transport_spi mptscsih mptbase ata_generic pata_acpi [last unloaded: fscache]
      [17844.686676] CPU: 0 PID: 21758 Comm: kworker/0:1 Tainted: G        W  OE   4.3.0-rc1-pnfs+ #245
      [17844.687352] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
      [17844.698502] Workqueue: nfsiod rpc_async_release [sunrpc]
      [17844.699212]  0000000000000009 0000000043e58010 ffff8800454fbc10 ffffffff813680c4
      [17844.699990]  ffff8800454fbc48 ffffffff8108b49d ffff88004eb20000 ffff88004eb20000
      [17844.700844]  ffff880062e26000 0000000000000000 0000000000000001 ffff8800454fbc58
      [17844.701637] Call Trace:
      [17844.725252]  [<ffffffff813680c4>] dump_stack+0x19/0x25
      [17844.732693]  [<ffffffff8108b49d>] warn_slowpath_common+0x7d/0xb0
      [17844.733855]  [<ffffffff8108b5da>] warn_slowpath_null+0x1a/0x20
      [17844.735015]  [<ffffffffa04a27ca>] nfs_direct_select_verf+0x5a/0x70 [nfs]
      [17844.735999]  [<ffffffffa04a2b83>] nfs_direct_set_hdr_verf+0x23/0x90 [nfs]
      [17844.736846]  [<ffffffffa04a2e17>] nfs_direct_write_completion+0x227/0x260 [nfs]
      [17844.737782]  [<ffffffffa04a433c>] nfs_pgio_release+0x1c/0x20 [nfs]
      [17844.738597]  [<ffffffffa0502df3>] pnfs_generic_rw_release+0x23/0x30 [nfsv4]
      [17844.739486]  [<ffffffffa01cbbea>] rpc_free_task+0x2a/0x70 [sunrpc]
      [17844.740326]  [<ffffffffa01cbcd5>] rpc_async_release+0x15/0x20 [sunrpc]
      [17844.741173]  [<ffffffff810a387c>] process_one_work+0x21c/0x4c0
      [17844.741984]  [<ffffffff810a37cd>] ? process_one_work+0x16d/0x4c0
      [17844.742837]  [<ffffffff810a3b6a>] worker_thread+0x4a/0x440
      [17844.743639]  [<ffffffff810a3b20>] ? process_one_work+0x4c0/0x4c0
      [17844.744399]  [<ffffffff810a3b20>] ? process_one_work+0x4c0/0x4c0
      [17844.745176]  [<ffffffff810a8d75>] kthread+0xf5/0x110
      [17844.745927]  [<ffffffff810a8c80>] ? kthread_create_on_node+0x240/0x240
      [17844.747105]  [<ffffffff8172ce1f>] ret_from_fork+0x3f/0x70
      [17844.747856]  [<ffffffff810a8c80>] ? kthread_create_on_node+0x240/0x240
      [17844.748642] ---[ end trace 336a2845d42b83f0 ]---
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      834e465b
    • P
      cifs: use server timestamp for ntlmv2 authentication · 98ce94c8
      Peter Seiderer 提交于
      Linux cifs mount with ntlmssp against an Mac OS X (Yosemite
      10.10.5) share fails in case the clocks differ more than +/-2h:
      
      digest-service: digest-request: od failed with 2 proto=ntlmv2
      digest-service: digest-request: kdc failed with -1561745592 proto=ntlmv2
      
      Fix this by (re-)using the given server timestamp for the
      ntlmv2 authentication (as Windows 7 does).
      
      A related problem was also reported earlier by Namjae Jaen (see below):
      
      Windows machine has extended security feature which refuse to allow
      authentication when there is time difference between server time and
      client time when ntlmv2 negotiation is used. This problem is prevalent
      in embedded enviornment where system time is set to default 1970.
      
      Modern servers send the server timestamp in the TargetInfo Av_Pair
      structure in the challenge message [see MS-NLMP 2.2.2.1]
      In [MS-NLMP 3.1.5.1.2] it is explicitly mentioned that the client must
      use the server provided timestamp if present OR current time if it is
      not
      Reported-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: NPeter Seiderer <ps.report@gmx.net>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      CC: Stable <stable@vger.kernel.org>
      98ce94c8
    • S
      disabling oplocks/leases via module parm enable_oplocks broken for SMB3 · e0ddde9d
      Steve French 提交于
      leases (oplocks) were always requested for SMB2/SMB3 even when oplocks
      disabled in the cifs.ko module.
      Signed-off-by: NSteve French <steve.french@primarydata.com>
      Reviewed-by: NChandrika Srinivasan <chandrika.srinivasan@citrix.com>
      CC: Stable <stable@vger.kernel.org>
      e0ddde9d
    • J
      Btrfs: keep dropped roots in cache until transaction commit · 2b9dbef2
      Josef Bacik 提交于
      When dropping a snapshot we need to account for the qgroup changes.  If we drop
      the snapshot in all one go then the backref code will fail to find blocks from
      the snapshot we dropped since it won't be able to find the root in the fs root
      cache.  This can lead to us failing to find refs from other roots that pointed
      at blocks in the now deleted root.  To handle this we need to not remove the fs
      roots from the cache until after we process the qgroup operations.  Do this by
      adding dropped roots to a list on the transaction, and letting the transaction
      remove the roots at the same time it drops the commit roots.  This will keep all
      of the backref searching code in sync properly, and fixes a problem Mark was
      seeing with snapshot delete and qgroups.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2b9dbef2
  18. 22 9月, 2015 1 次提交
    • C
      Btrfs: Direct I/O: Fix space accounting · 50745b0a
      chandan 提交于
      The following call trace is seen when generic/095 test is executed,
      
      WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
      Modules linked in:
      CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
       ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
       0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
       ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
      Call Trace:
       [<ffffffff81984058>] dump_stack+0x45/0x57
       [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0
       [<ffffffff81050465>] warn_slowpath_null+0x15/0x20
       [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0
       [<ffffffff8117ce07>] destroy_inode+0x37/0x60
       [<ffffffff8117cf39>] evict+0x109/0x170
       [<ffffffff8117cfd5>] dispose_list+0x35/0x50
       [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100
       [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0
       [<ffffffff81165951>] kill_anon_super+0x11/0x20
       [<ffffffff81302093>] btrfs_kill_super+0x13/0x110
       [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70
       [<ffffffff811660cf>] deactivate_super+0x5f/0x70
       [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90
       [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10
       [<ffffffff81069c06>] task_work_run+0x96/0xb0
       [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50
       [<ffffffff8198cbc2>] int_signal+0x12/0x17
      
      This means that the inode had non-zero "outstanding extents" during
      eviction. This occurs because, during direct I/O a task which successfully
      used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
      not clear the bit after finishing the DIO write. A future DIO write could
      actually fail and the unused reserve space won't be freed because of the
      previously set BTRFS_INODE_DIO_READY bit.
      
      Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
      following issue,
      |-----------------------------------+-------------------------------------|
      | Task A                            | Task B                              |
      |-----------------------------------+-------------------------------------|
      | Start direct i/o write on inode X.|                                     |
      | reserve space                     |                                     |
      | Allocate ordered extent           |                                     |
      | release reserved space            |                                     |
      | Set BTRFS_INODE_DIO_READY bit.    |                                     |
      |                                   | splice()                            |
      |                                   | Transfer data from pipe buffer to   |
      |                                   | destination file.                   |
      |                                   | - kmap(pipe buffer page)            |
      |                                   | - Start direct i/o write on         |
      |                                   |   inode X.                          |
      |                                   |   - reserve space                   |
      |                                   |   - dio_refill_pages()              |
      |                                   |     - sdio->blocks_available == 0   |
      |                                   |     - Since a kernel address is     |
      |                                   |       being passed instead of a     |
      |                                   |       user space address,           |
      |                                   |       iov_iter_get_pages() returns  |
      |                                   |       -EFAULT.                      |
      |                                   |   - Since BTRFS_INODE_DIO_READY is  |
      |                                   |     set, we don't release reserved  |
      |                                   |     space.                          |
      |                                   |   - Clear BTRFS_INODE_DIO_READY bit.|
      | -EIOCBQUEUED is returned.         |                                     |
      |-----------------------------------+-------------------------------------|
      
      Hence this commit introduces "struct btrfs_dio_data" to track the usage of
      reserved data space. The remaining unused "reserve space" can now be freed
      reliably.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50745b0a
  19. 21 9月, 2015 1 次提交