1. 21 5月, 2015 1 次提交
  2. 20 5月, 2015 2 次提交
    • F
      Btrfs: fix racy system chunk allocation when setting block group ro · a9629596
      Filipe Manana 提交于
      If while setting a block group read-only we end up allocating a system
      chunk, through check_system_chunk(), we were not doing it while holding
      the chunk mutex which is a problem if a concurrent chunk allocation is
      happening, through do_chunk_alloc(), as it means both block groups can
      end up using the same logical addresses and physical regions in the
      device(s). So make sure we hold the chunk mutex.
      
      Cc: stable@vger.kernel.org  # 4.0+
      Fixes: 2f081088 ("btrfs: delete chunk allocation attemp when
                            setting block group ro")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a9629596
    • M
      btrfs: clear 'ret' in btrfs_check_shared() loop · 2c2ed5aa
      Mark Fasheh 提交于
      btrfs_check_shared() is leaking a return value of '1' from
      find_parent_nodes(). As a result, callers (in this case, extent_fiemap())
      are told extents are shared when they are not. This in turn broke fiemap on
      btrfs for kernels v3.18 and up.
      
      The fix is simple - we just have to clear 'ret' after we are done processing
      the results of find_parent_nodes().
      
      It wasn't clear to me at first what was happening with return values in
      btrfs_check_shared() and find_parent_nodes() - thanks to Josef for the help
      on irc. I added documentation to both functions to make things more clear
      for the next hacker who might come across them.
      
      If we could queue this up for -stable too that would be great.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c2ed5aa
  3. 11 5月, 2015 4 次提交
    • F
      Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON · 062c19e9
      Filipe Manana 提交于
      There's a race between releasing extent buffers that are flagged as stale
      and recycling them that makes us it the following BUG_ON at
      btrfs_release_extent_buffer_page:
      
          BUG_ON(extent_buffer_under_io(eb))
      
      The BUG_ON is triggered because the extent buffer has the flag
      EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
      dirty by another concurrent task.
      
      Here follows a sequence of steps that leads to the BUG_ON.
      
            CPU 0                                                    CPU 1                                                CPU 2
      
      path->nodes[0] == eb X
      X->refs == 2 (1 for the tree, 1 for the path)
      btrfs_header_generation(X) == current trans id
      flag EXTENT_BUFFER_DIRTY set on X
      
      btrfs_release_path(path)
          unlocks X
      
                                                            reads eb X
                                                               X->refs incremented to 3
                                                            locks eb X
                                                            btrfs_del_items(X)
                                                               X becomes empty
                                                               clean_tree_block(X)
                                                                   clear EXTENT_BUFFER_DIRTY from X
                                                               btrfs_del_leaf(X)
                                                                   unlocks X
                                                                   extent_buffer_get(X)
                                                                      X->refs incremented to 4
                                                                   btrfs_free_tree_block(X)
                                                                      X's range is not pinned
                                                                      X's range added to free
                                                                        space cache
                                                                   free_extent_buffer_stale(X)
                                                                      lock X->refs_lock
                                                                      set EXTENT_BUFFER_STALE on X
                                                                      release_extent_buffer(X)
                                                                          X->refs decremented to 3
                                                                          unlocks X->refs_lock
                                                            btrfs_release_path()
                                                               unlocks X
                                                               free_extent_buffer(X)
                                                                   X->refs becomes 2
      
                                                                                                            __btrfs_cow_block(Y)
                                                                                                                btrfs_alloc_tree_block()
                                                                                                                    btrfs_reserve_extent()
                                                                                                                        find_free_extent()
                                                                                                                            gets offset == X->start
                                                                                                                    btrfs_init_new_buffer(X->start)
                                                                                                                        btrfs_find_create_tree_block(X->start)
                                                                                                                            alloc_extent_buffer(X->start)
                                                                                                                                find_extent_buffer(X->start)
                                                                                                                                    finds eb X in radix tree
      
          free_extent_buffer(X)
              lock X->refs_lock
                  test X->refs == 2
                  test bit EXTENT_BUFFER_STALE is set
                  test !extent_buffer_under_io(eb)
      
                                                                                                                                    increments X->refs to 3
                                                                                                                                    mark_extent_buffer_accessed(X)
                                                                                                                                        check_buffer_tree_ref(X)
                                                                                                                                          --> does nothing,
                                                                                                                                              X->refs >= 2 and
                                                                                                                                              EXTENT_BUFFER_TREE_REF
                                                                                                                                              is set in X
                                                                                                                    clear EXTENT_BUFFER_STALE from X
                                                                                                                    locks X
                                                                                                                btrfs_mark_buffer_dirty()
                                                                                                                    set_extent_buffer_dirty(X)
                                                                                                                        check_buffer_tree_ref(X)
                                                                                                                           --> does nothing, X->refs >= 2 and
                                                                                                                               EXTENT_BUFFER_TREE_REF is set
                                                                                                                        sets EXTENT_BUFFER_DIRTY on X
      
                  test and clear EXTENT_BUFFER_TREE_REF
                  decrements X->refs to 2
              release_extent_buffer(X)
                  decrements X->refs to 1
                  unlock X->refs_lock
      
                                                                                                            unlock X
                                                                                                            free_extent_buffer(X)
                                                                                                                lock X->refs_lock
                                                                                                                release_extent_buffer(X)
                                                                                                                    decrements X->refs to 0
                                                                                                                    btrfs_release_extent_buffer_page(X)
                                                                                                                         BUG_ON(extent_buffer_under_io(X))
                                                                                                                             --> EXTENT_BUFFER_DIRTY set on X
      
      Fix this by making find_extent buffer wait for any ongoing task currently
      executing free_extent_buffer()/free_extent_buffer_stale() if the extent
      buffer has the stale flag set.
      A more clean alternative would be to always increment the extent buffer's
      reference count while holding its refs_lock spinlock but find_extent_buffer
      is a performance critical area and that would cause lock contention whenever
      multiple tasks search for the same extent buffer concurrently.
      
      A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
      btrfs patches backported from newer kernels) was hitting this often:
      
      [1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
      (...)
      [1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
      [1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
      [1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
      [1212302.540792] RIP: 0010:[<ffffffffa0507081>]  [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
      (...)
      [1212302.630008] Call Trace:
      [1212302.630008]  [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs]
      [1212302.630008]  [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs]
      [1212302.630008]  [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
      [1212302.630008]  [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs]
      [1212302.630008]  [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
      [1212302.630008]  [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs]
      [1212302.630008]  [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
      [1212302.630008]  [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
      [1212302.630008]  [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
      [1212302.630008]  [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs]
      [1212302.630008]  [<ffffffff811b9e28>] evict+0xa8/0x190
      [1212302.630008]  [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0
      
      I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
      integration branch from about a month ago, running the following stress
      test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):
      
        while true; do
           mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
           mount /dev/sdd /mnt
           snapshot_cmd="btrfs subvolume snapshot -r /mnt"
           snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
           fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
           umount /mnt
        done
      
      Which usually triggers the BUG_ON within less than 24 hours:
      
      [49558.618097] ------------[ cut here ]------------
      [49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
      (...)
      [49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G        W      3.19.0-btrfs-next-7+ #3
      [49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
      [49558.620031] RIP: 0010:[<ffffffffa0476b1a>]  [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
      (...)
      [49558.620031] Call Trace:
      [49558.620031]  [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs]
      [49558.620031]  [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43
      [49558.620031]  [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs]
      [49558.620031]  [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs]
      [49558.620031]  [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs]
      [49558.620031]  [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs]
      [49558.620031]  [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
      [49558.728054]  [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
      [49558.728054]  [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
      [49558.728054]  [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
      [49558.728054]  [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs]
      [49558.728054]  [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs]
      [49558.728054]  [<ffffffff81155923>] ? iterate_supers+0x60/0xc4
      [49558.728054]  [<ffffffff8117966a>] ? do_sync_work+0x91/0x91
      [49558.728054]  [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22
      [49558.728054]  [<ffffffff81155939>] iterate_supers+0x76/0xc4
      [49558.728054]  [<ffffffff811798e8>] sys_sync+0x55/0x83
      [49558.728054]  [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      062c19e9
    • F
      Btrfs: fix race between block group creation and their cache writeout · ff1f8250
      Filipe Manana 提交于
      So creating a block group has 2 distinct phases:
      
      Phase 1 - creates the btrfs_block_group_cache item and adds it to the
      rbtree fs_info->block_group_cache_tree and to the corresponding list
      space_info->block_groups[];
      
      Phase 2 - adds the block group item to the extent tree and corresponding
      items to the chunk tree.
      
      The first phase adds the block_group_cache_item to a list of pending block
      groups in the transaction handle, and phase 2 happens when
      btrfs_end_transaction() is called against the transaction handle.
      
      It happens that once phase 1 completes, other concurrent tasks that use
      their own transaction handle, but points to the same running transaction
      (struct btrfs_trans_handle->transaction), can use this block group for
      space allocations and therefore mark it dirty. Dirty block groups are
      tracked in a list belonging to the currently running transaction (struct
      btrfs_transaction) and not in the transaction handle (btrfs_trans_handle).
      
      This is a problem because once a task calls btrfs_commit_transaction(),
      it calls btrfs_start_dirty_block_groups() which will see all dirty block
      groups and attempt to start their writeout, including those that are
      still attached to the transaction handle of some concurrent task that
      hasn't called btrfs_end_transaction() yet - which means those block
      groups haven't gone through phase 2 yet and therefore when
      write_one_cache_group() is called, it won't find the block group items
      in the extent tree and abort the current transaction with -ENOENT,
      turning the fs into readonly mode and require a remount.
      
      Fix this by ignoring -ENOENT when looking for block group items in the
      extent tree when we attempt to start the writeout of the block group
      caches outside the critical section of the transaction commit. We will
      try again later during the critical section and if there we still don't
      find the block group item in the extent tree, we then abort the current
      transaction.
      
      This issue happened twice, once while running fstests btrfs/067 and once
      for btrfs/078, which produced the following trace:
      
      [ 3278.703014] WARNING: CPU: 7 PID: 18499 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [ 3278.707329] BTRFS: Transaction aborted (error -2)
      (...)
      [ 3278.731555] Call Trace:
      [ 3278.732396]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [ 3278.733860]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [ 3278.735312]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [ 3278.736874]  [<ffffffffa03ada6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [ 3278.738302]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [ 3278.739520]  [<ffffffffa03ada6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [ 3278.741222]  [<ffffffffa03b9e56>] write_one_cache_group+0xae/0xbf [btrfs]
      [ 3278.742797]  [<ffffffffa03c487b>] btrfs_start_dirty_block_groups+0x170/0x2b2 [btrfs]
      [ 3278.744492]  [<ffffffffa03d309c>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      [ 3278.746084]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [ 3278.747249]  [<ffffffffa03e5660>] btrfs_sync_file+0x313/0x387 [btrfs]
      [ 3278.748744]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
      [ 3278.749958]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
      [ 3278.751218]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
      [ 3278.754197]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
      [ 3278.755192]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
      [ 3278.756236]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [ 3278.757366] ---[ end trace 9a4d4df4969709aa ]---
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ff1f8250
    • F
      Btrfs: fix panic when starting bg cache writeout after IO error · 28aeeac1
      Filipe Manana 提交于
      When waiting for the writeback of block group cache we returned
      immediately if there was an error during writeback without waiting
      for the ordered extent to complete. This left a short time window
      where if some other task attempts to start the writeout for the same
      block group cache it can attempt to add a new ordered extent, starting
      at the same offset (0) before the previous one is removed from the
      ordered tree, causing an ordered tree panic (calls BUG()).
      
      This normally doesn't happen in other write paths, such as buffered
      writes or direct IO writes for regular files, since before marking
      page ranges dirty we lock the ranges and wait for any ordered extents
      within the range to complete first.
      
      Fix this by making btrfs_wait_ordered_range() not return immediately
      if it gets an error from the writeback, waiting for all ordered extents
      to complete first.
      
      This issue happened often when running the fstest btrfs/088 and it's
      easy to trigger it by running in a loop until the panic happens:
      
        for ((i = 1; i <= 10000; i++)) do ./check btrfs/088 ; done
      
      [17156.862573] BTRFS critical (device sdc): panic in ordered_data_tree_panic:70: Inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
      [17156.864052] ------------[ cut here ]------------
      [17156.864052] kernel BUG at fs/btrfs/ordered-data.c:70!
      (...)
      [17156.864052] Call Trace:
      [17156.864052]  [<ffffffffa03876e3>] btrfs_add_ordered_extent+0x12/0x14 [btrfs]
      [17156.864052]  [<ffffffffa03787e2>] run_delalloc_nocow+0x5bf/0x747 [btrfs]
      [17156.864052]  [<ffffffffa03789ff>] run_delalloc_range+0x95/0x353 [btrfs]
      [17156.864052]  [<ffffffffa038b7fe>] writepage_delalloc.isra.16+0xb9/0x13f [btrfs]
      [17156.864052]  [<ffffffffa038d75b>] __extent_writepage+0x129/0x1f7 [btrfs]
      [17156.864052]  [<ffffffffa038da5a>] extent_write_cache_pages.isra.15.constprop.28+0x231/0x2f4 [btrfs]
      [17156.864052]  [<ffffffff810ad2af>] ? __module_text_address+0x12/0x59
      [17156.864052]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [17156.864052]  [<ffffffffa038df76>] extent_writepages+0x4b/0x5c [btrfs]
      [17156.864052]  [<ffffffff81144431>] ? kmem_cache_free+0x9b/0xce
      [17156.864052]  [<ffffffffa0376a46>] ? btrfs_submit_direct+0x3fc/0x3fc [btrfs]
      [17156.864052]  [<ffffffffa0389cd6>] ? free_extent_state+0x8c/0xc1 [btrfs]
      [17156.864052]  [<ffffffffa0374871>] btrfs_writepages+0x28/0x2a [btrfs]
      [17156.864052]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
      [17156.864052]  [<ffffffff81102f36>] __filemap_fdatawrite_range+0x5a/0x61
      [17156.864052]  [<ffffffff81102f6e>] filemap_fdatawrite_range+0x13/0x15
      [17156.864052]  [<ffffffffa0383ef7>] btrfs_fdatawrite_range+0x21/0x48 [btrfs]
      [17156.864052]  [<ffffffffa03ab89e>] __btrfs_write_out_cache.isra.14+0x2d9/0x3a7 [btrfs]
      [17156.864052]  [<ffffffffa03ac1ab>] ? btrfs_write_out_cache+0x41/0xdc [btrfs]
      [17156.864052]  [<ffffffffa03ac1fd>] btrfs_write_out_cache+0x93/0xdc [btrfs]
      [17156.864052]  [<ffffffffa0363847>] ? btrfs_start_dirty_block_groups+0x13a/0x2b2 [btrfs]
      [17156.864052]  [<ffffffffa03638e6>] btrfs_start_dirty_block_groups+0x1d9/0x2b2 [btrfs]
      [17156.864052]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [17156.864052]  [<ffffffffa037209e>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      [17156.864052]  [<ffffffffa034c748>] btrfs_sync_fs+0xe1/0x12d [btrfs]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      28aeeac1
    • F
      Btrfs: fix crash after inode cache writeback failure · e43699d4
      Filipe Manana 提交于
      If the writeback of an inode cache failed we were unnecessarilly
      attempting to release again the delalloc metadata that we previously
      reserved. However attempting to do this a second time triggers an
      assertion at drop_outstanding_extent() because we have no more
      outstanding extents for our inode cache's inode. If we were able
      to start writeback of the cache the reserved metadata space is
      released at btrfs_finished_ordered_io(), even if an error happens
      during writeback.
      
      So make sure we don't repeat the metadata space release if writeback
      started for our inode cache.
      
      This issue was trivial to reproduce by running the fstest btrfs/088
      with "-o inode_cache", which triggered the assertion leading to a
      BUG() call and requiring a reboot in order to run the remaining
      fstests. Trace produced by btrfs/088:
      
      [255289.385904] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5276
      [255289.388094] ------------[ cut here ]------------
      [255289.389184] kernel BUG at fs/btrfs/ctree.h:4057!
      [255289.390125] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      (...)
      [255289.392068] Call Trace:
      [255289.392068]  [<ffffffffa035e774>] drop_outstanding_extent+0x3d/0x6d [btrfs]
      [255289.392068]  [<ffffffffa0364988>] btrfs_delalloc_release_metadata+0x54/0xe3 [btrfs]
      [255289.392068]  [<ffffffffa03b4174>] btrfs_write_out_ino_cache+0x95/0xad [btrfs]
      [255289.392068]  [<ffffffffa036f5c4>] btrfs_save_ino_cache+0x275/0x2dc [btrfs]
      [255289.392068]  [<ffffffffa03e2d83>] commit_fs_roots.isra.12+0xaa/0x137 [btrfs]
      [255289.392068]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [255289.392068]  [<ffffffffa037841f>] ? btrfs_commit_transaction+0x4b1/0x9c9 [btrfs]
      [255289.392068]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
      [255289.392068]  [<ffffffffa037842e>] btrfs_commit_transaction+0x4c0/0x9c9 [btrfs]
      (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e43699d4
  4. 07 5月, 2015 1 次提交
  5. 30 4月, 2015 1 次提交
  6. 26 4月, 2015 9 次提交
    • Y
      Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. · 6e17d30b
      Yang Dongsheng 提交于
      We need to fill inode when we found a node for it in delayed_nodes_tree.
      But we did not fill the ->last_trans currently, it will cause the test
      of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
      
      Problem:
      	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(2). pwrite(test_fd, buf, 4096, 0)
      	(3). close(test_fd)
      	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
      	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(6). fsync(test_fd);
      				<-------- we did not get the correct log entry for the file
      Reason:
      	When we re-open this file in (5), we would find a node
      in delayed_nodes_tree and fill the inode we are lookup with the
      information. But the ->last_trans is not filled, then the fsync()
      will check the ->last_trans and found it's 0 then say this inode
      is already in our tree which is commited, not recording the extents
      for it.
      
      Fix:
      	This patch fill the ->last_trans properly and set the
      runtime_flags if needed in this situation. Then we can get the
      log entries we expected after (6) and generic/311 passed.
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6e17d30b
    • O
      btrfs: unlock i_mutex after attempting to delete subvolume during send · 909e26dc
      Omar Sandoval 提交于
      Whenever the check for a send in progress introduced in commit
      521e0546 (btrfs: protect snapshots from deleting during send) is
      hit, we return without unlocking inode->i_mutex. This is easy to see
      with lockdep enabled:
      
      [  +0.000059] ================================================
      [  +0.000028] [ BUG: lock held when returning to user space! ]
      [  +0.000029] 4.0.0-rc5-00096-g3c435c1e #93 Not tainted
      [  +0.000026] ------------------------------------------------
      [  +0.000029] btrfs/211 is leaving the kernel with locks still held!
      [  +0.000029] 1 lock held by btrfs/211:
      [  +0.000023]  #0:  (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0
      
      Make sure we unlock it in the error path.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      909e26dc
    • O
      btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache · b8605454
      Omar Sandoval 提交于
      If io_ctl_prepare_pages fails, the pages in io_ctl.pages are not valid.
      When we try to access them later, things will blow up in various ways.
      
      Also fix the comment about the return value, which is an errno on error,
      not -1, and update the cases where it was not.
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b8605454
    • O
      btrfs: fix race on ENOMEM in alloc_extent_buffer · 5ca64f45
      Omar Sandoval 提交于
      Consider the following interleaving of overlapping calls to
      alloc_extent_buffer:
      
      Call 1:
      
      - Successfully allocates a few pages with find_or_create_page
      - find_or_create_page fails, goto free_eb
      - Unlocks the allocated pages
      
      Call 2:
      - Calls find_or_create_page and gets a page in call 1's extent_buffer
      - Finds that the page is already associated with an extent_buffer
      - Grabs a reference to the half-written extent_buffer and calls
        mark_extent_buffer_accessed on it
      
      mark_extent_buffer_accessed will then try to call mark_page_accessed on
      a null page and panic.
      
      The fix is to decrement the reference count on the half-written
      extent_buffer before unlocking the pages so call 2 won't use it. We
      should also set exists = NULL in the case that we don't use exists to
      avoid accidentally returning a freed extent_buffer in an error case.
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5ca64f45
    • O
      btrfs: handle ENOMEM in btrfs_alloc_tree_block · 67b7859e
      Omar Sandoval 提交于
      This is one of the first places to give out when memory is tight. Handle
      it properly rather than with a BUG_ON.
      
      Also fix the comment about the return value, which is an ERR_PTR, not
      NULL, on error.
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      67b7859e
    • F
      Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole · 1b984508
      Forrest Liu 提交于
      If device tree has hole, find_free_dev_extent() cannot find available
      address properly.
      
      The problem can be reproduce by following script.
      
          mntpath=/btrfs
          loopdev=/dev/loop0
          filepath=/home/forrest/image
      
          umount $mntpath
          losetup -d $loopdev
          truncate --size 100g $filepath
          losetup $loopdev $filepath
          mkfs.btrfs -f $loopdev
          mount $loopdev $mntpath
      
          # make device tree with one big hole
          for i in `seq 1 1 100`; do
              fallocate -l 1g $mntpath/$i
          done
          sync
          for i in `seq 1 1 95`; do
              rm $mntpath/$i
          done
          sync
      
          # wait cleaner thread remove unused block group
          sleep 300
      
          fallocate -l 1g $mntpath/aaa
      
          # failed to allocate new chunk
          fallocate -l 1g $mntpath/bbb
      
      Above script will make device tree with one big hole, and can only allocate
      just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb
      
          item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48
              dev extent chunk_tree 3
              chunk objectid 256 chunk offset 106292051968 length 1073741824
          item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48
              dev extent chunk_tree 3
              chunk objectid 256 chunk offset 103108575232 length 1073741824
      Signed-off-by: NForrest Liu <forrestl@synology.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1b984508
    • C
      Btrfs: don't check for delalloc_bytes in cache_save_setup · e4c88f00
      Chris Mason 提交于
      Now that we're doing free space cache writeback outside the critical
      section in the commit, there is a bigger window for delalloc_bytes to
      be added after a cache has been written.  find_free_extent may do this
      without putting the block group back into the dirty list, and also
      without a transaction running.
      
      Checking for delalloc_bytes in cache_save_setup means we might leave the
      cache marked as written without invalidating it.  Consistency checks
      during mount will toss the cache, but it's better to get rid of the
      check in cache_save_setup and let it get invalidated by the checks
      already done during cache write out.
      Signed-off-by: NChris Mason <clm@fb.com>
      e4c88f00
    • F
      Btrfs: fix deadlock when starting writeback of bg caches · 24b89d08
      Filipe Manana 提交于
      While starting the writes of the dirty block group caches, if we don't
      find a block group item in the extent tree we were leaving without
      releasing our path, running delayed references and then looping again to
      process any new dirty block groups. However this second iteration of the
      loop could cause a deadlock because it tries to lock some other extent
      tree node/leaf which another task already locked and it's blocked because
      it's waiting for a lock on some node/leaf that is in our path that was not
      released before.
      We could also deadlock when running the delayed references - as we could
      end up trying to lock the same nodes/leafs that we have in our local path
      (with a different lock type).
      
      Got into such case when running xfstests:
      
      [20892.242791] ------------[ cut here ]------------
      [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [20892.245874] BTRFS: Transaction aborted (error -2)
      (...)
      [20892.269378] Call Trace:
      [20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
      [20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      (...)
      [20892.291316] ---[ end trace 597f77e664245373 ]---
      [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
      [20892.297390] BTRFS info (device sdg): forced readonly
      [20892.298222] ------------[ cut here ]------------
      [20892.299190] WARNING: CPU: 0 PID: 13299 at fs/btrfs/ctree.c:2683 btrfs_search_slot+0x7e/0x7d2 [btrfs]()
      (...)
      [20892.326253] Call Trace:
      [20892.326904]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.329503]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.330815]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.332556]  [<ffffffffa0510b73>] ? btrfs_search_slot+0x7e/0x7d2 [btrfs]
      [20892.333955]  [<ffffffff81045f62>] warn_slowpath_null+0x1a/0x1c
      [20892.335562]  [<ffffffffa0510b73>] btrfs_search_slot+0x7e/0x7d2 [btrfs]
      [20892.336849]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
      [20892.338222]  [<ffffffffa051ad52>] ? cache_save_setup+0x43/0x2a5 [btrfs]
      [20892.339823]  [<ffffffffa051ad66>] ? cache_save_setup+0x57/0x2a5 [btrfs]
      [20892.341275]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
      [20892.342810]  [<ffffffffa0515de7>] write_one_cache_group+0x3f/0xaf [btrfs]
      [20892.344184]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.347162]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      (...)
      [20892.361015] ---[ end trace 597f77e664245374 ]---
      [21120.688097] INFO: task kworker/u8:17:29854 blocked for more than 120 seconds.
      [21120.689881]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.691384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.703696] Call Trace:
      [21120.704310]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.705490]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
      [21120.706757]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.708156]  [<ffffffffa054ac1e>] lock_extent_buffer_for_io+0x3e/0x194 [btrfs]
      [21120.709892]  [<ffffffffa054bb86>] ? btree_write_cache_pages+0x273/0x385 [btrfs]
      [21120.711605]  [<ffffffffa054bc42>] btree_write_cache_pages+0x32f/0x385 [btrfs]
      [21120.723440]  [<ffffffffa0527552>] btree_writepages+0x23/0x5c [btrfs]
      [21120.724943]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
      [21120.726008]  [<ffffffff81176dde>] __writeback_single_inode+0x73/0x2fa
      [21120.727230]  [<ffffffff8117714a>] ? writeback_sb_inodes+0xe5/0x38b
      [21120.728526]  [<ffffffff811771fb>] ? writeback_sb_inodes+0x196/0x38b
      [21120.729701]  [<ffffffff8117726a>] writeback_sb_inodes+0x205/0x38b
      (...)
      [21120.747853] INFO: task btrfs:13282 blocked for more than 120 seconds.
      [21120.749459]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.751137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.768457] Call Trace:
      [21120.769039]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.770107]  [<ffffffffa052f25c>] btrfs_commit_transaction+0x315/0x9c9 [btrfs]
      [21120.771558]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.773659]  [<ffffffffa056fd8c>] prepare_to_relocate+0xcb/0xd2 [btrfs]
      [21120.776257]  [<ffffffffa05741da>] relocate_block_group+0x44/0x4a9 [btrfs]
      [21120.777755]  [<ffffffffa05747a0>] ? btrfs_relocate_block_group+0x161/0x288 [btrfs]
      [21120.779459]  [<ffffffffa05747a8>] btrfs_relocate_block_group+0x169/0x288 [btrfs]
      [21120.781153]  [<ffffffffa0550403>] btrfs_relocate_chunk.isra.29+0x3e/0xa7 [btrfs]
      [21120.783918]  [<ffffffffa05518fd>] btrfs_balance+0xaa4/0xc52 [btrfs]
      [21120.785436]  [<ffffffff8114306e>] ? cpu_cache_get.isra.39+0xe/0x1f
      [21120.786434]  [<ffffffffa0559252>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
      (...)
      [21120.889251] INFO: task fsstress:13288 blocked for more than 120 seconds.
      [21120.890526]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [21120.891773] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      (...)
      [21120.899960] Call Trace:
      [21120.900743]  [<ffffffff8143107e>] schedule+0x74/0x83
      [21120.903004]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
      [21120.904383]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [21120.905608]  [<ffffffffa051125b>] btrfs_search_slot+0x766/0x7d2 [btrfs]
      [21120.906812]  [<ffffffff8114290e>] ? virt_to_head_page+0x9/0x2c
      [21120.907874]  [<ffffffff81144b7f>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb
      [21120.909551]  [<ffffffffa05124e0>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
      [21120.910914]  [<ffffffffa0512585>] btrfs_insert_item+0x5a/0xa5 [btrfs]
      [21120.912181]  [<ffffffffa0520271>] ? btrfs_create_pending_block_groups+0x96/0x130 [btrfs]
      [21120.913784]  [<ffffffffa052028a>] btrfs_create_pending_block_groups+0xaf/0x130 [btrfs]
      [21120.915374]  [<ffffffffa052ffc2>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [21120.916735]  [<ffffffffa05302b4>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [21120.917996]  [<ffffffffa051ab26>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [21120.919478]  [<ffffffffa051ba25>] btrfs_delalloc_reserve_space+0x1e/0x51 [btrfs]
      [21120.921226]  [<ffffffffa05382f2>] btrfs_truncate_page+0x85/0x2c4 [btrfs]
      [21120.923121]  [<ffffffffa0538572>] btrfs_cont_expand+0x41/0x3ef [btrfs]
      [21120.924449]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
      [21120.926602]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
      [21120.927769]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
      [21120.929324]  [<ffffffffa05410a0>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
      [21120.930723]  [<ffffffffa05410d9>] btrfs_file_write_iter+0x1e2/0x431 [btrfs]
      [21120.931897]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
      [21120.934446]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
      [21120.935528]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
      (...)
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      24b89d08
    • F
      Btrfs: fix race between start dirty bg cache writeout and bg deletion · b58d1a9e
      Filipe Manana 提交于
      While running xfstests I ran into the following:
      
      [20892.242791] ------------[ cut here ]------------
      [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [20892.245874] BTRFS: Transaction aborted (error -2)
      [20892.247329] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse$
      [20892.258488] CPU: 0 PID: 13299 Comm: fsstress Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
      [20892.262011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [20892.264738]  0000000000000009 ffff880427f8bc18 ffffffff8142fa46 ffffffff8108b6a2
      [20892.266244]  ffff880427f8bc68 ffff880427f8bc58 ffffffff81045ea5 ffff880427f8bc48
      [20892.267761]  ffffffffa0509a6d 00000000fffffffe ffff8803545d6f40 ffffffffa05a15a0
      [20892.269378] Call Trace:
      [20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
      [20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
      [20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
      [20892.281781]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
      [20892.282873]  [<ffffffffa054163b>] btrfs_sync_file+0x313/0x387 [btrfs]
      [20892.284111]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
      [20892.285203]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
      [20892.286290]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [20892.287469]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
      [20892.288412]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
      [20892.289348]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
      [20892.290255]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [20892.291316] ---[ end trace 597f77e664245373 ]---
      [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
      [20892.297390] BTRFS info (device sdg): forced readonly
      
      This happens because in btrfs_start_dirty_block_groups() we splice the
      transaction's list of dirty block groups into a local list and then we
      keep extracting the first element of the list without holding the
      cache_write_mutex mutex. This means that before we acquire that mutex
      the first block group on the list might be removed by a conurrent task
      running btrfs_remove_block_group(). So make sure we extract the first
      element (and test the list emptyness) while holding that mutex.
      
      Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                            outside critical section in commit")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b58d1a9e
  7. 25 4月, 2015 1 次提交
  8. 24 4月, 2015 1 次提交
    • C
      Btrfs: fix inode cache writeout · 85db36cf
      Chris Mason 提交于
      The code to fix stalls during free spache cache IO wasn't using
      the correct root when waiting on the IO for inode caches.  This
      is only a problem when the inode cache is enabled with
      
      mount -o inode_cache
      
      This fixes the inode cache writeout to preserve any error values and
      makes sure not to override the root when inode cache writeout is done.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      85db36cf
  9. 13 4月, 2015 20 次提交