1. 25 8月, 2019 1 次提交
    • F
      Btrfs: fix deadlock between fiemap and transaction commits · f833deae
      Filipe Manana 提交于
      [ Upstream commit a6d155d2e363f26290ffd50591169cb96c2a609e ]
      
      The fiemap handler locks a file range that can have unflushed delalloc,
      and after locking the range, it tries to attach to a running transaction.
      If the running transaction started its commit, that is, it is in state
      TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
      flushoncommit option or the transaction is creating a snapshot for the
      subvolume that contains the file that fiemap is operating on, we end up
      deadlocking. This happens because fiemap is blocked on the transaction,
      waiting for it to complete, and the transaction is waiting for the flushed
      dealloc to complete, which requires locking the file range that the fiemap
      task already locked. The following stack traces serve as an example of
      when this deadlock happens:
      
        (...)
        [404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        [404571.515956] Call Trace:
        [404571.516360]  ? __schedule+0x3ae/0x7b0
        [404571.516730]  schedule+0x3a/0xb0
        [404571.517104]  lock_extent_bits+0x1ec/0x2a0 [btrfs]
        [404571.517465]  ? remove_wait_queue+0x60/0x60
        [404571.517832]  btrfs_finish_ordered_io+0x292/0x800 [btrfs]
        [404571.518202]  normal_work_helper+0xea/0x530 [btrfs]
        [404571.518566]  process_one_work+0x21e/0x5c0
        [404571.518990]  worker_thread+0x4f/0x3b0
        [404571.519413]  ? process_one_work+0x5c0/0x5c0
        [404571.519829]  kthread+0x103/0x140
        [404571.520191]  ? kthread_create_worker_on_cpu+0x70/0x70
        [404571.520565]  ret_from_fork+0x3a/0x50
        [404571.520915] kworker/u8:6    D    0 31651      2 0x80004000
        [404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
        (...)
        [404571.537000] fsstress        D    0 13117  13115 0x00004000
        [404571.537263] Call Trace:
        [404571.537524]  ? __schedule+0x3ae/0x7b0
        [404571.537788]  schedule+0x3a/0xb0
        [404571.538066]  wait_current_trans+0xc8/0x100 [btrfs]
        [404571.538349]  ? remove_wait_queue+0x60/0x60
        [404571.538680]  start_transaction+0x33c/0x500 [btrfs]
        [404571.539076]  btrfs_check_shared+0xa3/0x1f0 [btrfs]
        [404571.539513]  ? extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.539866]  extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.540170]  do_vfs_ioctl+0x526/0x6f0
        [404571.540436]  ksys_ioctl+0x70/0x80
        [404571.540734]  __x64_sys_ioctl+0x16/0x20
        [404571.540997]  do_syscall_64+0x60/0x1d0
        [404571.541279]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
        [404571.543729] btrfs           D    0 14210  14208 0x00004000
        [404571.544023] Call Trace:
        [404571.544275]  ? __schedule+0x3ae/0x7b0
        [404571.544526]  ? wait_for_completion+0x112/0x1a0
        [404571.544795]  schedule+0x3a/0xb0
        [404571.545064]  schedule_timeout+0x1ff/0x390
        [404571.545351]  ? lock_acquire+0xa6/0x190
        [404571.545638]  ? wait_for_completion+0x49/0x1a0
        [404571.545890]  ? wait_for_completion+0x112/0x1a0
        [404571.546228]  wait_for_completion+0x131/0x1a0
        [404571.546503]  ? wake_up_q+0x70/0x70
        [404571.546775]  btrfs_wait_ordered_extents+0x27c/0x400 [btrfs]
        [404571.547159]  btrfs_commit_transaction+0x3b0/0xae0 [btrfs]
        [404571.547449]  ? btrfs_mksubvol+0x4a4/0x640 [btrfs]
        [404571.547703]  ? remove_wait_queue+0x60/0x60
        [404571.547969]  btrfs_mksubvol+0x605/0x640 [btrfs]
        [404571.548226]  ? __sb_start_write+0xd4/0x1c0
        [404571.548512]  ? mnt_want_write_file+0x24/0x50
        [404571.548789]  btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs]
        [404571.549048]  btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs]
        [404571.549307]  btrfs_ioctl+0x133f/0x3150 [btrfs]
        [404571.549549]  ? mem_cgroup_charge_statistics+0x4c/0xd0
        [404571.549792]  ? mem_cgroup_commit_charge+0x84/0x4b0
        [404571.550064]  ? __handle_mm_fault+0xe3e/0x11f0
        [404571.550306]  ? do_raw_spin_unlock+0x49/0xc0
        [404571.550608]  ? _raw_spin_unlock+0x24/0x30
        [404571.550976]  ? __handle_mm_fault+0xedf/0x11f0
        [404571.551319]  ? do_vfs_ioctl+0xa2/0x6f0
        [404571.551659]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [404571.552087]  do_vfs_ioctl+0xa2/0x6f0
        [404571.552355]  ksys_ioctl+0x70/0x80
        [404571.552621]  __x64_sys_ioctl+0x16/0x20
        [404571.552864]  do_syscall_64+0x60/0x1d0
        [404571.553104]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
      
      If we were joining the transaction instead of attaching to it, we would
      not risk a deadlock because a join only blocks if the transaction is in a
      state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
      flush performed by a transaction is done before it reaches that state,
      when it is in the state TRANS_STATE_COMMIT_START. However a transaction
      join is intended for use cases where we do modify the filesystem, and
      fiemap only needs to peek at delayed references from the current
      transaction in order to determine if extents are shared, and, besides
      that, when there is no current transaction or when it blocks to wait for
      a current committing transaction to complete, it creates a new transaction
      without reserving any space. Such unnecessary transactions, besides doing
      unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
      rotation of the precious backup roots.
      
      So fix this by adding a new transaction join variant, named join_nostart,
      which behaves like the regular join, but it does not create a transaction
      when none currently exists or after waiting for a committing transaction
      to complete.
      
      Fixes: 03628cdbc64db6 ("Btrfs: do not start a transaction during fiemap")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f833deae
  2. 06 8月, 2018 1 次提交
  3. 30 5月, 2018 1 次提交
  4. 18 4月, 2018 1 次提交
    • Q
      btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT · a514d638
      Qu Wenruo 提交于
      Unlike previous method that tries to commit transaction inside
      qgroup_reserve(), this time we will try to commit transaction using
      fs_info->transaction_kthread to avoid nested transaction and no need to
      worry about locking context.
      
      Since it's an asynchronous function call and we won't wait for
      transaction commit, unlike previous method, we must call it before we
      hit the qgroup limit.
      
      So this patch will use the ratio and size of qgroup meta_pertrans
      reservation as indicator to check if we should trigger a transaction
      commit.  (meta_prealloc won't be cleaned in transaction committ, it's
      useless anyway)
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a514d638
  5. 12 4月, 2018 1 次提交
  6. 31 3月, 2018 1 次提交
  7. 26 3月, 2018 2 次提交
  8. 22 1月, 2018 6 次提交
  9. 18 4月, 2017 2 次提交
  10. 06 12月, 2016 3 次提交
  11. 27 9月, 2016 1 次提交
  12. 26 7月, 2016 1 次提交
  13. 18 6月, 2016 1 次提交
    • J
      btrfs: account for non-CoW'd blocks in btrfs_abort_transaction · 64c12921
      Jeff Mahoney 提交于
      The test for !trans->blocks_used in btrfs_abort_transaction is
      insufficient to determine whether it's safe to drop the transaction
      handle on the floor.  btrfs_cow_block, informed by should_cow_block,
      can return blocks that have already been CoW'd in the current
      transaction.  trans->blocks_used is only incremented for new block
      allocations. If an operation overlaps the blocks in the current
      transaction entirely and must abort the transaction, we'll happily
      let it clean up the trans handle even though it may have modified
      the blocks and will commit an incomplete operation.
      
      In the long-term, I'd like to do closer tracking of when the fs
      is actually modified so we can still recover as gracefully as possible,
      but that approach will need some discussion.  In the short term,
      since this is the only code using trans->blocks_used, let's just
      switch it to a bool indicating whether any blocks were used and set
      it when should_cow_block returns false.
      
      Cc: stable@vger.kernel.org # 3.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      64c12921
  14. 26 5月, 2016 1 次提交
  15. 07 1月, 2016 2 次提交
  16. 10 12月, 2015 1 次提交
    • F
      Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list · 348a0013
      Filipe Manana 提交于
      As of my previous change titled "Btrfs: fix scrub preventing unused block
      groups from being deleted", the following warning at
      extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
      filesysten with "-o discard":
      
       10263  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
       10264  {
       (...)
       10405                  if (trimming) {
       10406                          WARN_ON(!list_empty(&block_group->bg_list));
       10407                          spin_lock(&trans->transaction->deleted_bgs_lock);
       10408                          list_move(&block_group->bg_list,
       10409                                    &trans->transaction->deleted_bgs);
       10410                          spin_unlock(&trans->transaction->deleted_bgs_lock);
       10411                          btrfs_get_block_group(block_group);
       10412                  }
       (...)
      
      This happens because scrub can now add back the block group to the list of
      unused block groups (fs_info->unused_bgs). This is dangerous because we
      are moving the block group from the unused block groups list to the list
      of deleted block groups without holding the lock that protects the source
      list (fs_info->unused_bgs_lock).
      
      The following diagram illustrates how this happens:
      
                  CPU 1                                     CPU 2
      
       cleaner_kthread()
         btrfs_delete_unused_bgs()
      
           sees bg X in list
            fs_info->unused_bgs
      
           deletes bg X from list
            fs_info->unused_bgs
      
                                                  scrub_enumerate_chunks()
      
                                                    searches device tree using
                                                    its commit root
      
                                                    finds device extent for
                                                    block group X
      
                                                    gets block group X from the tree
                                                    fs_info->block_group_cache_tree
                                                    (via btrfs_lookup_block_group())
      
                                                    sets bg X to RO (again)
      
                                                    scrub_chunk(bg X)
      
                                                    sets bg X back to RW mode
      
                                                    adds bg X to the list
                                                    fs_info->unused_bgs again,
                                                    since it's still unused and
                                                    currently not in that list
      
           sets bg X to RO mode
      
           btrfs_remove_chunk(bg X)
      
           --> discard is enabled and bg X
               is in the fs_info->unused_bgs
               list again so the warning is
               triggered
           --> we move it from that list into
               the transaction's delete_bgs
               list, but we can have another
               task currently manipulating
               the first list (fs_info->unused_bgs)
      
      Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
      the list of unused block groups and the list of deleted block groups. This
      makes it safe and there's not much worry for more lock contention, as this
      lock is seldom used and only the cleaner kthread adds elements to the list
      of deleted block groups. The warning goes away too, as this was previously
      an impossible case (and would have been better a BUG_ON/ASSERT) but it's
      not impossible anymore.
      Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      348a0013
  17. 25 11月, 2015 1 次提交
    • F
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana 提交于
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eab77ff
  18. 22 10月, 2015 5 次提交
  19. 06 10月, 2015 1 次提交
    • F
      Btrfs: fix deadlock when finalizing block group creation · d9a0540a
      Filipe Manana 提交于
      Josef ran into a deadlock while a transaction handle was finalizing the
      creation of its block groups, which produced the following trace:
      
        [260445.593112] fio             D ffff88022a9df468     0  8924   4518 0x00000084
        [260445.593119]  ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
        [260445.593126]  ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
        [260445.593132]  ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
        [260445.593137] Call Trace:
        [260445.593145]  [<ffffffff8175a437>] schedule+0x37/0x80
        [260445.593189]  [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
        [260445.593197]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
        [260445.593225]  [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
        [260445.593253]  [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
        [260445.593295]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
        [260445.593324]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [260445.593351]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [260445.593394]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
        [260445.593427]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
        [260445.593459]  [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
        [260445.593491]  [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs]
        [260445.593524]  [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs]
        [260445.593532]  [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170
        [260445.593564]  [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
        [260445.593597]  [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
        [260445.593626]  [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
        [260445.593654]  [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
        [260445.593682]  [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
        [260445.593724]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
        [260445.593752]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [260445.593830]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [260445.593905]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
        [260445.593946]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
        [260445.593990]  [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
        [260445.594042]  [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
        [260445.594089]  [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs]
        [260445.594115]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
        [260445.594133]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
        [260445.594149]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
        [260445.594169]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
        [260445.594187]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
        [260445.594204]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      This happened because the same transaction handle created a large number
      of block groups and while finalizing their creation (inserting new items
      and updating existing items in the chunk and device trees) a new metadata
      extent had to be allocated and no free space was found in the current
      metadata block groups, which made find_free_extent() attempt to allocate
      a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
      ended up allocating a new system chunk too and exceeded the threshold
      of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
      final part of block group creation again (at
      btrfs_create_pending_block_groups()) and attempt to lock again the root
      of the chunk tree when it's already write locked by the same task.
      
      Similarly we can deadlock on extent tree nodes/leafs if while we are
      running delayed references we end up creating a new metadata block group
      in order to allocate a new node/leaf for the extent tree (as part of
      a CoW operation or growing the tree), as btrfs_create_pending_block_groups
      inserts items into the extent tree as well. In this case we get the
      following trace:
      
        [14242.773581] fio             D ffff880428ca3418     0  3615   3100 0x00000084
        [14242.773588]  ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
        [14242.773594]  ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
        [14242.773600]  ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
        [14242.773606] Call Trace:
        [14242.773613]  [<ffffffff8175a437>] schedule+0x37/0x80
        [14242.773656]  [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
        [14242.773664]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
        [14242.773692]  [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
        [14242.773720]  [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
        [14242.773750]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [14242.773758]  [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200
        [14242.773786]  [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs]
        [14242.773818]  [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
        [14242.773850]  [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
        [14242.773934]  [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs]
        [14242.773998]  [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
        [14242.774041]  [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
        [14242.774078]  [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
        [14242.774118]  [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
        [14242.774155]  [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
        [14242.774194]  [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
        [14242.774235]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
        [14242.774274]  [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
        [14242.774318]  [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
        [14242.774358]  [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
        [14242.774391]  [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
        [14242.774432]  [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs]
        [14242.774474]  [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
        [14242.774516]  [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
        [14242.774558]  [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs]
        [14242.774599]  [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
        [14242.774642]  [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs]
        [14242.774650]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
        [14242.774657]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
        [14242.774663]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
        [14242.774669]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
        [14242.774675]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
        [14242.774681]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      Fix this by never recursing into the finalization phase of block group
      creation and making sure we never trigger the finalization of block group
      creation while running delayed references.
      Reported-by: NJosef Bacik <jbacik@fb.com>
      Fixes: 00d80e34 ("Btrfs: fix quick exhaustion of the system array in the superblock")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      d9a0540a
  20. 23 9月, 2015 1 次提交
    • J
      Btrfs: keep dropped roots in cache until transaction commit · 2b9dbef2
      Josef Bacik 提交于
      When dropping a snapshot we need to account for the qgroup changes.  If we drop
      the snapshot in all one go then the backref code will fail to find blocks from
      the snapshot we dropped since it won't be able to find the root in the fs root
      cache.  This can lead to us failing to find refs from other roots that pointed
      at blocks in the now deleted root.  To handle this we need to not remove the fs
      roots from the cache until after we process the qgroup operations.  Do this by
      adding dropped roots to a list on the transaction, and letting the transaction
      remove the roots at the same time it drops the commit roots.  This will keep all
      of the backref searching code in sync properly, and fixes a problem Mark was
      seeing with snapshot delete and qgroups.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2b9dbef2
  21. 29 7月, 2015 1 次提交
  22. 11 6月, 2015 1 次提交
  23. 03 6月, 2015 1 次提交
    • F
      Btrfs: fix -ENOSPC when finishing block group creation · 4fbcdf66
      Filipe Manana 提交于
      While creating a block group, we often end up getting ENOSPC while updating
      the chunk tree, which leads to a transaction abortion that produces a trace
      like the following:
      
      [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
      [30670.117777] BTRFS: Transaction aborted (error -28)
      (...)
      [30670.163567] Call Trace:
      [30670.163906]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [30670.164522]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [30670.165171]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [30670.166323]  [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.167213]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [30670.167862]  [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.169116]  [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
      [30670.170593]  [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [30670.171960]  [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [30670.174649]  [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [30670.176092]  [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
      [30670.177218]  [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
      [30670.178622]  [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
      [30670.179642]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
      [30670.180692]  [<ffffffff81152863>] SyS_fallocate+0x47/0x62
      [30670.186737]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [30670.187792] ---[ end trace 0373e6b491c4a8cc ]---
      
      This is because we don't do proper space reservation for the chunk block
      reserve when we have multiple tasks allocating chunks in parallel.
      
      So block group creation has 2 phases, and the first phase essentially
      checks if there is enough space in the system space_info, allocating a
      new system chunk if there isn't, while the second phase updates the
      device, extent and chunk trees. However, because the updates to the
      chunk tree happen in the second phase, if we have N tasks, each with
      its own transaction handle, allocating new chunks in parallel and if
      there is only enough space in the system space_info to allocate M chunks,
      where M < N, none of the tasks ends up allocating a new system chunk in
      the first phase and N - M tasks will get -ENOSPC when attempting to
      update the chunk tree in phase 2 if they need to COW any nodes/leafs
      from the chunk tree.
      
      Fix this by doing proper reservation in the chunk block reserve.
      
      The issue could be reproduced by running fstests generic/038 in a loop,
      which eventually triggered the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4fbcdf66
  24. 11 4月, 2015 2 次提交
    • C
      Btrfs: allow block group cache writeout outside critical section in commit · 1bbc621e
      Chris Mason 提交于
      We loop through all of the dirty block groups during commit and write
      the free space cache.  In order to make sure the cache is currect, we do
      this while no other writers are allowed in the commit.
      
      If a large number of block groups are dirty, this can introduce long
      stalls during the final stages of the commit, which can block new procs
      trying to change the filesystem.
      
      This commit changes the block group cache writeout to take appropriate
      locks and allow it to run earlier in the commit.  We'll still have to
      redo some of the block groups, but it means we can get most of the work
      out of the way without blocking the entire FS.
      Signed-off-by: NChris Mason <clm@fb.com>
      1bbc621e
    • J
      Btrfs: reserve space for block groups · cb723e49
      Josef Bacik 提交于
      This changes our delayed refs calculations to include the space needed
      to write back dirty block groups.
      Signed-off-by: NChris Mason <clm@fb.com>
      cb723e49
  25. 27 3月, 2015 1 次提交
    • F
      Btrfs: fix metadata inconsistencies after directory fsync · 2f2ff0ee
      Filipe Manana 提交于
      We can get into inconsistency between inodes and directory entries
      after fsyncing a directory. The issue is that while a directory gets
      the new dentries persisted in the fsync log and replayed at mount time,
      the link count of the inode that directory entries point to doesn't
      get updated, staying with an incorrect link count (smaller then the
      correct value). This later leads to stale file handle errors when
      accessing (including attempt to delete) some of the links if all the
      other ones are removed, which also implies impossibility to delete the
      parent directories, since the dentries can not be removed.
      
      Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
      when fsyncing a directory, new files aren't logged (their metadata and
      dentries) nor any child directories. So this patch fixes this issue too,
      since it has the same resolution as the incorrect inode link count issue
      mentioned before.
      
      This is very easy to reproduce, and the following excerpt from my test
      case for xfstests shows how:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our main test file and directory.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
        mkdir $SCRATCH_MNT/mydir
      
        # Make sure all metadata and data are durably persisted.
        sync
      
        # Add a hard link to 'foo' inside our test directory and fsync only the
        # directory. The btrfs fsync implementation had a bug that caused the new
        # directory entry to be visible after the fsync log replay but, the inode
        # of our file remained with a link count of 1.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2
      
        # Add a few more links and new files.
        # This is just to verify nothing breaks or gives incorrect results after the
        # fsync log is replayed.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
        $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
        ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2
      
        # Add some subdirectories and new files and links to them. This is to verify
        # that after fsyncing our top level directory 'mydir', all the subdirectories
        # and their files/links are registered in the fsync log and exist after the
        # fsync log is replayed.
        mkdir -p $SCRATCH_MNT/mydir/x/y/z
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
        touch $SCRATCH_MNT/mydir/x/y/z/qwerty
      
        # Now fsync only our top directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir
      
        # And fsync now our new file named 'hello', just to verify later that it has
        # the expected content and that the previous fsync on the directory 'mydir' had
        # no bad influence on this fsync.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Verify the content of our file 'foo' remains the same as before, 8192 bytes,
        # all with the value 0xaa.
        echo "File 'foo' content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Remove the first name of our inode. Because of the directory fsync bug, the
        # inode's link count was 1 instead of 5, so removing the 'foo' name ended up
        # deleting the inode and the other names became stale directory entries (still
        # visible to applications). Attempting to remove or access the remaining
        # dentries pointing to that inode resulted in stale file handle errors and
        # made it impossible to remove the parent directories since it was impossible
        # for them to become empty.
        echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
        rm -f $SCRATCH_MNT/foo
      
        # Now verify that all files, links and directories created before fsyncing our
        # directory exist after the fsync log was replayed.
        [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
        [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
        [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
        [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
            echo "Link mydir/x/y/foo_y_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
            echo "Link mydir/x/y/z/foo_z_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
            echo "File mydir/x/y/z/qwerty is missing"
      
        # We expect our file here to have a size of 64Kb and all the bytes having the
        # value 0xff.
        echo "file 'hello' content after log replay:"
        od -t x1 $SCRATCH_MNT/hello
      
        # Now remove all files/links, under our test directory 'mydir', and verify we
        # can remove all the directories.
        rm -f $SCRATCH_MNT/mydir/x/y/z/*
        rmdir $SCRATCH_MNT/mydir/x/y/z
        rm -f $SCRATCH_MNT/mydir/x/y/*
        rmdir $SCRATCH_MNT/mydir/x/y
        rmdir $SCRATCH_MNT/mydir/x
        rm -f $SCRATCH_MNT/mydir/*
        rmdir $SCRATCH_MNT/mydir
      
        # An fsck, run by the fstests framework everytime a test finishes, also detected
        # the inconsistency and printed the following error message:
        #
        # root 5 inode 257 errors 2001, no inode item, link count wrong
        #    unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
        #    unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
        status=0
        exit
      
      The expected golden output for the test is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 5
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
      
      Which is the output after this patch and when running the test against
      ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
      output is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 1
        Link mydir/foo_2 is missing
        Link mydir/foo_3 is missing
        Link mydir/x/y/foo_y_link is missing
        Link mydir/x/y/z/foo_z_link is missing
        File mydir/x/y/z/qwerty is missing
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty
      
      Fsck, without this fix, also complains about the wrong link count:
      
        root 5 inode 257 errors 2001, no inode item, link count wrong
            unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
            unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
      So fix this by logging the inodes that the dentries point to when
      fsyncing a directory.
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2f2ff0ee