1. 25 5月, 2020 3 次提交
  2. 30 4月, 2020 1 次提交
  3. 24 3月, 2020 10 次提交
    • J
      btrfs: do not resolve backrefs for roots that are being deleted · 39dba873
      Josef Bacik 提交于
      Zygo reported a deadlock where a task was stuck in the inode logical
      resolve code.  The deadlock looks like this
      
        Task 1
        btrfs_ioctl_logical_to_ino
        ->iterate_inodes_from_logical
         ->iterate_extent_inodes
          ->path->search_commit_root isn't set, so a transaction is started
            ->resolve_indirect_ref for a root that's being deleted
      	->search for our key, attempt to lock a node, DEADLOCK
      
        Task 2
        btrfs_drop_snapshot
        ->walk down to a leaf, lock it, walk up, lock node
         ->end transaction
          ->start transaction
            -> wait_cur_trans
      
        Task 3
        btrfs_commit_transaction
        ->wait_event(cur_trans->write_wait, num_writers == 1) DEADLOCK
      
      We are holding a transaction open in btrfs_ioctl_logical_to_ino while we
      try to resolve our references.  btrfs_drop_snapshot() holds onto its
      locks while it stops and starts transaction handles, because it assumes
      nobody is going to touch the root now.  Commit just does what commit
      does, waiting for the writers to finish, blocking any new trans handles
      from starting.
      
      Fix this by making the backref code not try to resolve backrefs of roots
      that are currently being deleted.  This will keep us from walking into a
      snapshot that's currently being deleted.
      
      This problem was harder to hit before because we rarely broke out of the
      snapshot delete halfway through, but with my delayed ref throttling code
      it happened much more often.  However we've always been able to do this,
      so it's not a new problem.
      
      Fixes: 8da6d581 ("Btrfs: added btrfs_find_all_roots()")
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      39dba873
    • J
      btrfs: kill the subvol_srcu · c75e8394
      Josef Bacik 提交于
      Now that we have proper root ref counting everywhere we can kill the
      subvol_srcu.
      
      * removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes
      
      * the refcount_t used for the references checks for accidental 0->1
        in cases where the root lifetime would not be properly protected
      
      * there's a leak detector for roots to catch unfreed roots at umount
        time
      
      * SRCU served us well over the years but is was not a proper
        synchronization mechanism for some cases
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c75e8394
    • Q
      btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves · 19b546d7
      Qu Wenruo 提交于
      In relocation, we need to locate all parent tree leaves referring to one
      data extent, thus we have a complex mechanism to iterate throught extent
      tree and subvolume trees to locate the related leaves.
      
      However this is already done in backref.c, we have
      btrfs_find_all_leafs(), which can return a ulist containing all leaves
      referring to that data extent.
      
      Use btrfs_find_all_leafs() to replace find_data_references().
      
      There is a special handling for v1 space cache data extents, where we
      need to delete the v1 space cache data extents, to avoid those data
      extents to hang the data relocation.
      
      In this patch, the special handling is done by re-iterating the root
      tree leaf.  Although it's a little less efficient than the old handling,
      considering we can reuse a lot of code, it should be acceptable.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      19b546d7
    • E
      btrfs: backref, use correct count to resolve normal data refs · b25b0b87
      ethanwu 提交于
      With the following patches:
      
      - btrfs: backref, only collect file extent items matching backref offset
      - btrfs: backref, not adding refs from shared block when resolving normal backref
      - btrfs: backref, only search backref entries from leaves of the same root
      
      we only collect the normal data refs we want, so the imprecise upper
      bound total_refs of that EXTENT_ITEM could now be changed to the count
      of the normal backref entry we want to search.
      
      Background and how the patches fit together:
      
      Btrfs has two types of data backref.
      For BTRFS_EXTENT_DATA_REF_KEY type of backref, we don't have the
      exact block number. Therefore, we need to call resolve_indirect_refs.
      It uses btrfs_search_slot to locate the leaf block. Then
      we need to walk through the leaves to search for the EXTENT_DATA items
      that have disk bytenr matching the extent item (add_all_parents).
      
      When resolving indirect refs, we could take entries that don't
      belong to the backref entry we are searching for right now.
      For that reason when searching backref entry, we always use total
      refs of that EXTENT_ITEM rather than individual count.
      
      For example:
      item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize
        extent refs 24 gen 7302 flags DATA
        shared data backref parent 394985472 count 10 #1
        extent data backref root 257 objectid 260 offset 1048576 count 3 #2
        extent data backref root 256 objectid 260 offset 65536 count 6 #3
        extent data backref root 257 objectid 260 offset 65536 count 5 #4
      
      For example, when searching backref entry #4, we'll use total_refs
      24, a very loose loop ending condition, instead of total_refs = 5.
      
      But using total_refs = 24 is not accurate. Sometimes, we'll never find
      all the refs from specific root.  As a result, the loop keeps on going
      until we reach the end of that inode.
      
      The first 3 patches, handle 3 different types refs we might encounter.
      These refs do not belong to the normal backref we are searching, and
      hence need to be skipped.
      
      This patch changes the total_refs to correct number so that we could
      end loop as soon as we find all the refs we want.
      
      btrfs send uses backref to find possible clone sources, the following
      is a simple test to compare the results with and without this patch:
      
       $ btrfs subvolume create /sub1
       $ for i in `seq 1 163840`; do
           dd if=/dev/zero of=/sub1/file bs=64K count=1 seek=$((i-1)) conv=notrunc oflag=direct
         done
       $ btrfs subvolume snapshot /sub1 /sub2
       $ for i in `seq 1 163840`; do
           dd if=/dev/zero of=/sub1/file bs=4K count=1 seek=$(((i-1)*16+10)) conv=notrunc oflag=direct
         done
       $ btrfs subvolume snapshot -r /sub1 /snap1
       $ time btrfs send /snap1 | btrfs receive /volume2
      
      Without this patch:
      
      real 69m48.124s
      user 0m50.199s
      sys  70m15.600s
      
      With this patch:
      
      real    1m59.683s
      user    0m35.421s
      sys     2m42.684s
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: Nethanwu <ethanwu@synology.com>
      [ add patchset cover letter with background and numbers ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b25b0b87
    • E
      btrfs: backref, only search backref entries from leaves of the same root · cfc0eed0
      ethanwu 提交于
      We could have some nodes/leaves in subvolume whose owner are not the
      that subvolume. In this way, when we resolve normal backrefs of that
      subvolume, we should avoid collecting those references from these blocks.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: Nethanwu <ethanwu@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cfc0eed0
    • E
      btrfs: backref, don't add refs from shared block when resolving normal backref · ed58f2e6
      ethanwu 提交于
      All references from the block of SHARED_DATA_REF belong to that shared
      block backref.
      
      For example:
      
        item 11 key (40831553536 EXTENT_ITEM 4194304) itemoff 15460 itemsize 95
            extent refs 24 gen 7302 flags DATA
            extent data backref root 257 objectid 260 offset 65536 count 5
            extent data backref root 258 objectid 265 offset 0 count 9
            shared data backref parent 394985472 count 10
      
      Block 394985472 might be leaf from root 257, and the item obejctid and
      (file_pos - file_extent_item::offset) in that leaf just happens to be
      260 and 65536 which is equal to the first extent data backref entry.
      
      Before this patch, when we resolve backref:
      
        root 257 objectid 260 offset 65536
      
      we will add those refs in block 394985472 and wrongly treat those as the
      refs we want.
      
      Fix this by checking if the leaf we are processing is shared data
      backref, if so, just skip this leaf.
      
      Shared data refs added into preftrees.direct have all entry value = 0
      (root_id = 0, key = NULL, level = 0) except parent entry.
      
      Other refs from indirect tree will have key value and root id != 0, and
      these values won't be changed when their parent is resolved and added to
      preftrees.direct. Therefore, we could reuse the preftrees.direct and
      search ref with all values = 0 except parent is set to avoid getting
      those resolved refs block.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: Nethanwu <ethanwu@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ed58f2e6
    • E
      btrfs: backref, only collect file extent items matching backref offset · 7ac8b88e
      ethanwu 提交于
      When resolving one backref of type EXTENT_DATA_REF, we collect all
      references that simply reference the EXTENT_ITEM even though their
      (file_pos - file_extent_item::offset) are not the same as the
      btrfs_extent_data_ref::offset we are searching for.
      
      This patch adds additional check so that we only collect references whose
      (file_pos - file_extent_item::offset) == btrfs_extent_data_ref::offset.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: Nethanwu <ethanwu@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7ac8b88e
    • J
      btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root · 00246528
      Josef Bacik 提交于
      We are now using these for all roots, rename them to btrfs_put_root()
      and btrfs_grab_root();
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      00246528
    • J
      btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root · bc44d7c4
      Josef Bacik 提交于
      Now that all callers of btrfs_get_fs_root are subsequently calling
      btrfs_grab_fs_root and handling dropping the ref when they are done
      appropriately, go ahead and push btrfs_grab_fs_root up into
      btrfs_get_fs_root.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc44d7c4
    • J
      btrfs: hold a ref on the root in resolve_indirect_ref · 9326f76f
      Josef Bacik 提交于
      We're looking up a random root, we need to hold a ref on it while we're
      using it.
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9326f76f
  4. 31 7月, 2019 1 次提交
    • F
      Btrfs: fix deadlock between fiemap and transaction commits · a6d155d2
      Filipe Manana 提交于
      The fiemap handler locks a file range that can have unflushed delalloc,
      and after locking the range, it tries to attach to a running transaction.
      If the running transaction started its commit, that is, it is in state
      TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
      flushoncommit option or the transaction is creating a snapshot for the
      subvolume that contains the file that fiemap is operating on, we end up
      deadlocking. This happens because fiemap is blocked on the transaction,
      waiting for it to complete, and the transaction is waiting for the flushed
      dealloc to complete, which requires locking the file range that the fiemap
      task already locked. The following stack traces serve as an example of
      when this deadlock happens:
      
        (...)
        [404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        [404571.515956] Call Trace:
        [404571.516360]  ? __schedule+0x3ae/0x7b0
        [404571.516730]  schedule+0x3a/0xb0
        [404571.517104]  lock_extent_bits+0x1ec/0x2a0 [btrfs]
        [404571.517465]  ? remove_wait_queue+0x60/0x60
        [404571.517832]  btrfs_finish_ordered_io+0x292/0x800 [btrfs]
        [404571.518202]  normal_work_helper+0xea/0x530 [btrfs]
        [404571.518566]  process_one_work+0x21e/0x5c0
        [404571.518990]  worker_thread+0x4f/0x3b0
        [404571.519413]  ? process_one_work+0x5c0/0x5c0
        [404571.519829]  kthread+0x103/0x140
        [404571.520191]  ? kthread_create_worker_on_cpu+0x70/0x70
        [404571.520565]  ret_from_fork+0x3a/0x50
        [404571.520915] kworker/u8:6    D    0 31651      2 0x80004000
        [404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
        (...)
        [404571.537000] fsstress        D    0 13117  13115 0x00004000
        [404571.537263] Call Trace:
        [404571.537524]  ? __schedule+0x3ae/0x7b0
        [404571.537788]  schedule+0x3a/0xb0
        [404571.538066]  wait_current_trans+0xc8/0x100 [btrfs]
        [404571.538349]  ? remove_wait_queue+0x60/0x60
        [404571.538680]  start_transaction+0x33c/0x500 [btrfs]
        [404571.539076]  btrfs_check_shared+0xa3/0x1f0 [btrfs]
        [404571.539513]  ? extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.539866]  extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.540170]  do_vfs_ioctl+0x526/0x6f0
        [404571.540436]  ksys_ioctl+0x70/0x80
        [404571.540734]  __x64_sys_ioctl+0x16/0x20
        [404571.540997]  do_syscall_64+0x60/0x1d0
        [404571.541279]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
        [404571.543729] btrfs           D    0 14210  14208 0x00004000
        [404571.544023] Call Trace:
        [404571.544275]  ? __schedule+0x3ae/0x7b0
        [404571.544526]  ? wait_for_completion+0x112/0x1a0
        [404571.544795]  schedule+0x3a/0xb0
        [404571.545064]  schedule_timeout+0x1ff/0x390
        [404571.545351]  ? lock_acquire+0xa6/0x190
        [404571.545638]  ? wait_for_completion+0x49/0x1a0
        [404571.545890]  ? wait_for_completion+0x112/0x1a0
        [404571.546228]  wait_for_completion+0x131/0x1a0
        [404571.546503]  ? wake_up_q+0x70/0x70
        [404571.546775]  btrfs_wait_ordered_extents+0x27c/0x400 [btrfs]
        [404571.547159]  btrfs_commit_transaction+0x3b0/0xae0 [btrfs]
        [404571.547449]  ? btrfs_mksubvol+0x4a4/0x640 [btrfs]
        [404571.547703]  ? remove_wait_queue+0x60/0x60
        [404571.547969]  btrfs_mksubvol+0x605/0x640 [btrfs]
        [404571.548226]  ? __sb_start_write+0xd4/0x1c0
        [404571.548512]  ? mnt_want_write_file+0x24/0x50
        [404571.548789]  btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs]
        [404571.549048]  btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs]
        [404571.549307]  btrfs_ioctl+0x133f/0x3150 [btrfs]
        [404571.549549]  ? mem_cgroup_charge_statistics+0x4c/0xd0
        [404571.549792]  ? mem_cgroup_commit_charge+0x84/0x4b0
        [404571.550064]  ? __handle_mm_fault+0xe3e/0x11f0
        [404571.550306]  ? do_raw_spin_unlock+0x49/0xc0
        [404571.550608]  ? _raw_spin_unlock+0x24/0x30
        [404571.550976]  ? __handle_mm_fault+0xedf/0x11f0
        [404571.551319]  ? do_vfs_ioctl+0xa2/0x6f0
        [404571.551659]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [404571.552087]  do_vfs_ioctl+0xa2/0x6f0
        [404571.552355]  ksys_ioctl+0x70/0x80
        [404571.552621]  __x64_sys_ioctl+0x16/0x20
        [404571.552864]  do_syscall_64+0x60/0x1d0
        [404571.553104]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
      
      If we were joining the transaction instead of attaching to it, we would
      not risk a deadlock because a join only blocks if the transaction is in a
      state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
      flush performed by a transaction is done before it reaches that state,
      when it is in the state TRANS_STATE_COMMIT_START. However a transaction
      join is intended for use cases where we do modify the filesystem, and
      fiemap only needs to peek at delayed references from the current
      transaction in order to determine if extents are shared, and, besides
      that, when there is no current transaction or when it blocks to wait for
      a current committing transaction to complete, it creates a new transaction
      without reserving any space. Such unnecessary transactions, besides doing
      unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
      rotation of the precious backup roots.
      
      So fix this by adding a new transaction join variant, named join_nostart,
      which behaves like the regular join, but it does not create a transaction
      when none currently exists or after waiting for a committing transaction
      to complete.
      
      Fixes: 03628cdb ("Btrfs: do not start a transaction during fiemap")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a6d155d2
  5. 01 7月, 2019 1 次提交
  6. 30 4月, 2019 3 次提交
    • F
      Btrfs: do not start a transaction during fiemap · 03628cdb
      Filipe Manana 提交于
      During fiemap, for regular extents (non inline) we need to check if they
      are shared and if they are, set the shared bit. Checking if an extent is
      shared requires checking the delayed references of the currently running
      transaction, since some reference might have not yet hit the extent tree
      and be only in the in-memory delayed references.
      
      However we were using a transaction join for this, which creates a new
      transaction when there is no transaction currently running. That means
      that two more potential failures can happen: creating the transaction and
      committing it. Further, if no write activity is currently happening in the
      system, and fiemap calls keep being done, we end up creating and
      committing transactions that do nothing.
      
      In some extreme cases this can result in the commit of the transaction
      created by fiemap to fail with ENOSPC when updating the root item of a
      subvolume tree because a join does not reserve any space, leading to a
      trace like the following:
      
       heisenberg kernel: ------------[ cut here ]------------
       heisenberg kernel: BTRFS: Transaction aborted (error -28)
       heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
       heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
       heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
       heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
       heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
       heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
       heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
       heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
       heisenberg kernel: FS:  0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
       heisenberg kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
       heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       heisenberg kernel: Call Trace:
       heisenberg kernel:  commit_fs_roots+0x166/0x1d0 [btrfs]
       heisenberg kernel:  ? _cond_resched+0x15/0x30
       heisenberg kernel:  ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
       heisenberg kernel:  btrfs_commit_transaction+0x2bd/0x870 [btrfs]
       heisenberg kernel:  ? start_transaction+0x9d/0x3f0 [btrfs]
       heisenberg kernel:  transaction_kthread+0x147/0x180 [btrfs]
       heisenberg kernel:  ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
       heisenberg kernel:  kthread+0x112/0x130
       heisenberg kernel:  ? kthread_bind+0x30/0x30
       heisenberg kernel:  ret_from_fork+0x35/0x40
       heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
      
      Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do
      a transaction join to avoid the overhead of creating a new transaction (if
      there is currently no running transaction) and introducing a potential
      point of failure when the new transaction gets committed, instead use a
      transaction attach to grab a handle for the currently running transaction
      if any.
      Reported-by: NChristoph Anton Mitterer <calestyo@scientia.net>
      Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/
      Fixes: afce772e ("btrfs: fix check_shared for fiemap ioctl")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      03628cdb
    • F
      Btrfs: do not start a transaction at iterate_extent_inodes() · bfc61c36
      Filipe Manana 提交于
      When finding out which inodes have references on a particular extent, done
      by backref.c:iterate_extent_inodes(), from the BTRFS_IOC_LOGICAL_INO (both
      v1 and v2) ioctl and from scrub we use the transaction join API to grab a
      reference on the currently running transaction, since in order to give
      accurate results we need to inspect the delayed references of the currently
      running transaction.
      
      However, if there is currently no running transaction, the join operation
      will create a new transaction. This is inefficient as the transaction will
      eventually be committed, doing unnecessary IO and introducing a potential
      point of failure that will lead to a transaction abort due to -ENOSPC, as
      recently reported [1].
      
      That's because the join, creates the transaction but does not reserve any
      space, so when attempting to update the root item of the root passed to
      btrfs_join_transaction(), during the transaction commit, we can end up
      failling with -ENOSPC. Users of a join operation are supposed to actually
      do some filesystem changes and reserve space by some means, which is not
      the case of iterate_extent_inodes(), it is a read-only operation for all
      contextes from which it is called.
      
      The reported [1] -ENOSPC failure stack trace is the following:
      
       heisenberg kernel: ------------[ cut here ]------------
       heisenberg kernel: BTRFS: Transaction aborted (error -28)
       heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
       heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
       heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
       heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
       heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
       heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
       heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
       heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
       heisenberg kernel: FS:  0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
       heisenberg kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
       heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       heisenberg kernel: Call Trace:
       heisenberg kernel:  commit_fs_roots+0x166/0x1d0 [btrfs]
       heisenberg kernel:  ? _cond_resched+0x15/0x30
       heisenberg kernel:  ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
       heisenberg kernel:  btrfs_commit_transaction+0x2bd/0x870 [btrfs]
       heisenberg kernel:  ? start_transaction+0x9d/0x3f0 [btrfs]
       heisenberg kernel:  transaction_kthread+0x147/0x180 [btrfs]
       heisenberg kernel:  ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
       heisenberg kernel:  kthread+0x112/0x130
       heisenberg kernel:  ? kthread_bind+0x30/0x30
       heisenberg kernel:  ret_from_fork+0x35/0x40
       heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
      
      So fix that by using the attach API, which does not create a transaction
      when there is currently no running transaction.
      
      [1] https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bfc61c36
    • A
      btrfs: use BUG() instead of BUG_ON(1) · 290342f6
      Arnd Bergmann 提交于
      BUG_ON(1) leads to bogus warnings from clang when
      CONFIG_PROFILE_ANNOTATED_BRANCHES is set:
      
      fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false
            [-Werror,-Wsometimes-uninitialized]
                      BUG_ON(1);
                      ^~~~~~~~~
      include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
       #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
                                         ^~~~~~~~~~~~~~~~~~~
      include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
       #  define unlikely(x)   (__branch_check__(x, 0, __builtin_constant_p(x)))
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here
                                   max_chunk_size);
                                   ^~~~~~~~~~~~~~
      include/linux/kernel.h:860:36: note: expanded from macro 'min'
       #define min(x, y)       __careful_cmp(x, y, <)
                                               ^
      include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp'
                      __cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
                                    ^
      include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once'
                      typeof(y) unique_y = (y);               \
                                            ^
      fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true
                      BUG_ON(1);
                      ^
      include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
       #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
                                     ^
      fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning
              u64 max_chunk_size;
                                ^
                                 = 0
      
      Change it to BUG() so clang can see that this code path can never
      continue.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      290342f6
  7. 25 2月, 2019 2 次提交
    • J
      btrfs: honor path->skip_locking in backref code · 38e3eebf
      Josef Bacik 提交于
      Qgroups will do the old roots lookup at delayed ref time, which could be
      while walking down the extent root while running a delayed ref.  This
      should be fine, except we specifically lock eb's in the backref walking
      code irrespective of path->skip_locking, which deadlocks the system.
      Fix up the backref code to honor path->skip_locking, nobody will be
      modifying the commit_root when we're searching so it's completely safe
      to do.
      
      This happens since fb235dc0 ("btrfs: qgroup: Move half of the qgroup
      accounting time out of commit trans"), kernel may lockup with quota
      enabled.
      
      There is one backref trace triggered by snapshot dropping along with
      write operation in the source subvolume.  The example can be reliably
      reproduced:
      
        btrfs-cleaner   D    0  4062      2 0x80000000
        Call Trace:
         schedule+0x32/0x90
         btrfs_tree_read_lock+0x93/0x130 [btrfs]
         find_parent_nodes+0x29b/0x1170 [btrfs]
         btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
         btrfs_find_all_roots+0x57/0x70 [btrfs]
         btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
         btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
         btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
         do_walk_down+0x541/0x5e3 [btrfs]
         walk_down_tree+0xab/0xe7 [btrfs]
         btrfs_drop_snapshot+0x356/0x71a [btrfs]
         btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
         cleaner_kthread+0x12b/0x160 [btrfs]
         kthread+0x112/0x130
         ret_from_fork+0x27/0x50
      
      When dropping snapshots with qgroup enabled, we will trigger backref
      walk.
      
      However such backref walk at that timing is pretty dangerous, as if one
      of the parent nodes get WRITE locked by other thread, we could cause a
      dead lock.
      
      For example:
      
                 FS 260     FS 261 (Dropped)
                  node A        node B
                 /      \      /      \
             node C      node D      node E
            /   \         /  \        /     \
        leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
      
      The lock sequence would be:
      
            Thread A (cleaner)             |       Thread B (other writer)
      -----------------------------------------------------------------------
      write_lock(B)                        |
      write_lock(D)                        |
      ^^^ called by walk_down_tree()       |
                                           |       write_lock(A)
                                           |       write_lock(D) << Stall
      read_lock(H) << for backref walk     |
      read_lock(D) << lock owner is        |
                      the same thread A    |
                      so read lock is OK   |
      read_lock(A) << Stall                |
      
      So thread A hold write lock D, and needs read lock A to unlock.
      While thread B holds write lock A, while needs lock D to unlock.
      
      This will cause a deadlock.
      
      This is not only limited to snapshot dropping case.  As the backref
      walk, even only happens on commit trees, is breaking the normal top-down
      locking order, makes it deadlock prone.
      
      Fixes: fb235dc0 ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
      CC: stable@vger.kernel.org # 4.14+
      Reported-and-tested-by: NDavid Sterba <dsterba@suse.com>
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [ rebase to latest branch and fix lock assert bug in btrfs/007 ]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ copy logs and deadlock analysis from Qu's patch ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38e3eebf
    • D
      btrfs: replace btrfs_set_lock_blocking_rw with appropriate helpers · 300aa896
      David Sterba 提交于
      We can use the right helper where the lock type is a fixed parameter.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      300aa896
  8. 17 12月, 2018 3 次提交
    • A
      btrfs: Fix typos in comments and strings · 52042d8e
      Andrea Gelmini 提交于
      The typos accumulate over time so once in a while time they get fixed in
      a large patch.
      Signed-off-by: NAndrea Gelmini <andrea.gelmini@gelma.net>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52042d8e
    • N
      btrfs: Remove needless tree locking in iterate_inode_extrefs · 5c623d33
      Nikolay Borisov 提交于
      In iterate_inode_exrefs the eb is cloned via btrfs_clone_extent_buffer
      which creates a private extent buffer with the dummy flag set and ref
      count of 1. Then this buffer is locked for reading and its ref count is
      incremented by 1. Finally it's fed to the passed iterate_irefs_t
      function. The actual iterate call back is inode_to_path (coming from
      paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final
      function the passed eb is only read by first assigning it to the local
      eb variable. This variable is only modified in the case another eb was
      referenced from the passed path that is eb != eb_in check triggers.
      
      Considering this there is no point in locking the cloned eb in
      iterate_inode_refs since it's never being modified and is not published
      anywhere. Furthermore the cloned eb is completely fine having its ref
      count be 1.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5c623d33
    • N
      btrfs: Remove needless tree locking in iterate_inode_refs · e5bba0b0
      Nikolay Borisov 提交于
      In iterate_inode_refs the eb is cloned via btrfs_clone_extent_buffer
      which creates a private extent buffer with the dummy flag set and ref
      count of 1. Then this buffer is locked for reading and its ref count is
      incremented by 1. Finally it's fed to the passed iterate_irefs_t
      function. The actual iterate call back is inode_to_path (coming from
      paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final
      function the passed eb is only read by first assigning it to the local
      eb variable. This variable is only modified in the case another eb was
      referenced from the passed path that is eb != eb_in check triggers.
      
      Considering this there is no point in locking the cloned eb in
      iterate_inode_refs since it's never being modified and is not published
      anywhere. Furthermore the cloned eb is completely fine having its ref
      count be 1.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e5bba0b0
  9. 15 10月, 2018 3 次提交
  10. 06 8月, 2018 2 次提交
  11. 12 4月, 2018 1 次提交
  12. 31 3月, 2018 2 次提交
    • Q
      btrfs: Validate child tree block's level and first key · 581c1760
      Qu Wenruo 提交于
      We have several reports about node pointer points to incorrect child
      tree blocks, which could have even wrong owner and level but still with
      valid generation and checksum.
      
      Although btrfs check could handle it and print error message like:
      leaf parent key incorrect 60670574592
      
      Kernel doesn't have enough check on this type of corruption correctly.
      At least add such check to read_tree_block() and btrfs_read_buffer(),
      where we need two new parameters @level and @first_key to verify the
      child tree block.
      
      The new @level check is mandatory and all call sites are already
      modified to extract expected level from its call chain.
      
      While @first_key is optional, the following call sites are skipping such
      check:
      1) Root node/leaf
         As ROOT_ITEM doesn't contain the first key, skip @first_key check.
      2) Direct backref
         Only parent bytenr and level is known and we need to resolve the key
         all by ourselves, skip @first_key check.
      
      Another note of this verification is, it needs extra info from nodeptr
      or ROOT_ITEM, so it can't fit into current tree-checker framework, which
      is limited to node/leaf boundary.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      581c1760
    • N
      btrfs: Remove unused op_key var from add_delayed_refs · a6dbceaf
      Nikolay Borisov 提交于
      Added as part of 86d5f994 ("btrfs: convert prelimary reference
      tracking to use rbtrees") but never used. tmp_op_key essentially
      subsumed that variable.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a6dbceaf
  13. 26 3月, 2018 1 次提交
    • D
      btrfs: add more __cold annotations · e67c718b
      David Sterba 提交于
      The __cold functions are placed to a special section, as they're
      expected to be called rarely. This could help i-cache prefetches or help
      compiler to decide which branches are more/less likely to be taken
      without any other annotations needed.
      
      Though we can't add more __exit annotations, it's still possible to add
      __cold (that's also added with __exit). That way the following function
      categories are tagged:
      
      - printf wrappers, error messages
      - exit helpers
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e67c718b
  14. 15 3月, 2018 1 次提交
    • E
      btrfs: add missing initialization in btrfs_check_shared · 18bf591b
      Edmund Nadolski 提交于
      This patch addresses an issue that causes fiemap to falsely
      report a shared extent.  The test case is as follows:
      
      xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5
      sync
      xfs_io  -c "fiemap -v" /media/scratch/file5
      
      which gives the resulting output:
      
      wrote 65536/65536 bytes at offset 0
      64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec)
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128 0x2001
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128   0x1
      
      This is because btrfs_check_shared calls find_parent_nodes
      repeatedly in a loop, passing a share_check struct to report
      the count of shared extent. But btrfs_check_shared does not
      re-initialize the count value to zero for subsequent calls
      from the loop, resulting in a false share count value. This
      is a regressive behavior from 4.13.
      
      With proper re-initialization the test result is as follows:
      
      wrote 65536/65536 bytes at offset 0
      64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec)
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128   0x1
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128   0x1
      
      which corrects the regression.
      
      Fixes: 3ec4d323 ("btrfs: allow backref search checks for shared extents")
      Signed-off-by: NEdmund Nadolski <enadolski@suse.com>
      [ add text from cover letter to changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18bf591b
  15. 02 2月, 2018 1 次提交
    • Z
      btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes · c8195a7b
      Zygo Blaxell 提交于
      Until v4.14, this warning was very infrequent:
      
      	WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0
      	Modules linked in: [...]
      	CPU: 3 PID: 18172 Comm: bees Tainted: G      D W    L  4.11.9-zb64+ #1
      	Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101    12/02/2014
      	Call Trace:
      	 dump_stack+0x85/0xc2
      	 __warn+0xd1/0xf0
      	 warn_slowpath_null+0x1d/0x20
      	 find_parent_nodes+0xc41/0x14e0
      	 __btrfs_find_all_roots+0xad/0x120
      	 ? extent_same_check_offsets+0x70/0x70
      	 iterate_extent_inodes+0x168/0x300
      	 iterate_inodes_from_logical+0x87/0xb0
      	 ? iterate_inodes_from_logical+0x87/0xb0
      	 ? extent_same_check_offsets+0x70/0x70
      	 btrfs_ioctl+0x8ac/0x2820
      	 ? lock_acquire+0xc2/0x200
      	 do_vfs_ioctl+0x91/0x700
      	 ? __fget+0x112/0x200
      	 SyS_ioctl+0x79/0x90
      	 entry_SYSCALL_64_fastpath+0x23/0xc6
      	 ? trace_hardirqs_off_caller+0x1f/0x140
      
      Starting with v4.14 (specifically 86d5f994 ("btrfs: convert prelimary
      reference tracking to use rbtrees")) the WARN_ON occurs three orders of
      magnitude more frequently--almost once per second while running workloads
      like bees.
      
      Replace the WARN_ON() with a comment rationale for its removal.
      The rationale is paraphrased from an explanation by Edmund Nadolski
      <enadolski@suse.de> on the linux-btrfs mailing list.
      
      Fixes: 8da6d581 ("Btrfs: added btrfs_find_all_roots()")
      Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c8195a7b
  16. 22 1月, 2018 1 次提交
  17. 02 11月, 2017 2 次提交
    • J
      btrfs: track refs in a rb_tree instead of a list · 0e0adbcf
      Josef Bacik 提交于
      If we get a significant amount of delayed refs for a single block (think
      modifying multiple snapshots) we can end up spending an ungodly amount
      of time looping through all of the entries trying to see if they can be
      merged.  This is because we only add them to a list, so we have O(2n)
      for every ref head.  This doesn't make any sense as we likely have refs
      for different roots, and so they cannot be merged.  Tracking in a tree
      will allow us to break as soon as we hit an entry that doesn't match,
      making our worst case O(n).
      
      With this we can also merge entries more easily.  Before we had to hope
      that matching refs were on the ends of our list, but with the tree we
      can search down to exact matches and merge them at insert time.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e0adbcf
    • Z
      btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents · c995ab3c
      Zygo Blaxell 提交于
      The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
      offset (encoded as a single logical address) to a list of extent refs.
      LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
      (extent ref -> extent bytenr and offset, or logical address).  These are
      useful capabilities for programs that manipulate extents and extent
      references from userspace (e.g. dedup and defrag utilities).
      
      When the extents are uncompressed (and not encrypted and not other),
      check_extent_in_eb performs filtering of the extent refs to remove any
      extent refs which do not contain the same extent offset as the 'logical'
      parameter's extent offset.  This prevents LOGICAL_INO from returning
      references to more than a single block.
      
      To find the set of extent references to an uncompressed extent from [a, b),
      userspace has to run a loop like this pseudocode:
      
      	for (i = a; i < b; ++i)
      		extent_ref_set += LOGICAL_INO(i);
      
      At each iteration of the loop (up to 32768 iterations for a 128M extent),
      data we are interested in is collected in the kernel, then deleted by
      the filter in check_extent_in_eb.
      
      When the extents are compressed (or encrypted or other), the 'logical'
      parameter must be an extent bytenr (the 'a' parameter in the loop).
      No filtering by extent offset is done (or possible?) so the result is
      the complete set of extent refs for the entire extent.  This removes
      the need for the loop, since we get all the extent refs in one call.
      
      Add an 'ignore_offset' argument to iterate_inodes_from_logical,
      [...several levels of function call graph...], and check_extent_in_eb, so
      that we can disable the extent offset filtering for uncompressed extents.
      This flag can be set by an improved version of the LOGICAL_INO ioctl to
      get either behavior as desired.
      
      There is no functional change in this patch.  The new flag is always
      false.
      Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor coding style fixes ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c995ab3c
  18. 30 10月, 2017 1 次提交
  19. 21 8月, 2017 1 次提交