1. 19 9月, 2019 1 次提交
    • F
      Btrfs: fix assertion failure during fsync and use of stale transaction · 7cbd49cf
      Filipe Manana 提交于
      commit 410f954cb1d1c79ae485dd83a175f21954fd87cd upstream.
      
      Sometimes when fsync'ing a file we need to log that other inodes exist and
      when we need to do that we acquire a reference on the inodes and then drop
      that reference using iput() after logging them.
      
      That generally is not a problem except if we end up doing the final iput()
      (dropping the last reference) on the inode and that inode has a link count
      of 0, which can happen in a very short time window if the logging path
      gets a reference on the inode while it's being unlinked.
      
      In that case we end up getting the eviction callback, btrfs_evict_inode(),
      invoked through the iput() call chain which needs to drop all of the
      inode's items from its subvolume btree, and in order to do that, it needs
      to join a transaction at the helper function evict_refill_and_join().
      However because the task previously started a transaction at the fsync
      handler, btrfs_sync_file(), it has current->journal_info already pointing
      to a transaction handle and therefore evict_refill_and_join() will get
      that transaction handle from btrfs_join_transaction(). From this point on,
      two different problems can happen:
      
      1) evict_refill_and_join() will often change the transaction handle's
         block reserve (->block_rsv) and set its ->bytes_reserved field to a
         value greater than 0. If evict_refill_and_join() never commits the
         transaction, the eviction handler ends up decreasing the reference
         count (->use_count) of the transaction handle through the call to
         btrfs_end_transaction(), and after that point we have a transaction
         handle with a NULL ->block_rsv (which is the value prior to the
         transaction join from evict_refill_and_join()) and a ->bytes_reserved
         value greater than 0. If after the eviction/iput completes the inode
         logging path hits an error or it decides that it must fallback to a
         transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
         non-zero value from btrfs_log_dentry_safe(), and because of that
         non-zero value it tries to commit the transaction using a handle with
         a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
         the transaction commit hit an assertion failure at
         btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
         the ->block_rsv is NULL. The produced stack trace for that is like the
         following:
      
         [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
         [192922.917553] ------------[ cut here ]------------
         [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
         [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
         [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
         [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
         [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
         (...)
         [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
         [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
         [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
         [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
         [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
         [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
         [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
         [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
         [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [192922.925105] Call Trace:
         [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
         [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
         [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
         [192922.926731]  do_fsync+0x38/0x60
         [192922.927138]  __x64_sys_fdatasync+0x13/0x20
         [192922.927543]  do_syscall_64+0x60/0x1c0
         [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         (...)
         [192922.934077] ---[ end trace f00808b12068168f ]---
      
      2) If evict_refill_and_join() decides to commit the transaction, it will
         be able to do it, since the nested transaction join only increments the
         transaction handle's ->use_count reference counter and it does not
         prevent the transaction from getting committed. This means that after
         eviction completes, the fsync logging path will be using a transaction
         handle that refers to an already committed transaction. What happens
         when using such a stale transaction can be unpredictable, we are at
         least having a use-after-free on the transaction handle itself, since
         the transaction commit will call kmem_cache_free() against the handle
         regardless of its ->use_count value, or we can end up silently losing
         all the updates to the log tree after that iput() in the logging path,
         or using a transaction handle that in the meanwhile was allocated to
         another task for a new transaction, etc, pretty much unpredictable
         what can happen.
      
      In order to fix both of them, instead of using iput() during logging, use
      btrfs_add_delayed_iput(), so that the logging path of fsync never drops
      the last reference on an inode, that step is offloaded to a safe context
      (usually the cleaner kthread).
      
      The assertion failure issue was sporadically triggered by the test case
      generic/475 from fstests, which loads the dm error target while fsstress
      is running, which lead to fsync failing while logging inodes with -EIO
      errors and then trying later to commit the transaction, triggering the
      assertion failure.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7cbd49cf
  2. 16 9月, 2019 12 次提交
    • J
      btrfs: correctly validate compression type · 1c13c9c4
      Johannes Thumshirn 提交于
      [ Upstream commit aa53e3bfac7205fb3a8815ac1c937fd6ed01b41e ]
      
      Nikolay reported the following KASAN splat when running btrfs/048:
      
      [ 1843.470920] ==================================================================
      [ 1843.471971] BUG: KASAN: slab-out-of-bounds in strncmp+0x66/0xb0
      [ 1843.472775] Read of size 1 at addr ffff888111e369e2 by task btrfs/3979
      
      [ 1843.473904] CPU: 3 PID: 3979 Comm: btrfs Not tainted 5.2.0-rc3-default #536
      [ 1843.475009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      [ 1843.476322] Call Trace:
      [ 1843.476674]  dump_stack+0x7c/0xbb
      [ 1843.477132]  ? strncmp+0x66/0xb0
      [ 1843.477587]  print_address_description+0x114/0x320
      [ 1843.478256]  ? strncmp+0x66/0xb0
      [ 1843.478740]  ? strncmp+0x66/0xb0
      [ 1843.479185]  __kasan_report+0x14e/0x192
      [ 1843.479759]  ? strncmp+0x66/0xb0
      [ 1843.480209]  kasan_report+0xe/0x20
      [ 1843.480679]  strncmp+0x66/0xb0
      [ 1843.481105]  prop_compression_validate+0x24/0x70
      [ 1843.481798]  btrfs_xattr_handler_set_prop+0x65/0x160
      [ 1843.482509]  __vfs_setxattr+0x71/0x90
      [ 1843.483012]  __vfs_setxattr_noperm+0x84/0x130
      [ 1843.483606]  vfs_setxattr+0xac/0xb0
      [ 1843.484085]  setxattr+0x18c/0x230
      [ 1843.484546]  ? vfs_setxattr+0xb0/0xb0
      [ 1843.485048]  ? __mod_node_page_state+0x1f/0xa0
      [ 1843.485672]  ? _raw_spin_unlock+0x24/0x40
      [ 1843.486233]  ? __handle_mm_fault+0x988/0x1290
      [ 1843.486823]  ? lock_acquire+0xb4/0x1e0
      [ 1843.487330]  ? lock_acquire+0xb4/0x1e0
      [ 1843.487842]  ? mnt_want_write_file+0x3c/0x80
      [ 1843.488442]  ? debug_lockdep_rcu_enabled+0x22/0x40
      [ 1843.489089]  ? rcu_sync_lockdep_assert+0xe/0x70
      [ 1843.489707]  ? __sb_start_write+0x158/0x200
      [ 1843.490278]  ? mnt_want_write_file+0x3c/0x80
      [ 1843.490855]  ? __mnt_want_write+0x98/0xe0
      [ 1843.491397]  __x64_sys_fsetxattr+0xba/0xe0
      [ 1843.492201]  ? trace_hardirqs_off_thunk+0x1a/0x1c
      [ 1843.493201]  do_syscall_64+0x6c/0x230
      [ 1843.493988]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 1843.495041] RIP: 0033:0x7fa7a8a7707a
      [ 1843.495819] Code: 48 8b 0d 21 de 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 be 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ee dd 2b 00 f7 d8 64 89 01 48
      [ 1843.499203] RSP: 002b:00007ffcb73bca38 EFLAGS: 00000202 ORIG_RAX: 00000000000000be
      [ 1843.500210] RAX: ffffffffffffffda RBX: 00007ffcb73bda9d RCX: 00007fa7a8a7707a
      [ 1843.501170] RDX: 00007ffcb73bda9d RSI: 00000000006dc050 RDI: 0000000000000003
      [ 1843.502152] RBP: 00000000006dc050 R08: 0000000000000000 R09: 0000000000000000
      [ 1843.503109] R10: 0000000000000002 R11: 0000000000000202 R12: 00007ffcb73bda91
      [ 1843.504055] R13: 0000000000000003 R14: 00007ffcb73bda82 R15: ffffffffffffffff
      
      [ 1843.505268] Allocated by task 3979:
      [ 1843.505771]  save_stack+0x19/0x80
      [ 1843.506211]  __kasan_kmalloc.constprop.5+0xa0/0xd0
      [ 1843.506836]  setxattr+0xeb/0x230
      [ 1843.507264]  __x64_sys_fsetxattr+0xba/0xe0
      [ 1843.507886]  do_syscall_64+0x6c/0x230
      [ 1843.508429]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [ 1843.509558] Freed by task 0:
      [ 1843.510188] (stack is not available)
      
      [ 1843.511309] The buggy address belongs to the object at ffff888111e369e0
                      which belongs to the cache kmalloc-8 of size 8
      [ 1843.514095] The buggy address is located 2 bytes inside of
                      8-byte region [ffff888111e369e0, ffff888111e369e8)
      [ 1843.516524] The buggy address belongs to the page:
      [ 1843.517561] page:ffff88813f478d80 refcount:1 mapcount:0 mapping:ffff88811940c300 index:0xffff888111e373b8 compound_mapcount: 0
      [ 1843.519993] flags: 0x4404000010200(slab|head)
      [ 1843.520951] raw: 0004404000010200 ffff88813f48b008 ffff888119403d50 ffff88811940c300
      [ 1843.522616] raw: ffff888111e373b8 000000000016000f 00000001ffffffff 0000000000000000
      [ 1843.524281] page dumped because: kasan: bad access detected
      
      [ 1843.525936] Memory state around the buggy address:
      [ 1843.526975]  ffff888111e36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.528479]  ffff888111e36900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.530138] >ffff888111e36980: fc fc fc fc fc fc fc fc fc fc fc fc 02 fc fc fc
      [ 1843.531877]                                                        ^
      [ 1843.533287]  ffff888111e36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.534874]  ffff888111e36a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [ 1843.536468] ==================================================================
      
      This is caused by supplying a too short compression value ('lz') in the
      test-case and comparing it to 'lzo' with strncmp() and a length of 3.
      strncmp() read past the 'lz' when looking for the 'o' and thus caused an
      out-of-bounds read.
      
      Introduce a new check 'btrfs_compress_is_valid_type()' which not only
      checks the user-supplied value against known compression types, but also
      employs checks for too short values.
      Reported-by: NNikolay Borisov <nborisov@suse.com>
      Fixes: 272e5326c783 ("btrfs: prop: fix vanished compression property after failed set")
      CC: stable@vger.kernel.org # 5.1+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1c13c9c4
    • F
      Btrfs: fix race between block group removal and block group allocation · 1d064876
      Filipe Manana 提交于
      [ Upstream commit 8eaf40c0e24e98899a0f3ac9d25a33aafe13822a ]
      
      If a task is removing the block group that currently has the highest start
      offset amongst all existing block groups, there is a short time window
      where it races with a concurrent block group allocation, resulting in a
      transaction abort with an error code of EEXIST.
      
      The following diagram explains the race in detail:
      
            Task A                                                        Task B
      
       btrfs_remove_block_group(bg offset X)
      
         remove_extent_mapping(em offset X)
           -> removes extent map X from the
              tree of extent maps
              (fs_info->mapping_tree), so the
              next call to find_next_chunk()
              will return offset X
      
                                                         btrfs_alloc_chunk()
                                                           find_next_chunk()
                                                             --> returns offset X
      
                                                           __btrfs_alloc_chunk(offset X)
                                                             btrfs_make_block_group()
                                                               btrfs_create_block_group_cache()
                                                                 --> creates btrfs_block_group_cache
                                                                     object with a key corresponding
                                                                     to the block group item in the
                                                                     extent, the key is:
                                                                     (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
      
                                                               --> adds the btrfs_block_group_cache object
                                                                   to the list new_bgs of the transaction
                                                                   handle
      
                                                         btrfs_end_transaction(trans handle)
                                                           __btrfs_end_transaction()
                                                             btrfs_create_pending_block_groups()
                                                               --> sees the new btrfs_block_group_cache
                                                                   in the new_bgs list of the transaction
                                                                   handle
                                                               --> its call to btrfs_insert_item() fails
                                                                   with -EEXIST when attempting to insert
                                                                   the block group item key
                                                                   (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
                                                                   because task A has not removed that key yet
                                                               --> aborts the running transaction with
                                                                   error -EEXIST
      
         btrfs_del_item()
           -> removes the block group's key from
              the extent tree, key is
              (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
      
      A sample transaction abort trace:
      
        [78912.403537] ------------[ cut here ]------------
        [78912.403811] BTRFS: Transaction aborted (error -17)
        [78912.404082] WARNING: CPU: 2 PID: 20465 at fs/btrfs/extent-tree.c:10551 btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
        (...)
        [78912.405642] CPU: 2 PID: 20465 Comm: btrfs Tainted: G        W         5.0.0-btrfs-next-46 #1
        [78912.405941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [78912.406586] RIP: 0010:btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
        (...)
        [78912.407636] RSP: 0018:ffff9d3d4b7e3b08 EFLAGS: 00010282
        [78912.407997] RAX: 0000000000000000 RBX: ffff90959a3796f0 RCX: 0000000000000006
        [78912.408369] RDX: 0000000000000007 RSI: 0000000000000001 RDI: ffff909636b16860
        [78912.408746] RBP: ffff909626758a58 R08: 0000000000000000 R09: 0000000000000000
        [78912.409144] R10: ffff9095ff462400 R11: 0000000000000000 R12: ffff90959a379588
        [78912.409521] R13: ffff909626758ab0 R14: ffff9095036c0000 R15: ffff9095299e1158
        [78912.409899] FS:  00007f387f16f700(0000) GS:ffff909636b00000(0000) knlGS:0000000000000000
        [78912.410285] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [78912.410673] CR2: 00007f429fc87cbc CR3: 000000014440a004 CR4: 00000000003606e0
        [78912.411095] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [78912.411496] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [78912.411898] Call Trace:
        [78912.412318]  __btrfs_end_transaction+0x5b/0x1c0 [btrfs]
        [78912.412746]  btrfs_inc_block_group_ro+0xcf/0x160 [btrfs]
        [78912.413179]  scrub_enumerate_chunks+0x188/0x5b0 [btrfs]
        [78912.413622]  ? __mutex_unlock_slowpath+0x100/0x2a0
        [78912.414078]  btrfs_scrub_dev+0x2ef/0x720 [btrfs]
        [78912.414535]  ? __sb_start_write+0xd4/0x1c0
        [78912.414963]  ? mnt_want_write_file+0x24/0x50
        [78912.415403]  btrfs_ioctl+0x17fb/0x3120 [btrfs]
        [78912.415832]  ? lock_acquire+0xa6/0x190
        [78912.416256]  ? do_vfs_ioctl+0xa2/0x6f0
        [78912.416685]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [78912.417116]  do_vfs_ioctl+0xa2/0x6f0
        [78912.417534]  ? __fget+0x113/0x200
        [78912.417954]  ksys_ioctl+0x70/0x80
        [78912.418369]  __x64_sys_ioctl+0x16/0x20
        [78912.418812]  do_syscall_64+0x60/0x1b0
        [78912.419231]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [78912.419644] RIP: 0033:0x7f3880252dd7
        (...)
        [78912.420957] RSP: 002b:00007f387f16ed68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [78912.421426] RAX: ffffffffffffffda RBX: 000055f5becc1df0 RCX: 00007f3880252dd7
        [78912.421889] RDX: 000055f5becc1df0 RSI: 00000000c400941b RDI: 0000000000000003
        [78912.422354] RBP: 0000000000000000 R08: 00007f387f16f700 R09: 0000000000000000
        [78912.422790] R10: 00007f387f16f700 R11: 0000000000000246 R12: 0000000000000000
        [78912.423202] R13: 00007ffda49c266f R14: 0000000000000000 R15: 00007f388145e040
        [78912.425505] ---[ end trace eb9bfe7c426fc4d3 ]---
      
      Fix this by calling remove_extent_mapping(), at btrfs_remove_block_group(),
      only at the very end, after removing the block group item key from the
      extent tree (and removing the free space tree entry if we are using the
      free space tree feature).
      
      Fixes: 04216820 ("Btrfs: fix race between fs trimming and block group remove/allocation")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1d064876
    • D
      btrfs: init csum_list before possible free · 476ecc14
      Dan Robertson 提交于
      [ Upstream commit e49be14b8d80e23bb7c53d78c21717a474ade76b ]
      
      The scrub_ctx csum_list member must be initialized before scrub_free_ctx
      is called. If the csum_list is not initialized beforehand, the
      list_empty call in scrub_free_csums will result in a null deref if the
      allocation fails in the for loop.
      
      Fixes: a2de733c ("btrfs: scrub")
      CC: stable@vger.kernel.org # 3.0+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDan Robertson <dan@dlrobertson.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      476ecc14
    • A
      btrfs: scrub: fix circular locking dependency warning · 936690bd
      Anand Jain 提交于
      [ Upstream commit 1cec3f27168d7835ff3d23ab371cd548440131bb ]
      
      This fixes a longstanding lockdep warning triggered by
      fstests/btrfs/011.
      
      Circular locking dependency check reports warning[1], that's because the
      btrfs_scrub_dev() calls the stack #0 below with, the fs_info::scrub_lock
      held. The test case leading to this warning:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /btrfs
        $ btrfs scrub start -B /btrfs
      
      In fact we have fs_info::scrub_workers_refcnt to track if the init and destroy
      of the scrub workers are needed. So once we have incremented and decremented
      the fs_info::scrub_workers_refcnt value in the thread, its ok to drop the
      scrub_lock, and then actually do the btrfs_destroy_workqueue() part. So this
      patch drops the scrub_lock before calling btrfs_destroy_workqueue().
      
        [359.258534] ======================================================
        [359.260305] WARNING: possible circular locking dependency detected
        [359.261938] 5.0.0-rc6-default #461 Not tainted
        [359.263135] ------------------------------------------------------
        [359.264672] btrfs/20975 is trying to acquire lock:
        [359.265927] 00000000d4d32bea ((wq_completion)"%s-%s""btrfs", name){+.+.}, at: flush_workqueue+0x87/0x540
        [359.268416]
        [359.268416] but task is already holding lock:
        [359.270061] 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.272418]
        [359.272418] which lock already depends on the new lock.
        [359.272418]
        [359.274692]
        [359.274692] the existing dependency chain (in reverse order) is:
        [359.276671]
        [359.276671] -> #3 (&fs_info->scrub_lock){+.+.}:
        [359.278187]        __mutex_lock+0x86/0x9c0
        [359.279086]        btrfs_scrub_pause+0x31/0x100 [btrfs]
        [359.280421]        btrfs_commit_transaction+0x1e4/0x9e0 [btrfs]
        [359.281931]        close_ctree+0x30b/0x350 [btrfs]
        [359.283208]        generic_shutdown_super+0x64/0x100
        [359.284516]        kill_anon_super+0x14/0x30
        [359.285658]        btrfs_kill_super+0x12/0xa0 [btrfs]
        [359.286964]        deactivate_locked_super+0x29/0x60
        [359.288242]        cleanup_mnt+0x3b/0x70
        [359.289310]        task_work_run+0x98/0xc0
        [359.290428]        exit_to_usermode_loop+0x83/0x90
        [359.291445]        do_syscall_64+0x15b/0x180
        [359.292598]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.294011]
        [359.294011] -> #2 (sb_internal#2){.+.+}:
        [359.295432]        __sb_start_write+0x113/0x1d0
        [359.296394]        start_transaction+0x369/0x500 [btrfs]
        [359.297471]        btrfs_finish_ordered_io+0x2aa/0x7c0 [btrfs]
        [359.298629]        normal_work_helper+0xcd/0x530 [btrfs]
        [359.299698]        process_one_work+0x246/0x610
        [359.300898]        worker_thread+0x3c/0x390
        [359.302020]        kthread+0x116/0x130
        [359.303053]        ret_from_fork+0x24/0x30
        [359.304152]
        [359.304152] -> #1 ((work_completion)(&work->normal_work)){+.+.}:
        [359.306100]        process_one_work+0x21f/0x610
        [359.307302]        worker_thread+0x3c/0x390
        [359.308465]        kthread+0x116/0x130
        [359.309357]        ret_from_fork+0x24/0x30
        [359.310229]
        [359.310229] -> #0 ((wq_completion)"%s-%s""btrfs", name){+.+.}:
        [359.311812]        lock_acquire+0x90/0x180
        [359.312929]        flush_workqueue+0xaa/0x540
        [359.313845]        drain_workqueue+0xa1/0x180
        [359.314761]        destroy_workqueue+0x17/0x240
        [359.315754]        btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.317245]        scrub_workers_put+0x2c/0x60 [btrfs]
        [359.318585]        btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.319944]        btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.321622]        btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.322908]        do_vfs_ioctl+0xa2/0x6d0
        [359.324021]        ksys_ioctl+0x3a/0x70
        [359.325066]        __x64_sys_ioctl+0x16/0x20
        [359.326236]        do_syscall_64+0x54/0x180
        [359.327379]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.328772]
        [359.328772] other info that might help us debug this:
        [359.328772]
        [359.330990] Chain exists of:
        [359.330990]   (wq_completion)"%s-%s""btrfs", name --> sb_internal#2 --> &fs_info->scrub_lock
        [359.330990]
        [359.334376]  Possible unsafe locking scenario:
        [359.334376]
        [359.336020]        CPU0                    CPU1
        [359.337070]        ----                    ----
        [359.337821]   lock(&fs_info->scrub_lock);
        [359.338506]                                lock(sb_internal#2);
        [359.339506]                                lock(&fs_info->scrub_lock);
        [359.341461]   lock((wq_completion)"%s-%s""btrfs", name);
        [359.342437]
        [359.342437]  *** DEADLOCK ***
        [359.342437]
        [359.343745] 1 lock held by btrfs/20975:
        [359.344788]  #0: 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
        [359.346778]
        [359.346778] stack backtrace:
        [359.347897] CPU: 0 PID: 20975 Comm: btrfs Not tainted 5.0.0-rc6-default #461
        [359.348983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
        [359.350501] Call Trace:
        [359.350931]  dump_stack+0x67/0x90
        [359.351676]  print_circular_bug.isra.37.cold.56+0x15c/0x195
        [359.353569]  check_prev_add.constprop.44+0x4f9/0x750
        [359.354849]  ? check_prev_add.constprop.44+0x286/0x750
        [359.356505]  __lock_acquire+0xb84/0xf10
        [359.357505]  lock_acquire+0x90/0x180
        [359.358271]  ? flush_workqueue+0x87/0x540
        [359.359098]  flush_workqueue+0xaa/0x540
        [359.359912]  ? flush_workqueue+0x87/0x540
        [359.360740]  ? drain_workqueue+0x1e/0x180
        [359.361565]  ? drain_workqueue+0xa1/0x180
        [359.362391]  drain_workqueue+0xa1/0x180
        [359.363193]  destroy_workqueue+0x17/0x240
        [359.364539]  btrfs_destroy_workqueue+0x57/0x200 [btrfs]
        [359.365673]  scrub_workers_put+0x2c/0x60 [btrfs]
        [359.366618]  btrfs_scrub_dev+0x336/0x590 [btrfs]
        [359.367594]  ? start_transaction+0xa1/0x500 [btrfs]
        [359.368679]  btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
        [359.369545]  btrfs_ioctl+0x28a4/0x2e40 [btrfs]
        [359.370186]  ? __lock_acquire+0x263/0xf10
        [359.370777]  ? kvm_clock_read+0x14/0x30
        [359.371392]  ? kvm_sched_clock_read+0x5/0x10
        [359.372248]  ? sched_clock+0x5/0x10
        [359.372786]  ? sched_clock_cpu+0xc/0xc0
        [359.373662]  ? do_vfs_ioctl+0xa2/0x6d0
        [359.374552]  do_vfs_ioctl+0xa2/0x6d0
        [359.375378]  ? do_sigaction+0xff/0x250
        [359.376233]  ksys_ioctl+0x3a/0x70
        [359.376954]  __x64_sys_ioctl+0x16/0x20
        [359.377772]  do_syscall_64+0x54/0x180
        [359.378841]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [359.380422] RIP: 0033:0x7f5429296a97
      
      Backporting to older kernels: scrub_nocow_workers must be freed the same
      way as the others.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      [ update changelog ]
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      936690bd
    • D
      btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex · ff55333f
      David Sterba 提交于
      [ Upstream commit 0e94c4f45d14cf89d1f40c91b0a8517e791672a7 ]
      
      The scrub context is allocated with GFP_KERNEL and called from
      btrfs_scrub_dev under the fs_info::device_list_mutex. This is not safe
      regarding reclaim that could try to flush filesystem data in order to
      get the memory. And the device_list_mutex is held during superblock
      commit, so this would cause a lockup.
      
      Move the alocation and initialization before any changes that require
      the mutex.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ff55333f
    • D
      btrfs: scrub: pass fs_info to scrub_setup_ctx · 8ba3169d
      David Sterba 提交于
      [ Upstream commit 92f7ba434f51e8e9317f1d166105889aa230abd2 ]
      
      We can pass fs_info directly as this is the only member of btrfs_device
      that's bing used inside scrub_setup_ctx.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8ba3169d
    • Q
      btrfs: Use real device structure to verify dev extent · be77686f
      Qu Wenruo 提交于
      [ Upstream commit 1b3922a8bc74231f9a767d1be6d9a061a4d4eeab ]
      
      [BUG]
      Linux v5.0-rc1 will fail fstests/btrfs/163 with the following kernel
      message:
      
        BTRFS error (device dm-6): dev extent devid 1 physical offset 13631488 len 8388608 is beyond device boundary 0
        BTRFS error (device dm-6): failed to verify dev extents against chunks: -117
        BTRFS error (device dm-6): open_ctree failed
      
      [CAUSE]
      Commit cf90d884 ("btrfs: Introduce mount time chunk <-> dev extent
      mapping check") introduced strict check on dev extents.
      
      We use btrfs_find_device() with dev uuid and fs uuid set to NULL, and
      only dependent on @devid to find the real device.
      
      For seed devices, we call clone_fs_devices() in open_seed_devices() to
      allow us search seed devices directly.
      
      However clone_fs_devices() just populates devices with devid and dev
      uuid, without populating other essential members, like disk_total_bytes.
      
      This makes any device returned by btrfs_find_device(fs_info, devid,
      NULL, NULL) is just a dummy, with 0 disk_total_bytes, and any dev
      extents on the seed device will not pass the device boundary check.
      
      [FIX]
      This patch will try to verify the device returned by btrfs_find_device()
      and if it's a dummy then re-search in seed devices.
      
      Fixes: cf90d884 ("btrfs: Introduce mount time chunk <-> dev extent mapping check")
      CC: stable@vger.kernel.org # 4.19+
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      be77686f
    • Q
      btrfs: volumes: Make sure no dev extent is beyond device boundary · a2790b99
      Qu Wenruo 提交于
      [ Upstream commit 05a37c48604c19b50873fd9663f9140c150469d1 ]
      
      Add extra dev extent end check against device boundary.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a2790b99
    • N
      btrfs: Fix error handling in btrfs_cleanup_ordered_extents · eb124aaa
      Nikolay Borisov 提交于
      [ Upstream commit d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 ]
      
      Running btrfs/124 in a loop hung up on me sporadically with the
      following call trace:
      
      	btrfs           D    0  5760   5324 0x00000000
      	Call Trace:
      	 ? __schedule+0x243/0x800
      	 schedule+0x33/0x90
      	 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
      	 ? wait_woken+0xa0/0xa0
      	 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
      	 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
      	 btrfs_relocate_chunk+0x49/0x100 [btrfs]
      	 btrfs_balance+0xbeb/0x1740 [btrfs]
      	 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
      	 btrfs_ioctl+0x1691/0x3110 [btrfs]
      	 ? lockdep_hardirqs_on+0xed/0x180
      	 ? __handle_mm_fault+0x8e7/0xfb0
      	 ? _raw_spin_unlock+0x24/0x30
      	 ? __handle_mm_fault+0x8e7/0xfb0
      	 ? do_vfs_ioctl+0xa5/0x6e0
      	 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      	 do_vfs_ioctl+0xa5/0x6e0
      	 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
      	 ksys_ioctl+0x3a/0x70
      	 __x64_sys_ioctl+0x16/0x20
      	 do_syscall_64+0x60/0x1b0
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This happens because during page writeback it's valid for
      writepage_delalloc to instantiate a delalloc range which doesn't belong
      to the page currently being written back.
      
      The reason this case is valid is due to find_lock_delalloc_range
      returning any available range after the passed delalloc_start and
      ignoring whether the page under writeback is within that range.
      
      In turn ordered extents (OE) are always created for the returned range
      from find_lock_delalloc_range. If, however, a failure occurs while OE
      are being created then the clean up code in btrfs_cleanup_ordered_extents
      will be called.
      
      Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
      the case of such 'foreign' range being processed and instead it always
      assumes that the range OE are created for belongs to the page. This
      leads to the first page of such foregin range to not be cleaned up since
      it's deliberately missed and skipped by the current cleaning up code.
      
      Fix this by correctly checking whether the current page belongs to the
      range being instantiated and if so adjsut the range parameters passed
      for cleaning up. If it doesn't, then just clean the whole OE range
      directly.
      
      Fixes: 52427260 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      eb124aaa
    • N
      btrfs: Remove extent_io_ops::fill_delalloc · 1669d1d2
      Nikolay Borisov 提交于
      [ Upstream commit 5eaad97af8aeff38debe7d3c69ec3a0d71f8350f ]
      
      This callback is called only from writepage_delalloc which in turn is
      guaranteed to be called from the data page writeout path. In the end
      there is no reason to have the call to this function to be indrected via
      the extent_io_ops structure. This patch removes the callback definition,
      exports the function and calls it directly. No functional changes.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_run_delalloc_range ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1669d1d2
    • F
      Btrfs: fix deadlock with memory reclaim during scrub · 338a528b
      Filipe Manana 提交于
      [ Upstream commit a5fb11429167ee6ddeeacc554efaf5776b36433a ]
      
      When a transaction commit starts, it attempts to pause scrub and it blocks
      until the scrub is paused. So while the transaction is blocked waiting for
      scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
      otherwise we risk getting into a deadlock with reclaim.
      
      Checking for scrub pause requests is done early at the beginning of the
      while loop of scrub_stripe() and later in the loop, scrub_extent() and
      scrub_raid56_parity() are called, which in turn call scrub_pages() and
      scrub_pages_for_parity() respectively. These last two functions do memory
      allocations using GFP_KERNEL. Same problem could happen while scrubbing
      the super blocks, since it calls scrub_pages().
      
      We also can not have any of the worker tasks, created by the scrub task,
      doing GFP_KERNEL allocations, because before pausing, the scrub task waits
      for all the worker tasks to complete (also done at scrub_stripe()).
      
      So make sure GFP_NOFS is used for the memory allocations because at any
      time a scrub pause request can happen from another task that started to
      commit a transaction.
      
      Fixes: 58c4e173 ("btrfs: scrub: use GFP_KERNEL on the submission path")
      CC: stable@vger.kernel.org # 4.6+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      338a528b
    • O
      Btrfs: clean up scrub is_dev_replace parameter · fac80347
      Omar Sandoval 提交于
      [ Upstream commit 32934280967d00dc2b5c4d3b63b21a9c8638326e ]
      
      struct scrub_ctx has an ->is_dev_replace member, so there's no point in
      passing around is_dev_replace where sctx is available.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fac80347
  3. 25 8月, 2019 1 次提交
    • F
      Btrfs: fix deadlock between fiemap and transaction commits · f833deae
      Filipe Manana 提交于
      [ Upstream commit a6d155d2e363f26290ffd50591169cb96c2a609e ]
      
      The fiemap handler locks a file range that can have unflushed delalloc,
      and after locking the range, it tries to attach to a running transaction.
      If the running transaction started its commit, that is, it is in state
      TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
      flushoncommit option or the transaction is creating a snapshot for the
      subvolume that contains the file that fiemap is operating on, we end up
      deadlocking. This happens because fiemap is blocked on the transaction,
      waiting for it to complete, and the transaction is waiting for the flushed
      dealloc to complete, which requires locking the file range that the fiemap
      task already locked. The following stack traces serve as an example of
      when this deadlock happens:
      
        (...)
        [404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        [404571.515956] Call Trace:
        [404571.516360]  ? __schedule+0x3ae/0x7b0
        [404571.516730]  schedule+0x3a/0xb0
        [404571.517104]  lock_extent_bits+0x1ec/0x2a0 [btrfs]
        [404571.517465]  ? remove_wait_queue+0x60/0x60
        [404571.517832]  btrfs_finish_ordered_io+0x292/0x800 [btrfs]
        [404571.518202]  normal_work_helper+0xea/0x530 [btrfs]
        [404571.518566]  process_one_work+0x21e/0x5c0
        [404571.518990]  worker_thread+0x4f/0x3b0
        [404571.519413]  ? process_one_work+0x5c0/0x5c0
        [404571.519829]  kthread+0x103/0x140
        [404571.520191]  ? kthread_create_worker_on_cpu+0x70/0x70
        [404571.520565]  ret_from_fork+0x3a/0x50
        [404571.520915] kworker/u8:6    D    0 31651      2 0x80004000
        [404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
        (...)
        [404571.537000] fsstress        D    0 13117  13115 0x00004000
        [404571.537263] Call Trace:
        [404571.537524]  ? __schedule+0x3ae/0x7b0
        [404571.537788]  schedule+0x3a/0xb0
        [404571.538066]  wait_current_trans+0xc8/0x100 [btrfs]
        [404571.538349]  ? remove_wait_queue+0x60/0x60
        [404571.538680]  start_transaction+0x33c/0x500 [btrfs]
        [404571.539076]  btrfs_check_shared+0xa3/0x1f0 [btrfs]
        [404571.539513]  ? extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.539866]  extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.540170]  do_vfs_ioctl+0x526/0x6f0
        [404571.540436]  ksys_ioctl+0x70/0x80
        [404571.540734]  __x64_sys_ioctl+0x16/0x20
        [404571.540997]  do_syscall_64+0x60/0x1d0
        [404571.541279]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
        [404571.543729] btrfs           D    0 14210  14208 0x00004000
        [404571.544023] Call Trace:
        [404571.544275]  ? __schedule+0x3ae/0x7b0
        [404571.544526]  ? wait_for_completion+0x112/0x1a0
        [404571.544795]  schedule+0x3a/0xb0
        [404571.545064]  schedule_timeout+0x1ff/0x390
        [404571.545351]  ? lock_acquire+0xa6/0x190
        [404571.545638]  ? wait_for_completion+0x49/0x1a0
        [404571.545890]  ? wait_for_completion+0x112/0x1a0
        [404571.546228]  wait_for_completion+0x131/0x1a0
        [404571.546503]  ? wake_up_q+0x70/0x70
        [404571.546775]  btrfs_wait_ordered_extents+0x27c/0x400 [btrfs]
        [404571.547159]  btrfs_commit_transaction+0x3b0/0xae0 [btrfs]
        [404571.547449]  ? btrfs_mksubvol+0x4a4/0x640 [btrfs]
        [404571.547703]  ? remove_wait_queue+0x60/0x60
        [404571.547969]  btrfs_mksubvol+0x605/0x640 [btrfs]
        [404571.548226]  ? __sb_start_write+0xd4/0x1c0
        [404571.548512]  ? mnt_want_write_file+0x24/0x50
        [404571.548789]  btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs]
        [404571.549048]  btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs]
        [404571.549307]  btrfs_ioctl+0x133f/0x3150 [btrfs]
        [404571.549549]  ? mem_cgroup_charge_statistics+0x4c/0xd0
        [404571.549792]  ? mem_cgroup_commit_charge+0x84/0x4b0
        [404571.550064]  ? __handle_mm_fault+0xe3e/0x11f0
        [404571.550306]  ? do_raw_spin_unlock+0x49/0xc0
        [404571.550608]  ? _raw_spin_unlock+0x24/0x30
        [404571.550976]  ? __handle_mm_fault+0xedf/0x11f0
        [404571.551319]  ? do_vfs_ioctl+0xa2/0x6f0
        [404571.551659]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [404571.552087]  do_vfs_ioctl+0xa2/0x6f0
        [404571.552355]  ksys_ioctl+0x70/0x80
        [404571.552621]  __x64_sys_ioctl+0x16/0x20
        [404571.552864]  do_syscall_64+0x60/0x1d0
        [404571.553104]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
      
      If we were joining the transaction instead of attaching to it, we would
      not risk a deadlock because a join only blocks if the transaction is in a
      state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
      flush performed by a transaction is done before it reaches that state,
      when it is in the state TRANS_STATE_COMMIT_START. However a transaction
      join is intended for use cases where we do modify the filesystem, and
      fiemap only needs to peek at delayed references from the current
      transaction in order to determine if extents are shared, and, besides
      that, when there is no current transaction or when it blocks to wait for
      a current committing transaction to complete, it creates a new transaction
      without reserving any space. Such unnecessary transactions, besides doing
      unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
      rotation of the precious backup roots.
      
      So fix this by adding a new transaction join variant, named join_nostart,
      which behaves like the regular join, but it does not create a transaction
      when none currently exists or after waiting for a committing transaction
      to complete.
      
      Fixes: 03628cdbc64db6 ("Btrfs: do not start a transaction during fiemap")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f833deae
  4. 07 8月, 2019 4 次提交
    • F
      Btrfs: fix race leading to fs corruption after transaction abort · 50d70040
      Filipe Manana 提交于
      commit cb2d3daddbfb6318d170e79aac1f7d5e4d49f0d7 upstream.
      
      When one transaction is finishing its commit, it is possible for another
      transaction to start and enter its initial commit phase as well. If the
      first ends up getting aborted, we have a small time window where the second
      transaction commit does not notice that the previous transaction aborted
      and ends up committing, writing a superblock that points to btrees that
      reference extent buffers (nodes and leafs) that were not persisted to disk.
      The consequence is that after mounting the filesystem again, we will be
      unable to load some btree nodes/leafs, either because the content on disk
      is either garbage (or just zeroes) or corresponds to the old content of a
      previouly COWed or deleted node/leaf, resulting in the well known error
      messages "parent transid verify failed on ...".
      The following sequence diagram illustrates how this can happen.
      
              CPU 1                                           CPU 2
      
       <at transaction N>
      
       btrfs_commit_transaction()
         (...)
         --> sets transaction state to
             TRANS_STATE_UNBLOCKED
         --> sets fs_info->running_transaction
             to NULL
      
                                                          (...)
                                                          btrfs_start_transaction()
                                                            start_transaction()
                                                              wait_current_trans()
                                                                --> returns immediately
                                                                    because
                                                                    fs_info->running_transaction
                                                                    is NULL
                                                              join_transaction()
                                                                --> creates transaction N + 1
                                                                --> sets
                                                                    fs_info->running_transaction
                                                                    to transaction N + 1
                                                                --> adds transaction N + 1 to
                                                                    the fs_info->trans_list list
                                                              --> returns transaction handle
                                                                  pointing to the new
                                                                  transaction N + 1
                                                          (...)
      
                                                          btrfs_sync_file()
                                                            btrfs_start_transaction()
                                                              --> returns handle to
                                                                  transaction N + 1
                                                            (...)
      
         btrfs_write_and_wait_transaction()
           --> writeback of some extent
               buffer fails, returns an
      	 error
         btrfs_handle_fs_error()
           --> sets BTRFS_FS_STATE_ERROR in
               fs_info->fs_state
         --> jumps to label "scrub_continue"
         cleanup_transaction()
           btrfs_abort_transaction(N)
             --> sets BTRFS_FS_STATE_TRANS_ABORTED
                 flag in fs_info->fs_state
             --> sets aborted field in the
                 transaction and transaction
      	   handle structures, for
                 transaction N only
           --> removes transaction from the
               list fs_info->trans_list
                                                            btrfs_commit_transaction(N + 1)
                                                              --> transaction N + 1 was not
      							    aborted, so it proceeds
                                                              (...)
                                                              --> sets the transaction's state
                                                                  to TRANS_STATE_COMMIT_START
                                                              --> does not find the previous
                                                                  transaction (N) in the
                                                                  fs_info->trans_list, so it
                                                                  doesn't know that transaction
                                                                  was aborted, and the commit
                                                                  of transaction N + 1 proceeds
                                                              (...)
                                                              --> sets transaction N + 1 state
                                                                  to TRANS_STATE_UNBLOCKED
                                                              btrfs_write_and_wait_transaction()
                                                                --> succeeds writing all extent
                                                                    buffers created in the
                                                                    transaction N + 1
                                                              write_all_supers()
                                                                 --> succeeds
                                                                 --> we now have a superblock on
                                                                     disk that points to trees
                                                                     that refer to at least one
                                                                     extent buffer that was
                                                                     never persisted
      
      So fix this by updating the transaction commit path to check if the flag
      BTRFS_FS_STATE_TRANS_ABORTED is set on fs_info->fs_state if after setting
      the transaction to the TRANS_STATE_COMMIT_START we do not find any previous
      transaction in the fs_info->trans_list. If the flag is set, just fail the
      transaction commit with -EROFS, as we do in other places. The exact error
      code for the previous transaction abort was already logged and reported.
      
      Fixes: 49b25e05 ("btrfs: enhance transaction abort infrastructure")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      50d70040
    • F
      Btrfs: fix incremental send failure after deduplication · 009d7a4e
      Filipe Manana 提交于
      commit b4f9a1a87a48c255bb90d8a6c3d555a1abb88130 upstream.
      
      When doing an incremental send operation we can fail if we previously did
      deduplication operations against a file that exists in both snapshots. In
      that case we will fail the send operation with -EIO and print a message
      to dmesg/syslog like the following:
      
        BTRFS error (device sdc): Send: inconsistent snapshot, found updated \
        extent for inode 257 without updated inode item, send root is 258, \
        parent root is 257
      
      This requires that we deduplicate to the same file in both snapshots for
      the same amount of times on each snapshot. The issue happens because a
      deduplication only updates the iversion of an inode and does not update
      any other field of the inode, therefore if we deduplicate the file on
      each snapshot for the same amount of time, the inode will have the same
      iversion value (stored as the "sequence" field on the inode item) on both
      snapshots, therefore it will be seen as unchanged between in the send
      snapshot while there are new/updated/deleted extent items when comparing
      to the parent snapshot. This makes the send operation return -EIO and
      print an error message.
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        # Create our first file. The first half of the file has several 64Kb
        # extents while the second half as a single 512Kb extent.
        $ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo
      
        # Create the base snapshot and the parent send stream from it.
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
        $ btrfs send -f /tmp/1.snap /mnt/mysnap1
      
        # Create our second file, that has exactly the same data as the first
        # file.
        $ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar
      
        # Create the second snapshot, used for the incremental send, before
        # doing the file deduplication.
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
      
        # Now before creating the incremental send stream:
        #
        # 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This
        #    will drop several extent items and add a new one, also updating
        #    the inode's iversion (sequence field in inode item) by 1, but not
        #    any other field of the inode;
        #
        # 2) Deduplicate into a different subrange of file foo in snapshot
        #    mysnap2. This will replace an extent item with a new one, also
        #    updating the inode's iversion by 1 but not any other field of the
        #    inode.
        #
        # After these two deduplication operations, the inode items, for file
        # foo, are identical in both snapshots, but we have different extent
        # items for this inode in both snapshots. We want to check this doesn't
        # cause send to fail with an error or produce an incorrect stream.
      
        $ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo
        $ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo
      
        # Create the incremental send stream.
        $ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2
        ERROR: send ioctl failed with -5: Input/output error
      
      This issue started happening back in 2015 when deduplication was updated
      to not update the inode's ctime and mtime and update only the iversion.
      Back then we would hit a BUG_ON() in send, but later in 2016 send was
      updated to return -EIO and print the error message instead of doing the
      BUG_ON().
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933
      Fixes: 1c919a5e ("btrfs: don't update mtime/ctime on deduped inodes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      009d7a4e
    • Q
      btrfs: qgroup: Don't hold qgroup_ioctl_lock in btrfs_qgroup_inherit() · f96c70fa
      Qu Wenruo 提交于
      [ Upstream commit e88439debd0a7f969b3ddba6f147152cd0732676 ]
      
      [BUG]
      Lockdep will report the following circular locking dependency:
      
        WARNING: possible circular locking dependency detected
        5.2.0-rc2-custom #24 Tainted: G           O
        ------------------------------------------------------
        btrfs/8631 is trying to acquire lock:
        000000002536438c (&fs_info->qgroup_ioctl_lock#2){+.+.}, at: btrfs_qgroup_inherit+0x40/0x620 [btrfs]
      
        but task is already holding lock:
        000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (&fs_info->tree_log_mutex){+.+.}:
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_commit_transaction+0x475/0xa00 [btrfs]
               btrfs_commit_super+0x71/0x80 [btrfs]
               close_ctree+0x2bd/0x320 [btrfs]
               btrfs_put_super+0x15/0x20 [btrfs]
               generic_shutdown_super+0x72/0x110
               kill_anon_super+0x18/0x30
               btrfs_kill_super+0x16/0xa0 [btrfs]
               deactivate_locked_super+0x3a/0x80
               deactivate_super+0x51/0x60
               cleanup_mnt+0x3f/0x80
               __cleanup_mnt+0x12/0x20
               task_work_run+0x94/0xb0
               exit_to_usermode_loop+0xd8/0xe0
               do_syscall_64+0x210/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #1 (&fs_info->reloc_mutex){+.+.}:
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_commit_transaction+0x40d/0xa00 [btrfs]
               btrfs_quota_enable+0x2da/0x730 [btrfs]
               btrfs_ioctl+0x2691/0x2b40 [btrfs]
               do_vfs_ioctl+0xa9/0x6d0
               ksys_ioctl+0x67/0x90
               __x64_sys_ioctl+0x1a/0x20
               do_syscall_64+0x65/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #0 (&fs_info->qgroup_ioctl_lock#2){+.+.}:
               lock_acquire+0xa7/0x190
               __mutex_lock+0x76/0x940
               mutex_lock_nested+0x1b/0x20
               btrfs_qgroup_inherit+0x40/0x620 [btrfs]
               create_pending_snapshot+0x9d7/0xe60 [btrfs]
               create_pending_snapshots+0x94/0xb0 [btrfs]
               btrfs_commit_transaction+0x415/0xa00 [btrfs]
               btrfs_mksubvol+0x496/0x4e0 [btrfs]
               btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs]
               btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs]
               btrfs_ioctl+0xa90/0x2b40 [btrfs]
               do_vfs_ioctl+0xa9/0x6d0
               ksys_ioctl+0x67/0x90
               __x64_sys_ioctl+0x1a/0x20
               do_syscall_64+0x65/0x240
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        other info that might help us debug this:
      
        Chain exists of:
          &fs_info->qgroup_ioctl_lock#2 --> &fs_info->reloc_mutex --> &fs_info->tree_log_mutex
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&fs_info->tree_log_mutex);
                                       lock(&fs_info->reloc_mutex);
                                       lock(&fs_info->tree_log_mutex);
          lock(&fs_info->qgroup_ioctl_lock#2);
      
         *** DEADLOCK ***
      
        6 locks held by btrfs/8631:
         #0: 00000000ed8f23f6 (sb_writers#12){.+.+}, at: mnt_want_write_file+0x28/0x60
         #1: 000000009fb1597a (&type->i_mutex_dir_key#10/1){+.+.}, at: btrfs_mksubvol+0x70/0x4e0 [btrfs]
         #2: 0000000088c5ad88 (&fs_info->subvol_sem){++++}, at: btrfs_mksubvol+0x128/0x4e0 [btrfs]
         #3: 000000009606fc3e (sb_internal#2){.+.+}, at: start_transaction+0x37a/0x520 [btrfs]
         #4: 00000000f82bbdf5 (&fs_info->reloc_mutex){+.+.}, at: btrfs_commit_transaction+0x40d/0xa00 [btrfs]
         #5: 000000003d52cc23 (&fs_info->tree_log_mutex){+.+.}, at: create_pending_snapshot+0x8b6/0xe60 [btrfs]
      
      [CAUSE]
      Due to the delayed subvolume creation, we need to call
      btrfs_qgroup_inherit() inside commit transaction code, with a lot of
      other mutex hold.
      This hell of lock chain can lead to above problem.
      
      [FIX]
      On the other hand, we don't really need to hold qgroup_ioctl_lock if
      we're in the context of create_pending_snapshot().
      As in that context, we're the only one being able to modify qgroup.
      
      All other qgroup functions which needs qgroup_ioctl_lock are either
      holding a transaction handle, or will start a new transaction:
        Functions will start a new transaction():
        * btrfs_quota_enable()
        * btrfs_quota_disable()
        Functions hold a transaction handler:
        * btrfs_add_qgroup_relation()
        * btrfs_del_qgroup_relation()
        * btrfs_create_qgroup()
        * btrfs_remove_qgroup()
        * btrfs_limit_qgroup()
        * btrfs_qgroup_inherit() call inside create_subvol()
      
      So we have a higher level protection provided by transaction, thus we
      don't need to always hold qgroup_ioctl_lock in btrfs_qgroup_inherit().
      
      Only the btrfs_qgroup_inherit() call in create_subvol() needs to hold
      qgroup_ioctl_lock, while the btrfs_qgroup_inherit() call in
      create_pending_snapshot() is already protected by transaction.
      
      So the fix is to detect the context by checking
      trans->transaction->state.
      If we're at TRANS_STATE_COMMIT_DOING, then we're in commit transaction
      context and no need to get the mutex.
      Reported-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f96c70fa
    • D
      btrfs: fix minimum number of chunk errors for DUP · d9245dab
      David Sterba 提交于
      [ Upstream commit 0ee5f8ae082e1f675a2fb6db601c31ac9958a134 ]
      
      The list of profiles in btrfs_chunk_max_errors lists DUP as a profile
      DUP able to tolerate 1 device missing. Though this profile is special
      with 2 copies, it still needs the device, unlike the others.
      
      Looking at the history of changes, thre's no clear reason why DUP is
      there, functions were refactored and blocks of code merged to one
      helper.
      
      d20983b4 Btrfs: fix writing data into the seed filesystem
        - factor code to a helper
      
      de11cc12 Btrfs: don't pre-allocate btrfs bio
        - unrelated change, DUP still in the list with max errors 1
      
      a236aed1 Btrfs: Deal with failed writes in mirrored configurations
        - introduced the max errors, leaves DUP and RAID1 in the same group
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d9245dab
  5. 31 7月, 2019 1 次提交
    • Q
      btrfs: inode: Don't compress if NODATASUM or NODATACOW set · e3dc9ea5
      Qu Wenruo 提交于
      commit 42c16da6d684391db83788eb680accd84f6c2083 upstream.
      
      As btrfs(5) specified:
      
      	Note
      	If nodatacow or nodatasum are enabled, compression is disabled.
      
      If NODATASUM or NODATACOW set, we should not compress the extent.
      
      Normally NODATACOW is detected properly in run_delalloc_range() so
      compression won't happen for NODATACOW.
      
      However for NODATASUM we don't have any check, and it can cause
      compressed extent without csum pretty easily, just by:
        mkfs.btrfs -f $dev
        mount $dev $mnt -o nodatasum
        touch $mnt/foobar
        mount -o remount,datasum,compress $mnt
        xfs_io -f -c "pwrite 0 128K" $mnt/foobar
      
      And in fact, we have a bug report about corrupted compressed extent
      without proper data checksum so even RAID1 can't recover the corruption.
      (https://bugzilla.kernel.org/show_bug.cgi?id=199707)
      
      Running compression without proper checksum could cause more damage when
      corruption happens, as compressed data could make the whole extent
      unreadable, so there is no need to allow compression for
      NODATACSUM.
      
      The fix will refactor the inode compression check into two parts:
      
      - inode_can_compress()
        As the hard requirement, checked at btrfs_run_delalloc_range(), so no
        compression will happen for NODATASUM inode at all.
      
      - inode_need_compress()
        As the soft requirement, checked at btrfs_run_delalloc_range() and
        compress_file_range().
      Reported-by: NJames Harvey <jamespharvey20@gmail.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3dc9ea5
  6. 26 7月, 2019 3 次提交
    • F
      Btrfs: add missing inode version, ctime and mtime updates when punching hole · 4bd95324
      Filipe Manana 提交于
      commit 179006688a7e888cbff39577189f2e034786d06a upstream.
      
      If the range for which we are punching a hole covers only part of a page,
      we end up updating the inode item but we skip the update of the inode's
      iversion, mtime and ctime. Fix that by ensuring we update those properties
      of the inode.
      
      A patch for fstests test case generic/059 that tests this as been sent
      along with this fix.
      
      Fixes: 2aaa6655 ("Btrfs: add hole punching")
      Fixes: e8c1c76e ("Btrfs: add missing inode update when punching hole")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4bd95324
    • F
      Btrfs: fix fsync not persisting dentry deletions due to inode evictions · fffedf5c
      Filipe Manana 提交于
      commit 803f0f64d17769071d7287d9e3e3b79a3e1ae937 upstream.
      
      In order to avoid searches on a log tree when unlinking an inode, we check
      if the inode being unlinked was logged in the current transaction, as well
      as the inode of its parent directory. When any of the inodes are logged,
      we proceed to delete directory items and inode reference items from the
      log, to ensure that if a subsequent fsync of only the inode being unlinked
      or only of the parent directory when the other is not fsync'ed as well,
      does not result in the entry still existing after a power failure.
      
      That check however is not reliable when one of the inodes involved (the
      one being unlinked or its parent directory's inode) is evicted, since the
      logged_trans field is transient, that is, it is not stored on disk, so it
      is lost when the inode is evicted and loaded into memory again (which is
      set to zero on load). As a consequence the checks currently being done by
      btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always
      return true if the inode was evicted before, regardless of the inode
      having been logged or not before (and in the current transaction), this
      results in the dentry being unlinked still existing after a log replay
      if after the unlink operation only one of the inodes involved is fsync'ed.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Keep an open file descriptor on our directory while we evict inodes.
        # We just want to evict the file's inode, the directory's inode must not
        # be evicted.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time to background process to chdir to our test
        # directory.
        $ sleep 0.5
      
        # Trigger eviction of the file's inode.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # Unlink our file and fsync the parent directory. After a power failure
        # we don't expect to see the file anymore, since we fsync'ed the parent
        # directory.
        $ rm -f $SCRATCH_MNT/dir/foo
        $ xfs_io -c fsync /mnt/dir
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ ls /mnt/dir
        foo
        $
         --> file still there, unlink not persisted despite explicit fsync on dir
      
      Fix this by checking if the inode has the full_sync bit set in its runtime
      flags as well, since that bit is set everytime an inode is loaded from
      disk, or for other less common cases such as after a shrinking truncate
      or failure to allocate extent maps for holes, and gets cleared after the
      first fsync. Also consider the inode as possibly logged only if it was
      last modified in the current transaction (besides having the full_fsync
      flag set).
      
      Fixes: 3a5f1d45 ("Btrfs: Optimize btree walking while logging inodes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fffedf5c
    • F
      Btrfs: fix data loss after inode eviction, renaming it, and fsync it · 110850ff
      Filipe Manana 提交于
      commit d1d832a0b51dd9570429bb4b81b2a6c1759e681a upstream.
      
      When we log an inode, regardless of logging it completely or only that it
      exists, we always update it as logged (logged_trans and last_log_commit
      fields of the inode are updated). This is generally fine and avoids future
      attempts to log it from having to do repeated work that brings no value.
      
      However, if we write data to a file, then evict its inode after all the
      dealloc was flushed (and ordered extents completed), rename the file and
      fsync it, we end up not logging the new extents, since the rename may
      result in logging that the inode exists in case the parent directory was
      logged before. The following reproducer shows and explains how this can
      happen:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ touch /mnt/dir/bar
      
        # Do a direct IO write instead of a buffered write because with a
        # buffered write we would need to make sure dealloc gets flushed and
        # complete before we do the inode eviction later, and we can not do that
        # from user space with call to things such as sync(2) since that results
        # in a transaction commit as well.
        $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar
      
        # Keep the directory dir in use while we evict inodes. We want our file
        # bar's inode to be evicted but we don't want our directory's inode to
        # be evicted (if it were evicted too, we would not be able to reproduce
        # the issue since the first fsync below, of file foo, would result in a
        # transaction commit.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time for the background process to chdir.
        $ sleep 0.1
      
        # Evict all inodes, except the inode for the directory dir because it is
        # currently in use by our background process.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # fsync file foo, which ends up persisting information about the parent
        # directory because it is a new inode.
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Rename bar, this results in logging that this inode exists (inode item,
        # names, xattrs) because the parent directory is in the log.
        $ mv /mnt/dir/bar /mnt/dir/baz
      
        # Now fsync baz, which ends up doing absolutely nothing because of the
        # rename operation which logged that the inode exists only.
        $ xfs_io -c fsync /mnt/dir/baz
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/dir/baz
        0000000
      
          --> Empty file, data we wrote is missing.
      
      Fix this by not updating last_sub_trans of an inode when we are logging
      only that it exists and the inode was not yet logged since it was loaded
      from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
      return false for this scenario and make us log the inode. The logged_trans
      of the inode is still always setsince that alone is used to track if names
      need to be deleted as part of unlink operations.
      
      Fixes: 257c62e1 ("Btrfs: avoid tree log commit when there are no changes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      110850ff
  7. 10 7月, 2019 1 次提交
    • N
      btrfs: Ensure replaced device doesn't have pending chunk allocation · fb814f21
      Nikolay Borisov 提交于
      commit debd1c065d2037919a7da67baf55cc683fee09f0 upstream.
      
      Recent FITRIM work, namely bbbf7243d62d ("btrfs: combine device update
      operations during transaction commit") combined the way certain
      operations are recoded in a transaction. As a result an ASSERT was added
      in dev_replace_finish to ensure the new code works correctly.
      Unfortunately I got reports that it's possible to trigger the assert,
      meaning that during a device replace it's possible to have an unfinished
      chunk allocation on the source device.
      
      This is supposed to be prevented by the fact that a transaction is
      committed before finishing the replace oepration and alter acquiring the
      chunk mutex. This is not sufficient since by the time the transaction is
      committed and the chunk mutex acquired it's possible to allocate a chunk
      depending on the workload being executed on the replaced device. This
      bug has been present ever since device replace was introduced but there
      was never code which checks for it.
      
      The correct way to fix is to ensure that there is no pending device
      modification operation when the chunk mutex is acquire and if there is
      repeat transaction commit. Unfortunately it's not possible to just
      exclude the source device from btrfs_fs_devices::dev_alloc_list since
      this causes ENOSPC to be hit in transaction commit.
      
      Fixing that in another way would need to add special cases to handle the
      last writes and forbid new ones. The looped transaction fix is more
      obvious, and can be easily backported. The runtime of dev-replace is
      long so there's no noticeable delay caused by that.
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Fixes: 391cd9df ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      fb814f21
  8. 25 6月, 2019 1 次提交
  9. 09 6月, 2019 4 次提交
    • F
      Btrfs: incremental send, fix file corruption when no-holes feature is enabled · 8a652fd1
      Filipe Manana 提交于
      commit 6b1f72e5b82a5c2a4da4d1ebb8cc01913ddbea21 upstream.
      
      When using the no-holes feature, if we have a file with prealloc extents
      with a start offset beyond the file's eof, doing an incremental send can
      cause corruption of the file due to incorrect hole detection. Such case
      requires that the prealloc extent(s) exist in both the parent and send
      snapshots, and that a hole is punched into the file that covers all its
      extents that do not cross the eof boundary.
      
      Example reproducer:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar
        $ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
      
        $ btrfs send -f /tmp/base.snap /mnt/sdb/base
      
        $ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
      
        $ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr
      
        $ md5sum /mnt/sdb/incr/foobar
        816df6f64deba63b029ca19d880ee10a   /mnt/sdb/incr/foobar
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/base.snap /mnt/sdc
        $ btrfs receive -f /tmp/incr.snap /mnt/sdc
      
        $ md5sum /mnt/sdc/incr/foobar
        cf2ef71f4a9e90c2f6013ba3b2257ed2   /mnt/sdc/incr/foobar
      
          --> Different checksum, because the prealloc extent beyond the
              file's eof confused the hole detection code and it assumed
              a hole starting at offset 0 and ending at the offset of the
              prealloc extent (1200Kb) instead of ending at the offset
              500Kb (the file's size).
      
      Fix this by ensuring we never cross the file's size when issuing the
      write operations for a hole.
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      CC: stable@vger.kernel.org # 3.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a652fd1
    • F
      Btrfs: fix fsync not persisting changed attributes of a directory · a8107111
      Filipe Manana 提交于
      commit 60d9f50308e5df19bc18c2fefab0eba4a843900a upstream.
      
      While logging an inode we follow its ancestors and for each one we mark
      it as logged in the current transaction, even if we have not logged it.
      As a consequence if we change an attribute of an ancestor, such as the
      UID or GID for example, and then explicitly fsync it, we end up not
      logging the inode at all despite returning success to user space, which
      results in the attribute being lost if a power failure happens after
      the fsync.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ chown 6007:6007 /mnt/dir
      
        $ sync
      
        $ chown 9003:9003 /mnt/dir
        $ touch /mnt/dir/file
        $ xfs_io -c fsync /mnt/dir/file
      
        # fsync our directory after fsync'ing the new file, should persist the
        # new values for the uid and gid.
        $ xfs_io -c fsync /mnt/dir
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ stat -c %u:%g /mnt/dir
        6007:6007
      
          --> should be 9003:9003, the uid and gid were not persisted, despite
              the explicit fsync on the directory prior to the power failure
      
      Fix this by not updating the logged_trans field of ancestor inodes when
      logging an inode, since we have not logged them. Let only future calls to
      btrfs_log_inode() to mark inodes as logged.
      
      This could be triggered by my recent fsync fuzz tester for fstests, for
      which an fstests patch exists titled "fstests: generic, fsync fuzz tester
      with fsstress".
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8107111
    • F
      Btrfs: fix race updating log root item during fsync · 37fe0383
      Filipe Manana 提交于
      commit 06989c799f04810f6876900d4760c0edda369cf7 upstream.
      
      When syncing the log, the final phase of a fsync operation, we need to
      either create a log root's item or update the existing item in the log
      tree of log roots, and that depends on the current value of the log
      root's log_transid - if it's 1 we need to create the log root item,
      otherwise it must exist already and we update it. Since there is no
      synchronization between updating the log_transid and checking it for
      deciding whether the log root's item needs to be created or updated, we
      end up with a tiny race window that results in attempts to update the
      item to fail because the item was not yet created:
      
                    CPU 1                                    CPU 2
      
        btrfs_sync_log()
      
          lock root->log_mutex
      
          set log root's log_transid to 1
      
          unlock root->log_mutex
      
                                                     btrfs_sync_log()
      
                                                       lock root->log_mutex
      
                                                       sets log root's
                                                       log_transid to 2
      
                                                       unlock root->log_mutex
      
          update_log_root()
      
            sees log root's log_transid
            with a value of 2
      
              calls btrfs_update_root(),
              which fails with -EUCLEAN
              and causes transaction abort
      
      Until recently the race lead to a BUG_ON at btrfs_update_root(), but after
      the recent commit 7ac1e464c4d47 ("btrfs: Don't panic when we can't find a
      root key") we just abort the current transaction.
      
      A sample trace of the BUG_ON() on a SLE12 kernel:
      
        ------------[ cut here ]------------
        kernel BUG at ../fs/btrfs/root-tree.c:157!
        Oops: Exception in kernel mode, sig: 5 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        (...)
        Supported: Yes, External
        CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G                 X 4.4.156-94.57-default #1
        task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000
        NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000
        REGS: c00000ff42b0b860 TRAP: 0700   Tainted: G                 X  (4.4.156-94.57-default)
        MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 22444484  XER: 20000000
        CFAR: d000000036aba66c SOFTE: 1
        GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054
        GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000
        GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079
        GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023
        GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28
        GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001
        GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888
        GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20
        NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs]
        LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs]
        Call Trace:
        [c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable)
        [c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs]
        [c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs]
        [c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120
        [c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0
        [c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40
        [c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100
        Instruction dump:
        7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0
        e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0
        ---[ end trace 8f2dc8f919cabab8 ]---
      
      So fix this by doing the check of log_transid and updating or creating the
      log root's item while holding the root's log_mutex.
      
      Fixes: 7237f183 ("Btrfs: fix tree logs parallel sync")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      37fe0383
    • F
      Btrfs: fix wrong ctime and mtime of a directory after log replay · 7301bbea
      Filipe Manana 提交于
      commit 5338e43abbab13791144d37fd8846847062351c6 upstream.
      
      When replaying a log that contains a new file or directory name that needs
      to be added to its parent directory, we end up updating the mtime and the
      ctime of the parent directory to the current time after we have set their
      values to the correct ones (set at fsync time), efectivelly losing them.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/file
      
        # fsync of the directory is optional, not needed
        $ xfs_io -c fsync /mnt/dir
        $ xfs_io -c fsync /mnt/dir/file
      
        $ stat -c %Y /mnt/dir
        1557856079
      
        <power failure>
      
        $ sleep 3
        $ mount /dev/sdb /mnt
        $ stat -c %Y /mnt/dir
        1557856082
      
          --> should have been 1557856079, the mtime is updated to the current
              time when replaying the log
      
      Fix this by not updating the mtime and ctime to the current time at
      btrfs_add_link() when we are replaying a log tree.
      
      This could be triggered by my recent fsync fuzz tester for fstests, for
      which an fstests patch exists titled "fstests: generic, fsync fuzz tester
      with fsstress".
      
      Fixes: e02119d5 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7301bbea
  10. 31 5月, 2019 11 次提交
    • Q
      btrfs: Don't panic when we can't find a root key · bd3d8f4c
      Qu Wenruo 提交于
      [ Upstream commit 7ac1e464c4d473b517bb784f30d40da1f842482e ]
      
      When we failed to find a root key in btrfs_update_root(), we just panic.
      
      That's definitely not cool, fix it by outputting an unique error
      message, aborting current transaction and return -EUCLEAN. This should
      not normally happen as the root has been used by the callers in some
      way.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      bd3d8f4c
    • J
      btrfs: fix panic during relocation after ENOSPC before writeback happens · 431cbaec
      Josef Bacik 提交于
      [ Upstream commit ff612ba7849964b1898fd3ccd1f56941129c6aab ]
      
      We've been seeing the following sporadically throughout our fleet
      
      panic: kernel BUG at fs/btrfs/relocation.c:4584!
      netversion: 5.0-0
      Backtrace:
       #0 [ffffc90003adb880] machine_kexec at ffffffff81041da8
       #1 [ffffc90003adb8c8] __crash_kexec at ffffffff8110396c
       #2 [ffffc90003adb988] crash_kexec at ffffffff811048ad
       #3 [ffffc90003adb9a0] oops_end at ffffffff8101c19a
       #4 [ffffc90003adb9c0] do_trap at ffffffff81019114
       #5 [ffffc90003adba00] do_error_trap at ffffffff810195d0
       #6 [ffffc90003adbab0] invalid_op at ffffffff81a00a9b
          [exception RIP: btrfs_reloc_cow_block+692]
          RIP: ffffffff8143b614  RSP: ffffc90003adbb68  RFLAGS: 00010246
          RAX: fffffffffffffff7  RBX: ffff8806b9c32000  RCX: ffff8806aad00690
          RDX: ffff880850b295e0  RSI: ffff8806b9c32000  RDI: ffff88084f205bd0
          RBP: ffff880849415000   R8: ffffc90003adbbe0   R9: ffff88085ac90000
          R10: ffff8805f7369140  R11: 0000000000000000  R12: ffff880850b295e0
          R13: ffff88084f205bd0  R14: 0000000000000000  R15: 0000000000000000
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffffc90003adbbb0] __btrfs_cow_block at ffffffff813bf1cd
       #8 [ffffc90003adbc28] btrfs_cow_block at ffffffff813bf4b3
       #9 [ffffc90003adbc78] btrfs_search_slot at ffffffff813c2e6c
      
      The way relocation moves data extents is by creating a reloc inode and
      preallocating extents in this inode and then copying the data into these
      preallocated extents.  Once we've done this for all of our extents,
      we'll write out these dirty pages, which marks the extent written, and
      goes into btrfs_reloc_cow_block().  From here we get our current
      reloc_control, which _should_ match the reloc_control for the current
      block group we're relocating.
      
      However if we get an ENOSPC in this path at some point we'll bail out,
      never initiating writeback on this inode.  Not a huge deal, unless we
      happen to be doing relocation on a different block group, and this block
      group is now rc->stage == UPDATE_DATA_PTRS.  This trips the BUG_ON() in
      btrfs_reloc_cow_block(), because we expect to be done modifying the data
      inode.  We are in fact done modifying the metadata for the data inode
      we're currently using, but not the one from the failed block group, and
      thus we BUG_ON().
      
      (This happens when writeback finishes for extents from the previous
      group, when we are at btrfs_finish_ordered_io() which updates the data
      reloc tree (inode item, drops/adds extent items, etc).)
      
      Fix this by writing out the reloc data inode always, and then breaking
      out of the loop after that point to keep from tripping this BUG_ON()
      later.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [ add note from Filipe ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      431cbaec
    • R
      Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota reserve · 1084fc9a
      Robbie Ko 提交于
      [ Upstream commit 39ad317315887c2cb9a4347a93a8859326ddf136 ]
      
      When doing fallocate, we first add the range to the reserve_list and
      then reserve the quota.  If quota reservation fails, we'll release all
      reserved parts of reserve_list.
      
      However, cur_offset is not updated to indicate that this range is
      already been inserted into the list.  Therefore, the same range is freed
      twice.  Once at list_for_each_entry loop, and once at the end of the
      function.  This will result in WARN_ON on bytes_may_use when we free the
      remaining space.
      
      At the end, under the 'out' label we have a call to:
      
         btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset);
      
      The start offset, third argument, should be cur_offset.
      
      Everything from alloc_start to cur_offset was freed by the
      list_for_each_entry_safe_loop.
      
      Fixes: 18513091 ("btrfs: update btrfs_space_info's bytes_may_use timely")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1084fc9a
    • D
      Revert "btrfs: Honour FITRIM range constraints during free space trim" · 06a67c0f
      David Sterba 提交于
      This reverts commit 8b13bb91.
      
      There is currently no corresponding patch in master due to additional
      changes that would be significantly different from plain revert in the
      respective stable branch.
      
      The range argument was not handled correctly and could cause trim to
      overlap allocated areas or reach beyond the end of the device. The
      address space that fitrim normally operates on is in logical
      coordinates, while the discards are done on the physical device extents.
      This distinction cannot be made with the current ioctl interface and
      caused the confusion.
      
      The bug depends on the layout of block groups and does not always
      happen. The whole-fs trim (run by default by the fstrim tool) is not
      affected.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      06a67c0f
    • J
      btrfs: honor path->skip_locking in backref code · 9c0339dd
      Josef Bacik 提交于
      commit 38e3eebff643db725633657d1d87a3be019d1018 upstream.
      
      Qgroups will do the old roots lookup at delayed ref time, which could be
      while walking down the extent root while running a delayed ref.  This
      should be fine, except we specifically lock eb's in the backref walking
      code irrespective of path->skip_locking, which deadlocks the system.
      Fix up the backref code to honor path->skip_locking, nobody will be
      modifying the commit_root when we're searching so it's completely safe
      to do.
      
      This happens since fb235dc0 ("btrfs: qgroup: Move half of the qgroup
      accounting time out of commit trans"), kernel may lockup with quota
      enabled.
      
      There is one backref trace triggered by snapshot dropping along with
      write operation in the source subvolume.  The example can be reliably
      reproduced:
      
        btrfs-cleaner   D    0  4062      2 0x80000000
        Call Trace:
         schedule+0x32/0x90
         btrfs_tree_read_lock+0x93/0x130 [btrfs]
         find_parent_nodes+0x29b/0x1170 [btrfs]
         btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
         btrfs_find_all_roots+0x57/0x70 [btrfs]
         btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
         btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
         btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
         do_walk_down+0x541/0x5e3 [btrfs]
         walk_down_tree+0xab/0xe7 [btrfs]
         btrfs_drop_snapshot+0x356/0x71a [btrfs]
         btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
         cleaner_kthread+0x12b/0x160 [btrfs]
         kthread+0x112/0x130
         ret_from_fork+0x27/0x50
      
      When dropping snapshots with qgroup enabled, we will trigger backref
      walk.
      
      However such backref walk at that timing is pretty dangerous, as if one
      of the parent nodes get WRITE locked by other thread, we could cause a
      dead lock.
      
      For example:
      
                 FS 260     FS 261 (Dropped)
                  node A        node B
                 /      \      /      \
             node C      node D      node E
            /   \         /  \        /     \
        leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
      
      The lock sequence would be:
      
            Thread A (cleaner)             |       Thread B (other writer)
      -----------------------------------------------------------------------
      write_lock(B)                        |
      write_lock(D)                        |
      ^^^ called by walk_down_tree()       |
                                           |       write_lock(A)
                                           |       write_lock(D) << Stall
      read_lock(H) << for backref walk     |
      read_lock(D) << lock owner is        |
                      the same thread A    |
                      so read lock is OK   |
      read_lock(A) << Stall                |
      
      So thread A hold write lock D, and needs read lock A to unlock.
      While thread B holds write lock A, while needs lock D to unlock.
      
      This will cause a deadlock.
      
      This is not only limited to snapshot dropping case.  As the backref
      walk, even only happens on commit trees, is breaking the normal top-down
      locking order, makes it deadlock prone.
      
      Fixes: fb235dc0 ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
      CC: stable@vger.kernel.org # 4.14+
      Reported-and-tested-by: NDavid Sterba <dsterba@suse.com>
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [ rebase to latest branch and fix lock assert bug in btrfs/007 ]
      [ backport to linux-4.19.y branch, solve minor conflicts ]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ copy logs and deadlock analysis from Qu's patch ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c0339dd
    • T
      btrfs: sysfs: don't leak memory when failing add fsid · 94e1f966
      Tobin C. Harding 提交于
      commit e32773357d5cc271b1d23550b3ed026eb5c2a468 upstream.
      
      A failed call to kobject_init_and_add() must be followed by a call to
      kobject_put().  Currently in the error path when adding fs_devices we
      are missing this call.  This could be fixed by calling
      btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or
      by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid().
      Here we choose the second option because it prevents the slightly
      unusual error path handling requirements of kobject from leaking out
      into btrfs functions.
      
      Add a call to kobject_put() in the error path of kobject_add_and_init().
      This causes the release method to be called if kobject_init_and_add()
      fails.  open_tree() is the function that calls btrfs_sysfs_add_fsid()
      and the error code in this function is already written with the
      assumption that the release method is called during the error path of
      open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the
      fail_fsdev_sysfs label).
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NTobin C. Harding <tobin@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      94e1f966
    • T
      btrfs: sysfs: Fix error path kobject memory leak · 946ad2ec
      Tobin C. Harding 提交于
      commit 450ff8348808a89cc27436771aa05c2b90c0eef1 upstream.
      
      If a call to kobject_init_and_add() fails we must call kobject_put()
      otherwise we leak memory.
      
      Calling kobject_put() when kobject_init_and_add() fails drops the
      refcount back to 0 and calls the ktype release method (which in turn
      calls the percpu destroy and kfree).
      
      Add call to kobject_put() in the error path of call to
      kobject_init_and_add().
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NTobin C. Harding <tobin@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      946ad2ec
    • F
      Btrfs: fix race between ranged fsync and writeback of adjacent ranges · 92f907d7
      Filipe Manana 提交于
      commit 0c713cbab6200b0ab6473b50435e450a6e1de85d upstream.
      
      When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
      inode) that happens to be ranged, which happens during a msync() or writes
      for files opened with O_SYNC for example, we can end up with a corrupt log,
      due to different file extent items representing ranges that overlap with
      each other, or hit some assertion failures.
      
      When doing a ranged fsync we only flush delalloc and wait for ordered
      exents within that range. If while we are logging items from our inode
      ordered extents for adjacent ranges complete, we end up in a race that can
      make us insert the file extent items that overlap with others we logged
      previously and the assertion failures.
      
      For example, if tree-log.c:copy_items() receives a leaf that has the
      following file extents items, all with a length of 4K and therefore there
      is an implicit hole in the range 68K to 72K - 1:
      
        (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
      
      It copies them to the log tree. However due to the need to detect implicit
      holes, it may release the path, in order to look at the previous leaf to
      detect an implicit hole, and then later it will search again in the tree
      for the first file extent item key, with the goal of locking again the
      leaf (which might have changed due to concurrent changes to other inodes).
      
      However when it locks again the leaf containing the first key, the key
      corresponding to the extent at offset 72K may not be there anymore since
      there is an ordered extent for that range that is finishing (that is,
      somewhere in the middle of btrfs_finish_ordered_io()), and it just
      removed the file extent item but has not yet replaced it with a new file
      extent item, so the part of copy_items() that does hole detection will
      decide that there is a hole in the range starting from 68K to 76K - 1,
      and therefore insert a file extent item to represent that hole, having
      a key offset of 68K. After that we now have a log tree with 2 different
      extent items that have overlapping ranges:
      
       1) The file extent item copied before copy_items() released the path,
          which has a key offset of 72K and a length of 4K, representing the
          file range 72K to 76K - 1.
      
       2) And a file extent item representing a hole that has a key offset of
          68K and a length of 8K, representing the range 68K to 76K - 1. This
          item was inserted after releasing the path, and overlaps with the
          extent item inserted before.
      
      The overlapping extent items can cause all sorts of unpredictable and
      incorrect behaviour, either when replayed or if a fast (non full) fsync
      happens later, which can trigger a BUG_ON() when calling
      btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
      trace like the following:
      
        [61666.783269] ------------[ cut here ]------------
        [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
        [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
        (...)
        [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
        [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
        [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
        [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
        [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
        [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
        [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
        [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
        [61666.786253] FS:  00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
        [61666.786253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
        [61666.786253] Call Trace:
        [61666.786253]  __btrfs_drop_extents+0x5e3/0xaad [btrfs]
        [61666.786253]  ? time_hardirqs_on+0x9/0x14
        [61666.786253]  btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
        [61666.786253]  ? release_extent_buffer+0x38/0xb4 [btrfs]
        [61666.786253]  btrfs_log_inode+0xb6e/0xcdc [btrfs]
        [61666.786253]  ? lock_acquire+0x131/0x1c5
        [61666.786253]  ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
        [61666.786253]  btrfs_log_inode_parent+0x223/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? lockref_get_not_zero+0x2c/0x34
        [61666.786253]  ? rcu_read_unlock+0x3e/0x5d
        [61666.786253]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
        [61666.786253]  btrfs_sync_file+0x317/0x42c [btrfs]
        [61666.786253]  vfs_fsync_range+0x8c/0x9e
        [61666.786253]  SyS_msync+0x13c/0x1c9
        [61666.786253]  entry_SYSCALL_64_fastpath+0x18/0xad
      
      A sample of a corrupt log tree leaf with overlapping extents I got from
      running btrfs/072:
      
            item 14 key (295 108 200704) itemoff 2599 itemsize 53
                    extent data disk bytenr 0 nr 0
                    extent data offset 0 nr 458752 ram 458752
            item 15 key (295 108 659456) itemoff 2546 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 606208 nr 163840 ram 770048
            item 16 key (295 108 663552) itemoff 2493 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 610304 nr 155648 ram 770048
            item 17 key (295 108 819200) itemoff 2440 itemsize 53
                    extent data disk bytenr 4334788608 nr 4096
                    extent data offset 0 nr 4096 ram 4096
      
      The file extent item at offset 659456 (item 15) ends at offset 823296
      (659456 + 163840) while the next file extent item (item 16) starts at
      offset 663552.
      
      Another different problem that the race can trigger is a failure in the
      assertions at tree-log.c:copy_items(), which expect that the first file
      extent item key we found before releasing the path exists after we have
      released path and that the last key we found before releasing the path
      also exists after releasing the path:
      
        $ cat -n fs/btrfs/tree-log.c
        4080          if (need_find_last_extent) {
        4081                  /* btrfs_prev_leaf could return 1 without releasing the path */
        4082                  btrfs_release_path(src_path);
        4083                  ret = btrfs_search_slot(NULL, inode->root, &first_key,
        4084                                  src_path, 0, 0);
        4085                  if (ret < 0)
        4086                          return ret;
        4087                  ASSERT(ret == 0);
        (...)
        4103                  if (i >= btrfs_header_nritems(src_path->nodes[0])) {
        4104                          ret = btrfs_next_leaf(inode->root, src_path);
        4105                          if (ret < 0)
        4106                                  return ret;
        4107                          ASSERT(ret == 0);
        4108                          src = src_path->nodes[0];
        4109                          i = 0;
        4110                          need_find_last_extent = true;
        4111                  }
        (...)
      
      The second assertion implicitly expects that the last key before the path
      release still exists, because the surrounding while loop only stops after
      we have found that key. When this assertion fails it produces a stack like
      this:
      
        [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
        [139590.037406] ------------[ cut here ]------------
        [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
        [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G        W         5.0.0-btrfs-next-46 #1
        (...)
        [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
        (...)
        [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
        [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
        [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
        [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
        [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
        [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
        [139590.042501] FS:  00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
        [139590.042847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
        [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [139590.044250] Call Trace:
        [139590.044631]  copy_items+0xa3f/0x1000 [btrfs]
        [139590.045009]  ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
        [139590.045396]  btrfs_log_inode+0x7b3/0xd70 [btrfs]
        [139590.045773]  btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
        [139590.046143]  ? do_raw_spin_unlock+0x49/0xc0
        [139590.046510]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [139590.046872]  btrfs_sync_file+0x3b6/0x440 [btrfs]
        [139590.047243]  btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
        [139590.047592]  __vfs_write+0x129/0x1c0
        [139590.047932]  vfs_write+0xc2/0x1b0
        [139590.048270]  ksys_write+0x55/0xc0
        [139590.048608]  do_syscall_64+0x60/0x1b0
        [139590.048946]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [139590.049287] RIP: 0033:0x7f2efc4be190
        (...)
        [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
        [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
        [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
        [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
        [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
        (...)
        [139590.055128] ---[ end trace 193f35d0215cdeeb ]---
      
      So fix this race between a full ranged fsync and writeback of adjacent
      ranges by flushing all delalloc and waiting for all ordered extents to
      complete before logging the inode. This is the simplest way to solve the
      problem because currently the full fsync path does not deal with ranges
      at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
      to look at adjacent ranges for hole detection. For use cases of ranged
      fsyncs this can make a few fsyncs slower but on the other hand it can
      make some following fsyncs to other ranges do less work or no need to do
      anything at all. A full fsync is rare anyway and happens only once after
      loading/creating an inode and once after less common operations such as a
      shrinking truncate.
      
      This is an issue that exists for a long time, and was often triggered by
      generic/127, because it does mmap'ed writes and msync (which triggers a
      ranged fsync). Adding support for the tree checker to detect overlapping
      extents (next patch in the series) and trigger a WARN() when such cases
      are found, and then calling btrfs_check_leaf_full() at the end of
      btrfs_insert_file_extent() made the issue much easier to detect. Running
      btrfs/072 with that change to the tree checker and making fsstress open
      files always with O_SYNC made it much easier to trigger the issue (as
      triggering it with generic/127 is very rare).
      
      CC: stable@vger.kernel.org # 3.16+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      92f907d7
    • F
      Btrfs: avoid fallback to transaction commit during fsync of files with holes · 4f9a774d
      Filipe Manana 提交于
      commit ebb929060aeb162417b4c1307e63daee47b208d9 upstream.
      
      When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a
      file that has holes and has file extent items spanning two or more leafs,
      we can end up falling to back to a full transaction commit due to a logic
      bug that leads to failure to insert a duplicate file extent item that is
      meant to represent a hole between the last file extent item of a leaf and
      the first file extent item in the next leaf. The failure (EEXIST error)
      leads to a transaction commit (as most errors when logging an inode do).
      
      For example, we have the two following leafs:
      
      Leaf N:
      
        -----------------------------------------------
        | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) |
        -----------------------------------------------
        The file extent item at the end of leaf N has a length of 4Kb,
        representing the file range from 64K to 68K - 1.
      
      Leaf N + 1:
      
        -----------------------------------------------
        | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... |
        -----------------------------------------------
        The file extent item at the first slot of leaf N + 1 has a length of
        4Kb too, representing the file range from 72K to 76K - 1.
      
      During the full fsync path, when we are at tree-log.c:copy_items() with
      leaf N as a parameter, after processing the last file extent item, that
      represents the extent at offset 64K, we take a look at the first file
      extent item at the next leaf (leaf N + 1), and notice there's a 4K hole
      between the two extents, and therefore we insert a file extent item
      representing that hole, starting at file offset 68K and ending at offset
      72K - 1. However we don't update the value of *last_extent, which is used
      to represent the end offset (plus 1, non-inclusive end) of the last file
      extent item inserted in the log, so it stays with a value of 68K and not
      with a value of 72K.
      
      Then, when copy_items() is called for leaf N + 1, because the value of
      *last_extent is smaller then the offset of the first extent item in the
      leaf (68K < 72K), we look at the last file extent item in the previous
      leaf (leaf N) and see it there's a 4K gap between it and our first file
      extent item (again, 68K < 72K), so we decide to insert a file extent item
      representing the hole, starting at file offset 68K and ending at offset
      72K - 1, this insertion will fail with -EEXIST being returned from
      btrfs_insert_file_extent() because we already inserted a file extent item
      representing a hole for this offset (68K) in the previous call to
      copy_items(), when processing leaf N.
      
      The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(),
      which falls back to a full transaction commit.
      
      Fix this by adjusting *last_extent after inserting a hole when we had to
      look at the next leaf.
      
      Fixes: 4ee3fad3 ("Btrfs: fix fsync after hole punching when using no-holes feature")
      Cc: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f9a774d
    • F
      Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path · 7ec747c8
      Filipe Manana 提交于
      commit 72bd2323ec87722c115a5906bc6a1b31d11e8f54 upstream.
      
      Currently when we fail to COW a path at btrfs_update_root() we end up
      always aborting the transaction. However all the current callers of
      btrfs_update_root() are able to deal with errors returned from it, many do
      end up aborting the transaction themselves (directly or not, such as the
      transaction commit path), other BUG_ON() or just gracefully cancel whatever
      they were doing.
      
      When syncing the fsync log, we call btrfs_update_root() through
      tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log
      sync code does not abort the transaction, instead it gracefully handles
      the error and returns -EAGAIN to the fsync handler, so that it falls back
      to a transaction commit. Any other error different from -ENOSPC, makes the
      log sync code abort the transaction.
      
      So remove the transaction abort from btrfs_update_log() when we fail to
      COW a path to update the root item, so that if an -ENOSPC failure happens
      we avoid aborting the current transaction and have a chance of the fsync
      succeeding after falling back to a transaction commit.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413
      Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
      Cc: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ec747c8
    • J
      btrfs: don't double unlock on error in btrfs_punch_hole · ce21e658
      Josef Bacik 提交于
      commit 8fca955057b9c58467d1b231e43f19c4cf26ae8c upstream.
      
      If we have an error writing out a delalloc range in
      btrfs_punch_hole_lock_range we'll unlock the inode and then goto
      out_only_mutex, where we will again unlock the inode.  This is bad,
      don't do this.
      
      Fixes: f27451f2 ("Btrfs: add support for fallocate's zero range operation")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce21e658
  11. 22 5月, 2019 1 次提交
    • F
      Btrfs: do not start a transaction at iterate_extent_inodes() · 8a8f671b
      Filipe Manana 提交于
      commit bfc61c36260ca990937539cd648ede3cd749bc10 upstream.
      
      When finding out which inodes have references on a particular extent, done
      by backref.c:iterate_extent_inodes(), from the BTRFS_IOC_LOGICAL_INO (both
      v1 and v2) ioctl and from scrub we use the transaction join API to grab a
      reference on the currently running transaction, since in order to give
      accurate results we need to inspect the delayed references of the currently
      running transaction.
      
      However, if there is currently no running transaction, the join operation
      will create a new transaction. This is inefficient as the transaction will
      eventually be committed, doing unnecessary IO and introducing a potential
      point of failure that will lead to a transaction abort due to -ENOSPC, as
      recently reported [1].
      
      That's because the join, creates the transaction but does not reserve any
      space, so when attempting to update the root item of the root passed to
      btrfs_join_transaction(), during the transaction commit, we can end up
      failling with -ENOSPC. Users of a join operation are supposed to actually
      do some filesystem changes and reserve space by some means, which is not
      the case of iterate_extent_inodes(), it is a read-only operation for all
      contextes from which it is called.
      
      The reported [1] -ENOSPC failure stack trace is the following:
      
       heisenberg kernel: ------------[ cut here ]------------
       heisenberg kernel: BTRFS: Transaction aborted (error -28)
       heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2
       heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018
       heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs]
      (...)
       heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286
       heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006
       heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0
       heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007
       heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078
       heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068
       heisenberg kernel: FS:  0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000
       heisenberg kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0
       heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       heisenberg kernel: Call Trace:
       heisenberg kernel:  commit_fs_roots+0x166/0x1d0 [btrfs]
       heisenberg kernel:  ? _cond_resched+0x15/0x30
       heisenberg kernel:  ? btrfs_run_delayed_refs+0xac/0x180 [btrfs]
       heisenberg kernel:  btrfs_commit_transaction+0x2bd/0x870 [btrfs]
       heisenberg kernel:  ? start_transaction+0x9d/0x3f0 [btrfs]
       heisenberg kernel:  transaction_kthread+0x147/0x180 [btrfs]
       heisenberg kernel:  ? btrfs_cleanup_transaction+0x530/0x530 [btrfs]
       heisenberg kernel:  kthread+0x112/0x130
       heisenberg kernel:  ? kthread_bind+0x30/0x30
       heisenberg kernel:  ret_from_fork+0x35/0x40
       heisenberg kernel: ---[ end trace 05de912e30e012d9 ]---
      
      So fix that by using the attach API, which does not create a transaction
      when there is currently no running transaction.
      
      [1] https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a8f671b