1. 09 9月, 2019 33 次提交
    • D
    • D
      f64ce7b8
    • D
      btrfs: tree-log: convert defines to enums · e13976cf
      David Sterba 提交于
      Used only for in-memory state tracking.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e13976cf
    • D
      btrfs: remove unused key type set/get helpers · 82253cb6
      David Sterba 提交于
      The switch to open coded set/get has happend long time ago in
      962a298f ("btrfs: kill the key type accessor helpers"), remove the
      stray helpers.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      82253cb6
    • D
      btrfs: remove unused btrfs_device::flush_bio_sent · adf4c0c5
      David Sterba 提交于
      The status of flush bio is tracked as a status bit, changed in commit
      1c3063b6 ("btrfs: cleanup device states define
      BTRFS_DEV_STATE_FLUSH_SENT"), the flush_bio_sent was forgotten.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      adf4c0c5
    • F
      Btrfs: remove unnecessary condition in btrfs_clone() to avoid too much nesting · b64119b5
      Filipe Manana 提交于
      The bulk of the work done when cloning extents, at ioctl.c:btrfs_clone(),
      is done inside an if statement that checks if the found key has the type
      BTRFS_EXTENT_DATA_KEY. That if statement is redundant however, because
      btrfs_search_slot() always leaves us in a leaf slot that points to a key
      that is always greater then or equals to the search key, and our search
      key here has that type, and because we bail out before that if statement
      if the key at the given leaf slot is greater then BTRFS_EXTENT_DATA_KEY.
      
      Therefore just remove that if statement, not only because it is useless
      but mostly because it increases the nesting/indentation level in this
      function which is quite big and makes things a bit awkward whenever I need
      to fix something that requires changing btrfs_clone() (and it has been
      like that for many years already).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b64119b5
    • N
      btrfs: Refactor btrfs_calc_avail_data_space · 559ca6ea
      Nikolay Borisov 提交于
      Simplify the code by removing variables that don't bring any real value
      as well as simplifying the checks when buidling the candidate list of
      devices. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      559ca6ea
    • N
      btrfs: Remove unnecessary check from join_running_log_trans · e678934c
      Nikolay Borisov 提交于
      join_running_log_trans checks btrfs_root::log_root outside of
      btrfs_root::log_mutex to avoid contention on the mutex. Turns out this
      check is not necessary because the two callers of join_running_log_trans
      (both of which deal with removing entries from the tree-log during
      unlink) explicitly check whether the respective inode has been logged in
      the current transaction.
      
      If it hasn't then it won't have any items in the tree-log and call path
      will return before calling join_running_log_trans. If the check passes,
      however, then it's guaranteed that btrfs_root::log_root is set because
      the inode is logged.
      
      Those guarantees allows us to remove the speculative as well as the
      implicity and tricky memory barrier.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e678934c
    • F
      Btrfs: wake up inode cache waiters sooner to reduce waiting time · 32e53440
      Filipe Manana 提交于
      If we need to start an inode caching thread, because none currently exists
      on disk, we can wake up all waiters as soon as we mark the range starting
      at root's highest objectid + 1 and ending at BTRFS_LAST_FREE_OBJECTID as
      free, so that they don't need to wait for the caching thread to start and
      do some progress. We follow the same approach within the caching thread,
      since as soon as it finds a free range and marks it as free space in the
      cache, it wakes up all waiters. So improve this by adding such a wakeup
      call after marking that initial range as free space.
      
      Fixes: a47d6b70 ("Btrfs: setup free ino caching in a more asynchronous way")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32e53440
    • F
      Btrfs: fix inode cache waiters hanging on path allocation failure · 9d123a35
      Filipe Manana 提交于
      If the caching thread fails to allocate a path, it returns without waking
      up any cache waiters, leaving them hang forever. Fix this by following the
      same approach as when we fail to start the caching thread: print an error
      message, disable inode caching and make the wakers fallback to non-caching
      mode behaviour (calling btrfs_find_free_objectid()).
      
      Fixes: 581bb050 ("Btrfs: Cache free inode numbers in memory")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9d123a35
    • F
      Btrfs: fix inode cache waiters hanging on failure to start caching thread · a68ebe07
      Filipe Manana 提交于
      If we fail to start the inode caching thread, we print an error message
      and disable the inode cache, however we never wake up any waiters, so they
      hang forever waiting for the caching to finish. Fix this by waking them
      up and have them fallback to a call to btrfs_find_free_objectid().
      
      Fixes: e60efa84 ("Btrfs: avoid triggering bug_on() when we fail to start inode caching task")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a68ebe07
    • F
      Btrfs: fix inode cache block reserve leak on failure to allocate data space · 29d47d00
      Filipe Manana 提交于
      If we failed to allocate the data extent(s) for the inode space cache, we
      were bailing out without releasing the previously reserved metadata. This
      was triggering the following warnings when unmounting a filesystem:
      
        $ cat -n fs/btrfs/inode.c
        (...)
        9268  void btrfs_destroy_inode(struct inode *inode)
        9269  {
        (...)
        9276          WARN_ON(BTRFS_I(inode)->block_rsv.reserved);
        9277          WARN_ON(BTRFS_I(inode)->block_rsv.size);
        (...)
        9281          WARN_ON(BTRFS_I(inode)->csum_bytes);
        9282          WARN_ON(BTRFS_I(inode)->defrag_bytes);
        (...)
      
      Several fstests test cases triggered this often, such as generic/083,
      generic/102, generic/172, generic/269 and generic/300 at least, producing
      stack traces like the following in dmesg/syslog:
      
        [82039.079546] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9276 btrfs_destroy_inode+0x203/0x270 [btrfs]
        (...)
        [82039.081543] CPU: 2 PID: 13167 Comm: umount Tainted: G        W         5.2.0-rc4-btrfs-next-50 #1
        [82039.081912] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [82039.082673] RIP: 0010:btrfs_destroy_inode+0x203/0x270 [btrfs]
        (...)
        [82039.083913] RSP: 0018:ffffac0b426a7d30 EFLAGS: 00010206
        [82039.084320] RAX: ffff8ddf77691158 RBX: ffff8dde29b34660 RCX: 0000000000000002
        [82039.084736] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8dde29b34660
        [82039.085156] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000
        [82039.085578] R10: ffffac0b426a7c90 R11: ffffffffb9aad768 R12: ffffac0b426a7db0
        [82039.086000] R13: ffff8ddf5fbec0a0 R14: dead000000000100 R15: 0000000000000000
        [82039.086416] FS:  00007f8db96d12c0(0000) GS:ffff8de036b00000(0000) knlGS:0000000000000000
        [82039.086837] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [82039.087253] CR2: 0000000001416108 CR3: 00000002315cc001 CR4: 00000000003606e0
        [82039.087672] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [82039.088089] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [82039.088504] Call Trace:
        [82039.088918]  destroy_inode+0x3b/0x70
        [82039.089340]  btrfs_free_fs_root+0x16/0xa0 [btrfs]
        [82039.089768]  btrfs_free_fs_roots+0xd8/0x160 [btrfs]
        [82039.090183]  ? wait_for_completion+0x65/0x1a0
        [82039.090607]  close_ctree+0x172/0x370 [btrfs]
        [82039.091021]  generic_shutdown_super+0x6c/0x110
        [82039.091427]  kill_anon_super+0xe/0x30
        [82039.091832]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [82039.092233]  deactivate_locked_super+0x3a/0x70
        [82039.092636]  cleanup_mnt+0x3b/0x80
        [82039.093039]  task_work_run+0x93/0xc0
        [82039.093457]  exit_to_usermode_loop+0xfa/0x100
        [82039.093856]  do_syscall_64+0x162/0x1d0
        [82039.094244]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [82039.094634] RIP: 0033:0x7f8db8fbab37
        (...)
        [82039.095876] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        [82039.096290] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37
        [82039.096700] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240
        [82039.097110] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015
        [82039.097522] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64
        [82039.097937] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0
        [82039.098350] irq event stamp: 0
        [82039.098750] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [82039.099150] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.099545] softirqs last  enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.099925] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [82039.100292] ---[ end trace f2521afa616ddccc ]---
        [82039.100707] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9277 btrfs_destroy_inode+0x1ac/0x270 [btrfs]
        (...)
        [82039.103050] CPU: 2 PID: 13167 Comm: umount Tainted: G        W         5.2.0-rc4-btrfs-next-50 #1
        [82039.103428] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [82039.104203] RIP: 0010:btrfs_destroy_inode+0x1ac/0x270 [btrfs]
        (...)
        [82039.105461] RSP: 0018:ffffac0b426a7d30 EFLAGS: 00010206
        [82039.105866] RAX: ffff8ddf77691158 RBX: ffff8dde29b34660 RCX: 0000000000000002
        [82039.106270] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8dde29b34660
        [82039.106673] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000
        [82039.107078] R10: ffffac0b426a7c90 R11: ffffffffb9aad768 R12: ffffac0b426a7db0
        [82039.107487] R13: ffff8ddf5fbec0a0 R14: dead000000000100 R15: 0000000000000000
        [82039.107894] FS:  00007f8db96d12c0(0000) GS:ffff8de036b00000(0000) knlGS:0000000000000000
        [82039.108309] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [82039.108723] CR2: 0000000001416108 CR3: 00000002315cc001 CR4: 00000000003606e0
        [82039.109146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [82039.109567] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [82039.109989] Call Trace:
        [82039.110405]  destroy_inode+0x3b/0x70
        [82039.110830]  btrfs_free_fs_root+0x16/0xa0 [btrfs]
        [82039.111257]  btrfs_free_fs_roots+0xd8/0x160 [btrfs]
        [82039.111675]  ? wait_for_completion+0x65/0x1a0
        [82039.112101]  close_ctree+0x172/0x370 [btrfs]
        [82039.112519]  generic_shutdown_super+0x6c/0x110
        [82039.112988]  kill_anon_super+0xe/0x30
        [82039.113439]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [82039.113861]  deactivate_locked_super+0x3a/0x70
        [82039.114278]  cleanup_mnt+0x3b/0x80
        [82039.114685]  task_work_run+0x93/0xc0
        [82039.115083]  exit_to_usermode_loop+0xfa/0x100
        [82039.115476]  do_syscall_64+0x162/0x1d0
        [82039.115863]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [82039.116254] RIP: 0033:0x7f8db8fbab37
        (...)
        [82039.117463] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        [82039.117882] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37
        [82039.118330] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240
        [82039.118743] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015
        [82039.119159] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64
        [82039.119574] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0
        [82039.119987] irq event stamp: 0
        [82039.120387] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [82039.120787] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.121182] softirqs last  enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.121563] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [82039.121933] ---[ end trace f2521afa616ddccd ]---
        [82039.122353] WARNING: CPU: 2 PID: 13167 at fs/btrfs/inode.c:9278 btrfs_destroy_inode+0x1bc/0x270 [btrfs]
        (...)
        [82039.124606] CPU: 2 PID: 13167 Comm: umount Tainted: G        W         5.2.0-rc4-btrfs-next-50 #1
        [82039.125008] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [82039.125801] RIP: 0010:btrfs_destroy_inode+0x1bc/0x270 [btrfs]
        (...)
        [82039.126998] RSP: 0018:ffffac0b426a7d30 EFLAGS: 00010202
        [82039.127399] RAX: ffff8ddf77691158 RBX: ffff8dde29b34660 RCX: 0000000000000002
        [82039.127803] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8dde29b34660
        [82039.128206] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000
        [82039.128611] R10: ffffac0b426a7c90 R11: ffffffffb9aad768 R12: ffffac0b426a7db0
        [82039.129020] R13: ffff8ddf5fbec0a0 R14: dead000000000100 R15: 0000000000000000
        [82039.129428] FS:  00007f8db96d12c0(0000) GS:ffff8de036b00000(0000) knlGS:0000000000000000
        [82039.129846] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [82039.130261] CR2: 0000000001416108 CR3: 00000002315cc001 CR4: 00000000003606e0
        [82039.130684] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [82039.131142] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [82039.131561] Call Trace:
        [82039.131990]  destroy_inode+0x3b/0x70
        [82039.132417]  btrfs_free_fs_root+0x16/0xa0 [btrfs]
        [82039.132844]  btrfs_free_fs_roots+0xd8/0x160 [btrfs]
        [82039.133262]  ? wait_for_completion+0x65/0x1a0
        [82039.133688]  close_ctree+0x172/0x370 [btrfs]
        [82039.134157]  generic_shutdown_super+0x6c/0x110
        [82039.134575]  kill_anon_super+0xe/0x30
        [82039.134997]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [82039.135415]  deactivate_locked_super+0x3a/0x70
        [82039.135832]  cleanup_mnt+0x3b/0x80
        [82039.136239]  task_work_run+0x93/0xc0
        [82039.136637]  exit_to_usermode_loop+0xfa/0x100
        [82039.137029]  do_syscall_64+0x162/0x1d0
        [82039.137418]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [82039.137812] RIP: 0033:0x7f8db8fbab37
        (...)
        [82039.139059] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        [82039.139475] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37
        [82039.139890] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240
        [82039.140302] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015
        [82039.140719] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64
        [82039.141138] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0
        [82039.141597] irq event stamp: 0
        [82039.142043] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [82039.142443] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.142839] softirqs last  enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.143220] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [82039.143588] ---[ end trace f2521afa616ddcce ]---
        [82039.167472] WARNING: CPU: 3 PID: 13167 at fs/btrfs/extent-tree.c:10120 btrfs_free_block_groups+0x30d/0x460 [btrfs]
        (...)
        [82039.173800] CPU: 3 PID: 13167 Comm: umount Tainted: G        W         5.2.0-rc4-btrfs-next-50 #1
        [82039.174847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [82039.177031] RIP: 0010:btrfs_free_block_groups+0x30d/0x460 [btrfs]
        (...)
        [82039.180397] RSP: 0018:ffffac0b426a7dd8 EFLAGS: 00010206
        [82039.181574] RAX: ffff8de010a1db40 RBX: ffff8de010a1db40 RCX: 0000000000170014
        [82039.182711] RDX: ffff8ddff4380040 RSI: ffff8de010a1da58 RDI: 0000000000000246
        [82039.183817] RBP: ffff8ddf5fbec000 R08: 0000000000000000 R09: 0000000000000000
        [82039.184925] R10: ffff8de036404380 R11: ffffffffb8a5ea00 R12: ffff8de010a1b2b8
        [82039.186090] R13: ffff8de010a1b2b8 R14: 0000000000000000 R15: dead000000000100
        [82039.187208] FS:  00007f8db96d12c0(0000) GS:ffff8de036b80000(0000) knlGS:0000000000000000
        [82039.188345] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [82039.189481] CR2: 00007fb044005170 CR3: 00000002315cc006 CR4: 00000000003606e0
        [82039.190674] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [82039.191829] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [82039.192978] Call Trace:
        [82039.194160]  close_ctree+0x19a/0x370 [btrfs]
        [82039.195315]  generic_shutdown_super+0x6c/0x110
        [82039.196486]  kill_anon_super+0xe/0x30
        [82039.197645]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [82039.198696]  deactivate_locked_super+0x3a/0x70
        [82039.199619]  cleanup_mnt+0x3b/0x80
        [82039.200559]  task_work_run+0x93/0xc0
        [82039.201505]  exit_to_usermode_loop+0xfa/0x100
        [82039.202436]  do_syscall_64+0x162/0x1d0
        [82039.203339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [82039.204091] RIP: 0033:0x7f8db8fbab37
        (...)
        [82039.206360] RSP: 002b:00007ffdce35b468 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        [82039.207132] RAX: 0000000000000000 RBX: 0000560d20b00060 RCX: 00007f8db8fbab37
        [82039.207906] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000560d20b00240
        [82039.208621] RBP: 0000560d20b00240 R08: 0000560d20b00270 R09: 0000000000000015
        [82039.209285] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f8db94bce64
        [82039.209984] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffdce35b6f0
        [82039.210642] irq event stamp: 0
        [82039.211306] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [82039.211971] hardirqs last disabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.212643] softirqs last  enabled at (0): [<ffffffffb7884ff2>] copy_process.part.33+0x7f2/0x1f00
        [82039.213304] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [82039.213875] ---[ end trace f2521afa616ddccf ]---
      
      Fix this by releasing the reserved metadata on failure to allocate data
      extent(s) for the inode cache.
      
      Fixes: 69fe2d75 ("btrfs: make the delalloc block rsv per inode")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      29d47d00
    • F
      Btrfs: fix hang when loading existing inode cache off disk · 7764d56b
      Filipe Manana 提交于
      If we are able to load an existing inode cache off disk, we set the state
      of the cache to BTRFS_CACHE_FINISHED, but we don't wake up any one waiting
      for the cache to be available. This means that anyone waiting for the
      cache to be available, waiting on the condition that either its state is
      BTRFS_CACHE_FINISHED or its available free space is greather than zero,
      can hang forever.
      
      This could be observed running fstests with MOUNT_OPTIONS="-o inode_cache",
      in particular test case generic/161 triggered it very frequently for me,
      producing a trace like the following:
      
        [63795.739712] BTRFS info (device sdc): enabling inode map caching
        [63795.739714] BTRFS info (device sdc): disk space caching is enabled
        [63795.739716] BTRFS info (device sdc): has skinny extents
        [64036.653886] INFO: task btrfs-transacti:3917 blocked for more than 120 seconds.
        [64036.654079]       Not tainted 5.2.0-rc4-btrfs-next-50 #1
        [64036.654143] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [64036.654232] btrfs-transacti D    0  3917      2 0x80004000
        [64036.654239] Call Trace:
        [64036.654258]  ? __schedule+0x3ae/0x7b0
        [64036.654271]  schedule+0x3a/0xb0
        [64036.654325]  btrfs_commit_transaction+0x978/0xae0 [btrfs]
        [64036.654339]  ? remove_wait_queue+0x60/0x60
        [64036.654395]  transaction_kthread+0x146/0x180 [btrfs]
        [64036.654450]  ? btrfs_cleanup_transaction+0x620/0x620 [btrfs]
        [64036.654456]  kthread+0x103/0x140
        [64036.654464]  ? kthread_create_worker_on_cpu+0x70/0x70
        [64036.654476]  ret_from_fork+0x3a/0x50
        [64036.654504] INFO: task xfs_io:3919 blocked for more than 120 seconds.
        [64036.654568]       Not tainted 5.2.0-rc4-btrfs-next-50 #1
        [64036.654617] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [64036.654685] xfs_io          D    0  3919   3633 0x00000000
        [64036.654691] Call Trace:
        [64036.654703]  ? __schedule+0x3ae/0x7b0
        [64036.654716]  schedule+0x3a/0xb0
        [64036.654756]  btrfs_find_free_ino+0xa9/0x120 [btrfs]
        [64036.654764]  ? remove_wait_queue+0x60/0x60
        [64036.654809]  btrfs_create+0x72/0x1f0 [btrfs]
        [64036.654822]  lookup_open+0x6bc/0x790
        [64036.654849]  path_openat+0x3bc/0xc00
        [64036.654854]  ? __lock_acquire+0x331/0x1cb0
        [64036.654869]  do_filp_open+0x99/0x110
        [64036.654884]  ? __alloc_fd+0xee/0x200
        [64036.654895]  ? do_raw_spin_unlock+0x49/0xc0
        [64036.654909]  ? do_sys_open+0x132/0x220
        [64036.654913]  do_sys_open+0x132/0x220
        [64036.654926]  do_syscall_64+0x60/0x1d0
        [64036.654933]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fix this by adding a wake_up() call right after setting the cache state to
      BTRFS_CACHE_FINISHED, at start_caching(), when we are able to load the
      cache from disk.
      
      Fixes: 82d5902d ("Btrfs: Support reading/writing on disk free ino cache")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7764d56b
    • Q
      btrfs: tree-checker: Add ROOT_ITEM check · 259ee775
      Qu Wenruo 提交于
      This patch will introduce ROOT_ITEM check, which includes:
      - Key->objectid and key->offset check
        Currently only some easy check, e.g. 0 as rootid is invalid.
      
      - Item size check
        Root item size is fixed.
      
      - Generation checks
        Generation, generation_v2 and last_snapshot should not be greater than
        super generation + 1
      
      - Level and alignment check
        Level should be in [0, 7], and bytenr must be aligned to sector size.
      
      - Flags check
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203261Reported-by: NJungyeon Yoon <jungyeon.yoon@gmail.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      259ee775
    • Q
      btrfs: extent-tree: Make sure we only allocate extents from block groups with the same type · 2a28468e
      Qu Wenruo 提交于
      [BUG]
      With fuzzed image and MIXED_GROUPS super flag, we can hit the following
      BUG_ON():
      
        kernel BUG at fs/btrfs/delayed-ref.c:491!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 0 PID: 1849 Comm: sync Tainted: G           O      5.2.0-custom #27
        RIP: 0010:update_existing_head_ref.cold+0x44/0x46 [btrfs]
        Call Trace:
         add_delayed_ref_head+0x20c/0x2d0 [btrfs]
         btrfs_add_delayed_tree_ref+0x1fc/0x490 [btrfs]
         btrfs_free_tree_block+0x123/0x380 [btrfs]
         __btrfs_cow_block+0x435/0x500 [btrfs]
         btrfs_cow_block+0x110/0x240 [btrfs]
         btrfs_search_slot+0x230/0xa00 [btrfs]
         ? __lock_acquire+0x105e/0x1e20
         btrfs_insert_empty_items+0x67/0xc0 [btrfs]
         alloc_reserved_file_extent+0x9e/0x340 [btrfs]
         __btrfs_run_delayed_refs+0x78e/0x1240 [btrfs]
         ? kvm_clock_read+0x18/0x30
         ? __sched_clock_gtod_offset+0x21/0x50
         btrfs_run_delayed_refs.part.0+0x4e/0x180 [btrfs]
         btrfs_run_delayed_refs+0x23/0x30 [btrfs]
         btrfs_commit_transaction+0x53/0x9f0 [btrfs]
         btrfs_sync_fs+0x7c/0x1c0 [btrfs]
         ? __ia32_sys_fdatasync+0x20/0x20
         sync_fs_one_sb+0x23/0x30
         iterate_supers+0x95/0x100
         ksys_sync+0x62/0xb0
         __ia32_sys_sync+0xe/0x20
         do_syscall_64+0x65/0x240
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      This situation is caused by several factors:
      - Fuzzed image
        The extent tree of this fs missed one backref for extent tree root.
        So we can allocated space from that slot.
      
      - MIXED_BG feature
        Super block has MIXED_BG flag.
      
      - No mixed block groups exists
        All block groups are just regular ones.
      
      This makes data space_info->block_groups[] contains metadata block
      groups.  And when we reserve space for data, we can use space in
      metadata block group.
      
      Then we hit the following file operations:
      
      - fallocate
        We need to allocate data extents.
        find_free_extent() choose to use the metadata block to allocate space
        from, and choose the space of extent tree root, since its backref is
        missing.
      
        This generate one delayed ref head with is_data = 1.
      
      - extent tree update
        We need to update extent tree at run_delayed_ref time.
      
        This generate one delayed ref head with is_data = 0, for the same
        bytenr of old extent tree root.
      
      Then we trigger the BUG_ON().
      
      [FIX]
      The quick fix here is to check block_group->flags before using it.
      
      The problem can only happen for MIXED_GROUPS fs. Regular filesystems
      won't have space_info with DATA|METADATA flag, and no way to hit the
      bug.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203255Reported-by: NJungyeon Yoon <jungyeon.yoon@gmail.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2a28468e
    • Q
      btrfs: delayed-inode: Kill the BUG_ON() in btrfs_delete_delayed_dir_index() · 933c22a7
      Qu Wenruo 提交于
      There is one report of fuzzed image which leads to BUG_ON() in
      btrfs_delete_delayed_dir_index().
      
      Although that fuzzed image can already be addressed by enhanced
      extent-tree error handler, it's still better to hunt down more BUG_ON().
      
      This patch will hunt down two BUG_ON()s in
      btrfs_delete_delayed_dir_index():
      - One for error from btrfs_delayed_item_reserve_metadata()
        Instead of BUG_ON(), we output an error message and free the item.
        And return the error.
        All callers of this function handles the error by aborting current
        trasaction.
      
      - One for possible EEXIST from __btrfs_add_delayed_deletion_item()
        That function can return -EEXIST.
        We already have a good enough error message for that, only need to
        clean up the reserved metadata space and allocated item.
      
      To help above cleanup, also modifiy __btrfs_remove_delayed_item() called
      in btrfs_release_delayed_item(), to skip unassociated item.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203253Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      933c22a7
    • Q
      btrfs: volumes: Remove ENOSPC-prone btrfs_can_relocate() · 112974d4
      Qu Wenruo 提交于
      [BUG]
      Test case btrfs/156 fails since commit 302167c5 ("btrfs: don't end
      the transaction for delayed refs in throttle") with ENOSPC.
      
      [CAUSE]
      The ENOSPC is reported from btrfs_can_relocate().
      
      This function will check:
      - If this block group is empty, we can relocate
      - If we can enough free space, we can relocate
      
      Above checks are valid but the following check is vague due to its
      implementation:
      - If and only if we can allocated a new block group to contain all the
        used space, we can relocate
      
      This design itself is OK, but the way to determine if we can allocate a
      new block group is problematic.
      
      btrfs_can_relocate() uses find_free_dev_extent() to find free space on a
      device.
      However find_free_dev_extent() only searches commit root and excludes
      dev extents allocated in current trans, this makes it unable to use dev
      extent just freed in current transaction.
      
      So for the following example, btrfs_can_relocate() will report ENOSPC:
      The example block group layout:
      1M      129M        257M       385M      513M       550M
      |///////|///////////|//////////|         |          |
      // = Used bg, consider all bg is 100% used for easy calculation.
      And all block groups are SINGLE, on-disk bytenr is the same as the
      logical bytenr.
      
      1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100
      1M      129M        257M       385M      513M       550M
      |///////|           |//////////|/////////|
      In transid 100, bg in [129M, 257M) get relocated to [385M, 513M)
      
      However transid 100 is not committed yet, so in dev commit tree, we
      still have the old dev extents layout:
      1M      129M        257M       385M      513M       550M
      |///////|///////////|//////////|         |          |
      
      2) Try to relocate bg [257M, 385M)
      We goes into btrfs_can_relocate(), no free space in current bgs, so we
      check if we can find large enough free dev extents.
      
      The first slot is [385M, 513M), but that is already used by new bg at
      [385M, 513M), so we continue search.
      
      The remaining slot is [512M, 550M), smaller than the bg's length 128M.
      So btrfs_can_relocate report ENOSPC.
      
      However this is over killed, in fact if we just skip btrfs_can_relocate()
      check, and go into regular relocation routine, at extent reservation time,
      if we can't find free extent, then we fallback to commit transaction,
      which will free up the dev extents and allow new block group to be created.
      
      [FIX]
      The fix here is to remove btrfs_can_relocate() completely.
      
      If we hit the false ENOSPC case just like btrfs/156, extent allocator
      will push harder by committing transaction and we will have space for
      new block group, avoiding the false ENOSPC.
      
      If we really ran out of space, we will hit ENOSPC at
      relocate_block_group(), and btrfs will just reports the ENOSPC error as
      usual.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      112974d4
    • Q
      btrfs: extent-tree: Add comment for inc_block_group_ro() · e9138142
      Qu Wenruo 提交于
      inc_block_group_ro() is only designed to mark one block group read-only,
      it doesn't really care if other block groups have enough free space to
      contain the used space in the block group.
      
      However due to the close connection between this function and
      relocation, sometimes we can be confused and think this function is
      responsible for balance space reservation, which is not true.
      
      Add some comment to make the functionality clear.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e9138142
    • Q
      btrfs: volumes: Add comment for find_free_dev_extent_start() · 135da976
      Qu Wenruo 提交于
      Since commit 6df9a95e ("Btrfs: make the chunk allocator completely
      tree lockless") we search commit root of device tree to avoid deadlock.
      
      This introduced a safety feature, find_free_dev_extent_start() won't
      use dev extents which just get freed in current transaction.
      
      This safety feature makes sure we won't allocate new block group using
      just freed dev extents to break CoW.
      
      However, this feature also makes find_free_dev_extent_start() not
      reliable reporting free device space.  Just add such comment to make
      later viewer careful about this behavior.
      
      This behavior makes one caller, btrfs_can_relocate() unreliable
      determining the device free space.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      135da976
    • Q
      btrfs: volumes: Unexport find_free_dev_extent_start() · 9e3246a5
      Qu Wenruo 提交于
      This function is only used locally in find_free_dev_extent(), no
      external callers.
      
      So unexport it.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9e3246a5
    • D
      btrfs: assert tree mod log lock in __tree_mod_log_insert · 73e82fe4
      David Sterba 提交于
      The tree is going to be modified so it must be the exclusive lock.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73e82fe4
    • D
      btrfs: assert extent map tree lock in add_extent_mapping · d23ea3fa
      David Sterba 提交于
      As add_extent_mapping is called from several functions, let's add the
      lock annotation. The tree is going to be modified so it must be the
      exclusive lock.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d23ea3fa
    • J
      btrfs: Add an assertion to warn incorrect case in insert_inline_extent() · 982f1f5d
      Jia-Ju Bai 提交于
      In insert_inline_extent(), the case that checks compressed_size > 0
      and compressed_pages = NULL cannot occur, otherwise a null-pointer
      dereference may occur on line 215:
      
           cpage = compressed_pages[i];
      
      To catch this incorrect case, an assertion is added.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJia-Ju Bai <baijiaju1990@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      982f1f5d
    • N
      btrfs: Remove leftover of in-band dedupe · 330a5827
      Nikolay Borisov 提交于
      It's unlikely in-band dedupe is going to land so just remove any
      leftovers - dedupe.h header as well as the 'dedupe' parameter to
      btrfs_set_extent_delalloc.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      330a5827
    • N
      btrfs: Remove delalloc_end argument from extent_clear_unlock_delalloc · 74e9194a
      Nikolay Borisov 提交于
      It was added in ba8b04c1 ("btrfs: extend btrfs_set_extent_delalloc
      and its friends to support in-band dedupe and subpage size patchset") as
      a preparatory patch for in-band and subapge block size patchsets.
      However neither of those are likely to be merged anytime soon and the
      code has diverged significantly from the last public post of either
      of those patchsets.
      
      It's unlikely either of the patchests are going to use those preparatory
      steps so just remove the variables. Since cow_file_range also took
      delalloc_end to pass it to extent_clear_unlock_delalloc remove the
      parameter from that function as well.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74e9194a
    • N
      btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range · cecc8d90
      Nikolay Borisov 提交于
      This label is only executed if compress_file_range fails to create an
      inline extent. So move its code in the semantically related inline
      extent handling branch. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cecc8d90
    • N
      btrfs: Return number of compressed extents directly in compress_file_range · ac3e9933
      Nikolay Borisov 提交于
      compress_file_range returns a void, yet uses a function parameter as a
      return value. Make that more idiomatic by simply returning the number
      of compressed extents directly. Also track such extents in more aptly
      named variables. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac3e9933
    • E
      btrfs: use common vfs LABEL ioctl definitions · 40cf931f
      Eric Sandeen 提交于
      I lifted the btrfs label get/set ioctls to the vfs some time ago, but
      never followed up to use those common definitions directly in btrfs.
      
      This patch does that.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40cf931f
    • N
      btrfs: Remove unused locking functions · 5044ed4f
      Nikolay Borisov 提交于
      Those were split out of btrfs_clear_lock_blocking_rw by
      aa12c027 ("btrfs: split btrfs_clear_lock_blocking_rw to read and write helpers")
      however at that time this function was unused due to commit
      52398340 ("Btrfs: kill btrfs_clear_path_blocking"). Put the final
      nail in the coffin of those 2 functions.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5044ed4f
    • A
      btrfs: reduce stack usage for btrfsic_process_written_block · 8ddc3197
      Arnd Bergmann 提交于
      btrfsic_process_written_block() cals btrfsic_process_metablock(),
      which has a fairly large stack usage due to the btrfsic_stack_frame
      variable. It also calls btrfsic_test_for_metadata(), which now
      needs several hundreds of bytes for its SHASH_DESC_ON_STACK().
      
      In some configurations, we end up with both functions on the
      same stack, and gcc warns about the excessive stack usage that
      might cause the available stack space to run out:
      
      fs/btrfs/check-integrity.c:1743:13: error: stack frame size of 1152 bytes in function 'btrfsic_process_written_block' [-Werror,-Wframe-larger-than=]
      
      Marking both child functions as noinline_for_stack helps because
      this guarantees that the large variables are not on the same
      stack frame.
      
      Fixes: d5178578 ("btrfs: directly call into crypto framework for checksumming")
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ddc3197
    • Y
      btrfs: remove set but not used variable 'offset' · 99fccf33
      YueHaibing 提交于
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      fs/btrfs/volumes.c: In function __btrfs_map_block:
      fs/btrfs/volumes.c:6023:6: warning:
       variable offset set but not used [-Wunused-but-set-variable]
      
      It is not used any more since commit 343abd1c0ca9 ("btrfs: Use
      btrfs_get_io_geometry appropriately")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      99fccf33
    • F
      Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents · 690a5dbf
      Filipe Manana 提交于
      When cloning extents (or deduplicating) we create a transaction with a
      space reservation that considers we will drop or update a single file
      extent item of the destination inode (that we modify a single leaf). That
      is fine for the vast majority of scenarios, however it might happen that
      we need to drop many file extent items, and adjust at most two file extent
      items, in the destination root, which can span multiple leafs. This will
      lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
      the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
      (called through clone_finish_inode_update()) to fail with ENOSPC. Such
      failure results in a transaction abort, leaving the filesystem in a
      read-only mode.
      
      In order to fix this we need to follow the same approach as the hole
      punching code, where we create a local reservation with 1 unit and keep
      ending and starting transactions, after balancing the btree inode,
      when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
      extent cloning call calls the recently added btrfs_punch_hole_range()
      helper, which is what does the mentioned work for hole punching, and
      make sure whenever we drop extent items in a transaction, we also add a
      replacing file extent item, to avoid corruption (a hole) if after ending
      a transaction and before starting a new one, the old transaction gets
      committed and a power failure happens before we finish cloning.
      
      A test case for fstests follows soon.
      Reported-by: NDavid Goodwin <david@codepoets.co.uk>
      Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/Reported-by: NSam Tygier <sam@tygier.co.uk>
      Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
      Fixes: b6f3409b ("Btrfs: reserve sufficient space for ioctl clone")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      690a5dbf
    • F
      Btrfs: factor out extent dropping code from hole punch handler · 9cba40a6
      Filipe Manana 提交于
      Move the code that is responsible for dropping extents in a range out of
      btrfs_punch_hole() into a new helper function, btrfs_punch_hole_range(),
      so that later it can be used by the reflinking (extent cloning and dedup)
      code to fix a ENOSPC bug.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9cba40a6
  2. 07 8月, 2019 2 次提交
    • Q
      btrfs: trim: Check the range passed into to prevent overflow · 07301df7
      Qu Wenruo 提交于
      Normally the range->len is set to default value (U64_MAX), but when it's
      not default value, we should check if the range overflows.
      
      And if it overflows, return -EINVAL before doing anything.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      07301df7
    • F
      Btrfs: fix sysfs warning and missing raid sysfs directories · d7cd4dd9
      Filipe Manana 提交于
      In the 5.3 merge window, commit 7c7e3014 ("btrfs: sysfs: Replace
      default_attrs in ktypes with groups"), we started using the member
      "defaults_groups" for the kobject type "btrfs_raid_ktype". That leads
      to a series of warnings when running some test cases of fstests, such
      as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those
      warnings are like the following:
      
        [116648.059212] kernfs: can not remove 'total_bytes', no directory
        [116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G        W         5.3.0-rc3-btrfs-next-54 #1
        (...)
        [116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
        [116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000
        [116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
        [116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001
        [116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
        [116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
        [116648.077143] FS:  00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
        [116648.077852] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
        [116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [116648.080585] Call Trace:
        [116648.081262]  remove_files+0x31/0x70
        [116648.081929]  sysfs_remove_group+0x38/0x80
        [116648.082596]  sysfs_remove_groups+0x34/0x70
        [116648.083258]  kobject_del+0x20/0x60
        [116648.083933]  btrfs_free_block_groups+0x405/0x430 [btrfs]
        [116648.084608]  close_ctree+0x19a/0x380 [btrfs]
        [116648.085278]  generic_shutdown_super+0x6c/0x110
        [116648.085951]  kill_anon_super+0xe/0x30
        [116648.086621]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [116648.087289]  deactivate_locked_super+0x3a/0x70
        [116648.087956]  cleanup_mnt+0xb4/0x160
        [116648.088620]  task_work_run+0x7e/0xc0
        [116648.089285]  exit_to_usermode_loop+0xfa/0x100
        [116648.089933]  do_syscall_64+0x1cb/0x220
        [116648.090567]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [116648.091197] RIP: 0033:0x7f9cdc073b37
        (...)
        [116648.100046] ---[ end trace 22e24db328ccadf8 ]---
        [116648.100618] ------------[ cut here ]------------
        [116648.101175] kernfs: can not remove 'used_bytes', no directory
        [116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G        W         5.3.0-rc3-btrfs-next-54 #1
        (...)
        [116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80
        (...)
        [116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282
        [116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000
        [116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8
        [116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001
        [116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120
        [116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100
        [116648.113268] FS:  00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000
        [116648.113939] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0
        [116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [116648.116649] Call Trace:
        [116648.117326]  remove_files+0x31/0x70
        [116648.117997]  sysfs_remove_group+0x38/0x80
        [116648.118671]  sysfs_remove_groups+0x34/0x70
        [116648.119342]  kobject_del+0x20/0x60
        [116648.120022]  btrfs_free_block_groups+0x405/0x430 [btrfs]
        [116648.120707]  close_ctree+0x19a/0x380 [btrfs]
        [116648.121396]  generic_shutdown_super+0x6c/0x110
        [116648.122057]  kill_anon_super+0xe/0x30
        [116648.122702]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [116648.123335]  deactivate_locked_super+0x3a/0x70
        [116648.123961]  cleanup_mnt+0xb4/0x160
        [116648.124586]  task_work_run+0x7e/0xc0
        [116648.125210]  exit_to_usermode_loop+0xfa/0x100
        [116648.125830]  do_syscall_64+0x1cb/0x220
        [116648.126463]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [116648.127080] RIP: 0033:0x7f9cdc073b37
        (...)
        [116648.135923] ---[ end trace 22e24db328ccadf9 ]---
      
      These happen because, during the unmount path, we call kobject_del() for
      raid kobjects that are not fully initialized, meaning that we set their
      ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set
      their parent kobject, which is done through btrfs_add_raid_kobjects().
      
      We have this split raid kobject setup since commit 75cb379d
      ("btrfs: defer adding raid type kobject until after chunk relocation") in
      order to avoid triggering reclaim during contextes where we can not
      (either we are holding a transaction handle or some lock required by
      the transaction commit path), so that we do the calls to kobject_add(),
      which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects()
      in contextes where it is safe to trigger reclaim. That change expected
      that a new raid kobject can only be created either when mounting the
      filesystem or after raid profile conversion through the relocation path.
      However, we can have new raid kobject created in other two cases at least:
      
      1) During device replace (or scrub) after adding a device a to the
         filesystem. The replace procedure (and scrub) do calls to
         btrfs_inc_block_group_ro() which can allocate a new block group
         with a new raid profile (because we now have more devices). This
         can be triggered by test cases btrfs/027 and btrfs/176.
      
      2) During a degraded mount trough any write path. This can be triggered
         by test case btrfs/124.
      
      Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only
      makes things more complex and fragile, can also introduce deadlocks with
      reclaim the following way:
      
      1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or
         anywhere in the replace/scrub path will cause a deadlock with reclaim
         because if reclaim happens and a transaction commit is triggered,
         the transaction commit path will block at btrfs_scrub_pause().
      
      2) During degraded mounts it is essentially impossible to figure out where
         to add extra calls to btrfs_add_raid_kobjects(), because allocation of
         a block group with a new raid profile can happen anywhere, which means
         we can't safely figure out which contextes are safe for reclaim, as
         we can either hold a transaction handle or some lock needed by the
         transaction commit path.
      
      So it is too complex and error prone to have this split setup of raid
      kobjects. So fix the issue by consolidating the setup of the kobjects in a
      single place, at link_block_group(), and setup a nofs context there in
      order to prevent reclaim being triggered by the memory allocations done
      through the call chain of kobject_add().
      
      Besides fixing the sysfs warnings during kobject_del(), this also ensures
      the sysfs directories for the new raid profiles end up created and visible
      to users (a bug that existed before the 5.3 commit 7c7e3014
      ("btrfs: sysfs: Replace default_attrs in ktypes with groups")).
      
      Fixes: 75cb379d ("btrfs: defer adding raid type kobject until after chunk relocation")
      Fixes: 7c7e3014 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7cd4dd9
  3. 31 7月, 2019 3 次提交
    • F
      Btrfs: fix deadlock between fiemap and transaction commits · a6d155d2
      Filipe Manana 提交于
      The fiemap handler locks a file range that can have unflushed delalloc,
      and after locking the range, it tries to attach to a running transaction.
      If the running transaction started its commit, that is, it is in state
      TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
      flushoncommit option or the transaction is creating a snapshot for the
      subvolume that contains the file that fiemap is operating on, we end up
      deadlocking. This happens because fiemap is blocked on the transaction,
      waiting for it to complete, and the transaction is waiting for the flushed
      dealloc to complete, which requires locking the file range that the fiemap
      task already locked. The following stack traces serve as an example of
      when this deadlock happens:
      
        (...)
        [404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        [404571.515956] Call Trace:
        [404571.516360]  ? __schedule+0x3ae/0x7b0
        [404571.516730]  schedule+0x3a/0xb0
        [404571.517104]  lock_extent_bits+0x1ec/0x2a0 [btrfs]
        [404571.517465]  ? remove_wait_queue+0x60/0x60
        [404571.517832]  btrfs_finish_ordered_io+0x292/0x800 [btrfs]
        [404571.518202]  normal_work_helper+0xea/0x530 [btrfs]
        [404571.518566]  process_one_work+0x21e/0x5c0
        [404571.518990]  worker_thread+0x4f/0x3b0
        [404571.519413]  ? process_one_work+0x5c0/0x5c0
        [404571.519829]  kthread+0x103/0x140
        [404571.520191]  ? kthread_create_worker_on_cpu+0x70/0x70
        [404571.520565]  ret_from_fork+0x3a/0x50
        [404571.520915] kworker/u8:6    D    0 31651      2 0x80004000
        [404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
        (...)
        [404571.537000] fsstress        D    0 13117  13115 0x00004000
        [404571.537263] Call Trace:
        [404571.537524]  ? __schedule+0x3ae/0x7b0
        [404571.537788]  schedule+0x3a/0xb0
        [404571.538066]  wait_current_trans+0xc8/0x100 [btrfs]
        [404571.538349]  ? remove_wait_queue+0x60/0x60
        [404571.538680]  start_transaction+0x33c/0x500 [btrfs]
        [404571.539076]  btrfs_check_shared+0xa3/0x1f0 [btrfs]
        [404571.539513]  ? extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.539866]  extent_fiemap+0x2ce/0x650 [btrfs]
        [404571.540170]  do_vfs_ioctl+0x526/0x6f0
        [404571.540436]  ksys_ioctl+0x70/0x80
        [404571.540734]  __x64_sys_ioctl+0x16/0x20
        [404571.540997]  do_syscall_64+0x60/0x1d0
        [404571.541279]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
        [404571.543729] btrfs           D    0 14210  14208 0x00004000
        [404571.544023] Call Trace:
        [404571.544275]  ? __schedule+0x3ae/0x7b0
        [404571.544526]  ? wait_for_completion+0x112/0x1a0
        [404571.544795]  schedule+0x3a/0xb0
        [404571.545064]  schedule_timeout+0x1ff/0x390
        [404571.545351]  ? lock_acquire+0xa6/0x190
        [404571.545638]  ? wait_for_completion+0x49/0x1a0
        [404571.545890]  ? wait_for_completion+0x112/0x1a0
        [404571.546228]  wait_for_completion+0x131/0x1a0
        [404571.546503]  ? wake_up_q+0x70/0x70
        [404571.546775]  btrfs_wait_ordered_extents+0x27c/0x400 [btrfs]
        [404571.547159]  btrfs_commit_transaction+0x3b0/0xae0 [btrfs]
        [404571.547449]  ? btrfs_mksubvol+0x4a4/0x640 [btrfs]
        [404571.547703]  ? remove_wait_queue+0x60/0x60
        [404571.547969]  btrfs_mksubvol+0x605/0x640 [btrfs]
        [404571.548226]  ? __sb_start_write+0xd4/0x1c0
        [404571.548512]  ? mnt_want_write_file+0x24/0x50
        [404571.548789]  btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs]
        [404571.549048]  btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs]
        [404571.549307]  btrfs_ioctl+0x133f/0x3150 [btrfs]
        [404571.549549]  ? mem_cgroup_charge_statistics+0x4c/0xd0
        [404571.549792]  ? mem_cgroup_commit_charge+0x84/0x4b0
        [404571.550064]  ? __handle_mm_fault+0xe3e/0x11f0
        [404571.550306]  ? do_raw_spin_unlock+0x49/0xc0
        [404571.550608]  ? _raw_spin_unlock+0x24/0x30
        [404571.550976]  ? __handle_mm_fault+0xedf/0x11f0
        [404571.551319]  ? do_vfs_ioctl+0xa2/0x6f0
        [404571.551659]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [404571.552087]  do_vfs_ioctl+0xa2/0x6f0
        [404571.552355]  ksys_ioctl+0x70/0x80
        [404571.552621]  __x64_sys_ioctl+0x16/0x20
        [404571.552864]  do_syscall_64+0x60/0x1d0
        [404571.553104]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        (...)
      
      If we were joining the transaction instead of attaching to it, we would
      not risk a deadlock because a join only blocks if the transaction is in a
      state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
      flush performed by a transaction is done before it reaches that state,
      when it is in the state TRANS_STATE_COMMIT_START. However a transaction
      join is intended for use cases where we do modify the filesystem, and
      fiemap only needs to peek at delayed references from the current
      transaction in order to determine if extents are shared, and, besides
      that, when there is no current transaction or when it blocks to wait for
      a current committing transaction to complete, it creates a new transaction
      without reserving any space. Such unnecessary transactions, besides doing
      unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
      rotation of the precious backup roots.
      
      So fix this by adding a new transaction join variant, named join_nostart,
      which behaves like the regular join, but it does not create a transaction
      when none currently exists or after waiting for a committing transaction
      to complete.
      
      Fixes: 03628cdb ("Btrfs: do not start a transaction during fiemap")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a6d155d2
    • F
      Btrfs: fix race leading to fs corruption after transaction abort · cb2d3dad
      Filipe Manana 提交于
      When one transaction is finishing its commit, it is possible for another
      transaction to start and enter its initial commit phase as well. If the
      first ends up getting aborted, we have a small time window where the second
      transaction commit does not notice that the previous transaction aborted
      and ends up committing, writing a superblock that points to btrees that
      reference extent buffers (nodes and leafs) that were not persisted to disk.
      The consequence is that after mounting the filesystem again, we will be
      unable to load some btree nodes/leafs, either because the content on disk
      is either garbage (or just zeroes) or corresponds to the old content of a
      previouly COWed or deleted node/leaf, resulting in the well known error
      messages "parent transid verify failed on ...".
      The following sequence diagram illustrates how this can happen.
      
              CPU 1                                           CPU 2
      
       <at transaction N>
      
       btrfs_commit_transaction()
         (...)
         --> sets transaction state to
             TRANS_STATE_UNBLOCKED
         --> sets fs_info->running_transaction
             to NULL
      
                                                          (...)
                                                          btrfs_start_transaction()
                                                            start_transaction()
                                                              wait_current_trans()
                                                                --> returns immediately
                                                                    because
                                                                    fs_info->running_transaction
                                                                    is NULL
                                                              join_transaction()
                                                                --> creates transaction N + 1
                                                                --> sets
                                                                    fs_info->running_transaction
                                                                    to transaction N + 1
                                                                --> adds transaction N + 1 to
                                                                    the fs_info->trans_list list
                                                              --> returns transaction handle
                                                                  pointing to the new
                                                                  transaction N + 1
                                                          (...)
      
                                                          btrfs_sync_file()
                                                            btrfs_start_transaction()
                                                              --> returns handle to
                                                                  transaction N + 1
                                                            (...)
      
         btrfs_write_and_wait_transaction()
           --> writeback of some extent
               buffer fails, returns an
      	 error
         btrfs_handle_fs_error()
           --> sets BTRFS_FS_STATE_ERROR in
               fs_info->fs_state
         --> jumps to label "scrub_continue"
         cleanup_transaction()
           btrfs_abort_transaction(N)
             --> sets BTRFS_FS_STATE_TRANS_ABORTED
                 flag in fs_info->fs_state
             --> sets aborted field in the
                 transaction and transaction
      	   handle structures, for
                 transaction N only
           --> removes transaction from the
               list fs_info->trans_list
                                                            btrfs_commit_transaction(N + 1)
                                                              --> transaction N + 1 was not
      							    aborted, so it proceeds
                                                              (...)
                                                              --> sets the transaction's state
                                                                  to TRANS_STATE_COMMIT_START
                                                              --> does not find the previous
                                                                  transaction (N) in the
                                                                  fs_info->trans_list, so it
                                                                  doesn't know that transaction
                                                                  was aborted, and the commit
                                                                  of transaction N + 1 proceeds
                                                              (...)
                                                              --> sets transaction N + 1 state
                                                                  to TRANS_STATE_UNBLOCKED
                                                              btrfs_write_and_wait_transaction()
                                                                --> succeeds writing all extent
                                                                    buffers created in the
                                                                    transaction N + 1
                                                              write_all_supers()
                                                                 --> succeeds
                                                                 --> we now have a superblock on
                                                                     disk that points to trees
                                                                     that refer to at least one
                                                                     extent buffer that was
                                                                     never persisted
      
      So fix this by updating the transaction commit path to check if the flag
      BTRFS_FS_STATE_TRANS_ABORTED is set on fs_info->fs_state if after setting
      the transaction to the TRANS_STATE_COMMIT_START we do not find any previous
      transaction in the fs_info->trans_list. If the flag is set, just fail the
      transaction commit with -EROFS, as we do in other places. The exact error
      code for the previous transaction abort was already logged and reported.
      
      Fixes: 49b25e05 ("btrfs: enhance transaction abort infrastructure")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb2d3dad
    • F
      Btrfs: fix incremental send failure after deduplication · b4f9a1a8
      Filipe Manana 提交于
      When doing an incremental send operation we can fail if we previously did
      deduplication operations against a file that exists in both snapshots. In
      that case we will fail the send operation with -EIO and print a message
      to dmesg/syslog like the following:
      
        BTRFS error (device sdc): Send: inconsistent snapshot, found updated \
        extent for inode 257 without updated inode item, send root is 258, \
        parent root is 257
      
      This requires that we deduplicate to the same file in both snapshots for
      the same amount of times on each snapshot. The issue happens because a
      deduplication only updates the iversion of an inode and does not update
      any other field of the inode, therefore if we deduplicate the file on
      each snapshot for the same amount of time, the inode will have the same
      iversion value (stored as the "sequence" field on the inode item) on both
      snapshots, therefore it will be seen as unchanged between in the send
      snapshot while there are new/updated/deleted extent items when comparing
      to the parent snapshot. This makes the send operation return -EIO and
      print an error message.
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        # Create our first file. The first half of the file has several 64Kb
        # extents while the second half as a single 512Kb extent.
        $ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo
      
        # Create the base snapshot and the parent send stream from it.
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
        $ btrfs send -f /tmp/1.snap /mnt/mysnap1
      
        # Create our second file, that has exactly the same data as the first
        # file.
        $ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar
      
        # Create the second snapshot, used for the incremental send, before
        # doing the file deduplication.
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
      
        # Now before creating the incremental send stream:
        #
        # 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This
        #    will drop several extent items and add a new one, also updating
        #    the inode's iversion (sequence field in inode item) by 1, but not
        #    any other field of the inode;
        #
        # 2) Deduplicate into a different subrange of file foo in snapshot
        #    mysnap2. This will replace an extent item with a new one, also
        #    updating the inode's iversion by 1 but not any other field of the
        #    inode.
        #
        # After these two deduplication operations, the inode items, for file
        # foo, are identical in both snapshots, but we have different extent
        # items for this inode in both snapshots. We want to check this doesn't
        # cause send to fail with an error or produce an incorrect stream.
      
        $ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo
        $ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo
      
        # Create the incremental send stream.
        $ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2
        ERROR: send ioctl failed with -5: Input/output error
      
      This issue started happening back in 2015 when deduplication was updated
      to not update the inode's ctime and mtime and update only the iversion.
      Back then we would hit a BUG_ON() in send, but later in 2016 send was
      updated to return -EIO and print the error message instead of doing the
      BUG_ON().
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933
      Fixes: 1c919a5e ("btrfs: don't update mtime/ctime on deduped inodes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4f9a1a8
  4. 26 7月, 2019 1 次提交
  5. 25 7月, 2019 1 次提交