1. 09 2月, 2021 5 次提交
    • N
      btrfs: zoned: verify device extent is aligned to zone · 381a696e
      Naohiro Aota 提交于
      Add a check in verify_one_dev_extent() to ensure that a device extent on
      a zoned block device is aligned to the respective zone boundary.
      
      If it isn't, mark the filesystem as unclean.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      381a696e
    • N
      btrfs: zoned: implement zoned chunk allocator · 1cd6121f
      Naohiro Aota 提交于
      Implement a zoned chunk and device extent allocator. One device zone
      becomes a device extent so that a zone reset affects only this device
      extent and does not change the state of blocks in the neighbor device
      extents.
      
      To implement the allocator, we need to extend the following functions for
      a zoned filesystem.
      
      - init_alloc_chunk_ctl
      - dev_extent_search_start
      - dev_extent_hole_check
      - decide_stripe_size
      
      init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
      set the stripe_size to the zone size and aligns the parameters to the zone
      size.
      
      dev_extent_search_start() only aligns the start offset to zone boundaries.
      We don't care about the first 1MB like in regular filesystem because we
      anyway reserve the first two zones for superblock logging.
      
      dev_extent_hole_check_zoned() checks if zones in given hole are either
      conventional or empty sequential zones. Also, it skips zones reserved for
      superblock logging.
      
      With the change to the hole, the new hole may now contain pending extents.
      So, in this case, loop again to check that.
      
      Finally, decide_stripe_size_zoned() should shrink the number of devices
      instead of stripe size because we need to honor stripe_size == zone_size.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cd6121f
    • N
      btrfs: zoned: defer loading zone info after opening trees · 73651042
      Naohiro Aota 提交于
      This is a preparation patch to implement zone emulation on a regular
      device.
      
      To emulate a zoned filesystem on a regular (non-zoned) device, we need to
      decide an emulated zone size. Instead of making it a compile-time static
      value, we'll make it configurable at mkfs time. Since we have one zone ==
      one device extent restriction, we can determine the emulated zone size
      from the size of a device extent. We can extend btrfs_get_dev_zone_info()
      to show a regular device filled with conventional zones once the zone size
      is decided.
      
      The current call site of btrfs_get_dev_zone_info() during the mount process
      is earlier than loading the file system trees so that we don't know the
      size of a device extent at this point. Thus we can't slice a regular device
      to conventional zones.
      
      This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone
      info for all the devices. And, it places this function in open_ctree()
      after loading the trees.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73651042
    • M
      btrfs: let callers of btrfs_get_io_geometry pass the em · 42034313
      Michal Rostecki 提交于
      Before this change, the btrfs_get_io_geometry() function was calling
      btrfs_get_chunk_map() to get the extent mapping, necessary for
      calculating the I/O geometry. It was using that extent mapping only
      internally and freeing the pointer after its execution.
      
      That resulted in calling btrfs_get_chunk_map() de facto twice by the
      __btrfs_map_block() function. It was calling btrfs_get_io_geometry()
      first and then calling btrfs_get_chunk_map() directly to get the extent
      mapping, used by the rest of the function.
      
      Change that to passing the extent mapping to the btrfs_get_io_geometry()
      function as an argument.
      
      This could improve performance in some cases.  For very large
      filesystems, i.e. several thousands of allocated chunks, not only this
      avoids searching two times the rbtree, saving time, it may also help
      reducing contention on the lock that protects the tree - thinking of
      writeback starting for multiple inodes, other tasks allocating or
      removing chunks, and anything else that requires access to the rbtree.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NMichal Rostecki <mrostecki@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add Filipe's analysis ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      42034313
    • N
      btrfs: consolidate btrfs_previous_item ret val handling in btrfs_shrink_device · 7056bf69
      Nikolay Borisov 提交于
      Instead of having three 'if' to handle non-NULL return value consolidate
      this in one 'if (ret)'. That way the code is more obvious:
      
       - Always drop delete_unused_bgs_mutex if ret is not NULL
       - If ret is negative -> goto done
       - If it's 1 -> reset ret to 0, release the path and finish the loop.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7056bf69
  2. 26 1月, 2021 1 次提交
    • S
      btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch · c41ec452
      Su Yue 提交于
      This effectively reverts commit d5c82388 ("btrfs: convert
      data_seqcount to seqcount_mutex_t").
      
      While running fstests on 32 bits test box, many tests failed because of
      warnings in dmesg. One of those warnings (btrfs/003):
      
        [66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
        [66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G           O      5.11.0-rc4-custom+ #5
        [66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
        [66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
        [66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803
        [66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70
        [66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
        [66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0
        [66.441485] Call Trace:
        [66.441510]  btrfs_relocate_chunk+0xb1/0x100 [btrfs]
        [66.441529]  ? btrfs_lookup_block_group+0x17/0x20 [btrfs]
        [66.441562]  btrfs_balance+0x8ed/0x13b0 [btrfs]
        [66.441586]  ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
        [66.441619]  ? __this_cpu_preempt_check+0xf/0x11
        [66.441643]  btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
        [66.441664]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441683]  btrfs_ioctl+0x414/0x2ae0 [btrfs]
        [66.441700]  ? __lock_acquire+0x35f/0x2650
        [66.441717]  ? lockdep_hardirqs_on+0x87/0x120
        [66.441720]  ? lockdep_hardirqs_on_prepare+0xd0/0x1e0
        [66.441724]  ? call_rcu+0x2d3/0x530
        [66.441731]  ? __might_fault+0x41/0x90
        [66.441736]  ? kvm_sched_clock_read+0x15/0x50
        [66.441740]  ? sched_clock+0x8/0x10
        [66.441745]  ? sched_clock_cpu+0x13/0x180
        [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441768]  __ia32_sys_ioctl+0x165/0x8a0
        [66.441773]  ? __this_cpu_preempt_check+0xf/0x11
        [66.441785]  ? __might_fault+0x89/0x90
        [66.441791]  __do_fast_syscall_32+0x54/0x80
        [66.441796]  do_fast_syscall_32+0x32/0x70
        [66.441801]  do_SYSENTER_32+0x15/0x20
        [66.441805]  entry_SYSENTER_32+0x9f/0xf2
        [66.441808] EIP: 0xab7b5549
        [66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c
        [66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98
        [66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
        [66.441833] irq event stamp: 42579
        [66.441835] hardirqs last  enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590
        [66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590
        [66.441840] softirqs last  enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60
        [66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60
      
        ========================================================================
        btrfs_remove_chunk+0x58b/0x7b0:
        __seqprop_mutex_assert at linux/./include/linux/seqlock.h:279
        (inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212
        (inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994
        ========================================================================
      
      The warning is produced by lockdep_assert_held() in
      __seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled.
      And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock
      fs_info->chunk_mutex held already.
      
      After adding some debug prints, the cause was found that many
      __alloc_device() are called with NULL @fs_info (during scanning ioctl).
      Inside the function, btrfs_device_data_ordered_init() is expanded to
      seqcount_mutex_init().  In this scenario, its second
      parameter info->chunk_mutex  is &NULL->chunk_mutex which equals
      to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus,
      seqcount_mutex_init() is called in wrong way. And later
      btrfs_device_get/set helpers trigger lockdep warnings.
      
      The device and filesystem object lifetimes are different and we'd have
      to synchronize initialization of the btrfs_device::data_seqcount with
      the fs_info, possibly using some additional synchronization. It would
      still not prevent concurrent access to the seqcount lock when it's used
      for read and initialization.
      
      Commit d5c82388 ("btrfs: convert data_seqcount to seqcount_mutex_t")
      does not mention a particular problem being fixed so revert should not
      cause any harm and we'll get the lockdep warning fixed.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139Reported-by: NErhard F <erhard_f@mailbox.org>
      Fixes: d5c82388 ("btrfs: convert data_seqcount to seqcount_mutex_t")
      CC: stable@vger.kernel.org # 5.10
      CC: Davidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NSu Yue <l@damenly.su>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c41ec452
  3. 18 1月, 2021 1 次提交
    • J
      btrfs: fix lockdep splat in btrfs_recover_relocation · fb286100
      Josef Bacik 提交于
      While testing the error paths of relocation I hit the following lockdep
      splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.10.0-rc6+ #217 Not tainted
        ------------------------------------------------------
        mount/779 is trying to acquire lock:
        ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340
      
        but task is already holding lock:
        ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-root-00){++++}-{3:3}:
      	 down_read_nested+0x43/0x130
      	 __btrfs_tree_read_lock+0x27/0x100
      	 btrfs_read_lock_root_node+0x31/0x40
      	 btrfs_search_slot+0x462/0x8f0
      	 btrfs_update_root+0x55/0x2b0
      	 btrfs_drop_snapshot+0x398/0x750
      	 clean_dirty_subvols+0xdf/0x120
      	 btrfs_recover_relocation+0x534/0x5a0
      	 btrfs_start_pre_rw_mount+0xcb/0x170
      	 open_ctree+0x151f/0x1726
      	 btrfs_mount_root.cold+0x12/0xea
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 vfs_kern_mount.part.0+0x71/0xb0
      	 btrfs_mount+0x10d/0x380
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x433/0xc10
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (sb_internal#2){.+.+}-{0:0}:
      	 start_transaction+0x444/0x700
      	 insert_balance_item.isra.0+0x37/0x320
      	 btrfs_balance+0x354/0xf40
      	 btrfs_ioctl_balance+0x2cf/0x380
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&fs_info->balance_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x1120/0x1e10
      	 lock_acquire+0x116/0x370
      	 __mutex_lock+0x7e/0x7b0
      	 btrfs_recover_balance+0x2f0/0x340
      	 open_ctree+0x1095/0x1726
      	 btrfs_mount_root.cold+0x12/0xea
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 vfs_kern_mount.part.0+0x71/0xb0
      	 btrfs_mount+0x10d/0x380
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x433/0xc10
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-root-00);
      				 lock(sb_internal#2);
      				 lock(btrfs-root-00);
          lock(&fs_info->balance_mutex);
      
         *** DEADLOCK ***
      
        2 locks held by mount/779:
         #0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
         #1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100
      
        stack backtrace:
        CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb0
         check_noncircular+0xcf/0xf0
         ? trace_call_bpf+0x139/0x260
         __lock_acquire+0x1120/0x1e10
         lock_acquire+0x116/0x370
         ? btrfs_recover_balance+0x2f0/0x340
         __mutex_lock+0x7e/0x7b0
         ? btrfs_recover_balance+0x2f0/0x340
         ? btrfs_recover_balance+0x2f0/0x340
         ? rcu_read_lock_sched_held+0x3f/0x80
         ? kmem_cache_alloc_trace+0x2c4/0x2f0
         ? btrfs_get_64+0x5e/0x100
         btrfs_recover_balance+0x2f0/0x340
         open_ctree+0x1095/0x1726
         btrfs_mount_root.cold+0x12/0xea
         ? rcu_read_lock_sched_held+0x3f/0x80
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         vfs_kern_mount.part.0+0x71/0xb0
         btrfs_mount+0x10d/0x380
         ? __kmalloc_track_caller+0x2f2/0x320
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         ? capable+0x3a/0x60
         path_mount+0x433/0xc10
         __x64_sys_mount+0xe3/0x120
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is straightforward to fix, simply release the path before we setup
      the balance_ctl.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb286100
  4. 18 12月, 2020 1 次提交
    • F
      btrfs: fix race between RO remount and the cleaner task · a0a1db70
      Filipe Manana 提交于
      When we are remounting a filesystem in RO mode we can race with the cleaner
      task and result in leaking a transaction if the filesystem is unmounted
      shortly after, before the transaction kthread had a chance to commit that
      transaction. That also results in a crash during unmount, due to a
      use-after-free, if hardware acceleration is not available for crc32c.
      
      The following sequence of steps explains how the race happens.
      
      1) The filesystem is mounted in RW mode and the cleaner task is running.
         This means that currently BTRFS_FS_CLEANER_RUNNING is set at
         fs_info->flags;
      
      2) The cleaner task is currently running delayed iputs for example;
      
      3) A filesystem RO remount operation starts;
      
      4) The RO remount task calls btrfs_commit_super(), which commits any
         currently open transaction, and it finishes;
      
      5) At this point the cleaner task is still running and it creates a new
         transaction by doing one of the following things:
      
         * When running the delayed iput() for an inode with a 0 link count,
           in which case at btrfs_evict_inode() we start a transaction through
           the call to evict_refill_and_join(), use it and then release its
           handle through btrfs_end_transaction();
      
         * When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
           a transaction is started at btrfs_drop_snapshot() and then its handle
           is released through a call to btrfs_end_transaction_throttle();
      
         * When the remount task was still running, and before the remount task
           called btrfs_delete_unused_bgs(), the cleaner task also called
           btrfs_delete_unused_bgs() and it picked and removed one block group
           from the list of unused block groups. Before the cleaner task started
           a transaction, through btrfs_start_trans_remove_block_group() at
           btrfs_delete_unused_bgs(), the remount task had already called
           btrfs_commit_super();
      
      6) So at this point the filesystem is in RO mode and we have an open
         transaction that was started by the cleaner task;
      
      7) Shortly after a filesystem unmount operation starts. At close_ctree()
         we stop the transaction kthread before it had a chance to commit the
         transaction, since less than 30 seconds (the default commit interval)
         have elapsed since the last transaction was committed;
      
      8) We end up calling iput() against the btree inode at close_ctree() while
         there is an open transaction, and since that transaction was used to
         update btrees by the cleaner, we have dirty pages in the btree inode
         due to COW operations on metadata extents, and therefore writeback is
         triggered for the btree inode.
      
         So btree_write_cache_pages() is invoked to flush those dirty pages
         during the final iput() on the btree inode. This results in creating a
         bio and submitting it, which makes us end up at
         btrfs_submit_metadata_bio();
      
      9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
         that calls btrfs_wq_submit_bio(), because check_async_write() returned
         a value of 1. This value of 1 is because we did not have hardware
         acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
         set in fs_info->flags;
      
      10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
          workqueue at fs_info->workers, which was already freed before by the
          call to btrfs_stop_all_workers() at close_ctree(). This results in an
          invalid memory access due to a use-after-free, leading to a crash.
      
      When this happens, before the crash there are several warnings triggered,
      since we have reserved metadata space in a block group, the delayed refs
      reservation, etc:
      
        ------------[ cut here ]------------
        WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
        Code: f0 01 00 00 48 39 c2 75 (...)
        RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
        RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
        RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
        RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
        R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 01 48 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c6 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
        Code: 48 83 bb b0 03 00 00 00 (...)
        RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
        RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
        RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
        R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 01 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c7 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
        Code: ad de 49 be 22 01 00 (...)
        RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
        RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
        RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
        R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c8 ]---
        BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
        BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
        BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
      
      And the crash, which only happens when we do not have crc32c hardware
      acceleration, produces the following trace immediately after those
      warnings:
      
        stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
        Code: 54 55 53 48 89 f3 (...)
        RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
        RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
        RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
        R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
        FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
         btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
         submit_one_bio+0x61/0x70 [btrfs]
         btree_write_cache_pages+0x414/0x450 [btrfs]
         ? kobject_put+0x9a/0x1d0
         ? trace_hardirqs_on+0x1b/0xf0
         ? _raw_spin_unlock_irqrestore+0x3c/0x60
         ? free_debug_processing+0x1e1/0x2b0
         do_writepages+0x43/0xe0
         ? lock_acquired+0x199/0x490
         __writeback_single_inode+0x59/0x650
         writeback_single_inode+0xaf/0x120
         write_inode_now+0x94/0xd0
         iput+0x187/0x2b0
         close_ctree+0x2c6/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f3cfebabee7
        Code: ff 0b 00 f7 d8 64 89 01 (...)
        RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
        RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
        R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        ---[ end trace dd74718fef1ed5cc ]---
      
      Finally when we remove the btrfs module (rmmod btrfs), there are several
      warnings about objects that were allocated from our slabs but were never
      freed, consequence of the transaction that was never committed and got
      leaked:
      
        =============================================================================
        BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? lock_release+0x20e/0x4c0
         kmem_cache_destroy+0x55/0x120
         btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x0000000050cbdd61 @offset=12104
        INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
              btrfs_free_tree_block+0x128/0x360 [btrfs]
              __btrfs_cow_block+0x489/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              sync_filesystem+0x74/0x90
              generic_shutdown_super+0x22/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
        INFO: Object 0x0000000086e9b0ff @offset=12776
        INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
              btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
        INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
              commit_cowonly_roots+0x248/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        =============================================================================
        BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
        CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? lock_release+0x20e/0x4c0
         kmem_cache_destroy+0x55/0x120
         btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x000000001a340018 @offset=4408
        INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
              btrfs_free_tree_block+0x128/0x360 [btrfs]
              __btrfs_cow_block+0x489/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              btrfs_commit_transaction+0x60/0xc40 [btrfs]
              create_subvol+0x56a/0x990 [btrfs]
              btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
              __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
              btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
              btrfs_ioctl+0x1a92/0x36f0 [btrfs]
              __x64_sys_ioctl+0x83/0xb0
              do_syscall_64+0x33/0x80
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
        INFO: Object 0x000000002b46292a @offset=13648
        INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
              btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
        INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        =============================================================================
        BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
        INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? __mutex_unlock_slowpath+0x45/0x2a0
         kmem_cache_destroy+0x55/0x120
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x000000004cf95ea8 @offset=6264
        INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
        CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
      
      So fix this by making the remount path to wait for the cleaner task before
      calling btrfs_commit_super(). The remount path now waits for the bit
      BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
      btrfs_commit_super() and this ensures the cleaner can not start a
      transaction after that, because it sleeps when the filesystem is in RO
      mode and we have already flagged the filesystem as RO before waiting for
      BTRFS_FS_CLEANER_RUNNING to be cleared.
      
      This also introduces a new flag BTRFS_FS_STATE_RO to be used for
      fs_info->fs_state when the filesystem is in RO mode. This is because we
      were doing the RO check using the flags of the superblock and setting the
      RO mode simply by ORing into the superblock's flags - those operations are
      not atomic and could result in the cleaner not seeing the update from the
      remount task after it clears BTRFS_FS_CLEANER_RUNNING.
      Tested-by: NFabian Vogt <fvogt@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0a1db70
  5. 10 12月, 2020 4 次提交
  6. 08 12月, 2020 8 次提交
    • A
      btrfs: remove unused argument seed from btrfs_find_device · b2598edf
      Anand Jain 提交于
      Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to
      check if the parameter seed is true in the function btrfs_find_device().
      This tells it whether to traverse the seed device list or not.
      
      After this commit, the argument is unused and can be removed.
      
      In device_list_add() it's not necessary because fs_devices always points
      to the device's fs_devices. So with the devid+uuid matching, it will
      find the right device and return, thus not needing to traverse seed
      devices.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b2598edf
    • A
      btrfs: drop never met disk total bytes check in verify_one_dev_extent · 3a160a93
      Anand Jain 提交于
      Drop the condition in verify_one_dev_extent,
      btrfs_device::disk_total_bytes is set even for a seed device. The
      comment is wrong, the size is properly set when cloning the device.
      
      Commit 1b3922a8 ("btrfs: Use real device structure to verify
      dev extent") introduced it but it's unclear why the total_disk_bytes
      was 0.
      
      Theoretically, all devices (including missing and seed) marked with the
      BTRFS_DEV_STATE_IN_FS_METADATA flag gets the total_disk_bytes updated at
      fill_device_from_item():
      
        open_ctree()
          btrfs_read_chunk_tree()
            read_one_dev()
              open_seed_device()
              fill_device_from_item()
      
      Even if verify_one_dev_extent() reports total_disk_bytes == 0, then its
      a bug to be fixed somewhere else and not in verify_one_dev_extent() as
      it's just a messenger. It is never expected that a total_disk_bytes
      shall be zero.
      
      The function fill_device_from_item() does the job of reading it from the
      item and updating btrfs_device::disk_total_bytes. So both the missing
      device and the seed devices do have their disk_total_bytes updated.
      btrfs_find_device can also return a device from fs_info->seed_list
      because it searches it as well.
      
      Furthermore, while removing the device if there is a power loss, we
      could have a device with its total_bytes = 0, that's still valid.
      
      Instead, introduce a check against maximum block device size in
      read_one_dev().
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3a160a93
    • A
      btrfs: drop unused argument step from btrfs_free_extra_devids · bacce86a
      Anand Jain 提交于
      Commit cf89af14 ("btrfs: dev-replace: fail mount if we don't have
      replace item with target device") dropped the multi stage operation of
      btrfs_free_extra_devids() that does not need to check replace target
      anymore and we can remove the 'step' argument.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bacce86a
    • J
      btrfs: set the lockdep class for extent buffers on creation · e114c545
      Josef Bacik 提交于
      Both Filipe and Fedora QA recently hit the following lockdep splat:
      
        WARNING: possible recursive locking detected
        5.10.0-0.rc1.20201028gited8780e3.57.fc34.x86_64 #1 Not tainted
        --------------------------------------------
        rsync/2610 is trying to acquire lock:
        ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        but task is already holding lock:
        ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      	 CPU0
      	 ----
          lock(&eb->lock);
          lock(&eb->lock);
      
         *** DEADLOCK ***
         May be due to missing lock nesting notation
        2 locks held by rsync/2610:
         #0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190
         #1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
      
        stack backtrace:
        CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3.57.fc34.x86_64 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack+0x8b/0xb0
         __lock_acquire.cold+0x12d/0x2a4
         ? kvm_sched_clock_read+0x14/0x30
         ? sched_clock+0x5/0x10
         lock_acquire+0xc8/0x400
         ? btrfs_tree_read_lock_atomic+0x34/0x140
         ? read_block_for_search.isra.0+0xdd/0x320
         _raw_read_lock+0x3d/0xa0
         ? btrfs_tree_read_lock_atomic+0x34/0x140
         btrfs_tree_read_lock_atomic+0x34/0x140
         btrfs_search_slot+0x616/0x9a0
         btrfs_lookup_dir_item+0x6c/0xb0
         btrfs_lookup_dentry+0xa8/0x520
         ? lockdep_init_map_waits+0x4c/0x210
         btrfs_lookup+0xe/0x30
         __lookup_slow+0x10f/0x1e0
         walk_component+0x11b/0x190
         path_lookupat+0x72/0x1c0
         filename_lookup+0x97/0x180
         ? strncpy_from_user+0x96/0x1e0
         ? getname_flags.part.0+0x45/0x1a0
         vfs_statx+0x64/0x100
         ? lockdep_hardirqs_on_prepare+0xff/0x180
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         __do_sys_newlstat+0x26/0x40
         ? lockdep_hardirqs_on_prepare+0xff/0x180
         ? syscall_enter_from_user_mode+0x27/0x80
         ? syscall_enter_from_user_mode+0x27/0x80
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      I have also seen a report of lockdep complaining about the lock class
      that was looked up being the same as the lock class on the lock we were
      using, but I can't find the report.
      
      These are problems that occur because we do not have the lockdep class
      set on the extent buffer until _after_ we read the eb in properly.  This
      is problematic for concurrent readers, because we will create the extent
      buffer, lock it, and then attempt to read the extent buffer.
      
      If a second thread comes in and tries to do a search down the same path
      they'll get the above lockdep splat because the class isn't set properly
      on the extent buffer.
      
      There was a good reason for this, we generally didn't know the real
      owner of the eb until we read it, specifically in refcounted roots.
      
      However now all refcounted roots have the same class name, so we no
      longer need to worry about this.  For non-refcounted trees we know
      which root we're on based on the parent.
      
      Fix this by setting the lockdep class on the eb at creation time instead
      of read time.  This will fix the splat and the weirdness where the class
      changes in the middle of locking the block.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e114c545
    • J
      btrfs: pass the owner_root and level to alloc_extent_buffer · 3fbaf258
      Josef Bacik 提交于
      Now that we've plumbed all of the callers to have the owner root and the
      level, plumb it down into alloc_extent_buffer().
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3fbaf258
    • J
      btrfs: cleanup extent buffer readahead · bfb484d9
      Josef Bacik 提交于
      We're going to pass around more information when we allocate extent
      buffers, in order to make that cleaner how we do readahead.  Most of the
      callers have the parent node that we're getting our blockptr from, with
      the sole exception of relocation which simply has the bytenr it wants to
      read.
      
      Add a helper that takes the current arguments that we need (bytenr and
      gen), and add another helper for simply reading the slot out of a node.
      In followup patches the helper that takes all the extra arguments will
      be expanded, and the simpler helper won't need to have it's arguments
      adjusted.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bfb484d9
    • A
      btrfs: create read policy framework · 33fd2f71
      Anand Jain 提交于
      As of now, we use the pid method to read striped mirrored data, which
      means process id determines the stripe id to read. This type of routing
      typically helps in a system with many small independent processes tying
      to read random data. On the other hand, the pid based read IO policy is
      inefficient because if there is a single process trying to read a large
      file, the overall disk bandwidth remains underutilized.
      
      So this patch introduces a read policy framework so that we could add
      more read policies, such as IO routing based on the device's wait-queue
      or manual when we have a read-preferred device or a policy based on the
      target storage caching.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      33fd2f71
    • J
      btrfs: introduce mount option rescue=ignorebadroots · 42437a63
      Josef Bacik 提交于
      In the face of extent root corruption, or any other core fs wide root
      corruption we will fail to mount the file system.  This makes recovery
      kind of a pain, because you need to fall back to userspace tools to
      scrape off data.  Instead provide a mechanism to gracefully handle bad
      roots, so we can at least mount read-only and possibly recover data from
      the file system.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      42437a63
  7. 02 12月, 2020 1 次提交
  8. 24 11月, 2020 1 次提交
    • J
      btrfs: don't access possibly stale fs_info data for printing duplicate device · 0697d9a6
      Johannes Thumshirn 提交于
      Syzbot reported a possible use-after-free when printing a duplicate device
      warning device_list_add().
      
      At this point it can happen that a btrfs_device::fs_info is not correctly
      setup yet, so we're accessing stale data, when printing the warning
      message using the btrfs_printk() wrappers.
      
        ==================================================================
        BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
        Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068
      
        CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x1d6/0x29e lib/dump_stack.c:118
         print_address_description+0x66/0x620 mm/kasan/report.c:383
         __kasan_report mm/kasan/report.c:513 [inline]
         kasan_report+0x132/0x1d0 mm/kasan/report.c:530
         btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
         device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
         btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
         btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x44840a
        RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
        RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
        RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
        RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
        R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003
      
        Allocated by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track mm/kasan/common.c:56 [inline]
         __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
         kmalloc_node include/linux/slab.h:577 [inline]
         kvmalloc_node+0x81/0x110 mm/util.c:574
         kvmalloc include/linux/mm.h:757 [inline]
         kvzalloc include/linux/mm.h:765 [inline]
         btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
         kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
         __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
         __cache_free mm/slab.c:3418 [inline]
         kfree+0x113/0x200 mm/slab.c:3756
         deactivate_locked_super+0xa7/0xf0 fs/super.c:335
         btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff8880878e0000
         which belongs to the cache kmalloc-16k of size 16384
        The buggy address is located 1704 bytes inside of
         16384-byte region [ffff8880878e0000, ffff8880878e4000)
        The buggy address belongs to the page:
        page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
        head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
        flags: 0xfffe0000010200(slab|head)
        raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
        raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      				    ^
         ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
      
      The syzkaller reproducer for this use-after-free crafts a filesystem image
      and loop mounts it twice in a loop. The mount will fail as the crafted
      image has an invalid chunk tree. When this happens btrfs_mount_root() will
      call deactivate_locked_super(), which then cleans up fs_info and
      fs_info::sb. If a second thread now adds the same block-device to the
      filesystem, it will get detected as a duplicate device and
      device_list_add() will reject the duplicate and print a warning. But as
      the fs_info pointer passed in is non-NULL this will result in a
      use-after-free.
      
      Instead of printing possibly uninitialized or already freed memory in
      btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
      device name will be skipped altogether.
      
      There was a slightly different approach discussed in
      https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u
      
      Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
      Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0697d9a6
  9. 05 11月, 2020 1 次提交
    • A
      btrfs: dev-replace: fail mount if we don't have replace item with target device · cf89af14
      Anand Jain 提交于
      If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
      item, then it means the filesystem is inconsistent state. This is either
      corruption or a crafted image.  Fail the mount as this needs a closer
      look what is actually wrong.
      
      As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
      item, in __btrfs_free_extra_devids() we determine that there is an
      extra device, and free those extra devices but continue to mount the
      device.
      However, we were wrong in keeping tack of the rw_devices so the syzbot
      testcase failed:
      
        WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x198/0x1fd lib/dump_stack.c:118
         panic+0x347/0x7c0 kernel/panic.c:231
         __warn.cold+0x20/0x46 kernel/panic.c:600
         report_bug+0x1bd/0x210 lib/bug.c:198
         handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
         exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
         asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
        RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
        RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
        RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
        RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
        RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
        R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
         close_fs_devices fs/btrfs/volumes.c:1193 [inline]
         btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
         open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
         btrfs_fill_super fs/btrfs/super.c:1316 [inline]
         btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672
      
      The fix here is, when we determine that there isn't a replace item
      then fail the mount if there is a replace target device (devid 0).
      
      CC: stable@vger.kernel.org # 4.19+
      Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cf89af14
  10. 27 10月, 2020 1 次提交
  11. 26 10月, 2020 1 次提交
    • F
      btrfs: fix readahead hang and use-after-free after removing a device · 66d204a1
      Filipe Manana 提交于
      Very sporadically I had test case btrfs/069 from fstests hanging (for
      years, it is not a recent regression), with the following traces in
      dmesg/syslog:
      
        [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
        [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
        [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
        [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
        [162513.514318]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.514747] task:btrfs-transacti state:D stack:    0 pid:1356167 ppid:     2 flags:0x00004000
        [162513.514751] Call Trace:
        [162513.514761]  __schedule+0x5ce/0xd00
        [162513.514765]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.514771]  schedule+0x46/0xf0
        [162513.514844]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.514850]  ? finish_wait+0x90/0x90
        [162513.514864]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.514879]  transaction_kthread+0xa4/0x170 [btrfs]
        [162513.514891]  ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
        [162513.514894]  kthread+0x153/0x170
        [162513.514897]  ? kthread_stop+0x2c0/0x2c0
        [162513.514902]  ret_from_fork+0x22/0x30
        [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
        [162513.515192]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.515680] task:fsstress        state:D stack:    0 pid:1356184 ppid:1356177 flags:0x00004000
        [162513.515682] Call Trace:
        [162513.515688]  __schedule+0x5ce/0xd00
        [162513.515691]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.515697]  schedule+0x46/0xf0
        [162513.515712]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.515716]  ? finish_wait+0x90/0x90
        [162513.515729]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.515743]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
        [162513.515753]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
        [162513.515758]  ? __ia32_sys_fdatasync+0x20/0x20
        [162513.515761]  iterate_supers+0x87/0xf0
        [162513.515765]  ksys_sync+0x60/0xb0
        [162513.515768]  __do_sys_sync+0xa/0x10
        [162513.515771]  do_syscall_64+0x33/0x80
        [162513.515774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.515781] RIP: 0033:0x7f5238f50bd7
        [162513.515782] Code: Bad RIP value.
        [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
        [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
        [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
        [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
        [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
        [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
        [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
        [162513.516064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.516617] task:fsstress        state:D stack:    0 pid:1356185 ppid:1356177 flags:0x00000000
        [162513.516620] Call Trace:
        [162513.516625]  __schedule+0x5ce/0xd00
        [162513.516628]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.516634]  schedule+0x46/0xf0
        [162513.516647]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.516650]  ? finish_wait+0x90/0x90
        [162513.516662]  start_transaction+0x4d7/0x5f0 [btrfs]
        [162513.516679]  btrfs_setxattr_trans+0x3c/0x100 [btrfs]
        [162513.516686]  __vfs_setxattr+0x66/0x80
        [162513.516691]  __vfs_setxattr_noperm+0x70/0x200
        [162513.516697]  vfs_setxattr+0x6b/0x120
        [162513.516703]  setxattr+0x125/0x240
        [162513.516709]  ? lock_acquire+0xb1/0x480
        [162513.516712]  ? mnt_want_write+0x20/0x50
        [162513.516721]  ? rcu_read_lock_any_held+0x8e/0xb0
        [162513.516723]  ? preempt_count_add+0x49/0xa0
        [162513.516725]  ? __sb_start_write+0x19b/0x290
        [162513.516727]  ? preempt_count_add+0x49/0xa0
        [162513.516732]  path_setxattr+0xba/0xd0
        [162513.516739]  __x64_sys_setxattr+0x27/0x30
        [162513.516741]  do_syscall_64+0x33/0x80
        [162513.516743]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.516745] RIP: 0033:0x7f5238f56d5a
        [162513.516746] Code: Bad RIP value.
        [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
        [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
        [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
        [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
        [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
        [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
        [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
        [162513.517064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.517763] task:fsstress        state:D stack:    0 pid:1356196 ppid:1356177 flags:0x00004000
        [162513.517780] Call Trace:
        [162513.517786]  __schedule+0x5ce/0xd00
        [162513.517789]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.517796]  schedule+0x46/0xf0
        [162513.517810]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.517814]  ? finish_wait+0x90/0x90
        [162513.517829]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.517845]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
        [162513.517857]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
        [162513.517862]  ? __ia32_sys_fdatasync+0x20/0x20
        [162513.517865]  iterate_supers+0x87/0xf0
        [162513.517869]  ksys_sync+0x60/0xb0
        [162513.517872]  __do_sys_sync+0xa/0x10
        [162513.517875]  do_syscall_64+0x33/0x80
        [162513.517878]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.517881] RIP: 0033:0x7f5238f50bd7
        [162513.517883] Code: Bad RIP value.
        [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
        [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
        [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
        [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
        [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
        [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
        [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
        [162513.518298]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.519157] task:fsstress        state:D stack:    0 pid:1356197 ppid:1356177 flags:0x00000000
        [162513.519160] Call Trace:
        [162513.519165]  __schedule+0x5ce/0xd00
        [162513.519168]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.519174]  schedule+0x46/0xf0
        [162513.519190]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.519193]  ? finish_wait+0x90/0x90
        [162513.519206]  start_transaction+0x4d7/0x5f0 [btrfs]
        [162513.519222]  btrfs_create+0x57/0x200 [btrfs]
        [162513.519230]  lookup_open+0x522/0x650
        [162513.519246]  path_openat+0x2b8/0xa50
        [162513.519270]  do_filp_open+0x91/0x100
        [162513.519275]  ? find_held_lock+0x32/0x90
        [162513.519280]  ? lock_acquired+0x33b/0x470
        [162513.519285]  ? do_raw_spin_unlock+0x4b/0xc0
        [162513.519287]  ? _raw_spin_unlock+0x29/0x40
        [162513.519295]  do_sys_openat2+0x20d/0x2d0
        [162513.519300]  do_sys_open+0x44/0x80
        [162513.519304]  do_syscall_64+0x33/0x80
        [162513.519307]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.519309] RIP: 0033:0x7f5238f4a903
        [162513.519310] Code: Bad RIP value.
        [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
        [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
        [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
        [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
        [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
        [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
        [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
        [162513.519727]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.520508] task:btrfs           state:D stack:    0 pid:1356211 ppid:1356178 flags:0x00004002
        [162513.520511] Call Trace:
        [162513.520516]  __schedule+0x5ce/0xd00
        [162513.520519]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.520525]  schedule+0x46/0xf0
        [162513.520544]  btrfs_scrub_pause+0x11f/0x180 [btrfs]
        [162513.520548]  ? finish_wait+0x90/0x90
        [162513.520562]  btrfs_commit_transaction+0x45a/0xc30 [btrfs]
        [162513.520574]  ? start_transaction+0xe0/0x5f0 [btrfs]
        [162513.520596]  btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
        [162513.520619]  btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
        [162513.520639]  btrfs_ioctl+0x2a25/0x36f0 [btrfs]
        [162513.520643]  ? do_sigaction+0xf3/0x240
        [162513.520645]  ? find_held_lock+0x32/0x90
        [162513.520648]  ? do_sigaction+0xf3/0x240
        [162513.520651]  ? lock_acquired+0x33b/0x470
        [162513.520655]  ? _raw_spin_unlock_irq+0x24/0x50
        [162513.520657]  ? lockdep_hardirqs_on+0x7d/0x100
        [162513.520660]  ? _raw_spin_unlock_irq+0x35/0x50
        [162513.520662]  ? do_sigaction+0xf3/0x240
        [162513.520671]  ? __x64_sys_ioctl+0x83/0xb0
        [162513.520672]  __x64_sys_ioctl+0x83/0xb0
        [162513.520677]  do_syscall_64+0x33/0x80
        [162513.520679]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.520681] RIP: 0033:0x7fc3cd307d87
        [162513.520682] Code: Bad RIP value.
        [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
        [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
        [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
        [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
        [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
        [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
        [162513.520703]
      		  Showing all locks held in the system:
        [162513.520712] 1 lock held by khungtaskd/54:
        [162513.520713]  #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
        [162513.520728] 1 lock held by in:imklog/596:
        [162513.520729]  #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
        [162513.520782] 1 lock held by btrfs-transacti/1356167:
        [162513.520784]  #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
        [162513.520798] 1 lock held by btrfs/1356190:
        [162513.520800]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
        [162513.520805] 1 lock held by fsstress/1356184:
        [162513.520806]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
        [162513.520811] 3 locks held by fsstress/1356185:
        [162513.520812]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
        [162513.520815]  #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
        [162513.520820]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
        [162513.520833] 1 lock held by fsstress/1356196:
        [162513.520834]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
        [162513.520838] 3 locks held by fsstress/1356197:
        [162513.520839]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
        [162513.520843]  #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
        [162513.520846]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
        [162513.520858] 2 locks held by btrfs/1356211:
        [162513.520859]  #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
        [162513.520877]  #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
      
      This was weird because the stack traces show that a transaction commit,
      triggered by a device replace operation, is blocking trying to pause any
      running scrubs but there are no stack traces of blocked tasks doing a
      scrub.
      
      After poking around with drgn, I noticed there was a scrub task that was
      constantly running and blocking for shorts periods of time:
      
        >>> t = find_task(prog, 1356190)
        >>> prog.stack_trace(t)
        #0  __schedule+0x5ce/0xcfc
        #1  schedule+0x46/0xe4
        #2  schedule_timeout+0x1df/0x475
        #3  btrfs_reada_wait+0xda/0x132
        #4  scrub_stripe+0x2a8/0x112f
        #5  scrub_chunk+0xcd/0x134
        #6  scrub_enumerate_chunks+0x29e/0x5ee
        #7  btrfs_scrub_dev+0x2d5/0x91b
        #8  btrfs_ioctl+0x7f5/0x36e7
        #9  __x64_sys_ioctl+0x83/0xb0
        #10 do_syscall_64+0x33/0x77
        #11 entry_SYSCALL_64+0x7c/0x156
      
      Which corresponds to:
      
      int btrfs_reada_wait(void *handle)
      {
          struct reada_control *rc = handle;
          struct btrfs_fs_info *fs_info = rc->fs_info;
      
          while (atomic_read(&rc->elems)) {
              if (!atomic_read(&fs_info->reada_works_cnt))
                  reada_start_machine(fs_info);
              wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
                                (HZ + 9) / 10);
          }
      (...)
      
      So the counter "rc->elems" was set to 1 and never decreased to 0, causing
      the scrub task to loop forever in that function. Then I used the following
      script for drgn to check the readahead requests:
      
        $ cat dump_reada.py
        import sys
        import drgn
        from drgn import NULL, Object, cast, container_of, execscript, \
            reinterpret, sizeof
        from drgn.helpers.linux import *
      
        mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
      
        mnt = None
        for mnt in for_each_mount(prog, dst = mnt_path):
            pass
      
        if mnt is None:
            sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
            sys.exit(1)
      
        fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
      
        def dump_re(re):
            nzones = re.nzones.value_()
            print(f're at {hex(re.value_())}')
            print(f'\t logical {re.logical.value_()}')
            print(f'\t refcnt {re.refcnt.value_()}')
            print(f'\t nzones {nzones}')
            for i in range(nzones):
                dev = re.zones[i].device
                name = dev.name.str.string_()
                print(f'\t\t dev id {dev.devid.value_()} name {name}')
            print()
      
        for _, e in radix_tree_for_each(fs_info.reada_tree):
            re = cast('struct reada_extent *', e)
            dump_re(re)
      
        $ drgn dump_reada.py
        re at 0xffff8f3da9d25ad8
                logical 38928384
                refcnt 1
                nzones 1
                       dev id 0 name b'/dev/sdd'
        $
      
      So there was one readahead extent with a single zone corresponding to the
      source device of that last device replace operation logged in dmesg/syslog.
      Also the ID of that zone's device was 0 which is a special value set in
      the source device of a device replace operation when the operation finishes
      (constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
      confirming again that device /dev/sdd was the source of a device replace
      operation.
      
      Normally there should be as many zones in the readahead extent as there are
      devices, and I wasn't expecting the extent to be in a block group with a
      'single' profile, so I went and confirmed with the following drgn script
      that there weren't any single profile block groups:
      
        $ cat dump_block_groups.py
        import sys
        import drgn
        from drgn import NULL, Object, cast, container_of, execscript, \
            reinterpret, sizeof
        from drgn.helpers.linux import *
      
        mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
      
        mnt = None
        for mnt in for_each_mount(prog, dst = mnt_path):
            pass
      
        if mnt is None:
            sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
            sys.exit(1)
      
        fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
      
        BTRFS_BLOCK_GROUP_DATA = (1 << 0)
        BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
        BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
        BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
        BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
        BTRFS_BLOCK_GROUP_DUP = (1 << 5)
        BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
        BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
        BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
        BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
        BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)
      
        def bg_flags_string(bg):
            flags = bg.flags.value_()
            ret = ''
            if flags & BTRFS_BLOCK_GROUP_DATA:
                ret = 'data'
            if flags & BTRFS_BLOCK_GROUP_METADATA:
                if len(ret) > 0:
                    ret += '|'
                ret += 'meta'
            if flags & BTRFS_BLOCK_GROUP_SYSTEM:
                if len(ret) > 0:
                    ret += '|'
                ret += 'system'
            if flags & BTRFS_BLOCK_GROUP_RAID0:
                ret += ' raid0'
            elif flags & BTRFS_BLOCK_GROUP_RAID1:
                ret += ' raid1'
            elif flags & BTRFS_BLOCK_GROUP_DUP:
                ret += ' dup'
            elif flags & BTRFS_BLOCK_GROUP_RAID10:
                ret += ' raid10'
            elif flags & BTRFS_BLOCK_GROUP_RAID5:
                ret += ' raid5'
            elif flags & BTRFS_BLOCK_GROUP_RAID6:
                ret += ' raid6'
            elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
                ret += ' raid1c3'
            elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
                ret += ' raid1c4'
            else:
                ret += ' single'
      
            return ret
      
        def dump_bg(bg):
            print()
            print(f'block group at {hex(bg.value_())}')
            print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
            print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')
      
        bg_root = fs_info.block_group_cache_tree.address_of_()
        for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
            dump_bg(bg)
      
        $ drgn dump_block_groups.py
      
        block group at 0xffff8f3d673b0400
               start 22020096 length 16777216
               flags 258 - system raid6
      
        block group at 0xffff8f3d53ddb400
               start 38797312 length 536870912
               flags 260 - meta raid6
      
        block group at 0xffff8f3d5f4d9c00
               start 575668224 length 2147483648
               flags 257 - data raid6
      
        block group at 0xffff8f3d08189000
               start 2723151872 length 67108864
               flags 258 - system raid6
      
        block group at 0xffff8f3db70ff000
               start 2790260736 length 1073741824
               flags 260 - meta raid6
      
        block group at 0xffff8f3d5f4dd800
               start 3864002560 length 67108864
               flags 258 - system raid6
      
        block group at 0xffff8f3d67037000
               start 3931111424 length 2147483648
               flags 257 - data raid6
        $
      
      So there were only 2 reasons left for having a readahead extent with a
      single zone: reada_find_zone(), called when creating a readahead extent,
      returned NULL either because we failed to find the corresponding block
      group or because a memory allocation failed. With some additional and
      custom tracing I figured out that on every further ocurrence of the
      problem the block group had just been deleted when we were looping to
      create the zones for the readahead extent (at reada_find_extent()), so we
      ended up with only one zone in the readahead extent, corresponding to a
      device that ends up getting replaced.
      
      So after figuring that out it became obvious why the hang happens:
      
      1) Task A starts a scrub on any device of the filesystem, except for
         device /dev/sdd;
      
      2) Task B starts a device replace with /dev/sdd as the source device;
      
      3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
         starting to scrub a stripe from block group X. This call to
         btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
         calls reada_add_block(), it passes the logical address of the extent
         tree's root node as its 'logical' argument - a value of 38928384;
      
      4) Task A then enters reada_find_extent(), called from reada_add_block().
         It finds there isn't any existing readahead extent for the logical
         address 38928384, so it proceeds to the path of creating a new one.
      
         It calls btrfs_map_block() to find out which stripes exist for the block
         group X. On the first iteration of the for loop that iterates over the
         stripes, it finds the stripe for device /dev/sdd, so it creates one
         zone for that device and adds it to the readahead extent. Before getting
         into the second iteration of the loop, the cleanup kthread deletes block
         group X because it was empty. So in the iterations for the remaining
         stripes it does not add more zones to the readahead extent, because the
         calls to reada_find_zone() returned NULL because they couldn't find
         block group X anymore.
      
         As a result the new readahead extent has a single zone, corresponding to
         the device /dev/sdd;
      
      4) Before task A returns to btrfs_reada_add() and queues the readahead job
         for the readahead work queue, task B finishes the device replace and at
         btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
         device /dev/sdg;
      
      5) Task A returns to reada_add_block(), which increments the counter
         "->elems" of the reada_control structure allocated at btrfs_reada_add().
      
         Then it returns back to btrfs_reada_add() and calls
         reada_start_machine(). This queues a job in the readahead work queue to
         run the function reada_start_machine_worker(), which calls
         __reada_start_machine().
      
         At __reada_start_machine() we take the device list mutex and for each
         device found in the current device list, we call
         reada_start_machine_dev() to start the readahead work. However at this
         point the device /dev/sdd was already freed and is not in the device
         list anymore.
      
         This means the corresponding readahead for the extent at 38928384 is
         never started, and therefore the "->elems" counter of the reada_control
         structure allocated at btrfs_reada_add() never goes down to 0, causing
         the call to btrfs_reada_wait(), done by the scrub task, to wait forever.
      
      Note that the readahead request can be made either after the device replace
      started or before it started, however in pratice it is very unlikely that a
      device replace is able to start after a readahead request is made and is
      able to complete before the readahead request completes - maybe only on a
      very small and nearly empty filesystem.
      
      This hang however is not the only problem we can have with readahead and
      device removals. When the readahead extent has other zones other than the
      one corresponding to the device that is being removed (either by a device
      replace or a device remove operation), we risk having a use-after-free on
      the device when dropping the last reference of the readahead extent.
      
      For example if we create a readahead extent with two zones, one for the
      device /dev/sdd and one for the device /dev/sde:
      
      1) Before the readahead worker starts, the device /dev/sdd is removed,
         and the corresponding btrfs_device structure is freed. However the
         readahead extent still has the zone pointing to the device structure;
      
      2) When the readahead worker starts, it only finds device /dev/sde in the
         current device list of the filesystem;
      
      3) It starts the readahead work, at reada_start_machine_dev(), using the
         device /dev/sde;
      
      4) Then when it finishes reading the extent from device /dev/sde, it calls
         __readahead_hook() which ends up dropping the last reference on the
         readahead extent through the last call to reada_extent_put();
      
      5) At reada_extent_put() it iterates over each zone of the readahead extent
         and attempts to delete an element from the device's 'reada_extents'
         radix tree, resulting in a use-after-free, as the device pointer of the
         zone for /dev/sdd is now stale. We can also access the device after
         dropping the last reference of a zone, through reada_zone_release(),
         also called by reada_extent_put().
      
      And a device remove suffers the same problem, however since it shrinks the
      device size down to zero before removing the device, it is very unlikely to
      still have readahead requests not completed by the time we free the device,
      the only possibility is if the device has a very little space allocated.
      
      While the hang problem is exclusive to scrub, since it is currently the
      only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
      problem affects any path that triggers readhead, which includes
      btree_readahead_hook() and __readahead_hook() (a readahead worker can
      trigger readahed for the children of a node) for example - any path that
      ends up calling reada_add_block() can trigger the use-after-free after a
      device is removed.
      
      So fix this by waiting for any readahead requests for a device to complete
      before removing a device, ensuring that while waiting for existing ones no
      new ones can be made.
      
      This problem has been around for a very long time - the readahead code was
      added in 2011, device remove exists since 2008 and device replace was
      introduced in 2013, hard to pick a specific commit for a git Fixes tag.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      66d204a1
  12. 07 10月, 2020 15 次提交
    • A
      btrfs: skip devices without magic signature when mounting · 96c2e067
      Anand Jain 提交于
      Many things can happen after the device is scanned and before the device
      is mounted.  One such thing is losing the BTRFS_MAGIC on the device.
      If it happens we still won't free that device from the memory and cause
      the userland confusion.
      
      For example: As the BTRFS_IOC_DEV_INFO still carries the device path
      which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
      device which does not belong to the filesystem anymore:
      
        $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
        $ wipefs -a /dev/sdb
        # /dev/sdb does not contain magic signature
        $ mount -o degraded /dev/sda /btrfs
        $ btrfs fi show -m
        Label: none  uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
      	  Total devices 2 FS bytes used 128.00KiB
      	  devid    1 size 3.00GiB used 571.19MiB path /dev/sda
      	  devid    2 size 3.00GiB used 571.19MiB path /dev/sdb
      
      We need to distinguish the missing signature and invalid superblock, so
      add a specific error code ENODATA for that. This also fixes failure of
      fstest btrfs/198.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      96c2e067
    • J
      btrfs: return error if we're unable to read device stats · 92e26df4
      Josef Bacik 提交于
      I noticed when fixing device stats for seed devices that we simply threw
      away the return value from btrfs_search_slot().  This is because we may
      not have stat items, but we could very well get an error, and thus miss
      reporting the error up the chain.
      
      Fix this by returning ret if it's an actual error, and then stop trying
      to init the rest of the devices stats and return the error up the chain.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92e26df4
    • J
      btrfs: init device stats for seed devices · 124604eb
      Josef Bacik 提交于
      We recently started recording device stats across the fleet, and noticed
      a large increase in messages such as this
      
        BTRFS warning (device dm-0): get dev_stats failed, not yet valid
      
      on our tiers that use seed devices for their root devices.  This is
      because we do not initialize the device stats for any seed devices if we
      have a sprout device and mount using that sprout device.  The basic
      steps for reproducing are:
      
        $ mkfs seed device
        $ mount seed device
        # fill seed device
        $ umount seed device
        $ btrfstune -S 1 seed device
        $ mount seed device
        $ btrfs device add -f sprout device /mnt/wherever
        $ umount /mnt/wherever
        $ mount sprout device /mnt/wherever
        $ btrfs device stats /mnt/wherever
      
      This will fail with the above message in dmesg.
      
      Fix this by iterating over the fs_devices->seed if they exist in
      btrfs_init_dev_stats.  This fixed the problem and properly reports the
      stats for both devices.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_device_init_dev_stats ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      124604eb
    • A
      btrfs: simplify gotos in open_seed_device · c83b60c0
      Anand Jain 提交于
      The function does not have a common exit block and returns immediatelly
      so there's no point having the goto. Remove the two cases.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c83b60c0
    • A
      btrfs: remove unnecessary tmp variable in btrfs_assign_next_active_device() · e493e8f9
      Anand Jain 提交于
      We can check the argument value directly, no need for the temporary
      variable.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e493e8f9
    • A
      btrfs: use sprout device_list_mutex in btrfs_init_devices_late · e17125b5
      Anand Jain 提交于
      On a mounted sprout filesystem, all threads now are using the
      sprout::device_list_mutex, and this is the only code using the
      seed::device_list_mutex. This patch converts to use the sprouts
      fs_info->fs_devices->device_list_mutex.
      
      The same reasoning holds true here, that device delete is holding
      the sprout::device_list_mutex.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e17125b5
    • A
      btrfs: split and refactor btrfs_sysfs_remove_devices_dir · 53f8a74c
      Anand Jain 提交于
      Similar to btrfs_sysfs_add_devices_dir()'s refactoring, split
      btrfs_sysfs_remove_devices_dir() so that we don't have to use the device
      argument to indicate whether to free all devices or just one device.
      
      Export btrfs_sysfs_remove_device() as device operations outside of
      sysfs.c now calls this instead of btrfs_sysfs_remove_devices_dir().
      
      btrfs_sysfs_remove_devices_dir() is renamed to
      btrfs_sysfs_remove_fs_devices() to suite its new role.
      
      Now, no one outside of sysfs.c calls btrfs_sysfs_remove_fs_devices()
      so it is redeclared s static. And the same function had to be moved
      before its first caller.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53f8a74c
    • A
      btrfs: simplify parameters of btrfs_sysfs_add_devices_dir · cd36da2e
      Anand Jain 提交于
      When we add a device we need to add it to sysfs, so instead of using the
      btrfs_sysfs_add_devices_dir() fs_devices argument to specify whether to
      add a device or all of fs_devices, call the helper function directly
      btrfs_sysfs_add_device() and thus make it non-static.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd36da2e
    • A
      btrfs: improve device scanning messages · 79dae17d
      Anand Jain 提交于
      Systems booting without the initramfs seems to scan an unusual kind
      of device path (/dev/root). And at a later time, the device is updated
      to the correct path. We generally print the process name and PID of the
      process scanning the device but we don't capture the same information if
      the device path is rescanned with a different pathname.
      
      The current message is too long, so drop the unnecessary UUID and add
      process name and PID.
      
      While at this also update the duplicate device warning to include the
      process name and PID so the messages are consistent
      
      CC: stable@vger.kernel.org # 4.19+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79dae17d
    • G
      btrfs: enumerate the type of exclusive operation in progress · c3e1f96c
      Goldwyn Rodrigues 提交于
      Instead of using a flag bit for exclusive operation, use a variable to
      store which exclusive operation is being performed.  Introduce an API
      to start and finish an exclusive operation.
      
      This would enable another way for tools to check which operation is
      running on why starting an exclusive operation failed. The followup
      patch adds a sysfs_notify() to alert userspace when the state changes, so
      userspace can perform select() on it to get notified of the change.
      
      This would enable us to enqueue a command which will wait for current
      exclusive operation to complete before issuing the next exclusive
      operation. This has been done synchronously as opposed to a background
      process, or else error collection (if any) will become difficult.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3e1f96c
    • J
      btrfs: sysfs: init devices outside of the chunk_mutex · ca10845a
      Josef Bacik 提交于
      While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
      following lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc3+ #4 Not tainted
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc+0x37/0x270
      	 alloc_inode+0x82/0xb0
      	 iget_locked+0x10d/0x2c0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x240
      	 sysfs_get_tree+0x16/0x40
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (kernfs_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_link+0x63/0xa0
      	 sysfs_do_create_link_sd+0x5e/0xd0
      	 btrfs_sysfs_add_devices_dir+0x81/0x130
      	 btrfs_init_new_device+0x67f/0x1250
      	 btrfs_ioctl+0x1ef/0x2e20
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_insert_empty_items+0x64/0xb0
      	 btrfs_insert_delayed_items+0x90/0x4f0
      	 btrfs_commit_inode_delayed_items+0x93/0x140
      	 btrfs_log_inode+0x5de/0x2020
      	 btrfs_log_inode_parent+0x429/0xc90
      	 btrfs_log_new_name+0x95/0x9b
      	 btrfs_rename2+0xbb9/0x1800
      	 vfs_rename+0x64f/0x9f0
      	 do_renameat2+0x320/0x4e0
      	 __x64_sys_rename+0x1f/0x30
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x119c/0x1fc0
      	 lock_acquire+0xa7/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x119c/0x1fc0
         lock_acquire+0xa7/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa7/0x3d0
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because we are holding the chunk_mutex at the time of
      adding in a new device.  However we only need to hold the
      device_list_mutex, as we're going to iterate over the fs_devices
      devices.  Move the sysfs init stuff outside of the chunk_mutex to get
      rid of this lockdep splat.
      
      CC: stable@vger.kernel.org # 4.4.x: f3cd2c58: btrfs: sysfs, rename device_link add/remove functions
      CC: stable@vger.kernel.org # 4.4.x
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca10845a
    • N
      btrfs: don't opencode sync_blockdev in btrfs_init_new_device · b9ba017f
      Nikolay Borisov 提交于
      Instead of opencoding filemap_write_and_wait simply call syncblockdev as
      it makes it abundantly clear what's going on and why this is used. No
      semantics changes.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b9ba017f
    • N
      btrfs: remove redundant code from btrfs_free_stale_devices · 4ae312e9
      Nikolay Borisov 提交于
      Following the refactor of btrfs_free_stale_devices in
      7bcb8164 ("btrfs: use device_list_mutex when removing stale devices")
      fs_devices are freed after they have been iterated by the inner
      list_for_each so the use-after-free fixed by introducing the break in
      fd649f10 ("btrfs: Fix use-after-free when cleaning up fs_devs with
      a single stale device") is no longer necessary. Just remove it
      altogether. No functional changes.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4ae312e9
    • N
      btrfs: refactor locked condition in btrfs_init_new_device · 44cab9ba
      Nikolay Borisov 提交于
      Invert unlocked to locked and exploit the fact it can only ever be
      modified if we are adding a new device to a seed filesystem. This allows
      to simplify the check in error: label. No semantics changes.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      44cab9ba
    • N
      btrfs: use RCU for quick device check in btrfs_init_new_device · f4cfa9bd
      Nikolay Borisov 提交于
      When adding a new device there's a mandatory check to see if a device is
      being duplicated to the filesystem it's added to. Since this is a
      read-only operations not necessary to take device_list_mutex and can simply
      make do with an rcu-readlock.
      
      Using just RCU is safe because there won't be another device add delete
      running in parallel as btrfs_init_new_device is called only from
      btrfs_ioctl_add_dev.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f4cfa9bd