1. 06 5月, 2022 1 次提交
  2. 24 3月, 2022 2 次提交
    • J
      btrfs: zoned: remove left over ASSERT checking for single profile · 62ed0bf7
      Johannes Thumshirn 提交于
      With commit dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block
      groups") we started allowing DUP on metadata block groups, so the
      ASSERT()s in btrfs_can_activate_zone() and btrfs_zoned_get_device() are
      no longer valid and in fact even harmful.
      
      Fixes: dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups")
      CC: stable@vger.kernel.org # 5.17
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62ed0bf7
    • J
      btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zone · 0b9e6676
      Johannes Thumshirn 提交于
      btrfs_can_activate_zone() can be called with the device_list_mutex already
      held, which will lead to a deadlock:
      
      insert_dev_extents() // Takes device_list_mutex
      `-> insert_dev_extent()
       `-> btrfs_insert_empty_item()
        `-> btrfs_insert_empty_items()
         `-> btrfs_search_slot()
          `-> btrfs_cow_block()
           `-> __btrfs_cow_block()
            `-> btrfs_alloc_tree_block()
             `-> btrfs_reserve_extent()
              `-> find_free_extent()
               `-> find_free_extent_update_loop()
                `-> can_allocate_chunk()
                 `-> btrfs_can_activate_zone() // Takes device_list_mutex again
      
      Instead of using the RCU on fs_devices->device_list we
      can use fs_devices->alloc_list, protected by the chunk_mutex to traverse
      the list of active devices.
      
      We are in the chunk allocation thread. The newer chunk allocation
      happens from the devices in the fs_device->alloc_list protected by the
      chunk_mutex.
      
        btrfs_create_chunk()
          lockdep_assert_held(&info->chunk_mutex);
          gather_device_info
            list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
      
      Also, a device that reappears after the mount won't join the alloc_list
      yet and, it will be in the dev_list, which we don't want to consider in
      the context of the chunk alloc.
      
        [15.166572] WARNING: possible recursive locking detected
        [15.167117] 5.17.0-rc6-dennis #79 Not tainted
        [15.167487] --------------------------------------------
        [15.167733] kworker/u8:3/146 is trying to acquire lock:
        [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: find_free_extent+0x15a/0x14f0 [btrfs]
        [15.167733]
        [15.167733] but task is already holding lock:
        [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
        [15.167733]
        [15.167733] other info that might help us debug this:
        [15.167733]  Possible unsafe locking scenario:
        [15.167733]
        [15.171834]        CPU0
        [15.171834]        ----
        [15.171834]   lock(&fs_devs->device_list_mutex);
        [15.171834]   lock(&fs_devs->device_list_mutex);
        [15.171834]
        [15.171834]  *** DEADLOCK ***
        [15.171834]
        [15.171834]  May be due to missing lock nesting notation
        [15.171834]
        [15.171834] 5 locks held by kworker/u8:3/146:
        [15.171834]  #0: ffff888100050938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
        [15.171834]  #1: ffffc9000067be80 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
        [15.176244]  #2: ffff88810521e620 (sb_internal){.+.+}-{0:0}, at: flush_space+0x335/0x600 [btrfs]
        [15.176244]  #3: ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
        [15.176244]  #4: ffff8881152e4b78 (btrfs-dev-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x130 [btrfs]
        [15.179641]
        [15.179641] stack backtrace:
        [15.179641] CPU: 1 PID: 146 Comm: kworker/u8:3 Not tainted 5.17.0-rc6-dennis #79
        [15.179641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
        [15.179641] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
        [15.179641] Call Trace:
        [15.179641]  <TASK>
        [15.179641]  dump_stack_lvl+0x45/0x59
        [15.179641]  __lock_acquire.cold+0x217/0x2b2
        [15.179641]  lock_acquire+0xbf/0x2b0
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  __mutex_lock+0x8e/0x970
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? lock_is_held_type+0xd7/0x130
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? _raw_spin_unlock+0x24/0x40
        [15.183838]  ? btrfs_get_alloc_profile+0x106/0x230 [btrfs]
        [15.187601]  btrfs_reserve_extent+0x131/0x260 [btrfs]
        [15.187601]  btrfs_alloc_tree_block+0xb5/0x3b0 [btrfs]
        [15.187601]  __btrfs_cow_block+0x138/0x600 [btrfs]
        [15.187601]  btrfs_cow_block+0x10f/0x230 [btrfs]
        [15.187601]  btrfs_search_slot+0x55f/0xbc0 [btrfs]
        [15.187601]  ? lock_is_held_type+0xd7/0x130
        [15.187601]  btrfs_insert_empty_items+0x2d/0x60 [btrfs]
        [15.187601]  btrfs_create_pending_block_groups+0x2b3/0x560 [btrfs]
        [15.187601]  __btrfs_end_transaction+0x36/0x2a0 [btrfs]
        [15.192037]  flush_space+0x374/0x600 [btrfs]
        [15.192037]  ? find_held_lock+0x2b/0x80
        [15.192037]  ? btrfs_async_reclaim_data_space+0x49/0x180 [btrfs]
        [15.192037]  ? lock_release+0x131/0x2b0
        [15.192037]  btrfs_async_reclaim_data_space+0x70/0x180 [btrfs]
        [15.192037]  process_one_work+0x24c/0x5a0
        [15.192037]  worker_thread+0x4a/0x3d0
      
      Fixes: a85f05e5 ("btrfs: zoned: avoid chunk allocation if active block group has enough space")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0b9e6676
  3. 14 3月, 2022 5 次提交
  4. 07 1月, 2022 2 次提交
    • N
      btrfs: zoned: fix chunk allocation condition for zoned allocator · 82187d2e
      Naohiro Aota 提交于
      The ZNS specification defines a limit on the number of "active"
      zones. That limit impose us to limit the number of block groups which
      can be used for an allocation at the same time. Not to exceed the
      limit, we reuse the existing active block groups as much as possible
      when we can't activate any other zones without sacrificing an already
      activated block group in commit a85f05e5 ("btrfs: zoned: avoid
      chunk allocation if active block group has enough space").
      
      However, the check is wrong in two ways. First, it checks the
      condition for every raid index (ffe_ctl->index). Even if it reaches
      the condition and "ffe_ctl->max_extent_size >=
      ffe_ctl->min_alloc_size" is met, there can be other block groups
      having enough space to hold ffe_ctl->num_bytes. (Actually, this won't
      happen in the current zoned code as it only supports SINGLE
      profile. But, it can happen once it enables other RAID types.)
      
      Second, it checks the active zone availability depending on the
      raid index. The raid index is just an index for
      space_info->block_groups, so it has nothing to do with chunk allocation.
      
      These mistakes are causing a faulty allocation in a certain
      situation. Consider we are running zoned btrfs on a device whose
      max_active_zone == 0 (no limit). And, suppose no block group have a
      room to fit ffe_ctl->num_bytes but some room to meet
      ffe_ctl->min_alloc_size (i.e. max_extent_size > num_bytes >=
      min_alloc_size).
      
      In this situation, the following occur:
      
      - With SINGLE raid_index, it reaches the chunk allocation checking
        code
      - The check returns true because we can activate a new zone (no limit)
      - But, before allocating the chunk, it iterates to the next raid index
        (RAID5)
      - Since there are no RAID5 block groups on zoned mode, it again
        reaches the check code
      - The check returns false because of btrfs_can_activate_zone()'s "if
        (raid_index != BTRFS_RAID_SINGLE)" part
      - That results in returning -ENOSPC without allocating a new chunk
      
      As a result, we end up hitting -ENOSPC too early.
      
      Move the check to the right place in the can_allocate_chunk() hook,
      and do the active zone check depending on the allocation flag, not on
      the raid index.
      
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      82187d2e
    • J
      btrfs: zoned: simplify btrfs_check_meta_write_pointer · 8fdf54fe
      Johannes Thumshirn 提交于
      btrfs_check_meta_write_pointer() will always be called with a NULL
      'cache_ret' argument.
      
      As there's no need to check if we have a valid block_group passed in
      remove these checks.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8fdf54fe
  5. 03 1月, 2022 2 次提交
  6. 08 12月, 2021 1 次提交
  7. 27 10月, 2021 16 次提交
  8. 26 10月, 2021 1 次提交
  9. 23 8月, 2021 2 次提交
  10. 22 7月, 2021 1 次提交
  11. 22 6月, 2021 1 次提交
  12. 21 6月, 2021 3 次提交
  13. 04 6月, 2021 1 次提交
  14. 20 5月, 2021 1 次提交
  15. 04 5月, 2021 1 次提交
    • N
      btrfs: zoned: sanity check zone type · 784daf2b
      Naohiro Aota 提交于
      The fstests test case generic/475 creates a dm-linear device that gets
      changed to a dm-error device. This leads to errors in loading the block
      group's zone information when running on a zoned file system, ultimately
      resulting in a list corruption. When running on a kernel with list
      debugging enabled this leads to the following crash.
      
       BTRFS: error (device dm-2) in cleanup_transaction:1953: errno=-5 IO failure
       kernel BUG at lib/list_debug.c:54!
       invalid opcode: 0000 [#1] SMP PTI
       CPU: 1 PID: 2433 Comm: umount Tainted: G        W         5.12.0+ #1018
       RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
       RSP: 0018:ffffc90001473df0 EFLAGS: 00010296
       RAX: 0000000000000054 RBX: ffff8881038fd000 RCX: ffffc90001473c90
       RDX: 0000000100001a31 RSI: 0000000000000003 RDI: 0000000000000003
       RBP: ffff888308871108 R08: 0000000000000003 R09: 0000000000000001
       R10: 3961373532383838 R11: 6666666620736177 R12: ffff888308871000
       R13: ffff8881038fd088 R14: ffff8881038fdc78 R15: dead000000000100
       FS:  00007f353c9b1540(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f353cc2c710 CR3: 000000018e13c000 CR4: 00000000000006a0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_free_block_groups+0xc9/0x310 [btrfs]
        close_ctree+0x2ee/0x31a [btrfs]
        ? call_rcu+0x8f/0x270
        ? mutex_lock+0x1c/0x40
        generic_shutdown_super+0x67/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x90
        cleanup_mnt+0x13e/0x1b0
        task_work_run+0x63/0xb0
        exit_to_user_mode_loop+0xd9/0xe0
        exit_to_user_mode_prepare+0x3e/0x60
        syscall_exit_to_user_mode+0x1d/0x50
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      As dm-error has no support for zones, btrfs will run it's zone emulation
      mode on this device. The zone emulation mode emulates conventional zones,
      so bail out if the zone bitmap that gets populated on mount sees the zone
      as sequential while we're thinking it's a conventional zone when creating
      a block group.
      
      Note: this scenario is unlikely in a real wold application and can only
      happen by this (ab)use of device-mapper targets.
      
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      784daf2b