1. 14 2月, 2023 1 次提交
  2. 12 1月, 2023 1 次提交
    • N
      btrfs: zoned: enable metadata over-commit for non-ZNS setup · 85e79ec7
      Naohiro Aota 提交于
      The commit 79417d04 ("btrfs: zoned: disable metadata overcommit for
      zoned") disabled the metadata over-commit to track active zones properly.
      
      However, it also introduced a heavy overhead by allocating new metadata
      block groups and/or flushing dirty buffers to release the space
      reservations. Specifically, a workload (write only without any sync
      operations) worsen its performance from 343.77 MB/sec (v5.19) to 182.89
      MB/sec (v6.0).
      
      The performance is still bad on current misc-next which is 187.95 MB/sec.
      And, with this patch applied, it improves back to 326.70 MB/sec (+73.82%).
      
      This patch introduces a new fs_info->flag BTRFS_FS_NO_OVERCOMMIT to
      indicate it needs to disable the metadata over-commit. The flag is enabled
      when a device with max active zones limit is loaded into a file-system.
      
      Fixes: 79417d04 ("btrfs: zoned: disable metadata overcommit for zoned")
      CC: stable@vger.kernel.org # 6.0+
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      85e79ec7
  3. 06 12月, 2022 5 次提交
  4. 23 11月, 2022 1 次提交
    • C
      btrfs: use kvcalloc in btrfs_get_dev_zone_info · 8fe97d47
      Christoph Hellwig 提交于
      Otherwise the kernel memory allocator seems to be unhappy about failing
      order 6 allocations for the zones array, that cause 100% reproducible
      mount failures in my qemu setup:
      
        [26.078981] mount: page allocation failure: order:6, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
        [26.079741] CPU: 0 PID: 2965 Comm: mount Not tainted 6.1.0-rc5+ #185
        [26.080181] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [26.080950] Call Trace:
        [26.081132]  <TASK>
        [26.081291]  dump_stack_lvl+0x56/0x6f
        [26.081554]  warn_alloc+0x117/0x140
        [26.081808]  ? __alloc_pages_direct_compact+0x1b5/0x300
        [26.082174]  __alloc_pages_slowpath.constprop.0+0xd0e/0xde0
        [26.082569]  __alloc_pages+0x32a/0x340
        [26.082836]  __kmalloc_large_node+0x4d/0xa0
        [26.083133]  ? trace_kmalloc+0x29/0xd0
        [26.083399]  kmalloc_large+0x14/0x60
        [26.083654]  btrfs_get_dev_zone_info+0x1b9/0xc00
        [26.083980]  ? _raw_spin_unlock_irqrestore+0x28/0x50
        [26.084328]  btrfs_get_dev_zone_info_all_devices+0x54/0x80
        [26.084708]  open_ctree+0xed4/0x1654
        [26.084974]  btrfs_mount_root.cold+0x12/0xde
        [26.085288]  ? lock_is_held_type+0xe2/0x140
        [26.085603]  legacy_get_tree+0x28/0x50
        [26.085876]  vfs_get_tree+0x1d/0xb0
        [26.086139]  vfs_kern_mount.part.0+0x6c/0xb0
        [26.086456]  btrfs_mount+0x118/0x3a0
        [26.086728]  ? lock_is_held_type+0xe2/0x140
        [26.087043]  legacy_get_tree+0x28/0x50
        [26.087323]  vfs_get_tree+0x1d/0xb0
        [26.087587]  path_mount+0x2ba/0xbe0
        [26.087850]  ? _raw_spin_unlock_irqrestore+0x38/0x50
        [26.088217]  __x64_sys_mount+0xfe/0x140
        [26.088506]  do_syscall_64+0x35/0x80
        [26.088776]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 5b316468 ("btrfs: get zone information of zoned block devices")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8fe97d47
  5. 21 11月, 2022 1 次提交
  6. 07 11月, 2022 1 次提交
  7. 26 9月, 2022 3 次提交
  8. 13 9月, 2022 1 次提交
    • N
      btrfs: zoned: wait for extent buffer IOs before finishing a zone · 2dd7e7bc
      Naohiro Aota 提交于
      Before sending REQ_OP_ZONE_FINISH to a zone, we need to ensure that
      ongoing IOs already finished. Or, we will see a "Zone Is Full" error for
      the IOs, as the ZONE_FINISH command makes the zone full.
      
      We ensure that with btrfs_wait_block_group_reservations() and
      btrfs_wait_ordered_roots() for a data block group. And, for a metadata
      block group, the comparison of alloc_offset vs meta_write_pointer mostly
      ensures IOs for the allocated region already sent. However, there still
      can be a little time frame where the IOs are sent but not yet completed.
      
      Introduce wait_eb_writebacks() to ensure such IOs are completed for a
      metadata block group. It walks the buffer_radix to find extent buffers in
      the block group and calls wait_on_extent_buffer_writeback() on them.
      
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      CC: stable@vger.kernel.org # 5.19+
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2dd7e7bc
  9. 05 9月, 2022 3 次提交
  10. 25 7月, 2022 7 次提交
    • N
      btrfs: zoned: wait until zone is finished when allocation didn't progress · 2ce543f4
      Naohiro Aota 提交于
      When the allocated position doesn't progress, we cannot submit IOs to
      finish a block group, but there should be ongoing IOs that will finish a
      block group. So, in that case, we wait for a zone to be finished and retry
      the allocation after that.
      
      Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
      indicate we need a zone finish to have proceeded. The flag is set when the
      allocator detected it cannot activate a new block group. And, it is cleared
      once a zone is finished.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ce543f4
    • N
      btrfs: zoned: activate metadata block group on flush_space · b0931513
      Naohiro Aota 提交于
      For metadata space on zoned filesystem, reaching ALLOC_CHUNK{,_FORCE}
      means we don't have enough space left in the active_total_bytes. Before
      allocating a new chunk, we can try to activate an existing block group
      in this case.
      
      Also, allocating a chunk is not enough to grant a ticket for metadata
      space on zoned filesystem we need to activate the block group to
      increase the active_total_bytes.
      
      btrfs_zoned_activate_one_bg() implements the activation feature. It will
      activate a block group by (maybe) finishing a block group. It will give up
      activating a block group if it cannot finish any block group.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0931513
    • N
      btrfs: zoned: introduce space_info->active_total_bytes · 6a921de5
      Naohiro Aota 提交于
      The active_total_bytes, like the total_bytes, accounts for the total bytes
      of active block groups in the space_info.
      
      With an introduction of active_total_bytes, we can check if the reserved
      bytes can be written to the block groups without activating a new block
      group. The check is necessary for metadata allocation on zoned
      filesystem. We cannot finish a block group, which may require waiting
      for the current transaction, from the metadata allocation context.
      Instead, we need to ensure the ongoing allocation (reserved bytes) fits
      in active block groups.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a921de5
    • N
      btrfs: zoned: finish least available block group on data bg allocation · 393f646e
      Naohiro Aota 提交于
      When we run out of active zones and no sufficient space is left in any
      block groups, we need to finish one block group to make room to activate a
      new block group.
      
      However, we cannot do this for metadata block groups because we can cause a
      deadlock by waiting for a running transaction commit. So, do that only for
      a data block group.
      
      Furthermore, the block group to be finished has two requirements. First,
      the block group must not have reserved bytes left. Having reserved bytes
      means we have an allocated region but did not yet send bios for it. If that
      region is allocated by the thread calling btrfs_zone_finish(), it results
      in a deadlock.
      
      Second, the block group to be finished must not be a SYSTEM block
      group. Finishing a SYSTEM block group easily breaks further chunk
      allocation by nullifying the SYSTEM free space.
      
      In a certain case, we cannot find any zone finish candidate or
      btrfs_zone_finish() may fail. In that case, we fall back to split the
      allocation bytes and fill the last spaces left in the block groups.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      393f646e
    • N
      btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size · f7b12a62
      Naohiro Aota 提交于
      On zoned filesystem, data write out is limited by max_zone_append_size,
      and a large ordered extent is split according the size of a bio. OTOH,
      the number of extents to be written is calculated using
      BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
      metadata bytes to update and/or create the metadata items.
      
      The metadata reservation is done at e.g, btrfs_buffered_write() and then
      released according to the estimation changes. Thus, if the number of extent
      increases massively, the reserved metadata can run out.
      
      The increase of the number of extents easily occurs on zoned filesystem
      if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
      following warning on a small RAM environment with disabling metadata
      over-commit (in the following patch).
      
      [75721.498492] ------------[ cut here ]------------
      [75721.505624] BTRFS: block rsv 1 returned -28
      [75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
      [75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
      [75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
      [75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
      [75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
      [75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
      [75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
      [75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
      [75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
      [75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
      [75721.730499] Call Trace:
      [75721.735166]  <TASK>
      [75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
      [75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
      [75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
      [75721.776431]  ? memcpy+0x4e/0x60
      [75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
      [75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
      [75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
      [75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
      [75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
      [75721.818300]  ? lock_downgrade+0x7c0/0x7c0
      [75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
      [75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
      [75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
      [75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
      [75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
      [75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
      [75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.876085]  ? lock_release+0x552/0xf80
      [75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
      [75721.888886]  ? __kasan_check_write+0x14/0x20
      [75721.895152]  ? do_raw_read_unlock+0x44/0x80
      [75721.901323]  ? _raw_write_lock_irq+0x60/0x80
      [75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
      [75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
      [75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
      [75721.929166]  ? _raw_write_unlock+0x23/0x40
      [75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
      [75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
      [75721.949906]  ? try_to_wake_up+0x30/0x14a0
      [75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
      [75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.969111]  ? lock_acquire+0x41b/0x4c0
      [75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
      [75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
      [75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
      [75721.994643]  process_one_work+0x815/0x1460
      [75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
      [75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
      [75722.013086]  worker_thread+0x59a/0xeb0
      [75722.018511]  kthread+0x2ac/0x360
      [75722.023428]  ? process_one_work+0x1460/0x1460
      [75722.029431]  ? kthread_complete_and_exit+0x30/0x30
      [75722.036044]  ret_from_fork+0x22/0x30
      [75722.041255]  </TASK>
      [75722.045047] irq event stamp: 0
      [75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
      [75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
      [75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [75722.085335] ---[ end trace 0000000000000000 ]---
      
      To fix the estimation, we need to introduce fs_info->max_extent_size to
      replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
      regular vs zoned filesystem.
      
      Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
      filesystem, it is set to fs_info->max_zone_append_size.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7b12a62
    • N
      btrfs: zoned: revive max_zone_append_bytes · c2ae7b77
      Naohiro Aota 提交于
      This patch is basically a revert of commit 5a80d1c6 ("btrfs: zoned:
      remove max_zone_append_size logic"), but without unnecessary ASSERT and
      check. The max_zone_append_size will be used as a hint to estimate the
      number of extents to cover delalloc/writeback region in the later commits.
      
      The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
      this commit considers it to calculate max_zone_append_size. Technically, a
      bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
      contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
      as an upper limit of an extent size to calculate the number of extents
      needed to write data.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2ae7b77
    • P
      btrfs: zoned: fix comment description for sb_write_pointer logic · 31f37269
      Pankaj Raghav 提交于
      Fix the comment to represent the actual logic used for sb_write_pointer
      
      - Empty[0] && In use[1] should be an invalid state instead of returning
        zone 0 wp
      - Empty[0] && Full[1] should be returning zone 0 wp instead of zone 1 wp
      - In use[0] && Empty[1] should be returning zone 0 wp instead of being an
        invalid state
      - In use[0] && Full[1] should be returning zone 0 wp instead of returning
        zone 1 wp
      - Full[0] && Empty[1] should be returning zone 1 wp instead of returning
        zone 0 wp
      - Full[0] && In use[1] should be returning zone 1 wp instead of returning
        zone 0 wp
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31f37269
  11. 09 7月, 2022 2 次提交
    • N
      btrfs: zoned: drop optimization of zone finish · b3a3b025
      Naohiro Aota 提交于
      We have an optimization in do_zone_finish() to send REQ_OP_ZONE_FINISH only
      when necessary, i.e. we don't send REQ_OP_ZONE_FINISH when we assume we
      wrote fully into the zone.
      
      The assumption is determined by "alloc_offset == capacity". This condition
      won't work if the last ordered extent is canceled due to some errors. In
      that case, we consider the zone is deactivated without sending the finish
      command while it's still active.
      
      This inconstancy results in activating another block group while we cannot
      really activate the underlying zone, which causes the active zone exceeds
      errors like below.
      
          BTRFS error (device nvme3n2): allocation failed flags 1, wanted 520192 tree-log 0, relocation: 0
          nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
          active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
          nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
          active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
      
      Fix the issue by removing the optimization for now.
      
      Fixes: 8376d9e1 ("btrfs: zoned: finish superblock zone once no space left for new SB")
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b3a3b025
    • C
      btrfs: zoned: fix a leaked bioc in read_zone_info · 29634578
      Christoph Hellwig 提交于
      The bioc would leak on the normal completion path and also on the RAID56
      check (but that one won't happen in practice due to the invalid
      combination with zoned mode).
      
      Fixes: 7db1c5d1 ("btrfs: zoned: support dev-replace in zoned filesystems")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      [ update changelog ]
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      29634578
  12. 21 6月, 2022 1 次提交
    • N
      btrfs: zoned: prevent allocation from previous data relocation BG · 343d8a30
      Naohiro Aota 提交于
      After commit 5f0addf7 ("btrfs: zoned: use dedicated lock for data
      relocation"), we observe IO errors on e.g, btrfs/232 like below.
      
        [09.0][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs]
        <snip>
        [09.9][T4038707] Call Trace:
        [09.5][T4038707]  <TASK>
        [09.3][T4038707]  run_delalloc_nocow+0x7f1/0x11a0 [btrfs]
        [09.6][T4038707]  ? test_range_bit+0x174/0x320 [btrfs]
        [09.2][T4038707]  ? fallback_to_cow+0x980/0x980 [btrfs]
        [09.3][T4038707]  ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs]
        [09.5][T4038707]  btrfs_run_delalloc_range+0x445/0x1320 [btrfs]
        [09.2][T4038707]  ? test_range_bit+0x320/0x320 [btrfs]
        [09.4][T4038707]  ? lock_downgrade+0x6a0/0x6a0
        [09.2][T4038707]  ? orc_find.part.0+0x1ed/0x300
        [09.5][T4038707]  ? __module_address.part.0+0x25/0x300
        [09.0][T4038707]  writepage_delalloc+0x159/0x310 [btrfs]
        <snip>
        [09.4][    C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
        [09.5][    C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current]
        [09.9][    C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command
        [09.5][    C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00
        [09.4][    C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
        [09.9][    C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
      
      The IO errors occur when we allocate a regular extent in previous data
      relocation block group.
      
      On zoned btrfs, we use a dedicated block group to relocate a data
      extent. Thus, we allocate relocating data extents (pre-alloc) only from
      the dedicated block group and vice versa. Once the free space in the
      dedicated block group gets tight, a relocating extent may not fit into
      the block group. In that case, we need to switch the dedicated block
      group to the next one. Then, the previous one is now freed up for
      allocating a regular extent. The BG is already not enough to allocate
      the relocating extent, but there is still room to allocate a smaller
      extent. Now the problem happens. By allocating a regular extent while
      nocow IOs for the relocation is still on-going, we will issue WRITE IOs
      (for relocation) and ZONE APPEND IOs (for the regular writes) at the
      same time. That mixed IOs confuses the write pointer and arises the
      unaligned write errors.
      
      This commit introduces a new bit 'zoned_data_reloc_ongoing' to the
      btrfs_block_group. We set this bit before releasing the dedicated block
      group, and no extent are allocated from a block group having this bit
      set. This bit is similar to setting block_group->ro, but is different from
      it by allowing nocow writes to start.
      
      Once all the nocow IO for relocation is done (hooked from
      btrfs_finish_ordered_io), we reset the bit to release the block group for
      further allocation.
      
      Fixes: c2707a25 ("btrfs: zoned: add a dedicated data relocation block group")
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      343d8a30
  13. 18 5月, 2022 2 次提交
  14. 16 5月, 2022 5 次提交
  15. 06 5月, 2022 2 次提交
  16. 18 4月, 2022 1 次提交
  17. 24 3月, 2022 2 次提交
    • J
      btrfs: zoned: remove left over ASSERT checking for single profile · 62ed0bf7
      Johannes Thumshirn 提交于
      With commit dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block
      groups") we started allowing DUP on metadata block groups, so the
      ASSERT()s in btrfs_can_activate_zone() and btrfs_zoned_get_device() are
      no longer valid and in fact even harmful.
      
      Fixes: dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups")
      CC: stable@vger.kernel.org # 5.17
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62ed0bf7
    • J
      btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zone · 0b9e6676
      Johannes Thumshirn 提交于
      btrfs_can_activate_zone() can be called with the device_list_mutex already
      held, which will lead to a deadlock:
      
      insert_dev_extents() // Takes device_list_mutex
      `-> insert_dev_extent()
       `-> btrfs_insert_empty_item()
        `-> btrfs_insert_empty_items()
         `-> btrfs_search_slot()
          `-> btrfs_cow_block()
           `-> __btrfs_cow_block()
            `-> btrfs_alloc_tree_block()
             `-> btrfs_reserve_extent()
              `-> find_free_extent()
               `-> find_free_extent_update_loop()
                `-> can_allocate_chunk()
                 `-> btrfs_can_activate_zone() // Takes device_list_mutex again
      
      Instead of using the RCU on fs_devices->device_list we
      can use fs_devices->alloc_list, protected by the chunk_mutex to traverse
      the list of active devices.
      
      We are in the chunk allocation thread. The newer chunk allocation
      happens from the devices in the fs_device->alloc_list protected by the
      chunk_mutex.
      
        btrfs_create_chunk()
          lockdep_assert_held(&info->chunk_mutex);
          gather_device_info
            list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
      
      Also, a device that reappears after the mount won't join the alloc_list
      yet and, it will be in the dev_list, which we don't want to consider in
      the context of the chunk alloc.
      
        [15.166572] WARNING: possible recursive locking detected
        [15.167117] 5.17.0-rc6-dennis #79 Not tainted
        [15.167487] --------------------------------------------
        [15.167733] kworker/u8:3/146 is trying to acquire lock:
        [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: find_free_extent+0x15a/0x14f0 [btrfs]
        [15.167733]
        [15.167733] but task is already holding lock:
        [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
        [15.167733]
        [15.167733] other info that might help us debug this:
        [15.167733]  Possible unsafe locking scenario:
        [15.167733]
        [15.171834]        CPU0
        [15.171834]        ----
        [15.171834]   lock(&fs_devs->device_list_mutex);
        [15.171834]   lock(&fs_devs->device_list_mutex);
        [15.171834]
        [15.171834]  *** DEADLOCK ***
        [15.171834]
        [15.171834]  May be due to missing lock nesting notation
        [15.171834]
        [15.171834] 5 locks held by kworker/u8:3/146:
        [15.171834]  #0: ffff888100050938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
        [15.171834]  #1: ffffc9000067be80 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
        [15.176244]  #2: ffff88810521e620 (sb_internal){.+.+}-{0:0}, at: flush_space+0x335/0x600 [btrfs]
        [15.176244]  #3: ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
        [15.176244]  #4: ffff8881152e4b78 (btrfs-dev-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x130 [btrfs]
        [15.179641]
        [15.179641] stack backtrace:
        [15.179641] CPU: 1 PID: 146 Comm: kworker/u8:3 Not tainted 5.17.0-rc6-dennis #79
        [15.179641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
        [15.179641] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
        [15.179641] Call Trace:
        [15.179641]  <TASK>
        [15.179641]  dump_stack_lvl+0x45/0x59
        [15.179641]  __lock_acquire.cold+0x217/0x2b2
        [15.179641]  lock_acquire+0xbf/0x2b0
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  __mutex_lock+0x8e/0x970
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? lock_is_held_type+0xd7/0x130
        [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  find_free_extent+0x15a/0x14f0 [btrfs]
        [15.183838]  ? _raw_spin_unlock+0x24/0x40
        [15.183838]  ? btrfs_get_alloc_profile+0x106/0x230 [btrfs]
        [15.187601]  btrfs_reserve_extent+0x131/0x260 [btrfs]
        [15.187601]  btrfs_alloc_tree_block+0xb5/0x3b0 [btrfs]
        [15.187601]  __btrfs_cow_block+0x138/0x600 [btrfs]
        [15.187601]  btrfs_cow_block+0x10f/0x230 [btrfs]
        [15.187601]  btrfs_search_slot+0x55f/0xbc0 [btrfs]
        [15.187601]  ? lock_is_held_type+0xd7/0x130
        [15.187601]  btrfs_insert_empty_items+0x2d/0x60 [btrfs]
        [15.187601]  btrfs_create_pending_block_groups+0x2b3/0x560 [btrfs]
        [15.187601]  __btrfs_end_transaction+0x36/0x2a0 [btrfs]
        [15.192037]  flush_space+0x374/0x600 [btrfs]
        [15.192037]  ? find_held_lock+0x2b/0x80
        [15.192037]  ? btrfs_async_reclaim_data_space+0x49/0x180 [btrfs]
        [15.192037]  ? lock_release+0x131/0x2b0
        [15.192037]  btrfs_async_reclaim_data_space+0x70/0x180 [btrfs]
        [15.192037]  process_one_work+0x24c/0x5a0
        [15.192037]  worker_thread+0x4a/0x3d0
      
      Fixes: a85f05e5 ("btrfs: zoned: avoid chunk allocation if active block group has enough space")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0b9e6676
  18. 14 3月, 2022 1 次提交