1. 07 1月, 2022 2 次提交
    • N
      btrfs: zoned: fix chunk allocation condition for zoned allocator · 82187d2e
      Naohiro Aota 提交于
      The ZNS specification defines a limit on the number of "active"
      zones. That limit impose us to limit the number of block groups which
      can be used for an allocation at the same time. Not to exceed the
      limit, we reuse the existing active block groups as much as possible
      when we can't activate any other zones without sacrificing an already
      activated block group in commit a85f05e5 ("btrfs: zoned: avoid
      chunk allocation if active block group has enough space").
      
      However, the check is wrong in two ways. First, it checks the
      condition for every raid index (ffe_ctl->index). Even if it reaches
      the condition and "ffe_ctl->max_extent_size >=
      ffe_ctl->min_alloc_size" is met, there can be other block groups
      having enough space to hold ffe_ctl->num_bytes. (Actually, this won't
      happen in the current zoned code as it only supports SINGLE
      profile. But, it can happen once it enables other RAID types.)
      
      Second, it checks the active zone availability depending on the
      raid index. The raid index is just an index for
      space_info->block_groups, so it has nothing to do with chunk allocation.
      
      These mistakes are causing a faulty allocation in a certain
      situation. Consider we are running zoned btrfs on a device whose
      max_active_zone == 0 (no limit). And, suppose no block group have a
      room to fit ffe_ctl->num_bytes but some room to meet
      ffe_ctl->min_alloc_size (i.e. max_extent_size > num_bytes >=
      min_alloc_size).
      
      In this situation, the following occur:
      
      - With SINGLE raid_index, it reaches the chunk allocation checking
        code
      - The check returns true because we can activate a new zone (no limit)
      - But, before allocating the chunk, it iterates to the next raid index
        (RAID5)
      - Since there are no RAID5 block groups on zoned mode, it again
        reaches the check code
      - The check returns false because of btrfs_can_activate_zone()'s "if
        (raid_index != BTRFS_RAID_SINGLE)" part
      - That results in returning -ENOSPC without allocating a new chunk
      
      As a result, we end up hitting -ENOSPC too early.
      
      Move the check to the right place in the can_allocate_chunk() hook,
      and do the active zone check depending on the allocation flag, not on
      the raid index.
      
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      82187d2e
    • J
      btrfs: zoned: simplify btrfs_check_meta_write_pointer · 8fdf54fe
      Johannes Thumshirn 提交于
      btrfs_check_meta_write_pointer() will always be called with a NULL
      'cache_ret' argument.
      
      As there's no need to check if we have a valid block_group passed in
      remove these checks.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8fdf54fe
  2. 03 1月, 2022 2 次提交
  3. 08 12月, 2021 1 次提交
  4. 27 10月, 2021 16 次提交
  5. 26 10月, 2021 1 次提交
  6. 23 8月, 2021 2 次提交
  7. 22 7月, 2021 1 次提交
  8. 22 6月, 2021 1 次提交
  9. 21 6月, 2021 3 次提交
  10. 04 6月, 2021 1 次提交
  11. 20 5月, 2021 1 次提交
  12. 04 5月, 2021 1 次提交
    • N
      btrfs: zoned: sanity check zone type · 784daf2b
      Naohiro Aota 提交于
      The fstests test case generic/475 creates a dm-linear device that gets
      changed to a dm-error device. This leads to errors in loading the block
      group's zone information when running on a zoned file system, ultimately
      resulting in a list corruption. When running on a kernel with list
      debugging enabled this leads to the following crash.
      
       BTRFS: error (device dm-2) in cleanup_transaction:1953: errno=-5 IO failure
       kernel BUG at lib/list_debug.c:54!
       invalid opcode: 0000 [#1] SMP PTI
       CPU: 1 PID: 2433 Comm: umount Tainted: G        W         5.12.0+ #1018
       RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47
       RSP: 0018:ffffc90001473df0 EFLAGS: 00010296
       RAX: 0000000000000054 RBX: ffff8881038fd000 RCX: ffffc90001473c90
       RDX: 0000000100001a31 RSI: 0000000000000003 RDI: 0000000000000003
       RBP: ffff888308871108 R08: 0000000000000003 R09: 0000000000000001
       R10: 3961373532383838 R11: 6666666620736177 R12: ffff888308871000
       R13: ffff8881038fd088 R14: ffff8881038fdc78 R15: dead000000000100
       FS:  00007f353c9b1540(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f353cc2c710 CR3: 000000018e13c000 CR4: 00000000000006a0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_free_block_groups+0xc9/0x310 [btrfs]
        close_ctree+0x2ee/0x31a [btrfs]
        ? call_rcu+0x8f/0x270
        ? mutex_lock+0x1c/0x40
        generic_shutdown_super+0x67/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x90
        cleanup_mnt+0x13e/0x1b0
        task_work_run+0x63/0xb0
        exit_to_user_mode_loop+0xd9/0xe0
        exit_to_user_mode_prepare+0x3e/0x60
        syscall_exit_to_user_mode+0x1d/0x50
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      As dm-error has no support for zones, btrfs will run it's zone emulation
      mode on this device. The zone emulation mode emulates conventional zones,
      so bail out if the zone bitmap that gets populated on mount sees the zone
      as sequential while we're thinking it's a conventional zone when creating
      a block group.
      
      Note: this scenario is unlikely in a real wold application and can only
      happen by this (ab)use of device-mapper targets.
      
      CC: stable@vger.kernel.org # 5.12+
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      784daf2b
  13. 19 4月, 2021 1 次提交
  14. 10 4月, 2021 1 次提交
    • N
      btrfs: zoned: move superblock logging zone location · 53b74fa9
      Naohiro Aota 提交于
      Moves the location of the superblock logging zones. The new locations of
      the logging zones are now determined based on fixed block addresses
      instead of on fixed zone numbers.
      
      The old placement method based on fixed zone numbers causes problems when
      one needs to inspect a file system image without access to the drive zone
      information. In such case, the super block locations cannot be reliably
      determined as the zone size is unknown. By locating the superblock logging
      zones using fixed addresses, we can scan a dumped file system image without
      the zone information since a super block copy will always be present at or
      after the fixed known locations.
      
      Introduce the following three pairs of zones containing fixed offset
      locations, regardless of the device zone size.
      
        - primary superblock: offset   0B (and the following zone)
        - first copy:         offset 512G (and the following zone)
        - Second copy:        offset   4T (4096G, and the following zone)
      
      If a logging zone is outside of the disk capacity, we do not record the
      superblock copy.
      
      The first copy position is much larger than for a non-zoned filesystem,
      which is at 64M.  This is to avoid overlapping with the log zones for
      the primary superblock. This higher location is arbitrary but allows
      supporting devices with very large zone sizes, plus some space around in
      between.
      
      Such large zone size is unrealistic and very unlikely to ever be seen in
      real devices. Currently, SMR disks have a zone size of 256MB, and we are
      expecting ZNS drives to be in the 1-4GB range, so this limit gives us
      room to breathe. For now, we only allow zone sizes up to 8GB. The
      maximum zone size that would still fit in the space is 256G.
      
      The fixed location addresses are somewhat arbitrary, with the intent of
      maintaining superblock reliability for smaller and larger devices, with
      the preference for the latter. For this reason, there are two superblocks
      under the first 1T. This should cover use cases for physical devices and
      for emulated/device-mapper devices.
      
      The superblock logging zones are reserved for superblock logging and
      never used for data or metadata blocks. Note that we only reserve the
      two zones per primary/copy actually used for superblock logging. We do
      not reserve the ranges of zones possibly containing superblocks with the
      largest supported zone size (0-16GB, 512G-528GB, 4096G-4112G).
      
      The zones containing the fixed location offsets used to store
      superblocks on a non-zoned volume are also reserved to avoid confusion.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53b74fa9
  15. 04 3月, 2021 1 次提交
  16. 09 2月, 2021 5 次提交
    • N
      btrfs: zoned: support dev-replace in zoned filesystems · 7db1c5d1
      Naohiro Aota 提交于
      This is 4/4 patch to implement device-replace on zoned filesystems.
      
      Even after the copying is done, the write pointers of the source device
      and the destination device may not be synchronized. For example, when
      the last allocated extent is freed before device-replace process, the
      extent is not copied, leaving a hole there.
      
      Synchronize the write pointers by writing zeroes to the destination
      device.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7db1c5d1
    • N
      btrfs: zoned: implement copying for zoned device-replace · de17addc
      Naohiro Aota 提交于
      This is 3/4 patch to implement device-replace on zoned filesystems.
      
      This commit implements copying. To do this, it tracks the write pointer
      during the device replace process. As device-replace's copy process is
      smart enough to only copy used extents on the source device, we have to
      fill the gap to honor the sequential write requirement in the target
      device.
      
      The device-replace process on zoned filesystems must copy or clone all
      the extents in the source device exactly once. So, we need to ensure
      allocations started just before the dev-replace process to have their
      corresponding extent information in the B-trees.
      finish_extent_writes_for_zoned() implements that functionality, which
      basically is the removed code in the commit 042528f8 ("Btrfs: fix
      block group remaining RO forever after error during device replace").
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de17addc
    • N
      btrfs: zoned: implement cloning for zoned device-replace · 6143c23c
      Naohiro Aota 提交于
      This is 2/4 patch to implement device replace for zoned filesystems.
      
      In zoned mode, a block group must be either copied (from the source
      device to the target device) or cloned (to both devices).
      
      Implement the cloning part. If a block group targeted by an IO is marked
      to copy, we should not clone the IO to the destination device, because
      the block group is eventually copied by the replace process.
      
      This commit also handles cloning of device reset.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6143c23c
    • N
      btrfs: zoned: serialize metadata IO · 0bc09ca1
      Naohiro Aota 提交于
      We cannot use zone append for writing metadata, because the B-tree nodes
      have references to each other using logical address. Without knowing
      the address in advance, we cannot construct the tree in the first place.
      So we need to serialize write IOs for metadata.
      
      We cannot add a mutex around allocation and submission because metadata
      blocks are allocated in an earlier stage to build up B-trees.
      
      Add a zoned_meta_io_lock and hold it during metadata IO submission in
      btree_write_cache_pages() to serialize IOs.
      
      Furthermore, this adds a per-block group metadata IO submission pointer
      "meta_write_pointer" to ensure sequential writing, which can break when
      attempting to write back blocks in an unfinished transaction. If the
      writing out failed because of a hole and the write out is for data
      integrity (WB_SYNC_ALL), it returns EAGAIN.
      
      A caller like fsync() code should handle this properly e.g. by falling
      back to a full transaction commit.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0bc09ca1
    • N
      btrfs: zoned: use ZONE_APPEND write for zoned mode · d8e3fb10
      Naohiro Aota 提交于
      Enable zone append writing for zoned mode. When using zone append, a
      bio is issued to the start of a target zone and the device decides to
      place it inside the zone. Upon completion the device reports the actual
      written position back to the host.
      
      Three parts are necessary to enable zone append mode. First, modify the
      bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
      bi_sector to point the beginning of the zone.
      
      Second, record the returned physical address (and disk/partno) to the
      ordered extent in end_bio_extent_writepage() after the bio has been
      completed. We cannot resolve the physical address to the logical address
      because we can neither take locks nor allocate a buffer in this end_bio
      context. So, we need to record the physical address to resolve it later
      in btrfs_finish_ordered_io().
      
      And finally, rewrite the logical addresses of the extent mapping and
      checksum data according to the physical address using btrfs_rmap_block.
      If the returned address matches the originally allocated address, we can
      skip this rewriting process.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8e3fb10