- 08 12月, 2021 1 次提交
-
-
由 Johannes Thumshirn 提交于
When finishing a zone that is used by a dedicated data relocation block group, also remove its reference from fs_info, so we're not trying to use a full block group for allocations during data relocation, which will always fail. The result is we're not making any forward progress and end up in a deadlock situation. Fixes: c2707a25 ("btrfs: zoned: add a dedicated data relocation block group") Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 27 10月, 2021 16 次提交
-
-
由 Kai Song 提交于
Fix memdup.cocci warning: fs/btrfs/zoned.c:1198:23-30: WARNING opportunity for kmemdup Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NKai Song <songkai01@inspur.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The structure btrfs_bio is used by two different sites: - bio->bi_private for mirror based profiles For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records how many mirrors are still pending, and save the original endio function of the bio. - RAID56 code In that case, RAID56 only utilize the stripes info, and no long uses that to trace the pending mirrors. So btrfs_bio is not always bind to a bio, and contains more info for IO context, thus renaming it will make the naming less confusing. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
Now that we have a dedicated block group for relocation, we can use REQ_OP_WRITE instead of REQ_OP_ZONE_APPEND for writing out the data on relocation. Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
Relocation in a zoned filesystem can fail with a transaction abort with error -22 (EINVAL). This happens because the relocation code assumes that the extents we relocated the data to have the same size the source extents had and ensures this by preallocating the extents. But in a zoned filesystem we currently can't preallocate the extents as this would break the sequential write required rule. Therefore it can happen that the writeback process kicks in while we're still adding pages to a delalloc range and starts writing out dirty pages. This then creates destination extents that are smaller than the source extents, triggering the following safety check in get_new_location(): 1034 if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) { 1035 ret = -EINVAL; 1036 goto out; 1037 } Temporarily create a dedicated block group for the relocation process, so no non-relocation data writes can interfere with the relocation writes. This is needed that we can switch the relocation process on a zoned filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme like in a non-zoned filesystem using REQ_OP_WRITE and preallocation. Fixes: 32430c61 ("btrfs: zoned: enable relocation on a zoned filesystem") Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
If we have written to the zone capacity, the device automatically deactivates the zone. Sync up block group side (the active BG list and zone_is_active flag) with it. We need to do it both on data BGs and metadata BGs. On data side, we add a hook to btrfs_finish_ordered_io(). On metadata side, we use end_extent_buffer_writeback(). To reduce excess lookup of a block group, we mark the last extent buffer in a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for data (ordered_extent), because the address may change due to REQ_OP_ZONE_APPEND. Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
The current extent allocator tries to allocate a new block group when the existing block groups do not have enough space. On a ZNS device, a new block group means a new active zone. If the number of active zones has already reached the max_active_zones, activating a new zone needs to finish an existing zone, leading to wasting the free space there. So, instead, it should reuse the existing active block groups as much as possible when we can't activate any other zones without sacrificing an already activated block group. While at it, I converted find_free_extent_update_loop() to check the found_extent() case early and made the other conditions simpler. Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
Load activeness of underlying zones of a block group. When underlying zones are active, we add the block group to the fs_info->zone_active_bgs list. Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
Add zone_is_active flag to btrfs_block_group. This flag indicates the underlying zones are all active. Such zone active block groups are tracked by fs_info->active_bg_list. btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying device part. They set/clear the bitmap to indicate zone activeness and count the number of zones we can activate left. btrfs_zone_{activate,finish}() take responsibility for the logical part and the list management. In addition, btrfs_zone_finish() wait for any writes on it and send REQ_OP_ZONE_FINISH to the zone. Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
We will use a block group's physical location to track active zones and finish fully written zones in the following commits. Since the zone activation is done in the extent allocation context which already holding the tree locks, we can't query the chunk tree for the physical locations. So, copy the location info into a block group and use it for activation. Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
The ZNS specification defines a limit on the number of zones that can be in the implicit open, explicit open or closed conditions. Any zone with such condition is defined as an active zone and correspond to any zone that is being written or that has been only partially written. If the maximum number of active zones is reached, we must either reset or finish some active zones before being able to chose other zones for storing data. Load queue_max_active_zones() and track the number of active zones left on the device. Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
If there is no more space left for a new superblock in a superblock zone, then it is better to ZONE_FINISH the zone and frees up the active zone count. Since btrfs_advance_sb_log() can now issue REQ_OP_ZONE_FINISH, we also need to convert it to return int for the error case. Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
sb_write_pointer() returns the write position of next superblock. For READ, we need a previous location. When the pointer is at the head, the previous one is the last one of the other zone. Calculate the last one's position from zone capacity. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
We cannot write beyond zone capacity. So, we should consider a zone as "full" when the write pointer goes beyond capacity - the size of super info. Also, take this opportunity to replace a subtle duplicated code with a loop and fix a typo in comment. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
Now that we introduced capacity in a block group, we need to calculate free space using the capacity instead of the length. Thus, bytes we account capacity - alloc_pointer as free, and account bytes [capacity, length] as zone unusable. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
btrfs_free_excluded_extents() is not neccessary for btrfs_calc_zone_unusable() and it makes btrfs_calc_zone_unusable() difficult to reuse. Move it out and call btrfs_free_excluded_extents() in proper context. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
The ZNS specification introduces the concept of a Zone Capacity. A zone capacity is an additional per-zone attribute that indicates the number of usable logical blocks within each zone, starting from the first logical block of each zone. It is always smaller or equal to the zone size. With the SINGLE profile, we can set a block group's "capacity" as the same as the underlying zone's Zone Capacity. We will limit the allocation not to exceed in a following commit. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 26 10月, 2021 1 次提交
-
-
由 Nikolay Borisov 提交于
The user facing function used to allocate new chunks is btrfs_chunk_alloc, unfortunately there is yet another similar sounding function - btrfs_alloc_chunk. This creates confusion, especially since the latter function can be considered "private" in the sense that it implements the first stage of chunk creation and as such is called by btrfs_chunk_alloc. To avoid the awkwardness that comes with having similarly named but distinctly different in their purpose function rename btrfs_alloc_chunk to btrfs_create_chunk, given that the main purpose of this function is to orchestrate the whole process of allocating a chunk - reserving space into devices, deciding on characteristics of the stripe size and creating the in-memory structures. Reviewed-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 23 8月, 2021 2 次提交
-
-
由 Marcos Paulo de Souza 提交于
After calling btrfs_search_slot is a common practice to check if the slot found isn't bigger than number of slots in the current leaf, and if so, search for the same key in the next leaf by calling btrfs_next_leaf, which calls btrfs_next_old_leaf to do the job. Calling btrfs_next_item in the same situation would end up in the same code flow, since * btrfs_next_item * btrfs_next_old_item * if slot >= nritems(curr_leaf) btrfs_next_old_leaf Change btrfs_verify_dev_extents and calculate_emulated_zone_size functions to use btrfs_next_leaf in the same situation. Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
There used to be a patch in the original series for zoned support which limited the extent size to max_zone_append_size, but this patch has been dropped somewhere around v9. We've decided to go the opposite direction, instead of limiting extents in the first place we split them before submission to comply with the device's limits. Remove the related code, btrfs_fs_info::max_zone_append_size and btrfs_zoned_device_info::max_zone_append_size. This also removes the workaround for dm-crypt introduced in 1d68128c ("btrfs: zoned: fail mount if the device does not support zone append") because the fix has been merged as f34ee1dc ("dm crypt: Fix zoned block device support"). Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 22 7月, 2021 1 次提交
-
-
由 Christoph Hellwig 提交于
Store the block device instead of the gendisk in the btrfs_ordered_extent structure instead of acquiring a reference to it later. Note: this is from series removing bdgrab/bdput, btrfs is one of the last users. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 22 6月, 2021 1 次提交
-
-
由 David Sterba 提交于
Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 21 6月, 2021 3 次提交
-
-
由 Johannes Thumshirn 提交于
To be able to construct a zone append bio we need to look up the btrfs_device. The code doing the chunk map lookup to get the device is present in btrfs_submit_compressed_write and submit_extent_page. Factor out the lookup calls into a helper and use it in the submission paths. Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
If we can't read a reliable write pointer from a sequential zone fail creating the block group with an I/O error. Also if the read write pointer is beyond the end of the respective zone, fail the creation of the block group on this zone with an I/O error. While this could also happen in real world scenarios with misbehaving drives, this issue addresses a problem uncovered by fstests' test case generic/475. CC: stable@vger.kernel.org # 5.12+ Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
This extends patch 784daf2b ("btrfs: zoned: sanity check zone type"), the message was supposed to be there but was lost during merge. We want to make the error noticeable so add it. Fixes: 784daf2b ("btrfs: zoned: sanity check zone type") CC: stable@vger.kernel.org # 5.12+ Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 04 6月, 2021 1 次提交
-
-
由 Naohiro Aota 提交于
In btrfs_get_dev_zone_info(), we have "u32 sb_zone" and calculate "sector_t sector" by shifting it. But, this "sector" is calculated in 32bit, leading it to be 0 for the 2nd superblock copy. Since zone number is u32, shifting it to sector (sector_t) or physical address (u64) can easily trigger a missing cast bug like this. This commit introduces helpers to convert zone number to sector/LBA, so we won't fall into the same pitfall again. Reported-by: NDmitry Fomichev <Dmitry.Fomichev@wdc.com> Fixes: 12659251 ("btrfs: implement log-structured superblock for ZONED mode") CC: stable@vger.kernel.org # 5.11+ Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 20 5月, 2021 1 次提交
-
-
由 Johannes Thumshirn 提交于
btrfs_use_zone_append only needs the passed in extent_map's block_start member, so there's no need to pass in the full extent map. This also enables the use of btrfs_use_zone_append in places where we only have a start byte but no extent_map. Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 04 5月, 2021 1 次提交
-
-
由 Naohiro Aota 提交于
The fstests test case generic/475 creates a dm-linear device that gets changed to a dm-error device. This leads to errors in loading the block group's zone information when running on a zoned file system, ultimately resulting in a list corruption. When running on a kernel with list debugging enabled this leads to the following crash. BTRFS: error (device dm-2) in cleanup_transaction:1953: errno=-5 IO failure kernel BUG at lib/list_debug.c:54! invalid opcode: 0000 [#1] SMP PTI CPU: 1 PID: 2433 Comm: umount Tainted: G W 5.12.0+ #1018 RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47 RSP: 0018:ffffc90001473df0 EFLAGS: 00010296 RAX: 0000000000000054 RBX: ffff8881038fd000 RCX: ffffc90001473c90 RDX: 0000000100001a31 RSI: 0000000000000003 RDI: 0000000000000003 RBP: ffff888308871108 R08: 0000000000000003 R09: 0000000000000001 R10: 3961373532383838 R11: 6666666620736177 R12: ffff888308871000 R13: ffff8881038fd088 R14: ffff8881038fdc78 R15: dead000000000100 FS: 00007f353c9b1540(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f353cc2c710 CR3: 000000018e13c000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0xc9/0x310 [btrfs] close_ctree+0x2ee/0x31a [btrfs] ? call_rcu+0x8f/0x270 ? mutex_lock+0x1c/0x40 generic_shutdown_super+0x67/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x90 cleanup_mnt+0x13e/0x1b0 task_work_run+0x63/0xb0 exit_to_user_mode_loop+0xd9/0xe0 exit_to_user_mode_prepare+0x3e/0x60 syscall_exit_to_user_mode+0x1d/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xae As dm-error has no support for zones, btrfs will run it's zone emulation mode on this device. The zone emulation mode emulates conventional zones, so bail out if the zone bitmap that gets populated on mount sees the zone as sequential while we're thinking it's a conventional zone when creating a block group. Note: this scenario is unlikely in a real wold application and can only happen by this (ab)use of device-mapper targets. CC: stable@vger.kernel.org # 5.12+ Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 19 4月, 2021 1 次提交
-
-
由 Johannes Thumshirn 提交于
For zoned btrfs, zone append is mandatory to write to a sequential write only zone, otherwise parallel writes to the same zone could result in unaligned write errors. If a zoned block device does not support zone append (e.g. a dm-crypt zoned device using a non-NULL IV cypher), fail to mount. CC: stable@vger.kernel.org # 5.12 Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 10 4月, 2021 1 次提交
-
-
由 Naohiro Aota 提交于
Moves the location of the superblock logging zones. The new locations of the logging zones are now determined based on fixed block addresses instead of on fixed zone numbers. The old placement method based on fixed zone numbers causes problems when one needs to inspect a file system image without access to the drive zone information. In such case, the super block locations cannot be reliably determined as the zone size is unknown. By locating the superblock logging zones using fixed addresses, we can scan a dumped file system image without the zone information since a super block copy will always be present at or after the fixed known locations. Introduce the following three pairs of zones containing fixed offset locations, regardless of the device zone size. - primary superblock: offset 0B (and the following zone) - first copy: offset 512G (and the following zone) - Second copy: offset 4T (4096G, and the following zone) If a logging zone is outside of the disk capacity, we do not record the superblock copy. The first copy position is much larger than for a non-zoned filesystem, which is at 64M. This is to avoid overlapping with the log zones for the primary superblock. This higher location is arbitrary but allows supporting devices with very large zone sizes, plus some space around in between. Such large zone size is unrealistic and very unlikely to ever be seen in real devices. Currently, SMR disks have a zone size of 256MB, and we are expecting ZNS drives to be in the 1-4GB range, so this limit gives us room to breathe. For now, we only allow zone sizes up to 8GB. The maximum zone size that would still fit in the space is 256G. The fixed location addresses are somewhat arbitrary, with the intent of maintaining superblock reliability for smaller and larger devices, with the preference for the latter. For this reason, there are two superblocks under the first 1T. This should cover use cases for physical devices and for emulated/device-mapper devices. The superblock logging zones are reserved for superblock logging and never used for data or metadata blocks. Note that we only reserve the two zones per primary/copy actually used for superblock logging. We do not reserve the ranges of zones possibly containing superblocks with the largest supported zone size (0-16GB, 512G-528GB, 4096G-4112G). The zones containing the fixed location offsets used to store superblocks on a non-zoned volume are also reserved to avoid confusion. Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 04 3月, 2021 1 次提交
-
-
由 Naohiro Aota 提交于
We need to use sector_t for zone_sectors, or it would set the zone size to zero when the size >= 4GB (= 2^24 sectors) by shifting the zone_sectors value by SECTOR_SHIFT. We're assuming zones sizes up to 8GiB. Fixes: 5b316468 ("btrfs: get zone information of zoned block devices") Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 09 2月, 2021 9 次提交
-
-
由 Naohiro Aota 提交于
This is 4/4 patch to implement device-replace on zoned filesystems. Even after the copying is done, the write pointers of the source device and the destination device may not be synchronized. For example, when the last allocated extent is freed before device-replace process, the extent is not copied, leaving a hole there. Synchronize the write pointers by writing zeroes to the destination device. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
This is 3/4 patch to implement device-replace on zoned filesystems. This commit implements copying. To do this, it tracks the write pointer during the device replace process. As device-replace's copy process is smart enough to only copy used extents on the source device, we have to fill the gap to honor the sequential write requirement in the target device. The device-replace process on zoned filesystems must copy or clone all the extents in the source device exactly once. So, we need to ensure allocations started just before the dev-replace process to have their corresponding extent information in the B-trees. finish_extent_writes_for_zoned() implements that functionality, which basically is the removed code in the commit 042528f8 ("Btrfs: fix block group remaining RO forever after error during device replace"). Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
This is 2/4 patch to implement device replace for zoned filesystems. In zoned mode, a block group must be either copied (from the source device to the target device) or cloned (to both devices). Implement the cloning part. If a block group targeted by an IO is marked to copy, we should not clone the IO to the destination device, because the block group is eventually copied by the replace process. This commit also handles cloning of device reset. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
We cannot use zone append for writing metadata, because the B-tree nodes have references to each other using logical address. Without knowing the address in advance, we cannot construct the tree in the first place. So we need to serialize write IOs for metadata. We cannot add a mutex around allocation and submission because metadata blocks are allocated in an earlier stage to build up B-trees. Add a zoned_meta_io_lock and hold it during metadata IO submission in btree_write_cache_pages() to serialize IOs. Furthermore, this adds a per-block group metadata IO submission pointer "meta_write_pointer" to ensure sequential writing, which can break when attempting to write back blocks in an unfinished transaction. If the writing out failed because of a hole and the write out is for data integrity (WB_SYNC_ALL), it returns EAGAIN. A caller like fsync() code should handle this properly e.g. by falling back to a full transaction commit. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
Enable zone append writing for zoned mode. When using zone append, a bio is issued to the start of a target zone and the device decides to place it inside the zone. Upon completion the device reports the actual written position back to the host. Three parts are necessary to enable zone append mode. First, modify the bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the bi_sector to point the beginning of the zone. Second, record the returned physical address (and disk/partno) to the ordered extent in end_bio_extent_writepage() after the bio has been completed. We cannot resolve the physical address to the logical address because we can neither take locks nor allocate a buffer in this end_bio context. So, we need to record the physical address to resolve it later in btrfs_finish_ordered_io(). And finally, rewrite the logical addresses of the extent mapping and checksum data according to the physical address using btrfs_rmap_block. If the returned address matches the originally allocated address, we can skip this rewriting process. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
On a zoned filesystem, cache if a block group is on a sequential write only zone. On sequential write only zones, we can use REQ_OP_ZONE_APPEND for writing data, therefore provide btrfs_use_zone_append() to figure out if IO is targeting a sequential write only zone and we can use REQ_OP_ZONE_APPEND for data writing. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
Tree manipulating operations like merging nodes often release once-allocated tree nodes. Such nodes are cleaned so that pages in the node are not uselessly written out. On zoned volumes, however, such optimization blocks the following IOs as the cancellation of the write out of the freed blocks breaks the sequential write sequence expected by the device. Introduce a list of clean and unwritten extent buffers that have been released in a transaction. Redirty the buffers so that btree_write_cache_pages() can send proper bios to the devices. Besides it clears the entire content of the extent buffer not to confuse raw block scanners e.g. 'btrfs check'. By clearing the content, csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
In a zoned filesystem a once written then freed region is not usable until the underlying zone has been reset. So we need to distinguish such unusable space from usable free space. Therefore we need to introduce the "zone_unusable" field to the block group structure, and "bytes_zone_unusable" to the space_info structure to track the unusable space. Pinned bytes are always reclaimed to the unusable space. But, when an allocated region is returned before using e.g., the block group becomes read-only between allocation time and reservation time, we can safely return the region to the block group. For the situation, this commit introduces "btrfs_add_free_space_unused". This behaves the same as btrfs_add_free_space() on regular filesystem. On zoned filesystems, it rewinds the allocation offset. Because the read-only bytes tracks free but unusable bytes when the block group is read-only, we need to migrate the zone_unusable bytes to read-only bytes when a block group is marked read-only. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
Conventional zones do not have a write pointer, so we cannot use it to determine the allocation offset for sequential allocation if a block group contains a conventional zone. But instead, we can consider the end of the highest addressed extent in the block group for the allocation offset. For new block group, we cannot calculate the allocation offset by consulting the extent tree, because it can cause deadlock by taking extent buffer lock after chunk mutex, which is already taken in btrfs_make_block_group(). Since it is a new block group anyways, we can simply set the allocation offset to 0. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-