- 07 10月, 2020 29 次提交
-
-
由 Nikolay Borisov 提交于
Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
This makes reading the code a tad easier by decreasing the level of indirection by one. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
It's always set to 0 by its sole caller - btrfs_readpage. Simply remove it. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
It's always set to 0 from the sole caller - btrfs_readpage. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Now that btrfs_readpage is the only caller of extent_read_full_page the latter can be open coded in the former. Use the occassion to rename __extent_read_full_page to extent_read_full_page. To facillitate this change submit_one_bio has to be exported as well. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
It's called only from btrfs_readpage which always passes 0 so just sink the argument into extent_read_full_page. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Now that this function is only responsible for reading data pages it's no longer necessary to pass get_extent_t parameter across several layers of functions. This patch removes this parameter from multiple functions: __get_extent_map/__do_readpage/__extent_read_full_page/ extent_read_full_page and simply calls btrfs_get_extent directly in __get_extent_map. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
The function btrfs_punch_hole_range() is now used to replace all the file extents in a given file range with an extent described in the given struct btrfs_replace_extent_info argument. This extent can either be an existing extent that is being cloned or it can be a new extent (namely a prealloc extent). When that argument is NULL it only punches a hole (drops all the existing extents) in the file range. So rename the function to btrfs_replace_file_extents(). Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
Now that we can use btrfs_clone_extent_info to convey information for a new prealloc extent as well, and not just for existing extents that are being cloned, rename it to btrfs_replace_extent_info, which reflects the fact that this is now more generic and it is used to replace all existing extents in a file range with the extent described by the structure. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
The value of item_size of struct btrfs_clone_extent_info is always set to the size of a non-inline file extent item, and in fact the infrastructure that uses this structure (btrfs_punch_hole_range()) does not work with inline file extents at all (and it is not supposed to). So just remove that field from the structure and use directly sizeof(struct btrfs_file_extent_item) instead. Also assert that the file extent type is not inline at btrfs_insert_clone_extent(). Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
When doing an fallocate(), specially a zero range operation, we assume that reserving 3 units of metadata space is enough, that at most we touch one leaf in subvolume/fs tree for removing existing file extent items and inserting a new file extent item. This assumption is generally true for most common use cases. However when we end up needing to remove file extent items from multiple leaves, we can end up failing with -ENOSPC and abort the current transaction, turning the filesystem to RO mode. When this happens a stack trace like the following is dumped in dmesg/syslog: [ 1500.620934] ------------[ cut here ]------------ [ 1500.620938] BTRFS: Transaction aborted (error -28) [ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...) [ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1 [ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.621026] Code: 8b 40 50 f0 48 (...) [ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286 [ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000 [ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff [ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001 [ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000 [ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60 [ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000 [ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0 [ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1500.621049] Call Trace: [ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs] [ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs] [ 1500.621108] vfs_fallocate+0x14d/0x290 [ 1500.621112] ksys_fallocate+0x3a/0x70 [ 1500.621117] __x64_sys_fallocate+0x1a/0x20 [ 1500.621120] do_syscall_64+0x33/0x80 [ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1500.621126] RIP: 0033:0x7fb5b248c477 [ 1500.621128] Code: 89 7c 24 08 (...) [ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d [ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477 [ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003 [ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000 [ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010 [ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003 [ 1500.621151] irq event stamp: 1026217 [ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0 [ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0 [ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606 [ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20 [ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]--- [ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left When we use fallocate() internally, for reserving an extent for a space cache, inode cache or relocation, we can't hit this problem since either there aren't any file extent items to remove from the subvolume tree or there is at most one. When using plain fallocate() it's very unlikely, since that would require having many file extent items representing holes for the target range and crossing multiple leafs - we attempt to increase the range (merge) of such file extent items when punching holes, so at most we end up with 2 file extent items for holes at leaf boundaries. However when using the zero range operation of fallocate() for a large range (100+ MiB for example) that's fairly easy to trigger. The following example reproducer triggers the issue: $ cat reproducer.sh #!/bin/bash umount /dev/sdj &> /dev/null mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null mount /dev/sdj /mnt/sdj # Create a 100M file with many file extent items. Punch a hole every 8K # just to speedup the file creation - we could do 4K sequential writes # followed by fsync (or O_SYNC) as well, but that takes a lot of time. file_size=$((100 * 1024 * 1024)) xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar for ((i = 0; i < $file_size; i += 8192)); do xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar done # Force a transaction commit, so the zero range operation will be forced # to COW all metadata extents it need to touch. sync xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar umount /mnt/sdj $ ./reproducer.sh wrote 104857600/104857600 bytes at offset 0 100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec) fallocate: No space left on device $ dmesg <shows the same stack trace pasted before> To fix this use the existing infrastructure that hole punching and extent cloning use for replacing a file range with another extent. This deals with doing the removal of file extent items and inserting the new one using an incremental approach, reserving more space when needed and always ensuring we don't leave an implicit hole in the range in case we need to do multiple iterations and a crash happens between iterations. A test case for fstests will follow up soon. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Goldwyn Rodrigues 提交于
Instead of using a flag bit for exclusive operation, use a variable to store which exclusive operation is being performed. Introduce an API to start and finish an exclusive operation. This would enable another way for tools to check which operation is running on why starting an exclusive operation failed. The followup patch adds a sysfs_notify() to alert userspace when the state changes, so userspace can perform select() on it to get notified of the change. This would enable us to enqueue a command which will wait for current exclusive operation to complete before issuing the next exclusive operation. This has been done synchronously as opposed to a background process, or else error collection (if any) will become difficult. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
Our current tree locking stuff allows us to recurse with read locks if we're already holding the write lock. This is necessary for the space cache inode, as we could be holding a lock on the root_tree root when we need to cache a block group, and thus need to be able to read down the root_tree to read in the inode cache. We can get away with this in our current locking, but we won't be able to with a rwsem. Handle this by purposefully annotating the places where we require recursion, so that in the future we can maybe come up with a way to avoid the recursion. In the case of the free space inode, this will be superseded by the free space tree. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] When quota is enabled for TEST_DEV, generic/013 sometimes fails like this: generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg) And with the following metadata leak: BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs] Call Trace: btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace a6cfd45ba80e4e06 ]--- BTRFS error (device dm-3): qgroup reserved space leaked BTRFS info (device dm-3): disk space caching is enabled BTRFS info (device dm-3): has skinny extents [CAUSE] The qgroup preallocated meta rsv operations of that offending root are: btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072 btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072 btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152 btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072 btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072 It's pretty obvious that, we reserve qgroup meta rsv in btrfs_subvolume_reserve_metadata(), but doesn't have corresponding release/convert calls in btrfs_subvolume_release_metadata(). This leads to the leakage. [FIX] To fix this bug, we should follow what we're doing in btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and add it to block_rsv->qgroup_rsv_reserved. And free the qgroup reserved metadata space when releasing the block_rsv. To do this, we need to change the btrfs_subvolume_release_metadata() to accept btrfs_root, and record the qgroup_to_release number, and call btrfs_qgroup_convert_reserved_meta() for it. Fixes: 733e03a0 ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
There's no practical reason too use 'err' as a variable to convey errors. In fact it's value is either set explicitly in the beginning of the function or it simply takes the value of 'ret'. Not conforming to the usual pattern of having ret be the only variable used to convey errors makes the code more error prone to bugs. In fact one such bug was introduced by 6bf9e4bd ("btrfs: inode: Verify inode mode toi avoid NULL pointer dereference") by assigning the error value to 'ret' and not 'err'. Let's fix that issue and make the function less tricky by leaving only ret to convey error values. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
iomap dio will run generic_write_sync() for us if the iocb is DSYNC. This is problematic for us because of 2 reasons: 1. we hold the inode_lock() during this operation, and we take it in generic_write_sync() 2. we hold a read lock on the dio_sem but take the write lock in fsync Since we don't want to rip out this code right now, but reworking the locking is a bit much to do at this point, work around this problem with this masterpiece of a patch. First, we clear DSYNC on the iocb so that the iomap stuff doesn't know that it needs to handle the sync. We save this fact in current->journal_info, because we need to see do special things once we're in iomap_begin, and we have no way to pass private information into iomap_dio_rw(). Next we specify a separate iomap_dio_ops for sync, which implements an ->end_io() callback that gets called when the dio completes. This is important for AIO, because we really do need to run generic_write_sync() if we complete asynchronously. However if we're still in the submitting context when we enter ->end_io() we clear the flag so that the submitter knows they're the ones that needs to run generic_write_sync(). This is meant to be temporary. We need to work out how to eliminate the inode_lock() and the dio_sem in our fsync and use another mechanism to protect these operations. Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Goldwyn Rodrigues 提交于
We're using direct io implementation based on buffer heads. This patch switches to the new iomap infrastructure. Switch from __blockdev_direct_IO() to iomap_dio_rw(). Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as iomap_begin() for iomap direct I/O functions. This function allocates and locks all the blocks required for the I/O. btrfs_submit_direct() is used as the submit_io() hook for direct I/O ops. Since we need direct I/O reads to go through iomap_dio_rw(), we change file_operations.read_iter() to a btrfs_file_read_iter() which calls btrfs_direct_IO() for direct reads and falls back to generic_file_buffered_read() for incomplete reads and buffered reads. We don't need address_space.direct_IO() anymore: set it to noop. Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is capable of direct I/O reads from a hole, so we don't need to return -ENOENT. Btrfs direct I/O is now done under i_rwsem, shared in case of reads and exclusive in case of writes. This guards against simultaneous truncates. Use iomap->iomap_end() to check for failed or incomplete direct I/O: - for writes, call __endio_write_update_ordered() - for reads, unlock extents btrfs_dio_data is now hooked in iomap->private and not current->journal_info. It carries the reservation variable and the amount of data submitted, so we can calculate the amount of data to call __endio_write_update_ordered in case of an error. This patch removes last use of struct buffer_head from btrfs. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
Since commit d4682ba0 ("Btrfs: sync log after logging new name") we started to commit logs, and fallback to transaction commits when we failed to log the new names or commit the logs, after link and rename operations when the target inodes (or their parents) were previously logged in the current transaction. This was to avoid losing directories despite an explicit fsync on them when they are ancestors of some inode that got a new named logged, due to a link or rename operation. However that adds the cost of starting IO and waiting for it to complete, which can cause higher latencies for applications. Instead of doing that, just make sure that when we log a new name for an inode we don't mark any of its ancestors as logged, so that if any one does an fsync against any of them, without doing any other change on them, the fsync commits the log. This way we only pay the cost of a log commit (or a transaction commit if something goes wrong or a new block group was created) if the application explicitly asks to fsync any of the parent directories. Using dbench, which mixes several filesystems operations including renames, revealed some significant latency gains. The following script that uses dbench was used to test this: #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/btrfs MOUNT_OPTIONS="-o ssd -o space_cache=v2" MKFS_OPTIONS="-m single -d single" THREADS=16 echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -t 300 -D $MNT $THREADS umount $MNT The test was run on bare metal, no virtualization, on a box with 12 cores (Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel configuration that is the default of typical distributions (debian in this case), without debug options enabled (kasan, kmemleak, slub debug, debug of page allocations, lock debugging, etc). Results before this patch: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 10750455 0.011 155.088 Close 7896674 0.001 0.243 Rename 455222 2.158 1101.947 Unlink 2171189 0.067 121.638 Deltree 256 2.425 7.816 Mkdir 128 0.002 0.003 Qpathinfo 9744323 0.006 21.370 Qfileinfo 1707092 0.001 0.146 Qfsinfo 1786756 0.001 11.228 Sfileinfo 875612 0.003 21.263 Find 3767281 0.025 9.617 WriteX 5356924 0.011 211.390 ReadX 16852694 0.003 9.442 LockX 35008 0.002 0.119 UnlockX 35008 0.001 0.138 Flush 753458 4.252 1102.249 Throughput 1128.35 MB/sec 16 clients 16 procs max_latency=1102.255 ms Results after this patch: 16 clients, after Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 11471098 0.012 448.281 Close 8426396 0.001 0.925 Rename 485746 0.123 267.183 Unlink 2316477 0.080 63.433 Deltree 288 2.830 11.144 Mkdir 144 0.003 0.010 Qpathinfo 10397420 0.006 10.288 Qfileinfo 1822039 0.001 0.169 Qfsinfo 1906497 0.002 14.039 Sfileinfo 934433 0.004 2.438 Find 4019879 0.026 10.200 WriteX 5718932 0.011 200.985 ReadX 17981671 0.003 10.036 LockX 37352 0.002 0.076 UnlockX 37352 0.001 0.109 Flush 804018 5.015 778.033 Throughput 1201.98 MB/sec 16 clients 16 procs max_latency=778.036 ms (+6.5% throughput, -29.4% max latency, -75.8% rename latency) Test case generic/498 from fstests tests the scenario that the previously mentioned commit fixed. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
In btrfs_orphan_cleanup, there's another instance of fs_info, but it's the same as the one we already have. In btrfs_backref_finish_upper_links, rb_node is same type and used as temporary cursor to the tree. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We have btrfs_wait_ordered_roots() which takes a u64 for nr, but btrfs_start_delalloc_roots() that takes an int for nr, which makes using them in conjunction, especially for something like (u64)-1, annoying and inconsistent. Fix btrfs_start_delalloc_roots() to take a u64 for nr and adjust start_delalloc_inodes() and it's callers appropriately. This means we've adjusted start_delalloc_inodes() to take a pointer of nr since we want to preserve the ability for start-delalloc_inodes() to return an error, so simply make it do the nr adjusting as necessary. Part of adjusting the callers to this means changing btrfs_writeback_inodes_sb_nr() to take a u64 for items. This may be confusing because it seems unrelated, but the caller of btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the function variable that needs to be changed. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Tested-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
That BUG_ON cannot ever trigger because as the comment there states - 'err' is always set. Simply remove it as it brings no value. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 21 8月, 2020 1 次提交
-
-
由 Boris Burkov 提交于
can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for detecting a must cow condition which is not exactly accurate, but saves unnecessary tree traversal. The incorrect assumption is that if the extent was created in a generation smaller than the last snapshot generation, it must be referenced by that snapshot. That is true, except the snapshot could have since been deleted, without affecting the last snapshot generation. The original patch claimed a performance win from this check, but it also leads to a bug where you are unable to use a swapfile if you ever snapshotted the subvolume it's in. Make the check slower and more strict for the swapon case, without modifying the general cow checks as a compromise. Turning swap on does not seem to be a particularly performance sensitive operation, so incurring a possibly unnecessary btrfs_search_slot seems worthwhile for the added usability. Note: Until the snapshot is competely cleaned after deletion, check_committed_refs will still cause the logic to think that cow is necessary, so the user must until 'btrfs subvolu sync' finished before activating the swapfile swapon. CC: stable@vger.kernel.org # 5.4+ Suggested-by: NOmar Sandoval <osandov@osandov.com> Signed-off-by: NBoris Burkov <boris@bur.io> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 20 8月, 2020 1 次提交
-
-
由 Johannes Thumshirn 提交于
Btrfs' async submit mechanism is able to handle errors in the submission path and the meta-data async submit function correctly passes the error code to the caller. In btrfs_submit_bio_start() and btrfs_submit_bio_start_direct_io() we're not handling the errors returned by btrfs_csum_one_bio() correctly though and simply call BUG_ON(). This is unnecessary as the caller of these two functions - run_one_async_start - correctly checks for the return values and sets the status of the async_submit_bio. The actual bio submission will be handled later on by run_one_async_done only if async_submit_bio::status is 0, so the data won't be written if we encountered an error in the checksum process. Simply return the error from btrfs_csum_one_bio() to the async submitters, like it's done in btree_submit_bio_start(). Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 11 8月, 2020 2 次提交
-
-
由 Pavel Machek 提交于
btrfs_get_extent() sets variable ret, but out: error path expect error to be in variable err so the error code is lost. Fixes: 6bf9e4bd ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NPavel Machek (CIP) <pavel@denx.de> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] There is a bug report of NULL pointer dereference caused in compress_file_extent(): Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs] NIP [c008000006dd4d34] compress_file_range.constprop.41+0x75c/0x8a0 [btrfs] LR [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] Call Trace: [c000000c69093b00] [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] (unreliable) [c000000c69093bd0] [c008000006dd4ebc] async_cow_start+0x44/0xa0 [btrfs] [c000000c69093c10] [c008000006e14824] normal_work_helper+0xdc/0x598 [btrfs] [c000000c69093c80] [c0000000001608c0] process_one_work+0x2c0/0x5b0 [c000000c69093d10] [c000000000160c38] worker_thread+0x88/0x660 [c000000c69093db0] [c00000000016b55c] kthread+0x1ac/0x1c0 [c000000c69093e20] [c00000000000b660] ret_from_kernel_thread+0x5c/0x7c ---[ end trace f16954aa20d822f6 ]--- [CAUSE] For the following execution route of compress_file_range(), it's possible to hit NULL pointer dereference: compress_file_extent() |- pages = NULL; |- start = async_chunk->start = 0; |- end = async_chunk = 4095; |- nr_pages = 1; |- inode_need_compress() == false; <<< Possible, see later explanation | Now, we have nr_pages = 1, pages = NULL |- cont: |- ret = cow_file_range_inline(); |- if (ret <= 0) { |- for (i = 0; i < nr_pages; i++) { |- WARN_ON(pages[i]->mapping); <<< Crash To enter above call execution branch, we need the following race: Thread 1 (chattr) | Thread 2 (writeback) --------------------------+------------------------------ | btrfs_run_delalloc_range | |- inode_need_compress = true | |- cow_file_range_async() btrfs_ioctl_set_flag() | |- binode_flags |= | BTRFS_INODE_NOCOMPRESS | | compress_file_range() | |- inode_need_compress = false | |- nr_page = 1 while pages = NULL | | Then hit the crash [FIX] This patch will fix it by checking @pages before doing accessing it. This patch is only designed as a hot fix and easy to backport. More elegant fix may make btrfs only check inode_need_compress() once to avoid such race, but that would be another story. Reported-by: NLuciano Chavez <chavez@us.ibm.com> Fixes: 4d3a800e ("btrfs: merge nr_pages input and output parameter in compress_pages") CC: stable@vger.kernel.org # 4.14.x: cecc8d90: btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range CC: stable@vger.kernel.org # 4.14+ Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 27 7月, 2020 7 次提交
-
-
由 Filipe Manana 提交于
The possibility of extents being shared (through clone and deduplication operations) requires special care when logging data checksums, to avoid having a log tree with different checksum items that cover ranges which overlap (which resulted in missing checksums after replaying a log tree). Such problems were fixed in the past by the following commits: commit 40e046ac ("Btrfs: fix missing data checksums after replaying a log tree") commit e289f03e ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents") Test case generic/588 exercises the scenario solved by the first commit (purely sequential and deterministic) while test case generic/457 often triggered the case fixed by the second commit (not deterministic, requires specific timings under concurrency). The problems were addressed by deleting, from the log tree, any existing checksums before logging the new ones. And also by doing the deletion and logging of the cheksums while locking the checksum range in an extent io tree (root->log_csum_range), to deal with the case where we have concurrent fsyncs against files with shared extents. That however causes more contention on the leaves of a log tree where we store checksums (and all the nodes in the paths leading to them), even when we do not have shared extents, or all the shared extents were created by past transactions. It also adds a bit of contention on the spin lock of the log_csums_range extent io tree of the log root. This change adds a 'last_reflink_trans' field to the inode to keep track of the last transaction where a new extent was shared between inodes (through clone and deduplication operations). It is updated for both the source and destination inodes of reflink operations whenever a new extent (created in the current transaction) becomes shared by the inodes. This field is kept in memory only, not persisted in the inode item, similar to other existing fields (last_unlink_trans, logged_trans). When logging checksums for an extent, if the value of 'last_reflink_trans' is smaller then the current transaction's generation/id, we skip locking the extent range and deletion of checksums from the log tree, since we know we do not have new shared extents. This reduces contention on the log tree's leaves where checksums are stored. The following script, which uses fio, was used to measure the impact of this change: $ cat test-fsync.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-d single -m single" if [ $# -ne 3 ]; then echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 FSYNC_FREQ=$3 cat <<EOF > /tmp/fio-job.ini [writers] rw=write fsync=$FSYNC_FREQ fallocate=none group_reporting=1 direct=0 bs=64k ioengine=sync size=$FILE_SIZE directory=$MNT numjobs=$NUM_JOBS EOF echo "Using config:" echo cat /tmp/fio-job.ini echo mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The tests were performed for different numbers of jobs, file sizes and fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has 12 cores, with cpu governance set to performance mode on all cores), 16GiB of ram (the host has 64GiB) and using a NVMe device directly (without an intermediary filesystem in the host). While running the tests, the host was not used for anything else, to avoid disturbing the tests. The obtained results were the following (the last line of fio's output was pasted). Starting with 16 jobs is where a significant difference is observable in this particular setup and hardware (differences highlighted below). The very small differences for tests with less than 16 jobs are possibly just noise and random. **** 1 job, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=23.8MiB/s (24.9MB/s), 23.8MiB/s-23.8MiB/s (24.9MB/s-24.9MB/s), io=1024MiB (1074MB), run=43075-43075msec after this change: WRITE: bw=24.4MiB/s (25.6MB/s), 24.4MiB/s-24.4MiB/s (25.6MB/s-25.6MB/s), io=1024MiB (1074MB), run=41938-41938msec **** 2 jobs, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=37.7MiB/s (39.5MB/s), 37.7MiB/s-37.7MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54351-54351msec after this change: WRITE: bw=37.7MiB/s (39.5MB/s), 37.6MiB/s-37.6MiB/s (39.5MB/s-39.5MB/s), io=2048MiB (2147MB), run=54428-54428msec **** 4 jobs, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=67.5MiB/s (70.8MB/s), 67.5MiB/s-67.5MiB/s (70.8MB/s-70.8MB/s), io=4096MiB (4295MB), run=60669-60669msec after this change: WRITE: bw=68.6MiB/s (71.0MB/s), 68.6MiB/s-68.6MiB/s (71.0MB/s-71.0MB/s), io=4096MiB (4295MB), run=59678-59678msec **** 8 jobs, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s), io=8192MiB (8590MB), run=64048-64048msec after this change: WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=8192MiB (8590MB), run=63405-63405msec **** 16 jobs, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=78.5MiB/s (82.3MB/s), 78.5MiB/s-78.5MiB/s (82.3MB/s-82.3MB/s), io=16.0GiB (17.2GB), run=208676-208676msec after this change: WRITE: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=16.0GiB (17.2GB), run=149295-149295msec (+40.1% throughput, -28.5% runtime) **** 32 jobs, file size 1G, fsync frequency 1 **** before this change: WRITE: bw=58.8MiB/s (61.7MB/s), 58.8MiB/s-58.8MiB/s (61.7MB/s-61.7MB/s), io=32.0GiB (34.4GB), run=557134-557134msec after this change: WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430550-430550msec (+29.4% throughput, -22.7% runtime) **** 64 jobs, file size 512M, fsync frequency 1 **** before this change: WRITE: bw=65.8MiB/s (68.0MB/s), 65.8MiB/s-65.8MiB/s (68.0MB/s-68.0MB/s), io=32.0GiB (34.4GB), run=498055-498055msec after this change: WRITE: bw=85.1MiB/s (89.2MB/s), 85.1MiB/s-85.1MiB/s (89.2MB/s-89.2MB/s), io=32.0GiB (34.4GB), run=385116-385116msec (+29.3% throughput, -22.7% runtime) **** 128 jobs, file size 256M, fsync frequency 1 **** before this change: WRITE: bw=54.7MiB/s (57.3MB/s), 54.7MiB/s-54.7MiB/s (57.3MB/s-57.3MB/s), io=32.0GiB (34.4GB), run=599373-599373msec after this change: WRITE: bw=121MiB/s (126MB/s), 121MiB/s-121MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=271907-271907msec (+121.2% throughput, -54.6% runtime) **** 256 jobs, file size 256M, fsync frequency 1 **** before this change: WRITE: bw=69.2MiB/s (72.5MB/s), 69.2MiB/s-69.2MiB/s (72.5MB/s-72.5MB/s), io=64.0GiB (68.7GB), run=947536-947536msec after this change: WRITE: bw=121MiB/s (127MB/s), 121MiB/s-121MiB/s (127MB/s-127MB/s), io=64.0GiB (68.7GB), run=541916-541916msec (+74.9% throughput, -42.8% runtime) **** 512 jobs, file size 128M, fsync frequency 1 **** before this change: WRITE: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=64.0GiB (68.7GB), run=767734-767734msec after this change: WRITE: bw=141MiB/s (147MB/s), 141MiB/s-141MiB/s (147MB/s-147MB/s), io=64.0GiB (68.7GB), run=466022-466022msec (+65.1% throughput, -39.3% runtime) **** 1024 jobs, file size 128M, fsync frequency 1 **** before this change: WRITE: bw=115MiB/s (120MB/s), 115MiB/s-115MiB/s (120MB/s-120MB/s), io=128GiB (137GB), run=1143775-1143775msec after this change: WRITE: bw=171MiB/s (180MB/s), 171MiB/s-171MiB/s (180MB/s-180MB/s), io=128GiB (137GB), run=764843-764843msec (+48.7% throughput, -33.1% runtime) Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Now that btrfs_io_bio have access to btrfs_device we can safely increment the device corruption counter on error. There is one notable exception - repair bios for raid. Since those don't go through the normal submit_stripe_bio callpath but through raid56_parity_recover thus repair bios won't have their device set. Scrub increments the corruption counter for checksum mismatch as well but does not call this function. Link: https://lore.kernel.org/linux-btrfs/4857863.FCrPRfMyHP@liv/Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] When a lot of subvolumes are created, there is a user report about transaction aborted caused by slow anonymous block device reclaim: BTRFS: Transaction aborted (error -24) WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs] RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs] Call Trace: create_pending_snapshots+0x82/0xa0 [btrfs] btrfs_commit_transaction+0x275/0x8c0 [btrfs] btrfs_mksubvol+0x4b9/0x500 [btrfs] btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs] btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs] btrfs_ioctl+0x11a4/0x2da0 [btrfs] do_vfs_ioctl+0xa9/0x640 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace 33f2f83f3d5250e9 ]--- BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown BTRFS info (device sda1): forced readonly BTRFS warning (device sda1): Skipping commit of aborted transaction. BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown [CAUSE] The anonymous device pool is shared and its size is 1M. It's possible to hit that limit if the subvolume deletion is not fast enough and the subvolumes to be cleaned keep the ids allocated. [WORKAROUND] We can't avoid the anon device pool exhaustion but we can shorten the time the id is attached to the subvolume root once the subvolume becomes invisible to the user. Reported-by: NGreed Rong <greedrong@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
vfs_inode is used only for the inode number everything else requires btrfs_inode. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ use btrfs_ino ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Instead of making multiple calls to BTRFS_I simply take btrfs_inode as an input paramter. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
All of its children functions use btrfs_inode. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
All of its children take btrfs_inode so bubble up this requirement to btrfs_delalloc_reserve_space's interface and stop calling BTRFS_I internally. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-