- 03 6月, 2015 32 次提交
-
-
由 Filipe Manana 提交于
Zygo Blaxell and other users have reported occasional hangs while an inode is being evicted, leading to traces like the following: [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000 [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001 [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78 [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123 [ 5281.981396] Call Trace: [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs] [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs] [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs] [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [ 5282.001299] 1 lock held by rm/20488: [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b This happens when we have readahead, which calls readpages(), happening right before the inode eviction handler is invoked. So the reason is essentially: 1) readpages() is called while a reference on the inode is held, so eviction can not be triggered before readpages() returns. It also locks one or more ranges in the inode's io_tree (which is done at extent_io.c:__do_contiguous_readpages()); 2) readpages() submits several read bios, all with an end io callback that runs extent_io.c:end_bio_extent_readpage() and that is executed by other task when a bio finishes, corresponding to a work queue (fs_info->end_io_workers) worker kthread. This callback unlocks the ranges in the inode's io_tree that were previously locked in step 1; 3) readpages() returns, the reference on the inode is dropped; 4) One or more of the read bios previously submitted are still not complete (their end io callback was not yet invoked or has not yet finished execution); 5) Inode eviction is triggered (through an unlink call for example). The inode reference count was not incremented before submitting the read bios, therefore this is possible; 6) The eviction handler starts executing and enters the loop that iterates over all extent states in the inode's io_tree; 7) The loop picks one extent state record and uses its ->start and ->end fields, after releasing the inode's io_tree spinlock, to call lock_extent_bits() and clear_extent_bit(). The call to lock the range [state->start, state->end] blocks because the whole range or a part of it was locked by the previous call to readpages() and the corresponding end io callback, which unlocks the range was not yet executed; 8) The end io callback for the read bio is executed and unlocks the range [state->start, state->end] (or a superset of that range). And at clear_extent_bit() the extent_state record state is used as a second argument to split_state(), which sets state->start to a larger value; 9) The task executing the eviction handler is woken up by the task executing the bio's end io callback (through clear_state_bit) and the eviction handler locks the range [old value for state->start, state->end]. Shortly after, when calling clear_extent_bit(), it unlocks the range [new value for state->start, state->end], so it ends up unlocking only part of the range that it locked, leaving an extent state record in the io_tree that represents the unlocked subrange; 10) The eviction handler loop, in its next iteration, gets the extent_state record for the subrange that it did not unlock in the previous step and then tries to lock it, resulting in an hang. So fix this by not using the ->start and ->end fields of an existing extent_state record. This is a simple solution, and an alternative could be to bump the inode's reference count before submitting each read bio and having it dropped in the bio's end io callback. But that would be a more invasive/complex change and would not protect against other possible places that are not holding a reference on the inode as well. Something to consider in the future. Many thanks to Zygo Blaxell for reporting, in the mailing list, the issue, a set of scripts to trigger it and testing this fix. Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org> Tested-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Liu Bo 提交于
The return value of read_tree_block() can confuse callers as it always returns NULL for either -ENOMEM or -EIO, so it's likely that callers parse it to a wrong error, for instance, in btrfs_read_tree_root(). This fixes the above issue. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Liu Bo 提交于
read_tree_block may take a reference on the 'eb', a following free_extent_buffer is necessary. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Liu Bo 提交于
After commit 8407f553 ("Btrfs: fix data corruption after fast fsync and writeback error"), during wait_ordered_extents(), we wait for ordered extent setting BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've already got checksum information, so we don't need to check (csum_bytes_left == 0) in the whole logging path. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
Unlike when attempting to allocate a new block group, where we check that we have enough space in the system space_info to update the device items and insert a new chunk item in the chunk tree, we were not checking if the system space_info had enough space for updating the device items and deleting the chunk item in the chunk tree. This often lead to -ENOSPC error when attempting to allocate blocks for the chunk tree (during btree node/leaf COW operations) while updating the device items or deleting the chunk item, which resulted in the current transaction being aborted and turning the filesystem into read-only mode. While running fstests generic/038, which stresses allocation of block groups and removal of unused block groups, with a large scratch device (750Gb) this happened often, despite more than enough unallocated space, and resulted in the following trace: [68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() [68663.600407] BTRFS: Transaction aborted (error -28) (...) [68663.730829] Call Trace: [68663.732585] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [68663.734334] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [68663.739980] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [68663.757153] [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [68663.760925] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [68663.762854] [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs] [68663.764073] [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [68663.765130] [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs] [68663.765998] [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs] [68663.767068] [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs] [68663.768227] [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c [68663.769081] [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs] [68663.799485] [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs] [68663.809208] [<ffffffff8105f367>] kthread+0xef/0xf7 [68663.828795] [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28 [68663.844942] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad [68663.846486] [<ffffffff81435a88>] ret_from_fork+0x58/0x90 [68663.847760] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad [68663.849503] ---[ end trace 798477c6d6dbaad6 ]--- [68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left So fix this by verifying that enough space exists in system space_info, and reserving the space in the chunk block reserve, before attempting to delete the block group and allocate a new system chunk if we don't have enough space to perform the necessary updates and delete in the chunk tree. Like for the block group creation case, we don't error our if we fail to allocate a new system chunk, since we might end up not needing it (no node/leaf splits happen during the COW operations and/or we end up not needing to COW any btree nodes or leafs because they were already COWed in the current transaction and their writeback didn't start yet). Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
While creating a block group, we often end up getting ENOSPC while updating the chunk tree, which leads to a transaction abortion that produces a trace like the following: [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]() [30670.117777] BTRFS: Transaction aborted (error -28) (...) [30670.163567] Call Trace: [30670.163906] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [30670.164522] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [30670.165171] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [30670.166323] [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs] [30670.167213] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [30670.167862] [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs] [30670.169116] [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs] [30670.170593] [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs] [30670.171960] [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs] [30670.174649] [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [30670.176092] [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs] [30670.177218] [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15 [30670.178622] [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de [30670.179642] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f [30670.180692] [<ffffffff81152863>] SyS_fallocate+0x47/0x62 [30670.186737] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [30670.187792] ---[ end trace 0373e6b491c4a8cc ]--- This is because we don't do proper space reservation for the chunk block reserve when we have multiple tasks allocating chunks in parallel. So block group creation has 2 phases, and the first phase essentially checks if there is enough space in the system space_info, allocating a new system chunk if there isn't, while the second phase updates the device, extent and chunk trees. However, because the updates to the chunk tree happen in the second phase, if we have N tasks, each with its own transaction handle, allocating new chunks in parallel and if there is only enough space in the system space_info to allocate M chunks, where M < N, none of the tasks ends up allocating a new system chunk in the first phase and N - M tasks will get -ENOSPC when attempting to update the chunk tree in phase 2 if they need to COW any nodes/leafs from the chunk tree. Fix this by doing proper reservation in the chunk block reserve. The issue could be reproduced by running fstests generic/038 in a loop, which eventually triggered the problem. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Josef Bacik 提交于
We should be doing this, it's weird we hadn't been doing this. Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Omar Sandoval 提交于
Now that we're guaranteed to have a meaningful root dentry, we can just export seq_dentry() and use it in btrfs_show_options(). The subvolume ID is easy to get and can also be useful, so put that in there, too. Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NOmar Sandoval <osandov@osandov.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Omar Sandoval 提交于
Currently, mounting a subvolume with subvolid= takes a different code path than mounting with subvol=. This isn't really a big deal except for the fact that mounts done with subvolid= or the default subvolume don't have a dentry that's connected to the dentry tree like in the subvol= case. To unify the code paths, when given subvolid= or using the default subvolume ID, translate it into a subvolume name by walking ROOT_BACKREFs in the root tree and INODE_REFs in the filesystem trees. Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NOmar Sandoval <osandov@osandov.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Omar Sandoval 提交于
There's nothing to stop a user from passing both subvol= and subvolid= to mount, but if they don't refer to the same subvolume, someone is going to be surprised at some point. Error out on this case, but allow users to pass in both if they do match (which they could, for example, get out of /proc/mounts). Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NOmar Sandoval <osandov@osandov.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Omar Sandoval 提交于
In preparation for new functionality in mount_subvol(), give it ownership of subvol_name and tidy up the error paths. Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NOmar Sandoval <osandov@osandov.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Omar Sandoval 提交于
Currently, setup_root_args() substitutes 's/subvol=[^,]*/subvolid=0/'. But, this means that if the user passes both a subvol and subvolid for some reason, we won't actually mount the top-level when we recursively mount. For example, consider: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt btrfs subvol create /mnt/subvol1 # subvolid=257 btrfs subvol create /mnt/subvol2 # subvolid=258 umount /mnt mount -osubvol=/subvol1,subvolid=258 /dev/sdb /mnt In the final mount, subvol=/subvol1,subvolid=258 becomes subvolid=0,subvolid=258, and the last option takes precedence, so we mount subvol2 and try to look up subvol1 inside of it, which fails. So, instead, do a thorough scan through the argument list and remove any subvol= and subvolid= options, then append subvolid=0 to the end. This implicitly makes subvol= take precedence over subvolid=, but we're about to add a stricter check for that. This also makes setup_root_args() more generic, which we'll need soon. Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NOmar Sandoval <osandov@osandov.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Omar Sandoval 提交于
Since commit 0723a047 ("btrfs: allow mounting btrfs subvolumes with different ro/rw options"), when mounting a subvolume read/write when another subvolume has previously been mounted read-only, we first do a remount. However, this should be done with the superblock locked, as per sync_filesystem(): /* * We need to be protected against the filesystem going from * r/o to r/w or vice versa. */ WARN_ON(!rwsem_is_locked(&sb->s_umount)); This WARN_ON can easily be hit with: mkfs.btrfs -f /dev/vdb mount /dev/vdb /mnt btrfs subvol create /mnt/vol1 btrfs subvol create /mnt/vol2 umount /mnt mount -oro,subvol=/vol1 /dev/vdb /mnt mount -orw,subvol=/vol2 /dev/vdb /mnt2 Fixes: 0723a047 ("btrfs: allow mounting btrfs subvolumes with different ro/rw options") Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NOmar Sandoval <osandov@osandov.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits() through free_io_failure(), we weren't waking up any tasks waiting for the extent's state EXTENT_LOCKED bit, leading to an hang. So make sure clear_extent_bits() ends up waking up any waiters if the bit EXTENT_LOCKED is supplied by its callers. Zygo Blaxell was experiencing such hangs at inode eviction time after file unlinks. Thanks to him for a set of scripts to reproduce the issue. Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
With commit 1b984508 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole") introduced in the kernel 4.1 merge window, we end up using part of a device hole for which there are already pending chunks or pinned chunks. Before that commit we didn't use the hole and would just move on to the next hole in the device. However when we adjust the start offset for the chunk allocation and we have pinned chunks, we set it blindly to the end offset of the pinned chunk we are currently processing, which is dangerous because we can have a pending chunk that has a start offset that matches the end offset of our pinned chunk - leading us to a case where we end up getting two pending chunks that start at the same physical device offset, which makes us later abort the current transaction with -EEXIST when finishing the chunk allocation at btrfs_create_pending_block_groups(): [194737.659017] ------------[ cut here ]------------ [194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]() [194737.662209] BTRFS: Transaction aborted (error -17) [194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse [194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [194737.682999] 0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2 [194737.684540] ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78 [194737.686017] ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40 [194737.687509] Call Trace: [194737.688068] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [194737.689027] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [194737.690095] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [194737.691198] [<ffffffffa0383aa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs] [194737.693789] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [194737.695065] [<ffffffffa0383aa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs] [194737.696806] [<ffffffffa039a3bd>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs] [194737.698683] [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs] [194737.700329] [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs] [194737.701924] [<ffffffffa0394b51>] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [194737.703675] [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs] [194737.705417] [<ffffffffa03bb502>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs] [194737.707058] [<ffffffffa03bb511>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs] [194737.708560] [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs] [194737.710673] [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e [194737.712076] [<ffffffff811534c3>] new_sync_write+0x7c/0xa0 [194737.713293] [<ffffffff81153b58>] vfs_write+0xb2/0x117 [194737.714443] [<ffffffff81154424>] SyS_pwrite64+0x64/0x82 [194737.715646] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [194737.717175] ---[ end trace f2d5dc04e56d7e48 ]--- [194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by btrfs_create_pending_block_groups(), when it attempts to insert a duplicated device extent item via btrfs_alloc_dev_extent(). This issue was reproducible with fstests generic/038 running in a loop for several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard". Applying Jeff's recent patch titled "btrfs: add missing discards when unpinning extents with -o discard" makes the issue much easier to reproduce (usually within 4 to 5 hours), since it pins chunks for longer periods of time when an unused block group is deleted by the cleaner kthread. Fix this by making sure that we never adjust the start offset to a lower value than it currently has. Fixes: 1b984508 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole" Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Sasha Levin 提交于
__btrfs_close_devices() would call_rcu to free the device, which is racy with list_for_each_entry() accessing the memory to retrieve the next device on the list. Signed-off-by: NSasha Levin <sasha.levin@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
The INO_LOOKUP ioctl can lookup path for a given inode number and is thus restricted. As a sideefect it can find the root id of the containing subvolume and we're using this int the 'btrfs inspect rootid' command. The restriction is unnecessary in case we set the ioctl args args::treeid = 0 args::objectid = 256 (BTRFS_FIRST_FREE_OBJECTID) Then the path will be empty and the treeid is filled with the root id of the inode on which the ioctl is called. This behaviour is unchanged, after the root restriction is removed. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
When we create a block group we add it to the rbtree of block groups before setting its ->space_info field (while it's NULL). This is problematic since other tasks can access the block group from the rbtree and attempt to use its ->space_info before it is set by btrfs_make_block_group(). This can happen for example when a concurrent fitrim ioctl operation is ongoing, which produces a trace like the following when CONFIG_DEBUG_PAGEALLOC is set. [11509.604369] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [11509.606373] IP: [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] PGD 2296a8067 PUD 22f4a2067 PMD 0 [11509.608179] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [11509.608179] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq processor i2c_piix4 psmou [11509.608179] CPU: 10 PID: 8538 Comm: fstrim Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [11509.608179] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [11509.608179] task: ffff88009f5c46d0 ti: ffff8801b3edc000 task.ti: ffff8801b3edc000 [11509.608179] RIP: 0010:[<ffffffff8107d675>] [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] RSP: 0018:ffff8801b3edf9e8 EFLAGS: 00010002 [11509.608179] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000000 [11509.608179] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018 [11509.608179] RBP: ffff8801b3edfaa8 R08: 0000000000000001 R09: 0000000000000000 [11509.608179] R10: 0000000000000000 R11: ffff88009f5c4f98 R12: 0000000000000000 [11509.608179] R13: 0000000000000000 R14: 0000000000000018 R15: ffff88009f5c46d0 [11509.608179] FS: 00007f280a10e840(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000 [11509.608179] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [11509.608179] CR2: 0000000000000018 CR3: 00000002119bc000 CR4: 00000000000006e0 [11509.608179] Stack: [11509.608179] 0000000000000000 0000000000000000 0000000000000004 0000000000000000 [11509.608179] ffff880100000000 ffffffff00000000 0000000000000001 ffffffff00000000 [11509.608179] 0000000000000001 0000000000000000 ffff880100000000 00000000000006c4 [11509.608179] Call Trace: [11509.608179] [<ffffffff8107dc57>] ? __lock_acquire+0x696/0xf02 [11509.608179] [<ffffffff8107e806>] lock_acquire+0xa5/0x116 [11509.608179] [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffff81434f37>] _raw_spin_lock+0x34/0x44 [11509.608179] [<ffffffffa04cc876>] ? do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffffa04cc876>] do_trimming+0x51/0x145 [btrfs] [11509.608179] [<ffffffffa04cde7d>] btrfs_trim_block_group+0x201/0x491 [btrfs] [11509.608179] [<ffffffffa04849e2>] btrfs_trim_fs+0xe0/0x129 [btrfs] [11509.608179] [<ffffffffa04bb80a>] btrfs_ioctl_fitrim+0x138/0x167 [btrfs] [11509.608179] [<ffffffffa04c002f>] btrfs_ioctl+0x50d/0x21e8 [btrfs] [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81123bda>] ? might_fault+0x58/0xb5 [11509.608179] [<ffffffff81158050>] ? cp_new_stat+0x147/0x15e [11509.608179] [<ffffffff81163041>] do_vfs_ioctl+0x3c6/0x479 [11509.608179] [<ffffffff81158116>] ? SYSC_newfstat+0x25/0x2e [11509.608179] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 [11509.608179] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f [11509.608179] [<ffffffff8116314e>] SyS_ioctl+0x5a/0x7f [11509.608179] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [11509.608179] Code: f4 01 00 0f 85 c0 00 00 00 48 c7 c1 f3 1f 7d 81 48 c7 c2 aa cb 7c 81 be fc 0b 00 00 eb 70 83 3d 61 eb 9c 00 00 0f 84 a5 00 00 00 <49> 81 3e 40 a3 2b 82 b8 00 00 00 [11509.608179] RIP [<ffffffff8107d675>] __lock_acquire+0xb4/0xf02 [11509.608179] RSP <ffff8801b3edf9e8> [11509.608179] CR2: 0000000000000018 [11509.608179] ---[ end trace 570a5c6769f0e49a ]--- Which corresponds to the following access in fs/btrfs/free-space-cache.c: static int do_trimming(struct btrfs_block_group_cache *block_group, u64 *total_trimmed, u64 start, u64 bytes, u64 reserved_start, u64 reserved_bytes, struct btrfs_trim_range *trim_entry) { struct btrfs_space_info *space_info = block_group->space_info; (...) spin_lock(&space_info->lock); ^^^^^ - block_group->space_info is NULL... Fix this by ensuring the block group's ->space_info is set before adding the block group to the rbtree. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Anand Jain 提交于
Report missing device when add is successful, otherwise it would exit as ENOMEM. And add uuid to the report. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Qu Wenruo 提交于
Old csum type check is wrong and can't catch csum_type 1(not supported). Fix it to avoid hostile 0 division. Reported-by: NLukas Lueg <lukas.lueg@gmail.com> Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
Marc reported a problem where the receiving end of an incremental send was performing clone operations that failed with -EINVAL. This happened because, unlike for uncompressed extents, we were not checking if the source clone offset and length, after summing the data offset, falls within the source file's boundaries. So make sure we do such checks when attempting to issue clone operations for compressed extents. Problem reproducible with the following steps: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount -o compress /dev/sdc /mnt2 # Create the file with a single extent of 128K. This creates a metadata file # extent item with a data start offset of 0 and a logical length of 128K. $ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo # Now rewrite the range 64K to 112K of our file. This will make the inode's # metadata continue to point to the 128K extent we created before, but now # with an extent item that points to the extent with a data start offset of # 112K and a logical length of 16K. # That metadata file extent item is associated with the logical file offset # at 176K and covers the logical file range 176K to 192K. $ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo # Now rewrite the range 180K to 12K. This will make the inode's metadata # continue to point the the 128K extent we created earlier, with a single # extent item that points to it with a start offset of 112K and a logical # length of 4K. # That metadata file extent item is associated with the logical file offset # at 176K and covers the logical file range 176K to 180K. $ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ touch /mnt/bar # Calls the btrfs clone ioctl. $ ~/xfstests/src/cloner -s $((176 * 1024)) -d $((176 * 1024)) \ -l $((4 * 1024)) /mnt/foo /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 At subvol /mnt/snap1 At subvol snap1 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2 At subvol /mnt/snap2 At snapshot snap2 ERROR: failed to clone extents to bar Invalid argument A test case for fstests follows soon. Reported-by: NMarc MERLIN <marc@merlins.org> Tested-by: NMarc MERLIN <marc@merlins.org> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Tested-by: NDavid Sterba <dsterba@suse.cz> Tested-by: NJan Alexander Steffens (heftig) <jan.steffens@gmail.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Christian Engelmayer 提交于
Commit 9c8b35b1 ("btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.") introduced the allocation of a temporary ulist in function btrfs_add_qgroup_relation() and added the corresponding cleanup to the out path. However, the allocation was introduced before the src/dst level check that directly returns. Fix the possible leakage of the ulist by moving the allocation after the input validation. Detected by Coverity CID 1295988. Signed-off-by: NChristian Engelmayer <cengelma@gmx.at> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
If the call to btrfs_truncate_inode_items() failed and we don't have a block group, we were unlocking the cache_write_mutex without having locked it (we do it only if we have a block group). Fixes: 1bbc621e ("Btrfs: allow block group cache writeout outside critical section in commit") Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Anand Jain 提交于
Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’: fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, tree_root, ^ CC [M] fs/btrfs/ioctl.o fs/btrfs/ioctl.c: In function ‘create_subvol’: fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, root, PTR_ERR(new_root)); PTR_ERR returns long, but we're really using 'int' for the error codes everywhere so just set and use the local variable. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
The annotated functios will be placed into .text.unlikely section. The annotation also hints compiler to move the code out of the hot paths, and may implicitly mark if-statement leading to that block as unlikely. This is a heuristic, the impact on the generated code is not significant. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
WARN is called from a single location and all bugreports say that's in super.c __btrfs_abort_transaction. This is slightly confusing as we'd rather want to know the exact callsite. Whereas this information is printed in the syslog below the stacktrace, this requires further look and we usually see only the headline from WARNING. Moving the WARN into the macro has to inline some code and increases code by a few kilobytes: text data bss dec hex filename 835481 20305 14120 869906 d4612 btrfs.ko.before 842883 20305 14120 877308 d62fc btrfs.ko.after The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config. The increase is not small and could lead to worse icache use. The code is on error/exit paths that can be recognized by compiler as cold and moved out of the way so the impact is speculated to be low, if measurable at all. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
Long time ago (2008) the defrag was automatic for new b-tree writes but has been disabled after performance problems. There was a leftover in tree-defrag.c that effectively stops any defragmentation on b-trees. This is a bit unexpected and IMHO undesired. The SSD mode is an optimization and defrag is supposed to work if the users asks for it. Related commits: 6702ed49 Btrfs: Add run time btree defrag, and an ioctl to force btree defrag e18e4809 Btrfs: Add mount -o ssd, which includes optimizations for seek free storage b3236e68 Btrfs: Leave on the tree defragger in mount -o ssd, it still helps there 9afbb0b7 Btrfs: Disable tree defrag in SSD mode The last three commits switch the defrag+ssd off/on/off and the last one 3f157a2f Btrfs: Online btree defragmentation fixes misses the bits from tree-defrag.c to revert to the behaviour introduced in e18e4809. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
When we shrink the usable size of a device (its total_bytes), we go over all the device extent items in the device tree and attempt to relocate the chunk of any device extent that goes beyond the new usable size for the device. We do that after setting the new usable size (total_bytes) in the device object, so that all new allocations (and reallocations) don't use areas of the device that go beyond the new (shorter) size. However we were not considering that before setting the new size in the device, pending chunks might have been created that use device extents that go beyond the new size, and those device extents are not yet in the device tree after we search the device tree - they are still attached to the list of new block group for some ongoing transaction handle, and they are only added to the device tree when the transaction handle is ended (via btrfs_create_pending_block_groups()). So check for pending chunks with device extents that go beyond the new size and if any exists, commit the current transaction and repeat the search in the device tree. Not doing this it would mean we would return success to user space while still having extents that go beyond the new size, and later user space could override those locations on the device while the fs still references them, causing all sorts of corruption and unexpected events. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Omar Sandoval 提交于
Since commit bafc9b75 ("vfs: More precise tests in d_invalidate"), mounted subvolumes can be deleted because d_invalidate() won't fail. However, we run into problems when we attempt to delete the default subvolume while it is mounted as the root filesystem: # btrfs subvol list / ID 257 gen 306 top level 5 path rootvol ID 267 gen 334 top level 5 path snap1 # btrfs subvol get-default / ID 267 gen 334 top level 5 path snap1 # btrfs inspect-internal rootid / 267 # mount -o subvol=/ /dev/vda1 /mnt # btrfs subvol del /mnt/snap1 Delete subvolume (no-commit): '/mnt/snap1' ERROR: cannot delete '/mnt/snap1' - Operation not permitted # findmnt / findmnt: can't read /proc/mounts: No such file or directory # ls /proc # Markus reported that this same scenario simply led to a kernel oops. This happens because in btrfs_ioctl_snap_destroy(), we call d_invalidate() before we check may_destroy_subvol(), which means that we detach the submounts and drop the dentry before erroring out. Instead, we should only invalidate the dentry once the deletion has succeeded. Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate() will prune the dcache for the deleted subvolume. Cc: <stable@vger.kernel.org> Fixes: bafc9b75 ("vfs: More precise tests in d_invalidate") Reported-by: NMarkus Schauler <mschauler@gmail.com> Signed-off-by: NOmar Sandoval <osandov@osandov.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
If a directory inode is orphanized, because some inode previously processed has a new name that collides with the old name of the current inode, we need to check if it needs its rename operation delayed too, as its ancestor-descendent relationship with some other inode might have been reversed between the parent and send snapshots and therefore its rename operation needs to happen after that other inode is renamed. For example, for the following reproducer where this is needed (provided by Robbie Ko): $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ mkdir -p /mnt/data/n1/n2 $ mkdir /mnt/data/n4 $ mkdir -p /mnt/data/t6/t7 $ mkdir /mnt/data/t5 $ mkdir /mnt/data/t7 $ mkdir /mnt/data/n4/t2 $ mkdir /mnt/data/t4 $ mkdir /mnt/data/t3 $ mv /mnt/data/t7 /mnt/data/n4/t2 $ mv /mnt/data/t4 /mnt/data/n4/t2/t7 $ mv /mnt/data/t5 /mnt/data/n4/t2/t7/t4 $ mv /mnt/data/t6 /mnt/data/n4/t2/t7/t4/t5 $ mv /mnt/data/n1/n2 /mnt/data/n4/t2/t7/t4/t5/t6 $ mv /mnt/data/n1 /mnt/data/n4/t2/t7/t4/t5/t6 $ mv /mnt/data/n4/t2/t7/t4/t5/t6/t7 /mnt/data/n4/t2/t7/t4/t5/t6/n2 $ mv /mnt/data/t3 /mnt/data/n4/t2/t7/t4/t5/t6/n2/t7 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/n4/t2/t7/t4/t5/t6/n1 /mnt/data/n4 $ mv /mnt/data/n4/t2 /mnt/data/n4/n1 $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6/n2 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/n2/t7/t3 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/t7/t4/t5/t6 /mnt/data/n4/n1/t2 $ mv /mnt/data/n4/n1/t2/t7/t4 /mnt/data/n4/n1/t2/t6 $ mv /mnt/data/n4/n1/t2/t7 /mnt/data/n4/n1/t2/t3 $ mv /mnt/data/n4/n1/t2/n2/t7 /mnt/data/n4/n1/t2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2 ERROR: send ioctl failed with -12: Cannot allocate memory Where the parent snapshot directory hierarchy is the following: . (ino 256) |-- data/ (ino 257) |-- n4/ (ino 260) |-- t2/ (ino 265) |-- t7/ (ino 264) |-- t4/ (ino 266) |-- t5/ (ino 263) |-- t6/ (ino 261) |-- n1/ (ino 258) |-- n2/ (ino 259) |-- t7/ (ino 262) |-- t3/ (ino 267) And the send snapshot's directory hierarchy is the following: . (ino 256) |-- data/ (ino 257) |-- n4/ (ino 260) |-- n1/ (ino 258) |-- t2/ (ino 265) |-- n2/ (ino 259) |-- t3/ (ino 267) | |-- t7 (ino 264) | |-- t6/ (ino 261) | |-- t4/ (ino 266) | |-- t5/ (ino 263) | |-- t7/ (ino 262) While processing inode 262 we orphanize inode 264 and later attempt to rename inode 264 to its new name/location, which resulted in building an incorrect destination path string for the rename operation with the value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must have been done only after inode 267 is processed and renamed, as the ancestor-descendent relationship between inodes 264 and 267 was reversed between both snapshots, because otherwise it results in an infinite loop when building the path string for inode 264 when we are processing an inode with a number larger than 264. That loop is the following: start inode 264, send progress of 265 for example parent of 264 -> 267 parent of 267 -> 262 parent of 262 -> 259 parent of 259 -> 261 parent of 261 -> 263 parent of 263 -> 266 parent of 266 -> 264 |--> back to first iteration while current path string length is <= PATH_MAX, and fail with -ENOMEM otherwise So fix this by making the check if we need to delay a directory rename regardless of the current inode having been orphanized or not. A test case for fstests follows soon. Thanks to Robbie Ko for providing a reproducer for this problem. Reported-by: NRobbie Ko <robbieko@synology.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com>
-
由 Filipe Manana 提交于
Even though we delay the rename of directories when they become descendents of other directories that were also renamed in the send root to prevent infinite path build loops, we were doing it in cases where this was not needed and was actually harmful resulting in infinite path build loops as we ended up with a circular dependency of delayed directory renames. Consider the following reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt2 $ mkdir /mnt/data $ mkdir /mnt/data/n1 $ mkdir /mnt/data/n1/n2 $ mkdir /mnt/data/n4 $ mkdir /mnt/data/n1/n2/p1 $ mkdir /mnt/data/n1/n2/p1/p2 $ mkdir /mnt/data/t6 $ mkdir /mnt/data/t7 $ mkdir -p /mnt/data/t5/t7 $ mkdir /mnt/data/t2 $ mkdir /mnt/data/t4 $ mkdir -p /mnt/data/t1/t3 $ mkdir /mnt/data/p1 $ mv /mnt/data/t1 /mnt/data/p1 $ mkdir -p /mnt/data/p1/p2 $ mv /mnt/data/t4 /mnt/data/p1/p2/t1 $ mv /mnt/data/t5 /mnt/data/n4/t5 $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/n4/t5/p2 $ mv /mnt/data/t7 /mnt/data/n4/t5/p2/t7 $ mv /mnt/data/t2 /mnt/data/n4/t1 $ mv /mnt/data/p1 /mnt/data/n4/t5/p2/p1 $ mv /mnt/data/n1/n2 /mnt/data/n4/t5/p2/p1/p2/n2 $ mv /mnt/data/n4/t5/p2/p1/p2/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1 $ mv /mnt/data/n4/t5/t7 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7 $ mv /mnt/data/n4/t5/p2/p1/t1/t3 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3 $ mv /mnt/data/n4/t5/p2/p1/p2/n2/p1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1 $ mv /mnt/data/t6 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t5 $ mv /mnt/data/n4/t5/p2/p1/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t1 $ mv /mnt/data/n1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/n1 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/data/n4/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/t1 $ mv /mnt/data/n4/t5/p2/p1/p2/n2/t1 /mnt/data/n4/ $ mv /mnt/data/n4/t5/p2/p1/p2/n2 /mnt/data/n4/t1/n2 $ mv /mnt/data/n4/t1/t7/p1 /mnt/data/n4/t1/n2/p1 $ mv /mnt/data/n4/t1/t3/t1 /mnt/data/n4/t1/n2/t1 $ mv /mnt/data/n4/t1/t3 /mnt/data/n4/t1/n2/t1/t3 $ mv /mnt/data/n4/t5/p2/p1/p2 /mnt/data/n4/t1/n2/p1/p2 $ mv /mnt/data/n4/t1/t7 /mnt/data/n4/t1/n2/p1/t7 $ mv /mnt/data/n4/t5/p2/p1 /mnt/data/n4/t1/n2/p1/p2/p1 $ mv /mnt/data/n4/t1/n2/t1/t3/t5 /mnt/data/n4/t1/n2/p1/p2/t5 $ mv /mnt/data/n4/t5 /mnt/data/n4/t1/n2/p1/p2/p1/t5 $ mv /mnt/data/n4/t1/n2/p1/p2/p1/t5/p2 /mnt/data/n4/t1/n2/p1/p2/p1/p2 $ mv /mnt/data/n4/t1/n2/p1/p2/p1/p2/t7 /mnt/data/n4/t1/t7 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 | btrfs receive /mnt2 $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive -vv /mnt2 ERROR: send ioctl failed with -12: Cannot allocate memory This reproducer resulted in an infinite path build loop when building the path for inode 266 because the following circular dependency of delayed directory renames was created: ino 272 <- ino 261 <- ino 259 <- ino 268 <- ino 267 <- ino 261 Where the notation "X <- Y" means the rename of inode X is delayed by the rename of inode Y (X will be renamed after Y is renamed). This resulted in an infinite path build loop of inode 266 because that inode has inode 261 as an ancestor in the send root and inode 261 is in the circular dependency of delayed renames listed above. Fix this by not delaying the rename of a directory inode if an ancestor of the inode in the send root, which has a delayed rename operation, is not also a descendent of the inode in the parent root. Thanks to Robbie Ko for sending the reproducer example. A test case for xfstests follows soon. Reported-by: NRobbie Ko <robbieko@synology.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com>
-
- 21 5月, 2015 1 次提交
-
-
由 Chris Mason 提交于
Commit 2f081088 changed btrfs_set_block_group_ro to avoid trying to allocate new chunks with the new raid profile during conversion. This fixed failures when there was no space on the drive to allocate a new chunk, but the metadata reserves were sufficient to continue the conversion. But this ended up causing a regression when the drive had plenty of space to allocate new chunks, mostly because reduce_alloc_profile isn't using the new raid profile. Fixing btrfs_reduce_alloc_profile is a bigger patch. For now, do a partial revert of 2f081088, and don't error out if we hit ENOSPC. Signed-off-by: NChris Mason <clm@fb.com> Tested-by: NDave Sterba <dsterba@suse.cz> Reported-by: NHolger Hoffstaette <holger.hoffstaette@googlemail.com>
-
- 20 5月, 2015 2 次提交
-
-
由 Filipe Manana 提交于
If while setting a block group read-only we end up allocating a system chunk, through check_system_chunk(), we were not doing it while holding the chunk mutex which is a problem if a concurrent chunk allocation is happening, through do_chunk_alloc(), as it means both block groups can end up using the same logical addresses and physical regions in the device(s). So make sure we hold the chunk mutex. Cc: stable@vger.kernel.org # 4.0+ Fixes: 2f081088 ("btrfs: delete chunk allocation attemp when setting block group ro") Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Mark Fasheh 提交于
btrfs_check_shared() is leaking a return value of '1' from find_parent_nodes(). As a result, callers (in this case, extent_fiemap()) are told extents are shared when they are not. This in turn broke fiemap on btrfs for kernels v3.18 and up. The fix is simple - we just have to clear 'ret' after we are done processing the results of find_parent_nodes(). It wasn't clear to me at first what was happening with return values in btrfs_check_shared() and find_parent_nodes() - thanks to Josef for the help on irc. I added documentation to both functions to make things more clear for the next hacker who might come across them. If we could queue this up for -stable too that would be great. Signed-off-by: NMark Fasheh <mfasheh@suse.de> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 11 5月, 2015 4 次提交
-
-
由 Filipe Manana 提交于
There's a race between releasing extent buffers that are flagged as stale and recycling them that makes us it the following BUG_ON at btrfs_release_extent_buffer_page: BUG_ON(extent_buffer_under_io(eb)) The BUG_ON is triggered because the extent buffer has the flag EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made dirty by another concurrent task. Here follows a sequence of steps that leads to the BUG_ON. CPU 0 CPU 1 CPU 2 path->nodes[0] == eb X X->refs == 2 (1 for the tree, 1 for the path) btrfs_header_generation(X) == current trans id flag EXTENT_BUFFER_DIRTY set on X btrfs_release_path(path) unlocks X reads eb X X->refs incremented to 3 locks eb X btrfs_del_items(X) X becomes empty clean_tree_block(X) clear EXTENT_BUFFER_DIRTY from X btrfs_del_leaf(X) unlocks X extent_buffer_get(X) X->refs incremented to 4 btrfs_free_tree_block(X) X's range is not pinned X's range added to free space cache free_extent_buffer_stale(X) lock X->refs_lock set EXTENT_BUFFER_STALE on X release_extent_buffer(X) X->refs decremented to 3 unlocks X->refs_lock btrfs_release_path() unlocks X free_extent_buffer(X) X->refs becomes 2 __btrfs_cow_block(Y) btrfs_alloc_tree_block() btrfs_reserve_extent() find_free_extent() gets offset == X->start btrfs_init_new_buffer(X->start) btrfs_find_create_tree_block(X->start) alloc_extent_buffer(X->start) find_extent_buffer(X->start) finds eb X in radix tree free_extent_buffer(X) lock X->refs_lock test X->refs == 2 test bit EXTENT_BUFFER_STALE is set test !extent_buffer_under_io(eb) increments X->refs to 3 mark_extent_buffer_accessed(X) check_buffer_tree_ref(X) --> does nothing, X->refs >= 2 and EXTENT_BUFFER_TREE_REF is set in X clear EXTENT_BUFFER_STALE from X locks X btrfs_mark_buffer_dirty() set_extent_buffer_dirty(X) check_buffer_tree_ref(X) --> does nothing, X->refs >= 2 and EXTENT_BUFFER_TREE_REF is set sets EXTENT_BUFFER_DIRTY on X test and clear EXTENT_BUFFER_TREE_REF decrements X->refs to 2 release_extent_buffer(X) decrements X->refs to 1 unlock X->refs_lock unlock X free_extent_buffer(X) lock X->refs_lock release_extent_buffer(X) decrements X->refs to 0 btrfs_release_extent_buffer_page(X) BUG_ON(extent_buffer_under_io(X)) --> EXTENT_BUFFER_DIRTY set on X Fix this by making find_extent buffer wait for any ongoing task currently executing free_extent_buffer()/free_extent_buffer_stale() if the extent buffer has the stale flag set. A more clean alternative would be to always increment the extent buffer's reference count while holding its refs_lock spinlock but find_extent_buffer is a performance critical area and that would cause lock contention whenever multiple tasks search for the same extent buffer concurrently. A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream btrfs patches backported from newer kernels) was hitting this often: [1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507! (...) [1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1 [1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006 [1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000 [1212302.540792] RIP: 0010:[<ffffffffa0507081>] [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs] (...) [1212302.630008] Call Trace: [1212302.630008] [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs] [1212302.630008] [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs] [1212302.630008] [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs] [1212302.630008] [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs] [1212302.630008] [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs] [1212302.630008] [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs] [1212302.630008] [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs] [1212302.630008] [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs] [1212302.630008] [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs] [1212302.630008] [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs] [1212302.630008] [<ffffffff811b9e28>] evict+0xa8/0x190 [1212302.630008] [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0 I was also able to reproduce this on a 3.19 kernel, corresponding to Chris' integration branch from about a month ago, running the following stress test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram): while true; do mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd mount /dev/sdd /mnt snapshot_cmd="btrfs subvolume snapshot -r /mnt" snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`" fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100 umount /mnt done Which usually triggers the BUG_ON within less than 24 hours: [49558.618097] ------------[ cut here ]------------ [49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551! (...) [49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G W 3.19.0-btrfs-next-7+ #3 [49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000 [49558.620031] RIP: 0010:[<ffffffffa0476b1a>] [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs] (...) [49558.620031] Call Trace: [49558.620031] [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs] [49558.620031] [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43 [49558.620031] [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs] [49558.620031] [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs] [49558.620031] [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs] [49558.620031] [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs] [49558.620031] [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs] [49558.728054] [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18 [49558.728054] [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs] [49558.728054] [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs] [49558.728054] [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs] [49558.728054] [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs] [49558.728054] [<ffffffff81155923>] ? iterate_supers+0x60/0xc4 [49558.728054] [<ffffffff8117966a>] ? do_sync_work+0x91/0x91 [49558.728054] [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22 [49558.728054] [<ffffffff81155939>] iterate_supers+0x76/0xc4 [49558.728054] [<ffffffff811798e8>] sys_sync+0x55/0x83 [49558.728054] [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17 Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
So creating a block group has 2 distinct phases: Phase 1 - creates the btrfs_block_group_cache item and adds it to the rbtree fs_info->block_group_cache_tree and to the corresponding list space_info->block_groups[]; Phase 2 - adds the block group item to the extent tree and corresponding items to the chunk tree. The first phase adds the block_group_cache_item to a list of pending block groups in the transaction handle, and phase 2 happens when btrfs_end_transaction() is called against the transaction handle. It happens that once phase 1 completes, other concurrent tasks that use their own transaction handle, but points to the same running transaction (struct btrfs_trans_handle->transaction), can use this block group for space allocations and therefore mark it dirty. Dirty block groups are tracked in a list belonging to the currently running transaction (struct btrfs_transaction) and not in the transaction handle (btrfs_trans_handle). This is a problem because once a task calls btrfs_commit_transaction(), it calls btrfs_start_dirty_block_groups() which will see all dirty block groups and attempt to start their writeout, including those that are still attached to the transaction handle of some concurrent task that hasn't called btrfs_end_transaction() yet - which means those block groups haven't gone through phase 2 yet and therefore when write_one_cache_group() is called, it won't find the block group items in the extent tree and abort the current transaction with -ENOENT, turning the fs into readonly mode and require a remount. Fix this by ignoring -ENOENT when looking for block group items in the extent tree when we attempt to start the writeout of the block group caches outside the critical section of the transaction commit. We will try again later during the critical section and if there we still don't find the block group item in the extent tree, we then abort the current transaction. This issue happened twice, once while running fstests btrfs/067 and once for btrfs/078, which produced the following trace: [ 3278.703014] WARNING: CPU: 7 PID: 18499 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() [ 3278.707329] BTRFS: Transaction aborted (error -2) (...) [ 3278.731555] Call Trace: [ 3278.732396] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [ 3278.733860] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [ 3278.735312] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [ 3278.736874] [<ffffffffa03ada6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [ 3278.738302] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [ 3278.739520] [<ffffffffa03ada6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [ 3278.741222] [<ffffffffa03b9e56>] write_one_cache_group+0xae/0xbf [btrfs] [ 3278.742797] [<ffffffffa03c487b>] btrfs_start_dirty_block_groups+0x170/0x2b2 [btrfs] [ 3278.744492] [<ffffffffa03d309c>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] [ 3278.746084] [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf [ 3278.747249] [<ffffffffa03e5660>] btrfs_sync_file+0x313/0x387 [btrfs] [ 3278.748744] [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4 [ 3278.749958] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 [ 3278.751218] [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e [ 3278.754197] [<ffffffff8117ae54>] do_fsync+0x34/0x4e [ 3278.755192] [<ffffffff8117b07c>] SyS_fsync+0x10/0x14 [ 3278.756236] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [ 3278.757366] ---[ end trace 9a4d4df4969709aa ]--- Fixes: 1bbc621e ("Btrfs: allow block group cache writeout outside critical section in commit") Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
When waiting for the writeback of block group cache we returned immediately if there was an error during writeback without waiting for the ordered extent to complete. This left a short time window where if some other task attempts to start the writeout for the same block group cache it can attempt to add a new ordered extent, starting at the same offset (0) before the previous one is removed from the ordered tree, causing an ordered tree panic (calls BUG()). This normally doesn't happen in other write paths, such as buffered writes or direct IO writes for regular files, since before marking page ranges dirty we lock the ranges and wait for any ordered extents within the range to complete first. Fix this by making btrfs_wait_ordered_range() not return immediately if it gets an error from the writeback, waiting for all ordered extents to complete first. This issue happened often when running the fstest btrfs/088 and it's easy to trigger it by running in a loop until the panic happens: for ((i = 1; i <= 10000; i++)) do ./check btrfs/088 ; done [17156.862573] BTRFS critical (device sdc): panic in ordered_data_tree_panic:70: Inconsistency in ordered tree at offset 0 (errno=-17 Object already exists) [17156.864052] ------------[ cut here ]------------ [17156.864052] kernel BUG at fs/btrfs/ordered-data.c:70! (...) [17156.864052] Call Trace: [17156.864052] [<ffffffffa03876e3>] btrfs_add_ordered_extent+0x12/0x14 [btrfs] [17156.864052] [<ffffffffa03787e2>] run_delalloc_nocow+0x5bf/0x747 [btrfs] [17156.864052] [<ffffffffa03789ff>] run_delalloc_range+0x95/0x353 [btrfs] [17156.864052] [<ffffffffa038b7fe>] writepage_delalloc.isra.16+0xb9/0x13f [btrfs] [17156.864052] [<ffffffffa038d75b>] __extent_writepage+0x129/0x1f7 [btrfs] [17156.864052] [<ffffffffa038da5a>] extent_write_cache_pages.isra.15.constprop.28+0x231/0x2f4 [btrfs] [17156.864052] [<ffffffff810ad2af>] ? __module_text_address+0x12/0x59 [17156.864052] [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf [17156.864052] [<ffffffffa038df76>] extent_writepages+0x4b/0x5c [btrfs] [17156.864052] [<ffffffff81144431>] ? kmem_cache_free+0x9b/0xce [17156.864052] [<ffffffffa0376a46>] ? btrfs_submit_direct+0x3fc/0x3fc [btrfs] [17156.864052] [<ffffffffa0389cd6>] ? free_extent_state+0x8c/0xc1 [btrfs] [17156.864052] [<ffffffffa0374871>] btrfs_writepages+0x28/0x2a [btrfs] [17156.864052] [<ffffffff8110c4c8>] do_writepages+0x23/0x2c [17156.864052] [<ffffffff81102f36>] __filemap_fdatawrite_range+0x5a/0x61 [17156.864052] [<ffffffff81102f6e>] filemap_fdatawrite_range+0x13/0x15 [17156.864052] [<ffffffffa0383ef7>] btrfs_fdatawrite_range+0x21/0x48 [btrfs] [17156.864052] [<ffffffffa03ab89e>] __btrfs_write_out_cache.isra.14+0x2d9/0x3a7 [btrfs] [17156.864052] [<ffffffffa03ac1ab>] ? btrfs_write_out_cache+0x41/0xdc [btrfs] [17156.864052] [<ffffffffa03ac1fd>] btrfs_write_out_cache+0x93/0xdc [btrfs] [17156.864052] [<ffffffffa0363847>] ? btrfs_start_dirty_block_groups+0x13a/0x2b2 [btrfs] [17156.864052] [<ffffffffa03638e6>] btrfs_start_dirty_block_groups+0x1d9/0x2b2 [btrfs] [17156.864052] [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf [17156.864052] [<ffffffffa037209e>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] [17156.864052] [<ffffffffa034c748>] btrfs_sync_fs+0xe1/0x12d [btrfs] Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
If the writeback of an inode cache failed we were unnecessarilly attempting to release again the delalloc metadata that we previously reserved. However attempting to do this a second time triggers an assertion at drop_outstanding_extent() because we have no more outstanding extents for our inode cache's inode. If we were able to start writeback of the cache the reserved metadata space is released at btrfs_finished_ordered_io(), even if an error happens during writeback. So make sure we don't repeat the metadata space release if writeback started for our inode cache. This issue was trivial to reproduce by running the fstest btrfs/088 with "-o inode_cache", which triggered the assertion leading to a BUG() call and requiring a reboot in order to run the remaining fstests. Trace produced by btrfs/088: [255289.385904] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5276 [255289.388094] ------------[ cut here ]------------ [255289.389184] kernel BUG at fs/btrfs/ctree.h:4057! [255289.390125] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC (...) [255289.392068] Call Trace: [255289.392068] [<ffffffffa035e774>] drop_outstanding_extent+0x3d/0x6d [btrfs] [255289.392068] [<ffffffffa0364988>] btrfs_delalloc_release_metadata+0x54/0xe3 [btrfs] [255289.392068] [<ffffffffa03b4174>] btrfs_write_out_ino_cache+0x95/0xad [btrfs] [255289.392068] [<ffffffffa036f5c4>] btrfs_save_ino_cache+0x275/0x2dc [btrfs] [255289.392068] [<ffffffffa03e2d83>] commit_fs_roots.isra.12+0xaa/0x137 [btrfs] [255289.392068] [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf [255289.392068] [<ffffffffa037841f>] ? btrfs_commit_transaction+0x4b1/0x9c9 [btrfs] [255289.392068] [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46 [255289.392068] [<ffffffffa037842e>] btrfs_commit_transaction+0x4c0/0x9c9 [btrfs] (...) Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 07 5月, 2015 1 次提交
-
-
由 Filipe Manana 提交于
We were passing a flags value that differed from the intention in commit 2b108268 ("Btrfs: don't use highmem for free space cache pages"). This caused problems in a ARM machine, leaving btrfs unusable there. Reported-by: NMerlijn Wajer <merlijn@wizzup.org> Tested-by: NMerlijn Wajer <merlijn@wizzup.org> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-