- 13 4月, 2015 19 次提交
-
-
由 Dongsheng Yang 提交于
Currently, for pre_alloc or delay_alloc, the bytes will be accounted in space_info by the three guys. space_info->bytes_may_use --- space_info->reserved --- space_info->used. But on the other hand, in qgroup, there are only two counters to account the bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts bytes in space_info->bytes_may_use and qg->excl accounts bytes in space_info->used. So the bytes in space_info->reserved is not accounted in qgroup. If so, there is a window we can exceed the quota limit when bytes is in space_info->reserved. Example: # btrfs quota enable /mnt # btrfs qgroup limit -e 10M /mnt # for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done # sync # btrfs qgroup show -pcre /mnt qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 20987904 20987904 0 10485760 --- --- qg->excl is 20987904 larger than max_excl 10485760. This patch introduce a new counter named may_use to qgroup, then there are three counters in qgroup to account bytes in space_info as below. space_info->bytes_may_use --- space_info->reserved --- space_info->used. qgroup->may_use --- qgroup->reserved --- qgroup->excl With this patch applied: # btrfs quota enable /mnt # btrfs qgroup limit -e 10M /mnt # for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done fallocate: /mnt/data9: fallocate failed: Disk quota exceeded fallocate: /mnt/data10: fallocate failed: Disk quota exceeded fallocate: /mnt/data11: fallocate failed: Disk quota exceeded fallocate: /mnt/data12: fallocate failed: Disk quota exceeded fallocate: /mnt/data13: fallocate failed: Disk quota exceeded fallocate: /mnt/data14: fallocate failed: Disk quota exceeded fallocate: /mnt/data15: fallocate failed: Disk quota exceeded fallocate: /mnt/data16: fallocate failed: Disk quota exceeded fallocate: /mnt/data17: fallocate failed: Disk quota exceeded fallocate: /mnt/data18: fallocate failed: Disk quota exceeded fallocate: /mnt/data19: fallocate failed: Disk quota exceeded # sync # btrfs qgroup show -pcre /mnt qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 9453568 9453568 0 10485760 --- --- Reported-by: NCyril SCETBON <cyril.scetbon@free.fr> Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Dongsheng Yang 提交于
When we exceed quota limit in writing, we will free some reserved extent when we need to drop but not free account in qgroup. It means, each time we exceed quota in writing, there will be some remain space in qg->reserved we can not use any more. If things go on like this, the all space will be ate up. Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Dongsheng Yang 提交于
Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Dongsheng Yang 提交于
btrfs_limit_group use arg limit to override the old qgroup_limit of corresponding qgroup. However, we should override part of old qgroup_limit according to the bit which has been set in arg limit. Signed-off-by: NFan Chengniang <fancn.fnst@cn.fujitsu.com> Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Dongsheng Yang 提交于
When we commit_transaction(), qgroups in btree should be updated. But, limit info is not considered currently. It will cause a problem when a qgroup of a snapshot inherit the limit info from srcqgroup, then there is an inconsistency. Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Dongsheng Yang 提交于
Cleanup: Change the parameter of update_qgroup_limit_item() to the family of update_qgroup_xxx_item(). Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Dongsheng Yang 提交于
When we call btrfs_qgroup_inherit() with BTRFS_QGROUP_INHERIT_SET_LIMITS, btrfs will update the limit info of qgroup in btree but forget to update the qgroup in rbtree at the same time. It obviousely will cause an inconsistency. This patch fix it by updating the rbtree at the same time. Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Dongsheng Yang 提交于
Currently, when we snapshot a subvol, snapshot will not copy the limits from srcqgroup. This patch make the qgroup in snapshot inherit the limit info when create a snapshot. Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
Reproduce: while true; do dd if=/dev/zero of=/mnt/btrfs/file count=[75% fs_size] rm /mnt/btrfs/file done Then we can see above loop failed on NO_SPACE. It it long-term problem since very beginning, because delayed-iput after rm are not run. We already have commit_transaction() in alloc_space code, but it is not triggered in above case. This patch trigger commit_transaction() to run delayed-iput and reflash pinned-space to to make write success. It is based on previous fix of delayed-iput in commit_transaction(), need to be applied on top of: btrfs: Fix NO_SPACE bug caused by delayed-iput Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
Steps to reproduce: while true; do dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%] rm /btrfs_dir/file sync done And we'll see dd failed because btrfs return NO_SPACE. Reason: Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs() in end to free fs space for next write, but sometimes it hadn't done work on time, because btrfs-cleaner thread get delayed-iputs from list before, but do iput() after next write. This is log: [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs() [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs() [ 2569.087554] comm=sync func=btrfs_commit_transaction() end [ 2569.191081] comm=dd begin [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28 [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888 [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512 ... [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496 [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end Fix: Make btrfs_commit_transaction() wait current running btrfs-cleaner's delayed-iputs() done in end. Test: Use script similar to above(more complex), before patch: 7 failed in 100 * 20 loop. after patch: 0 failed in 100 * 20 loop. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
space_info's value calculation is some complex and easy to cause bug, add WARN_ON() to help debug. Changelog v1->v2: Put WARN_ON()s under the ENOSPC_DEBUG mount option. Suggested by: David Sterba <dsterba@suse.cz> Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
Bug1: space_info->bytes_readonly was set to very large(negative) value in btrfs_remove_block_group(). Reason: Current code set block_group_cache->pinned = 0 in btrfs_delete_unused_bgs(), but above space was not counted to space_info->bytes_readonly. Then in btrfs_remove_block_group(): block_group->space_info->bytes_readonly -= block_group->key.offset; We can see following value in trace: btrfs_remove_block_group: pid=2677 comm=btrfs-cleaner WARNING: bytes_readonly=12582912, key.offset=134217728 Bug2: space_info->total_bytes_pinned grow to value larger than fs size. In a 1.2G fs, we can get following trace log: at first: ZL_DEBUG: add_pinned_bytes: pid=2710 comm=sync change total_bytes_pinned flags=1 869793792 + 95944704 = 965738496 after some op: ZL_DEBUG: add_pinned_bytes: pid=2770 comm=sync change total_bytes_pinned flags=1 1780178944 + 95944704 = 1876123648 after some op: ZL_DEBUG: add_pinned_bytes: pid=3193 comm=sync change total_bytes_pinned flags=1 2924568576 + 95551488 = 3020120064 ... Reason: Similar to bug1, we also need to adjust space_info->total_bytes_pinned in above code block. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
If we have any chance to make a successful write, we should not give up. This patch adjust commit-transaction condition from: pinned >= wanted to left + pinned >= wanted Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
It is another reason for NO_SPACE case. When we found enough free space in loop and saved them to max_hole_start/size before, and tail space contains pending extent, origional innocent max_hole_start/size are reset in retry. As a result, find_free_dev_extent() returns less space than it can, and cause NO_SPACE in user program. Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
Old code bypass commit transaction when we don't have enough pinned space, but another case is there exist freed bgs in current transction, it have possibility to make alloc_chunk success. This patch modify the condition to: if (have_free_bg || have_pinned_space) commit_transaction() Confirmed above action by printk before and after patch. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Chris Mason 提交于
Commit 0d97a64e0 creates a new variable but doesn't always set it up. This puts it back to the original method (key.offset + 1) for the cases not covered by Filipe's new logic. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
If we attempt to clone a 0 length region into a file we can end up inserting a range in the inode's extent_io tree with a start offset that is greater then the end offset, which triggers immediately the following warning: [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3914.620886] BTRFS: end < start 4095 4096 (...) [ 3914.638093] Call Trace: [ 3914.638636] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3914.639620] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3914.640789] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3914.642041] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3914.643236] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3914.644441] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3914.645711] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3914.646914] [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33 [ 3914.648058] [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs] [ 3914.650105] [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs] [ 3914.651361] [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs] [ 3914.652761] [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs] [ 3914.654128] [<ffffffff811226dd>] ? might_fault+0x58/0xb5 [ 3914.655320] [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs] (...) [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]--- This later makes the inode eviction handler enter an infinite loop that keeps dumping the following warning over and over: [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3915.119913] BTRFS: end < start 4095 4096 (...) [ 3915.137394] Call Trace: [ 3915.137913] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3915.139154] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3915.140316] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3915.141505] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3915.142709] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3915.143849] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3915.145120] [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs] [ 3915.146352] [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50 [ 3915.147565] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3915.148785] [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33 [ 3915.149931] [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs] [ 3915.151154] [<ffffffff81168904>] evict+0xa0/0x148 [ 3915.152094] [<ffffffff811689e5>] dispose_list+0x39/0x43 [ 3915.153081] [<ffffffff81169564>] evict_inodes+0xdc/0xeb [ 3915.154062] [<ffffffff81154418>] generic_shutdown_super+0x49/0xef [ 3915.155193] [<ffffffff811546d1>] kill_anon_super+0x13/0x1e [ 3915.156274] [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs] (...) [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]--- So just bail out of the clone ioctl if the length of the region to clone is zero, without locking any extent range, in order to prevent this issue (same behaviour as a pwrite with a 0 length for example). This is trivial to reproduce. For example, the steps for the test I just made for fstests: mkfs.btrfs -f SCRATCH_DEV mount SCRATCH_DEV $SCRATCH_MNT touch $SCRATCH_MNT/foo touch $SCRATCH_MNT/bar $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar umount $SCRATCH_MNT A test case for fstests follows soon. CC: <stable@vger.kernel.org> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NOmar Sandoval <osandov@osandov.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
If we pass a length of 0 to the extent_same ioctl, we end up locking an extent range with a start offset greater then its end offset (if the destination file's offset is greater than zero). This results in a warning from extent_io.c:insert_state through the following call chain: btrfs_extent_same() btrfs_double_lock() lock_extent_range() lock_extent(inode->io_tree, offset, offset + len - 1) lock_extent_bits() __set_extent_bit() insert_state() --> WARN_ON(end < start) This leads to an infinite loop when evicting the inode. This is the same problem that my previous patch titled "Btrfs: fix inode eviction infinite loop after cloning into it" addressed but for the extent_same ioctl instead of the clone ioctl. CC: <stable@vger.kernel.org> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NOmar Sandoval <osandov@osandov.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
While searching for extents to clone we might find one where we only use a part of it coming from its tail. If our destination inode is the same the source inode, we end up removing the tail part of the extent item and insert after a new one that point to the same extent with an adjusted key file offset and data offset. After this we search for the next extent item in the fs/subvol tree with a key that has an offset incremented by one. But this second search leaves us at the new extent item we inserted previously, and since that extent item has a non-zero data offset, it it can make us call btrfs_drop_extents with an empty range (start == end) which causes the following warning: [23978.537119] WARNING: CPU: 6 PID: 16251 at fs/btrfs/file.c:550 btrfs_drop_extent_cache+0x43/0x385 [btrfs]() (...) [23978.557266] Call Trace: [23978.557978] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [23978.559191] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [23978.560699] [<ffffffffa047f0ea>] ? btrfs_drop_extent_cache+0x43/0x385 [btrfs] [23978.562389] [<ffffffff8104544d>] warn_slowpath_null+0x1a/0x1c [23978.563613] [<ffffffffa047f0ea>] btrfs_drop_extent_cache+0x43/0x385 [btrfs] [23978.565103] [<ffffffff810e3a18>] ? time_hardirqs_off+0x15/0x28 [23978.566294] [<ffffffff81079ff8>] ? trace_hardirqs_off+0xd/0xf [23978.567438] [<ffffffffa047f73d>] __btrfs_drop_extents+0x6b/0x9e1 [btrfs] [23978.568702] [<ffffffff8107c03f>] ? trace_hardirqs_on+0xd/0xf [23978.569763] [<ffffffff811441c0>] ? ____cache_alloc+0x69/0x2eb [23978.570817] [<ffffffff81142269>] ? virt_to_head_page+0x9/0x36 [23978.571872] [<ffffffff81143c15>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb [23978.573466] [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18 [23978.574962] [<ffffffffa0480d07>] btrfs_drop_extents+0x66/0x7f [btrfs] [23978.576179] [<ffffffffa049aa35>] btrfs_clone+0x516/0xaf5 [btrfs] [23978.577311] [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs] [23978.578520] [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs] [23978.580282] [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs] (...) [23978.591887] ---[ end trace 988ec2a653d03ed3 ]--- Then we attempt to insert a new extent item with a key that already exists, which makes btrfs_insert_empty_item return -EEXIST resulting in abortion of the current transaction: [23978.594355] WARNING: CPU: 6 PID: 16251 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() (...) [23978.622589] Call Trace: [23978.623181] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [23978.624359] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [23978.625573] [<ffffffffa044ab6c>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [23978.626971] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [23978.628003] [<ffffffff8108a6c8>] ? vprintk_default+0x1d/0x1f [23978.629138] [<ffffffffa044ab6c>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [23978.630528] [<ffffffffa049ad1b>] btrfs_clone+0x7fc/0xaf5 [btrfs] [23978.631635] [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs] [23978.632886] [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs] [23978.634119] [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs] (...) [23978.647714] ---[ end trace 988ec2a653d03ed4 ]--- This is wrong because we should not process the extent item that we just inserted previously, and instead process the extent item that follows it in the tree For example for the test case I wrote for fstests: bs=$((64 * 1024)) mkfs.btrfs -f -l $bs -O ^no-holes /dev/sdc mount /dev/sdc /mnt xfs_io -f -c "pwrite -S 0xaa $(($bs * 2)) $(($bs * 2))" /mnt/foo $CLONER_PROG -s $((3 * $bs)) -d $((267 * $bs)) -l 0 /mnt/foo /mnt/foo $CLONER_PROG -s $((217 * $bs)) -d $((95 * $bs)) -l 0 /mnt/foo /mnt/foo The second clone call fails with -EEXIST, because when we process the first extent item (offset 262144), we drop part of it (counting from the end) and then insert a new extent item with a key greater then the key we found. The next time we search the tree we search for a key with offset 262144 + 1, which leaves us at the new extent item we have just inserted but we think it refers to an extent that we need to clone. Fix this by ensuring the next search key uses an offset corresponding to the offset of the key we found previously plus the data length of the corresponding extent item. This ensures we skip new extent items that we inserted and works for the case of implicit holes too (NO_HOLES feature). A test case for fstests follows soon. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 11 4月, 2015 11 次提交
-
-
由 Chris Mason 提交于
Near the end of close_ctree, we're calling btrfs_free_block_rsv to free up the orphan rsv. The problem is this call updates the space_info, which has already been freed. This adds a new __ function that directly calls kfree instead of trying to update the space infos. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Chris Mason 提交于
We loop through all of the dirty block groups during commit and write the free space cache. In order to make sure the cache is currect, we do this while no other writers are allowed in the commit. If a large number of block groups are dirty, this can introduce long stalls during the final stages of the commit, which can block new procs trying to change the filesystem. This commit changes the block group cache writeout to take appropriate locks and allow it to run earlier in the commit. We'll still have to redo some of the block groups, but it means we can get most of the work out of the way without blocking the entire FS. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Chris Mason 提交于
In order to create the free space cache concurrently with FS modifications, we need to take a few block group locks. The cache code also does kmap, which would schedule with the locks held. Instead of going through kmap_atomic, lets just use lowmem for the cache pages. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Chris Mason 提交于
Block group cache writeout is currently waiting on the pages for each block group cache before moving on to writing the next one. This commit switches things around to send down all the caches and then wait on them in batches. The end result is much faster, since we're keeping the disk pipeline full. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Chris Mason 提交于
We'll need to put the io_ctl into the block_group cache struct, so name it struct btrfs_io_ctl and move it into ctree.h Signed-off-by: NChris Mason <clm@fb.com>
-
由 Josef Bacik 提交于
btrfs_evict_inode() needs to be more careful about stealing from the global_rsv. We dont' want to end up aborting commit with ENOSPC just because the evict_inode code was too greedy. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Josef Bacik 提交于
We're triggering a huge number of commits from btrfs_async_reclaim_metadata_space. These aren't really requried, because everyone calling the async reclaim code is going to end up triggering a commit on their own. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Josef Bacik 提交于
This changes our delayed refs calculations to include the space needed to write back dirty block groups. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Chris Mason 提交于
When truncate starts, it allocates some space in the block reserves so that we'll have enough to update metadata along the way. For very large files, we can easily go through all of that space as we loop through the extents. This changes truncate to refill the space reservation as it progresses through the file. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Josef Bacik 提交于
As we delete large extents, we end up doing huge amounts of COW in order to delete the corresponding crcs. This adds accounting so that we keep track of that space and flushing of delayed refs so that we don't build up too much delayed crc work. This helps limit the delayed work that must be done at commit time and tries to avoid ENOSPC aborts because the crcs eat all the global reserves. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Chris Mason 提交于
When we are deleting large files with large extents, we are building up a huge set of delayed refs for processing. Truncate isn't checking often enough to see if we need to back off and process those, or let a commit proceed. The end result is long stalls after the rm, and very long commit times. During the commits, other processes back up waiting to start new transactions and we get into trouble. Signed-off-by: NChris Mason <clm@fb.com>
-
- 02 4月, 2015 1 次提交
-
-
由 Guenter Roeck 提交于
Building alpha:allmodconfig fails with fs/btrfs/inode.c: In function 'check_direct_IO': fs/btrfs/inode.c:8050:2: error: implicit declaration of function 'iov_iter_alignment' due to a missing include file. Fixes: 3737c63e1fb0 ("fs: move struct kiocb to fs.h") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NGuenter Roeck <linux@roeck-us.net> Acked-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
- 01 4月, 2015 1 次提交
-
-
由 Chris Mason 提交于
The error handling path for alloc_reserved_tree_block is calling btrfs_free_and_pin_reserved_extent with a spinning tree lock held. This might sleep as we allocate extent_state objects: BUG: sleeping function called from invalid context at mm/slub.c:1268 in_atomic(): 1, irqs_disabled(): 0, pid: 11093, name: kworker/u4:7 5 locks held by kworker/u4:7/11093: #0: ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520 #1: ((&work->normal_work)){+.+.+.}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520 #2: (sb_internal){++++.+}, at: [<ffffffffa003a70e>] start_transaction+0x43e/0x590 [btrfs] #3: (&head_ref->mutex){+.+...}, at: [<ffffffffa0089f8c>] btrfs_delayed_ref_lock+0x4c/0x240 [btrfs] #4: (btrfs-extent-00){++++..}, at: [<ffffffffa007697b>] btrfs_clear_lock_blocking_rw+0x9b/0x150 [btrfs] CPU: 0 PID: 11093 Comm: kworker/u4:7 Tainted: G W 4.0.0-rc6-default+ #246 Hardware name: Intel Corporation Santa Rosa platform/Matanzas, BIOS TSRSCRB1.86C.0047.B00.0610170821 10/17/06 Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] 00000000000004f4 ffff88006dd17848 ffffffff81ab0e3b ffff88006dd17848 ffff88007a944760 ffff88006dd17868 ffffffff8109d516 ffff88006dd17898 0000000000000000 ffff88006dd17898 ffffffff8109d5b2 ffffffff81aba2bb Call Trace: [<ffffffff81ab0e3b>] dump_stack+0x4f/0x6c [<ffffffff8109d516>] ___might_sleep+0xf6/0x140 [<ffffffff8109d5b2>] __might_sleep+0x52/0x90 [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34 [<ffffffff81196363>] kmem_cache_alloc+0x163/0x1b0 [<ffffffffa0056f31>] ? alloc_extent_state+0x31/0x150 [btrfs] [<ffffffffa0056f20>] ? alloc_extent_state+0x20/0x150 [btrfs] [<ffffffffa0056f31>] alloc_extent_state+0x31/0x150 [btrfs] [<ffffffffa005805b>] __set_extent_bit+0x37b/0x5d0 [btrfs] [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34 [<ffffffffa005888d>] ? set_extent_bit+0xd/0x30 [btrfs] [<ffffffffa00588a3>] set_extent_bit+0x23/0x30 [btrfs] [<ffffffffa0058e80>] set_extent_dirty+0x20/0x30 [btrfs] [<ffffffffa00195ba>] pin_down_extent+0xaa/0x170 [btrfs] [<ffffffffa001d8ef>] __btrfs_free_reserved_extent+0xcf/0x160 [btrfs] [<ffffffffa0023856>] btrfs_free_and_pin_reserved_extent+0x16/0x20 [btrfs] [<ffffffffa002482a>] __btrfs_run_delayed_refs+0xfca/0x1290 [btrfs] [<ffffffffa0026eae>] btrfs_run_delayed_refs+0x6e/0x2e0 [btrfs] [<ffffffffa0027378>] delayed_ref_async_start+0x48/0xb0 [btrfs] [<ffffffffa006c883>] normal_work_helper+0x83/0x350 [btrfs] [<ffffffffa006cd79>] ? btrfs_extent_refs_helper+0x9/0x20 [btrfs] [<ffffffffa006cd82>] btrfs_extent_refs_helper+0x12/0x20 [btrfs] [<ffffffff81091dcb>] process_one_work+0x1cb/0x520 [<ffffffff81091d51>] ? process_one_work+0x151/0x520 [<ffffffff811c7abf>] ? seq_read+0x3f/0x400 [<ffffffff8109260b>] worker_thread+0x5b/0x4e0 [<ffffffff81097be2>] ? __kthread_parkme+0x12/0xa0 [<ffffffff810925b0>] ? rescuer_thread+0x450/0x450 [<ffffffff81098686>] kthread+0xf6/0x120 [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0 [<ffffffff81ab8088>] ret_from_fork+0x58/0x90 [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0 ------------[ cut here ]------------ This changes things to free the path first, which will also unlock the extent buffer. Signed-off-by: NChris Mason <clm@fb.com> Reported-by: NDave Sterba <dsterba@suse.cz> Tested-by: NDave Sterba <dsterba@suse.cz>
-
- 27 3月, 2015 8 次提交
-
-
由 Liu Bo 提交于
This was used to make sure that a fresh btrfs from an older mkfs.btrfs, but it also allows us to mount a buggy btrfs if this btrfs has the right superblock head part but has something wrong with chunk tree part[1], and after that we can hit BUG_ON()s set in the code to prevent something impossible. Since David has released "Btrfs progs v3.19-rc2", just remove the check, if anyone who wants to make a fresh btrfs, please use the latest one. [1]: http://www.spinics.net/lists/linux-btrfs/msg42358.htmlSigned-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NOmar Sandoval <osandov@osandov.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Jeff Mahoney 提交于
Orphans in the fs tree are cleaned up via open_ctree and subvolume orphans are cleaned via btrfs_lookup_dentry -- except when a default subvolume is in use. The name for the default subvolume uses a manual lookup that doesn't trigger orphan cleanup and needs to trigger it manually as well. This doesn't apply to the remount case since the subvolumes are cleaned up by walking the root radix tree. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Tom Van Braeckel 提交于
The private_data member of the Btrfs control device file (/dev/btrfs-control) is used to hold the current transaction and needs to be initialized to NULL to signify that no transaction is in progress. We explicitly set the control file's private_data to NULL to be independent of whatever value the misc subsystem initializes it to. Backstory: ---------- The misc subsystem (which is used by /dev/btrfs-control) initializes a file's private_data to point to the misc device when a driver has registered a custom open file operation and initializes it to NULL when a custom open file operation has *not* been provided. This subtle quirk is confusing, to the point where kernel code registers *empty* file open operations to have private_data point to the misc device structure. And it leads to bugs, where the addition or removal of a custom open file operation surprisingly changes the initial contents of a file's private_data structure. To simplify things in the misc subsystem, a patch [1] has been proposed to *always* set private_data to point to the misc device instead of only doing this when a custom open file operation has been registered. But before we can fix this in the misc subsystem itself, we need to modify the (few) drivers that rely on this very subtle behavior. [1] https://lkml.org/lkml/2014/12/4/939Signed-off-by: NMartin Kepplinger <martink@posteo.de> Signed-off-by: NTom Van Braeckel <tomvanbraeckel@gmail.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Chengyu Song 提交于
fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was the last extent that will fit in user array. If 1 is returned, the return value may eventually returned to user space, which should not happen, according to manpage of ioctl. Signed-off-by: NChengyu Song <csong84@gatech.edu> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly works: $ touch file $ setfattr -n user. -v 1 file $ getfattr -d file user.="1" ie. the missing attribute name after the namespace. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291Reported-by: NWilliam Douglas <william.douglas@intel.com> CC: <stable@vger.kernel.org> # 2.6.29+ Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
While committing a transaction we free the log roots before we write the new super block. Freeing the log roots implies marking the disk location of every node/leaf (metadata extent) as pinned before the new super block is written. This is to prevent the disk location of log metadata extents from being reused before the new super block is written, otherwise we would have a corrupted log tree if before the new super block is written a crash/reboot happens and the location of any log tree metadata extent ended up being reused and rewritten. Even though we pinned the log tree's metadata extents, we were issuing a discard against them if the fs was mounted with the -o discard option, resulting in corruption of the log tree if a crash/reboot happened before writing the new super block - the next time the fs was mounted, during the log replay process we would find nodes/leafs of the log btree with a content full of zeroes, causing the process to fail and require the use of the tool btrfs-zero-log to wipeout the log tree (and all data previously fsynced becoming lost forever). Fix this by not doing a discard when pinning an extent. The discard will be done later when it's safe (after the new super block is committed) at extent-tree.c:btrfs_finish_extent_commit(). Fixes: e688b725 (Btrfs: fix extent pinning bugs in the tree log) CC: <stable@vger.kernel.org> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
We can get into inconsistency between inodes and directory entries after fsyncing a directory. The issue is that while a directory gets the new dentries persisted in the fsync log and replayed at mount time, the link count of the inode that directory entries point to doesn't get updated, staying with an incorrect link count (smaller then the correct value). This later leads to stale file handle errors when accessing (including attempt to delete) some of the links if all the other ones are removed, which also implies impossibility to delete the parent directories, since the dentries can not be removed. Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2), when fsyncing a directory, new files aren't logged (their metadata and dentries) nor any child directories. So this patch fixes this issue too, since it has the same resolution as the incorrect inode link count issue mentioned before. This is very easy to reproduce, and the following excerpt from my test case for xfstests shows how: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our main test file and directory. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io mkdir $SCRATCH_MNT/mydir # Make sure all metadata and data are durably persisted. sync # Add a hard link to 'foo' inside our test directory and fsync only the # directory. The btrfs fsync implementation had a bug that caused the new # directory entry to be visible after the fsync log replay but, the inode # of our file remained with a link count of 1. ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2 # Add a few more links and new files. # This is just to verify nothing breaks or gives incorrect results after the # fsync log is replayed. ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3 $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2 # Add some subdirectories and new files and links to them. This is to verify # that after fsyncing our top level directory 'mydir', all the subdirectories # and their files/links are registered in the fsync log and exist after the # fsync log is replayed. mkdir -p $SCRATCH_MNT/mydir/x/y/z ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link touch $SCRATCH_MNT/mydir/x/y/z/qwerty # Now fsync only our top directory. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir # And fsync now our new file named 'hello', just to verify later that it has # the expected content and that the previous fsync on the directory 'mydir' had # no bad influence on this fsync. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Verify the content of our file 'foo' remains the same as before, 8192 bytes, # all with the value 0xaa. echo "File 'foo' content after log replay:" od -t x1 $SCRATCH_MNT/foo # Remove the first name of our inode. Because of the directory fsync bug, the # inode's link count was 1 instead of 5, so removing the 'foo' name ended up # deleting the inode and the other names became stale directory entries (still # visible to applications). Attempting to remove or access the remaining # dentries pointing to that inode resulted in stale file handle errors and # made it impossible to remove the parent directories since it was impossible # for them to become empty. echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)" rm -f $SCRATCH_MNT/foo # Now verify that all files, links and directories created before fsyncing our # directory exist after the fsync log was replayed. [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing" [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing" [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing" [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing" [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \ echo "Link mydir/x/y/foo_y_link is missing" [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \ echo "Link mydir/x/y/z/foo_z_link is missing" [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \ echo "File mydir/x/y/z/qwerty is missing" # We expect our file here to have a size of 64Kb and all the bytes having the # value 0xff. echo "file 'hello' content after log replay:" od -t x1 $SCRATCH_MNT/hello # Now remove all files/links, under our test directory 'mydir', and verify we # can remove all the directories. rm -f $SCRATCH_MNT/mydir/x/y/z/* rmdir $SCRATCH_MNT/mydir/x/y/z rm -f $SCRATCH_MNT/mydir/x/y/* rmdir $SCRATCH_MNT/mydir/x/y rmdir $SCRATCH_MNT/mydir/x rm -f $SCRATCH_MNT/mydir/* rmdir $SCRATCH_MNT/mydir # An fsck, run by the fstests framework everytime a test finishes, also detected # the inconsistency and printed the following error message: # # root 5 inode 257 errors 2001, no inode item, link count wrong # unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref # unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref status=0 exit The expected golden output for the test is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 65536/65536 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File 'foo' content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 file 'foo' link count after log replay: 5 file 'hello' content after log replay: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0200000 Which is the output after this patch and when running the test against ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's output is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 65536/65536 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File 'foo' content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 file 'foo' link count after log replay: 1 Link mydir/foo_2 is missing Link mydir/foo_3 is missing Link mydir/x/y/foo_y_link is missing Link mydir/x/y/z/foo_z_link is missing File mydir/x/y/z/qwerty is missing file 'hello' content after log replay: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0200000 rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty Fsck, without this fix, also complains about the wrong link count: root 5 inode 257 errors 2001, no inode item, link count wrong unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref So fix this by logging the inodes that the dentries point to when fsyncing a directory. A test case for xfstests follows. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
After looking at Liu Bo's recent patch (titled "Btrfs: fix comp_oper to get right order") I realized the search made by qgroup_oper_exists() was buggy because its rbtree navigation comparison function, comp_oper_exist(), only looks at the fields bytenr and ref_root of a tree node, ignoring the seq field completely. This was wrong because when we insert a node into the rbtree we use comp_oper(), which takes a decision based first on bytenr, then on seq and then on the ref_root field. That means qgroup_oper_exists() could miss the fact that at least one operation with given bytenr and ref_root exists. Consider the following simple example of a 3 nodes qgroup operations rbtree (created using comp_oper before this patch), where each node's key is a tuple with the shape (bytenr, seq, ref_root, op): [ (4096, 2, 20, op X) ] / \ / \ [ (4096, 1, 5, op Y) ] [ (4096, 3, 10, op Z) ] qgroup_oper_exists() when called to search for an existing operation for bytenr 4096 and ref root 10 wouldn't find anything because it would go to the left subtree instead of the right subtree, since comp_oper_exits() ignores the seq field completely. Fix this by changing the insertion navigation function to use the ref_root field right after using the bytenr field and before using the seq field, so that qgroup_oper_exists() / comp_oper_exist() work as expected. This patch applies on top of the patch mentioned above from Liu. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-