- 07 1月, 2016 13 次提交
-
-
由 David Sterba 提交于
Replace the integers by enums for better readability. The value 2 does not have any meaning since a7175319 "Btrfs: do less aggressive btree readahead" (2009-01-22). Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
There are a few statically initialized arrays that can be made const. The remaining (like file_system_type, sysfs attributes or prop handlers) do not allow that due to type mismatch when passed to the APIs or because the structures are modified through other members. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
* struct extent_io_ops * struct btrfs_free_space_op Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Preparatory work for making btrfs_free_space_op constant. In test_steal_space_from_bitmap_to_extent, we substitute use_bitmap with own version thus preventing constification. We can rework it so we replace the whole structure with the correct function pointers. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Although we prefer to use separate caches for various structs, it seems better not to do that for struct btrfs_delalloc_work. Objects of this type are allocated rarely, when transaction commit calls btrfs_start_delalloc_roots, requesting delayed iputs. The objects are temporary (with some IO involved) but still allocated and freed within __start_delalloc_inodes. Memory allocation failure is handled. The slab cache is empty most of the time (observed on several systems), so if we need to allocate a new slab object, the first one has to allocate a full page. In a potential case of low memory conditions this might fail with higher probability compared to using the generic slab caches. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The helper btrfs_alloc_workqueue will add the "btrfs-" prefix. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
We can handle the special case of num_stripes == 0 directly inside btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to catch other unhandled cases where we fail to validate external data. A crafted or corrupted image crashes at mount time: BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0 BTRFS info (device loop0): disk space caching is enabled BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()! Kernel panic - not syncing: BUG! CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25 Stack: 637af890 60062489 602aeb2e 604192ba 60387961 00000011 637af8a0 6038a835 637af9c0 6038776b 634ef32b 00000000 Call Trace: [<6001c86d>] show_stack+0xfe/0x15b [<6038a835>] dump_stack+0x2a/0x2c [<6038776b>] panic+0x13e/0x2b3 [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff [<601cfbbe>] open_ctree+0x192d/0x27af [<6019c2c1>] btrfs_mount+0x8f5/0xb9a [<600bc9a7>] mount_fs+0x11/0xf3 [<600d5167>] vfs_kern_mount+0x75/0x11a [<6019bcb0>] btrfs_mount+0x2e4/0xb9a [<600bc9a7>] mount_fs+0x11/0xf3 [<600d5167>] vfs_kern_mount+0x75/0x11a [<600d710b>] do_mount+0xa35/0xbc9 [<600d7557>] SyS_mount+0x95/0xc8 [<6001e884>] handle_syscall+0x6b/0x8e Reported-by: NJiri Slaby <jslaby@suse.com> Reported-by: NVegard Nossum <vegard.nossum@oracle.com> CC: stable@vger.kernel.org # 3.19+ Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now and has 8 unused bytes. Reducing the level type to u8 makes it possible to squeeze it to the padding byte after key. The bitfields were switched to bool as there's space to store the full byte without increasing the whole structure, besides that the generated assembly is smaller. struct btrfs_delayed_extent_op { struct btrfs_disk_key key; /* 0 17 */ u8 level; /* 17 1 */ bool update_key; /* 18 1 */ bool update_flags; /* 19 1 */ bool is_data; /* 20 1 */ /* XXX 3 bytes hole, try to pack */ u64 flags_to_set; /* 24 8 */ /* size: 32, cachelines: 1, members: 6 */ /* sum members: 29, holes: 1, sum holes: 3 */ /* last cacheline: 32 bytes */ }; The final size is 32 bytes which gives +26 object per slab page. text data bss dec hex filename 938811 43670 23144 1005625 f5839 fs/btrfs/btrfs.ko.before 938747 43670 23144 1005561 f57f9 fs/btrfs/btrfs.ko.after Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Inodes for delayed iput allocate a trivial helper structure, let's place the list hook directly into the inode and save a kmalloc (killing a __GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode. The inode can be put into the delayed_iputs list more than once and we have to keep the count. This means we can't use the list_splice to process a bunch of inodes because we'd lost track of the count if the inode is put into the delayed iputs again while it's processed. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
Since we will add support for -d dup for non-mixed filesystem, kernel need to support converting to this raid-type. This patch remove limitation of above case. Tested by following script: (combination of dup conversion with fsck): export TEST_DEV='/dev/vdc' export TEST_DIR='/var/ltf/tester/mnt' do_dup_test() { local m_from="$1" local d_from="$2" local m_to="$3" local d_to="$4" echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to" umount "$TEST_DIR" &>/dev/null ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1 mount "$TEST_DEV" "$TEST_DIR" || return 1 cp -a /sbin/* "$TEST_DIR" [[ "$m_from" != "$m_to" ]] && { ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1 } [[ "$d_from" != "$d_to" ]] && { local opt=() [[ "$d_to" == single ]] && opt+=("-f") ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1 } umount "$TEST_DIR" || return 1 ./btrfsck "$TEST_DEV" || return 1 echo return 0 } test_all() { for m_from in single dup; do for d_from in single dup; do for m_to in single dup; do for d_to in single dup; do do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1 done done done done } test_all Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We hit this panic on a few of our boxes this week where we have an ordered_extent with an NULL inode. We do an igrab() of the inode in writepages, but weren't doing it in writepage which can be called directly from the VM on dirty pages. If the inode has been unlinked then we could have I_FREEING set which means igrab() would return NULL and we get this panic. Fix this by trying to igrab in btrfs_writepage, and if it returns NULL then just redirty the page and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
Looks like oversight, call brelse() when checksum fails. Further down the code, in the non error path, we do call brelse() and so we don't see brelse() in the goto error paths. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 16 12月, 2015 2 次提交
-
-
由 Chris Mason 提交于
prepare_pages() may end up calling prepare_uptodate_page() twice if our write only spans a single page. But if the first call returns an error, our page will be unlocked and its not safe to call it again. This bug goes all the way back to 2011, and it's not something commonly hit. While we're here, add a more explicit check for the page being truncated away. The bare lock_page() alone is protected only by good thoughts and i_mutex, which we're sure to regret eventually. Reported-by: NDave Jones <dsj@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Chris Mason 提交于
Dave Jones found a warning from kasan in setup_cluster_bitmaps() ================================================================== BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at addr ffff88039bef6828 Read of size 8 by task nfsd/1009 page:ffffea000e6fbd80 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x8000000000000000() page dumped because: kasan: bad access detected CPU: 1 PID: 1009 Comm: nfsd Tainted: G W 4.4.0-rc3-backup-debug+ #1 ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e 0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0 ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280 Call Trace: [<ffffffffa680a43e>] dump_stack+0x4b/0x6d [<ffffffffa62638d1>] kasan_report_error+0x501/0x520 [<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0 [<ffffffffa6263948>] kasan_report+0x58/0x60 [<ffffffffa6814b00>] ? rb_last+0x10/0x40 [<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0 [<ffffffffa6262ead>] __asan_load8+0x5d/0x70 [<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0 [<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400 [<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640 [<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0 [<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0 [<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40 [<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520 Andrey noticed this was because we were doing list_first_entry on a list that might be empty. Rework the tests a bit so we don't do that. Signed-off-by: NChris Mason <clm@fb.com> Reprorted-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com> Reported-by: NDave Jones <dsj@fb.com>
-
- 10 12月, 2015 3 次提交
-
-
由 Holger Hoffstätte 提交于
When an inconsistent space cache is detected during loading we log a warning that users frequently mistake as instruction to invalidate the cache manually, even though this is not required. Fix the message to indicate that the cache will be rebuilt automatically. Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com> Acked-by: NFilipe Manana <fdmanana@suse.com>
-
由 Filipe Manana 提交于
If we fail to allocate a new data chunk, we were jumping to the error path without release the transaction handle we got before. Fix this by always releasing it before doing the jump. Fixes: 2c9fe835 ("btrfs: Fix lost-data-profile caused by balance bg") Signed-off-by: NFilipe Manana <fdmanana@suse.com>
-
由 Filipe Manana 提交于
As of my previous change titled "Btrfs: fix scrub preventing unused block groups from being deleted", the following warning at extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a filesysten with "-o discard": 10263 void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) 10264 { (...) 10405 if (trimming) { 10406 WARN_ON(!list_empty(&block_group->bg_list)); 10407 spin_lock(&trans->transaction->deleted_bgs_lock); 10408 list_move(&block_group->bg_list, 10409 &trans->transaction->deleted_bgs); 10410 spin_unlock(&trans->transaction->deleted_bgs_lock); 10411 btrfs_get_block_group(block_group); 10412 } (...) This happens because scrub can now add back the block group to the list of unused block groups (fs_info->unused_bgs). This is dangerous because we are moving the block group from the unused block groups list to the list of deleted block groups without holding the lock that protects the source list (fs_info->unused_bgs_lock). The following diagram illustrates how this happens: CPU 1 CPU 2 cleaner_kthread() btrfs_delete_unused_bgs() sees bg X in list fs_info->unused_bgs deletes bg X from list fs_info->unused_bgs scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO (again) scrub_chunk(bg X) sets bg X back to RW mode adds bg X to the list fs_info->unused_bgs again, since it's still unused and currently not in that list sets bg X to RO mode btrfs_remove_chunk(bg X) --> discard is enabled and bg X is in the fs_info->unused_bgs list again so the warning is triggered --> we move it from that list into the transaction's delete_bgs list, but we can have another task currently manipulating the first list (fs_info->unused_bgs) Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both the list of unused block groups and the list of deleted block groups. This makes it safe and there's not much worry for more lock contention, as this lock is seldom used and only the cleaner kthread adds elements to the list of deleted block groups. The warning goes away too, as this was previously an impossible case (and would have been better a BUG_ON/ASSERT) but it's not impossible anymore. Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard"). Signed-off-by: NFilipe Manana <fdmanana@suse.com>
-
- 25 11月, 2015 14 次提交
-
-
由 Holger Hoffstätte 提交于
There's a regression in 4.4-rc since commit bc309467 (btrfs: extend balance filter usage to take minimum and maximum) in that existing (non-ranged) balance with -dusage=x no longer works; all chunks are skipped. After staring at the code for a while and wondering why a non-ranged balance would even need min and max thresholds (..which then were not set correctly, leading to the bug) I realized that the only problem was the fact that the filter functions were named wrong, thanks to patching copypasta. Simply renaming both functions lets the existing btrfs-progs call balance with -dusage=x and now the non-ranged filter function is invoked, properly using only a single chunk limit. Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com> Fixes: bc309467 ("btrfs: extend balance filter usage to take minimum and maximum") Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Mark Fasheh 提交于
Commit 0ed4792a ('btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.') removed our qgroup accounting during btrfs_drop_snapshot(). Predictably, this results in qgroup numbers going bad shortly after a snapshot is removed. Fix this by adding a dirty extent record when we encounter extents during our shared subtree walk. This effectively restores the functionality we had with the original shared subtree walking code in 1152651a (btrfs: qgroup: account shared subtrees during snapshot delete). The idea with the original patch (and this one) is that shared subtrees can get skipped during drop_snapshot. The shared subtree walk then allows us a chance to visit those extents and add them to the qgroup work for later processing. This ultimately makes the accounting for drop snapshot work. The new qgroup code nicely handles all the other extents during the tree walk via the ref dec/inc functions so we don't have to add actions beyond what we had originally. Signed-off-by: NMark Fasheh <mfasheh@suse.de> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Josef Bacik 提交于
The backref code will look up the fs_root we're trying to resolve our indirect refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT if the ref is 0. This isn't helpful for the qgroup stuff with snapshot delete as it won't be able to search down the snapshot we are deleting, which will cause us to miss roots. So use btrfs_get_fs_root and send false for check_ref so we can always get the root we're looking for. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NMark Fasheh <mfasheh@suse.de> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Justin Maggard 提交于
There's a race condition that leads to a NULL pointer dereference if you disable quotas while a quota rescan is running. To fix this, we just need to wait for the quota rescan worker to actually exit before tearing down the quota structures. Signed-off-by: NJustin Maggard <jmaggard@netgear.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
When a block group becomes unused and the cleaner kthread is currently running, we can end up getting the current transaction aborted with error -ENOENT when we try to commit the transaction, leading to the following trace: [59779.258768] WARNING: CPU: 3 PID: 5990 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]() [59779.272594] BTRFS: Transaction aborted (error -2) (...) [59779.291137] Call Trace: [59779.291621] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [59779.292543] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [59779.293435] [<ffffffffa04cb81f>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [59779.295000] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50 [59779.296138] [<ffffffffa04c2721>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs] [59779.297663] [<ffffffffa04cb81f>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [59779.299141] [<ffffffffa0549b0d>] commit_cowonly_roots+0x1de/0x261 [btrfs] [59779.300359] [<ffffffffa04dd5b6>] btrfs_commit_transaction+0x4c4/0x99c [btrfs] [59779.301805] [<ffffffffa04b5df4>] btrfs_sync_fs+0x145/0x1ad [btrfs] [59779.302893] [<ffffffff81196634>] sync_filesystem+0x7f/0x93 (...) [59779.318186] ---[ end trace 577e2daff90da33a ]--- The following diagram illustrates a sequence of steps leading to this problem: CPU 1 CPU 2 <at transaction N> adds bg A to list fs_info->unused_bgs adds bg B to list fs_info->unused_bgs <transaction kthread commits transaction N and wakes up the cleaner kthread> cleaner kthread delete_unused_bgs() sees bg A in list fs_info->unused_bgs btrfs_start_transaction() <transaction N + 1 starts> deletes bg A update_block_group(bg C) --> adds bg C to list fs_info->unused_bgs deletes bg B sees bg C in the list fs_info->unused_bgs btrfs_remove_chunk(bg C) btrfs_remove_block_group(bg C) --> checks if the block group is in a dirty list, and because it isn't now, it does nothing --> the block group item is deleted from the extent tree --> adds bg C to list transaction->dirty_bgs some task calls btrfs_commit_transaction(t N + 1) commit_cowonly_roots() btrfs_write_dirty_block_groups() --> sees bg C in cur_trans->dirty_bgs --> calls write_one_cache_group() which returns -ENOENT because it did not find the block group item in the extent tree --> transaction aborte with -ENOENT because write_one_cache_group() returned that error So fix this by adding a block group to the list of dirty block groups before adding it to the list of unused block groups. This happened on a stress test using fsstress plus concurrent calls to fallocate 20G and truncate (releasing part of the space allocated with fallocate). Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
Currently scrub can race with the cleaner kthread when the later attempts to delete an unused block group, and the result is preventing the cleaner kthread from ever deleting later the block group - unless the block group becomes used and unused again. The following diagram illustrates that race: CPU 1 CPU 2 cleaner kthread btrfs_delete_unused_bgs() gets block group X from fs_info->unused_bgs and removes it from that list scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO sees the block group is already RO and therefore doesn't delete it nor adds it back to unused list So fix this by making scrub add the block group again to the list of unused block groups if the block group is still unused when it finished scrubbing it and it hasn't been removed already. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
Scrub can race with the cleaner kthread deleting block groups that are unused (and with relocation too) leading to a failure with error -EINVAL that gets returned to user space. The following diagram illustrates how it happens: CPU 1 CPU 2 cleaner kthread btrfs_delete_unused_bgs() gets block group X from fs_info->unused_bgs sets block group to RO btrfs_remove_chunk(bg X) deletes device extents scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO (again) btrfs_remove_block_group(bg X) deletes block group from fs_info->block_group_cache_tree removes extent map from fs_info->mapping_tree scrub_chunk(offset X) searches fs_info->mapping_tree for extent map starting at offset X --> doesn't find any such extent map --> returns -EINVAL and scrub errors out to userspace with -EINVAL Fix this by dealing with an extent map lookup failure as an indicator of block group deletion. Issue reproduced with fstest btrfs/071. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
The test btrfs/011 triggers a rcu warning Reviewed-by: NAnand Jain <anand.jain@oracle.com> =============================== [ INFO: suspicious RCU usage. ] 4.4.0-rc1-default+ #286 Tainted: G W ------------------------------- fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 4 locks held by btrfs/28786: 0: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs] 1: (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs] 2: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs] 3: (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs] stack backtrace: CPU: 0 PID: 28786 Comm: btrfs Tainted: G W 4.4.0-rc1-default+ #286 Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010 0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001 0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78 ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220 Call Trace: [<ffffffff8141d47b>] dump_stack+0x4f/0x74 [<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140 [<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs] [<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80 [<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs] [<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs] [<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs] [<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs] [<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs] [<ffffffff81196aa5>] ? __might_fault+0x95/0xa0 [<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs] [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70 [<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs] [<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs] [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70 [<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0 [<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0 [<ffffffff811e3336>] ? __fget_light+0x86/0xb0 [<ffffffff811e3369>] ? __fdget+0x9/0x20 [<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80 [<ffffffff811d7483>] SyS_ioctl+0x53/0x80 [<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f This is because of unprotected use of rcu_dereference in btrfs_scratch_superblocks. We can't add rcu locks around the whole function because we read the superblock. The fix will use the rcu string buffer directly without the rcu locking. Thi is safe as the device will not go away in the meantime. We're holding the device list mutexes. Restructuring the code to narrow down the rcu section turned out to be impossible, we need to call filp_open (through update_dev_time) on the buffer and this could call kmalloc/__might_sleep. We could call kstrdup with GFP_ATOMIC but it's not absolutely necessary. Fixes: 12b1c263 (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks) Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhaolei 提交于
xfstests/011 failed in node with small_size filesystem. Can be reproduced by following script: DEV_LIST="/dev/vdd /dev/vde" DEV_REPLACE="/dev/vdf" do_test() { local mkfs_opt="$1" local size="$2" dmesg -c >/dev/null umount $SCRATCH_MNT &>/dev/null echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}" mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1 mount "${DEV_LIST[0]}" $SCRATCH_MNT echo -n "Writing big files" dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1 for ((i = 1; i <= size; i++)); do echo -n . /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1 done echo echo Start replace btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || { dmesg return 1 } return 0 } # Set size to value near fs size # for example, 1897 can trigger this bug in 2.6G device. # ./do_test "-d raid1 -m raid1" 1897 System will report replace fail with following warning in dmesg: [ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started [ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28 [ 135.543505] ------------[ cut here ]------------ [ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440() [ 135.545276] Modules linked in: [ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256 [ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 [ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000 [ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4 [ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70 [ 135.550758] Call Trace: [ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55 [ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0 [ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20 [ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440 [ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0 [ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0 [ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0 [ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580 [ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0 [ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70 [ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90 [ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80 [ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]--- Reason: When big data writen to fs, the whole free space will be allocated for data chunk. And operation as scrub need to set_block_ro(), and when there is only one metadata chunk in system(or other metadata chunks are all full), the function will try to allocate a new chunk, and failed because no space in device. Fix: When set_block_ro failed for metadata chunk, it is not a problem because scrub_lock paused commit_trancaction in same time, and metadata are always cowed, so the on-the-fly writepages will not write data into same place with scrub/replace. Let replace continue in this case is no problem. Tested by above script, and xfstests/011, plus 100 times xfstests/070. Changelog v1->v2: 1: Add detail comments in source and commit-message. 2: Add dmesg detail into commit-message. 3: Limit return value of -ENOSPC to be passed. All suggested by: Filipe Manana <fdmanana@gmail.com> Suggested-by: NFilipe Manana <fdmanana@gmail.com> Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
I've accidentally picked an already used number for the enhanced usage filter represented by BTRFS_BALANCE_ARGS_USAGE_RANGE, clashing with BTRFS_BALANCE_ARGS_CONVERT. Introduced during the development phase, no backward compatibility issues. Reported-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com> Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Fixes: bc309467 ("btrfs: extend balance filter usage to take minimum and maximum") Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
We were using only 1 transaction unit when attempting to delete an unused block group but in reality we need 3 + N units, where N corresponds to the number of stripes. We were accounting only for the addition of the orphan item (for the block group's free space cache inode) but we were not accounting that we need to delete one block group item from the extent tree, one free space item from the tree of tree roots and N device extent items from the device tree. While one unit is not enough, it worked most of the time because for each single unit we are too pessimistic and assume an entire tree path, with the highest possible heigth (8), needs to be COWed with eventual node splits at every possible level in the tree, so there was usually enough reserved space for removing all the items and adding the orphan item. However after adding the orphan item, writepages() can by called by the VM subsystem against the btree inode when we are under memory pressure, which causes writeback to start for the nodes we COWed before, this forces the operation to remove the free space item to COW again some (or all of) the same nodes (in the tree of tree roots). Even without writepages() being called, we could fail with ENOSPC because these items are located in multiple trees and one of them might have a higher heigth and require node/leaf splits at many levels, exhausting all the reserved space before removing all the items and adding the orphan. In the kernel 4.0 release, commit 3d84be79 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group"), we attempted to fix a BUG_ON due to ENOSPC when trying to add the orphan item by making the cleaner kthread reserve one transaction unit before attempting to remove the block group, but this was not enough. We had a couple user reports still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on a 4.2-rc6 kernel for example: http://www.spinics.net/lists/linux-btrfs/msg46070.html So fix this by reserving all the necessary units of metadata. Reported-by: NStefan Priebe <s.priebe@profihost.ag> Fixes: 3d84be79 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group") Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
It's possible to reach a state where the cleaner kthread isn't able to start a transaction to delete an unused block group due to lack of enough free metadata space and due to lack of unallocated device space to allocate a new metadata block group as well. If this happens try to use space from the global block group reserve just like we do for unlink operations, so that we don't reach a permanent state where starting a transaction for filesystem operations (file creation, renames, etc) keeps failing with -ENOSPC. Such an unfortunate state was observed on a machine where over a dozen unused data block groups existed and the cleaner kthread was failing to delete them due to ENOSPC error when attempting to start a transaction, and even running balance with a -dusage=0 filter failed with ENOSPC as well. Also unmounting and mounting again the filesystem didn't help. Allowing the cleaner kthread to use the global block reserve to delete the unused data block groups fixed the problem. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Dan Carpenter 提交于
btrfs_alloc_dummy_root() return an error pointer on failure, it never returns NULL. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
The calculation of range length in btrfs_sync_file leads to signed overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin. https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 The fsync call passes 0 and LLONG_MAX, the range length does not fit to loff_t and overflows, but the value is converted to u64 so it silently works as expected. The minimal fix is a typecast to u64, switching functions to take (start, end) instead of (start, len) would be more intrusive. Coccinelle script found that there's one more opencoded calculation of the length. <smpl> @@ loff_t start, end; @@ * end - start </smpl> CC: stable@vger.kernel.org Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 11 11月, 2015 8 次提交
-
-
由 Zhao Lei 提交于
No need to use root->fs_info in btrfs_delete_unused_bgs(), use fs_info directly instead. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
Reproduce: (In integration-4.3 branch) TEST_DEV=(/dev/vdg /dev/vdh) TEST_DIR=/mnt/tmp umount "$TEST_DEV" >/dev/null mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" btrfs balance start -dusage=0 $TEST_DIR btrfs filesystem usage $TEST_DIR dd if=/dev/zero of="$TEST_DIR"/file count=100 btrfs filesystem usage $TEST_DIR Result: We can see "no data chunk" in first "btrfs filesystem usage": # btrfs filesystem usage $TEST_DIR Overall: ... Metadata,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB Metadata,RAID1: Size:122.88MiB, Used:112.00KiB /dev/vdg 122.88MiB /dev/vdh 122.88MiB System,single: Size:4.00MiB, Used:0.00B /dev/vdg 4.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/vdg 8.00MiB /dev/vdh 8.00MiB Unallocated: /dev/vdg 1.06GiB /dev/vdh 1.07GiB And "data chunks changed from raid1 to single" in second "btrfs filesystem usage": # btrfs filesystem usage $TEST_DIR Overall: ... Data,single: Size:256.00MiB, Used:0.00B /dev/vdh 256.00MiB Metadata,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB Metadata,RAID1: Size:122.88MiB, Used:112.00KiB /dev/vdg 122.88MiB /dev/vdh 122.88MiB System,single: Size:4.00MiB, Used:0.00B /dev/vdg 4.00MiB System,RAID1: Size:8.00MiB, Used:16.00KiB /dev/vdg 8.00MiB /dev/vdh 8.00MiB Unallocated: /dev/vdg 1.06GiB /dev/vdh 841.92MiB Reason: btrfs balance delete last data chunk in case of no data in the filesystem, then we can see "no data chunk" by "fi usage" command. And when we do write operation to fs, the only available data profile is 0x0, result is all new chunks are allocated single type. Fix: Allocate a data chunk explicitly to ensure we don't lose the raid profile for data. Test: Test by above script, and confirmed the logic by debug output. Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
Reproduce: (In integration-4.3 branch) TEST_DEV=(/dev/vdg /dev/vdh) TEST_DIR=/mnt/tmp umount "$TEST_DEV" >/dev/null mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" umount "$TEST_DEV" mount -o nospace_cache "$TEST_DEV" "$TEST_DIR" btrfs filesystem usage $TEST_DIR We can see the data chunk changed from raid1 to single: # btrfs filesystem usage $TEST_DIR Data,single: Size:8.00MiB, Used:0.00B /dev/vdg 8.00MiB # Reason: When a empty filesystem mount with -o nospace_cache, the last data blockgroup will be auto-removed in umount. Then if we mount it again, there is no data chunk in the filesystem, so the only available data profile is 0x0, result is all new chunks are created as single type. Fix: Don't auto-delete last blockgroup for a raid type. Test: Test by above script, and confirmed the logic by debug output. Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
It is useless. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
We don't need pass so many arguments for recheck sblock now, this patch cleans them. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
We can use existing scrub_checksum_data() and scrub_checksum_tree_block() for scrub_recheck_block_checksum(), instead of write duplicated code. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
We should reset sblock->xxx_error stats before calling scrub_recheck_block_checksum(). Current code run correctly because all sblock are allocated by k[cz]alloc(), and the error stats are not got changed. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Zhao Lei 提交于
scrub_setup_recheck_block() isn't setup all necessary fields for sblock_to_check because history reason. So current code need more arguments in severial functions, and more local variables, just to passing these lacked values to necessary place. This patch setup above fields to sblock_to_check in scrub_setup_recheck_block(), for: 1: more cleanup for function arg, local variable 2: to make sblock_to_check complete, then we can use sblock_to_check without concern about some uninitialized member. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-