- 01 3月, 2018 4 次提交
-
-
由 Hans van Kranenburg 提交于
In case of using DUP, we search for enough unallocated disk space on a device to hold two stripes. The devices_info[ndevs-1].max_avail that holds the amount of unallocated space found is directly assigned to stripe_size, while it's actually twice the stripe size. Later on in the code, an unconditional division of stripe_size by dev_stripes corrects the value, but in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size. The unconditional division later tries to correct stripe_size, but will actually make sure we can't allocate more than half the max_chunk_size. Fix this by moving the division by dev_stripes before the max chunk size check, so it always contains the right value, instead of putting a duct tape division in further on to get it fixed again. Since in all other cases than DUP, dev_stripes is 1, this change only affects DUP. Other attempts in the past were made to fix this: * 37db63a4 "Btrfs: fix max chunk size check in chunk allocator" tried to fix the same problem, but still resulted in part of the code acting on a wrongly doubled stripe_size value. * 86db2578 "Btrfs: fix max chunk size on raid5/6" unintentionally broke this fix again. The real problem was already introduced with the rest of the code in 73c5de00. The user visible result however will be that the max chunk size for DUP will suddenly double, while it's actually acting according to the limits in the code again like it was 5 years ago. Reported-by: NNaohiro Aota <naohiro.aota@wdc.com> Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html Fixes: 73c5de00 ("btrfs: quasi-round-robin for chunk allocation") Fixes: 86db2578 ("Btrfs: fix max chunk size on raid5/6") Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Essentially duplicate the error handling from the above block which handles the !PageUptodate(page) case and additionally clear EXTENT_BOUNDARY. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
add_pending_csums was added as part of the new data=ordered implementation in e6dcd2dc ("Btrfs: New data=ordered implementation"). Even back then it called the btrfs_csum_file_blocks which can fail but it never bothered handling the failure. In ENOMEM situation this could lead to the filesystem failing to write the checksums for a particular extent and not detect this. On read this could lead to the filesystem erroring out due to crc mismatch. Fix it by propagating failure from add_pending_csums and handling them. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
The srcu_struct in btrfs_fs_info scales in size with NR_CPUS. On kernels built with NR_CPUS=8192, this can result in kmalloc failures that prevent mounting. There is work in progress to try to resolve this for every user of srcu_struct but using kvzalloc will work around the failures until that is complete. As an example with NR_CPUS=512 on x86_64: the overall size of subvol_srcu is 3460 bytes, fs_info is 6496. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 06 2月, 2018 1 次提交
-
-
由 Nikolay Borisov 提交于
Commit 4fde46f0 ("Btrfs: free the stale device") introduced btrfs_free_stale_device which iterates the device lists for all registered btrfs filesystems and deletes those devices which aren't mounted. In a btrfs_devices structure has only 1 device attached to it and it is unused then btrfs_free_stale_devices will proceed to also free the btrfs_fs_devices struct itself. Currently this leads to a use after free since list_for_each_entry will try to perform a check on the already freed memory to see if it has to terminate the loop. The fix is to use 'break' when we know we are freeing the current fs_devs. Fixes: 4fde46f0 ("Btrfs: free the stale device") Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 02 2月, 2018 9 次提交
-
-
由 Filipe Manana 提交于
When we are replacing a missing device we mount the filesystem with the degraded mode option in which case we are allowed to have a btrfs device structure without a backing device member (its bdev member is NULL) and therefore we can't dereference that member. Commit 38b5f68e ("btrfs: drop btrfs_device::can_discard to query directly") started to dereference that member when discarding extents, resulting in a null pointer dereference: [ 3145.322257] BTRFS warning (device sdf): devid 2 uuid 4d922414-58eb-4880-8fed-9c3840f6c5d5 is missing [ 3145.364116] BTRFS info (device sdf): dev_replace from <missing disk> (devid 2) to /dev/sdg started [ 3145.413489] BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0 [ 3145.415085] IP: btrfs_discard_extent+0x6a/0xf8 [btrfs] [ 3145.415085] PGD 0 P4D 0 [ 3145.415085] Oops: 0000 [#1] PREEMPT SMP PTI [ 3145.415085] Modules linked in: ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse parport_pc serio_raw i2c_piix4 i2 [ 3145.415085] CPU: 0 PID: 11989 Comm: btrfs Tainted: G W 4.15.0-rc9-btrfs-next-55+ #1 [ 3145.415085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [ 3145.415085] RIP: 0010:btrfs_discard_extent+0x6a/0xf8 [btrfs] [ 3145.415085] RSP: 0018:ffffc90004813c60 EFLAGS: 00010293 [ 3145.415085] RAX: ffff88020d39cc00 RBX: ffff88020c4ea2a0 RCX: 0000000000000002 [ 3145.415085] RDX: 0000000000000000 RSI: ffff88020c4ea240 RDI: 0000000000000000 [ 3145.415085] RBP: 0000000000000000 R08: 0000000000004000 R09: 0000000000000000 [ 3145.415085] R10: ffffc90004813ae8 R11: 0000000000000000 R12: 0000000000000000 [ 3145.415085] R13: ffff88020c418000 R14: 0000000000000000 R15: 0000000000000000 [ 3145.415085] FS: 00007f565681f8c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 [ 3145.415085] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3145.415085] CR2: 00000000000000e0 CR3: 000000020d208006 CR4: 00000000001606f0 [ 3145.415085] Call Trace: [ 3145.415085] btrfs_finish_extent_commit+0x9a/0x1be [btrfs] [ 3145.415085] btrfs_commit_transaction+0x649/0x7a0 [btrfs] [ 3145.415085] ? start_transaction+0x2b0/0x3b3 [btrfs] [ 3145.415085] btrfs_dev_replace_start+0x274/0x30c [btrfs] [ 3145.415085] btrfs_dev_replace_by_ioctl+0x45/0x59 [btrfs] [ 3145.415085] btrfs_ioctl+0x1a91/0x1d62 [btrfs] [ 3145.415085] ? lock_acquire+0x16a/0x1af [ 3145.415085] ? vfs_ioctl+0x1b/0x28 [ 3145.415085] ? trace_hardirqs_on_caller+0x14c/0x1a6 [ 3145.415085] vfs_ioctl+0x1b/0x28 [ 3145.415085] do_vfs_ioctl+0x5a9/0x5e0 [ 3145.415085] ? _raw_spin_unlock_irq+0x34/0x46 [ 3145.415085] ? entry_SYSCALL_64_fastpath+0x5/0x8b [ 3145.415085] ? trace_hardirqs_on_caller+0x14c/0x1a6 [ 3145.415085] SyS_ioctl+0x52/0x76 [ 3145.415085] entry_SYSCALL_64_fastpath+0x1e/0x8b [ 3145.415085] RIP: 0033:0x7f56558b3c47 [ 3145.415085] RSP: 002b:00007ffdcfac4c58 EFLAGS: 00000202 [ 3145.415085] Code: be 02 00 00 00 4c 89 ef e8 b9 e7 03 00 85 c0 89 c5 75 75 48 8b 44 24 08 45 31 f6 48 8d 58 60 eb 52 48 8b 03 48 8b b8 a0 00 00 00 <48> 8b 87 e0 00 [ 3145.415085] RIP: btrfs_discard_extent+0x6a/0xf8 [btrfs] RSP: ffffc90004813c60 [ 3145.415085] CR2: 00000000000000e0 [ 3145.458185] ---[ end trace 06302e7ac31902bf ]--- This is trivially reproduced by running the test btrfs/027 from fstests like this: $ MOUNT_OPTIONS="-o discard" ./check btrfs/027 Fix this by skipping devices without a backing device before attempting to discard. Fixes: 38b5f68e ("btrfs: drop btrfs_device::can_discard to query directly") Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zygo Blaxell 提交于
Until v4.14, this warning was very infrequent: WARNING: CPU: 3 PID: 18172 at fs/btrfs/backref.c:1391 find_parent_nodes+0xc41/0x14e0 Modules linked in: [...] CPU: 3 PID: 18172 Comm: bees Tainted: G D W L 4.11.9-zb64+ #1 Hardware name: System manufacturer System Product Name/M5A78L-M/USB3, BIOS 2101 12/02/2014 Call Trace: dump_stack+0x85/0xc2 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 find_parent_nodes+0xc41/0x14e0 __btrfs_find_all_roots+0xad/0x120 ? extent_same_check_offsets+0x70/0x70 iterate_extent_inodes+0x168/0x300 iterate_inodes_from_logical+0x87/0xb0 ? iterate_inodes_from_logical+0x87/0xb0 ? extent_same_check_offsets+0x70/0x70 btrfs_ioctl+0x8ac/0x2820 ? lock_acquire+0xc2/0x200 do_vfs_ioctl+0x91/0x700 ? __fget+0x112/0x200 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc6 ? trace_hardirqs_off_caller+0x1f/0x140 Starting with v4.14 (specifically 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")) the WARN_ON occurs three orders of magnitude more frequently--almost once per second while running workloads like bees. Replace the WARN_ON() with a comment rationale for its removal. The rationale is paraphrased from an explanation by Edmund Nadolski <enadolski@suse.de> on the linux-btrfs mailing list. Fixes: 8da6d581 ("Btrfs: added btrfs_find_all_roots()") Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Running generic/019 with qgroups on the scratch device enabled is almost guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's supposed to trigger only on -ENOMEM, in reality, however, it's possible to get -EIO from btrfs_qgroup_trace_extent_post. This function just finds the roots of the extent being tracked and sets the qrecord->old_roots list. If this operation fails nothing critical happens except the quota accounting can be considered wrong. In such case just set the INCONSISTENT flag for the quota and print a warning, rather than killing off the system. Additionally, it's possible to trigger a BUG_ON in btrfs_truncate_inode_items as well. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NQu Wenruo <wqu@suse.com> [ error message adjustments ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
The highest objectid, which is assigned to new inode, is decided at the time of initializing fs roots. However, in cases where log replay gets processed, the btree which fs root owns might be changed, so we have to search it again for the highest objectid, otherwise creating new inode would end up with -EEXIST. cc: <stable@vger.kernel.org> v4.4-rc6+ Fixes: f32e48e9 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots") Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
I got these from running generic/475, WARNING: CPU: 0 PID: 26384 at fs/btrfs/inode.c:3326 btrfs_orphan_commit_root+0x1ac/0x2b0 [btrfs] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: btrfs_block_rsv_release+0x1c/0x70 [btrfs] Call Trace: btrfs_orphan_release_metadata+0x9f/0x200 [btrfs] btrfs_orphan_del+0x10d/0x170 [btrfs] btrfs_setattr+0x500/0x640 [btrfs] notify_change+0x7ae/0x870 do_truncate+0xca/0x130 vfs_truncate+0x2ee/0x3d0 do_sys_truncate+0xaf/0xf0 SyS_truncate+0xe/0x10 entry_SYSCALL_64_fastpath+0x1f/0x96 The race is between btrfs_orphan_commit_root and btrfs_orphan_del, t1 t2 btrfs_orphan_commit_root btrfs_orphan_del spin_lock check (&root->orphan_inodes) root->orphan_block_rsv = NULL; spin_unlock atomic_dec(&root->orphan_inodes); access root->orphan_block_rsv Accessing root->orphan_block_rsv must be done before decreasing root->orphan_inodes. cc: <stable@vger.kernel.org> v3.12+ Fixes: 703c88e0 ("Btrfs: fix tracking of orphan inode count") Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
This regression is introduced in commit 3d48d981 ("btrfs: Handle uninitialised inode eviction"). There are two problems, a) it is ->destroy_inode() that does the final free on inode, not ->evict_inode(), b) clear_inode() must be called before ->evict_inode() returns. This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); in evict() because I_CLEAR is set in clear_inode(). Fixes: commit 3d48d981 ("btrfs: Handle uninitialised inode eviction") Cc: <stable@vger.kernel.org> # v4.7-rc6+ Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
It's possible that btrfs_sync_log() bails out after one of the two btrfs_write_marked_extents() which convert extent state's state bit into EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY and EXTENT_NEW are searched by free_log_tree() so that those extent states with EXTENT_NEED_WAIT lead to memory leak. cc: <stable@vger.kernel.org> Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
In cases that the whole fs flips into readonly status due to failures in critical sections, then log tree's blocks are still dirty, and this leads to a crash during umount time, the crash is about use-after-free, umount -> close_ctree -> stop workers -> iput(btree_inode) -> iput_final -> write_inode_now -> ... -> queue job on stop'd workers cc: <stable@vger.kernel.org> v3.12+ Fixes: 681ae509 ("Btrfs: cleanup reserved space when freeing tree log on error") Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
@cur_offset is not set back to what it should be (@cow_start) if btrfs_next_leaf() returns something wrong, and the range [cow_start, cur_offset) remains locked forever. cc: <stable@vger.kernel.org> Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 30 1月, 2018 2 次提交
-
-
由 Anand Jain 提交于
As struct btrfs_disk_super is being passed, so it can get devid the same way its parent does. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
Instead of pointer to btrfs_fs_devices as an arg in device_list_add() better to get pointer to btrfs_device as return value, then we have both, pointer to btrfs_device and btrfs_fs_devices. btrfs_device is needed to handle reappearing missing device. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 23 1月, 2018 7 次提交
-
-
由 Anand Jain 提交于
There is no other parent for device_list_add() except for btrfs_scan_one_device(), which would set btrfs_fs_devices::total_devices if device_list_add is successful and this can be done with in device_list_add() itself. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
Commit 60999ca4 ("btrfs: make device scan less noisy") adds return value 1 to device_list_add(), so that parent function can call pr_info only when new device is added. Move the pr_info() part into device_list_add() so that this function can be kept simple. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
The btrfs_free_stale_devices() is updated to match for the given device path and delete it. (It searches for only unmounted list of devices.) Also drop the comment about different path being used for the same device, since now we will have cli to clean any device that's not a concern any more. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
No functional changes. Rename btrfs_free_stale_devices() arg to skip_dev, so that it reflects what that arg for. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
This updates btrfs_free_stale_devices() helper function to delete all unmouted devices, when arg is NULL. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
Let the list iterator iterate further and find other stale devices and delete it. This is in preparation to add support for user land request-able stale devices cleanup. Also rename btrfs_free_stale_device() to btrfs_free_stale_devices(). Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
There is no need to check for btrfs_fs_devices::seeding when we have checked for btrfs_fs_devices::opened, because we can't sprout without its seed FS being opened. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 22 1月, 2018 17 次提交
-
-
由 Nikolay Borisov 提交于
No functional changes, just makes the code more readable Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
In order to debug subtle bugs around merge_extent_mapping(), perf probe can be used to check the arguments, but sometimes merge_extent_mapping() got inlined by compiler and couldn't be probed. This is adding noinline attribute to merge_extent_mapping(). Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
This is a subtle case, so in order to understand the problem, it'd be good to know the content of existing and em when any error occurs. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
This test case simulates the racy situation of dio write vs dio read, and see if btrfs_get_extent() would return -EEXIST. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
This test case simulates the racy situation of buffered write vs dio read, and see if btrfs_get_extent() would return -EEXIST. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
We've observed that btrfs_get_extent() and merge_extent_mapping() could return -EEXIST in several cases, and they are caused by some racy condition, e.g dio read vs dio write, which makes the problem very tricky to reproduce. This adds extent map selftests in order to simulate those racy situations. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> [ minor string adjustments ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
These helpers are extent map specific, move them to extent_map.c. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
This is a prepare work for the following extent map selftest, which runs tests against em merge logic. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
This fixes a corner case that is caused by a race of dio write vs dio read/write. Here is how the race could happen. Suppose that no extent map has been loaded into memory yet. There is a file extent [0, 32K), two jobs are running concurrently against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio read from [0, 4K) or [4K, 8K). t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K). ------------------------------------------------------ t1 t2 btrfs_get_blocks_direct() btrfs_get_blocks_direct() -> btrfs_get_extent() -> btrfs_get_extent() -> lookup_extent_mapping() -> add_extent_mapping() -> lookup_extent_mapping() # load [0, 32K) -> btrfs_new_extent_direct() -> btrfs_drop_extent_cache() # split [0, 32K) and # drop [8K, 32K) -> add_extent_mapping() # add [8K, 32K) -> add_extent_mapping() # handle -EEXIST when adding # [0, 32K) ------------------------------------------------------ About how t2(dio read/write) runs into -EEXIST: a) add_extent_mapping() gets -EEXIST for adding em [0, 32k), b) search_extent_mapping() then returns [0, 8k) as the existing em, even though start == existing->start, em is [0, 32k) so that extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k, c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k) (with a length 0) and returns -EEXIST as [8k, 32k) is already in tree, d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write, which is confusing applications. Here I conclude all the possible situations, 1) start < existing->start +-----------+em+-----------+ +--prev---+ | +-------------+ | | | | | | | +---------+ + +---+existing++ ++ + | + start 2) start == existing->start +------------em------------+ | +-------------+ | | | | | + +----existing-+ + | | + start 3) start > existing->start && start < (existing->start + existing->len) +------------em------------+ | +-------------+ | | | | | + +----existing-+ + | | + start 4) start >= (existing->start + existing->len) +-----------+em+-----------+ | +-------------+ | +--next---+ | | | | | | + +---+existing++ + +---------+ + | + start As we can see, it turns out that if start is within existing em (front inclusive), then the existing em should be returned as is, otherwise, we try our best to merge candidate em with sibling ems to form a larger em (in order to reduce the total number of em). Reported-by: NDavid Vallender <david.vallender@landmark.co.uk> Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
%block_len could be checked on deciding if two em are mergeable. merge_extent_mapping() has only added the front pad if the front part of em gets truncated, but it's possible that the end part gets truncated. For both compressed extent and inline extent, em->block_len is not adjusted accordingly, and for regular extent, em->block_len always equals to em->len, hence this sets em->block_len with em->len. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Matthew Wilcox 提交于
The reada_lock in struct btrfs_device was only initialised, and not actually used. That's good because there's another lock also called reada_lock in the btrfs_fs_info that was quite heavily used. Remove this one. Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
Before rbio_orig_end_io() goes to free rbio, rbio may get merged with more bios from other rbios and rbio->bio_list becomes non-empty, in that case, these newly merged bios don't end properly. Once unlock_stripe() is done, rbio->bio_list will not be updated any more and we can call bio_endio() on all queued bios. It should only happen in error-out cases, the normal path of recover and full stripe write have already set RBIO_RMW_LOCKED_BIT to disable merge before doing IO, so rbio_orig_end_io() called by them doesn't have the above issue. Reported-by: NJérôme Carretero <cJ-ko@zougloub.eu> Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
Since raid6 recover tries all possible combinations of failed stripes, - when raid6 rebuild algorithm is used, i.e. raid6_datap_recov() and raid6_2data_recov(), it may change the in-memory content of failed stripes, if such a raid bio is cached, a later raid write rmw or recover can steal @stripe_pages from it instead of reading from disks, such that it carries the wrong content to do write rmw or recovery and ends up with corruption or recovery failures. - when raid5 rebuild algorithm is used, i.e. xor, raid bio can be cached because the only failed stripe which contains @rbio->bio_pages gets modified, others remain the same so that their in-memory content is consistent with their on-disk content. This adds a check to skip caching rbio if using raid6 recover. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
Bio iterated by set_bio_pages_uptodate() is raid56 internal one, so it will never be a BIO_CLONED bio, and since this is called by end_io functions, bio->bi_iter.bi_size is zero, we mustn't use bio_for_each_segment() as that is a no-op if bi_size is zero. Fixes: 6592e58c ("Btrfs: fix write corruption due to bio cloning on raid5/6") Cc: <stable@vger.kernel.org> # v4.12-rc6+ Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Su Yue 提交于
There is no function named btrfs_get_inode_index_count. Explanation for magic number index_cnt=2 in btrfs_new_inode() is actually located in btrfs_set_inode_index_count(). So replace 'btrfs_get_inode_index_count' in the comment by 'btrfs_set_inode_index_count'. Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
It's not used outside of extent-tree so there is no reason to not be static. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
We call btrfs_free_stale_device() only when we alloc a new struct btrfs_device (ret=1), so move it closer to where we alloc the new device. Also drop the comments. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-