- 14 11月, 2018 21 次提交
-
-
由 Filipe Manana 提交于
commit 7ed586d0a8241e81d58c656c5b315f781fa6fc97 upstream. When using the NO_HOLES feature and logging a regular file, we were expecting that if we find an inline extent, that either its size in RAM (uncompressed and unenconded) matches the size of the file or if it does not, that it matches the sector size and it represents compressed data. This assertion does not cover a case where the length of the inline extent is smaller than the sector size and also smaller the file's size, such case is possible through fallocate. Example: $ mkfs.btrfs -f -O no-holes /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xb60 0 21" /mnt/foobar $ xfs_io -c "falloc 40 40" /mnt/foobar $ xfs_io -c "fsync" /mnt/foobar In the above example we trigger the assertion because the inline extent's length is 21 bytes while the file size is 80 bytes. The fallocate() call merely updated the file's size and did not touch the existing inline extent, as expected. So fix this by adjusting the assertion so that an inline extent length smaller than the file size is valid if the file size is smaller than the filesystem's sector size. A test case for fstests follows soon. Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com> Fixes: a89ca6f2 ("Btrfs: fix fsync after truncate when no_holes feature is enabled") CC: stable@vger.kernel.org # 4.14+ Link: https://lore.kernel.org/linux-btrfs/CAE5jQCfRSBC7n4pUTFJcmHh109=gwyT9mFkCOL+NKfzswmR=_Q@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Filipe Manana 提交于
commit 3527a018c00e5dbada2f9d7ed5576437b6dd5cfb upstream. At inode.c:compress_file_range(), under the "free_pages_out" label, we can end up dereferencing the "pages" pointer when it has a NULL value. This case happens when "start" has a value of 0 and we fail to allocate memory for the "pages" pointer. When that happens we jump to the "cont" label and then enter the "if (start == 0)" branch where we immediately call the cow_file_range_inline() function. If that function returns 0 (success creating an inline extent) or an error (like -ENOMEM for example) we jump to the "free_pages_out" label and then access "pages[i]" leading to a NULL pointer dereference, since "nr_pages" has a value greater than zero at that point. Fix this by setting "nr_pages" to 0 when we fail to allocate memory for the "pages" pointer. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119 Fixes: 771ed689 ("Btrfs: Optimize compressed writeback and reads") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Qu Wenruo 提交于
commit 9c7b0c2e8dbfbcd80a71e2cbfe02704f26c185c6 upstream. [BUG] In the following case, rescan won't zero out the number of qgroup 1/0: $ mkfs.btrfs -fq $DEV $ mount $DEV /mnt $ btrfs quota enable /mnt $ btrfs qgroup create 1/0 /mnt $ btrfs sub create /mnt/sub $ btrfs qgroup assign 0/257 1/0 /mnt $ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000 $ btrfs sub snap /mnt/sub /mnt/snap $ btrfs quota rescan -w /mnt $ btrfs qgroup show -pcre /mnt qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 16.00KiB 16.00KiB none none --- --- 0/257 1016.00KiB 16.00KiB none none 1/0 --- 0/258 1016.00KiB 16.00KiB none none --- --- 1/0 1016.00KiB 16.00KiB none none --- 0/257 So far so good, but: $ btrfs qgroup remove 0/257 1/0 /mnt WARNING: quotas may be inconsistent, rescan needed $ btrfs quota rescan -w /mnt $ btrfs qgroup show -pcre /mnt qgoupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 16.00KiB 16.00KiB none none --- --- 0/257 1016.00KiB 16.00KiB none none --- --- 0/258 1016.00KiB 16.00KiB none none --- --- 1/0 1016.00KiB 16.00KiB none none --- --- ^^^^^^^^^^ ^^^^^^^^ not cleared [CAUSE] Before rescan we call qgroup_rescan_zero_tracking() to zero out all qgroups' accounting numbers. However we don't mark all qgroups dirty, but rely on rescan to do so. If we have any high level qgroup without children, it won't be marked dirty during rescan, since we cannot reach that qgroup. This will cause QGROUP_INFO items of childless qgroups never get updated in the quota tree, thus their numbers will stay the same in "btrfs qgroup show" output. [FIX] Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even if we have childless qgroups, their QGROUP_INFO items will still get updated during rescan. Reported-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com> Tested-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Filipe Manana 提交于
commit 0f375eed upstream. In a scenario like the following: mkdir /mnt/A # inode 258 mkdir /mnt/B # inode 259 touch /mnt/B/bar # inode 260 sync mv /mnt/B/bar /mnt/A/bar mv -T /mnt/A /mnt/B fsync /mnt/B/bar <power fail> After replaying the log we end up with file bar having 2 hard links, both with the name 'bar' and one in the directory with inode number 258 and the other in the directory with inode number 259. Also, we end up with the directory inode 259 still existing and with the directory inode 258 still named as 'A', instead of 'B'. In this scenario, file 'bar' should only have one hard link, located at directory inode 258, the directory inode 259 should not exist anymore and the name for directory inode 258 should be 'B'. This incorrect behaviour happens because when attempting to log the old parents of an inode, we skip any parents that no longer exist. Fix this by forcing a full commit if an old parent no longer exists. A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Filipe Manana 提交于
commit f2d72f42 upstream. When replaying a log which contains a tmpfile (which necessarily has a link count of 0) we end up calling inc_nlink(), at fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like the following: [195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 inc_nlink+0x33/0x40 [195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 4.19.0-rc6-btrfs-next-38 #1 [195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [195191.943726] RIP: 0010:inc_nlink+0x33/0x40 [195191.943728] RSP: 0018:ffffb96e425e3870 EFLAGS: 00010246 [195191.943730] RAX: 0000000000000000 RBX: ffff8c0d1e6af4f0 RCX: 0000000000000006 [195191.943731] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8c0d1e6af4f0 [195191.943731] RBP: 0000000000000097 R08: 0000000000000001 R09: 0000000000000000 [195191.943732] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb96e425e3a60 [195191.943733] R13: ffff8c0d10cff0c8 R14: ffff8c0d0d515348 R15: ffff8c0d78a1b3f8 [195191.943735] FS: 00007f570ee24480(0000) GS:ffff8c0dfb200000(0000) knlGS:0000000000000000 [195191.943736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [195191.943737] CR2: 00005593286277c8 CR3: 00000000bb8f2006 CR4: 00000000003606f0 [195191.943739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [195191.943740] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [195191.943741] Call Trace: [195191.943778] replay_one_buffer+0x797/0x7d0 [btrfs] [195191.943802] walk_up_log_tree+0x1c1/0x250 [btrfs] [195191.943809] ? rcu_read_lock_sched_held+0x3f/0x70 [195191.943825] walk_log_tree+0xae/0x1d0 [btrfs] [195191.943840] btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs] [195191.943856] ? replay_dir_deletes+0x280/0x280 [btrfs] [195191.943870] open_ctree+0x1c3b/0x22a0 [btrfs] [195191.943887] btrfs_mount_root+0x6b4/0x800 [btrfs] [195191.943894] ? rcu_read_lock_sched_held+0x3f/0x70 [195191.943899] ? pcpu_alloc+0x55b/0x7c0 [195191.943906] ? mount_fs+0x3b/0x140 [195191.943908] mount_fs+0x3b/0x140 [195191.943912] ? __init_waitqueue_head+0x36/0x50 [195191.943916] vfs_kern_mount+0x62/0x160 [195191.943927] btrfs_mount+0x134/0x890 [btrfs] [195191.943936] ? rcu_read_lock_sched_held+0x3f/0x70 [195191.943938] ? pcpu_alloc+0x55b/0x7c0 [195191.943943] ? mount_fs+0x3b/0x140 [195191.943952] ? btrfs_remount+0x570/0x570 [btrfs] [195191.943954] mount_fs+0x3b/0x140 [195191.943956] ? __init_waitqueue_head+0x36/0x50 [195191.943960] vfs_kern_mount+0x62/0x160 [195191.943963] do_mount+0x1f9/0xd40 [195191.943967] ? memdup_user+0x4b/0x70 [195191.943971] ksys_mount+0x7e/0xd0 [195191.943974] __x64_sys_mount+0x21/0x30 [195191.943977] do_syscall_64+0x60/0x1b0 [195191.943980] entry_SYSCALL_64_after_hwframe+0x49/0xbe [195191.943983] RIP: 0033:0x7f570e4e524a [195191.943986] RSP: 002b:00007ffd83589478 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5 [195191.943989] RAX: ffffffffffffffda RBX: 0000563f335b2060 RCX: 00007f570e4e524a [195191.943990] RDX: 0000563f335b2240 RSI: 0000563f335b2280 RDI: 0000563f335b2260 [195191.943992] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020 [195191.943993] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000563f335b2260 [195191.943994] R13: 0000563f335b2240 R14: 0000000000000000 R15: 00000000ffffffff [195191.944002] irq event stamp: 8688 [195191.944010] hardirqs last enabled at (8687): [<ffffffff9cb004c3>] console_unlock+0x503/0x640 [195191.944012] hardirqs last disabled at (8688): [<ffffffff9ca037dd>] trace_hardirqs_off_thunk+0x1a/0x1c [195191.944018] softirqs last enabled at (8638): [<ffffffff9cc0a5d1>] __set_page_dirty_nobuffers+0x101/0x150 [195191.944020] softirqs last disabled at (8634): [<ffffffff9cc26bbe>] wb_wakeup_delayed+0x2e/0x60 [195191.944022] ---[ end trace 5d6e873a9a0b811a ]--- This happens because the inode does not have the flag I_LINKABLE set, which is a runtime only flag, not meant to be persisted, set when the inode is created through open(2) if the flag O_EXCL is not passed to it. Except for the warning, there are no other consequences (like corruptions or metadata inconsistencies). Since it's pointless to replay a tmpfile as it would be deleted in a later phase of the log replay procedure (it has a link count of 0), fix this by not logging tmpfiles and if a tmpfile is found in a log (created by a kernel without this change), skip the replay of the inode. A test case for fstests follows soon. Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay") CC: stable@vger.kernel.org # 4.18+ Reported-by: NMartin Steigerwald <martin@lichtvoll.de> Link: https://lore.kernel.org/linux-btrfs/3666619.NTnn27ZJZE@merkaba/Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit 545e3366db823dc3342ca9d7fea803f829c9062f upstream. Allocating new chunks modifies both the extent and chunk tree, which can trigger new chunk allocations. So instead of doing list_for_each_safe, just do while (!list_empty()) so we make sure we don't exit with other pending bg's still on our list. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit 553cceb49681d60975d00892877d4c871bf220f9 upstream. We need to clear the max_extent_size when we clear bits from a bitmap since it could have been from the range that contains the max_extent_size. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit 84de76a2fb217dc1b6bc2965cc397d1648aa1404 upstream. If we're allocating a new space cache inode it's likely going to be under a transaction handle, so we need to use memalloc_nofs_save() in order to avoid deadlocks, and more importantly lockdep messages that make xfstests fail. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit f45c752b65af46bf42963295c332865d95f97fff upstream. We want to release the unused reservation we have since it refills the delayed refs reserve, which will make everything go smoother when running the delayed refs if we're short on our reservation. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Chris Mason 提交于
commit 7703bdd8d23e6ef057af3253958a793ec6066b28 upstream. During buffered writes, we follow this basic series of steps: again: lock all the pages wait for writeback on all the pages Take the extent range lock wait for ordered extents on the whole range clean all the pages if (copy_from_user_in_atomic() hits a fault) { drop our locks goto again; } dirty all the pages release all the locks The extra waiting, cleaning and locking are there to make sure we don't modify pages in flight to the drive, after they've been crc'd. If some of the pages in the range were already dirty when the write began, and we need to goto again, we create a window where a dirty page has been cleaned and unlocked. It may be reclaimed before we're able to lock it again, which means we'll read the old contents off the drive and lose any modifications that had been pending writeback. We don't actually need to clean the pages. All of the other locking in place makes sure we don't start IO on the pages, so we can just leave them dirty for the duration of the write. Fixes: 73d59314 (the original btrfs merge) CC: stable@vger.kernel.org # v4.4+ Signed-off-by: NChris Mason <clm@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit 3aa7c7a31c26321696b92841d5103461c6f3f517 upstream. While testing my backport I noticed there was a panic if I ran generic/416 generic/417 generic/418 all in a row. This just happened to uncover a race where we had outstanding IO after we destroy all of our workqueues, and then we'd go to queue the endio work on those free'd workqueues. This is because we aren't waiting for the caching threads to be done before freeing everything up, so to fix this make sure we wait on any outstanding caching that's being done before we free up the block group, so we're sure to be done with all IO by the time we get to btrfs_stop_all_workers(). This fixes the panic I was seeing consistently in testing. ------------[ cut here ]------------ kernel BUG at fs/btrfs/volumes.c:6112! SMP PTI Modules linked in: CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Workqueue: btrfs-cache btrfs_cache_helper RIP: 0010:btrfs_map_bio+0x346/0x370 RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000 RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000 RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000 R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00 FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btree_submit_bio_hook+0x8a/0xd0 submit_one_bio+0x5d/0x80 read_extent_buffer_pages+0x18a/0x320 btree_read_extent_buffer_pages+0xbc/0x200 ? alloc_extent_buffer+0x359/0x3e0 read_tree_block+0x3d/0x60 read_block_for_search.isra.30+0x1a5/0x360 btrfs_search_slot+0x41b/0xa10 btrfs_next_old_leaf+0x212/0x470 caching_thread+0x323/0x490 normal_work_helper+0xc5/0x310 process_one_work+0x141/0x340 worker_thread+0x44/0x3c0 kthread+0xf8/0x130 ? process_one_work+0x340/0x340 ? kthread_bind+0x10/0x10 ret_from_fork+0x35/0x40 RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0 ---[ end trace 827eb13e50846033 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jeff Mahoney 提交于
commit fee7acc361314df6561208c2d3c0882d663dd537 upstream. Commit 499f377f (btrfs: iterate over unused chunk space in FITRIM) fixed free space trimming, but introduced latency when it was running. This is due to it pinning the transaction using both a incremented refcount and holding the commit root sem for the duration of a single trim operation. This was to ensure safety but it's unnecessary. We already hold the the chunk mutex so we know that the chunk we're using can't be allocated while we're trimming it. In order to check against chunks allocated already in this transaction, we need to check the pending chunks list. To to that safely without joining the transaction (or attaching than then having to commit it) we need to ensure that the dev root's commit root doesn't change underneath us and the pending chunk lists stays around until we're done with it. We can ensure the former by holding the commit root sem and the latter by pinning the transaction. We do this now, but the critical section covers the trim operation itself and we don't need to do that. This patch moves the pinning and unpinning logic into helpers and unpins the transaction after performing the search and check for pending chunks. Limiting the critical section of the transaction pinning improves the latency substantially on slower storage (e.g. image files over NFS). Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jeff Mahoney 提交于
commit 0be88e367fd8fbdb45257615d691f4675dda062f upstream. We check whether any device the file system is using supports discard in the ioctl call, but then we attempt to trim free extents on every device regardless of whether discard is supported. Due to the way we mask off EOPNOTSUPP, we can end up issuing the trim operations on each free range on devices that don't support it, just wasting time. Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jeff Mahoney 提交于
commit d4e329de upstream. btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the device_list_mutex. The problem is that ->alloc_list is protected by the chunk mutex. We don't want to hold the chunk mutex over the trim of the entire file system. Fortunately, the ->dev_list list is protected by the dev_list mutex and while it will give us all devices, including read-only devices, we already just skip the read-only devices. Then we can continue to take and release the chunk mutex while scanning each device. Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Qu Wenruo 提交于
commit 6ba9fc8e upstream. [BUG] fstrim on some btrfs only trims the unallocated space, not trimming any space in existing block groups. [CAUSE] Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to range [0, super->total_bytes). So later btrfs_trim_fs() will only be able to trim block groups in range [0, super->total_bytes). While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs uses its logical address space, there is nothing limiting the location where we put block groups. For filesystem with frequent balance, it's quite easy to relocate all block groups and bytenr of block groups will start beyond super->total_bytes. In that case, btrfs will not trim existing block groups. [FIX] Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs() can get the unmodified range, which is normally set to [0, U64_MAX]. Reported-by: NChris Murphy <lists@colorremedies.com> Fixes: f4c697e6 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl") CC: <stable@vger.kernel.org> # v4.4+ Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Qu Wenruo 提交于
commit 93bba24d upstream. Function btrfs_trim_fs() doesn't handle errors in a consistent way. If error happens when trimming existing block groups, it will skip the remaining blocks and continue to trim unallocated space for each device. The return value will only reflect the final error from device trimming. This patch will fix such behavior by: 1) Recording the last error from block group or device trimming The return value will also reflect the last error during trimming. Make developer more aware of the problem. 2) Continuing trimming if possible If we failed to trim one block group or device, we could still try the next block group or device. 3) Report number of failures during block group and device trimming It would be less noisy, but still gives user a brief summary of what's going wrong. Such behavior can avoid confusion for cases like failure to trim the first block group and then only unallocated space is trimmed. Reported-by: NChris Murphy <lists@colorremedies.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ add bg_ret and dev_ret to the messages ] Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jeff Mahoney 提交于
commit 5c06147128fbbdf7a84232c5f0d808f53153defe upstream. When we fail to start a transaction in btrfs_dev_replace_start, we leave dev_replace->replace_start set to STARTED but clear ->srcdev and ->tgtdev. Later, that can result in an Oops in btrfs_dev_replace_progress when having state set to STARTED or SUSPENDED implies that ->srcdev is valid. Also fix error handling when the state is already STARTED or SUSPENDED while starting. That, too, will clear ->srcdev and ->tgtdev even though it doesn't own them. This should be an impossible case to hit since we should be protected by the BTRFS_FS_EXCL_OP bit being set. Let's add an ASSERT there while we're at it. Fixes: e93c89c1 (Btrfs: add new sources for device replace code) CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jeff Mahoney 提交于
commit 374b0e2d upstream. When we hit an I/O error in free_log_tree->walk_log_tree during file system shutdown we can crash due to there not being a valid transaction handle. Use btrfs_handle_fs_error when there's no transaction handle to use. BUG: unable to handle kernel NULL pointer dereference at 0000000000000060 IP: free_log_tree+0xd2/0x140 [btrfs] PGD 0 P4D 0 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI Modules linked in: <modules> CPU: 2 PID: 23544 Comm: umount Tainted: G W 4.12.14-kvmsmall #9 SLE15 (unreleased) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 task: ffff96bfd3478880 task.stack: ffffa7cf40d78000 RIP: 0010:free_log_tree+0xd2/0x140 [btrfs] RSP: 0018:ffffa7cf40d7bd10 EFLAGS: 00010282 RAX: 00000000fffffffb RBX: 00000000fffffffb RCX: 0000000000000002 RDX: 0000000000000000 RSI: ffff96c02f07d4c8 RDI: 0000000000000282 RBP: ffff96c013cf1000 R08: ffff96c02f07d4c8 R09: ffff96c02f07d4d0 R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000 R13: ffff96c005e800c0 R14: ffffa7cf40d7bdb8 R15: 0000000000000000 FS: 00007f17856bcfc0(0000) GS:ffff96c03f600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 0000000045ed6002 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? wait_for_writer+0xb0/0xb0 [btrfs] btrfs_free_log+0x17/0x30 [btrfs] btrfs_drop_and_free_fs_root+0x9a/0xe0 [btrfs] btrfs_free_fs_roots+0xc0/0x130 [btrfs] ? wait_for_completion+0xf2/0x100 close_ctree+0xea/0x2e0 [btrfs] ? kthread_stop+0x161/0x260 generic_shutdown_super+0x6c/0x120 kill_anon_super+0xe/0x20 btrfs_kill_super+0x13/0x100 [btrfs] deactivate_locked_super+0x3f/0x70 cleanup_mnt+0x3b/0x70 task_work_run+0x78/0x90 exit_to_usermode_loop+0x77/0xa6 do_syscall_64+0x1c5/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x7f1784f90827 RSP: 002b:00007ffdeeb03118 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 0000556a60c62970 RCX: 00007f1784f90827 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556a60c62b50 RBP: 0000000000000000 R08: 0000000000000005 R09: 00000000ffffffff R10: 0000556a60c63900 R11: 0000000000000246 R12: 0000556a60c62b50 R13: 00007f17854a81c4 R14: 0000000000000000 R15: 0000000000000000 RIP: free_log_tree+0xd2/0x140 [btrfs] RSP: ffffa7cf40d7bd10 CR2: 0000000000000060 Fixes: 681ae509 ("Btrfs: cleanup reserved space when freeing tree log on error") CC: <stable@vger.kernel.org> # v3.13 Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Qu Wenruo 提交于
commit b72c3aba upstream. [BUG] For certain crafted image, whose csum root leaf has missing backref, if we try to trigger write with data csum, it could cause deadlock with the following kernel WARN_ON(): WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400 CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8 Workqueue: btrfs-endio-write btrfs_endio_write_helper RIP: 0010:btrfs_tree_lock+0x3e2/0x400 Call Trace: btrfs_alloc_tree_block+0x39f/0x770 __btrfs_cow_block+0x285/0x9e0 btrfs_cow_block+0x191/0x2e0 btrfs_search_slot+0x492/0x1160 btrfs_lookup_csum+0xec/0x280 btrfs_csum_file_blocks+0x2be/0xa60 add_pending_csums+0xaf/0xf0 btrfs_finish_ordered_io+0x74b/0xc90 finish_ordered_fn+0x15/0x20 normal_work_helper+0xf6/0x500 btrfs_endio_write_helper+0x12/0x20 process_one_work+0x302/0x770 worker_thread+0x81/0x6d0 kthread+0x180/0x1d0 ret_from_fork+0x35/0x40 [CAUSE] That crafted image has missing backref for csum tree root leaf. And when we try to allocate new tree block, since there is no EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot and use it. The extent tree of the image looks like: Normal image | This fuzzed image ----------------------------------+-------------------------------- BG 29360128 | BG 29360128 One empty slot | One empty slot 29364224: backref to UUID tree | 29364224: backref to UUID tree Two empty slots | Two empty slots 29376512: backref to CSUM tree | One empty slot (bad type) <<< 29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree ... | ... Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to alloc tree block, it's an valid slot for btrfs. And for finish_ordered_write, when we need to insert csum, we try to CoW csum tree root. By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K, BG_OFFSET + 12K is already used by tree block COW for other trees, the next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM tree. But due to the bad type, btrfs can recognize it and still consider it as an empty slot, and will try to use it for csum tree CoW. Then in the following call trace, we will try to lock the new tree block, which turns out to be the old csum tree root which is already locked: btrfs_search_slot() called on csum tree root, which is at 29376512 |- btrfs_cow_block() |- btrfs_set_lock_block() | |- Now locks tree block 29376512 (old csum tree root) |- __btrfs_cow_block() |- btrfs_alloc_tree_block() |- btrfs_reserve_extent() | Now it returns tree block 29376512, which extent tree | shows its empty slot, but it's already hold by csum tree |- btrfs_init_new_buffer() |- btrfs_tree_lock() | Triggers WARN_ON(eb->lock_owner == current->pid) |- wait_event() Wait lock owner to release the lock, but it's locked by ourself, so it will deadlock [FIX] This patch will do the lock_owner and current->pid check at btrfs_init_new_buffer(). So above deadlock can be avoided. Since such problem can only happen in crafted image, we will still trigger kernel warning for later aborted transaction, but with a little more meaningful warning message. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405Reported-by: NXu Wen <wen.xu@gatech.edu> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Qu Wenruo 提交于
commit 65c6e82b upstream. [BUG] When mounting certain crafted image, btrfs will trigger kernel BUG_ON() when trying to recover balance: kernel BUG at fs/btrfs/extent-tree.c:8956! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10 RIP: 0010:walk_up_proc+0x336/0x480 [btrfs] RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202 Call Trace: walk_up_tree+0x172/0x1f0 [btrfs] btrfs_drop_snapshot+0x3a4/0x830 [btrfs] merge_reloc_roots+0xe1/0x1d0 [btrfs] btrfs_recover_relocation+0x3ea/0x420 [btrfs] open_ctree+0x1af3/0x1dd0 [btrfs] btrfs_mount_root+0x66b/0x740 [btrfs] mount_fs+0x3b/0x16a vfs_kern_mount.part.9+0x54/0x140 btrfs_mount+0x16d/0x890 [btrfs] mount_fs+0x3b/0x16a vfs_kern_mount.part.9+0x54/0x140 do_mount+0x1fd/0xda0 ksys_mount+0xba/0xd0 __x64_sys_mount+0x21/0x30 do_syscall_64+0x60/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe [CAUSE] Extent tree corruption. In this particular case, reloc tree root's owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is corrupted and we failed the owner check in walk_up_tree(). [FIX] It's pretty hard to take care of every extent tree corruption, but at least we can remove such BUG_ON() and exit more gracefully. And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share the same root (which is obviously invalid), we needs to make __del_reloc_root() more robust to detect such invalid sharing to avoid possible NULL dereference as root->node can be NULL in this case. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411Reported-by: NXu Wen <wen.xu@gatech.edu> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Qu Wenruo 提交于
commit 3628b4ca upstream. Some qgroup trace events like btrfs_qgroup_release_data() and btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is not enabled. This is caused by the lack of qgroup status check before calling some qgroup functions. Thankfully the functions can handle quota disabled case well and just do nothing for qgroup disabled case. This patch will do earlier check before triggering related trace events. And for enabled <-> disabled race case: 1) For enabled->disabled case Disable will wipe out all qgroups data including reservation and excl/rfer. Even if we leak some reservation or numbers, it will still be cleared, so nothing will go wrong. 2) For disabled -> enabled case Current btrfs_qgroup_release_data() will use extent_io tree to ensure we won't underflow reservation. And for delayed_ref we use head->qgroup_reserved to record the reserved space, so in that case head->qgroup_reserved should be 0 and we won't underflow. CC: stable@vger.kernel.org # 4.14+ Reported-by: NChris Murphy <lists@colorremedies.com> Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 24 8月, 2018 1 次提交
-
-
由 Misono Tomohiro 提交于
Commit 672d5990 ("btrfs: Use wrapper macro for rcu string to remove duplicate code") replaces some open coded RCU string handling with macro. It turns out that btrfs_debug_in_rcu() is used for the first time and the macro lacks lock/unlock of RCU string for non-debug case (i.e. when the message is not printed), leading to suspicious RCU usage warning when CONFIG_PROVE_RCU is on. Fix this by adding a wrapper to call lock/unlock for the non-debug case too. Fixes: 672d5990 ("btrfs: Use wrapper macro for rcu string to remove duplicate code") Reported-by: NDavid Howells <dhowells@redhat.com> Tested-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 23 8月, 2018 5 次提交
-
-
由 Dan Carpenter 提交于
The issue here is that btrfs_commit_transaction() frees "trans" on both the error and the success path. So the problem would be if btrfs_commit_transaction() succeeds, and then qgroup_rescan_init() fails. That means that "ret" is non-zero and "trans" is non-NULL and it leads to a use after free inside the btrfs_end_transaction() macro. Fixes: 340f1aa2 ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable") Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
Test case btrfs/164 reports use-after-free: [ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP .. [ 6712.195423] btrfs_update_commit_device_size+0x75/0xf0 [btrfs] [ 6712.201424] btrfs_commit_transaction+0x57d/0xa90 [btrfs] [ 6712.206999] btrfs_rm_device+0x627/0x850 [btrfs] [ 6712.211800] btrfs_ioctl+0x2b03/0x3120 [btrfs] Reason for this is that btrfs_shrink_device adds the resized device to the fs_devices::resized_devices after it has called the last commit transaction. So the list fs_devices::resized_devices is not empty when btrfs_shrink_device returns. Now the parent function btrfs_rm_device calls: btrfs_close_bdev(device); call_rcu(&device->rcu, free_device_rcu); and then does the transactio ncommit. It goes through the fs_devices::resized_devices in btrfs_update_commit_device_size and leads to use-after-free. Fix this by making sure btrfs_shrink_device calls the last needed btrfs_commit_transaction before the return. This is consistent with what the grow counterpart does and this makes sure the on-disk state is persistent when the function returns. Reported-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Tested-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Lu Fengqi 提交于
After btrfs_qgroup_reserve_meta_prealloc(), num_bytes will be assigned again by btrfs_calc_trans_metadata_size(). Once block_rsv fails, we can't properly free the num_bytes of the previous qgroup_reserve. Use a separate variable to store the num_bytes of the qgroup_reserve. Delete the comment for the qgroup_reserved that does not exist and add a comment about use_global_rsv. Fixes: c4c129db ("btrfs: drop unused parameter qgroup_reserved") CC: stable@vger.kernel.org # 4.18+ Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
If we deduplicate extents between two different files we can end up corrupting data if the source range ends at the size of the source file, the source file's size is not aligned to the filesystem's block size and the destination range does not go past the size of the destination file size. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo # The first byte with a value of 0xae starts at an offset (2518890) # which is not a multiple of the sector size. $ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo # Confirm the file content is full of bytes with values 0x6b and 0xae. $ od -t x1 /mnt/foo 0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b * 11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae 11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae * 11777540 ae ae ae ae ae ae ae ae 11777550 # Create a second file with a length not aligned to the sector size, # whose bytes all have the value 0x6b, so that its extent(s) can be # deduplicated with the first file. $ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar # Now deduplicate the entire second file into a range of the first file # that also has all bytes with the value 0x6b. The destination range's # end offset must not be aligned to the sector size and must be less # then the offset of the first byte with the value 0xae (byte at offset # 2518890). $ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo # The bytes in the range starting at offset 2515659 (end of the # deduplication range) and ending at offset 2519040 (start offset # rounded up to the block size) must all have the value 0xae (and not # replaced with 0x00 values). In other words, we should have exactly # the same data we had before we asked for deduplication. $ od -t x1 /mnt/foo 0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b * 11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae 11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae * 11777540 ae ae ae ae ae ae ae ae 11777550 # Unmount the filesystem and mount it again. This guarantees any file # data in the page cache is dropped. $ umount /dev/sdb $ mount /dev/sdb /mnt $ od -t x1 /mnt/foo 0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b * 11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00 11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae * 11777540 ae ae ae ae ae ae ae ae 11777550 # The bytes in range 2515659 to 2519040 have a value of 0x00 and not a # value of 0xae, data corruption happened due to the deduplication # operation. So fix this by rounding down, to the sector size, the length used for the deduplication when the following conditions are met: 1) Source file's range ends at its i_size; 2) Source file's i_size is not aligned to the sector size; 3) Destination range does not cross the i_size of the destination file. Fixes: e1d227a4 ("btrfs: Handle unaligned length in extent_same") CC: stable@vger.kernel.org # 4.2+ Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
When we add a new name for an inode which was logged in the current transaction, we update the inode in the log so that its new name and ancestors are added to the log. However when we do this we do not persist the log, so the changes remain in memory only, and as a consequence, any ancestors that were created in the current transaction are updated such that future calls to btrfs_inode_in_log() return true. This leads to a subsequent fsync against such new ancestor directories returning immediately, without persisting the log, therefore after a power failure the new ancestor directories do not exist, despite fsync being called against them explicitly. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/A $ mkdir /mnt/B $ mkdir /mnt/A/C $ touch /mnt/B/foo $ xfs_io -c "fsync" /mnt/B/foo $ ln /mnt/B/foo /mnt/A/C/foo $ xfs_io -c "fsync" /mnt/A <power failure> After the power failure, directory "A" does not exist, despite the explicit fsync on it. Instead of fixing this by changing the behaviour of the explicit fsync on directory "A" to persist the log instead of doing nothing, make the logging of the new file name (which happens when creating a hard link or renaming) persist the log. This approach not only is simpler, not requiring addition of new fields to the inode in memory structure, but also gives us the same behaviour as ext4, xfs and f2fs (possibly other filesystems too). A test case for fstests follows soon. Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes") Reported-by: NVijay Chidambaram <vvijay03@gmail.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 18 8月, 2018 2 次提交
-
-
由 Jens Axboe 提交于
a_ops->readpages() is only ever used for read-ahead. Ensure that we pass this information down to the block layer. Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Robbie Ko 提交于
Commit e9894fd3 ("Btrfs: fix snapshot vs nocow writting") forced nocow writes to fallback to COW, during writeback, when a snapshot is created. This resulted in writes made before creating the snapshot to unexpectedly fail with ENOSPC during writeback when success (0) was returned to user space through the write system call. The steps leading to this problem are: 1. When it's not possible to allocate data space for a write, the buffered write path checks if a NOCOW write is possible. If it is, it will not reserve space and success (0) is returned to user space. 2. Then when a snapshot is created, the root's will_be_snapshotted atomic is incremented and writeback is triggered for all inode's that belong to the root being snapshotted. Incrementing that atomic forces all previous writes to fallback to COW during writeback (running delalloc). 3. This results in the writeback for the inodes to fail and therefore setting the ENOSPC error in their mappings, so that a subsequent fsync on them will report the error to user space. So it's not a completely silent data loss (since fsync will report ENOSPC) but it's a very unexpected and undesirable behaviour, because if a clean shutdown/unmount of the filesystem happens without previous calls to fsync, it is expected to have the data present in the files after mounting the filesystem again. So fix this by adding a new atomic named snapshot_force_cow to the root structure which prevents this behaviour and works the following way: 1. It is incremented when we start to create a snapshot after triggering writeback and before waiting for writeback to finish. 2. This new atomic is now what is used by writeback (running delalloc) to decide whether we need to fallback to COW or not. Because we incremented this new atomic after triggering writeback in the snapshot creation ioctl, we ensure that all buffered writes that happened before snapshot creation will succeed and not fallback to COW (which would make them fail with ENOSPC). 3. The existing atomic, will_be_snapshotted, is kept because it is used to force new buffered writes, that start after we started snapshotting, to reserve data space even when NOCOW is possible. This makes these writes fail early with ENOSPC when there's no available space to allocate, preventing the unexpected behaviour of writeback later failing with ENOSPC due to a fallback to COW mode. Fixes: e9894fd3 ("Btrfs: fix snapshot vs nocow writting") Signed-off-by: NRobbie Ko <robbieko@synology.com> Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 06 8月, 2018 11 次提交
-
-
由 Naohiro Aota 提交于
When btrfs hits error after modifying fs_devices in btrfs_init_new_device() (such as btrfs_add_dev_item() returns error), it leaves everything as is, but frees allocated btrfs_device. As a result, fs_devices->devices and fs_devices->alloc_list contain already freed btrfs_device, leading to later use-after-free bug. Error path also messes the things like ->num_devices. While they go back to the original value by unscanning btrfs devices, it is safe to revert them here. Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling") Signed-off-by: NNaohiro Aota <naota@elisp.net> Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
It's entirely possible that a crafted btrfs image contains overlapping chunks. Although we can't detect such problem by tree-checker, it's not a catastrophic problem, current extent map can already detect such problem and return -EEXIST. We just only need to exit gracefully and fail the mount. Reported-by: NXu Wen <wen.xu@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200409Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
This patch will introduce chunk <-> dev extent mapping check, to protect us against invalid dev extents or chunks. Since chunk mapping is the fundamental infrastructure of btrfs, extra check at mount time could prevent a lot of unexpected behavior (BUG_ON). Reported-by: NXu Wen <wen.xu@gatech.edu> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200403 Link: https://bugzilla.kernel.org/show_bug.cgi?id=200407Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
If a crafted image has missing block group items, it could cause unexpected behavior and breaks the assumption of 1:1 chunk<->block group mapping. Although we have the block group -> chunk mapping check, we still need chunk -> block group mapping check. This patch will do extra check to ensure each chunk has its corresponding block group. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847Reported-by: NXu Wen <wen.xu@gatech.edu> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NGu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
A crafted btrfs image with incorrect chunk<->block group mapping will trigger a lot of unexpected things as the mapping is essential. Although the problem can be caught by block group item checker added in "btrfs: tree-checker: Verify block_group_item", it's still not sufficient. A sufficiently valid block group item can pass the check added by the mentioned patch but could fail to match the existing chunk. This patch will add extra block group -> chunk mapping check, to ensure we have a completely matching (start, len, flags) chunk for each block group at mount time. Here we reuse the original helper find_first_block_group(), which is already doing the basic bg -> chunk checks, adding further checks of the start/len and type flags. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199837Reported-by: NXu Wen <wen.xu@gatech.edu> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
When doing an incremental send, if we have a file in the parent snapshot that has prealloc extents beyond EOF and in the send snapshot it got a hole punch that partially covers the prealloc extents, the send stream, when replayed by a receiver, can result in a file that has a size bigger than it should and filled with zeroes past the correct EOF. For example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "falloc -k 0 4M" /mnt/foobar $ xfs_io -c "pwrite -S 0xea 0 1M" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send -f /tmp/1.send /mnt/snap1 $ xfs_io -c "fpunch 1M 2M" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -f /tmp/2.send -p /mnt/snap1 /mnt/snap2 $ stat --format %s /mnt/snap2/foobar 1048576 $ md5sum /mnt/snap2/foobar d31659e82e87798acd4669a1e0a19d4f /mnt/snap2/foobar $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive -f /mnt/1.snap /mnt $ btrfs receive -f /mnt/2.snap /mnt $ stat --format %s /mnt/snap2/foobar 3145728 # --> should be 1Mb and not 3Mb (which was the end offset of hole # punch operation) $ md5sum /mnt/snap2/foobar 117baf295297c2a995f92da725b0b651 /mnt/snap2/foobar # --> should be d31659e82e87798acd4669a1e0a19d4f as in the original fs This issue actually happens only since commit ffa7c429 ("Btrfs: send, do not issue unnecessary truncate operations"), but before that commit we were issuing a write operation full of zeroes (to "punch" a hole) which was extending the file size beyond the correct value and then immediately issue a truncate operation to the correct size and undoing the previous write operation. Since the send protocol does not support fallocate, for extent preallocation and hole punching, fix this by not even attempting to send a "hole" (regular write full of zeroes) if it starts at an offset greater then or equals to the file's size. This approach, besides being much more simple then making send issue the truncate operation, adds the benefit of avoiding the useless pair of write of zeroes and truncate operations, saving time and IO at the receiver and reducing the size of the send stream. A test case for fstests follows soon. Fixes: ffa7c429 ("Btrfs: send, do not issue unnecessary truncate operations") CC: stable@vger.kernel.org # 4.17+ Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Misono Tomohiro 提交于
Cleanup patch and no functional changes. Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Al Viro 提交于
Don't open-code iget_failed(), don't bother with btrfs_free_path(NULL), move handling of positive return values of btrfs_lookup_inode() from btrfs_read_locked_inode() to btrfs_iget() and kill now obviously pointless ASSERT() in there. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Al Viro 提交于
We don't need to check is_bad_inode() after the call of btrfs_read_locked_inode() - it's exactly the same as checking return value for being non-zero. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Reviewed-by: NDavid Sterba <dsterba@suse.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Al Viro 提交于
IS_ERR(p) && PTR_ERR(p) == n is a weird way to spell p == ERR_PTR(n). Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Reviewed-by: NDavid Sterba <dsterba@suse.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> [ update changelog ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Al Viro 提交于
Just get rid of pointless checks. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-