- 09 6月, 2019 1 次提交
-
-
由 Filipe Manana 提交于
commit 5338e43abbab13791144d37fd8846847062351c6 upstream. When replaying a log that contains a new file or directory name that needs to be added to its parent directory, we end up updating the mtime and the ctime of the parent directory to the current time after we have set their values to the correct ones (set at fsync time), efectivelly losing them. Sample reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ touch /mnt/dir/file # fsync of the directory is optional, not needed $ xfs_io -c fsync /mnt/dir $ xfs_io -c fsync /mnt/dir/file $ stat -c %Y /mnt/dir 1557856079 <power failure> $ sleep 3 $ mount /dev/sdb /mnt $ stat -c %Y /mnt/dir 1557856082 --> should have been 1557856079, the mtime is updated to the current time when replaying the log Fix this by not updating the mtime and ctime to the current time at btrfs_add_link() when we are replaying a log tree. This could be triggered by my recent fsync fuzz tester for fstests, for which an fstests patch exists titled "fstests: generic, fsync fuzz tester with fsstress". Fixes: e02119d5 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 13 2月, 2019 1 次提交
-
-
由 Ethan Lien 提交于
[ Upstream commit 3cd24c698004d2f7668e0eb9fc1f096f533c791b ] Snapshot is expected to be fast. But if there are writers steadily creating dirty pages in our subvolume, the snapshot may take a very long time to complete. To fix the problem, we use tagged writepage for snapshot flusher as we do in the generic write_cache_pages(), so we can omit pages dirtied after the snapshot command. This does not change the semantics regarding which data get to the snapshot, if there are pages being dirtied during the snapshotting operation. There's a sync called before snapshot is taken in old/new case, any IO in flight just after that may be in the snapshot but this depends on other system effects that might still sync the IO. We do a simple snapshot speed test on a Intel D-1531 box: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 1m58sec patched: 6.54sec This is the best case for this patch since for a sequential write case, we omit nearly all pages dirtied after the snapshot command. For a multi writers, random write test: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 15.83sec patched: 10.35sec The improvement is smaller compared to the sequential write case, since we omit only half of the pages dirtied after snapshot command. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NEthan Lien <ethanlien@synology.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 26 1月, 2019 1 次提交
-
-
由 Johannes Thumshirn 提交于
[ Upstream commit 1690dd41e0cb1dade80850ed8a3eb0121b96d22f ] In the error handling block, err holds the return value of either btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked since it's introduction with commit fe66a05a (Btrfs: improve error handling for btrfs_insert_dir_item callers) in 2012. If the error handling in the error handling fails, there's not much left to do and the abort either happened earlier in the callees or is necessary here. So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort the transaction, but still return the original code of the failure stored in 'ret' as this will be reported to the user. Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 23 1月, 2019 1 次提交
-
-
由 David Sterba 提交于
commit 77b7aad195099e7c6da11e94b7fa6ef5e6fb0025 upstream. This reverts commit e73e81b6. This patch causes a few problems: - adds latency to btrfs_finish_ordered_io - as btrfs_finish_ordered_io is used for free space cache, generating more work from btrfs_btree_balance_dirty_nodelay could end up in the same workque, effectively deadlocking 12260 kworker/u96:16+btrfs-freespace-write D [<0>] balance_dirty_pages+0x6e6/0x7ad [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90 [<0>] btrfs_finish_ordered_io+0x3da/0x770 [<0>] normal_work_helper+0x1c5/0x5a0 [<0>] process_one_work+0x1ee/0x5a0 [<0>] worker_thread+0x46/0x3d0 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff Transaction commit will wait on the freespace cache: 838 btrfs-transacti D [<0>] btrfs_start_ordered_extent+0x154/0x1e0 [<0>] btrfs_wait_ordered_range+0xbd/0x110 [<0>] __btrfs_wait_cache_io+0x49/0x1a0 [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0 [<0>] commit_cowonly_roots+0x215/0x2b0 [<0>] btrfs_commit_transaction+0x37e/0x910 [<0>] transaction_kthread+0x14d/0x180 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff And then writepages ends up waiting on transaction commit: 9520 kworker/u96:13+flush-btrfs-1 D [<0>] wait_current_trans+0xac/0xe0 [<0>] start_transaction+0x21b/0x4b0 [<0>] cow_file_range_inline+0x10b/0x6b0 [<0>] cow_file_range.isra.69+0x329/0x4a0 [<0>] run_delalloc_range+0x105/0x3c0 [<0>] writepage_delalloc+0x119/0x180 [<0>] __extent_writepage+0x10c/0x390 [<0>] extent_write_cache_pages+0x26f/0x3d0 [<0>] extent_writepages+0x4f/0x80 [<0>] do_writepages+0x17/0x60 [<0>] __writeback_single_inode+0x59/0x690 [<0>] writeback_sb_inodes+0x291/0x4e0 [<0>] __writeback_inodes_wb+0x87/0xb0 [<0>] wb_writeback+0x3bb/0x500 [<0>] wb_workfn+0x40d/0x610 [<0>] process_one_work+0x1ee/0x5a0 [<0>] worker_thread+0x1e0/0x3d0 [<0>] kthread+0xf5/0x130 [<0>] ret_from_fork+0x24/0x30 [<0>] 0xffffffffffffffff Eventually, we have every process in the system waiting on balance_dirty_pages(), and nobody is able to make progress on page writeback. The original patch tried to fix an OOM condition, that happened on 4.4 but no success reproducing that on later kernels (4.19 and 4.20). This is more likely a problem in OOM itself. Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/Reported-by: NChris Mason <clm@fb.com> CC: stable@vger.kernel.org # 4.18+ CC: ethanlien <ethanlien@synology.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 10 1月, 2019 2 次提交
-
-
由 Filipe Manana 提交于
commit 41bd60676923822de1df2c50b3f9a10171f4338a upstream. The log tree has a long standing problem that when a file is fsync'ed we only check for new ancestors, created in the current transaction, by following only the hard link for which the fsync was issued. We follow the ancestors using the VFS' dget_parent() API. This means that if we create a new link for a file in a directory that is new (or in an any other new ancestor directory) and then fsync the file using an old hard link, we end up not logging the new ancestor, and on log replay that new hard link and ancestor do not exist. In some cases, involving renames, the file will not exist at all. Example: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt mkdir /mnt/A touch /mnt/foo ln /mnt/foo /mnt/A/bar xfs_io -c fsync /mnt/foo <power failure> In this example after log replay only the hard link named 'foo' exists and directory A does not exist, which is unexpected. In other major linux filesystems, such as ext4, xfs and f2fs for example, both hard links exist and so does directory A after mounting again the filesystem. Checking if any new ancestors are new and need to be logged was added in 2009 by commit 12fcfd22 ("Btrfs: tree logging unlink/rename fixes"), however only for the ancestors of the hard link (dentry) for which the fsync was issued, instead of checking for all ancestors for all of the inode's hard links. So fix this by tracking the id of the last transaction where a hard link was created for an inode and then on fsync fallback to a full transaction commit when an inode has more than one hard link and at least one new hard link was created in the current transaction. This is the simplest solution since this is not a common use case (adding frequently hard links for which there's an ancestor created in the current transaction and then fsync the file). In case it ever becomes a common use case, a solution that consists of iterating the fs/subvol btree for each hard link and check if any ancestor is new, could be implemented. This solves many unexpected scenarios reported by Jayashree Mohan and Vijay Chidambaram, and for which there is a new test case for fstests under review. Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes") CC: stable@vger.kernel.org # 4.4+ Reported-by: NVijay Chidambaram <vvijay03@gmail.com> Reported-by: NJayashree Mohan <jayashree2912@gmail.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Lu Fengqi 提交于
commit 27a7ff554e8d349627a90bda275c527b7348adae upstream. The test case btrfs/001 with inode_cache mount option will encounter the following warning: WARNING: CPU: 1 PID: 23700 at fs/btrfs/inode.c:956 cow_file_range.isra.19+0x32b/0x430 [btrfs] CPU: 1 PID: 23700 Comm: btrfs Kdump: loaded Tainted: G W O 4.20.0-rc4-custom+ #30 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:cow_file_range.isra.19+0x32b/0x430 [btrfs] Call Trace: ? free_extent_buffer+0x46/0x90 [btrfs] run_delalloc_nocow+0x455/0x900 [btrfs] btrfs_run_delalloc_range+0x1a7/0x360 [btrfs] writepage_delalloc+0xf9/0x150 [btrfs] __extent_writepage+0x125/0x3e0 [btrfs] extent_write_cache_pages+0x1b6/0x3e0 [btrfs] ? __wake_up_common_lock+0x63/0xc0 extent_writepages+0x50/0x80 [btrfs] do_writepages+0x41/0xd0 ? __filemap_fdatawrite_range+0x9e/0xf0 __filemap_fdatawrite_range+0xbe/0xf0 btrfs_fdatawrite_range+0x1b/0x50 [btrfs] __btrfs_write_out_cache+0x42c/0x480 [btrfs] btrfs_write_out_ino_cache+0x84/0xd0 [btrfs] btrfs_save_ino_cache+0x551/0x660 [btrfs] commit_fs_roots+0xc5/0x190 [btrfs] btrfs_commit_transaction+0x2bf/0x8d0 [btrfs] btrfs_mksubvol+0x48d/0x4d0 [btrfs] btrfs_ioctl_snap_create_transid+0x170/0x180 [btrfs] btrfs_ioctl_snap_create_v2+0x124/0x180 [btrfs] btrfs_ioctl+0x123f/0x3030 [btrfs] The file extent generation of the free space inode is equal to the last snapshot of the file root, so the inode will be passed to cow_file_rage. But the inode was created and its extents were preallocated in btrfs_save_ino_cache, there are no cow copies on disk. The preallocated extent is not yet in the extent tree, and btrfs_cross_ref_exist will ignore the -ENOENT returned by check_committed_ref, so we can directly write the inode to the disk. Fixes: 78d4295b ("btrfs: lift some btrfs_cross_ref_exist checks in nocow path") CC: stable@vger.kernel.org # 4.18+ Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 21 11月, 2018 1 次提交
-
-
由 Robbie Ko 提交于
commit 506481b20e818db40b6198815904ecd2d6daee64 upstream. When the cow_file_range fails, the related resources are unlocked according to the range [start..end), so the unlock cannot be repeated in run_delalloc_nocow. In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is not updated correctly, so move the cur_offset update before cow_file_range. kernel BUG at mm/page-writeback.c:2663! Internal error: Oops - BUG: 0 [#1] SMP CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O Hardware name: Realtek_RTD1296 (DT) Workqueue: writeback wb_workfn (flush-btrfs-1) task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000 PC is at clear_page_dirty_for_io+0x1bc/0x1e8 LR is at clear_page_dirty_for_io+0x14/0x1e8 pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145 sp : ffffffc02e9af4f0 Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020) Call trace: [<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8 [<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs] [<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs] [<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs] [<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs] [<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs] [<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs] [<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs] [<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs] [<ffffffc00033d758>] do_writepages+0x30/0x88 [<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198 [<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0 [<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0 [<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0 [<ffffffc0003bb0f0>] wb_workfn+0x150/0x250 [<ffffffc0002b0014>] process_one_work+0x1dc/0x388 [<ffffffc0002b02f0>] worker_thread+0x130/0x500 [<ffffffc0002b6344>] kthread+0x10c/0x110 [<ffffffc000284590>] ret_from_fork+0x10/0x40 Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000) CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NRobbie Ko <robbieko@synology.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 14 11月, 2018 3 次提交
-
-
由 Filipe Manana 提交于
commit 421f0922a2cfb0c75acd9746454aaa576c711a65 upstream. At inode.c:evict_inode_truncate_pages(), when we iterate over the inode's extent states, we access an extent state record's "state" field after we unlocked the inode's io tree lock. This can lead to a use-after-free issue because after we unlock the io tree that extent state record might have been freed due to being merged into another adjacent extent state record (a previous inflight bio for a read operation finished in the meanwhile which unlocked a range in the io tree and cause a merge of extent state records, as explained in the comment before the while loop added in commit 6ca07097 ("Btrfs: fix hang during inode eviction due to concurrent readahead")). Fix this by keeping a copy of the extent state's flags in a local variable and using it after unlocking the io tree. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201189 Fixes: b9d0b389 ("btrfs: Add handler for invalidate page") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit 49940bdd57779c78462da7aa5a8650b2fea8c2ff upstream. When we insert the file extent once the ordered extent completes we free the reserved extent reservation as it'll have been migrated to the bytes_used counter. However if we error out after this step we'll still clear the reserved extent reservation, resulting in a negative accounting of the reserved bytes for the block group and space info. Fix this by only doing the free if we didn't successfully insert a file extent for this extent. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Filipe Manana 提交于
commit 3527a018c00e5dbada2f9d7ed5576437b6dd5cfb upstream. At inode.c:compress_file_range(), under the "free_pages_out" label, we can end up dereferencing the "pages" pointer when it has a NULL value. This case happens when "start" has a value of 0 and we fail to allocate memory for the "pages" pointer. When that happens we jump to the "cont" label and then enter the "if (start == 0)" branch where we immediately call the cow_file_range_inline() function. If that function returns 0 (success creating an inline extent) or an error (like -ENOMEM for example) we jump to the "free_pages_out" label and then access "pages[i]" leading to a NULL pointer dereference, since "nr_pages" has a value greater than zero at that point. Fix this by setting "nr_pages" to 0 when we fail to allocate memory for the "pages" pointer. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201119 Fixes: 771ed689 ("Btrfs: Optimize compressed writeback and reads") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 23 8月, 2018 1 次提交
-
-
由 Filipe Manana 提交于
When we add a new name for an inode which was logged in the current transaction, we update the inode in the log so that its new name and ancestors are added to the log. However when we do this we do not persist the log, so the changes remain in memory only, and as a consequence, any ancestors that were created in the current transaction are updated such that future calls to btrfs_inode_in_log() return true. This leads to a subsequent fsync against such new ancestor directories returning immediately, without persisting the log, therefore after a power failure the new ancestor directories do not exist, despite fsync being called against them explicitly. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/A $ mkdir /mnt/B $ mkdir /mnt/A/C $ touch /mnt/B/foo $ xfs_io -c "fsync" /mnt/B/foo $ ln /mnt/B/foo /mnt/A/C/foo $ xfs_io -c "fsync" /mnt/A <power failure> After the power failure, directory "A" does not exist, despite the explicit fsync on it. Instead of fixing this by changing the behaviour of the explicit fsync on directory "A" to persist the log instead of doing nothing, make the logging of the new file name (which happens when creating a hard link or renaming) persist the log. This approach not only is simpler, not requiring addition of new fields to the inode in memory structure, but also gives us the same behaviour as ext4, xfs and f2fs (possibly other filesystems too). A test case for fstests follows soon. Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes") Reported-by: NVijay Chidambaram <vvijay03@gmail.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 18 8月, 2018 1 次提交
-
-
由 Robbie Ko 提交于
Commit e9894fd3 ("Btrfs: fix snapshot vs nocow writting") forced nocow writes to fallback to COW, during writeback, when a snapshot is created. This resulted in writes made before creating the snapshot to unexpectedly fail with ENOSPC during writeback when success (0) was returned to user space through the write system call. The steps leading to this problem are: 1. When it's not possible to allocate data space for a write, the buffered write path checks if a NOCOW write is possible. If it is, it will not reserve space and success (0) is returned to user space. 2. Then when a snapshot is created, the root's will_be_snapshotted atomic is incremented and writeback is triggered for all inode's that belong to the root being snapshotted. Incrementing that atomic forces all previous writes to fallback to COW during writeback (running delalloc). 3. This results in the writeback for the inodes to fail and therefore setting the ENOSPC error in their mappings, so that a subsequent fsync on them will report the error to user space. So it's not a completely silent data loss (since fsync will report ENOSPC) but it's a very unexpected and undesirable behaviour, because if a clean shutdown/unmount of the filesystem happens without previous calls to fsync, it is expected to have the data present in the files after mounting the filesystem again. So fix this by adding a new atomic named snapshot_force_cow to the root structure which prevents this behaviour and works the following way: 1. It is incremented when we start to create a snapshot after triggering writeback and before waiting for writeback to finish. 2. This new atomic is now what is used by writeback (running delalloc) to decide whether we need to fallback to COW or not. Because we incremented this new atomic after triggering writeback in the snapshot creation ioctl, we ensure that all buffered writes that happened before snapshot creation will succeed and not fallback to COW (which would make them fail with ENOSPC). 3. The existing atomic, will_be_snapshotted, is kept because it is used to force new buffered writes, that start after we started snapshotting, to reserve data space even when NOCOW is possible. This makes these writes fail early with ENOSPC when there's no available space to allocate, preventing the unexpected behaviour of writeback later failing with ENOSPC due to a fallback to COW mode. Fixes: e9894fd3 ("Btrfs: fix snapshot vs nocow writting") Signed-off-by: NRobbie Ko <robbieko@synology.com> Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 06 8月, 2018 18 次提交
-
-
由 Al Viro 提交于
Don't open-code iget_failed(), don't bother with btrfs_free_path(NULL), move handling of positive return values of btrfs_lookup_inode() from btrfs_read_locked_inode() to btrfs_iget() and kill now obviously pointless ASSERT() in there. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Al Viro 提交于
We don't need to check is_bad_inode() after the call of btrfs_read_locked_inode() - it's exactly the same as checking return value for being non-zero. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Reviewed-by: NDavid Sterba <dsterba@suse.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Lu Fengqi 提交于
Although it is safe to call this on already released paths with no locks held or extent buffers, removing the redundant btrfs_release_path is reasonable. Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Lu Fengqi 提交于
All callers pass the root tree of dir, we can push that down to the function itself. Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Lu Fengqi 提交于
It can be referenced from the passed transaction handle. Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Lu Fengqi 提交于
It can be referenced from the passed transaction handle. Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Lu Fengqi 提交于
It can be referenced from the passed transaction handle. Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We recently ran into the following deadlock involving btrfs_write_inode(): [ +0.005066] __schedule+0x38e/0x8c0 [ +0.007144] schedule+0x36/0x80 [ +0.006447] bit_wait+0x11/0x60 [ +0.006446] __wait_on_bit+0xbe/0x110 [ +0.007487] ? bit_wait_io+0x60/0x60 [ +0.007319] __inode_wait_for_writeback+0x96/0xc0 [ +0.009568] ? autoremove_wake_function+0x40/0x40 [ +0.009565] inode_wait_for_writeback+0x21/0x30 [ +0.009224] evict+0xb0/0x190 [ +0.006099] iput+0x1a8/0x210 [ +0.006103] btrfs_run_delayed_iputs+0x73/0xc0 [ +0.009047] btrfs_commit_transaction+0x799/0x8c0 [ +0.009567] btrfs_write_inode+0x81/0xb0 [ +0.008008] __writeback_single_inode+0x267/0x320 [ +0.009569] writeback_sb_inodes+0x25b/0x4e0 [ +0.008702] wb_writeback+0x102/0x2d0 [ +0.007487] wb_workfn+0xa4/0x310 [ +0.006794] ? wb_workfn+0xa4/0x310 [ +0.007143] process_one_work+0x150/0x410 [ +0.008179] worker_thread+0x6d/0x520 [ +0.007490] kthread+0x12c/0x160 [ +0.006620] ? put_pwq_unlocked+0x80/0x80 [ +0.008185] ? kthread_park+0xa0/0xa0 [ +0.007484] ? do_syscall_64+0x53/0x150 [ +0.007837] ret_from_fork+0x29/0x40 Writeback calls: btrfs_write_inode btrfs_commit_transaction btrfs_run_delayed_iputs If iput() is called on that same inode, evict() will wait for writeback forever. btrfs_write_inode() was originally added way back in 4730a4bc ("btrfs_dirty_inode") to support O_SYNC writes. However, ->write_inode() hasn't been used for O_SYNC since 148f948b ("vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode"), so btrfs_write_inode() is actually unnecessary (and leads to a bunch of unnecessary commits). Get rid of it, which also gets rid of the deadlock. CC: stable@vger.kernel.org # 3.2+ Signed-off-by: NJosef Bacik <jbacik@fb.com> [Omar: new commit message] Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The data and metadata callback implementation both use the same function. We can remove the call indirection and intermediate helper completely. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The data and metadata callback implementation both use the same function. We can remove the call indirection completely. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
All implementations of the callback are trivial and do the same and there's only one user. Merge everything together. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The end_io callbacks passed to btrfs_wq_submit_bio (btrfs_submit_bio_done and btree_submit_bio_done) are effectively the same code, there's no point to do the indirection. Export btrfs_submit_bio_done and call it directly. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The helper is trivial and marked as deprecated. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Remove includes if none of the interfaces and exports is used in the given source file. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Functions that get btrfs inode can simply reach the fs_info by dereferencing the root and this looks a bit more straightforward compared to the btrfs_sb(...) indirection. If the transaction handle is available and not NULL it's used instead. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
There are several places when the btrfs inode is converted to the generic inode, back to btrfs and then passed to btrfs_ino. We can remove the extra back and forth conversions. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Arnd Bergmann 提交于
While the regular inode timestamps all use timespec64 now, the i_otime field is btrfs specific and still needs to be converted to correctly represent times beyond 2038. Signed-off-by: NArnd Bergmann <arnd@arndb.de> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
We used to call btrfs_file_extent_inline_len() to get the uncompressed data size of an inlined extent. However this function is hiding evil, for compressed extent, it has no choice but to directly read out ram_bytes from btrfs_file_extent_item. While for uncompressed extent, it uses item size to calculate the real data size, and ignoring ram_bytes completely. In fact, for corrupted ram_bytes, due to above behavior kernel btrfs_print_leaf() can't even print correct ram_bytes to expose the bug. Since we have the tree-checker to verify all EXTENT_DATA, such mismatch can be detected pretty easily, thus we can trust ram_bytes without the evil btrfs_file_extent_inline_len(). Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 04 8月, 2018 1 次提交
-
-
由 Al Viro 提交于
Make sure that no partially set up inodes can be returned by open-by-handle. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 6月, 2018 1 次提交
-
-
由 Chris Mason 提交于
The vm_fault_t conversion commit introduced a ret2 variable for tracking the integer return values from internal btrfs functions. It was sometimes returning VM_FAULT_LOCKED for pages that were actually invalid and had been removed from the radix. Something like this: ret2 = btrfs_delalloc_reserve_space() // returns zero on success lock_page(page) if (page->mapping != inode->i_mapping) goto out_unlock; ... out_unlock: if (!ret2) { ... return VM_FAULT_LOCKED; } This ends up triggering this WARNING in btrfs_destroy_inode() WARN_ON(BTRFS_I(inode)->block_rsv.size); xfstests generic/095 was able to reliably reproduce the errors. Since out_unlock: is only used for errors, this fix moves it below the if (!ret2) check we use to return VM_FAULT_LOCKED for success. Fixes: a528a241 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t) Signed-off-by: NChris Mason <clm@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 22 6月, 2018 1 次提交
-
-
由 Filipe Manana 提交于
If we failed during a rename exchange operation after starting/joining a transaction, we would end up replacing the return value, stored in the local 'ret' variable, with the return value from btrfs_end_transaction(). So this could end up returning 0 (success) to user space despite the operation having failed and aborted the transaction, because if there are multiple tasks having a reference on the transaction at the time btrfs_end_transaction() is called by the rename exchange, that function returns 0 (otherwise it returns -EIO and not the original error value). So fix this by not overwriting the return value on error after getting a transaction handle. Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.9+ Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 07 6月, 2018 1 次提交
-
-
由 Souptick Joarder 提交于
Use the new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Reference commit 1c8f4220 ("mm: change return type to vm_fault_t") vmf_error() is the newly introduced inline function in 4.17-rc6. Signed-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 06 6月, 2018 1 次提交
-
-
由 Deepa Dinamani 提交于
struct timespec is not y2038 safe. Transition vfs to use y2038 safe struct timespec64 instead. The change was made with the help of the following cocinelle script. This catches about 80% of the changes. All the header file and logic changes are included in the first 5 rules. The rest are trivial substitutions. I avoid changing any of the function signatures or any other filesystem specific data structures to keep the patch simple for review. The script can be a little shorter by combining different cases. But, this version was sufficient for my usecase. virtual patch @ depends on patch @ identifier now; @@ - struct timespec + struct timespec64 current_time ( ... ) { - struct timespec now = current_kernel_time(); + struct timespec64 now = current_kernel_time64(); ... - return timespec_trunc( + return timespec64_trunc( ... ); } @ depends on patch @ identifier xtime; @@ struct \( iattr \| inode \| kstat \) { ... - struct timespec xtime; + struct timespec64 xtime; ... } @ depends on patch @ identifier t; @@ struct inode_operations { ... int (*update_time) (..., - struct timespec t, + struct timespec64 t, ...); ... } @ depends on patch @ identifier t; identifier fn_update_time =~ "update_time$"; @@ fn_update_time (..., - struct timespec *t, + struct timespec64 *t, ...) { ... } @ depends on patch @ identifier t; @@ lease_get_mtime( ... , - struct timespec *t + struct timespec64 *t ) { ... } @te depends on patch forall@ identifier ts; local idexpression struct inode *inode_node; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn_update_time =~ "update_time$"; identifier fn; expression e, E3; local idexpression struct inode *node1; local idexpression struct inode *node2; local idexpression struct iattr *attr1; local idexpression struct iattr *attr2; local idexpression struct iattr attr; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; @@ ( ( - struct timespec ts; + struct timespec64 ts; | - struct timespec ts = current_time(inode_node); + struct timespec64 ts = current_time(inode_node); ) <+... when != ts ( - timespec_equal(&inode_node->i_xtime, &ts) + timespec64_equal(&inode_node->i_xtime, &ts) | - timespec_equal(&ts, &inode_node->i_xtime) + timespec64_equal(&ts, &inode_node->i_xtime) | - timespec_compare(&inode_node->i_xtime, &ts) + timespec64_compare(&inode_node->i_xtime, &ts) | - timespec_compare(&ts, &inode_node->i_xtime) + timespec64_compare(&ts, &inode_node->i_xtime) | ts = current_time(e) | fn_update_time(..., &ts,...) | inode_node->i_xtime = ts | node1->i_xtime = ts | ts = inode_node->i_xtime | <+... attr1->ia_xtime ...+> = ts | ts = attr1->ia_xtime | ts.tv_sec | ts.tv_nsec | btrfs_set_stack_timespec_sec(..., ts.tv_sec) | btrfs_set_stack_timespec_nsec(..., ts.tv_nsec) | - ts = timespec64_to_timespec( + ts = ... -) | - ts = ktime_to_timespec( + ts = ktime_to_timespec64( ...) | - ts = E3 + ts = timespec_to_timespec64(E3) | - ktime_get_real_ts(&ts) + ktime_get_real_ts64(&ts) | fn(..., - ts + timespec64_to_timespec(ts) ,...) ) ...+> ( <... when != ts - return ts; + return timespec64_to_timespec(ts); ...> ) | - timespec_equal(&node1->i_xtime1, &node2->i_xtime2) + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2) | - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2) + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2) | - timespec_compare(&node1->i_xtime1, &node2->i_xtime2) + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2) | node1->i_xtime1 = - timespec_trunc(attr1->ia_xtime1, + timespec64_trunc(attr1->ia_xtime1, ...) | - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2, + attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2, ...) | - ktime_get_real_ts(&attr1->ia_xtime1) + ktime_get_real_ts64(&attr1->ia_xtime1) | - ktime_get_real_ts(&attr.ia_xtime1) + ktime_get_real_ts64(&attr.ia_xtime1) ) @ depends on patch @ struct inode *node; struct iattr *attr; identifier fn; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; expression e; @@ ( - fn(node->i_xtime); + fn(timespec64_to_timespec(node->i_xtime)); | fn(..., - node->i_xtime); + timespec64_to_timespec(node->i_xtime)); | - e = fn(attr->ia_xtime); + e = fn(timespec64_to_timespec(attr->ia_xtime)); ) @ depends on patch forall @ struct inode *node; struct iattr *attr; identifier i_xtime =~ "^i_[acm]time$"; identifier ia_xtime =~ "^ia_[acm]time$"; identifier fn; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); fn (..., - &node->i_xtime, + &ts, ...); | + ts = timespec64_to_timespec(attr->ia_xtime); fn (..., - &attr->ia_xtime, + &ts, ...); ) ...+> } @ depends on patch forall @ struct inode *node; struct iattr *attr; struct kstat *stat; identifier ia_xtime =~ "^ia_[acm]time$"; identifier i_xtime =~ "^i_[acm]time$"; identifier xtime =~ "^[acm]time$"; identifier fn, ret; @@ { + struct timespec ts; <+... ( + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime, + &ts, ...); | + ts = timespec64_to_timespec(node->i_xtime); ret = fn (..., - &node->i_xtime); + &ts); | + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime, + &ts, ...); | + ts = timespec64_to_timespec(attr->ia_xtime); ret = fn (..., - &attr->ia_xtime); + &ts); | + ts = timespec64_to_timespec(stat->xtime); ret = fn (..., - &stat->xtime); + &ts); ) ...+> } @ depends on patch @ struct inode *node; struct inode *node2; identifier i_xtime1 =~ "^i_[acm]time$"; identifier i_xtime2 =~ "^i_[acm]time$"; identifier i_xtime3 =~ "^i_[acm]time$"; struct iattr *attrp; struct iattr *attrp2; struct iattr attr ; identifier ia_xtime1 =~ "^ia_[acm]time$"; identifier ia_xtime2 =~ "^ia_[acm]time$"; struct kstat *stat; struct kstat stat1; struct timespec64 ts; identifier xtime =~ "^[acmb]time$"; expression e; @@ ( ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ; | node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \); | node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \); | node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \); | stat->xtime = node2->i_xtime1; | stat1.xtime = node2->i_xtime1; | ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ; | ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2; | - e = node->i_xtime1; + e = timespec64_to_timespec( node->i_xtime1 ); | - e = attrp->ia_xtime1; + e = timespec64_to_timespec( attrp->ia_xtime1 ); | node->i_xtime1 = current_time(...); | node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); | node->i_xtime1 = node->i_xtime3 = - e; + timespec_to_timespec64(e); | - node->i_xtime1 = e; + node->i_xtime1 = timespec_to_timespec64(e); ) Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Cc: <anton@tuxera.com> Cc: <balbi@kernel.org> Cc: <bfields@fieldses.org> Cc: <darrick.wong@oracle.com> Cc: <dhowells@redhat.com> Cc: <dsterba@suse.com> Cc: <dwmw2@infradead.org> Cc: <hch@lst.de> Cc: <hirofumi@mail.parknet.co.jp> Cc: <hubcap@omnibond.com> Cc: <jack@suse.com> Cc: <jaegeuk@kernel.org> Cc: <jaharkes@cs.cmu.edu> Cc: <jslaby@suse.com> Cc: <keescook@chromium.org> Cc: <mark@fasheh.com> Cc: <miklos@szeredi.hu> Cc: <nico@linaro.org> Cc: <reiserfs-devel@vger.kernel.org> Cc: <richard@nod.at> Cc: <sage@redhat.com> Cc: <sfrench@samba.org> Cc: <swhiteho@redhat.com> Cc: <tj@kernel.org> Cc: <trond.myklebust@primarydata.com> Cc: <tytso@mit.edu> Cc: <viro@zeniv.linux.org.uk>
-
- 31 5月, 2018 3 次提交
-
-
由 Omar Sandoval 提交于
btrfs_truncate() uses two variables for error handling, ret and err (if this sounds familiar, it's because btrfs_truncate_inode_items() did something similar). This is error prone, as was made evident by "Btrfs: fix error handling in btrfs_truncate()". We only have err because we don't want to mask an error if we call btrfs_update_inode() and btrfs_end_transaction(), so let's make that its own scoped return variable and use ret everywhere else. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Now that the read side is extracted into its own function, do the same to the write side. This leaves btrfs_get_blocks_direct_write with the sole purpose of handling common locking required. Also flip the condition in btrfs_get_blocks_direct_write so that the write case comes first and we check for if (Create) rather than if (!create). This is purely subjective but I believe makes reading a bit more "linear". No functional changes. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Currently this function handles both the READ and WRITE dio cases. This is facilitated by a bunch of 'if' statements, a goto short-circuit statement and a very perverse aliasing of "!created"(READ) case by setting lockstart = lockend and checking for lockstart < lockend for detecting the write. Let's simplify this mess by extracting the READ-only code into a separate __btrfs_get_block_direct_read function. This is only the first step, the next one will be to factor out the write side as well. The end goal will be to have the common locking/ unlocking code in btrfs_get_blocks_direct and then it will call either the read|write subvariants. No functional changes. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 30 5月, 2018 2 次提交
-
-
由 Su Yue 提交于
In cow_file_range(), create_io_em() may fail, but its return value is not recorded. Then return value may be 0 even it failed which is a wrong behavior. Let cow_file_range() return PTR_ERR(em) if create_io_em() failed. Fixes: 6f9994db ("Btrfs: create a helper to create em for IO") CC: stable@vger.kernel.org # 4.11+ Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Gu JinXiang 提交于
Since commit 7775c818 ("btrfs: remove unused parameter from btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used by caller of function btrfs_subvolume_reserve_metadata. So remove it. Signed-off-by: NGu JinXiang <gujx@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-