- 18 11月, 2019 17 次提交
-
-
由 Omar Sandoval 提交于
Currently, reada_start_machine_worker() frees the reada_machine_work and then calls __reada_start_machine() to do readahead. This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". There _might_ already be a deadlock here: reada_start_machine_worker() can depend on itself through stacked filesystems (__read_start_machine() -> reada_start_machine_dev() -> reada_tree_block_flagged() -> read_extent_buffer_pages() -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() -> submit_bio() onto a loop device can trigger readahead on the lower filesystem). Either way, let's fix it by freeing the work at the end. Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Omar Sandoval 提交于
Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds the work item) and then calls bio_endio(). This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". In particular, the endio call may depend on other work items. For example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() -> __btrfs_correct_data_nocsum() -> dio_read_error() -> submit_dio_repair_bio(), which submits a bio that is also completed through a end_workqueue_fn() work item. However, __btrfs_correct_data_nocsum() waits for the newly submitted bio to complete, thus it depends on another work item. This example currently usually works because we use different workqueue helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR. However, it may deadlock with stacked filesystems and is fragile overall. The proper fix is to free the work item at the very end of the work function, so let's do that. Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Omar Sandoval 提交于
We hit the following very strange deadlock on a system with Btrfs on a loop device backed by another Btrfs filesystem: 1. The top (loop device) filesystem queues an async_cow work item from cow_file_range_async(). We'll call this work X. 2. Worker thread A starts work X (normal_work_helper()). 3. Worker thread A executes the ordered work for the top filesystem (run_ordered_work()). 4. Worker thread A finishes the ordered work for work X and frees X (work->ordered_free()). 5. Worker thread A executes another ordered work and gets blocked on I/O to the bottom filesystem (still in run_ordered_work()). 6. Meanwhile, the bottom filesystem allocates and queues an async_cow work item which happens to be the recently-freed X. 7. The workqueue code sees that X is already being executed by worker thread A, so it schedules X to be executed _after_ worker thread A finishes (see the find_worker_executing_work() call in process_one_work()). Now, the top filesystem is waiting for I/O on the bottom filesystem, but the bottom filesystem is waiting for the top filesystem to finish, so we deadlock. This happens because we are breaking the workqueue assumption that a work item cannot be recycled while it still depends on other work. Fix it by waiting to free the work item until we are done with all of the related ordered work. P.S.: One might ask why the workqueue code doesn't try to detect a recycled work item. It actually does try by checking whether the work item has the same work function (find_worker_executing_work()), but in our case the function is the same. This is the only key that the workqueue code has available to compare, short of adding an additional, layer-violating "custom key". Considering that we're the only ones that have ever hit this, we should just play by the rules. Unfortunately, we haven't been able to create a minimal reproducer other than our full container setup using a compress-force=zstd filesystem on top of another compress-force=zstd filesystem. Suggested-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Omar Sandoval 提交于
Commit fc97fab0 ("btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue.") converted qgroup_rescan_work to be initialized with btrfs_init_work(), but it left behind an unnecessary memset(). Get rid of the memset(). Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
This needs to be cleaned up in the future, but for now it belongs to the extent-io-tree stuff since it uses the internal tree search code. Needed to export get_state_failrec and set_state_failrec as well since we're not going to move the actual IO part of the failrec stuff out at this point. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
This utilizes internal stuff to the extent_io_tree, so we need to export it before we move it. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
extent_io.c/h are huge, encompassing a bunch of different things. The extent_io_tree code can live on its own, so separate this out. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We are moving extent_io_tree into it's on file, so separate out the extent_state init stuff from extent_io_tree_init(). Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We check both extent buffer and extent state leaks in the same function, separate these two functions out so we can move them around. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The following comment shows up in btrfs_search_slot() with out much sense: /* * setup the path here so we can release it under lock * contention with the cow code */ if (cow) { /* code touching path->lock[] is far away from here */ } This comment hasn't been cleaned up after the relevant code has been removed. The original code is introduced in commit 65b51a00 ("btrfs_search_slot: reduce lock contention by cowing in two stages"): + + /* + * setup the path here so we can release it under lock + * contention with the cow code + */ + p->nodes[level] = b; + if (!p->skip_locking) + p->locks[level] = 1; + But in current code, we have different timing for modifying path lock, so just remove the comment. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Similar to btrfs_search_slot() done in previous patch, make a shortcut for the level 0 case and allow to reduce indentation for the remaining case. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
In btrfs_search_slot(), we something like: if (level != 0) { /* Do search inside tree nodes*/ } else { /* Do search inside tree leaves */ goto done; } This caused extra indent for tree node search code. Change it to something like: if (level == 0) { /* Do search inside tree leaves */ goto done' } /* Do search inside tree nodes */ So we have more space to maneuver our code, this is especially useful as the tree nodes search code is more complex than the leaves search code. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
For INODE_REF we will check: - Objectid (ino) against previous key To detect missing INODE_ITEM. - No overflow/padding in the data payload Much like DIR_ITEM, but with less members to check. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
For the following items, key->objectid is inode number: - DIR_ITEM - DIR_INDEX - XATTR_ITEM - EXTENT_DATA - INODE_REF So in the subvolume tree, such items must have its previous item share the same objectid, e.g.: (257 INODE_ITEM 0) (257 DIR_INDEX xxx) (257 DIR_ITEM xxx) (258 INODE_ITEM 0) (258 INODE_REF 0) (258 XATTR_ITEM 0) (258 EXTENT_DATA 0) But if we have the following sequence, then there is definitely something wrong, normally some INODE_ITEM is missing, like: (257 INODE_ITEM 0) (257 DIR_INDEX xxx) (257 DIR_ITEM xxx) (258 XATTR_ITEM 0) <<< objecitd suddenly changed to 258 (258 EXTENT_DATA 0) So just by checking the previous key for above inode based key types, we can detect a missing inode item. For INODE_REF key type, the check will be added along with INODE_REF checker. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
It's not used ouside of transaction.c Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
A recent patch to btrfs showed that there was at least 1 case where a nested transaction was committed. Nested transaction in this case means a code which has a transaction handle calls some function which in turn obtains a copy of the same transaction handle. In such cases the correct thing to do is for the lower callee to call btrfs_end_transaction which contains appropriate checks so as to not commit the transaction which will result in stale trans handler for the caller. To catch such cases add an assert in btrfs_commit_transaction ensuring btrfs_trans_handle::use_count is always 1. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Goldwyn Rodrigues 提交于
This is similar to 942491c9 ("xfs: fix AIM7 regression"). Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. This causes extra contention on the lock and can be measured eg. by AIM7 benchmark. So change our read/write methods to just do the trylock for the RWF_NOWAIT case. Fixes: edf064e7 ("btrfs: nowait aio support") Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 16 11月, 2019 1 次提交
-
-
由 David Howells 提交于
When a lookup is done, the afs filesystem will perform a bulk status-fetch operation on the requested vnode (file) plus the next 49 other vnodes from the directory list (in AFS, directory contents are downloaded as blobs and parsed locally). When the results are received, it will speculatively populate the inode cache from the extra data. However, if the lookup races with another lookup on the same directory, but for a different file - one that's in the 49 extra fetches, then if the bulk status-fetch operation finishes first, it will try and update the inode from the other lookup. If this other inode is still in the throes of being created, however, this will cause an assertion failure in afs_apply_status(): BUG_ON(test_bit(AFS_VNODE_UNSET, &vnode->flags)); on or about fs/afs/inode.c:175 because it expects data to be there already that it can compare to. Fix this by skipping the update if the inode is being created as the creator will presumably set up the inode with the same information. Fixes: 39db9815 ("afs: Fix application of the results of a inline bulk status fetch") Signed-off-by: NDavid Howells <dhowells@redhat.com> Reviewed-by: NMarc Dionne <marc.dionne@auristor.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 11月, 2019 2 次提交
-
-
由 Jeff Layton 提交于
Ceph can in some cases issue an async DIO request, in which case we can end up calling ceph_end_io_direct before the I/O is actually complete. That may allow buffered operations to proceed while DIO requests are still in flight. Fix this by incrementing the i_dio_count when issuing an async DIO request, and decrement it when tearing down the aio_req. Fixes: 321fe13c ("ceph: add buffered/direct exclusionary locking for reads and writes") Signed-off-by: NJeff Layton <jlayton@kernel.org> Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
由 Jeff Layton 提交于
Most of the time, we (or the vfs layer) takes the inode_lock and then acquires caps, but ceph_read_iter does the opposite, and that can lead to a deadlock. When there are multiple clients treading over the same data, we can end up in a situation where a reader takes caps and then tries to acquire the inode_lock. Another task holds the inode_lock and issues a request to the MDS which needs to revoke the caps, but that can't happen until the inode_lock is unwedged. Fix this by having ceph_read_iter take the inode_lock earlier, before attempting to acquire caps. Fixes: 321fe13c ("ceph: add buffered/direct exclusionary locking for reads and writes") Link: https://tracker.ceph.com/issues/36348Signed-off-by: NJeff Layton <jlayton@kernel.org> Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
- 14 11月, 2019 2 次提交
-
-
由 Jens Axboe 提交于
A test case was reported where two linked reads with registered buffers failed the second link always. This is because we set the expected value of a request in req->result, and if we don't get this result, then we fail the dependent links. For some reason the registered buffer import returned -ERROR/0, while the normal import returns -ERROR/length. This broke linked commands with registered buffers. Fix this by making io_import_fixed() correctly return the mapped length. Cc: stable@vger.kernel.org # v5.3 Reported-by: N李通洲 <carter.li@eoitek.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
For timeout requests io_uring tries to grab a file with specified fd, which is usually stdin/fd=0. Update io_op_needs_file() Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 12 11月, 2019 2 次提交
-
-
由 Jens Axboe 提交于
Currently we make sequence == 0 be the same as sequence == 1, but that's not super useful if the intent is really to have a timeout that's just a pure timeout. If the user passes in sqe->off == 0, then don't apply any sequence logic to the request, let it purely be driven by the timeout specified. Reported-by: N李通洲 <carter.li@eoitek.com> Reviewed-by: N李通洲 <carter.li@eoitek.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Filipe Manana 提交于
During rename exchange we might have successfully log the new name in the source root's log tree, in which case we leave our log context (allocated on stack) in the root's list of log contextes. However we might fail to log the new name in the destination root, in which case we fallback to a transaction commit later and never sync the log of the source root, which causes the source root log context to remain in the list of log contextes. This later causes invalid memory accesses because the context was allocated on stack and after rename exchange finishes the stack gets reused and overwritten for other purposes. The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can detect this and report something like the following: [ 691.489929] ------------[ cut here ]------------ [ 691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38). [ 691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0 (...) [ 691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1 [ 691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [ 691.490003] RIP: 0010:__list_add_valid+0x95/0xe0 (...) [ 691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282 [ 691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000 [ 691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0 [ 691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114 [ 691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff [ 691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b [ 691.490019] FS: 00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000 [ 691.490021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0 [ 691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 691.490029] Call Trace: [ 691.490058] btrfs_log_inode_parent+0x667/0x2730 [btrfs] [ 691.490083] ? join_transaction+0x24a/0xce0 [btrfs] [ 691.490107] ? btrfs_end_log_trans+0x80/0x80 [btrfs] [ 691.490111] ? dget_parent+0xb8/0x460 [ 691.490116] ? lock_downgrade+0x6b0/0x6b0 [ 691.490121] ? rwlock_bug.part.0+0x90/0x90 [ 691.490127] ? do_raw_spin_unlock+0x142/0x220 [ 691.490151] btrfs_log_dentry_safe+0x65/0x90 [btrfs] [ 691.490172] btrfs_sync_file+0x9f1/0xc00 [btrfs] [ 691.490195] ? btrfs_file_write_iter+0x1800/0x1800 [btrfs] [ 691.490198] ? rcu_read_lock_any_held.part.11+0x20/0x20 [ 691.490204] ? __do_sys_newstat+0x88/0xd0 [ 691.490207] ? cp_new_stat+0x5d0/0x5d0 [ 691.490218] ? do_fsync+0x38/0x60 [ 691.490220] do_fsync+0x38/0x60 [ 691.490224] __x64_sys_fdatasync+0x32/0x40 [ 691.490228] do_syscall_64+0x9f/0x540 [ 691.490233] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 691.490235] RIP: 0033:0x7f4b253ad5f0 (...) [ 691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b [ 691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0 [ 691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003 [ 691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c [ 691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4 [ 691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0 This started happening recently when running some test cases from fstests like btrfs/004 for example, because support for rename exchange was added last week to fsstress from fstests. So fix this by deleting the log context for the source root from the list if we have logged the new name in the source root. Reported-by: NSu Yue <Damenly_Su@gmx.com> Fixes: d4682ba0 ("Btrfs: sync log after logging new name") CC: stable@vger.kernel.org # 4.19+ Tested-by: NSu Yue <Damenly_Su@gmx.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 11 11月, 2019 4 次提交
-
-
由 Al Viro 提交于
We need to get the underlying dentry of parent; sure, absent the races it is the parent of underlying dentry, but there's nothing to prevent losing a timeslice to preemtion in the middle of evaluation of lower_dentry->d_parent->d_inode, having another process move lower_dentry around and have its (ex)parent not pinned anymore and freed on memory pressure. Then we regain CPU and try to fetch ->d_inode from memory that is freed by that point. dentry->d_parent *is* stable here - it's an argument of ->lookup() and we are guaranteed that it won't be moved anywhere until we feed it to d_add/d_splice_alias. So we safely go that way to get to its underlying dentry. Cc: stable@vger.kernel.org # since 2009 or so Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
lower_dentry can't go from positive to negative (we have it pinned), but it *can* go from negative to positive. So fetching ->d_inode into a local variable, doing a blocking allocation, checking that now ->d_inode is non-NULL and feeding the value we'd fetched earlier to a function that won't accept NULL is not a good idea. Cc: stable@vger.kernel.org Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
A problem similar to the one caught in commit 74dd7c97 ("ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()") exists for unlink/rmdir as well. Instead of playing with dget_parent() of underlying dentry of victim and hoping it's the same as underlying dentry of our directory, do the following: * find the underlying dentry of victim * find the underlying directory of victim's parent (stable since the victim is ecryptfs dentry and inode of its parent is held exclusive by the caller). * lock the inode of dentry underlying the victim's parent * check that underlying dentry of victim is still hashed and has the right parent - it can be moved, but it can't be moved to/from the directory we are holding exclusive. So while ->d_parent itself might not be stable, the result of comparison is. If the check passes, everything is fine - underlying directory is locked, underlying victim is still a child of that directory and we can go ahead and feed them to vfs_unlink(). As in the current mainline we need to pin the underlying dentry of victim, so that it wouldn't go negative under us, but that's the only temporary reference that needs to be grabbed there. Underlying dentry of parent won't go away (it's pinned by the parent, which is held by caller), so there's no need to grab it. The same problem (with the same solution) exists for rmdir. Moreover, rename gets simpler and more robust with the same "don't bother with dget_parent()" approach. Fixes: 74dd7c97 "ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()" Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 09 11月, 2019 1 次提交
-
-
由 Tejun Heo 提交于
cgroup writeback tries to refresh the associated wb immediately if the current wb is dead. This is to avoid keeping issuing IOs on the stale wb after memcg - blkcg association has changed (ie. when blkcg got disabled / enabled higher up in the hierarchy). Unfortunately, the logic gets triggered spuriously on inodes which are associated with dead cgroups. When the logic is triggered on dead cgroups, the attempt fails only after doing quite a bit of work allocating and initializing a new wb. While c3aab9a0 ("mm/filemap.c: don't initiate writeback if mapping has no dirty pages") alleviated the issue significantly as it now only triggers when the inode has dirty pages. However, the condition can still be triggered before the inode is switched to a different cgroup and the logic simply doesn't make sense. Skip the immediate switching if the associated memcg is dying. This is a simplified version of the following two patches: * https://lore.kernel.org/linux-mm/20190513183053.GA73423@dennisz-mbp/ * http://lkml.kernel.org/r/156355839560.2063.5265687291430814589.stgit@buzz Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Fixes: e8a7abf5 ("writeback: disassociate inodes from dying bdi_writebacks") Acked-by: NDennis Zhou <dennis@kernel.org> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 08 11月, 2019 1 次提交
-
-
由 Jeff Layton 提交于
If someone requests fscache on the mount, and the kernel doesn't support it, it should fail the mount. [ Drop ceph prefix -- it's provided by pr_err. ] Signed-off-by: NJeff Layton <jlayton@kernel.org> Reviewed-by: NIlya Dryomov <idryomov@gmail.com> Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
- 07 11月, 2019 3 次提交
-
-
由 Pavel Shilovsky 提交于
When the client hits a network reconnect, it re-opens every open file with a create context to reconnect a persistent handle. All create context types should be 8-bytes aligned but the padding was missed for that one. As a result, some servers don't allow us to reconnect handles and return an error. The problem occurs when the problematic context is not at the end of the create request packet. Fix this by adding a proper padding at the end of the reconnect persistent handle context. Cc: Stable <stable@vger.kernel.org> # 4.19.x Signed-off-by: NPavel Shilovsky <pshilov@microsoft.com> Signed-off-by: NSteve French <stfrench@microsoft.com>
-
由 Honggang Li 提交于
When create symbolic link, create_link should calculate the depth of the parent item. However, both the first and second parameters of configfs_get_target_path had been set to the target. Broken symbolic link created. $ targetcli ls / o- / ............................................................. [...] o- backstores .................................................. [...] | o- block ...................................... [Storage Objects: 0] | o- fileio ..................................... [Storage Objects: 2] | | o- vdev0 .......... [/dev/ramdisk1 (16.0MiB) write-thru activated] | | | o- alua ....................................... [ALUA Groups: 1] | | | o- default_tg_pt_gp ........... [ALUA state: Active/optimized] | | o- vdev1 .......... [/dev/ramdisk2 (16.0MiB) write-thru activated] | | o- alua ....................................... [ALUA Groups: 1] | | o- default_tg_pt_gp ........... [ALUA state: Active/optimized] | o- pscsi ...................................... [Storage Objects: 0] | o- ramdisk .................................... [Storage Objects: 0] o- iscsi ................................................ [Targets: 0] o- loopback ............................................. [Targets: 0] o- srpt ................................................. [Targets: 2] | o- ib.e89a8f91cb3200000000000000000000 ............... [no-gen-acls] | | o- acls ................................................ [ACLs: 2] | | | o- ib.e89a8f91cb3200000000000000000000 ........ [Mapped LUNs: 2] | | | | o- mapped_lun0 ............................. [BROKEN LUN LINK] | | | | o- mapped_lun1 ............................. [BROKEN LUN LINK] | | | o- ib.e89a8f91cb3300000000000000000000 ........ [Mapped LUNs: 2] | | | o- mapped_lun0 ............................. [BROKEN LUN LINK] | | | o- mapped_lun1 ............................. [BROKEN LUN LINK] | | o- luns ................................................ [LUNs: 2] | | o- lun0 ...... [fileio/vdev0 (/dev/ramdisk1) (default_tg_pt_gp)] | | o- lun1 ...... [fileio/vdev1 (/dev/ramdisk2) (default_tg_pt_gp)] | o- ib.e89a8f91cb3300000000000000000000 ............... [no-gen-acls] | o- acls ................................................ [ACLs: 0] | o- luns ................................................ [LUNs: 0] o- vhost ................................................ [Targets: 0] Fixes: e9c03af2 ("configfs: calculate the symlink target only once") Signed-off-by: NHonggang Li <honli@redhat.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Shuning Zhang 提交于
When the extent tree is modified, it should be protected by inode cluster lock and ip_alloc_sem. The extent tree is accessed and modified in the ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem. The following is a case. The function ocfs2_fiemap is accessing the extent tree, which is modified at the same time. kernel BUG at fs/ocfs2/extent_map.c:475! invalid opcode: 0000 [#1] SMP Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...] CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2 Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018 task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000 RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2] Call Trace: ocfs2_fiemap+0x1e3/0x430 [ocfs2] do_vfs_ioctl+0x155/0x510 SyS_ioctl+0x81/0xa0 system_call_fastpath+0x18/0xd8 Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f RIP ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2] ---[ end trace c8aa0c8180e869dc ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled This issue can be reproduced every week in a production environment. This issue is related to the usage mode. If others use ocfs2 in this mode, the kernel will panic frequently. [akpm@linux-foundation.org: coding style fixes] [Fix new warning due to unused function by removing said function - Linus ] Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.comSigned-off-by: NShuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com> Reviewed-by: NGang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 11月, 2019 4 次提交
-
-
由 Luis Henriques 提交于
copy_file_range tries to use the OSD 'copy-from' operation, which simply performs a full object copy. Unfortunately, the implementation of this system call assumes that stripe_count is always set to 1 and doesn't take into account that the data may be striped across an object set. If the file layout has stripe_count different from 1, then the destination file data will be corrupted. For example: Consider a 8 MiB file with 4 MiB object size, stripe_count of 2 and stripe_size of 2 MiB; the first half of the file will be filled with 'A's and the second half will be filled with 'B's: 0 4M 8M Obj1 Obj2 +------+------+ +----+ +----+ file: | AAAA | BBBB | | AA | | AA | +------+------+ |----| |----| | BB | | BB | +----+ +----+ If we copy_file_range this file into a new file (which needs to have the same file layout!), then it will start by copying the object starting at file offset 0 (Obj1). And then it will copy the object starting at file offset 4M -- which is Obj1 again. Unfortunately, the solution for this is to not allow remote object copies to be performed when the file layout stripe_count is not 1 and simply fallback to the default (VFS) copy_file_range implementation. Cc: stable@vger.kernel.org Signed-off-by: NLuis Henriques <lhenriques@suse.com> Reviewed-by: NJeff Layton <jlayton@kernel.org> Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
由 Jeff Layton 提交于
If ceph_atomic_open is handed a !d_in_lookup dentry, then that means that it already passed d_revalidate so we *know* that it's negative (or at least was very recently). Just return -ENOENT in that case. This also addresses a subtle bug in dentry handling. Non-O_CREAT opens call atomic_open with the parent's i_rwsem shared, but calling d_splice_alias on a hashed dentry requires the exclusive lock. If ceph_atomic_open receives a hashed, negative dentry on a non-O_CREAT open, and another client were to race in and create the file before we issue our OPEN, ceph_fill_trace could end up calling d_splice_alias on the dentry with the new inode with insufficient locks. Cc: stable@vger.kernel.org Reported-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJeff Layton <jlayton@kernel.org> Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
由 David Sterba 提交于
The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as deprecated and scheduled for removal but we actualy do use them for 'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351 should have been just the async flag for subvolume creation. The deprecation has been added in this development cycle, remove it until it's time. Fixes: ebc87351 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag") Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We hit a regression while rolling out 5.2 internally where we were hitting the following panic kernel BUG at mm/page-writeback.c:2659! RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0 Call Trace: __process_pages_contig+0x25a/0x350 ? extent_clear_unlock_delalloc+0x43/0x70 submit_compressed_extents+0x359/0x4d0 normal_work_helper+0x15a/0x330 process_one_work+0x1f5/0x3f0 worker_thread+0x2d/0x3d0 ? rescuer_thread+0x340/0x340 kthread+0x111/0x130 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x1f/0x30 This is happening because the page is not locked when doing clear_page_dirty_for_io. Looking at the core dump it was because our async_extent had a ram_size of 24576 but our async_chunk range only spanned 20480, so we had a whole extra page in our ram_size for our async_extent. This happened because we try not to compress pages outside of our i_size, however a cleanup patch changed us to do actual_end = min_t(u64, i_size_read(inode), end + 1); which is problematic because i_size_read() can evaluate to different values in between checking and assigning. So either an expanding truncate or a fallocate could increase our i_size while we're doing writeout and actual_end would end up being past the range we have locked. I confirmed this was what was happening by installing a debug kernel that had actual_end = min_t(u64, i_size_read(inode), end + 1); if (actual_end > end + 1) { printk(KERN_ERR "KABOOM\n"); actual_end = end + 1; } and installing it onto 500 boxes of the tier that had been seeing the problem regularly. Last night I got my debug message and no panic, confirming what I expected. [ dsterba: the assembly confirms a tiny race window: mov 0x20(%rsp),%rax cmp %rax,0x48(%r15) # read movl $0x0,0x18(%rsp) mov %rax,%r12 mov %r14,%rax cmovbe 0x48(%r15),%r12 # eval Where r15 is inode and 0x48 is offset of i_size. The original fix was to revert 62b37622 that would do an intermediate assignment and this would also avoid the doulble evaluation but is not future-proof, should the compiler merge the stores and call i_size_read anyway. There's a patch adding READ_ONCE to i_size_read but that's not being applied at the moment and we need to fix the bug. Instead, emulate READ_ONCE by two barrier()s that's what effectively happens. The assembly confirms single evaluation: mov 0x48(%rbp),%rax # read once mov 0x20(%rsp),%rcx mov $0x20,%edx cmp %rax,%rcx cmovbe %rcx,%rax mov %rax,(%rsp) mov %rax,%rcx mov %r14,%rax Where 0x48(%rbp) is inode->i_size stored to %eax. ] Fixes: 62b37622 ("btrfs: Remove isize local variable in compress_file_range") CC: stable@vger.kernel.org # v5.1+ Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ changelog updated ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 01 11月, 2019 2 次提交
-
-
由 Trond Myklebust 提交于
A typo in nfs4_refresh_delegation_stateid() means we're leaking an RCU lock, and always returning a value of 'false'. As the function description states, we were always supposed to return 'true' if a matching delegation was found. Fixes: 12f275cd ("NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.") Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Trond Myklebust 提交于
If the delegation is marked as being revoked, we must not use it for cached opens. Fixes: 869f9dfa ("NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return") Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
- 31 10月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
We use io_kiocb->result == -EAGAIN as a way to know if we need to re-submit a polled request, as -EAGAIN reporting happens out-of-line for IO submission failures. This field is cleared when we originally allocate the request, but it isn't reset when we retry the submission from async context. This can cause issues where we think something needs a re-issue, but we're really just reading stale data. Reset ->result whenever we re-prep a request for polled submission. Cc: stable@vger.kernel.org Fixes: 9e645e11 ("io_uring: add support for sqe links") Reported-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-