- 04 6月, 2019 12 次提交
-
-
由 Joseph Qi 提交于
Since now variable 'flags' is not used anymore, remove it as well. Otherwise there is a compile warning "unused variable 'flags'". Fixes: e1843f01 ("NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter") Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Yihao Wu 提交于
When a waiter is waked by CB_NOTIFY_LOCK, it will retry nfs4_proc_setlk(). The waiter may fail to nfs4_proc_setlk() and sleep again. However, the waiter is already removed from clp->cl_lock_waitq when handling CB_NOTIFY_LOCK in nfs4_wake_lock_waiter(). So any subsequent CB_NOTIFY_LOCK won't wake this waiter anymore. We should put the waiter back to clp->cl_lock_waitq before retrying. Cc: stable@vger.kernel.org #4.9+ Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NJeff Layton <jlayton@kernel.org> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Yihao Wu 提交于
Commit b7dbcc0e "NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter" found this bug. However it didn't fix it. This commit replaces schedule_timeout() with wait_woken() and default_wake_function() with woken_wake_function() in function nfs4_retry_setlk() and nfs4_wake_lock_waiter(). wait_woken() uses memory barriers in its implementation to avoid potential race condition when putting a process into sleeping state and then waking it up. Fixes: a1d617d8 ("nfs: allow blocking locks to be awoken by lock callbacks") Cc: stable@vger.kernel.org #4.9+ Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NJeff Layton <jlayton@kernel.org> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 luanshi 提交于
There might have tons of files queued in the writeback, awaiting for writing back. Unfortunately, the writeback's cgroup has been dead. In this case, we reassociate the inode with another writeback, but we possibly can't because the writeback associated with the dead cgroup is the only valid one. In this case, the new writeback is allocated, initialized and associated with the inode in the non-stopping fashion until all data resident in the inode's page cache are flushed to disk. It causes unnecessary high system load. This fixes the issue by enforce moving the inode to root cgroup when the previous binding cgroup becomes dead. With it, no more unnecessary writebacks are created, populated and the system load decreased by about 6x in the test case we carried out: Without the patch: 30% system load With the patch: 5% system load Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
We found that it return success when we set IMMUTABLE_FL flag to a file in docker even though the docker didn't have the capability CAP_LINUX_IMMUTABLE. The commit d1d04ef8 ("ovl: stack file ops") and dab5ca8f ("ovl: add lsattr/chattr support") implemented chattr operations on a regular overlay file. ovl_real_ioctl() overridden the current process's subjective credentials with ofs->creator_cred which have the capability CAP_LINUX_IMMUTABLE so that it will return success in vfs_ioctl()->cap_capable(). Fix this by checking the capability before cred overridden. And here we only care about APPEND_FL and IMMUTABLE_FL, so get these information from inode. [SzM: move check and call to underlying fs inside inode locked region to prevent two such calls from racing with each other] Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Julio Montes 提交于
Cherry-pick from kata-container patches: https://github.com/kata-containers/packaging/tree/master/kernel/patches/0001-NO-UPSTREAM-9P-always-use-cached-inode-to-fill-in-v9.patch So that if in cache=none mode, we don't have to lookup server that might not support open-unlink-fstat operation. fixes https://github.com/01org/cc-oci-runtime/issues/47 fixes https://github.com/01org/cc-oci-runtime/issues/1062Signed-off-by: NJulio Montes <julio.montes@intel.com> Signed-off-by: NPeng Tao <bergwolf@gmail.com> Signed-off-by: NEryu Guan <eguan@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream. Add new code to count canceled pending cluster reservations on bigalloc file systems and to reduce the cluster reservation count on all file systems using delayed allocation. This replaces old code in ext4_da_page_release_reservations that was incorrect. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream. Modify ext4_ext_remove_space() and the code it calls to correct the reserved cluster count for pending reservations (delayed allocated clusters shared with allocated blocks) when a block range is removed from the extent tree. Pending reservations may be found for the clusters at the ends of written or unwritten extents when a block range is removed. If a physical cluster at the end of an extent is freed, it's necessary to increment the reserved cluster count to maintain correct accounting if the corresponding logical cluster is shared with at least one delayed and unwritten extent as found in the extents status tree. Add a new function, ext4_rereserve_cluster(), to reapply a reservation on a delayed allocated cluster sharing blocks with a freed allocated cluster. To avoid ENOSPC on reservation, a flag is applied to ext4_free_blocks() to briefly defer updating the freeclusters counter when an allocated cluster is freed. This prevents another thread from allocating the freed block before the reservation can be reapplied. Redefine the partial cluster object as a struct to carry more state information and to clarify the code using it. Adjust the conditional code structure in ext4_ext_remove_space to reduce the indentation level in the main body of the code to improve readability. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream. Ext4 does not always reduce the reserved cluster count by the number of clusters allocated when mapping a delayed extent. It sometimes adds back one or more clusters after allocation if delalloc blocks adjacent to the range allocated by ext4_ext_map_blocks() share the clusters newly allocated for that range. However, this overcounts the number of clusters needed to satisfy future mapping requests (holding one or more reservations for clusters that have already been allocated) and premature ENOSPC and quota failures, etc., result. Ext4 also does not reduce the reserved cluster count when allocating clusters for non-delayed allocated writes that have previously been reserved for delayed writes. This also results in overcounts. To make it possible to handle reserved cluster accounting for fallocated regions in the same manner as used for other non-delayed writes, do the reserved cluster accounting for them at the time of allocation. In the current code, this is only done later when a delayed extent sharing the fallocated region is finally mapped. Address comment correcting handling of unsigned long long constant from Jan Kara's review of RFC version of this patch. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream. The code in ext4_da_map_blocks sometimes reserves space for more delayed allocated clusters than it should, resulting in premature ENOSPC, exceeded quota, and inaccurate free space reporting. Fix this by checking for written and unwritten blocks shared in the same cluster with the newly delayed allocated block. A cluster reservation should not be made for a cluster for which physical space has already been allocated. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit 1dc0aa46e74a3366e12f426b7caaca477853e9c3 upstream. Add new pending reservation mechanism to help manage reserved cluster accounting. Its primary function is to avoid the need to read extents from the disk when invalidating pages as a result of a truncate, punch hole, or collapse range operation. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
由 Eric Whitney 提交于
commit ad431025aecda85d3ebef5e4a3aca5c1c681d0c7 upstream. Ext4 contains a few functions that are used to search for delayed extents or blocks in the extents status tree. Rather than duplicate code to add new functions to search for extents with different status values, such as written or a combination of delayed and unwritten, generalize the existing code to search for caller-specified extents status values. Also, move this code into extents_status.c where it is better associated with the data structures it operates upon, and where it can be more readily used to implement new extents status tree functions that might want a broader scope for i_es_lock. Three missing static specifiers in RFC version of patch reported and fixed by Fengguang Wu <fengguang.wu@intel.com>. Signed-off-by: NEric Whitney <enwlinux@gmail.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
-
- 31 5月, 2019 24 次提交
-
-
由 Benjamin Coddington 提交于
[ Upstream commit c260121a97a3e4df6536edbc2f26e166eff370ce ] Now that nfs_match_client drops the nfs_client_lock, we should be careful to always return it in the same condition: locked. Fixes: 950a578c6128 ("NFS: make nfs_match_client killable") Reported-by: syzbot+228a82b263b5da91883d@syzkaller.appspotmail.com Signed-off-by: NBenjamin Coddington <bcodding@redhat.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Chengguang Xu 提交于
[ Upstream commit de36e16d1557a0b6eb328bc3516359a12ba5c25c ] Current overlap checking cannot correctly handle a case which is baseminor < existing baseminor && baseminor + minorct > existing baseminor + minorct. Signed-off-by: NChengguang Xu <cgxu519@gmx.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Qu Wenruo 提交于
[ Upstream commit 7ac1e464c4d473b517bb784f30d40da1f842482e ] When we failed to find a root key in btrfs_update_root(), we just panic. That's definitely not cool, fix it by outputting an unique error message, aborting current transaction and return -EUCLEAN. This should not normally happen as the root has been used by the callers in some way. Reviewed-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Josef Bacik 提交于
[ Upstream commit ff612ba7849964b1898fd3ccd1f56941129c6aab ] We've been seeing the following sporadically throughout our fleet panic: kernel BUG at fs/btrfs/relocation.c:4584! netversion: 5.0-0 Backtrace: #0 [ffffc90003adb880] machine_kexec at ffffffff81041da8 #1 [ffffc90003adb8c8] __crash_kexec at ffffffff8110396c #2 [ffffc90003adb988] crash_kexec at ffffffff811048ad #3 [ffffc90003adb9a0] oops_end at ffffffff8101c19a #4 [ffffc90003adb9c0] do_trap at ffffffff81019114 #5 [ffffc90003adba00] do_error_trap at ffffffff810195d0 #6 [ffffc90003adbab0] invalid_op at ffffffff81a00a9b [exception RIP: btrfs_reloc_cow_block+692] RIP: ffffffff8143b614 RSP: ffffc90003adbb68 RFLAGS: 00010246 RAX: fffffffffffffff7 RBX: ffff8806b9c32000 RCX: ffff8806aad00690 RDX: ffff880850b295e0 RSI: ffff8806b9c32000 RDI: ffff88084f205bd0 RBP: ffff880849415000 R8: ffffc90003adbbe0 R9: ffff88085ac90000 R10: ffff8805f7369140 R11: 0000000000000000 R12: ffff880850b295e0 R13: ffff88084f205bd0 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffc90003adbbb0] __btrfs_cow_block at ffffffff813bf1cd #8 [ffffc90003adbc28] btrfs_cow_block at ffffffff813bf4b3 #9 [ffffc90003adbc78] btrfs_search_slot at ffffffff813c2e6c The way relocation moves data extents is by creating a reloc inode and preallocating extents in this inode and then copying the data into these preallocated extents. Once we've done this for all of our extents, we'll write out these dirty pages, which marks the extent written, and goes into btrfs_reloc_cow_block(). From here we get our current reloc_control, which _should_ match the reloc_control for the current block group we're relocating. However if we get an ENOSPC in this path at some point we'll bail out, never initiating writeback on this inode. Not a huge deal, unless we happen to be doing relocation on a different block group, and this block group is now rc->stage == UPDATE_DATA_PTRS. This trips the BUG_ON() in btrfs_reloc_cow_block(), because we expect to be done modifying the data inode. We are in fact done modifying the metadata for the data inode we're currently using, but not the one from the failed block group, and thus we BUG_ON(). (This happens when writeback finishes for extents from the previous group, when we are at btrfs_finish_ordered_io() which updates the data reloc tree (inode item, drops/adds extent items, etc).) Fix this by writing out the reloc data inode always, and then breaking out of the loop after that point to keep from tripping this BUG_ON() later. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NFilipe Manana <fdmanana@suse.com> [ add note from Filipe ] Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Robbie Ko 提交于
[ Upstream commit 39ad317315887c2cb9a4347a93a8859326ddf136 ] When doing fallocate, we first add the range to the reserve_list and then reserve the quota. If quota reservation fails, we'll release all reserved parts of reserve_list. However, cur_offset is not updated to indicate that this range is already been inserted into the list. Therefore, the same range is freed twice. Once at list_for_each_entry loop, and once at the end of the function. This will result in WARN_ON on bytes_may_use when we free the remaining space. At the end, under the 'out' label we have a call to: btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset); The start offset, third argument, should be cur_offset. Everything from alloc_start to cur_offset was freed by the list_for_each_entry_safe_loop. Fixes: 18513091 ("btrfs: update btrfs_space_info's bytes_may_use timely") Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NRobbie Ko <robbieko@synology.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Andreas Gruenbacher 提交于
[ Upstream commit 9287c6452d2b1f24ea8e84bd3cf6f3c6f267f712 ] This patch has to do with the life cycle of glocks and buffers. When gfs2 metadata or journaled data is queued to be written, a gfs2_bufdata object is assigned to track the buffer, and that is queued to various lists, including the glock's gl_ail_list to indicate it's on the active items list. Once the page associated with the buffer has been written, it is removed from the ail list, but its life isn't over until a revoke has been successfully written. So after the block is written, its bufdata object is moved from the glock's gl_ail_list to a file-system-wide list of pending revokes, sd_log_le_revoke. At that point the glock still needs to track how many revokes it contributed to that list (in gl_revokes) so that things like glock go_sync can ensure all the metadata has been not only written, but also revoked before the glock is granted to a different node. This is to guarantee journal replay doesn't replay the block once the glock has been granted to another node. Ross Lagerwall recently discovered a race in which an inode could be evicted, and its glock freed after its ail list had been synced, but while it still had unwritten revokes on the sd_log_le_revoke list. The evict decremented the glock reference count to zero, which allowed the glock to be freed. After the revoke was written, function revoke_lo_after_commit tried to adjust the glock's gl_revokes counter and clear its GLF_LFLUSH flag, at which time it referenced the freed glock. This patch fixes the problem by incrementing the glock reference count in gfs2_add_revoke when the glock's first bufdata object is moved from the glock to the global revokes list. Later, when the glock's last such bufdata object is freed, the reference count is decremented. This guarantees that whichever process finishes last (the revoke writing or the evict) will properly free the glock, and neither will reference the glock after it has been freed. Reported-by: NRoss Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
[ Upstream commit 950a578c6128c2886e295b9c7ecb0b6b22fcc92b ] Actually we don't do anything with return value from nfs_wait_client_init_complete in nfs_match_client, as a consequence if we get a fatal signal and client is not fully initialised, we'll loop to "again" label This has been proven to cause soft lockups on some scenarios (no-carrier but configured network interfaces) Signed-off-by: NRoberto Bergantinos Corpas <rbergant@redhat.com> Reviewed-by: NBenjamin Coddington <bcodding@redhat.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Ross Lagerwall 提交于
[ Upstream commit 7881ef3f33bb80f459ea6020d1e021fc524a6348 ] Under certain conditions, lru_count may drop below zero resulting in a large amount of log spam like this: vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \ negative objects to delete nr=-1 This happens as follows: 1) A glock is moved from lru_list to the dispose list and lru_count is decremented. 2) The dispose function calls cond_resched() and drops the lru lock. 3) Another thread takes the lru lock and tries to add the same glock to lru_list, checking if the glock is on an lru list. 4) It is on a list (actually the dispose list) and so it avoids incrementing lru_count. 5) The glock is moved to lru_list. 5) The original thread doesn't dispose it because it has been re-added to the lru list but the lru_count has still decreased by one. Fix by checking if the LRU flag is set on the glock rather than checking if the glock is on some list and rearrange the code so that the LRU flag is added/removed precisely when the glock is added/removed from lru_list. Signed-off-by: NRoss Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 David Sterba 提交于
This reverts commit 8b13bb91. There is currently no corresponding patch in master due to additional changes that would be significantly different from plain revert in the respective stable branch. The range argument was not handled correctly and could cause trim to overlap allocated areas or reach beyond the end of the device. The address space that fitrim normally operates on is in logical coordinates, while the discards are done on the physical device extents. This distinction cannot be made with the current ioctl interface and caused the confusion. The bug depends on the layout of block groups and does not always happen. The whole-fs trim (run by default by the fstrim tool) is not affected. Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Al Viro 提交于
commit 9419a3191dcb27f24478d288abaab697228d28e6 upstream. What happens there is that we are replacing file->path.mnt of a file we'd just opened with a clone and we need the write count contribution to be transferred from original mount to new one. That's it. We do *NOT* want any kind of freeze protection for the duration of switchover. IOW, we should just use __mnt_{want,drop}_write() for that switchover; no need to bother with mnt_{want,drop}_write() there. Tested-by: NAmir Goldstein <amir73il@gmail.com> Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Amir Goldstein 提交于
commit acf3062a7e1ccf67c6f7e7c28671a6708fde63b0 upstream. This nasty little syzbot repro: https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000 Creates overlay mounts where the same directory is both in upper and lower layers. Simplified example: mkdir foo work mount -t overlay none foo -o"lowerdir=.,upperdir=foo,workdir=work" The repro runs several threads in parallel that attempt to chdir into foo and attempt to symlink/rename/exec/mkdir the file bar. The repro hits a WARN_ON() I placed in ovl_instantiate(), which suggests that an overlay inode already exists in cache and is hashed by the pointer of the real upper dentry that ovl_create_real() has just created. At the point of the WARN_ON(), for overlay dir inode lock is held and upper dir inode lock, so at first, I did not see how this was possible. On a closer look, I see that after ovl_create_real(), because of the overlapping upper and lower layers, a lookup by another thread can find the file foo/bar that was just created in upper layer, at overlay path foo/foo/bar and hash the an overlay inode with the new real dentry as lower dentry. This is possible because the overlay directory foo/foo is not locked and the upper dentry foo/bar is in dcache, so ovl_lookup() can find it without taking upper dir inode shared lock. Overlapping layers is considered a wrong setup which would result in unexpected behavior, but it shouldn't crash the kernel and it shouldn't trigger WARN_ON() either, so relax this WARN_ON() and leave a pr_warn() instead to cover all cases of failure to get an overlay inode. The error returned from failure to insert new inode to cache with inode_insert5() was changed to -EEXIST, to distinguish from the error -ENOMEM returned on failure to get/allocate inode with iget5_locked(). Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com Fixes: 01b39dcc ("ovl: use inode_insert5() to hash a newly...") Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit 38e3eebff643db725633657d1d87a3be019d1018 upstream. Qgroups will do the old roots lookup at delayed ref time, which could be while walking down the extent root while running a delayed ref. This should be fine, except we specifically lock eb's in the backref walking code irrespective of path->skip_locking, which deadlocks the system. Fix up the backref code to honor path->skip_locking, nobody will be modifying the commit_root when we're searching so it's completely safe to do. This happens since fb235dc0 ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans"), kernel may lockup with quota enabled. There is one backref trace triggered by snapshot dropping along with write operation in the source subvolume. The example can be reliably reproduced: btrfs-cleaner D 0 4062 2 0x80000000 Call Trace: schedule+0x32/0x90 btrfs_tree_read_lock+0x93/0x130 [btrfs] find_parent_nodes+0x29b/0x1170 [btrfs] btrfs_find_all_roots_safe+0xa8/0x120 [btrfs] btrfs_find_all_roots+0x57/0x70 [btrfs] btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs] btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs] btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs] do_walk_down+0x541/0x5e3 [btrfs] walk_down_tree+0xab/0xe7 [btrfs] btrfs_drop_snapshot+0x356/0x71a [btrfs] btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs] cleaner_kthread+0x12b/0x160 [btrfs] kthread+0x112/0x130 ret_from_fork+0x27/0x50 When dropping snapshots with qgroup enabled, we will trigger backref walk. However such backref walk at that timing is pretty dangerous, as if one of the parent nodes get WRITE locked by other thread, we could cause a dead lock. For example: FS 260 FS 261 (Dropped) node A node B / \ / \ node C node D node E / \ / \ / \ leaf F|leaf G|leaf H|leaf I|leaf J|leaf K The lock sequence would be: Thread A (cleaner) | Thread B (other writer) ----------------------------------------------------------------------- write_lock(B) | write_lock(D) | ^^^ called by walk_down_tree() | | write_lock(A) | write_lock(D) << Stall read_lock(H) << for backref walk | read_lock(D) << lock owner is | the same thread A | so read lock is OK | read_lock(A) << Stall | So thread A hold write lock D, and needs read lock A to unlock. While thread B holds write lock A, while needs lock D to unlock. This will cause a deadlock. This is not only limited to snapshot dropping case. As the backref walk, even only happens on commit trees, is breaking the normal top-down locking order, makes it deadlock prone. Fixes: fb235dc0 ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans") CC: stable@vger.kernel.org # 4.14+ Reported-and-tested-by: NDavid Sterba <dsterba@suse.com> Reported-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NFilipe Manana <fdmanana@suse.com> [ rebase to latest branch and fix lock assert bug in btrfs/007 ] [ backport to linux-4.19.y branch, solve minor conflicts ] Signed-off-by: NQu Wenruo <wqu@suse.com> [ copy logs and deadlock analysis from Qu's patch ] Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Olga Kornievskaia 提交于
commit 0769663b4f580566ef6cdf366f3073dbe8022c39 upstream. According to the NFSv4.2 spec if the input and output file is the same file, operation should fail with EINVAL. However, linux copy_file_range() system call has no such restrictions. Therefore, in such case let's return EOPNOTSUPP and allow VFS to fallback to doing do_splice_direct(). Also when copy_file_range is called on an NFSv4.0 or 4.1 mount (ie., a server that doesn't support COPY functionality), we also need to return EOPNOTSUPP and fallback to a regular copy. Fixes xfstest generic/075, generic/091, generic/112, generic/263 for all NFSv4.x versions. Signed-off-by: NOlga Kornievskaia <kolga@netapp.com> Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com> Cc: Yu Xu <xuyu@linux.alibaba.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Olga Kornievskaia 提交于
commit 45ac486ecf2dc998e25cf32f0cabf2deaad875be upstream. Currently nfs42_proc_copy_file_range() can not return EAGAIN. Fixes: e4648aa4 ("NFS recover from destination server reboot for copies") Signed-off-by: NOlga Kornievskaia <kolga@netapp.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com> Cc: Yu Xu <xuyu@linux.alibaba.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Tobin C. Harding 提交于
commit e32773357d5cc271b1d23550b3ed026eb5c2a468 upstream. A failed call to kobject_init_and_add() must be followed by a call to kobject_put(). Currently in the error path when adding fs_devices we are missing this call. This could be fixed by calling btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid(). Here we choose the second option because it prevents the slightly unusual error path handling requirements of kobject from leaking out into btrfs functions. Add a call to kobject_put() in the error path of kobject_add_and_init(). This causes the release method to be called if kobject_init_and_add() fails. open_tree() is the function that calls btrfs_sysfs_add_fsid() and the error code in this function is already written with the assumption that the release method is called during the error path of open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the fail_fsdev_sysfs label). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NTobin C. Harding <tobin@kernel.org> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Tobin C. Harding 提交于
commit 450ff8348808a89cc27436771aa05c2b90c0eef1 upstream. If a call to kobject_init_and_add() fails we must call kobject_put() otherwise we leak memory. Calling kobject_put() when kobject_init_and_add() fails drops the refcount back to 0 and calls the ktype release method (which in turn calls the percpu destroy and kfree). Add call to kobject_put() in the error path of call to kobject_init_and_add(). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NTobin C. Harding <tobin@kernel.org> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Filipe Manana 提交于
commit 0c713cbab6200b0ab6473b50435e450a6e1de85d upstream. When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode) that happens to be ranged, which happens during a msync() or writes for files opened with O_SYNC for example, we can end up with a corrupt log, due to different file extent items representing ranges that overlap with each other, or hit some assertion failures. When doing a ranged fsync we only flush delalloc and wait for ordered exents within that range. If while we are logging items from our inode ordered extents for adjacent ranges complete, we end up in a race that can make us insert the file extent items that overlap with others we logged previously and the assertion failures. For example, if tree-log.c:copy_items() receives a leaf that has the following file extents items, all with a length of 4K and therefore there is an implicit hole in the range 68K to 72K - 1: (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... It copies them to the log tree. However due to the need to detect implicit holes, it may release the path, in order to look at the previous leaf to detect an implicit hole, and then later it will search again in the tree for the first file extent item key, with the goal of locking again the leaf (which might have changed due to concurrent changes to other inodes). However when it locks again the leaf containing the first key, the key corresponding to the extent at offset 72K may not be there anymore since there is an ordered extent for that range that is finishing (that is, somewhere in the middle of btrfs_finish_ordered_io()), and it just removed the file extent item but has not yet replaced it with a new file extent item, so the part of copy_items() that does hole detection will decide that there is a hole in the range starting from 68K to 76K - 1, and therefore insert a file extent item to represent that hole, having a key offset of 68K. After that we now have a log tree with 2 different extent items that have overlapping ranges: 1) The file extent item copied before copy_items() released the path, which has a key offset of 72K and a length of 4K, representing the file range 72K to 76K - 1. 2) And a file extent item representing a hole that has a key offset of 68K and a length of 8K, representing the range 68K to 76K - 1. This item was inserted after releasing the path, and overlaps with the extent item inserted before. The overlapping extent items can cause all sorts of unpredictable and incorrect behaviour, either when replayed or if a fast (non full) fsync happens later, which can trigger a BUG_ON() when calling btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a trace like the following: [61666.783269] ------------[ cut here ]------------ [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182! [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP (...) [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000 [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs] [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246 [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000 [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937 [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000 [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418 [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000 [61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000 [61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0 [61666.786253] Call Trace: [61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs] [61666.786253] ? time_hardirqs_on+0x9/0x14 [61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs] [61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs] [61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs] [61666.786253] ? lock_acquire+0x131/0x1c5 [61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs] [61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? lockref_get_not_zero+0x2c/0x34 [61666.786253] ? rcu_read_unlock+0x3e/0x5d [61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs] [61666.786253] btrfs_sync_file+0x317/0x42c [btrfs] [61666.786253] vfs_fsync_range+0x8c/0x9e [61666.786253] SyS_msync+0x13c/0x1c9 [61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad A sample of a corrupt log tree leaf with overlapping extents I got from running btrfs/072: item 14 key (295 108 200704) itemoff 2599 itemsize 53 extent data disk bytenr 0 nr 0 extent data offset 0 nr 458752 ram 458752 item 15 key (295 108 659456) itemoff 2546 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 606208 nr 163840 ram 770048 item 16 key (295 108 663552) itemoff 2493 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 610304 nr 155648 ram 770048 item 17 key (295 108 819200) itemoff 2440 itemsize 53 extent data disk bytenr 4334788608 nr 4096 extent data offset 0 nr 4096 ram 4096 The file extent item at offset 659456 (item 15) ends at offset 823296 (659456 + 163840) while the next file extent item (item 16) starts at offset 663552. Another different problem that the race can trigger is a failure in the assertions at tree-log.c:copy_items(), which expect that the first file extent item key we found before releasing the path exists after we have released path and that the last key we found before releasing the path also exists after releasing the path: $ cat -n fs/btrfs/tree-log.c 4080 if (need_find_last_extent) { 4081 /* btrfs_prev_leaf could return 1 without releasing the path */ 4082 btrfs_release_path(src_path); 4083 ret = btrfs_search_slot(NULL, inode->root, &first_key, 4084 src_path, 0, 0); 4085 if (ret < 0) 4086 return ret; 4087 ASSERT(ret == 0); (...) 4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) { 4104 ret = btrfs_next_leaf(inode->root, src_path); 4105 if (ret < 0) 4106 return ret; 4107 ASSERT(ret == 0); 4108 src = src_path->nodes[0]; 4109 i = 0; 4110 need_find_last_extent = true; 4111 } (...) The second assertion implicitly expects that the last key before the path release still exists, because the surrounding while loop only stops after we have found that key. When this assertion fails it produces a stack like this: [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107 [139590.037406] ------------[ cut here ]------------ [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546! [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1 (...) [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs] (...) [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282 [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000 [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868 [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000 [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013 [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001 [139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000 [139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0 [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [139590.044250] Call Trace: [139590.044631] copy_items+0xa3f/0x1000 [btrfs] [139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs] [139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs] [139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs] [139590.046143] ? do_raw_spin_unlock+0x49/0xc0 [139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs] [139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs] [139590.047592] __vfs_write+0x129/0x1c0 [139590.047932] vfs_write+0xc2/0x1b0 [139590.048270] ksys_write+0x55/0xc0 [139590.048608] do_syscall_64+0x60/0x1b0 [139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe [139590.049287] RIP: 0033:0x7f2efc4be190 (...) [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190 [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003 [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60 [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003 [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000 (...) [139590.055128] ---[ end trace 193f35d0215cdeeb ]--- So fix this race between a full ranged fsync and writeback of adjacent ranges by flushing all delalloc and waiting for all ordered extents to complete before logging the inode. This is the simplest way to solve the problem because currently the full fsync path does not deal with ranges at all (it assumes a full range from 0 to LLONG_MAX) and it always needs to look at adjacent ranges for hole detection. For use cases of ranged fsyncs this can make a few fsyncs slower but on the other hand it can make some following fsyncs to other ranges do less work or no need to do anything at all. A full fsync is rare anyway and happens only once after loading/creating an inode and once after less common operations such as a shrinking truncate. This is an issue that exists for a long time, and was often triggered by generic/127, because it does mmap'ed writes and msync (which triggers a ranged fsync). Adding support for the tree checker to detect overlapping extents (next patch in the series) and trigger a WARN() when such cases are found, and then calling btrfs_check_leaf_full() at the end of btrfs_insert_file_extent() made the issue much easier to detect. Running btrfs/072 with that change to the tree checker and making fsstress open files always with O_SYNC made it much easier to trigger the issue (as triggering it with generic/127 is very rare). CC: stable@vger.kernel.org # 3.16+ Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Filipe Manana 提交于
commit ebb929060aeb162417b4c1307e63daee47b208d9 upstream. When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a file that has holes and has file extent items spanning two or more leafs, we can end up falling to back to a full transaction commit due to a logic bug that leads to failure to insert a duplicate file extent item that is meant to represent a hole between the last file extent item of a leaf and the first file extent item in the next leaf. The failure (EEXIST error) leads to a transaction commit (as most errors when logging an inode do). For example, we have the two following leafs: Leaf N: ----------------------------------------------- | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) | ----------------------------------------------- The file extent item at the end of leaf N has a length of 4Kb, representing the file range from 64K to 68K - 1. Leaf N + 1: ----------------------------------------------- | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... | ----------------------------------------------- The file extent item at the first slot of leaf N + 1 has a length of 4Kb too, representing the file range from 72K to 76K - 1. During the full fsync path, when we are at tree-log.c:copy_items() with leaf N as a parameter, after processing the last file extent item, that represents the extent at offset 64K, we take a look at the first file extent item at the next leaf (leaf N + 1), and notice there's a 4K hole between the two extents, and therefore we insert a file extent item representing that hole, starting at file offset 68K and ending at offset 72K - 1. However we don't update the value of *last_extent, which is used to represent the end offset (plus 1, non-inclusive end) of the last file extent item inserted in the log, so it stays with a value of 68K and not with a value of 72K. Then, when copy_items() is called for leaf N + 1, because the value of *last_extent is smaller then the offset of the first extent item in the leaf (68K < 72K), we look at the last file extent item in the previous leaf (leaf N) and see it there's a 4K gap between it and our first file extent item (again, 68K < 72K), so we decide to insert a file extent item representing the hole, starting at file offset 68K and ending at offset 72K - 1, this insertion will fail with -EEXIST being returned from btrfs_insert_file_extent() because we already inserted a file extent item representing a hole for this offset (68K) in the previous call to copy_items(), when processing leaf N. The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(), which falls back to a full transaction commit. Fix this by adjusting *last_extent after inserting a hole when we had to look at the next leaf. Fixes: 4ee3fad3 ("Btrfs: fix fsync after hole punching when using no-holes feature") Cc: stable@vger.kernel.org # 4.14+ Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Filipe Manana 提交于
commit 72bd2323ec87722c115a5906bc6a1b31d11e8f54 upstream. Currently when we fail to COW a path at btrfs_update_root() we end up always aborting the transaction. However all the current callers of btrfs_update_root() are able to deal with errors returned from it, many do end up aborting the transaction themselves (directly or not, such as the transaction commit path), other BUG_ON() or just gracefully cancel whatever they were doing. When syncing the fsync log, we call btrfs_update_root() through tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log sync code does not abort the transaction, instead it gracefully handles the error and returns -EAGAIN to the fsync handler, so that it falls back to a transaction commit. Any other error different from -ENOSPC, makes the log sync code abort the transaction. So remove the transaction abort from btrfs_update_log() when we fail to COW a path to update the root item, so that if an -ENOSPC failure happens we avoid aborting the current transaction and have a chance of the fsync succeeding after falling back to a transaction commit. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413 Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling") Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit 8fca955057b9c58467d1b231e43f19c4cf26ae8c upstream. If we have an error writing out a delalloc range in btrfs_punch_hole_lock_range we'll unlock the inode and then goto out_only_mutex, where we will again unlock the inode. This is bad, don't do this. Fixes: f27451f2 ("Btrfs: add support for fallocate's zero range operation") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Andreas Gruenbacher 提交于
commit 5a5ec83d6ac974b12085cd99b196795f14079037 upstream. Commit 4d207133 changed the types of the statistic values in struct gfs2_lkstats from s64 to u64. Because of that, what should be a signed value in gfs2_update_stats turned into an unsigned value. When shifted right, we end up with a large positive value instead of a small negative value, which results in an incorrect variance estimate. Fixes: 4d207133 ("gfs2: Make statistics unsigned, suitable for use with do_div()") Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Cc: stable@vger.kernel.org # v4.4+ Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Damien Le Moal 提交于
commit 0916878da355650d7e77104a7ac0fa1784eca852 upstream. For a single device mount using a zoned block device, the zone information for the device is stored in the sbi->devs single entry array and sbi->s_ndevs is set to 1. This differs from a single device mount using a regular block device which does not allocate sbi->devs and sets sbi->s_ndevs to 0. However, sbi->s_devs == 0 condition is used throughout the code to differentiate a single device mount from a multi-device mount where sbi->s_ndevs is always larger than 1. This results in problems with single zoned block device volumes as these are treated as multi-device mounts but do not have the start_blk and end_blk information set. One of the problem observed is skipping of zone discard issuing resulting in write commands being issued to full zones or unaligned to a zone write pointer. Fix this problem by simply treating the cases sbi->s_ndevs == 0 (single regular block device mount) and sbi->s_ndevs == 1 (single zoned block device mount) in the same manner. This is done by introducing the helper function f2fs_is_multi_device() and using this helper in place of direct tests of sbi->s_ndevs value, improving code readability. Fixes: 7bb3a371 ("f2fs: Fix zoned block device support") Cc: <stable@vger.kernel.org> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Reviewed-by: NChao Yu <yuchao0@huawei.com> Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jan Kara 提交于
commit 82a25b027ca48d7ef197295846b352345853dfa8 upstream. We didn't wait for outstanding direct IO during truncate in nojournal mode (as we skip orphan handling in that case). This can lead to fs corruption or stale data exposure if truncate ends up freeing blocks and these get reallocated before direct IO finishes. Fix the condition determining whether the wait is necessary. CC: stable@vger.kernel.org Fixes: 1c9114f9 ("ext4: serialize unlocked dio reads with truncate") Reviewed-by: NIra Weiny <ira.weiny@intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jan Kara 提交于
commit ee0ed02ca93ef1ecf8963ad96638795d55af2c14 upstream. It is possible that unlinked inode enters ext4_setattr() (e.g. if somebody calls ftruncate(2) on unlinked but still open file). In such case we should not delete the inode from the orphan list if truncate fails. Note that this is mostly a theoretical concern as filesystem is corrupted if we reach this path anyway but let's be consistent in our orphan handling. Reviewed-by: NIra Weiny <ira.weiny@intel.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 26 5月, 2019 4 次提交
-
-
由 Al Viro 提交于
[ Upstream commit 4e9036042fedaffcd868d7f7aa948756c48c637d ] To choose whether to pick the GID from the old (16bit) or new (32bit) field, we should check if the old gid field is set to 0xffff. Mainline checks the old *UID* field instead - cut'n'paste from the corresponding code in ufs_get_inode_uid(). Fixes: 252e211eSigned-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Kirill Smelkov 提交于
commit bbd84f33652f852ce5992d65db4d020aba21f882 upstream. Starting from commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") files opened even via nonseekable_open gate read and write via lock and do not allow them to be run simultaneously. This can create read vs write deadlock if a filesystem is trying to implement a socket-like file which is intended to be simultaneously used for both read and write from filesystem client. See commit 10dce8af3422 ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") for details and e.g. commit 581d21a2 ("xenbus: fix deadlock on writes to /proc/xen/xenbus") for a similar deadlock example on /proc/xen/xenbus. To avoid such deadlock it was tempting to adjust fuse_finish_open to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the opened handler is having stream-like semantics; does not use file position and thus the kernel is free to issue simultaneous read and write request on opened file handle. This patch together with stream_open() should be added to stable kernels starting from v3.14+. This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all kernel versions. This should work because fuse_finish_open ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: NKirill Smelkov <kirr@nexedi.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jeff Layton 提交于
commit 00abf69dd24f4444d185982379c5cc3bb7b6d1fc upstream. xfstest generic/452 was triggering a "Busy inodes after umount" warning. ceph was allowing the mount to go read-only without first flushing out dirty inodes in the cache. Ensure we sync out the filesystem before allowing a remount to proceed. Cc: stable@vger.kernel.org Link: http://tracker.ceph.com/issues/39571Signed-off-by: NJeff Layton <jlayton@kernel.org> Reviewed-by: N"Yan, Zheng" <zyan@redhat.com> Signed-off-by: NIlya Dryomov <idryomov@gmail.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Amir Goldstein 提交于
commit 3428030da004a1128cbdcf93dc03e16f184d845b upstream. Generalize the helper ovl_open_maybe_copy_up() and use it to copy up file with data before FS_IOC_SETFLAGS ioctl. The FS_IOC_SETFLAGS ioctl is a bit of an odd ball in vfs, which probably caused the confusion. File may be open O_RDONLY, but ioctl modifies the file. VFS does not call mnt_want_write_file() nor lock inode mutex, but fs-specific code for FS_IOC_SETFLAGS does. So ovl_ioctl() calls mnt_want_write_file() for the overlay file, and fs-specific code calls mnt_want_write_file() for upper fs file, but there was no call for ovl_want_write() for copy up duration which prevents overlayfs from copying up on a frozen upper fs. Fixes: dab5ca8f ("ovl: add lsattr/chattr support") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Acked-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-