- 07 9月, 2021 1 次提交
-
-
由 Josef Bacik 提交于
When removing the device we call blkdev_put() on the device once we've removed it, and because we have an EXCL open we need to take the ->open_mutex on the block device to clean it up. Unfortunately during device remove we are holding the sb writers lock, which results in the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc2+ #407 Not tainted ------------------------------------------------------ losetup/11595 is trying to acquire lock: ffff973ac35dd138 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0 but task is already holding lock: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&lo->lo_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 lo_open+0x28/0x60 [loop] blkdev_get_whole+0x25/0xf0 blkdev_get_by_dev.part.0+0x168/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x161/0x390 path_openat+0x3cc/0xa20 do_filp_open+0x96/0x120 do_sys_openat2+0x7b/0x130 __x64_sys_openat+0x46/0x70 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (&disk->open_mutex){+.+.}-{3:3}: __mutex_lock+0x7d/0x750 blkdev_put+0x3a/0x220 btrfs_rm_device.cold+0x62/0xe5 btrfs_ioctl+0x2a31/0x2e70 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#12){.+.+}-{0:0}: lo_write_bvec+0xc2/0x240 [loop] loop_process_work+0x238/0xd00 [loop] process_one_work+0x26b/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: process_one_work+0x245/0x560 worker_thread+0x55/0x3c0 kthread+0x140/0x160 ret_from_fork+0x1f/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 flush_workqueue+0x91/0x5e0 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 1 lock held by losetup/11595: #0: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop] stack backtrace: CPU: 0 PID: 11595 Comm: losetup Not tainted 5.14.0-rc2+ #407 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x57/0x72 check_noncircular+0xcf/0xf0 ? stack_trace_save+0x3b/0x50 __lock_acquire+0x10ea/0x1d90 lock_acquire+0xb5/0x2b0 ? flush_workqueue+0x67/0x5e0 ? lockdep_init_map_type+0x47/0x220 flush_workqueue+0x91/0x5e0 ? flush_workqueue+0x67/0x5e0 ? verify_cpu+0xf0/0x100 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x660 [loop] ? blkdev_ioctl+0x8d/0x2a0 block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fc21255d4cb So instead save the bdev and do the put once we've dropped the sb writers lock in order to avoid the lockdep recursion. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 23 8月, 2021 13 次提交
-
-
由 Christian Brauner 提交于
The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl and has the following restrictions. The main difference between the two is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER is scoped beneath the file descriptor passed with the ioctl. Specifically, INO_LOOKUP_USER must adhere to the following restrictions: - The caller must be privileged over each inode of each path component for the path they are trying to lookup. - The path for the subvolume the caller is trying to lookup must be reachable from the inode associated with the file descriptor passed with the ioctl. The second condition makes it possible to scope the lookup of the path to the mount identified by the file descriptor passed with the ioctl. This allows us to enable this ioctl on idmapped mounts. Specifically, this is possible because all child subvolumes of a parent subvolume are reachable when the parent subvolume is mounted. So if the user had access to open the parent subvolume or has been given the fd then they can lookup the path if they had access to it provided they were privileged over each path component. Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name of a subvolume even though they would otherwise be restricted from doing so via regular VFS-based lookup. So think about a parent subvolume with multiple child subvolumes. Someone could mount he parent subvolume and restrict access to the child subvolumes by overmounting them with empty directories. At this point the user can't traverse the child subvolumes and they can't open files in the child subvolumes. However, they can still learn the path of child subvolumes as long as they have access to the parent subvolume by using the INO_LOOKUP_USER ioctl. The underlying assumption here is that it's ok that the lookup ioctls can't really take mounts into account other than the original mount the fd belongs to during lookup. Since this assumption is baked into the original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christian Brauner 提交于
Setting flags on subvolumes or snapshots are core features of btrfs. The SUBVOL_SETFLAGS ioctl is especially important as it allows to make subvolumes and snapshots read-only or read-write. Allow setting flags on btrfs subvolumes and snapshots on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christian Brauner 提交于
The SET_RECEIVED_SUBVOL ioctls are used to set information about a received subvolume. Make it possible to set information about a received subvolume on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christian Brauner 提交于
So far we prevented the deletion of subvolumes and snapshots using subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag. This restriction is necessary on idmapped mounts as this allows filesystem wide subvolume and snapshot deletions and thus can escape the scope of what's exposed under the mount identified by the fd passed with the ioctl. Deletion by subvolume id works by looking for an alias of the parent of the subvolume or snapshot to be deleted. The parent alias can be anywhere in the filesystem. However, as long as the alias of the parent that is found is the same as the one identified by the file descriptor passed through the ioctl we can allow the deletion. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christian Brauner 提交于
Destroying subvolumes and snapshots are important features of btrfs. Both operations are available to unprivileged users if the filesystem has been mounted with the "user_subvol_rm_allowed" mount option. Allow subvolume and snapshot deletion on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Subvolumes and snapshots can either be deleted by specifying their name or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set. This feature is blocked on idmapped mounts as this allows filesystem wide subvolume deletions and thus can escape the scope of what's exposed under the mount identified by the fd passed with the ioctl. This means that even the root or CAP_SYS_ADMIN capable user can't delete a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional. The root user is currently already subject to permission checks in btrfs_may_delete() including whether the inode's i_uid/i_gid of the directory the subvolume is located in have a mapping in the caller's idmapping. For this to fail isn't currently possible since a btrfs filesystem can't be mounted with a non-initial idmapping but it shows that even the root user would fail to delete a subvolume if the relevant inode isn't mapped in their idmapping. The idmapped mount case is the same in principle. This isn't a huge problem a root user wanting to delete arbitrary subvolumes can just always create another (even detached) mount without an idmapping attached. In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the subvolume to delete is directly located under inode referenced by the fd passed for the ioctl() in a follow-up commit. Here is an example where a btrfs subvolume is deleted through a subvolume mount that does not expose the subvolume to be delete but it can still be deleted by using the subvolume id: /* Compile the following program as "delete_by_spec". */ #define _GNU_SOURCE #include <fcntl.h> #include <inttypes.h> #include <linux/btrfs.h> #include <stdio.h> #include <stdlib.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> static int rm_subvolume_by_id(int fd, uint64_t subvolid) { struct btrfs_ioctl_vol_args_v2 args = {}; int ret; args.flags = BTRFS_SUBVOL_SPEC_BY_ID; args.subvolid = subvolid; ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args); if (ret < 0) return -1; return 0; } int main(int argc, char *argv[]) { int subvolid = 0; if (argc < 3) exit(1); fprintf(stderr, "Opening %s\n", argv[1]); int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY); if (fd < 0) exit(2); subvolid = atoi(argv[2]); fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid); int ret = rm_subvolume_by_id(fd, subvolid); if (ret < 0) exit(3); exit(0); } #include <stdio.h>" #include <stdlib.h>" #include <linux/btrfs.h" truncate -s 10G btrfs.img mkfs.btrfs btrfs.img export LOOPDEV=$(sudo losetup -f --show btrfs.img) mount ${LOOPDEV} /mnt sudo chown $(id -u):$(id -g) /mnt btrfs subvolume create /mnt/A btrfs subvolume create /mnt/B/C # Get subvolume id via: sudo btrfs subvolume show /mnt/A # Save subvolid SUBVOLID=<nr> sudo umount /mnt sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt ./delete_by_spec /mnt ${SUBVOLID} With idmapped mounts this can potentially be used by users to delete subvolumes/snapshots they would otherwise not have access to as the idmapping would be applied to an inode that is not exposed in the mount of the subvolume. The fact that this is a filesystem wide operation suggests it might be a good idea to expose this under a separate ioctl that clearly indicates this. In essence, the file descriptor passed with the ioctl is merely used to identify the filesystem on which to operate when BTRFS_SUBVOL_SPEC_BY_ID is used. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christian Brauner 提交于
Creating subvolumes and snapshots is one of the core features of btrfs and is even available to unprivileged users. Make it possible to use subvolume and snapshot creation on idmapped mounts. This is a fairly straightforward operation since all the permission checking helpers are already capable of handling idmapped mounts. So we just need to pass down the mount's userns. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christian Brauner 提交于
When a new subvolume is created btrfs currently doesn't check whether the fsgid/fsuid of the caller actually have a mapping in the user namespace attached to the filesystem. The VFS always checks this to make sure that the caller's fsgid/fsuid can be represented on-disk. This is most relevant for filesystems that can be mounted inside user namespaces but it is in general a good hardening measure to prevent unrepresentable gid/uid from being written to disk. Since we want to support idmapped mounts for btrfs ioctls to create subvolumes in follow-up patches this becomes important since we want to make sure the fsgid/fsuid of the caller as mapped according to the idmapped mount can be represented on-disk. Simply add the missing fsuidgid_has_mapping() line from the VFS may_create() version to btrfs_may_create(). Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Goldwyn Rodrigues 提交于
Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args, allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably small and ioctls are called in process context. sizeof(btrfs_ioctl_defrag_range_args) = 48 Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Goldwyn Rodrigues 提交于
Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args, allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably small and ioctls are called in process context. sizeof(btrfs_ioctl_quota_rescan_args) = 64 Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Marcos Paulo de Souza 提交于
It's a common practice to start a search using offset (u64)-1, which is the u64 maximum value, meaning that we want the search_slot function to be set in the last item with the same objectid and type. Once we are in this position, it's a matter to start a search backwards by calling btrfs_previous_item, which will check if we'll need to go to a previous leaf and other necessary checks, only to be sure that we are in last offset of the same object and type. The new btrfs_search_backwards function does the all these steps when necessary, and can be used to avoid code duplication. Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Boris Burkov 提交于
Add support for fsverity in btrfs. To support the generic interface in fs/verity, we add two new item types in the fs tree for inodes with verity enabled. One stores the per-file verity descriptor and btrfs verity item and the other stores the Merkle tree data itself. Verity checking is done in end_page_read just before a page is marked uptodate. This naturally handles a variety of edge cases like holes, preallocated extents, and inline extents. Some care needs to be taken to not try to verity pages past the end of the file, which are accessed by the generic buffered file reading code under some circumstances like reading to the end of the last page and trying to read again. Direct IO on a verity file falls back to buffered reads. Verity relies on PageChecked for the Merkle tree data itself to avoid re-walking up shared paths in the tree. For this reason, we need to cache the Merkle tree data. Since the file is immutable after verity is turned on, we can cache it at an index past EOF. Use the new inode ro_flags to store verity on the inode item, so that we can enable verity on a file, then rollback to an older kernel and still mount the file system and read the file. Since we can't safely write the file anymore without ruining the invariants of the Merkle tree, we mark a ro_compat flag on the file system when a file has verity enabled. Acked-by: NEric Biggers <ebiggers@google.com> Co-developed-by: NChris Mason <clm@fb.com> Signed-off-by: NChris Mason <clm@fb.com> Signed-off-by: NBoris Burkov <boris@bur.io> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Boris Burkov 提交于
Currently, inode flags are fully backwards incompatible in btrfs. If we introduce a new inode flag, then tree-checker will detect it and fail. This can even cause us to fail to mount entirely. To make it possible to introduce new flags which can be read-only compatible, like VERITY, we add new ro flags to btrfs without treating them quite so harshly in tree-checker. A read-only file system can survive an unexpected flag, and can be mounted. As for the implementation, it unfortunately gets a little complicated. The on-disk representation of the inode, btrfs_inode_item, has an __le64 for flags but the in-memory representation, btrfs_inode, uses a u32. David Sterba had the nice idea that we could reclaim those wasted 32 bits on disk and use them for the new ro_compat flags. It turns out that the tree-checker code which checks for unknown flags is broken, and ignores the upper 32 bits we are hoping to use. The issue is that the flags use the literal 1 rather than 1ULL, so the flags are signed ints, and one of them is specifically (1 << 31). As a result, the mask which ORs the flags is a negative integer on machines where int is 32 bit twos complement. When tree-checker evaluates the expression: btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK) The mask is something like 0x80000abc, which gets promoted to u64 with sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves all the upper bits zeroed, and we can't detect unexpected flags. This suggests that we can't use those bits after all. Luckily, we have good reason to believe that they are zero anyway. Inode flags are metadata, which is always checksummed, so any bit flips that would introduce 1s would cause a checksum failure anyway (excluding the improbable case of the checksum getting corrupted exactly badly). Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit inode flag should preserve its value and not add leading zeroes (at least for twos complement). The only place that flag (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in the root item, and indeed for that inode we see 0xffffffff80000000 as the flags on disk. However, that inode is never seen by tree checker, nor is it used in a context where verity might be meaningful. Theoretically, a future ro flag might cause trouble on that inode, so we should proactively clean up that mess before it does. With the introduction of the new ro flags, keep two separate unsigned masks and check them against the appropriate u32. Since we no longer run afoul of sign extension, this also stops writing out 0xffffffff80000000 in root_item inodes going forward. Signed-off-by: NBoris Burkov <boris@bur.io> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Since now we support data and metadata read-write for subpage, remove the RO requirement for subpage mount. There are some extra limitations though: - For now, subpage RW mount is still considered experimental Thus that mount warning will still be there. - No compression support There are still quite some PAGE_SIZE hard coded and quite some call sites use extent_clear_unlock_delalloc() to unlock locked_page. This will screw up subpage helpers. Now for subpage RW mount, no matter what mount option or inode attr is set, all writes will not be compressed. Although reading compressed data has no problem. - No defrag for subpage case The defrag support for subpage case will come in later patches, which will also rework the defrag workflow. - No inline extent will be created This is mostly due to the fact that filemap_fdatawrite_range() will trigger more write than the range specified. In fallocate calls, this behavior can make us to writeback which can be inlined, before we enlarge the i_size. This is a very special corner case, and even current btrfs check won't report error on such inline extent + regular extent. But considering how much effort has been put to prevent such inline + regular, I'd prefer to cut off inline extent completely until we have a good solution. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 22 6月, 2021 1 次提交
-
-
由 David Sterba 提交于
Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 21 6月, 2021 7 次提交
-
-
由 David Sterba 提交于
There's only one caller left btrfs_ioctl_start_sync that passes 0, so we can remove the switch in btrfs_commit_transaction_async. A cleanup 9babda9f ("btrfs: Remove async_transid from btrfs_mksubvol/create_subvol/create_snapshot") removed calls that passed 1, so this is a followup. As this removes last call of wait_current_trans_commit_start_and_unblock, remove the function as well. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Accept device name "cancel" as a request to cancel running device deletion operation. The string is literal, in case there's a real device named "cancel", pass it as full absolute path or as "./cancel" This works for v1 and v2 ioctls when the device is specified by name. Moving chunks from the device uses relocation, use the conditional exclusive operation start and cancellation helpers Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Accept literal string "cancel" as resize operation and interpret that as a request to cancel the running operation. If it's running, wait until it finishes current work and return ECANCELED. Shrinking resize uses relocation to move the chunks away, use the conditional exclusive operation start and cancellation helpers. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
To support optional cancellation of some operations, add helper that will wrap all the combinations. In normal mode it's same as btrfs_exclop_start, in cancellation mode it checks if it's already running and request cancellation and waits until completion. The error codes can be returned to to user space and semantics is not changed, adding ECANCELED. This should be evaluated as an error and that the operation has not completed and the operation should be restarted or the filesystem status reviewed. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Add try-lock for exclusive operation start to allow callers to do more checks. The same operation must already be running. The try-lock and unlock must pair and are a substitute for btrfs_exclop_start, thus it must also pair with btrfs_exclop_finish to release the exclop context. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The exclusive operation is now atomically checked and set using bit operations. Switch it to protection by spinlock. The super block lock is not frequently used and adding a new lock seems like an overkill so it should be safe to reuse it. The reason to use spinlock is to enhance the locking context so more checks can be done, eg. allowing the same exclusive operation enter the exclop section and cancel the running one. This will be used for resize and device delete. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Tian Tao 提交于
When inode defrag is canceled, the error is set to EAGAIN but then overwritten by number of defragmented bytes. As this would hide the error, rather return EAGAIN. This does not harm 'btrfs fi defrag', it will print the error and continue to next file (as it does in for any other error). Signed-off-by: NTian Tao <tiantao6@hisilicon.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 11 5月, 2021 1 次提交
-
-
由 Ritesh Harjani 提交于
Add error handling in btrfs_fileattr_set in case of an error while starting a transaction. This fixes btrfs/232 which otherwise used to fail with below signature on Power. btrfs/232 [ 1119.474650] run fstests btrfs/232 at 2021-04-21 02:21:22 <...> [ 1366.638585] BUG: Unable to handle kernel data access on read at 0xffffffffffffff86 [ 1366.638768] Faulting instruction address: 0xc0000000009a5c88 cpu 0x0: Vector: 380 (Data SLB Access) at [c000000014f177b0] pc: c0000000009a5c88: btrfs_update_root_times+0x58/0xc0 lr: c0000000009a5c84: btrfs_update_root_times+0x54/0xc0 <...> pid = 24881, comm = fsstress btrfs_update_inode+0xa0/0x140 btrfs_fileattr_set+0x5d0/0x6f0 vfs_fileattr_set+0x2a8/0x390 do_vfs_ioctl+0x1290/0x1ac0 sys_ioctl+0x6c/0x120 system_call_exception+0x3d4/0x410 system_call_common+0xec/0x278 Fixes: 97fc2977 ("btrfs: convert to fileattr") Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 29 4月, 2021 1 次提交
-
-
由 Filipe Manana 提交于
There are a few exceptional cases where cloning an inline extent needs to copy the inline extent data into a page of the destination inode. When this happens, we end up starting a transaction while having a dirty page for the destination inode and while having the range locked in the destination's inode iotree too. Because when reserving metadata space for a transaction we may need to flush existing delalloc in case there is not enough free space, we have a mechanism in place to prevent a deadlock, which was introduced in commit 3d45f221 ("btrfs: fix deadlock when cloning inline extent and low on free metadata space"). However when using qgroups, a transaction also reserves metadata qgroup space, which can also result in flushing delalloc in case there is not enough available space at the moment. When this happens we deadlock, since flushing delalloc requires locking the file range in the inode's iotree and the range was already locked at the very beginning of the clone operation, before attempting to start the transaction. When this issue happens, stack traces like the following are reported: [72747.556262] task:kworker/u81:9 state:D stack: 0 pid: 225 ppid: 2 flags:0x00004000 [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142) [72747.556271] Call Trace: [72747.556273] __schedule+0x296/0x760 [72747.556277] schedule+0x3c/0xa0 [72747.556279] io_schedule+0x12/0x40 [72747.556284] __lock_page+0x13c/0x280 [72747.556287] ? generic_file_readonly_mmap+0x70/0x70 [72747.556325] extent_write_cache_pages+0x22a/0x440 [btrfs] [72747.556331] ? __set_page_dirty_nobuffers+0xe7/0x160 [72747.556358] ? set_extent_buffer_dirty+0x5e/0x80 [btrfs] [72747.556362] ? update_group_capacity+0x25/0x210 [72747.556366] ? cpumask_next_and+0x1a/0x20 [72747.556391] extent_writepages+0x44/0xa0 [btrfs] [72747.556394] do_writepages+0x41/0xd0 [72747.556398] __writeback_single_inode+0x39/0x2a0 [72747.556403] writeback_sb_inodes+0x1ea/0x440 [72747.556407] __writeback_inodes_wb+0x5f/0xc0 [72747.556410] wb_writeback+0x235/0x2b0 [72747.556414] ? get_nr_inodes+0x35/0x50 [72747.556417] wb_workfn+0x354/0x490 [72747.556420] ? newidle_balance+0x2c5/0x3e0 [72747.556424] process_one_work+0x1aa/0x340 [72747.556426] worker_thread+0x30/0x390 [72747.556429] ? create_worker+0x1a0/0x1a0 [72747.556432] kthread+0x116/0x130 [72747.556435] ? kthread_park+0x80/0x80 [72747.556438] ret_from_fork+0x1f/0x30 [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [72747.566961] Call Trace: [72747.566964] __schedule+0x296/0x760 [72747.566968] ? finish_wait+0x80/0x80 [72747.566970] schedule+0x3c/0xa0 [72747.566995] wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs] [72747.566999] ? finish_wait+0x80/0x80 [72747.567024] lock_extent_bits+0x37/0x90 [btrfs] [72747.567047] btrfs_invalidatepage+0x299/0x2c0 [btrfs] [72747.567051] ? find_get_pages_range_tag+0x2cd/0x380 [72747.567076] __extent_writepage+0x203/0x320 [btrfs] [72747.567102] extent_write_cache_pages+0x2bb/0x440 [btrfs] [72747.567106] ? update_load_avg+0x7e/0x5f0 [72747.567109] ? enqueue_entity+0xf4/0x6f0 [72747.567134] extent_writepages+0x44/0xa0 [btrfs] [72747.567137] ? enqueue_task_fair+0x93/0x6f0 [72747.567140] do_writepages+0x41/0xd0 [72747.567144] __filemap_fdatawrite_range+0xc7/0x100 [72747.567167] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [72747.567195] btrfs_work_helper+0xc2/0x300 [btrfs] [72747.567200] process_one_work+0x1aa/0x340 [72747.567202] worker_thread+0x30/0x390 [72747.567205] ? create_worker+0x1a0/0x1a0 [72747.567208] kthread+0x116/0x130 [72747.567211] ? kthread_park+0x80/0x80 [72747.567214] ret_from_fork+0x1f/0x30 [72747.569686] task:fsstress state:D stack: 0 pid:841421 ppid:841417 flags:0x00000000 [72747.569689] Call Trace: [72747.569691] __schedule+0x296/0x760 [72747.569694] schedule+0x3c/0xa0 [72747.569721] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569725] ? finish_wait+0x80/0x80 [72747.569753] btrfs_qgroup_reserve_data+0x34/0x50 [btrfs] [72747.569781] btrfs_check_data_free_space+0x5f/0xa0 [btrfs] [72747.569804] btrfs_buffered_write+0x1f7/0x7f0 [btrfs] [72747.569810] ? path_lookupat.isra.48+0x97/0x140 [72747.569833] btrfs_file_write_iter+0x81/0x410 [btrfs] [72747.569836] ? __kmalloc+0x16a/0x2c0 [72747.569839] do_iter_readv_writev+0x160/0x1c0 [72747.569843] do_iter_write+0x80/0x1b0 [72747.569847] vfs_writev+0x84/0x140 [72747.569869] ? btrfs_file_llseek+0x38/0x270 [btrfs] [72747.569873] do_writev+0x65/0x100 [72747.569876] do_syscall_64+0x33/0x40 [72747.569879] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [72747.569899] task:fsstress state:D stack: 0 pid:841424 ppid:841417 flags:0x00004000 [72747.569903] Call Trace: [72747.569906] __schedule+0x296/0x760 [72747.569909] schedule+0x3c/0xa0 [72747.569936] try_flush_qgroup+0x95/0x140 [btrfs] [72747.569940] ? finish_wait+0x80/0x80 [72747.569967] __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs] [72747.569989] start_transaction+0x279/0x580 [btrfs] [72747.570014] clone_copy_inline_extent+0x332/0x490 [btrfs] [72747.570041] btrfs_clone+0x5b7/0x7a0 [btrfs] [72747.570068] ? lock_extent_bits+0x64/0x90 [btrfs] [72747.570095] btrfs_clone_files+0xfc/0x150 [btrfs] [72747.570122] btrfs_remap_file_range+0x3d8/0x4a0 [btrfs] [72747.570126] do_clone_file_range+0xed/0x200 [72747.570131] vfs_clone_file_range+0x37/0x110 [72747.570134] ioctl_file_clone+0x7d/0xb0 [72747.570137] do_vfs_ioctl+0x138/0x630 [72747.570140] __x64_sys_ioctl+0x62/0xc0 [72747.570143] do_syscall_64+0x33/0x40 [72747.570146] entry_SYSCALL_64_after_hwframe+0x44/0xa9 So fix this by skipping the flush of delalloc for an inode that is flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under such a special case of cloning an inline extent, when flushing delalloc during qgroup metadata reservation. The special cases for cloning inline extents were added in kernel 5.7 by by commit 05a5a762 ("Btrfs: implement full reflink support for inline extents"), while having qgroup metadata space reservation flushing delalloc when low on space was added in kernel 5.9 by commit c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable kernel backports. Reported-by: NWang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/ Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: stable@vger.kernel.org # 5.9+ Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 21 4月, 2021 1 次提交
-
-
由 Filipe Manana 提交于
When creating a subvolume we allocate an extent buffer for its root node after starting a transaction. We setup a root item for the subvolume that points to that extent buffer and then attempt to insert the root item into the root tree - however if that fails, due to ENOMEM for example, we do not free the extent buffer previously allocated and we do not abort the transaction (as at that point we did nothing that can not be undone). This means that we effectively do not return the metadata extent back to the free space cache/tree and we leave a delayed reference for it which causes a metadata extent item to be added to the extent tree, in the next transaction commit, without having backreferences. When this happens 'btrfs check' reports the following: $ btrfs check /dev/sdi Opening filesystem to check... Checking filesystem on /dev/sdi UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb [1/7] checking root items [2/7] checking extents ref mismatch on [30425088 16384] extent item 1, found 0 backref 30425088 root 256 not referenced back 0x564a91c23d70 incorrect global backref count on 30425088 found 1 wanted 0 backpointer mismatch on [30425088 16384] owner ref check failed [30425088 16384] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space cache [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) found 212992 bytes used, error(s) found total csum bytes: 0 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 124669 file data blocks allocated: 65536 referenced 65536 So fix this by freeing the metadata extent if btrfs_insert_root() returns an error. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 19 4月, 2021 2 次提交
-
-
由 Josef Bacik 提交于
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in create_subvol. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 12 4月, 2021 1 次提交
-
-
由 Miklos Szeredi 提交于
Use the fileattr API to let the VFS handle locking, permission checking and conversion. Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Cc: David Sterba <dsterba@suse.com>
-
- 02 3月, 2021 1 次提交
-
-
由 Dan Carpenter 提交于
The problem is we're copying "inherit" from user space but we don't necessarily know that we're copying enough data for a 64 byte struct. Then the next problem is that 'inherit' has a variable size array at the end, and we have to verify that array is the size we expected. Fixes: 6f72c7e2 ("Btrfs: add qgroup inheritance") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 09 2月, 2021 7 次提交
-
-
由 Naohiro Aota 提交于
The implementation of fitrim depends on space cache, which is not used and disabled for zoned extent allocator. So the current code does not work with zoned filesystem. In the future, we can implement fitrim for zoned filesystems by enabling space cache (but, only for fitrim) or scanning the extent tree at fitrim time. For now, disallow fitrim on zoned filesystems. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
To support subpage sector size, data also need extra info to make sure which sectors in a page are uptodate/dirty/... This patch will make pages for data inodes get btrfs_subpage structure attached, and detached when the page is freed. This patch also slightly changes the timing when set_page_extent_mapped() is called to make sure: - We have page->mapping set page->mapping->host is used to grab btrfs_fs_info, thus we can only call this function after page is mapped to an inode. One call site attaches pages to inode manually, thus we have to modify the timing of set_page_extent_mapped() a bit. - As soon as possible, before other operations Since memory allocation can fail, we have to do extra error handling. Calling set_page_extent_mapped() as soon as possible can simply the error handling for several call sites. The idea is pretty much the same as iomap_page, but with more bitmaps for btrfs specific cases. Currently the plan is to switch iomap if iomap can provide sector aligned write back (only write back dirty sectors, but not the full page, data balance require this feature). So we will stick to btrfs specific bitmap for now. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
It's currently u64 which gets instantly translated either to LONG_MAX (if U64_MAX is passed) or cast to an unsigned long (which is in fact, wrong because writeback_control::nr_to_write is a signed, long type). Just convert the function's argument to be long time which obviates the need to manually convert u64 value to a long. Adjust all call sites which pass U64_MAX to pass LONG_MAX. Finally ensure that in shrink_delalloc the u64 is converted to a long without overflowing, resulting in a negative number. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
It's no longer used. While at it also remove new_dirid in create_subvol as it's used in a single place and open code it. No functional changes. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Adjust the way free_objectid is being initialized, it now stores BTRFS_FIRST_FREE_OBJECTID rather than the, somewhat arbitrary, BTRFS_FIRST_FREE_OBJECTID - 1. This change also has the added benefit that now it becomes unnecessary to explicitly initialize free_objectid for a newly create fs root. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
This reflects the true purpose of the member as it's being used solely in context where a new objectid is being allocated. Future changes will also change the way it's being used to closely follow this semantics. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
This better reflects the semantics of the function i.e no search is performed whatsoever. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 24 1月, 2021 3 次提交
-
-
由 Christian Brauner 提交于
The may_follow_link(), may_linkat(), may_lookup(), may_open(), may_o_create(), may_create_in_sticky(), may_delete(), and may_create() helpers determine whether the caller is privileged enough to perform the associated operations. Let them handle idmapped mounts by mapping the inode or fsids according to the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. The patch takes care to retrieve the mount's user namespace right before performing permission checks and passing it down into the fileystem so the user namespace can't change in between by someone idmapping a mount that is currently not idmapped. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJames Morris <jamorris@linux.microsoft.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christian Brauner 提交于
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJames Morris <jamorris@linux.microsoft.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christian Brauner 提交于
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJames Morris <jamorris@linux.microsoft.com> Acked-by: NSerge Hallyn <serge@hallyn.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
- 18 12月, 2020 1 次提交
-
-
由 Filipe Manana 提交于
When cloning an inline extent there are cases where we can not just copy the inline extent from the source range to the target range (e.g. when the target range starts at an offset greater than zero). In such cases we copy the inline extent's data into a page of the destination inode and then dirty that page. However, after that we will need to start a transaction for each processed extent and, if we are ever low on available metadata space, we may need to flush existing delalloc for all dirty inodes in an attempt to release metadata space - if that happens we may deadlock: * the async reclaim task queued a delalloc work to flush delalloc for the destination inode of the clone operation; * the task executing that delalloc work gets blocked waiting for the range with the dirty page to be unlocked, which is currently locked by the task doing the clone operation; * the async reclaim task blocks waiting for the delalloc work to complete; * the cloning task is waiting on the waitqueue of its reservation ticket while holding the range with the dirty page locked in the inode's io_tree; * if metadata space is not released by some other task (like delalloc for some other inode completing for example), the clone task waits forever and as a consequence the delalloc work and async reclaim tasks will hang forever as well. Releasing more space on the other hand may require starting a transaction, which will hang as well when trying to reserve metadata space, resulting in a deadlock between all these tasks. When this happens, traces like the following show up in dmesg/syslog: [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. [87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [87452.326136] Call Trace: [87452.326737] __schedule+0x5d1/0xcf0 [87452.327390] schedule+0x45/0xe0 [87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs] [87452.328894] ? finish_wait+0x90/0x90 [87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs] [87452.330133] ? __mod_memcg_state+0x8e/0x160 [87452.330738] __extent_writepage+0x2d4/0x400 [btrfs] [87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs] [87452.332007] ? lock_release+0x20e/0x4c0 [87452.332557] ? trace_hardirqs_on+0x1b/0xf0 [87452.333127] extent_writepages+0x43/0x90 [btrfs] [87452.333653] ? lock_acquire+0x1a3/0x490 [87452.334177] do_writepages+0x43/0xe0 [87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100 [87452.335720] __filemap_fdatawrite_range+0xc5/0x100 [87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs] [87452.337838] process_one_work+0x24e/0x5e0 [87452.338437] worker_thread+0x50/0x3b0 [87452.339137] ? process_one_work+0x5e0/0x5e0 [87452.339884] kthread+0x153/0x170 [87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.341153] ret_from_fork+0x22/0x30 [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. [87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [87452.345655] Call Trace: [87452.346305] __schedule+0x5d1/0xcf0 [87452.346947] ? kvm_clock_read+0x14/0x30 [87452.347676] ? wait_for_completion+0x81/0x110 [87452.348389] schedule+0x45/0xe0 [87452.349077] schedule_timeout+0x30c/0x580 [87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [87452.350340] ? lock_acquire+0x1a3/0x490 [87452.351006] ? try_to_wake_up+0x7a/0xa20 [87452.351541] ? lock_release+0x20e/0x4c0 [87452.352040] ? lock_acquired+0x199/0x490 [87452.352517] ? wait_for_completion+0x81/0x110 [87452.353000] wait_for_completion+0xab/0x110 [87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs] [87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] [87452.354455] flush_space+0x24f/0x660 [btrfs] [87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] [87452.355565] process_one_work+0x24e/0x5e0 [87452.356024] worker_thread+0x20f/0x3b0 [87452.356487] ? process_one_work+0x5e0/0x5e0 [87452.356973] kthread+0x153/0x170 [87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.357880] ret_from_fork+0x22/0x30 (...) < stack traces of several tasks waiting for the locks of the inodes of the clone operation > (...) [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052 [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97 [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960 [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000 [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 [92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000 [92867.447920] Call Trace: [92867.448435] __schedule+0x5d1/0xcf0 [92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [92867.449423] schedule+0x45/0xe0 [92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs] [92867.450576] ? finish_wait+0x90/0x90 [92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] [92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs] [92867.452412] start_transaction+0x2d1/0x760 [btrfs] [92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs] [92867.453848] ? lock_release+0x20e/0x4c0 [92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs] [92867.455218] btrfs_clone+0x569/0x7e0 [btrfs] [92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs] [92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs] [92867.457213] do_clone_file_range+0xd4/0x1f0 [92867.457828] vfs_clone_file_range+0x4d/0x230 [92867.458355] ? lock_release+0x20e/0x4c0 [92867.458890] ioctl_file_clone+0x8f/0xc0 [92867.459377] do_vfs_ioctl+0x342/0x750 [92867.459913] __x64_sys_ioctl+0x62/0xb0 [92867.460377] do_syscall_64+0x33/0x80 [92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9 (...) < stack traces of more tasks blocked on metadata reservation like the clone task above, because the async reclaim task has deadlocked > (...) Another thing to notice is that the worker task that is deadlocked when trying to flush the destination inode of the clone operation is at btrfs_invalidatepage(). This is simply because the clone operation has a destination offset greater than the i_size and we only update the i_size of the destination file after cloning an extent (just like we do in the buffered write path). Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger the flushing of delalloc for all inodes that have delalloc, add a runtime flag to an inode to signal it should not be flushed, and for inodes with that flag set, start_delalloc_inodes() will simply skip them. When the cloning code needs to dirty a page to copy an inline extent, set that flag on the inode and then clear it when the clone operation finishes. This could be sporadically triggered with test case generic/269 from fstests, which exercises many fsstress processes running in parallel with several dd processes filling up the entire filesystem. CC: stable@vger.kernel.org # 5.9+ Fixes: 05a5a762 ("Btrfs: implement full reflink support for inline extents") Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-