- 19 4月, 2021 1 次提交
-
-
由 Josef Bacik 提交于
A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 02 3月, 2021 1 次提交
-
-
由 Dan Carpenter 提交于
The problem is we're copying "inherit" from user space but we don't necessarily know that we're copying enough data for a 64 byte struct. Then the next problem is that 'inherit' has a variable size array at the end, and we have to verify that array is the size we expected. Fixes: 6f72c7e2 ("Btrfs: add qgroup inheritance") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 09 2月, 2021 7 次提交
-
-
由 Naohiro Aota 提交于
The implementation of fitrim depends on space cache, which is not used and disabled for zoned extent allocator. So the current code does not work with zoned filesystem. In the future, we can implement fitrim for zoned filesystems by enabling space cache (but, only for fitrim) or scanning the extent tree at fitrim time. For now, disallow fitrim on zoned filesystems. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
To support subpage sector size, data also need extra info to make sure which sectors in a page are uptodate/dirty/... This patch will make pages for data inodes get btrfs_subpage structure attached, and detached when the page is freed. This patch also slightly changes the timing when set_page_extent_mapped() is called to make sure: - We have page->mapping set page->mapping->host is used to grab btrfs_fs_info, thus we can only call this function after page is mapped to an inode. One call site attaches pages to inode manually, thus we have to modify the timing of set_page_extent_mapped() a bit. - As soon as possible, before other operations Since memory allocation can fail, we have to do extra error handling. Calling set_page_extent_mapped() as soon as possible can simply the error handling for several call sites. The idea is pretty much the same as iomap_page, but with more bitmaps for btrfs specific cases. Currently the plan is to switch iomap if iomap can provide sector aligned write back (only write back dirty sectors, but not the full page, data balance require this feature). So we will stick to btrfs specific bitmap for now. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
It's currently u64 which gets instantly translated either to LONG_MAX (if U64_MAX is passed) or cast to an unsigned long (which is in fact, wrong because writeback_control::nr_to_write is a signed, long type). Just convert the function's argument to be long time which obviates the need to manually convert u64 value to a long. Adjust all call sites which pass U64_MAX to pass LONG_MAX. Finally ensure that in shrink_delalloc the u64 is converted to a long without overflowing, resulting in a negative number. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
It's no longer used. While at it also remove new_dirid in create_subvol as it's used in a single place and open code it. No functional changes. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Adjust the way free_objectid is being initialized, it now stores BTRFS_FIRST_FREE_OBJECTID rather than the, somewhat arbitrary, BTRFS_FIRST_FREE_OBJECTID - 1. This change also has the added benefit that now it becomes unnecessary to explicitly initialize free_objectid for a newly create fs root. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
This reflects the true purpose of the member as it's being used solely in context where a new objectid is being allocated. Future changes will also change the way it's being used to closely follow this semantics. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
This better reflects the semantics of the function i.e no search is performed whatsoever. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 24 1月, 2021 3 次提交
-
-
由 Christian Brauner 提交于
The may_follow_link(), may_linkat(), may_lookup(), may_open(), may_o_create(), may_create_in_sticky(), may_delete(), and may_create() helpers determine whether the caller is privileged enough to perform the associated operations. Let them handle idmapped mounts by mapping the inode or fsids according to the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. The patch takes care to retrieve the mount's user namespace right before performing permission checks and passing it down into the fileystem so the user namespace can't change in between by someone idmapping a mount that is currently not idmapped. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJames Morris <jamorris@linux.microsoft.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christian Brauner 提交于
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJames Morris <jamorris@linux.microsoft.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christian Brauner 提交于
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJames Morris <jamorris@linux.microsoft.com> Acked-by: NSerge Hallyn <serge@hallyn.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
- 18 12月, 2020 1 次提交
-
-
由 Filipe Manana 提交于
When cloning an inline extent there are cases where we can not just copy the inline extent from the source range to the target range (e.g. when the target range starts at an offset greater than zero). In such cases we copy the inline extent's data into a page of the destination inode and then dirty that page. However, after that we will need to start a transaction for each processed extent and, if we are ever low on available metadata space, we may need to flush existing delalloc for all dirty inodes in an attempt to release metadata space - if that happens we may deadlock: * the async reclaim task queued a delalloc work to flush delalloc for the destination inode of the clone operation; * the task executing that delalloc work gets blocked waiting for the range with the dirty page to be unlocked, which is currently locked by the task doing the clone operation; * the async reclaim task blocks waiting for the delalloc work to complete; * the cloning task is waiting on the waitqueue of its reservation ticket while holding the range with the dirty page locked in the inode's io_tree; * if metadata space is not released by some other task (like delalloc for some other inode completing for example), the clone task waits forever and as a consequence the delalloc work and async reclaim tasks will hang forever as well. Releasing more space on the other hand may require starting a transaction, which will hang as well when trying to reserve metadata space, resulting in a deadlock between all these tasks. When this happens, traces like the following show up in dmesg/syslog: [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. [87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [87452.326136] Call Trace: [87452.326737] __schedule+0x5d1/0xcf0 [87452.327390] schedule+0x45/0xe0 [87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs] [87452.328894] ? finish_wait+0x90/0x90 [87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs] [87452.330133] ? __mod_memcg_state+0x8e/0x160 [87452.330738] __extent_writepage+0x2d4/0x400 [btrfs] [87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs] [87452.332007] ? lock_release+0x20e/0x4c0 [87452.332557] ? trace_hardirqs_on+0x1b/0xf0 [87452.333127] extent_writepages+0x43/0x90 [btrfs] [87452.333653] ? lock_acquire+0x1a3/0x490 [87452.334177] do_writepages+0x43/0xe0 [87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100 [87452.335720] __filemap_fdatawrite_range+0xc5/0x100 [87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs] [87452.337838] process_one_work+0x24e/0x5e0 [87452.338437] worker_thread+0x50/0x3b0 [87452.339137] ? process_one_work+0x5e0/0x5e0 [87452.339884] kthread+0x153/0x170 [87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.341153] ret_from_fork+0x22/0x30 [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. [87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [87452.345655] Call Trace: [87452.346305] __schedule+0x5d1/0xcf0 [87452.346947] ? kvm_clock_read+0x14/0x30 [87452.347676] ? wait_for_completion+0x81/0x110 [87452.348389] schedule+0x45/0xe0 [87452.349077] schedule_timeout+0x30c/0x580 [87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [87452.350340] ? lock_acquire+0x1a3/0x490 [87452.351006] ? try_to_wake_up+0x7a/0xa20 [87452.351541] ? lock_release+0x20e/0x4c0 [87452.352040] ? lock_acquired+0x199/0x490 [87452.352517] ? wait_for_completion+0x81/0x110 [87452.353000] wait_for_completion+0xab/0x110 [87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs] [87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] [87452.354455] flush_space+0x24f/0x660 [btrfs] [87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] [87452.355565] process_one_work+0x24e/0x5e0 [87452.356024] worker_thread+0x20f/0x3b0 [87452.356487] ? process_one_work+0x5e0/0x5e0 [87452.356973] kthread+0x153/0x170 [87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.357880] ret_from_fork+0x22/0x30 (...) < stack traces of several tasks waiting for the locks of the inodes of the clone operation > (...) [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052 [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97 [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960 [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000 [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 [92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000 [92867.447920] Call Trace: [92867.448435] __schedule+0x5d1/0xcf0 [92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [92867.449423] schedule+0x45/0xe0 [92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs] [92867.450576] ? finish_wait+0x90/0x90 [92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] [92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs] [92867.452412] start_transaction+0x2d1/0x760 [btrfs] [92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs] [92867.453848] ? lock_release+0x20e/0x4c0 [92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs] [92867.455218] btrfs_clone+0x569/0x7e0 [btrfs] [92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs] [92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs] [92867.457213] do_clone_file_range+0xd4/0x1f0 [92867.457828] vfs_clone_file_range+0x4d/0x230 [92867.458355] ? lock_release+0x20e/0x4c0 [92867.458890] ioctl_file_clone+0x8f/0xc0 [92867.459377] do_vfs_ioctl+0x342/0x750 [92867.459913] __x64_sys_ioctl+0x62/0xb0 [92867.460377] do_syscall_64+0x33/0x80 [92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9 (...) < stack traces of more tasks blocked on metadata reservation like the clone task above, because the async reclaim task has deadlocked > (...) Another thing to notice is that the worker task that is deadlocked when trying to flush the destination inode of the clone operation is at btrfs_invalidatepage(). This is simply because the clone operation has a destination offset greater than the i_size and we only update the i_size of the destination file after cloning an extent (just like we do in the buffered write path). Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger the flushing of delalloc for all inodes that have delalloc, add a runtime flag to an inode to signal it should not be flushed, and for inodes with that flag set, start_delalloc_inodes() will simply skip them. When the cloning code needs to dirty a page to copy an inline extent, set that flag on the inode and then clear it when the clone operation finishes. This could be sporadically triggered with test case generic/269 from fstests, which exercises many fsstress processes running in parallel with several dd processes filling up the entire filesystem. CC: stable@vger.kernel.org # 5.9+ Fixes: 05a5a762 ("Btrfs: implement full reflink support for inline extents") Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 10 12月, 2020 2 次提交
-
-
由 Nikolay Borisov 提交于
It's been deprecated since commit b547a88e ("btrfs: start deprecation of mount option inode_cache") which enumerates the reasons. A filesystem that uses the feature (mount -o inode_cache) tracks the inode numbers in bitmaps, that data stay on the filesystem after this patch. The size is roughly 5MiB for 1M inodes [1], which is considered small enough to be left there. Removal of the change can be implemented in btrfs-progs if needed. [1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
NODATACOW implies overwriting the file data on a device, which is impossible in sequential required zones. Disable NODATACOW globally with mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 08 12月, 2020 4 次提交
-
-
由 Nikolay Borisov 提交于
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to check if the parameter seed is true in the function btrfs_find_device(). This tells it whether to traverse the seed device list or not. After this commit, the argument is unused and can be removed. In device_list_add() it's not necessary because fs_devices always points to the device's fs_devices. So with the devid+uuid matching, it will find the right device and return, thus not needing to traverse seed devices. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
When defragmenting we skip ranges that have holes or inline extents, so that we don't do unnecessary IO and waste space. We do this check when calling should_defrag_range() at btrfs_defrag_file(). However we do it without holding the inode's lock. The reason we do it like this is to avoid blocking other tasks for too long, that possibly want to operate on other file ranges, since after the call to should_defrag_range() and before locking the inode, we trigger a synchronous page cache readahead. However before we were able to lock the inode, some other task might have punched a hole in our range, or we may now have an inline extent there, in which case we should not set the range for defrag anymore since that would cause unnecessary IO and make us waste space (i.e. allocating extents to contain zeros for a hole). So after we locked the inode and the range in the iotree, check again if we have holes or an inline extent, and if we do, just skip the range. I hit this while testing my next patch that fixes races when updating an inode's number of bytes (subject "btrfs: update the number of bytes used by an inode atomically"), and it depends on this change in order to work correctly. Alternatively I could rework that other patch to detect holes and flag their range with the 'new delalloc' bit, but this itself fixes an efficiency problem due a race that from a functional point of view is not harmful (it could be triggered with btrfs/062 from fstests). CC: stable@vger.kernel.org # 5.4+ Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We no longer distinguish between blocking and spinning, so rip out all this code. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 05 11月, 2020 1 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
On 32-bit systems, this shift will overflow for files larger than 4GB as start_index is unsigned long while the calls to btrfs_delalloc_*_space expect u64. CC: stable@vger.kernel.org # 4.4+ Fixes: df480633 ("btrfs: extent-tree: Switch to new delalloc space reserve and release") Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ define the variable instead of repeating the shift ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 07 10月, 2020 7 次提交
-
-
由 Nikolay Borisov 提交于
The passed in ordered_extent struct is always well-formed and contains the inode making the explicit argument redundant. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We have this thing wrapped in an RCU lock, but it's really not needed. We create all the space_info's on mount, and we destroy them on unmount. The list never changes and we're protected from messing with it by the normal mount/umount path, so kill the RCU stuff around it. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Goldwyn Rodrigues 提交于
/sys/fs/<fsid>/exclusive_operation contains the currently executing exclusive operation. Add a sysfs_notify() when operation end, so userspace can be notified of exclusive operation is finished. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Goldwyn Rodrigues 提交于
Instead of using a flag bit for exclusive operation, use a variable to store which exclusive operation is being performed. Introduce an API to start and finish an exclusive operation. This would enable another way for tools to check which operation is running on why starting an exclusive operation failed. The followup patch adds a sysfs_notify() to alert userspace when the state changes, so userspace can perform select() on it to get notified of the change. This would enable us to enqueue a command which will wait for current exclusive operation to complete before issuing the next exclusive operation. This has been done synchronously as opposed to a background process, or else error collection (if any) will become difficult. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
When we COW a block we are holding a lock on the original block, and then we lock the new COW block. Because our lockdep maps are based on root + level, this will make lockdep complain. We need a way to indicate a subclass for locking the COW'ed block, so plumb through our btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer, and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks. The reason I've added all this extra infrastructure is because there will be need of different nesting classes in follow up patches. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] When quota is enabled for TEST_DEV, generic/013 sometimes fails like this: generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg) And with the following metadata leak: BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs] Call Trace: btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace a6cfd45ba80e4e06 ]--- BTRFS error (device dm-3): qgroup reserved space leaked BTRFS info (device dm-3): disk space caching is enabled BTRFS info (device dm-3): has skinny extents [CAUSE] The qgroup preallocated meta rsv operations of that offending root are: btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072 btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072 btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152 btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072 btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072 It's pretty obvious that, we reserve qgroup meta rsv in btrfs_subvolume_reserve_metadata(), but doesn't have corresponding release/convert calls in btrfs_subvolume_release_metadata(). This leads to the leakage. [FIX] To fix this bug, we should follow what we're doing in btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and add it to block_rsv->qgroup_rsv_reserved. And free the qgroup reserved metadata space when releasing the block_rsv. To do this, we need to change the btrfs_subvolume_release_metadata() to accept btrfs_root, and record the qgroup_to_release number, and call btrfs_qgroup_convert_reserved_meta() for it. Fixes: 733e03a0 ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We have btrfs_wait_ordered_roots() which takes a u64 for nr, but btrfs_start_delalloc_roots() that takes an int for nr, which makes using them in conjunction, especially for something like (u64)-1, annoying and inconsistent. Fix btrfs_start_delalloc_roots() to take a u64 for nr and adjust start_delalloc_inodes() and it's callers appropriately. This means we've adjusted start_delalloc_inodes() to take a pointer of nr since we want to preserve the ability for start-delalloc_inodes() to return an error, so simply make it do the nr adjusting as necessary. Part of adjusting the callers to this means changing btrfs_writeback_inodes_sb_nr() to take a u64 for items. This may be confusing because it seems unrelated, but the caller of btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the function variable that needs to be changed. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Tested-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 14 9月, 2020 1 次提交
-
-
由 Filipe Manana 提交于
When faulting in the pages for the user supplied buffer for the search ioctl, we are passing only the base address of the buffer to the function fault_in_pages_writeable(). This means that after the first iteration of the while loop that searches for leaves, when we have a non-zero offset, stored in 'sk_offset', we try to fault in a wrong page range. So fix this by adding the offset in 'sk_offset' to the base address of the user supplied buffer when calling fault_in_pages_writeable(). Several users have reported that the applications compsize and bees have started to operate incorrectly since commit a48b73ec ("btrfs: fix potential deadlock in the search ioctl") was added to stable trees, and these applications make heavy use of the search ioctls. This fixes their issues. Link: https://lore.kernel.org/linux-btrfs/632b888d-a3c3-b085-cdf5-f9bb61017d92@lechevalier.se/ Link: https://github.com/kilobyte/compsize/issues/34 Fixes: a48b73ec ("btrfs: fix potential deadlock in the search ioctl") CC: stable@vger.kernel.org # 4.4+ Tested-by: NA L <mail@lechevalier.se> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 27 8月, 2020 1 次提交
-
-
由 Josef Bacik 提交于
With the conversion of the tree locks to rwsem I got the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted ------------------------------------------------------ compsize/11122 is trying to acquire lock: ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90 but task is already holding lock: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-fs-00){++++}-{3:3}: down_write_nested+0x3b/0x70 __btrfs_tree_lock+0x24/0x120 btrfs_search_slot+0x756/0x990 btrfs_lookup_inode+0x3a/0xb4 __btrfs_update_delayed_inode+0x93/0x270 btrfs_async_run_delayed_root+0x168/0x230 btrfs_work_helper+0xd4/0x570 process_one_work+0x2ad/0x5f0 worker_thread+0x3a/0x3d0 kthread+0x133/0x150 ret_from_fork+0x1f/0x30 -> #1 (&delayed_node->mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_delayed_update_inode+0x50/0x440 btrfs_update_inode+0x8a/0xf0 btrfs_dirty_inode+0x5b/0xd0 touch_atime+0xa1/0xd0 btrfs_file_mmap+0x3f/0x60 mmap_region+0x3a4/0x640 do_mmap+0x376/0x580 vm_mmap_pgoff+0xd5/0x120 ksys_mmap_pgoff+0x193/0x230 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&mm->mmap_lock#2){++++}-{3:3}: __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 __might_fault+0x68/0x90 _copy_to_user+0x1e/0x80 copy_to_sk.isra.32+0x121/0x300 search_ioctl+0x106/0x200 btrfs_ioctl_tree_search_v2+0x7b/0xf0 btrfs_ioctl+0x106f/0x30a0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-fs-00); lock(&delayed_node->mutex); lock(btrfs-fs-00); lock(&mm->mmap_lock#2); *** DEADLOCK *** 1 lock held by compsize/11122: #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 stack backtrace: CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x165/0x180 __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 ? __might_fault+0x3e/0x90 ? find_held_lock+0x72/0x90 __might_fault+0x68/0x90 ? __might_fault+0x3e/0x90 _copy_to_user+0x1e/0x80 copy_to_sk.isra.32+0x121/0x300 ? btrfs_search_forward+0x2a6/0x360 search_ioctl+0x106/0x200 btrfs_ioctl_tree_search_v2+0x7b/0xf0 btrfs_ioctl+0x106f/0x30a0 ? __do_sys_newfstat+0x5a/0x70 ? ksys_ioctl+0x83/0xc0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The problem is we're doing a copy_to_user() while holding tree locks, which can deadlock if we have to do a page fault for the copy_to_user(). This exists even without my locking changes, so it needs to be fixed. Rework the search ioctl to do the pre-fault and then copy_to_user_nofault for the copying. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 27 7月, 2020 8 次提交
-
-
由 David Sterba 提交于
User Forza reported on IRC that some invalid combinations of file attributes are accepted by chattr. The NODATACOW and compression file flags/attributes are mutually exclusive, but they could be set by 'chattr +c +C' on an empty file. The nodatacow will be in effect because it's checked first in btrfs_run_delalloc_range. Extend the flag validation to catch the following cases: - input flags are conflicting - old and new flags are conflicting - initialize the local variable with inode flags after inode ls locked Inode attributes take precedence over mount options and are an independent setting. Nocompress would be a no-op with nodatacow, but we don't want to mix any compression-related options with nodatacow. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
Add retrieval of the filesystem's metadata UUID to the fsinfo ioctl. This is driven by setting the BTRFS_FS_INFO_FLAG_METADATA_UUID flag in btrfs_ioctl_fs_info_args::flags. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
Add retrieval of the filesystem's generation to the fsinfo ioctl. This is driven by setting the BTRFS_FS_INFO_FLAG_GENERATION flag in btrfs_ioctl_fs_info_args::flags. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
With the recent addition of filesystem checksum types other than CRC32c, it is not anymore hard-coded which checksum type a btrfs filesystem uses. Up to now there is no good way to read the filesystem checksum, apart from reading the filesystem UUID and then query sysfs for the checksum type. Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl command which usually is used to query filesystem features. Also add a flags member indicating that the kernel responded with a set csum_type and csum_size field. For compatibility reasons, only return the csum_type and csum_size if the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also clear any unknown flags so we don't pass false positives to user-space newer than the kernel. To simplify further additions to the ioctl, also switch the padding to a u8 array. Pahole was used to verify the result of this switch: The csum members are added before flags, which might look odd, but this is to keep the alignment requirements and not to introduce holes in the structure. $ pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko struct btrfs_ioctl_fs_info_args { __u64 max_id; /* 0 8 */ __u64 num_devices; /* 8 8 */ __u8 fsid[16]; /* 16 16 */ __u32 nodesize; /* 32 4 */ __u32 sectorsize; /* 36 4 */ __u32 clone_alignment; /* 40 4 */ __u16 csum_type; /* 44 2 */ __u16 csum_size; /* 46 2 */ __u64 flags; /* 48 8 */ __u8 reserved[968]; /* 56 968 */ /* size: 1024, cachelines: 16, members: 10 */ }; Fixes: 3951e7f0 ("btrfs: add xxhash64 to checksumming algorithms") Fixes: 3831bf00 ("btrfs: add sha256 to checksumming algorithm") CC: stable@vger.kernel.org # 5.5+ Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] When the anonymous block device pool is exhausted, subvolume/snapshot creation fails with EMFILE (Too many files open). This has been reported by a user. The allocation happens in the second phase during transaction commit where it's only way out is to abort the transaction BTRFS: Transaction aborted (error -24) WARNING: CPU: 17 PID: 17041 at fs/btrfs/transaction.c:1576 create_pending_snapshot+0xbc4/0xd10 [btrfs] RIP: 0010:create_pending_snapshot+0xbc4/0xd10 [btrfs] Call Trace: create_pending_snapshots+0x82/0xa0 [btrfs] btrfs_commit_transaction+0x275/0x8c0 [btrfs] btrfs_mksubvol+0x4b9/0x500 [btrfs] btrfs_ioctl_snap_create_transid+0x174/0x180 [btrfs] btrfs_ioctl_snap_create_v2+0x11c/0x180 [btrfs] btrfs_ioctl+0x11a4/0x2da0 [btrfs] do_vfs_ioctl+0xa9/0x640 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace 33f2f83f3d5250e9 ]--- BTRFS: error (device sda1) in create_pending_snapshot:1576: errno=-24 unknown BTRFS info (device sda1): forced readonly BTRFS warning (device sda1): Skipping commit of aborted transaction. BTRFS: error (device sda1) in cleanup_transaction:1831: errno=-24 unknown [CAUSE] When the global anonymous block device pool is exhausted, the following call chain will fail, and lead to transaction abort: btrfs_ioctl_snap_create_v2() |- btrfs_ioctl_snap_create_transid() |- btrfs_mksubvol() |- btrfs_commit_transaction() |- create_pending_snapshot() |- btrfs_get_fs_root() |- btrfs_init_fs_root() |- get_anon_bdev() [FIX] Although we can't enlarge the anonymous block device pool, at least we can preallocate anon_dev for subvolume/snapshot in the first phase, outside of transaction context and exactly at the moment the user calls the creation ioctl. Reported-by: NGreed Rong <greedrong@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
All of its children take btrfs_inode so bubble up this requirement to btrfs_delalloc_reserve_space's interface and stop calling BTRFS_I internally. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
It needs btrfs_inode so take it as a parameter directly. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
It doesn't use the generic vfs inode for anything use btrfs_inode directly. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 17 6月, 2020 1 次提交
-
-
由 Waiman Long 提交于
In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc() was incorrectly paired with kzfree(). According to David Sterba, there isn't any sensitive information in the subvol_info that needs to be cleared before freeing. So kzfree() isn't really needed, use kfree() instead. Signed-off-by: NWaiman Long <longman@redhat.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 25 5月, 2020 2 次提交
-
-
由 David Sterba 提交于
The inode lookup starting at btrfs_iget takes the full location key, while only the objectid is used to match the inode, because the lookup happens inside the given root thus the inode number is unique. The entire location key is properly set up in btrfs_init_locked_inode. Simplify the helpers and pass only inode number, renaming it to 'ino' instead of 'objectid'. This allows to remove temporary variables key, saving some stack space. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The main function to lookup a root by its id btrfs_get_fs_root takes the whole key, while only using the objectid. The value of offset is preset to (u64)-1 but not actually used until btrfs_find_root that does the actual search. Switch btrfs_get_fs_root to use only objectid and remove all local variables that existed just for the lookup. The actual key for search is set up in btrfs_get_fs_root, reusing another key variable. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-