- 30 6月, 2021 5 次提交
-
-
由 Roman Gushchin 提交于
Currently there is no way to iterate over inodes attached to a specific cgwb structure. It limits the ability to efficiently reclaim the writeback structure itself and associated memory and block cgroup structures without scanning all inodes belonging to a sb, which can be prohibitively expensive. While dirty/in-active-writeback an inode belongs to one of the bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time. Once cleaned up, it's removed from all io lists. So the inode->i_io_list can be reused to maintain the list of inodes, attached to a bdi_writeback structure. This patch introduces a new wb->b_attached list, which contains all inodes which were dirty at least once and are attached to the given cgwb. Inodes attached to the root bdi_writeback structures are never placed on such list. The following patch will use this list to try to release cgwbs structures more efficiently. Link: https://lkml.kernel.org/r/20210608230225.2078447-6-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NJan Kara <jack@suse.cz> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NDennis Zhou <dennis@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Inode's wb switching requires two steps divided by an RCU grace period. It's currently implemented as an RCU callback inode_switch_wbs_rcu_fn(), which schedules inode_switch_wbs_work_fn() as a work. Switching to the rcu_work API allows to do the same in a cleaner and slightly shorter form. Link: https://lkml.kernel.org/r/20210608230225.2078447-5-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NDennis Zhou <dennis@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
isw_nr_in_flight is used to determine whether the inode switch queue should be flushed from the umount path. Currently it's increased after grabbing an inode and even scheduling the switch work. It means the umount path can walk past cleanup_offline_cgwb() with active inode references, which can result in a "Busy inodes after unmount." message and use-after-free issues (with inode->i_sb which gets freed). Fix it by incrementing isw_nr_in_flight before doing anything with the inode and decrementing in the case when switching wasn't scheduled. The problem hasn't yet been seen in the real life and was discovered by Jan Kara by looking into the code. Link: https://lkml.kernel.org/r/20210608230225.2078447-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NJan Kara <jack@suse.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
A full memory barrier is required between clearing SB_ACTIVE flag in generic_shutdown_super() and checking isw_nr_in_flight in cgroup_writeback_umount(), otherwise a new switch operation might be scheduled after atomic_read(&isw_nr_in_flight) returned 0. This would result in a non-flushed isw_wq, and a potential crash. The problem hasn't yet been seen in the real life and was discovered by Jan Kara by looking into the code. Link: https://lkml.kernel.org/r/20210608230225.2078447-3-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NJan Kara <jack@suse.cz> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Jan Kara <jack@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Patch series "cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups", v9. When an inode is getting dirty for the first time it's associated with a wb structure (see __inode_attach_wb()). It can later be switched to another wb (if e.g. some other cgroup is writing a lot of data to the same inode), but otherwise stays attached to the original wb until being reclaimed. The problem is that the wb structure holds a reference to the original memory and blkcg cgroups. So if an inode has been dirty once and later is actively used in read-only mode, it has a good chance to pin down the original memory and blkcg cgroups forever. This is often the case with services bringing data for other services, e.g. updating some rpm packages. In the real life it becomes a problem due to a large size of the memcg structure, which can easily be 1000x larger than an inode. Also a really large number of dying cgroups can raise different scalability issues, e.g. making the memory reclaim costly and less effective. To solve the problem inodes should be eventually detached from the corresponding writeback structure. It's inefficient to do it after every writeback completion. Instead it can be done whenever the original memory cgroup is offlined and writeback structure is getting killed. Scanning over a (potentially long) list of inodes and detach them from the writeback structure can take quite some time. To avoid scanning all inodes, attached inodes are kept on a new list (b_attached). To make it less noticeable to a user, the scanning and switching is performed from a work context. Big thanks to Jan Kara, Dennis Zhou, Hillf Danton and Tejun Heo for their ideas and contribution to this patchset. This patch (of 8): If an inode's state has I_WILL_FREE flag set, the inode will be freed soon, so there is no point in trying to switch the inode to a different cgwb. I_WILL_FREE was ignored since the introduction of the inode switching, so it looks like it doesn't lead to any noticeable issues for a user. This is why the patch is not intended for a stable backport. Link: https://lkml.kernel.org/r/20210608230225.2078447-1-guro@fb.com Link: https://lkml.kernel.org/r/20210608230225.2078447-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NJan Kara <jack@suse.cz> Acked-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Acked-by: NDennis Zhou <dennis@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 1月, 2021 6 次提交
-
-
由 Eric Biggers 提交于
Some comments for writeback_single_inode() and __writeback_single_inode() are outdated or not very helpful, especially with regards to writeback list handling. Update them. Link: https://lore.kernel.org/r/20210112190253.64307-10-ebiggers@kernel.orgReviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Eric Biggers 提交于
wbc->for_sync implies wbc->sync_mode == WB_SYNC_ALL, so there's no need to check for both. Just check for WB_SYNC_ALL. Link: https://lore.kernel.org/r/20210112190253.64307-9-ebiggers@kernel.orgReviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Eric Biggers 提交于
Improve some comments, and don't bother checking for the I_DIRTY_TIME flag in the case where we just cleared it. Also, warn if I_DIRTY_TIME and I_DIRTY_PAGES are passed to __mark_inode_dirty() at the same time, as this case isn't handled. Link: https://lore.kernel.org/r/20210112190253.64307-8-ebiggers@kernel.orgReviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Eric Biggers 提交于
->dirty_inode is now only called when I_DIRTY_INODE (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) is set. However it may still be passed other dirty flags at the same time, provided that these other flags happened to be passed to __mark_inode_dirty() at the same time as I_DIRTY_INODE. This doesn't make sense because there is no reason for filesystems to care about these extra flags. Nor are filesystems notified about all updates to these other flags. Therefore, mask the flags before passing them to ->dirty_inode. Also properly document ->dirty_inode in vfs.rst. Link: https://lore.kernel.org/r/20210112190253.64307-7-ebiggers@kernel.orgReviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Eric Biggers 提交于
There is no need to call ->dirty_inode for lazytime timestamp updates (i.e. for __mark_inode_dirty(I_DIRTY_TIME)), since by the definition of lazytime, filesystems must ignore these updates. Filesystems only need to care about the updated timestamps when they expire. Therefore, only call ->dirty_inode when I_DIRTY_INODE is set. Based on a patch from Christoph Hellwig: https://lore.kernel.org/r/20200325122825.1086872-4-hch@lst.de Link: https://lore.kernel.org/r/20210112190253.64307-6-ebiggers@kernel.orgReviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Eric Biggers 提交于
When lazytime is enabled and an inode is being written due to its in-memory updated timestamps having expired, either due to a sync() or syncfs() system call or due to dirtytime_expire_interval having elapsed, the VFS needs to inform the filesystem so that the filesystem can copy the inode's timestamps out to the on-disk data structures. This is done by __writeback_single_inode() calling mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC). However, this occurs after __writeback_single_inode() has already cleared the dirty flags from ->i_state. This causes two bugs: - mark_inode_dirty_sync() redirties the inode, causing it to remain dirty. This wastefully causes the inode to be written twice. But more importantly, it breaks cases where sync_filesystem() is expected to clean dirty inodes. This includes the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (as reported at https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well as possibly filesystem freezing (freeze_super()). - Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is called from __writeback_single_inode() for lazytime expiration, xfs_fs_dirty_inode() ignores the notification. (XFS only cares about lazytime expirations, and it assumes that i_state will contain I_DIRTY_TIME during those.) Therefore, lazy timestamps aren't persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS. Fix this by moving the call to mark_inode_dirty_sync() to earlier in __writeback_single_inode(), before the dirty flags are cleared from i_state. This makes filesystems be properly notified of the timestamp expiration, and it avoids incorrectly redirtying the inode. This fixes xfstest generic/580 (which tests FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime enabled. It also fixes the new lazytime xfstest I've proposed, which reproduces the above-mentioned XFS bug (https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org). Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly. But due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the right thing to do because mark_inode_dirty_sync() now knows not to move the inode to a writeback list if it is currently queued for sync. Fixes: 0ae45f63 ("vfs: add support for a lazytime mount option") Cc: stable@vger.kernel.org Depends-on: 5afced3b ("writeback: Avoid skipping inode writeback") Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.orgSuggested-by: NJan Kara <jack@suse.cz> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
- 16 12月, 2020 1 次提交
-
-
由 Christoph Hellwig 提交于
BDIs get unregistered during device removal, and this WARN can be trivially triggered by hot-removing a NVMe device while running fsx It is otherwise harmless as we still hold a BDI reference, and the writeback has been shut down already. Link: https://lore.kernel.org/r/20200928122613.434820-1-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz>
-
- 25 9月, 2020 1 次提交
-
-
由 Christoph Hellwig 提交于
Replace the two negative flags that are always used together with a single positive flag that indicates the writeback capability instead of two related non-capabilities. Also remove the pointless wrappers to just check the flag. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 20 9月, 2020 1 次提交
-
-
由 Tobias Klauser 提交于
Commit 32927393 ("sysctl: pass kernel pointers to ->proc_handler") changed ctl_table.proc_handler to take a kernel pointer. Adjust the definition of dirtytime_interval_handler to match its prototype in linux/writeback.h which fixes the following sparse error/warning: fs/fs-writeback.c:2189:50: warning: incorrect type in argument 3 (different address spaces) fs/fs-writeback.c:2189:50: expected void * fs/fs-writeback.c:2189:50: got void [noderef] __user *buffer fs/fs-writeback.c:2184:5: error: symbol 'dirtytime_interval_handler' redeclared with different type (incompatible argument 3 (different address spaces)): fs/fs-writeback.c:2184:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... ) fs/fs-writeback.c: note: in included file: ./include/linux/writeback.h:374:5: note: previously declared as: ./include/linux/writeback.h:374:5: int extern [addressable] [signed] [toplevel] dirtytime_interval_handler( ... ) Fixes: 32927393 ("sysctl: pass kernel pointers to ->proc_handler") Signed-off-by: NTobias Klauser <tklauser@distanz.ch> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: https://lkml.kernel.org/r/20200907093140.13434-1-tklauser@distanz.chSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 6月, 2020 4 次提交
-
-
由 Jan Kara 提交于
The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writeback the dirty inode time stamps (either because we are syncing or because of age). However we can detect this directly in __writeback_single_inode() and there's no need for the strange propagation with I_DIRTY_TIME_EXPIRE flag. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
When we are processing writeback for sync(2), move_expired_inodes() didn't set any inode expiry value (older_than_this). This can result in writeback never completing if there's steady stream of inodes added to b_dirty_time list as writeback rechecks dirty lists after each writeback round whether there's more work to be done. Fix the problem by using sync(2) start time is inode expiry value when processing b_dirty_time list similarly as for ordinarily dirtied inodes. This requires some refactoring of older_than_this handling which simplifies the code noticeably as a bonus. Fixes: 0ae45f63 ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Inode's i_io_list list head is used to attach inode to several different lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker prepares a list of inodes to writeback e.g. for sync(2), it moves inodes to b_io list. Thus it is critical for sync(2) data integrity guarantees that inode is not requeued to any other writeback list when inode is queued for processing by flush worker. That's the reason why writeback_single_inode() does not touch i_io_list (unless the inode is completely clean) and why __mark_inode_dirty() does not touch i_io_list if I_SYNC flag is set. However there are two flaws in the current logic: 1) When inode has only I_DIRTY_TIME set but it is already queued in b_io list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC) can still move inode back to b_dirty list resulting in skipping writeback of inode time stamps during sync(2). 2) When inode is on b_dirty_time list and writeback_single_inode() races with __mark_inode_dirty() like: writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES) inode->i_state |= I_SYNC __writeback_single_inode() inode->i_state |= I_DIRTY_PAGES; if (inode->i_state & I_SYNC) bail if (!(inode->i_state & I_DIRTY_ALL)) - not true so nothing done We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus standard background writeback will not writeback this inode leading to possible dirty throttling stalls etc. (thanks to Martijn Coenen for this analysis). Fix these problems by tracking whether inode is queued in b_io or b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we know flush worker has queued inode and we should not touch i_io_list. On the other hand we also know that once flush worker is done with the inode it will requeue the inode to appropriate dirty list. When I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode to appropriate dirty list. Reported-by: NMartijn Coenen <maco@android.com> Reviewed-by: NMartijn Coenen <maco@android.com> Tested-by: NMartijn Coenen <maco@android.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Fixes: 0ae45f63 ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Signed-off-by: NJan Kara <jack@suse.cz>
-
由 Jan Kara 提交于
Currently, operations on inode->i_io_list are protected by wb->list_lock. In the following patches we'll need to maintain consistency between inode->i_state and inode->i_io_list so change the code so that inode->i_lock protects also all inode's i_io_list handling. Reviewed-by: NMartijn Coenen <maco@android.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> CC: stable@vger.kernel.org # Prerequisite for "writeback: Avoid skipping inode writeback" Signed-off-by: NJan Kara <jack@suse.cz>
-
- 04 6月, 2020 1 次提交
-
-
由 Jan Kara 提交于
Ext4 needs to remove inode from writeback lists after it is out of visibility of its journalling machinery (which can still dirty the inode). Export inode_io_list_del() for it. Signed-off-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200421085445.5731-3-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 03 6月, 2020 1 次提交
-
-
由 NeilBrown 提交于
After an NFS page has been written it is considered "unstable" until a COMMIT request succeeds. If the COMMIT fails, the page will be re-written. These "unstable" pages are currently accounted as "reclaimable", either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a 'reclaimable' count. This might have made sense when sending the COMMIT required a separate action by the VFS/MM (e.g. releasepage() used to send a COMMIT). However now that all writes generated by ->writepages() will automatically be followed by a COMMIT (since commit 919e3bd9 ("NFS: Ensure we commit after writeback is complete")) it makes more sense to treat them as writeback pages. So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in NR_WRITEBACK and WB_WRITEBACK. A particular effect of this change is that when wb_check_background_flush() calls wb_over_bg_threshold(), the latter will report 'true' a lot less often as the 'unstable' pages are no longer considered 'dirty' (as there is nothing that writeback can do about them anyway). Currently wb_check_background_flush() will trigger writeback to NFS even when there are relatively few dirty pages (if there are lots of unstable pages), this can result in small writes going to the server (10s of Kilobytes rather than a Megabyte) which hurts throughput. With this patch, there are fewer writes which are each larger on average. Where the NR_UNSTABLE_NFS count was included in statistics virtual-files, the entry is retained, but the value is hard-coded as zero. static trace points and warning printks which mentioned this counter no longer report it. [akpm@linux-foundation.org: re-layout comment] [akpm@linux-foundation.org: fix printk warning] Signed-off-by: NNeilBrown <neilb@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NChristoph Hellwig <hch@lst.de> Acked-by: NTrond Myklebust <trond.myklebust@hammerspace.com> Acked-by: Michal Hocko <mhocko@suse.com> [mm] Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 5月, 2020 1 次提交
-
-
由 Christoph Hellwig 提交于
The name is only printed for a not registered bdi in writeback. Use the device name there as is more useful anyway for the unlike case that the warning triggers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NBart Van Assche <bvanassche@acm.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 2月, 2020 1 次提交
-
-
由 Theodore Ts'o 提交于
Without memcg, there is a one-to-one mapping between the bdi and bdi_writeback structures. In this world, things are fairly straightforward; the first thing bdi_unregister() does is to shutdown the bdi_writeback structure (or wb), and part of that writeback ensures that no other work queued against the wb, and that the wb is fully drained. With memcg, however, there is a one-to-many relationship between the bdi and bdi_writeback structures; that is, there are multiple wb objects which can all point to a single bdi. There is a refcount which prevents the bdi object from being released (and hence, unregistered). So in theory, the bdi_unregister() *should* only get called once its refcount goes to zero (bdi_put will drop the refcount, and when it is zero, release_bdi gets called, which calls bdi_unregister). Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about the Brave New memcg World, and calls bdi_unregister directly. It does this without informing the file system, or the memcg code, or anything else. This causes the root wb associated with the bdi to be unregistered, but none of the memcg-specific wb's are shutdown. So when one of these wb's are woken up to do delayed work, they try to dereference their wb->bdi->dev to fetch the device name, but unfortunately bdi->dev is now NULL, thanks to the bdi_unregister() called by del_gendisk(). As a result, *boom*. Fortunately, it looks like the rest of the writeback path is perfectly happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to create a bdi_dev_name() function which can handle bdi->dev being NULL. This also allows us to bulletproof the writeback tracepoints to prevent them from dereferencing a NULL pointer and crashing the kernel if one is tracing with memcg's enabled, and an iSCSI device dies or a USB storage stick is pulled. The most common way of triggering this will be hotremoval of a device while writeback with memcg enabled is going on. It was triggering several times a day in a heavily loaded production environment. Google Bug Id: 145475544 Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.eduSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 11月, 2019 1 次提交
-
-
由 Tejun Heo 提交于
cgroup writeback tries to refresh the associated wb immediately if the current wb is dead. This is to avoid keeping issuing IOs on the stale wb after memcg - blkcg association has changed (ie. when blkcg got disabled / enabled higher up in the hierarchy). Unfortunately, the logic gets triggered spuriously on inodes which are associated with dead cgroups. When the logic is triggered on dead cgroups, the attempt fails only after doing quite a bit of work allocating and initializing a new wb. While c3aab9a0 ("mm/filemap.c: don't initiate writeback if mapping has no dirty pages") alleviated the issue significantly as it now only triggers when the inode has dirty pages. However, the condition can still be triggered before the inode is switched to a different cgroup and the logic simply doesn't make sense. Skip the immediate switching if the associated memcg is dying. This is a simplified version of the following two patches: * https://lore.kernel.org/linux-mm/20190513183053.GA73423@dennisz-mbp/ * http://lkml.kernel.org/r/156355839560.2063.5265687291430814589.stgit@buzz Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Fixes: e8a7abf5 ("writeback: disassociate inodes from dying bdi_writebacks") Acked-by: NDennis Zhou <dennis@kernel.org> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 15 10月, 2019 1 次提交
-
-
由 Randy Dunlap 提交于
Fix kernel-doc warning in fs/fs-writeback.c: fs/fs-writeback.c:913: warning: Excess function parameter 'nr_pages' description in 'cgroup_writeback_by_id' Link: http://lkml.kernel.org/r/756645ac-0ce8-d47e-d30a-04d9e4923a4f@infradead.org Fixes: d62241c7 ("writeback, memcg: Implement cgroup_writeback_by_id()") Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 10月, 2019 1 次提交
-
-
由 Tejun Heo 提交于
finish_writeback_work() reads @done->waitq after decrementing @done->cnt. However, once @done->cnt reaches zero, @done may be freed (from stack) at any moment and @done->waitq can contain something unrelated by the time finish_writeback_work() tries to read it. This led to the following crash. "BUG: kernel NULL pointer dereference, address: 0000000000000002" #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC CPU: 40 PID: 555153 Comm: kworker/u98:50 Kdump: loaded Not tainted ... Workqueue: writeback wb_workfn (flush-btrfs-1) RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30 Code: 48 89 d8 5b c3 e8 50 db 6b ff eb f4 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 9c 5b fa 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 05 48 89 d8 5b c3 89 c6 e8 fe ca 6b ff eb f2 66 90 RSP: 0018:ffffc90049b27d98 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000002 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 R10: ffff889fff407600 R11: ffff88ba9395d740 R12: 000000000000e300 R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88bfdfa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000002 CR3: 0000000002409005 CR4: 00000000001606e0 Call Trace: __wake_up_common_lock+0x63/0xc0 wb_workfn+0xd2/0x3e0 process_one_work+0x1f5/0x3f0 worker_thread+0x2d/0x3d0 kthread+0x111/0x130 ret_from_fork+0x1f/0x30 Fix it by reading and caching @done->waitq before decrementing @done->cnt. Link: http://lkml.kernel.org/r/20190924010631.GH2233839@devbig004.ftw2.facebook.com Fixes: 5b9cce4c ("writeback: Generalize and expose wb_completion") Signed-off-by: NTejun Heo <tj@kernel.org> Debugged-by: NChris Mason <clm@fb.com> Reviewed-by: NJens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> [5.2+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 8月, 2019 1 次提交
-
-
由 Tejun Heo 提交于
cgroup foreign inode handling has quite a bit of heuristics and internal states which sometimes makes it difficult to understand what's going on. Add tracepoints to improve visibility. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 8月, 2019 2 次提交
-
-
由 Tejun Heo 提交于
Implement cgroup_writeback_by_id() which initiates cgroup writeback from bdi and memcg IDs. This will be used by memcg foreign inode flushing. v2: Use wb_get_lookup() instead of wb_get_create() to avoid creating spurious wbs. v3: Interpret 0 @nr as 1.25 * nr_dirty to implement best-effort flushing while avoding possible livelocks. Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
wb_completion is used to track writeback completions. We want to use it from memcg side for foreign inode flushes. This patch updates it to remember the target waitq instead of assuming bdi->wb_waitq and expose it outside of fs-writeback.c. Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 16 8月, 2019 2 次提交
-
-
由 Tejun Heo 提交于
As inode wb switching may make sync(2) miss some inodes, they're synchronized using wb_switch_rwsem so that no wb switching happens while sync(2) is in progress. In addition to synchronizing the actual switching, the rwsem is also used to prevent queueing new switch attempts while sync(2) is in progress. This is to avoid queueing too many instances while the rwsem is held by sync(2). Unfortunately, this is too agressive and can block wb switching for a long time if sync(2) is frequent. The goal is avoiding expolding the number of scheduled switches, not avoiding scheduling anything. Let's use wb_switch_rwsem only for synchronizing the actual switching and sync(2) and use isw_nr_in_flight instead for limiting the maximum number of scheduled switches. The limit is set to 1024 which should be more than enough while still avoiding extreme situations. Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
WB_FRN_TIME_CUT_DIV is used to tell the foreign inode detection logic to ignore short writeback rounds to prevent getting confused by a burst of short writebacks. The parameter is currently 2 meaning that anything smaller than half of the running average writback duration will be ignored. This is unnecessarily aggressive. The detection logic uses 16 history slots and is already reasonably protected against some short bursts confusing it and the current parameter can lead to tens of seconds of missed detection depending on the writeback pattern. Let's change the parameter to 8, so that it only ignores writeback with are smaller than 12.5% of the current running average. v2: Add comment explaining what's going on with the foreign detection parameters. Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 7月, 2019 3 次提交
-
-
由 Tejun Heo 提交于
When writeback IOs are bounced through async layers, the IOs should only be accounted against the wbc from the original bdi writeback to avoid confusing cgroup inode ownership arbitration. Add wbc->no_cgroup_owner to allow disabling wbc cgroup owner accounting. This will be used make btrfs compression work well with cgroup IO control. v2: Renamed from no_wbc_acct to no_cgroup_owner and added comment as per Jan. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
wbc_account_io() does a very specific job - try to see which cgroup is actually dirtying an inode and transfer its ownership to the majority dirtier if needed. The name is too generic and confusing. Let's rename it to something more specific. Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
btrfs is going to use css_put() and wbc helpers to improve cgroup writeback support. Add dummy css_get() definition and export wbc helpers to prepare for module and !CONFIG_CGROUP builds. Reported-by: Nkbuild test robot <lkp@intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 16 6月, 2019 1 次提交
-
-
由 Tejun Heo 提交于
wbc_account_io() collects information on cgroup ownership of writeback pages to determine which cgroup should own the inode. Pages can stay associated with dead memcgs but we want to avoid attributing IOs to dead blkcgs as much as possible as the association is likely to be stale. However, currently, pages associated with dead memcgs contribute to the accounting delaying and/or confusing the arbitration. Fix it by ignoring pages associated with dead memcgs. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 5月, 2019 1 次提交
-
-
由 Thomas Gleixner 提交于
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 19 5月, 2019 1 次提交
-
-
由 Jiufei Xue 提交于
synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb switch may not go to the workqueue after synchronize_rcu(). Thus previous scheduled switches was not finished even flushing the workqueue, which will cause a NULL pointer dereferenced followed below. VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds. Have a nice day... BUG: unable to handle kernel NULL pointer dereference at 0000000000000278 evict+0xb3/0x180 iput+0x1b0/0x230 inode_switch_wbs_work_fn+0x3c0/0x6a0 worker_thread+0x4e/0x490 ? process_one_work+0x410/0x410 kthread+0xe6/0x100 ret_from_fork+0x39/0x50 Replace the synchronize_rcu() call with a rcu_barrier() to wait for all pending callbacks to finish. And inc isw_nr_in_flight after call_rcu() in inode_switch_wbs() to make more sense. Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.comSigned-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Acked-by: NTejun Heo <tj@kernel.org> Suggested-by: NTejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 1月, 2019 1 次提交
-
-
由 Tejun Heo 提交于
sync_inodes_sb() can race against cgwb (cgroup writeback) membership switches and fail to writeback some inodes. For example, if an inode switches to another wb while sync_inodes_sb() is in progress, the new wb might not be visible to bdi_split_work_to_wbs() at all or the inode might jump from a wb which hasn't issued writebacks yet to one which already has. This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb switch path against sync_inodes_sb() so that sync_inodes_sb() is guaranteed to see all the target wbs and inodes can't jump wbs to escape syncing. v2: Fixed misplaced rwsem init. Spotted by Jiufei. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NJiufei Xue <xuejiufei@gmail.com> Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.comAcked-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 10月, 2018 1 次提交
-
-
由 Matthew Wilcox 提交于
A couple of short loops. Signed-off-by: NMatthew Wilcox <willy@infradead.org>
-
- 04 5月, 2018 1 次提交
-
-
由 Jan Kara 提交于
Syzbot has reported that it can hit a NULL pointer dereference in wb_workfn() due to wb->bdi->dev being NULL. This indicates that wb_workfn() was called for an already unregistered bdi which should not happen as wb_shutdown() called from bdi_unregister() should make sure all pending writeback works are completed before bdi is unregistered. Except that wb_workfn() itself can requeue the work with: mod_delayed_work(bdi_wq, &wb->dwork, 0); and if this happens while wb_shutdown() is waiting in: flush_delayed_work(&wb->dwork); the dwork can get executed after wb_shutdown() has finished and bdi_unregister() has cleared wb->bdi->dev. Make wb_workfn() use wakeup_wb() for requeueing the work which takes all the necessary precautions against racing with bdi unregistration. CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> CC: Tejun Heo <tj@kernel.org> Fixes: 839a8e86Reported-by: Nsyzbot <syzbot+9873874c735f2892e7e9@syzkaller.appspotmail.com> Reviewed-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 4月, 2018 1 次提交
-
-
由 Greg Thelen 提交于
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if the page's memcg is undergoing move accounting, which occurs when a process leaves its memcg for a new one that has memory.move_charge_at_immigrate set. unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the given inode is switching writeback domains. Switches occur when enough writes are issued from a new domain. This existing pattern is thus suspicious: lock_page_memcg(page); unlocked_inode_to_wb_begin(inode, &locked); ... unlocked_inode_to_wb_end(inode, locked); unlock_page_memcg(page); If both inode switch and process memcg migration are both in-flight then unlocked_inode_to_wb_end() will unconditionally enable interrupts while still holding the lock_page_memcg() irq spinlock. This suggests the possibility of deadlock if an interrupt occurs before unlock_page_memcg(). truncate __cancel_dirty_page lock_page_memcg unlocked_inode_to_wb_begin unlocked_inode_to_wb_end <interrupts mistakenly enabled> <interrupt> end_page_writeback test_clear_page_writeback lock_page_memcg <deadlock> unlock_page_memcg Due to configuration limitations this deadlock is not currently possible because we don't mix cgroup writeback (a cgroupv2 feature) and memory.move_charge_at_immigrate (a cgroupv1 feature). If the kernel is hacked to always claim inode switching and memcg moving_account, then this script triggers lockup in less than a minute: cd /mnt/cgroup/memory mkdir a b echo 1 > a/memory.move_charge_at_immigrate echo 1 > b/memory.move_charge_at_immigrate ( echo $BASHPID > a/cgroup.procs while true; do dd if=/dev/zero of=/mnt/big bs=1M count=256 done ) & while true; do sync done & sleep 1h & SLEEP=$! while true; do echo $SLEEP > a/cgroup.procs echo $SLEEP > b/cgroup.procs done The deadlock does not seem possible, so it's debatable if there's any reason to modify the kernel. I suggest we should to prevent future surprises. And Wang Long said "this deadlock occurs three times in our environment", so there's more reason to apply this, even to stable. Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch see "[PATCH for-4.4] writeback: safer lock nesting" https://lkml.org/lkml/2018/4/11/146 Wang Long said "this deadlock occurs three times in our environment" [gthelen@google.com: v4] Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com [akpm@linux-foundation.org: comment tweaks, struct initialization simplification] Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613 Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com Fixes: 682aa8e1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Signed-off-by: NGreg Thelen <gthelen@google.com> Reported-by: NWang Long <wanglong19@meituan.com> Acked-by: NWang Long <wanglong19@meituan.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: <stable@vger.kernel.org> [v4.2+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-