- 02 6月, 2015 4 次提交
-
-
由 Tejun Heo 提交于
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->wb_lock and ->worklist into wb. * The lock protects bdi->worklist and bdi->wb.dwork scheduling. While moving, rename it to wb->work_lock as wb->wb_lock is confusing. Also, move wb->dwork downwards so that it's colocated with the new ->work_lock and ->work_list fields. * bdi_writeback_workfn() -> wb_workfn() bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb) bdi_wakeup_thread(bdi) -> wb_wakeup(wb) bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...) __bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...) get_next_work_item(bdi) -> get_next_work_item(wb) * bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb. The function contained parts which belong to the containing bdi rather than the wb itself - testing cap_writeback_dirty and bdi_remove_from_list() invocation. Those are moved to bdi_unregister(). * bdi_wb_{init|exit}() are renamed to wb_{init|exit}(). Initializations of the moved bdi->wb_lock and ->work_list are relocated from bdi_init() to wb_init(). * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bandwidth related fields from backing_dev_info into bdi_writeback. * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp, write_bandwidth, avg_write_bandwidth, dirty_ratelimit, balanced_dirty_ratelimit, completions and dirty_exceeded. * writeback_chunk_size() and over_bground_thresh() now take @wb instead of @bdi. * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...) bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...) bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...) bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...) [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...) bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...) bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...) * Init/exits of the relocated fields are moved to bdi_wb_init/exit() respectively. Note that explicit zeroing is dropped in the process as wb's are cleared in entirety anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes. v2: Typo in description fixed as suggested by Jan. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->bdi_stat[] into wb. * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all enums is changed from BDI_ to WB_. * BDI_STAT_BATCH() -> WB_STAT_BATCH() * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...) * bdi_stat[_error]() -> wb_stat[_error]() * bdi_writeout_inc() -> wb_writeout_inc() * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and frees stat. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->state into wb. * enum bdi_state is renamed to wb_state and the prefix of all enums is changed from BDI_ to WB_. * Explicit zeroing of bdi->state is removed without adding zeoring of wb->state as the whole data structure is zeroed on init anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: drbd-dev@lists.linbit.com Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 05 2月, 2015 1 次提交
-
-
由 Theodore Ts'o 提交于
Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact long tail latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 21 1月, 2015 4 次提交
-
-
由 Christoph Hellwig 提交于
Now that default_backing_dev_info is not used for writeback purposes we can git rid of it easily: - instead of using it's name for tracing unregistered bdi we just use "unknown" - btrfs and ceph can just assign the default read ahead window themselves like several other filesystems already do. - we can assign noop_backing_dev_info as the default one in alloc_super. All filesystems already either assigned their own or noop_backing_dev_info. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
If we have dirty inodes we need to call the filesystem for it, even if the device has been removed and the filesystem will error out early. The current code does that by reassining all dirty inodes to the default backing_dev_info when a bdi is unlinked, but that's pretty pointless given that the bdi must always outlive the super block. Instead of stopping writeback at unregister time and moving inodes to the default bdi just keep the current bdi alive until it is destroyed. The containing objects of the bdi ensure this doesn't happen until all writeback has finished by erroring out. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Killed the redundant WARN_ON(), as noticed by Jan. Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Now that we never use the backing_dev_info pointer in struct address_space we can simply remove it and save 4 to 8 bytes in every inode. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Reviewed-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Since "BDI: Provide backing device capability information [try #3]" the backing_dev_info structure also provides flags for the kind of mmap operation available in a nommu environment, which is entirely unrelated to it's original purpose. Introduce a new nommu-only file operation to provide this information to the nommu mmap code instead. Splitting this from the backing_dev_info structure allows to remove lots of backing_dev_info instance that aren't otherwise needed, and entirely gets rid of the concept of providing a backing_dev_info for a character device. It also removes the need for the mtd_inodefs filesystem. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NTejun Heo <tj@kernel.org> Acked-by: NBrian Norris <computersforpeace@gmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 10 10月, 2014 1 次提交
-
-
由 Johannes Weiner 提交于
Page reclaim tests zone_is_reclaim_dirty(), but the site that actually sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the reader through layers indirection just to track down a simple bit. Remove all zone flag wrappers and just use bitops against zone->flags directly. It's just as readable and the lines are barely any longer. Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and remove the zone_flags_t typedef. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 9月, 2014 4 次提交
-
-
由 Tejun Heo 提交于
A block_device may be attached to different gendisks and thus different bdis over time. bdev_inode_switch_bdi() is used to switch the associated bdi. The function assumes that the inode could be dirty and transfers it between bdis if so. This is a bit nasty in that it reaches into bdi internals. This patch reimplements the function so that it writes out the inode if dirty. This is a lot simpler and can be implemented without exposing bdi internals. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
bdi_destroy() has code to transfer the remaining dirty inodes to the default_backing_dev_info; however, given the shutdown sequence, it isn't clear how such condition would happen. Also, it isn't a full solution as the transferred inodes stlil point to the bdi which is being destroyed. Operations on those inodes can end up accessing already released fields such as the percpu stat fields. Digging through the history, it seems that the code was added as a quick workaround for a bug report without fully root-causing the issue. We probably want to remove the code in time but for now let's add a comment noting that it is a quick workaround. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
Canceling of bdi->wb.dwork is currently a bit mushy. bdi_wb_shutdown() performs cancel_delayed_work_sync() at the end after shutting down and flushing the delayed_work and bdi_destroy() tries yet again after bdi_unregister(). bdi->wb.dwork is queued only after checking BDI_registered while holding bdi->wb_lock and bdi_wb_shutdown() clears the flag while holding the same lock and then flushes the delayed_work. There's no way the delayed_work can be queued again after that. Replace the two unnecessary cancel_delayed_work_sync() invocations with WARNs on pending. This simplifies and clarifies the code a bit and will help future changes in further isolating bdi_writeback handling. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
The only places where NULL test on bdi->dev is used are bdi_[un]register(). The functions can't be called in parallel anyway and there's no point in protecting bdi->dev clearing with a lock. Remove bdi->wb_lock grabbing around bdi->dev clearing and move it after device_unregister() call so that bdi->dev doesn't have to be cached in a local variable. This patch shouldn't introduce any behavior difference. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 08 9月, 2014 2 次提交
-
-
由 Tejun Heo 提交于
Percpu allocator now supports allocation mask. Add @gfp to [flex_]proportions init functions so that !GFP_KERNEL allocation masks can be used with them too. This patch doesn't make any functional difference. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org>
-
由 Tejun Heo 提交于
Percpu allocator now supports allocation mask. Add @gfp to percpu_counter_init() so that !GFP_KERNEL allocation masks can be used with percpu_counters too. We could have left percpu_counter_init() alone and added percpu_counter_init_gfp(); however, the number of users isn't that high and introducing _gfp variants to all percpu data structures would be quite ugly, so let's just do the conversion. This is the one with the most users. Other percpu data structures are a lot easier to convert. This patch doesn't make any functional difference. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NJan Kara <jack@suse.cz> Acked-by: N"David S. Miller" <davem@davemloft.net> Cc: x86@kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org>
-
- 18 4月, 2014 1 次提交
-
-
由 Peter Zijlstra 提交于
Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 4月, 2014 2 次提交
-
-
由 Jan Kara 提交于
After commit 839a8e86 ("writeback: replace custom worker pool implementation with unbound workqueue") when device is removed while we are writing to it we crash in bdi_writeback_workfn() -> set_worker_desc() because bdi->dev is NULL. This can happen because even though bdi_unregister() cancels all pending flushing work, nothing really prevents new ones from being queued from balance_dirty_pages() or other places. Fix the problem by clearing BDI_registered bit in bdi_unregister() and checking it before scheduling of any flushing work. Fixes: 839a8e86Reviewed-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJan Kara <jack@suse.cz> Cc: Derek Basehore <dbasehore@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Derek Basehore 提交于
bdi_wakeup_thread_delayed() used the mod_delayed_work() function to schedule work to writeback dirty inodes. The problem with this is that it can delay work that is scheduled for immediate execution, such as the work from sync_inodes_sb(). This can happen since mod_delayed_work() can now steal work from a work_queue. This fixes the problem by using queue_delayed_work() instead. This is a regression caused by commit 839a8e86 ("writeback: replace custom worker pool implementation with unbound workqueue"). The reason that this causes a problem is that laptop-mode will change the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default. In the case that bdi_wakeup_thread_delayed() races with sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung task. Even if dirty_writeback_centisecs is not long enough to cause a hung task, we still don't want to delay sync for that long. We fix the problem by using queue_delayed_work() when we want to schedule writeback sometime in future. This function doesn't change the timer if it is already armed. For the same reason, we also change bdi_writeback_workfn() to immediately queue the work again in the case that the work_list is not empty. The same problem can happen if the sync work is run on the rescue worker. [jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()] Signed-off-by: NDerek Basehore <dbasehore@chromium.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zento.linux.org.uk> Reviewed-by: NTejun Heo <tj@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Derek Basehore <dbasehore@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Benson Leung <bleung@chromium.org> Cc: Sonny Rao <sonnyrao@chromium.org> Cc: Luigi Semenzato <semenzato@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Dave Chinner <david@fromorbit.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 9月, 2013 1 次提交
-
-
由 Chen Gang 提交于
'*lenp' may be less than "sizeof(kbuf)" so we must check this before the next copy_to_user(). pdflush_proc_obsolete() is called by sysctl which 'procname' is "nr_pdflush_threads", if the user passes buffer length less than "sizeof(kbuf)", it will cause issue. Signed-off-by: NChen Gang <gang.chen@asianux.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 8月, 2013 1 次提交
-
-
由 Greg Kroah-Hartman 提交于
The dev_attrs field of struct class is going away soon, dev_groups should be used instead. This converts the backing device class code to use the correct field. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 17 7月, 2013 1 次提交
-
-
由 Greg Kroah-Hartman 提交于
A number of parts of the kernel created their own version of this, might as well have the sysfs core provide it instead. Reviewed-by: NGuenter Roeck <linux@roeck-us.net> Tested-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 04 7月, 2013 1 次提交
-
-
由 Kees Cook 提交于
Calling dev_set_name with a single paramter causes it to be handled as a format string. Many callers are passing potentially dynamic string content, so use "%s" in those cases to avoid any potential accidents, including wrappers like device_create*() and bdi_register(). Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 4月, 2013 3 次提交
-
-
由 Tejun Heo 提交于
There are cases where userland wants to tweak the priority and affinity of writeback flushers. Expose bdi_wq to userland by setting WQ_SYSFS. It appears under /sys/bus/workqueue/devices/writeback/ and allows adjusting maximum concurrency level, cpumask and nice level. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Tejun Heo 提交于
Writeback implements its own worker pool - each bdi can be associated with a worker thread which is created and destroyed dynamically. The worker thread for the default bdi is always present and serves as the "forker" thread which forks off worker threads for other bdis. there's no reason for writeback to implement its own worker pool when using unbound workqueue instead is much simpler and more efficient. This patch replaces custom worker pool implementation in writeback with an unbound workqueue. The conversion isn't too complicated but the followings are worth mentioning. * bdi_writeback->last_active, task and wakeup_timer are removed. delayed_work ->dwork is added instead. Explicit timer handling is no longer necessary. Everything works by either queueing / modding / flushing / canceling the delayed_work item. * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off bdi_writeback->dwork. On each execution, it processes bdi->work_list and reschedules itself if there are more things to do. The function also handles low-mem condition, which used to be handled by the forker thread. If the function is running off a rescuer thread, it only writes out limited number of pages so that the rescuer can serve other bdis too. This preserves the flusher creation failure behavior of the forker thread. * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell bdi_writeback_workfn() about on-going bdi unregistration so that it always drains work_list even if it's running off the rescuer. Note that the original code was broken in this regard. Under memory pressure, a bdi could finish unregistration with non-empty work_list. * The default bdi is no longer special. It now is treated the same as any other bdi and bdi_cap_flush_forker() is removed. * BDI_pending is no longer used. Removed. * Some tracepoints become non-applicable. The following TPs are removed - writeback_nothread, writeback_wake_thread, writeback_wake_forker_thread, writeback_thread_start, writeback_thread_stop. Everything, including devices coming and going away and rescuer operation under simulated memory pressure, seems to work fine in my test setup. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com>
-
由 Tejun Heo 提交于
There's no user left. Remove it. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fengguang Wu <fengguang.wu@intel.com>
-
- 22 2月, 2013 1 次提交
-
-
由 Darrick J. Wong 提交于
This patchset ("stable page writes, part 2") makes some key modifications to the original 'stable page writes' patchset. First, it provides creators (devices and filesystems) of a backing_dev_info a flag that declares whether or not it is necessary to ensure that page contents cannot change during writeout. It is no longer assumed that this is true of all devices (which was never true anyway). Second, the flag is used to relaxed the wait_on_page_writeback calls so that wait only occurs if the device needs it. Third, it fixes up the remaining disk-backed filesystems to use this improved conditional-wait logic to provide stable page writes on those filesystems. It is hoped that (for people not using checksumming devices, anyway) this patchset will give back unnecessary performance decreases since the original stable page write patchset went into 3.0. Sorry about not fixing it sooner. Complaints were registered by several people about the long write latencies introduced by the original stable page write patchset. Generally speaking, the kernel ought to allocate as little extra memory as possible to facilitate writeout, but for people who simply cannot wait, a second page stability strategy is (re)introduced: snapshotting page contents. The waiting behavior is still the default strategy; to enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be set. This flag is used to bandaid^Henable stable page writeback on ext3[1], and is not used anywhere else. Given that there are already a few storage devices and network FSes that have rolled their own page stability wait/page snapshot code, it would be nice to move towards consolidating all of these. It seems possible that iscsi and raid5 may wish to use the new stable page write support to enable zero-copy writeout. Thank you to Jan Kara for helping fix a couple more filesystems. Per Andrew Morton's request, here are the result of using dbench to measure latencies on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, for ext2 the maximum write latency decreases from ~60ms on a laptop hard disk to ~4ms. I'm not sure why the flush latencies increase, though I suspect that being able to dirty pages faster gives the flusher more work to do. On ext4, the average write latency decreases as well as all the maximum latencies: 3.8.0-rc3: WriteX 85624 0.152 33.078 ReadX 272090 0.010 61.210 Flush 12129 36.219 168.260 Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms 3.8.0-rc3 + patches: WriteX 86082 0.141 30.928 ReadX 273358 0.010 36.124 Flush 12214 34.800 165.689 Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms XFS seems to exhibit similar latency improvements as ext2: 3.8.0-rc3: WriteX 125739 0.028 104.343 ReadX 399070 0.005 4.115 Flush 17851 25.004 131.390 Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms 3.8.0-rc3 + patches: WriteX 123529 0.028 6.299 ReadX 392434 0.005 4.287 Flush 17549 25.120 188.687 Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms ...and btrfs, just to round things out, also shows some latency decreases: 3.8.0-rc3: WriteX 67122 0.083 82.355 ReadX 212719 0.005 2.828 Flush 9547 47.561 147.418 Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms 3.8.0-rc3 + patches: WriteX 64898 0.101 71.631 ReadX 206673 0.005 7.123 Flush 9190 47.963 219.034 Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms Before this patchset, all filesystems would block, regardless of whether or not it was necessary. ext3 would wait, but still generate occasional checksum errors. The network filesystems were left to do their own thing, so they'd wait too. After this patchset, all the disk filesystems except ext3 and btrfs will wait only if the hardware requires it. ext3 (if necessary) snapshots pages instead of blocking, and btrfs provides its own bdi so the mm will never wait. Network filesystems haven't been touched, so either they provide their own wait code, or they don't block at all. The blocking behavior is back to what it was before 3.0 if you don't have a disk requiring stable page writes. This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same results as -rc3. [1] The alternative fixes to ext3 include fixing the locking order and page bit handling like we did for ext4 (but then why not just use ext4?), or setting PG_writeback so early that ext3 becomes extremely slow. I tried that, but the number of write()s I could initiate dropped by nearly an order of magnitude. That was a bit much even for the author of the stable page series! :) This patch: Creates a per-backing-device flag that tracks whether or not pages must be held immutable during writeout. Eventually it will be used to waive wait_for_page_writeback() if nothing requires stable pages. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 12月, 2012 1 次提交
-
-
由 Linus Torvalds 提交于
This reverts commit 8fa72d23. People disagree about how this should be done, so let's revert this for now so that nobody starts using the new tuning interface. Tejun is thinking about a more generic interface for thread pool affinity. Requested-by: NTejun Heo <tj@kernel.org> Acked-by: NJeff Moyer <jmoyer@redhat.com> Acked-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 12月, 2012 1 次提交
-
-
由 Jeff Moyer 提交于
In realtime environments, it may be desirable to keep the per-bdi flusher threads from running on certain cpus. This patch adds a cpu_list file to /sys/class/bdi/* to enable this. The default is to tie the flusher threads to the same numa node as the backing device (though I could be convinced to make it a mask of all cpus to avoid a change in behaviour). Thanks to Jeremy Eder for the original idea. Signed-off-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 8月, 2012 1 次提交
-
-
由 Namjae Jeon 提交于
Fix checkpatch warnings: WARNING: consider using kstrto* in preference to simple_strtoul for the below sys entry parsers: /sys/block/<block device>/bdi/read_ahead_kb /sys/block/<block device>/bdi/max_ratio /sys/block/<block device>/bdi/min_ratio Signed-off-by: NNamjae Jeon <linkinjeon@gmail.com> Signed-off-by: NVivek Trivedi <vtrivedi018@gmail.com>
-
- 04 8月, 2012 1 次提交
-
-
由 Artem Bityutskiy 提交于
Finally we can kill the 'sync_supers' kernel thread along with the '->write_super()' superblock operation because all the users are gone. Now every file-system is supposed to self-manage own superblock and its dirty state. The nice thing about killing this thread is that it improves power management. Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up every 5 seconds no matter what - even if there were no dirty superblocks and even if there were no file-systems using this service (e.g., btrfs and journalled ext4 do not need it). So it was wasting power most of the time. And because the thread was in the core of the kernel, all systems had to have it. So I am quite happy to make it go away. Interestingly, this thread is a left-over from the pdflush kernel thread which was a self-forking kernel thread responsible for all the write-back in old Linux kernels. It was turned into per-block device BDI threads, and 'sync_supers' was a left-over. Thus, R.I.P, pdflush as well. Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 01 8月, 2012 1 次提交
-
-
由 Wanpeng Li 提交于
Since per-BDI flusher threads were introduced in 2.6, the pdflush mechanism is not used any more. But the old interface exported through /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless. For back-compatibility, printk warning information and return 2 to notify the users that the interface is removed. Signed-off-by: NWanpeng Li <liwp@linux.vnet.ibm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 6月, 2012 1 次提交
-
-
由 Jan Kara 提交于
Convert calculations of proportion of writeback each bdi does to new flexible proportion code. That allows us to use aging period of fixed wallclock time which gives better proportion estimates given the hugely varying throughput of different devices. Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
-
- 01 2月, 2012 1 次提交
-
-
由 Rabin Vincent 提交于
While 7a401a97 ("backing-dev: ensure wakeup_timer is deleted") addressed the problem of the bdi being freed with a queued wakeup timer, there are other races that could happen if the wakeup timer expires after/during bdi_unregister(), before bdi_destroy() is called. wakeup_timer_fn() could attempt to wakeup a task which has already has been freed, or could access a NULL bdi->dev via the wake_forker_thread tracepoint. Cc: <stable@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Reported-by: NChanho Min <chanho.min@lge.com> Reviewed-by: NNamjae Jeon <linkinjeon@gmail.com> Signed-off-by: NRabin Vincent <rabin@rab.in> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
-
- 22 11月, 2011 1 次提交
-
-
由 Tejun Heo 提交于
Writeback and thinkpad_acpi have been using thaw_process() to prevent deadlock between the freezer and kthread_stop(); unfortunately, this is inherently racy - nothing prevents freezing from happening between thaw_process() and kthread_stop(). This patch implements kthread_freezable_should_stop() which enters refrigerator if necessary but is guaranteed to return if kthread_stop() is invoked. Both thaw_process() users are converted to use the new function. Note that this deadlock condition exists for many of freezable kthreads. They need to be converted to use the new should_stop or freezable workqueue. Tested with synthetic test case. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NHenrique de Moraes Holschuh <ibm-acpi@hmh.eng.br> Cc: Jens Axboe <axboe@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com>
-
- 11 11月, 2011 1 次提交
-
-
由 Rabin Vincent 提交于
bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links from all super_blocks and then del_timer_sync() the writeback timer. However, this can race with __mark_inode_dirty(), leading to bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi we're unregistering, after we've called del_timer_sync(). This can end up with the bdi being freed with an active timer inside it, as in the case of the following dump after the removal of an SD card. Fix this by redoing the del_timer_sync() in bdi_destory(). ------------[ cut here ]------------ WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8() ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180 Modules linked in: Backtrace: [<c00109dc>] (dump_backtrace+0x0/0x110) from [<c0236e4c>] (dump_stack+0x18/0x1c) r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000 [<c0236e34>] (dump_stack+0x0/0x1c) from [<c0025e6c>] (warn_slowpath_common+0x54/0x6c) [<c0025e18>] (warn_slowpath_common+0x0/0x6c) from [<c0025f28>] (warn_slowpath_fmt+0x38/0x40) r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29 r3:00000009 [<c0025ef0>] (warn_slowpath_fmt+0x0/0x40) from [<c015eb4c>] (debug_print_object+0x9c/0xc8) r3:c02b1b29 r2:c02bc662 [<c015eab0>] (debug_print_object+0x0/0xc8) from [<c015f574>] (debug_check_no_obj_freed+0xac/0x1dc) r6:c7964000 r5:00000001 r4:c7964000 [<c015f4c8>] (debug_check_no_obj_freed+0x0/0x1dc) from [<c00a9e38>] (kmem_cache_free+0x88/0x1f8) [<c00a9db0>] (kmem_cache_free+0x0/0x1f8) from [<c014286c>] (blk_release_queue+0x70/0x78) [<c01427fc>] (blk_release_queue+0x0/0x78) from [<c015290c>] (kobject_release+0x70/0x84) r5:c79641f0 r4:c796420c [<c015289c>] (kobject_release+0x0/0x84) from [<c0153ce4>] (kref_put+0x68/0x80) r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c [<c0153c7c>] (kref_put+0x0/0x80) from [<c01527d0>] (kobject_put+0x48/0x5c) r5:c79643b4 r4:c79641f0 [<c0152788>] (kobject_put+0x0/0x5c) from [<c013ddd8>] (blk_cleanup_queue+0x68/0x74) r4:c7964000 [<c013dd70>] (blk_cleanup_queue+0x0/0x74) from [<c01a6370>] (mmc_blk_put+0x78/0xe8) r5:00000000 r4:c794c400 [<c01a62f8>] (mmc_blk_put+0x0/0xe8) from [<c01a64b4>] (mmc_blk_release+0x24/0x38) r5:c794c400 r4:c0322824 [<c01a6490>] (mmc_blk_release+0x0/0x38) from [<c00de11c>] (__blkdev_put+0xe8/0x170) r5:c78d5e00 r4:c74083c0 [<c00de034>] (__blkdev_put+0x0/0x170) from [<c00de2c0>] (blkdev_put+0x11c/0x12c) r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0 r3:00000000 [<c00de1a4>] (blkdev_put+0x0/0x12c) from [<c00b0724>] (kill_block_super+0x60/0x6c) r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0 [<c00b06c4>] (kill_block_super+0x0/0x6c) from [<c00b0a94>] (deactivate_locked_super+0x44/0x70) r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4 [<c00b0a50>] (deactivate_locked_super+0x0/0x70) from [<c00b1358>] (deactivate_super+0x6c/0x70) r5:c794dc00 r4:c794dc00 [<c00b12ec>] (deactivate_super+0x0/0x70) from [<c00c88b0>] (mntput_no_expire+0x188/0x194) r5:c794dc00 r4:c7942300 [<c00c8728>] (mntput_no_expire+0x0/0x194) from [<c00c95e0>] (sys_umount+0x2e4/0x310) r6:c7942300 r5:00000000 r4:00000000 r3:00000000 [<c00c92fc>] (sys_umount+0x0/0x310) from [<c000d940>] (ret_fast_syscall+0x0/0x30) ---[ end trace e5c83c92ada51c76 ]--- Cc: stable@kernel.org Signed-off-by: NRabin Vincent <rabin.vincent@stericsson.com> Signed-off-by: NLinus Walleij <linus.walleij@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 11月, 2011 1 次提交
-
-
由 Andrew Morton 提交于
fiddle wording Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 10月, 2011 1 次提交
-
-
由 Curt Wohlgemuth 提交于
This creates a new 'reason' field in a wb_writeback_work structure, which unambiguously identifies who initiates writeback activity. A 'wb_reason' enumeration has been added to writeback.h, to enumerate the possible reasons. The 'writeback_work_class' and tracepoint event class and 'writeback_queue_io' tracepoints are updated to include the symbolic 'reason' in all trace events. And the 'writeback_inodes_sbXXX' family of routines has had a wb_stats parameter added to them, so callers can specify why writeback is being started. Acked-by: NJan Kara <jack@suse.cz> Signed-off-by: NCurt Wohlgemuth <curtw@google.com> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
-
- 03 10月, 2011 2 次提交
-
-
由 Wu Fengguang 提交于
There are some imperfections in balanced_dirty_ratelimit. 1) large fluctuations The dirty_rate used for computing balanced_dirty_ratelimit is merely averaged in the past 200ms (very small comparing to the 3s estimation period for write_bw), which makes rather dispersed distribution of balanced_dirty_ratelimit. It's pretty hard to average out the singular points by increasing the estimation period. Considering that the averaging technique will introduce very undesirable time lags, I give it up totally. (btw, the 3s write_bw averaging time lag is much more acceptable because its impact is one-way and therefore won't lead to oscillations.) The more practical way is filtering -- most singular balanced_dirty_ratelimit points can be filtered out by remembering some prev_balanced_rate and prev_prev_balanced_rate. However the more reliable way is to guard balanced_dirty_ratelimit with task_ratelimit. 2) due to truncates and fs redirties, the (write_bw <=> dirty_rate) match could become unbalanced, which may lead to large systematical errors in balanced_dirty_ratelimit. The truncates, due to its possibly bumpy nature, can hardly be compensated smoothly. So let's face it. When some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit high, dirty pages will go higher than the setpoint. task_ratelimit will in turn become lower than dirty_ratelimit. So if we consider both balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit only when they are on the same side of dirty_ratelimit, the systematical errors in balanced_dirty_ratelimit won't be able to bring dirty_ratelimit far away. The balanced_dirty_ratelimit estimation may also be inaccurate near @limit or @freerun, however is less an issue. 3) since we ultimately want to - keep the fluctuations of task ratelimit as small as possible - keep the dirty pages around the setpoint as long time as possible the update policy used for (2) also serves the above goals nicely: if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit), and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit), there is no point to bring up dirty_ratelimit in a hurry only to hurt both the above two goals. So, we make use of task_ratelimit to limit the update of dirty_ratelimit in two ways: 1) avoid changing dirty rate when it's against the position control target (the adjusted rate will slow down the progress of dirty pages going back to setpoint). 2) limit the step size. task_ratelimit is changing values step by step, leaving a consistent trace comparing to the randomly jumping balanced_dirty_ratelimit. task_ratelimit also has the nice smaller errors in stable state and typically larger errors when there are big errors in rate. So it's a pretty good limiting factor for the step size of dirty_ratelimit. Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit. task_ratelimit is merely used as a limiting factor. Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
-
由 Wu Fengguang 提交于
It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N) when there are N dd tasks. On write() syscall, use bdi->dirty_ratelimit ============================================ balance_dirty_pages(pages_dirtied) { task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio(); pause = pages_dirtied / task_ratelimit; sleep(pause); } On every 200ms, update bdi->dirty_ratelimit =========================================== bdi_update_dirty_ratelimit() { task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio(); balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate; bdi->dirty_ratelimit = balanced_dirty_ratelimit } Estimation of balanced bdi->dirty_ratelimit =========================================== balanced task_ratelimit ----------------------- balance_dirty_pages() needs to throttle tasks dirtying pages such that the total amount of dirty pages stays below the specified dirty limit in order to avoid memory deadlocks. Furthermore we desire fairness in that tasks get throttled proportionally to the amount of pages they dirty. IOW we want to throttle tasks such that we match the dirty rate to the writeout bandwidth, this yields a stable amount of dirty pages: dirty_rate == write_bw (1) The fairness requirement gives us: task_ratelimit = balanced_dirty_ratelimit == write_bw / N (2) where N is the number of dd tasks. We don't know N beforehand, but still can estimate balanced_dirty_ratelimit within 200ms. Start by throttling each dd task at rate task_ratelimit = task_ratelimit_0 (3) (any non-zero initial value is OK) After 200ms, we measured dirty_rate = # of pages dirtied by all dd's / 200ms write_bw = # of pages written to the disk / 200ms For the aggressive dd dirtiers, the equality holds dirty_rate == N * task_rate == N * task_ratelimit_0 (4) Or task_ratelimit_0 == dirty_rate / N (5) Now we conclude that the balanced task ratelimit can be estimated by write_bw balanced_dirty_ratelimit = task_ratelimit_0 * ---------- (6) dirty_rate Because with (4) and (5) we can get the desired equality (1): write_bw balanced_dirty_ratelimit == (dirty_rate / N) * ---------- dirty_rate == write_bw / N Then using the balanced task ratelimit we can compute task pause times like: task_pause = task->nr_dirtied / task_ratelimit task_ratelimit with position control ------------------------------------ However, while the above gives us means of matching the dirty rate to the writeout bandwidth, it at best provides us with a stable dirty page count (assuming a static system). In order to control the dirty page count such that it is high enough to provide performance, but does not exceed the specified limit we need another control. The dirty position control works by extending (2) to task_ratelimit = balanced_dirty_ratelimit * pos_ratio (7) where pos_ratio is a negative feedback function that subjects to 1) f(setpoint) = 1.0 2) df/dx < 0 That is, if the dirty pages are ABOVE the setpoint, we throttle each task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty pages are created less fast than they are cleaned, thus DROP to the setpoints (and the reverse). Based on (7) and the assumption that both dirty_ratelimit and pos_ratio remains CONSTANT for the past 200ms, we get task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio (8) Putting (8) into (6), we get the formula used in bdi_update_dirty_ratelimit(): write_bw balanced_dirty_ratelimit *= pos_ratio * ---------- (9) dirty_rate Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
-