提交 · b3bba872ddb0320a7ecb54decae53c13ceb2ed4c · openeuler / raspberrypi-kernel

18 12月, 2011 1 次提交

writeback: show writeback reason with __print_symbolic · b3bba872

由 Wu Fengguang 提交于 12月 08, 2011

This makes the binary trace understandable by trace-cmd.

CC: Dave Chinner <david@fromorbit.com>
CC: Curt Wohlgemuth <curtw@google.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

b3bba872

29 11月, 2011 1 次提交

writeback: Fix issue on make htmldocs · 786228ab

由 Marcos Paulo de Souza 提交于 11月 23, 2011

Document the @reason parameter to make "make htmldocs" happy.
Acked-by: NRandy Dunlap <rdunlap@xenotime.net>
Signed-off-by: NMarcos Paulo de Souza <marcos.mage@gmail.com>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

786228ab

31 10月, 2011 2 次提交

writeback: Add a 'reason' to wb_writeback_work · 0e175a18

由 Curt Wohlgemuth 提交于 10月 07, 2011

This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity.  A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.

The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.

And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

0e175a18

writeback: send work item to queue_io, move_expired_inodes · ad4e38dd

由 Curt Wohlgemuth 提交于 10月 07, 2011

Instead of sending ->older_than_this to queue_io() and
move_expired_inodes(), send the entire wb_writeback_work
structure.  There are other fields of a work item that are
useful in these routines and in tracepoints.
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

ad4e38dd

03 10月, 2011 2 次提交

writeback: per-bdi background threshold · b00949aa

由 Wu Fengguang 提交于 11月 18, 2010

One thing puzzled me is that in JBOD case, the per-disk writeout
performance is smaller than the corresponding single-disk case even
when they have comparable bdi_thresh. Tracing shows find that in single
disk case, bdi_writeback is always kept high while in JBOD case, it
could drop low from time to time and correspondingly bdi_reclaimable
could sometimes rush high.

The fix is to watch bdi_reclaimable and kick background writeback as
soon as it goes high. This resembles the global background threshold
but in per-bdi manner. The trick is, as long as bdi_reclaimable does
not go high, bdi_writeback naturally won't go low because
bdi_reclaimable+bdi_writeback ~= bdi_thresh.

With less fluctuated writeback pages, JBOD performance is observed to
increase noticeably in various cases.

vmstat:nr_written values before/after patch:

3.1.0-rc4-wo-underrun+ 3.1.0-rc4-bgthresh3+
------------------------ ------------------------
125596480 +25.9% 158179363 JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X
61790815 +110.4% 130032231 JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X
58853546 -0.1% 58823828 JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X
110159811 +24.7% 137355377 JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X
69544762 +10.8% 77080047 JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X
50644862 +0.5% 50890006 JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X
42677090 +28.0% 54643527 JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X
47491324 +13.3% 53785605 JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X
52548986 +0.9% 53001031 JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X
26783091 +36.8% 36650248 JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
35526347 +14.0% 40492312 JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
44670723 -1.1% 44177606 JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
127996037 +22.4% 156719990 JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X
57518856 +3.8% 59677625 JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X
51919909 +12.2% 58269894 JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X
86410514 +79.0% 154660433 JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X
40132519 +38.6% 55617893 JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X
48423248 +7.5% 52042927 JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X
206041046 +44.1% 296846536 JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X
72312903 -19.4% 58272885 JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X
50635672 -0.5% 50384787 JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X
68308534 +115.7% 147324758 JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X
57882933 +14.5% 66269621 JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X
52183472 +12.8% 58855181 JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X
53788956 +94.2% 104460352 JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
44493342 +35.5% 60298210 JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
42641209 +18.9% 50681038 JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

b00949aa

W
writeback: add bg_threshold parameter to __bdi_update_bandwidth() · af6a3113
由 Wu Fengguang 提交于 10月 03, 2011
```
No behavior change.
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
```
af6a3113

31 7月, 2011 1 次提交

don't busy retry the inode on failed grab_super_passive() · 0e995816

由 Wu Fengguang 提交于 7月 29, 2011

This fixes a soft lockup on conditions

a) the flusher is working on a work by __bdi_start_writeback(), while

b) someone else calls writeback_inodes_sb*() or sync_inodes_sb(), which
grab sb->s_umount and enqueue a new work for the flusher to execute

The s_umount grabbed by (b) will fail the grab_super_passive() in (a).
Then if the inode is requeued, wb_writeback() will busy retry on it.
As a result, wb_writeback() loops for ever without releasing
wb->list_lock, which further blocks other tasks.

Fix the busy loop by redirtying the inode. This may undesirably delay
the writeback of the inode, however most likely it will be picked up
soon by the queued work by writeback_inodes_sb*(), sync_inodes_sb() or
even writeback_inodes_wb().

bug url: http://www.spinics.net/lists/linux-fsdevel/msg47292.htmlReported-by: NChristoph Hellwig <hch@infradead.org>
Tested-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

0e995816

24 7月, 2011 1 次提交

writeback: don't busy retry writeback on new/freeing inodes · fcc5c222

由 Wu Fengguang 提交于 7月 11, 2011

Fix a system hang bug introduced by commit b7a2441f ("writeback:
remove writeback_control.more_io") and e8dfc305 ("writeback: elevate
queue_io() into wb_writeback()") easily reproducible with high memory
pressure and lots of file creation/deletions, for example, a kernel
build in limited memory.

It hangs when some inode is in the I_NEW, I_FREEING or I_WILL_FREE
state, the flusher will get stuck busy retrying that inode, never
releasing wb->list_lock. The lock in turn blocks all kinds of other
tasks when they are trying to grab it.

As put by Jan, it's a safe change regarding data integrity. I_FREEING or
I_WILL_FREE inodes are written back by iput_final() and it is reclaim
code that is responsible for eventually removing them. So writeback code
can safely ignore them. I_NEW inodes should move out of this state when
they are fully set up and in the writeback round following that, we will
consider them for writeback. So the change makes sense.

CC: Jan Kara <jack@suse.cz>
Reported-by: NHugh Dickins <hughd@google.com>
Tested-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

fcc5c222

20 7月, 2011 1 次提交

superblock: move pin_sb_for_writeback() to fs/super.c · 12ad3ab6

由 Dave Chinner 提交于 7月 08, 2011

The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.

pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

12ad3ab6

10 7月, 2011 4 次提交

writeback: scale IO chunk size up to half device bandwidth · 1a12d8bd

由 Wu Fengguang 提交于 8月 29, 2010

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
will adapt to as large as the storage device can write within 500ms.

XFS is observed to do IO completions in a batch, and the batch size is
equal to the write chunk size. To avoid dirty pages to suddenly drop
out of balance_dirty_pages()'s dirty control scope and create large
fluctuations, the chunk size is also limited to half the control scope.

The balance_dirty_pages() control scrope is

	[(background_thresh + dirty_thresh) / 2, dirty_thresh]

which is by default [15%, 20%] of global dirty pages, whose range size
is dirty_thresh / DIRTY_FULL_SCOPE.

The adpative write chunk size will be rounded to the nearest 4MB
boundary.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

1a12d8bd

writeback: introduce smoothed global dirty limit · c42843f2

由 Wu Fengguang 提交于 3月 02, 2011

The start of a heavy weight application (ie. KVM) may instantly knock
down determine_dirtyable_memory() if the swap is not enabled or full.
global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
dirty thresholds that are _much_ lower than the global/bdi dirty pages.

balance_dirty_pages() will then heavily throttle all dirtiers including
the light ones, until the dirty pages drop below the new dirty thresholds.
During this _deep_ dirty-exceeded state, the system may appear rather
unresponsive to the users.

About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
threshold to heavy dirtiers than light ones, and the dirty pages will
be throttled around the heavy dirtiers' dirty threshold and reasonably
below the light dirtiers' dirty threshold. In this state, only the heavy
dirtiers will be throttled and the dirty pages are carefully controlled
to not exceed the light dirtiers' dirty threshold. However if the
threshold itself suddenly drops below the number of dirty pages, the
light dirtiers will get heavily throttled.

So introduce global_dirty_limit for tracking the global dirty threshold
with policies

- follow downwards slowly
- follow up in one shot

global_dirty_limit can effectively mask out the impact of sudden drop of
dirtyable memory. It will be used in the next patch for two new type of
dirty limits. Note that the new dirty limits are not going to avoid
throttling the light dirtiers, but could limit their sleep time to 200ms.
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

c42843f2

writeback: bdi write bandwidth estimation · e98be2d5

由 Wu Fengguang 提交于 8月 29, 2010

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than one second will be skipped.

The estimated bandwidth will be reflecting how fast the device can
writeout when _fully utilized_, and won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time, if
not considering fluctuations, it will also remain high unless be knocked
down by possible concurrent reads that compete for the disk time and
bandwidth with async writes.

The estimation is not done purely in the flusher because there is no
guarantee for write_cache_pages() to return timely to update bandwidth.

The bdi->avg_write_bandwidth smoothing is very effective for filtering
out sudden spikes, however may be a little biased in long term.

The overheads are low because the bdi bandwidth update only occurs at
200ms intervals.

The 200ms update interval is suitable, because it's not possible to get
the real bandwidth for the instance at all, due to large fluctuations.

The NFS commits can be as large as seconds worth of data. One XFS
completion may be as large as half second worth of data if we are going
to increase the write chunk to half second worth of data. In ext4,
fluctuations with time period of around 5 seconds is observed. And there
is another pattern of irregular periods of up to 20 seconds on SSD tests.

That's why we are not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to do
another level of smoothing in avg_write_bandwidth.

CC: Li Shaohua <shaohua.li@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

e98be2d5

writeback: make writeback_control.nr_to_write straight · d46db3d5

由 Wu Fengguang 提交于 5月 04, 2011

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.
Acked-by: NJan Kara <jack@suse.cz>
Proposed-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

d46db3d5

08 6月, 2011 12 次提交

writeback: trace event writeback_queue_io · e84d0a4f

由 Wu Fengguang 提交于 4月 23, 2011

Note that it adds a little overheads to account the moved/enqueued
inodes from b_dirty to b_io. The "moved" accounting may be later used to
limit the number of inodes that can be moved in one shot, in order to
keep spinlock hold time under control.
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

e84d0a4f

writeback: trace event writeback_single_inode · 251d6a47

由 Wu Fengguang 提交于 12月 01, 2010

It is valuable to know how the dirty inodes are iterated and their IO size.

"writeback_single_inode: bdi 8:0: ino=134246746 state=I_DIRTY_SYNC|I_SYNC age=414 index=0 to_write=1024 wrote=0"

- "state" reflects inode->i_state at the end of writeback_single_inode()
- "index" reflects mapping->writeback_index after the ->writepages() call
- "to_write" is the wbc->nr_to_write at entrance of writeback_single_inode()
- "wrote" is the number of pages actually written

v2: add trace event writeback_single_inode_requeue as proposed by Dave.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

251d6a47

writeback: remove writeback_control.more_io · b7a2441f

由 Wu Fengguang 提交于 7月 21, 2010

When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.
Acked-by: NJan Kara <jack@suse.cz>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

b7a2441f

writeback: avoid extra sync work at enqueue time · e185dda8

由 Wu Fengguang 提交于 4月 23, 2011

This removes writeback_control.wb_start and does more straightforward
sync livelock prevention by setting .older_than_this to prevent extra
inodes from being enqueued in the first place.
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

e185dda8

writeback: elevate queue_io() into wb_writeback() · e8dfc305

由 Wu Fengguang 提交于 4月 21, 2011

Code refactor for more logical code layout.
No behavior change.

- remove the mis-named __writeback_inodes_sb()

- wb_writeback()/writeback_inodes_wb() will decide when to queue_io()
  before calling __writeback_inodes_wb()
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

e8dfc305

writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab

由 Christoph Hellwig 提交于 4月 21, 2011

Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
as it's currently the most contended lock in the system for metadata
heavy workloads.  It won't help for single-filesystem workloads for
which we'll need the I/O-less balance_dirty_pages, but at least we
can dedicate a cpu to spinning on each bdi now for larger systems.

Based on earlier patches from Nick Piggin and Dave Chinner.

It reduces lock contentions to 1/4 in this test case:
10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram

lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 2.6.39-rc3:
                      inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                      ------------------
                      inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                      inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                      inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                      inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157

2.6.39-rc3 + patch:
                &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                ------------------------
                &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                ------------------------
                &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f

hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

f758eeab

writeback: refill b_io iff empty · 424b351f

由 Wu Fengguang 提交于 7月 21, 2010

There is no point to carry different refill policies between for_kupdate
and other type of works. Use a consistent "refill b_io iff empty" policy
which can guarantee fairness in an easy to understand way.

A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.

This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.

This change relies on wb_writeback() to keep retrying as long as we made
some progress on cleaning some pages and/or inodes. Without that ability,
the old logic on background works relies on aggressively queuing all
eligible inodes into b_io at every time. But that's not a guarantee.

The below test script completes a slightly faster now:

             2.6.39-rc3	  2.6.39-rc3-dyn-expire+
------------------------------------------------
all elapsed     256.043      252.367
stddev           24.381       12.530

tar elapsed      30.097       28.808
dd  elapsed      13.214       11.782

	#!/bin/zsh

	cp /c/linux-2.6.38.3.tar.bz2 /dev/shm/

	umount /dev/sda7
	mkfs.xfs -f /dev/sda7
	mount /dev/sda7 /fs

	echo 3 > /proc/sys/vm/drop_caches

	tic=$(cat /proc/uptime|cut -d' ' -f2)

	cd /fs
	time tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 &
	time dd if=/dev/zero of=/fs/zero bs=1M count=1000 &

	wait
	sync
	tac=$(cat /proc/uptime|cut -d' ' -f2)
	echo elapsed: $((tac - tic))

It maintains roughly the same small vs. large file writeout shares, and
offers large files better chances to be written in nice 4M chunks.

Analyzes from Dave Chinner in great details:

Let's say we have lots of inodes with 100 dirty pages being created,
and one large writeback going on. We expire 8 new inodes for every
1024 pages we write back.

With the old code, we do:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)

	writeback  8 small inodes 800 pages
		   1 large inode 224 pages -> b_more_io

	b_more_io (large inode) -> b_io (8s, 1l)
	8 newly expired inodes -> b_io (8s, 1l, 8s)
	.....

Your new code:

	b_more_io (large inode) -> b_io (1l)
	8 newly expired inodes -> b_io (1l, 8s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 8s)
	writeback  8 small inodes 800 pages

	b_io empty: (1800 pages written)
		b_more_io (large inode) -> b_io (1l)
		14 newly expired inodes -> b_io (1l, 14s)

	writeback  large inode 1024 pages -> b_more_io
	(b_io == 14s)
	writeback  10 small inodes 1000 pages
		   1 small inode 24 pages -> b_more_io (1l, 1s(24))
	writeback  5 small inodes 500 pages
	b_io empty: (2548 pages written)
		b_more_io (large inode) -> b_io (1l, 1s(24))
		20 newly expired inodes -> b_io (1l, 1s(24), 20s)
	......

Rough progression of pages written at b_io refill:

Old code:

	total	large file	% of writeback
	1024	224		21.9% (fixed)

New code:
	total	large file	% of writeback
	1800	1024		~55%
	2550	1024		~40%
	3050	1024		~33%
	3500	1024		~29%
	3950	1024		~26%
	4250	1024		~24%
	4500	1024		~22.7%
	4700	1024		~21.7%
	4800	1024		~21.3%
	4800	1024		~21.3%
	(pretty much steady state from here)

Ok, so the steady state is reached with a similar percentage of
writeback to the large file as the existing code. Ok, that's good,
but providing some evidence that is doesn't change the shared of
writeback to the large should be in the commit message ;)

The other advantage to this is that we always write 1024 page chunks
to the large file, rather than smaller "whatever remains" chunks.

CC: Jan Kara <jack@suse.cz>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

424b351f

writeback: the kupdate expire timestamp should be a moving target · ba9aa839

由 Wu Fengguang 提交于 7月 21, 2010

Dynamically compute the dirty expire timestamp at queue_io() time.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

v2: keep policy changes inside wb_writeback() and keep the
wbc.older_than_this visibility as suggested by Dave.

CC: Dave Chinner <david@fromorbit.com>
Acked-by: NJan Kara <jack@suse.cz>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

ba9aa839

writeback: try more writeback as long as something was written · e6fb6da2

由 Wu Fengguang 提交于 7月 22, 2010

writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of eligible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

For example, imagine 100 inodes

        i0, i1, i2, ..., i90, i91, i99

At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold. This will be a fairly normal/frequent case I guess.

Now that we do tagged sync and update inode->dirtied_when after the sync,
this change won't livelock sync(1).  I actually tried to write 1 page
per 1ms with this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs.
Acked-by: NJan Kara <jack@suse.cz>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

e6fb6da2

writeback: introduce writeback_control.inodes_written · cb9bd115

由 Wu Fengguang 提交于 7月 21, 2010

The flusher works on dirty inodes in batches, and may quit prematurely
if the batch of inodes happen to be metadata-only dirtied: in this case
wbc->nr_to_write won't be decreased at all, which stands for "no pages
written" but also mis-interpreted as "no progress".

So introduce writeback_control.inodes_written to count the inodes get
cleaned from VFS POV.  A non-zero value means there are some progress on
writeback, in which case more writeback can be tried.
Acked-by: NJan Kara <jack@suse.cz>
Acked-by: NMel Gorman <mel@csn.ul.ie>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

cb9bd115

writeback: update dirtied_when for synced inode to prevent livelock · 94c3dcbb

由 Wu Fengguang 提交于 4月 27, 2011

Explicitly update .dirtied_when on synced inodes, so that they are no
longer considered for writeback in the next round.

It can prevent both of the following livelock schemes:

- while true; do echo data >> f; done
- while true; do touch f;        done (in theory)

The exact livelock condition is, during sync(1):

(1) no new inodes are dirtied
(2) an inode being actively dirtied

On (2), the inode will be tagged and synced with .nr_to_write=LONG_MAX.
When finished, it will be redirty_tail()ed because it's still dirty
and (.nr_to_write > 0). redirty_tail() won't update its ->dirtied_when
on condition (1). The sync work will then revisit it on the next
queue_io() and find it eligible again because its old ->dirtied_when
predates the sync work start time.

We'll do more aggressive "keep writeback as long as we wrote something"
logic in wb_writeback(). The "use LONG_MAX .nr_to_write" trick in commit
b9543dac ("writeback: avoid livelocking WB_SYNC_ALL writeback") will
no longer be enough to stop sync livelock.
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

94c3dcbb

writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6

由 Wu Fengguang 提交于 6月 06, 2010

sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
do livelock prevention for it, too.

Jan's commit f446daae ("mm: implement writeback livelock avoidance
using page tagging") is a partial fix in that it only fixed the
WB_SYNC_ALL phase livelock.

Although ext4 is tested to no longer livelock with commit f446daae,
it may due to some "redirty_tail() after pages_skipped" effect which
is by no means a guarantee for _all_ the file systems.

Note that writeback_inodes_sb() is called by not only sync(), they are
treated the same because the other callers also need livelock prevention.

Impact:  It changes the order in which pages/inodes are synced to disk.
Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
until finished with the current inode.
Acked-by: NJan Kara <jack@suse.cz>
CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>

6e6938b6

27 5月, 2011 1 次提交

fs: pass exact type of data dirties to ->dirty_inode · aa385729

由 Christoph Hellwig 提交于 5月 27, 2011

Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
anything else, so that the filesystem can track internally if it
needs to push out a transaction for fdatasync or not.

This is just the prototype change with no user for it yet.  I plan
to push large XFS changes for the next merge window, and getting
this trivial infrastructure in this window would help a lot to avoid
tree interdependencies.

Also remove incorrect comments that ->dirty_inode can't block.  That
has been changed a long time ago, and many implementations rely on it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

aa385729

31 3月, 2011 1 次提交

Fix common misspellings · 25985edc

由 Lucas De Marchi 提交于 3月 30, 2011

Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>

25985edc

25 3月, 2011 4 次提交

fs: pull inode->i_lock up out of writeback_single_inode · 0f1b1fd8

由 Dave Chinner 提交于 3月 22, 2011

First thing we do in writeback_single_inode() is take the i_lock and
the last thing we do is drop it. A caller already holds the i_lock,
so pull the i_lock out of writeback_single_inode() to reduce the
round trips on this lock during inode writeback.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0f1b1fd8

fs: move i_wb_list out from under inode_lock · a66979ab

由 Dave Chinner 提交于 3月 22, 2011

Protect the inode writeback list with a new global lock
inode_wb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a66979ab

fs: move i_sb_list out from under inode_lock · 55fa6091

由 Dave Chinner 提交于 3月 22, 2011

Protect the per-sb inode list with a new global lock
inode_sb_list_lock and use it to protect the list manipulations and
traversals. This lock replaces the inode_lock as the inodes on the
list can be validity checked while holding the inode->i_lock and
hence the inode_lock is no longer needed to protect the list.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

55fa6091

fs: protect inode->i_state with inode->i_lock · 250df6ed

由 Dave Chinner 提交于 3月 22, 2011

Protect inode state transitions and validity checks with the
inode->i_lock. This enables us to make inode state transitions
independently of the inode_lock and is the first step to peeling
away the inode_lock from the code.

This requires that __iget() is done atomically with i_state checks
during list traversals so that we don't race with another thread
marking the inode I_FREEING between the state check and grabbing the
reference.

Also remove the unlock_new_inode() memory barrier optimisation
required to avoid taking the inode_lock when clearing I_NEW.
Simplify the code by simply taking the inode->i_lock around the
state change and wakeup. Because the wakeup is no longer tricky,
remove the wake_up_inode() function and open code the wakeup where
necessary.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

250df6ed

14 1月, 2011 6 次提交

fs/fs-writeback.c: fix sync_inodes_sb() return value kernel-doc · cb9ef8d5

由 Stefan Hajnoczi 提交于 1月 13, 2011

The sync_inodes_sb() function does not have a return value.  Remove the
outdated documentation comment.
Signed-off-by: NStefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cb9ef8d5

sync_inode_metadata: fix comment · c691b9d9

由 Andrew Morton 提交于 1月 13, 2011

Use correct function name, remove incorrect apostrophe

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c691b9d9

writeback: avoid livelocking WB_SYNC_ALL writeback · b9543dac

由 Jan Kara 提交于 1月 13, 2011

When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
usually set to LONG_MAX.  The logic in wb_writeback() then calls
__writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we
easily end up with non-positive nr_to_write after the function returns, if
the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment.

When nr_to_write is <= 0 wb_writeback() decides we need another round of
writeback but this is wrong in some cases!  For example when a single
large file is continuously dirtied, we would never finish syncing it
because each pass would be able to write MAX_WRITEBACK_PAGES and inode
dirty timestamp never gets updated (as inode is never completely clean).
Thus __writeback_inodes_sb() would write the redirtied inode again and
again.

Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode.  We
do not need nr_to_write in WB_SYNC_ALL mode anyway since
write_cache_pages() does livelock avoidance using page tagging in
WB_SYNC_ALL mode.

This makes wb_writeback() call __writeback_inodes_sb() only once on
WB_SYNC_ALL.  The latter function won't livelock because it works on

- a finite set of files by doing queue_io() once at the beginning
- a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging

After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
longer able to stall sync forever.

[fengguang.wu@intel.com: fix locking comment]
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b9543dac

writeback: stop background/kupdate works from livelocking other works · aa373cf5

由 Jan Kara 提交于 1月 13, 2011

Background writeback is easily livelockable in a loop in wb_writeback() by
a process continuously re-dirtying pages (or continuously appending to a
file).  This is in fact intended as the target of background writeback is
to write dirty pages it can find as long as we are over
dirty_background_threshold.

But the above behavior gets inconvenient at times because no other work
queued in the flusher thread's queue gets processed.  In particular, since
e.g.  sync(1) relies on flusher thread to do all the IO for it, sync(1)
can hang forever waiting for flusher thread to do the work.

Generally, when a flusher thread has some work queued, someone submitted
the work to achieve a goal more specific than what background writeback
does.  Moreover by working on the specific work, we also reduce amount of
dirty pages which is exactly the target of background writeout.  So it
makes sense to give specific work a priority over a generic page cleaning.

Thus we interrupt background writeback if there is some other work to do.
We return to the background writeback after completing all the queued
work.

This may delay the writeback of expired inodes for a while, however the
expired inodes will eventually be flushed to disk as long as the other
works won't livelock.

[fengguang.wu@intel.com: update comment]
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aa373cf5

writeback: trace wakeup event for background writeback · 71927e84

由 Wu Fengguang 提交于 1月 13, 2011

This tracks when balance_dirty_pages() tries to wakeup the flusher thread
for background writeback (if it was not started already).
Suggested-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

71927e84

writeback: integrated background writeback work · 6585027a

由 Jan Kara 提交于 1月 13, 2011

Check whether background writeback is needed after finishing each work.

When bdi flusher thread finishes doing some work check whether any kind of
background writeback needs to be done (either because
dirty_background_ratio is exceeded or because we need to start flushing
old inodes).  If so, just do background write back.

This way, bdi_start_background_writeback() just needs to wake up the
flusher thread.  It will do background writeback as soon as there is no
other work.

This is a preparatory patch for the next patch which stops background
writeback as soon as there is other work to do.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Engelhardt <jengelh@medozas.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6585027a

30 10月, 2010 1 次提交

fs-writeback.c: unify some common code · cdf01dd5

由 Linus Torvalds 提交于 10月 30, 2010

The btrfs merge looks like hell, because it changes fs-writeback.c, and
the crazy code has this repeated "estimate number of dirty pages"
counting that involves three different helper functions.  And it's done
in two different places.

Just unify that whole calculation as a "get_nr_dirty_pages()" helper
function, and the merge result will look half-way decent.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cdf01dd5

29 10月, 2010 1 次提交

Add new functions for triggering inode writeback · 3259f8be

由 Chris Mason 提交于 10月 29, 2010

When btrfs is running low on metadata space, it needs to force delayed
allocation pages to disk. It currently does this with a suboptimal walk
of a private list of inodes with delayed allocation, and it would be
much better if we used the generic flusher threads.

writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
thread to start IO on all the dirty pages in the FS before it returns.
This adds variants of writeback_inodes_sb* that allow the caller to
control how many pages get sent down.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

3259f8be

27 10月, 2010 1 次提交

kernel: remove PF_FLUSHER · 766f9164

由 Peter Zijlstra 提交于 10月 26, 2010

PF_FLUSHER is only ever set, not tested, remove it.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

766f9164