1. 10 7月, 2013 1 次提交
  2. 03 7月, 2013 1 次提交
    • D
      sync: don't block the flusher thread waiting on IO · 7747bd4b
      Dave Chinner 提交于
      When sync does it's WB_SYNC_ALL writeback, it issues data Io and
      then immediately waits for IO completion. This is done in the
      context of the flusher thread, and hence completely ties up the
      flusher thread for the backing device until all the dirty inodes
      have been synced. On filesystems that are dirtying inodes constantly
      and quickly, this means the flusher thread can be tied up for
      minutes per sync call and hence badly affect system level write IO
      performance as the page cache cannot be cleaned quickly.
      
      We already have a wait loop for IO completion for sync(2), so cut
      this out of the flusher thread and delegate it to wait_sb_inodes().
      Hence we can do rapid IO submission, and then wait for it all to
      complete.
      
      Effect of sync on fsmark before the patch:
      
      FSUse%        Count         Size    Files/sec     App Overhead
      .....
           0       640000         4096      35154.6          1026984
           0       720000         4096      36740.3          1023844
           0       800000         4096      36184.6           916599
           0       880000         4096       1282.7          1054367
           0       960000         4096       3951.3           918773
           0      1040000         4096      40646.2           996448
           0      1120000         4096      43610.1           895647
           0      1200000         4096      40333.1           921048
      
      And a single sync pass took:
      
        real    0m52.407s
        user    0m0.000s
        sys     0m0.090s
      
      After the patch, there is no impact on fsmark results, and each
      individual sync(2) operation run concurrently with the same fsmark
      workload takes roughly 7s:
      
        real    0m6.930s
        user    0m0.000s
        sys     0m0.039s
      
      IOWs, sync is 7-8x faster on a busy filesystem and does not have an
      adverse impact on ongoing async data write operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7747bd4b
  3. 08 5月, 2013 1 次提交
  4. 12 1月, 2013 1 次提交
    • M
      vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them · 10ee27a0
      Miao Xie 提交于
      writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
      with down_read_trylock() because
      
      - If ->s_umount is write locked, then the sb is not idle. That is
        writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.
      
      - writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
        writeback, it may bring us deadlock problem when doing umount. In order to
        fix the problem, ext4 and btrfs implemented their own writeback functions
        instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant
        code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().
      
      The name of these two functions is cumbersome, so rename them to
      try_to_writeback_inodes_sb(_nr).
      
      This idea came from Christoph Hellwig.
      Some code is from the patch of Kamal Mostafa.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      10ee27a0
  5. 12 12月, 2012 1 次提交
  6. 04 8月, 2012 1 次提交
  7. 01 8月, 2012 1 次提交
  8. 06 5月, 2012 1 次提交
    • J
      writeback: Avoid iput() from flusher thread · 169ebd90
      Jan Kara 提交于
      Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
      because iput() can do a lot of work - for example truncate the inode if it's
      the last iput on unlinked file. Some filesystems depend on flusher thread
      progressing (e.g. because they need to flush delay allocated blocks to reduce
      allocation uncertainty) and so flusher thread doing truncate creates
      interesting dependencies and possibilities for deadlocks.
      
      We get rid of iput() in flusher thread by using the fact that I_SYNC inode
      flag effectively pins the inode in memory. So if we take care to either hold
      i_lock or have I_SYNC set, we can get away without taking inode reference
      in writeback_sb_inodes().
      
      As a side effect of these changes, we also fix possible use-after-free in
      wb_writeback() because inode_wait_for_writeback() call could try to reacquire
      i_lock on the inode that was already free.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      169ebd90
  9. 25 4月, 2012 1 次提交
  10. 07 3月, 2012 1 次提交
  11. 11 1月, 2012 2 次提交
    • J
      mm: try to distribute dirty pages fairly across zones · a756cf59
      Johannes Weiner 提交于
      The maximum number of dirty pages that exist in the system at any time is
      determined by a number of pages considered dirtyable and a user-configured
      percentage of those, or an absolute number in bytes.
      
      This number of dirtyable pages is the sum of memory provided by all the
      zones in the system minus their lowmem reserves and high watermarks, so
      that the system can retain a healthy number of free pages without having
      to reclaim dirty pages.
      
      But there is a flaw in that we have a zoned page allocator which does not
      care about the global state but rather the state of individual memory
      zones.  And right now there is nothing that prevents one zone from filling
      up with dirty pages while other zones are spared, which frequently leads
      to situations where kswapd, in order to restore the watermark of free
      pages, does indeed have to write pages from that zone's LRU list.  This
      can interfere so badly with IO from the flusher threads that major
      filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
      already, taking away the VM's only possibility to keep such a zone
      balanced, aside from hoping the flushers will soon clean pages from that
      zone.
      
      Enter per-zone dirty limits.  They are to a zone's dirtyable memory what
      the global limit is to the global amount of dirtyable memory, and try to
      make sure that no single zone receives more than its fair share of the
      globally allowed dirty pages in the first place.  As the number of pages
      considered dirtyable excludes the zones' lowmem reserves and high
      watermarks, the maximum number of dirty pages in a zone is such that the
      zone can always be balanced without requiring page cleaning.
      
      As this is a placement decision in the page allocator and pages are
      dirtied only after the allocation, this patch allows allocators to pass
      __GFP_WRITE when they know in advance that the page will be written to and
      become dirty soon.  The page allocator will then attempt to allocate from
      the first zone of the zonelist - which on NUMA is determined by the task's
      NUMA memory policy - that has not exceeded its dirty limit.
      
      At first glance, it would appear that the diversion to lower zones can
      increase pressure on them, but this is not the case.  With a full high
      zone, allocations will be diverted to lower zones eventually, so it is
      more of a shift in timing of the lower zone allocations.  Workloads that
      previously could fit their dirty pages completely in the higher zone may
      be forced to allocate from lower zones, but the amount of pages that
      "spill over" are limited themselves by the lower zones' dirty constraints,
      and thus unlikely to become a problem.
      
      For now, the problem of unfair dirty page distribution remains for NUMA
      configurations where the zones allowed for allocation are in sum not big
      enough to trigger the global dirty limits, wake up the flusher threads and
      remedy the situation.  Because of this, an allocation that could not
      succeed on any of the considered zones is allowed to ignore the dirty
      limits before going into direct reclaim or even failing the allocation,
      until a future patch changes the global dirty throttling and flusher
      thread activation so that they take individual zone states into account.
      
      			Test results
      
      15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
      40% dirty ratio
      16G USB thumb drive
      10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
      
      		seconds			nr_vmscan_write
      		        (stddev)	       min|     median|        max
      xfs
      vanilla:	 549.747( 3.492)	     0.000|      0.000|      0.000
      patched:	 550.996( 3.802)	     0.000|      0.000|      0.000
      
      fuse-ntfs
      vanilla:	1183.094(53.178)	 54349.000|  59341.000|  65163.000
      patched:	 558.049(17.914)	     0.000|      0.000|     43.000
      
      btrfs
      vanilla:	 573.679(14.015)	156657.000| 460178.000| 606926.000
      patched:	 563.365(11.368)	     0.000|      0.000|   1362.000
      
      ext4
      vanilla:	 561.197(15.782)	     0.000|2725438.000|4143837.000
      patched:	 568.806(17.496)	     0.000|      0.000|      0.000
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a756cf59
    • J
      mm/page-writeback.c: make determine_dirtyable_memory static again · 1edf2234
      Johannes Weiner 提交于
      The tracing ring-buffer used this function briefly, but not anymore.
      Make it local to the writeback code again.
      
      Also, move the function so that no forward declaration needs to be
      reintroduced.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1edf2234
  12. 08 1月, 2012 1 次提交
  13. 18 12月, 2011 2 次提交
  14. 31 10月, 2011 1 次提交
    • C
      writeback: Add a 'reason' to wb_writeback_work · 0e175a18
      Curt Wohlgemuth 提交于
      This creates a new 'reason' field in a wb_writeback_work
      structure, which unambiguously identifies who initiates
      writeback activity.  A 'wb_reason' enumeration has been
      added to writeback.h, to enumerate the possible reasons.
      
      The 'writeback_work_class' and tracepoint event class and
      'writeback_queue_io' tracepoints are updated to include the
      symbolic 'reason' in all trace events.
      
      And the 'writeback_inodes_sbXXX' family of routines has had
      a wb_stats parameter added to them, so callers can specify
      why writeback is being started.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      0e175a18
  15. 03 10月, 2011 1 次提交
  16. 19 8月, 2011 1 次提交
    • W
      squeeze max-pause area and drop pass-good area · bb082295
      Wu Fengguang 提交于
      Revert the pass-good area introduced in ffd1f609 ("writeback:
      introduce max-pause and pass-good dirty limits") and make the max-pause
      area smaller and safe.
      
      This fixes ~30% performance regression in the ext3 data=writeback
      fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
      12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
      
      Using deadline scheduler also has a regression, but not that big as CFQ,
      so this suggests we have some write starvation.
      
      The test logs show that
      
      - the disks are sometimes under utilized
      
      - global dirty pages sometimes rush high to the pass-good area for
        several hundred seconds, while in the mean time some bdi dirty pages
        drop to very low value (bdi_dirty << bdi_thresh).  Then suddenly the
        global dirty pages dropped under global dirty threshold and bdi_dirty
        rush very high (for example, 2 times higher than bdi_thresh). During
        which time balance_dirty_pages() is not called at all.
      
      So the problems are
      
      1) The random writes progress so slow that they break the assumption of
         the max-pause logic that "8 pages per 200ms is typically more than
         enough to curb heavy dirtiers".
      
      2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
         for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
         and then (bdi_thresh >> bdi_dirty) for others.
      
      3) The higher max-pause/pass-good thresholds somehow leads to the bad
         swing of dirty pages.
      
      The fix is to allow the task to slightly dirty over task_bdi_thresh, but
      no way to exceed bdi_dirty and/or global dirty_thresh.
      
      Tests show that it fixed the JBOD regression completely (both behavior
      and performance), while still being able to cut down large pause times
      in balance_dirty_pages() for single-disk cases.
      Reported-by: NLi Shaohua <shaohua.li@intel.com>
      Tested-by: NLi Shaohua <shaohua.li@intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      bb082295
  17. 10 7月, 2011 5 次提交
    • W
      writeback: scale IO chunk size up to half device bandwidth · 1a12d8bd
      Wu Fengguang 提交于
      Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
      concern of not holding I_SYNC for too long.  (At least, that was the
      comment previously.)  This doesn't make sense now because the only
      time we wait for I_SYNC is if we are calling sync or fsync, and in
      that case we need to write out all of the data anyway.  Previously
      there may have been other code paths that waited on I_SYNC, but not
      any more.					    -- Theodore Ts'o
      
      So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
      will adapt to as large as the storage device can write within 500ms.
      
      XFS is observed to do IO completions in a batch, and the batch size is
      equal to the write chunk size. To avoid dirty pages to suddenly drop
      out of balance_dirty_pages()'s dirty control scope and create large
      fluctuations, the chunk size is also limited to half the control scope.
      
      The balance_dirty_pages() control scrope is
      
      	[(background_thresh + dirty_thresh) / 2, dirty_thresh]
      
      which is by default [15%, 20%] of global dirty pages, whose range size
      is dirty_thresh / DIRTY_FULL_SCOPE.
      
      The adpative write chunk size will be rounded to the nearest 4MB
      boundary.
      
      http://bugzilla.kernel.org/show_bug.cgi?id=13930
      
      CC: Theodore Ts'o <tytso@mit.edu>
      CC: Dave Chinner <david@fromorbit.com>
      CC: Chris Mason <chris.mason@oracle.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      1a12d8bd
    • W
      writeback: introduce max-pause and pass-good dirty limits · ffd1f609
      Wu Fengguang 提交于
      The max-pause limit helps to keep the sleep time inside
      balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means
      per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which
      normally is enough to stop dirtiers from continue pushing the dirty
      pages high, unless there are a sufficient large number of slow dirtiers
      (eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the
      write bandwidth of a slow disk and hence accumulating more and more dirty
      pages).
      
      The pass-good limit helps to let go of the good bdi's in the presence of
      a blocked bdi (ie. NFS server not responding) or slow USB disk which for
      some reason build up a large number of initial dirty pages that refuse
      to go away anytime soon.
      
      For example, given two bdi's A and B and the initial state
      
      	bdi_thresh_A = dirty_thresh / 2
      	bdi_thresh_B = dirty_thresh / 2
      	bdi_dirty_A  = dirty_thresh / 2
      	bdi_dirty_B  = dirty_thresh / 2
      
      Then A get blocked, after a dozen seconds
      
      	bdi_thresh_A = 0
      	bdi_thresh_B = dirty_thresh
      	bdi_dirty_A  = dirty_thresh / 2
      	bdi_dirty_B  = dirty_thresh / 2
      
      The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages
      will be effectively throttled by condition (nr_dirty < dirty_thresh).
      This has two problems:
      (1) we lose the protections for light dirtiers
      (2) balance_dirty_pages() effectively becomes IO-less because the
          (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good
          for IO, but balance_dirty_pages() loses an important way to break
          out of the loop which leads to more spread out throttle delays.
      
      DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is,
      DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above
      example while this patch uses the more conservative value 8 so as not to
      surprise people with too many dirty pages than expected.
      
      The max-pause limit won't noticeably impact the speed dirty pages are
      knocked down when there is a sudden drop of global/bdi dirty thresholds.
      Because the heavy dirties will be throttled below 160KB/s which is slow
      enough. It does help to avoid long dirty throttle delays and especially
      will make light dirtiers more responsive.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      ffd1f609
    • W
      writeback: introduce smoothed global dirty limit · c42843f2
      Wu Fengguang 提交于
      The start of a heavy weight application (ie. KVM) may instantly knock
      down determine_dirtyable_memory() if the swap is not enabled or full.
      global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
      dirty thresholds that are _much_ lower than the global/bdi dirty pages.
      
      balance_dirty_pages() will then heavily throttle all dirtiers including
      the light ones, until the dirty pages drop below the new dirty thresholds.
      During this _deep_ dirty-exceeded state, the system may appear rather
      unresponsive to the users.
      
      About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
      threshold to heavy dirtiers than light ones, and the dirty pages will
      be throttled around the heavy dirtiers' dirty threshold and reasonably
      below the light dirtiers' dirty threshold. In this state, only the heavy
      dirtiers will be throttled and the dirty pages are carefully controlled
      to not exceed the light dirtiers' dirty threshold. However if the
      threshold itself suddenly drops below the number of dirty pages, the
      light dirtiers will get heavily throttled.
      
      So introduce global_dirty_limit for tracking the global dirty threshold
      with policies
      
      - follow downwards slowly
      - follow up in one shot
      
      global_dirty_limit can effectively mask out the impact of sudden drop of
      dirtyable memory. It will be used in the next patch for two new type of
      dirty limits. Note that the new dirty limits are not going to avoid
      throttling the light dirtiers, but could limit their sleep time to 200ms.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c42843f2
    • W
      writeback: bdi write bandwidth estimation · e98be2d5
      Wu Fengguang 提交于
      The estimation value will start from 100MB/s and adapt to the real
      bandwidth in seconds.
      
      It tries to update the bandwidth only when disk is fully utilized.
      Any inactive period of more than one second will be skipped.
      
      The estimated bandwidth will be reflecting how fast the device can
      writeout when _fully utilized_, and won't drop to 0 when it goes idle.
      The value will remain constant at disk idle time. At busy write time, if
      not considering fluctuations, it will also remain high unless be knocked
      down by possible concurrent reads that compete for the disk time and
      bandwidth with async writes.
      
      The estimation is not done purely in the flusher because there is no
      guarantee for write_cache_pages() to return timely to update bandwidth.
      
      The bdi->avg_write_bandwidth smoothing is very effective for filtering
      out sudden spikes, however may be a little biased in long term.
      
      The overheads are low because the bdi bandwidth update only occurs at
      200ms intervals.
      
      The 200ms update interval is suitable, because it's not possible to get
      the real bandwidth for the instance at all, due to large fluctuations.
      
      The NFS commits can be as large as seconds worth of data. One XFS
      completion may be as large as half second worth of data if we are going
      to increase the write chunk to half second worth of data. In ext4,
      fluctuations with time period of around 5 seconds is observed. And there
      is another pattern of irregular periods of up to 20 seconds on SSD tests.
      
      That's why we are not only doing the estimation at 200ms intervals, but
      also averaging them over a period of 3 seconds and then go further to do
      another level of smoothing in avg_write_bandwidth.
      
      CC: Li Shaohua <shaohua.li@intel.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      e98be2d5
    • W
      writeback: make writeback_control.nr_to_write straight · d46db3d5
      Wu Fengguang 提交于
      Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
      and initialize the struct writeback_control there.
      
      struct writeback_control is basically designed to control writeback of a
      single file, but we keep abuse it for writing multiple files in
      writeback_sb_inodes() and its callers.
      
      It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
      work->nr_pages starts to make sense, and instead of saving and restoring
      pages_skipped in writeback_sb_inodes it can always start with a clean
      zero value.
      
      It also makes a neat IO pattern change: large dirty files are now
      written in the full 4MB writeback chunk size, rather than whatever
      remained quota in wbc->nr_to_write.
      Acked-by: NJan Kara <jack@suse.cz>
      Proposed-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      d46db3d5
  18. 08 6月, 2011 6 次提交
    • W
      writeback: remove .nonblocking and .encountered_congestion · 846d5a09
      Wu Fengguang 提交于
      Remove two unused struct writeback_control fields:
      
      	.encountered_congestion	(completely unused)
      	.nonblocking		(never set, checked/showed in XFS,NFS/btrfs)
      
      The .for_background check in nfs_write_inode() is also removed btw,
      as .for_background implies WB_SYNC_NONE.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Proposed-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      846d5a09
    • W
      writeback: remove writeback_control.more_io · b7a2441f
      Wu Fengguang 提交于
      When wbc.more_io was first introduced, it indicates whether there are
      at least one superblock whose s_more_io contains more IO work. Now with
      the per-bdi writeback, it can be replaced with a simple b_more_io test.
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      b7a2441f
    • W
      writeback: avoid extra sync work at enqueue time · e185dda8
      Wu Fengguang 提交于
      This removes writeback_control.wb_start and does more straightforward
      sync livelock prevention by setting .older_than_this to prevent extra
      inodes from being enqueued in the first place.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      e185dda8
    • C
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig 提交于
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      f758eeab
    • W
      writeback: introduce writeback_control.inodes_written · cb9bd115
      Wu Fengguang 提交于
      The flusher works on dirty inodes in batches, and may quit prematurely
      if the batch of inodes happen to be metadata-only dirtied: in this case
      wbc->nr_to_write won't be decreased at all, which stands for "no pages
      written" but also mis-interpreted as "no progress".
      
      So introduce writeback_control.inodes_written to count the inodes get
      cleaned from VFS POV.  A non-zero value means there are some progress on
      writeback, in which case more writeback can be tried.
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      cb9bd115
    • W
      writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6
      Wu Fengguang 提交于
      sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
      WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
      do livelock prevention for it, too.
      
      Jan's commit f446daae ("mm: implement writeback livelock avoidance
      using page tagging") is a partial fix in that it only fixed the
      WB_SYNC_ALL phase livelock.
      
      Although ext4 is tested to no longer livelock with commit f446daae,
      it may due to some "redirty_tail() after pages_skipped" effect which
      is by no means a guarantee for _all_ the file systems.
      
      Note that writeback_inodes_sb() is called by not only sync(), they are
      treated the same because the other callers also need livelock prevention.
      
      Impact:  It changes the order in which pages/inodes are synced to disk.
      Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
      until finished with the current inode.
      Acked-by: NJan Kara <jack@suse.cz>
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      6e6938b6
  19. 25 3月, 2011 2 次提交
  20. 29 10月, 2010 1 次提交
    • C
      Add new functions for triggering inode writeback · 3259f8be
      Chris Mason 提交于
      When btrfs is running low on metadata space, it needs to force delayed
      allocation pages to disk.  It currently does this with a suboptimal walk
      of a private list of inodes with delayed allocation, and it would be
      much better if we used the generic flusher threads.
      
      writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
      thread to start IO on all the dirty pages in the FS before it returns.
      This adds variants of writeback_inodes_sb* that allow the caller to
      control how many pages get sent down.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3259f8be
  21. 28 10月, 2010 1 次提交
    • E
      ext4: implement writeback livelock avoidance using page tagging · 5b41d924
      Eric Sandeen 提交于
      This is analogous to Jan Kara's commit,
      f446daae
      mm: implement writeback livelock avoidance using page tagging
      
      but since we forked write_cache_pages, we need to reimplement
      it there (and in ext4_da_writepages, since range_cyclic handling
      was moved to there)
      
      If you start a large buffered IO to a file, and then set
      fsync after it, you'll find that fsync does not complete
      until the other IO stops.
      
      If you continue re-dirtying the file (say, putting dd
      with conv=notrunc in a loop), when fsync finally completes
      (after all IO is done), it reports via tracing that
      it has written many more pages than the file contains;
      in other words it has synced and re-synced pages in
      the file multiple times.
      
      This then leads to problems with our writeback_index
      update, since it advances it by pages written, and
      essentially sets writeback_index off the end of the
      file...
      
      With the following patch, we only sync as much as was
      dirty at the time of the sync.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5b41d924
  22. 27 10月, 2010 1 次提交
  23. 26 10月, 2010 1 次提交
    • N
      fs: Implement lazy LRU updates for inodes · 9e38d86f
      Nick Piggin 提交于
      Convert the inode LRU to use lazy updates to reduce lock and
      cacheline traffic.  We avoid moving inodes around in the LRU list
      during iget/iput operations so these frequent operations don't need
      to access the LRUs. Instead, we defer the refcount checks to
      reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
      reclaim that iget has touched the inode in the past. This means that
      only reclaim should be touching the LRU with any frequency, hence
      significantly reducing lock acquisitions and the amount contention
      on LRU updates.
      
      This also removes the inode_in_use list, which means we now only
      have one list for tracking the inode LRU status. This makes it much
      simpler to split out the LRU list operations under it's own lock.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9e38d86f
  24. 12 8月, 2010 1 次提交
  25. 06 7月, 2010 2 次提交
  26. 09 6月, 2010 1 次提交
    • D
      writeback: pay attention to wbc->nr_to_write in write_cache_pages · 0b564927
      Dave Chinner 提交于
      If a filesystem writes more than one page in ->writepage, write_cache_pages
      fails to notice this and continues to attempt writeback when wbc->nr_to_write
      has gone negative - this trace was captured from XFS:
      
          wbc_writeback_start: towrt=1024
          wbc_writepage: towrt=1024
          wbc_writepage: towrt=0
          wbc_writepage: towrt=-1
          wbc_writepage: towrt=-5
          wbc_writepage: towrt=-21
          wbc_writepage: towrt=-85
      
      This has adverse effects on filesystem writeback behaviour. write_cache_pages()
      needs to terminate after a certain number of pages are written, not after a
      certain number of calls to ->writepage are made.  This is a regression
      introduced by 17bc6c30 ("vfs: Add
      no_nrwrite_index_update writeback control flag"), but cannot be reverted
      directly due to subsequent bug fixes that have gone in on top of it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b564927
  27. 01 6月, 2010 1 次提交