1. 04 8月, 2012 1 次提交
    • A
      vfs: kill write_super and sync_supers · f0cd2dbb
      Artem Bityutskiy 提交于
      Finally we can kill the 'sync_supers' kernel thread along with the
      '->write_super()' superblock operation because all the users are gone.
      Now every file-system is supposed to self-manage own superblock and
      its dirty state.
      
      The nice thing about killing this thread is that it improves power management.
      Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
      every 5 seconds no matter what - even if there were no dirty superblocks and
      even if there were no file-systems using this service (e.g., btrfs and
      journalled ext4 do not need it). So it was wasting power most of the time. And
      because the thread was in the core of the kernel, all systems had to have it.
      So I am quite happy to make it go away.
      
      Interestingly, this thread is a left-over from the pdflush kernel thread which
      was a self-forking kernel thread responsible for all the write-back in old
      Linux kernels. It was turned into per-block device BDI threads, and
      'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f0cd2dbb
  2. 09 6月, 2012 2 次提交
  3. 06 5月, 2012 1 次提交
    • F
      writeback: initialize global_dirty_limit · 68809c71
      Fengguang Wu 提交于
      This prevents global_dirty_limit from remaining 0 (the initial value)
      for long time, since it's only updated in update_dirty_limit() when
      above the dirty freerun area.
      
      It will avoid unexpected consequences when some random code use it as a
      convenient approximation of the global dirty threshold.
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      68809c71
  4. 14 4月, 2012 1 次提交
  5. 22 3月, 2012 2 次提交
    • A
      mm: export dirty_writeback_interval · 91913a29
      Artem Bityutskiy 提交于
      Export 'dirty_writeback_interval' to make it visible to
      file-systems. We are going to push superblock management down to
      file-systems and get rid of the 'sync_supers' kernel thread completly.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      91913a29
    • F
      mm: use global_dirty_limit in throttle_vm_writeout() · 47a13333
      Fengguang Wu 提交于
      When starting a memory hog task, a desktop box w/o swap is found to go
      unresponsive for a long time.  It's solely caused by lots of congestion
      waits in throttle_vm_writeout():
      
       gnome-system-mo-4201 553.073384: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
       gnome-system-mo-4201 553.073386: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
                 gtali-4237 553.080377: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
                 gtali-4237 553.080378: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
                  Xorg-3483 553.103375: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
                  Xorg-3483 553.103377: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
      
      The root cause is, the dirty threshold is knocked down a lot by the memory
      hog task.  Fixed by using global_dirty_limit which decreases gradually on
      such events and can guarantee we stay above (the also decreasing) nr_dirty
      in the progress of following down to the new dirty threshold.
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47a13333
  6. 11 1月, 2012 4 次提交
    • J
      mm: try to distribute dirty pages fairly across zones · a756cf59
      Johannes Weiner 提交于
      The maximum number of dirty pages that exist in the system at any time is
      determined by a number of pages considered dirtyable and a user-configured
      percentage of those, or an absolute number in bytes.
      
      This number of dirtyable pages is the sum of memory provided by all the
      zones in the system minus their lowmem reserves and high watermarks, so
      that the system can retain a healthy number of free pages without having
      to reclaim dirty pages.
      
      But there is a flaw in that we have a zoned page allocator which does not
      care about the global state but rather the state of individual memory
      zones.  And right now there is nothing that prevents one zone from filling
      up with dirty pages while other zones are spared, which frequently leads
      to situations where kswapd, in order to restore the watermark of free
      pages, does indeed have to write pages from that zone's LRU list.  This
      can interfere so badly with IO from the flusher threads that major
      filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
      already, taking away the VM's only possibility to keep such a zone
      balanced, aside from hoping the flushers will soon clean pages from that
      zone.
      
      Enter per-zone dirty limits.  They are to a zone's dirtyable memory what
      the global limit is to the global amount of dirtyable memory, and try to
      make sure that no single zone receives more than its fair share of the
      globally allowed dirty pages in the first place.  As the number of pages
      considered dirtyable excludes the zones' lowmem reserves and high
      watermarks, the maximum number of dirty pages in a zone is such that the
      zone can always be balanced without requiring page cleaning.
      
      As this is a placement decision in the page allocator and pages are
      dirtied only after the allocation, this patch allows allocators to pass
      __GFP_WRITE when they know in advance that the page will be written to and
      become dirty soon.  The page allocator will then attempt to allocate from
      the first zone of the zonelist - which on NUMA is determined by the task's
      NUMA memory policy - that has not exceeded its dirty limit.
      
      At first glance, it would appear that the diversion to lower zones can
      increase pressure on them, but this is not the case.  With a full high
      zone, allocations will be diverted to lower zones eventually, so it is
      more of a shift in timing of the lower zone allocations.  Workloads that
      previously could fit their dirty pages completely in the higher zone may
      be forced to allocate from lower zones, but the amount of pages that
      "spill over" are limited themselves by the lower zones' dirty constraints,
      and thus unlikely to become a problem.
      
      For now, the problem of unfair dirty page distribution remains for NUMA
      configurations where the zones allowed for allocation are in sum not big
      enough to trigger the global dirty limits, wake up the flusher threads and
      remedy the situation.  Because of this, an allocation that could not
      succeed on any of the considered zones is allowed to ignore the dirty
      limits before going into direct reclaim or even failing the allocation,
      until a future patch changes the global dirty throttling and flusher
      thread activation so that they take individual zone states into account.
      
      			Test results
      
      15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
      40% dirty ratio
      16G USB thumb drive
      10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
      
      		seconds			nr_vmscan_write
      		        (stddev)	       min|     median|        max
      xfs
      vanilla:	 549.747( 3.492)	     0.000|      0.000|      0.000
      patched:	 550.996( 3.802)	     0.000|      0.000|      0.000
      
      fuse-ntfs
      vanilla:	1183.094(53.178)	 54349.000|  59341.000|  65163.000
      patched:	 558.049(17.914)	     0.000|      0.000|     43.000
      
      btrfs
      vanilla:	 573.679(14.015)	156657.000| 460178.000| 606926.000
      patched:	 563.365(11.368)	     0.000|      0.000|   1362.000
      
      ext4
      vanilla:	 561.197(15.782)	     0.000|2725438.000|4143837.000
      patched:	 568.806(17.496)	     0.000|      0.000|      0.000
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a756cf59
    • J
      mm: writeback: cleanups in preparation for per-zone dirty limits · ccafa287
      Johannes Weiner 提交于
      The next patch will introduce per-zone dirty limiting functions in
      addition to the traditional global dirty limiting.
      
      Rename determine_dirtyable_memory() to global_dirtyable_memory() before
      adding the zone-specific version, and fix up its documentation.
      
      Also, move the functions to determine the dirtyable memory and the
      function to calculate the dirty limit based on that together so that their
      relationship is more apparent and that they can be commented on as a
      group.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mel@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ccafa287
    • J
      mm: exclude reserved pages from dirtyable memory · ab8fabd4
      Johannes Weiner 提交于
      Per-zone dirty limits try to distribute page cache pages allocated for
      writing across zones in proportion to the individual zone sizes, to reduce
      the likelihood of reclaim having to write back individual pages from the
      LRU lists in order to make progress.
      
      This patch:
      
      The amount of dirtyable pages should not include the full number of free
      pages: there is a number of reserved pages that the page allocator and
      kswapd always try to keep free.
      
      The closer (reclaimable pages - dirty pages) is to the number of reserved
      pages, the more likely it becomes for reclaim to run into dirty pages:
      
             +----------+ ---
             |   anon   |  |
             +----------+  |
             |          |  |
             |          |  -- dirty limit new    -- flusher new
             |   file   |  |                     |
             |          |  |                     |
             |          |  -- dirty limit old    -- flusher old
             |          |                        |
             +----------+                       --- reclaim
             | reserved |
             +----------+
             |  kernel  |
             +----------+
      
      This patch introduces a per-zone dirty reserve that takes both the lowmem
      reserve as well as the high watermark of the zone into account, and a
      global sum of those per-zone values that is subtracted from the global
      amount of dirtyable pages.  The lowmem reserve is unavailable to page
      cache allocations and kswapd tries to keep the high watermark free.  We
      don't want to end up in a situation where reclaim has to clean pages in
      order to balance zones.
      
      Not treating reserved pages as dirtyable on a global level is only a
      conceptual fix.  In reality, dirty pages are not distributed equally
      across zones and reclaim runs into dirty pages on a regular basis.
      
      But it is important to get this right before tackling the problem on a
      per-zone level, where the distance between reclaim and the dirty pages is
      mostly much smaller in absolute numbers.
      
      [akpm@linux-foundation.org: fix highmem build]
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab8fabd4
    • J
      mm/page-writeback.c: make determine_dirtyable_memory static again · 1edf2234
      Johannes Weiner 提交于
      The tracing ring-buffer used this function briefly, but not anymore.
      Make it local to the writeback code again.
      
      Also, move the function so that no forward declaration needs to be
      reintroduced.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1edf2234
  7. 04 1月, 2012 1 次提交
  8. 18 12月, 2011 8 次提交
    • W
      writeback: balanced_rate cannot exceed write bandwidth · bdaac490
      Wu Fengguang 提交于
      Add an upper limit to balanced_rate according to the below inequality.
      This filters out some rare but huge singular points, which at least
      enables more readable gnuplot figures.
      
      When there are N dd dirtiers,
      
      	balanced_dirty_ratelimit = write_bw / N
      
      So it holds that
      
      	balanced_dirty_ratelimit <= write_bw
      
      The singular points originate from dirty_rate in the below formular:
      
              balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
      where
      	dirty_rate = (number of page dirties in the past 200ms) / 200ms
      
      In the extreme case, if all dd tasks suddenly get blocked on something
      else and hence no pages are dirtied at all, dirty_rate will be 0 and
      balanced_dirty_ratelimit will be inf. This could happen in reality.
      
      Note that these huge singular points are not a real threat, since they
      are _guaranteed_ to be filtered out by the
      	min(balanced_dirty_ratelimit, task_ratelimit)
      line in bdi_update_dirty_ratelimit(). task_ratelimit is based on the
      number of dirty pages, which will never _suddenly_ fly away like
      balanced_dirty_ratelimit. So any weirdly large balanced_dirty_ratelimit
      will be cut down to the level of task_ratelimit.
      
      There won't be tiny singular points though, as long as the dirty pages
      lie inside the dirty throttling region (above the freerun region).
      Because there the dd tasks will be throttled by balanced_dirty_pages()
      and won't be able to suddenly dirty much more pages than average.
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      bdaac490
    • W
      writeback: do strict bdi dirty_exceeded · 82791940
      Wu Fengguang 提交于
      This helps to reduce dirty throttling polls and hence CPU overheads.
      
      bdi->dirty_exceeded typically only helps when suddenly starting 100+
      dd's on a disk, in which case the dd's may need to poll
      balance_dirty_pages() earlier than tsk->nr_dirtied_pause.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      82791940
    • W
      writeback: avoid tiny dirty poll intervals · 5b9b3574
      Wu Fengguang 提交于
      The LKP tests see big 56% regression for the case fio_mmap_randwrite_64k.
      Shaohua manages to root cause it to be the much smaller dirty pause times
      and hence much more frequent invocations to the IO-less balance_dirty_pages().
      Since fio_mmap_randwrite_64k effectively contains both reads and writes,
      the more frequent pauses triggered more idling in the cfq IO scheduler.
      
      The solution is to increase pause time all the way up to the max 200ms
      in this case, which is found to restore most performance. This will help
      reduce CPU overheads in other cases, too.
      
      Note that I don't expect many performance critical workloads to run this
      access pattern: the mmap read-on-write is rather inefficient and could
      be avoided by doing normal writes syscalls.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Reported-by: NLi Shaohua <shaohua.li@intel.com>
      Tested-by: NLi Shaohua <shaohua.li@intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      5b9b3574
    • W
      writeback: max, min and target dirty pause time · 7ccb9ad5
      Wu Fengguang 提交于
      Control the pause time and the call intervals to balance_dirty_pages()
      with three parameters:
      
      1) max_pause, limited by bdi_dirty and MAX_PAUSE
      
      2) the target pause time, grows with the number of dd tasks
         and is normally limited by max_pause/2
      
      3) the minimal pause, set to half the target pause
         and is used to skip short sleeps and accumulate them into bigger ones
      
      The typical behaviors after patch:
      
      - if ever task_ratelimit is far below dirty_ratelimit, the pause time
        will remain constant at max_pause and nr_dirtied_pause will be
        fluctuating with task_ratelimit
      
      - in the normal cases, nr_dirtied_pause will remain stable (keep in the
        same pace with dirty_ratelimit) and the pause time will be fluctuating
        with task_ratelimit
      
      In summary, someone has to fluctuate with task_ratelimit, because
      
      	task_ratelimit = nr_dirtied_pause / pause
      
      We normally prefer a stable nr_dirtied_pause, until reaching max_pause.
      
      The notable behavior changes are:
      
      - in stable workloads, there will no longer be sudden big trajectory
        switching of nr_dirtied_pause as concerned by Peter. It will be as
        smooth as dirty_ratelimit and changing proportionally with it (as
        always, assuming bdi bandwidth does not fluctuate across 2^N lines,
        otherwise nr_dirtied_pause will show up in 2+ parallel trajectories)
      
      - in the rare cases when something keeps task_ratelimit far below
        dirty_ratelimit, the smoothness can no longer be retained and
        nr_dirtied_pause will be "dancing" with task_ratelimit. This fixes a
        (not that destructive but still not good) bug that
      	  dirty_ratelimit gets brought down undesirably
      	  <= balanced_dirty_ratelimit is under estimated
      	  <= weakly executed task_ratelimit
      	  <= pause goes too large and gets trimmed down to max_pause
      	  <= nr_dirtied_pause (based on dirty_ratelimit) is set too large
      	  <= dirty_ratelimit being much larger than task_ratelimit
      
      - introduce min_pause to avoid small pause sleeps
      
      - when pause is trimmed down to max_pause, try to compensate it at the
        next pause time
      
      The "refactor" type of changes are:
      
      The max_pause equation is slightly transformed to make it slightly more
      efficient.
      
      We now scale target_pause by (N * 10ms) on 2^N concurrent tasks, which
      is effectively equal to the original scaling max_pause by (N * 20ms)
      because the original code does implicit target_pause ~= max_pause / 2.
      Based on the same implicit ratio, target_pause starts with 10ms on 1 dd.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      7ccb9ad5
    • W
      writeback: dirty ratelimit - think time compensation · 83712358
      Wu Fengguang 提交于
      Compensate the task's think time when computing the final pause time,
      so that ->dirty_ratelimit can be executed accurately.
      
              think time := time spend outside of balance_dirty_pages()
      
      In the rare case that the task slept longer than the 200ms period time
      (result in negative pause time), the sleep time will be compensated in
      the following periods, too, if it's less than 1 second.
      
      Accumulated errors are carefully avoided as long as the max pause area
      is not hitted.
      
      Pseudo code:
      
              period = pages_dirtied / task_ratelimit;
              think = jiffies - dirty_paused_when;
              pause = period - think;
      
      1) normal case: period > think
      
              pause = period - think
              dirty_paused_when = jiffies + pause
              nr_dirtied = 0
      
                                   period time
                    |===============================>|
                        think time      pause time
                    |===============>|==============>|
              ------|----------------|---------------|------------------------
              dirty_paused_when   jiffies
      
      2) no pause case: period <= think
      
              don't pause; reduce future pause time by:
              dirty_paused_when += period
              nr_dirtied = 0
      
                                 period time
                    |===============================>|
                                        think time
                    |===================================================>|
              ------|--------------------------------+-------------------|----
              dirty_paused_when                                       jiffies
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      83712358
    • W
      writeback: fix dirtied pages accounting on redirty · 2f800fbd
      Wu Fengguang 提交于
      De-account the accumulative dirty counters on page redirty.
      
      Page redirties (very common in ext4) will introduce mismatch between
      counters (a) and (b)
      
      a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
      b) NR_WRITTEN, BDI_WRITTEN
      
      This will introduce systematic errors in balanced_rate and result in
      dirty page position errors (ie. the dirty pages are no longer balanced
      around the global/bdi setpoints).
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      2f800fbd
    • W
      writeback: fix dirtied pages accounting on sub-page writes · d3bc1fef
      Wu Fengguang 提交于
      When dd in 512bytes, generic_perform_write() calls
      balance_dirty_pages_ratelimited() 8 times for the same page, but
      obviously the page is only dirtied once.
      
      Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      d3bc1fef
    • W
      writeback: charge leaked page dirties to active tasks · 54848d73
      Wu Fengguang 提交于
      It's a years long problem that a large number of short-lived dirtiers
      (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
      (eg. dd) as well as pushing the dirty pages to the global hard limit.
      
      The solution is to charge the pages dirtied by the exited gcc to the
      other random dirtying tasks. It sounds not perfect, however should
      behave good enough in practice, seeing as that throttled tasks aren't
      actually running so those that are running are more likely to pick it up
      and get throttled, therefore promoting an equal spread.
      
      Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      54848d73
  9. 08 12月, 2011 3 次提交
    • W
      writeback: set max_pause to lowest value on zero bdi_dirty · 82e230a0
      Wu Fengguang 提交于
      Some trace shows lots of bdi_dirty=0 lines where it's actually some
      small value if w/o the accounting errors in the per-cpu bdi stats.
      
      In this case the max pause time should really be set to the smallest
      (non-zero) value to avoid IO queue underrun and improve throughput.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      82e230a0
    • W
      writeback: permit through good bdi even when global dirty exceeded · c5c6343c
      Wu Fengguang 提交于
      On a system with 1 local mount and 1 NFS mount, if the NFS server
      becomes not responding when dd to the NFS mount, the NFS dirty pages may
      exceed the global dirty limit and _every_ task involving writing will be
      blocked. The whole system appears unresponsive.
      
      The workaround is to permit through the bdi's that only has a small
      number of dirty pages. The number chosen (bdi_stat_error pages) is not
      enough to enable the local disk to run in optimal throughput, however is
      enough to make the system responsive on a broken NFS mount. The user can
      then kill the dirtiers on the NFS mount and increase the global dirty
      limit to bring up the local disk's throughput.
      
      It risks allowing dirty pages to grow much larger than the global dirty
      limit when there are 1000+ mounts, however that's very unlikely to happen,
      especially in low memory profiles.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c5c6343c
    • W
      writeback: comment on the bdi dirty threshold · aed21ad2
      Wu Fengguang 提交于
      We do "floating proportions" to let active devices to grow its target
      share of dirty pages and stalled/inactive devices to decrease its target
      share over time.
      
      It works well except in the case of "an inactive disk suddenly goes
      busy", where the initial target share may be too small. To mitigate
      this, bdi_position_ratio() has the below line to raise a small
      bdi_thresh when it's safe to do so, so that the disk be feed with enough
      dirty pages for efficient IO and in turn fast rampup of bdi_thresh:
      
              bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
      
      balance_dirty_pages() normally does negative feedback control which
      adjusts ratelimit to balance the bdi dirty pages around the target.
      In some extreme cases when that is not enough, it will have to block
      the tasks completely until the bdi dirty pages drop below bdi_thresh.
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      aed21ad2
  10. 17 11月, 2011 2 次提交
    • W
      writeback: remove vm_dirties and task->dirties · 468e6a20
      Wu Fengguang 提交于
      They are not used any more.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      468e6a20
    • W
      writeback: hard throttle 1000+ dd on a slow USB stick · 1df64719
      Wu Fengguang 提交于
      The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms
      on every 1 4KB-page, which means it cannot throttle a task under
      4KB/200ms=20KB/s. So when there are more than 512 dd writing to a
      10MB/s USB stick, its bdi dirty pages could grow out of control.
      
      Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1)
      means a limit of 4KB/s.
                                                             
      They can eventually be safeguarded by the global limit check 
      (nr_dirty < dirty_thresh). However if someone is also writing to an 
      HDD at the same time, it'll get poor HDD write performance.
                                                             
      We at least want to maintain good write performance for other devices
      when one device is attacked by some "massive parallel" workload, or
      suffers from slow write bandwidth, or somehow get stalled due to some 
      error condition (eg. NFS server not responding).
      
      For a stalled device, we need to completely block its dirtiers, too,
      before its bdi dirty pages grow all the way up to the global limit and
      leave no space for the other functional devices.
      
      So change the loop exit condition to
      
      	/*
      	 * Always enforce global dirty limit; also enforce bdi dirty limit
      	 * if the normal max_pause sleeps cannot keep things under control.
      	 */
      	if (nr_dirty < dirty_thresh &&
      	    (bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1))
      		break;
      
      which can be further simplified to
      
      	if (task_ratelimit)
      		break;
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      1df64719
  11. 16 11月, 2011 1 次提交
  12. 07 11月, 2011 1 次提交
  13. 01 11月, 2011 1 次提交
  14. 31 10月, 2011 4 次提交
  15. 11 10月, 2011 1 次提交
  16. 03 10月, 2011 7 次提交
    • W
      writeback: dirty position control - bdi reserve area · 8927f66c
      Wu Fengguang 提交于
      Keep a minimal pool of dirty pages for each bdi, so that the disk IO
      queues won't underrun. Also gently increase a small bdi_thresh to avoid
      it stuck in 0 for some light dirtied bdi.
      
      It's particularly useful for JBOD and small memory system.
      
      It may result in (pos_ratio > 1) at the setpoint and push the dirty
      pages high. This is more or less intended because the bdi is in the
      danger of IO queue underflow.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      8927f66c
    • W
      writeback: control dirty pause time · 57fc978c
      Wu Fengguang 提交于
      The dirty pause time shall ultimately be controlled by adjusting
      nr_dirtied_pause, since there is relationship
      
      	pause = pages_dirtied / task_ratelimit
      
      Assuming
      
      	pages_dirtied ~= nr_dirtied_pause
      	task_ratelimit ~= dirty_ratelimit
      
      We get
      
      	nr_dirtied_pause ~= dirty_ratelimit * desired_pause
      
      Here dirty_ratelimit is preferred over task_ratelimit because it's
      more stable.
      
      It's also important to limit possible large transitional errors:
      
      - bw is changing quickly
      - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
      - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
        separate fix, but still expect non-trivial errors)
      
      So we end up using the above formula inside clamp_val().
      
      The best test case for this code is to run 100 "dd bs=4M" tasks on
      btrfs and check its pause time distribution.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      57fc978c
    • W
      writeback: limit max dirty pause time · c8462cc9
      Wu Fengguang 提交于
      Apply two policies to scale down the max pause time for
      
      1) small number of concurrent dirtiers
      2) small memory system (comparing to storage bandwidth)
      
      MAX_PAUSE=200ms may only be suitable for high end servers with lots of
      concurrent dirtiers, where the large pause time can reduce much overheads.
      
      Otherwise, smaller pause time is desirable whenever possible, so as to
      get good responsiveness and smooth user experiences. It's actually
      required for good disk utilization in the case when all the dirty pages
      can be synced to disk within MAX_PAUSE=200ms.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c8462cc9
    • W
      writeback: IO-less balance_dirty_pages() · 143dfe86
      Wu Fengguang 提交于
      As proposed by Chris, Dave and Jan, don't start foreground writeback IO
      inside balance_dirty_pages(). Instead, simply let it idle sleep for some
      time to throttle the dirtying task. In the mean while, kick off the
      per-bdi flusher thread to do background writeback IO.
      
      RATIONALS
      =========
      
      - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
      
        If every thread doing writes and being throttled start foreground
        writeback, it leads to N IO submitters from at least N different
        inodes at the same time, end up with N different sets of IO being
        issued with potentially zero locality to each other, resulting in
        much lower elevator sort/merge efficiency and hence we seek the disk
        all over the place to service the different sets of IO.
        OTOH, if there is only one submission thread, it doesn't jump between
        inodes in the same way when congestion clears - it keeps writing to
        the same inode, resulting in large related chunks of sequential IOs
        being issued to the disk. This is more efficient than the above
        foreground writeback because the elevator works better and the disk
        seeks less.
      
      - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
      
        With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
        from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
      
        * "CPU usage has dropped by ~55%", "it certainly appears that most of
          the CPU time saving comes from the removal of contention on the
          inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
          cacheline bouncing, because the new code is able to call much less
          frequently into balance_dirty_pages() and hence access the global
          page states)
      
        * the user space "App overhead" is reduced by 20%, by avoiding the
          cacheline pollution by the complex writeback code path
      
        * "for a ~5% throughput reduction", "the number of write IOs have
          dropped by ~25%", and the elapsed time reduced from 41:42.17 to
          40:53.23.
      
        * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
          and improves IO throughput from 38MB/s to 42MB/s.
      
      - IO size too small for fast arrays and too large for slow USB sticks
      
        The write_chunk used by current balance_dirty_pages() cannot be
        directly set to some large value (eg. 128MB) for better IO efficiency.
        Because it could lead to more than 1 second user perceivable stalls.
        Even the current 4MB write size may be too large for slow USB sticks.
        The fact that balance_dirty_pages() starts IO on itself couples the
        IO size to wait time, which makes it hard to do suitable IO size while
        keeping the wait time under control.
      
        Now it's possible to increase writeback chunk size proportional to the
        disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
        the larger writeback size dramatically reduces the seek count to 1/10
        (far beyond my expectation) and improves the write throughput by 24%.
      
      - long block time in balance_dirty_pages() hurts desktop responsiveness
      
        Many of us may have the experience: it often takes a couple of seconds
        or even long time to stop a heavy writing dd/cp/tar command with
        Ctrl-C or "kill -9".
      
      - IO pipeline broken by bumpy write() progress
      
        There are a broad class of "loop {read(buf); write(buf);}" applications
        whose read() pipeline will be under-utilized or even come to a stop if
        the write()s have long latencies _or_ don't progress in a constant rate.
        The current threshold based throttling inherently transfers the large
        low level IO completion fluctuations to bumpy application write()s,
        and further deteriorates with increasing number of dirtiers and/or bdi's.
      
        For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
        the rsync progresses very bumpy in legacy kernel, and throughput is
        improved by 67% by this patchset. (plus the larger write chunk size,
        it will be 93% speedup).
      
        The new rate based throttling can support 1000+ dd's with excellent
        smoothness, low latency and low overheads.
      
      For the above reasons, it's much better to do IO-less and low latency
      pauses in balance_dirty_pages().
      
      Jan Kara, Dave Chinner and me explored the scheme to let
      balance_dirty_pages() wait for enough writeback IO completions to
      safeguard the dirty limit. However it's found to have two problems:
      
      - in large NUMA systems, the per-cpu counters may have big accounting
        errors, leading to big throttle wait time and jitters.
      
      - NFS may kill large amount of unstable pages with one single COMMIT.
        Because NFS server serves COMMIT with expensive fsync() IOs, it is
        desirable to delay and reduce the number of COMMITs. So it's not
        likely to optimize away such kind of bursty IO completions, and the
        resulted large (and tiny) stall times in IO completion based throttling.
      
      So here is a pause time oriented approach, which tries to control the
      pause time in each balance_dirty_pages() invocations, by controlling
      the number of pages dirtied before calling balance_dirty_pages(), for
      smooth and efficient dirty throttling:
      
      - avoid useless (eg. zero pause time) balance_dirty_pages() calls
      - avoid too small pause time (less than   4ms, which burns CPU power)
      - avoid too large pause time (more than 200ms, which hurts responsiveness)
      - avoid big fluctuations of pause times
      
      It can control pause times at will. The default policy (in a followup
      patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
      in 1000-dd case.
      
      BEHAVIOR CHANGE
      ===============
      
      (1) dirty threshold
      
      Users will notice that the applications will get throttled once crossing
      the global (background + dirty)/2=15% threshold, and then balanced around
      17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
      memory in 1-dd case.
      
      Since the task will be soft throttled earlier than before, it may be
      perceived by end users as performance "slow down" if his application
      happens to dirty more than 15% dirtyable memory.
      
      (2) smoothness/responsiveness
      
      Users will notice a more responsive system during heavy writeback.
      "killall dd" will take effect instantly.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      143dfe86
    • W
      writeback: per task dirty rate limit · 9d823e8f
      Wu Fengguang 提交于
      Add two fields to task_struct.
      
      1) account dirtied pages in the individual tasks, for accuracy
      2) per-task balance_dirty_pages() call intervals, for flexibility
      
      The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
      scale near-sqrt to the safety gap between dirty pages and threshold.
      
      The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
      pages at exactly the same time, each task will be assigned a large
      initial nr_dirtied_pause, so that the dirty threshold will be exceeded
      long before each task reached its nr_dirtied_pause and hence call
      balance_dirty_pages().
      
      The solution is to watch for the number of pages dirtied on each CPU in
      between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
      (3% dirty threshold), force call balance_dirty_pages() for a chance to
      set bdi->dirty_exceeded. In normal situations, this safeguarding
      condition is not expected to trigger at all.
      
      On the sqrt in dirty_poll_interval():
      
      It will serve as an initial guess when dirty pages are still in the
      freerun area.
      
      When dirty pages are floating inside the dirty control scope [freerun,
      limit], a followup patch will use some refined dirty poll interval to
      get the desired pause time.
      
         thresh-dirty (MB)    sqrt
      		   1      16
      		   2      22
      		   4      32
      		   8      45
      		  16      64
      		  32      90
      		  64     128
      		 128     181
      		 256     256
      		 512     362
      		1024     512
      
      The above table means, given 1MB (or 1GB) gap and the dd tasks polling
      balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
      be exceeded as long as there are less than 16 (or 512) concurrent dd's.
      
      So sqrt naturally leads to less overheads and more safe concurrent tasks
      for large memory servers, which have large (thresh-freerun) gaps.
      
      peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
      
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NAndrea Righi <andrea@betterlinux.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      9d823e8f
    • W
      writeback: stabilize bdi->dirty_ratelimit · 7381131c
      Wu Fengguang 提交于
      There are some imperfections in balanced_dirty_ratelimit.
      
      1) large fluctuations
      
      The dirty_rate used for computing balanced_dirty_ratelimit is merely
      averaged in the past 200ms (very small comparing to the 3s estimation
      period for write_bw), which makes rather dispersed distribution of
      balanced_dirty_ratelimit.
      
      It's pretty hard to average out the singular points by increasing the
      estimation period. Considering that the averaging technique will
      introduce very undesirable time lags, I give it up totally. (btw, the 3s
      write_bw averaging time lag is much more acceptable because its impact
      is one-way and therefore won't lead to oscillations.)
      
      The more practical way is filtering -- most singular
      balanced_dirty_ratelimit points can be filtered out by remembering some
      prev_balanced_rate and prev_prev_balanced_rate. However the more
      reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.
      
      2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
      match could become unbalanced, which may lead to large systematical
      errors in balanced_dirty_ratelimit. The truncates, due to its possibly
      bumpy nature, can hardly be compensated smoothly. So let's face it. When
      some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
      high, dirty pages will go higher than the setpoint. task_ratelimit will
      in turn become lower than dirty_ratelimit.  So if we consider both
      balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
      only when they are on the same side of dirty_ratelimit, the systematical
      errors in balanced_dirty_ratelimit won't be able to bring
      dirty_ratelimit far away.
      
      The balanced_dirty_ratelimit estimation may also be inaccurate near
      @limit or @freerun, however is less an issue.
      
      3) since we ultimately want to
      
      - keep the fluctuations of task ratelimit as small as possible
      - keep the dirty pages around the setpoint as long time as possible
      
      the update policy used for (2) also serves the above goals nicely:
      if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
      and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
      there is no point to bring up dirty_ratelimit in a hurry only to hurt
      both the above two goals.
      
      So, we make use of task_ratelimit to limit the update of dirty_ratelimit
      in two ways:
      
      1) avoid changing dirty rate when it's against the position control target
         (the adjusted rate will slow down the progress of dirty pages going
         back to setpoint).
      
      2) limit the step size. task_ratelimit is changing values step by step,
         leaving a consistent trace comparing to the randomly jumping
         balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
         errors in stable state and typically larger errors when there are big
         errors in rate.  So it's a pretty good limiting factor for the step
         size of dirty_ratelimit.
      
      Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
      task_ratelimit is merely used as a limiting factor.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      7381131c
    • W
      writeback: dirty rate control · be3ffa27
      Wu Fengguang 提交于
      It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
      when there are N dd tasks.
      
      On write() syscall, use bdi->dirty_ratelimit
      ============================================
      
          balance_dirty_pages(pages_dirtied)
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              pause = pages_dirtied / task_ratelimit;
              sleep(pause);
          }
      
      On every 200ms, update bdi->dirty_ratelimit
      ===========================================
      
          bdi_update_dirty_ratelimit()
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
              bdi->dirty_ratelimit = balanced_dirty_ratelimit
          }
      
      Estimation of balanced bdi->dirty_ratelimit
      ===========================================
      
      balanced task_ratelimit
      -----------------------
      
      balance_dirty_pages() needs to throttle tasks dirtying pages such that
      the total amount of dirty pages stays below the specified dirty limit in
      order to avoid memory deadlocks. Furthermore we desire fairness in that
      tasks get throttled proportionally to the amount of pages they dirty.
      
      IOW we want to throttle tasks such that we match the dirty rate to the
      writeout bandwidth, this yields a stable amount of dirty pages:
      
              dirty_rate == write_bw                                          (1)
      
      The fairness requirement gives us:
      
              task_ratelimit = balanced_dirty_ratelimit
                             == write_bw / N                                  (2)
      
      where N is the number of dd tasks.  We don't know N beforehand, but
      still can estimate balanced_dirty_ratelimit within 200ms.
      
      Start by throttling each dd task at rate
      
              task_ratelimit = task_ratelimit_0                               (3)
                               (any non-zero initial value is OK)
      
      After 200ms, we measured
      
              dirty_rate = # of pages dirtied by all dd's / 200ms
              write_bw   = # of pages written to the disk / 200ms
      
      For the aggressive dd dirtiers, the equality holds
      
              dirty_rate == N * task_rate
                         == N * task_ratelimit_0                              (4)
      Or
              task_ratelimit_0 == dirty_rate / N                              (5)
      
      Now we conclude that the balanced task ratelimit can be estimated by
      
                                                            write_bw
              balanced_dirty_ratelimit = task_ratelimit_0 * ----------        (6)
                                                            dirty_rate
      
      Because with (4) and (5) we can get the desired equality (1):
      
                                                             write_bw
              balanced_dirty_ratelimit == (dirty_rate / N) * ----------
                                                             dirty_rate
                                       == write_bw / N
      
      Then using the balanced task ratelimit we can compute task pause times like:
      
              task_pause = task->nr_dirtied / task_ratelimit
      
      task_ratelimit with position control
      ------------------------------------
      
      However, while the above gives us means of matching the dirty rate to
      the writeout bandwidth, it at best provides us with a stable dirty page
      count (assuming a static system). In order to control the dirty page
      count such that it is high enough to provide performance, but does not
      exceed the specified limit we need another control.
      
      The dirty position control works by extending (2) to
      
              task_ratelimit = balanced_dirty_ratelimit * pos_ratio           (7)
      
      where pos_ratio is a negative feedback function that subjects to
      
      1) f(setpoint) = 1.0
      2) df/dx < 0
      
      That is, if the dirty pages are ABOVE the setpoint, we throttle each
      task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
      pages are created less fast than they are cleaned, thus DROP to the
      setpoints (and the reverse).
      
      Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
      remains CONSTANT for the past 200ms, we get
      
              task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio         (8)
      
      Putting (8) into (6), we get the formula used in
      bdi_update_dirty_ratelimit():
      
                                                      write_bw
              balanced_dirty_ratelimit *= pos_ratio * ----------              (9)
                                                      dirty_rate
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      be3ffa27