1. 08 12月, 2011 3 次提交
    • W
      writeback: set max_pause to lowest value on zero bdi_dirty · 82e230a0
      Wu Fengguang 提交于
      Some trace shows lots of bdi_dirty=0 lines where it's actually some
      small value if w/o the accounting errors in the per-cpu bdi stats.
      
      In this case the max pause time should really be set to the smallest
      (non-zero) value to avoid IO queue underrun and improve throughput.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      82e230a0
    • W
      writeback: permit through good bdi even when global dirty exceeded · c5c6343c
      Wu Fengguang 提交于
      On a system with 1 local mount and 1 NFS mount, if the NFS server
      becomes not responding when dd to the NFS mount, the NFS dirty pages may
      exceed the global dirty limit and _every_ task involving writing will be
      blocked. The whole system appears unresponsive.
      
      The workaround is to permit through the bdi's that only has a small
      number of dirty pages. The number chosen (bdi_stat_error pages) is not
      enough to enable the local disk to run in optimal throughput, however is
      enough to make the system responsive on a broken NFS mount. The user can
      then kill the dirtiers on the NFS mount and increase the global dirty
      limit to bring up the local disk's throughput.
      
      It risks allowing dirty pages to grow much larger than the global dirty
      limit when there are 1000+ mounts, however that's very unlikely to happen,
      especially in low memory profiles.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c5c6343c
    • W
      writeback: comment on the bdi dirty threshold · aed21ad2
      Wu Fengguang 提交于
      We do "floating proportions" to let active devices to grow its target
      share of dirty pages and stalled/inactive devices to decrease its target
      share over time.
      
      It works well except in the case of "an inactive disk suddenly goes
      busy", where the initial target share may be too small. To mitigate
      this, bdi_position_ratio() has the below line to raise a small
      bdi_thresh when it's safe to do so, so that the disk be feed with enough
      dirty pages for efficient IO and in turn fast rampup of bdi_thresh:
      
              bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
      
      balance_dirty_pages() normally does negative feedback control which
      adjusts ratelimit to balance the bdi dirty pages around the target.
      In some extreme cases when that is not enough, it will have to block
      the tasks completely until the bdi dirty pages drop below bdi_thresh.
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      aed21ad2
  2. 17 11月, 2011 2 次提交
    • W
      writeback: remove vm_dirties and task->dirties · 468e6a20
      Wu Fengguang 提交于
      They are not used any more.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      468e6a20
    • W
      writeback: hard throttle 1000+ dd on a slow USB stick · 1df64719
      Wu Fengguang 提交于
      The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms
      on every 1 4KB-page, which means it cannot throttle a task under
      4KB/200ms=20KB/s. So when there are more than 512 dd writing to a
      10MB/s USB stick, its bdi dirty pages could grow out of control.
      
      Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1)
      means a limit of 4KB/s.
                                                             
      They can eventually be safeguarded by the global limit check 
      (nr_dirty < dirty_thresh). However if someone is also writing to an 
      HDD at the same time, it'll get poor HDD write performance.
                                                             
      We at least want to maintain good write performance for other devices
      when one device is attacked by some "massive parallel" workload, or
      suffers from slow write bandwidth, or somehow get stalled due to some 
      error condition (eg. NFS server not responding).
      
      For a stalled device, we need to completely block its dirtiers, too,
      before its bdi dirty pages grow all the way up to the global limit and
      leave no space for the other functional devices.
      
      So change the loop exit condition to
      
      	/*
      	 * Always enforce global dirty limit; also enforce bdi dirty limit
      	 * if the normal max_pause sleeps cannot keep things under control.
      	 */
      	if (nr_dirty < dirty_thresh &&
      	    (bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1))
      		break;
      
      which can be further simplified to
      
      	if (task_ratelimit)
      		break;
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      1df64719
  3. 16 11月, 2011 1 次提交
  4. 07 11月, 2011 1 次提交
  5. 01 11月, 2011 1 次提交
  6. 31 10月, 2011 4 次提交
  7. 11 10月, 2011 1 次提交
  8. 03 10月, 2011 10 次提交
    • W
      writeback: dirty position control - bdi reserve area · 8927f66c
      Wu Fengguang 提交于
      Keep a minimal pool of dirty pages for each bdi, so that the disk IO
      queues won't underrun. Also gently increase a small bdi_thresh to avoid
      it stuck in 0 for some light dirtied bdi.
      
      It's particularly useful for JBOD and small memory system.
      
      It may result in (pos_ratio > 1) at the setpoint and push the dirty
      pages high. This is more or less intended because the bdi is in the
      danger of IO queue underflow.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      8927f66c
    • W
      writeback: control dirty pause time · 57fc978c
      Wu Fengguang 提交于
      The dirty pause time shall ultimately be controlled by adjusting
      nr_dirtied_pause, since there is relationship
      
      	pause = pages_dirtied / task_ratelimit
      
      Assuming
      
      	pages_dirtied ~= nr_dirtied_pause
      	task_ratelimit ~= dirty_ratelimit
      
      We get
      
      	nr_dirtied_pause ~= dirty_ratelimit * desired_pause
      
      Here dirty_ratelimit is preferred over task_ratelimit because it's
      more stable.
      
      It's also important to limit possible large transitional errors:
      
      - bw is changing quickly
      - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
      - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
        separate fix, but still expect non-trivial errors)
      
      So we end up using the above formula inside clamp_val().
      
      The best test case for this code is to run 100 "dd bs=4M" tasks on
      btrfs and check its pause time distribution.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      57fc978c
    • W
      writeback: limit max dirty pause time · c8462cc9
      Wu Fengguang 提交于
      Apply two policies to scale down the max pause time for
      
      1) small number of concurrent dirtiers
      2) small memory system (comparing to storage bandwidth)
      
      MAX_PAUSE=200ms may only be suitable for high end servers with lots of
      concurrent dirtiers, where the large pause time can reduce much overheads.
      
      Otherwise, smaller pause time is desirable whenever possible, so as to
      get good responsiveness and smooth user experiences. It's actually
      required for good disk utilization in the case when all the dirty pages
      can be synced to disk within MAX_PAUSE=200ms.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c8462cc9
    • W
      writeback: IO-less balance_dirty_pages() · 143dfe86
      Wu Fengguang 提交于
      As proposed by Chris, Dave and Jan, don't start foreground writeback IO
      inside balance_dirty_pages(). Instead, simply let it idle sleep for some
      time to throttle the dirtying task. In the mean while, kick off the
      per-bdi flusher thread to do background writeback IO.
      
      RATIONALS
      =========
      
      - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
      
        If every thread doing writes and being throttled start foreground
        writeback, it leads to N IO submitters from at least N different
        inodes at the same time, end up with N different sets of IO being
        issued with potentially zero locality to each other, resulting in
        much lower elevator sort/merge efficiency and hence we seek the disk
        all over the place to service the different sets of IO.
        OTOH, if there is only one submission thread, it doesn't jump between
        inodes in the same way when congestion clears - it keeps writing to
        the same inode, resulting in large related chunks of sequential IOs
        being issued to the disk. This is more efficient than the above
        foreground writeback because the elevator works better and the disk
        seeks less.
      
      - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
      
        With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
        from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
      
        * "CPU usage has dropped by ~55%", "it certainly appears that most of
          the CPU time saving comes from the removal of contention on the
          inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
          cacheline bouncing, because the new code is able to call much less
          frequently into balance_dirty_pages() and hence access the global
          page states)
      
        * the user space "App overhead" is reduced by 20%, by avoiding the
          cacheline pollution by the complex writeback code path
      
        * "for a ~5% throughput reduction", "the number of write IOs have
          dropped by ~25%", and the elapsed time reduced from 41:42.17 to
          40:53.23.
      
        * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
          and improves IO throughput from 38MB/s to 42MB/s.
      
      - IO size too small for fast arrays and too large for slow USB sticks
      
        The write_chunk used by current balance_dirty_pages() cannot be
        directly set to some large value (eg. 128MB) for better IO efficiency.
        Because it could lead to more than 1 second user perceivable stalls.
        Even the current 4MB write size may be too large for slow USB sticks.
        The fact that balance_dirty_pages() starts IO on itself couples the
        IO size to wait time, which makes it hard to do suitable IO size while
        keeping the wait time under control.
      
        Now it's possible to increase writeback chunk size proportional to the
        disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
        the larger writeback size dramatically reduces the seek count to 1/10
        (far beyond my expectation) and improves the write throughput by 24%.
      
      - long block time in balance_dirty_pages() hurts desktop responsiveness
      
        Many of us may have the experience: it often takes a couple of seconds
        or even long time to stop a heavy writing dd/cp/tar command with
        Ctrl-C or "kill -9".
      
      - IO pipeline broken by bumpy write() progress
      
        There are a broad class of "loop {read(buf); write(buf);}" applications
        whose read() pipeline will be under-utilized or even come to a stop if
        the write()s have long latencies _or_ don't progress in a constant rate.
        The current threshold based throttling inherently transfers the large
        low level IO completion fluctuations to bumpy application write()s,
        and further deteriorates with increasing number of dirtiers and/or bdi's.
      
        For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
        the rsync progresses very bumpy in legacy kernel, and throughput is
        improved by 67% by this patchset. (plus the larger write chunk size,
        it will be 93% speedup).
      
        The new rate based throttling can support 1000+ dd's with excellent
        smoothness, low latency and low overheads.
      
      For the above reasons, it's much better to do IO-less and low latency
      pauses in balance_dirty_pages().
      
      Jan Kara, Dave Chinner and me explored the scheme to let
      balance_dirty_pages() wait for enough writeback IO completions to
      safeguard the dirty limit. However it's found to have two problems:
      
      - in large NUMA systems, the per-cpu counters may have big accounting
        errors, leading to big throttle wait time and jitters.
      
      - NFS may kill large amount of unstable pages with one single COMMIT.
        Because NFS server serves COMMIT with expensive fsync() IOs, it is
        desirable to delay and reduce the number of COMMITs. So it's not
        likely to optimize away such kind of bursty IO completions, and the
        resulted large (and tiny) stall times in IO completion based throttling.
      
      So here is a pause time oriented approach, which tries to control the
      pause time in each balance_dirty_pages() invocations, by controlling
      the number of pages dirtied before calling balance_dirty_pages(), for
      smooth and efficient dirty throttling:
      
      - avoid useless (eg. zero pause time) balance_dirty_pages() calls
      - avoid too small pause time (less than   4ms, which burns CPU power)
      - avoid too large pause time (more than 200ms, which hurts responsiveness)
      - avoid big fluctuations of pause times
      
      It can control pause times at will. The default policy (in a followup
      patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
      in 1000-dd case.
      
      BEHAVIOR CHANGE
      ===============
      
      (1) dirty threshold
      
      Users will notice that the applications will get throttled once crossing
      the global (background + dirty)/2=15% threshold, and then balanced around
      17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
      memory in 1-dd case.
      
      Since the task will be soft throttled earlier than before, it may be
      perceived by end users as performance "slow down" if his application
      happens to dirty more than 15% dirtyable memory.
      
      (2) smoothness/responsiveness
      
      Users will notice a more responsive system during heavy writeback.
      "killall dd" will take effect instantly.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      143dfe86
    • W
      writeback: per task dirty rate limit · 9d823e8f
      Wu Fengguang 提交于
      Add two fields to task_struct.
      
      1) account dirtied pages in the individual tasks, for accuracy
      2) per-task balance_dirty_pages() call intervals, for flexibility
      
      The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
      scale near-sqrt to the safety gap between dirty pages and threshold.
      
      The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
      pages at exactly the same time, each task will be assigned a large
      initial nr_dirtied_pause, so that the dirty threshold will be exceeded
      long before each task reached its nr_dirtied_pause and hence call
      balance_dirty_pages().
      
      The solution is to watch for the number of pages dirtied on each CPU in
      between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
      (3% dirty threshold), force call balance_dirty_pages() for a chance to
      set bdi->dirty_exceeded. In normal situations, this safeguarding
      condition is not expected to trigger at all.
      
      On the sqrt in dirty_poll_interval():
      
      It will serve as an initial guess when dirty pages are still in the
      freerun area.
      
      When dirty pages are floating inside the dirty control scope [freerun,
      limit], a followup patch will use some refined dirty poll interval to
      get the desired pause time.
      
         thresh-dirty (MB)    sqrt
      		   1      16
      		   2      22
      		   4      32
      		   8      45
      		  16      64
      		  32      90
      		  64     128
      		 128     181
      		 256     256
      		 512     362
      		1024     512
      
      The above table means, given 1MB (or 1GB) gap and the dd tasks polling
      balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
      be exceeded as long as there are less than 16 (or 512) concurrent dd's.
      
      So sqrt naturally leads to less overheads and more safe concurrent tasks
      for large memory servers, which have large (thresh-freerun) gaps.
      
      peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
      
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NAndrea Righi <andrea@betterlinux.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      9d823e8f
    • W
      writeback: stabilize bdi->dirty_ratelimit · 7381131c
      Wu Fengguang 提交于
      There are some imperfections in balanced_dirty_ratelimit.
      
      1) large fluctuations
      
      The dirty_rate used for computing balanced_dirty_ratelimit is merely
      averaged in the past 200ms (very small comparing to the 3s estimation
      period for write_bw), which makes rather dispersed distribution of
      balanced_dirty_ratelimit.
      
      It's pretty hard to average out the singular points by increasing the
      estimation period. Considering that the averaging technique will
      introduce very undesirable time lags, I give it up totally. (btw, the 3s
      write_bw averaging time lag is much more acceptable because its impact
      is one-way and therefore won't lead to oscillations.)
      
      The more practical way is filtering -- most singular
      balanced_dirty_ratelimit points can be filtered out by remembering some
      prev_balanced_rate and prev_prev_balanced_rate. However the more
      reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.
      
      2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
      match could become unbalanced, which may lead to large systematical
      errors in balanced_dirty_ratelimit. The truncates, due to its possibly
      bumpy nature, can hardly be compensated smoothly. So let's face it. When
      some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
      high, dirty pages will go higher than the setpoint. task_ratelimit will
      in turn become lower than dirty_ratelimit.  So if we consider both
      balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
      only when they are on the same side of dirty_ratelimit, the systematical
      errors in balanced_dirty_ratelimit won't be able to bring
      dirty_ratelimit far away.
      
      The balanced_dirty_ratelimit estimation may also be inaccurate near
      @limit or @freerun, however is less an issue.
      
      3) since we ultimately want to
      
      - keep the fluctuations of task ratelimit as small as possible
      - keep the dirty pages around the setpoint as long time as possible
      
      the update policy used for (2) also serves the above goals nicely:
      if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
      and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
      there is no point to bring up dirty_ratelimit in a hurry only to hurt
      both the above two goals.
      
      So, we make use of task_ratelimit to limit the update of dirty_ratelimit
      in two ways:
      
      1) avoid changing dirty rate when it's against the position control target
         (the adjusted rate will slow down the progress of dirty pages going
         back to setpoint).
      
      2) limit the step size. task_ratelimit is changing values step by step,
         leaving a consistent trace comparing to the randomly jumping
         balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
         errors in stable state and typically larger errors when there are big
         errors in rate.  So it's a pretty good limiting factor for the step
         size of dirty_ratelimit.
      
      Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
      task_ratelimit is merely used as a limiting factor.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      7381131c
    • W
      writeback: dirty rate control · be3ffa27
      Wu Fengguang 提交于
      It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
      when there are N dd tasks.
      
      On write() syscall, use bdi->dirty_ratelimit
      ============================================
      
          balance_dirty_pages(pages_dirtied)
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              pause = pages_dirtied / task_ratelimit;
              sleep(pause);
          }
      
      On every 200ms, update bdi->dirty_ratelimit
      ===========================================
      
          bdi_update_dirty_ratelimit()
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
              bdi->dirty_ratelimit = balanced_dirty_ratelimit
          }
      
      Estimation of balanced bdi->dirty_ratelimit
      ===========================================
      
      balanced task_ratelimit
      -----------------------
      
      balance_dirty_pages() needs to throttle tasks dirtying pages such that
      the total amount of dirty pages stays below the specified dirty limit in
      order to avoid memory deadlocks. Furthermore we desire fairness in that
      tasks get throttled proportionally to the amount of pages they dirty.
      
      IOW we want to throttle tasks such that we match the dirty rate to the
      writeout bandwidth, this yields a stable amount of dirty pages:
      
              dirty_rate == write_bw                                          (1)
      
      The fairness requirement gives us:
      
              task_ratelimit = balanced_dirty_ratelimit
                             == write_bw / N                                  (2)
      
      where N is the number of dd tasks.  We don't know N beforehand, but
      still can estimate balanced_dirty_ratelimit within 200ms.
      
      Start by throttling each dd task at rate
      
              task_ratelimit = task_ratelimit_0                               (3)
                               (any non-zero initial value is OK)
      
      After 200ms, we measured
      
              dirty_rate = # of pages dirtied by all dd's / 200ms
              write_bw   = # of pages written to the disk / 200ms
      
      For the aggressive dd dirtiers, the equality holds
      
              dirty_rate == N * task_rate
                         == N * task_ratelimit_0                              (4)
      Or
              task_ratelimit_0 == dirty_rate / N                              (5)
      
      Now we conclude that the balanced task ratelimit can be estimated by
      
                                                            write_bw
              balanced_dirty_ratelimit = task_ratelimit_0 * ----------        (6)
                                                            dirty_rate
      
      Because with (4) and (5) we can get the desired equality (1):
      
                                                             write_bw
              balanced_dirty_ratelimit == (dirty_rate / N) * ----------
                                                             dirty_rate
                                       == write_bw / N
      
      Then using the balanced task ratelimit we can compute task pause times like:
      
              task_pause = task->nr_dirtied / task_ratelimit
      
      task_ratelimit with position control
      ------------------------------------
      
      However, while the above gives us means of matching the dirty rate to
      the writeout bandwidth, it at best provides us with a stable dirty page
      count (assuming a static system). In order to control the dirty page
      count such that it is high enough to provide performance, but does not
      exceed the specified limit we need another control.
      
      The dirty position control works by extending (2) to
      
              task_ratelimit = balanced_dirty_ratelimit * pos_ratio           (7)
      
      where pos_ratio is a negative feedback function that subjects to
      
      1) f(setpoint) = 1.0
      2) df/dx < 0
      
      That is, if the dirty pages are ABOVE the setpoint, we throttle each
      task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
      pages are created less fast than they are cleaned, thus DROP to the
      setpoints (and the reverse).
      
      Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
      remains CONSTANT for the past 200ms, we get
      
              task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio         (8)
      
      Putting (8) into (6), we get the formula used in
      bdi_update_dirty_ratelimit():
      
                                                      write_bw
              balanced_dirty_ratelimit *= pos_ratio * ----------              (9)
                                                      dirty_rate
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      be3ffa27
    • W
      writeback: add bg_threshold parameter to __bdi_update_bandwidth() · af6a3113
      Wu Fengguang 提交于
      No behavior change.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      af6a3113
    • W
      writeback: dirty position control · 6c14ae1e
      Wu Fengguang 提交于
      bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
      that the resulted task rate limit can drive the dirty pages back to the
      global/bdi setpoints.
      
      Old scheme is,
                                                |
                                 free run area  |  throttle area
        ----------------------------------------+---------------------------->
                                          thresh^                  dirty pages
      
      New scheme is,
      
        ^ task rate limit
        |
        |            *
        |             *
        |              *
        |[free run]      *      [smooth throttled]
        |                  *
        |                     *
        |                         *
        ..bdi->dirty_ratelimit..........*
        |                               .     *
        |                               .          *
        |                               .              *
        |                               .                 *
        |                               .                    *
        +-------------------------------.-----------------------*------------>
                                setpoint^                  limit^  dirty pages
      
      The slope of the bdi control line should be
      
      1) large enough to pull the dirty pages to setpoint reasonably fast
      
      2) small enough to avoid big fluctuations in the resulted pos_ratio and
         hence task ratelimit
      
      Since the fluctuation range of the bdi dirty pages is typically observed
      to be within 1-second worth of data, the bdi control line's slope is
      selected to be a linear function of bdi write bandwidth, so that it can
      adapt to slow/fast storage devices well.
      
      Assume the bdi control line
      
      	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
      
      where k is the negative slope.
      
      If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
      are fluctuating in range
      
      	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
      
      we get slope
      
      	k = - 1 / (8 * write_bw)
      
      Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
      
      	x_intercept = bdi_setpoint + 8 * write_bw
      
      The global/bdi slopes are nicely complementing each other when the
      system has only one major bdi (indicated by bdi_thresh ~= thresh):
      
      1) slope of global control line    => scaling to the control scope size
      2) slope of main bdi control line  => scaling to the writeout bandwidth
      
      so that
      
      - in memory tight systems, (1) becomes strong enough to squeeze dirty
        pages inside the control scope
      
      - in large memory systems where the "gravity" of (1) for pulling the
        dirty pages to setpoint is too weak, (2) can back (1) up and drive
        dirty pages to bdi_setpoint ~= setpoint reasonably fast.
      
      Unfortunately in JBOD setups, the fluctuation range of bdi threshold
      is related to memory size due to the interferences between disks.  In
      this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
      
      Given equations
      
              span = x_intercept - bdi_setpoint
              k = df/dx = - 1 / span
      
      and the extremum values
      
              span = bdi_thresh
              dx = bdi_thresh
      
      we get
      
              df = - dx / span = - 1.0
      
      That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
      task ratelimit will fluctuate by -100%.
      
      peter: use 3rd order polynomial for the global control line
      
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      6c14ae1e
    • W
      writeback: account per-bdi accumulated dirtied pages · c8e28ce0
      Wu Fengguang 提交于
      Introduce the BDI_DIRTIED counter. It will be used for estimating the
      bdi's dirty bandwidth.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Michael Rubin <mrubin@google.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c8e28ce0
  9. 19 8月, 2011 1 次提交
    • W
      squeeze max-pause area and drop pass-good area · bb082295
      Wu Fengguang 提交于
      Revert the pass-good area introduced in ffd1f609 ("writeback:
      introduce max-pause and pass-good dirty limits") and make the max-pause
      area smaller and safe.
      
      This fixes ~30% performance regression in the ext3 data=writeback
      fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
      12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
      
      Using deadline scheduler also has a regression, but not that big as CFQ,
      so this suggests we have some write starvation.
      
      The test logs show that
      
      - the disks are sometimes under utilized
      
      - global dirty pages sometimes rush high to the pass-good area for
        several hundred seconds, while in the mean time some bdi dirty pages
        drop to very low value (bdi_dirty << bdi_thresh).  Then suddenly the
        global dirty pages dropped under global dirty threshold and bdi_dirty
        rush very high (for example, 2 times higher than bdi_thresh). During
        which time balance_dirty_pages() is not called at all.
      
      So the problems are
      
      1) The random writes progress so slow that they break the assumption of
         the max-pause logic that "8 pages per 200ms is typically more than
         enough to curb heavy dirtiers".
      
      2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
         for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
         and then (bdi_thresh >> bdi_dirty) for others.
      
      3) The higher max-pause/pass-good thresholds somehow leads to the bad
         swing of dirty pages.
      
      The fix is to allow the task to slightly dirty over task_bdi_thresh, but
      no way to exceed bdi_dirty and/or global dirty_thresh.
      
      Tests show that it fixed the JBOD regression completely (both behavior
      and performance), while still being able to cut down large pause times
      in balance_dirty_pages() for single-disk cases.
      Reported-by: NLi Shaohua <shaohua.li@intel.com>
      Tested-by: NLi Shaohua <shaohua.li@intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      bb082295
  10. 26 7月, 2011 2 次提交
  11. 24 7月, 2011 1 次提交
    • J
      mm: properly reflect task dirty limits in dirty_exceeded logic · bcff25fc
      Jan Kara 提交于
      We set bdi->dirty_exceeded (and thus ratelimiting code starts to
      call balance_dirty_pages() every 8 pages) when a per-bdi limit is
      exceeded or global limit is exceeded. But per-bdi limit also depends
      on the task. Thus different tasks reach the limit on that bdi at
      different levels of dirty pages. The result is that with current code
      bdi->dirty_exceeded ping-ponged between 1 and 0 depending on which task
      just got into balance_dirty_pages().
      
      We fix the issue by clearing bdi->dirty_exceeded only when per-bdi amount
      of dirty pages drops below the threshold (7/8 * bdi_dirty_limit) where task
      limits already do not have any influence.
      
      Impact:  The end result is, the dirty pages are kept more tightly under
      control, with the average number slightly lowered than before.  This
      reduces the risk to throttle light dirtiers and hence more responsive.
      However it may add overheads by enforcing balance_dirty_pages() calls
      on every 8 pages when there are 2+ heavy dirtiers.
      
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Christoph Hellwig <hch@infradead.org>
      CC: Dave Chinner <david@fromorbit.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      bcff25fc
  12. 10 7月, 2011 7 次提交
    • W
      writeback: trace global_dirty_state · e1cbe236
      Wu Fengguang 提交于
      Add trace event balance_dirty_state for showing the global dirty page
      counts and thresholds at each global_dirty_limits() invocation.  This
      will cover the callers throttle_vm_writeout(), over_bground_thresh()
      and each balance_dirty_pages() loop.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      e1cbe236
    • W
      writeback: introduce max-pause and pass-good dirty limits · ffd1f609
      Wu Fengguang 提交于
      The max-pause limit helps to keep the sleep time inside
      balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means
      per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which
      normally is enough to stop dirtiers from continue pushing the dirty
      pages high, unless there are a sufficient large number of slow dirtiers
      (eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the
      write bandwidth of a slow disk and hence accumulating more and more dirty
      pages).
      
      The pass-good limit helps to let go of the good bdi's in the presence of
      a blocked bdi (ie. NFS server not responding) or slow USB disk which for
      some reason build up a large number of initial dirty pages that refuse
      to go away anytime soon.
      
      For example, given two bdi's A and B and the initial state
      
      	bdi_thresh_A = dirty_thresh / 2
      	bdi_thresh_B = dirty_thresh / 2
      	bdi_dirty_A  = dirty_thresh / 2
      	bdi_dirty_B  = dirty_thresh / 2
      
      Then A get blocked, after a dozen seconds
      
      	bdi_thresh_A = 0
      	bdi_thresh_B = dirty_thresh
      	bdi_dirty_A  = dirty_thresh / 2
      	bdi_dirty_B  = dirty_thresh / 2
      
      The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages
      will be effectively throttled by condition (nr_dirty < dirty_thresh).
      This has two problems:
      (1) we lose the protections for light dirtiers
      (2) balance_dirty_pages() effectively becomes IO-less because the
          (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good
          for IO, but balance_dirty_pages() loses an important way to break
          out of the loop which leads to more spread out throttle delays.
      
      DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is,
      DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above
      example while this patch uses the more conservative value 8 so as not to
      surprise people with too many dirty pages than expected.
      
      The max-pause limit won't noticeably impact the speed dirty pages are
      knocked down when there is a sudden drop of global/bdi dirty thresholds.
      Because the heavy dirties will be throttled below 160KB/s which is slow
      enough. It does help to avoid long dirty throttle delays and especially
      will make light dirtiers more responsive.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      ffd1f609
    • W
      writeback: introduce smoothed global dirty limit · c42843f2
      Wu Fengguang 提交于
      The start of a heavy weight application (ie. KVM) may instantly knock
      down determine_dirtyable_memory() if the swap is not enabled or full.
      global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
      dirty thresholds that are _much_ lower than the global/bdi dirty pages.
      
      balance_dirty_pages() will then heavily throttle all dirtiers including
      the light ones, until the dirty pages drop below the new dirty thresholds.
      During this _deep_ dirty-exceeded state, the system may appear rather
      unresponsive to the users.
      
      About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
      threshold to heavy dirtiers than light ones, and the dirty pages will
      be throttled around the heavy dirtiers' dirty threshold and reasonably
      below the light dirtiers' dirty threshold. In this state, only the heavy
      dirtiers will be throttled and the dirty pages are carefully controlled
      to not exceed the light dirtiers' dirty threshold. However if the
      threshold itself suddenly drops below the number of dirty pages, the
      light dirtiers will get heavily throttled.
      
      So introduce global_dirty_limit for tracking the global dirty threshold
      with policies
      
      - follow downwards slowly
      - follow up in one shot
      
      global_dirty_limit can effectively mask out the impact of sudden drop of
      dirtyable memory. It will be used in the next patch for two new type of
      dirty limits. Note that the new dirty limits are not going to avoid
      throttling the light dirtiers, but could limit their sleep time to 200ms.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c42843f2
    • W
      writeback: consolidate variable names in balance_dirty_pages() · 7762741e
      Wu Fengguang 提交于
      Introduce
      
      	nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS
      
      in order to simplify many tests in the following patches.
      
      balance_dirty_pages() will eventually care only about the dirty sums
      besides nr_writeback.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      7762741e
    • W
      writeback: bdi write bandwidth estimation · e98be2d5
      Wu Fengguang 提交于
      The estimation value will start from 100MB/s and adapt to the real
      bandwidth in seconds.
      
      It tries to update the bandwidth only when disk is fully utilized.
      Any inactive period of more than one second will be skipped.
      
      The estimated bandwidth will be reflecting how fast the device can
      writeout when _fully utilized_, and won't drop to 0 when it goes idle.
      The value will remain constant at disk idle time. At busy write time, if
      not considering fluctuations, it will also remain high unless be knocked
      down by possible concurrent reads that compete for the disk time and
      bandwidth with async writes.
      
      The estimation is not done purely in the flusher because there is no
      guarantee for write_cache_pages() to return timely to update bandwidth.
      
      The bdi->avg_write_bandwidth smoothing is very effective for filtering
      out sudden spikes, however may be a little biased in long term.
      
      The overheads are low because the bdi bandwidth update only occurs at
      200ms intervals.
      
      The 200ms update interval is suitable, because it's not possible to get
      the real bandwidth for the instance at all, due to large fluctuations.
      
      The NFS commits can be as large as seconds worth of data. One XFS
      completion may be as large as half second worth of data if we are going
      to increase the write chunk to half second worth of data. In ext4,
      fluctuations with time period of around 5 seconds is observed. And there
      is another pattern of irregular periods of up to 20 seconds on SSD tests.
      
      That's why we are not only doing the estimation at 200ms intervals, but
      also averaging them over a period of 3 seconds and then go further to do
      another level of smoothing in avg_write_bandwidth.
      
      CC: Li Shaohua <shaohua.li@intel.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      e98be2d5
    • J
      writeback: account per-bdi accumulated written pages · f7d2b1ec
      Jan Kara 提交于
      Introduce the BDI_WRITTEN counter. It will be used for estimating the
      bdi's write bandwidth.
      
      Peter Zijlstra <a.p.zijlstra@chello.nl>:
      Move BDI_WRITTEN accounting into __bdi_writeout_inc().
      This will cover and fix fuse, which only calls bdi_writeout_inc().
      
      CC: Michael Rubin <mrubin@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      f7d2b1ec
    • W
      writeback: make writeback_control.nr_to_write straight · d46db3d5
      Wu Fengguang 提交于
      Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
      and initialize the struct writeback_control there.
      
      struct writeback_control is basically designed to control writeback of a
      single file, but we keep abuse it for writing multiple files in
      writeback_sb_inodes() and its callers.
      
      It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
      work->nr_pages starts to make sense, and instead of saving and restoring
      pages_skipped in writeback_sb_inodes it can always start with a clean
      zero value.
      
      It also makes a neat IO pattern change: large dirty files are now
      written in the full 4MB writeback chunk size, rather than whatever
      remained quota in wbc->nr_to_write.
      Acked-by: NJan Kara <jack@suse.cz>
      Proposed-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      d46db3d5
  13. 20 6月, 2011 1 次提交
  14. 08 6月, 2011 3 次提交
    • W
      writeback: skip balance_dirty_pages() for in-memory fs · 3efaf0fa
      Wu Fengguang 提交于
      This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
      
      Notes about the tmpfs/ramfs behavior changes:
      
      As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
      balance_dirty_pages() as long as we are over the (dirty+background)/2
      global throttle threshold.  This is because both the dirty pages and
      threshold will be 0 for tmpfs/ramfs. Hence this test will always
      evaluate to TRUE:
      
                      dirty_exceeded =
                              (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                              || (nr_reclaimable + nr_writeback >= dirty_thresh);
      
      For 2.6.37, someone complained that the current logic does not allow the
      users to set vm.dirty_ratio=0.  So commit 4cbec4c8 changed the test to
      
                      dirty_exceeded =
                              (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                              || (nr_reclaimable + nr_writeback > dirty_thresh);
      
      So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
      throttled unless the global dirty threshold is exceeded (which is very
      unlikely to happen; once happen, will block many tasks).
      
      I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
      for a busy writing server, tmpfs write()s may get livelocked! The
      "inadvertent" throttling can hardly bring help to any workload because
      of its "either no throttling, or get throttled to death" property.
      
      So based on 2.6.37, this patch won't bring more noticeable changes.
      
      CC: Hugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      3efaf0fa
    • W
      writeback: add bdi_dirty_limit() kernel-doc · 6f718656
      Wu Fengguang 提交于
      Clarify the bdi_dirty_limit() comment.
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      6f718656
    • W
      writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6
      Wu Fengguang 提交于
      sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
      WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
      do livelock prevention for it, too.
      
      Jan's commit f446daae ("mm: implement writeback livelock avoidance
      using page tagging") is a partial fix in that it only fixed the
      WB_SYNC_ALL phase livelock.
      
      Although ext4 is tested to no longer livelock with commit f446daae,
      it may due to some "redirty_tail() after pages_skipped" effect which
      is by no means a guarantee for _all_ the file systems.
      
      Note that writeback_inodes_sb() is called by not only sync(), they are
      treated the same because the other callers also need livelock prevention.
      
      Impact:  It changes the order in which pages/inodes are synced to disk.
      Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
      until finished with the current inode.
      Acked-by: NJan Kara <jack@suse.cz>
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      6e6938b6
  15. 23 3月, 2011 2 次提交
    • J
      writeback: make mapping->writeback_index to point to the last written page · cf15b07c
      Jun'ichi Nomura 提交于
      For range-cyclic writeback (e.g.  kupdate), the writeback code sets a
      continuation point of the next writeback to mapping->writeback_index which
      is set the page after the last written page.  This happens so that we
      evenly write the whole file even if pages in it get continuously
      redirtied.
      
      However, in some cases, sequential writer is writing in the middle of the
      page and it just redirties the last written page by continuing from that.
      For example with an application which uses a file as a big ring buffer we
      see:
      
      [1st writeback session]
             ...
             flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898514 + 8
             flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898522 + 8
             flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898530 + 8
             flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898538 + 8
             flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898546 + 8
           kworker/0:1-11    4571: block_rq_issue: 8,0 W 0 () 94898514 + 40
      >>     flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898554 + 8
      >>     flush-8:0-2743  4571: block_rq_issue: 8,0 W 0 () 94898554 + 8
      
      [2nd writeback session after 35sec]
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898562 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898570 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898578 + 8
             ...
           kworker/0:1-11    4606: block_rq_issue: 8,0 W 0 () 94898562 + 640
           kworker/0:1-11    4606: block_rq_issue: 8,0 W 0 () 94899202 + 72
             ...
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899962 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899970 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899978 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899986 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899994 + 8
           kworker/0:1-11    4606: block_rq_issue: 8,0 W 0 () 94899962 + 40
      >>     flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898554 + 8
      >>     flush-8:0-2743  4606: block_rq_issue: 8,0 W 0 () 94898554 + 8
      
      So we seeked back to 94898554 after we wrote all the pages at the end of
      the file.
      
      This extra seek seems unnecessary.  If we continue writeback from the last
      written page, we can avoid it and do not cause harm to other cases.  The
      original intent of even writeout over the whole file is preserved and if
      the page does not get redirtied pagevec_lookup_tag() just skips it.
      
      As an exceptional case, when I/O error happens, set done_index to the next
      page as the comment in the code suggests.
      Tested-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf15b07c
    • M
      mm: reclaim invalidated page ASAP · 278df9f4
      Minchan Kim 提交于
      invalidate_mapping_pages is very big hint to reclaimer.  It means user
      doesn't want to use the page any more.  So in order to prevent working set
      page eviction, this patch move the page into tail of inactive list by
      PG_reclaim.
      
      Please, remember that pages in inactive list are working set as well as
      active list.  If we don't move pages into inactive list's tail, pages near
      by tail of inactive list can be evicted although we have a big clue about
      useless pages.  It's totally bad.
      
      Now PG_readahead/PG_reclaim is shared.  fe3cba17 added ClearPageReclaim
      into clear_page_dirty_for_io for preventing fast reclaiming readahead
      marker page.
      
      In this series, PG_reclaim is used by invalidated page, too.  If VM find
      the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
      asap.  Then, when the dirty page will be writeback,
      clear_page_dirty_for_io will clear PG_reclaim unconditionally.  It
      disturbs this serie's goal.
      
      I think it's okay to clear PG_readahead when the page is dirty, not
      writeback time.  So this patch moves ClearPageReadahead.  In v4,
      ClearPageReadahead in set_page_dirty has a problem which is reported by
      Steven Barrett.  It's due to compound page.  Some driver(ex, audio) calls
      set_page_dirty with compound page which isn't on LRU.  but my patch does
      ClearPageRelcaim on compound page.  In non-CONFIG_PAGEFLAGS_EXTENDED, it
      breaks PageTail flag.
      
      I think it doesn't affect THP and pass my test with THP enabling but Cced
      Andrea for double check.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Reported-by: NSteven Barrett <damentz@liquorix.net>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      278df9f4