1. 09 6月, 2012 1 次提交
  2. 31 10月, 2011 1 次提交
    • C
      writeback: Add a 'reason' to wb_writeback_work · 0e175a18
      Curt Wohlgemuth 提交于
      This creates a new 'reason' field in a wb_writeback_work
      structure, which unambiguously identifies who initiates
      writeback activity.  A 'wb_reason' enumeration has been
      added to writeback.h, to enumerate the possible reasons.
      
      The 'writeback_work_class' and tracepoint event class and
      'writeback_queue_io' tracepoints are updated to include the
      symbolic 'reason' in all trace events.
      
      And the 'writeback_inodes_sbXXX' family of routines has had
      a wb_stats parameter added to them, so callers can specify
      why writeback is being started.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      0e175a18
  3. 03 10月, 2011 3 次提交
    • W
      writeback: stabilize bdi->dirty_ratelimit · 7381131c
      Wu Fengguang 提交于
      There are some imperfections in balanced_dirty_ratelimit.
      
      1) large fluctuations
      
      The dirty_rate used for computing balanced_dirty_ratelimit is merely
      averaged in the past 200ms (very small comparing to the 3s estimation
      period for write_bw), which makes rather dispersed distribution of
      balanced_dirty_ratelimit.
      
      It's pretty hard to average out the singular points by increasing the
      estimation period. Considering that the averaging technique will
      introduce very undesirable time lags, I give it up totally. (btw, the 3s
      write_bw averaging time lag is much more acceptable because its impact
      is one-way and therefore won't lead to oscillations.)
      
      The more practical way is filtering -- most singular
      balanced_dirty_ratelimit points can be filtered out by remembering some
      prev_balanced_rate and prev_prev_balanced_rate. However the more
      reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.
      
      2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
      match could become unbalanced, which may lead to large systematical
      errors in balanced_dirty_ratelimit. The truncates, due to its possibly
      bumpy nature, can hardly be compensated smoothly. So let's face it. When
      some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
      high, dirty pages will go higher than the setpoint. task_ratelimit will
      in turn become lower than dirty_ratelimit.  So if we consider both
      balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
      only when they are on the same side of dirty_ratelimit, the systematical
      errors in balanced_dirty_ratelimit won't be able to bring
      dirty_ratelimit far away.
      
      The balanced_dirty_ratelimit estimation may also be inaccurate near
      @limit or @freerun, however is less an issue.
      
      3) since we ultimately want to
      
      - keep the fluctuations of task ratelimit as small as possible
      - keep the dirty pages around the setpoint as long time as possible
      
      the update policy used for (2) also serves the above goals nicely:
      if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
      and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
      there is no point to bring up dirty_ratelimit in a hurry only to hurt
      both the above two goals.
      
      So, we make use of task_ratelimit to limit the update of dirty_ratelimit
      in two ways:
      
      1) avoid changing dirty rate when it's against the position control target
         (the adjusted rate will slow down the progress of dirty pages going
         back to setpoint).
      
      2) limit the step size. task_ratelimit is changing values step by step,
         leaving a consistent trace comparing to the randomly jumping
         balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
         errors in stable state and typically larger errors when there are big
         errors in rate.  So it's a pretty good limiting factor for the step
         size of dirty_ratelimit.
      
      Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
      task_ratelimit is merely used as a limiting factor.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      7381131c
    • W
      writeback: dirty rate control · be3ffa27
      Wu Fengguang 提交于
      It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
      when there are N dd tasks.
      
      On write() syscall, use bdi->dirty_ratelimit
      ============================================
      
          balance_dirty_pages(pages_dirtied)
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              pause = pages_dirtied / task_ratelimit;
              sleep(pause);
          }
      
      On every 200ms, update bdi->dirty_ratelimit
      ===========================================
      
          bdi_update_dirty_ratelimit()
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
              bdi->dirty_ratelimit = balanced_dirty_ratelimit
          }
      
      Estimation of balanced bdi->dirty_ratelimit
      ===========================================
      
      balanced task_ratelimit
      -----------------------
      
      balance_dirty_pages() needs to throttle tasks dirtying pages such that
      the total amount of dirty pages stays below the specified dirty limit in
      order to avoid memory deadlocks. Furthermore we desire fairness in that
      tasks get throttled proportionally to the amount of pages they dirty.
      
      IOW we want to throttle tasks such that we match the dirty rate to the
      writeout bandwidth, this yields a stable amount of dirty pages:
      
              dirty_rate == write_bw                                          (1)
      
      The fairness requirement gives us:
      
              task_ratelimit = balanced_dirty_ratelimit
                             == write_bw / N                                  (2)
      
      where N is the number of dd tasks.  We don't know N beforehand, but
      still can estimate balanced_dirty_ratelimit within 200ms.
      
      Start by throttling each dd task at rate
      
              task_ratelimit = task_ratelimit_0                               (3)
                               (any non-zero initial value is OK)
      
      After 200ms, we measured
      
              dirty_rate = # of pages dirtied by all dd's / 200ms
              write_bw   = # of pages written to the disk / 200ms
      
      For the aggressive dd dirtiers, the equality holds
      
              dirty_rate == N * task_rate
                         == N * task_ratelimit_0                              (4)
      Or
              task_ratelimit_0 == dirty_rate / N                              (5)
      
      Now we conclude that the balanced task ratelimit can be estimated by
      
                                                            write_bw
              balanced_dirty_ratelimit = task_ratelimit_0 * ----------        (6)
                                                            dirty_rate
      
      Because with (4) and (5) we can get the desired equality (1):
      
                                                             write_bw
              balanced_dirty_ratelimit == (dirty_rate / N) * ----------
                                                             dirty_rate
                                       == write_bw / N
      
      Then using the balanced task ratelimit we can compute task pause times like:
      
              task_pause = task->nr_dirtied / task_ratelimit
      
      task_ratelimit with position control
      ------------------------------------
      
      However, while the above gives us means of matching the dirty rate to
      the writeout bandwidth, it at best provides us with a stable dirty page
      count (assuming a static system). In order to control the dirty page
      count such that it is high enough to provide performance, but does not
      exceed the specified limit we need another control.
      
      The dirty position control works by extending (2) to
      
              task_ratelimit = balanced_dirty_ratelimit * pos_ratio           (7)
      
      where pos_ratio is a negative feedback function that subjects to
      
      1) f(setpoint) = 1.0
      2) df/dx < 0
      
      That is, if the dirty pages are ABOVE the setpoint, we throttle each
      task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
      pages are created less fast than they are cleaned, thus DROP to the
      setpoints (and the reverse).
      
      Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
      remains CONSTANT for the past 200ms, we get
      
              task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio         (8)
      
      Putting (8) into (6), we get the formula used in
      bdi_update_dirty_ratelimit():
      
                                                      write_bw
              balanced_dirty_ratelimit *= pos_ratio * ----------              (9)
                                                      dirty_rate
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      be3ffa27
    • W
      writeback: account per-bdi accumulated dirtied pages · c8e28ce0
      Wu Fengguang 提交于
      Introduce the BDI_DIRTIED counter. It will be used for estimating the
      bdi's dirty bandwidth.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Michael Rubin <mrubin@google.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c8e28ce0
  4. 27 7月, 2011 1 次提交
  5. 10 7月, 2011 2 次提交
    • W
      writeback: bdi write bandwidth estimation · e98be2d5
      Wu Fengguang 提交于
      The estimation value will start from 100MB/s and adapt to the real
      bandwidth in seconds.
      
      It tries to update the bandwidth only when disk is fully utilized.
      Any inactive period of more than one second will be skipped.
      
      The estimated bandwidth will be reflecting how fast the device can
      writeout when _fully utilized_, and won't drop to 0 when it goes idle.
      The value will remain constant at disk idle time. At busy write time, if
      not considering fluctuations, it will also remain high unless be knocked
      down by possible concurrent reads that compete for the disk time and
      bandwidth with async writes.
      
      The estimation is not done purely in the flusher because there is no
      guarantee for write_cache_pages() to return timely to update bandwidth.
      
      The bdi->avg_write_bandwidth smoothing is very effective for filtering
      out sudden spikes, however may be a little biased in long term.
      
      The overheads are low because the bdi bandwidth update only occurs at
      200ms intervals.
      
      The 200ms update interval is suitable, because it's not possible to get
      the real bandwidth for the instance at all, due to large fluctuations.
      
      The NFS commits can be as large as seconds worth of data. One XFS
      completion may be as large as half second worth of data if we are going
      to increase the write chunk to half second worth of data. In ext4,
      fluctuations with time period of around 5 seconds is observed. And there
      is another pattern of irregular periods of up to 20 seconds on SSD tests.
      
      That's why we are not only doing the estimation at 200ms intervals, but
      also averaging them over a period of 3 seconds and then go further to do
      another level of smoothing in avg_write_bandwidth.
      
      CC: Li Shaohua <shaohua.li@intel.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      e98be2d5
    • J
      writeback: account per-bdi accumulated written pages · f7d2b1ec
      Jan Kara 提交于
      Introduce the BDI_WRITTEN counter. It will be used for estimating the
      bdi's write bandwidth.
      
      Peter Zijlstra <a.p.zijlstra@chello.nl>:
      Move BDI_WRITTEN accounting into __bdi_writeout_inc().
      This will cover and fix fuse, which only calls bdi_writeout_inc().
      
      CC: Michael Rubin <mrubin@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      f7d2b1ec
  6. 08 6月, 2011 1 次提交
    • C
      writeback: split inode_wb_list_lock into bdi_writeback.list_lock · f758eeab
      Christoph Hellwig 提交于
      Split the global inode_wb_list_lock into a per-bdi_writeback list_lock,
      as it's currently the most contended lock in the system for metadata
      heavy workloads.  It won't help for single-filesystem workloads for
      which we'll need the I/O-less balance_dirty_pages, but at least we
      can dedicate a cpu to spinning on each bdi now for larger systems.
      
      Based on earlier patches from Nick Piggin and Dave Chinner.
      
      It reduces lock contentions to 1/4 in this test case:
      10 HDD JBOD, 100 dd on each disk, XFS, 6GB ram
      
      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      vanilla 2.6.39-rc3:
                            inode_wb_list_lock:         42590          44433           0.12         147.74      144127.35         252274         886792           0.08         121.34      917211.23
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             34          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock          12893          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock          10702          [<ffffffff8115afef>] writeback_single_inode+0x16d/0x20a
                            ------------------
                            inode_wb_list_lock              2          [<ffffffff81165da5>] bdev_inode_switch_bdi+0x29/0x85
                            inode_wb_list_lock             19          [<ffffffff8115bd0b>] inode_wb_list_del+0x22/0x49
                            inode_wb_list_lock           5550          [<ffffffff8115bb53>] __mark_inode_dirty+0x170/0x1d0
                            inode_wb_list_lock           8511          [<ffffffff8115b4ad>] writeback_sb_inodes+0x10f/0x157
      
      2.6.39-rc3 + patch:
                      &(&wb->list_lock)->rlock:         11383          11657           0.14         151.69       40429.51          90825         527918           0.11         145.90      556843.37
                      ------------------------
                      &(&wb->list_lock)->rlock             10          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           1493          [<ffffffff8115b1ed>] writeback_inodes_wb+0x3d/0x150
                      &(&wb->list_lock)->rlock           3652          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
                      &(&wb->list_lock)->rlock           1412          [<ffffffff8115a38e>] writeback_single_inode+0x17f/0x223
                      ------------------------
                      &(&wb->list_lock)->rlock              3          [<ffffffff8110b5af>] bdi_lock_two+0x46/0x4b
                      &(&wb->list_lock)->rlock              6          [<ffffffff8115b189>] inode_wb_list_del+0x5f/0x86
                      &(&wb->list_lock)->rlock           2061          [<ffffffff8115af97>] __mark_inode_dirty+0x173/0x1cf
                      &(&wb->list_lock)->rlock           2629          [<ffffffff8115a8e9>] writeback_sb_inodes+0x123/0x16f
      
      hughd@google.com: fix recursive lock when bdi_lock_two() is called with new the same as old
      akpm@linux-foundation.org: cleanup bdev_inode_switch_bdi() comment
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      f758eeab
  7. 10 3月, 2011 1 次提交
  8. 27 10月, 2010 2 次提交
    • N
      mm: declare some external symbols · 92c09c04
      Namhyung Kim 提交于
      Declare 'bdi_pending_list' and 'tag_pages_for_writeback()' to remove
      following sparse warnings:
      
       mm/backing-dev.c:46:1: warning: symbol 'bdi_pending_list' was not declared. Should it be static?
       mm/page-writeback.c:825:6: warning: symbol 'tag_pages_for_writeback' was not declared. Should it be static?
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92c09c04
    • M
      writeback: do not sleep on the congestion queue if there are no congested BDIs... · 0e093d99
      Mel Gorman 提交于
      writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
      
      If congestion_wait() is called with no BDI congested, the caller will
      sleep for the full timeout and this may be an unnecessary sleep.  This
      patch adds a wait_iff_congested() that checks congestion and only sleeps
      if a BDI is congested else, it calls cond_resched() to ensure the caller
      is not hogging the CPU longer than its quota but otherwise will not sleep.
      
      This is aimed at reducing some of the major desktop stalls reported during
      IO.  For example, while kswapd is operating, it calls congestion_wait()
      but it could just have been reclaiming clean page cache pages with no
      congestion.  Without this patch, it would sleep for a full timeout but
      after this patch, it'll just call schedule() if it has been on the CPU too
      long.  Similar logic applies to direct reclaimers that are not making
      enough progress.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e093d99
  9. 12 8月, 2010 1 次提交
    • J
      mm: fix writeback_in_progress() · 81d73a32
      Jan Kara 提交于
      Commit 83ba7b07 ("writeback: simplify the write back thread queue")
      broke writeback_in_progress() as in that commit we started to remove work
      items from the list at the moment we start working on them and not at the
      moment they are finished.  Thus if the flusher thread was doing some work
      but there was no other work queued, writeback_in_progress() returned
      false.  This could in particular cause unnecessary queueing of background
      writeback from balance_dirty_pages() or writeout work from
      writeback_sb_if_idle().
      
      This patch fixes the problem by introducing a bit in the bdi state which
      indicates that the flusher thread is processing some work and uses this
      bit for writeback_in_progress() test.
      
      NOTE: Both callsites of writeback_in_progress() (namely,
      writeback_inodes_sb_if_idle() and balance_dirty_pages()) would actually
      need a different information than what writeback_in_progress() provides.
      They would need to know whether *the kind of writeback they are going to
      submit* is already queued.  But this information isn't that simple to
      provide so let's fix writeback_in_progress() for the time being.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Acked-by: NJens Axboe <jaxboe@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81d73a32
  10. 08 8月, 2010 6 次提交
    • A
      writeback: optimize periodic bdi thread wakeups · 6467716a
      Artem Bityutskiy 提交于
      Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
      should take care of the periodic background write-out. However, the write-out
      will actually start only 'dirty_writeback_interval' centisecs later, so we can
      delay the wake-up.
      
      This change was requested by Nick Piggin who pointed out that if we delay the
      wake-up, we weed out 2 unnecessary contex switches, which matters because
      '__mark_inode_dirty()' is a hot-path function.
      
      This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
      sets up a timer to wake-up the bdi thread and returns. So the wake-up is
      delayed.
      
      We also delete the timer in bdi threads just before writing-back. And
      synchronously delete it when unregistering bdi. At the unregister point the bdi
      does not have any users, so no one can arm it again.
      
      Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
      context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
      this change as well.
      
      This patch also moves the 'bdi_wb_init()' function down in the file to avoid
      forward-declaration of 'bdi_wakeup_thread_delayed()'.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6467716a
    • A
      writeback: move last_active to bdi · ecd58403
      Artem Bityutskiy 提交于
      Currently bdi threads use local variable 'last_active' which stores last time
      when the bdi thread did some useful work. Move this local variable to 'struct
      bdi_writeback'. This is just a preparation for the further patches which will
      make the forker thread decide when bdi threads should be killed.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      ecd58403
    • A
      writeback: simplify bdi code a little · 080dcec4
      Artem Bityutskiy 提交于
      This patch simplifies bdi code a little by removing the 'pending_list' which is
      redundant. Indeed, currently the forker thread ('bdi_forker_thread()') is
      working like this:
      
      1. In a loop, fetch all bdi's which have works but have no writeback thread and
         move them to the 'pending_list'.
      2. If the list is empty, sleep for 5 sec.
      3. Otherwise, take one bdi from the list, fork the writeback thread for this
         bdi, and repeat the loop.
      
      IOW, it first moves everything to the 'pending_list', then process only one
      element, and so on. This patch simplifies the algorithm, which is now as
      follows.
      
      1. Find the first bdi which has a work and remove it from the global list of
         bdi's (bdi_list).
      2. If there was not such bdi, sleep 5 sec.
      3. Fork the writeback thread for this bdi and repeat the loop.
      
      IOW, now we find the first bdi to process, process it, and so on. This is
      simpler and involves less lists.
      
      The bonus now is that we can get rid of a couple of functions, as well as
      remove complications which involve 'rcu_call()' and 'bdi->rcu_head'.
      
      This patch also makes sure we use 'list_add_tail_rcu()', instead of plain
      'list_add_tail()', but this piece of code is going to be removed in the next
      patch anyway.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      080dcec4
    • A
      writeback: harmonize writeback threads naming · 6f904ff0
      Artem Bityutskiy 提交于
      The write-back code mixes words "thread" and "task" for the same things. This
      is not a big deal, but still an inconsistency.
      
      hch: a convention I tend to use and I've seen in various places
      is to always use _task for the storage of the task_struct pointer,
      and thread everywhere else.  This especially helps with having
      foo_thread for the actual thread and foo_task for a global
      variable keeping the task_struct pointer
      
      This patch renames:
      * 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
      * 'bdi_forker_task()'              -> 'bdi_forker_thread()'
      
      because bdi threads are 'bdi_writeback_thread()', so these names are more
      consistent.
      
      This patch also amends commentaries and makes them refer the forker and bdi
      threads as "thread", not "task".
      
      Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use
      'static void' instead of 'void static' and make checkpatch.pl happy.
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      6f904ff0
    • C
      writeback: merge bdi_writeback_task and bdi_start_fn · 08243900
      Christoph Hellwig 提交于
      Move all code for the writeback thread into fs/fs-writeback.c instead of
      splitting it over two functions in two files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      08243900
    • C
      writeback: remove wb_list · c1955ce3
      Christoph Hellwig 提交于
      The wb_list member of struct backing_device_info always has exactly one
      element.  Just use the direct bdi->wb pointer instead and simplify some
      code.
      
      Also remove bdi_task_init which is now trivial to prepare for the next
      patch.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      c1955ce3
  11. 06 7月, 2010 1 次提交
    • C
      writeback: simplify the write back thread queue · 83ba7b07
      Christoph Hellwig 提交于
      First remove items from work_list as soon as we start working on them.  This
      means we don't have to track any pending or visited state and can get
      rid of all the RCU magic freeing the work items - we can simply free
      them once the operation has finished.  Second use a real completion for
      tracking synchronous requests - if the caller sets the completion pointer
      we complete it, otherwise use it as a boolean indicator that we can free
      the work item directly.  Third unify struct wb_writeback_args and struct
      bdi_work into a single data structure, wb_writeback_work.  Previous we
      set all parameters into a struct wb_writeback_args, copied it into
      struct bdi_work, copied it again on the stack to use it there.  Instead
      of just allocate one structure dynamically or on the stack and use it
      all the way through the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      83ba7b07
  12. 11 6月, 2010 1 次提交
  13. 01 6月, 2010 1 次提交
  14. 22 5月, 2010 1 次提交
    • J
      writeback: fixups for !dirty_writeback_centisecs · 6423104b
      Jens Axboe 提交于
      Commit 69b62d01 fixed up most of the places where we would enter
      busy schedule() spins when disabling the periodic background
      writeback. This fixes up the sb timer so that it doesn't get
      hammered on with the delay disabled, and ensures that it gets
      rearmed if needed when /proc/sys/vm/dirty_writeback_centisecs
      gets modified.
      
      bdi_forker_task() also needs to check for !dirty_writeback_centisecs
      and use schedule() appropriately, fix that up too.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      6423104b
  15. 17 5月, 2010 1 次提交
    • J
      writeback: fix WB_SYNC_NONE writeback from umount · e913fc82
      Jens Axboe 提交于
      When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
      writeback to kick off writeback of pending dirty inodes, then follow
      that up with a WB_SYNC_ALL to wait for it. Since umount already holds
      the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
      writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
      since WB_SYNC_ALL writeback is a data integrity operation and thus
      a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
      it's a lot slower.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e913fc82
  16. 25 4月, 2010 1 次提交
    • J
      Catch filesystems lacking s_bdi · 5129a469
      Jörn Engel 提交于
      noop_backing_dev_info is used only as a flag to mark filesystems that
      don't have any backing store, like tmpfs, procfs, spufs, etc.
      Signed-off-by: NJoern Engel <joern@logfs.org>
      
      Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes
      to the noop_backing_dev_info is not legal and will not result in
      them being flushed, but we already catch this condition in
      __mark_inode_dirty() when checking for a registered bdi.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      5129a469
  17. 22 4月, 2010 1 次提交
  18. 06 4月, 2010 1 次提交
    • M
      laptop-mode: Make flushes per-device · 31373d09
      Matthew Garrett 提交于
      One of the features of laptop-mode is that it forces a writeout of dirty
      pages if something else triggers a physical read or write from a device.
      The current implementation flushes pages on all devices, rather than only
      the one that triggered the flush. This patch alters the behaviour so that
      only the recently accessed block device is flushed, preventing other
      disks being spun up for no terribly good reason.
      Signed-off-by: NMatthew Garrett <mjg@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      31373d09
  19. 29 10月, 2009 1 次提交
  20. 26 9月, 2009 1 次提交
  21. 16 9月, 2009 2 次提交
  22. 11 9月, 2009 4 次提交
    • J
      writeback: check for registered bdi in flusher add and inode dirty · 500b067c
      Jens Axboe 提交于
      Also a debugging aid. We want to catch dirty inodes being added to
      backing devices that don't do writeback.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      500b067c
    • J
      writeback: add name to backing_dev_info · d993831f
      Jens Axboe 提交于
      This enables us to track who does what and print info. Its main use
      is catching dirty inodes on the default_backing_dev_info, so we can
      fix that up.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d993831f
    • J
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe 提交于
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      03ba3782
    • J
      writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2
      Jens Axboe 提交于
      This is a first step at introducing per-bdi flusher threads. We should
      have no change in behaviour, although sb_has_dirty_inodes() is now
      ridiculously expensive, as there's no easy way to answer that question.
      Not a huge problem, since it'll be deleted in subsequent patches.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      66f3b8e2
  23. 12 7月, 2009 1 次提交
  24. 11 7月, 2009 1 次提交
  25. 06 4月, 2009 1 次提交
  26. 20 10月, 2008 1 次提交
    • R
      vmscan: split LRU lists into anon & file sets · 4f98a2fe
      Rik van Riel 提交于
      Split the LRU lists in two, one set for pages that are backed by real file
      systems ("file") and one for pages that are backed by memory and swap
      ("anon").  The latter includes tmpfs.
      
      The advantage of doing this is that the VM will not have to scan over lots
      of anonymous pages (which we generally do not want to swap out), just to
      find the page cache pages that it should evict.
      
      This patch has the infrastructure and a basic policy to balance how much
      we scan the anon lists and how much we scan the file lists.  The big
      policy changes are in separate patches.
      
      [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
      [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
      [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
      [hugh@veritas.com: memcg swapbacked pages active]
      [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
      [akpm@linux-foundation.org: fix /proc/vmstat units]
      [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
      [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
      [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f98a2fe
  27. 30 4月, 2008 1 次提交