1. 19 8月, 2011 1 次提交
    • W
      squeeze max-pause area and drop pass-good area · bb082295
      Wu Fengguang 提交于
      Revert the pass-good area introduced in ffd1f609 ("writeback:
      introduce max-pause and pass-good dirty limits") and make the max-pause
      area smaller and safe.
      
      This fixes ~30% performance regression in the ext3 data=writeback
      fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
      12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
      
      Using deadline scheduler also has a regression, but not that big as CFQ,
      so this suggests we have some write starvation.
      
      The test logs show that
      
      - the disks are sometimes under utilized
      
      - global dirty pages sometimes rush high to the pass-good area for
        several hundred seconds, while in the mean time some bdi dirty pages
        drop to very low value (bdi_dirty << bdi_thresh).  Then suddenly the
        global dirty pages dropped under global dirty threshold and bdi_dirty
        rush very high (for example, 2 times higher than bdi_thresh). During
        which time balance_dirty_pages() is not called at all.
      
      So the problems are
      
      1) The random writes progress so slow that they break the assumption of
         the max-pause logic that "8 pages per 200ms is typically more than
         enough to curb heavy dirtiers".
      
      2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
         for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
         and then (bdi_thresh >> bdi_dirty) for others.
      
      3) The higher max-pause/pass-good thresholds somehow leads to the bad
         swing of dirty pages.
      
      The fix is to allow the task to slightly dirty over task_bdi_thresh, but
      no way to exceed bdi_dirty and/or global dirty_thresh.
      
      Tests show that it fixed the JBOD regression completely (both behavior
      and performance), while still being able to cut down large pause times
      in balance_dirty_pages() for single-disk cases.
      Reported-by: NLi Shaohua <shaohua.li@intel.com>
      Tested-by: NLi Shaohua <shaohua.li@intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      bb082295
  2. 26 7月, 2011 2 次提交
  3. 24 7月, 2011 1 次提交
    • J
      mm: properly reflect task dirty limits in dirty_exceeded logic · bcff25fc
      Jan Kara 提交于
      We set bdi->dirty_exceeded (and thus ratelimiting code starts to
      call balance_dirty_pages() every 8 pages) when a per-bdi limit is
      exceeded or global limit is exceeded. But per-bdi limit also depends
      on the task. Thus different tasks reach the limit on that bdi at
      different levels of dirty pages. The result is that with current code
      bdi->dirty_exceeded ping-ponged between 1 and 0 depending on which task
      just got into balance_dirty_pages().
      
      We fix the issue by clearing bdi->dirty_exceeded only when per-bdi amount
      of dirty pages drops below the threshold (7/8 * bdi_dirty_limit) where task
      limits already do not have any influence.
      
      Impact:  The end result is, the dirty pages are kept more tightly under
      control, with the average number slightly lowered than before.  This
      reduces the risk to throttle light dirtiers and hence more responsive.
      However it may add overheads by enforcing balance_dirty_pages() calls
      on every 8 pages when there are 2+ heavy dirtiers.
      
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Christoph Hellwig <hch@infradead.org>
      CC: Dave Chinner <david@fromorbit.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      bcff25fc
  4. 10 7月, 2011 7 次提交
    • W
      writeback: trace global_dirty_state · e1cbe236
      Wu Fengguang 提交于
      Add trace event balance_dirty_state for showing the global dirty page
      counts and thresholds at each global_dirty_limits() invocation.  This
      will cover the callers throttle_vm_writeout(), over_bground_thresh()
      and each balance_dirty_pages() loop.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      e1cbe236
    • W
      writeback: introduce max-pause and pass-good dirty limits · ffd1f609
      Wu Fengguang 提交于
      The max-pause limit helps to keep the sleep time inside
      balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means
      per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which
      normally is enough to stop dirtiers from continue pushing the dirty
      pages high, unless there are a sufficient large number of slow dirtiers
      (eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the
      write bandwidth of a slow disk and hence accumulating more and more dirty
      pages).
      
      The pass-good limit helps to let go of the good bdi's in the presence of
      a blocked bdi (ie. NFS server not responding) or slow USB disk which for
      some reason build up a large number of initial dirty pages that refuse
      to go away anytime soon.
      
      For example, given two bdi's A and B and the initial state
      
      	bdi_thresh_A = dirty_thresh / 2
      	bdi_thresh_B = dirty_thresh / 2
      	bdi_dirty_A  = dirty_thresh / 2
      	bdi_dirty_B  = dirty_thresh / 2
      
      Then A get blocked, after a dozen seconds
      
      	bdi_thresh_A = 0
      	bdi_thresh_B = dirty_thresh
      	bdi_dirty_A  = dirty_thresh / 2
      	bdi_dirty_B  = dirty_thresh / 2
      
      The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages
      will be effectively throttled by condition (nr_dirty < dirty_thresh).
      This has two problems:
      (1) we lose the protections for light dirtiers
      (2) balance_dirty_pages() effectively becomes IO-less because the
          (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good
          for IO, but balance_dirty_pages() loses an important way to break
          out of the loop which leads to more spread out throttle delays.
      
      DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is,
      DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above
      example while this patch uses the more conservative value 8 so as not to
      surprise people with too many dirty pages than expected.
      
      The max-pause limit won't noticeably impact the speed dirty pages are
      knocked down when there is a sudden drop of global/bdi dirty thresholds.
      Because the heavy dirties will be throttled below 160KB/s which is slow
      enough. It does help to avoid long dirty throttle delays and especially
      will make light dirtiers more responsive.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      ffd1f609
    • W
      writeback: introduce smoothed global dirty limit · c42843f2
      Wu Fengguang 提交于
      The start of a heavy weight application (ie. KVM) may instantly knock
      down determine_dirtyable_memory() if the swap is not enabled or full.
      global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
      dirty thresholds that are _much_ lower than the global/bdi dirty pages.
      
      balance_dirty_pages() will then heavily throttle all dirtiers including
      the light ones, until the dirty pages drop below the new dirty thresholds.
      During this _deep_ dirty-exceeded state, the system may appear rather
      unresponsive to the users.
      
      About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
      threshold to heavy dirtiers than light ones, and the dirty pages will
      be throttled around the heavy dirtiers' dirty threshold and reasonably
      below the light dirtiers' dirty threshold. In this state, only the heavy
      dirtiers will be throttled and the dirty pages are carefully controlled
      to not exceed the light dirtiers' dirty threshold. However if the
      threshold itself suddenly drops below the number of dirty pages, the
      light dirtiers will get heavily throttled.
      
      So introduce global_dirty_limit for tracking the global dirty threshold
      with policies
      
      - follow downwards slowly
      - follow up in one shot
      
      global_dirty_limit can effectively mask out the impact of sudden drop of
      dirtyable memory. It will be used in the next patch for two new type of
      dirty limits. Note that the new dirty limits are not going to avoid
      throttling the light dirtiers, but could limit their sleep time to 200ms.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c42843f2
    • W
      writeback: consolidate variable names in balance_dirty_pages() · 7762741e
      Wu Fengguang 提交于
      Introduce
      
      	nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS
      
      in order to simplify many tests in the following patches.
      
      balance_dirty_pages() will eventually care only about the dirty sums
      besides nr_writeback.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      7762741e
    • W
      writeback: bdi write bandwidth estimation · e98be2d5
      Wu Fengguang 提交于
      The estimation value will start from 100MB/s and adapt to the real
      bandwidth in seconds.
      
      It tries to update the bandwidth only when disk is fully utilized.
      Any inactive period of more than one second will be skipped.
      
      The estimated bandwidth will be reflecting how fast the device can
      writeout when _fully utilized_, and won't drop to 0 when it goes idle.
      The value will remain constant at disk idle time. At busy write time, if
      not considering fluctuations, it will also remain high unless be knocked
      down by possible concurrent reads that compete for the disk time and
      bandwidth with async writes.
      
      The estimation is not done purely in the flusher because there is no
      guarantee for write_cache_pages() to return timely to update bandwidth.
      
      The bdi->avg_write_bandwidth smoothing is very effective for filtering
      out sudden spikes, however may be a little biased in long term.
      
      The overheads are low because the bdi bandwidth update only occurs at
      200ms intervals.
      
      The 200ms update interval is suitable, because it's not possible to get
      the real bandwidth for the instance at all, due to large fluctuations.
      
      The NFS commits can be as large as seconds worth of data. One XFS
      completion may be as large as half second worth of data if we are going
      to increase the write chunk to half second worth of data. In ext4,
      fluctuations with time period of around 5 seconds is observed. And there
      is another pattern of irregular periods of up to 20 seconds on SSD tests.
      
      That's why we are not only doing the estimation at 200ms intervals, but
      also averaging them over a period of 3 seconds and then go further to do
      another level of smoothing in avg_write_bandwidth.
      
      CC: Li Shaohua <shaohua.li@intel.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      e98be2d5
    • J
      writeback: account per-bdi accumulated written pages · f7d2b1ec
      Jan Kara 提交于
      Introduce the BDI_WRITTEN counter. It will be used for estimating the
      bdi's write bandwidth.
      
      Peter Zijlstra <a.p.zijlstra@chello.nl>:
      Move BDI_WRITTEN accounting into __bdi_writeout_inc().
      This will cover and fix fuse, which only calls bdi_writeout_inc().
      
      CC: Michael Rubin <mrubin@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      f7d2b1ec
    • W
      writeback: make writeback_control.nr_to_write straight · d46db3d5
      Wu Fengguang 提交于
      Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
      and initialize the struct writeback_control there.
      
      struct writeback_control is basically designed to control writeback of a
      single file, but we keep abuse it for writing multiple files in
      writeback_sb_inodes() and its callers.
      
      It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
      work->nr_pages starts to make sense, and instead of saving and restoring
      pages_skipped in writeback_sb_inodes it can always start with a clean
      zero value.
      
      It also makes a neat IO pattern change: large dirty files are now
      written in the full 4MB writeback chunk size, rather than whatever
      remained quota in wbc->nr_to_write.
      Acked-by: NJan Kara <jack@suse.cz>
      Proposed-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      d46db3d5
  5. 20 6月, 2011 1 次提交
  6. 08 6月, 2011 3 次提交
    • W
      writeback: skip balance_dirty_pages() for in-memory fs · 3efaf0fa
      Wu Fengguang 提交于
      This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.
      
      Notes about the tmpfs/ramfs behavior changes:
      
      As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
      balance_dirty_pages() as long as we are over the (dirty+background)/2
      global throttle threshold.  This is because both the dirty pages and
      threshold will be 0 for tmpfs/ramfs. Hence this test will always
      evaluate to TRUE:
      
                      dirty_exceeded =
                              (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                              || (nr_reclaimable + nr_writeback >= dirty_thresh);
      
      For 2.6.37, someone complained that the current logic does not allow the
      users to set vm.dirty_ratio=0.  So commit 4cbec4c8 changed the test to
      
                      dirty_exceeded =
                              (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                              || (nr_reclaimable + nr_writeback > dirty_thresh);
      
      So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
      throttled unless the global dirty threshold is exceeded (which is very
      unlikely to happen; once happen, will block many tasks).
      
      I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
      for a busy writing server, tmpfs write()s may get livelocked! The
      "inadvertent" throttling can hardly bring help to any workload because
      of its "either no throttling, or get throttled to death" property.
      
      So based on 2.6.37, this patch won't bring more noticeable changes.
      
      CC: Hugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      3efaf0fa
    • W
      writeback: add bdi_dirty_limit() kernel-doc · 6f718656
      Wu Fengguang 提交于
      Clarify the bdi_dirty_limit() comment.
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      6f718656
    • W
      writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6
      Wu Fengguang 提交于
      sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
      WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
      do livelock prevention for it, too.
      
      Jan's commit f446daae ("mm: implement writeback livelock avoidance
      using page tagging") is a partial fix in that it only fixed the
      WB_SYNC_ALL phase livelock.
      
      Although ext4 is tested to no longer livelock with commit f446daae,
      it may due to some "redirty_tail() after pages_skipped" effect which
      is by no means a guarantee for _all_ the file systems.
      
      Note that writeback_inodes_sb() is called by not only sync(), they are
      treated the same because the other callers also need livelock prevention.
      
      Impact:  It changes the order in which pages/inodes are synced to disk.
      Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
      until finished with the current inode.
      Acked-by: NJan Kara <jack@suse.cz>
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      6e6938b6
  7. 23 3月, 2011 2 次提交
    • J
      writeback: make mapping->writeback_index to point to the last written page · cf15b07c
      Jun'ichi Nomura 提交于
      For range-cyclic writeback (e.g.  kupdate), the writeback code sets a
      continuation point of the next writeback to mapping->writeback_index which
      is set the page after the last written page.  This happens so that we
      evenly write the whole file even if pages in it get continuously
      redirtied.
      
      However, in some cases, sequential writer is writing in the middle of the
      page and it just redirties the last written page by continuing from that.
      For example with an application which uses a file as a big ring buffer we
      see:
      
      [1st writeback session]
             ...
             flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898514 + 8
             flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898522 + 8
             flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898530 + 8
             flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898538 + 8
             flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898546 + 8
           kworker/0:1-11    4571: block_rq_issue: 8,0 W 0 () 94898514 + 40
      >>     flush-8:0-2743  4571: block_bio_queue: 8,0 W 94898554 + 8
      >>     flush-8:0-2743  4571: block_rq_issue: 8,0 W 0 () 94898554 + 8
      
      [2nd writeback session after 35sec]
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898562 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898570 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898578 + 8
             ...
           kworker/0:1-11    4606: block_rq_issue: 8,0 W 0 () 94898562 + 640
           kworker/0:1-11    4606: block_rq_issue: 8,0 W 0 () 94899202 + 72
             ...
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899962 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899970 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899978 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899986 + 8
             flush-8:0-2743  4606: block_bio_queue: 8,0 W 94899994 + 8
           kworker/0:1-11    4606: block_rq_issue: 8,0 W 0 () 94899962 + 40
      >>     flush-8:0-2743  4606: block_bio_queue: 8,0 W 94898554 + 8
      >>     flush-8:0-2743  4606: block_rq_issue: 8,0 W 0 () 94898554 + 8
      
      So we seeked back to 94898554 after we wrote all the pages at the end of
      the file.
      
      This extra seek seems unnecessary.  If we continue writeback from the last
      written page, we can avoid it and do not cause harm to other cases.  The
      original intent of even writeout over the whole file is preserved and if
      the page does not get redirtied pagevec_lookup_tag() just skips it.
      
      As an exceptional case, when I/O error happens, set done_index to the next
      page as the comment in the code suggests.
      Tested-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf15b07c
    • M
      mm: reclaim invalidated page ASAP · 278df9f4
      Minchan Kim 提交于
      invalidate_mapping_pages is very big hint to reclaimer.  It means user
      doesn't want to use the page any more.  So in order to prevent working set
      page eviction, this patch move the page into tail of inactive list by
      PG_reclaim.
      
      Please, remember that pages in inactive list are working set as well as
      active list.  If we don't move pages into inactive list's tail, pages near
      by tail of inactive list can be evicted although we have a big clue about
      useless pages.  It's totally bad.
      
      Now PG_readahead/PG_reclaim is shared.  fe3cba17 added ClearPageReclaim
      into clear_page_dirty_for_io for preventing fast reclaiming readahead
      marker page.
      
      In this series, PG_reclaim is used by invalidated page, too.  If VM find
      the page is invalidated and it's dirty, it sets PG_reclaim to reclaim
      asap.  Then, when the dirty page will be writeback,
      clear_page_dirty_for_io will clear PG_reclaim unconditionally.  It
      disturbs this serie's goal.
      
      I think it's okay to clear PG_readahead when the page is dirty, not
      writeback time.  So this patch moves ClearPageReadahead.  In v4,
      ClearPageReadahead in set_page_dirty has a problem which is reported by
      Steven Barrett.  It's due to compound page.  Some driver(ex, audio) calls
      set_page_dirty with compound page which isn't on LRU.  but my patch does
      ClearPageRelcaim on compound page.  In non-CONFIG_PAGEFLAGS_EXTENDED, it
      breaks PageTail flag.
      
      I think it doesn't affect THP and pass my test with THP enabling but Cced
      Andrea for double check.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Reported-by: NSteven Barrett <damentz@liquorix.net>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      278df9f4
  8. 17 3月, 2011 1 次提交
  9. 10 3月, 2011 1 次提交
  10. 14 1月, 2011 2 次提交
  11. 04 1月, 2011 1 次提交
  12. 23 12月, 2010 1 次提交
  13. 27 10月, 2010 3 次提交
    • W
      writeback: remove the internal 5% low bound on dirty_ratio · 4cbec4c8
      Wu Fengguang 提交于
      The dirty_ratio was silently limited in global_dirty_limits() to >= 5%.
      This is not a user expected behavior.  And it's inconsistent with
      calc_period_shift(), which uses the plain vm_dirty_ratio value.
      
      Let's remove the internal bound.
      
      At the same time, fix balance_dirty_pages() to work with the
      dirty_thresh=0 case.  This allows applications to proceed when
      dirty+writeback pages are all cleaned.
      
      And ">" fits with the name "exceeded" better than ">=" does.  Neil thinks
      it is an aesthetic improvement as well as a functional one :)
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Proposed-by: NCon Kolivas <kernel@kolivas.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NNeil Brown <neilb@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cbec4c8
    • M
      writeback: add nr_dirtied and nr_written to /proc/vmstat · ea941f0e
      Michael Rubin 提交于
      To help developers and applications gain visibility into writeback
      behaviour adding two entries to vm_stat_items and /proc/vmstat.  This will
      allow us to track the "written" and "dirtied" counts.
      
         # grep nr_dirtied /proc/vmstat
         nr_dirtied 3747
         # grep nr_written /proc/vmstat
         nr_written 3618
      Signed-off-by: NMichael Rubin <mrubin@google.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea941f0e
    • M
      mm: add account_page_writeback() · f629d1c9
      Michael Rubin 提交于
      To help developers and applications gain visibility into writeback
      behaviour this patch adds two counters to /proc/vmstat.
      
        # grep nr_dirtied /proc/vmstat
        nr_dirtied 3747
        # grep nr_written /proc/vmstat
        nr_written 3618
      
      These entries allow user apps to understand writeback behaviour over time
      and learn how it is impacting their performance.  Currently there is no
      way to inspect dirty and writeback speed over time.  It's not possible for
      nr_dirty/nr_writeback.
      
      These entries are necessary to give visibility into writeback behaviour.
      We have /proc/diskstats which lets us understand the io in the block
      layer.  We have blktrace for more in depth understanding.  We have
      e2fsprogs and debugsfs to give insight into the file systems behaviour,
      but we don't offer our users the ability understand what writeback is
      doing.  There is no way to know how active it is over the whole system, if
      it's falling behind or to quantify it's efforts.  With these values
      exported users can easily see how much data applications are sending
      through writeback and also at what rates writeback is processing this
      data.  Comparing the rates of change between the two allow developers to
      see when writeback is not able to keep up with incoming traffic and the
      rate of dirty memory being sent to the IO back end.  This allows folks to
      understand their io workloads and track kernel issues.  Non kernel
      engineers at Google often use these counters to solve puzzling performance
      problems.
      
      Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written
      
      Patch #5 add writeback thresholds to /proc/vmstat
      
      Currently these values are in debugfs. But they should be promoted to
      /proc since they are useful for developers who are writing databases
      and file servers and are not debugging the kernel.
      
      The output is as below:
      
       # grep threshold /proc/vmstat
       nr_pages_dirty_threshold 409111
       nr_pages_dirty_background_threshold 818223
      
      This patch:
      
      This allows code outside of the mm core to safely manipulate page
      writeback state and not worry about the other accounting.  Not using these
      routines means that some code will lose track of the accounting and we get
      bugs.
      
      Modify nilfs2 to use interface.
      Signed-off-by: NMichael Rubin <mrubin@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Jiro SEKIBA <jir@unicus.jp>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f629d1c9
  14. 24 8月, 2010 1 次提交
    • D
      writeback: write_cache_pages doesn't terminate at nr_to_write <= 0 · 546a1924
      Dave Chinner 提交于
      I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have
      been. Enabling writeback tracing showed:
      
          flush-253:16-8516  [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
          flush-253:16-8516  [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
      
      Writeback is not terminating in background writeback if ->writepage is
      returning with wbc->nr_to_write == 0, resulting in sub-optimal single page
      writeback on XFS.
      
      Fix the write_cache_pages loop to terminate correctly when this situation
      occurs and so prevent this sub-optimal background writeback pattern. This
      improves sustained sequential buffered write performance from around
      250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM.
      
      Cc:<stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      546a1924
  15. 23 8月, 2010 1 次提交
  16. 21 8月, 2010 1 次提交
  17. 15 8月, 2010 1 次提交
  18. 12 8月, 2010 4 次提交
    • W
      writeback: add comment to the dirty limit functions · 1babe183
      Wu Fengguang 提交于
      Document global_dirty_limits() and bdi_dirty_limit().
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1babe183
    • W
      writeback: avoid unnecessary calculation of bdi dirty thresholds · 16c4042f
      Wu Fengguang 提交于
      Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
      that the latter can be avoided when under global dirty background
      threshold (which is the normal state for most systems).
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16c4042f
    • W
      writeback: balance_dirty_pages(): reduce calls to global_page_state · e50e3720
      Wu Fengguang 提交于
      Reducing the number of times balance_dirty_pages calls global_page_state
      reduces the cache references and so improves write performance on a
      variety of workloads.
      
      'perf stats' of simple fio write tests shows the reduction in cache
      access.  Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2
      with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10
      times, dropping the fasted & slowest values then taking the average &
      standard deviation
      
      		average (s.d.) in millions (10^6)
      2.6.31-rc8	648.6 (14.6)
      +patch		620.1 (16.5)
      
      Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads
      the counters to apply the dirty_threshold and moving this check up into
      balance_dirty_pages where it has already read the counters.
      
      Also by rearrange the for loop to only contain one copy of the limit tests
      allows the pdflush test after the loop to use the local copies of the
      counters rather than rereading them.
      
      In the common case with no throttling it now calls global_page_state 5
      fewer times and bdi_stat 2 fewer.
      
      Fengguang:
      
      This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
      with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh) to
      avoid exceeding the dirty limit.  Since the bdi dirty limit is mostly
      accurate we don't need to do routinely clip.  A simple dirty limit check
      would be enough.
      
      The check is necessary because, in principle we should throttle everything
      calling balance_dirty_pages() when we're over the total limit, as said by
      Peter.
      
      We now set and clear dirty_exceeded not only based on bdi dirty limits,
      but also on the global dirty limit.  The global limit check is added in
      place of clip_bdi_dirty_limit() for safety and not intended as a behavior
      change.  The bdi limits should be tight enough to keep all dirty pages
      under the global limit at most time; occasional small exceeding should be
      OK though.  The change makes the logic more obvious: the global limit is
      the ultimate goal and shall be always imposed.
      
      We may now start background writeback work based on outdated conditions.
      That's safe because the bdi flush thread will (and have to) double check
      the states.  It reduces overall overheads because the test based on old
      states still have good chance to be right.
      
      [akpm@linux-foundation.org] fix uninitialized dirty_exceeded
      Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e50e3720
    • R
      mm: fix fatal kernel-doc error · 3c111a07
      Randy Dunlap 提交于
      Fix a fatal kernel-doc error due to a #define coming between a function's
      kernel-doc notation and the function signature.  (kernel-doc cannot handle
      this)
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c111a07
  19. 10 8月, 2010 1 次提交
    • J
      mm: implement writeback livelock avoidance using page tagging · f446daae
      Jan Kara 提交于
      We try to avoid livelocks of writeback when some steadily creates dirty
      pages in a mapping we are writing out.  For memory-cleaning writeback,
      using nr_to_write works reasonably well but we cannot really use it for
      data integrity writeback.  This patch tries to solve the problem.
      
      The idea is simple: Tag all pages that should be written back with a
      special tag (TOWRITE) in the radix tree.  This can be done rather quickly
      and thus livelocks should not happen in practice.  Then we start doing the
      hard work of locking pages and sending them to disk only for those pages
      that have TOWRITE tag set.
      
      Note: Adding new radix tree tag grows radix tree node from 288 to 296
      bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
      However, the number of slab/slub items per page remains the same (13 and 7
      respectively).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f446daae
  20. 08 8月, 2010 2 次提交
  21. 06 7月, 2010 1 次提交
  22. 11 6月, 2010 1 次提交
  23. 09 6月, 2010 1 次提交
    • D
      writeback: limit write_cache_pages integrity scanning to current EOF · d87815cb
      Dave Chinner 提交于
      sync can currently take a really long time if a concurrent writer is
      extending a file. The problem is that the dirty pages on the address
      space grow in the same direction as write_cache_pages scans, so if
      the writer keeps ahead of writeback, the writeback will not
      terminate until the writer stops adding dirty pages.
      
      For a data integrity sync, we only need to write the pages dirty at
      the time we start the writeback, so we can stop scanning once we get
      to the page that was at the end of the file at the time the scan
      started.
      
      This will prevent operations like copying a large file preventing
      sync from completing as it will not write back pages that were
      dirtied after the sync was started. This does not impact the
      existing integrity guarantees, as any dirty page (old or new)
      within the EOF range at the start of the scan will still be
      captured.
      
      This patch will not prevent sync from blocking on large writes into
      holes. That requires more complex intervention while this patch only
      addresses the common append-case of this sync holdoff.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d87815cb