1. 12 8月, 2010 4 次提交
    • W
      writeback: add comment to the dirty limit functions · 1babe183
      Wu Fengguang 提交于
      Document global_dirty_limits() and bdi_dirty_limit().
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1babe183
    • W
      writeback: avoid unnecessary calculation of bdi dirty thresholds · 16c4042f
      Wu Fengguang 提交于
      Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so
      that the latter can be avoided when under global dirty background
      threshold (which is the normal state for most systems).
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16c4042f
    • W
      writeback: balance_dirty_pages(): reduce calls to global_page_state · e50e3720
      Wu Fengguang 提交于
      Reducing the number of times balance_dirty_pages calls global_page_state
      reduces the cache references and so improves write performance on a
      variety of workloads.
      
      'perf stats' of simple fio write tests shows the reduction in cache
      access.  Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2
      with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10
      times, dropping the fasted & slowest values then taking the average &
      standard deviation
      
      		average (s.d.) in millions (10^6)
      2.6.31-rc8	648.6 (14.6)
      +patch		620.1 (16.5)
      
      Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads
      the counters to apply the dirty_threshold and moving this check up into
      balance_dirty_pages where it has already read the counters.
      
      Also by rearrange the for loop to only contain one copy of the limit tests
      allows the pdflush test after the loop to use the local copies of the
      counters rather than rereading them.
      
      In the common case with no throttling it now calls global_page_state 5
      fewer times and bdi_stat 2 fewer.
      
      Fengguang:
      
      This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
      with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh) to
      avoid exceeding the dirty limit.  Since the bdi dirty limit is mostly
      accurate we don't need to do routinely clip.  A simple dirty limit check
      would be enough.
      
      The check is necessary because, in principle we should throttle everything
      calling balance_dirty_pages() when we're over the total limit, as said by
      Peter.
      
      We now set and clear dirty_exceeded not only based on bdi dirty limits,
      but also on the global dirty limit.  The global limit check is added in
      place of clip_bdi_dirty_limit() for safety and not intended as a behavior
      change.  The bdi limits should be tight enough to keep all dirty pages
      under the global limit at most time; occasional small exceeding should be
      OK though.  The change makes the logic more obvious: the global limit is
      the ultimate goal and shall be always imposed.
      
      We may now start background writeback work based on outdated conditions.
      That's safe because the bdi flush thread will (and have to) double check
      the states.  It reduces overall overheads because the test based on old
      states still have good chance to be right.
      
      [akpm@linux-foundation.org] fix uninitialized dirty_exceeded
      Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e50e3720
    • R
      mm: fix fatal kernel-doc error · 3c111a07
      Randy Dunlap 提交于
      Fix a fatal kernel-doc error due to a #define coming between a function's
      kernel-doc notation and the function signature.  (kernel-doc cannot handle
      this)
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c111a07
  2. 10 8月, 2010 1 次提交
    • J
      mm: implement writeback livelock avoidance using page tagging · f446daae
      Jan Kara 提交于
      We try to avoid livelocks of writeback when some steadily creates dirty
      pages in a mapping we are writing out.  For memory-cleaning writeback,
      using nr_to_write works reasonably well but we cannot really use it for
      data integrity writeback.  This patch tries to solve the problem.
      
      The idea is simple: Tag all pages that should be written back with a
      special tag (TOWRITE) in the radix tree.  This can be done rather quickly
      and thus livelocks should not happen in practice.  Then we start doing the
      hard work of locking pages and sending them to disk only for those pages
      that have TOWRITE tag set.
      
      Note: Adding new radix tree tag grows radix tree node from 288 to 296
      bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
      However, the number of slab/slub items per page remains the same (13 and 7
      respectively).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f446daae
  3. 08 8月, 2010 2 次提交
  4. 06 7月, 2010 1 次提交
  5. 11 6月, 2010 1 次提交
  6. 09 6月, 2010 2 次提交
    • D
      writeback: limit write_cache_pages integrity scanning to current EOF · d87815cb
      Dave Chinner 提交于
      sync can currently take a really long time if a concurrent writer is
      extending a file. The problem is that the dirty pages on the address
      space grow in the same direction as write_cache_pages scans, so if
      the writer keeps ahead of writeback, the writeback will not
      terminate until the writer stops adding dirty pages.
      
      For a data integrity sync, we only need to write the pages dirty at
      the time we start the writeback, so we can stop scanning once we get
      to the page that was at the end of the file at the time the scan
      started.
      
      This will prevent operations like copying a large file preventing
      sync from completing as it will not write back pages that were
      dirtied after the sync was started. This does not impact the
      existing integrity guarantees, as any dirty page (old or new)
      within the EOF range at the start of the scan will still be
      captured.
      
      This patch will not prevent sync from blocking on large writes into
      holes. That requires more complex intervention while this patch only
      addresses the common append-case of this sync holdoff.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d87815cb
    • D
      writeback: pay attention to wbc->nr_to_write in write_cache_pages · 0b564927
      Dave Chinner 提交于
      If a filesystem writes more than one page in ->writepage, write_cache_pages
      fails to notice this and continues to attempt writeback when wbc->nr_to_write
      has gone negative - this trace was captured from XFS:
      
          wbc_writeback_start: towrt=1024
          wbc_writepage: towrt=1024
          wbc_writepage: towrt=0
          wbc_writepage: towrt=-1
          wbc_writepage: towrt=-5
          wbc_writepage: towrt=-21
          wbc_writepage: towrt=-85
      
      This has adverse effects on filesystem writeback behaviour. write_cache_pages()
      needs to terminate after a certain number of pages are written, not after a
      certain number of calls to ->writepage are made.  This is a regression
      introduced by 17bc6c30 ("vfs: Add
      no_nrwrite_index_update writeback control flag"), but cannot be reverted
      directly due to subsequent bug fixes that have gone in on top of it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b564927
  7. 01 6月, 2010 1 次提交
  8. 22 5月, 2010 3 次提交
  9. 17 5月, 2010 1 次提交
    • J
      writeback: fix WB_SYNC_NONE writeback from umount · e913fc82
      Jens Axboe 提交于
      When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
      writeback to kick off writeback of pending dirty inodes, then follow
      that up with a WB_SYNC_ALL to wait for it. Since umount already holds
      the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
      writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
      since WB_SYNC_ALL writeback is a data integrity operation and thus
      a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
      it's a lot slower.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e913fc82
  10. 06 4月, 2010 1 次提交
    • M
      laptop-mode: Make flushes per-device · 31373d09
      Matthew Garrett 提交于
      One of the features of laptop-mode is that it forces a writeout of dirty
      pages if something else triggers a physical read or write from a device.
      The current implementation flushes pages on all devices, rather than only
      the one that triggered the flush. This patch alters the behaviour so that
      only the recently accessed block device is flushed, preventing other
      disks being spun up for no terribly good reason.
      Signed-off-by: NMatthew Garrett <mjg@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      31373d09
  11. 03 12月, 2009 1 次提交
    • W
      writeback: remove unused nonblocking and congestion checks · 0d99519e
      Wu Fengguang 提交于
      - no one is calling wb_writeback and write_cache_pages with
        wbc.nonblocking=1 any more
      - lumpy pageout will want to do nonblocking writeback without the
        congestion wait
      
      So remove the congestion checks as suggested by Chris.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Alex Elder <aelder@sgi.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0d99519e
  12. 09 10月, 2009 1 次提交
    • W
      writeback: account IO throttling wait as iowait · d25105e8
      Wu Fengguang 提交于
      It makes sense to do IOWAIT when someone is blocked
      due to IO throttle, as suggested by Kame and Peter.
      
      There is an old comment for not doing IOWAIT on throttle,
      however it has been mismatching the code for a long time.
      
      If we stop accounting IOWAIT for 2.6.32, it could be an
      undesirable behavior change. So restore the io_schedule.
      
      CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d25105e8
  13. 26 9月, 2009 4 次提交
  14. 24 9月, 2009 1 次提交
  15. 22 9月, 2009 1 次提交
  16. 21 9月, 2009 2 次提交
  17. 16 9月, 2009 4 次提交
    • J
      writeback: separate starting of sync vs opportunistic writeback · b6e51316
      Jens Axboe 提交于
      bdi_start_writeback() is currently split into two paths, one for
      WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
      for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
      only WB_SYNC_NONE.
      
      Push down the writeback_control allocation and only accept the
      parameters that make sense for each function. This cleans up
      the API considerably.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b6e51316
    • J
      writeback: use RCU to protect bdi_list · cfc4ba53
      Jens Axboe 提交于
      Now that bdi_writeback_all() no longer handles integrity writeback,
      it doesn't have to block anymore. This means that we can switch
      bdi_list reader side protection to RCU.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      cfc4ba53
    • J
      writeback: get rid of wbc->for_writepages · 1fe06ad8
      Jens Axboe 提交于
      It's only set, it's never checked. Kill it.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1fe06ad8
    • W
      HWPOISON: shmem: call set_page_dirty() with locked page · 6746aff7
      Wu Fengguang 提交于
      The dirtying of page and set_page_dirty() can be moved into the page lock.
      
      - In shmem_write_end(), the page was dirtied while the page lock was held,
        but it's being marked dirty just after dropping the page lock.
      - In shmem_symlink(), both dirtying and marking can be moved into page lock.
      
      It's valuable for the hwpoison code to know whether one bad page can be dropped
      without losing data. It mainly judges by testing the PG_dirty bit after taking
      the page lock. So it becomes important that the dirtying of page and the
      marking of dirtiness are both done inside the page lock. Which is a common
      practice, but sadly not a rule.
      
      The noticeable exceptions are
      - mapped pages
      - pages with buffer_heads
      The above pages could go dirty at any time. Fortunately the hwpoison will
      unmap the page and release the buffer_heads beforehand anyway.
      
      Many other types of pages (eg. metadata pages) can also be dirtied at will by
      their owners, the hwpoison code cannot do meaningful things to them anyway.
      Only the dirtiness of pagecache pages owned by regular files are interested.
      
      v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra)
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      6746aff7
  18. 11 9月, 2009 2 次提交
    • J
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe 提交于
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      03ba3782
    • J
      writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2
      Jens Axboe 提交于
      This is a first step at introducing per-bdi flusher threads. We should
      have no change in behaviour, although sb_has_dirty_inodes() is now
      ridiculously expensive, as there's no easy way to answer that question.
      Not a huge problem, since it'll be deleted in subsequent patches.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      66f3b8e2
  19. 11 7月, 2009 1 次提交
  20. 01 7月, 2009 1 次提交
    • R
      mm: prevent balance_dirty_pages() from doing too much work · d7831a0b
      Richard Kennedy 提交于
      balance_dirty_pages can overreact and move all of the dirty pages to
      writeback unnecessarily.
      
      balance_dirty_pages makes its decision to throttle based on the number of
      dirty plus writeback pages that are over the calculated limit,so it will
      continue to move pages even when there are plenty of pages in writeback
      and less than the threshold still dirty.
      
      This allows it to overshoot its limits and move all the dirty pages to
      writeback while waiting for the drives to catch up and empty the writeback
      list.
      
      A simple fio test easily demonstrates this problem.
      
      fio --name=f1 --directory=/disk1 --size=2G -rw=write --name=f2 --directory=/disk2 --size=1G --rw=write --startdelay=10
      
      This is the simplest fix I could find, but I'm not entirely sure that it
      alone will be enough for all cases.  But it certainly is an improvement on
      my desktop machine writing to 2 disks.
      
      Do we need something more for machines with large arrays where
      bdi_threshold * number_of_drives is greater than the dirty_ratio ?
      Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7831a0b
  21. 24 6月, 2009 1 次提交
    • T
      percpu: clean up percpu variable definitions · 245b2e70
      Tejun Heo 提交于
      Percpu variable definition is about to be updated such that all percpu
      symbols including the static ones must be unique.  Update percpu
      variable definitions accordingly.
      
      * as,cfq: rename ioc_count uniquely
      
      * cpufreq: rename cpu_dbs_info uniquely
      
      * xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
      
      * mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
        rename it
      
      * ipv4,6: rename cookie_scratch uniquely
      
      * x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
        pmc_irq_entry and nmi_entry to pmc_nmi_entry
      
      * perf_counter: rename disable_count to perf_disable_count
      
      * ftrace: rename test_event_disable to ftrace_test_event_disable
      
      * kmemleak: rename test_pointer to kmemleak_test_pointer
      
      * mce: rename next_interval to mce_next_interval
      
      [ Impact: percpu usage cleanups, no duplicate static percpu var names ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      245b2e70
  22. 17 6月, 2009 1 次提交
  23. 18 5月, 2009 1 次提交
  24. 01 4月, 2009 2 次提交