1. 22 5月, 2010 1 次提交
    • J
      writeback: fixups for !dirty_writeback_centisecs · 6423104b
      Jens Axboe 提交于
      Commit 69b62d01 fixed up most of the places where we would enter
      busy schedule() spins when disabling the periodic background
      writeback. This fixes up the sb timer so that it doesn't get
      hammered on with the delay disabled, and ensures that it gets
      rearmed if needed when /proc/sys/vm/dirty_writeback_centisecs
      gets modified.
      
      bdi_forker_task() also needs to check for !dirty_writeback_centisecs
      and use schedule() appropriately, fix that up too.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      6423104b
  2. 17 5月, 2010 1 次提交
    • J
      writeback: fix WB_SYNC_NONE writeback from umount · e913fc82
      Jens Axboe 提交于
      When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
      writeback to kick off writeback of pending dirty inodes, then follow
      that up with a WB_SYNC_ALL to wait for it. Since umount already holds
      the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
      writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
      since WB_SYNC_ALL writeback is a data integrity operation and thus
      a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
      it's a lot slower.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e913fc82
  3. 06 4月, 2010 1 次提交
    • M
      laptop-mode: Make flushes per-device · 31373d09
      Matthew Garrett 提交于
      One of the features of laptop-mode is that it forces a writeout of dirty
      pages if something else triggers a physical read or write from a device.
      The current implementation flushes pages on all devices, rather than only
      the one that triggered the flush. This patch alters the behaviour so that
      only the recently accessed block device is flushed, preventing other
      disks being spun up for no terribly good reason.
      Signed-off-by: NMatthew Garrett <mjg@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      31373d09
  4. 03 12月, 2009 1 次提交
    • W
      writeback: remove unused nonblocking and congestion checks · 0d99519e
      Wu Fengguang 提交于
      - no one is calling wb_writeback and write_cache_pages with
        wbc.nonblocking=1 any more
      - lumpy pageout will want to do nonblocking writeback without the
        congestion wait
      
      So remove the congestion checks as suggested by Chris.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Alex Elder <aelder@sgi.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0d99519e
  5. 09 10月, 2009 1 次提交
    • W
      writeback: account IO throttling wait as iowait · d25105e8
      Wu Fengguang 提交于
      It makes sense to do IOWAIT when someone is blocked
      due to IO throttle, as suggested by Kame and Peter.
      
      There is an old comment for not doing IOWAIT on throttle,
      however it has been mismatching the code for a long time.
      
      If we stop accounting IOWAIT for 2.6.32, it could be an
      undesirable behavior change. So restore the io_schedule.
      
      CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d25105e8
  6. 26 9月, 2009 4 次提交
  7. 24 9月, 2009 1 次提交
  8. 22 9月, 2009 1 次提交
  9. 21 9月, 2009 2 次提交
  10. 16 9月, 2009 4 次提交
    • J
      writeback: separate starting of sync vs opportunistic writeback · b6e51316
      Jens Axboe 提交于
      bdi_start_writeback() is currently split into two paths, one for
      WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
      for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
      only WB_SYNC_NONE.
      
      Push down the writeback_control allocation and only accept the
      parameters that make sense for each function. This cleans up
      the API considerably.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      b6e51316
    • J
      writeback: use RCU to protect bdi_list · cfc4ba53
      Jens Axboe 提交于
      Now that bdi_writeback_all() no longer handles integrity writeback,
      it doesn't have to block anymore. This means that we can switch
      bdi_list reader side protection to RCU.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      cfc4ba53
    • J
      writeback: get rid of wbc->for_writepages · 1fe06ad8
      Jens Axboe 提交于
      It's only set, it's never checked. Kill it.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1fe06ad8
    • W
      HWPOISON: shmem: call set_page_dirty() with locked page · 6746aff7
      Wu Fengguang 提交于
      The dirtying of page and set_page_dirty() can be moved into the page lock.
      
      - In shmem_write_end(), the page was dirtied while the page lock was held,
        but it's being marked dirty just after dropping the page lock.
      - In shmem_symlink(), both dirtying and marking can be moved into page lock.
      
      It's valuable for the hwpoison code to know whether one bad page can be dropped
      without losing data. It mainly judges by testing the PG_dirty bit after taking
      the page lock. So it becomes important that the dirtying of page and the
      marking of dirtiness are both done inside the page lock. Which is a common
      practice, but sadly not a rule.
      
      The noticeable exceptions are
      - mapped pages
      - pages with buffer_heads
      The above pages could go dirty at any time. Fortunately the hwpoison will
      unmap the page and release the buffer_heads beforehand anyway.
      
      Many other types of pages (eg. metadata pages) can also be dirtied at will by
      their owners, the hwpoison code cannot do meaningful things to them anyway.
      Only the dirtiness of pagecache pages owned by regular files are interested.
      
      v2: AK: Add comment about set_page_dirty rules (suggested by Peter Zijlstra)
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      6746aff7
  11. 11 9月, 2009 2 次提交
    • J
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe 提交于
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      03ba3782
    • J
      writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2
      Jens Axboe 提交于
      This is a first step at introducing per-bdi flusher threads. We should
      have no change in behaviour, although sb_has_dirty_inodes() is now
      ridiculously expensive, as there's no easy way to answer that question.
      Not a huge problem, since it'll be deleted in subsequent patches.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      66f3b8e2
  12. 11 7月, 2009 1 次提交
  13. 01 7月, 2009 1 次提交
    • R
      mm: prevent balance_dirty_pages() from doing too much work · d7831a0b
      Richard Kennedy 提交于
      balance_dirty_pages can overreact and move all of the dirty pages to
      writeback unnecessarily.
      
      balance_dirty_pages makes its decision to throttle based on the number of
      dirty plus writeback pages that are over the calculated limit,so it will
      continue to move pages even when there are plenty of pages in writeback
      and less than the threshold still dirty.
      
      This allows it to overshoot its limits and move all the dirty pages to
      writeback while waiting for the drives to catch up and empty the writeback
      list.
      
      A simple fio test easily demonstrates this problem.
      
      fio --name=f1 --directory=/disk1 --size=2G -rw=write --name=f2 --directory=/disk2 --size=1G --rw=write --startdelay=10
      
      This is the simplest fix I could find, but I'm not entirely sure that it
      alone will be enough for all cases.  But it certainly is an improvement on
      my desktop machine writing to 2 disks.
      
      Do we need something more for machines with large arrays where
      bdi_threshold * number_of_drives is greater than the dirty_ratio ?
      Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7831a0b
  14. 24 6月, 2009 1 次提交
    • T
      percpu: clean up percpu variable definitions · 245b2e70
      Tejun Heo 提交于
      Percpu variable definition is about to be updated such that all percpu
      symbols including the static ones must be unique.  Update percpu
      variable definitions accordingly.
      
      * as,cfq: rename ioc_count uniquely
      
      * cpufreq: rename cpu_dbs_info uniquely
      
      * xen: move nesting_count out of xen_evtchn_do_upcall() and rename it
      
      * mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and
        rename it
      
      * ipv4,6: rename cookie_scratch uniquely
      
      * x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to
        pmc_irq_entry and nmi_entry to pmc_nmi_entry
      
      * perf_counter: rename disable_count to perf_disable_count
      
      * ftrace: rename test_event_disable to ftrace_test_event_disable
      
      * kmemleak: rename test_pointer to kmemleak_test_pointer
      
      * mce: rename next_interval to mce_next_interval
      
      [ Impact: percpu usage cleanups, no duplicate static percpu var names ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      245b2e70
  15. 17 6月, 2009 1 次提交
  16. 18 5月, 2009 1 次提交
  17. 01 4月, 2009 2 次提交
  18. 27 3月, 2009 1 次提交
    • W
      writeback: double the dirty thresholds · 1b5e62b4
      Wu Fengguang 提交于
      Enlarge default dirty ratios from 5/10 to 10/20.  This fixes [Bug
      #12809] iozone regression with 2.6.29-rc6.
      
      The iozone benchmarks are performed on a 1200M file, with 8GB ram.
      
        iozone -i 0 -i 1 -i 2 -i 3 -i 4 -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
        iozone -B -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
      
      The performance regression is triggered by commit 1cf6e7d8(mm: task
      dirty accounting fix), which makes more correct/thorough dirty
      accounting.
      
      The default 5/10 dirty ratios were picked (a) with the old dirty logic
      and (b) largely at random and (c) designed to be aggressive.  In
      particular, that (a) means that having fixed some of the dirty
      accounting, maybe the real bug is now that it was always too aggressive,
      just hidden by an accounting issue.
      
      The enlarged 10/20 dirty ratios are just about enough to fix the regression.
      
      [ We will have to look at how this affects the old fsync() latency issue,
        but that probably will need independent work.  - Linus ]
      
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Reported-by: N"Lin, Ming M" <ming.m.lin@intel.com>
      Tested-by: N"Lin, Ming M" <ming.m.lin@intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b5e62b4
  19. 19 2月, 2009 1 次提交
  20. 13 2月, 2009 1 次提交
    • N
      Fix page writeback thinko, causing Berkeley DB slowdown · 3a4c6800
      Nick Piggin 提交于
      A bug was introduced into write_cache_pages cyclic writeout by commit
      31a12666 ("mm: write_cache_pages cyclic
      fix").  The intention (and comments) is that we should cycle back and
      look for more dirty pages at the beginning of the file if there is no
      more work to be done.
      
      But the !done condition was dropped from the test.  This means that any
      time the page writeout loop breaks (eg.  due to nr_to_write == 0), we
      will set index to 0, then goto again.  This will set done_index to
      index, then find done is set, so will proceed to the end of the
      function.  When updating mapping->writeback_index for cyclic writeout,
      we now use done_index == 0, so we're always cycling back to 0.
      
      This seemed to be causing random mmap writes (slapadd and iozone) to
      start writing more pages from the LRU and writeout would slowdown, and
      caused bugzilla entry
      
      	http://bugzilla.kernel.org/show_bug.cgi?id=12604
      
      about Berkeley DB slowing down dramatically.
      
      With this patch, iozone random write performance is increased nearly
      5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reported-and-tested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a4c6800
  21. 12 2月, 2009 2 次提交
  22. 04 2月, 2009 1 次提交
    • A
      write-back: fix nr_to_write counter · dcf6a79d
      Artem Bityutskiy 提交于
      Commit 05fe478d introduced some
      @wbc->nr_to_write breakage.
      
      It made the following changes:
       1. Decrement wbc->nr_to_write instead of nr_to_write
       2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE
       3. If synced nr_to_write pages, stop only if if wbc->sync_mode ==
          WB_SYNC_NONE, otherwise keep going.
      
      However, according to the commit message, the intention was to only make
      change 3.  Change 1 is a bug.  Change 2 does not seem to be necessary,
      and it breaks UBIFS expectations, so if needed, it should be done
      separately later.  And change 2 does not seem to be documented in the
      commit message.
      
      This patch does the following:
       1. Undo changes 1 and 2
       2. Add a comment explaining change 3 (it very useful to have comments
          in _code_, not only in the commit).
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcf6a79d
  23. 07 1月, 2009 8 次提交
    • D
      mm: add dirty_background_bytes and dirty_bytes sysctls · 2da02997
      David Rientjes 提交于
      This change introduces two new sysctls to /proc/sys/vm:
      dirty_background_bytes and dirty_bytes.
      
      dirty_background_bytes is the counterpart to dirty_background_ratio and
      dirty_bytes is the counterpart to dirty_ratio.
      
      With growing memory capacities of individual machines, it's no longer
      sufficient to specify dirty thresholds as a percentage of the amount of
      dirtyable memory over the entire system.
      
      dirty_background_bytes and dirty_bytes specify quantities of memory, in
      bytes, that represent the dirty limits for the entire system.  If either
      of these values is set, its value represents the amount of dirty memory
      that is needed to commence either background or direct writeback.
      
      When a `bytes' or `ratio' file is written, its counterpart becomes a
      function of the written value.  For example, if dirty_bytes is written to
      be 8096, 8K of memory is required to commence direct writeback.
      dirty_ratio is then functionally equivalent to 8K / the amount of
      dirtyable memory:
      
      	dirtyable_memory = free pages + mapped pages + file cache
      
      	dirty_background_bytes = dirty_background_ratio * dirtyable_memory
      		-or-
      	dirty_background_ratio = dirty_background_bytes / dirtyable_memory
      
      		AND
      
      	dirty_bytes = dirty_ratio * dirtyable_memory
      		-or-
      	dirty_ratio = dirty_bytes / dirtyable_memory
      
      Only one of dirty_background_bytes and dirty_background_ratio may be
      specified at a time, and only one of dirty_bytes and dirty_ratio may be
      specified.  When one sysctl is written, the other appears as 0 when read.
      
      The `bytes' files operate on a page size granularity since dirty limits
      are compared with ZVC values, which are in page units.
      
      Prior to this change, the minimum dirty_ratio was 5 as implemented by
      get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
      written value between 0 and 100.  This restriction is maintained, but
      dirty_bytes has a lower limit of only one page.
      
      Also prior to this change, the dirty_background_ratio could not equal or
      exceed dirty_ratio.  This restriction is maintained in addition to
      restricting dirty_background_bytes.  If either background threshold equals
      or exceeds that of the dirty threshold, it is implicitly set to half the
      dirty threshold.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2da02997
    • D
      mm: change dirty limit type specifiers to unsigned long · 364aeb28
      David Rientjes 提交于
      The background dirty and dirty limits are better defined with type
      specifiers of unsigned long since negative writeback thresholds are not
      possible.
      
      These values, as returned by get_dirty_limits(), are normally compared
      with ZVC values to determine whether writeback shall commence or be
      throttled.  Such page counts cannot be negative, so declaring the page
      limits as signed is unnecessary.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      364aeb28
    • A
      mm: write_cache_pages more terminate quickly · 82fd1a9a
      Andrew Morton 提交于
      Now that we have the early-termination logic in place, it makes sense to
      bail out early in all other cases where done is set to 1.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82fd1a9a
    • N
      mm: write_cache_pages terminate quickly · d5482cdf
      Nick Piggin 提交于
      Terminate the write_cache_pages loop upon encountering the first page past
      end, without locking the page.  Pages cannot have their index change when
      we have a reference on them (truncate, eg truncate_inode_pages_range
      performs the same check without the page lock).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5482cdf
    • N
      mm: write_cache_pages optimise page cleaning · 515f4a03
      Nick Piggin 提交于
      In write_cache_pages, if we get stuck behind another process that is
      cleaning pages, we will be forced to wait for them to finish, then perform
      our own writeout (if it was redirtied during the long wait), then wait for
      that.
      
      If a page under writeout is still clean, we can skip waiting for it (if
      we're part of a data integrity sync, we'll be waiting for all writeout
      pages afterwards, so we'll still be waiting for the other guy's write
      that's cleaned the page).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      515f4a03
    • N
      mm: write_cache_pages cleanups · 5a3d5c98
      Nick Piggin 提交于
      Get rid of some complex expressions from flow control statements, add a
      comment, remove some duplicate code.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a3d5c98
    • N
      mm: write_cache_pages integrity fix · 05fe478d
      Nick Piggin 提交于
      In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
      so the function will return success after writing out nr_to_write pages,
      even if that was not sufficient to guarantee data integrity.
      
      The callers tend to set it to values that could break data interity
      semantics easily in practice.  For example, nr_to_write can be set to
      mapping->nr_pages * 2, however if a file has a single, dirty page, then
      fsync is called, subsequent pages might be concurrently added and dirtied,
      then write_cache_pages might writeout two of these newly dirty pages,
      while not writing out the old page that should have been written out.
      
      Fix this by ignoring nr_to_write if it is a data integrity sync.
      
      This is a data integrity bug.
      
      The reason this has been done in the past is to avoid stalling sync
      operations behind page dirtiers.
      
       "If a file has one dirty page at offset 1000000000000000 then someone
        does an fsync() and someone else gets in first and starts madly writing
        pages at offset 0, we want to write that page at 1000000000000000.
        Somehow."
      
      What we do today is return success after an arbitrary amount of pages are
      written, whether or not we have provided the data-integrity semantics that
      the caller has asked for.  Even this doesn't actually fix all stall cases
      completely: in the above situation, if the file has a huge number of pages
      in pagecache (but not dirty), then mapping->nrpages is going to be huge,
      even if pages are being dirtied.
      
      This change does indeed make the possibility of long stalls lager, and
      that's not a good thing, but lying about data integrity is even worse.  We
      have to either perform the sync, or return -ELINUXISLAME so at least the
      caller knows what has happened.
      
      There are subsequent competing approaches in the works to solve the stall
      problems properly, without compromising data integrity.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05fe478d
    • N
      mm: write_cache_pages writepage error fix · 00266770
      Nick Piggin 提交于
      In write_cache_pages, if ret signals a real error, but we still have some
      pages left in the pagevec, done would be set to 1, but the remaining pages
      would continue to be processed and ret will be overwritten in the process.
      
      It could easily be overwritten with success, and thus success will be
      returned even if there is an error.  Thus the caller is told all writes
      succeeded, wheras in reality some did not.
      
      Fix this by bailing immediately if there is an error, and retaining the
      first error code.
      
      This is a data integrity bug.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00266770