1. 13 2月, 2009 1 次提交
    • N
      Fix page writeback thinko, causing Berkeley DB slowdown · 3a4c6800
      Nick Piggin 提交于
      A bug was introduced into write_cache_pages cyclic writeout by commit
      31a12666 ("mm: write_cache_pages cyclic
      fix").  The intention (and comments) is that we should cycle back and
      look for more dirty pages at the beginning of the file if there is no
      more work to be done.
      
      But the !done condition was dropped from the test.  This means that any
      time the page writeout loop breaks (eg.  due to nr_to_write == 0), we
      will set index to 0, then goto again.  This will set done_index to
      index, then find done is set, so will proceed to the end of the
      function.  When updating mapping->writeback_index for cyclic writeout,
      we now use done_index == 0, so we're always cycling back to 0.
      
      This seemed to be causing random mmap writes (slapadd and iozone) to
      start writing more pages from the LRU and writeout would slowdown, and
      caused bugzilla entry
      
      	http://bugzilla.kernel.org/show_bug.cgi?id=12604
      
      about Berkeley DB slowing down dramatically.
      
      With this patch, iozone random write performance is increased nearly
      5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reported-and-tested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a4c6800
  2. 12 2月, 2009 2 次提交
  3. 04 2月, 2009 1 次提交
    • A
      write-back: fix nr_to_write counter · dcf6a79d
      Artem Bityutskiy 提交于
      Commit 05fe478d introduced some
      @wbc->nr_to_write breakage.
      
      It made the following changes:
       1. Decrement wbc->nr_to_write instead of nr_to_write
       2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE
       3. If synced nr_to_write pages, stop only if if wbc->sync_mode ==
          WB_SYNC_NONE, otherwise keep going.
      
      However, according to the commit message, the intention was to only make
      change 3.  Change 1 is a bug.  Change 2 does not seem to be necessary,
      and it breaks UBIFS expectations, so if needed, it should be done
      separately later.  And change 2 does not seem to be documented in the
      commit message.
      
      This patch does the following:
       1. Undo changes 1 and 2
       2. Add a comment explaining change 3 (it very useful to have comments
          in _code_, not only in the commit).
      Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcf6a79d
  4. 07 1月, 2009 10 次提交
    • D
      mm: add dirty_background_bytes and dirty_bytes sysctls · 2da02997
      David Rientjes 提交于
      This change introduces two new sysctls to /proc/sys/vm:
      dirty_background_bytes and dirty_bytes.
      
      dirty_background_bytes is the counterpart to dirty_background_ratio and
      dirty_bytes is the counterpart to dirty_ratio.
      
      With growing memory capacities of individual machines, it's no longer
      sufficient to specify dirty thresholds as a percentage of the amount of
      dirtyable memory over the entire system.
      
      dirty_background_bytes and dirty_bytes specify quantities of memory, in
      bytes, that represent the dirty limits for the entire system.  If either
      of these values is set, its value represents the amount of dirty memory
      that is needed to commence either background or direct writeback.
      
      When a `bytes' or `ratio' file is written, its counterpart becomes a
      function of the written value.  For example, if dirty_bytes is written to
      be 8096, 8K of memory is required to commence direct writeback.
      dirty_ratio is then functionally equivalent to 8K / the amount of
      dirtyable memory:
      
      	dirtyable_memory = free pages + mapped pages + file cache
      
      	dirty_background_bytes = dirty_background_ratio * dirtyable_memory
      		-or-
      	dirty_background_ratio = dirty_background_bytes / dirtyable_memory
      
      		AND
      
      	dirty_bytes = dirty_ratio * dirtyable_memory
      		-or-
      	dirty_ratio = dirty_bytes / dirtyable_memory
      
      Only one of dirty_background_bytes and dirty_background_ratio may be
      specified at a time, and only one of dirty_bytes and dirty_ratio may be
      specified.  When one sysctl is written, the other appears as 0 when read.
      
      The `bytes' files operate on a page size granularity since dirty limits
      are compared with ZVC values, which are in page units.
      
      Prior to this change, the minimum dirty_ratio was 5 as implemented by
      get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user
      written value between 0 and 100.  This restriction is maintained, but
      dirty_bytes has a lower limit of only one page.
      
      Also prior to this change, the dirty_background_ratio could not equal or
      exceed dirty_ratio.  This restriction is maintained in addition to
      restricting dirty_background_bytes.  If either background threshold equals
      or exceeds that of the dirty threshold, it is implicitly set to half the
      dirty threshold.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2da02997
    • D
      mm: change dirty limit type specifiers to unsigned long · 364aeb28
      David Rientjes 提交于
      The background dirty and dirty limits are better defined with type
      specifiers of unsigned long since negative writeback thresholds are not
      possible.
      
      These values, as returned by get_dirty_limits(), are normally compared
      with ZVC values to determine whether writeback shall commence or be
      throttled.  Such page counts cannot be negative, so declaring the page
      limits as signed is unnecessary.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Righi <righi.andrea@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      364aeb28
    • A
      mm: write_cache_pages more terminate quickly · 82fd1a9a
      Andrew Morton 提交于
      Now that we have the early-termination logic in place, it makes sense to
      bail out early in all other cases where done is set to 1.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82fd1a9a
    • N
      mm: write_cache_pages terminate quickly · d5482cdf
      Nick Piggin 提交于
      Terminate the write_cache_pages loop upon encountering the first page past
      end, without locking the page.  Pages cannot have their index change when
      we have a reference on them (truncate, eg truncate_inode_pages_range
      performs the same check without the page lock).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5482cdf
    • N
      mm: write_cache_pages optimise page cleaning · 515f4a03
      Nick Piggin 提交于
      In write_cache_pages, if we get stuck behind another process that is
      cleaning pages, we will be forced to wait for them to finish, then perform
      our own writeout (if it was redirtied during the long wait), then wait for
      that.
      
      If a page under writeout is still clean, we can skip waiting for it (if
      we're part of a data integrity sync, we'll be waiting for all writeout
      pages afterwards, so we'll still be waiting for the other guy's write
      that's cleaned the page).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      515f4a03
    • N
      mm: write_cache_pages cleanups · 5a3d5c98
      Nick Piggin 提交于
      Get rid of some complex expressions from flow control statements, add a
      comment, remove some duplicate code.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a3d5c98
    • N
      mm: write_cache_pages integrity fix · 05fe478d
      Nick Piggin 提交于
      In write_cache_pages, nr_to_write is heeded even for data-integrity syncs,
      so the function will return success after writing out nr_to_write pages,
      even if that was not sufficient to guarantee data integrity.
      
      The callers tend to set it to values that could break data interity
      semantics easily in practice.  For example, nr_to_write can be set to
      mapping->nr_pages * 2, however if a file has a single, dirty page, then
      fsync is called, subsequent pages might be concurrently added and dirtied,
      then write_cache_pages might writeout two of these newly dirty pages,
      while not writing out the old page that should have been written out.
      
      Fix this by ignoring nr_to_write if it is a data integrity sync.
      
      This is a data integrity bug.
      
      The reason this has been done in the past is to avoid stalling sync
      operations behind page dirtiers.
      
       "If a file has one dirty page at offset 1000000000000000 then someone
        does an fsync() and someone else gets in first and starts madly writing
        pages at offset 0, we want to write that page at 1000000000000000.
        Somehow."
      
      What we do today is return success after an arbitrary amount of pages are
      written, whether or not we have provided the data-integrity semantics that
      the caller has asked for.  Even this doesn't actually fix all stall cases
      completely: in the above situation, if the file has a huge number of pages
      in pagecache (but not dirty), then mapping->nrpages is going to be huge,
      even if pages are being dirtied.
      
      This change does indeed make the possibility of long stalls lager, and
      that's not a good thing, but lying about data integrity is even worse.  We
      have to either perform the sync, or return -ELINUXISLAME so at least the
      caller knows what has happened.
      
      There are subsequent competing approaches in the works to solve the stall
      problems properly, without compromising data integrity.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05fe478d
    • N
      mm: write_cache_pages writepage error fix · 00266770
      Nick Piggin 提交于
      In write_cache_pages, if ret signals a real error, but we still have some
      pages left in the pagevec, done would be set to 1, but the remaining pages
      would continue to be processed and ret will be overwritten in the process.
      
      It could easily be overwritten with success, and thus success will be
      returned even if there is an error.  Thus the caller is told all writes
      succeeded, wheras in reality some did not.
      
      Fix this by bailing immediately if there is an error, and retaining the
      first error code.
      
      This is a data integrity bug.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00266770
    • N
      mm: write_cache_pages early loop termination · bd19e012
      Nick Piggin 提交于
      We'd like to break out of the loop early in many situations, however the
      existing code has been setting mapping->writeback_index past the final
      page in the pagevec lookup for cyclic writeback.  This is a problem if we
      don't process all pages up to the final page.
      
      Currently the code mostly keeps writeback_index reasonable and hacked
      around this by not breaking out of the loop or writing pages outside the
      range in these cases.  Keep track of a real "done index" that enables us
      to terminate the loop in a much more flexible manner.
      
      Needed by the subsequent patch to preserve writepage errors, and then
      further patches to break out of the loop early for other reasons.  However
      there are no functional changes with this patch alone.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd19e012
    • N
      mm: write_cache_pages cyclic fix · 31a12666
      Nick Piggin 提交于
      In write_cache_pages, scanned == 1 is supposed to mean that cyclic
      writeback has circled through zero, thus we should not circle again.
      However it gets set to 1 after the first successful pagevec lookup.  This
      leads to cases where not enough data gets written.
      
      Counterexample: file with first 10 pages dirty, writeback_index == 5,
      nr_to_write == 10.  Then the 5 last pages will be found, and scanned will
      be set to 1, after writing those out, we will not cycle back to get the
      first 5.
      
      Rework this logic, now we'll always cycle unless we started off from index
      0.  When cycling, only write out as far as 1 page before the start page
      from the first cycle (so we don't write parts of the file twice).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31a12666
  5. 20 10月, 2008 1 次提交
    • R
      vmscan: split LRU lists into anon & file sets · 4f98a2fe
      Rik van Riel 提交于
      Split the LRU lists in two, one set for pages that are backed by real file
      systems ("file") and one for pages that are backed by memory and swap
      ("anon").  The latter includes tmpfs.
      
      The advantage of doing this is that the VM will not have to scan over lots
      of anonymous pages (which we generally do not want to swap out), just to
      find the page cache pages that it should evict.
      
      This patch has the infrastructure and a basic policy to balance how much
      we scan the anon lists and how much we scan the file lists.  The big
      policy changes are in separate patches.
      
      [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
      [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
      [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
      [hugh@veritas.com: memcg swapbacked pages active]
      [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
      [akpm@linux-foundation.org: fix /proc/vmstat units]
      [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
      [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
      [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f98a2fe
  6. 17 10月, 2008 1 次提交
  7. 16 10月, 2008 1 次提交
  8. 14 10月, 2008 1 次提交
  9. 27 7月, 2008 1 次提交
  10. 12 7月, 2008 1 次提交
    • A
      mm: Add range_cont mode for writeback · 06d6cf69
      Aneesh Kumar K.V 提交于
      Filesystems like ext4 needs to start a new transaction in
      the writepages for block allocation. This happens with delayed
      allocation and there is limit to how many credits we can request
      from the journal layer. So we call write_cache_pages multiple
      times with wbc->nr_to_write set to the maximum possible value
      limitted by the max journal credits available.
      
      Add a new mode to writeback that enables us to handle this
      behaviour. In the new mode we update the wbc->range_start
      to point to the new offset to be written. Next call to
      call to write_cache_pages will start writeout from specified
      range_start offset. In the new mode we also limit writing
      to the specified wbc->range_end.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMingming Cao <cmm@us.ibm.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      06d6cf69
  11. 24 5月, 2008 1 次提交
    • S
      ftrace: limit trace entries · 3eefae99
      Steven Rostedt 提交于
      Currently there is no protection from the root user to use up all of
      memory for trace buffers. If the root user allocates too many entries,
      the OOM killer might start kill off all tasks.
      
      This patch adds an algorith to check the following condition:
      
       pages_requested > (freeable_memory + current_trace_buffer_pages) / 4
      
      If the above is met then the allocation fails. The above prevents more
      than 1/4th of freeable memory from being used by trace buffers.
      
      To determine the freeable_memory, I made determine_dirtyable_memory in
      mm/page-writeback.c global.
      
      Special thanks goes to Peter Zijlstra for suggesting the above calculation.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      3eefae99
  12. 30 4月, 2008 6 次提交
  13. 06 2月, 2008 4 次提交
    • F
      writeback: speed up writeback of big dirty files · 8bc3be27
      Fengguang Wu 提交于
      After making dirty a 100M file, the normal behavior is to start the
      writeback for all data after 30s delays.  But sometimes the following
      happens instead:
      
      	- after 30s:    ~4M
      	- after 5s:     ~4M
      	- after 5s:     all remaining 92M
      
      Some analyze shows that the internal io dispatch queues goes like this:
      
      		s_io            s_more_io
      		-------------------------
      	1)	100M,1K         0
      	2)	1K              96M
      	3)	0               96M
      1) initial state with a 100M file and a 1K file
      
      2) 4M written, nr_to_write <= 0, so write more
      
      3) 1K written, nr_to_write > 0, no more writes(BUG)
      
      nr_to_write > 0 in (3) fools the upper layer to think that data have all
      been written out.  The big dirty file is actually still sitting in
      s_more_io.  We cannot simply splice s_more_io back to s_io as soon as s_io
      becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
      may starve newly expired inodes in s_dirty.  It is also not an option to
      draw inodes from both s_more_io and s_dirty, an let the loop go on: this
      might lead to live locks, and might also starve other superblocks in sync
      time(well kupdate may still starve some superblocks, that's another bug).
      
      We have to return when a full scan of s_io completes.  So nr_to_write > 0
      does not necessarily mean that "all data are written".  This patch
      introduces a flag writeback_control.more_io to indicate that more io should
      be done.  With it the big dirty file no longer has to wait for the next
      kupdate invokation 5s later.
      
      In sync_sb_inodes() we only set more_io on super_blocks we actually
      visited.  This avoids the interaction between two pdflush deamons.
      
      Also in __sync_single_inode() we don't blindly keep requeuing the io if the
      filesystem cannot progress.  Failing to do so may lead to 100% iowait.
      Tested-by: NMike Snitzer <snitzer@gmail.com>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bc3be27
    • H
      mm: remove fastcall from mm/ · 920c7a5d
      Harvey Harrison 提交于
      fastcall is always defined to be empty, remove it
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      920c7a5d
    • B
      mm/page-writeback: highmem_is_dirtyable option · 195cf453
      Bron Gondwana 提交于
      Add vm.highmem_is_dirtyable toggle
      
      A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
      approximately 2Gb size which contains a hash format that is written
      randomly by the dbclean process.  On 2.6.16 this process took a few
      minutes.  With lowmem only accounting of dirty ratios, this takes about 12
      hours of 100% disk IO, all random writes.
      
      Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
      add the highmem back to the total available memory count.
      
      [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
      Signed-off-by: NBron Gondwana <brong@fastmail.fm>
      Cc: Ethan Solomita <solo@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      195cf453
    • A
      mm/page-writeback.c: make a function static · f61eaf9f
      Adrian Bunk 提交于
      task_dirty_limit() can become static.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f61eaf9f
  14. 15 1月, 2008 1 次提交
  15. 16 11月, 2007 1 次提交
    • L
      dirty page balancing: Get rid of broken unmapped_ratio logic · 8c086340
      Linus Torvalds 提交于
      This code harks back to the days when we didn't count dirty mapped
      pages, which led us to try to balance the number of dirty unmapped pages
      by how much unmapped memory there was in the system.
      
      That makes no sense any more, since now the dirty counts include the
      mapped pages.  Not to mention that the math doesn't work with HIGHMEM
      machines anyway, and causes the unmapped_ratio to potentially turn
      negative (which we do catch thanks to clamping it at a minimum value,
      but I mention that as an indication of how broken the code is).
      
      The code also was written at a time when the default dirty ratio was
      much larger, and the unmapped_ratio logic effectively capped that large
      dirty ratio a bit.  Again, we've since lowered the dirty ratio rather
      aggressively, further lessening the point of that code.
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c086340
  16. 15 11月, 2007 1 次提交
  17. 20 10月, 2007 1 次提交
  18. 17 10月, 2007 5 次提交