1. 17 10月, 2013 1 次提交
    • F
      writeback: fix negative bdi max pause · e3b6c655
      Fengguang Wu 提交于
      Toralf runs trinity on UML/i386.  After some time it hangs and the last
      message line is
      
      	BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:1521]
      
      It's found that pages_dirtied becomes very large.  More than 1000000000
      pages in this case:
      
      	period = HZ * pages_dirtied / task_ratelimit;
      	BUG_ON(pages_dirtied > 2000000000);
      	BUG_ON(pages_dirtied > 1000000000);      <---------
      
      UML debug printf shows that we got negative pause here:
      
      	ick: pause : -984
      	ick: pages_dirtied : 0
      	ick: task_ratelimit: 0
      
      	 pause:
      	+       if (pause < 0)  {
      	+               extern int printf(char *, ...);
      	+               printf("ick : pause : %li\n", pause);
      	+               printf("ick: pages_dirtied : %lu\n", pages_dirtied);
      	+               printf("ick: task_ratelimit: %lu\n", task_ratelimit);
      	+               BUG_ON(1);
      	+       }
      	        trace_balance_dirty_pages(bdi,
      
      Since pause is bounded by [min_pause, max_pause] where min_pause is also
      bounded by max_pause.  It's suspected and demonstrated that the
      max_pause calculation goes wrong:
      
      	ick: pause : -717
      	ick: min_pause : -177
      	ick: max_pause : -717
      	ick: pages_dirtied : 14
      	ick: task_ratelimit: 0
      
      The problem lies in the two "long = unsigned long" assignments in
      bdi_max_pause() which might go negative if the highest bit is 1, and the
      min_t(long, ...) check failed to protect it falling under 0.  Fix all of
      them by using "unsigned long" throughout the function.
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Reported-by: NToralf Förster <toralf.foerster@gmx.de>
      Tested-by: NToralf Förster <toralf.foerster@gmx.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3b6c655
  2. 13 9月, 2013 1 次提交
    • S
      memcg: add per cgroup writeback pages accounting · 3ea67d06
      Sha Zhengju 提交于
      Add memcg routines to count writeback pages, later dirty pages will also
      be accounted.
      
      After Kame's commit 89c06bd5 ("memcg: use new logic for page stat
      accounting"), we can use 'struct page' flag to test page state instead
      of per page_cgroup flag.  But memcg has a feature to move a page from a
      cgroup to another one and may have race between "move" and "page stat
      accounting".  So in order to avoid the race we have designed a new lock:
      
               mem_cgroup_begin_update_page_stat()
               modify page information        -->(a)
               mem_cgroup_update_page_stat()  -->(b)
               mem_cgroup_end_update_page_stat()
      
      It requires both (a) and (b)(writeback pages accounting) to be pretected
      in mem_cgroup_{begin/end}_update_page_stat().  It's full no-op for
      !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
      read lock in the most cases (no task is moving), and spin_lock_irqsave
      on top in the slow path.
      
      There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
      And the lock order is:
      	--> memcg->move_lock
      	  --> mapping->tree_lock
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea67d06
  3. 12 9月, 2013 3 次提交
    • M
      mm/page-writeback.c: add strictlimit feature · 5a537485
      Maxim Patlasov 提交于
      The feature prevents mistrusted filesystems (ie: FUSE mounts created by
      unprivileged users) to grow a large number of dirty pages before
      throttling.  For such filesystems balance_dirty_pages always check bdi
      counters against bdi limits.  I.e.  even if global "nr_dirty" is under
      "freerun", it's not allowed to skip bdi checks.  The only use case for now
      is fuse: it sets bdi max_ratio to 1% by default and system administrators
      are supposed to expect that this limit won't be exceeded.
      
      The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
      filesystem may set the flag when it initializes its BDI.
      
      The problematic scenario comes from the fact that nobody pays attention to
      the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
      writeback).  The implementation of fuse writeback releases original page
      (by calling end_page_writeback) almost immediately.  A fuse request queued
      for real processing bears a copy of original page.  Hence, if userspace
      fuse daemon doesn't finalize write requests in timely manner, an
      aggressive mmap writer can pollute virtually all memory by those temporary
      fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
      nobody cares.
      
      To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
      problem" as a shortcut for "a possibility of uncontrolled grow of amount
      of RAM consumed by temporary pages allocated by kernel fuse to process
      writeback".
      
      The problem was very easy to reproduce.  There is a trivial example
      filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
      added "sleep(1);" to the write methods, then recompiled and mounted it.
      Then created a huge file on the mount point and run a simple program which
      mmap-ed the file to a memory region, then wrote a data to the region.  An
      hour later I observed almost all RAM consumed by fuse writeback.  Since
      then some unrelated changes in kernel fuse made it more difficult to
      reproduce, but it is still possible now.
      
      Putting this theoretical happens-in-the-lab thing aside, there is another
      thing that really hurts real world (FUSE) users.  This is write-through
      page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
      fuse populates page cache and flushes user data to the server
      synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
      ("writeback cache policy") solve the problem, but they also make resolving
      NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
      a huge file to a fuse mount would result in memory starvation.  Miklos,
      the maintainer of FUSE, believes strictlimit feature the way to go.
      
      And eventually putting FUSE topics aside, there is one more use-case for
      strictlimit feature.  Using a slow USB stick (mass storage) in a machine
      with huge amount of RAM installed is a well-known pain.  Let's make simple
      computations.  Assuming 64GB of RAM installed, existing implementation of
      balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
      dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
      /media/my-usb-storage/" may return in a few seconds, but subsequent
      "umount /media/my-usb-storage/" will take more than two hours if effective
      throughput of the storage is, to say, 1MB/sec.
      
      After inclusion of strictlimit feature, it will be trivial to add a knob
      (e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
      Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
      natural desire to limit the amount of dirty memory for some devices we are
      not fully trust (in the sense of sustainable throughput).
      
      [akpm@linux-foundation.org: fix warning in page-writeback.c]
      Signed-off-by: NMaxim Patlasov <MPatlasov@parallels.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a537485
    • L
      mm: vmscan: fix do_try_to_free_pages() livelock · 6e543d57
      Lisa Du 提交于
      This patch is based on KOSAKI's work and I add a little more description,
      please refer https://lkml.org/lkml/2012/6/14/74.
      
      Currently, I found system can enter a state that there are lots of free
      pages in a zone but only order-0 and order-1 pages which means the zone is
      heavily fragmented, then high order allocation could make direct reclaim
      path's long stall(ex, 60 seconds) especially in no swap and no compaciton
      enviroment.  This problem happened on v3.4, but it seems issue still lives
      in current tree, the reason is do_try_to_free_pages enter live lock:
      
      kswapd will go to sleep if the zones have been fully scanned and are still
      not balanced.  As kswapd thinks there's little point trying all over again
      to avoid infinite loop.  Instead it changes order from high-order to
      0-order because kswapd think order-0 is the most important.  Look at
      73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
      and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
      can still perform direct reclaim if they wish.
      
      Direct reclaim continue to reclaim for a high order which is not a
      COSTLY_ORDER without oom-killer until kswapd turn on
      zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
      So it means direct_reclaim depends on kswapd to break this loop.
      
      In worst case, direct-reclaim may continue to page reclaim forever when
      kswapd sleeps forever until someone like watchdog detect and finally kill
      the process.  As described in:
      http://thread.gmane.org/gmane.linux.kernel.mm/103737
      
      We can't turn on zone->all_unreclaimable from direct reclaim path because
      direct reclaim path don't take any lock and this way is racy.  Thus this
      patch removes zone->all_unreclaimable field completely and recalculates
      zone reclaimable state every time.
      
      Note: we can't take the idea that direct-reclaim see zone->pages_scanned
      directly and kswapd continue to use zone->all_unreclaimable.  Because, it
      is racy.  commit 929bea7c (vmscan: all_unreclaimable() use
      zone->all_unreclaimable as a name) describes the detail.
      
      [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
      Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Neil Zhang <zhangwm@marvell.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLisa Du <cldu@marvell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e543d57
    • J
      mm: revert "page-writeback.c: subtract min_free_kbytes from dirtyable memory" · 72457c0a
      Johannes Weiner 提交于
      This reverts commit 75f7ad8e.  It was the result of a problem
      observed with a 3.2 kernel and merged in 3.9, while the issue had been
      resolved upstream in 3.3 (commit ab8fabd4: "mm: exclude reserved
      pages from dirtyable memory").
      
      The "reserved pages" are a superset of min_free_kbytes, thus this change
      is redundant and confusing.  Revert it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Paul Szabo <psz@maths.usyd.edu.au>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72457c0a
  4. 15 7月, 2013 1 次提交
    • P
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      
      [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      0db0628d
  5. 30 4月, 2013 1 次提交
    • D
      mm: make snapshotting pages for stable writes a per-bio operation · 71368511
      Darrick J. Wong 提交于
      Walking a bio's page mappings has proved problematic, so create a new
      bio flag to indicate that a bio's data needs to be snapshotted in order
      to guarantee stable pages during writeback.  Next, for the one user
      (ext3/jbd) of snapshotting, hook all the places where writes can be
      initiated without PG_writeback set, and set BIO_SNAP_STABLE there.
      
      We must also flag journal "metadata" bios for stable writeout, since
      file data can be written through the journal.  Finally, the
      MS_SNAP_STABLE mount flag (only used by ext3) is now superfluous, so get
      rid of it.
      
      [akpm@linux-foundation.org: rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration]
      [akpm@linux-foundation.org: teeny cleanup]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71368511
  6. 24 2月, 2013 1 次提交
  7. 22 2月, 2013 2 次提交
    • D
      block: optionally snapshot page contents to provide stable pages during write · ffecfd1a
      Darrick J. Wong 提交于
      This provides a band-aid to provide stable page writes on jbd without
      needing to backport the fixed locking and page writeback bit handling
      schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
      page contents instead of waiting.
      
      For those wondering about the ext3 bandage -- fixing the jbd locking
      (which was done as part of ext4dev years ago) is a lot of surgery, and
      setting PG_writeback on data pages when we actually hold the page lock
      dropped ext3 performance by nearly an order of magnitude.  If we're
      going to migrate iscsi and raid to use stable page writes, the
      complaints about high latency will likely return.  We might as well
      centralize their page snapshotting thing to one place.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffecfd1a
    • D
      mm: only enforce stable page writes if the backing device requires it · 1d1d1a76
      Darrick J. Wong 提交于
      Create a helper function to check if a backing device requires stable
      page writes and, if so, performs the necessary wait.  Then, make it so
      that all points in the memory manager that handle making pages writable
      use the helper function.  This should provide stable page write support
      to most filesystems, while eliminating unnecessary waiting for devices
      that don't require the feature.
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own stable page guarantees or they don't block at all.
      The blocking behavior is back to what it was before 3.0 if you don't
      have a disk requiring stable page writes.
      
      Here's the result of using dbench to test latency on ext2:
      
      3.8.0-rc3:
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       WriteX        109347     0.028    59.817
       ReadX         347180     0.004     3.391
       Flush          15514    29.828   287.283
      
      Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
       WriteX        105556     0.029     4.273
       ReadX         335004     0.005     4.112
       Flush          14982    30.540   298.634
      
      Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, the maximum write latency drops considerably with this
      patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
      similarly, but see the cover letter for those results.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d1d1a76
  8. 08 2月, 2013 1 次提交
  9. 24 1月, 2013 1 次提交
  10. 14 1月, 2013 1 次提交
    • T
      writeback: add more tracepoints · 9fb0a7da
      Tejun Heo 提交于
      Add tracepoints for page dirtying, writeback_single_inode start, inode
      dirtying and writeback.  For the latter two inode events, a pair of
      events are defined to denote start and end of the operations (the
      starting one has _start suffix and the one w/o suffix happens after
      the operation is complete).  These inode ops are FS specific and can
      be non-trivial and having enclosing tracepoints is useful for external
      tracers.
      
      This is part of tracepoint additions to improve visiblity into
      dirtying / writeback operations for io tracer and userland.
      
      v2: writeback_dirty_inode[_start] TPs may be called for files on
          pseudo FSes w/ unregistered bdi.  Check whether bdi->dev is %NULL
          before dereferencing.
      
      v3: buffer dirtying moved to a block TP.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9fb0a7da
  11. 21 12月, 2012 1 次提交
  12. 12 12月, 2012 1 次提交
  13. 28 9月, 2012 1 次提交
  14. 04 8月, 2012 1 次提交
    • A
      vfs: kill write_super and sync_supers · f0cd2dbb
      Artem Bityutskiy 提交于
      Finally we can kill the 'sync_supers' kernel thread along with the
      '->write_super()' superblock operation because all the users are gone.
      Now every file-system is supposed to self-manage own superblock and
      its dirty state.
      
      The nice thing about killing this thread is that it improves power management.
      Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
      every 5 seconds no matter what - even if there were no dirty superblocks and
      even if there were no file-systems using this service (e.g., btrfs and
      journalled ext4 do not need it). So it was wasting power most of the time. And
      because the thread was in the core of the kernel, all systems had to have it.
      So I am quite happy to make it go away.
      
      Interestingly, this thread is a left-over from the pdflush kernel thread which
      was a self-forking kernel thread responsible for all the write-back in old
      Linux kernels. It was turned into per-block device BDI threads, and
      'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f0cd2dbb
  15. 09 6月, 2012 2 次提交
  16. 06 5月, 2012 1 次提交
    • F
      writeback: initialize global_dirty_limit · 68809c71
      Fengguang Wu 提交于
      This prevents global_dirty_limit from remaining 0 (the initial value)
      for long time, since it's only updated in update_dirty_limit() when
      above the dirty freerun area.
      
      It will avoid unexpected consequences when some random code use it as a
      convenient approximation of the global dirty threshold.
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      68809c71
  17. 14 4月, 2012 1 次提交
  18. 22 3月, 2012 2 次提交
    • A
      mm: export dirty_writeback_interval · 91913a29
      Artem Bityutskiy 提交于
      Export 'dirty_writeback_interval' to make it visible to
      file-systems. We are going to push superblock management down to
      file-systems and get rid of the 'sync_supers' kernel thread completly.
      Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      91913a29
    • F
      mm: use global_dirty_limit in throttle_vm_writeout() · 47a13333
      Fengguang Wu 提交于
      When starting a memory hog task, a desktop box w/o swap is found to go
      unresponsive for a long time.  It's solely caused by lots of congestion
      waits in throttle_vm_writeout():
      
       gnome-system-mo-4201 553.073384: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
       gnome-system-mo-4201 553.073386: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
                 gtali-4237 553.080377: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
                 gtali-4237 553.080378: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
                  Xorg-3483 553.103375: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
                  Xorg-3483 553.103377: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
      
      The root cause is, the dirty threshold is knocked down a lot by the memory
      hog task.  Fixed by using global_dirty_limit which decreases gradually on
      such events and can guarantee we stay above (the also decreasing) nr_dirty
      in the progress of following down to the new dirty threshold.
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47a13333
  19. 11 1月, 2012 4 次提交
    • J
      mm: try to distribute dirty pages fairly across zones · a756cf59
      Johannes Weiner 提交于
      The maximum number of dirty pages that exist in the system at any time is
      determined by a number of pages considered dirtyable and a user-configured
      percentage of those, or an absolute number in bytes.
      
      This number of dirtyable pages is the sum of memory provided by all the
      zones in the system minus their lowmem reserves and high watermarks, so
      that the system can retain a healthy number of free pages without having
      to reclaim dirty pages.
      
      But there is a flaw in that we have a zoned page allocator which does not
      care about the global state but rather the state of individual memory
      zones.  And right now there is nothing that prevents one zone from filling
      up with dirty pages while other zones are spared, which frequently leads
      to situations where kswapd, in order to restore the watermark of free
      pages, does indeed have to write pages from that zone's LRU list.  This
      can interfere so badly with IO from the flusher threads that major
      filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
      already, taking away the VM's only possibility to keep such a zone
      balanced, aside from hoping the flushers will soon clean pages from that
      zone.
      
      Enter per-zone dirty limits.  They are to a zone's dirtyable memory what
      the global limit is to the global amount of dirtyable memory, and try to
      make sure that no single zone receives more than its fair share of the
      globally allowed dirty pages in the first place.  As the number of pages
      considered dirtyable excludes the zones' lowmem reserves and high
      watermarks, the maximum number of dirty pages in a zone is such that the
      zone can always be balanced without requiring page cleaning.
      
      As this is a placement decision in the page allocator and pages are
      dirtied only after the allocation, this patch allows allocators to pass
      __GFP_WRITE when they know in advance that the page will be written to and
      become dirty soon.  The page allocator will then attempt to allocate from
      the first zone of the zonelist - which on NUMA is determined by the task's
      NUMA memory policy - that has not exceeded its dirty limit.
      
      At first glance, it would appear that the diversion to lower zones can
      increase pressure on them, but this is not the case.  With a full high
      zone, allocations will be diverted to lower zones eventually, so it is
      more of a shift in timing of the lower zone allocations.  Workloads that
      previously could fit their dirty pages completely in the higher zone may
      be forced to allocate from lower zones, but the amount of pages that
      "spill over" are limited themselves by the lower zones' dirty constraints,
      and thus unlikely to become a problem.
      
      For now, the problem of unfair dirty page distribution remains for NUMA
      configurations where the zones allowed for allocation are in sum not big
      enough to trigger the global dirty limits, wake up the flusher threads and
      remedy the situation.  Because of this, an allocation that could not
      succeed on any of the considered zones is allowed to ignore the dirty
      limits before going into direct reclaim or even failing the allocation,
      until a future patch changes the global dirty throttling and flusher
      thread activation so that they take individual zone states into account.
      
      			Test results
      
      15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
      40% dirty ratio
      16G USB thumb drive
      10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
      
      		seconds			nr_vmscan_write
      		        (stddev)	       min|     median|        max
      xfs
      vanilla:	 549.747( 3.492)	     0.000|      0.000|      0.000
      patched:	 550.996( 3.802)	     0.000|      0.000|      0.000
      
      fuse-ntfs
      vanilla:	1183.094(53.178)	 54349.000|  59341.000|  65163.000
      patched:	 558.049(17.914)	     0.000|      0.000|     43.000
      
      btrfs
      vanilla:	 573.679(14.015)	156657.000| 460178.000| 606926.000
      patched:	 563.365(11.368)	     0.000|      0.000|   1362.000
      
      ext4
      vanilla:	 561.197(15.782)	     0.000|2725438.000|4143837.000
      patched:	 568.806(17.496)	     0.000|      0.000|      0.000
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a756cf59
    • J
      mm: writeback: cleanups in preparation for per-zone dirty limits · ccafa287
      Johannes Weiner 提交于
      The next patch will introduce per-zone dirty limiting functions in
      addition to the traditional global dirty limiting.
      
      Rename determine_dirtyable_memory() to global_dirtyable_memory() before
      adding the zone-specific version, and fix up its documentation.
      
      Also, move the functions to determine the dirtyable memory and the
      function to calculate the dirty limit based on that together so that their
      relationship is more apparent and that they can be commented on as a
      group.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mel@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ccafa287
    • J
      mm: exclude reserved pages from dirtyable memory · ab8fabd4
      Johannes Weiner 提交于
      Per-zone dirty limits try to distribute page cache pages allocated for
      writing across zones in proportion to the individual zone sizes, to reduce
      the likelihood of reclaim having to write back individual pages from the
      LRU lists in order to make progress.
      
      This patch:
      
      The amount of dirtyable pages should not include the full number of free
      pages: there is a number of reserved pages that the page allocator and
      kswapd always try to keep free.
      
      The closer (reclaimable pages - dirty pages) is to the number of reserved
      pages, the more likely it becomes for reclaim to run into dirty pages:
      
             +----------+ ---
             |   anon   |  |
             +----------+  |
             |          |  |
             |          |  -- dirty limit new    -- flusher new
             |   file   |  |                     |
             |          |  |                     |
             |          |  -- dirty limit old    -- flusher old
             |          |                        |
             +----------+                       --- reclaim
             | reserved |
             +----------+
             |  kernel  |
             +----------+
      
      This patch introduces a per-zone dirty reserve that takes both the lowmem
      reserve as well as the high watermark of the zone into account, and a
      global sum of those per-zone values that is subtracted from the global
      amount of dirtyable pages.  The lowmem reserve is unavailable to page
      cache allocations and kswapd tries to keep the high watermark free.  We
      don't want to end up in a situation where reclaim has to clean pages in
      order to balance zones.
      
      Not treating reserved pages as dirtyable on a global level is only a
      conceptual fix.  In reality, dirty pages are not distributed equally
      across zones and reclaim runs into dirty pages on a regular basis.
      
      But it is important to get this right before tackling the problem on a
      per-zone level, where the distance between reclaim and the dirty pages is
      mostly much smaller in absolute numbers.
      
      [akpm@linux-foundation.org: fix highmem build]
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab8fabd4
    • J
      mm/page-writeback.c: make determine_dirtyable_memory static again · 1edf2234
      Johannes Weiner 提交于
      The tracing ring-buffer used this function briefly, but not anymore.
      Make it local to the writeback code again.
      
      Also, move the function so that no forward declaration needs to be
      reintroduced.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1edf2234
  20. 04 1月, 2012 1 次提交
  21. 18 12月, 2011 8 次提交
    • W
      writeback: balanced_rate cannot exceed write bandwidth · bdaac490
      Wu Fengguang 提交于
      Add an upper limit to balanced_rate according to the below inequality.
      This filters out some rare but huge singular points, which at least
      enables more readable gnuplot figures.
      
      When there are N dd dirtiers,
      
      	balanced_dirty_ratelimit = write_bw / N
      
      So it holds that
      
      	balanced_dirty_ratelimit <= write_bw
      
      The singular points originate from dirty_rate in the below formular:
      
              balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
      where
      	dirty_rate = (number of page dirties in the past 200ms) / 200ms
      
      In the extreme case, if all dd tasks suddenly get blocked on something
      else and hence no pages are dirtied at all, dirty_rate will be 0 and
      balanced_dirty_ratelimit will be inf. This could happen in reality.
      
      Note that these huge singular points are not a real threat, since they
      are _guaranteed_ to be filtered out by the
      	min(balanced_dirty_ratelimit, task_ratelimit)
      line in bdi_update_dirty_ratelimit(). task_ratelimit is based on the
      number of dirty pages, which will never _suddenly_ fly away like
      balanced_dirty_ratelimit. So any weirdly large balanced_dirty_ratelimit
      will be cut down to the level of task_ratelimit.
      
      There won't be tiny singular points though, as long as the dirty pages
      lie inside the dirty throttling region (above the freerun region).
      Because there the dd tasks will be throttled by balanced_dirty_pages()
      and won't be able to suddenly dirty much more pages than average.
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      bdaac490
    • W
      writeback: do strict bdi dirty_exceeded · 82791940
      Wu Fengguang 提交于
      This helps to reduce dirty throttling polls and hence CPU overheads.
      
      bdi->dirty_exceeded typically only helps when suddenly starting 100+
      dd's on a disk, in which case the dd's may need to poll
      balance_dirty_pages() earlier than tsk->nr_dirtied_pause.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      82791940
    • W
      writeback: avoid tiny dirty poll intervals · 5b9b3574
      Wu Fengguang 提交于
      The LKP tests see big 56% regression for the case fio_mmap_randwrite_64k.
      Shaohua manages to root cause it to be the much smaller dirty pause times
      and hence much more frequent invocations to the IO-less balance_dirty_pages().
      Since fio_mmap_randwrite_64k effectively contains both reads and writes,
      the more frequent pauses triggered more idling in the cfq IO scheduler.
      
      The solution is to increase pause time all the way up to the max 200ms
      in this case, which is found to restore most performance. This will help
      reduce CPU overheads in other cases, too.
      
      Note that I don't expect many performance critical workloads to run this
      access pattern: the mmap read-on-write is rather inefficient and could
      be avoided by doing normal writes syscalls.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Reported-by: NLi Shaohua <shaohua.li@intel.com>
      Tested-by: NLi Shaohua <shaohua.li@intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      5b9b3574
    • W
      writeback: max, min and target dirty pause time · 7ccb9ad5
      Wu Fengguang 提交于
      Control the pause time and the call intervals to balance_dirty_pages()
      with three parameters:
      
      1) max_pause, limited by bdi_dirty and MAX_PAUSE
      
      2) the target pause time, grows with the number of dd tasks
         and is normally limited by max_pause/2
      
      3) the minimal pause, set to half the target pause
         and is used to skip short sleeps and accumulate them into bigger ones
      
      The typical behaviors after patch:
      
      - if ever task_ratelimit is far below dirty_ratelimit, the pause time
        will remain constant at max_pause and nr_dirtied_pause will be
        fluctuating with task_ratelimit
      
      - in the normal cases, nr_dirtied_pause will remain stable (keep in the
        same pace with dirty_ratelimit) and the pause time will be fluctuating
        with task_ratelimit
      
      In summary, someone has to fluctuate with task_ratelimit, because
      
      	task_ratelimit = nr_dirtied_pause / pause
      
      We normally prefer a stable nr_dirtied_pause, until reaching max_pause.
      
      The notable behavior changes are:
      
      - in stable workloads, there will no longer be sudden big trajectory
        switching of nr_dirtied_pause as concerned by Peter. It will be as
        smooth as dirty_ratelimit and changing proportionally with it (as
        always, assuming bdi bandwidth does not fluctuate across 2^N lines,
        otherwise nr_dirtied_pause will show up in 2+ parallel trajectories)
      
      - in the rare cases when something keeps task_ratelimit far below
        dirty_ratelimit, the smoothness can no longer be retained and
        nr_dirtied_pause will be "dancing" with task_ratelimit. This fixes a
        (not that destructive but still not good) bug that
      	  dirty_ratelimit gets brought down undesirably
      	  <= balanced_dirty_ratelimit is under estimated
      	  <= weakly executed task_ratelimit
      	  <= pause goes too large and gets trimmed down to max_pause
      	  <= nr_dirtied_pause (based on dirty_ratelimit) is set too large
      	  <= dirty_ratelimit being much larger than task_ratelimit
      
      - introduce min_pause to avoid small pause sleeps
      
      - when pause is trimmed down to max_pause, try to compensate it at the
        next pause time
      
      The "refactor" type of changes are:
      
      The max_pause equation is slightly transformed to make it slightly more
      efficient.
      
      We now scale target_pause by (N * 10ms) on 2^N concurrent tasks, which
      is effectively equal to the original scaling max_pause by (N * 20ms)
      because the original code does implicit target_pause ~= max_pause / 2.
      Based on the same implicit ratio, target_pause starts with 10ms on 1 dd.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      7ccb9ad5
    • W
      writeback: dirty ratelimit - think time compensation · 83712358
      Wu Fengguang 提交于
      Compensate the task's think time when computing the final pause time,
      so that ->dirty_ratelimit can be executed accurately.
      
              think time := time spend outside of balance_dirty_pages()
      
      In the rare case that the task slept longer than the 200ms period time
      (result in negative pause time), the sleep time will be compensated in
      the following periods, too, if it's less than 1 second.
      
      Accumulated errors are carefully avoided as long as the max pause area
      is not hitted.
      
      Pseudo code:
      
              period = pages_dirtied / task_ratelimit;
              think = jiffies - dirty_paused_when;
              pause = period - think;
      
      1) normal case: period > think
      
              pause = period - think
              dirty_paused_when = jiffies + pause
              nr_dirtied = 0
      
                                   period time
                    |===============================>|
                        think time      pause time
                    |===============>|==============>|
              ------|----------------|---------------|------------------------
              dirty_paused_when   jiffies
      
      2) no pause case: period <= think
      
              don't pause; reduce future pause time by:
              dirty_paused_when += period
              nr_dirtied = 0
      
                                 period time
                    |===============================>|
                                        think time
                    |===================================================>|
              ------|--------------------------------+-------------------|----
              dirty_paused_when                                       jiffies
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      83712358
    • W
      writeback: fix dirtied pages accounting on redirty · 2f800fbd
      Wu Fengguang 提交于
      De-account the accumulative dirty counters on page redirty.
      
      Page redirties (very common in ext4) will introduce mismatch between
      counters (a) and (b)
      
      a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
      b) NR_WRITTEN, BDI_WRITTEN
      
      This will introduce systematic errors in balanced_rate and result in
      dirty page position errors (ie. the dirty pages are no longer balanced
      around the global/bdi setpoints).
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      2f800fbd
    • W
      writeback: fix dirtied pages accounting on sub-page writes · d3bc1fef
      Wu Fengguang 提交于
      When dd in 512bytes, generic_perform_write() calls
      balance_dirty_pages_ratelimited() 8 times for the same page, but
      obviously the page is only dirtied once.
      
      Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      d3bc1fef
    • W
      writeback: charge leaked page dirties to active tasks · 54848d73
      Wu Fengguang 提交于
      It's a years long problem that a large number of short-lived dirtiers
      (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
      (eg. dd) as well as pushing the dirty pages to the global hard limit.
      
      The solution is to charge the pages dirtied by the exited gcc to the
      other random dirtying tasks. It sounds not perfect, however should
      behave good enough in practice, seeing as that throttled tasks aren't
      actually running so those that are running are more likely to pick it up
      and get throttled, therefore promoting an equal spread.
      
      Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      54848d73
  22. 08 12月, 2011 3 次提交
    • W
      writeback: set max_pause to lowest value on zero bdi_dirty · 82e230a0
      Wu Fengguang 提交于
      Some trace shows lots of bdi_dirty=0 lines where it's actually some
      small value if w/o the accounting errors in the per-cpu bdi stats.
      
      In this case the max pause time should really be set to the smallest
      (non-zero) value to avoid IO queue underrun and improve throughput.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      82e230a0
    • W
      writeback: permit through good bdi even when global dirty exceeded · c5c6343c
      Wu Fengguang 提交于
      On a system with 1 local mount and 1 NFS mount, if the NFS server
      becomes not responding when dd to the NFS mount, the NFS dirty pages may
      exceed the global dirty limit and _every_ task involving writing will be
      blocked. The whole system appears unresponsive.
      
      The workaround is to permit through the bdi's that only has a small
      number of dirty pages. The number chosen (bdi_stat_error pages) is not
      enough to enable the local disk to run in optimal throughput, however is
      enough to make the system responsive on a broken NFS mount. The user can
      then kill the dirtiers on the NFS mount and increase the global dirty
      limit to bring up the local disk's throughput.
      
      It risks allowing dirty pages to grow much larger than the global dirty
      limit when there are 1000+ mounts, however that's very unlikely to happen,
      especially in low memory profiles.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c5c6343c
    • W
      writeback: comment on the bdi dirty threshold · aed21ad2
      Wu Fengguang 提交于
      We do "floating proportions" to let active devices to grow its target
      share of dirty pages and stalled/inactive devices to decrease its target
      share over time.
      
      It works well except in the case of "an inactive disk suddenly goes
      busy", where the initial target share may be too small. To mitigate
      this, bdi_position_ratio() has the below line to raise a small
      bdi_thresh when it's safe to do so, so that the disk be feed with enough
      dirty pages for efficient IO and in turn fast rampup of bdi_thresh:
      
              bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
      
      balance_dirty_pages() normally does negative feedback control which
      adjusts ratelimit to balance the bdi dirty pages around the target.
      In some extreme cases when that is not enough, it will have to block
      the tasks completely until the bdi dirty pages drop below bdi_thresh.
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      aed21ad2
  23. 17 11月, 2011 1 次提交