1. 02 6月, 2015 13 次提交
    • T
      writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's · a06fd6b1
      Tejun Heo 提交于
      For cgroup writeback support, all bdi-wide operations should be
      distributed to all its wb's (bdi_writeback's).
      
      This patch updates laptop_mode_timer_fn() so that it invokes
      wb_start_writeback() on all wb's rather than just the root one.  As
      the intent is writing out all dirty data, there's no reason to split
      the number of pages to write.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a06fd6b1
    • T
      writeback: remove bdi_start_writeback() · c00ddad3
      Tejun Heo 提交于
      bdi_start_writeback() is a thin wrapper on top of
      __wb_start_writeback() which is used only by laptop_mode_timer_fn().
      This patches removes bdi_start_writeback(), renames
      __wb_start_writeback() to wb_start_writeback() and makes
      laptop_mode_timer_fn() use it instead.
      
      This doesn't cause any functional difference and will ease making
      laptop_mode_timer_fn() cgroup writeback aware.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c00ddad3
    • T
      writeback: make bdi->min/max_ratio handling cgroup writeback aware · 693108a8
      Tejun Heo 提交于
      bdi->min/max_ratio are user-configurable per-bdi knobs which regulate
      dirty limit of each bdi.  For cgroup writeback, they need to be
      further distributed across wb's (bdi_writeback's) belonging to the
      configured bdi.
      
      This patch introduces wb_min_max_ratio() which distributes
      bdi->min/max_ratio according to a wb's proportion in the total active
      bandwidth of its bdi.
      
      v2: Update wb_min_max_ratio() to fix a bug where both min and max were
          assigned the min value and avoid calculations when possible.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      693108a8
    • T
      writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account · 95a46c65
      Tejun Heo 提交于
      bdi_has_dirty_io() used to only reflect whether the root wb
      (bdi_writeback) has dirty inodes.  For cgroup writeback support, it
      needs to take all active wb's into account.  If any wb on the bdi has
      dirty inodes, bdi_has_dirty_io() should return true.
      
      To achieve that, as inode_wb_list_{move|del}_locked() now keep track
      of the dirty state transition of each wb, the number of dirty wbs can
      be counted in the bdi; however, bdi is already aggregating
      wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
      there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
      dip below 1.  bdi_has_dirty_io() can simply test whether
      bdi->tot_write_bandwidth is zero or not.
      
      While this bumps the value of wb->avg_write_bandwidth to one when it
      used to be zero, this shouldn't cause any meaningful behavior
      difference.
      
      bdi_has_dirty_io() is made an inline function which tests whether
      ->tot_write_bandwidth is non-zero.  Also, WARN_ON_ONCE()'s on its
      value are added to inode_wb_list_{move|del}_locked().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      95a46c65
    • T
      writeback: implement backing_dev_info->tot_write_bandwidth · 766a9d6e
      Tejun Heo 提交于
      cgroup writeback support needs to keep track of the sum of
      avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
      distribute write workload.  This patch adds bdi->tot_write_bandwidth
      and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
      and wb_update_write_bandwidth() to adjust it as wb's gain and lose
      dirty inodes and its avg_write_bandwidth gets updated.
      
      As the update events are not synchronized with each other,
      bdi->tot_write_bandwidth is an atomic_long_t.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      766a9d6e
    • T
      writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback · dfb8ae56
      Tejun Heo 提交于
      Currently, balance_dirty_pages() always work on bdi->wb.  This patch
      updates it to work on the wb (bdi_writeback) matching memcg and blkcg
      of the current task as that's what the inode is being dirtied against.
      
      balance_dirty_pages_ratelimited() now pins the current wb and passes
      it to balance_dirty_pages().
      
      As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
      visible behavior differences.
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dfb8ae56
    • T
      writeback: attribute stats to the matching per-cgroup bdi_writeback · 91018134
      Tejun Heo 提交于
      Until now, all WB_* stats were accounted against the root wb
      (bdi_writeback), now that multiple wb (bdi_writeback) support is in
      place, let's attributes the stats to the respective per-cgroup wb's.
      
      As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
      visible behavior differences.
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      91018134
    • T
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo 提交于
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52ebea74
    • T
      writeback: s/bdi/wb/ in mm/page-writeback.c · de1fff37
      Tejun Heo 提交于
      Writeback operations will now be per wb (bdi_writeback) instead of
      bdi.  Replace the relevant bdi references in symbol names and comments
      with wb.  This patch is purely cosmetic and doesn't make any
      functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      de1fff37
    • T
      writeback: move bandwidth related fields from backing_dev_info into bdi_writeback · a88a341a
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bandwidth related fields from backing_dev_info into
      bdi_writeback.
      
      * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
        write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
        balanced_dirty_ratelimit, completions and dirty_exceeded.
      
      * writeback_chunk_size() and over_bground_thresh() now take @wb
        instead of @bdi.
      
      * bdi_writeout_fraction(bdi, ...)	-> wb_writeout_fraction(wb, ...)
        bdi_dirty_limit(bdi, ...)		-> wb_dirty_limit(wb, ...)
        bdi_position_ration(bdi, ...)		-> wb_position_ratio(wb, ...)
        bdi_update_writebandwidth(bdi, ...)	-> wb_update_write_bandwidth(wb, ...)
        [__]bdi_update_bandwidth(bdi, ...)	-> [__]wb_update_bandwidth(wb, ...)
        bdi_{max|min}_pause(bdi, ...)		-> wb_{max|min}_pause(wb, ...)
        bdi_dirty_limits(bdi, ...)		-> wb_dirty_limits(wb, ...)
      
      * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
        respectively.  Note that explicit zeroing is dropped in the process
        as wb's are cleared in entirety anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      
      v2: Typo in description fixed as suggested by Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a88a341a
    • T
      writeback: move backing_dev_info->bdi_stat[] into bdi_writeback · 93f78d88
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->bdi_stat[] into wb.
      
      * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
        enums is changed from BDI_ to WB_.
      
      * BDI_STAT_BATCH() -> WB_STAT_BATCH()
      
      * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)
      
      * bdi_stat[_error]() -> wb_stat[_error]()
      
      * bdi_writeout_inc() -> wb_writeout_inc()
      
      * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
        frees stat.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      93f78d88
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
    • T
      page_writeback: revive cancel_dirty_page() in a restricted form · 11f81bec
      Tejun Heo 提交于
      cancel_dirty_page() had some issues and b9ea2515 ("page_writeback:
      clean up mess around cancel_dirty_page()") replaced it with
      account_page_cleaned() which makes the caller responsible for clearing
      the dirty bit; unfortunately, the planned changes for cgroup writeback
      support requires synchronization between dirty bit manipulation and
      stat updates.  While we can open-code such synchronization in each
      account_page_cleaned() callsite, that's gonna be unnecessarily awkward
      and verbose.
      
      This patch revives cancel_dirty_page() but in a more restricted form.
      All it does is TestClearPageDirty() followed by account_page_cleaned()
      invocation if the page was dirty.  This helper covers all
      account_page_cleaned() usages except for __delete_from_page_cache()
      which is a special case anyway and left alone.  As this leaves no
      module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
      from it.
      
      This patch just revives cancel_dirty_page() as a trivial wrapper to
      replace equivalent usages and doesn't introduce any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      11f81bec
  2. 16 4月, 2015 1 次提交
  3. 15 4月, 2015 1 次提交
    • K
      page_writeback: clean up mess around cancel_dirty_page() · b9ea2515
      Konstantin Khlebnikov 提交于
      This patch replaces cancel_dirty_page() with a helper function
      account_page_cleaned() which only updates counters.  It's called from
      truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
      Page is locked in both cases, page-lock protects against concurrent
      dirtiers: see commit 2d6d7f98 ("mm: protect set_page_dirty() from
      ongoing truncation").
      
      Delete_from_page_cache() shouldn't be called for dirty pages, they must
      be handled by caller (either written or truncated).  This patch treats
      final dirty accounting fixup at the end of __delete_from_page_cache() as
      a debug check and adds WARN_ON_ONCE() around it.  If something removes
      dirty pages without proper handling that might be a bug and unwritten
      data might be lost.
      
      Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
      here.
      
      cancel_dirty_page() in nfs_wb_page_cancel() is redundant.  This is
      helper for nfs_invalidate_page() and it's called only in case complete
      invalidation.
      
      The mess was started in v2.6.20 after commits 46d2277c ("Clean up
      and make try_to_free_buffers() not race with dirty pages") and
      3e67c098 ("truncate: clear page dirtiness before running
      try_to_free_buffers()") first was reverted right in v2.6.20 in commit
      ecdfc978 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
      v2.6.25 commit a2b34564 ("Fix dirty page accounting leak with ext3
      data=journal").
      
      Custom fixes were introduced between these points.  NFS in v2.6.23, commit
      1b3b4a1a ("NFS: Fix a write request leak in nfs_invalidate_page()").
      Kludge in __delete_from_page_cache() in v2.6.24, commit 3a692790 ("Do
      dirty page accounting when removing a page from the page cache").  Since
      v2.6.25 all of them are redundant.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9ea2515
  4. 23 3月, 2015 1 次提交
    • T
      writeback: fix possible underflow in write bandwidth calculation · c72efb65
      Tejun Heo 提交于
      From 1ebf33901ecc75d9496862dceb1ef0377980587c Mon Sep 17 00:00:00 2001
      From: Tejun Heo <tj@kernel.org>
      Date: Mon, 23 Mar 2015 00:08:19 -0400
      
      2f800fbd ("writeback: fix dirtied pages accounting on redirty")
      introduced account_page_redirty() which reverts stat updates for a
      redirtied page, making BDI_DIRTIED no longer monotonically increasing.
      
      bdi_update_write_bandwidth() uses the delta in BDI_DIRTIED as the
      basis for bandwidth calculation.  While unlikely, since the above
      patch, the newer value may be lower than the recorded past value and
      underflow the bandwidth calculation leading to a wild result.
      
      Fix it by subtracing min of the old and new values when calculating
      delta.  AFAIK, there hasn't been any report of it happening but the
      resulting erratic behavior would be non-critical and temporary, so
      it's possible that the issue is happening without being reported.  The
      risk of the fix is very low, so tagged for -stable.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Fixes: 2f800fbd ("writeback: fix dirtied pages accounting on redirty")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c72efb65
  5. 04 3月, 2015 1 次提交
    • T
      writeback: add missing INITIAL_JIFFIES init in global_update_bandwidth() · 7d70e154
      Tejun Heo 提交于
      global_update_bandwidth() uses static variable update_time as the
      timestamp for the last update but forgets to initialize it to
      INITIALIZE_JIFFIES.
      
      This means that global_dirty_limit will be 5 mins into the future on
      32bit and some large amount jiffies into the past on 64bit.  This
      isn't critical as the only effect is that global_dirty_limit won't be
      updated for the first 5 mins after booting on 32bit machines,
      especially given the auxiliary nature of global_dirty_limit's role -
      protecting against global dirty threshold's sudden dips; however, it
      does lead to unintended suboptimal behavior.  Fix it.
      
      Fixes: c42843f2 ("writeback: introduce smoothed global dirty limit")
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7d70e154
  6. 12 2月, 2015 2 次提交
  7. 21 1月, 2015 1 次提交
  8. 09 1月, 2015 1 次提交
    • J
      mm: protect set_page_dirty() from ongoing truncation · 2d6d7f98
      Johannes Weiner 提交于
      Tejun, while reviewing the code, spotted the following race condition
      between the dirtying and truncation of a page:
      
      __set_page_dirty_nobuffers()       __delete_from_page_cache()
        if (TestSetPageDirty(page))
                                           page->mapping = NULL
      				     if (PageDirty())
      				       dec_zone_page_state(page, NR_FILE_DIRTY);
      				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
          if (page->mapping)
            account_page_dirtied(page)
              __inc_zone_page_state(page, NR_FILE_DIRTY);
      	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
      
      which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
      
      Dirtiers usually lock out truncation, either by holding the page lock
      directly, or in case of zap_pte_range(), by pinning the mapcount with
      the page table lock held.  The notable exception to this rule, though,
      is do_wp_page(), for which this race exists.  However, do_wp_page()
      already waits for a locked page to unlock before setting the dirty bit,
      in order to prevent a race where clear_page_dirty() misses the page bit
      in the presence of dirty ptes.  Upgrade that wait to a fully locked
      set_page_dirty() to also cover the situation explained above.
      
      Afterwards, the code in set_page_dirty() dealing with a truncation race
      is no longer needed.  Remove it.
      Reported-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d6d7f98
  9. 11 12月, 2014 1 次提交
    • M
      mm, memcg: fix potential undefined behaviour in page stat accounting · e4bd6a02
      Michal Hocko 提交于
      Since commit d7365e78 ("mm: memcontrol: fix missed end-writeback
      page accounting") mem_cgroup_end_page_stat consumes locked and flags
      variables directly rather than via pointers which might trigger C
      undefined behavior as those variables are initialized only in the slow
      path of mem_cgroup_begin_page_stat.
      
      Although mem_cgroup_end_page_stat handles parameters correctly and
      touches them only when they hold a sensible value it is caller which
      loads a potentially uninitialized value which then might allow compiler
      to do crazy things.
      
      I haven't seen any warning from gcc and it seems that the current
      version (4.9) doesn't exploit this type undefined behavior but Sasha has
      reported the following:
      
        UBSan: Undefined behaviour in mm/rmap.c:1084:2
        load of value 255 is not a valid value for type '_Bool'
        CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427
        Call Trace:
          dump_stack (lib/dump_stack.c:52)
          ubsan_epilogue (lib/ubsan.c:159)
          __ubsan_handle_load_invalid_value (lib/ubsan.c:482)
          page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096)
          unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303)
          unmap_single_vma (mm/memory.c:1348)
          unmap_vmas (mm/memory.c:1377 (discriminator 3))
          exit_mmap (mm/mmap.c:2837)
          mmput (kernel/fork.c:659)
          do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747)
          do_group_exit (include/linux/sched.h:775 kernel/exit.c:873)
          SyS_exit_group (kernel/exit.c:901)
          tracesys_phase2 (arch/x86/kernel/entry_64.S:529)
      
      Fix this by using pointer parameters for both locked and flags and be
      more robust for future compiler changes even though the current code is
      implemented correctly.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4bd6a02
  10. 30 10月, 2014 2 次提交
    • J
      mm: memcontrol: fix missed end-writeback page accounting · d7365e78
      Johannes Weiner 提交于
      Commit 0a31bc97 ("mm: memcontrol: rewrite uncharge API") changed
      page migration to uncharge the old page right away.  The page is locked,
      unmapped, truncated, and off the LRU, but it could race with writeback
      ending, which then doesn't unaccount the page properly:
      
      test_clear_page_writeback()              migration
                                                 wait_on_page_writeback()
        TestClearPageWriteback()
                                                 mem_cgroup_migrate()
                                                   clear PCG_USED
        mem_cgroup_update_page_stat()
          if (PageCgroupUsed(pc))
            decrease memcg pages under writeback
      
        release pc->mem_cgroup->move_lock
      
      The per-page statistics interface is heavily optimized to avoid a
      function call and a lookup_page_cgroup() in the file unmap fast path,
      which means it doesn't verify whether a page is still charged before
      clearing PageWriteback() and it has to do it in the stat update later.
      
      Rework it so that it looks up the page's memcg once at the beginning of
      the transaction and then uses it throughout.  The charge will be
      verified before clearing PageWriteback() and migration can't uncharge
      the page as long as that is still set.  The RCU lock will protect the
      memcg past uncharge.
      
      As far as losing the optimization goes, the following test results are
      from a microbenchmark that maps, faults, and unmaps a 4GB sparse file
      three times in a nested fashion, so that there are two negative passes
      that don't account but still go through the new transaction overhead.
      There is no actual difference:
      
       old:     33.195102545 seconds time elapsed       ( +-  0.01% )
       new:     33.199231369 seconds time elapsed       ( +-  0.03% )
      
      The time spent in page_remove_rmap()'s callees still adds up to the
      same, but the time spent in the function itself seems reduced:
      
           # Children      Self  Command        Shared Object       Symbol
       old:     0.12%     0.11%  filemapstress  [kernel.kallsyms]   [k] page_remove_rmap
       new:     0.12%     0.08%  filemapstress  [kernel.kallsyms]   [k] page_remove_rmap
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[3.17.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7365e78
    • J
      mm: page-writeback: inline account_page_dirtied() into single caller · 3a3c02ec
      Johannes Weiner 提交于
      A follow-up patch would have changed the call signature.  To save the
      trouble, just fold it instead.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[3.17.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a3c02ec
  11. 10 10月, 2014 1 次提交
  12. 08 9月, 2014 1 次提交
  13. 07 8月, 2014 1 次提交
  14. 31 7月, 2014 1 次提交
  15. 07 6月, 2014 1 次提交
  16. 05 6月, 2014 2 次提交
  17. 12 5月, 2014 1 次提交
    • N
      ext4: fix data integrity sync in ordered mode · 1c8349a1
      Namjae Jeon 提交于
      When we perform a data integrity sync we tag all the dirty pages with
      PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.  Later we check
      for this tag in write_cache_pages_da and creates a struct
      mpage_da_data containing contiguously indexed pages tagged with this
      tag and sync these pages with a call to mpage_da_map_and_submit.  This
      process is done in while loop until all the PAGECACHE_TAG_TOWRITE
      pages are synced. We also do journal start and stop in each iteration.
      journal_stop could initiate journal commit which would call
      ext4_writepage which in turn will call ext4_bio_write_page even for
      delayed OR unwritten buffers. When ext4_bio_write_page is called for
      such buffers, even though it does not sync them but it clears the
      PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
      are also not synced by the currently running data integrity sync. We
      will end up with dirty pages although sync is completed.
      
      This could cause a potential data loss when the sync call is followed
      by a truncate_pagecache call, which is exactly the case in
      collapse_range.  (It will cause generic/127 failure in xfstests)
      
      To avoid this issue, we can use set_page_writeback_keepwrite instead of
      set_page_writeback, which doesn't clear TOWRITE tag.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      1c8349a1
  18. 07 5月, 2014 1 次提交
  19. 08 4月, 2014 1 次提交
  20. 07 2月, 2014 1 次提交
    • K
      mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() · a85d9df1
      KOSAKI Motohiro 提交于
      During aio stress test, we observed the following lockdep warning.  This
      mean AIO+numa_balancing is currently deadlockable.
      
      The problem is, aio_migratepage disable interrupt, but
      __set_page_dirty_nobuffers unintentionally enable it again.
      
      Generally, all helper function should use spin_lock_irqsave() instead of
      spin_lock_irq() because they don't know caller at all.
      
         other info that might help us debug this:
          Possible unsafe locking scenario:
      
                CPU0
                ----
           lock(&(&ctx->completion_lock)->rlock);
           <Interrupt>
             lock(&(&ctx->completion_lock)->rlock);
      
          *** DEADLOCK ***
      
            dump_stack+0x19/0x1b
            print_usage_bug+0x1f7/0x208
            mark_lock+0x21d/0x2a0
            mark_held_locks+0xb9/0x140
            trace_hardirqs_on_caller+0x105/0x1d0
            trace_hardirqs_on+0xd/0x10
            _raw_spin_unlock_irq+0x2c/0x50
            __set_page_dirty_nobuffers+0x8c/0xf0
            migrate_page_copy+0x434/0x540
            aio_migratepage+0xb1/0x140
            move_to_new_page+0x7d/0x230
            migrate_pages+0x5e5/0x700
            migrate_misplaced_page+0xbc/0xf0
            do_numa_page+0x102/0x190
            handle_pte_fault+0x241/0x970
            handle_mm_fault+0x265/0x370
            __do_page_fault+0x172/0x5a0
            do_page_fault+0x1a/0x70
            page_fault+0x28/0x30
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a85d9df1
  21. 30 1月, 2014 2 次提交
    • J
      mm/page-writeback.c: do not count anon pages as dirtyable memory · a1c3bfb2
      Johannes Weiner 提交于
      The VM is currently heavily tuned to avoid swapping.  Whether that is
      good or bad is a separate discussion, but as long as the VM won't swap
      to make room for dirty cache, we can not consider anonymous pages when
      calculating the amount of dirtyable memory, the baseline to which
      dirty_background_ratio and dirty_ratio are applied.
      
      A simple workload that occupies a significant size (40+%, depending on
      memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
      uses the remainder for a streaming writer demonstrates this problem.  In
      that case, the actual cache pages are a small fraction of what is
      considered dirtyable overall, which results in an relatively large
      portion of the cache pages to be dirtied.  As kswapd starts rotating
      these, random tasks enter direct reclaim and stall on IO.
      
      Only consider free pages and file pages dirtyable.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTejun Heo <tj@kernel.org>
      Tested-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1c3bfb2
    • J
      mm/page-writeback.c: fix dirty_balance_reserve subtraction from dirtyable memory · a804552b
      Johannes Weiner 提交于
      Tejun reported stuttering and latency spikes on a system where random
      tasks would enter direct reclaim and get stuck on dirty pages.  Around
      50% of memory was occupied by tmpfs backed by an SSD, and another disk
      (rotating) was reading and writing at max speed to shrink a partition.
      
      : The problem was pretty ridiculous.  It's a 8gig machine w/ one ssd and 10k
      : rpm harddrive and I could reliably reproduce constant stuttering every
      : several seconds for as long as buffered IO was going on on the hard drive
      : either with tmpfs occupying somewhere above 4gig or a test program which
      : allocates about the same amount of anon memory.  Although swap usage was
      : zero, turning off swap also made the problem go away too.
      :
      : The trigger conditions seem quite plausible - high anon memory usage w/
      : heavy buffered IO and swap configured - and it's highly likely that this
      : is happening in the wild too.  (this can happen with copying large files
      : to usb sticks too, right?)
      
      This patch (of 2):
      
      The dirty_balance_reserve is an approximation of the fraction of free
      pages that the page allocator does not make available for page cache
      allocations.  As a result, it has to be taken into account when
      calculating the amount of "dirtyable memory", the baseline to which
      dirty_background_ratio and dirty_ratio are applied.
      
      However, currently the reserve is subtracted from the sum of free and
      reclaimable pages, which is non-sensical and leads to erroneous results
      when the system is dominated by unreclaimable pages and the
      dirty_balance_reserve is bigger than free+reclaimable.  In that case, at
      least the already allocated cache should be considered dirtyable.
      
      Fix the calculation by subtracting the reserve from the amount of free
      pages, then adding the reclaimable pages on top.
      
      [akpm@linux-foundation.org: fix CONFIG_HIGHMEM build]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTejun Heo <tj@kernel.org>
      Tested-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a804552b
  22. 17 10月, 2013 1 次提交
    • F
      writeback: fix negative bdi max pause · e3b6c655
      Fengguang Wu 提交于
      Toralf runs trinity on UML/i386.  After some time it hangs and the last
      message line is
      
      	BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:1521]
      
      It's found that pages_dirtied becomes very large.  More than 1000000000
      pages in this case:
      
      	period = HZ * pages_dirtied / task_ratelimit;
      	BUG_ON(pages_dirtied > 2000000000);
      	BUG_ON(pages_dirtied > 1000000000);      <---------
      
      UML debug printf shows that we got negative pause here:
      
      	ick: pause : -984
      	ick: pages_dirtied : 0
      	ick: task_ratelimit: 0
      
      	 pause:
      	+       if (pause < 0)  {
      	+               extern int printf(char *, ...);
      	+               printf("ick : pause : %li\n", pause);
      	+               printf("ick: pages_dirtied : %lu\n", pages_dirtied);
      	+               printf("ick: task_ratelimit: %lu\n", task_ratelimit);
      	+               BUG_ON(1);
      	+       }
      	        trace_balance_dirty_pages(bdi,
      
      Since pause is bounded by [min_pause, max_pause] where min_pause is also
      bounded by max_pause.  It's suspected and demonstrated that the
      max_pause calculation goes wrong:
      
      	ick: pause : -717
      	ick: min_pause : -177
      	ick: max_pause : -717
      	ick: pages_dirtied : 14
      	ick: task_ratelimit: 0
      
      The problem lies in the two "long = unsigned long" assignments in
      bdi_max_pause() which might go negative if the highest bit is 1, and the
      min_t(long, ...) check failed to protect it falling under 0.  Fix all of
      them by using "unsigned long" throughout the function.
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Reported-by: NToralf Förster <toralf.foerster@gmx.de>
      Tested-by: NToralf Förster <toralf.foerster@gmx.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3b6c655
  23. 13 9月, 2013 1 次提交
    • S
      memcg: add per cgroup writeback pages accounting · 3ea67d06
      Sha Zhengju 提交于
      Add memcg routines to count writeback pages, later dirty pages will also
      be accounted.
      
      After Kame's commit 89c06bd5 ("memcg: use new logic for page stat
      accounting"), we can use 'struct page' flag to test page state instead
      of per page_cgroup flag.  But memcg has a feature to move a page from a
      cgroup to another one and may have race between "move" and "page stat
      accounting".  So in order to avoid the race we have designed a new lock:
      
               mem_cgroup_begin_update_page_stat()
               modify page information        -->(a)
               mem_cgroup_update_page_stat()  -->(b)
               mem_cgroup_end_update_page_stat()
      
      It requires both (a) and (b)(writeback pages accounting) to be pretected
      in mem_cgroup_{begin/end}_update_page_stat().  It's full no-op for
      !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
      read lock in the most cases (no task is moving), and spin_lock_irqsave
      on top in the slow path.
      
      There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
      And the lock order is:
      	--> memcg->move_lock
      	  --> mapping->tree_lock
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea67d06
  24. 12 9月, 2013 1 次提交
    • M
      mm/page-writeback.c: add strictlimit feature · 5a537485
      Maxim Patlasov 提交于
      The feature prevents mistrusted filesystems (ie: FUSE mounts created by
      unprivileged users) to grow a large number of dirty pages before
      throttling.  For such filesystems balance_dirty_pages always check bdi
      counters against bdi limits.  I.e.  even if global "nr_dirty" is under
      "freerun", it's not allowed to skip bdi checks.  The only use case for now
      is fuse: it sets bdi max_ratio to 1% by default and system administrators
      are supposed to expect that this limit won't be exceeded.
      
      The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
      filesystem may set the flag when it initializes its BDI.
      
      The problematic scenario comes from the fact that nobody pays attention to
      the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
      writeback).  The implementation of fuse writeback releases original page
      (by calling end_page_writeback) almost immediately.  A fuse request queued
      for real processing bears a copy of original page.  Hence, if userspace
      fuse daemon doesn't finalize write requests in timely manner, an
      aggressive mmap writer can pollute virtually all memory by those temporary
      fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
      nobody cares.
      
      To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
      problem" as a shortcut for "a possibility of uncontrolled grow of amount
      of RAM consumed by temporary pages allocated by kernel fuse to process
      writeback".
      
      The problem was very easy to reproduce.  There is a trivial example
      filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
      added "sleep(1);" to the write methods, then recompiled and mounted it.
      Then created a huge file on the mount point and run a simple program which
      mmap-ed the file to a memory region, then wrote a data to the region.  An
      hour later I observed almost all RAM consumed by fuse writeback.  Since
      then some unrelated changes in kernel fuse made it more difficult to
      reproduce, but it is still possible now.
      
      Putting this theoretical happens-in-the-lab thing aside, there is another
      thing that really hurts real world (FUSE) users.  This is write-through
      page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
      fuse populates page cache and flushes user data to the server
      synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
      ("writeback cache policy") solve the problem, but they also make resolving
      NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
      a huge file to a fuse mount would result in memory starvation.  Miklos,
      the maintainer of FUSE, believes strictlimit feature the way to go.
      
      And eventually putting FUSE topics aside, there is one more use-case for
      strictlimit feature.  Using a slow USB stick (mass storage) in a machine
      with huge amount of RAM installed is a well-known pain.  Let's make simple
      computations.  Assuming 64GB of RAM installed, existing implementation of
      balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
      dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
      /media/my-usb-storage/" may return in a few seconds, but subsequent
      "umount /media/my-usb-storage/" will take more than two hours if effective
      throughput of the storage is, to say, 1MB/sec.
      
      After inclusion of strictlimit feature, it will be trivial to add a knob
      (e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
      Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
      natural desire to limit the amount of dirty memory for some devices we are
      not fully trust (in the sense of sustainable throughput).
      
      [akpm@linux-foundation.org: fix warning in page-writeback.c]
      Signed-off-by: NMaxim Patlasov <MPatlasov@parallels.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a537485