1. 02 6月, 2015 40 次提交
    • T
      writeback: add dirty_throttle_control->pos_ratio · daddfa3c
      Tejun Heo 提交于
      wb_position_ratio() is used to calculate pos_ratio, which is used for
      two purposes.  wb_update_dirty_ratelimit() uses it to adjust
      wb->[balanced_]dirty_ratelimit gradually and balance_dirty_pages() to
      immediately adjust dirty_ratelimit right before applying it to
      determine pause duration.
      
      While wb_update_dirty_ratelimit() is separately rate limited from
      balance_dirty_pages(), on the run where the ratelimit is updated, we
      end up calculating pos_ratio twice with the same parameters.
      
      This patch adds dirty_throttle_control->pos_ratio.
      balance_dirty_pages() calculates it once per run and
      wb_update_dirty_ratelimit() uses the value stored in
      dirty_throttle_control.
      
      This removes the duplicate calculation and also will help implementing
      memcg wb_domain.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      daddfa3c
    • T
      writeback: make __wb_calc_thresh() take dirty_throttle_control · b1cbc6d4
      Tejun Heo 提交于
      wb_calc_thresh() calculates wb_thresh by scaling thresh according to
      the wb's portion in the system-wide write bandwidth.  cgroup writeback
      support would need to calculate wb_thresh against memcg domain too.
      This patch renames wb_calc_thresh() to __wb_calc_thresh() and makes it
      take dirty_throttle_control so that the function can later be updated
      to calculate against different domains according to
      dirty_throttle_control.
      
      wb_calc_thresh() is now a thin wrapper around __wb_calc_thresh().
      
      v2: The original version was incorrectly scaling dtc->dirty instead of
          dtc->thresh.  This was due to the extremely confusing function and
          variable names.  Added a rename patch and fixed this one.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b1cbc6d4
    • T
      writeback: add dirty_throttle_control->wb_bg_thresh · 970fb01a
      Tejun Heo 提交于
      wb_bg_thresh is currently treated as a second-class citizen.  It's
      only used when BDI_CAP_STRICTLIMIT is set and balance_dirty_pages()
      doesn't calculate it unless the cap is set.  When the cap is set, the
      calculated value is not passed around but instead recalculated
      whenever it's used.
      
      wb_position_ratio() calculates it by scaling wb_thresh proportional to
      bg_thresh / thresh.  wb_update_dirty_ratelimit() uses wb_dirty_limit()
      on bg_thresh, which should generally lead to a similar result as the
      proportional scaling but can also be way off in the presence of
      max/min_ratio settings.
      
      Avoiding wb_bg_thresh calculation saves us one u64 multiplication and
      divsion when BDI_CAP_STRICTLIMIT is not set.  Given that
      balance_dirty_pages() is already ratelimited, this doesn't justify the
      incurred extra complexity.
      
      This patch adds wb_bg_thresh to dirty_throttle_control and makes
      wb_dirty_limits() always calculate it and updates the users to use the
      pre-calculated value.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      970fb01a
    • T
      writeback: consolidate dirty throttle parameters into dirty_throttle_control · 2bc00aef
      Tejun Heo 提交于
      Dirty throttling implemented in balance_dirty_pages() and its
      subroutines makes use of a number of parameters which are passed
      around individually.  This renders these functions somewhat unwieldy
      and makes it difficult to add or change the involved parameters.  Also
      some functions use different or conflicting naming schemes for the
      same parameters making the code confusing to follow.
      
      This patch consolidates the main parameters into struct
      dirty_throttle_control so that they can be passed around easily and
      adding new paramters isn't painful.  This also unifies how a given
      parameter is named and accessed.  The drawback of using this type of
      control structure rather than explicit paramters is that it isn't
      immediately obvious which function accesses and modifies what;
      however, it's fairly clear that the benefits outweigh in this case.
      
      GDTC_INIT() macro is provided to ease initializing
      dirty_throttle_control for the global_wb_domain and
      balance_dirty_pages() uses a separate pointer to point to its global
      dirty_throttle_control.  This is to make it uniform with memcg domain
      handling which will be added later.
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2bc00aef
    • T
      writeback: move global_dirty_limit into wb_domain · dcc25ae7
      Tejun Heo 提交于
      This patch is a part of the series to define wb_domain which
      represents a domain that wb's (bdi_writeback's) belong to and are
      measured against each other in.  This will enable IO backpressure
      propagation for cgroup writeback.
      
      global_dirty_limit exists to regulate the global dirty threshold which
      is a property of the wb_domain.  This patch moves hard_dirty_limit,
      dirty_lock, and update_time into wb_domain.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dcc25ae7
    • T
      writeback: implement wb_domain · 380c27ca
      Tejun Heo 提交于
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      Currently, what constitutes the global writeback domain are scattered
      across a number of global states.  This patch starts collecting them
      into struct wb_domain.
      
      * fprop_global which serves as the basis for proportional bandwidth
        measurement and its period timer are moved into struct wb_domain.
      
      * global_wb_domain hosts the states for the global domain.
      
      * While at it, flatten wb_writeout_fraction() into its callers.  This
        thin wrapper doesn't provide any actual benefits while getting in
        the way.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      380c27ca
    • T
      writeback: reorganize [__]wb_update_bandwidth() · 8a731799
      Tejun Heo 提交于
      __wb_update_bandwidth() is called from two places -
      fs/fs-writeback.c::balance_dirty_pages() and
      mm/page-writeback.c::wb_writeback().  The latter updates only the
      write bandwidth while the former also deals with the dirty ratelimit.
      The two callsites are distinguished by whether @thresh parameter is
      zero or not, which is cryptic.  In addition, the two files define
      their own different versions of wb_update_bandwidth() on top of
      __wb_update_bandwidth(), which is confusing to say the least.  This
      patch cleans up [__]wb_update_bandwidth() in the following ways.
      
      * __wb_update_bandwidth() now takes explicit @update_ratelimit
        parameter to gate dirty ratelimit handling.
      
      * mm/page-writeback.c::wb_update_bandwidth() is flattened into its
        caller - balance_dirty_pages().
      
      * fs/fs-writeback.c::wb_update_bandwidth() is moved to
        mm/page-writeback.c and __wb_update_bandwidth() is made static.
      
      * While at it, add a lockdep assertion to __wb_update_bandwidth().
      
      Except for the lockdep addition, this is pure reorganization and
      doesn't introduce any behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8a731799
    • T
      writeback: clean up wb_dirty_limit() · 0d960a38
      Tejun Heo 提交于
      The function name wb_dirty_limit(), its argument @dirty and the local
      variable @wb_dirty are mortally confusing given that the function
      calculates per-wb threshold value not dirty pages, especially given
      that @dirty and @wb_dirty are used elsewhere for dirty pages.
      
      Let's rename the function to wb_calc_thresh() and wb_dirty to
      wb_thresh.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0d960a38
    • T
      memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online · 733a572e
      Tejun Heo 提交于
      cpu_possible_mask represents the CPUs which are actually possible
      during that boot instance.  For systems which don't support CPU
      hotplug, this will match cpu_online_mask exactly in most cases.  Even
      for systems which support CPU hotplug, the number of possible CPU
      slots is highly unlikely to diverge greatly from the number of online
      CPUs.  The only cases where the difference between possible and online
      caused problems were when the boot code failed to initialize the
      possible mask and left it fully set at NR_CPUS - 1.
      
      As such, most per-cpu constructs allocate for all possible CPUs and
      often iterate over the possibles, which also has the benefit of
      avoiding the blocking CPU hotplug synchronization.
      
      memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and
      mem_cgroup_read_events(), which iterates over online CPUs and handles
      CPU hotplug operations explicitly.  This complexity doesn't actually
      buy anything.  Switch to iterating over the possibles and drop the
      explicit CPU hotplug handling.
      
      Eventually, we want to convert memcg to use percpu_counter instead of
      its own custom implementation which also benefits from quick access
      w/o summing for cases where larger error margin is acceptable.
      
      This will allow mem_cgroup_read_stat() to be called from non-sleepable
      contexts which will be used by cgroup writeback.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      733a572e
    • T
      ext2: enable cgroup writeback support · 108dad65
      Tejun Heo 提交于
      Writeback now supports cgroup writeback and the generic writeback,
      buffer, libfs, and mpage helpers that ext2 uses are all updated to
      work with cgroup writeback.
      
      This patch enables cgroup writeback for ext2 by adding
      FS_CGROUP_WRITEBACK to its ->fs_flags.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: linux-ext4@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      108dad65
    • T
      mpage: make __mpage_writepage() honor cgroup writeback · 429b3fb0
      Tejun Heo 提交于
      __mpage_writepage() is used to implement mpage_writepages() which in
      turn is used for ->writepages() of various filesystems.  All writeback
      logic is now updated to handle cgroup writeback and the block cgroup
      to issue IOs for is encoded in writeback_control and can be retrieved
      from the inode; however, __mpage_writepage() currently ignores the
      blkcg indicated by the inode and issues all bio's without explicit
      blkcg association.
      
      This patch updates __mpage_writepage() so that the issued bio's are
      associated with inode_to_writeback_blkcg_css(inode).
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      429b3fb0
    • T
      buffer, writeback: make __block_write_full_page() honor cgroup writeback · bafc0dba
      Tejun Heo 提交于
      [__]block_write_full_page() is used to implement ->writepage in
      various filesystems.  All writeback logic is now updated to handle
      cgroup writeback and the block cgroup to issue IOs for is encoded in
      writeback_control and can be retrieved from the inode; however,
      [__]block_write_full_page() currently ignores the blkcg indicated by
      inode and issues all bio's without explicit blkcg association.
      
      This patch adds submit_bh_blkcg() which associates the bio with the
      specified blkio cgroup before issuing and uses it in
      __block_write_full_page() so that the issued bio's are associated with
      inode_to_wb_blkcg_css(inode).
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bafc0dba
    • T
      writeback: dirty inodes against their matching cgroup bdi_writeback's · 0747259d
      Tejun Heo 提交于
      __mark_inode_dirty() always dirtied the inode against the root wb
      (bdi_writeback).  The previous patches added all the infrastructure
      necessary to attribute an inode against the wb of the dirtying cgroup.
      
      This patch updates __mark_inode_dirty() so that it uses the wb
      associated with the inode instead of unconditionally using the root
      one.
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being dirtied against the root wb.
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0747259d
    • T
      writeback: make writeback initiation functions handle multiple bdi_writeback's · db125360
      Tejun Heo 提交于
      [try_]writeback_inodes_sb[_nr]() and sync_inodes_sb() currently only
      handle dirty inodes on the root wb (bdi_writeback) of the target bdi.
      This patch implements bdi_split_work_to_wbs() and use it to make these
      functions handle multiple wb's.
      
      bdi_split_work_to_wbs() takes a base wb_writeback_work and create
      clones of it and issue them to the wb's of the target bdi.  The base
      work's nr_pages is distributed using wb_split_bdi_pages() -
      ie. according to each wb's write bandwidth's proportion in the bdi.
      
      Cloning a bdi involves memory allocation which may fail.  In such
      cases, bdi_split_work_to_wbs() issues the base work directly and waits
      for its completion before proceeding to the next wb to guarantee
      forward progress and correctness under memory pressure.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      db125360
    • T
      writeback: restructure try_writeback_inodes_sb[_nr]() · f30a7d0c
      Tejun Heo 提交于
      try_writeback_inodes_sb_nr() wraps writeback_inodes_sb_nr() so that it
      handles s_umount locking and skips if writeback is already in
      progress.  The in progress test is performed on the root wb
      (bdi_writeback) which isn't sufficient for cgroup writeback support.
      The test must be done per-wb.
      
      To prepare for the change, this patch factors out
      __writeback_inodes_sb_nr() from writeback_inodes_sb_nr() and adds
      @skip_if_busy and moves the in progress test right before queueing the
      wb_writeback_work.  try_writeback_inodes_sb_nr() now just grabs
      s_umount and invokes __writeback_inodes_sb_nr() with asserted
      @skip_if_busy.  This way, later addition of multiple wb handling can
      skip only the wb's which already have writeback in progress.
      
      This swaps the order between in progress test and s_umount test which
      can flip the return value when writeback is in progress and s_umount
      is being held by someone else but this shouldn't cause any meaningful
      difference.  It's a fringe condition and the return value is an
      unsynchronized hint anyway.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f30a7d0c
    • T
      writeback: implement wb_wait_for_single_work() · 98754bf7
      Tejun Heo 提交于
      For cgroup writeback, multiple wb_writeback_work items may need to be
      issuedto accomplish a single task.  The previous patch updated the
      waiting mechanism such that wb_wait_for_completion() can wait for
      multiple work items.
      
      Issuing mulitple work items involves memory allocation which may fail.
      As most writeback operations can't fail or blocked on memory
      allocation, in such cases, we'll fall back to sequential issuing of an
      on-stack work item, which would need to be waited upon sequentially.
      
      This patch implements wb_wait_for_single_work() which waits for a
      single work item independently from wb_completion waiting so that such
      fallback mechanism can be used without getting tangled with the usual
      issuing / completion operation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      98754bf7
    • T
      writeback: implement bdi_wait_for_completion() · cc395d7f
      Tejun Heo 提交于
      If the completion of a wb_writeback_work can be waited upon by setting
      its ->done to a struct completion and waiting on it; however, for
      cgroup writeback support, it's necessary to issue multiple work items
      to multiple bdi_writebacks and wait for the completion of all.
      
      This patch implements wb_completion which can wait for multiple work
      items and replaces the struct completion with it.  It can be defined
      using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
      waited for by wb_wait_for_completion().
      
      Nobody currently issues multiple work items and this patch doesn't
      introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cc395d7f
    • T
      writeback: add wb_writeback_work->auto_free · ac7b19a3
      Tejun Heo 提交于
      Currently, a wb_writeback_work is freed automatically on completion if
      it doesn't have ->done set.  Add wb_writeback_work->auto_free to make
      the switch explicit.  This will help cgroup writeback support where
      waiting for completion and whether to free automatically don't
      necessarily move together.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ac7b19a3
    • T
      writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's · 001fe6f6
      Tejun Heo 提交于
      wakeup_dirtytime_writeback() currently only starts writeback on the
      root wb (bdi_writeback).  For cgroup writeback support, update the
      function to check all wbs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      001fe6f6
    • T
      writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's · f2b65121
      Tejun Heo 提交于
      wakeup_flusher_threads() currently only starts writeback on the root
      wb (bdi_writeback).  For cgroup writeback support, update the function
      to wake up all wbs and distribute the number of pages to write
      according to the proportion of each wb's write bandwidth, which is
      implemented in wb_split_bdi_pages().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f2b65121
    • T
      writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info · 9ecf4866
      Tejun Heo 提交于
      bdi_start_background_writeback() currently takes @bdi and kicks the
      root wb (bdi_writeback).  In preparation for cgroup writeback support,
      make it take wb instead.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9ecf4866
    • T
      writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info · bc05873d
      Tejun Heo 提交于
      writeback_in_progress() currently takes @bdi and returns whether
      writeback is in progress on its root wb (bdi_writeback).  In
      preparation for cgroup writeback support, make it take wb instead.
      While at it, make it an inline function.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bc05873d
    • T
      writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's · a06fd6b1
      Tejun Heo 提交于
      For cgroup writeback support, all bdi-wide operations should be
      distributed to all its wb's (bdi_writeback's).
      
      This patch updates laptop_mode_timer_fn() so that it invokes
      wb_start_writeback() on all wb's rather than just the root one.  As
      the intent is writing out all dirty data, there's no reason to split
      the number of pages to write.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a06fd6b1
    • T
      writeback: remove bdi_start_writeback() · c00ddad3
      Tejun Heo 提交于
      bdi_start_writeback() is a thin wrapper on top of
      __wb_start_writeback() which is used only by laptop_mode_timer_fn().
      This patches removes bdi_start_writeback(), renames
      __wb_start_writeback() to wb_start_writeback() and makes
      laptop_mode_timer_fn() use it instead.
      
      This doesn't cause any functional difference and will ease making
      laptop_mode_timer_fn() cgroup writeback aware.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c00ddad3
    • T
      writeback: implement bdi_for_each_wb() · ebe41ab0
      Tejun Heo 提交于
      This will be used to implement bdi-wide operations which should be
      distributed across all its cgroup bdi_writebacks.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ebe41ab0
    • T
      writeback: make bdi->min/max_ratio handling cgroup writeback aware · 693108a8
      Tejun Heo 提交于
      bdi->min/max_ratio are user-configurable per-bdi knobs which regulate
      dirty limit of each bdi.  For cgroup writeback, they need to be
      further distributed across wb's (bdi_writeback's) belonging to the
      configured bdi.
      
      This patch introduces wb_min_max_ratio() which distributes
      bdi->min/max_ratio according to a wb's proportion in the total active
      bandwidth of its bdi.
      
      v2: Update wb_min_max_ratio() to fix a bug where both min and max were
          assigned the min value and avoid calculations when possible.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      693108a8
    • T
      writeback: don't issue wb_writeback_work if clean · e7972912
      Tejun Heo 提交于
      There are several places in fs/fs-writeback.c which queues
      wb_writeback_work without checking whether the target wb
      (bdi_writeback) has dirty inodes or not.  The only thing
      wb_writeback_work does is writing back the dirty inodes for the target
      wb and queueing a work item for a clean wb is essentially noop.  There
      are some side effects such as bandwidth stats being updated and
      triggering tracepoints but these don't affect the operation in any
      meaningful way.
      
      This patch makes all writeback_inodes_sb_nr() and sync_inodes_sb()
      skip wb_queue_work() if the target bdi is clean.  Also, it moves
      dirtiness check from wakeup_flusher_threads() to
      __wb_start_writeback() so that all its callers benefit from the check.
      
      While the overhead incurred by scheduling a noop work isn't currently
      significant, the overhead may be higher with cgroup writeback support
      as we may end up issuing noop work items to a lot of clean wb's.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e7972912
    • T
      writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account · 95a46c65
      Tejun Heo 提交于
      bdi_has_dirty_io() used to only reflect whether the root wb
      (bdi_writeback) has dirty inodes.  For cgroup writeback support, it
      needs to take all active wb's into account.  If any wb on the bdi has
      dirty inodes, bdi_has_dirty_io() should return true.
      
      To achieve that, as inode_wb_list_{move|del}_locked() now keep track
      of the dirty state transition of each wb, the number of dirty wbs can
      be counted in the bdi; however, bdi is already aggregating
      wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
      there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
      dip below 1.  bdi_has_dirty_io() can simply test whether
      bdi->tot_write_bandwidth is zero or not.
      
      While this bumps the value of wb->avg_write_bandwidth to one when it
      used to be zero, this shouldn't cause any meaningful behavior
      difference.
      
      bdi_has_dirty_io() is made an inline function which tests whether
      ->tot_write_bandwidth is non-zero.  Also, WARN_ON_ONCE()'s on its
      value are added to inode_wb_list_{move|del}_locked().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      95a46c65
    • T
      writeback: implement backing_dev_info->tot_write_bandwidth · 766a9d6e
      Tejun Heo 提交于
      cgroup writeback support needs to keep track of the sum of
      avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
      distribute write workload.  This patch adds bdi->tot_write_bandwidth
      and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
      and wb_update_write_bandwidth() to adjust it as wb's gain and lose
      dirty inodes and its avg_write_bandwidth gets updated.
      
      As the update events are not synchronized with each other,
      bdi->tot_write_bandwidth is an atomic_long_t.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      766a9d6e
    • T
      writeback: implement WB_has_dirty_io wb_state flag · d6c10f1f
      Tejun Heo 提交于
      Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback)
      has any dirty inode by testing all three IO lists on each invocation
      without actively keeping track.  For cgroup writeback support, a
      single bdi will host multiple wb's each of which will host dirty
      inodes separately and we'll need to make bdi_has_dirty_io(), which
      currently only represents the root wb, aggregate has_dirty_io from all
      member wb's, which requires tracking transitions in has_dirty_io state
      on each wb.
      
      This patch introduces inode_wb_list_{move|del}_locked() to consolidate
      IO list operations leaving queue_io() the only other function which
      directly manipulates IO lists (via move_expired_inodes()).  All three
      functions are updated to call wb_io_lists_[de]populated() which keep
      track of whether the wb has dirty inodes or not and record it using
      the new WB_has_dirty_io flag.  inode_wb_list_moved_locked()'s return
      value indicates whether the wb had no dirty inodes before.
      
      mark_inode_dirty() is restructured so that the return value of
      inode_wb_list_move_locked() can be used for deciding whether to wake
      up the wb.
      
      While at it, change {bdi|wb}_has_dirty_io()'s return values to bool.
      These functions were returning 0 and 1 before.  Also, add a comment
      explaining the synchronization of wb_state flags.
      
      v2: Updated to accommodate b_dirty_time.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d6c10f1f
    • T
      writeback: implement and use inode_congested() · 703c2708
      Tejun Heo 提交于
      In several places, bdi_congested() and its wrappers are used to
      determine whether more IOs should be issued.  With cgroup writeback
      support, this question can't be answered solely based on the bdi
      (backing_dev_info).  It's dependent on whether the filesystem and bdi
      support cgroup writeback and the blkcg the inode is associated with.
      
      This patch implements inode_congested() and its wrappers which take
      @inode and determines the congestion state considering cgroup
      writeback.  The new functions replace bdi_*congested() calls in places
      where the query is about specific inode and task.
      
      There are several filesystem users which also fit this criteria but
      they should be updated when each filesystem implements cgroup
      writeback support.
      
      v2: Now that a given inode is associated with only one wb, congestion
          state can be determined independent from the asking task.  Drop
          @task.  Spotted by Vivek.  Also, converted to take @inode instead
          of @mapping and renamed to inode_congested().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      703c2708
    • T
      writeback, blkcg: propagate non-root blkcg congestion state · 482cf79c
      Tejun Heo 提交于
      Now that bdi layer can handle per-blkcg bdi_writeback_congested state,
      blk_{set|clear}_congested() can propagate non-root blkcg congestion
      state to them.
      
      This can be easily achieved by disabling the root_rl tests in
      blk_{set|clear}_congested().  Note that we still need those tests when
      !CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg
      wb's congestion state for events happening on other blkcgs.
      
      v2: Updated for bdi_writeback_congested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      482cf79c
    • T
      writeback, blkcg: restructure blk_{set|clear}_queue_congested() · d40f75a0
      Tejun Heo 提交于
      blk_{set|clear}_queue_congested() take @q and set or clear,
      respectively, the congestion state of its bdi's root wb.  Because bdi
      used to be able to handle congestion state only on the root wb, the
      callers of those functions tested whether the congestion is on the
      root blkcg and skipped if not.
      
      This is cumbersome and makes implementation of per cgroup
      bdi_writeback congestion state propagation difficult.  This patch
      renames blk_{set|clear}_queue_congested() to
      blk_{set|clear}_congested(), and makes them take request_list instead
      of request_queue and test whether the specified request_list is the
      root one before updating bdi_writeback congestion state.  This makes
      the tests in the callers unnecessary and simplifies them.
      
      As there are no external users of these functions, the definitions are
      moved from include/linux/blkdev.h to block/blk-core.c.
      
      This patch doesn't introduce any noticeable behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d40f75a0
    • T
      writeback: make congestion functions per bdi_writeback · ec8a6f26
      Tejun Heo 提交于
      Currently, all congestion functions take bdi (backing_dev_info) and
      always operate on the root wb (bdi->wb) and the congestion state from
      the block layer is propagated only for the root blkcg.  This patch
      introduces {set|clear}_wb_congested() and wb_congested() which take a
      bdi_writeback_congested and bdi_writeback respectively.  The bdi
      counteparts are now wrappers invoking the wb based functions on
      @bdi->wb.
      
      While converting clear_bdi_congested() to clear_wb_congested(), the
      local variable declaration order between @wqh and @bit is swapped for
      cosmetic reason.
      
      This patch just adds the new wb based functions.  The following
      patches will apply them.
      
      v2: Updated for bdi_writeback_congested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ec8a6f26
    • T
      writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback · dfb8ae56
      Tejun Heo 提交于
      Currently, balance_dirty_pages() always work on bdi->wb.  This patch
      updates it to work on the wb (bdi_writeback) matching memcg and blkcg
      of the current task as that's what the inode is being dirtied against.
      
      balance_dirty_pages_ratelimited() now pins the current wb and passes
      it to balance_dirty_pages().
      
      As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
      visible behavior differences.
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dfb8ae56
    • T
      writeback: attribute stats to the matching per-cgroup bdi_writeback · 91018134
      Tejun Heo 提交于
      Until now, all WB_* stats were accounted against the root wb
      (bdi_writeback), now that multiple wb (bdi_writeback) support is in
      place, let's attributes the stats to the respective per-cgroup wb's.
      
      As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
      visible behavior differences.
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      91018134
    • T
      writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congested · ce7acfea
      Tejun Heo 提交于
      A blkg (blkcg_gq) can be congested and decongested independently from
      other blkgs on the same request_queue.  Accordingly, for cgroup
      writeback support, the congestion status at bdi (backing_dev_info)
      should be split and updated separately from matching blkg's.
      
      This patch prepares by adding blkg->wb_congested and associating a
      blkg with its matching per-blkcg bdi_writeback_congested on creation.
      
      v2: Updated to associate bdi_writeback_congested instead of
          bdi_writeback.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ce7acfea
    • T
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo 提交于
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52ebea74
    • T
      writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK · 89e9b9e0
      Tejun Heo 提交于
      cgroup writeback requires support from both bdi and filesystem sides.
      Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
      support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
      default.  Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
      both MEMCG and BLK_CGROUP are enabled.
      
      inode_cgwb_enabled() which determines whether a given inode's both bdi
      and fs support cgroup writeback is added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      89e9b9e0
    • T
      bdi: separate out congested state into a separate struct · 4aa9c692
      Tejun Heo 提交于
      Currently, a wb's (bdi_writeback) congestion state is carried in its
      ->state field; however, cgroup writeback support will require multiple
      wb's sharing the same congestion state.  This patch separates out
      congestion state into its own struct - struct bdi_writeback_congested.
      A new field wb field, wb_congested, points to its associated congested
      struct.  The default wb, bdi->wb, always points to bdi->wb_congested.
      
      While this patch adds a layer of indirection, it doesn't introduce any
      behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4aa9c692