1. 02 6月, 2015 40 次提交
    • T
      writeback: implement foreign cgroup inode detection · 2a814908
      Tejun Heo 提交于
      As concurrent write sharing of an inode is expected to be very rare
      and memcg only tracks page ownership on first-use basis severely
      confining the usefulness of such sharing, cgroup writeback tracks
      ownership per-inode.  While the support for concurrent write sharing
      of an inode is deemed unnecessary, an inode being written to by
      different cgroups at different points in time is a lot more common,
      and, more importantly, charging only by first-use can too readily lead
      to grossly incorrect behaviors (single foreign page can lead to
      gigabytes of writeback to be incorrectly attributed).
      
      To resolve this issue, cgroup writeback detects the majority dirtier
      of an inode and will transfer the ownership to it.  To avoid
      unnnecessary oscillation, the detection mechanism keeps track of
      history and gives out the switch verdict only if the foreign usage
      pattern is stable over a certain amount of time and/or writeback
      attempts.
      
      The detection mechanism has fairly low space and computation overhead.
      It adds 8 bytes to struct inode (one int and two u16's) and minimal
      amount of calculation per IO.  The detection mechanism converges to
      the correct answer usually in several seconds of IO time when there's
      a clear majority dirtier.  Even when there isn't, it can reach an
      acceptable answer fairly quickly under most circumstances.
      
      Please see wb_detach_inode() for more details.
      
      This patch only implements detection.  Following patches will
      implement actual switching.
      
      v2: wbc_account_io() now checks whether the wbc is associated with a
          wb before dereferencing it.  This can happen when pageout() is
          writing pages directly without going through the usual writeback
          path.  As pageout() path is single-threaded, we don't want it to
          be blocked behind a slow cgroup and ultimately want it to delegate
          actual writing to the usual writeback path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a814908
    • T
      writeback: make writeback_control track the inode being written back · b16b1deb
      Tejun Heo 提交于
      Currently, for cgroup writeback, the IO submission paths directly
      associate the bio's with the blkcg from inode_to_wb_blkcg_css();
      however, it'd be necessary to keep more writeback context to implement
      foreign inode writeback detection.  wbc (writeback_control) is the
      natural fit for the extra context - it persists throughout the
      writeback of each inode and is passed all the way down to IO
      submission paths.
      
      This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and
      wbc_attach_fdatawrite_inode() which are used to associate wbc with the
      inode being written back.  IO submission paths now use wbc_init_bio()
      instead of directly associating bio's with blkcg themselves.  This
      leaves inode_to_wb_blkcg_css() w/o any user.  The function is removed.
      
      wbc currently only tracks the associated wb (bdi_writeback).  Future
      patches will add more for foreign inode detection.  The association is
      established under i_lock which will be depended upon when migrating
      foreign inodes to other wb's.
      
      As currently, once established, inode to wb association never changes,
      going through wbc when initializing bio's doesn't cause any behavior
      changes.
      
      v2: submit_blk_blkcg() now checks whether the wbc is associated with a
          wb before dereferencing it.  This can happen when pageout() is
          writing pages directly without going through the usual writeback
          path.  As pageout() path is single-threaded, we don't want it to
          be blocked behind a slow cgroup and ultimately want it to delegate
          actual writing to the usual writeback path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b16b1deb
    • T
      writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() · 21c6321f
      Tejun Heo 提交于
      Currently, majority of cgroup writeback support including all the
      above functions are implemented in include/linux/backing-dev.h and
      mm/backing-dev.c; however, the portion closely related to writeback
      logic implemented in include/linux/writeback.h and mm/page-writeback.c
      will expand to support foreign writeback detection and correction.
      
      This patch moves wb[_try]_get() and wb_put() to
      include/linux/backing-dev-defs.h so that they can be used from
      writeback.h and inode_{attach|detach}_wb() to writeback.h and
      page-writeback.c.
      
      This is pure reorganization and doesn't introduce any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      21c6321f
    • T
      writeback: implement memcg writeback domain based throttling · c2aa723a
      Tejun Heo 提交于
      While cgroup writeback support now connects memcg and blkcg so that
      writeback IOs are properly attributed and controlled, the IO back
      pressure propagation mechanism implemented in balance_dirty_pages()
      and its subroutines wasn't aware of cgroup writeback.
      
      Processes belonging to a memcg may have access to only subset of total
      memory available in the system and not factoring this into dirty
      throttling rendered it completely ineffective for processes under
      memcg limits and memcg ended up building a separate ad-hoc degenerate
      mechanism directly into vmscan code to limit page dirtying.
      
      The previous patches updated balance_dirty_pages() and its subroutines
      so that they can deal with multiple wb_domain's (writeback domains)
      and defined per-memcg wb_domain.  Processes belonging to a non-root
      memcg are bound to two wb_domains, global wb_domain and memcg
      wb_domain, and should be throttled according to IO pressures from both
      domains.  This patch updates dirty throttling code so that it repeats
      similar calculations for the two domains - the differences between the
      two are few and minor - and applies the lower of the two sets of
      resulting constraints.
      
      wb_over_bg_thresh(), which controls when background writeback
      terminates, is also updated to consider both global and memcg
      wb_domains.  It returns true if dirty is over bg_thresh for either
      domain.
      
      This makes the dirty throttling mechanism operational for memcg
      domains including writeback-bandwidth-proportional dirty page
      distribution inside them but the ad-hoc memcg throttling mechanism in
      vmscan is still in place.  The next patch will rip it out.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c2aa723a
    • T
      writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes · 2529bb3a
      Tejun Heo 提交于
      The amount of available memory to a memcg wb_domain can change as
      memcg configuration changes.  A domain's ->dirty_limit exists to
      smooth out sudden drops in dirty threshold; however, when a domain's
      size actually drops significantly, it hinders the dirty throttling
      from adjusting to the new configuration leading to unexpected
      behaviors including unnecessary OOM kills.
      
      This patch resolves the issue by adding wb_domain_size_changed() which
      resets ->dirty_limit[_tstmp] and making memcg call it on configuration
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2529bb3a
    • T
      writeback: implement memcg wb_domain · 841710aa
      Tejun Heo 提交于
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      The previous patches laid the groundwork to support the two wb_domains
      and this patch implements memcg wb_domain.  memcg->cgwb_domain is
      initialized on css online and destroyed on css release,
      wb->memcg_completions is added, and __wb_writeout_inc() is updated to
      increment completions against both global and memcg wb_domains.
      
      The following patches will update balance_dirty_pages() and its
      subroutines to actually consider memcg wb_domain for throttling.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      841710aa
    • T
      writeback: move over_bground_thresh() to mm/page-writeback.c · aa661bbe
      Tejun Heo 提交于
      and rename it to wb_over_bg_thresh().  The function is closely tied to
      the dirty throttling mechanism implemented in page-writeback.c.  This
      relocation will allow future updates necessary for cgroup writeback
      support.
      
      While at it, add function comment.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aa661bbe
    • T
      writeback: move global_dirty_limit into wb_domain · dcc25ae7
      Tejun Heo 提交于
      This patch is a part of the series to define wb_domain which
      represents a domain that wb's (bdi_writeback's) belong to and are
      measured against each other in.  This will enable IO backpressure
      propagation for cgroup writeback.
      
      global_dirty_limit exists to regulate the global dirty threshold which
      is a property of the wb_domain.  This patch moves hard_dirty_limit,
      dirty_lock, and update_time into wb_domain.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dcc25ae7
    • T
      writeback: implement wb_domain · 380c27ca
      Tejun Heo 提交于
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      Currently, what constitutes the global writeback domain are scattered
      across a number of global states.  This patch starts collecting them
      into struct wb_domain.
      
      * fprop_global which serves as the basis for proportional bandwidth
        measurement and its period timer are moved into struct wb_domain.
      
      * global_wb_domain hosts the states for the global domain.
      
      * While at it, flatten wb_writeout_fraction() into its callers.  This
        thin wrapper doesn't provide any actual benefits while getting in
        the way.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      380c27ca
    • T
      writeback: reorganize [__]wb_update_bandwidth() · 8a731799
      Tejun Heo 提交于
      __wb_update_bandwidth() is called from two places -
      fs/fs-writeback.c::balance_dirty_pages() and
      mm/page-writeback.c::wb_writeback().  The latter updates only the
      write bandwidth while the former also deals with the dirty ratelimit.
      The two callsites are distinguished by whether @thresh parameter is
      zero or not, which is cryptic.  In addition, the two files define
      their own different versions of wb_update_bandwidth() on top of
      __wb_update_bandwidth(), which is confusing to say the least.  This
      patch cleans up [__]wb_update_bandwidth() in the following ways.
      
      * __wb_update_bandwidth() now takes explicit @update_ratelimit
        parameter to gate dirty ratelimit handling.
      
      * mm/page-writeback.c::wb_update_bandwidth() is flattened into its
        caller - balance_dirty_pages().
      
      * fs/fs-writeback.c::wb_update_bandwidth() is moved to
        mm/page-writeback.c and __wb_update_bandwidth() is made static.
      
      * While at it, add a lockdep assertion to __wb_update_bandwidth().
      
      Except for the lockdep addition, this is pure reorganization and
      doesn't introduce any behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8a731799
    • T
      writeback: clean up wb_dirty_limit() · 0d960a38
      Tejun Heo 提交于
      The function name wb_dirty_limit(), its argument @dirty and the local
      variable @wb_dirty are mortally confusing given that the function
      calculates per-wb threshold value not dirty pages, especially given
      that @dirty and @wb_dirty are used elsewhere for dirty pages.
      
      Let's rename the function to wb_calc_thresh() and wb_dirty to
      wb_thresh.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0d960a38
    • T
      buffer, writeback: make __block_write_full_page() honor cgroup writeback · bafc0dba
      Tejun Heo 提交于
      [__]block_write_full_page() is used to implement ->writepage in
      various filesystems.  All writeback logic is now updated to handle
      cgroup writeback and the block cgroup to issue IOs for is encoded in
      writeback_control and can be retrieved from the inode; however,
      [__]block_write_full_page() currently ignores the blkcg indicated by
      inode and issues all bio's without explicit blkcg association.
      
      This patch adds submit_bh_blkcg() which associates the bio with the
      specified blkio cgroup before issuing and uses it in
      __block_write_full_page() so that the issued bio's are associated with
      inode_to_wb_blkcg_css(inode).
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bafc0dba
    • T
      writeback: restructure try_writeback_inodes_sb[_nr]() · f30a7d0c
      Tejun Heo 提交于
      try_writeback_inodes_sb_nr() wraps writeback_inodes_sb_nr() so that it
      handles s_umount locking and skips if writeback is already in
      progress.  The in progress test is performed on the root wb
      (bdi_writeback) which isn't sufficient for cgroup writeback support.
      The test must be done per-wb.
      
      To prepare for the change, this patch factors out
      __writeback_inodes_sb_nr() from writeback_inodes_sb_nr() and adds
      @skip_if_busy and moves the in progress test right before queueing the
      wb_writeback_work.  try_writeback_inodes_sb_nr() now just grabs
      s_umount and invokes __writeback_inodes_sb_nr() with asserted
      @skip_if_busy.  This way, later addition of multiple wb handling can
      skip only the wb's which already have writeback in progress.
      
      This swaps the order between in progress test and s_umount test which
      can flip the return value when writeback is in progress and s_umount
      is being held by someone else but this shouldn't cause any meaningful
      difference.  It's a fringe condition and the return value is an
      unsynchronized hint anyway.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f30a7d0c
    • T
      writeback: implement bdi_wait_for_completion() · cc395d7f
      Tejun Heo 提交于
      If the completion of a wb_writeback_work can be waited upon by setting
      its ->done to a struct completion and waiting on it; however, for
      cgroup writeback support, it's necessary to issue multiple work items
      to multiple bdi_writebacks and wait for the completion of all.
      
      This patch implements wb_completion which can wait for multiple work
      items and replaces the struct completion with it.  It can be defined
      using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
      waited for by wb_wait_for_completion().
      
      Nobody currently issues multiple work items and this patch doesn't
      introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cc395d7f
    • T
      writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info · 9ecf4866
      Tejun Heo 提交于
      bdi_start_background_writeback() currently takes @bdi and kicks the
      root wb (bdi_writeback).  In preparation for cgroup writeback support,
      make it take wb instead.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9ecf4866
    • T
      writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info · bc05873d
      Tejun Heo 提交于
      writeback_in_progress() currently takes @bdi and returns whether
      writeback is in progress on its root wb (bdi_writeback).  In
      preparation for cgroup writeback support, make it take wb instead.
      While at it, make it an inline function.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bc05873d
    • T
      writeback: remove bdi_start_writeback() · c00ddad3
      Tejun Heo 提交于
      bdi_start_writeback() is a thin wrapper on top of
      __wb_start_writeback() which is used only by laptop_mode_timer_fn().
      This patches removes bdi_start_writeback(), renames
      __wb_start_writeback() to wb_start_writeback() and makes
      laptop_mode_timer_fn() use it instead.
      
      This doesn't cause any functional difference and will ease making
      laptop_mode_timer_fn() cgroup writeback aware.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c00ddad3
    • T
      writeback: implement bdi_for_each_wb() · ebe41ab0
      Tejun Heo 提交于
      This will be used to implement bdi-wide operations which should be
      distributed across all its cgroup bdi_writebacks.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ebe41ab0
    • T
      writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account · 95a46c65
      Tejun Heo 提交于
      bdi_has_dirty_io() used to only reflect whether the root wb
      (bdi_writeback) has dirty inodes.  For cgroup writeback support, it
      needs to take all active wb's into account.  If any wb on the bdi has
      dirty inodes, bdi_has_dirty_io() should return true.
      
      To achieve that, as inode_wb_list_{move|del}_locked() now keep track
      of the dirty state transition of each wb, the number of dirty wbs can
      be counted in the bdi; however, bdi is already aggregating
      wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
      there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
      dip below 1.  bdi_has_dirty_io() can simply test whether
      bdi->tot_write_bandwidth is zero or not.
      
      While this bumps the value of wb->avg_write_bandwidth to one when it
      used to be zero, this shouldn't cause any meaningful behavior
      difference.
      
      bdi_has_dirty_io() is made an inline function which tests whether
      ->tot_write_bandwidth is non-zero.  Also, WARN_ON_ONCE()'s on its
      value are added to inode_wb_list_{move|del}_locked().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      95a46c65
    • T
      writeback: implement backing_dev_info->tot_write_bandwidth · 766a9d6e
      Tejun Heo 提交于
      cgroup writeback support needs to keep track of the sum of
      avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
      distribute write workload.  This patch adds bdi->tot_write_bandwidth
      and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
      and wb_update_write_bandwidth() to adjust it as wb's gain and lose
      dirty inodes and its avg_write_bandwidth gets updated.
      
      As the update events are not synchronized with each other,
      bdi->tot_write_bandwidth is an atomic_long_t.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      766a9d6e
    • T
      writeback: implement WB_has_dirty_io wb_state flag · d6c10f1f
      Tejun Heo 提交于
      Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback)
      has any dirty inode by testing all three IO lists on each invocation
      without actively keeping track.  For cgroup writeback support, a
      single bdi will host multiple wb's each of which will host dirty
      inodes separately and we'll need to make bdi_has_dirty_io(), which
      currently only represents the root wb, aggregate has_dirty_io from all
      member wb's, which requires tracking transitions in has_dirty_io state
      on each wb.
      
      This patch introduces inode_wb_list_{move|del}_locked() to consolidate
      IO list operations leaving queue_io() the only other function which
      directly manipulates IO lists (via move_expired_inodes()).  All three
      functions are updated to call wb_io_lists_[de]populated() which keep
      track of whether the wb has dirty inodes or not and record it using
      the new WB_has_dirty_io flag.  inode_wb_list_moved_locked()'s return
      value indicates whether the wb had no dirty inodes before.
      
      mark_inode_dirty() is restructured so that the return value of
      inode_wb_list_move_locked() can be used for deciding whether to wake
      up the wb.
      
      While at it, change {bdi|wb}_has_dirty_io()'s return values to bool.
      These functions were returning 0 and 1 before.  Also, add a comment
      explaining the synchronization of wb_state flags.
      
      v2: Updated to accommodate b_dirty_time.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d6c10f1f
    • T
      writeback: implement and use inode_congested() · 703c2708
      Tejun Heo 提交于
      In several places, bdi_congested() and its wrappers are used to
      determine whether more IOs should be issued.  With cgroup writeback
      support, this question can't be answered solely based on the bdi
      (backing_dev_info).  It's dependent on whether the filesystem and bdi
      support cgroup writeback and the blkcg the inode is associated with.
      
      This patch implements inode_congested() and its wrappers which take
      @inode and determines the congestion state considering cgroup
      writeback.  The new functions replace bdi_*congested() calls in places
      where the query is about specific inode and task.
      
      There are several filesystem users which also fit this criteria but
      they should be updated when each filesystem implements cgroup
      writeback support.
      
      v2: Now that a given inode is associated with only one wb, congestion
          state can be determined independent from the asking task.  Drop
          @task.  Spotted by Vivek.  Also, converted to take @inode instead
          of @mapping and renamed to inode_congested().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      703c2708
    • T
      writeback, blkcg: restructure blk_{set|clear}_queue_congested() · d40f75a0
      Tejun Heo 提交于
      blk_{set|clear}_queue_congested() take @q and set or clear,
      respectively, the congestion state of its bdi's root wb.  Because bdi
      used to be able to handle congestion state only on the root wb, the
      callers of those functions tested whether the congestion is on the
      root blkcg and skipped if not.
      
      This is cumbersome and makes implementation of per cgroup
      bdi_writeback congestion state propagation difficult.  This patch
      renames blk_{set|clear}_queue_congested() to
      blk_{set|clear}_congested(), and makes them take request_list instead
      of request_queue and test whether the specified request_list is the
      root one before updating bdi_writeback congestion state.  This makes
      the tests in the callers unnecessary and simplifies them.
      
      As there are no external users of these functions, the definitions are
      moved from include/linux/blkdev.h to block/blk-core.c.
      
      This patch doesn't introduce any noticeable behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d40f75a0
    • T
      writeback: make congestion functions per bdi_writeback · ec8a6f26
      Tejun Heo 提交于
      Currently, all congestion functions take bdi (backing_dev_info) and
      always operate on the root wb (bdi->wb) and the congestion state from
      the block layer is propagated only for the root blkcg.  This patch
      introduces {set|clear}_wb_congested() and wb_congested() which take a
      bdi_writeback_congested and bdi_writeback respectively.  The bdi
      counteparts are now wrappers invoking the wb based functions on
      @bdi->wb.
      
      While converting clear_bdi_congested() to clear_wb_congested(), the
      local variable declaration order between @wqh and @bit is swapped for
      cosmetic reason.
      
      This patch just adds the new wb based functions.  The following
      patches will apply them.
      
      v2: Updated for bdi_writeback_congested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ec8a6f26
    • T
      writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congested · ce7acfea
      Tejun Heo 提交于
      A blkg (blkcg_gq) can be congested and decongested independently from
      other blkgs on the same request_queue.  Accordingly, for cgroup
      writeback support, the congestion status at bdi (backing_dev_info)
      should be split and updated separately from matching blkg's.
      
      This patch prepares by adding blkg->wb_congested and associating a
      blkg with its matching per-blkcg bdi_writeback_congested on creation.
      
      v2: Updated to associate bdi_writeback_congested instead of
          bdi_writeback.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ce7acfea
    • T
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo 提交于
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52ebea74
    • T
      writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK · 89e9b9e0
      Tejun Heo 提交于
      cgroup writeback requires support from both bdi and filesystem sides.
      Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
      support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
      default.  Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
      both MEMCG and BLK_CGROUP are enabled.
      
      inode_cgwb_enabled() which determines whether a given inode's both bdi
      and fs support cgroup writeback is added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      89e9b9e0
    • T
      bdi: separate out congested state into a separate struct · 4aa9c692
      Tejun Heo 提交于
      Currently, a wb's (bdi_writeback) congestion state is carried in its
      ->state field; however, cgroup writeback support will require multiple
      wb's sharing the same congestion state.  This patch separates out
      congestion state into its own struct - struct bdi_writeback_congested.
      A new field wb field, wb_congested, points to its associated congested
      struct.  The default wb, bdi->wb, always points to bdi->wb_congested.
      
      While this patch adds a layer of indirection, it doesn't introduce any
      behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4aa9c692
    • T
      bdi: make inode_to_bdi() inline · a212b105
      Tejun Heo 提交于
      Now that bdi definitions are moved to backing-dev-defs.h,
      backing-dev.h can include blkdev.h and inline inode_to_bdi() without
      worrying about introducing circular include dependency.  The function
      gets called from hot paths and fairly trivial.
      
      This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
      function calls inline.  blockdev_superblock and noop_backing_dev_info
      are EXPORT_GPL'd to allow the inline functions to be used from
      modules.
      
      While at it, make sb_is_blkdev_sb() return bool instead of int.
      
      v2: Fixed typo in description as suggested by Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a212b105
    • T
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo 提交于
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66114cad
    • T
      writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback · f0054bb1
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->wb_lock and ->worklist into wb.
      
      * The lock protects bdi->worklist and bdi->wb.dwork scheduling.  While
        moving, rename it to wb->work_lock as wb->wb_lock is confusing.
        Also, move wb->dwork downwards so that it's colocated with the new
        ->work_lock and ->work_list fields.
      
      * bdi_writeback_workfn()		-> wb_workfn()
        bdi_wakeup_thread_delayed(bdi)	-> wb_wakeup_delayed(wb)
        bdi_wakeup_thread(bdi)		-> wb_wakeup(wb)
        bdi_queue_work(bdi, ...)		-> wb_queue_work(wb, ...)
        __bdi_start_writeback(bdi, ...)	-> __wb_start_writeback(wb, ...)
        get_next_work_item(bdi)		-> get_next_work_item(wb)
      
      * bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
        The function contained parts which belong to the containing bdi
        rather than the wb itself - testing cap_writeback_dirty and
        bdi_remove_from_list() invocation.  Those are moved to
        bdi_unregister().
      
      * bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
        Initializations of the moved bdi->wb_lock and ->work_list are
        relocated from bdi_init() to wb_init().
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->state are mechanically replaced with bdi->wb.state
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f0054bb1
    • T
      writeback: move bandwidth related fields from backing_dev_info into bdi_writeback · a88a341a
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bandwidth related fields from backing_dev_info into
      bdi_writeback.
      
      * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
        write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
        balanced_dirty_ratelimit, completions and dirty_exceeded.
      
      * writeback_chunk_size() and over_bground_thresh() now take @wb
        instead of @bdi.
      
      * bdi_writeout_fraction(bdi, ...)	-> wb_writeout_fraction(wb, ...)
        bdi_dirty_limit(bdi, ...)		-> wb_dirty_limit(wb, ...)
        bdi_position_ration(bdi, ...)		-> wb_position_ratio(wb, ...)
        bdi_update_writebandwidth(bdi, ...)	-> wb_update_write_bandwidth(wb, ...)
        [__]bdi_update_bandwidth(bdi, ...)	-> [__]wb_update_bandwidth(wb, ...)
        bdi_{max|min}_pause(bdi, ...)		-> wb_{max|min}_pause(wb, ...)
        bdi_dirty_limits(bdi, ...)		-> wb_dirty_limits(wb, ...)
      
      * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
        respectively.  Note that explicit zeroing is dropped in the process
        as wb's are cleared in entirety anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      
      v2: Typo in description fixed as suggested by Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a88a341a
    • T
      writeback: move backing_dev_info->bdi_stat[] into bdi_writeback · 93f78d88
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->bdi_stat[] into wb.
      
      * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
        enums is changed from BDI_ to WB_.
      
      * BDI_STAT_BATCH() -> WB_STAT_BATCH()
      
      * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)
      
      * bdi_stat[_error]() -> wb_stat[_error]()
      
      * bdi_writeout_inc() -> wb_writeout_inc()
      
      * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
        frees stat.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      93f78d88
    • T
      writeback: move backing_dev_info->state into bdi_writeback · 4452226e
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->state into wb.
      
      * enum bdi_state is renamed to wb_state and the prefix of all enums is
        changed from BDI_ to WB_.
      
      * Explicit zeroing of bdi->state is removed without adding zeoring of
        wb->state as the whole data structure is zeroed on init anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->state are mechanically replaced with bdi->wb.state
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: drbd-dev@lists.linbit.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4452226e
    • T
      memcg: implement mem_cgroup_css_from_page() · ad7fa852
      Tejun Heo 提交于
      Implement mem_cgroup_css_from_page() which returns the
      cgroup_subsys_state of the memcg associated with a given page on the
      default hierarchy.  This will be used by cgroup writeback support.
      
      This function assumes that page->mem_cgroup association doesn't change
      until the page is released, which is true on the default hierarchy as
      long as replace_page_cache_page() is not used.  As the only user of
      replace_page_cache_page() is FUSE which won't support cgroup writeback
      for the time being, this works for now, and replace_page_cache_page()
      will soon be updated so that the invariant actually holds.
      
      Note that the RCU protected page->mem_cgroup access is consistent with
      other usages across memcg but ultimately incorrect.  These unlocked
      accesses are missing required barriers.  page->mem_cgroup should be
      made an RCU pointer and updated and accessed using RCU operations.
      
      v4: Instead of triggering WARN, return the root css on the traditional
          hierarchies.  This makes the function a lot easier to deal with
          especially as there's no light way to synchronize against
          hierarchy rebinding.
      
      v3: s/mem_cgroup_migrate()/mem_cgroup_css_from_page()/
      
      v2: Trigger WARN if the function is used on the traditional
          hierarchies and add comment about the assumed invariant.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ad7fa852
    • T
      blkcg: implement bio_associate_blkcg() · 1d933cf0
      Tejun Heo 提交于
      Currently, a bio can only be associated with the io_context and blkcg
      of %current using bio_associate_current().  This is too restrictive
      for cgroup writeback support.  Implement bio_associate_blkcg() which
      associates a bio with the specified blkcg.
      
      bio_associate_blkcg() leaves the io_context unassociated.
      bio_associate_current() is updated so that it considers a bio as
      already associated if it has a blkcg_css, instead of an io_context,
      associated with it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1d933cf0
    • T
      blkcg: implement task_get_blkcg_css() · fd383c2d
      Tejun Heo 提交于
      Implement a wrapper around task_get_css() to acquire the blkcg css for
      a given task.  The wrapper is necessary for cgroup writeback support
      as there will be places outside blkcg proper trying to acquire
      blkcg_css and blkio_cgrp_id will be undefined when !CONFIG_BLK_CGROUP.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fd383c2d
    • T
      cgroup, block: implement task_get_css() and use it in bio_associate_current() · ec438699
      Tejun Heo 提交于
      bio_associate_current() currently open codes task_css() and
      css_tryget_online() to find and pin $current's blkcg css.  Abstract it
      into task_get_css() which is implemented from cgroup side.  As a task
      is always associated with an online css for every subsystem except
      while the css_set update is propagating, task_get_css() retries till
      css_tryget_online() succeeds.
      
      This is a cleanup and shouldn't lead to noticeable behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ec438699
    • T
      blkcg: add blkcg_root_css · 496d5e75
      Tejun Heo 提交于
      Add global constant blkcg_root_css which points to &blkcg_root.css.
      This will be used by cgroup writeback support.  If blkcg is disabled,
      it's defined as ERR_PTR(-EINVAL).
      
      v2: The declarations moved to include/linux/blk-cgroup.h as suggested
          by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      496d5e75
    • T
      memcg: add mem_cgroup_root_css · 56161634
      Tejun Heo 提交于
      Add global mem_cgroup_root_css which points to the root memcg css.
      This will be used by cgroup writeback support.  If memcg is disabled,
      it's defined as ERR_PTR(-EINVAL).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      aCc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      56161634