1. 02 6月, 2015 40 次提交
    • T
      writeback: implement foreign cgroup inode bdi_writeback switching · d10c8095
      Tejun Heo 提交于
      As concurrent write sharing of an inode is expected to be very rare
      and memcg only tracks page ownership on first-use basis severely
      confining the usefulness of such sharing, cgroup writeback tracks
      ownership per-inode.  While the support for concurrent write sharing
      of an inode is deemed unnecessary, an inode being written to by
      different cgroups at different points in time is a lot more common,
      and, more importantly, charging only by first-use can too readily lead
      to grossly incorrect behaviors (single foreign page can lead to
      gigabytes of writeback to be incorrectly attributed).
      
      To resolve this issue, cgroup writeback detects the majority dirtier
      of an inode and transfers the ownership to it.  The previous patches
      implemented the foreign condition detection mechanism and laid the
      groundwork.  This patch implements the actual switching.
      
      With the previously implemented [unlocked_]inode_to_wb_and_list_lock()
      and wb stat transaction, grabbing wb->list_lock, inode->i_lock and
      mapping->tree_lock gives us full exclusion against all wb operations
      on the target inode.  inode_switch_wb_work_fn() grabs all the locks
      and transfers the inode atomically along with its RECLAIMABLE and
      WRITEBACK stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d10c8095
    • T
      writeback: add lockdep annotation to inode_to_wb() · aaa2cacf
      Tejun Heo 提交于
      With the previous three patches, all operations which acquire wb from
      inode are either under one of inode->i_lock, mapping->tree_lock or
      wb->list_lock or protected by unlocked_inode_to_wb transaction.  This
      will be depended upon by foreign inode wb switching.
      
      This patch adds lockdep assertion to inode_to_wb() so that usages
      outside the above list locks can be caught easily.  There are three
      exceptions.
      
      * locked_inode_to_wb_and_lock_list() is holding wb->list_lock but the
        wb may not be the inode's.  Ensuring that is the function's role
        after all.  Updated to deref inode->i_wb directly.
      
      * inode_wb_stat_unlocked_begin() is usually protected by combination
        of !I_WB_SWITCH and rcu_read_lock().  Updated to deref inode->i_wb
        directly.
      
      * inode_congested() wants to test whether inode->i_wb is set before
        starting the transaction.  Added inode_to_wb_is_valid() which tests
        inode->i_wb directly.
      
      v5: might_lock() removed.  It annotates that the lock is grabbed w/
          irq enabled which isn't the case and triggering lockdep warning
          spuriously.
      
      v4: might_lock() added to unlocked_inode_to_wb_begin().
      
      v3: inode_congested() conversion added.
      
      v2: locked_inode_to_wb_and_lock_list() was missing in the first
          version.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aaa2cacf
    • T
      writeback: use unlocked_inode_to_wb transaction in inode_congested() · 5cb8b824
      Tejun Heo 提交于
      Similar to wb stat updates, inode_congested() accesses the associated
      wb of an inode locklessly, which will break with foreign inode wb
      switching.  This path updates inode_congested() to use unlocked inode
      wb access transaction introduced by the previous patch.
      
      Combined with the previous two patches, this makes all wb list and
      access operations to be protected by either of inode->i_lock,
      wb->list_lock, or mapping->tree_lock while wb switching is in
      progress.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5cb8b824
    • T
      writeback: implement unlocked_inode_to_wb transaction and use it for stat updates · 682aa8e1
      Tejun Heo 提交于
      The mechanism for detecting whether an inode should switch its wb
      (bdi_writeback) association is now in place.  This patch build the
      framework for the actual switching.
      
      This patch adds a new inode flag I_WB_SWITCHING, which has two
      functions.  First, the easy one, it ensures that there's only one
      switching in progress for a give inode.  Second, it's used as a
      mechanism to synchronize wb stat updates.
      
      The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
      but track the current number of dirty pages and pages under writeback
      respectively.  As such, when an inode is moved from one wb to another,
      the inode's portion of those stats have to be transferred together;
      unfortunately, this is a bit tricky as those stat updates are percpu
      operations which are performed without holding any lock in some
      places.
      
      This patch solves the problem in a similar way as memcg.  Each such
      lockless stat updates are wrapped in transaction surrounded by
      unlocked_inode_to_wb_begin/end().  During normal operation, they map
      to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
      mapping->tree_lock is grabbed across the transaction.
      
      In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
      grace period to pass before actually starting to switch, which
      guarantees that all stat update paths are synchronizing against
      mapping->tree_lock.
      
      This patch still doesn't implement the actual switching.
      
      v3: Updated on top of the recent cancel_dirty_page() updates.
          unlocked_inode_to_wb_begin() now nests inside
          mem_cgroup_begin_page_stat() to match the locking order.
      
      v2: The i_wb access transaction will be used for !stat accesses too.
          Function names and comments updated accordingly.
      
          s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
          s/switch_wb/switch_wbs/
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      682aa8e1
    • T
      writeback: implement [locked_]inode_to_wb_and_lock_list() · 87e1d789
      Tejun Heo 提交于
      cgroup writeback currently assumes that inode to wb association
      doesn't change; however, with the planned foreign inode wb switching
      mechanism, the association will change dynamically.
      
      When an inode needs to be put on one of the IO lists of its wb, the
      current code simply calls inode_to_wb() and locks the returned wb;
      however, with the planned wb switching, the association may change
      before locking the wb and may even get released.
      
      This patch implements [locked_]inode_to_wb_and_lock_list() which pins
      the associated wb while holding i_lock, releases it, acquires
      wb->list_lock and verifies that the association hasn't changed
      inbetween.  As the association will be protected by both locks among
      other things, this guarantees that the wb is the inode's associated wb
      until the list_lock is released.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      87e1d789
    • T
      writeback: implement foreign cgroup inode detection · 2a814908
      Tejun Heo 提交于
      As concurrent write sharing of an inode is expected to be very rare
      and memcg only tracks page ownership on first-use basis severely
      confining the usefulness of such sharing, cgroup writeback tracks
      ownership per-inode.  While the support for concurrent write sharing
      of an inode is deemed unnecessary, an inode being written to by
      different cgroups at different points in time is a lot more common,
      and, more importantly, charging only by first-use can too readily lead
      to grossly incorrect behaviors (single foreign page can lead to
      gigabytes of writeback to be incorrectly attributed).
      
      To resolve this issue, cgroup writeback detects the majority dirtier
      of an inode and will transfer the ownership to it.  To avoid
      unnnecessary oscillation, the detection mechanism keeps track of
      history and gives out the switch verdict only if the foreign usage
      pattern is stable over a certain amount of time and/or writeback
      attempts.
      
      The detection mechanism has fairly low space and computation overhead.
      It adds 8 bytes to struct inode (one int and two u16's) and minimal
      amount of calculation per IO.  The detection mechanism converges to
      the correct answer usually in several seconds of IO time when there's
      a clear majority dirtier.  Even when there isn't, it can reach an
      acceptable answer fairly quickly under most circumstances.
      
      Please see wb_detach_inode() for more details.
      
      This patch only implements detection.  Following patches will
      implement actual switching.
      
      v2: wbc_account_io() now checks whether the wbc is associated with a
          wb before dereferencing it.  This can happen when pageout() is
          writing pages directly without going through the usual writeback
          path.  As pageout() path is single-threaded, we don't want it to
          be blocked behind a slow cgroup and ultimately want it to delegate
          actual writing to the usual writeback path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a814908
    • T
      writeback: make writeback_control track the inode being written back · b16b1deb
      Tejun Heo 提交于
      Currently, for cgroup writeback, the IO submission paths directly
      associate the bio's with the blkcg from inode_to_wb_blkcg_css();
      however, it'd be necessary to keep more writeback context to implement
      foreign inode writeback detection.  wbc (writeback_control) is the
      natural fit for the extra context - it persists throughout the
      writeback of each inode and is passed all the way down to IO
      submission paths.
      
      This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and
      wbc_attach_fdatawrite_inode() which are used to associate wbc with the
      inode being written back.  IO submission paths now use wbc_init_bio()
      instead of directly associating bio's with blkcg themselves.  This
      leaves inode_to_wb_blkcg_css() w/o any user.  The function is removed.
      
      wbc currently only tracks the associated wb (bdi_writeback).  Future
      patches will add more for foreign inode detection.  The association is
      established under i_lock which will be depended upon when migrating
      foreign inodes to other wb's.
      
      As currently, once established, inode to wb association never changes,
      going through wbc when initializing bio's doesn't cause any behavior
      changes.
      
      v2: submit_blk_blkcg() now checks whether the wbc is associated with a
          wb before dereferencing it.  This can happen when pageout() is
          writing pages directly without going through the usual writeback
          path.  As pageout() path is single-threaded, we don't want it to
          be blocked behind a slow cgroup and ultimately want it to delegate
          actual writing to the usual writeback path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b16b1deb
    • T
      writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() · 21c6321f
      Tejun Heo 提交于
      Currently, majority of cgroup writeback support including all the
      above functions are implemented in include/linux/backing-dev.h and
      mm/backing-dev.c; however, the portion closely related to writeback
      logic implemented in include/linux/writeback.h and mm/page-writeback.c
      will expand to support foreign writeback detection and correction.
      
      This patch moves wb[_try]_get() and wb_put() to
      include/linux/backing-dev-defs.h so that they can be used from
      writeback.h and inode_{attach|detach}_wb() to writeback.h and
      page-writeback.c.
      
      This is pure reorganization and doesn't introduce any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      21c6321f
    • T
      mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use · 97c9341f
      Tejun Heo 提交于
      Because writeback wasn't cgroup aware before, the usual dirty
      throttling mechanism in balance_dirty_pages() didn't work for
      processes under memcg limit.  The writeback path didn't know how much
      memory is available or how fast the dirty pages are being written out
      for a given memcg and balance_dirty_pages() didn't have any measure of
      IO back pressure for the memcg.
      
      To work around the issue, memcg implemented an ad-hoc dirty throttling
      mechanism in the direct reclaim path by stalling on pages under
      writeback which are encountered during direct reclaim scan.  This is
      rather ugly and crude - none of the configurability, fairness, or
      bandwidth-proportional distribution of the normal path.
      
      The previous patches implemented proper memcg aware dirty throttling
      when cgroup writeback is in use making the ad-hoc mechanism
      unnecessary.  This patch disables direct reclaim stalling for such
      case.
      
      Note: I disabled the parts which seemed obvious and it behaves fine
            while testing but my understanding of this code path is
            rudimentary and it's quite possible that I got something wrong.
            Please let me know if I got some wrong or more global_reclaim()
            sites should be updated.
      
      v2: The original patch removed the direct stalling mechanism which
          breaks legacy hierarchies.  Conditionalize instead of removing.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      97c9341f
    • T
      writeback: implement memcg writeback domain based throttling · c2aa723a
      Tejun Heo 提交于
      While cgroup writeback support now connects memcg and blkcg so that
      writeback IOs are properly attributed and controlled, the IO back
      pressure propagation mechanism implemented in balance_dirty_pages()
      and its subroutines wasn't aware of cgroup writeback.
      
      Processes belonging to a memcg may have access to only subset of total
      memory available in the system and not factoring this into dirty
      throttling rendered it completely ineffective for processes under
      memcg limits and memcg ended up building a separate ad-hoc degenerate
      mechanism directly into vmscan code to limit page dirtying.
      
      The previous patches updated balance_dirty_pages() and its subroutines
      so that they can deal with multiple wb_domain's (writeback domains)
      and defined per-memcg wb_domain.  Processes belonging to a non-root
      memcg are bound to two wb_domains, global wb_domain and memcg
      wb_domain, and should be throttled according to IO pressures from both
      domains.  This patch updates dirty throttling code so that it repeats
      similar calculations for the two domains - the differences between the
      two are few and minor - and applies the lower of the two sets of
      resulting constraints.
      
      wb_over_bg_thresh(), which controls when background writeback
      terminates, is also updated to consider both global and memcg
      wb_domains.  It returns true if dirty is over bg_thresh for either
      domain.
      
      This makes the dirty throttling mechanism operational for memcg
      domains including writeback-bandwidth-proportional dirty page
      distribution inside them but the ad-hoc memcg throttling mechanism in
      vmscan is still in place.  The next patch will rip it out.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c2aa723a
    • T
      writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes · 2529bb3a
      Tejun Heo 提交于
      The amount of available memory to a memcg wb_domain can change as
      memcg configuration changes.  A domain's ->dirty_limit exists to
      smooth out sudden drops in dirty threshold; however, when a domain's
      size actually drops significantly, it hinders the dirty throttling
      from adjusting to the new configuration leading to unexpected
      behaviors including unnecessary OOM kills.
      
      This patch resolves the issue by adding wb_domain_size_changed() which
      resets ->dirty_limit[_tstmp] and making memcg call it on configuration
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2529bb3a
    • T
      writeback: implement memcg wb_domain · 841710aa
      Tejun Heo 提交于
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      The previous patches laid the groundwork to support the two wb_domains
      and this patch implements memcg wb_domain.  memcg->cgwb_domain is
      initialized on css online and destroyed on css release,
      wb->memcg_completions is added, and __wb_writeout_inc() is updated to
      increment completions against both global and memcg wb_domains.
      
      The following patches will update balance_dirty_pages() and its
      subroutines to actually consider memcg wb_domain for throttling.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      841710aa
    • T
      writeback: update wb_over_bg_thresh() to use wb_domain aware operations · 947e9762
      Tejun Heo 提交于
      wb_over_bg_thresh() currently uses global_dirty_limits() and
      wb_dirty_limit() both of which are wrappers around operations which
      take dirty_throttle_control.  For cgroup writeback support, the
      function will be updated to also consider memcg wb_domains which
      requires the context information carried in dirty_throttle_control.
      
      This patch updates wb_over_bg_thresh() so that it uses the underlying
      wb_domain aware operations directly and builds the global
      dirty_throttle_control in the process.
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      947e9762
    • T
      writeback: move over_bground_thresh() to mm/page-writeback.c · aa661bbe
      Tejun Heo 提交于
      and rename it to wb_over_bg_thresh().  The function is closely tied to
      the dirty throttling mechanism implemented in page-writeback.c.  This
      relocation will allow future updates necessary for cgroup writeback
      support.
      
      While at it, add function comment.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aa661bbe
    • T
      writeback: separate out domain_dirty_limits() · 9fc3a43e
      Tejun Heo 提交于
      global_dirty_limits() calculates thresh and bg_thresh (confusingly
      called *pdirty and *pbackground in the function) assuming
      global_wb_domain; however, cgroup writeback support requires
      considering per-memcg wb_domain too.
      
      This patch separates out domain_dirty_limits() which takes
      dirty_throttle_control out of global_dirty_limits().  As thresh and
      bg_thresh calculation needs the amount of dirtyable memory in the
      domain, dirty_throttle_control->avail is added.  The new function
      calculates the two thresholds and store them directly in the
      dirty_throttle_control.
      
      Also, as memcg domains can't follow vm_dirty_bytes and
      dirty_background_bytes settings directly.  If those are set and
      domain_dirty_limits() is invoked for a !global domain, the settings
      are translated to ratios by scaling them against globally available
      memory.  dirty_throttle_control->gdtc is added to enable this when
      CONFIG_CGROUP_WRITEBACK.
      
      global_dirty_limits() is now a thin wrapper around
      domain_dirty_limits() and balance_dirty_pages() is updated to use the
      new function too.
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9fc3a43e
    • T
      writeback: make __wb_writeout_inc() and hard_dirty_limit() take wb_domaas a parameter · c7981433
      Tejun Heo 提交于
      Currently __wb_writeout_inc() and hard_dirty_limit() assume
      global_wb_domain; however, cgroup writeback support requires
      considering per-memcg wb_domain too.
      
      This patch separates out domain-specific part of __wb_writeout_inc()
      into wb_domain_writeout_inc() which takes wb_domain as a parameter and
      adds the parameter to hard_dirty_limit().  This will allow these two
      functions to handle per-memcg wb_domains.
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c7981433
    • T
      writeback: add dirty_throttle_control->dom · e9f07dfd
      Tejun Heo 提交于
      Currently all dirty throttle operations use global_wb_domain; however,
      cgroup writeback support requires considering per-memcg wb_domain too.
      This patch adds dirty_throttle_control->dom and updates functions
      which are directly using globabl_wb_domain to use it instead.
      
      As this makes global_update_bandwidth() a misnomer, the function is
      renamed to domain_update_bandwidth().
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e9f07dfd
    • T
      writeback: add dirty_throttle_control->wb_completions · e9770b34
      Tejun Heo 提交于
      wb->completions measures the wb's proportional write bandwidth in
      global_wb_domain and thus naturally tied to the wb_domain.  This patch
      adds dirty_throttle_control->wb_completions which is initialized to
      wb->completions by GDTC_INIT() and updates __wb_dirty_limits() to use
      it instead of dereferencing wb->completions directly.
      
      This will allow dirty_throttle_control to represent different
      wb_domains and the matching wb completions.
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e9770b34
    • T
      writeback: add dirty_throttle_control->pos_ratio · daddfa3c
      Tejun Heo 提交于
      wb_position_ratio() is used to calculate pos_ratio, which is used for
      two purposes.  wb_update_dirty_ratelimit() uses it to adjust
      wb->[balanced_]dirty_ratelimit gradually and balance_dirty_pages() to
      immediately adjust dirty_ratelimit right before applying it to
      determine pause duration.
      
      While wb_update_dirty_ratelimit() is separately rate limited from
      balance_dirty_pages(), on the run where the ratelimit is updated, we
      end up calculating pos_ratio twice with the same parameters.
      
      This patch adds dirty_throttle_control->pos_ratio.
      balance_dirty_pages() calculates it once per run and
      wb_update_dirty_ratelimit() uses the value stored in
      dirty_throttle_control.
      
      This removes the duplicate calculation and also will help implementing
      memcg wb_domain.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      daddfa3c
    • T
      writeback: make __wb_calc_thresh() take dirty_throttle_control · b1cbc6d4
      Tejun Heo 提交于
      wb_calc_thresh() calculates wb_thresh by scaling thresh according to
      the wb's portion in the system-wide write bandwidth.  cgroup writeback
      support would need to calculate wb_thresh against memcg domain too.
      This patch renames wb_calc_thresh() to __wb_calc_thresh() and makes it
      take dirty_throttle_control so that the function can later be updated
      to calculate against different domains according to
      dirty_throttle_control.
      
      wb_calc_thresh() is now a thin wrapper around __wb_calc_thresh().
      
      v2: The original version was incorrectly scaling dtc->dirty instead of
          dtc->thresh.  This was due to the extremely confusing function and
          variable names.  Added a rename patch and fixed this one.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b1cbc6d4
    • T
      writeback: add dirty_throttle_control->wb_bg_thresh · 970fb01a
      Tejun Heo 提交于
      wb_bg_thresh is currently treated as a second-class citizen.  It's
      only used when BDI_CAP_STRICTLIMIT is set and balance_dirty_pages()
      doesn't calculate it unless the cap is set.  When the cap is set, the
      calculated value is not passed around but instead recalculated
      whenever it's used.
      
      wb_position_ratio() calculates it by scaling wb_thresh proportional to
      bg_thresh / thresh.  wb_update_dirty_ratelimit() uses wb_dirty_limit()
      on bg_thresh, which should generally lead to a similar result as the
      proportional scaling but can also be way off in the presence of
      max/min_ratio settings.
      
      Avoiding wb_bg_thresh calculation saves us one u64 multiplication and
      divsion when BDI_CAP_STRICTLIMIT is not set.  Given that
      balance_dirty_pages() is already ratelimited, this doesn't justify the
      incurred extra complexity.
      
      This patch adds wb_bg_thresh to dirty_throttle_control and makes
      wb_dirty_limits() always calculate it and updates the users to use the
      pre-calculated value.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      970fb01a
    • T
      writeback: consolidate dirty throttle parameters into dirty_throttle_control · 2bc00aef
      Tejun Heo 提交于
      Dirty throttling implemented in balance_dirty_pages() and its
      subroutines makes use of a number of parameters which are passed
      around individually.  This renders these functions somewhat unwieldy
      and makes it difficult to add or change the involved parameters.  Also
      some functions use different or conflicting naming schemes for the
      same parameters making the code confusing to follow.
      
      This patch consolidates the main parameters into struct
      dirty_throttle_control so that they can be passed around easily and
      adding new paramters isn't painful.  This also unifies how a given
      parameter is named and accessed.  The drawback of using this type of
      control structure rather than explicit paramters is that it isn't
      immediately obvious which function accesses and modifies what;
      however, it's fairly clear that the benefits outweigh in this case.
      
      GDTC_INIT() macro is provided to ease initializing
      dirty_throttle_control for the global_wb_domain and
      balance_dirty_pages() uses a separate pointer to point to its global
      dirty_throttle_control.  This is to make it uniform with memcg domain
      handling which will be added later.
      
      This patch doesn't introduce any behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2bc00aef
    • T
      writeback: move global_dirty_limit into wb_domain · dcc25ae7
      Tejun Heo 提交于
      This patch is a part of the series to define wb_domain which
      represents a domain that wb's (bdi_writeback's) belong to and are
      measured against each other in.  This will enable IO backpressure
      propagation for cgroup writeback.
      
      global_dirty_limit exists to regulate the global dirty threshold which
      is a property of the wb_domain.  This patch moves hard_dirty_limit,
      dirty_lock, and update_time into wb_domain.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dcc25ae7
    • T
      writeback: implement wb_domain · 380c27ca
      Tejun Heo 提交于
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      Currently, what constitutes the global writeback domain are scattered
      across a number of global states.  This patch starts collecting them
      into struct wb_domain.
      
      * fprop_global which serves as the basis for proportional bandwidth
        measurement and its period timer are moved into struct wb_domain.
      
      * global_wb_domain hosts the states for the global domain.
      
      * While at it, flatten wb_writeout_fraction() into its callers.  This
        thin wrapper doesn't provide any actual benefits while getting in
        the way.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      380c27ca
    • T
      writeback: reorganize [__]wb_update_bandwidth() · 8a731799
      Tejun Heo 提交于
      __wb_update_bandwidth() is called from two places -
      fs/fs-writeback.c::balance_dirty_pages() and
      mm/page-writeback.c::wb_writeback().  The latter updates only the
      write bandwidth while the former also deals with the dirty ratelimit.
      The two callsites are distinguished by whether @thresh parameter is
      zero or not, which is cryptic.  In addition, the two files define
      their own different versions of wb_update_bandwidth() on top of
      __wb_update_bandwidth(), which is confusing to say the least.  This
      patch cleans up [__]wb_update_bandwidth() in the following ways.
      
      * __wb_update_bandwidth() now takes explicit @update_ratelimit
        parameter to gate dirty ratelimit handling.
      
      * mm/page-writeback.c::wb_update_bandwidth() is flattened into its
        caller - balance_dirty_pages().
      
      * fs/fs-writeback.c::wb_update_bandwidth() is moved to
        mm/page-writeback.c and __wb_update_bandwidth() is made static.
      
      * While at it, add a lockdep assertion to __wb_update_bandwidth().
      
      Except for the lockdep addition, this is pure reorganization and
      doesn't introduce any behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8a731799
    • T
      writeback: clean up wb_dirty_limit() · 0d960a38
      Tejun Heo 提交于
      The function name wb_dirty_limit(), its argument @dirty and the local
      variable @wb_dirty are mortally confusing given that the function
      calculates per-wb threshold value not dirty pages, especially given
      that @dirty and @wb_dirty are used elsewhere for dirty pages.
      
      Let's rename the function to wb_calc_thresh() and wb_dirty to
      wb_thresh.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0d960a38
    • T
      memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online · 733a572e
      Tejun Heo 提交于
      cpu_possible_mask represents the CPUs which are actually possible
      during that boot instance.  For systems which don't support CPU
      hotplug, this will match cpu_online_mask exactly in most cases.  Even
      for systems which support CPU hotplug, the number of possible CPU
      slots is highly unlikely to diverge greatly from the number of online
      CPUs.  The only cases where the difference between possible and online
      caused problems were when the boot code failed to initialize the
      possible mask and left it fully set at NR_CPUS - 1.
      
      As such, most per-cpu constructs allocate for all possible CPUs and
      often iterate over the possibles, which also has the benefit of
      avoiding the blocking CPU hotplug synchronization.
      
      memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and
      mem_cgroup_read_events(), which iterates over online CPUs and handles
      CPU hotplug operations explicitly.  This complexity doesn't actually
      buy anything.  Switch to iterating over the possibles and drop the
      explicit CPU hotplug handling.
      
      Eventually, we want to convert memcg to use percpu_counter instead of
      its own custom implementation which also benefits from quick access
      w/o summing for cases where larger error margin is acceptable.
      
      This will allow mem_cgroup_read_stat() to be called from non-sleepable
      contexts which will be used by cgroup writeback.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      733a572e
    • T
      ext2: enable cgroup writeback support · 108dad65
      Tejun Heo 提交于
      Writeback now supports cgroup writeback and the generic writeback,
      buffer, libfs, and mpage helpers that ext2 uses are all updated to
      work with cgroup writeback.
      
      This patch enables cgroup writeback for ext2 by adding
      FS_CGROUP_WRITEBACK to its ->fs_flags.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: linux-ext4@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      108dad65
    • T
      mpage: make __mpage_writepage() honor cgroup writeback · 429b3fb0
      Tejun Heo 提交于
      __mpage_writepage() is used to implement mpage_writepages() which in
      turn is used for ->writepages() of various filesystems.  All writeback
      logic is now updated to handle cgroup writeback and the block cgroup
      to issue IOs for is encoded in writeback_control and can be retrieved
      from the inode; however, __mpage_writepage() currently ignores the
      blkcg indicated by the inode and issues all bio's without explicit
      blkcg association.
      
      This patch updates __mpage_writepage() so that the issued bio's are
      associated with inode_to_writeback_blkcg_css(inode).
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      429b3fb0
    • T
      buffer, writeback: make __block_write_full_page() honor cgroup writeback · bafc0dba
      Tejun Heo 提交于
      [__]block_write_full_page() is used to implement ->writepage in
      various filesystems.  All writeback logic is now updated to handle
      cgroup writeback and the block cgroup to issue IOs for is encoded in
      writeback_control and can be retrieved from the inode; however,
      [__]block_write_full_page() currently ignores the blkcg indicated by
      inode and issues all bio's without explicit blkcg association.
      
      This patch adds submit_bh_blkcg() which associates the bio with the
      specified blkio cgroup before issuing and uses it in
      __block_write_full_page() so that the issued bio's are associated with
      inode_to_wb_blkcg_css(inode).
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bafc0dba
    • T
      writeback: dirty inodes against their matching cgroup bdi_writeback's · 0747259d
      Tejun Heo 提交于
      __mark_inode_dirty() always dirtied the inode against the root wb
      (bdi_writeback).  The previous patches added all the infrastructure
      necessary to attribute an inode against the wb of the dirtying cgroup.
      
      This patch updates __mark_inode_dirty() so that it uses the wb
      associated with the inode instead of unconditionally using the root
      one.
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being dirtied against the root wb.
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0747259d
    • T
      writeback: make writeback initiation functions handle multiple bdi_writeback's · db125360
      Tejun Heo 提交于
      [try_]writeback_inodes_sb[_nr]() and sync_inodes_sb() currently only
      handle dirty inodes on the root wb (bdi_writeback) of the target bdi.
      This patch implements bdi_split_work_to_wbs() and use it to make these
      functions handle multiple wb's.
      
      bdi_split_work_to_wbs() takes a base wb_writeback_work and create
      clones of it and issue them to the wb's of the target bdi.  The base
      work's nr_pages is distributed using wb_split_bdi_pages() -
      ie. according to each wb's write bandwidth's proportion in the bdi.
      
      Cloning a bdi involves memory allocation which may fail.  In such
      cases, bdi_split_work_to_wbs() issues the base work directly and waits
      for its completion before proceeding to the next wb to guarantee
      forward progress and correctness under memory pressure.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      db125360
    • T
      writeback: restructure try_writeback_inodes_sb[_nr]() · f30a7d0c
      Tejun Heo 提交于
      try_writeback_inodes_sb_nr() wraps writeback_inodes_sb_nr() so that it
      handles s_umount locking and skips if writeback is already in
      progress.  The in progress test is performed on the root wb
      (bdi_writeback) which isn't sufficient for cgroup writeback support.
      The test must be done per-wb.
      
      To prepare for the change, this patch factors out
      __writeback_inodes_sb_nr() from writeback_inodes_sb_nr() and adds
      @skip_if_busy and moves the in progress test right before queueing the
      wb_writeback_work.  try_writeback_inodes_sb_nr() now just grabs
      s_umount and invokes __writeback_inodes_sb_nr() with asserted
      @skip_if_busy.  This way, later addition of multiple wb handling can
      skip only the wb's which already have writeback in progress.
      
      This swaps the order between in progress test and s_umount test which
      can flip the return value when writeback is in progress and s_umount
      is being held by someone else but this shouldn't cause any meaningful
      difference.  It's a fringe condition and the return value is an
      unsynchronized hint anyway.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f30a7d0c
    • T
      writeback: implement wb_wait_for_single_work() · 98754bf7
      Tejun Heo 提交于
      For cgroup writeback, multiple wb_writeback_work items may need to be
      issuedto accomplish a single task.  The previous patch updated the
      waiting mechanism such that wb_wait_for_completion() can wait for
      multiple work items.
      
      Issuing mulitple work items involves memory allocation which may fail.
      As most writeback operations can't fail or blocked on memory
      allocation, in such cases, we'll fall back to sequential issuing of an
      on-stack work item, which would need to be waited upon sequentially.
      
      This patch implements wb_wait_for_single_work() which waits for a
      single work item independently from wb_completion waiting so that such
      fallback mechanism can be used without getting tangled with the usual
      issuing / completion operation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      98754bf7
    • T
      writeback: implement bdi_wait_for_completion() · cc395d7f
      Tejun Heo 提交于
      If the completion of a wb_writeback_work can be waited upon by setting
      its ->done to a struct completion and waiting on it; however, for
      cgroup writeback support, it's necessary to issue multiple work items
      to multiple bdi_writebacks and wait for the completion of all.
      
      This patch implements wb_completion which can wait for multiple work
      items and replaces the struct completion with it.  It can be defined
      using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
      waited for by wb_wait_for_completion().
      
      Nobody currently issues multiple work items and this patch doesn't
      introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cc395d7f
    • T
      writeback: add wb_writeback_work->auto_free · ac7b19a3
      Tejun Heo 提交于
      Currently, a wb_writeback_work is freed automatically on completion if
      it doesn't have ->done set.  Add wb_writeback_work->auto_free to make
      the switch explicit.  This will help cgroup writeback support where
      waiting for completion and whether to free automatically don't
      necessarily move together.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ac7b19a3
    • T
      writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's · 001fe6f6
      Tejun Heo 提交于
      wakeup_dirtytime_writeback() currently only starts writeback on the
      root wb (bdi_writeback).  For cgroup writeback support, update the
      function to check all wbs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      001fe6f6
    • T
      writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's · f2b65121
      Tejun Heo 提交于
      wakeup_flusher_threads() currently only starts writeback on the root
      wb (bdi_writeback).  For cgroup writeback support, update the function
      to wake up all wbs and distribute the number of pages to write
      according to the proportion of each wb's write bandwidth, which is
      implemented in wb_split_bdi_pages().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f2b65121
    • T
      writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info · 9ecf4866
      Tejun Heo 提交于
      bdi_start_background_writeback() currently takes @bdi and kicks the
      root wb (bdi_writeback).  In preparation for cgroup writeback support,
      make it take wb instead.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9ecf4866
    • T
      writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info · bc05873d
      Tejun Heo 提交于
      writeback_in_progress() currently takes @bdi and returns whether
      writeback is in progress on its root wb (bdi_writeback).  In
      preparation for cgroup writeback support, make it take wb instead.
      While at it, make it an inline function.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bc05873d