1. 25 9月, 2015 1 次提交
  2. 19 8月, 2015 2 次提交
    • T
      blkcg: rename subsystem name from blkio to io · c165b3e3
      Tejun Heo 提交于
      blkio interface has become messy over time and is currently the
      largest.  In addition to the inconsistent naming scheme, it has
      multiple stat files which report more or less the same thing, a number
      of debug stat files which expose internal details which shouldn't have
      been part of the public interface in the first place, recursive and
      non-recursive stats and leaf and non-leaf knobs.
      
      Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
      don't make any sense on the unified hierarchy as only leaf cgroups can
      contain processes.  cgroups is going through a major interface
      revision with the unified hierarchy involving significant fundamental
      usage changes and given that a significant portion of the interface
      doesn't make sense anymore, it's a good time to reorganize the
      interface.
      
      As the first step, this patch renames the external visible subsystem
      name from "blkio" to "io".  This is more concise, matches the other
      two major subsystem names, "cpu" and "memory", and better suited as
      blkcg will be involved in anything writeback related too whether an
      actual block device is involved or not.
      
      As the subsystem legacy_name is set to "blkio", the only userland
      visible change outside the unified hierarchy is that blkcg is reported
      as "io" instead of "blkio" in the subsystem initialized message during
      boot.  On the unified hierarchy, blkcg now appears as "io".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c165b3e3
    • T
      writeback: bdi_for_each_wb() iteration is memcg ID based not blkcg · 1ed8d48c
      Tejun Heo 提交于
      wb's (bdi_writeback's) are currently keyed by memcg ID; however, in an
      earlier implementation, wb's were keyed by blkcg ID.
      bdi_for_each_wb() walks bdi->cgwb_tree in the ascending ID order and
      allows iterations to start from an arbitrary ID which is used to
      interrupt and resume iterations.
      
      Unfortunately, while changing wb to be keyed by memcg ID instead of
      blkcg, bdi_for_each_wb() was missed and is still assuming that wb's
      are keyed by blkcg ID.  This doesn't affect iterations which don't get
      interrupted but bdi_split_work_to_wbs() makes use of iteration
      resuming on allocation failures and thus may incorrectly skip or
      repeat wb's.
      
      Fix it by changing bdi_for_each_wb() to take memcg IDs instead of
      blkcg IDs and updating bdi_split_work_to_wbs() accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1ed8d48c
  3. 02 7月, 2015 1 次提交
    • T
      writeback: don't embed root bdi_writeback_congested in bdi_writeback · a13f35e8
      Tejun Heo 提交于
      52ebea74 ("writeback: make backing_dev_info host cgroup-specific
      bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
      (bdi_writeback's).  As the congested state needs to be per-wb and
      referenced from blkcg side and multiple wbs, the patch made all
      non-root cong's (bdi_writeback_congested's) reference counted and
      indexed on bdi.
      
      When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
      non-root cong's; however, this can hang indefinitely because wb's can
      also be referenced from blkcg_gq's which are destroyed after bdi
      destruction is complete.
      
      To fix the bug, bdi destruction will be updated to not wait for cong's
      to drain, which naturally means that cong's may outlive the associated
      bdi.  This is fine for non-root cong's but is problematic for the root
      cong's which are embedded in their bdi's as they may end up getting
      dereferenced after the containing bdi's are freed.
      
      This patch makes root cong's behave the same as non-root cong's.  They
      are no longer embedded in their bdi's but allocated separately during
      bdi initialization, indexed and reference counted the same way.
      
      * As cong handling is the same for all wb's, wb->congested
        initialization is moved into wb_init().
      
      * When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
        bdi->wb_congested is now a pointer pointing to the root cong
        allocated during bdi init and minimal refcnting operations are
        implemented.
      
      * The above makes root wb init paths diverge depending on
        CONFIG_CGROUP_WRITEBACK.  root wb init is moved to cgwb_bdi_init().
      
      This patch in itself shouldn't cause any consequential behavior
      differences but prepares for the actual fix.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJon Christopherson <jon@jons.org>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681Tested-by: NJon Christopherson <jon@jons.org>
      
      Added <linux/slab.h> include to backing-dev.h for kfree() definition.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a13f35e8
  4. 18 6月, 2015 1 次提交
    • T
      vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB · 46b15caa
      Tejun Heo 提交于
      FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
      cgroup writeback; however, different super_blocks of the same
      file_system_type may or may not support cgroup writeback depending on
      filesystem options.  This patch replaces FS_CGROUP_WRITEBACK with a
      per-super_block flag.
      
      super_block->s_flags carries some internal flags in the high bits but
      it's exposd to userland through uapi header and running out of space
      anyway.  This patch adds a new field super_block->s_iflags to carry
      kernel-internal flags.  It is currently only used by the new
      SB_I_CGROUPWB flag whose concatenated and abbreviated name is for
      consistency with other super_block flags.
      
      ext2_fill_super() is updated to set SB_I_CGROUPWB.
      
      v2: Added super_block->s_iflags instead of stealing another high bit
          from sb->s_flags as suggested by Christoph and Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: linux-ext4@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      46b15caa
  5. 02 6月, 2015 22 次提交
    • T
      writeback: add lockdep annotation to inode_to_wb() · aaa2cacf
      Tejun Heo 提交于
      With the previous three patches, all operations which acquire wb from
      inode are either under one of inode->i_lock, mapping->tree_lock or
      wb->list_lock or protected by unlocked_inode_to_wb transaction.  This
      will be depended upon by foreign inode wb switching.
      
      This patch adds lockdep assertion to inode_to_wb() so that usages
      outside the above list locks can be caught easily.  There are three
      exceptions.
      
      * locked_inode_to_wb_and_lock_list() is holding wb->list_lock but the
        wb may not be the inode's.  Ensuring that is the function's role
        after all.  Updated to deref inode->i_wb directly.
      
      * inode_wb_stat_unlocked_begin() is usually protected by combination
        of !I_WB_SWITCH and rcu_read_lock().  Updated to deref inode->i_wb
        directly.
      
      * inode_congested() wants to test whether inode->i_wb is set before
        starting the transaction.  Added inode_to_wb_is_valid() which tests
        inode->i_wb directly.
      
      v5: might_lock() removed.  It annotates that the lock is grabbed w/
          irq enabled which isn't the case and triggering lockdep warning
          spuriously.
      
      v4: might_lock() added to unlocked_inode_to_wb_begin().
      
      v3: inode_congested() conversion added.
      
      v2: locked_inode_to_wb_and_lock_list() was missing in the first
          version.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aaa2cacf
    • T
      writeback: implement unlocked_inode_to_wb transaction and use it for stat updates · 682aa8e1
      Tejun Heo 提交于
      The mechanism for detecting whether an inode should switch its wb
      (bdi_writeback) association is now in place.  This patch build the
      framework for the actual switching.
      
      This patch adds a new inode flag I_WB_SWITCHING, which has two
      functions.  First, the easy one, it ensures that there's only one
      switching in progress for a give inode.  Second, it's used as a
      mechanism to synchronize wb stat updates.
      
      The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
      but track the current number of dirty pages and pages under writeback
      respectively.  As such, when an inode is moved from one wb to another,
      the inode's portion of those stats have to be transferred together;
      unfortunately, this is a bit tricky as those stat updates are percpu
      operations which are performed without holding any lock in some
      places.
      
      This patch solves the problem in a similar way as memcg.  Each such
      lockless stat updates are wrapped in transaction surrounded by
      unlocked_inode_to_wb_begin/end().  During normal operation, they map
      to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
      mapping->tree_lock is grabbed across the transaction.
      
      In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
      grace period to pass before actually starting to switch, which
      guarantees that all stat update paths are synchronizing against
      mapping->tree_lock.
      
      This patch still doesn't implement the actual switching.
      
      v3: Updated on top of the recent cancel_dirty_page() updates.
          unlocked_inode_to_wb_begin() now nests inside
          mem_cgroup_begin_page_stat() to match the locking order.
      
      v2: The i_wb access transaction will be used for !stat accesses too.
          Function names and comments updated accordingly.
      
          s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
          s/switch_wb/switch_wbs/
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      682aa8e1
    • T
      writeback: make writeback_control track the inode being written back · b16b1deb
      Tejun Heo 提交于
      Currently, for cgroup writeback, the IO submission paths directly
      associate the bio's with the blkcg from inode_to_wb_blkcg_css();
      however, it'd be necessary to keep more writeback context to implement
      foreign inode writeback detection.  wbc (writeback_control) is the
      natural fit for the extra context - it persists throughout the
      writeback of each inode and is passed all the way down to IO
      submission paths.
      
      This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and
      wbc_attach_fdatawrite_inode() which are used to associate wbc with the
      inode being written back.  IO submission paths now use wbc_init_bio()
      instead of directly associating bio's with blkcg themselves.  This
      leaves inode_to_wb_blkcg_css() w/o any user.  The function is removed.
      
      wbc currently only tracks the associated wb (bdi_writeback).  Future
      patches will add more for foreign inode detection.  The association is
      established under i_lock which will be depended upon when migrating
      foreign inodes to other wb's.
      
      As currently, once established, inode to wb association never changes,
      going through wbc when initializing bio's doesn't cause any behavior
      changes.
      
      v2: submit_blk_blkcg() now checks whether the wbc is associated with a
          wb before dereferencing it.  This can happen when pageout() is
          writing pages directly without going through the usual writeback
          path.  As pageout() path is single-threaded, we don't want it to
          be blocked behind a slow cgroup and ultimately want it to delegate
          actual writing to the usual writeback path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b16b1deb
    • T
      writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() · 21c6321f
      Tejun Heo 提交于
      Currently, majority of cgroup writeback support including all the
      above functions are implemented in include/linux/backing-dev.h and
      mm/backing-dev.c; however, the portion closely related to writeback
      logic implemented in include/linux/writeback.h and mm/page-writeback.c
      will expand to support foreign writeback detection and correction.
      
      This patch moves wb[_try]_get() and wb_put() to
      include/linux/backing-dev-defs.h so that they can be used from
      writeback.h and inode_{attach|detach}_wb() to writeback.h and
      page-writeback.c.
      
      This is pure reorganization and doesn't introduce any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      21c6321f
    • T
      buffer, writeback: make __block_write_full_page() honor cgroup writeback · bafc0dba
      Tejun Heo 提交于
      [__]block_write_full_page() is used to implement ->writepage in
      various filesystems.  All writeback logic is now updated to handle
      cgroup writeback and the block cgroup to issue IOs for is encoded in
      writeback_control and can be retrieved from the inode; however,
      [__]block_write_full_page() currently ignores the blkcg indicated by
      inode and issues all bio's without explicit blkcg association.
      
      This patch adds submit_bh_blkcg() which associates the bio with the
      specified blkio cgroup before issuing and uses it in
      __block_write_full_page() so that the issued bio's are associated with
      inode_to_wb_blkcg_css(inode).
      
      v2: Updated for per-inode wb association.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bafc0dba
    • T
      writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info · 9ecf4866
      Tejun Heo 提交于
      bdi_start_background_writeback() currently takes @bdi and kicks the
      root wb (bdi_writeback).  In preparation for cgroup writeback support,
      make it take wb instead.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9ecf4866
    • T
      writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info · bc05873d
      Tejun Heo 提交于
      writeback_in_progress() currently takes @bdi and returns whether
      writeback is in progress on its root wb (bdi_writeback).  In
      preparation for cgroup writeback support, make it take wb instead.
      While at it, make it an inline function.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bc05873d
    • T
      writeback: remove bdi_start_writeback() · c00ddad3
      Tejun Heo 提交于
      bdi_start_writeback() is a thin wrapper on top of
      __wb_start_writeback() which is used only by laptop_mode_timer_fn().
      This patches removes bdi_start_writeback(), renames
      __wb_start_writeback() to wb_start_writeback() and makes
      laptop_mode_timer_fn() use it instead.
      
      This doesn't cause any functional difference and will ease making
      laptop_mode_timer_fn() cgroup writeback aware.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c00ddad3
    • T
      writeback: implement bdi_for_each_wb() · ebe41ab0
      Tejun Heo 提交于
      This will be used to implement bdi-wide operations which should be
      distributed across all its cgroup bdi_writebacks.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ebe41ab0
    • T
      writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account · 95a46c65
      Tejun Heo 提交于
      bdi_has_dirty_io() used to only reflect whether the root wb
      (bdi_writeback) has dirty inodes.  For cgroup writeback support, it
      needs to take all active wb's into account.  If any wb on the bdi has
      dirty inodes, bdi_has_dirty_io() should return true.
      
      To achieve that, as inode_wb_list_{move|del}_locked() now keep track
      of the dirty state transition of each wb, the number of dirty wbs can
      be counted in the bdi; however, bdi is already aggregating
      wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
      there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
      dip below 1.  bdi_has_dirty_io() can simply test whether
      bdi->tot_write_bandwidth is zero or not.
      
      While this bumps the value of wb->avg_write_bandwidth to one when it
      used to be zero, this shouldn't cause any meaningful behavior
      difference.
      
      bdi_has_dirty_io() is made an inline function which tests whether
      ->tot_write_bandwidth is non-zero.  Also, WARN_ON_ONCE()'s on its
      value are added to inode_wb_list_{move|del}_locked().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      95a46c65
    • T
      writeback: implement WB_has_dirty_io wb_state flag · d6c10f1f
      Tejun Heo 提交于
      Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback)
      has any dirty inode by testing all three IO lists on each invocation
      without actively keeping track.  For cgroup writeback support, a
      single bdi will host multiple wb's each of which will host dirty
      inodes separately and we'll need to make bdi_has_dirty_io(), which
      currently only represents the root wb, aggregate has_dirty_io from all
      member wb's, which requires tracking transitions in has_dirty_io state
      on each wb.
      
      This patch introduces inode_wb_list_{move|del}_locked() to consolidate
      IO list operations leaving queue_io() the only other function which
      directly manipulates IO lists (via move_expired_inodes()).  All three
      functions are updated to call wb_io_lists_[de]populated() which keep
      track of whether the wb has dirty inodes or not and record it using
      the new WB_has_dirty_io flag.  inode_wb_list_moved_locked()'s return
      value indicates whether the wb had no dirty inodes before.
      
      mark_inode_dirty() is restructured so that the return value of
      inode_wb_list_move_locked() can be used for deciding whether to wake
      up the wb.
      
      While at it, change {bdi|wb}_has_dirty_io()'s return values to bool.
      These functions were returning 0 and 1 before.  Also, add a comment
      explaining the synchronization of wb_state flags.
      
      v2: Updated to accommodate b_dirty_time.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d6c10f1f
    • T
      writeback: implement and use inode_congested() · 703c2708
      Tejun Heo 提交于
      In several places, bdi_congested() and its wrappers are used to
      determine whether more IOs should be issued.  With cgroup writeback
      support, this question can't be answered solely based on the bdi
      (backing_dev_info).  It's dependent on whether the filesystem and bdi
      support cgroup writeback and the blkcg the inode is associated with.
      
      This patch implements inode_congested() and its wrappers which take
      @inode and determines the congestion state considering cgroup
      writeback.  The new functions replace bdi_*congested() calls in places
      where the query is about specific inode and task.
      
      There are several filesystem users which also fit this criteria but
      they should be updated when each filesystem implements cgroup
      writeback support.
      
      v2: Now that a given inode is associated with only one wb, congestion
          state can be determined independent from the asking task.  Drop
          @task.  Spotted by Vivek.  Also, converted to take @inode instead
          of @mapping and renamed to inode_congested().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      703c2708
    • T
      writeback: make congestion functions per bdi_writeback · ec8a6f26
      Tejun Heo 提交于
      Currently, all congestion functions take bdi (backing_dev_info) and
      always operate on the root wb (bdi->wb) and the congestion state from
      the block layer is propagated only for the root blkcg.  This patch
      introduces {set|clear}_wb_congested() and wb_congested() which take a
      bdi_writeback_congested and bdi_writeback respectively.  The bdi
      counteparts are now wrappers invoking the wb based functions on
      @bdi->wb.
      
      While converting clear_bdi_congested() to clear_wb_congested(), the
      local variable declaration order between @wqh and @bit is swapped for
      cosmetic reason.
      
      This patch just adds the new wb based functions.  The following
      patches will apply them.
      
      v2: Updated for bdi_writeback_congested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ec8a6f26
    • T
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo 提交于
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52ebea74
    • T
      writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK · 89e9b9e0
      Tejun Heo 提交于
      cgroup writeback requires support from both bdi and filesystem sides.
      Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
      support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
      default.  Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
      both MEMCG and BLK_CGROUP are enabled.
      
      inode_cgwb_enabled() which determines whether a given inode's both bdi
      and fs support cgroup writeback is added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      89e9b9e0
    • T
      bdi: separate out congested state into a separate struct · 4aa9c692
      Tejun Heo 提交于
      Currently, a wb's (bdi_writeback) congestion state is carried in its
      ->state field; however, cgroup writeback support will require multiple
      wb's sharing the same congestion state.  This patch separates out
      congestion state into its own struct - struct bdi_writeback_congested.
      A new field wb field, wb_congested, points to its associated congested
      struct.  The default wb, bdi->wb, always points to bdi->wb_congested.
      
      While this patch adds a layer of indirection, it doesn't introduce any
      behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4aa9c692
    • T
      bdi: make inode_to_bdi() inline · a212b105
      Tejun Heo 提交于
      Now that bdi definitions are moved to backing-dev-defs.h,
      backing-dev.h can include blkdev.h and inline inode_to_bdi() without
      worrying about introducing circular include dependency.  The function
      gets called from hot paths and fairly trivial.
      
      This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
      function calls inline.  blockdev_superblock and noop_backing_dev_info
      are EXPORT_GPL'd to allow the inline functions to be used from
      modules.
      
      While at it, make sb_is_blkdev_sb() return bool instead of int.
      
      v2: Fixed typo in description as suggested by Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a212b105
    • T
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo 提交于
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66114cad
    • T
      writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback · f0054bb1
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->wb_lock and ->worklist into wb.
      
      * The lock protects bdi->worklist and bdi->wb.dwork scheduling.  While
        moving, rename it to wb->work_lock as wb->wb_lock is confusing.
        Also, move wb->dwork downwards so that it's colocated with the new
        ->work_lock and ->work_list fields.
      
      * bdi_writeback_workfn()		-> wb_workfn()
        bdi_wakeup_thread_delayed(bdi)	-> wb_wakeup_delayed(wb)
        bdi_wakeup_thread(bdi)		-> wb_wakeup(wb)
        bdi_queue_work(bdi, ...)		-> wb_queue_work(wb, ...)
        __bdi_start_writeback(bdi, ...)	-> __wb_start_writeback(wb, ...)
        get_next_work_item(bdi)		-> get_next_work_item(wb)
      
      * bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
        The function contained parts which belong to the containing bdi
        rather than the wb itself - testing cap_writeback_dirty and
        bdi_remove_from_list() invocation.  Those are moved to
        bdi_unregister().
      
      * bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
        Initializations of the moved bdi->wb_lock and ->work_list are
        relocated from bdi_init() to wb_init().
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->state are mechanically replaced with bdi->wb.state
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f0054bb1
    • T
      writeback: move bandwidth related fields from backing_dev_info into bdi_writeback · a88a341a
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bandwidth related fields from backing_dev_info into
      bdi_writeback.
      
      * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
        write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
        balanced_dirty_ratelimit, completions and dirty_exceeded.
      
      * writeback_chunk_size() and over_bground_thresh() now take @wb
        instead of @bdi.
      
      * bdi_writeout_fraction(bdi, ...)	-> wb_writeout_fraction(wb, ...)
        bdi_dirty_limit(bdi, ...)		-> wb_dirty_limit(wb, ...)
        bdi_position_ration(bdi, ...)		-> wb_position_ratio(wb, ...)
        bdi_update_writebandwidth(bdi, ...)	-> wb_update_write_bandwidth(wb, ...)
        [__]bdi_update_bandwidth(bdi, ...)	-> [__]wb_update_bandwidth(wb, ...)
        bdi_{max|min}_pause(bdi, ...)		-> wb_{max|min}_pause(wb, ...)
        bdi_dirty_limits(bdi, ...)		-> wb_dirty_limits(wb, ...)
      
      * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
        respectively.  Note that explicit zeroing is dropped in the process
        as wb's are cleared in entirety anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      
      v2: Typo in description fixed as suggested by Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a88a341a
    • T
      writeback: move backing_dev_info->bdi_stat[] into bdi_writeback · 93f78d88
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->bdi_stat[] into wb.
      
      * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
        enums is changed from BDI_ to WB_.
      
      * BDI_STAT_BATCH() -> WB_STAT_BATCH()
      
      * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)
      
      * bdi_stat[_error]() -> wb_stat[_error]()
      
      * bdi_writeout_inc() -> wb_writeout_inc()
      
      * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
        frees stat.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      93f78d88
    • T
      writeback: move backing_dev_info->state into bdi_writeback · 4452226e
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bdi->state into wb.
      
      * enum bdi_state is renamed to wb_state and the prefix of all enums is
        changed from BDI_ to WB_.
      
      * Explicit zeroing of bdi->state is removed without adding zeoring of
        wb->state as the whole data structure is zeroed on init anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->state are mechanically replaced with bdi->wb.state
        introducing no behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: drbd-dev@lists.linbit.com
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4452226e
  6. 29 5月, 2015 1 次提交
  7. 05 2月, 2015 1 次提交
    • T
      vfs: add support for a lazytime mount option · 0ae45f63
      Theodore Ts'o 提交于
      Add a new mount option which enables a new "lazytime" mode.  This mode
      causes atime, mtime, and ctime updates to only be made to the
      in-memory version of the inode.  The on-disk times will only get
      updated when (a) if the inode needs to be updated for some non-time
      related change, (b) if userspace calls fsync(), syncfs() or sync(), or
      (c) just before an undeleted inode is evicted from memory.
      
      This is OK according to POSIX because there are no guarantees after a
      crash unless userspace explicitly requests via a fsync(2) call.
      
      For workloads which feature a large number of random write to a
      preallocated file, the lazytime mount option significantly reduces
      writes to the inode table.  The repeated 4k writes to a single block
      will result in undesirable stress on flash devices and SMR disk
      drives.  Even on conventional HDD's, the repeated writes to the inode
      table block will trigger Adjacent Track Interference (ATI) remediation
      latencies, which very negatively impact long tail latencies --- which
      is a very big deal for web serving tiers (for example).
      
      Google-Bug-Id: 18297052
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0ae45f63
  8. 21 1月, 2015 4 次提交
  9. 09 9月, 2014 2 次提交
    • T
      bdi: reimplement bdev_inode_switch_bdi() · 018a17bd
      Tejun Heo 提交于
      A block_device may be attached to different gendisks and thus
      different bdis over time.  bdev_inode_switch_bdi() is used to switch
      the associated bdi.  The function assumes that the inode could be
      dirty and transfers it between bdis if so.  This is a bit nasty in
      that it reaches into bdi internals.
      
      This patch reimplements the function so that it writes out the inode
      if dirty.  This is a lot simpler and can be implemented without
      exposing bdi internals.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      018a17bd
    • T
      bdi: remove unused stuff · e36f1dfc
      Tejun Heo 提交于
      Two flags and one bdi_writeback field are no longer used.  Remove
      them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e36f1dfc
  10. 04 4月, 2014 1 次提交
    • J
      bdi: avoid oops on device removal · 5acda9d1
      Jan Kara 提交于
      After commit 839a8e86 ("writeback: replace custom worker pool
      implementation with unbound workqueue") when device is removed while we
      are writing to it we crash in bdi_writeback_workfn() ->
      set_worker_desc() because bdi->dev is NULL.
      
      This can happen because even though bdi_unregister() cancels all pending
      flushing work, nothing really prevents new ones from being queued from
      balance_dirty_pages() or other places.
      
      Fix the problem by clearing BDI_registered bit in bdi_unregister() and
      checking it before scheduling of any flushing work.
      
      Fixes: 839a8e86Reviewed-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Derek Basehore <dbasehore@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5acda9d1
  11. 08 11月, 2013 1 次提交
  12. 12 9月, 2013 1 次提交
    • M
      mm/page-writeback.c: add strictlimit feature · 5a537485
      Maxim Patlasov 提交于
      The feature prevents mistrusted filesystems (ie: FUSE mounts created by
      unprivileged users) to grow a large number of dirty pages before
      throttling.  For such filesystems balance_dirty_pages always check bdi
      counters against bdi limits.  I.e.  even if global "nr_dirty" is under
      "freerun", it's not allowed to skip bdi checks.  The only use case for now
      is fuse: it sets bdi max_ratio to 1% by default and system administrators
      are supposed to expect that this limit won't be exceeded.
      
      The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
      filesystem may set the flag when it initializes its BDI.
      
      The problematic scenario comes from the fact that nobody pays attention to
      the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
      writeback).  The implementation of fuse writeback releases original page
      (by calling end_page_writeback) almost immediately.  A fuse request queued
      for real processing bears a copy of original page.  Hence, if userspace
      fuse daemon doesn't finalize write requests in timely manner, an
      aggressive mmap writer can pollute virtually all memory by those temporary
      fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
      nobody cares.
      
      To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
      problem" as a shortcut for "a possibility of uncontrolled grow of amount
      of RAM consumed by temporary pages allocated by kernel fuse to process
      writeback".
      
      The problem was very easy to reproduce.  There is a trivial example
      filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
      added "sleep(1);" to the write methods, then recompiled and mounted it.
      Then created a huge file on the mount point and run a simple program which
      mmap-ed the file to a memory region, then wrote a data to the region.  An
      hour later I observed almost all RAM consumed by fuse writeback.  Since
      then some unrelated changes in kernel fuse made it more difficult to
      reproduce, but it is still possible now.
      
      Putting this theoretical happens-in-the-lab thing aside, there is another
      thing that really hurts real world (FUSE) users.  This is write-through
      page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
      fuse populates page cache and flushes user data to the server
      synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
      ("writeback cache policy") solve the problem, but they also make resolving
      NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
      a huge file to a fuse mount would result in memory starvation.  Miklos,
      the maintainer of FUSE, believes strictlimit feature the way to go.
      
      And eventually putting FUSE topics aside, there is one more use-case for
      strictlimit feature.  Using a slow USB stick (mass storage) in a machine
      with huge amount of RAM installed is a well-known pain.  Let's make simple
      computations.  Assuming 64GB of RAM installed, existing implementation of
      balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
      dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
      /media/my-usb-storage/" may return in a few seconds, but subsequent
      "umount /media/my-usb-storage/" will take more than two hours if effective
      throughput of the storage is, to say, 1MB/sec.
      
      After inclusion of strictlimit feature, it will be trivial to add a knob
      (e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
      Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
      natural desire to limit the amount of dirty memory for some devices we are
      not fully trust (in the sense of sustainable throughput).
      
      [akpm@linux-foundation.org: fix warning in page-writeback.c]
      Signed-off-by: NMaxim Patlasov <MPatlasov@parallels.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a537485
  13. 02 4月, 2013 2 次提交
    • T
      writeback: replace custom worker pool implementation with unbound workqueue · 839a8e86
      Tejun Heo 提交于
      Writeback implements its own worker pool - each bdi can be associated
      with a worker thread which is created and destroyed dynamically.  The
      worker thread for the default bdi is always present and serves as the
      "forker" thread which forks off worker threads for other bdis.
      
      there's no reason for writeback to implement its own worker pool when
      using unbound workqueue instead is much simpler and more efficient.
      This patch replaces custom worker pool implementation in writeback
      with an unbound workqueue.
      
      The conversion isn't too complicated but the followings are worth
      mentioning.
      
      * bdi_writeback->last_active, task and wakeup_timer are removed.
        delayed_work ->dwork is added instead.  Explicit timer handling is
        no longer necessary.  Everything works by either queueing / modding
        / flushing / canceling the delayed_work item.
      
      * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
        bdi_writeback->dwork.  On each execution, it processes
        bdi->work_list and reschedules itself if there are more things to
        do.
      
        The function also handles low-mem condition, which used to be
        handled by the forker thread.  If the function is running off a
        rescuer thread, it only writes out limited number of pages so that
        the rescuer can serve other bdis too.  This preserves the flusher
        creation failure behavior of the forker thread.
      
      * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
        bdi_writeback_workfn() about on-going bdi unregistration so that it
        always drains work_list even if it's running off the rescuer.  Note
        that the original code was broken in this regard.  Under memory
        pressure, a bdi could finish unregistration with non-empty
        work_list.
      
      * The default bdi is no longer special.  It now is treated the same as
        any other bdi and bdi_cap_flush_forker() is removed.
      
      * BDI_pending is no longer used.  Removed.
      
      * Some tracepoints become non-applicable.  The following TPs are
        removed - writeback_nothread, writeback_wake_thread,
        writeback_wake_forker_thread, writeback_thread_start,
        writeback_thread_stop.
      
      Everything, including devices coming and going away and rescuer
      operation under simulated memory pressure, seems to work fine in my
      test setup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      839a8e86
    • T
      writeback: remove unused bdi_pending_list · 181387da
      Tejun Heo 提交于
      There's no user left.  Remove it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      181387da