1. 19 8月, 2015 31 次提交
    • T
      blk-throttle: improve queue bypass handling · c9589f03
      Tejun Heo 提交于
      If a queue is bypassing, all blkcg policies should become noops but
      blk-throttle wasn't.  It only became noop if the queue was dying.
      While this wouldn't lead to an oops as falling back to the root blkg
      is safe in this case, this can be a bit surprising - a bypassing queue
      could still be applying throttle limits.
      
      Fix it by removing blk_queue_dying() test in throtl_lookup_create_tg()
      and testing blk_queue_bypass() in blk_throtl_bio() and bypassing
      before doing anything else.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c9589f03
    • T
      blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup() · 85b6bc9d
      Tejun Heo 提交于
      Currently, both throttle and cfq policies implement their own root
      blkg (blkcg_gq) lookup fast path.  This patch moves root blkg
      optimization from throtl_lookup_tg() to __blkg_lookup().  cfq-iosched
      currently doesn't use blkg_lookup() but will be converted and drop the
      optimization too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      85b6bc9d
    • T
      blkcg: inline [__]blkg_lookup() · 24f29046
      Tejun Heo 提交于
      blkg_lookup() checks whether the target queue is bypassing and, if
      not, calls __blkg_lookup() which first checks the lookup hint and then
      performs radix tree walk.  The operations upto hint checking are
      trivial and there are many users of this function.  This patch inlines
      blkg_lookup() and the fast path part of __blkg_lookup().  The radix
      tree lookup and hint update are now in blkg_lookup_slowpath().
      
      This will help consolidating blkg handling by easing moving root blkcg
      short-circuit to inlined lookup fast path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      24f29046
    • T
      blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods · e4a9bde9
      Tejun Heo 提交于
      Each active policy has a cpd (blkcg_policy_data) on each blkcg.  The
      cpd's were allocated by blkcg core and each policy could request to
      allocate extra space at the end by setting blkcg_policy->cpd_size
      larger than the size of cpd.
      
      This is a bit unusual but blkg (blkcg_gq) policy data used to be
      handled this way too so it made sense to be consistent; however, blkg
      policy data switched to alloc/free callbacks.
      
      This patch makes similar changes to cpd handling.
      blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size.  As
      cpd allocation is now done from policy side, it can simply allocate a
      larger area which embeds cpd at the beginning.
      
      As ->cpd_alloc_fn() may be able to perform all necessary
      initializations, this patch makes ->cpd_init_fn() optional.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e4a9bde9
    • T
      blkcg: minor updates around blkcg_policy_data · 81437648
      Tejun Heo 提交于
      * Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used
        for blkcg_policy_data.
      
      * Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of
        blkcg.  This makes it consistent with blkg_policy_data methods and
        to-be-added cpd alloc/free methods.
      
      * blkcg_policy_data->blkcg and cpd_to_blkcg() added so that
        cpd_init_fn() can determine the associated blkcg from
        blkcg_policy_data.
      
      v2: blkcg_policy_data->blkcg initializations were missing.  Added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      81437648
    • T
      blkcg: make blkcg_policy methods take a pointer to blkcg_policy_data · a9520cd6
      Tejun Heo 提交于
      The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
      (blkg_policy_data) while the older ones use blkg (blkcg_gq).  As using
      blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
      can always be mapped to blkg and given that these are policy-specific
      methods, it makes sense to converge on pd.
      
      This patch makes all methods deal with pd instead of blkg.  Most
      conversions are trivial.  In blk-cgroup.c, a couple method invocation
      sites now test whether pd exists instead of policy state for
      consistency.  This shouldn't cause any behavioral differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a9520cd6
    • T
      blk-throttle: clean up blkg_policy_data alloc/init/exit/free methods · b2ce2643
      Tejun Heo 提交于
      With the recent addition of alloc and free methods, things became
      messier.  This patch reorganizes them according to the followings.
      
      * ->pd_alloc_fn()
      
        Responsible for allocation and static initializations - the ones
        which can be done independent of where the pd might be attached.
      
      * ->pd_init_fn()
      
        Initializations which require the knowledge of where the pd is
        attached.
      
      * ->pd_free_fn()
      
        The counter part of pd_alloc_fn().  Static de-init and freeing.
      
      This leaves ->pd_exit_fn() without any users.  Removed.
      
      While at it, collapse an one liner function throtl_pd_exit(), which
      has only one user, into its user.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b2ce2643
    • T
      blk-throttle: remove asynchrnous percpu stats allocation mechanism · 4fb72036
      Tejun Heo 提交于
      Because percpu allocator couldn't do non-blocking allocations,
      blk-throttle was forced to implement an ad-hoc asynchronous allocation
      mechanism for its percpu stats for cases where blkg's (blkcg_gq's) are
      allocated from an IO path without sleepable context.
      
      Now that percpu allocator can handle gfp_mask and blkg_policy_data
      alloc / free are handled by policy methods, the ad-hoc asynchronous
      allocation mechanism can be replaced with direct allocation from
      tg_stats_alloc_fn().  Rit it out.
      
      This ensures that an active throtl_grp always has valid non-NULL
      ->stats_cpu.  Remove checks on it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4fb72036
    • T
      blkcg: replace blkcg_policy->pd_size with ->pd_alloc/free_fn() methods · 001bea73
      Tejun Heo 提交于
      A blkg (blkcg_gq) represents the relationship between a cgroup and
      request_queue.  Each active policy has a pd (blkg_policy_data) on each
      blkg.  The pd's were allocated by blkcg core and each policy could
      request to allocate extra space at the end by setting
      blkcg_policy->pd_size larger than the size of pd.
      
      This is a bit unusual but was done this way mostly to simplify error
      handling and all the existing use cases could be handled this way;
      however, this is becoming too restrictive now that percpu memory can
      be allocated without blocking.
      
      This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
      and pd_free_fn() - which are used to allocate and release pd for a
      given policy.  As pd allocation is now done from policy side, it can
      simply allocate a larger area which embeds pd at the beginning.  This
      change makes ->pd_size pointless.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      001bea73
    • T
      blkcg: make blkcg_activate_policy() allow NULL ->pd_init_fn · 3e418710
      Tejun Heo 提交于
      blkg_create() allows NULL ->pd_init_fn() but blkcg_activate_policy()
      doesn't.  As both in-kernel policies implement ->pd_init_fn, it
      currently doesn't break anything.  Update blkcg_activate_policy() so
      that its behavior is consistent with blkg_create().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3e418710
    • T
      blkcg: restructure blkg_policy_data allocation in blkcg_activate_policy() · 4c55f4f9
      Tejun Heo 提交于
      When a policy gets activated, it needs to allocate and install its
      policy data on all existing blkg's (blkcg_gq's).  Because blkg
      iteration is protected by a spinlock, it currently counts the total
      number of blkg's in the system, allocates the matching number of
      policy data on a list and installs them during a single iteration.
      
      This can be simplified by using speculative GFP_NOWAIT allocations
      while iterating and falling back to a preallocated policy data on
      failure.  If the preallocated one has already been consumed, it
      releases the lock, preallocate with GFP_KERNEL and then restarts the
      iteration.  This can be a bit more expensive than before but policy
      activation is a very cold path and shouldn't matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4c55f4f9
    • T
      blkcg: remove unnecessary blkcg_root handling from css_alloc/free paths · bc915e61
      Tejun Heo 提交于
      blkcg_css_alloc() bypasses policy data allocation and blkcg_css_free()
      bypasses policy data and blkcg freeing for blkcg_root.  There's no
      reason to to treat policy data any differently for blkcg_root.  If the
      root css gets allocated after policies are registered, policy
      registration path will add policy data; otherwise, the alloc path
      will.  The free path isn't never invoked for root csses.
      
      This patch removes the unnecessary special handling of blkcg_root from
      css_alloc/free paths.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bc915e61
    • T
      blkcg: use blkg_free() in blkcg_init_queue() failure path · 994b7832
      Tejun Heo 提交于
      When blkcg_init_queue() fails midway after creating a new blkg, it
      performs kfree() directly; however, this doesn't free the policy data
      areas.  Make it use blkg_free() instead.  In turn, blkg_free() is
      updated to handle root request_list special case.
      
      While this fixes a possible memory leak, it's on an unlikely failure
      path of an already cold path and the size leaked per occurrence is
      miniscule too.  I don't think it needs to be tagged for -stable.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      994b7832
    • T
      blkcg: remove unnecessary request_list->blkg NULL test in blk_put_rl() · 401efbf8
      Tejun Heo 提交于
      Since ec13b1d6 ("blkcg: always create the blkcg_gq for the root
      blkcg"), a request_list always has its blkg associated.  Drop
      unnecessary rl->blkg NULL test from blk_put_rl().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      401efbf8
    • T
      cfq-iosched: charge async IOs to the appropriate blkcg's instead of the root · 60a83707
      Tejun Heo 提交于
      Up until now, all async IOs were queued to async queues which are
      shared across the whole request_queue, which means that blkcg resource
      control is completely void on async IOs including all writeback IOs.
      It was done this way because writeback didn't support writeback and
      there was no way of telling which writeback IO belonged to which
      cgroup; however, writeback recently became cgroup aware and writeback
      bio's are sent down properly tagged with the blkcg's to charge them
      against.
      
      This patch makes async cfq_queues per-cfq_cgroup instead of
      per-cfq_data so that each async IO is charged to the blkcg that it was
      tagged for instead of unconditionally attributing it to root.
      
      * cfq_data->async_cfqq and ->async_idle_cfqq are moved to cfq_group
        and alloc / destroy paths are updated accordingly.
      
      * cfq_link_cfqq_cfqg() no longer overrides @cfqg to root for async
        queues.
      
      * check_blkcg_changed() now also invalidates async queues as they no
        longer stay the same across cgroups.
      
      After this patch, cfq's proportional IO control through blkio.weight
      works correctly when cgroup writeback is in use.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      60a83707
    • T
      cfq-iosched: fold cfq_find_alloc_queue() into cfq_get_queue() · d4aad7ff
      Tejun Heo 提交于
      cfq_find_alloc_queue() checks whether a queue actually needs to be
      allocated, which is unnecessary as its sole caller, cfq_get_queue(),
      only calls it if so.  Also, the oom queue fallback logic is scattered
      between cfq_get_queue() and cfq_find_alloc_queue().  There really
      isn't much going on in the latter and things can be made simpler by
      folding it into cfq_get_queue().
      
      This patch collapses cfq_find_alloc_queue() into cfq_get_queue().  The
      change is fairly straight-forward with one exception - async_cfqq is
      now initialized to NULL and the "!is_sync" test in the last if
      conditional is replaced with "async_cfqq" test.  This is because gcc
      (5.1.1) gets confused for some reason and warns that async_cfqq may be
      used uninitialized otherwise.  Oh well, the code isn't necessarily
      worse this way.
      
      This patch doesn't cause any functional difference.
      
      v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d4aad7ff
    • T
      cfq-iosched: move cfq_group determination from cfq_find_alloc_queue() to cfq_get_queue() · 322731ed
      Tejun Heo 提交于
      This is necessary for making async cfq_cgroups per-cfq_group instead
      of per-cfq_data.  While this change makes cfq_get_queue() perform RCU
      locking and look up cfq_group even when it reuses async queue, the
      extra overhead is extremely unlikely to be noticeable given that this
      is already sitting behind cic->cfqq[] cache and the overall cost of
      cfq operation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      322731ed
    • T
      cfq-iosched: remove @gfp_mask from cfq_find_alloc_queue() · 2da8de0b
      Tejun Heo 提交于
      Even when allocations fail, cfq_find_alloc_queue() always returns a
      valid cfq_queue by falling back to the oom cfq_queue.  As such, there
      isn't much point in taking @gfp_mask and trying "harder" if __GFP_WAIT
      is set.  GFP_NOWAIT allocations don't fail often and even when they do
      the degraded behavior is acceptable and temporary.
      
      After all, the only reason get_request(), which ultimately determines
      the gfp_mask, cares about __GFP_WAIT is to guarantee request
      allocation, assuming IO forward progress, for callers which are
      willing to wait.  There's no reason for cfq_find_alloc_queue() to
      behave differently on __GFP_WAIT when it already has a fallback
      mechanism.
      
      Remove @gfp_mask from cfq_find_alloc_queue() and propagate the changes
      to its callers.  This simplifies the function quite a bit and will
      help making async queues per-cfq_group.
      
      v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2da8de0b
    • T
      blkcg, cfq-iosched: use GFP_NOWAIT instead of GFP_ATOMIC for non-critical allocations · d93a11f1
      Tejun Heo 提交于
      blkcg performs several allocations to track IOs per cgroup and enforce
      resource control.  Most of these allocations are performed lazily on
      demand in the IO path and thus can't involve reclaim path.  Currently,
      these allocations use GFP_ATOMIC; however, blkcg can gracefully deal
      with occassional failures of these allocations by punting IOs to the
      root cgroup and there's no reason to reach into the emergency reserve.
      
      This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following
      allocations.
      
      * bdi_writeback_congested and blkcg_gq allocations in blkg_create().
      
      * radix tree node allocations for blkcg->blkg_tree.
      
      * cfq_queue allocation on ioprio changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-and-Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Suggested-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d93a11f1
    • T
      cfq-iosched: minor cleanups · 563180a4
      Tejun Heo 提交于
      * Some were accessing cic->cfqq[] directly.  Always use cic_to_cfqq()
        and cic_set_cfqq().
      
      * check_ioprio_changed() doesn't need to verify cfq_get_queue()'s
        return for NULL.  It's always non-NULL.  Simplify accordingly.
      
      This patch doesn't cause any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      563180a4
    • T
      cfq-iosched: fix oom cfq_queue ref leak in cfq_set_request() · bce6133b
      Tejun Heo 提交于
      If the cfq_queue cached in cfq_io_cq is the oom one, cfq_set_request()
      replaces it by invoking cfq_get_queue() again without putting the oom
      queue leaking the reference it was holding.  While oom queues are not
      released through reference counting, they're still reference counted
      and this can theoretically lead to the reference count overflowing and
      incorrectly invoke the usual release path on it.
      
      Fix it by making cfq_set_request() put the ref it was holding.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bce6133b
    • T
      cfq-iosched: fix async oom queue handling · 95e5d6f6
      Tejun Heo 提交于
      Async cfqq's (cfq_queue's) are shared across cfq_data.  When
      cfq_get_queue() obtains a new queue from cfq_find_alloc_queue(), it
      stashes the pointer in cfq_data and reuses it from then on; however,
      the function doesn't consider that cfq_find_alloc_queue() may return
      the oom_cfqq under memory pressure and installs the returned queue
      unconditionally.
      
      If the oom_cfqq is installed as an async cfqq, cfq_set_request() will
      continue calling cfq_get_queue() hoping to replace it with a proper
      queue; however, cfq_get_queue() will keep returning the cached queue
      for the slot - the oom_cfqq.
      
      Fix it by skipping caching if the queue is the oom one.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      95e5d6f6
    • T
      cfq-iosched: simplify control flow in cfq_get_queue() · 4ebc1c61
      Tejun Heo 提交于
      cfq_get_queue()'s control flow looks like the following.
      
      	async_cfqq = NULL;
      	cfqq = NULL;
      
      	if (!is_sync) {
      		...
      		async_cfqq = ...;
      		cfqq = *async_cfqq;
      	}
      
      	if (!cfqq)
      		cfqq = ...;
      
      	if (!is_sync && !(*async_cfqq))
      		...;
      
      The only thing the local variable init, the second if, and the
      async_cfqq test in the third if achieves is to skip cfqq creation and
      installation if *async_cfqq was already non-NULL.  This is needlessly
      complicated with different tests examining the same condition.
      Simplify it to the following.
      
      	if (!is_sync) {
      		...
      		async_cfqq = ...;
      		cfqq = *async_cfqq;
      		if (cfqq)
      			goto out;
      	}
      
      	cfqq = ...;
      
      	if (!is_sync)
      		...;
       out:
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4ebc1c61
    • T
      writeback: update writeback tracepoints to report cgroup · 5634cc2a
      Tejun Heo 提交于
      The following tracepoints are updated to report the cgroup used during
      cgroup writeback.
      
      * writeback_write_inode[_start]
      * writeback_queue
      * writeback_exec
      * writeback_start
      * writeback_written
      * writeback_wait
      * writeback_nowork
      * writeback_wake_background
      * wbc_writepage
      * writeback_queue_io
      * bdi_dirty_ratelimit
      * balance_dirty_pages
      * writeback_sb_inodes_requeue
      * writeback_single_inode[_start]
      
      Note that writeback_bdi_register is separated out from writeback_class
      as reporting cgroup doesn't make sense to it.  Tracepoints which take
      bdi are updated to take bdi_writeback instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5634cc2a
    • T
      kernfs: implement kernfs_path_len() · 9acee9c5
      Tejun Heo 提交于
      Add a function to determine the path length of a kernfs node.  This
      for now will be used by writeback tracepoint updates.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9acee9c5
    • T
      60292bcc
    • T
      writeback: remove wb_writeback_work->single_wait/done · 8a1270cd
      Tejun Heo 提交于
      wb_writeback_work->single_wait/done are used for the wait mechanism
      for synchronous wb_work (wb_writeback_work) items which are issued
      when bdi_split_work_to_wbs() fails to allocate memory for asynchronous
      wb_work items; however, there's no reason to use a separate wait
      mechanism for this.  bdi_split_work_to_wbs() can simply use on-stack
      fallback wb_work item and separate wb_completion to wait for it.
      
      This patch removes wb_work->single_wait/done and the related code and
      make bdi_split_work_to_wbs() use on-stack fallback wb_work and
      wb_completion instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8a1270cd
    • T
      writeback: bdi_for_each_wb() iteration is memcg ID based not blkcg · 1ed8d48c
      Tejun Heo 提交于
      wb's (bdi_writeback's) are currently keyed by memcg ID; however, in an
      earlier implementation, wb's were keyed by blkcg ID.
      bdi_for_each_wb() walks bdi->cgwb_tree in the ascending ID order and
      allows iterations to start from an arbitrary ID which is used to
      interrupt and resume iterations.
      
      Unfortunately, while changing wb to be keyed by memcg ID instead of
      blkcg, bdi_for_each_wb() was missed and is still assuming that wb's
      are keyed by blkcg ID.  This doesn't affect iterations which don't get
      interrupted but bdi_split_work_to_wbs() makes use of iteration
      resuming on allocation failures and thus may incorrectly skip or
      repeat wb's.
      
      Fix it by changing bdi_for_each_wb() to take memcg IDs instead of
      blkcg IDs and updating bdi_split_work_to_wbs() accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1ed8d48c
    • J
      Merge branch 'for-4.3-unified-base' of... · 11743ee0
      Jens Axboe 提交于
      Merge branch 'for-4.3-unified-base' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-4.3/blkcg
      11743ee0
    • T
      cgroup: introduce cgroup_subsys->legacy_name · 3e1d2eed
      Tejun Heo 提交于
      This allows cgroup subsystems to use a different name on the unified
      hierarchy.  cgroup_subsys->name is used on the unified hierarchy,
      ->legacy_name elsewhere.  If ->legacy_name is not explicitly set, it's
      automatically set to ->name and the userland visible behavior remains
      unchanged.
      
      v2: Make parse_cgroupfs_options() only consider ->legacy_name as mount
          options are used only on legacy hierarchies.  Suggested by Li
          Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: cgroups@vger.kernel.org
      3e1d2eed
    • T
      cgroup: don't print subsystems for the default hierarchy · d98817d4
      Tejun Heo 提交于
      It doesn't make sense to print subsystems on mount option or
      /proc/PID/cgroup for the default hierarchy.
      
      * cgroup.controllers file at the root of the default hierarchy lists
        the currently attached controllers.
      
      * The default hierarchy is catch-all for unmounted subsystems.
      
      * The default hierarchy doesn't accept any mount options.
      
      Suppress subsystem printing on mount options and /proc/PID/cgroup for
      the default hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: cgroups@vger.kernel.org
      d98817d4
  2. 12 8月, 2015 1 次提交
  3. 06 8月, 2015 1 次提交
  4. 05 8月, 2015 1 次提交
    • T
      cgroup: define controller file conventions · 6abc8ca1
      Tejun Heo 提交于
      Traditionally, each cgroup controller implemented whatever interface
      it wanted leading to interfaces which are widely inconsistent.
      Examining the requirements of the controllers readily yield that there
      are only a few control schemes shared among all.
      
      Two major controllers already had to implement new interface for the
      unified hierarchy due to significant structural changes.  Let's take
      the chance to establish common conventions throughout all controllers.
      
      This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight
      based control knobs and documents the conventions that controllers
      should follow on the unified hierarchy.  Except for io.weight knob,
      all existing unified hierarchy knobs are already compliant.  A
      follow-up patch will update io.weight.
      
      v2: Added descriptions of min, low and high knobs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      6abc8ca1
  5. 28 7月, 2015 1 次提交
  6. 27 7月, 2015 3 次提交
    • L
      Linux 4.2-rc4 · cbfe8fa6
      Linus Torvalds 提交于
      cbfe8fa6
    • L
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2579d019
      Linus Torvalds 提交于
      Pull perf fix from Thomas Gleixner:
       "A single fix for the intel cqm perf facility to prevent IPIs from
        interrupt context"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel/cqm: Return cached counter value from IRQ context
      2579d019
    • L
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 28003486
      Linus Torvalds 提交于
      Pull x86 fixes from Thomas Gleixner:
       "This update contains:
      
         - the manual revert of the SYSCALL32 changes which caused a
           regression
      
         - a fix for the MPX vma handling
      
         - three fixes for the ioremap 'is ram' checks.
      
         - PAT warning fixes
      
         - a trivial fix for the size calculation of TLB tracepoints
      
         - handle old EFI structures gracefully
      
        This also contains a PAT fix from Jan plus a revert thereof.  Toshi
        explained why the code is correct"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mm/pat: Revert 'Adjust default caching mode translation tables'
        x86/asm/entry/32: Revert 'Do not use R9 in SYSCALL32' commit
        x86/mm: Fix newly introduced printk format warnings
        mm: Fix bugs in region_is_ram()
        x86/mm: Remove region_is_ram() call from ioremap
        x86/mm: Move warning from __ioremap_check_ram() to the call site
        x86/mm/pat, drivers/media/ivtv: Move the PAT warning and replace WARN() with pr_warn()
        x86/mm/pat, drivers/infiniband/ipath: Replace WARN() with pr_warn()
        x86/mm/pat: Adjust default caching mode translation tables
        x86/fpu: Disable dependent CPU features on "noxsave"
        x86/mpx: Do not set ->vm_ops on MPX VMAs
        x86/mm: Add parenthesis for TLB tracepoint size calculation
        efi: Handle memory error structures produced based on old versions of standard
      28003486
  7. 26 7月, 2015 2 次提交
    • T
      x86/mm/pat: Revert 'Adjust default caching mode translation tables' · 1a4e8795
      Thomas Gleixner 提交于
      Toshi explains:
      
      "No, the default values need to be set to the fallback types,
       i.e. minimal supported mode.  For WC and WT, UC is the fallback type.
      
       When PAT is disabled, pat_init() does update the tables below to
       enable WT per the default BIOS setup.  However, when PAT is enabled,
       but CPU has PAT -errata, WT falls back to UC per the default values."
      
      Revert: ca1fec58 'x86/mm/pat: Adjust default caching mode translation tables'
      Requested-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Jan Beulich <jbeulich@suse.de>
      Link: http://lkml.kernel.org/r/1437577776.3214.252.camel@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1a4e8795
    • M
      perf/x86/intel/cqm: Return cached counter value from IRQ context · 2c534c0d
      Matt Fleming 提交于
      Peter reported the following potential crash which I was able to
      reproduce with his test program,
      
      [  148.765788] ------------[ cut here ]------------
      [  148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 smp_call_function_many+0xb6/0x260()
      [  148.765797] Modules linked in:
      [  148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4
      [  148.765803]  ffffffff81cdc398 ffff88085f105950 ffffffff818bdfd5 0000000000000007
      [  148.765805]  0000000000000000 ffff88085f105990 ffffffff810e413a 0000000000000000
      [  148.765807]  ffffffff82301080 0000000000000022 ffffffff8107f640 ffffffff8107f640
      [  148.765809] Call Trace:
      [  148.765810]  <NMI>  [<ffffffff818bdfd5>] dump_stack+0x45/0x57
      [  148.765818]  [<ffffffff810e413a>] warn_slowpath_common+0x8a/0xc0
      [  148.765822]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
      [  148.765824]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
      [  148.765825]  [<ffffffff810e422a>] warn_slowpath_null+0x1a/0x20
      [  148.765827]  [<ffffffff811613f6>] smp_call_function_many+0xb6/0x260
      [  148.765829]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
      [  148.765831]  [<ffffffff81161748>] on_each_cpu_mask+0x28/0x60
      [  148.765832]  [<ffffffff8107f6ef>] intel_cqm_event_count+0x7f/0xe0
      [  148.765836]  [<ffffffff811cdd35>] perf_output_read+0x2a5/0x400
      [  148.765839]  [<ffffffff811d2e5a>] perf_output_sample+0x31a/0x590
      [  148.765840]  [<ffffffff811d333d>] ? perf_prepare_sample+0x26d/0x380
      [  148.765841]  [<ffffffff811d3497>] perf_event_output+0x47/0x60
      [  148.765843]  [<ffffffff811d36c5>] __perf_event_overflow+0x215/0x240
      [  148.765844]  [<ffffffff811d4124>] perf_event_overflow+0x14/0x20
      [  148.765847]  [<ffffffff8107e7f4>] intel_pmu_handle_irq+0x1d4/0x440
      [  148.765849]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765853]  [<ffffffff81219bad>] ? vunmap_page_range+0x19d/0x2f0
      [  148.765854]  [<ffffffff81219d11>] ? unmap_kernel_range_noflush+0x11/0x20
      [  148.765859]  [<ffffffff814ce6fe>] ? ghes_copy_tofrom_phys+0x11e/0x2a0
      [  148.765863]  [<ffffffff8109e5db>] ? native_apic_msr_write+0x2b/0x30
      [  148.765865]  [<ffffffff8109e44d>] ? x2apic_send_IPI_self+0x1d/0x20
      [  148.765869]  [<ffffffff81065135>] ? arch_irq_work_raise+0x35/0x40
      [  148.765872]  [<ffffffff811c8d86>] ? irq_work_queue+0x66/0x80
      [  148.765875]  [<ffffffff81075306>] perf_event_nmi_handler+0x26/0x40
      [  148.765877]  [<ffffffff81063ed9>] nmi_handle+0x79/0x100
      [  148.765879]  [<ffffffff81064422>] default_do_nmi+0x42/0x100
      [  148.765880]  [<ffffffff81064563>] do_nmi+0x83/0xb0
      [  148.765884]  [<ffffffff818c7c0f>] end_repeat_nmi+0x1e/0x2e
      [  148.765886]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765888]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765890]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765891]  <<EOE>>  [<ffffffff8110ab66>] finish_task_switch+0x156/0x210
      [  148.765898]  [<ffffffff818c1671>] __schedule+0x341/0x920
      [  148.765899]  [<ffffffff818c1c87>] schedule+0x37/0x80
      [  148.765903]  [<ffffffff810ae1af>] ? do_page_fault+0x2f/0x80
      [  148.765905]  [<ffffffff818c1f4a>] schedule_user+0x1a/0x50
      [  148.765907]  [<ffffffff818c666c>] retint_careful+0x14/0x32
      [  148.765908] ---[ end trace e33ff2be78e14901 ]---
      
      The CQM task events are not safe to be called from within interrupt
      context because they require performing an IPI to read the counter value
      on all sockets. And performing IPIs from within IRQ context is a
      "no-no".
      
      Make do with the last read counter value currently event in
      event->count when we're invoked in this context.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikas Shivappa <vikas.shivappa@intel.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Will Auld <will.auld@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      2c534c0d