1. 19 8月, 2015 12 次提交
    • T
      blkcg: consolidate blkg creation in blkcg_bio_issue_check() · ae118896
      Tejun Heo 提交于
      blkg (blkcg_gq) currently is created by blkcg policies invoking
      blkg_lookup_create() which ends up repeating about the same code in
      different policies.  Theoretically, this can avoid the overhead of
      looking and/or creating blkg's if blkcg is enabled but no policy is in
      use; however, the cost of blkg lookup / creation is very low
      especially if only the root blkcg is in use which is highly likely if
      no blkcg policy is in active use - it boils down to a single very
      predictable conditional and surrounding RCU protection.
      
      This patch consolidates blkg creation to a new function
      blkcg_bio_issue_check() which is called during bio issue from
      generic_make_request_checks().  blkcg_bio_issue_check() is now the
      only function which tries to create missing blkg's.  The subsequent
      policy and request_list operations just perform blkg_lookup() and if
      missing falls back to the root.
      
      * blk_get_rl() no longer tries to create blkg.  It uses blkg_lookup()
        instead of blkg_lookup_create().
      
      * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
        read locked and blkg already looked up.  Both throtl_lookup_tg() and
        throtl_lookup_create_tg() are dropped.
      
      * cfq is similarly updated.  cfq_lookup_create_cfqg() is replaced with
        cfq_lookup_cfqg()which uses blkg_lookup().
      
      This consolidates blkg handling and avoids unnecessary blkg creation
      retries under memory pressure.  In addition, this provides a common
      bio entry point into blkcg where things like common accounting can be
      performed.
      
      v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
          !CONFIG_BLK_DEV_THROTTLING.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ae118896
    • T
      blkcg: inline [__]blkg_lookup() · 24f29046
      Tejun Heo 提交于
      blkg_lookup() checks whether the target queue is bypassing and, if
      not, calls __blkg_lookup() which first checks the lookup hint and then
      performs radix tree walk.  The operations upto hint checking are
      trivial and there are many users of this function.  This patch inlines
      blkg_lookup() and the fast path part of __blkg_lookup().  The radix
      tree lookup and hint update are now in blkg_lookup_slowpath().
      
      This will help consolidating blkg handling by easing moving root blkcg
      short-circuit to inlined lookup fast path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      24f29046
    • T
      blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods · e4a9bde9
      Tejun Heo 提交于
      Each active policy has a cpd (blkcg_policy_data) on each blkcg.  The
      cpd's were allocated by blkcg core and each policy could request to
      allocate extra space at the end by setting blkcg_policy->cpd_size
      larger than the size of cpd.
      
      This is a bit unusual but blkg (blkcg_gq) policy data used to be
      handled this way too so it made sense to be consistent; however, blkg
      policy data switched to alloc/free callbacks.
      
      This patch makes similar changes to cpd handling.
      blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size.  As
      cpd allocation is now done from policy side, it can simply allocate a
      larger area which embeds cpd at the beginning.
      
      As ->cpd_alloc_fn() may be able to perform all necessary
      initializations, this patch makes ->cpd_init_fn() optional.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e4a9bde9
    • T
      blkcg: minor updates around blkcg_policy_data · 81437648
      Tejun Heo 提交于
      * Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used
        for blkcg_policy_data.
      
      * Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of
        blkcg.  This makes it consistent with blkg_policy_data methods and
        to-be-added cpd alloc/free methods.
      
      * blkcg_policy_data->blkcg and cpd_to_blkcg() added so that
        cpd_init_fn() can determine the associated blkcg from
        blkcg_policy_data.
      
      v2: blkcg_policy_data->blkcg initializations were missing.  Added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      81437648
    • T
      blkcg: make blkcg_policy methods take a pointer to blkcg_policy_data · a9520cd6
      Tejun Heo 提交于
      The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
      (blkg_policy_data) while the older ones use blkg (blkcg_gq).  As using
      blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
      can always be mapped to blkg and given that these are policy-specific
      methods, it makes sense to converge on pd.
      
      This patch makes all methods deal with pd instead of blkg.  Most
      conversions are trivial.  In blk-cgroup.c, a couple method invocation
      sites now test whether pd exists instead of policy state for
      consistency.  This shouldn't cause any behavioral differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a9520cd6
    • T
      blk-throttle: clean up blkg_policy_data alloc/init/exit/free methods · b2ce2643
      Tejun Heo 提交于
      With the recent addition of alloc and free methods, things became
      messier.  This patch reorganizes them according to the followings.
      
      * ->pd_alloc_fn()
      
        Responsible for allocation and static initializations - the ones
        which can be done independent of where the pd might be attached.
      
      * ->pd_init_fn()
      
        Initializations which require the knowledge of where the pd is
        attached.
      
      * ->pd_free_fn()
      
        The counter part of pd_alloc_fn().  Static de-init and freeing.
      
      This leaves ->pd_exit_fn() without any users.  Removed.
      
      While at it, collapse an one liner function throtl_pd_exit(), which
      has only one user, into its user.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b2ce2643
    • T
      blkcg: replace blkcg_policy->pd_size with ->pd_alloc/free_fn() methods · 001bea73
      Tejun Heo 提交于
      A blkg (blkcg_gq) represents the relationship between a cgroup and
      request_queue.  Each active policy has a pd (blkg_policy_data) on each
      blkg.  The pd's were allocated by blkcg core and each policy could
      request to allocate extra space at the end by setting
      blkcg_policy->pd_size larger than the size of pd.
      
      This is a bit unusual but was done this way mostly to simplify error
      handling and all the existing use cases could be handled this way;
      however, this is becoming too restrictive now that percpu memory can
      be allocated without blocking.
      
      This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
      and pd_free_fn() - which are used to allocate and release pd for a
      given policy.  As pd allocation is now done from policy side, it can
      simply allocate a larger area which embeds pd at the beginning.  This
      change makes ->pd_size pointless.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      001bea73
    • T
      blkcg: make blkcg_activate_policy() allow NULL ->pd_init_fn · 3e418710
      Tejun Heo 提交于
      blkg_create() allows NULL ->pd_init_fn() but blkcg_activate_policy()
      doesn't.  As both in-kernel policies implement ->pd_init_fn, it
      currently doesn't break anything.  Update blkcg_activate_policy() so
      that its behavior is consistent with blkg_create().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3e418710
    • T
      blkcg: restructure blkg_policy_data allocation in blkcg_activate_policy() · 4c55f4f9
      Tejun Heo 提交于
      When a policy gets activated, it needs to allocate and install its
      policy data on all existing blkg's (blkcg_gq's).  Because blkg
      iteration is protected by a spinlock, it currently counts the total
      number of blkg's in the system, allocates the matching number of
      policy data on a list and installs them during a single iteration.
      
      This can be simplified by using speculative GFP_NOWAIT allocations
      while iterating and falling back to a preallocated policy data on
      failure.  If the preallocated one has already been consumed, it
      releases the lock, preallocate with GFP_KERNEL and then restarts the
      iteration.  This can be a bit more expensive than before but policy
      activation is a very cold path and shouldn't matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4c55f4f9
    • T
      blkcg: remove unnecessary blkcg_root handling from css_alloc/free paths · bc915e61
      Tejun Heo 提交于
      blkcg_css_alloc() bypasses policy data allocation and blkcg_css_free()
      bypasses policy data and blkcg freeing for blkcg_root.  There's no
      reason to to treat policy data any differently for blkcg_root.  If the
      root css gets allocated after policies are registered, policy
      registration path will add policy data; otherwise, the alloc path
      will.  The free path isn't never invoked for root csses.
      
      This patch removes the unnecessary special handling of blkcg_root from
      css_alloc/free paths.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bc915e61
    • T
      blkcg: use blkg_free() in blkcg_init_queue() failure path · 994b7832
      Tejun Heo 提交于
      When blkcg_init_queue() fails midway after creating a new blkg, it
      performs kfree() directly; however, this doesn't free the policy data
      areas.  Make it use blkg_free() instead.  In turn, blkg_free() is
      updated to handle root request_list special case.
      
      While this fixes a possible memory leak, it's on an unlikely failure
      path of an already cold path and the size leaked per occurrence is
      miniscule too.  I don't think it needs to be tagged for -stable.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      994b7832
    • T
      blkcg, cfq-iosched: use GFP_NOWAIT instead of GFP_ATOMIC for non-critical allocations · d93a11f1
      Tejun Heo 提交于
      blkcg performs several allocations to track IOs per cgroup and enforce
      resource control.  Most of these allocations are performed lazily on
      demand in the IO path and thus can't involve reclaim path.  Currently,
      these allocations use GFP_ATOMIC; however, blkcg can gracefully deal
      with occassional failures of these allocations by punting IOs to the
      root cgroup and there's no reason to reach into the emergency reserve.
      
      This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following
      allocations.
      
      * bdi_writeback_congested and blkcg_gq allocations in blkg_create().
      
      * radix tree node allocations for blkcg->blkg_tree.
      
      * cfq_queue allocation on ioprio changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-and-Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Suggested-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d93a11f1
  2. 23 7月, 2015 1 次提交
  3. 10 7月, 2015 4 次提交
    • T
      blkcg: fix blkcg_policy_data allocation bug · 06b285bd
      Tejun Heo 提交于
      e48453c3 ("block, cgroup: implement policy-specific per-blkcg
      data") updated per-blkcg policy data to be dynamically allocated.
      When a policy is registered, its policy data aren't created.  Instead,
      when the policy is activated on a queue, the policy data are allocated
      if there are blkg's (blkcg_gq's) which are attached to a given blkcg.
      This is buggy.  Consider the following scenario.
      
      1. A blkcg is created.  No blkg's attached yet.
      
      2. The policy is registered.  No policy data is allocated.
      
      3. The policy is activated on a queue.  As the above blkcg doesn't
         have any blkg's, it won't allocate the matching blkcg_policy_data.
      
      4. An IO is issued from the blkcg and blkg is created and the blkcg
         still doesn't have the matching policy data allocated.
      
      With cfq-iosched, this leads to an oops.
      
      It also doesn't free policy data on policy unregistration assuming
      that freeing of all policy data on blkcg destruction should take care
      of it; however, this also is incorrect.
      
      1. A blkcg has policy data.
      
      2. The policy gets unregistered but the policy data remains.
      
      3. Another policy gets registered on the same slot.
      
      4. Later, the new policy tries to allocate policy data on the previous
         blkcg but the slot is already occupied and gets skipped.  The
         policy ends up operating on the policy data of the previous policy.
      
      There's no reason to manage blkcg_policy_data lazily.  The reason we
      do lazy allocation of blkg's is that the number of all possible blkg's
      is the product of cgroups and block devices which can reach a
      surprising level.  blkcg_policy_data is contrained by the number of
      cgroups and shouldn't be a problem.
      
      This patch makes blkcg_policy_data to be allocated for all existing
      blkcg's on policy registration and freed on unregistration and removes
      blkcg_policy_data handling from policy [de]activation paths.  This
      makes that blkcg_policy_data are created and removed with the policy
      they belong to and fixes the above described problems.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      06b285bd
    • T
      blkcg: implement all_blkcgs list · 7876f930
      Tejun Heo 提交于
      Add all_blkcgs list goes through blkcg->all_blkcgs_node and is
      protected by blkcg_pol_mutex.  This will be used to fix
      blkcg_policy_data allocation bug.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7876f930
    • T
      blkcg: blkcg_css_alloc() should grab blkcg_pol_mutex while iterating blkcg_policy[] · 144232b3
      Tejun Heo 提交于
      An entry in blkcg_policy[] is stable while there are non-bypassing
      in-flight IOs on a request_queue which has the policy activated.  This
      is why most derefs of blkcg_policy[] don't need explicit locking;
      however, blkcg_css_alloc() isn't invoked from IO path and thus doesn't
      have this protection and may race policies being added and removed.
      
      Fix it by adding explicit blkcg_pol_mutex protection around
      blkcg_policy[] iteration in blkcg_css_alloc().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      144232b3
    • T
      blkcg: allow blkcg_pol_mutex to be grabbed from cgroup [file] methods · 838f13bf
      Tejun Heo 提交于
      blkcg_pol_mutex primarily protects the blkcg_policy array.  It also
      protects cgroup file type [un]registration during policy addition /
      removal.  This puts blkcg_pol_mutex outside cgroup internal
      synchronization and in turn makes it impossible to grab from blkcg's
      cgroup methods as that leads to cyclic dependency.
      
      Another problematic dependency arising from this is through cgroup
      interface file deactivation.  Removing a cftype requires removing all
      files of the type which in turn involves draining all on-going
      invocations of the file methods.  This means that an interface file
      implementation can't grab blkcg_pol_mutex as draining can lead to AA
      deadlock.
      
      blkcg_reset_stats() is already in this situation.  It currently
      trylocks blkcg_pol_mutex and then unwinds and retries the whole
      operation on failure, which is cumbersome at best.  It has a lengthy
      comment explaining how cgroup internal synchronization is involved and
      expected to be updated but as explained above this doesn't need cgroup
      internal locking to deadlock.  It's a self-contained AA deadlock.
      
      The described circular dependencies can be easily broken by moving
      cftype [un]registration out of blkcg_pol_mutex and protect them with
      an outer mutex.  This patch introduces blkcg_pol_register_mutex which
      wraps entire policy [un]registration including cftype operations and
      shrinks blkcg_pol_mutex critical section.  This also makes the trylock
      dancing in blkcg_reset_stats() unnecessary.  Removed.
      
      This patch is necessary for the following blkcg_policy_data allocation
      bug fixes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      838f13bf
  4. 07 7月, 2015 1 次提交
  5. 07 6月, 2015 1 次提交
    • A
      block, cgroup: implement policy-specific per-blkcg data · e48453c3
      Arianna Avanzini 提交于
      The block IO (blkio) controller enables the block layer to provide service
      guarantees in a hierarchical fashion. Specifically, service guarantees
      are provided by registered request-accounting policies. As of now, a
      proportional-share and a throttling policy are available. They are
      implemented, respectively, by the CFQ I/O scheduler and the blk-throttle
      subsystem. Unfortunately, as for adding new policies, the current
      implementation of the block IO controller is only halfway ready to allow
      new policies to be plugged in. This commit provides a solution to make
      the block IO controller fully ready to handle new policies.
      In what follows, we first describe briefly the current state, and then
      list the changes made by this commit.
      
      The throttling policy does not need any per-cgroup information to perform
      its task. In contrast, the proportional share policy uses, for each cgroup,
      both the weight assigned by the user to the cgroup, and a set of dynamically-
      computed weights, one for each device.
      
      The first, user-defined weight is stored in the blkcg data structure: the
      block IO controller allocates a private blkcg data structure for each
      cgroup in the blkio cgroups hierarchy (regardless of which policy is active).
      In other words, the block IO controller internally mirrors the blkio cgroups
      with private blkcg data structures.
      
      On the other hand, for each cgroup and device, the corresponding dynamically-
      computed weight is maintained in the following, different way. For each device,
      the block IO controller keeps a private blkcg_gq structure for each cgroup in
      blkio. In other words, block IO also keeps one private mirror copy of the blkio
      cgroups hierarchy for each device, made of blkcg_gq structures.
      Each blkcg_gq structure keeps per-policy information in a generic array of
      dynamically-allocated 'dedicated' data structures, one for each registered
      policy (so currently the array contains two elements). To be inserted into the
      generic array, each dedicated data structure embeds a generic blkg_policy_data
      structure. Consider now the array contained in the blkcg_gq structure
      corresponding to a given pair of cgroup and device: one of the elements
      of the array contains the dedicated data structure for the proportional-share
      policy, and this dedicated data structure contains the dynamically-computed
      weight for that pair of cgroup and device.
      
      The generic strategy adopted for storing per-policy data in blkcg_gq structures
      is already capable of handling new policies, whereas the one adopted with blkcg
      structures is not, because per-policy data are hard-coded in the blkcg
      structures themselves (currently only data related to the proportional-
      share policy).
      
      This commit addresses the above issues through the following changes:
      . It generalizes blkcg structures so that per-policy data are stored in the same
        way as in blkcg_gq structures.
        Specifically, it lets also the blkcg structure store per-policy data in a
        generic array of dynamically-allocated dedicated data structures. We will
        refer to these data structures as blkcg dedicated data structures, to
        distinguish them from the dedicated data structures inserted in the generic
        arrays kept by blkcg_gq structures.
        To allow blkcg dedicated data structures to be inserted in the generic array
        inside a blkcg structure, this commit also introduces a new blkcg_policy_data
        structure, which is the equivalent of blkg_policy_data for blkcg dedicated
        data structures.
      . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a
        cpd_size field and a cpd_init field, to be initialized by the policy with,
        respectively, the size of the blkcg dedicated data structures, and the
        address of a constructor function for blkcg dedicated data structures.
      . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e.,
        the fields related to the proportional-share policy), into a new blkcg
        dedicated data structure called cfq_group_data.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e48453c3
  6. 02 6月, 2015 5 次提交
    • T
      writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congested · ce7acfea
      Tejun Heo 提交于
      A blkg (blkcg_gq) can be congested and decongested independently from
      other blkgs on the same request_queue.  Accordingly, for cgroup
      writeback support, the congestion status at bdi (backing_dev_info)
      should be split and updated separately from matching blkg's.
      
      This patch prepares by adding blkg->wb_congested and associating a
      blkg with its matching per-blkcg bdi_writeback_congested on creation.
      
      v2: Updated to associate bdi_writeback_congested instead of
          bdi_writeback.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ce7acfea
    • T
      writeback: make backing_dev_info host cgroup-specific bdi_writebacks · 52ebea74
      Tejun Heo 提交于
      For the planned cgroup writeback support, on each bdi
      (backing_dev_info), each memcg will be served by a separate wb
      (bdi_writeback).  This patch updates bdi so that a bdi can host
      multiple wbs (bdi_writebacks).
      
      On the default hierarchy, blkcg implicitly enables memcg.  This allows
      using memcg's page ownership for attributing writeback IOs, and every
      memcg - blkcg combination can be served by its own wb by assigning a
      dedicated wb to each memcg.  This means that there may be multiple
      wb's of a bdi mapped to the same blkcg.  As congested state is per
      blkcg - bdi combination, those wb's should share the same congested
      state.  This is achieved by tracking congested state via
      bdi_writeback_congested structs which are keyed by blkcg.
      
      bdi->wb remains unchanged and will keep serving the root cgroup.
      cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
      looked up while dirtying an inode according to the memcg of the page
      being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
      by its memcg id.  Once an inode is associated with its wb, it can be
      retrieved using inode_to_wb().
      
      Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
      pages will keep being associated with bdi->wb.
      
      v3: inode_attach_wb() in account_page_dirtied() moved inside
          mapping_cap_account_dirty() block where it's known to be !NULL.
          Also, an unnecessary NULL check before kfree() removed.  Both
          detected by the kbuild bot.
      
      v2: Updated so that wb association is per inode and wb is per memcg
          rather than blkcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52ebea74
    • T
      blkcg: add blkcg_root_css · 496d5e75
      Tejun Heo 提交于
      Add global constant blkcg_root_css which points to &blkcg_root.css.
      This will be used by cgroup writeback support.  If blkcg is disabled,
      it's defined as ERR_PTR(-EINVAL).
      
      v2: The declarations moved to include/linux/blk-cgroup.h as suggested
          by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      496d5e75
    • T
      blkcg: always create the blkcg_gq for the root blkcg · ec13b1d6
      Tejun Heo 提交于
      Currently, blkcg does a minor optimization where the root blkcg is
      created when the first blkcg policy is activated on a queue and
      destroyed on the deactivation of the last.  On systems where blkcg is
      configured but not used, this saves one blkcg_gq struct per queue.  On
      systems where blkcg is actually used, there's no difference.  The only
      case where this can lead to any meaninful, albeit still minute, save
      in memory consumption is when all blkcg policies are deactivated after
      being widely used in the system, which is a hihgly unlikely scenario.
      
      The conditional existence of root blkcg_gq has already created several
      bugs in blkcg and became an issue once again for the new per-cgroup
      wb_congested mechanism for cgroup writeback support leading to a NULL
      dereference when no blkcg policy is active.  This is really not worth
      bothering with.  This patch makes blkcg always allocate and link the
      root blkcg_gq and release it only on queue destruction.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ec13b1d6
    • T
      blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.h · eea8f41c
      Tejun Heo 提交于
      cgroup aware writeback support will require exposing some of blkcg
      details.  In preprataion, move block/blk-cgroup.h to
      include/linux/blk-cgroup.h.  This patch is pure file move.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      eea8f41c
  7. 08 9月, 2014 1 次提交
    • T
      blkcg: remove blkcg->id · f4da8072
      Tejun Heo 提交于
      blkcg->id is a unique id given to each blkcg; however, the
      cgroup_subsys_state which each blkcg embeds already has ->serial_nr
      which can be used for the same purpose.  Drop blkcg->id and replace
      its uses with blkcg->css.serial_nr.  Rename cfq_cgroup->blkcg_id to
      ->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for
      consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f4da8072
  8. 16 7月, 2014 1 次提交
    • T
      blkcg: don't call into policy draining if root_blkg is already gone · 2a1b4cf2
      Tejun Heo 提交于
      While a queue is being destroyed, all the blkgs are destroyed and its
      ->root_blkg pointer is set to NULL.  If someone else starts to drain
      while the queue is in this state, the following oops happens.
      
        NULL pointer dereference at 0000000000000028
        IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        PGD e4a1067 PUD b773067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
        CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
        RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
        R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
        FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
        Stack:
         ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
         ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
         ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
        Call Trace:
         [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
         [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
         [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
         [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
         [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
         [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
         [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff81427505>] blk_put_queue+0x15/0x20
         [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
         [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
         [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
         [<ffffffff817930e2>] device_release+0x32/0xa0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff817934d7>] put_device+0x17/0x20
         [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
         [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
         [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
         [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
         [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
         [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
         [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
         [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
      
      776687bc ("block, blk-mq: draining can't be skipped even if
      bypass_depth was non-zero") made it easier to trigger this bug by
      making blk_queue_bypass_start() drain even when it loses the first
      bypass test to blk_cleanup_queue(); however, the bug has always been
      there even before the commit as blk_queue_bypass_start() could race
      against queue destruction, win the initial bypass test but perform the
      actual draining after blk_cleanup_queue() already destroyed all blkgs.
      
      Fix it by skippping calling into policy draining if all the blkgs are
      already gone.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NShirish Pargaonkar <spargaonkar@suse.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Reported-by: NJet Chen <jet.chen@intel.com>
      Cc: stable@vger.kernel.org
      Tested-by: NShirish Pargaonkar <spargaonkar@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a1b4cf2
  9. 15 7月, 2014 2 次提交
    • T
      cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() · 2cf669a5
      Tejun Heo 提交于
      Currently, cftypes added by cgroup_add_cftypes() are used for both the
      unified default hierarchy and legacy ones and subsystems can mark each
      file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
      appear only on one of them.  This is quite hairy and error-prone.
      Also, we may end up exposing interface files to the default hierarchy
      without thinking it through.
      
      cgroup_subsys will grow two separate cftype addition functions and
      apply each only on the hierarchies of the matching type.  This will
      allow organizing cftypes in a lot clearer way and encourage subsystems
      to scrutinize the interface which is being exposed in the new default
      hierarchy.
      
      In preparation, this patch adds cgroup_add_legacy_cftypes() which
      currently is a simple wrapper around cgroup_add_cftypes() and replaces
      all cgroup_add_cftypes() usages with it.
      
      While at it, this patch drops a completely spurious return from
      __hugetlb_cgroup_file_init().
      
      This patch doesn't introduce any functional differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      2cf669a5
    • T
      cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes · 5577964e
      Tejun Heo 提交于
      Currently, cgroup_subsys->base_cftypes is used for both the unified
      default hierarchy and legacy ones and subsystems can mark each file
      with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
      only on one of them.  This is quite hairy and error-prone.  Also, we
      may end up exposing interface files to the default hierarchy without
      thinking it through.
      
      cgroup_subsys will grow two separate cftype arrays and apply each only
      on the hierarchies of the matching type.  This will allow organizing
      cftypes in a lot clearer way and encourage subsystems to scrutinize
      the interface which is being exposed in the new default hierarchy.
      
      In preparation, this patch renames cgroup_subsys->base_cftypes to
      cgroup_subsys->legacy_cftypes.  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      5577964e
  10. 12 7月, 2014 1 次提交
    • T
      blkcg: don't call into policy draining if root_blkg is already gone · 0b462c89
      Tejun Heo 提交于
      While a queue is being destroyed, all the blkgs are destroyed and its
      ->root_blkg pointer is set to NULL.  If someone else starts to drain
      while the queue is in this state, the following oops happens.
      
        NULL pointer dereference at 0000000000000028
        IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        PGD e4a1067 PUD b773067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
        CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
        RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
        R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
        FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
        Stack:
         ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
         ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
         ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
        Call Trace:
         [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
         [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
         [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
         [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
         [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
         [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
         [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff81427505>] blk_put_queue+0x15/0x20
         [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
         [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
         [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
         [<ffffffff817930e2>] device_release+0x32/0xa0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff817934d7>] put_device+0x17/0x20
         [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
         [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
         [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
         [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
         [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
         [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
         [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
         [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
      
      776687bc ("block, blk-mq: draining can't be skipped even if
      bypass_depth was non-zero") made it easier to trigger this bug by
      making blk_queue_bypass_start() drain even when it loses the first
      bypass test to blk_cleanup_queue(); however, the bug has always been
      there even before the commit as blk_queue_bypass_start() could race
      against queue destruction, win the initial bypass test but perform the
      actual draining after blk_cleanup_queue() already destroyed all blkgs.
      
      Fix it by skippping calling into policy draining if all the blkgs are
      already gone.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NShirish Pargaonkar <spargaonkar@suse.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Reported-by: NJet Chen <jet.chen@intel.com>
      Cc: stable@vger.kernel.org
      Tested-by: NShirish Pargaonkar <spargaonkar@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0b462c89
  11. 09 7月, 2014 1 次提交
    • T
      blkcg, memcg: make blkcg depend on memcg on the default hierarchy · 1ced953b
      Tejun Heo 提交于
      Currently, the blkio subsystem attributes all of writeback IOs to the
      root.  One of the issues is that there's no way to tell who originated
      a writeback IO from block layer.  Those IOs are usually issued
      asynchronously from a task which didn't have anything to do with
      actually generating the dirty pages.  The memory subsystem, when
      enabled, already keeps track of the ownership of each dirty page and
      it's desirable for blkio to piggyback instead of adding its own
      per-page tag.
      
      cgroup now has a mechanism to express such dependency -
      cgroup_subsys->depends_on.  This patch declares that blkcg depends on
      memcg so that memcg is enabled automatically on the default hierarchy
      when available.  Future changes will make blkcg map the memcg tag to
      find out the cgroup to blame for writeback IOs.
      
      As this means that a memcg may be made invisible, this patch also
      implements css_reset() for memcg which resets its basic
      configurations.  This implementation will probably need to be expanded
      to cover other states which are used in the default hierarchy.
      
      v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
          build failure.  Reported by kbuild test robot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      1ced953b
  12. 23 6月, 2014 2 次提交
    • J
      Revert "block: add __init to blkcg_policy_register" · d5bf0291
      Jens Axboe 提交于
      This reverts commit a2d445d4.
      
      The original commit is buggy, we do use the registration functions
      at runtime for modular builds.
      d5bf0291
    • T
      blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t · a5049a8a
      Tejun Heo 提交于
      Hello,
      
      So, this patch should do.  Joe, Vivek, can one of you guys please
      verify that the oops goes away with this patch?
      
      Jens, the original thread can be read at
      
        http://thread.gmane.org/gmane.linux.kernel/1720729
      
      The fix converts blkg->refcnt from int to atomic_t.  It does some
      overhead but it should be minute compared to everything else which is
      going on and the involved cacheline bouncing, so I think it's highly
      unlikely to cause any noticeable difference.  Also, the refcnt in
      question should be converted to a perpcu_ref for blk-mq anyway, so the
      atomic_t is likely to go away pretty soon anyway.
      
      Thanks.
      
      ------- 8< -------
      __blkg_release_rcu() may be invoked after the associated request_queue
      is released with a RCU grace period inbetween.  As such, the function
      and callbacks invoked from it must not dereference the associated
      request_queue.  This is clearly indicated in the comment above the
      function.
      
      Unfortunately, while trying to fix a different issue, 2a4fd070
      ("blkcg: move bulk of blkcg_gq release operations to the RCU
      callback") ignored this and added [un]locking of @blkg->q->queue_lock
      to __blkg_release_rcu().  This of course can cause oops as the
      request_queue may be long gone by the time this code gets executed.
      
        general protection fault: 0000 [#1] SMP
        CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
        Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
        task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
        RIP: 0010:[<ffffffff8162e9e5>]  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
        RSP: 0018:ffff88085403fdf0  EFLAGS: 00010086
        RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
        RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
        RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
        R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
        R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
        Stack:
         ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
         ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
         ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
        Call Trace:
         [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150
         [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300
         [<ffffffff81091d81>] kthread+0xe1/0x100
         [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0
        Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
        +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
        +b7
        RIP  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
         RSP <ffff88085403fdf0>
      
      The request_queue locking was added because blkcg_gq->refcnt is an int
      protected with the queue lock and __blkg_release_rcu() needs to put
      the parent.  Let's fix it by making blkcg_gq->refcnt an atomic_t and
      dropping queue locking in the function.
      
      Given the general heavy weight of the current request_queue and blkcg
      operations, this is unlikely to cause any noticeable overhead.
      Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
      the near future, so whatever (most likely negligible) overhead it may
      add is temporary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a5049a8a
  13. 11 6月, 2014 1 次提交
  14. 14 5月, 2014 1 次提交
    • T
      cgroup: rename css_tryget*() to css_tryget_online*() · ec903c0c
      Tejun Heo 提交于
      Unlike the more usual refcnting, what css_tryget() provides is the
      distinction between online and offline csses instead of protection
      against upping a refcnt which already reached zero.  cgroup is
      planning to provide actual tryget which fails if the refcnt already
      reached zero.  Let's rename the existing trygets so that they clearly
      indicate that they're onliness.
      
      I thought about keeping the existing names as-are and introducing new
      names for the planned actual tryget; however, given that each
      controller participates in the synchronization of the online state, it
      seems worthwhile to make it explicit that these functions are about
      on/offline state.
      
      Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
      to css_tryget_online_from_dir().  This is pure rename.
      
      v2: cgroup_freezer grew new usages of css_tryget().  Update
          accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      ec903c0c
  15. 06 5月, 2014 1 次提交
    • T
      blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() · 36c38fb7
      Tejun Heo 提交于
      During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
      which nests above both the kernfs s_active protection and cgroup_mutex
      is added to synchronize cgroup file type operations as cgroup_mutex
      needed to be grabbed from some file operations and thus can't be put
      above s_active protection.
      
      While this arrangement mostly worked for cgroup, this triggered the
      following lockdep warning.
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G        W
        -------------------------------------------------------
        trinity-c173/9024 is trying to acquire lock:
        (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
      
        but task is already holding lock:
        (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (s_active#89){++++.+}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
        kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
        cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
        cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
        rebind_subsystems (kernel/cgroup.c:1144)
        cgroup_setup_root (kernel/cgroup.c:1568)
        cgroup_mount (kernel/cgroup.c:1716)
        mount_fs (fs/super.c:1094)
        vfs_kern_mount (fs/namespace.c:899)
        do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
        SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        -> #1 (cgroup_tree_mutex){+.+.+.}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
        blkcg_policy_register (block/blk-cgroup.c:1106)
        throtl_init (block/blk-throttle.c:1694)
        do_one_initcall (init/main.c:789)
        kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
        kernel_init (init/main.c:935)
        ret_from_fork (arch/x86/kernel/entry_64.S:552)
      
        -> #0 (blkcg_pol_mutex){+.+.+.}:
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        other info that might help us debug this:
      
        Chain exists of:
        blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(s_active#89);
      				 lock(cgroup_tree_mutex);
      				 lock(s_active#89);
          lock(blkcg_pol_mutex);
      
         *** DEADLOCK ***
      
        4 locks held by trinity-c173/9024:
        #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
        #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
        #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
        #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        stack backtrace:
        CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G        W     3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
         ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
         ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
         ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
        Call Trace:
        dump_stack (lib/dump_stack.c:52)
        print_circular_bug (kernel/locking/lockdep.c:1216)
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
      
      This is a highly unlikely but valid circular dependency between "echo
      1 > blkcg.reset_stats" and cfq module [un]loading.  cgroup is going
      through further locking update which will remove this complication but
      for now let's use trylock on blkcg_pol_mutex and retry the file
      operation if the trylock fails.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com
      36c38fb7
  16. 19 2月, 2014 1 次提交
  17. 13 2月, 2014 1 次提交
    • T
      cgroup: drop @skip_css from cgroup_taskset_for_each() · 924f0d9a
      Tejun Heo 提交于
      If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
      css.  The intention of the interface is to make it easy to skip css's
      (cgroup_subsys_states) which already match the migration target;
      however, this is entirely unnecessary as migration taskset doesn't
      include tasks which are already in the target cgroup.  Drop @skip_css
      from cgroup_taskset_for_each().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      924f0d9a
  18. 08 2月, 2014 2 次提交
    • T
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo 提交于
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9
    • T
      cgroup: drop module support · 3ed80a62
      Tejun Heo 提交于
      With module supported dropped from net_prio, no controller is using
      cgroup module support.  None of actual resource controllers can be
      built as a module and we aren't gonna add new controllers which don't
      control resources.  This patch drops module support from cgroup.
      
      * cgroup_[un]load_subsys() and cgroup_subsys->module removed.
      
      * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
        cgroup_subsys.h now uses IS_ENABLED() directly.
      
      * enum cgroup_subsys_id now exactly matches the list of enabled
        controllers as ordered in cgroup_subsys.h.
      
      * cgroup_subsys[] is now a contiguously occupied array.  Size
        specification is no longer necessary and dropped.
      
      * for_each_builtin_subsys() is removed and for_each_subsys() is
        updated to not require any locking.
      
      * module ref handling is removed from rebind_subsystems().
      
      * Module related comments dropped.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      
      v3: Added {} around the if (need_forkexit_callback) block in
          cgroup_post_fork() for readability as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3ed80a62
  19. 12 9月, 2013 1 次提交
    • T
      blkcg: relocate root_blkg setting and clearing · 577cee1e
      Tejun Heo 提交于
      Hello, Jens.
      
      The original thread can be read from
      
       http://thread.gmane.org/gmane.linux.kernel.cgroups/8937
      
      While it leads to oops, given that it only triggers under specific
      configurations which aren't common.  I don't think it's necessary to
      backport it through -stable and merging it during the coming merge
      window should be enough.
      
      Thanks!
      
      ----- 8< -----
      Currently, q->root_blkg and q->root_rl.blkg are set from
      blkcg_activate_policy() and cleared from blkg_destroy_all().  This
      doesn't necessarily coincide with the lifetime of the root blkcg_gq
      leading to the following oops when blkcg is enabled but no policy is
      activated because __blk_queue_next_rl() malfunctions expecting the
      root_blkg pointers to be set.
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90
        PGD 60f7a9067 PUD 60f4c9067 PMD 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
        gsmi: Log Shutdown Reason 0x03
        Modules linked in: act_mirred cls_tcindex cls_prioshift sch_dsmark xt_multiport iptable_mangle sata_mv elephant elephant_dev_num cdc_acm uhci_hcd ehci_hcd i2c_d
        CPU: 9 PID: 41382 Comm: iSCSI-write- Not tainted 3.11.0-dbg-DEV #19
        Hardware name: Intel XXX
        task: ffff88060d16eec0 ti: ffff88060d170000 task.ti: ffff88060d170000
        RIP: 0010:[<ffffffff810c58cb>] [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90
        RSP: 0000:ffff88060d171818  EFLAGS: 00010096
        RAX: 0000000000000082 RBX: ffff880baa3dee60 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff880baa3dee60
        RBP: ffff88060d171858 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000002 R12: ffff880baa3dee98
        R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
        FS:  00007f977cba6700(0000) GS:ffff880c79c60000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 000000060f7a5000 CR4: 00000000000007e0
        Stack:
         0000000000000082 0000000000000000 ffff88060d171858 ffff880baa3dee60
         0000000000000082 0000000000000003 0000000000000000 0000000000000000
         ffff88060d171898 ffffffff810c7848 ffff88060d171888 ffff880bde4bc4b8
        Call Trace:
         [<ffffffff810c7848>] __wake_up+0x48/0x70
         [<ffffffff8131da53>] __blk_drain_queue+0x123/0x190
         [<ffffffff8131dbb5>] blk_cleanup_queue+0xf5/0x210
         [<ffffffff8141877a>] __scsi_remove_device+0x5a/0xd0
         [<ffffffff81418824>] scsi_remove_device+0x34/0x50
         [<ffffffff814189cb>] scsi_remove_target+0x16b/0x220
         [<ffffffff814210f1>] __iscsi_unbind_session+0xd1/0x1b0
         [<ffffffff814212b2>] iscsi_remove_session+0xe2/0x1c0
         [<ffffffff814213a6>] iscsi_destroy_session+0x16/0x60
         [<ffffffff81423a59>] iscsi_session_teardown+0xd9/0x100
         [<ffffffff8142b75a>] iscsi_sw_tcp_session_destroy+0x5a/0xb0
         [<ffffffff81420948>] iscsi_if_rx+0x10e8/0x1560
         [<ffffffff81573335>] netlink_unicast+0x145/0x200
         [<ffffffff815736f3>] netlink_sendmsg+0x303/0x410
         [<ffffffff81528196>] sock_sendmsg+0xa6/0xd0
         [<ffffffff815294bc>] ___sys_sendmsg+0x38c/0x3a0
         [<ffffffff811ea840>] ? fget_light+0x40/0x160
         [<ffffffff811ea899>] ? fget_light+0x99/0x160
         [<ffffffff811ea840>] ? fget_light+0x40/0x160
         [<ffffffff8152bc79>] __sys_sendmsg+0x49/0x90
         [<ffffffff8152bcd2>] SyS_sendmsg+0x12/0x20
         [<ffffffff815fb642>] system_call_fastpath+0x16/0x1b
        Code: 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 4c 8d 67 38 53 48 83 ec 18 89 55 c4 48 8b 57 38 4c 89 45 c8 <4c> 8b 2a 48 8d 42 e8 49
      
      Fix it by moving r->root_blkg and q->root_rl.blkg setting to
      blkg_create() and clearing to blkg_destroy() so that they area
      initialized when a root blkg is created and cleared when destroyed.
      Reported-and-tested-by: NAnatol Pomozov <anatol.pomozov@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      577cee1e