1. 12 9月, 2013 4 次提交
  2. 24 8月, 2013 2 次提交
  3. 09 8月, 2013 9 次提交
    • T
      cgroup: make css_for_each_descendant() and friends include the origin css in the iteration · bd8815a6
      Tejun Heo 提交于
      Previously, all css descendant iterators didn't include the origin
      (root of subtree) css in the iteration.  The reasons were maintaining
      consistency with css_for_each_child() and that at the time of
      introduction more use cases needed skipping the origin anyway;
      however, given that css_is_descendant() considers self to be a
      descendant, omitting the origin css has become more confusing and
      looking at the accumulated use cases rather clearly indicates that
      including origin would result in simpler code overall.
      
      While this is a change which can easily lead to subtle bugs, cgroup
      API including the iterators has recently gone through major
      restructuring and no out-of-tree changes will be applicable without
      adjustments making this a relatively acceptable opportunity for this
      type of change.
      
      The conversions are mostly straight-forward.  If the iteration block
      had explicit origin handling before or after, it's moved inside the
      iteration.  If not, if (pos == origin) continue; is added.  Some
      conversions add extra reference get/put around origin handling by
      consolidating origin handling and the rest.  While the extra ref
      operations aren't strictly necessary, this shouldn't cause any
      noticeable difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      bd8815a6
    • T
      cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroup · d99c8727
      Tejun Heo 提交于
      cgroup is in the process of converting to css (cgroup_subsys_state)
      from cgroup as the principal subsystem interface handle.  This is
      mostly to prepare for the unified hierarchy support where css's will
      be created and destroyed dynamically but also helps cleaning up
      subsystem implementations as css is usually what they are interested
      in anyway.
      
      cgroup_taskset which is used by the subsystem attach methods is the
      last cgroup subsystem API which isn't using css as the handle.  Update
      cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
      cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.
      
      The conversions are pretty mechanical.  One exception is
      cpuset::cgroup_cs(), which lost its last user and got removed.
      
      This patch shouldn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      d99c8727
    • T
      cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroup · 492eb21b
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using css
      (cgroup_subsys_state) as the primary handle instead of cgroup in
      subsystem API.  For hierarchy iterators, this is beneficial because
      
      * In most cases, css is the only thing subsystems care about anyway.
      
      * On the planned unified hierarchy, iterations for different
        subsystems will need to skip over different subtrees of the
        hierarchy depending on which subsystems are enabled on each cgroup.
        Passing around css makes it unnecessary to explicitly specify the
        subsystem in question as css is intersection between cgroup and
        subsystem
      
      * For the planned unified hierarchy, css's would need to be created
        and destroyed dynamically independent from cgroup hierarchy.  Having
        cgroup core manage css iteration makes enforcing deref rules a lot
        easier.
      
      Most subsystem conversions are straight-forward.  Noteworthy changes
      are
      
      * blkio: cgroup_to_blkcg() is no longer used.  Removed.
      
      * freezer: cgroup_freezer() is no longer used.  Removed.
      
      * devices: cgroup_to_devcgroup() is no longer used.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      492eb21b
    • T
      cgroup: pass around cgroup_subsys_state instead of cgroup in file methods · 182446d0
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using struct
      cgroup_subsys_state * as the primary handle instead of struct cgroup.
      Please see the previous commit which converts the subsystem methods
      for rationale.
      
      This patch converts all cftype file operations to take @css instead of
      @cgroup.  cftypes for the cgroup core files don't have their subsytem
      pointer set.  These will automatically use the dummy_css added by the
      previous patch and can be converted the same way.
      
      Most subsystem conversions are straight forwards but there are some
      interesting ones.
      
      * freezer: update_if_frozen() is also converted to take @css instead
        of @cgroup for consistency.  This will make the code look simpler
        too once iterators are converted to use css.
      
      * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
        vmpressure while mem_cgroup_from_cont() can be made static.
        Updated accordingly.
      
      * cpu: cgroup_tg() doesn't have any user left.  Removed.
      
      * cpuacct: cgroup_ca() doesn't have any user left.  Removed.
      
      * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
        Removed.
      
      * net_cls: cgrp_cls_state() doesn't have any user left.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      182446d0
    • T
      cgroup: add subsys backlink pointer to cftype · 2bb566cb
      Tejun Heo 提交于
      cgroup is transitioning to using css (cgroup_subsys_state) instead of
      cgroup as the primary subsystem handle.  The cgroupfs file interface
      will be converted to use css's which requires finding out the
      subsystem from cftype so that the matching css can be determined from
      the cgroup.
      
      This patch adds cftype->ss which points to the subsystem the file
      belongs to.  The field is initialized while a cftype is being
      registered.  This makes it unnecessary to explicitly specify the
      subsystem for other cftype handling functions.  @ss argument dropped
      from various cftype handling functions.
      
      This patch shouldn't introduce any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      2bb566cb
    • T
      cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods · eb95419b
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using struct
      cgroup_subsys_state * as the primary handle instead of struct cgroup *
      in subsystem implementations for the following reasons.
      
      * With unified hierarchy, subsystems will be dynamically bound and
        unbound from cgroups and thus css's (cgroup_subsys_state) may be
        created and destroyed dynamically over the lifetime of a cgroup,
        which is different from the current state where all css's are
        allocated and destroyed together with the associated cgroup.  This
        in turn means that cgroup_css() should be synchronized and may
        return NULL, making it more cumbersome to use.
      
      * Differing levels of per-subsystem granularity in the unified
        hierarchy means that the task and descendant iterators should behave
        differently depending on the specific subsystem the iteration is
        being performed for.
      
      * In majority of the cases, subsystems only care about its part in the
        cgroup hierarchy - ie. the hierarchy of css's.  Subsystem methods
        often obtain the matching css pointer from the cgroup and don't
        bother with the cgroup pointer itself.  Passing around css fits
        much better.
      
      This patch converts all cgroup_subsys methods to take @css instead of
      @cgroup.  The conversions are mostly straight-forward.  A few
      noteworthy changes are
      
      * ->css_alloc() now takes css of the parent cgroup rather than the
        pointer to the new cgroup as the css for the new cgroup doesn't
        exist yet.  Knowing the parent css is enough for all the existing
        subsystems.
      
      * In kernel/cgroup.c::offline_css(), unnecessary open coded css
        dereference is replaced with local variable access.
      
      This patch shouldn't cause any behavior differences.
      
      v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
          with local variable @css as suggested by Li Zefan.
      
          Rebased on top of new for-3.12 which includes for-3.11-fixes so
          that ->css_free() invocation added by da0a12ca ("cgroup: fix a
          leak when percpu_ref_init() fails") is converted too.  Suggested
          by Li Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      eb95419b
    • T
      cgroup: add css_parent() · 63876986
      Tejun Heo 提交于
      Currently, controllers have to explicitly follow the cgroup hierarchy
      to find the parent of a given css.  cgroup is moving towards using
      cgroup_subsys_state as the main controller interface construct, so
      let's provide a way to climb the hierarchy using just csses.
      
      This patch implements css_parent() which, given a css, returns its
      parent.  The function is guarnateed to valid non-NULL parent css as
      long as the target css is not at the top of the hierarchy.
      
      freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
      are converted to use css_parent() instead of accessing cgroup->parent
      directly.
      
      * __parent_ca() is dropped from cpuacct and its usage is replaced with
        parent_ca().  The only difference between the two was NULL test on
        cgroup->parent which is now embedded in css_parent() making the
        distinction moot.  Note that eventually a css->parent field will be
        added to css and the NULL check in css_parent() will go away.
      
      This patch shouldn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      63876986
    • T
      cgroup: add/update accessors which obtain subsys specific data from css · a7c6d554
      Tejun Heo 提交于
      css (cgroup_subsys_state) is usually embedded in a subsys specific
      data structure.  Subsystems either use container_of() directly to cast
      from css to such data structure or has an accessor function wrapping
      such cast.  As cgroup as whole is moving towards using css as the main
      interface handle, add and update such accessors to ease dealing with
      css's.
      
      All accessors explicitly handle NULL input and return NULL in those
      cases.  While this looks like an extra branch in the code, as all
      controllers specific data structures have css as the first field, the
      casting doesn't involve any offsetting and the compiler can trivially
      optimize out the branch.
      
      * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
        accessor.  Added.
      
      * memory, hugetlb and devices already had one but didn't explicitly
        handle NULL input.  Updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a7c6d554
    • T
      cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ · 8af01f56
      Tejun Heo 提交于
      The names of the two struct cgroup_subsys_state accessors -
      cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
      The former clashes with the type name and the latter doesn't even
      indicate it's somehow related to cgroup.
      
      We're about to revamp large portion of cgroup API, so, let's rename
      them so that they're less awkward.  Most per-controller usages of the
      accessors are localized in accessor wrappers and given the amount of
      scheduled changes, this isn't gonna add any noticeable headache.
      
      Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
      to task_css().  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8af01f56
  4. 15 7月, 2013 1 次提交
    • P
      block: delete __cpuinit usage from all block files · 0b776b06
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the drivers/block uses of the __cpuinit macros
      from all C files.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      0b776b06
  5. 10 7月, 2013 3 次提交
  6. 04 7月, 2013 2 次提交
  7. 03 7月, 2013 1 次提交
    • J
      elevator: Fix a race in elevator switching · d50235b7
      Jianpeng Ma 提交于
      There's a race between elevator switching and normal io operation.
          Because the allocation of struct elevator_queue and struct elevator_data
          don't in a atomic operation.So there are have chance to use NULL
          ->elevator_data.
          For example:
              Thread A:                               Thread B
              blk_queu_bio                            elevator_switch
              spin_lock_irq(q->queue_block)           elevator_alloc
              elv_merge                               elevator_init_fn
      
          Because call elevator_alloc, it can't hold queue_lock and the
          ->elevator_data is NULL.So at the same time, threadA call elv_merge and
          nedd some info of elevator_data.So the crash happened.
      
          Move the elevator_alloc into func elevator_init_fn, it make the
          operations in a atomic operation.
      
          Using the follow method can easy reproduce this bug
          1:dd if=/dev/sdb of=/dev/null
          2:while true;do echo noop > scheduler;echo deadline > scheduler;done
      
          The test method also use this method.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d50235b7
  8. 01 7月, 2013 2 次提交
  9. 29 6月, 2013 1 次提交
    • J
      block: Reserve only one queue tag for sync IO if only 3 tags are available · a6b3f761
      Jan Kara 提交于
      In case a device has three tags available we still reserve two of them
      for sync IO. That leaves only a single tag for async IO such as
      writeback from flusher thread which results in poor performance.
      
      Allow async IO to consume two tags in case queue has three tag availabe
      to get a decent async write performance.
      
      This patch improves streaming write performance on a machine with such disk
      from ~21 MB/s to ~52 MB/s. Also postmark throughput in presence of
      streaming writer improves from 8 to 12 transactions per second so sync
      IO doesn't seem to be harmed in presence of heavy async writer.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a6b3f761
  10. 17 5月, 2013 1 次提交
  11. 15 5月, 2013 14 次提交
    • T
      blk-throttle: implement proper hierarchy support · 9138125b
      Tejun Heo 提交于
      With the recent updates, blk-throttle is finally ready for proper
      hierarchy support.  Dispatching now honors service_queue->parent_sq
      and propagates correctly.  The only thing missing is setting
      ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
      hierarchy.
      
      This patch updates throtl_pd_init() such that service_queues form the
      same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
      As this concludes proper hierarchy support for blkcg, the shameful
      .broken_hierarchy tag is removed from blkio_subsys.
      
      v2: Updated blkio-controller.txt as suggested by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      9138125b
    • T
      blk-throttle: implement throtl_grp->has_rules[] · 693e751e
      Tejun Heo 提交于
      blk_throtl_bio() has a quick exit path for throtl_grps without limits
      configured.  It looks at the bps and iops limits and if both are not
      configured, the bio is issued immediately.  While this is correct in
      the current flat hierarchy as each throtl_grp behaves completely
      independently, it would become wrong in proper hierarchy mode.  A
      group without any limits could still be limited by one of its
      ancestors and bio's queued for such group should not bypass
      blk-throtl.
      
      As having a quick bypass mechanism is beneficial, this patch
      reimplements the mechanism such that it's correct even with proper
      hierarchy.  throtl_grp->has_rules[] is added.  These booleans are
      updated for the whole subtree whenever a config is updated so that
      has_rules[] of the whole subtree stays synchronized.  They're also
      updated when a new throtl_grp comes online so that it can't escape the
      limits of its ancestors.
      
      As no throtl_grp has another throtl_grp as parent now, this patch
      doesn't yet make any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      693e751e
    • V
      blk-throttle: Account for child group's start time in parent while bio climbs up · 32ee5bc4
      Vivek Goyal 提交于
      With the planned proper hierarchy support, a bio will climb up the
      tree before actually being dispatched. This makes sure bio is also
      subjected to parent's throttling limits, if any.
      
      It might happen that parent is idle and when bio is transferred to
      parent, a new slice starts fresh. But that is incorrect as parents
      wait time should have started when bio was queued in child group and
      causes IOs to be throttled more than configured as they climb the
      hierarchy.
      
      Given the fact that we have not written hierarchical algorithm in a
      way where child's and parents time slices are synchronized, we
      transfer the child's start time to parent if parent was idling.  If
      parent was busy doing dispatch of other bios all this while, this is
      not an issue.
      
      Child's slice start time is passed to parent. Parent looks at its
      last expired slice start time. If child's start time is after parents
      old start time, that means parent had been idle and after parent
      went idle, child had an IO queued. So use child's start time as
      parent start time.
      
      If parent's start time is after child's start time, that means,
      when IO got queued in child group, parent was not idle. But later
      it dispatched some IO, its slice got trimmed and then it went idle.
      After a while child's request got shifted in parent group. In this
      case use parent's old start time as new start time as that's the
      duration of slice we did not use.
      
      This logic is far from perfect as if there are multiple childs
      then first child transferring the bio decides the start time while
      a bio might have queued up even earlier in other child, which is
      yet to be transferred up to parent. In that case we will lose
      time and bandwidth in parent. This patch is just an approximation
      to make situation somewhat better.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      32ee5bc4
    • T
      blk-throttle: add throtl_qnode for dispatch fairness · c5cc2070
      Tejun Heo 提交于
      With flat hierarchy, there's only single level of dispatching
      happening and fairness beyond that point is the responsibility of the
      rest of the block layer and driver, which usually works out okay;
      however, with the planned hierarchy support,
      service_queue->bio_lists[] can be filled up by bios from a single
      source.  While the limits would still be honored, it'd be very easy to
      starve IOs from siblings or children.
      
      To avoid such starvation, this patch implements throtl_qnode and
      converts service_queue->bio_lists[] to lists of per-source qnodes
      which in turn contains the bio's.  For example, when a bio is
      dispatched from a child group, the bio doesn't get queued on
      ->bio_lists[] directly but it first gets queued on the group's qnode
      which in turn gets queued on service_queue->queued[].  When
      dispatching for the upper level, the ->queued[] list is consumed in
      round-robing order so that the dispatch windows is consumed fairly by
      all IO sources.
      
      There are two ways a bio can come to a throtl_grp - directly queued to
      the group or dispatched from a child.  For the former
      throtl_grp->qnode_on_self[rw] is used.  For the latter, the child's
      ->qnode_on_parent[rw].
      
      Note that this means that the child which is contributing a bio to its
      parent should stay pinned until all its bios are dispatched to its
      grand-parent.  This patch moves blkg refcnting from bio add/remove
      spots to qnode activation/deactivation so that the blkg containing an
      active qnode is always pinned.  As child pins the parent, this is
      sufficient for keeping the relevant sub-tree pinned while bios are in
      flight.
      
      The starvation issue was spotted by Vivek Goyal.
      
      v2: The original patch used the same throtl_grp->qnode_on_self/parent
          for reads and writes causing RWs to be queued incorrectly if there
          already are outstanding IOs in the other direction.  They should
          be throtl_grp->qnode_on_self/parent[2] so that READs and WRITEs
          can use different qnodes.  Spotted by Vivek Goyal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      c5cc2070
    • T
      blk-throttle: make throtl_pending_timer_fn() ready for hierarchy · 2e48a530
      Tejun Heo 提交于
      throtl_pending_timer_fn() currently assumes that the parent_sq is the
      top level one and the bio's dispatched are ready to be issued;
      however, this assumption will be wrong with proper hierarchy support.
      This patch makes the following changes to make
      throtl_pending_timer_fn() ready for hiearchy.
      
      * If the parent_sq isn't the top-level one, update the parent
        throtl_grp's dispatch time and schedule the next dispatch as
        necessary.  If the parent's dispatch time is now, repeat the
        function for the parent throtl_grp.
      
      * If the parent_sq is the top-level one, kick issue work_item as
        before.
      
      * The debug message printed by throtl_log() now prints out the
        service_queue's nr_queued[] instead of the total nr_queued as the
        latter becomes uninteresting and misleading with hierarchical
        dispatch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2e48a530
    • T
      blk-throttle: make tg_dispatch_one_bio() ready for hierarchy · 6bc9c2b4
      Tejun Heo 提交于
      tg_dispatch_one_bio() currently assumes that the parent_sq is the top
      level one and the bio being dispatched is ready to be issued; however,
      this assumption will be wrong with proper hierarchy support.  This
      patch makes the following changes to make tg_dispatch_on_bio() ready
      for hiearchy.
      
      * throtl_data->nr_queued[] is incremented in blk_throtl_bio() instead
        of throtl_add_bio_tg() so that throtl_add_bio_tg() can be used to
        transfer a bio from a child tg to its parent.
      
      * tg_dispatch_one_bio() is updated to distinguish whether its parent
        is another throtl_grp or the throtl_data.  If former, the bio is
        transferred to the parent throtl_grp using throtl_add_bio_tg().  If
        latter, the bio is ready to be issued and put on the top-level
        service_queue's bio_lists[] and throtl_data->nr_queued is
        decremented.
      
      As all throtl_grps currently have the top level service_queue as their
      ->parent_sq, this patch in itself doesn't make any behavior
      difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      6bc9c2b4
    • T
      blk-throttle: make blk_throtl_bio() ready for hierarchy · 9e660acf
      Tejun Heo 提交于
      Currently, blk_throtl_bio() issues the passed in bio directly if it's
      within limits of its associated tg (throtl_grp).  This behavior
      becomes incorrect with hierarchy support as the bio should be
      accounted to and throttled by the ancestor throtl_grps too.
      
      This patch makes the direct issue path of blk_throtl_bio() to loop
      until it reaches the top-level service_queue or gets throttled.  If
      the former, the bio can be issued directly; otherwise, it gets queued
      at the first layer it was above limits.
      
      As tg->parent_sq is always the top-level service queue currently, this
      patch in itself doesn't make any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      9e660acf
    • T
      blk-throttle: make blk_throtl_drain() ready for hierarchy · 2a12f0dc
      Tejun Heo 提交于
      The current blk_throtl_drain() assumes that all active throtl_grps are
      queued on throtl_data->service_queue, which won't be true once
      hierarchy support is implemented.
      
      This patch makes blk_throtl_drain() perform post-order walk of the
      blkg hierarchy draining each associated throtl_grp, which guarantees
      that all bios will eventually be pushed to the top-level service_queue
      in throtl_data.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2a12f0dc
    • T
      blk-throttle: dispatch from throtl_pending_timer_fn() · 6e1a5704
      Tejun Heo 提交于
      Currently, blk_throtl_dispatch_work_fn() is responsible for both
      dispatching bio's from throtl_grp's according to their limits and then
      issuing the dispatched bios.
      
      This patch moves the dispatch part to throtl_pending_timer_fn() so
      that the work item is kicked iff there are bio's to issue.  This is to
      avoid work item execution at each step when hierarchy support is
      enabled.  bio's will be dispatched towards the top-level service_queue
      from the timers at each layer and the work item will only be used to
      issue the bio's which reached the top-level service_queue.
      
      While fetching bio's to issue from bio_lists[],
      blk_throtl_dispatch_work_fn() fetches all READs before WRITEs.  While
      the original code also dispatched READs first, if multiple throtl_grps
      are dispatched on the same run, WRITEs from throtl_grp which is
      dispatched first would precede READs from throtl_grps which are
      dispatched later.  While this is a behavior change, given that the
      previous code already prioritized READs and block layer generally
      prioritizes and segregates READs from WRITEs, this isn't likely to
      make any noticeable differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      6e1a5704
    • T
      blk-throttle: implement dispatch looping · 7f52f98c
      Tejun Heo 提交于
      throtl_select_dispatch() only dispatches throtl_quantum bios on each
      invocation.  blk_throtl_dispatch_work_fn() in turn depends on
      throtl_schedule_next_dispatch() scheduling the next dispatch window
      immediately so that undue delays aren't incurred.  This effectively
      chains multiple dispatch work item executions back-to-back when there
      are more than throtl_quantum bios to dispatch on a given tick.
      
      There is no reason to finish the current work item just to repeat it
      immediately.  This patch makes throtl_schedule_next_dispatch() return
      %false without doing anything if the current dispatch window is still
      open and updates blk_throtl_dispatch_work_fn() repeat dispatching
      after cpu_relax() on %false return.
      
      This change will help implementing hierarchy support as dispatching
      will be done from pending_timer and immediate reschedule of timer
      function isn't supported and doesn't make much sense.
      
      While this patch changes how dispatch behaves when there are more than
      throtl_quantum bios to dispatch on a single tick, the behavior change
      is immaterial.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      7f52f98c
    • T
      blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work · 69df0ab0
      Tejun Heo 提交于
      Currently, throtl_data->dispatch_work is a delayed_work item which
      handles both delayed dispatch and issuing bios.  The two tasks will be
      separated to support proper hierarchy.  To prepare for that, this
      patch separates out the timer into throtl_service_queue->pending_timer
      from throtl_data->dispatch_work and make the latter a work_struct.
      
      * As the timer is now per-service_queue, it's initialized and
        del_sync'd as its corresponding service_queue is created and
        destroyed.  The timer, when triggered, simply schedules
        throtl_data->dispathc_work for execution.
      
      * throtl_schedule_delayed_work() is renamed to
        throtl_schedule_pending_timer() and takes @sq and @expires now.
      
      * Simiarly, throtl_schedule_next_dispatch() now takes @sq, which
        should be the parent_sq of the service_queue which just got a new
        bio or updated.  As the parent_sq is always the top-level
        service_queue now, this doesn't change anything at this point.
      
      This patch doesn't introduce any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      69df0ab0
    • T
      blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it · 2a0f61e6
      Tejun Heo 提交于
      With proper hierarchy support, a bio can be dispatched multiple times
      until it reaches the top-level service_queue and we don't want to
      update dispatch stats at each step.  They are local stats and will be
      kept local.  If recursive stats are necessary, they should be
      implemented separately and definitely not by updating counters
      recursively on each dispatch.
      
      This patch moves REQ_THROTTLED setting to throtl_charge_bio() and gate
      stats update with it so that dispatch stats are updated only on the
      first time the bio is charged to a throtl_grp, which will always be
      the throtl_grp the bio was originally queued to.
      
      This means that REQ_THROTTLED would be set even for bios which don't
      get throttled.  As we don't want bios to leave blk-throtl with the
      flag set, move REQ_THROTLLED clearing to the end of blk_throtl_bio()
      and clear if the bio is being issued directly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2a0f61e6
    • T
      blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log() · fda6f272
      Tejun Heo 提交于
      Now that both throtl_data and throtl_grp embed throtl_service_queue,
      we can unify throtl_log() and throtl_log_tg().
      
      * sq_to_tg() is added.  This returns the throtl_grp a service_queue is
        embedded in.  If the service_queue is the top-level one embedded in
        throtl_data, NULL is returned.
      
      * sq_to_td() is added.  A service_queue is always associated with a
        throtl_data.  This function finds the associated td and returns it.
      
      * throtl_log() is updated to take throtl_service_queue instead of
        throtl_data.  If the service_queue is one embedded in throtl_grp, it
        prints the same header as throtl_log_tg() did.  If it's one embedded
        in throtl_data, it behaves the same as before.  This renders
        throtl_log_tg() unnecessary.  Removed.
      
      This change is necessary for hierarchy support as we're gonna be using
      the same code paths to dispatch bios to intermediate service_queues
      embedded in throtl_grps and the top-level service_queue embedded in
      throtl_data.
      
      This patch doesn't make any behavior changes.
      
      v2: throtl_log() didn't print a space after blkg path.  Updated so
          that it prints a space after throtl_grp path.  Spotted by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      fda6f272
    • T
      blk-throttle: add throtl_service_queue->parent_sq · 77216b04
      Tejun Heo 提交于
      To prepare for hierarchy support, this patch adds
      throtl_service_queue->service_sq which points to the arent
      service_queue.  Currently, for all service_queues embedded in
      throtl_grps, it points to throtl_data->service_queue.  As
      throtl_data->service_queue doesn't have a parent its parent_sq is set
      to NULL.
      
      There are a number of functions which take both throtl_grp *tg and
      throtl_service_queue *parent_sq.  With this patch, the parent
      service_queue can be determined from @tg and the @parent_sq arguments
      are removed.
      
      This patch doesn't make any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      77216b04