1. 15 5月, 2013 30 次提交
    • T
      blk-throttle: add throtl_qnode for dispatch fairness · c5cc2070
      Tejun Heo 提交于
      With flat hierarchy, there's only single level of dispatching
      happening and fairness beyond that point is the responsibility of the
      rest of the block layer and driver, which usually works out okay;
      however, with the planned hierarchy support,
      service_queue->bio_lists[] can be filled up by bios from a single
      source.  While the limits would still be honored, it'd be very easy to
      starve IOs from siblings or children.
      
      To avoid such starvation, this patch implements throtl_qnode and
      converts service_queue->bio_lists[] to lists of per-source qnodes
      which in turn contains the bio's.  For example, when a bio is
      dispatched from a child group, the bio doesn't get queued on
      ->bio_lists[] directly but it first gets queued on the group's qnode
      which in turn gets queued on service_queue->queued[].  When
      dispatching for the upper level, the ->queued[] list is consumed in
      round-robing order so that the dispatch windows is consumed fairly by
      all IO sources.
      
      There are two ways a bio can come to a throtl_grp - directly queued to
      the group or dispatched from a child.  For the former
      throtl_grp->qnode_on_self[rw] is used.  For the latter, the child's
      ->qnode_on_parent[rw].
      
      Note that this means that the child which is contributing a bio to its
      parent should stay pinned until all its bios are dispatched to its
      grand-parent.  This patch moves blkg refcnting from bio add/remove
      spots to qnode activation/deactivation so that the blkg containing an
      active qnode is always pinned.  As child pins the parent, this is
      sufficient for keeping the relevant sub-tree pinned while bios are in
      flight.
      
      The starvation issue was spotted by Vivek Goyal.
      
      v2: The original patch used the same throtl_grp->qnode_on_self/parent
          for reads and writes causing RWs to be queued incorrectly if there
          already are outstanding IOs in the other direction.  They should
          be throtl_grp->qnode_on_self/parent[2] so that READs and WRITEs
          can use different qnodes.  Spotted by Vivek Goyal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      c5cc2070
    • T
      blk-throttle: make throtl_pending_timer_fn() ready for hierarchy · 2e48a530
      Tejun Heo 提交于
      throtl_pending_timer_fn() currently assumes that the parent_sq is the
      top level one and the bio's dispatched are ready to be issued;
      however, this assumption will be wrong with proper hierarchy support.
      This patch makes the following changes to make
      throtl_pending_timer_fn() ready for hiearchy.
      
      * If the parent_sq isn't the top-level one, update the parent
        throtl_grp's dispatch time and schedule the next dispatch as
        necessary.  If the parent's dispatch time is now, repeat the
        function for the parent throtl_grp.
      
      * If the parent_sq is the top-level one, kick issue work_item as
        before.
      
      * The debug message printed by throtl_log() now prints out the
        service_queue's nr_queued[] instead of the total nr_queued as the
        latter becomes uninteresting and misleading with hierarchical
        dispatch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2e48a530
    • T
      blk-throttle: make tg_dispatch_one_bio() ready for hierarchy · 6bc9c2b4
      Tejun Heo 提交于
      tg_dispatch_one_bio() currently assumes that the parent_sq is the top
      level one and the bio being dispatched is ready to be issued; however,
      this assumption will be wrong with proper hierarchy support.  This
      patch makes the following changes to make tg_dispatch_on_bio() ready
      for hiearchy.
      
      * throtl_data->nr_queued[] is incremented in blk_throtl_bio() instead
        of throtl_add_bio_tg() so that throtl_add_bio_tg() can be used to
        transfer a bio from a child tg to its parent.
      
      * tg_dispatch_one_bio() is updated to distinguish whether its parent
        is another throtl_grp or the throtl_data.  If former, the bio is
        transferred to the parent throtl_grp using throtl_add_bio_tg().  If
        latter, the bio is ready to be issued and put on the top-level
        service_queue's bio_lists[] and throtl_data->nr_queued is
        decremented.
      
      As all throtl_grps currently have the top level service_queue as their
      ->parent_sq, this patch in itself doesn't make any behavior
      difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      6bc9c2b4
    • T
      blk-throttle: make blk_throtl_bio() ready for hierarchy · 9e660acf
      Tejun Heo 提交于
      Currently, blk_throtl_bio() issues the passed in bio directly if it's
      within limits of its associated tg (throtl_grp).  This behavior
      becomes incorrect with hierarchy support as the bio should be
      accounted to and throttled by the ancestor throtl_grps too.
      
      This patch makes the direct issue path of blk_throtl_bio() to loop
      until it reaches the top-level service_queue or gets throttled.  If
      the former, the bio can be issued directly; otherwise, it gets queued
      at the first layer it was above limits.
      
      As tg->parent_sq is always the top-level service queue currently, this
      patch in itself doesn't make any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      9e660acf
    • T
      blk-throttle: make blk_throtl_drain() ready for hierarchy · 2a12f0dc
      Tejun Heo 提交于
      The current blk_throtl_drain() assumes that all active throtl_grps are
      queued on throtl_data->service_queue, which won't be true once
      hierarchy support is implemented.
      
      This patch makes blk_throtl_drain() perform post-order walk of the
      blkg hierarchy draining each associated throtl_grp, which guarantees
      that all bios will eventually be pushed to the top-level service_queue
      in throtl_data.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2a12f0dc
    • T
      blk-throttle: dispatch from throtl_pending_timer_fn() · 6e1a5704
      Tejun Heo 提交于
      Currently, blk_throtl_dispatch_work_fn() is responsible for both
      dispatching bio's from throtl_grp's according to their limits and then
      issuing the dispatched bios.
      
      This patch moves the dispatch part to throtl_pending_timer_fn() so
      that the work item is kicked iff there are bio's to issue.  This is to
      avoid work item execution at each step when hierarchy support is
      enabled.  bio's will be dispatched towards the top-level service_queue
      from the timers at each layer and the work item will only be used to
      issue the bio's which reached the top-level service_queue.
      
      While fetching bio's to issue from bio_lists[],
      blk_throtl_dispatch_work_fn() fetches all READs before WRITEs.  While
      the original code also dispatched READs first, if multiple throtl_grps
      are dispatched on the same run, WRITEs from throtl_grp which is
      dispatched first would precede READs from throtl_grps which are
      dispatched later.  While this is a behavior change, given that the
      previous code already prioritized READs and block layer generally
      prioritizes and segregates READs from WRITEs, this isn't likely to
      make any noticeable differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      6e1a5704
    • T
      blk-throttle: implement dispatch looping · 7f52f98c
      Tejun Heo 提交于
      throtl_select_dispatch() only dispatches throtl_quantum bios on each
      invocation.  blk_throtl_dispatch_work_fn() in turn depends on
      throtl_schedule_next_dispatch() scheduling the next dispatch window
      immediately so that undue delays aren't incurred.  This effectively
      chains multiple dispatch work item executions back-to-back when there
      are more than throtl_quantum bios to dispatch on a given tick.
      
      There is no reason to finish the current work item just to repeat it
      immediately.  This patch makes throtl_schedule_next_dispatch() return
      %false without doing anything if the current dispatch window is still
      open and updates blk_throtl_dispatch_work_fn() repeat dispatching
      after cpu_relax() on %false return.
      
      This change will help implementing hierarchy support as dispatching
      will be done from pending_timer and immediate reschedule of timer
      function isn't supported and doesn't make much sense.
      
      While this patch changes how dispatch behaves when there are more than
      throtl_quantum bios to dispatch on a single tick, the behavior change
      is immaterial.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      7f52f98c
    • T
      blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work · 69df0ab0
      Tejun Heo 提交于
      Currently, throtl_data->dispatch_work is a delayed_work item which
      handles both delayed dispatch and issuing bios.  The two tasks will be
      separated to support proper hierarchy.  To prepare for that, this
      patch separates out the timer into throtl_service_queue->pending_timer
      from throtl_data->dispatch_work and make the latter a work_struct.
      
      * As the timer is now per-service_queue, it's initialized and
        del_sync'd as its corresponding service_queue is created and
        destroyed.  The timer, when triggered, simply schedules
        throtl_data->dispathc_work for execution.
      
      * throtl_schedule_delayed_work() is renamed to
        throtl_schedule_pending_timer() and takes @sq and @expires now.
      
      * Simiarly, throtl_schedule_next_dispatch() now takes @sq, which
        should be the parent_sq of the service_queue which just got a new
        bio or updated.  As the parent_sq is always the top-level
        service_queue now, this doesn't change anything at this point.
      
      This patch doesn't introduce any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      69df0ab0
    • T
      blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it · 2a0f61e6
      Tejun Heo 提交于
      With proper hierarchy support, a bio can be dispatched multiple times
      until it reaches the top-level service_queue and we don't want to
      update dispatch stats at each step.  They are local stats and will be
      kept local.  If recursive stats are necessary, they should be
      implemented separately and definitely not by updating counters
      recursively on each dispatch.
      
      This patch moves REQ_THROTTLED setting to throtl_charge_bio() and gate
      stats update with it so that dispatch stats are updated only on the
      first time the bio is charged to a throtl_grp, which will always be
      the throtl_grp the bio was originally queued to.
      
      This means that REQ_THROTTLED would be set even for bios which don't
      get throttled.  As we don't want bios to leave blk-throtl with the
      flag set, move REQ_THROTLLED clearing to the end of blk_throtl_bio()
      and clear if the bio is being issued directly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2a0f61e6
    • T
      blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log() · fda6f272
      Tejun Heo 提交于
      Now that both throtl_data and throtl_grp embed throtl_service_queue,
      we can unify throtl_log() and throtl_log_tg().
      
      * sq_to_tg() is added.  This returns the throtl_grp a service_queue is
        embedded in.  If the service_queue is the top-level one embedded in
        throtl_data, NULL is returned.
      
      * sq_to_td() is added.  A service_queue is always associated with a
        throtl_data.  This function finds the associated td and returns it.
      
      * throtl_log() is updated to take throtl_service_queue instead of
        throtl_data.  If the service_queue is one embedded in throtl_grp, it
        prints the same header as throtl_log_tg() did.  If it's one embedded
        in throtl_data, it behaves the same as before.  This renders
        throtl_log_tg() unnecessary.  Removed.
      
      This change is necessary for hierarchy support as we're gonna be using
      the same code paths to dispatch bios to intermediate service_queues
      embedded in throtl_grps and the top-level service_queue embedded in
      throtl_data.
      
      This patch doesn't make any behavior changes.
      
      v2: throtl_log() didn't print a space after blkg path.  Updated so
          that it prints a space after throtl_grp path.  Spotted by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      fda6f272
    • T
      blk-throttle: add throtl_service_queue->parent_sq · 77216b04
      Tejun Heo 提交于
      To prepare for hierarchy support, this patch adds
      throtl_service_queue->service_sq which points to the arent
      service_queue.  Currently, for all service_queues embedded in
      throtl_grps, it points to throtl_data->service_queue.  As
      throtl_data->service_queue doesn't have a parent its parent_sq is set
      to NULL.
      
      There are a number of functions which take both throtl_grp *tg and
      throtl_service_queue *parent_sq.  With this patch, the parent
      service_queue can be determined from @tg and the @parent_sq arguments
      are removed.
      
      This patch doesn't make any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      77216b04
    • T
      blk-throttle: generalize update_disptime optimization in blk_throtl_bio() · 0e9f4164
      Tejun Heo 提交于
      When blk_throtl_bio() wants to queue a bio to a tg (throtl_grp), it
      avoids invoking tg_update_disptime() and
      throtl_schedule_next_dispatch() if the tg already has bios queued in
      that direction.  As a new bio is appeneded after the existing ones, it
      can't change the tg's next dispatch time or the parent's dispatch
      schedule.
      
      This optimization is currently open coded in blk_throtl_bio().
      Whether the target biolist was occupied was recorded in a local
      variable and later used to skip disptime update.  This patch moves
      generalizes it so that throtl_add_bio_tg() sets a new flag
      THROTL_TG_WAS_EMPTY if the biolist was empty before the new bio was
      added.  tg_update_disptime() clears the flag automatically.
      blk_throtl_bio() is updated to simply test the flag before updating
      disptime.
      
      This patch doesn't make any functional differences now but will enable
      using the same optimization for recursive dispatch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      0e9f4164
    • T
      blk-throttle: dispatch to throtl_data->service_queue.bio_lists[] · 651930bc
      Tejun Heo 提交于
      throtl_service_queues will eventually form a tree which is anchored at
      throtl_data->service_queue and queue bios will climb the tree to the
      top service_queue to be executed.
      
      This patch makes the dispatch paths in blk_throtl_dispatch_work_fn()
      and blk_throtl_drain() to dispatch bios to
      throtl_data->service_queue.bio_lists[] instead of the on-stack
      bio_lists.  This will keep the final dispatch to the top level
      service_queue share the same mechanism as dispatches through the rest
      of the hierarchy.
      
      As bio's should be issued in a sleepable context,
      blk_throtl_dispatch_work_fn() transfers all dispatched bio's from the
      service_queue bio_lists[] into an onstack one before dropping
      queue_lock and issuing the bio's.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      651930bc
    • T
      blk-throttle: move bio_lists[] and friends to throtl_service_queue · 73f0d49a
      Tejun Heo 提交于
      throtl_service_queues will eventually form a tree which is anchored at
      throtl_data->service_queue and queue bios will climb the tree to the
      top service_queue to be executed.
      
      This patch moves bio_lists[] and nr_queued[] from throtl_grp to its
      service_queue to prepare for that.  As currently only the
      throtl_data->service_queue is in use, this patch just ends up moving
      throtl_grp->bio_lists[] and ->nr_queued[] to
      throtl_grp->service_queue.bio_lists[] and ->nr_queued[] without making
      any functional differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      73f0d49a
    • T
      blk-throttle: add throtl_grp->service_queue · 49a2f1e3
      Tejun Heo 提交于
      Currently, there's single service_queue per queue -
      throtl_data->service_queue.  All active throtl_grp's are queued on the
      queue and dispatched according to their limits.  To support hierarchy,
      this will be expanded such that active throtl_grp's form a tree
      anchored at throtl_data->service_queue and chained through each
      intermediate throtl_grp's service_queue.
      
      This patch adds throtl_grp->service_queue to prepare for hierarchy
      support.  The initialization function - throtl_service_queue_init() -
      is added and replaces the macro initializer.  The newly added
      tg->service_queue isn't used yet.  Following patches will do.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      49a2f1e3
    • T
      blk-throttle: reorganize throtl_service_queue passed around as argument · 0049af73
      Tejun Heo 提交于
      throtl_service_queue will be the building block of hierarchy support
      and will form a tree.  This patch updates its usages as arguments to
      reduce confusion.
      
      * When a service queue is used as the parent role - the host of the
        rbtree - use @parent_sq instead of @sq.
      
      * For functions taking both @tg and @parent_sq, reorder them so that
        the order is (@tg, @parent_sq) not the other way around.  This makes
        the code follow the usual convention of specifying the primary
        target of the operation as the first argument.
      
      This patch doesn't make any functional differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      0049af73
    • T
      blk-throttle: pass around throtl_service_queue instead of throtl_data · e2d57e60
      Tejun Heo 提交于
      throtl_service_queue will be used as the basic block to implement
      hierarchy support.  Pass around throtl_service_queue *sq instead of
      throtl_data *td in the following functions which will be used across
      multiple levels of hierarchy.
      
      * [__]throtl_enqueue/dequeue_tg()
      
      * throtl_add_bio_tg()
      
      * tg_update_disptime()
      
      * throtl_select_dispatch()
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      e2d57e60
    • T
      blk-throttle: add backlink pointer from throtl_grp to throtl_data · 0f3457f6
      Tejun Heo 提交于
      Add throtl_grp->td so that the td (throtl_data) a given tg
      (throtl_grp) belongs to can be determined, and remove @td argument
      from functions which take both @td and @tg as the former now can be
      determined from the latter.
      
      This generally simplifies the code and removes a number of cases where
      @td is passed as an argument without being actually used.  This will
      also help hierarchy support implementation.
      
      While at it, in multi-line conditions, move the logical operators
      leading broken lines to the end of the previous line.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      0f3457f6
    • T
      blk-throttle: simplify throtl_grp flag handling · 5b2c16aa
      Tejun Heo 提交于
      blk-throttle is still using function-defining macros to define flag
      handling functions, which went out style at least a decade ago.
      
      Just define the flag as bitmask and use direct bit operations.
      
      This patch doesn't make any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      5b2c16aa
    • T
      blk-throttle: rename throtl_rb_root to throtl_service_queue · c9e0332e
      Tejun Heo 提交于
      throtl_rb_root will be expanded to cover more roles for hierarchy
      support.  Rename it to throtl_service_queue and make its fields more
      descriptive.
      
      * rb		-> pending_tree
      * left		-> first_pending
      * count		-> nr_pending
      * min_disptime	-> first_pending_disptime
      
      This patch is purely cosmetic.
      
      Signed-off-by: Tejun Heo <tj@kernel.org
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      c9e0332e
    • T
      blk-throttle: remove pointless throtl_nr_queued() optimizations · 6a525600
      Tejun Heo 提交于
      throtl_nr_queued() is used in several places to avoid performing
      certain operations when the throtl_data is empty.  This usually is
      useless as those paths usually aren't traveled if there's no bio
      queued.
      
      * throtl_schedule_delayed_work() skips scheduling dispatch work item
        if @td doesn't have any bios queued; however, the only case it can
        be called when @td is empty is from tg_set_conf() which isn't
        something we should be optimizing for.
      
      * throtl_schedule_next_dispatch() takes a quick exit if @td is empty;
        however, right after that it triggers BUG if the service tree is
        empty.  The two conditions are equivalent and it can just test
        @st->count for the quick exit.
      
      * blk_throtl_dispatch_work_fn() skips dispatch if @td is empty.  This
        work function isn't usually invoked when @td is empty.  The only
        possibility is from tg_set_conf() and when it happens the normal
        dispatching path can handle empty @td fine.  No need to add special
        skip path.
      
      This patch removes the above three unnecessary optimizations, which
      leave throtl_log() call in blk_throtl_dispatch_work_fn() the only user
      of throtl_nr_queued().  Remove throtl_nr_queued() and open code it in
      throtl_log().  I don't think we need td->nr_queued[] at all.  Maybe we
      can remove it later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      6a525600
    • T
      blk-throttle: relocate throtl_schedule_delayed_work() · a9131a27
      Tejun Heo 提交于
      Move throtl_schedule_delayed_work() above its first user so that the
      forward declaration can be removed.
      
      This patch is pure relocaiton.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      a9131a27
    • T
      blk-throttle: collapse throtl_dispatch() into the work function · cb76199c
      Tejun Heo 提交于
      blk-throttle is about to go through major restructuring to support
      hierarchy.  Do cosmetic updates in preparation.
      
      * s/throtl_data->throtl_work/throtl_data->dispatch_work/
      
      * s/blk_throtl_work()/blk_throtl_dispatch_work_fn()/
      
      * Collapse throtl_dispatch() into blk_throtl_dispatch_work_fn()
      
      This patch is purely cosmetic.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      cb76199c
    • T
      blk-throttle: remove deferred config application mechanism · 632b4493
      Tejun Heo 提交于
      When bps or iops configuration changes, blk-throttle records the new
      configuration and sets a flag indicating that the config has changed.
      The flag is checked in the bio dispatch path and applied.  This
      deferred config application was necessary due to limitations in blkcg
      framework, which haven't existed for quite a while now.
      
      This patch removes the deferred config application mechanism and
      applies new configurations directly from tg_set_conf(), which is
      simpler.
      
      v2: Dropped unnecessary throtl_schedule_delayed_work() call from
          tg_set_conf() as suggested by Vivek Goyal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      632b4493
    • T
      blk-throttle: remove spurious throtl_enqueue_tg() call from throtl_select_dispatch() · 2db6314c
      Tejun Heo 提交于
      throtl_select_dispatch() calls throtl_enqueue_tg() right after
      tg_update_disptime(), which always calls the function anyway.  The
      call is, while harmless, unnecessary.  Remove it.
      
      This patch doesn't introduce any behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2db6314c
    • T
      blkcg: move bulk of blkcg_gq release operations to the RCU callback · 2a4fd070
      Tejun Heo 提交于
      Currently, when the last reference of a blkcg_gq is put, all then
      release operations sans the actual freeing happen directly in
      blkg_put().  As blkg_put() may be called under queue_lock, all
      pd_exit_fn()s may be too.  This makes it impossible for pd_exit_fn()s
      to use del_timer_sync() on timers which grab the queue_lock which is
      an irq-safe lock due to the deadlock possibility described in the
      comment on top of del_timer_sync().
      
      This can be easily avoided by perfoming the release operations in the
      RCU callback instead of directly from blkg_put().  This patch moves
      the blkcg_gq release operations to the RCU callback.
      
      As this leaves __blkg_release() with only call_rcu() invocation,
      blkg_rcu_free() is renamed to __blkg_release_rcu(), exported and
      call_rcu() invocation is now done directly from blkg_put() instead of
      going through __blkg_release() which is removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2a4fd070
    • T
      blkcg: invoke blkcg_policy->pd_init() after parent is linked · db613670
      Tejun Heo 提交于
      Currently, when creating a new blkcg_gq, each policy's pd_init_fn() is
      invoked in blkg_alloc() before the parent is linked.  This makes it
      difficult for policies to perform initializations which are dependent
      on the parent.
      
      This patch moves pd_init_fn() invocations to blkg_create() after the
      parent blkg is linked where the new blkg is fully initialized.  As
      this means that blkg_free() can't assume that pd's are initialized,
      pd_exit_fn() invocations are moved to __blkg_release().  This
      guarantees that pd_exit_fn() is also invoked with fully initialized
      blkgs with valid parent pointers.
      
      This will help implementing hierarchy support in blk-throttle.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      db613670
    • T
      blkcg: implement blkg_for_each_descendant_post() · aa539cb3
      Tejun Heo 提交于
      This will be used by blk-throttle hierarchy support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      aa539cb3
    • T
      blkcg: move blkg_for_each_descendant_pre() to block/blk-cgroup.h · dd4a4ffc
      Tejun Heo 提交于
      blk-throttle hierarchy support will make use of it.  Move
      blkg_for_each_descendant_pre() from block/blk-cgroup.c to
      block/blk-cgroup.h.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      dd4a4ffc
    • T
      blkcg: fix error return path in blkg_create() · 2423c9c3
      Tejun Heo 提交于
      In blkg_create(), after lookup of parent fails, the control jumps to
      error path with the error code encoded into @blkg.  The error path
      doesn't use @blkg for the return value.  It returns ERR_PTR(ret).
      Make lookup fail path set @ret instead of @blkg.
      
      Note that the parent lookup is guaranteed to succeed at that point and
      the condition check is purely for sanity and triggers WARN when fails.
      As such, I don't think it's necessary to mark it for -stable.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2423c9c3
  2. 08 5月, 2013 1 次提交
  3. 30 4月, 2013 1 次提交
  4. 19 4月, 2013 1 次提交
  5. 12 4月, 2013 1 次提交
  6. 09 4月, 2013 1 次提交
    • J
      blkcg: fix "scheduling while atomic" in blk_queue_bypass_start · e5072664
      Jun'ichi Nomura 提交于
      Since 749fefe6 in v3.7 ("block: lift the initial queue bypass mode
      on blk_register_queue() instead of blk_init_allocated_queue()"),
      the following warning appears when multipath is used with CONFIG_PREEMPT=y.
      
      This patch moves blk_queue_bypass_start() before radix_tree_preload()
      to avoid the sleeping call while preemption is disabled.
      
        BUG: scheduling while atomic: multipath/2460/0x00000002
        1 lock held by multipath/2460:
         #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
        Modules linked in: ...
        Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
        Call Trace:
         [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
         [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
         [<ffffffff814291e6>] schedule+0x64/0x66
         [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
         [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
         [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
         [<ffffffff814289e3>] wait_for_common+0x9d/0xee
         [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
         [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
         [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
         [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
         [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
         [<ffffffff8106fb18>] ? complete+0x21/0x53
         [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
         [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
         [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
         [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
         [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
         [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
         [<ffffffff811d8c41>] elevator_init+0xe1/0x115
         [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
         [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
         [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
         [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
         [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
         [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
         [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
         [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
         [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
         [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
         [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
         [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e5072664
  7. 08 4月, 2013 2 次提交
    • K
      driver core: add uid and gid to devtmpfs · 3c2670e6
      Kay Sievers 提交于
      Some drivers want to tell userspace what uid and gid should be used for
      their device nodes, so allow that information to percolate through the
      driver core to userspace in order to make this happen.  This means that
      some systems (i.e.  Android and friends) will not need to even run a
      udev-like daemon for their device node manager and can just rely in
      devtmpfs fully, reducing their footprint even more.
      Signed-off-by: NKay Sievers <kay@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c2670e6
    • J
      Revert "loop: cleanup partitions when detaching loop device" · c2fccc1c
      Jens Axboe 提交于
      This reverts commit 8761a3dc.
      
      There are situations where the destruction path is called
      with the bdev->bd_mutex already held, which then deadlocks in
      loop_clr_fd(). The normal partition cleanup does a trylock()
      on the mutex, but it'd be nice to have a more bullet proof
      method in loop. So punt this more involved fix to the next
      merge window, and just back out this buggy fix for now.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c2fccc1c
  8. 04 4月, 2013 1 次提交
    • A
      block: avoid using uninitialized value in from queue_var_store · c678ef52
      Arnd Bergmann 提交于
      As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions
      that use a value generated by queue_var_store independent of whether
      that value was set or not.
      
      block/blk-sysfs.c: In function 'queue_store_nonrot':
      block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]
      
      Unlike most other such warnings, this one is not a false positive,
      writing any non-number string into the sysfs files indeed has
      an undefined result, rather than returning an error.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c678ef52
  9. 24 3月, 2013 2 次提交
    • K
      block: Add bio_end_sector() · f73a1c7d
      Kent Overstreet 提交于
      Just a little convenience macro - main reason to add it now is preparing
      for immutable bio vecs, it'll reduce the size of the patch that puts
      bi_sector/bi_size/bi_idx into a struct bvec_iter.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
      CC: Heiko Carstens <heiko.carstens@de.ibm.com>
      CC: linux-s390@vger.kernel.org
      CC: Chris Mason <chris.mason@fusionio.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      f73a1c7d
    • K
      block: Refactor blk_update_request() · f79ea416
      Kent Overstreet 提交于
      Converts it to use bio_advance(), simplifying it quite a bit in the
      process.
      
      Note that req_bio_endio() now always calls bio_advance() - which means
      it always loops over the biovec, not just on partial completions. Don't
      expect it to affect performance, but worth noting.
      
      Tested it by forcing partial updates, and dumping before and after on
      various bio/bvec fields when doing a partial update.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      f79ea416