1. 22 2月, 2013 4 次提交
    • G
      block: use i_size_write() in bd_set_size() · d646a02a
      Guo Chao 提交于
      blkdev_ioctl(GETBLKSIZE) uses i_size_read() to read size of block device.
      If we update block size directly, reader may see intermediate result in
      some machines and configurations.  Use i_size_write() instead.
      Signed-off-by: NGuo Chao <yan@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: M. Hindess <hindessm@uk.ibm.com>
      Cc: Nikanth Karthikesan <knikanth@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d646a02a
    • G
      cfq: fix lock imbalance with failed allocations · a3cc86c2
      Glauber Costa 提交于
      While stress-running very-small container scenarios with the Kernel Memory
      Controller, I've run into a lockdep-detected lock imbalance in
      cfq-iosched.c.
      
      I'll apologize beforehand for not posting a backlog: I didn't anticipate
      it would be so hard to reproduce, so I didn't save my serial output and
      went directly on debugging.  Turns out that it did not happen again in
      more than 20 runs, making it a quite rare pattern.
      
      But here is my analysis:
      
      When we are in very low-memory situations, we will arrive at
      cfq_find_alloc_queue and may not find a queue, having to resort to the oom
      queue, in an rcu-locked condition:
      
        if (!cfqq || cfqq == &cfqd->oom_cfqq)
            [ ... ]
      
      Next, we will release the rcu lock, and try to allocate a queue, retrying
      if we succeed:
      
        rcu_read_unlock();
        spin_unlock_irq(cfqd->queue->queue_lock);
        new_cfqq = kmem_cache_alloc_node(cfq_pool,
                        gfp_mask | __GFP_ZERO,
                        cfqd->queue->node);
         spin_lock_irq(cfqd->queue->queue_lock);
         if (new_cfqq)
             goto retry;
      
      We are unlocked at this point, but it should be fine, since we will
      reacquire the rcu_read_lock when we retry.
      
      Except of course, that we may not retry: the allocation may very well fail
      and we'll keep on going through the flow:
      
      The next branch is:
      
          if (cfqq) {
      	[ ... ]
          } else
              cfqq = &cfqd->oom_cfqq;
      
      And right before exiting, we'll issue rcu_read_unlock().
      
      Being already unlocked, this is the likely source of our imbalance.  Since
      cfqq is either already NULL or made NULL in the first statement of the
      outter branch, the only viable alternative here seems to be to return the
      oom queue right away in case of allocation failure.
      
      Please review the following patch and apply if you agree with my analysis.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3cc86c2
    • C
      drivers/block/swim3.c: fix null pointer dereference · 7414d4f6
      Cong Ding 提交于
      The use of pointer fs should be after the null check.
      Signed-off-by: NCong Ding <dinggnu@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7414d4f6
    • M
      block: don't select PERCPU_RWSEM · 79d0b7f0
      Mikulas Patocka 提交于
      The block device doesn't use percpu rw-semaphore anymore, so don't select
      it for compilation.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      79d0b7f0
  2. 15 2月, 2013 2 次提交
    • V
      block: account iowait time when waiting for completion of IO request · 5577022f
      Vladimir Davydov 提交于
      Using wait_for_completion() for waiting for a IO request to be executed
      results in wrong iowait time accounting. For example, a system having
      the only task doing write() and fdatasync() on a block device can be
      reported being idle instead of iowaiting as it should because
      blkdev_issue_flush() calls wait_for_completion() which in turn calls
      schedule() that does not increment the iowait proc counter and thus does
      not turn on iowait time accounting.
      
      The patch makes block layer use wait_for_completion_io() instead of
      wait_for_completion() where appropriate to account iowait time
      correctly.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5577022f
    • V
      sched: add wait_for_completion_io[_timeout] · 686855f5
      Vladimir Davydov 提交于
      The only difference between wait_for_completion[_timeout]() and
      wait_for_completion_io[_timeout]() is that the latter calls
      io_schedule_timeout() instead of schedule_timeout() so that the caller
      is accounted as waiting for IO, not just sleeping.
      
      These functions can be used for correct iowait time accounting when the
      completion struct is actually used for waiting for IO (e.g. completion
      of a bio request in the block layer).
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      686855f5
  3. 14 1月, 2013 5 次提交
    • T
      writeback: add more tracepoints · 9fb0a7da
      Tejun Heo 提交于
      Add tracepoints for page dirtying, writeback_single_inode start, inode
      dirtying and writeback.  For the latter two inode events, a pair of
      events are defined to denote start and end of the operations (the
      starting one has _start suffix and the one w/o suffix happens after
      the operation is complete).  These inode ops are FS specific and can
      be non-trivial and having enclosing tracepoints is useful for external
      tracers.
      
      This is part of tracepoint additions to improve visiblity into
      dirtying / writeback operations for io tracer and userland.
      
      v2: writeback_dirty_inode[_start] TPs may be called for files on
          pseudo FSes w/ unregistered bdi.  Check whether bdi->dev is %NULL
          before dereferencing.
      
      v3: buffer dirtying moved to a block TP.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9fb0a7da
    • T
      block: add block_{touch|dirty}_buffer tracepoint · 5305cb83
      Tejun Heo 提交于
      The former is triggered from touch_buffer() and the latter
      mark_buffer_dirty().
      
      This is part of tracepoint additions to improve visiblity into
      dirtying / writeback operations for io tracer and userland.
      
      v2: Transformed writeback_dirty_buffer to block_dirty_buffer and made
          it share TP definition with block_touch_buffer.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5305cb83
    • T
      buffer: make touch_buffer() an exported function · f0059afd
      Tejun Heo 提交于
      We want to add a trace point to touch_buffer() but macros and inline
      functions defined in header files can't have tracing points.  Move
      touch_buffer() to fs/buffer.c and make it a proper function.
      
      The new exported function is also declared inline.  As most uses of
      touch_buffer() are inside buffer.c with nilfs2 as the only other user,
      the effect of this change should be negligible.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f0059afd
    • T
      block: add @req to bio_{front|back}_merge tracepoints · 8c1cf6bb
      Tejun Heo 提交于
      bio_{front|back}_merge tracepoints report a bio merging into an
      existing request but didn't specify which request the bio is being
      merged into.  Add @req to it.  This makes it impossible to share the
      event template with block_bio_queue - split it out.
      
      @req isn't used or exported to userland at this point and there is no
      userland visible behavior change.  Later changes will make use of the
      extra parameter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8c1cf6bb
    • T
      block: add missing block_bio_complete() tracepoint · 3a366e61
      Tejun Heo 提交于
      bio completion didn't kick block_bio_complete TP.  Only dm was
      explicitly triggering the TP on IO completion.  This makes
      block_bio_complete TP useless for tracers which want to know about
      bios, and all other bio based drivers skip generating blktrace
      completion events.
      
      This patch makes all bio completions via bio_endio() generate
      block_bio_complete TP.
      
      * Explicit trace_block_bio_complete() invocation removed from dm and
        the trace point is unexported.
      
      * @rq dropped from trace_block_bio_complete().  bios may fly around
        w/o queue associated.  Verifying and accessing the assocaited queue
        belongs to TP probes.
      
      * blktrace now gets both request and bio completions.  Make it ignore
        bio completions if request completion path is happening.
      
      This makes all bio based drivers generate blktrace completion events
      properly and makes the block_bio_complete TP actually useful.
      
      v2: With this change, block_bio_complete TP could be invoked on sg
          commands which have bio's with %NULL bi_bdev.  Update TP
          assignment code to check whether bio->bi_bdev is %NULL before
          dereferencing.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Original-patch-by: NNamhyung Kim <namhyung@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a366e61
  4. 12 1月, 2013 1 次提交
  5. 11 1月, 2013 2 次提交
  6. 10 1月, 2013 26 次提交
    • L
      Linux 3.8-rc3 · 9931faca
      Linus Torvalds 提交于
      9931faca
    • L
      Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm · 5c49985c
      Linus Torvalds 提交于
      Pull ARM fixes from Russell King.
      
      * 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
        ARM: 7616/1: cache-l2x0: aurora: Use writel_relaxed instead of writel
        ARM: 7615/1: cache-l2x0: aurora: Invalidate during clean operation with WT enable
        ARM: 7614/1: mm: fix wrong branch from Cortex-A9 to PJ4b
        ARM: 7612/1: imx: Do not select some errata that depends on !ARCH_MULTIPLATFORM
        ARM: 7611/1: VIC: fix bug in VIC irqdomain code
        ARM: 7610/1: versatile: bump IRQ numbers
        ARM: 7609/1: disable errata work-arounds which access secure registers
        ARM: 7608/1: l2x0: Only set .set_debug on PL310 r3p0 and earlier
      5c49985c
    • L
      Merge tag 'edac_fixes_for_3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp · 57a0c1e2
      Linus Torvalds 提交于
      Pull EDAC fixes from Borislav Petkov:
       "Two error path fixes causing a crash and a Kconfig fix for an issue
        which spilled all EDAC suboptions into the 'Device Drivers' menu."
      
      * tag 'edac_fixes_for_3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
        EDAC: Cleanup device deregistering path
        EDAC: Fix EDAC Kconfig menu
        EDAC: Fix kernel panic on module unloading
      57a0c1e2
    • L
      mm: reinstante dropped pmd_trans_splitting() check · e53289c0
      Linus Torvalds 提交于
      The check for a pmd being in the process of being split was dropped by
      mistake by commit d10e63f2 ("mm: numa: Create basic numa page
      hinting infrastructure"). Put it back.
      Reported-by: NDave Jones <davej@redhat.com>
      Debugged-by: NHillf Danton <dhillf@gmail.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e53289c0
    • M
      cred: Remove tgcred pointer from struct cred · 08c097fc
      Marc Dionne 提交于
      Commit 3a50597d ("KEYS: Make the session and process keyrings
      per-thread") removed the definition of the thread_group_cred structure,
      but left a now unused pointer in struct cred.
      Signed-off-by: NMarc Dionne <marc.c.dionne@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08c097fc
    • T
      cfq-iosched: add hierarchical cfq_group statistics · 43114018
      Tejun Heo 提交于
      Unfortunately, at this point, there's no way to make the existing
      statistics hierarchical without creating nasty surprises for the
      existing users.  Just create recursive counterpart of the existing
      stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      43114018
    • T
      cfq-iosched: collect stats from dead cfqgs · 0b39920b
      Tejun Heo 提交于
      To support hierarchical stats, it's necessary to remember stats from
      dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
      its stats to the parent's dead-stats.
      
      The transfer happens form ->pd_offline_fn() and it is possible that
      there are some residual IOs completing afterwards.  Currently, we lose
      these stats.  Given that cgroup removal isn't a very high frequency
      operation and the amount of residual IOs on offline are likely to be
      nil or small, this shouldn't be a big deal and the complexity needed
      to handle residual IOs - another callback and rather elaborate
      synchronization to reach and lock the matching q - doesn't seem
      justified.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      0b39920b
    • T
      cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() · 689665af
      Tejun Heo 提交于
      Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
      cfq_pd_reset_stats() and move the latter to where other pd methods are
      defined.  cfqg_stats_reset() will be used to implement hierarchical
      stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      689665af
    • T
      blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock · 810ecfa7
      Tejun Heo 提交于
      Instead of holding blkcg->lock while walking ->blkg_list and executing
      prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
      executing prfill().  This makes prfill() implementations easier as
      stats are mostly protected by queue lock.
      
      This will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      810ecfa7
    • T
      block: RCU free request_queue · 548bc8e1
      Tejun Heo 提交于
      RCU free request_queue so that blkcg_gq->q can be dereferenced under
      RCU lock.  This will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      548bc8e1
    • T
      blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() · 16b3de66
      Tejun Heo 提交于
      Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
      The former two collect the [rw]stats designated by the target policy
      data and offset from the pd's subtree.  The latter two add one
      [rw]stat to another.
      
      Note that the recursive sum functions require the queue lock to be
      held on entry to make blkg online test reliable.  This is necessary to
      properly handle stats of a dying blkg.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      16b3de66
    • T
      blkcg: export __blkg_prfill_rwstat() · b50da39f
      Tejun Heo 提交于
      Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
      Export it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      b50da39f
    • T
      blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/ · 4d5e80a7
      Tejun Heo 提交于
      Rename blkg_rwstat_sum() to blkg_rwstat_total().  sum will be used for
      summing up stats from multiple blkgs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      4d5e80a7
    • T
      blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online · f427d909
      Tejun Heo 提交于
      Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
      which are invoked as the policy_data gets activated and deactivated
      while holding both blkcg and q locks.
      
      Also, add blkcg_gq->online bool, which is set and cleared as the
      blkcg_gq gets activated and deactivated.  This flag also is toggled
      while holding both blkcg and q locks.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      f427d909
    • T
      blkcg: add blkg_policy_data->plid · b276a876
      Tejun Heo 提交于
      Add pd->plid so that the policy a pd belongs to can be identified
      easily.  This will be used to implement hierarchical blkg_[rw]stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      b276a876
    • T
      cfq-iosched: enable full blkcg hierarchy support · d02f7aa8
      Tejun Heo 提交于
      With the previous two patches, all cfqg scheduling decisions are based
      on vfraction and ready for hierarchy support.  The only thing which
      keeps the behavior flat is cfqg_flat_parent() which makes vfraction
      calculation consider all non-root cfqgs children of the root cfqg.
      
      Replace it with cfqg_parent() which returns the real parent.  This
      enables full blkcg hierarchy support for cfq-iosched.  For example,
      consider the following hierarchy.
      
              root
            /      \
         A:500      B:250
        /     \
       AA:500  AB:1000
      
      For simplicity, let's say all the leaf nodes have active tasks and are
      on service tree.  For each leaf node, vfraction would be
      
       AA: (500  / 1500) * (500 / 750) =~ 0.2222
       AB: (1000 / 1500) * (500 / 750) =~ 0.4444
        B:                 (250 / 750) =~ 0.3333
      
      and vdisktime will be distributed accordingly.  For more detail,
      please refer to Documentation/block/cfq-iosched.txt.
      
      v2: cfq-iosched.txt updated to describe group scheduling as suggested
          by Vivek.
      
      v3: blkio-controller.txt updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      d02f7aa8
    • T
      cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction · 41cad6ab
      Tejun Heo 提交于
      cfq_group_slice() calculates slice by taking a fraction of
      cfq_target_latency according to the ratio of cfqg->weight against
      service_tree->total_weight.  This currently works only because all
      cfqgs are treated to be at the same level.
      
      To prepare for proper hierarchy support, convert cfq_group_slice() to
      base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
      a fraction of 1 and represents the fraction allocated to the cfqg with
      hierarchy considered, the slice can be simply calculated by
      multiplying cfqg->vfraction to cfq_target_latency (with fixed point
      shift factored in).
      
      As vfraction calculation currently treats all non-root cfqgs as
      children of the root cfqg, this patch doesn't introduce noticeable
      behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      41cad6ab
    • T
      cfq-iosched: implement hierarchy-ready cfq_group charge scaling · 1d3650f7
      Tejun Heo 提交于
      Currently, cfqg charges are scaled directly according to cfqg->weight.
      Regardless of the number of active cfqgs or the amount of active
      weights, a given weight value always scales charge the same way.  This
      works fine as long as all cfqgs are treated equally regardless of
      their positions in the hierarchy, which is what cfq currently
      implements.  It can't work in hierarchical settings because the
      interpretation of a given weight value depends on where the weight is
      located in the hierarchy.
      
      This patch reimplements cfqg charge scaling so that it can be used to
      support hierarchy properly.  The scheme is fairly simple and
      light-weight.
      
      * When a cfqg is added to the service tree, v(disktime)weight is
        calculated.  It walks up the tree to root calculating the fraction
        it has in the hierarchy.  At each level, the fraction can be
        calculated as
      
          cfqg->weight / parent->level_weight
      
        By compounding these, the global fraction of vdisktime the cfqg has
        claim to - vfraction - can be determined.
      
      * When the cfqg needs to be charged, the charge is scaled inversely
        proportionally to the vfraction.
      
      The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
      representation as before; however, the smallest scaling factor is now
      1 (ie. 1 << CFQ_SERVICE_SHIFT).  This is different from before where 1
      was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
      scaling factor.
      
      While this shifts the global scale of vdisktime a bit, it doesn't
      change the relative relationships among cfqgs and the scheduling
      result isn't different.
      
      cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
      new cfqg to the service tree.  The specific value of CFQ_IDLE_DELAY
      didn't have any relevance to vdisktime before and is unlikely to cause
      any visible behavior difference now especially as the scale shift
      isn't that large.
      
      As the new scheme now makes proper distinction between cfqg->weight
      and ->leaf_weight, reverse the weight aliasing for root cfqgs.  For
      root, both weights are now mapped to ->leaf_weight instead of the
      other way around.
      
      Because we're still using cfqg_flat_parent(), this patch shouldn't
      change the scheduling behavior in any noticeable way.
      
      v2: Beefed up comments on vfraction as requested by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      1d3650f7
    • T
      cfq-iosched: implement cfq_group->nr_active and ->children_weight · 7918ffb5
      Tejun Heo 提交于
      To prepare for blkcg hierarchy support, add cfqg->nr_active and
      ->children_weight.  cfqg->nr_active counts the number of active cfqgs
      at the cfqg's level and ->children_weight is sum of weights of those
      cfqgs.  The level covers itself (cfqg->leaf_weight) and immediate
      children.
      
      The two values are updated when a cfqg enters and leaves the group
      service tree.  Unless the hierarchy is very deep, the added overhead
      should be negligible.
      
      Currently, the parent is determined using cfqg_flat_parent() which
      makes the root cfqg the parent of all other cfqgs.  This is to make
      the transition to hierarchy-aware scheduling gradual.  Scheduling
      logic will be converted to use cfqg->children_weight without actually
      changing the behavior.  When everything is ready,
      blkcg_weight_parent() will be replaced with proper parent function.
      
      This patch doesn't introduce any behavior chagne.
      
      v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      7918ffb5
    • T
      cfq-iosched: add leaf_weight · e71357e1
      Tejun Heo 提交于
      cfq blkcg is about to grow proper hierarchy handling, where a child
      blkg's weight would nest inside the parent's.  This makes tasks in a
      blkg to compete against both tasks in the sibling blkgs and the tasks
      of child blkgs.
      
      We're gonna use the existing weight as the group weight which decides
      the blkg's weight against its siblings.  This patch introduces a new
      weight - leaf_weight - which decides the weight of a blkg against the
      child blkgs.
      
      It's named leaf_weight because another way to look at it is that each
      internal blkg nodes have a hidden child leaf node which contains all
      its tasks and leaf_weight is the weight of the leaf node and handled
      the same as the weight of the child blkgs.
      
      This patch only adds leaf_weight fields and exposes it to userland.
      The new weight isn't actually used anywhere yet.  Note that
      cfq-iosched currently offcially supports only single level hierarchy
      and root blkgs compete with the first level blkgs - ie. root weight is
      basically being used as leaf_weight.  For root blkgs, the two weights
      are kept in sync for backward compatibility.
      
      v2: cfqd->root_group->leaf_weight initialization was missing from
          cfq_init_queue() causing divide by zero when
          !CONFIG_CFQ_GROUP_SCHED.  Fix it.  Reported by Fengguang.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      e71357e1
    • T
      blkcg: make blkcg_gq's hierarchical · 3c547865
      Tejun Heo 提交于
      Currently a child blkg (blkcg_gq) can be created even if its parent
      doesn't exist.  ie. Given a blkg, it's not guaranteed that its
      ancestors will exist.  This makes it difficult to implement proper
      hierarchy support for blkcg policies.
      
      Always create blkgs recursively and make a child blkg hold a reference
      to its parent.  blkg->parent is added so that finding the parent is
      easy.  blkcg_parent() is also added in the process.
      
      This change can be visible to userland.  e.g. while issuing IO in a
      nested cgroup didn't affect the ancestors at all, now it will
      initialize all ancestor blkgs and zero stats for the request_queue
      will always appear on them.  While this is userland visible, this
      shouldn't cause any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      3c547865
    • T
      blkcg: cosmetic updates to blkg_create() · 93e6d5d8
      Tejun Heo 提交于
      * Rename out_* labels to err_*.
      
      * Do ERR_PTR() conversion once in the error return path.
      
      This patch is cosmetic and to prepare for the hierarchy support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      93e6d5d8
    • T
      blkcg: reorganize blkg_lookup_create() and friends · 86cde6b6
      Tejun Heo 提交于
      Reorganize such that
      
      * __blkg_lookup() takes bool param @update_hint to determine whether
        to update hint.
      
      * __blkg_lookup_create() no longer performs lookup before trying to
        create.  Renamed to blkg_create().
      
      * blkg_lookup_create() now performs lookup and then invokes
        blkg_create() if lookup fails.
      
      * root_blkg creation in blkcg_activate_policy() updated accordingly.
        Note that blkcg_activate_policy() no longer updates lookup hint if
        root_blkg already exists.
      
      Except for the last lookup hint bit which is immaterial, this is pure
      reorganization and doesn't introduce any visible behavior change.
      This is to prepare for proper hierarchy support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      86cde6b6
    • T
      blkcg: fix minor bug in blkg_alloc() · 356d2e58
      Tejun Heo 提交于
      blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice.
      The latter test should have been on whether pol->pd_init_fn() exists.
      This doesn't cause actual problems because both blkcg policies
      implement pol->pd_init_fn().  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      356d2e58
    • V
      cfq-iosched: Print sync-noidle information in blktrace messages · b226e5c4
      Vivek Goyal 提交于
      Currently we attach a character "S" or "A" to the cfqq<pid>, to represent
      whether queues is sync or async. Add one more character "N" to represent
      whether it is sync-noidle queue or sync queue. So now three different
      type of queues will look as follows.
      
      cfq1234S   --> sync queus
      cfq1234SN  --> sync noidle queue
      cfq1234A   --> Async queue
      
      Previously S/A classification was being printed only if group scheduling
      was enabled. This patch also makes sure that this classification is
      displayed even if group idling is disabled.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b226e5c4
    • V
      cfq-iosched: Get rid of unnecessary local variable · 1f23f121
      Vivek Goyal 提交于
      Use of local varibale "n" seems to be unnecessary. Remove it. This brings
      it inline with function __cfq_group_st_add(), which is also doing the
      similar operation of adding a group to a rb tree.
      
      No functionality change here.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1f23f121