1. 10 1月, 2013 10 次提交
    • T
      cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction · 41cad6ab
      Tejun Heo 提交于
      cfq_group_slice() calculates slice by taking a fraction of
      cfq_target_latency according to the ratio of cfqg->weight against
      service_tree->total_weight.  This currently works only because all
      cfqgs are treated to be at the same level.
      
      To prepare for proper hierarchy support, convert cfq_group_slice() to
      base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
      a fraction of 1 and represents the fraction allocated to the cfqg with
      hierarchy considered, the slice can be simply calculated by
      multiplying cfqg->vfraction to cfq_target_latency (with fixed point
      shift factored in).
      
      As vfraction calculation currently treats all non-root cfqgs as
      children of the root cfqg, this patch doesn't introduce noticeable
      behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      41cad6ab
    • T
      cfq-iosched: implement hierarchy-ready cfq_group charge scaling · 1d3650f7
      Tejun Heo 提交于
      Currently, cfqg charges are scaled directly according to cfqg->weight.
      Regardless of the number of active cfqgs or the amount of active
      weights, a given weight value always scales charge the same way.  This
      works fine as long as all cfqgs are treated equally regardless of
      their positions in the hierarchy, which is what cfq currently
      implements.  It can't work in hierarchical settings because the
      interpretation of a given weight value depends on where the weight is
      located in the hierarchy.
      
      This patch reimplements cfqg charge scaling so that it can be used to
      support hierarchy properly.  The scheme is fairly simple and
      light-weight.
      
      * When a cfqg is added to the service tree, v(disktime)weight is
        calculated.  It walks up the tree to root calculating the fraction
        it has in the hierarchy.  At each level, the fraction can be
        calculated as
      
          cfqg->weight / parent->level_weight
      
        By compounding these, the global fraction of vdisktime the cfqg has
        claim to - vfraction - can be determined.
      
      * When the cfqg needs to be charged, the charge is scaled inversely
        proportionally to the vfraction.
      
      The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
      representation as before; however, the smallest scaling factor is now
      1 (ie. 1 << CFQ_SERVICE_SHIFT).  This is different from before where 1
      was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
      scaling factor.
      
      While this shifts the global scale of vdisktime a bit, it doesn't
      change the relative relationships among cfqgs and the scheduling
      result isn't different.
      
      cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
      new cfqg to the service tree.  The specific value of CFQ_IDLE_DELAY
      didn't have any relevance to vdisktime before and is unlikely to cause
      any visible behavior difference now especially as the scale shift
      isn't that large.
      
      As the new scheme now makes proper distinction between cfqg->weight
      and ->leaf_weight, reverse the weight aliasing for root cfqgs.  For
      root, both weights are now mapped to ->leaf_weight instead of the
      other way around.
      
      Because we're still using cfqg_flat_parent(), this patch shouldn't
      change the scheduling behavior in any noticeable way.
      
      v2: Beefed up comments on vfraction as requested by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      1d3650f7
    • T
      cfq-iosched: implement cfq_group->nr_active and ->children_weight · 7918ffb5
      Tejun Heo 提交于
      To prepare for blkcg hierarchy support, add cfqg->nr_active and
      ->children_weight.  cfqg->nr_active counts the number of active cfqgs
      at the cfqg's level and ->children_weight is sum of weights of those
      cfqgs.  The level covers itself (cfqg->leaf_weight) and immediate
      children.
      
      The two values are updated when a cfqg enters and leaves the group
      service tree.  Unless the hierarchy is very deep, the added overhead
      should be negligible.
      
      Currently, the parent is determined using cfqg_flat_parent() which
      makes the root cfqg the parent of all other cfqgs.  This is to make
      the transition to hierarchy-aware scheduling gradual.  Scheduling
      logic will be converted to use cfqg->children_weight without actually
      changing the behavior.  When everything is ready,
      blkcg_weight_parent() will be replaced with proper parent function.
      
      This patch doesn't introduce any behavior chagne.
      
      v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      7918ffb5
    • T
      cfq-iosched: add leaf_weight · e71357e1
      Tejun Heo 提交于
      cfq blkcg is about to grow proper hierarchy handling, where a child
      blkg's weight would nest inside the parent's.  This makes tasks in a
      blkg to compete against both tasks in the sibling blkgs and the tasks
      of child blkgs.
      
      We're gonna use the existing weight as the group weight which decides
      the blkg's weight against its siblings.  This patch introduces a new
      weight - leaf_weight - which decides the weight of a blkg against the
      child blkgs.
      
      It's named leaf_weight because another way to look at it is that each
      internal blkg nodes have a hidden child leaf node which contains all
      its tasks and leaf_weight is the weight of the leaf node and handled
      the same as the weight of the child blkgs.
      
      This patch only adds leaf_weight fields and exposes it to userland.
      The new weight isn't actually used anywhere yet.  Note that
      cfq-iosched currently offcially supports only single level hierarchy
      and root blkgs compete with the first level blkgs - ie. root weight is
      basically being used as leaf_weight.  For root blkgs, the two weights
      are kept in sync for backward compatibility.
      
      v2: cfqd->root_group->leaf_weight initialization was missing from
          cfq_init_queue() causing divide by zero when
          !CONFIG_CFQ_GROUP_SCHED.  Fix it.  Reported by Fengguang.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      e71357e1
    • V
      cfq-iosched: Print sync-noidle information in blktrace messages · b226e5c4
      Vivek Goyal 提交于
      Currently we attach a character "S" or "A" to the cfqq<pid>, to represent
      whether queues is sync or async. Add one more character "N" to represent
      whether it is sync-noidle queue or sync queue. So now three different
      type of queues will look as follows.
      
      cfq1234S   --> sync queus
      cfq1234SN  --> sync noidle queue
      cfq1234A   --> Async queue
      
      Previously S/A classification was being printed only if group scheduling
      was enabled. This patch also makes sure that this classification is
      displayed even if group idling is disabled.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b226e5c4
    • V
      cfq-iosched: Get rid of unnecessary local variable · 1f23f121
      Vivek Goyal 提交于
      Use of local varibale "n" seems to be unnecessary. Remove it. This brings
      it inline with function __cfq_group_st_add(), which is also doing the
      similar operation of adding a group to a rb tree.
      
      No functionality change here.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1f23f121
    • V
      cfq-iosched: Rename few functions related to selecting workload · 6d816ec7
      Vivek Goyal 提交于
      choose_service_tree() selects/sets both wl_class and wl_type.  Rename it to
      choose_wl_class_and_type() to make it very clear.
      
      cfq_choose_wl() only selects and sets wl_type. It is easy to confuse
      it with choose_st(). So rename it to cfq_choose_wl_type() to make
      it clear what does it do.
      
      Just renaming. No functionality change.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6d816ec7
    • V
      cfq-iosched: Rename "service_tree" to "st" at some places · 34b98d03
      Vivek Goyal 提交于
      At quite a few places we use the keyword "service_tree". At some places,
      especially local variables, I have abbreviated it to "st".
      
      Also at couple of places moved binary operator "+" from beginning of line
      to end of previous line, as per Tejun's feedback.
      
      v2:
       Reverted most of the service tree name change based on Jeff Moyer's feedback.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      34b98d03
    • V
      cfq-iosched: More renaming to better represent wl_class and wl_type · 4d2ceea4
      Vivek Goyal 提交于
      Some more renaming. Again making the code uniform w.r.t use of
      wl_class/class to represent IO class (RT, BE, IDLE) and using
      wl_type/type to represent subclass (SYNC, SYNC-IDLE, ASYNC).
      
      At places this patch shortens the string "workload" to "wl".
      Renamed "saved_workload" to "saved_wl_type". Renamed
      "saved_serving_class" to "saved_wl_class".
      
      For uniformity with "saved_wl_*" variables, renamed "serving_class"
      to "serving_wl_class" and renamed "serving_type" to "serving_wl_type".
      
      Again, just trying to improve upon code uniformity and improve
      readability. No functional change.
      
      v2:
      - Restored the usage of keyword "service" based on Jeff Moyer's feedback.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4d2ceea4
    • V
      cfq-iosched: Properly name all references to IO class · 3bf10fea
      Vivek Goyal 提交于
      Currently CFQ has three IO classes, RT, BE and IDLE. At many a places we
      are calling workloads belonging to these classes as "prio". This gets
      very confusing as one starts to associate it with ioprio.
      
      So this patch just does bunch of renaming so that reading code becomes
      easier. All reference to RT, BE and IDLE workload are done using keyword
      "class" and all references to subclass, SYNC, SYNC-IDLE, ASYNC are made
      using keyword "type".
      
      This makes me feel much better while I am reading the code. There is no
      functionality change due to this patch.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3bf10fea
  2. 06 11月, 2012 1 次提交
    • S
      block CFQ: avoid moving request to different queue · 3d106fba
      Shaohua Li 提交于
      request is queued in cfqq->fifo list. Looks it's possible we are moving a
      request from one cfqq to another in request merge case. In such case, adjusting
      the fifo list order doesn't make sense and is impossible if we don't iterate
      the whole fifo list.
      
      My test does hit one case the two cfqq are different, but didn't cause kernel
      crash, maybe it's because fifo list isn't used frequently. Anyway, from the
      code logic, this is buggy.
      
      I thought we can re-enable the recusive merge logic after this is fixed.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d106fba
  3. 04 6月, 2012 2 次提交
    • T
      block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED · ffea73fc
      Tejun Heo 提交于
      cfq may be built w/ or w/o blkcg support depending on
      CONFIG_CFQ_CGROUP_IOSCHED.  If blkcg support is disabled, most of
      related code is ifdef'd out but some part is left dangling -
      blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register()
      calls are made on it.
      
      Feeding zero filled policy to blkcg_policy_register() is incorrect and
      triggers the following WARN_ON() if CONFIG_BLK_CGROUP &&
      !CONFIG_CFQ_GROUP_IOSCHED.
      
       ------------[ cut here ]------------
       WARNING: at block/blk-cgroup.c:867
       Modules linked in:
       Modules linked in:
       CPU: 3 Not tainted 3.4.0-09547-gfb21affa #1
       Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8)
       Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0)
      	    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
       Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000
      	    000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048
      	    000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8
      	    00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70
       Krnl Code: 00000000003d76be: a7280001           lhi     %r2,1
      	    00000000003d76c2: a7f4ffdf           brc     15,3d7680
      	   #00000000003d76c6: a7f40001           brc     15,3d76c8
      	   >00000000003d76ca: a7c8ffea           lhi     %r12,-22
      	    00000000003d76ce: a7f4ffce           brc     15,3d766a
      	    00000000003d76d2: a7f40001           brc     15,3d76d4
      	    00000000003d76d6: a7c80000           lhi     %r12,0
      	    00000000003d76da: a7f4ffc2           brc     15,3d765e
       Call Trace:
       ([<0000000000b6f000>] initcall_debug+0x0/0x4)
        [<0000000000989e8a>] cfq_init+0x62/0xd4
        [<00000000001000ba>] do_one_initcall+0x3a/0x170
        [<000000000096fb60>] kernel_init+0x214/0x2bc
        [<0000000000623202>] kernel_thread_starter+0x6/0xc
        [<00000000006231fc>] kernel_thread_starter+0x0/0xc
       no locks held by swapper/0/1.
       Last Breaking-Event-Address:
        [<00000000003d76c6>] blkcg_policy_register+0xc6/0xe0
       ---[ end trace b8ef4903fcbf9dd3 ]---
      
      This patch fixes the problem by ensuring all blkcg support code is
      inside CONFIG_CFQ_GROUP_IOSCHED.
      
      * blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved
        inside the first CONFIG_CFQ_GROUP_IOSCHED block.  __maybe_unused is
        dropped from blkcg_policy_cfq decl.
      
      * blkcg_deactivate_poilcy() invocation is moved inside ifdef.  This
        also makes the activation logic match cfq_init_queue().
      
      * All blkcg_policy_[un]register() invocations are moved inside ifdef.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      LKML-Reference: <20120601112954.GC3535@osiris.boeblingen.de.ibm.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ffea73fc
    • T
      block: fix return value on cfq_init() failure · fd794956
      Tejun Heo 提交于
      cfq_init() would return zero after kmem cache creation failure.  Fix
      so that it returns -ENOMEM.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd794956
  4. 20 4月, 2012 11 次提交
    • T
      blkcg: collapse blkcg_policy_ops into blkcg_policy · f9fcc2d3
      Tejun Heo 提交于
      There's no reason to keep blkcg_policy_ops separate.  Collapse it into
      blkcg_policy.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f9fcc2d3
    • T
      blkcg: embed struct blkg_policy_data in policy specific data · f95a04af
      Tejun Heo 提交于
      Currently blkg_policy_data carries policy specific data as char flex
      array instead of being embedded in policy specific data.  This was
      forced by oddities around blkg allocation which are all gone now.
      
      This patch makes blkg_policy_data embedded in policy specific data -
      throtl_grp and cfq_group so that it's more conventional and consistent
      with how io_cq is handled.
      
      * blkcg_policy->pdata_size is renamed to ->pd_size.
      
      * Functions which used to take void *pdata now takes struct
        blkg_policy_data *pd.
      
      * blkg_to_pdata/pdata_to_blkg() updated to blkg_to_pd/pd_to_blkg().
      
      * Dummy struct blkg_policy_data definition added.  Dummy
        pdata_to_blkg() definition was unused and inconsistent with the
        non-dummy version - correct dummy pd_to_blkg() added.
      
      * throtl and cfq updated accordingly.
      
      * As dummy blkg_to_pd/pd_to_blkg() are provided,
        blkg_to_cfqg/cfqg_to_blkg() don't need to be ifdef'd.  Moved outside
        ifdef block.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f95a04af
    • T
      blkcg: mass rename of blkcg API · 3c798398
      Tejun Heo 提交于
      During the recent blkcg cleanup, most of blkcg API has changed to such
      extent that mass renaming wouldn't cause any noticeable pain.  Take
      the chance and cleanup the naming.
      
      * Rename blkio_cgroup to blkcg.
      
      * Drop blkio / blkiocg prefixes and consistently use blkcg.
      
      * Rename blkio_group to blkcg_gq, which is consistent with io_cq but
        keep the blkg prefix / variable name.
      
      * Rename policy method type and field names to signify they're dealing
        with policy data.
      
      * Rename blkio_policy_type to blkcg_policy.
      
      This patch doesn't cause any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3c798398
    • T
      blkcg: remove blkio_group->path[] · 54e7ed12
      Tejun Heo 提交于
      blkio_group->path[] stores the path of the associated cgroup and is
      used only for debug messages.  Just format the path from blkg->cgroup
      when printing debug messages.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      54e7ed12
    • T
      blkcg: drop stuff unused after per-queue policy activation update · 3c96cb32
      Tejun Heo 提交于
      * All_q_list is unused.  Drop all_q_{mutex|list}.
      
      * @for_root of blkg_lookup_create() is always %false when called from
        outside blk-cgroup.c proper.  Factor out __blkg_lookup_create() so
        that it doesn't check whether @q is bypassing and use the
        underscored version for the @for_root callsite.
      
      * blkg_destroy_all() is used only from blkcg proper and @destroy_root
        is always %true.  Make it static and drop @destroy_root.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3c96cb32
    • T
      blkcg: implement per-queue policy activation · a2b1693b
      Tejun Heo 提交于
      All blkcg policies were assumed to be enabled on all request_queues.
      Due to various implementation obstacles, during the recent blkcg core
      updates, this was temporarily implemented as shooting down all !root
      blkgs on elevator switch and policy [de]registration combined with
      half-broken in-place root blkg updates.  In addition to being buggy
      and racy, this meant losing all blkcg configurations across those
      events.
      
      Now that blkcg is cleaned up enough, this patch replaces the temporary
      implementation with proper per-queue policy activation.  Each blkcg
      policy should call the new blkcg_[de]activate_policy() to enable and
      disable the policy on a specific queue.  blkcg_activate_policy()
      allocates and installs policy data for the policy for all existing
      blkgs.  blkcg_deactivate_policy() does the reverse.  If a policy is
      not enabled for a given queue, blkg printing / config functions skip
      the respective blkg for the queue.
      
      blkcg_activate_policy() also takes care of root blkg creation, and
      cfq_init_queue() and blk_throtl_init() are updated accordingly.
      
      This replaces blkcg_bypass_{start|end}() and update_root_blkg_pd()
      unnecessary.  Dropped.
      
      v2: cfq_init_queue() was returning uninitialized @ret on root_group
          alloc failure if !CONFIG_CFQ_GROUP_IOSCHED.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a2b1693b
    • T
      blkcg: add request_queue->root_blkg · 03d8e111
      Tejun Heo 提交于
      With per-queue policy activation, root blkg creation will be moved to
      blkcg core.  Add q->root_blkg in preparation.  For blk-throtl, this
      replaces throtl_data->root_tg; however, cfq needs to keep
      cfqd->root_group for !CONFIG_CFQ_GROUP_IOSCHED.
      
      This is to prepare for per-queue policy activation and doesn't cause
      any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03d8e111
    • T
      blkcg: make blkg_conf_prep() take @pol and return with queue lock held · da8b0662
      Tejun Heo 提交于
      Add @pol to blkg_conf_prep() and let it return with queue lock held
      (to be released by blkg_conf_finish()).  Note that @pol isn't used
      yet.
      
      This is to prepare for per-queue policy activation and doesn't cause
      any visible difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      da8b0662
    • T
      blkcg: remove static policy ID enums · 8bd435b3
      Tejun Heo 提交于
      Remove BLKIO_POLICY_* enums and let blkio_policy_register() allocate
      @pol->plid dynamically on registration.  The maximum number of blkcg
      policies which can be registered at the same time is defined by
      BLKCG_MAX_POLS constant added to include/linux/blkdev.h.
      
      Note that blkio_policy_register() now may fail.  Policy init functions
      updated accordingly and unnecessary ifdefs removed from cfq_init().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8bd435b3
    • T
      blkcg: use @pol instead of @plid in update_root_blkg_pd() and blkcg_print_blkgs() · ec399347
      Tejun Heo 提交于
      The two functions were taking "enum blkio_policy_id plid".  Make them
      take "const struct blkio_policy_type *pol" instead.
      
      This is to prepare for per-queue policy activation and doesn't cause
      any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ec399347
    • T
      cfq: fix build breakage & warnings · f48ec1d7
      Tejun Heo 提交于
      * CFQ_WEIGHT_* defined inside CONFIG_BLK_CGROUP causes cfq-iosched.c
        compile failure when the config is disabled.  Move it outside the
        ifdef block.
      
      * Dummy cfqg_stats_*() definitions were lacking inline modifiers
        causing unused functions warning if !CONFIG_CFQ_GROUP_IOSCHED.  Add
        them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f48ec1d7
  5. 02 4月, 2012 10 次提交
    • T
      blkcg: drop BLKCG_STAT_{PRIV|POL|OFF} macros · 5bc4afb1
      Tejun Heo 提交于
      Now that all stat handling code lives in policy implementations,
      there's no need to encode policy ID in cft->private.
      
      * Export blkcg_prfill_[rw]stat() from blkcg, remove
        blkcg_print_[rw]stat(), and implement cfqg_print_[rw]stat() which
        use hard-code BLKIO_POLICY_PROP.
      
      * Use cft->private for offset of the target field directly and drop
        BLKCG_STAT_{PRIV|POL|OFF}().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5bc4afb1
    • T
      blkcg: pass around pd->pdata instead of pd itself in prfill functions · d366e7ec
      Tejun Heo 提交于
      Now that all conf and stat fields are moved into policy specific
      blkio_policy_data->pdata areas, there's no reason to use
      blkio_policy_data itself in prfill functions.  Pass around @pd->pdata
      instead of @pd.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d366e7ec
    • T
      blkcg: move blkio_group_conf->weight to cfq · 3381cb8d
      Tejun Heo 提交于
      blkio_group_conf->weight is owned by cfq and has no reason to be
      defined in blkcg core.  Replace it with cfq_group->dev_weight and let
      conf setting functions directly set it.  If dev_weight is zero, the
      cfqg doesn't have device specific weight configured.
      
      Also, rename BLKIO_WEIGHT_* constants to CFQ_WEIGHT_* and rename
      blkio_cgroup->weight to blkio_cgroup->cfq_weight.  We eventually want
      per-policy storage in blkio_cgroup but just mark the ownership of the
      field for now.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3381cb8d
    • T
      blkcg: move blkio_group_stats to cfq-iosched.c · 155fead9
      Tejun Heo 提交于
      blkio_group_stats contains only fields used by cfq and has no reason
      to be defined in blkcg core.
      
      * Move blkio_group_stats to cfq-iosched.c and rename it to cfqg_stats.
      
      * blkg_policy_data->stats is replaced with cfq_group->stats.
        blkg_prfill_[rw]stat() are updated to use offset against pd->pdata
        instead.
      
      * All related macros / functions are renamed so that they have cfqg_
        prefix and the unnecessary @pol arguments are dropped.
      
      * All stat functions now take cfq_group * instead of blkio_group *.
      
      * lockdep assertion on queue lock dropped.  Elevator runs under queue
        lock by default.  There isn't much to be gained by adding lockdep
        assertions at stat function level.
      
      * cfqg_stats_reset() implemented for blkio_reset_group_stats_fn method
        so that cfqg->stats can be reset.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      155fead9
    • T
      blkcg: cfq doesn't need per-cpu dispatch stats · 41b38b6d
      Tejun Heo 提交于
      blkio_group_stats_cpu is used to count dispatch stats using per-cpu
      counters.  This is used by both blk-throtl and cfq-iosched but the
      sharing is rather silly.
      
      * cfq-iosched doesn't need per-cpu dispatch stats.  cfq always updates
        those stats while holding queue_lock.
      
      * blk-throtl needs per-cpu dispatch stats but only service_bytes and
        serviced.  It doesn't make use of sectors.
      
      This patch makes cfq add and use global stats for service_bytes,
      serviced and sectors, removes per-cpu sectors counter and moves
      per-cpu stat printing code to blk-throttle.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      41b38b6d
    • T
      blkcg: move statistics update code to policies · 629ed0b1
      Tejun Heo 提交于
      As with conf/stats file handling code, there's no reason for stat
      update code to live in blkcg core with policies calling into update
      them.  The current organization is both inflexible and complex.
      
      This patch moves stat update code to specific policies.  All
      blkiocg_update_*_stats() functions which deal with BLKIO_POLICY_PROP
      stats are collapsed into their cfq_blkiocg_update_*_stats()
      counterparts.  blkiocg_update_dispatch_stats() is used by both
      policies and duplicated as throtl_update_dispatch_stats() and
      cfq_blkiocg_update_dispatch_stats().  This will be cleaned up later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      629ed0b1
    • T
      cfq: collapse cfq.h into cfq-iosched.c · 2ce4d50f
      Tejun Heo 提交于
      block/cfq.h contains some functions which interact with blkcg;
      however, this is only part of it and cfq-iosched.c already has quite
      some #ifdef CONFIG_CFQ_GROUP_IOSCHED.  With conf/stat handling being
      moved to specific policies, having these relay functions isolated in
      cfq.h doesn't make much sense.  Collapse cfq.h into cfq-iosched.c for
      now.  Let's split blkcg support properly later if necessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2ce4d50f
    • T
      blkcg: move conf/stat file handling code to policies · 60c2bc2d
      Tejun Heo 提交于
      blkcg conf/stat handling is convoluted in that details which belong to
      specific policy implementations are all out in blkcg core and then
      policies hook into core layer to access and manipulate confs and
      stats.  This sadly achieves both inflexibility (confs/stats can't be
      modified without messing with blkcg core) and complexity (all the
      call-ins and call-backs).
      
      The previous patches restructured conf and stat handling code such
      that they can be separated out.  This patch relocates the file
      handling part.  All conf/stat file handling code which belongs to
      BLKIO_POLICY_PROP is moved to cfq-iosched.c and all
      BKLIO_POLICY_THROTL code to blk-throtl.c.
      
      The move is verbatim except for blkio_update_group_{weight|bps|iops}()
      callbacks which relays conf changes to policies.  The configuration
      settings are handled in policies themselves so the relaying isn't
      necessary.  Conf setting functions are modified to directly call
      per-policy update functions and the relaying mechanism is dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      60c2bc2d
    • T
      blkcg: remove unused @pol and @plid parameters · aaec55a0
      Tejun Heo 提交于
      @pol to blkg_to_pdata() and @plid to blkg_lookup_create() are no
      longer necessary.  Drop them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      aaec55a0
    • T
      block: Make cfq_target_latency tunable through sysfs. · 5bf14c07
      Tao Ma 提交于
      In cfq, when we calculate a time slice for a process(or a cfqq to
      be precise), we have to consider the cfq_target_latency so that all the
      sync request have an estimated latency(300ms) and it is controlled by
      cfq_target_latency. But in some hadoop test, we have found that if
      there are many processes doing sequential read(24 for example), the
      throughput is bad because every process can only work for about 25ms
      and the cfqq is switched. That leads to a higher disk seek. We can
      achive the good throughput by setting low_latency=0, but then some
      read's latency is too much for the application.
      
      So this patch makes cfq_target_latency tunable through sysfs so that
      we can tune it and find some magic number which is not bad for both
      the throughput and the read latency.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5bf14c07
  6. 23 3月, 2012 1 次提交
    • T
      cfq: fix cfqg ref handling when BLK_CGROUP && !CFQ_GROUP_IOSCHED · eb7d8c07
      Tejun Heo 提交于
      When BLK_CGROUP is enabled but CFQ_GROUP_IOSCHED is, cfq ends up
      calling blkg_get/put() on dummy cfqg leading to the following crash.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
        IP: [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430
        PGD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        CPU 0
        Modules linked in:
      
        Pid: 1, comm: swapper/0 Not tainted 3.3.0-rc6-work+ #125 Bochs Bochs
        RIP: 0010:[<ffffffff813d44d8>]  [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430
        RSP: 0018:ffff88001f9dfd80  EFLAGS: 00010046
        RAX: ffff88001aefbbf0 RBX: ffff88001aeedbf0 RCX: 0000000000000100
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff820ffd40
        RBP: ffff88001f9dfdd0 R08: 0000000000000000 R09: 0000000000000001
        R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
        R13: 0000000000000009 R14: ffff88001aefbc30 R15: 0000000000000003
        FS:  0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00000000000000b0 CR3: 000000000206f000 CR4: 00000000000006f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process swapper/0 (pid: 1, threadinfo ffff88001f9de000, task ffff88001f9dc040)
        Stack:
         ffff88001aeedbf0 ffff88001aefbdb0 ffff88001aef1548 ffff88001aefbbf0
         ffff88001f9dfdd0 ffff88001aef1548 ffffffff820d6320 ffffffff8165ce30
         ffffffff82c555e0 ffff88001aeebbf0 ffff88001f9dfe00 ffffffff813b0507
        Call Trace:
         [<ffffffff813b0507>] elevator_init+0xd7/0x140
         [<ffffffff813b83d5>] blk_init_allocated_queue+0x125/0x150
         [<ffffffff813b94d3>] blk_init_queue_node+0x43/0x80
         [<ffffffff813b9523>] blk_init_queue+0x13/0x20
         [<ffffffff821aec00>] floppy_init+0x82/0xec7
         [<ffffffff810001d2>] do_one_initcall+0x42/0x170
         [<ffffffff821835fc>] kernel_init+0xcb/0x14f
         [<ffffffff81b40b24>] kernel_thread_helper+0x4/0x10
        Code: 00 e8 1d 9e 76 00 48 8b 43 48 48 85 c0 48 89 83 28 03 00 00 74 07 4c 8b a0 10 ff ff ff 8b 15 b0 2e d0 00 85 d2 0f 85 49 01 00 00 <41> 8b 84 24 b0 00 00 00 85 c0 0f 8e 8c 01 00 00 83 e8 01 85 c0
        RIP  [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430
      
      Because cfq's blkcg support has a on/off switch, CFQ_GROUP_IOSCHED,
      separate from BLK_CGROUP, blkg access through cfqg needs to be
      conditioned on it.
      
      * Make blkg_to_cfqg() and cfqg_to_blkg() conditioned on
        CFQ_GROUP_IOSCHED.  If disabled, they always return %NULL.
      
      * Introduce cfqg_get() and cfqg_put() conditioned on
        CFQ_GROUP_IOSCHED.  If disabled, they are noops.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb7d8c07
  7. 20 3月, 2012 2 次提交
    • T
      cfq: don't use icq_get_changed() · 598971bf
      Tejun Heo 提交于
      cfq caches the associated cfqq's for a given cic.  The cache needs to
      be flushed if the cic's ioprio or blkcg has changed.  It is currently
      done by requiring the changing action to set the respective
      ICQ_*_CHANGED bit in the icq and testing it from cfq_set_request(),
      which involves iterating through all the affected icqs.
      
      All cfq wants to know is whether ioprio and/or blkcg have changed
      since the last flush and can be easily achieved by just remembering
      the current ioprio and blkcg ID in cic.
      
      This patch adds cic->{ioprio|blkcg_id}, updates all ioprio users to
      use the remembered value instead, and updates cfq_set_request() path
      such that, instead of using icq_get_changed(), the current values are
      compared against the remembered ones and trigger appropriate flush
      action if not.  Condition tests are moved inside both _changed
      functions which are now named check_ioprio_changed() and
      check_blkcg_changed().
      
      ioprio.h::task_ioprio*() can't be used anymore and replaced with
      open-coded IOPRIO_CLASS_NONE case in cfq_async_queue_prio().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      598971bf
    • T
      cfq: pass around cfq_io_cq instead of io_context · abede6da
      Tejun Heo 提交于
      Now that io_cq is managed by block core and guaranteed to exist for
      any in-flight request, it is easier and carries more information to
      pass around cfq_io_cq than io_context.
      
      This patch updates cfq_init_prio_data(), cfq_find_alloc_queue() and
      cfq_get_queue() to take @cic instead of @ioc.  This change removes a
      duplicate cfq_cic_lookup() from cfq_find_alloc_queue().
      
      This change enables the use of cic-cached ioprio in the next patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      abede6da
  8. 07 3月, 2012 3 次提交
    • T
      block: make block cgroup policies follow bio task association · 4f85cb96
      Tejun Heo 提交于
      Implement bio_blkio_cgroup() which returns the blkcg associated with
      the bio if exists or %current's blkcg, and use it in blk-throttle and
      cfq-iosched propio.  This makes both cgroup policies honor task
      association for the bio instead of always assuming %current.
      
      As nobody is using bio_set_task() yet, this doesn't introduce any
      behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4f85cb96
    • T
      block: implement bio_associate_current() · 852c788f
      Tejun Heo 提交于
      IO scheduling and cgroup are tied to the issuing task via io_context
      and cgroup of %current.  Unfortunately, there are cases where IOs need
      to be routed via a different task which makes scheduling and cgroup
      limit enforcement applied completely incorrectly.
      
      For example, all bios delayed by blk-throttle end up being issued by a
      delayed work item and get assigned the io_context of the worker task
      which happens to serve the work item and dumped to the default block
      cgroup.  This is double confusing as bios which aren't delayed end up
      in the correct cgroup and makes using blk-throttle and cfq propio
      together impossible.
      
      Any code which punts IO issuing to another task is affected which is
      getting more and more common (e.g. btrfs).  As both io_context and
      cgroup are firmly tied to task including userland visible APIs to
      manipulate them, it makes a lot of sense to match up tasks to bios.
      
      This patch implements bio_associate_current() which associates the
      specified bio with %current.  The bio will record the associated ioc
      and blkcg at that point and block layer will use the recorded ones
      regardless of which task actually ends up issuing the bio.  bio
      release puts the associated ioc and blkcg.
      
      It grabs and remembers ioc and blkcg instead of the task itself
      because task may already be dead by the time the bio is issued making
      ioc and blkcg inaccessible and those are all block layer cares about.
      
      elevator_set_req_fn() is updated such that the bio elvdata is being
      allocated for is available to the elevator.
      
      This doesn't update block cgroup policies yet.  Further patches will
      implement the support.
      
      -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
           rq_ioc() to fix build breakage.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      852c788f
    • T
      block: add io_context->active_ref · f6e8d01b
      Tejun Heo 提交于
      Currently ioc->nr_tasks is used to decide two things - whether an ioc
      is done issuing IOs and whether it's shared by multiple tasks.  This
      patch separate out the first into ioc->active_ref, which is acquired
      and released using {get|put}_io_context_active() respectively.
      
      This will be used to associate bio's with a given task.  This patch
      doesn't introduce any visible behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f6e8d01b