1. 23 5月, 2017 1 次提交
  2. 20 4月, 2017 1 次提交
  3. 28 3月, 2017 17 次提交
    • S
      blk-throttle: add latency target support · 53696b8d
      Shaohua Li 提交于
      One hard problem adding .low limit is to detect idle cgroup. If one
      cgroup doesn't dispatch enough IO against its low limit, we must have a
      mechanism to determine if other cgroups dispatch more IO. We added the
      think time detection mechanism before, but it doesn't work for all
      workloads. Here we add a latency based approach.
      
      We already have mechanism to calculate latency threshold for each IO
      size. For every IO dispatched from a cgorup, we compare its latency
      against its threshold and record the info. If most IO latency is below
      threshold (in the code I use 75%), the cgroup could be treated idle and
      other cgroups can dispatch more IO.
      
      Currently this latency target check is only for SSD as we can't
      calcualte the latency target for hard disk. And this is only for cgroup
      leaf node so far.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      53696b8d
    • S
      blk-throttle: add a mechanism to estimate IO latency · b9147dd1
      Shaohua Li 提交于
      User configures latency target, but the latency threshold for each
      request size isn't fixed. For a SSD, the IO latency highly depends on
      request size. To calculate latency threshold, we sample some data, eg,
      average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
      threshold of each request size will be the sample latency (I'll call it
      base latency) plus latency target. For example, the base latency for
      request size 4k is 80us and user configures latency target 60us. The 4k
      latency threshold will be 80 + 60 = 140us.
      
      To sample data, we calculate the order base 2 of rounded up IO sectors.
      If the IO size is bigger than 1M, it will be accounted as 1M. Since the
      calculation does round up, the base latency will be slightly smaller
      than actual value. Also if there isn't any IO dispatched for a specific
      IO size, we will use the base latency of smaller IO size for this IO
      size.
      
      But we shouldn't sample data at any time. The base latency is supposed
      to be latency where disk isn't congested, because we use latency
      threshold to schedule IOs between cgroups. If disk is congested, the
      latency is higher, using it for scheduling is meaningless. Hence we only
      do the sampling when block throttling is in the LOW limit, with
      assumption disk isn't congested in such state. If the assumption isn't
      true, eg, low limit is too high, calculated latency threshold will be
      higher.
      
      Hard disk is completely different. Latency depends on spindle seek
      instead of request size. Currently this feature is SSD only, we probably
      can use a fixed threshold like 4ms for hard disk though.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b9147dd1
    • S
      blk-throttle: add interface for per-cgroup target latency · ec80991d
      Shaohua Li 提交于
      Here we introduce per-cgroup latency target. The target determines how a
      cgroup can afford latency increasement. We will use the target latency
      to calculate a threshold and use it to schedule IO for cgroups. If a
      cgroup's bandwidth is below its low limit but its average latency is
      below the threshold, other cgroups can safely dispatch more IO even
      their bandwidth is higher than their low limits. On the other hand, if
      the first cgroup's latency is higher than the threshold, other cgroups
      are throttled to their low limits. So the target latency determines how
      we efficiently utilize free disk resource without sacifice of worload's
      IO latency.
      
      For example, assume 4k IO average latency is 50us when disk isn't
      congested. A cgroup sets the target latency to 30us. Then the cgroup can
      accept 50+30=80us IO latency. If the cgroupt's average IO latency is
      90us and its bandwidth is below low limit, other cgroups are throttled
      to their low limit. If the cgroup's average IO latency is 60us, other
      cgroups are allowed to dispatch more IO. When other cgroups dispatch
      more IO, the first cgroup's IO latency will increase. If it increases to
      81us, we then throttle other cgroups.
      
      User will configure the interface in this way:
      echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low
      
      latency is in microsecond unit
      
      By default, latency target is 0, which means to guarantee IO latency.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ec80991d
    • S
      blk-throttle: ignore idle cgroup limit · fa6fb5aa
      Shaohua Li 提交于
      Last patch introduces a way to detect idle cgroup. We use it to make
      upgrade/downgrade decision. And the new algorithm can detect completely
      idle cgroup too, so we can delete the corresponding code.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fa6fb5aa
    • S
      blk-throttle: add interface to configure idle time threshold · ada75b6e
      Shaohua Li 提交于
      Add interface to configure the threshold. The io.low interface will
      like:
      echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low
      
      idle is in microsecond unit.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ada75b6e
    • S
      blk-throttle: add a simple idle detection · 9e234eea
      Shaohua Li 提交于
      A cgroup gets assigned a low limit, but the cgroup could never dispatch
      enough IO to cross the low limit. In such case, the queue state machine
      will remain in LIMIT_LOW state and all other cgroups will be throttled
      according to low limit. This is unfair for other cgroups. We should
      treat the cgroup idle and upgrade the state machine to lower state.
      
      We also have a downgrade logic. If the state machine upgrades because of
      cgroup idle (real idle), the state machine will downgrade soon as the
      cgroup is below its low limit. This isn't what we want. A more
      complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
      when queue gets upgraded to lower state, other cgroups could dispatch
      more IO and this cgroup can't dispatch enough IO, so the cgroup is below
      its low limit and looks like idle (fake idle). In this case, the queue
      should downgrade soon. The key to determine if we should do downgrade is
      to detect if cgroup is truely idle.
      
      Unfortunately it's very hard to determine if a cgroup is real idle. This
      patch uses the 'think time check' idea from CFQ for the purpose. Please
      note, the idea doesn't work for all workloads. For example, a workload
      with io depth 8 has disk utilization 100%, hence think time is 0, eg,
      not idle. But the workload can run higher bandwidth with io depth 16.
      Compared to io depth 16, the io depth 8 workload is idle. We use the
      idea to roughly determine if a cgroup is idle.
      
      We treat a cgroup idle if its think time is above a threshold (by
      default 1ms for SSD and 100ms for HD). The idea is think time above the
      threshold will start to harm performance. HD is much slower so a longer
      think time is ok.
      
      The patch (and the latter patches) uses 'unsigned long' to track time.
      We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
      precision, should not a big deal.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9e234eea
    • S
      blk-throttle: make bandwidth change smooth · 7394e31f
      Shaohua Li 提交于
      When cgroups all reach low limit, cgroups can dispatch more IO. This
      could make some cgroups dispatch more IO but others not, and even some
      cgroups could dispatch less IO than their low limit. For example, cg1
      low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
      120M/s for the workload. Their bps could something like this:
      
      cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80
      
      At T1, all cgroups reach low limit, so they can dispatch more IO later.
      Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
      T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
      than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
      LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
      will have bandwidth below its low limit at most time.
      
      The big problem here is we don't know the maximum bandwidth of the
      workload, so we can't make smart decision to avoid the situation. This
      patch makes cgroup bandwidth change smooth. After disk upgrades from
      LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
      their max limit immediately. Their bandwidth limit will be increased
      gradually to avoid above situation. So above example will became
      something like:
      
      cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
      -> 45/75 -> 22/98
      
      In this way cgroups bandwidth will be above their limit in majority
      time, this still doesn't fully utilize disk bandwidth, but that's
      something we pay for sharing.
      
      Scale up is linear. The limit scales up 1/2 .low limit every
      throtl_slice after upgrade. The scale up will stop if the adjusted limit
      hits .max limit. Scale down is exponential. We cut the scale value half
      if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then
      fully downgrade the queue to LIMIT_LOW state.
      
      Note this doesn't completely avoid cgroup running under its low limit.
      The best way to guarantee cgroup doesn't run under its limit is to set
      max limit. For example, if we set cg1 max limit to 40, cg2 will never
      run under its low limit.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7394e31f
    • S
      blk-throttle: detect completed idle cgroup · aec24246
      Shaohua Li 提交于
      cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
      cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
      we can't move the state machine to higher level and all cgroups will be
      throttled to their lower limit, so we waste bandwidth. Detecting idle
      cgroup is hard. This patch handles a simple case, a cgroup doesn't
      dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
      the bandwidth.
      
      Please note this will be replaced with a more sophisticated algorithm
      later, but this demonstrates the idea how we handle idle cgroups, so I
      leave it here.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aec24246
    • S
      blk-throttle: choose a small throtl_slice for SSD · d61fcfa4
      Shaohua Li 提交于
      The throtl_slice is 100ms by default. This is a long time for SSD, a lot
      of IO can run. To make cgroups have smoother throughput, we choose a
      small value (20ms) for SSD.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d61fcfa4
    • S
      blk-throttle: make throtl_slice tunable · 297e3d85
      Shaohua Li 提交于
      throtl_slice is important for blk-throttling. It's called slice
      internally but it really is a time window blk-throttling samples data.
      blk-throttling will make decision based on the samplings. An example is
      bandwidth measurement. A cgroup's bandwidth is measured in the time
      interval of throtl_slice.
      
      A small throtl_slice meanse cgroups have smoother throughput but burn
      more CPUs. It has 100ms default value, which is not appropriate for all
      disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
      it tunable.
      
      Since throtl_slice isn't a time slice, the sysfs name
      'throttle_sample_time' reflects its character better.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      297e3d85
    • S
      blk-throttle: make sure expire time isn't too big · 06cceedc
      Shaohua Li 提交于
      cgroup could be throttled to a limit but when all cgroups cross high
      limit, queue enters a higher state and so the group should be throttled
      to a higher limit. It's possible the cgroup is sleeping because of
      throttle and other cgroups don't dispatch IO any more. In this case,
      nobody can trigger current downgrade/upgrade logic. To fix this issue,
      we could either set up a timer to wakeup the cgroup if other cgroups are
      idle or make sure this cgroup doesn't sleep too long. Setting up a timer
      means we must change the timer very frequently. This patch chooses the
      latter. Making cgroup sleep time not too big wouldn't change cgroup
      bps/iops, but could make it wakeup more frequently, which isn't a big
      issue because throtl_slice * 8 is already quite big.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      06cceedc
    • S
      blk-throttle: add downgrade logic · 3f0abd80
      Shaohua Li 提交于
      When queue state machine is in LIMIT_MAX state, but a cgroup is below
      its low limit for some time, the queue should be downgraded to lower
      state as one cgroup's low limit isn't met.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3f0abd80
    • S
      blk-throttle: add upgrade logic for LIMIT_LOW state · c79892c5
      Shaohua Li 提交于
      When queue is in LIMIT_LOW state and all cgroups with low limit cross
      the bps/iops limitation, we will upgrade queue's state to
      LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if
      the cgroup has pending request. Since cgroup is throttled according to
      the limit, pending request means the cgroup reaches the limit.
      
      If a cgroup has limit set for both read and write, we consider the
      combination of them for upgrade. The reason is read IO and write IO can
      interfere with each other. If we do the upgrade based in one direction
      IO, the other direction IO could be severly harmed.
      
      For a cgroup hierarchy, there are two cases. Children has lower low
      limit than parent. Parent's low limit is meaningless. If children's
      bps/iops cross low limit, we can upgrade queue state. The other case is
      children has higher low limit than parent. Children's low limit is
      meaningless. As long as parent's bps/iops (which is a sum of childrens
      bps/iops) cross low limit, we can upgrade queue state.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c79892c5
    • S
      blk-throttle: configure bps/iops limit for cgroup in low limit · b22c417c
      Shaohua Li 提交于
      each queue will have a state machine. Initially queue is in LIMIT_LOW
      state, which means all cgroups will be throttled according to their low
      limit. After all cgroups with low limit cross the limit, the queue state
      gets upgraded to LIMIT_MAX state.
      For max limit, cgroup will use the limit configured by user.
      For low limit, cgroup will use the minimal value between low limit and
      max limit configured by user. If the minimal value is 0, which means the
      cgroup doesn't configure low limit, we will use max limit to throttle
      the cgroup and the cgroup is ready to upgrade to LIMIT_MAX
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b22c417c
    • S
      blk-throttle: add .low interface · cd5ab1b0
      Shaohua Li 提交于
      Add low limit for cgroup and corresponding cgroup interface. To be
      consistent with memcg, we allow users configure .low limit higher than
      .max limit. But the internal logic always assumes .low limit is lower
      than .max limit. So we add extra bps/iops_conf fields in throtl_grp for
      userspace configuration. Old bps/iops fields in throtl_grp will be the
      actual limit we use for throttling.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cd5ab1b0
    • S
      blk-throttle: prepare support multiple limits · 9f626e37
      Shaohua Li 提交于
      We are going to support low/max limit, each cgroup will have 2 limits
      after that. This patch prepares for the multiple limits change.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9f626e37
    • S
      blk-throttle: use U64_MAX/UINT_MAX to replace -1 · 2ab5492d
      Shaohua Li 提交于
      clean up the code to avoid using -1
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2ab5492d
  4. 28 2月, 2017 1 次提交
  5. 23 1月, 2017 1 次提交
  6. 28 10月, 2016 1 次提交
  7. 20 9月, 2016 1 次提交
    • V
      blk-throttle: Extend slice if throttle group is not empty · 164c80ed
      Vivek Goyal 提交于
      Right now, if slice is expired, we start a new slice. If a bio is
      queued, we keep on extending slice by throtle_slice interval (100ms).
      
      This worked well as long as pending timer function got executed with-in
      few milli seconds of scheduled time. But looks like with recent changes
      in timer subsystem, slack can be much longer depending on the expiry time
      of the scheduled timer.
      
      commit 500462a9 ("timers: Switch to a non-cascading wheel")
      
      This means, by the time timer function gets executed, it is possible the
      delay from scheduled time is more than 100ms. That means current code
      will conclude that existing slice has expired and a new one needs to
      be started. New slice will be 100ms by default and that will not be
      sufficient to meet rate requirement of group given the bio size and
      bio will not be dispatched and we will start a new timer function to
      wait. And when that timer expires, same process will repeat and we
      will wait again and this can easily be an infinite loop.
      
      Solve this issue by starting a new slice only if throttle gropup is
      empty. If it is not empty, that means there should be an active slice
      going on. Ideally it should not be expired but given the slack, it is
      possible that it has expired.
      Reported-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      164c80ed
  8. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  9. 05 8月, 2016 1 次提交
  10. 10 5月, 2016 1 次提交
  11. 18 9月, 2015 1 次提交
    • T
      cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() · 9e10a130
      Tejun Heo 提交于
      cgroup_on_dfl() tests whether the cgroup's root is the default
      hierarchy; however, an individual controller is only interested in
      whether the controller is attached to the default hierarchy and never
      tests a cgroup which doesn't belong to the hierarchy that the
      controller is attached to.
      
      This patch replaces cgroup_on_dfl() tests in controllers with faster
      static_key based cgroup_subsys_on_dfl().  This leaves cgroup core as
      the only user of cgroup_on_dfl() and the function is moved from the
      header file to cgroup.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      9e10a130
  12. 19 8月, 2015 13 次提交
    • T
      blkcg: implement interface for the unified hierarchy · 2ee867dc
      Tejun Heo 提交于
      blkcg interface grew to be the biggest of all controllers and
      unfortunately most inconsistent too.  The interface files are
      inconsistent with a number of cloes duplicates.  Some files have
      recursive variants while others don't.  There's distinction between
      normal and leaf weights which isn't intuitive and there are a lot of
      stat knobs which don't make much sense outside of debugging and expose
      too much implementation details to userland.
      
      In the unified hierarchy, everything is always hierarchical and
      internal nodes can't have tasks rendering the two structural issues
      twisting the current interface.  The interface has to be updated in a
      significant anyway and this is a good chance to revamp it as a whole.
      This patch implements blkcg interface for the unified hierarchy.
      
      * (from a previous patch) blkcg is identified by "io" instead of
        "blkio" on the unified hierarchy.  Given that the whole interface is
        updated anyway, the rename shouldn't carry noticeable conversion
        overhead.
      
      * The original interface consisted of 27 files is replaced with the
        following three files.
      
        blkio.stat	: per-blkcg stats
        blkio.weight	: per-cgroup and per-cgroup-queue weight settings
        blkio.max	: per-cgroup-queue bps and iops max limits
      
      Documentation/cgroups/unified-hierarchy.txt updated accordingly.
      
      v2: blkcg_policy->dfl_cftypes wasn't removed on
          blkcg_policy_unregister() corrupting the cftypes list.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2ee867dc
    • T
      blkcg: separate out tg_conf_updated() from tg_set_conf() · 69948b07
      Tejun Heo 提交于
      tg_set_conf() is largely consisted of parsing and setting the new
      config and the follow-up application and propagation.  This patch
      separates out the latter part into tg_conf_updated().  This will be
      used to implement interface for the unified hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      69948b07
    • T
      blkcg: move body parsing from blkg_conf_prep() to its callers · 36aa9e5f
      Tejun Heo 提交于
      Currently, blkg_conf_prep() expects input to be of the following form
      
       MAJ:MIN NUM
      
      and reads the NUM part into blkg_conf_ctx->v.  This is quite
      restrictive and gets in the way in implementing blkcg interface for
      the unified hierarchy.  This patch updates blkg_conf_prep() so that it
      expects
      
       MAJ:MIN BODY_STR
      
      where BODY_STR is an arbitrary string.  blkg_conf_ctx->v is replaced
      with ->body which is a char pointer pointing to the start of BODY_STR.
      Parsing of the body is moved to blkg_conf_prep()'s callers.
      
      To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
      non-const pointer and to accommodate that const is dropped from @input
      too.
      
      This doesn't cause any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      36aa9e5f
    • T
      blkcg: mark existing cftypes as legacy · 880f50e2
      Tejun Heo 提交于
      blkcg is about to grow interface for the unified hierarchy.  Add
      legacy to existing cftypes.
      
      * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
      * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
      * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
      * blk-throttle.c:throtl_files -> throtl_legacy_files
      
      Pure renames.  No functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      880f50e2
    • T
      blkcg: move io_service_bytes and io_serviced stats into blkcg_gq · 77ea7338
      Tejun Heo 提交于
      Currently, both cfq-iosched and blk-throttle keep track of
      io_service_bytes and io_serviced stats.  While keeping track of them
      separately may be useful during development, it doesn't make much
      sense otherwise.  Also, blk-throttle was counting bio's as IOs while
      cfq-iosched request's, which is more confusing than informative.
      
      This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
      removes the counterparts from cfq-iosched and blk-throttle and let
      them print from the common blkg counters.  The common counters are
      incremented during bio issue in blkcg_bio_issue_check().
      
      The outputs are still filtered by whether the policy has
      blkg_policy_data on a given blkg, so cfq's output won't show up if it
      has never been used for a given blkg.  The only times when the outputs
      would differ significantly are when policies are attached on the fly
      or elevators are switched back and forth.  Those are quite exceptional
      operations and I don't think they warrant keeping separate counters.
      
      v3: Update blkio-controller.txt accordingly.
      
      v2: Account IOs during bio issues instead of request completions so
          that bio-based drivers can be handled the same way.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      77ea7338
    • T
      blkcg: make blkcg_[rw]stat per-cpu · 24bdb8ef
      Tejun Heo 提交于
      blkcg_[rw]stat are used as stat counters for blkcg policies.  It isn't
      per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
      it.  This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
      per-cpu wrapping in blk-throttle.
      
      * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
        percpu_counter.  This makes syncp unnecessary as remote accesses are
        handled by percpu_counter itself.
      
      * blkg_[rw]stat_init() can now fail due to percpu allocation failure
        and thus are updated to return int.
      
      * percpu_counters need explicit freeing.  blkg_[rw]stat_exit() added.
      
      * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
        and summing results are stored in ->aux_cnt[] instead.
      
      * Custom per-cpu stat implementation in blk-throttle is removed.
      
      This makes all blkcg stat counters per-cpu without complicating policy
      implmentations.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      24bdb8ef
    • T
      blkcg: consolidate blkg creation in blkcg_bio_issue_check() · ae118896
      Tejun Heo 提交于
      blkg (blkcg_gq) currently is created by blkcg policies invoking
      blkg_lookup_create() which ends up repeating about the same code in
      different policies.  Theoretically, this can avoid the overhead of
      looking and/or creating blkg's if blkcg is enabled but no policy is in
      use; however, the cost of blkg lookup / creation is very low
      especially if only the root blkcg is in use which is highly likely if
      no blkcg policy is in active use - it boils down to a single very
      predictable conditional and surrounding RCU protection.
      
      This patch consolidates blkg creation to a new function
      blkcg_bio_issue_check() which is called during bio issue from
      generic_make_request_checks().  blkcg_bio_issue_check() is now the
      only function which tries to create missing blkg's.  The subsequent
      policy and request_list operations just perform blkg_lookup() and if
      missing falls back to the root.
      
      * blk_get_rl() no longer tries to create blkg.  It uses blkg_lookup()
        instead of blkg_lookup_create().
      
      * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
        read locked and blkg already looked up.  Both throtl_lookup_tg() and
        throtl_lookup_create_tg() are dropped.
      
      * cfq is similarly updated.  cfq_lookup_create_cfqg() is replaced with
        cfq_lookup_cfqg()which uses blkg_lookup().
      
      This consolidates blkg handling and avoids unnecessary blkg creation
      retries under memory pressure.  In addition, this provides a common
      bio entry point into blkcg where things like common accounting can be
      performed.
      
      v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
          !CONFIG_BLK_DEV_THROTTLING.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ae118896
    • T
      blk-throttle: improve queue bypass handling · c9589f03
      Tejun Heo 提交于
      If a queue is bypassing, all blkcg policies should become noops but
      blk-throttle wasn't.  It only became noop if the queue was dying.
      While this wouldn't lead to an oops as falling back to the root blkg
      is safe in this case, this can be a bit surprising - a bypassing queue
      could still be applying throttle limits.
      
      Fix it by removing blk_queue_dying() test in throtl_lookup_create_tg()
      and testing blk_queue_bypass() in blk_throtl_bio() and bypassing
      before doing anything else.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c9589f03
    • T
      blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup() · 85b6bc9d
      Tejun Heo 提交于
      Currently, both throttle and cfq policies implement their own root
      blkg (blkcg_gq) lookup fast path.  This patch moves root blkg
      optimization from throtl_lookup_tg() to __blkg_lookup().  cfq-iosched
      currently doesn't use blkg_lookup() but will be converted and drop the
      optimization too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      85b6bc9d
    • T
      blkcg: make blkcg_policy methods take a pointer to blkcg_policy_data · a9520cd6
      Tejun Heo 提交于
      The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
      (blkg_policy_data) while the older ones use blkg (blkcg_gq).  As using
      blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
      can always be mapped to blkg and given that these are policy-specific
      methods, it makes sense to converge on pd.
      
      This patch makes all methods deal with pd instead of blkg.  Most
      conversions are trivial.  In blk-cgroup.c, a couple method invocation
      sites now test whether pd exists instead of policy state for
      consistency.  This shouldn't cause any behavioral differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a9520cd6
    • T
      blk-throttle: clean up blkg_policy_data alloc/init/exit/free methods · b2ce2643
      Tejun Heo 提交于
      With the recent addition of alloc and free methods, things became
      messier.  This patch reorganizes them according to the followings.
      
      * ->pd_alloc_fn()
      
        Responsible for allocation and static initializations - the ones
        which can be done independent of where the pd might be attached.
      
      * ->pd_init_fn()
      
        Initializations which require the knowledge of where the pd is
        attached.
      
      * ->pd_free_fn()
      
        The counter part of pd_alloc_fn().  Static de-init and freeing.
      
      This leaves ->pd_exit_fn() without any users.  Removed.
      
      While at it, collapse an one liner function throtl_pd_exit(), which
      has only one user, into its user.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b2ce2643
    • T
      blk-throttle: remove asynchrnous percpu stats allocation mechanism · 4fb72036
      Tejun Heo 提交于
      Because percpu allocator couldn't do non-blocking allocations,
      blk-throttle was forced to implement an ad-hoc asynchronous allocation
      mechanism for its percpu stats for cases where blkg's (blkcg_gq's) are
      allocated from an IO path without sleepable context.
      
      Now that percpu allocator can handle gfp_mask and blkg_policy_data
      alloc / free are handled by policy methods, the ad-hoc asynchronous
      allocation mechanism can be replaced with direct allocation from
      tg_stats_alloc_fn().  Rit it out.
      
      This ensures that an active throtl_grp always has valid non-NULL
      ->stats_cpu.  Remove checks on it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4fb72036
    • T
      blkcg: replace blkcg_policy->pd_size with ->pd_alloc/free_fn() methods · 001bea73
      Tejun Heo 提交于
      A blkg (blkcg_gq) represents the relationship between a cgroup and
      request_queue.  Each active policy has a pd (blkg_policy_data) on each
      blkg.  The pd's were allocated by blkcg core and each policy could
      request to allocate extra space at the end by setting
      blkcg_policy->pd_size larger than the size of pd.
      
      This is a bit unusual but was done this way mostly to simplify error
      handling and all the existing use cases could be handled this way;
      however, this is becoming too restrictive now that percpu memory can
      be allocated without blocking.
      
      This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
      and pd_free_fn() - which are used to allocate and release pd for a
      given policy.  As pd allocation is now done from policy side, it can
      simply allocate a larger area which embeds pd at the beginning.  This
      change makes ->pd_size pointless.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      001bea73