1. 19 8月, 2015 6 次提交
    • T
      cfq-iosched: remove @gfp_mask from cfq_find_alloc_queue() · 2da8de0b
      Tejun Heo 提交于
      Even when allocations fail, cfq_find_alloc_queue() always returns a
      valid cfq_queue by falling back to the oom cfq_queue.  As such, there
      isn't much point in taking @gfp_mask and trying "harder" if __GFP_WAIT
      is set.  GFP_NOWAIT allocations don't fail often and even when they do
      the degraded behavior is acceptable and temporary.
      
      After all, the only reason get_request(), which ultimately determines
      the gfp_mask, cares about __GFP_WAIT is to guarantee request
      allocation, assuming IO forward progress, for callers which are
      willing to wait.  There's no reason for cfq_find_alloc_queue() to
      behave differently on __GFP_WAIT when it already has a fallback
      mechanism.
      
      Remove @gfp_mask from cfq_find_alloc_queue() and propagate the changes
      to its callers.  This simplifies the function quite a bit and will
      help making async queues per-cfq_group.
      
      v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2da8de0b
    • T
      blkcg, cfq-iosched: use GFP_NOWAIT instead of GFP_ATOMIC for non-critical allocations · d93a11f1
      Tejun Heo 提交于
      blkcg performs several allocations to track IOs per cgroup and enforce
      resource control.  Most of these allocations are performed lazily on
      demand in the IO path and thus can't involve reclaim path.  Currently,
      these allocations use GFP_ATOMIC; however, blkcg can gracefully deal
      with occassional failures of these allocations by punting IOs to the
      root cgroup and there's no reason to reach into the emergency reserve.
      
      This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following
      allocations.
      
      * bdi_writeback_congested and blkcg_gq allocations in blkg_create().
      
      * radix tree node allocations for blkcg->blkg_tree.
      
      * cfq_queue allocation on ioprio changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-and-Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Suggested-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d93a11f1
    • T
      cfq-iosched: minor cleanups · 563180a4
      Tejun Heo 提交于
      * Some were accessing cic->cfqq[] directly.  Always use cic_to_cfqq()
        and cic_set_cfqq().
      
      * check_ioprio_changed() doesn't need to verify cfq_get_queue()'s
        return for NULL.  It's always non-NULL.  Simplify accordingly.
      
      This patch doesn't cause any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      563180a4
    • T
      cfq-iosched: fix oom cfq_queue ref leak in cfq_set_request() · bce6133b
      Tejun Heo 提交于
      If the cfq_queue cached in cfq_io_cq is the oom one, cfq_set_request()
      replaces it by invoking cfq_get_queue() again without putting the oom
      queue leaking the reference it was holding.  While oom queues are not
      released through reference counting, they're still reference counted
      and this can theoretically lead to the reference count overflowing and
      incorrectly invoke the usual release path on it.
      
      Fix it by making cfq_set_request() put the ref it was holding.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bce6133b
    • T
      cfq-iosched: fix async oom queue handling · 95e5d6f6
      Tejun Heo 提交于
      Async cfqq's (cfq_queue's) are shared across cfq_data.  When
      cfq_get_queue() obtains a new queue from cfq_find_alloc_queue(), it
      stashes the pointer in cfq_data and reuses it from then on; however,
      the function doesn't consider that cfq_find_alloc_queue() may return
      the oom_cfqq under memory pressure and installs the returned queue
      unconditionally.
      
      If the oom_cfqq is installed as an async cfqq, cfq_set_request() will
      continue calling cfq_get_queue() hoping to replace it with a proper
      queue; however, cfq_get_queue() will keep returning the cached queue
      for the slot - the oom_cfqq.
      
      Fix it by skipping caching if the queue is the oom one.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      95e5d6f6
    • T
      cfq-iosched: simplify control flow in cfq_get_queue() · 4ebc1c61
      Tejun Heo 提交于
      cfq_get_queue()'s control flow looks like the following.
      
      	async_cfqq = NULL;
      	cfqq = NULL;
      
      	if (!is_sync) {
      		...
      		async_cfqq = ...;
      		cfqq = *async_cfqq;
      	}
      
      	if (!cfqq)
      		cfqq = ...;
      
      	if (!is_sync && !(*async_cfqq))
      		...;
      
      The only thing the local variable init, the second if, and the
      async_cfqq test in the third if achieves is to skip cfqq creation and
      installation if *async_cfqq was already non-NULL.  This is needlessly
      complicated with different tests examining the same condition.
      Simplify it to the following.
      
      	if (!is_sync) {
      		...
      		async_cfqq = ...;
      		cfqq = *async_cfqq;
      		if (cfqq)
      			goto out;
      	}
      
      	cfqq = ...;
      
      	if (!is_sync)
      		...;
       out:
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4ebc1c61
  2. 21 6月, 2015 1 次提交
  3. 20 6月, 2015 2 次提交
  4. 10 6月, 2015 1 次提交
    • J
      cfq-iosched: fix the setting of IOPS mode on SSDs · 0bb97947
      Jens Axboe 提交于
      A previous commit wanted to make CFQ default to IOPS mode on
      non-rotational storage, however it did so when the queue was
      initialized and the non-rotational flag is only set later on
      in the probe.
      
      Add an elevator hook that gets called off the add_disk() path,
      at that point we know that feature probing has finished, and
      we can reliably check for the various flags that drivers can
      set.
      
      Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
      Tested-by: NRomain Francoise <romain@orebokech.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0bb97947
  5. 07 6月, 2015 1 次提交
    • A
      block, cgroup: implement policy-specific per-blkcg data · e48453c3
      Arianna Avanzini 提交于
      The block IO (blkio) controller enables the block layer to provide service
      guarantees in a hierarchical fashion. Specifically, service guarantees
      are provided by registered request-accounting policies. As of now, a
      proportional-share and a throttling policy are available. They are
      implemented, respectively, by the CFQ I/O scheduler and the blk-throttle
      subsystem. Unfortunately, as for adding new policies, the current
      implementation of the block IO controller is only halfway ready to allow
      new policies to be plugged in. This commit provides a solution to make
      the block IO controller fully ready to handle new policies.
      In what follows, we first describe briefly the current state, and then
      list the changes made by this commit.
      
      The throttling policy does not need any per-cgroup information to perform
      its task. In contrast, the proportional share policy uses, for each cgroup,
      both the weight assigned by the user to the cgroup, and a set of dynamically-
      computed weights, one for each device.
      
      The first, user-defined weight is stored in the blkcg data structure: the
      block IO controller allocates a private blkcg data structure for each
      cgroup in the blkio cgroups hierarchy (regardless of which policy is active).
      In other words, the block IO controller internally mirrors the blkio cgroups
      with private blkcg data structures.
      
      On the other hand, for each cgroup and device, the corresponding dynamically-
      computed weight is maintained in the following, different way. For each device,
      the block IO controller keeps a private blkcg_gq structure for each cgroup in
      blkio. In other words, block IO also keeps one private mirror copy of the blkio
      cgroups hierarchy for each device, made of blkcg_gq structures.
      Each blkcg_gq structure keeps per-policy information in a generic array of
      dynamically-allocated 'dedicated' data structures, one for each registered
      policy (so currently the array contains two elements). To be inserted into the
      generic array, each dedicated data structure embeds a generic blkg_policy_data
      structure. Consider now the array contained in the blkcg_gq structure
      corresponding to a given pair of cgroup and device: one of the elements
      of the array contains the dedicated data structure for the proportional-share
      policy, and this dedicated data structure contains the dynamically-computed
      weight for that pair of cgroup and device.
      
      The generic strategy adopted for storing per-policy data in blkcg_gq structures
      is already capable of handling new policies, whereas the one adopted with blkcg
      structures is not, because per-policy data are hard-coded in the blkcg
      structures themselves (currently only data related to the proportional-
      share policy).
      
      This commit addresses the above issues through the following changes:
      . It generalizes blkcg structures so that per-policy data are stored in the same
        way as in blkcg_gq structures.
        Specifically, it lets also the blkcg structure store per-policy data in a
        generic array of dynamically-allocated dedicated data structures. We will
        refer to these data structures as blkcg dedicated data structures, to
        distinguish them from the dedicated data structures inserted in the generic
        arrays kept by blkcg_gq structures.
        To allow blkcg dedicated data structures to be inserted in the generic array
        inside a blkcg structure, this commit also introduces a new blkcg_policy_data
        structure, which is the equivalent of blkg_policy_data for blkcg dedicated
        data structures.
      . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a
        cpd_size field and a cpd_init field, to be initialized by the policy with,
        respectively, the size of the blkcg dedicated data structures, and the
        address of a constructor function for blkcg dedicated data structures.
      . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e.,
        the fields related to the proportional-share policy), into a new blkcg
        dedicated data structure called cfq_group_data.
      Signed-off-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NArianna Avanzini <avanzini.arianna@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e48453c3
  6. 06 6月, 2015 1 次提交
    • T
      block: Make CFQ default to IOPS mode on SSDs · 41c0126b
      Tahsin Erdogan 提交于
      CFQ idling causes reduced IOPS throughput on non-rotational disks.
      Since disk head seeking is not applicable to SSDs, it doesn't really
      help performance by anticipating future near-by IO requests.
      
      By turning off idling (and switching to IOPS mode), we allow other
      processes to dispatch IO requests down to the driver and so increase IO
      throughput.
      
      Following FIO benchmark results were taken on a cloud SSD offering with
      idling on and off:
      
      Idling     iops    avg-lat(ms)    stddev            bw
      ------------------------------------------------------
          On     7054    90.107         38.697     28217KB/s
         Off    29255    21.836         11.730    117022KB/s
      
      fio --name=temp --size=100G --time_based --ioengine=libaio \
          --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
          --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
          --filename=/dev/sdb --runtime=10 --iodepth=64 --numjobs=10
      
      And the following is from a local SSD run:
      
      Idling     iops    avg-lat(ms)    stddev            bw
      ------------------------------------------------------
          On    19320    33.043         14.068     77281KB/s
         Off    21626    29.465         12.662     86507KB/s
      
      fio --name=temp --size=5G --time_based --ioengine=libaio \
          --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
          --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
          --filename=/fio_data --runtime=10 --iodepth=64 --numjobs=10
      Reviewed-by: NNauman Rafique <nauman@google.com>
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      41c0126b
  7. 02 6月, 2015 1 次提交
  8. 10 2月, 2015 1 次提交
  9. 22 1月, 2015 1 次提交
    • J
      cfq-iosched: fix incorrect filing of rt async cfqq · c6ce1943
      Jeff Moyer 提交于
      Hi,
      
      If you can manage to submit an async write as the first async I/O from
      the context of a process with realtime scheduling priority, then a
      cfq_queue is allocated, but filed into the wrong async_cfqq bucket.  It
      ends up in the best effort array, but actually has realtime I/O
      scheduling priority set in cfqq->ioprio.
      
      The reason is that cfq_get_queue assumes the default scheduling class and
      priority when there is no information present (i.e. when the async cfqq
      is created):
      
      static struct cfq_queue *
      cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
      	      struct bio *bio, gfp_t gfp_mask)
      {
      	const int ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
      	const int ioprio = IOPRIO_PRIO_DATA(cic->ioprio);
      
      cic->ioprio starts out as 0, which is "invalid".  So, class of 0
      (IOPRIO_CLASS_NONE) is passed to cfq_async_queue_prio like so:
      
      		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
      
      static struct cfq_queue **
      cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
      {
              switch (ioprio_class) {
              case IOPRIO_CLASS_RT:
                      return &cfqd->async_cfqq[0][ioprio];
              case IOPRIO_CLASS_NONE:
                      ioprio = IOPRIO_NORM;
                      /* fall through */
              case IOPRIO_CLASS_BE:
                      return &cfqd->async_cfqq[1][ioprio];
              case IOPRIO_CLASS_IDLE:
                      return &cfqd->async_idle_cfqq;
              default:
                      BUG();
              }
      }
      
      Here, instead of returning a class mapped from the process' scheduling
      priority, we get back the bucket associated with IOPRIO_CLASS_BE.
      
      Now, there is no queue allocated there yet, so we create it:
      
      		cfqq = cfq_find_alloc_queue(cfqd, is_sync, cic, bio, gfp_mask);
      
      That function ends up doing this:
      
      			cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
      			cfq_init_prio_data(cfqq, cic);
      
      cfq_init_cfqq marks the priority as having changed.  Then, cfq_init_prio
      data does this:
      
      	ioprio_class = IOPRIO_PRIO_CLASS(cic->ioprio);
      	switch (ioprio_class) {
      	default:
      		printk(KERN_ERR "cfq: bad prio %x\n", ioprio_class);
      	case IOPRIO_CLASS_NONE:
      		/*
      		 * no prio set, inherit CPU scheduling settings
      		 */
      		cfqq->ioprio = task_nice_ioprio(tsk);
      		cfqq->ioprio_class = task_nice_ioclass(tsk);
      		break;
      
      So we basically have two code paths that treat IOPRIO_CLASS_NONE
      differently, which results in an RT async cfqq filed into a best effort
      bucket.
      
      Attached is a patch which fixes the problem.  I'm not sure how to make
      it cleaner.  Suggestions would be welcome.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Tested-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c6ce1943
  10. 08 9月, 2014 1 次提交
    • T
      blkcg: remove blkcg->id · f4da8072
      Tejun Heo 提交于
      blkcg->id is a unique id given to each blkcg; however, the
      cgroup_subsys_state which each blkcg embeds already has ->serial_nr
      which can be used for the same purpose.  Drop blkcg->id and replace
      its uses with blkcg->css.serial_nr.  Rename cfq_cgroup->blkcg_id to
      ->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for
      consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f4da8072
  11. 28 8月, 2014 1 次提交
  12. 27 8月, 2014 1 次提交
    • T
      cfq-iosched: Fix wrong children_weight calculation · e15693ef
      Toshiaki Makita 提交于
      cfq_group_service_tree_add() is applying new_weight at the beginning of
      the function via cfq_update_group_weight().
      This actually allows weight to change between adding it to and subtracting
      it from children_weight, and triggers WARN_ON_ONCE() in
      cfq_group_service_tree_del(), or even causes oops by divide error during
      vfr calculation in cfq_group_service_tree_add().
      
      The detailed scenario is as follows:
      1. Create blkio cgroups X and Y as a child of X.
         Set X's weight to 500 and perform some I/O to apply new_weight.
         This X's I/O completes before starting Y's I/O.
      2. Y starts I/O and cfq_group_service_tree_add() is called with Y.
      3. cfq_group_service_tree_add() walks up the tree during children_weight
         calculation and adds parent X's weight (500) to children_weight of root.
         children_weight becomes 500.
      4. Set X's weight to 1000.
      5. X starts I/O and cfq_group_service_tree_add() is called with X.
      6. cfq_group_service_tree_add() applies its new_weight (1000).
      7. I/O of Y completes and cfq_group_service_tree_del() is called with Y.
      8. I/O of X completes and cfq_group_service_tree_del() is called with X.
      9. cfq_group_service_tree_del() subtracts X's weight (1000) from
         children_weight of root. children_weight becomes -500.
         This triggers WARN_ON_ONCE().
      10. Set X's weight to 500.
      11. X starts I/O and cfq_group_service_tree_add() is called with X.
      12. cfq_group_service_tree_add() applies its new_weight (500) and adds it
          to children_weight of root. children_weight becomes 0. Calcularion of
          vfr triggers oops by divide error.
      
      weight should be updated right before adding it to children_weight.
      Reported-by: NRuki Sekiya <sekiya.ruki@lab.ntt.co.jp>
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e15693ef
  13. 14 5月, 2014 1 次提交
    • T
      cgroup: replace cftype->write_string() with cftype->write() · 451af504
      Tejun Heo 提交于
      Convert all cftype->write_string() users to the new cftype->write()
      which maps directly to kernfs write operation and has full access to
      kernfs and cgroup contexts.  The conversions are mostly mechanical.
      
      * @css and @cft are accessed using of_css() and of_cft() accessors
        respectively instead of being specified as arguments.
      
      * Should return @nbytes on success instead of 0.
      
      * @buf is not trimmed automatically.  Trim if necessary.  Note that
        blkcg and netprio don't need this as the parsers already handle
        whitespaces.
      
      cftype->write_string() has no user left after the conversions and
      removed.
      
      While at it, remove unnecessary local variable @p in
      cgroup_subtree_control_write() and stale comment about
      CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.
      
      This patch doesn't introduce any visible behavior changes.
      
      v2: netprio was missing from conversion.  Converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      451af504
  14. 01 5月, 2014 1 次提交
  15. 10 4月, 2014 1 次提交
  16. 19 3月, 2014 1 次提交
    • T
      cgroup: drop const from @buffer of cftype->write_string() · 4d3bb511
      Tejun Heo 提交于
      cftype->write_string() just passes on the writeable buffer from kernfs
      and there's no reason to add const restriction on the buffer.  The
      only thing const achieves is unnecessarily complicating parsing of the
      buffer.  Drop const from @buffer.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>                                           
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      4d3bb511
  17. 25 2月, 2014 1 次提交
  18. 12 2月, 2014 1 次提交
    • T
      cgroup: update the meaning of cftype->max_write_len · 5f469907
      Tejun Heo 提交于
      cftype->max_write_len is used to extend the maximum size of writes.
      It's interpreted in such a way that the actual maximum size is one
      less than the specified value.  The default size is defined by
      CGROUP_LOCAL_BUFFER_SIZE.  Its interpretation is quite confusing - its
      value is decremented by 1 and then compared for equality with max
      size, which means that the actual default size is
      CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars.
      
      There's no point in having a limit that low.  Update its definition so
      that it means the actual string length sans termination and anything
      below PAGE_SIZE-1 is treated as PAGE_SIZE-1.
      
      .max_write_len for "release_agent" is updated to PATH_MAX-1 and
      cgroup_release_agent_write() is updated so that the redundant strlen()
      check is removed and it uses strlcpy() instead of strcpy().
      .max_write_len initializations in blk-throttle.c and cfq-iosched.c are
      no longer necessary and removed.  The one in cpuset is kept unchanged
      as it's an approximated value to begin with.
      
      This will also make transition to kernfs smoother.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5f469907
  19. 06 12月, 2013 1 次提交
    • T
      cgroup: replace cftype->read_seq_string() with cftype->seq_show() · 2da8ca82
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  This patch
      replaces cftype->read_seq_string() with cftype->seq_show() which is
      not limited to single_open() operation and will map directcly to
      kernfs seq_file interface.
      
      The conversions are mechanical.  As ->seq_show() doesn't have @css and
      @cft, the functions which make use of them are converted to use
      seq_css() and seq_cft() respectively.  In several occassions, e.f. if
      it has seq_string in its name, the function name is updated to fit the
      new method better.
      
      This patch does not introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      2da8ca82
  20. 13 11月, 2013 1 次提交
    • P
      block: Use u64_stats_init() to initialize seqcounts · 90d3839b
      Peter Zijlstra 提交于
      Now that seqcounts are lockdep enabled objects, we need to explicitly
      initialize runtime allocated seqcounts so that lockdep can track them.
      
      Without this patch, Fengguang was seeing:
      
        [    4.127282] INFO: trying to register non-static key.
        [    4.128027] the code is fine but needs lockdep annotation.
        [    4.128027] turning off the locking correctness validator.
        [    4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2
        [    4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        [    ...     ]
        [    4.128027] Call Trace:
        [    4.128027]  [<7908e744>] ? console_unlock+0x353/0x380
        [    4.128027]  [<79dc7cf2>] dump_stack+0x48/0x60
        [    4.128027]  [<7908953e>] __lock_acquire.isra.26+0x7e3/0xceb
        [    4.128027]  [<7908a1c5>] lock_acquire+0x71/0x9a
        [    4.128027]  [<794079aa>] ? blk_throtl_bio+0x1c3/0x485
        [    4.128027]  [<7940658b>] throtl_update_dispatch_stats+0x7c/0x153
        [    4.128027]  [<794079aa>] ? blk_throtl_bio+0x1c3/0x485
        [    4.128027]  [<794079aa>] blk_throtl_bio+0x1c3/0x485
        ...
      
      Use u64_stats_init() for all affected data structures, which initializes
      the seqcount.
      Reported-and-Tested-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      [ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org
      [ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      90d3839b
  21. 23 9月, 2013 1 次提交
  22. 12 9月, 2013 1 次提交
  23. 09 8月, 2013 1 次提交
    • T
      cgroup: pass around cgroup_subsys_state instead of cgroup in file methods · 182446d0
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using struct
      cgroup_subsys_state * as the primary handle instead of struct cgroup.
      Please see the previous commit which converts the subsystem methods
      for rationale.
      
      This patch converts all cftype file operations to take @css instead of
      @cgroup.  cftypes for the cgroup core files don't have their subsytem
      pointer set.  These will automatically use the dummy_css added by the
      previous patch and can be converted the same way.
      
      Most subsystem conversions are straight forwards but there are some
      interesting ones.
      
      * freezer: update_if_frozen() is also converted to take @css instead
        of @cgroup for consistency.  This will make the code look simpler
        too once iterators are converted to use css.
      
      * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
        vmpressure while mem_cgroup_from_cont() can be made static.
        Updated accordingly.
      
      * cpu: cgroup_tg() doesn't have any user left.  Removed.
      
      * cpuacct: cgroup_ca() doesn't have any user left.  Removed.
      
      * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
        Removed.
      
      * net_cls: cgrp_cls_state() doesn't have any user left.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      182446d0
  24. 03 7月, 2013 1 次提交
    • J
      elevator: Fix a race in elevator switching · d50235b7
      Jianpeng Ma 提交于
      There's a race between elevator switching and normal io operation.
          Because the allocation of struct elevator_queue and struct elevator_data
          don't in a atomic operation.So there are have chance to use NULL
          ->elevator_data.
          For example:
              Thread A:                               Thread B
              blk_queu_bio                            elevator_switch
              spin_lock_irq(q->queue_block)           elevator_alloc
              elv_merge                               elevator_init_fn
      
          Because call elevator_alloc, it can't hold queue_lock and the
          ->elevator_data is NULL.So at the same time, threadA call elv_merge and
          nedd some info of elevator_data.So the crash happened.
      
          Move the elevator_alloc into func elevator_init_fn, it make the
          operations in a atomic operation.
      
          Using the follow method can easy reproduce this bug
          1:dd if=/dev/sdb of=/dev/null
          2:while true;do echo noop > scheduler;echo deadline > scheduler;done
      
          The test method also use this method.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d50235b7
  25. 24 3月, 2013 1 次提交
    • K
      block: Add bio_end_sector() · f73a1c7d
      Kent Overstreet 提交于
      Just a little convenience macro - main reason to add it now is preparing
      for immutable bio vecs, it'll reduce the size of the patch that puts
      bi_sector/bi_size/bi_idx into a struct bvec_iter.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
      CC: Heiko Carstens <heiko.carstens@de.ibm.com>
      CC: linux-s390@vger.kernel.org
      CC: Chris Mason <chris.mason@fusionio.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      f73a1c7d
  26. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  27. 22 2月, 2013 1 次提交
    • G
      cfq: fix lock imbalance with failed allocations · a3cc86c2
      Glauber Costa 提交于
      While stress-running very-small container scenarios with the Kernel Memory
      Controller, I've run into a lockdep-detected lock imbalance in
      cfq-iosched.c.
      
      I'll apologize beforehand for not posting a backlog: I didn't anticipate
      it would be so hard to reproduce, so I didn't save my serial output and
      went directly on debugging.  Turns out that it did not happen again in
      more than 20 runs, making it a quite rare pattern.
      
      But here is my analysis:
      
      When we are in very low-memory situations, we will arrive at
      cfq_find_alloc_queue and may not find a queue, having to resort to the oom
      queue, in an rcu-locked condition:
      
        if (!cfqq || cfqq == &cfqd->oom_cfqq)
            [ ... ]
      
      Next, we will release the rcu lock, and try to allocate a queue, retrying
      if we succeed:
      
        rcu_read_unlock();
        spin_unlock_irq(cfqd->queue->queue_lock);
        new_cfqq = kmem_cache_alloc_node(cfq_pool,
                        gfp_mask | __GFP_ZERO,
                        cfqd->queue->node);
         spin_lock_irq(cfqd->queue->queue_lock);
         if (new_cfqq)
             goto retry;
      
      We are unlocked at this point, but it should be fine, since we will
      reacquire the rcu_read_lock when we retry.
      
      Except of course, that we may not retry: the allocation may very well fail
      and we'll keep on going through the flow:
      
      The next branch is:
      
          if (cfqq) {
      	[ ... ]
          } else
              cfqq = &cfqd->oom_cfqq;
      
      And right before exiting, we'll issue rcu_read_unlock().
      
      Being already unlocked, this is the likely source of our imbalance.  Since
      cfqq is either already NULL or made NULL in the first statement of the
      outter branch, the only viable alternative here seems to be to return the
      oom queue right away in case of allocation failure.
      
      Please review the following patch and apply if you agree with my analysis.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3cc86c2
  28. 10 1月, 2013 7 次提交
    • T
      cfq-iosched: add hierarchical cfq_group statistics · 43114018
      Tejun Heo 提交于
      Unfortunately, at this point, there's no way to make the existing
      statistics hierarchical without creating nasty surprises for the
      existing users.  Just create recursive counterpart of the existing
      stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      43114018
    • T
      cfq-iosched: collect stats from dead cfqgs · 0b39920b
      Tejun Heo 提交于
      To support hierarchical stats, it's necessary to remember stats from
      dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
      its stats to the parent's dead-stats.
      
      The transfer happens form ->pd_offline_fn() and it is possible that
      there are some residual IOs completing afterwards.  Currently, we lose
      these stats.  Given that cgroup removal isn't a very high frequency
      operation and the amount of residual IOs on offline are likely to be
      nil or small, this shouldn't be a big deal and the complexity needed
      to handle residual IOs - another callback and rather elaborate
      synchronization to reach and lock the matching q - doesn't seem
      justified.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      0b39920b
    • T
      cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() · 689665af
      Tejun Heo 提交于
      Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
      cfq_pd_reset_stats() and move the latter to where other pd methods are
      defined.  cfqg_stats_reset() will be used to implement hierarchical
      stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      689665af
    • T
      blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/ · 4d5e80a7
      Tejun Heo 提交于
      Rename blkg_rwstat_sum() to blkg_rwstat_total().  sum will be used for
      summing up stats from multiple blkgs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      4d5e80a7
    • T
      cfq-iosched: enable full blkcg hierarchy support · d02f7aa8
      Tejun Heo 提交于
      With the previous two patches, all cfqg scheduling decisions are based
      on vfraction and ready for hierarchy support.  The only thing which
      keeps the behavior flat is cfqg_flat_parent() which makes vfraction
      calculation consider all non-root cfqgs children of the root cfqg.
      
      Replace it with cfqg_parent() which returns the real parent.  This
      enables full blkcg hierarchy support for cfq-iosched.  For example,
      consider the following hierarchy.
      
              root
            /      \
         A:500      B:250
        /     \
       AA:500  AB:1000
      
      For simplicity, let's say all the leaf nodes have active tasks and are
      on service tree.  For each leaf node, vfraction would be
      
       AA: (500  / 1500) * (500 / 750) =~ 0.2222
       AB: (1000 / 1500) * (500 / 750) =~ 0.4444
        B:                 (250 / 750) =~ 0.3333
      
      and vdisktime will be distributed accordingly.  For more detail,
      please refer to Documentation/block/cfq-iosched.txt.
      
      v2: cfq-iosched.txt updated to describe group scheduling as suggested
          by Vivek.
      
      v3: blkio-controller.txt updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      d02f7aa8
    • T
      cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction · 41cad6ab
      Tejun Heo 提交于
      cfq_group_slice() calculates slice by taking a fraction of
      cfq_target_latency according to the ratio of cfqg->weight against
      service_tree->total_weight.  This currently works only because all
      cfqgs are treated to be at the same level.
      
      To prepare for proper hierarchy support, convert cfq_group_slice() to
      base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
      a fraction of 1 and represents the fraction allocated to the cfqg with
      hierarchy considered, the slice can be simply calculated by
      multiplying cfqg->vfraction to cfq_target_latency (with fixed point
      shift factored in).
      
      As vfraction calculation currently treats all non-root cfqgs as
      children of the root cfqg, this patch doesn't introduce noticeable
      behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      41cad6ab
    • T
      cfq-iosched: implement hierarchy-ready cfq_group charge scaling · 1d3650f7
      Tejun Heo 提交于
      Currently, cfqg charges are scaled directly according to cfqg->weight.
      Regardless of the number of active cfqgs or the amount of active
      weights, a given weight value always scales charge the same way.  This
      works fine as long as all cfqgs are treated equally regardless of
      their positions in the hierarchy, which is what cfq currently
      implements.  It can't work in hierarchical settings because the
      interpretation of a given weight value depends on where the weight is
      located in the hierarchy.
      
      This patch reimplements cfqg charge scaling so that it can be used to
      support hierarchy properly.  The scheme is fairly simple and
      light-weight.
      
      * When a cfqg is added to the service tree, v(disktime)weight is
        calculated.  It walks up the tree to root calculating the fraction
        it has in the hierarchy.  At each level, the fraction can be
        calculated as
      
          cfqg->weight / parent->level_weight
      
        By compounding these, the global fraction of vdisktime the cfqg has
        claim to - vfraction - can be determined.
      
      * When the cfqg needs to be charged, the charge is scaled inversely
        proportionally to the vfraction.
      
      The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
      representation as before; however, the smallest scaling factor is now
      1 (ie. 1 << CFQ_SERVICE_SHIFT).  This is different from before where 1
      was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
      scaling factor.
      
      While this shifts the global scale of vdisktime a bit, it doesn't
      change the relative relationships among cfqgs and the scheduling
      result isn't different.
      
      cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
      new cfqg to the service tree.  The specific value of CFQ_IDLE_DELAY
      didn't have any relevance to vdisktime before and is unlikely to cause
      any visible behavior difference now especially as the scale shift
      isn't that large.
      
      As the new scheme now makes proper distinction between cfqg->weight
      and ->leaf_weight, reverse the weight aliasing for root cfqgs.  For
      root, both weights are now mapped to ->leaf_weight instead of the
      other way around.
      
      Because we're still using cfqg_flat_parent(), this patch shouldn't
      change the scheduling behavior in any noticeable way.
      
      v2: Beefed up comments on vfraction as requested by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      1d3650f7