1. 22 10月, 2015 7 次提交
  2. 10 10月, 2015 2 次提交
    • C
      3380f458
    • K
      blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c · 8ee1b7b9
      Kosuke Tatsukawa 提交于
      blk_mq_tag_update_depth() seems to be missing a memory barrier which
      might cause the waker to not notice the waiter and fail to send a
      wake_up as in the following figure.
      
      	blk_mq_tag_update_depth			bt_get
      ------------------------------------------------------------------------
      if (waitqueue_active(&bs->wait))
      /* The CPU might reorder the test for
         the waitqueue up here, before
         prior writes complete */
      					prepare_to_wait(&bs->wait, &wait,
      					  TASK_UNINTERRUPTIBLE);
      					tag = __bt_get(hctx, bt, last_tag,
      					  tags);
      					/* Value set in bt_update_count not
      					   visible yet */
      bt_update_count(&tags->bitmap_tags, tdepth);
      /* blk_mq_tag_wakeup_all(tags, false); */
       bt = &tags->bitmap_tags;
       wake_index = atomic_read(&bt->wake_index);
      					...
      					io_schedule();
      ------------------------------------------------------------------------
      
      This patch adds the missing memory barrier.
      
      I found this issue when I was looking through the linux source code
      for places calling waitqueue_active() before wake_up*(), but without
      preceding memory barriers, after sending a patch to fix a similar
      issue in drivers/tty/n_tty.c  (Details about the original issue can be
      found here: https://lkml.org/lkml/2015/9/28/849).
      Signed-off-by: NKosuke Tatsukawa <tatsu@ab.jp.nec.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ee1b7b9
  3. 01 10月, 2015 2 次提交
  4. 30 9月, 2015 6 次提交
    • A
      blk-mq: fix deadlock when reading cpu_list · 60de074b
      Akinobu Mita 提交于
      CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
      all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
      entries by blk_mq_sysfs_unregister().  Removing sysfs entry needs to
      be blocked until the active reference of the kernfs_node to be zero.
      
      On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
      /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
      blk_mq_hw_sysfs_cpus_show().
      
      If these happen at the same time, a deadlock can happen.  Because one
      can wait for the active reference to be zero with holding all_q_mutex,
      and the other tries to acquire all_q_mutex with holding the active
      reference.
      
      The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
      is to avoid reading an imcomplete hctx->cpumask.  Since reading sysfs
      entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
      and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
      while hctx->cpumask is being updated.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      60de074b
    • A
      blk-mq: avoid inserting requests before establishing new mapping · 5778322e
      Akinobu Mita 提交于
      Notifier callbacks for CPU_ONLINE action can be run on the other CPU
      than the CPU which was just onlined.  So it is possible for the
      process running on the just onlined CPU to insert request and run
      hw queue before establishing new mapping which is done by
      blk_mq_queue_reinit_notify().
      
      This can cause a problem when the CPU has just been onlined first time
      since the request queue was initialized.  At this time ctx->index_hw
      for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
      zero before blk_mq_queue_reinit_notify() is called by notifier
      callbacks for CPU_ONLINE action.
      
      For example, there is a single hw queue (hctx) and two CPU queues
      (ctx0 for CPU0, and ctx1 for CPU1).  Now CPU1 is just onlined and
      a request is inserted into ctx1->rq_list and set bit0 in pending
      bitmap as ctx1->index_hw is still zero.
      
      And then while running hw queue, flush_busy_ctxs() finds bit0 is set
      in pending bitmap and tries to retrieve requests in
      hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0, so the
      request in ctx1->rq_list is ignored.
      
      Fix it by ensuring that new mapping is established before onlined cpu
      starts running.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5778322e
    • A
      blk-mq: fix q->mq_usage_counter access race · 0e626368
      Akinobu Mita 提交于
      CPU hotplug handling for blk-mq (blk_mq_queue_reinit) accesses
      q->mq_usage_counter while freezing all request queues in all_q_list.
      On the other hand, q->mq_usage_counter is deinitialized in
      blk_mq_free_queue() before deleting the queue from all_q_list.
      
      So if CPU hotplug event occurs in the window, percpu_ref_kill() is
      called with q->mq_usage_counter which has already been marked dead,
      and it triggers warning.  Fix it by deleting the queue from all_q_list
      earlier than destroying q->mq_usage_counter.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0e626368
    • A
      blk-mq: Fix use after of free q->mq_map · a723bab3
      Akinobu Mita 提交于
      CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates
      q->mq_map by blk_mq_update_queue_map() for all request queues in
      all_q_list.  On the other hand, q->mq_map is released before deleting
      the queue from all_q_list.
      
      So if CPU hotplug event occurs in the window, invalid memory access
      can happen.  Fix it by releasing q->mq_map in blk_mq_release() to make
      it happen latter than removal from all_q_list.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Suggested-by: NMing Lei <tom.leiming@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a723bab3
    • A
      blk-mq: fix sysfs registration/unregistration race · 4593fdbe
      Akinobu Mita 提交于
      There is a race between cpu hotplug handling and adding/deleting
      gendisk for blk-mq, where both are trying to register and unregister
      the same sysfs entries.
      
      null_add_dev
          --> blk_mq_init_queue
              --> blk_mq_init_allocated_queue
                  --> add to 'all_q_list' (*)
          --> add_disk
              --> blk_register_queue
                  --> blk_mq_register_disk (++)
      
      null_del_dev
          --> del_gendisk
              --> blk_unregister_queue
                  --> blk_mq_unregister_disk (--)
          --> blk_cleanup_queue
              --> blk_mq_free_queue
                  --> del from 'all_q_list' (*)
      
      blk_mq_queue_reinit
          --> blk_mq_sysfs_unregister (-)
          --> blk_mq_sysfs_register (+)
      
      While the request queue is added to 'all_q_list' (*),
      blk_mq_queue_reinit() can be called for the queue anytime by CPU
      hotplug callback.  But blk_mq_sysfs_unregister (-) and
      blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
      before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
      is finished.  Because '/sys/block/*/mq/' is not exists.
      
      There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
      be used to track these sysfs stuff, but it is only fixing this issue
      partially.
      
      In order to fix it completely, we just need per-queue flag instead of
      per-hctx flag with appropriate locking.  So this introduces
      q->mq_sysfs_init_done which is properly protected with all_q_mutex.
      
      Also, we need to ensure that blk_mq_map_swqueue() is called with
      all_q_mutex is held.  Since hctx->nr_ctx is reset temporarily and
      updated in blk_mq_map_swqueue(), so we should avoid
      blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
      in CPU hotplug handling or adding/deleting gendisk .
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4593fdbe
    • A
      blk-mq: avoid setting hctx->tags->cpumask before allocation · 1356aae0
      Akinobu Mita 提交于
      When unmapped hw queue is remapped after CPU topology is changed,
      hctx->tags->cpumask has to be set after hctx->tags is setup in
      blk_mq_map_swqueue(), otherwise it causes null pointer dereference.
      
      Fixes: f26cdc85 ("blk-mq: Shared tag enhancements")
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1356aae0
  5. 24 9月, 2015 1 次提交
  6. 18 9月, 2015 1 次提交
    • M
      block: fix bounce_end_io · 99451879
      Ming Lei 提交于
      When bio bounce is involved, one new bio and its biovecs are
      cloned from the comming bio, which can be one fast-cloned bio
      from upper layer(such as dm).
      
      So it is obviously wrong to assume the start index of the coming(
      original) bio's io vector is zero, which can be any value between
      0 and (bi_max_vecs - 1), especially in case of bio split.
      
      This patch fixes Fedora's booting oops on i386, often with the
      following kernel log together:
      
      > [    9.026738] systemd[1]: Switching root.
      > [    9.036467] systemd-journald[149]: Received SIGTERM from PID 1
      > (systemd).
      > [    9.082262] BUG: Bad page state in process kworker/u5:1  pfn:372ac
      > [    9.083989] page:f3d32ae0 count:0 mapcount:0 mapping:f2252178
      > index:0x16a
      > [    9.085755] flags: 0x40020021(locked|lru|mappedtodisk)
      > [    9.087284] page dumped because: page still charged to cgroup
      > [    9.088772] bad because of flags:
      > [    9.089731] flags: 0x21(locked|lru)
      > [    9.090818] page->mem_cgroup:f2c3e400
      Reported-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Tested-by: NAdam Williamson <awilliam@redhat.com>
      Cc: Ming Lin <mlin@kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      99451879
  7. 17 9月, 2015 1 次提交
    • M
      block: blk-merge: fast-clone bio when splitting rw bios · 52cc6eea
      Ming Lei 提交于
      biovecs has become immutable since v3.13, so it isn't necessary
      to allocate biovecs for the new cloned bios, then we can save
      one extra biovecs allocation/copy, and the allocation is often
      not fixed-length and a bit more expensive.
      
      For example, if the 'max_sectors_kb' of null blk's queue is set
      as 16(32 sectors) via sysfs just for making more splits, this patch
      can increase throught about ~70% in the sequential read test over
      null_blk(direct io, bs: 1M).
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Ming Lin <ming.l@ssi.samsung.com>
      Cc: Dongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      
      This fixes a performance regression introduced by commit 54efd50b,
      and allows us to take full advantage of the fact that we have immutable
      bio_vecs. Hand applied, as it rejected violently with commit
      5014c311.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52cc6eea
  8. 11 9月, 2015 4 次提交
  9. 04 9月, 2015 1 次提交
  10. 03 9月, 2015 1 次提交
  11. 02 9月, 2015 1 次提交
  12. 20 8月, 2015 1 次提交
  13. 19 8月, 2015 12 次提交
    • T
      blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy · 69d7fde5
      Tejun Heo 提交于
      cgroup is trying to make interface consistent across different
      controllers.  For weight based resource control, the knob should have
      the range [1, 10000] and default to 100.  This patch updates
      cfq-iosched so that the weight range conforms.  The internal
      calculations have enough range and the widening of the weight range
      shouldn't cause any problem.
      
      * blkcg_policy->cpd_bind_fn() is added.  If present, this is invoked
        when blkcg is attached to a hierarchy.
      
      * cfq_cpd_init() is updated to use the new default value on the
        unified hierarchy.
      
      * cfq_cpd_bind() callback is implemented to clear per-blkg configs and
        apply the default config matching the hierarchy type.
      
      * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
        is moved into !CONFIG_CFQ_GROUP_IOSCHED block.  cfq_cpd_bind() is
        now responsible for initializing the initial weights when blkcg is
        enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      69d7fde5
    • T
      blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/ · 3ecca629
      Tejun Heo 提交于
      blkcg is gonna switch to cgroup common weight range as defined by
      CGROUP_WEIGHT_* on the unified hierarchy.  In preparation, rename
      CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Arianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3ecca629
    • T
      blkcg: implement interface for the unified hierarchy · 2ee867dc
      Tejun Heo 提交于
      blkcg interface grew to be the biggest of all controllers and
      unfortunately most inconsistent too.  The interface files are
      inconsistent with a number of cloes duplicates.  Some files have
      recursive variants while others don't.  There's distinction between
      normal and leaf weights which isn't intuitive and there are a lot of
      stat knobs which don't make much sense outside of debugging and expose
      too much implementation details to userland.
      
      In the unified hierarchy, everything is always hierarchical and
      internal nodes can't have tasks rendering the two structural issues
      twisting the current interface.  The interface has to be updated in a
      significant anyway and this is a good chance to revamp it as a whole.
      This patch implements blkcg interface for the unified hierarchy.
      
      * (from a previous patch) blkcg is identified by "io" instead of
        "blkio" on the unified hierarchy.  Given that the whole interface is
        updated anyway, the rename shouldn't carry noticeable conversion
        overhead.
      
      * The original interface consisted of 27 files is replaced with the
        following three files.
      
        blkio.stat	: per-blkcg stats
        blkio.weight	: per-cgroup and per-cgroup-queue weight settings
        blkio.max	: per-cgroup-queue bps and iops max limits
      
      Documentation/cgroups/unified-hierarchy.txt updated accordingly.
      
      v2: blkcg_policy->dfl_cftypes wasn't removed on
          blkcg_policy_unregister() corrupting the cftypes list.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2ee867dc
    • T
      blkcg: misc preparations for unified hierarchy interface · dd165eb3
      Tejun Heo 提交于
      * Export blkg_dev_name()
      
      * Drop unnecessary @cft from __cfq_set_weight().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dd165eb3
    • T
      blkcg: separate out tg_conf_updated() from tg_set_conf() · 69948b07
      Tejun Heo 提交于
      tg_set_conf() is largely consisted of parsing and setting the new
      config and the follow-up application and propagation.  This patch
      separates out the latter part into tg_conf_updated().  This will be
      used to implement interface for the unified hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      69948b07
    • T
      blkcg: move body parsing from blkg_conf_prep() to its callers · 36aa9e5f
      Tejun Heo 提交于
      Currently, blkg_conf_prep() expects input to be of the following form
      
       MAJ:MIN NUM
      
      and reads the NUM part into blkg_conf_ctx->v.  This is quite
      restrictive and gets in the way in implementing blkcg interface for
      the unified hierarchy.  This patch updates blkg_conf_prep() so that it
      expects
      
       MAJ:MIN BODY_STR
      
      where BODY_STR is an arbitrary string.  blkg_conf_ctx->v is replaced
      with ->body which is a char pointer pointing to the start of BODY_STR.
      Parsing of the body is moved to blkg_conf_prep()'s callers.
      
      To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
      non-const pointer and to accommodate that const is dropped from @input
      too.
      
      This doesn't cause any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      36aa9e5f
    • T
      blkcg: mark existing cftypes as legacy · 880f50e2
      Tejun Heo 提交于
      blkcg is about to grow interface for the unified hierarchy.  Add
      legacy to existing cftypes.
      
      * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
      * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
      * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
      * blk-throttle.c:throtl_files -> throtl_legacy_files
      
      Pure renames.  No functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      880f50e2
    • T
      blkcg: rename subsystem name from blkio to io · c165b3e3
      Tejun Heo 提交于
      blkio interface has become messy over time and is currently the
      largest.  In addition to the inconsistent naming scheme, it has
      multiple stat files which report more or less the same thing, a number
      of debug stat files which expose internal details which shouldn't have
      been part of the public interface in the first place, recursive and
      non-recursive stats and leaf and non-leaf knobs.
      
      Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
      don't make any sense on the unified hierarchy as only leaf cgroups can
      contain processes.  cgroups is going through a major interface
      revision with the unified hierarchy involving significant fundamental
      usage changes and given that a significant portion of the interface
      doesn't make sense anymore, it's a good time to reorganize the
      interface.
      
      As the first step, this patch renames the external visible subsystem
      name from "blkio" to "io".  This is more concise, matches the other
      two major subsystem names, "cpu" and "memory", and better suited as
      blkcg will be involved in anything writeback related too whether an
      actual block device is involved or not.
      
      As the subsystem legacy_name is set to "blkio", the only userland
      visible change outside the unified hierarchy is that blkcg is reported
      as "io" instead of "blkio" in the subsystem initialized message during
      boot.  On the unified hierarchy, blkcg now appears as "io".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c165b3e3
    • T
      blkcg: refine error codes returned during blkcg configuration · 20386ce0
      Tejun Heo 提交于
      blkcg currently returns -EINVAL for most errors which can be pretty
      confusing given that the failure modes are quite varied.  Update the
      error returns so that
      
      * -EINVAL only for syntactic errors.
      * -ERANGE if the value is out of range.
      * -ENODEV if the target device can't be found.
      * -EOPNOTSUPP if the policy is not enabled on the target device.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      20386ce0
    • T
      blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device() · 5332dfc3
      Tejun Heo 提交于
      blkg_to_cfqg() and blkcg_to_cfqgd() on a valid blkg with the policy
      enabled are guaranteed to return non-NULL and the counterpart in
      blk-throttle doesn't have these checks either.  Remove the spurious
      NULL checks.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5332dfc3
    • T
      blkcg: reduce stack usage of blkg_rwstat_recursive_sum() · 3a7faead
      Tejun Heo 提交于
      The recent percpu conversion of blkg_rwstat triggered the following
      warning in certain configurations.
      
       block/blk-cgroup.c:654:1: warning: the frame size of 1360 bytes is larger than 1024 bytes
      
      This is because blkg_rwstat now contains four percpu_counter which can
      be pretty big depending on debug options although it shouldn't be a
      problem in production configs.  This patch removes one of the two
      local blkg_rwstat variables used by blkg_rwstat_recursive_sum() to
      reduce stack usage.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Link: http://article.gmane.org/gmane.linux.kernel.cgroups/13835Signed-off-by: NJens Axboe <axboe@fb.com>
      3a7faead
    • T
      blkcg: remove cfqg_stats->sectors · 702747ca
      Tejun Heo 提交于
      cfq_stats->sectors is a blkg_stat which keeps track of the total
      number of sectors serviced; however, this can be trivially calculated
      from blkcg_gq->stat_bytes.  The only thing necessary is adding up
      READs and WRITEs and then dividing by sector size.
      
      Remove cfqg_stats->sectors and make cfq print "sectors" and
      "sectors_recursive" from stat_bytes.
      
      While this is a bit more code, it removes duplicate stat allocations
      and updates and ensures that the reported stats stay in tune with each
      other.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      702747ca