1. 03 11月, 2015 1 次提交
    • J
      blk-mq: avoid excessive boot delays with large lun counts · 2404e607
      Jeff Moyer 提交于
      Hi,
      
      Zhangqing Luo reported long boot times on a system with thousands of
      LUNs when scsi-mq was enabled.  He narrowed the problem down to
      blk_mq_add_queue_tag_set, where every queue is frozen in order to set
      the BLK_MQ_F_TAG_SHARED flag.  Each added device will freeze all queues
      added before it in sequence, which involves waiting for an RCU grace
      period for each one.  We don't need to do this.  After the second queue
      is added, only new queues need to be initialized with the shared tag.
      We can do that by percolating the flag up to the blk_mq_tag_set, and
      updating the newly added queue's hctxs if the flag is set.
      
      This problem was introduced by commit 0d2602ca (blk-mq: improve
      support for shared tags maps).
      Reported-and-tested-by: NJason Luo <zhangqing.luo@oracle.com>
      Reviewed-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2404e607
  2. 28 10月, 2015 1 次提交
  3. 22 10月, 2015 19 次提交
  4. 15 10月, 2015 2 次提交
    • T
      block: don't release bdi while request_queue has live references · b02176f3
      Tejun Heo 提交于
      bdi's are initialized in two steps, bdi_init() and bdi_register(), but
      destroyed in a single step by bdi_destroy() which, for a bdi embedded
      in a request_queue, is called during blk_cleanup_queue() which makes
      the queue invisible and starts the draining of remaining usages.
      
      A request_queue's user can access the congestion state of the embedded
      bdi as long as it holds a reference to the queue.  As such, it may
      access the congested state of a queue which finished
      blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
      Because the congested state was embedded in backing_dev_info which in
      turn is embedded in request_queue, accessing the congested state after
      bdi_destroy() was called was fine.  The bdi was destroyed but the
      memory region for the congested state remained accessible till the
      queue got released.
      
      a13f35e8 ("writeback: don't embed root bdi_writeback_congested in
      bdi_writeback") changed the situation.  Now, the root congested state
      which is expected to be pinned while request_queue remains accessible
      is separately reference counted and the base ref is put during
      bdi_destroy().  This means that the root congested state may go away
      prematurely while the queue is between bdi_dstroy() and
      blk_cleanup_queue(), which was detected by Andrey's KASAN tests.
      
      The root cause of this problem is that bdi doesn't distinguish the two
      steps of destruction, unregistration and release, and now the root
      congested state actually requires a separate release step.  To fix the
      issue, this patch separates out bdi_unregister() and bdi_exit() from
      bdi_destroy().  bdi_unregister() is called from blk_cleanup_queue()
      and bdi_exit() from blk_release_queue().  bdi_destroy() is now just a
      simple wrapper calling the two steps back-to-back.
      
      While at it, the prototype of bdi_destroy() is moved right below
      bdi_setup_and_register() so that the counterpart operations are
      located together.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
      Cc: stable@vger.kernel.org # v4.2+
      Reported-and-tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.comReviewed-by: NJan Kara <jack@suse.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b02176f3
    • J
      blk-mq: fix use-after-free in blk_mq_free_tag_set() · f42d79ab
      Junichi Nomura 提交于
      tags is freed in blk_mq_free_rq_map() and should not be used after that.
      The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
      free_cpumask_var() is nop.
      
      tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
      free cpumask in its counter part, blk_mq_free_tags().
      
      Fixes: f26cdc85 ("blk-mq: Shared tag enhancements")
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f42d79ab
  5. 10 10月, 2015 2 次提交
    • C
      3380f458
    • K
      blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c · 8ee1b7b9
      Kosuke Tatsukawa 提交于
      blk_mq_tag_update_depth() seems to be missing a memory barrier which
      might cause the waker to not notice the waiter and fail to send a
      wake_up as in the following figure.
      
      	blk_mq_tag_update_depth			bt_get
      ------------------------------------------------------------------------
      if (waitqueue_active(&bs->wait))
      /* The CPU might reorder the test for
         the waitqueue up here, before
         prior writes complete */
      					prepare_to_wait(&bs->wait, &wait,
      					  TASK_UNINTERRUPTIBLE);
      					tag = __bt_get(hctx, bt, last_tag,
      					  tags);
      					/* Value set in bt_update_count not
      					   visible yet */
      bt_update_count(&tags->bitmap_tags, tdepth);
      /* blk_mq_tag_wakeup_all(tags, false); */
       bt = &tags->bitmap_tags;
       wake_index = atomic_read(&bt->wake_index);
      					...
      					io_schedule();
      ------------------------------------------------------------------------
      
      This patch adds the missing memory barrier.
      
      I found this issue when I was looking through the linux source code
      for places calling waitqueue_active() before wake_up*(), but without
      preceding memory barriers, after sending a patch to fix a similar
      issue in drivers/tty/n_tty.c  (Details about the original issue can be
      found here: https://lkml.org/lkml/2015/9/28/849).
      Signed-off-by: NKosuke Tatsukawa <tatsu@ab.jp.nec.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ee1b7b9
  6. 01 10月, 2015 2 次提交
  7. 30 9月, 2015 6 次提交
    • A
      blk-mq: fix deadlock when reading cpu_list · 60de074b
      Akinobu Mita 提交于
      CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
      all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
      entries by blk_mq_sysfs_unregister().  Removing sysfs entry needs to
      be blocked until the active reference of the kernfs_node to be zero.
      
      On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
      /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
      blk_mq_hw_sysfs_cpus_show().
      
      If these happen at the same time, a deadlock can happen.  Because one
      can wait for the active reference to be zero with holding all_q_mutex,
      and the other tries to acquire all_q_mutex with holding the active
      reference.
      
      The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
      is to avoid reading an imcomplete hctx->cpumask.  Since reading sysfs
      entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
      and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
      while hctx->cpumask is being updated.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      60de074b
    • A
      blk-mq: avoid inserting requests before establishing new mapping · 5778322e
      Akinobu Mita 提交于
      Notifier callbacks for CPU_ONLINE action can be run on the other CPU
      than the CPU which was just onlined.  So it is possible for the
      process running on the just onlined CPU to insert request and run
      hw queue before establishing new mapping which is done by
      blk_mq_queue_reinit_notify().
      
      This can cause a problem when the CPU has just been onlined first time
      since the request queue was initialized.  At this time ctx->index_hw
      for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
      zero before blk_mq_queue_reinit_notify() is called by notifier
      callbacks for CPU_ONLINE action.
      
      For example, there is a single hw queue (hctx) and two CPU queues
      (ctx0 for CPU0, and ctx1 for CPU1).  Now CPU1 is just onlined and
      a request is inserted into ctx1->rq_list and set bit0 in pending
      bitmap as ctx1->index_hw is still zero.
      
      And then while running hw queue, flush_busy_ctxs() finds bit0 is set
      in pending bitmap and tries to retrieve requests in
      hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0, so the
      request in ctx1->rq_list is ignored.
      
      Fix it by ensuring that new mapping is established before onlined cpu
      starts running.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5778322e
    • A
      blk-mq: fix q->mq_usage_counter access race · 0e626368
      Akinobu Mita 提交于
      CPU hotplug handling for blk-mq (blk_mq_queue_reinit) accesses
      q->mq_usage_counter while freezing all request queues in all_q_list.
      On the other hand, q->mq_usage_counter is deinitialized in
      blk_mq_free_queue() before deleting the queue from all_q_list.
      
      So if CPU hotplug event occurs in the window, percpu_ref_kill() is
      called with q->mq_usage_counter which has already been marked dead,
      and it triggers warning.  Fix it by deleting the queue from all_q_list
      earlier than destroying q->mq_usage_counter.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0e626368
    • A
      blk-mq: Fix use after of free q->mq_map · a723bab3
      Akinobu Mita 提交于
      CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates
      q->mq_map by blk_mq_update_queue_map() for all request queues in
      all_q_list.  On the other hand, q->mq_map is released before deleting
      the queue from all_q_list.
      
      So if CPU hotplug event occurs in the window, invalid memory access
      can happen.  Fix it by releasing q->mq_map in blk_mq_release() to make
      it happen latter than removal from all_q_list.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Suggested-by: NMing Lei <tom.leiming@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a723bab3
    • A
      blk-mq: fix sysfs registration/unregistration race · 4593fdbe
      Akinobu Mita 提交于
      There is a race between cpu hotplug handling and adding/deleting
      gendisk for blk-mq, where both are trying to register and unregister
      the same sysfs entries.
      
      null_add_dev
          --> blk_mq_init_queue
              --> blk_mq_init_allocated_queue
                  --> add to 'all_q_list' (*)
          --> add_disk
              --> blk_register_queue
                  --> blk_mq_register_disk (++)
      
      null_del_dev
          --> del_gendisk
              --> blk_unregister_queue
                  --> blk_mq_unregister_disk (--)
          --> blk_cleanup_queue
              --> blk_mq_free_queue
                  --> del from 'all_q_list' (*)
      
      blk_mq_queue_reinit
          --> blk_mq_sysfs_unregister (-)
          --> blk_mq_sysfs_register (+)
      
      While the request queue is added to 'all_q_list' (*),
      blk_mq_queue_reinit() can be called for the queue anytime by CPU
      hotplug callback.  But blk_mq_sysfs_unregister (-) and
      blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
      before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
      is finished.  Because '/sys/block/*/mq/' is not exists.
      
      There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
      be used to track these sysfs stuff, but it is only fixing this issue
      partially.
      
      In order to fix it completely, we just need per-queue flag instead of
      per-hctx flag with appropriate locking.  So this introduces
      q->mq_sysfs_init_done which is properly protected with all_q_mutex.
      
      Also, we need to ensure that blk_mq_map_swqueue() is called with
      all_q_mutex is held.  Since hctx->nr_ctx is reset temporarily and
      updated in blk_mq_map_swqueue(), so we should avoid
      blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
      in CPU hotplug handling or adding/deleting gendisk .
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4593fdbe
    • A
      blk-mq: avoid setting hctx->tags->cpumask before allocation · 1356aae0
      Akinobu Mita 提交于
      When unmapped hw queue is remapped after CPU topology is changed,
      hctx->tags->cpumask has to be set after hctx->tags is setup in
      blk_mq_map_swqueue(), otherwise it causes null pointer dereference.
      
      Fixes: f26cdc85 ("blk-mq: Shared tag enhancements")
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1356aae0
  8. 24 9月, 2015 1 次提交
  9. 18 9月, 2015 2 次提交
    • T
      cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() · 9e10a130
      Tejun Heo 提交于
      cgroup_on_dfl() tests whether the cgroup's root is the default
      hierarchy; however, an individual controller is only interested in
      whether the controller is attached to the default hierarchy and never
      tests a cgroup which doesn't belong to the hierarchy that the
      controller is attached to.
      
      This patch replaces cgroup_on_dfl() tests in controllers with faster
      static_key based cgroup_subsys_on_dfl().  This leaves cgroup core as
      the only user of cgroup_on_dfl() and the function is moved from the
      header file to cgroup.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      9e10a130
    • M
      block: fix bounce_end_io · 99451879
      Ming Lei 提交于
      When bio bounce is involved, one new bio and its biovecs are
      cloned from the comming bio, which can be one fast-cloned bio
      from upper layer(such as dm).
      
      So it is obviously wrong to assume the start index of the coming(
      original) bio's io vector is zero, which can be any value between
      0 and (bi_max_vecs - 1), especially in case of bio split.
      
      This patch fixes Fedora's booting oops on i386, often with the
      following kernel log together:
      
      > [    9.026738] systemd[1]: Switching root.
      > [    9.036467] systemd-journald[149]: Received SIGTERM from PID 1
      > (systemd).
      > [    9.082262] BUG: Bad page state in process kworker/u5:1  pfn:372ac
      > [    9.083989] page:f3d32ae0 count:0 mapcount:0 mapping:f2252178
      > index:0x16a
      > [    9.085755] flags: 0x40020021(locked|lru|mappedtodisk)
      > [    9.087284] page dumped because: page still charged to cgroup
      > [    9.088772] bad because of flags:
      > [    9.089731] flags: 0x21(locked|lru)
      > [    9.090818] page->mem_cgroup:f2c3e400
      Reported-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Tested-by: NAdam Williamson <awilliam@redhat.com>
      Cc: Ming Lin <mlin@kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      99451879
  10. 17 9月, 2015 1 次提交
    • M
      block: blk-merge: fast-clone bio when splitting rw bios · 52cc6eea
      Ming Lei 提交于
      biovecs has become immutable since v3.13, so it isn't necessary
      to allocate biovecs for the new cloned bios, then we can save
      one extra biovecs allocation/copy, and the allocation is often
      not fixed-length and a bit more expensive.
      
      For example, if the 'max_sectors_kb' of null blk's queue is set
      as 16(32 sectors) via sysfs just for making more splits, this patch
      can increase throught about ~70% in the sequential read test over
      null_blk(direct io, bs: 1M).
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Ming Lin <ming.l@ssi.samsung.com>
      Cc: Dongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      
      This fixes a performance regression introduced by commit 54efd50b,
      and allows us to take full advantage of the fact that we have immutable
      bio_vecs. Hand applied, as it rejected violently with commit
      5014c311.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52cc6eea
  11. 11 9月, 2015 3 次提交
    • T
      block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg · 6fe810bd
      Tejun Heo 提交于
      While making the root blkg unconditional, ec13b1d6 ("blkcg: always
      create the blkcg_gq for the root blkcg") removed the part which clears
      q->root_blkg and ->root_rl.blkg during q exit.  This leaves the two
      pointers dangling after blkg_destroy_all().  blk-throttle exit path
      performs blkg traversals and dereferences ->root_blkg and can lead to
      the following oops.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000558
       IP: [<ffffffff81389746>] __blkg_lookup+0x26/0x70
       ...
       task: ffff88001b4e2580 ti: ffff88001ac0c000 task.ti: ffff88001ac0c000
       RIP: 0010:[<ffffffff81389746>]  [<ffffffff81389746>] __blkg_lookup+0x26/0x70
       ...
       Call Trace:
        [<ffffffff8138d14a>] blk_throtl_drain+0x5a/0x110
        [<ffffffff8138a108>] blkcg_drain_queue+0x18/0x20
        [<ffffffff81369a70>] __blk_drain_queue+0xc0/0x170
        [<ffffffff8136a101>] blk_queue_bypass_start+0x61/0x80
        [<ffffffff81388c59>] blkcg_deactivate_policy+0x39/0x100
        [<ffffffff8138d328>] blk_throtl_exit+0x38/0x50
        [<ffffffff8138a14e>] blkcg_exit_queue+0x3e/0x50
        [<ffffffff8137016e>] blk_release_queue+0x1e/0xc0
       ...
      
      While the bug is a straigh-forward use-after-free bug, it is tricky to
      reproduce because blkg release is RCU protected and the rest of exit
      path usually finishes before RCU grace period.
      
      This patch fixes the bug by updating blkg_destro_all() to clear
      q->root_blkg and ->root_rl.blkg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: N"Richard W.M. Jones" <rjones@redhat.com>
      Reported-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Link: http://lkml.kernel.org/g/CA+5PVA5rzQ0s4723n5rHBcxQa9t0cW8BPPBekr_9aMRoWt2aYg@mail.gmail.com
      Fixes: ec13b1d6 ("blkcg: always create the blkcg_gq for the root blkcg")
      Cc: stable@vger.kernel.org # v4.2+
      Tested-by: NRichard W.M. Jones <rjones@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6fe810bd
    • S
      block: Copy a user iovec if it includes gaps · 46348456
      Sagi Grimberg 提交于
      For drivers that don't support gaps in the SG lists handed to
      them we must bounce (copy the user buffers) and pass a bio that
      does not include gaps. This doesn't matter for any current user,
      but will help to allow iser which can't handle gaps to use the
      block virtual boundary instead of using driver-local bounce
      buffering when handling SG_IO commands.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      46348456
    • S
      block: Refuse adding appending a gapped integrity page to a bio · 87a816df
      Sagi Grimberg 提交于
      This is only theoretical at the moment given that the only
      subsystems that generate integrity payloads are the block layer
      itself and the scsi target (which generate well aligned integrity
      payloads). But when we will expose integrity meta-data to user-space,
      we'll need to refuse appending a page with a gap (if the queue
      virtual boundary is set).
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      87a816df