1. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  2. 05 8月, 2016 1 次提交
    • G
      blk-mq: Allow timeouts to run while queue is freezing · 71f79fb3
      Gabriel Krisman Bertazi 提交于
      In case a submitted request gets stuck for some reason, the block layer
      can prevent the request starvation by starting the scheduled timeout work.
      If this stuck request occurs at the same time another thread has started
      a queue freeze, the blk_mq_timeout_work will not be able to acquire the
      queue reference and will return silently, thus not issuing the timeout.
      But since the request is already holding a q_usage_counter reference and
      is unable to complete, it will never release its reference, preventing
      the queue from completing the freeze started by first thread.  This puts
      the request_queue in a hung state, forever waiting for the freeze
      completion.
      
      This was observed while running IO to a NVMe device at the same time we
      toggled the CPU hotplug code. Eventually, once a request got stuck
      requiring a timeout during a queue freeze, we saw the CPU Hotplug
      notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
      the trace below.
      
      [c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
      [c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
      [c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
      [c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
      [c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
      [c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
      [c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
      [c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
      [c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
      [c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
      [c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
      [c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
      [c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
      [c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
      [c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
      [c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
      [c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
      [c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
      [c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
      [c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4
      
      The fix is to allow the timeout work to execute in the window between
      dropping the initial refcount reference and the release of the last
      reference, which actually marks the freeze completion.  This can be
      achieved with percpu_refcount_tryget, which does not require the counter
      to be alive.  This way the timeout work can do it's job and terminate a
      stuck request even during a freeze, returning its reference and avoiding
      the deadlock.
      
      Allowing the timeout to run is just a part of the fix, since for some
      devices, we might get stuck again inside the device driver's timeout
      handler, should it attempt to allocate a new request in that path -
      which is a quite common action for Abort commands, which need to be sent
      after a timeout.  In NVMe, for instance, we call blk_mq_alloc_request
      from inside the timeout handler, which will fail during a freeze, since
      it also tries to acquire a queue reference.
      
      I considered a similar change to blk_mq_alloc_request as a generic
      solution for further device driver hangs, but we can't do that, since it
      would allow new requests to disturb the freeze process.  I thought about
      creating a new function in the block layer to support unfreezable
      requests for these occasions, but after working on it for a while, I
      feel like this should be handled in a per-driver basis.  I'm now
      experimenting with changes to the NVMe timeout path, but I'm open to
      suggestions of ways to make this generic.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
      Cc: Brian King <brking@linux.vnet.ibm.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: linux-nvme@lists.infradead.org
      Cc: linux-block@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      71f79fb3
  3. 21 7月, 2016 1 次提交
  4. 06 7月, 2016 1 次提交
  5. 09 6月, 2016 1 次提交
    • O
      blk-mq: actually hook up defer list when running requests · 52b9c330
      Omar Sandoval 提交于
      If ->queue_rq() returns BLK_MQ_RQ_QUEUE_OK, we use continue and skip
      over the rest of the loop body. However, dptr is assigned later in the
      loop body, and the BLK_MQ_RQ_QUEUE_OK case is exactly the case that we'd
      want it for.
      
      NVMe isn't actually using BLK_MQ_F_DEFER_ISSUE yet, nor is any other
      in-tree driver, but if the code's going to be there, it might as well
      work.
      
      Fixes: 74c45052 ("blk-mq: add a 'list' parameter to ->queue_rq()")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52b9c330
  6. 08 6月, 2016 3 次提交
  7. 03 6月, 2016 1 次提交
  8. 26 5月, 2016 1 次提交
  9. 16 5月, 2016 1 次提交
  10. 03 5月, 2016 1 次提交
  11. 20 3月, 2016 1 次提交
  12. 16 3月, 2016 1 次提交
  13. 04 3月, 2016 1 次提交
  14. 15 2月, 2016 1 次提交
    • M
      blk-mq: mark request queue as mq asap · 66841672
      Ming Lei 提交于
      Currently q->mq_ops is used widely to decide if the queue
      is mq or not, so we should set the 'flag' asap so that both
      block core and drivers can get the correct mq info.
      
      For example, commit 868f2f0b(blk-mq: dynamic h/w context count)
      moves the hctx's initialization before setting q->mq_ops in
      blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue()
      to think the queue is non-mq and don't allocate command size
      for the per-hctx flush rq.
      
      This patches should fix the problem reported by Sasha.
      
      Cc: Keith Busch <keith.busch@intel.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Fixes: 868f2f0b ("blk-mq: dynamic h/w context count")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66841672
  15. 12 2月, 2016 1 次提交
  16. 10 2月, 2016 1 次提交
    • K
      blk-mq: dynamic h/w context count · 868f2f0b
      Keith Busch 提交于
      The hardware's provided queue count may change at runtime with resource
      provisioning. This patch allows a block driver to alter the number of
      h/w queues available when its resource count changes.
      
      The main part is a new blk-mq API to request a new number of h/w queues
      for a given live tag set. The new API freezes all queues using that set,
      then adjusts the allocated count prior to remapping these to CPUs.
      
      The bulk of the rest just shifts where h/w contexts and all their
      artifacts are allocated and freed.
      
      The number of max h/w contexts is capped to the number of possible cpus
      since there is no use for more than that. As such, all pre-allocated
      memory for pointers need to account for the max possible rather than
      the initial number of queues.
      
      A side effect of this is that the blk-mq will proceed successfully as
      long as it can allocate at least one h/w context. Previously it would
      fail request queue initialization if less than the requested number
      was allocated.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJon Derrick <jonathan.derrick@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      868f2f0b
  17. 23 12月, 2015 2 次提交
    • C
      block: remove REQ_NO_TIMEOUT flag · bbc758ec
      Christoph Hellwig 提交于
      This was added for the 'magic' AEN requests in the NVMe driver that never
      return.  We now handle them purely inside the driver and don't need this
      core hack any more.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bbc758ec
    • C
      block: defer timeouts to a workqueue · 287922eb
      Christoph Hellwig 提交于
      Timer context is not very useful for drivers to perform any meaningful abort
      action from.  So instead of calling the driver from this useless context
      defer it to a workqueue as soon as possible.
      
      Note that while a delayed_work item would seem the right thing here I didn't
      dare to use it due to the magic in blk_add_timer that pokes deep into timer
      internals.  But maybe this encourages Tejun to add a sensible API for that to
      the workqueue API and we'll all be fine in the end :)
      
      Contains a major update from Keith Bush:
      
      "This patch removes synchronizing the timeout work so that the timer can
       start a freeze on its own queue. The timer enters the queue, so timer
       context can only start a freeze, but not wait for frozen."
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      287922eb
  18. 04 12月, 2015 2 次提交
  19. 02 12月, 2015 1 次提交
  20. 21 11月, 2015 1 次提交
    • J
      blk-mq: fix calling unplug callbacks with preempt disabled · b094f89c
      Jens Axboe 提交于
      Liu reported that running certain parts of xfstests threw the
      following error:
      
      BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
      in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
      3 locks held by kworker/u16:0/6:
       #0:  ("writeback"){++++.+}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
       #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
       #2:  (&type->s_umount_key#44){+++++.}, at: [<ffffffff811e6805>] trylock_super+0x25/0x60
      CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G           OE   4.3.0+ #3
      Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
      Workqueue: writeback wb_workfn (flush-btrfs-108)
       ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
       0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
       ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
      Call Trace:
       [<ffffffff8130191b>] dump_stack+0x4f/0x74
       [<ffffffff8108ed95>] ___might_sleep+0x185/0x240
       [<ffffffff8108eea2>] __might_sleep+0x52/0x90
       [<ffffffff811817e8>] __alloc_pages_nodemask+0x268/0x410
       [<ffffffff8109a43c>] ? sched_clock_local+0x1c/0x90
       [<ffffffff8109a6d1>] ? local_clock+0x21/0x40
       [<ffffffff810b9eb0>] ? __lock_release+0x420/0x510
       [<ffffffff810b534c>] ? __lock_acquired+0x16c/0x3c0
       [<ffffffff811ca265>] alloc_pages_current+0xc5/0x210
       [<ffffffffa0577105>] ? rbio_is_full+0x55/0x70 [btrfs]
       [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
       [<ffffffff81666d50>] ? _raw_spin_unlock_irqrestore+0x40/0x60
       [<ffffffffa0578c0a>] full_stripe_write+0x5a/0xc0 [btrfs]
       [<ffffffffa0578ca9>] __raid56_parity_write+0x39/0x60 [btrfs]
       [<ffffffffa0578deb>] run_plug+0x11b/0x140 [btrfs]
       [<ffffffffa0578e33>] btrfs_raid_unplug+0x23/0x70 [btrfs]
       [<ffffffff812d36c2>] blk_flush_plug_list+0x82/0x1f0
       [<ffffffff812e0349>] blk_sq_make_request+0x1f9/0x740
       [<ffffffff812ceba2>] ? generic_make_request_checks+0x222/0x7c0
       [<ffffffff812cf264>] ? blk_queue_enter+0x124/0x310
       [<ffffffff812cf1d2>] ? blk_queue_enter+0x92/0x310
       [<ffffffff812d0ae2>] generic_make_request+0x172/0x2c0
       [<ffffffff812d0ad4>] ? generic_make_request+0x164/0x2c0
       [<ffffffff812d0ca0>] submit_bio+0x70/0x140
       [<ffffffffa0577b29>] ? rbio_add_io_page+0x99/0x150 [btrfs]
       [<ffffffffa0578a89>] finish_rmw+0x4d9/0x600 [btrfs]
       [<ffffffffa0578c4c>] full_stripe_write+0x9c/0xc0 [btrfs]
       [<ffffffffa057ab7f>] raid56_parity_write+0xef/0x160 [btrfs]
       [<ffffffffa052bd83>] btrfs_map_bio+0xe3/0x2d0 [btrfs]
       [<ffffffffa04fbd6d>] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
       [<ffffffffa05173c4>] submit_one_bio+0x74/0xb0 [btrfs]
       [<ffffffffa0517f55>] submit_extent_page+0xe5/0x1c0 [btrfs]
       [<ffffffffa0519b18>] __extent_writepage_io+0x408/0x4c0 [btrfs]
       [<ffffffffa05179c0>] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
       [<ffffffffa051dc88>] __extent_writepage+0x218/0x3a0 [btrfs]
       [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
       [<ffffffffa051e2c9>] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
       [<ffffffffa051e422>] extent_writepages+0x52/0x70 [btrfs]
       [<ffffffffa05001f0>] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
       [<ffffffffa04fcc17>] btrfs_writepages+0x27/0x30 [btrfs]
       [<ffffffff81184df3>] do_writepages+0x23/0x40
       [<ffffffff81212229>] __writeback_single_inode+0x89/0x4d0
       [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
       [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
       [<ffffffff8121295f>] ? writeback_sb_inodes+0x15f/0x480
       [<ffffffff81212ad2>] writeback_sb_inodes+0x2d2/0x480
       [<ffffffff810b1397>] ? down_read_trylock+0x57/0x60
       [<ffffffff811e6805>] ? trylock_super+0x25/0x60
       [<ffffffff810d629f>] ? rcu_read_lock_sched_held+0x4f/0x90
       [<ffffffff81212d0c>] __writeback_inodes_wb+0x8c/0xc0
       [<ffffffff812130b5>] wb_writeback+0x2b5/0x500
       [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
       [<ffffffff810660a8>] ? __local_bh_enable_ip+0x68/0xc0
       [<ffffffff81213362>] ? wb_do_writeback+0x62/0x310
       [<ffffffff812133c1>] wb_do_writeback+0xc1/0x310
       [<ffffffff8107c3d9>] ? set_worker_desc+0x79/0x90
       [<ffffffff81213842>] wb_workfn+0x92/0x330
       [<ffffffff8107f133>] process_one_work+0x223/0x730
       [<ffffffff8107f083>] ? process_one_work+0x173/0x730
       [<ffffffff8108035f>] ? worker_thread+0x18f/0x430
       [<ffffffff810802ed>] worker_thread+0x11d/0x430
       [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
       [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
       [<ffffffff810858df>] kthread+0xef/0x110
       [<ffffffff8108f74e>] ? schedule_tail+0x1e/0xd0
       [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
       [<ffffffff816673bf>] ret_from_fork+0x3f/0x70
       [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
      
      The issue is that we've got the software context pinned while
      calling blk_flush_plug_list(), which flushes callbacks that
      are allowed to sleep. btrfs and raid has such callbacks.
      
      Flip the checks around a bit, so we can enable preempt a bit
      earlier and flush plugs without having preempt disabled.
      
      This only affects blk-mq driven devices, and only those that
      register a single queue.
      Reported-by: NLiu Bo <bo.li.liu@oracle.com>
      Tested-by: NLiu Bo <bo.li.liu@oracle.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b094f89c
  21. 12 11月, 2015 1 次提交
  22. 08 11月, 2015 2 次提交
  23. 07 11月, 2015 2 次提交
    • M
      mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM · 71baba4b
      Mel Gorman 提交于
      __GFP_WAIT was used to signal that the caller was in atomic context and
      could not sleep.  Now it is possible to distinguish between true atomic
      context and callers that are not willing to sleep.  The latter should
      clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
      __GFP_WAIT behaves differently, there is a risk that people will clear the
      wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
      indicate what it does -- setting it allows all reclaim activity, clearing
      them prevents it.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71baba4b
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  24. 03 11月, 2015 1 次提交
    • J
      blk-mq: avoid excessive boot delays with large lun counts · 2404e607
      Jeff Moyer 提交于
      Hi,
      
      Zhangqing Luo reported long boot times on a system with thousands of
      LUNs when scsi-mq was enabled.  He narrowed the problem down to
      blk_mq_add_queue_tag_set, where every queue is frozen in order to set
      the BLK_MQ_F_TAG_SHARED flag.  Each added device will freeze all queues
      added before it in sequence, which involves waiting for an RCU grace
      period for each one.  We don't need to do this.  After the second queue
      is added, only new queues need to be initialized with the shared tag.
      We can do that by percolating the flag up to the blk_mq_tag_set, and
      updating the newly added queue's hctxs if the flag is set.
      
      This problem was introduced by commit 0d2602ca (blk-mq: improve
      support for shared tags maps).
      Reported-and-tested-by: NJason Luo <zhangqing.luo@oracle.com>
      Reviewed-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2404e607
  25. 22 10月, 2015 5 次提交
  26. 15 10月, 2015 1 次提交
  27. 01 10月, 2015 2 次提交
  28. 30 9月, 2015 2 次提交
    • A
      blk-mq: fix deadlock when reading cpu_list · 60de074b
      Akinobu Mita 提交于
      CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
      all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
      entries by blk_mq_sysfs_unregister().  Removing sysfs entry needs to
      be blocked until the active reference of the kernfs_node to be zero.
      
      On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
      /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
      blk_mq_hw_sysfs_cpus_show().
      
      If these happen at the same time, a deadlock can happen.  Because one
      can wait for the active reference to be zero with holding all_q_mutex,
      and the other tries to acquire all_q_mutex with holding the active
      reference.
      
      The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
      is to avoid reading an imcomplete hctx->cpumask.  Since reading sysfs
      entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
      and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
      while hctx->cpumask is being updated.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      60de074b
    • A
      blk-mq: avoid inserting requests before establishing new mapping · 5778322e
      Akinobu Mita 提交于
      Notifier callbacks for CPU_ONLINE action can be run on the other CPU
      than the CPU which was just onlined.  So it is possible for the
      process running on the just onlined CPU to insert request and run
      hw queue before establishing new mapping which is done by
      blk_mq_queue_reinit_notify().
      
      This can cause a problem when the CPU has just been onlined first time
      since the request queue was initialized.  At this time ctx->index_hw
      for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
      zero before blk_mq_queue_reinit_notify() is called by notifier
      callbacks for CPU_ONLINE action.
      
      For example, there is a single hw queue (hctx) and two CPU queues
      (ctx0 for CPU0, and ctx1 for CPU1).  Now CPU1 is just onlined and
      a request is inserted into ctx1->rq_list and set bit0 in pending
      bitmap as ctx1->index_hw is still zero.
      
      And then while running hw queue, flush_busy_ctxs() finds bit0 is set
      in pending bitmap and tries to retrieve requests in
      hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0, so the
      request in ctx1->rq_list is ignored.
      
      Fix it by ensuring that new mapping is established before onlined cpu
      starts running.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5778322e