1. 21 11月, 2015 1 次提交
    • J
      blk-mq: fix calling unplug callbacks with preempt disabled · b094f89c
      Jens Axboe 提交于
      Liu reported that running certain parts of xfstests threw the
      following error:
      
      BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
      in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
      3 locks held by kworker/u16:0/6:
       #0:  ("writeback"){++++.+}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
       #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
       #2:  (&type->s_umount_key#44){+++++.}, at: [<ffffffff811e6805>] trylock_super+0x25/0x60
      CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G           OE   4.3.0+ #3
      Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
      Workqueue: writeback wb_workfn (flush-btrfs-108)
       ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
       0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
       ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
      Call Trace:
       [<ffffffff8130191b>] dump_stack+0x4f/0x74
       [<ffffffff8108ed95>] ___might_sleep+0x185/0x240
       [<ffffffff8108eea2>] __might_sleep+0x52/0x90
       [<ffffffff811817e8>] __alloc_pages_nodemask+0x268/0x410
       [<ffffffff8109a43c>] ? sched_clock_local+0x1c/0x90
       [<ffffffff8109a6d1>] ? local_clock+0x21/0x40
       [<ffffffff810b9eb0>] ? __lock_release+0x420/0x510
       [<ffffffff810b534c>] ? __lock_acquired+0x16c/0x3c0
       [<ffffffff811ca265>] alloc_pages_current+0xc5/0x210
       [<ffffffffa0577105>] ? rbio_is_full+0x55/0x70 [btrfs]
       [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
       [<ffffffff81666d50>] ? _raw_spin_unlock_irqrestore+0x40/0x60
       [<ffffffffa0578c0a>] full_stripe_write+0x5a/0xc0 [btrfs]
       [<ffffffffa0578ca9>] __raid56_parity_write+0x39/0x60 [btrfs]
       [<ffffffffa0578deb>] run_plug+0x11b/0x140 [btrfs]
       [<ffffffffa0578e33>] btrfs_raid_unplug+0x23/0x70 [btrfs]
       [<ffffffff812d36c2>] blk_flush_plug_list+0x82/0x1f0
       [<ffffffff812e0349>] blk_sq_make_request+0x1f9/0x740
       [<ffffffff812ceba2>] ? generic_make_request_checks+0x222/0x7c0
       [<ffffffff812cf264>] ? blk_queue_enter+0x124/0x310
       [<ffffffff812cf1d2>] ? blk_queue_enter+0x92/0x310
       [<ffffffff812d0ae2>] generic_make_request+0x172/0x2c0
       [<ffffffff812d0ad4>] ? generic_make_request+0x164/0x2c0
       [<ffffffff812d0ca0>] submit_bio+0x70/0x140
       [<ffffffffa0577b29>] ? rbio_add_io_page+0x99/0x150 [btrfs]
       [<ffffffffa0578a89>] finish_rmw+0x4d9/0x600 [btrfs]
       [<ffffffffa0578c4c>] full_stripe_write+0x9c/0xc0 [btrfs]
       [<ffffffffa057ab7f>] raid56_parity_write+0xef/0x160 [btrfs]
       [<ffffffffa052bd83>] btrfs_map_bio+0xe3/0x2d0 [btrfs]
       [<ffffffffa04fbd6d>] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
       [<ffffffffa05173c4>] submit_one_bio+0x74/0xb0 [btrfs]
       [<ffffffffa0517f55>] submit_extent_page+0xe5/0x1c0 [btrfs]
       [<ffffffffa0519b18>] __extent_writepage_io+0x408/0x4c0 [btrfs]
       [<ffffffffa05179c0>] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
       [<ffffffffa051dc88>] __extent_writepage+0x218/0x3a0 [btrfs]
       [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
       [<ffffffffa051e2c9>] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
       [<ffffffffa051e422>] extent_writepages+0x52/0x70 [btrfs]
       [<ffffffffa05001f0>] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
       [<ffffffffa04fcc17>] btrfs_writepages+0x27/0x30 [btrfs]
       [<ffffffff81184df3>] do_writepages+0x23/0x40
       [<ffffffff81212229>] __writeback_single_inode+0x89/0x4d0
       [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
       [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
       [<ffffffff8121295f>] ? writeback_sb_inodes+0x15f/0x480
       [<ffffffff81212ad2>] writeback_sb_inodes+0x2d2/0x480
       [<ffffffff810b1397>] ? down_read_trylock+0x57/0x60
       [<ffffffff811e6805>] ? trylock_super+0x25/0x60
       [<ffffffff810d629f>] ? rcu_read_lock_sched_held+0x4f/0x90
       [<ffffffff81212d0c>] __writeback_inodes_wb+0x8c/0xc0
       [<ffffffff812130b5>] wb_writeback+0x2b5/0x500
       [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
       [<ffffffff810660a8>] ? __local_bh_enable_ip+0x68/0xc0
       [<ffffffff81213362>] ? wb_do_writeback+0x62/0x310
       [<ffffffff812133c1>] wb_do_writeback+0xc1/0x310
       [<ffffffff8107c3d9>] ? set_worker_desc+0x79/0x90
       [<ffffffff81213842>] wb_workfn+0x92/0x330
       [<ffffffff8107f133>] process_one_work+0x223/0x730
       [<ffffffff8107f083>] ? process_one_work+0x173/0x730
       [<ffffffff8108035f>] ? worker_thread+0x18f/0x430
       [<ffffffff810802ed>] worker_thread+0x11d/0x430
       [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
       [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
       [<ffffffff810858df>] kthread+0xef/0x110
       [<ffffffff8108f74e>] ? schedule_tail+0x1e/0xd0
       [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
       [<ffffffff816673bf>] ret_from_fork+0x3f/0x70
       [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
      
      The issue is that we've got the software context pinned while
      calling blk_flush_plug_list(), which flushes callbacks that
      are allowed to sleep. btrfs and raid has such callbacks.
      
      Flip the checks around a bit, so we can enable preempt a bit
      earlier and flush plugs without having preempt disabled.
      
      This only affects blk-mq driven devices, and only those that
      register a single queue.
      Reported-by: NLiu Bo <bo.li.liu@oracle.com>
      Tested-by: NLiu Bo <bo.li.liu@oracle.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b094f89c
  2. 12 11月, 2015 1 次提交
  3. 08 11月, 2015 2 次提交
  4. 07 11月, 2015 2 次提交
    • M
      mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM · 71baba4b
      Mel Gorman 提交于
      __GFP_WAIT was used to signal that the caller was in atomic context and
      could not sleep.  Now it is possible to distinguish between true atomic
      context and callers that are not willing to sleep.  The latter should
      clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
      __GFP_WAIT behaves differently, there is a risk that people will clear the
      wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
      indicate what it does -- setting it allows all reclaim activity, clearing
      them prevents it.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71baba4b
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  5. 03 11月, 2015 1 次提交
    • J
      blk-mq: avoid excessive boot delays with large lun counts · 2404e607
      Jeff Moyer 提交于
      Hi,
      
      Zhangqing Luo reported long boot times on a system with thousands of
      LUNs when scsi-mq was enabled.  He narrowed the problem down to
      blk_mq_add_queue_tag_set, where every queue is frozen in order to set
      the BLK_MQ_F_TAG_SHARED flag.  Each added device will freeze all queues
      added before it in sequence, which involves waiting for an RCU grace
      period for each one.  We don't need to do this.  After the second queue
      is added, only new queues need to be initialized with the shared tag.
      We can do that by percolating the flag up to the blk_mq_tag_set, and
      updating the newly added queue's hctxs if the flag is set.
      
      This problem was introduced by commit 0d2602ca (blk-mq: improve
      support for shared tags maps).
      Reported-and-tested-by: NJason Luo <zhangqing.luo@oracle.com>
      Reviewed-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2404e607
  6. 22 10月, 2015 5 次提交
  7. 15 10月, 2015 1 次提交
  8. 01 10月, 2015 2 次提交
  9. 30 9月, 2015 6 次提交
    • A
      blk-mq: fix deadlock when reading cpu_list · 60de074b
      Akinobu Mita 提交于
      CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires
      all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs
      entries by blk_mq_sysfs_unregister().  Removing sysfs entry needs to
      be blocked until the active reference of the kernfs_node to be zero.
      
      On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g.
      /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in
      blk_mq_hw_sysfs_cpus_show().
      
      If these happen at the same time, a deadlock can happen.  Because one
      can wait for the active reference to be zero with holding all_q_mutex,
      and the other tries to acquire all_q_mutex with holding the active
      reference.
      
      The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show()
      is to avoid reading an imcomplete hctx->cpumask.  Since reading sysfs
      entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock
      and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock
      while hctx->cpumask is being updated.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      60de074b
    • A
      blk-mq: avoid inserting requests before establishing new mapping · 5778322e
      Akinobu Mita 提交于
      Notifier callbacks for CPU_ONLINE action can be run on the other CPU
      than the CPU which was just onlined.  So it is possible for the
      process running on the just onlined CPU to insert request and run
      hw queue before establishing new mapping which is done by
      blk_mq_queue_reinit_notify().
      
      This can cause a problem when the CPU has just been onlined first time
      since the request queue was initialized.  At this time ctx->index_hw
      for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
      zero before blk_mq_queue_reinit_notify() is called by notifier
      callbacks for CPU_ONLINE action.
      
      For example, there is a single hw queue (hctx) and two CPU queues
      (ctx0 for CPU0, and ctx1 for CPU1).  Now CPU1 is just onlined and
      a request is inserted into ctx1->rq_list and set bit0 in pending
      bitmap as ctx1->index_hw is still zero.
      
      And then while running hw queue, flush_busy_ctxs() finds bit0 is set
      in pending bitmap and tries to retrieve requests in
      hctx->ctxs[0]->rq_list.  But htx->ctxs[0] is a pointer to ctx0, so the
      request in ctx1->rq_list is ignored.
      
      Fix it by ensuring that new mapping is established before onlined cpu
      starts running.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5778322e
    • A
      blk-mq: fix q->mq_usage_counter access race · 0e626368
      Akinobu Mita 提交于
      CPU hotplug handling for blk-mq (blk_mq_queue_reinit) accesses
      q->mq_usage_counter while freezing all request queues in all_q_list.
      On the other hand, q->mq_usage_counter is deinitialized in
      blk_mq_free_queue() before deleting the queue from all_q_list.
      
      So if CPU hotplug event occurs in the window, percpu_ref_kill() is
      called with q->mq_usage_counter which has already been marked dead,
      and it triggers warning.  Fix it by deleting the queue from all_q_list
      earlier than destroying q->mq_usage_counter.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0e626368
    • A
      blk-mq: Fix use after of free q->mq_map · a723bab3
      Akinobu Mita 提交于
      CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates
      q->mq_map by blk_mq_update_queue_map() for all request queues in
      all_q_list.  On the other hand, q->mq_map is released before deleting
      the queue from all_q_list.
      
      So if CPU hotplug event occurs in the window, invalid memory access
      can happen.  Fix it by releasing q->mq_map in blk_mq_release() to make
      it happen latter than removal from all_q_list.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Suggested-by: NMing Lei <tom.leiming@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a723bab3
    • A
      blk-mq: fix sysfs registration/unregistration race · 4593fdbe
      Akinobu Mita 提交于
      There is a race between cpu hotplug handling and adding/deleting
      gendisk for blk-mq, where both are trying to register and unregister
      the same sysfs entries.
      
      null_add_dev
          --> blk_mq_init_queue
              --> blk_mq_init_allocated_queue
                  --> add to 'all_q_list' (*)
          --> add_disk
              --> blk_register_queue
                  --> blk_mq_register_disk (++)
      
      null_del_dev
          --> del_gendisk
              --> blk_unregister_queue
                  --> blk_mq_unregister_disk (--)
          --> blk_cleanup_queue
              --> blk_mq_free_queue
                  --> del from 'all_q_list' (*)
      
      blk_mq_queue_reinit
          --> blk_mq_sysfs_unregister (-)
          --> blk_mq_sysfs_register (+)
      
      While the request queue is added to 'all_q_list' (*),
      blk_mq_queue_reinit() can be called for the queue anytime by CPU
      hotplug callback.  But blk_mq_sysfs_unregister (-) and
      blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called
      before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--)
      is finished.  Because '/sys/block/*/mq/' is not exists.
      
      There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can
      be used to track these sysfs stuff, but it is only fixing this issue
      partially.
      
      In order to fix it completely, we just need per-queue flag instead of
      per-hctx flag with appropriate locking.  So this introduces
      q->mq_sysfs_init_done which is properly protected with all_q_mutex.
      
      Also, we need to ensure that blk_mq_map_swqueue() is called with
      all_q_mutex is held.  Since hctx->nr_ctx is reset temporarily and
      updated in blk_mq_map_swqueue(), so we should avoid
      blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value
      in CPU hotplug handling or adding/deleting gendisk .
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reviewed-by: NMing Lei <tom.leiming@gmail.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4593fdbe
    • A
      blk-mq: avoid setting hctx->tags->cpumask before allocation · 1356aae0
      Akinobu Mita 提交于
      When unmapped hw queue is remapped after CPU topology is changed,
      hctx->tags->cpumask has to be set after hctx->tags is setup in
      blk_mq_map_swqueue(), otherwise it causes null pointer dereference.
      
      Fixes: f26cdc85 ("blk-mq: Shared tag enhancements")
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1356aae0
  10. 24 9月, 2015 1 次提交
  11. 15 8月, 2015 1 次提交
    • M
      blk-mq: fix race between timeout and freeing request · 0048b483
      Ming Lei 提交于
      Inside timeout handler, blk_mq_tag_to_rq() is called
      to retrieve the request from one tag. This way is obviously
      wrong because the request can be freed any time and some
      fiedds of the request can't be trusted, then kernel oops
      might be triggered[1].
      
      Currently wrt. blk_mq_tag_to_rq(), the only special case is
      that the flush request can share same tag with the request
      cloned from, and the two requests can't be active at the same
      time, so this patch fixes the above issue by updating tags->rqs[tag]
      with the active request(either flush rq or the request cloned
      from) of the tag.
      
      Also blk_mq_tag_to_rq() gets much simplified with this patch.
      
      Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
      make sure the request can't be freed, so in bt_for_each() this
      helper is replaced with tags->rqs[tag].
      
      [1] kernel oops log
      [  439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
      [  439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
      [  439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
      [  439.700653] Dumping ftrace buffer:^M
      [  439.700653]    (ftrace buffer empty)^M
      [  439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
      [  439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
      [  439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
      [  439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
      [  439.730500] RIP: 0010:[<ffffffff812d89ba>]  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.730500] RSP: 0018:ffff880819203da0  EFLAGS: 00010283^M
      [  439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
      [  439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
      [  439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
      [  439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
      [  439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
      [  439.730500] FS:  00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
      [  439.730500] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
      [  439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
      [  439.730500] Stack:^M
      [  439.730500]  0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
      [  439.755663]  ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
      [  439.755663] Call Trace:^M
      [  439.755663]  <IRQ> ^M
      [  439.755663]  [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M
      [  439.755663]  [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M
      [  439.755663]  [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M
      [  439.755663]  [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M
      [  439.755663]  [<ffffffff8104c76e>] irq_exit+0x40/0x94^M
      [  439.755663]  [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M
      [  439.755663]  [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M
      [  439.755663]  <EOI> ^M
      [  439.755663]  [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M
      [  439.755663]  [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M
      [  439.755663]  [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M
      [  439.755663]  [<ffffffff81550066>] __schedule+0x469/0x6cd^M
      [  439.755663]  [<ffffffff8155039b>] schedule+0x82/0x9a^M
      [  439.789267]  [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M
      [  439.790911]  [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M
      [  439.790911]  [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M
      [  439.790911]  [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M
      [  439.790911]  [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M
      [  439.790911]  [<ffffffff8116292b>] SyS_read+0x49/0x7f^M
      [  439.790911]  [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M
      [  439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
      f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
      53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
      ^M
      [  439.790911] RIP  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.790911]  RSP <ffff880819203da0>^M
      [  439.790911] CR2: 0000000000000158^M
      [  439.790911] ---[ end trace d40af58949325661 ]---^M
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0048b483
  12. 14 8月, 2015 1 次提交
    • K
      block: make generic_make_request handle arbitrarily sized bios · 54efd50b
      Kent Overstreet 提交于
      The way the block layer is currently written, it goes to great lengths
      to avoid having to split bios; upper layer code (such as bio_add_page())
      checks what the underlying device can handle and tries to always create
      bios that don't need to be split.
      
      But this approach becomes unwieldy and eventually breaks down with
      stacked devices and devices with dynamic limits, and it adds a lot of
      complexity. If the block layer could split bios as needed, we could
      eliminate a lot of complexity elsewhere - particularly in stacked
      drivers. Code that creates bios can then create whatever size bios are
      convenient, and more importantly stacked drivers don't have to deal with
      both their own bio size limitations and the limitations of the
      (potentially multiple) devices underneath them.  In the future this will
      let us delete merge_bvec_fn and a bunch of other code.
      
      We do this by adding calls to blk_queue_split() to the various
      make_request functions that need it - a few can already handle arbitrary
      size bios. Note that we add the call _after_ any call to
      blk_queue_bounce(); this means that blk_queue_split() and
      blk_recalc_rq_segments() don't need to be concerned with bouncing
      affecting segment merging.
      
      Some make_request_fn() callbacks were simple enough to audit and verify
      they don't need blk_queue_split() calls. The skipped ones are:
      
       * nfhd_make_request (arch/m68k/emu/nfblock.c)
       * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
       * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
       * brd_make_request (ramdisk - drivers/block/brd.c)
       * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
       * loop_make_request
       * null_queue_bio
       * bcache's make_request fns
      
      Some others are almost certainly safe to remove now, but will be left
      for future patches.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: drbd-user@lists.linbit.com
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits)
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
      [dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
      Signed-off-by: NDongsu Park <dpark@posteo.net>
      Signed-off-by: NMing Lin <ming.l@ssi.samsung.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54efd50b
  13. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  14. 16 7月, 2015 1 次提交
  15. 10 6月, 2015 1 次提交
  16. 02 6月, 2015 1 次提交
    • K
      blk-mq: Shared tag enhancements · f26cdc85
      Keith Busch 提交于
      Storage controllers may expose multiple block devices that share hardware
      resources managed by blk-mq. This patch enhances the shared tags so a
      low-level driver can access the shared resources not tied to the unshared
      h/w contexts. This way the LLD can dynamically add and delete disks and
      request queues without having to track all the request_queue hctx's to
      iterate outstanding tags.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f26cdc85
  17. 19 5月, 2015 1 次提交
  18. 09 5月, 2015 4 次提交
    • S
      blk-mq: make plug work for mutiple disks and queues · 5b3f341f
      Shaohua Li 提交于
      Last patch makes plug work for multiple queue case. However it only
      works for single disk case, because it assumes only one request in the
      plug list. If a task is accessing multiple disks, eg MD/DM, the
      assumption is wrong. Let blk_attempt_plug_merge() record request from
      the same queue.
      
      V2: use NULL parameter in !mq case. Fix a bug. Add comments in
      blk_attempt_plug_merge to make it less (hopefully) confusion.
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5b3f341f
    • S
      blk-mq: do limited block plug for multiple queue case · f984df1f
      Shaohua Li 提交于
      plug is still helpful for workload with IO merge, but it can be harmful
      otherwise especially with multiple hardware queues, as there is
      (supposed) no lock contention in this case and plug can introduce
      latency. For multiple queues, we do limited plug, eg plug only if there
      is request merge. If a request doesn't have merge with following
      request, the requet will be dispatched immediately.
      
      V2: check blk_queue_nomerges() as suggested by Jeff.
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f984df1f
    • S
      blk-mq: avoid re-initialize request which is failed in direct dispatch · 239ad215
      Shaohua Li 提交于
      If we directly issue a request and it fails, we use
      blk_mq_merge_queue_io(). But we already assigned bio to a request in
      blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
      blk_mq_bio_to_request again.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      239ad215
    • J
      blk-mq: fix plugging in blk_sq_make_request · e6c4438b
      Jeff Moyer 提交于
      The following appears in blk_sq_make_request:
      
      	/*
      	 * If we have multiple hardware queues, just go directly to
      	 * one of those for sync IO.
      	 */
      
      We clearly don't have multiple hardware queues, here!  This comment was
      introduced with this commit 07068d5b (blk-mq: split make request
      handler for multi and single queue):
      
          We want slightly different behavior from them:
      
          - On single queue devices, we currently use the per-process plug
            for deferred IO and for merging.
      
          - On multi queue devices, we don't use the per-process plug, but
            we want to go straight to hardware for SYNC IO.
      
      The old code had this:
      
              use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);
      
      and that was converted to:
      
      	use_plug = !is_flush_fua && !is_sync;
      
      which is not equivalent.  For the single queue case, that second half of
      the && expression is always true.  So, what I think was actually inteded
      follows (and this more closely matches what is done in blk_queue_bio).
      
      V2: delete the 'likely', which should not be a big deal
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e6c4438b
  19. 05 5月, 2015 1 次提交
    • S
      blk-mq: don't lose requests if a stopped queue restarts · 9ba52e58
      Shaohua Li 提交于
      Normally if driver is busy to dispatch a request the logic is like below:
      block layer:					driver:
      	__blk_mq_run_hw_queue
      a.						blk_mq_stop_hw_queue
      b.	rq add to ctx->dispatch
      
      later:
      1.						blk_mq_start_hw_queue
      2.	__blk_mq_run_hw_queue
      
      But it's possible step 1-2 runs between a and b. And since rq isn't in
      ctx->dispatch yet, step 2 will not run rq. The rq might get lost if
      there are no subsequent requests kick in.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9ba52e58
  20. 24 4月, 2015 2 次提交
  21. 17 4月, 2015 1 次提交
  22. 16 4月, 2015 1 次提交
    • C
      blk-mq: reduce unnecessary software queue looping · 889fa31f
      Chong Yuan 提交于
      In flush_busy_ctxs() and blk_mq_hctx_has_pending(), regardless of how many
      ctxs assigned to one hctx, they will all loop hctx->ctx_map.map_size
      times. Here hctx->ctx_map.map_size is a const ALIGN(nr_cpu_ids, 8) / 8.
      Especially, flush_busy_ctxs() is in hot code path. And it's unnecessary.
      Change ->map_size to contain the actually mapped software queues, so we
      only loop for as many iterations as we have to.
      
      And remove cpumask setting and nr_ctx count in blk_mq_init_cpu_queues()
      since they are all re-done in blk_mq_map_swqueue().
      blk_mq_map_swqueue().
      Signed-off-by: NChong Yuan <chong.yuan@memblaze.com>
      Reviewed-by: NWenbo Wang <wenbo.wang@memblaze.com>
      
      Updated by me for formatting and commenting.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      889fa31f
  23. 12 4月, 2015 1 次提交
    • L
      blk-mq: initialize 'struct request' and associated data to zero · ac211175
      Linus Torvalds 提交于
      Jan Engelhardt reports a strange oops with an invalid ->sense_buffer
      pointer in scsi_init_cmd_errh() with the blk-mq code.
      
      The sense_buffer pointer should have been initialized by the call to
      scsi_init_request() from blk_mq_init_rq_map(), but there seems to be
      some non-repeatable memory corruptor.
      
      This patch makes sure we initialize the whole struct request allocation
      (and the associated 'struct scsi_cmnd' for the SCSI case) to zero, by
      using __GFP_ZERO in the allocation.  The old code initialized a couple
      of individual fields, leaving the rest undefined (although many of them
      are then initialized in later phases, like blk_mq_rq_ctx_init() etc.
      
      It's not entirely clear why this matters, but it's the rigth thing to do
      regardless, and with 4.0 imminent this is the defensive "let's just make
      sure everything is initialized properly" patch.
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac211175
  24. 30 3月, 2015 1 次提交