1. 09 7月, 2018 4 次提交
    • M
      blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set() · 97889f9a
      Ming Lei 提交于
      We have to remove synchronize_rcu() from blk_queue_cleanup(),
      otherwise long delay can be caused during lun probe. For removing
      it, we have to avoid to iterate the set->tag_list in IO path, eg,
      blk_mq_sched_restart().
      
      This patch reverts 5b79413946d (Revert "blk-mq: don't handle
      TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
      and there isn't any reason to restart all queues in one tags any more,
      see the following reasons:
      
      1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
      which can wake up queues waiting for driver tag.
      
      2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
      target or host is ready, but SCSI built-in restart can cover all these well,
      see scsi_end_request(), queue will be rerun after any request initiated from
      this host/target is completed.
      
      In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
      when running I/O on these 8 luns concurrently.
      
      Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: linux-scsi@vger.kernel.org
      Reported-by: NAndrew Jones <drjones@redhat.com>
      Tested-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      97889f9a
    • M
      blk-mq: introduce new lock for protecting hctx->dispatch_wait · 5815839b
      Ming Lei 提交于
      Now hctx->lock is only acquired when adding hctx->dispatch_wait to
      one wait queue, but not held when removing it from the wait queue.
      
      IO hang can be observed easily if SCHED RESTART is disabled, that means
      now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait().
      
      This patch fixes the issue by introducing hctx->dispatch_wait_lock and
      holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(),
      since we need to avoid acquiring hctx->lock in irq context.
      
      Fixes: eb619fdb ("blk-mq: fix issue with shared tag queue re-running")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Tested-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5815839b
    • M
      blk-mq: don't pass **hctx to blk_mq_mark_tag_wait() · 2278d69f
      Ming Lei 提交于
      'hctx' won't be changed at all, so not necessary to pass
      '**hctx' to blk_mq_mark_tag_wait().
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Tested-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2278d69f
    • M
      blk-mq: cleanup blk_mq_get_driver_tag() · 8ab6bb9e
      Ming Lei 提交于
      We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence
      we never change '**hctx' as well. The last use of these went away
      with the flush cleanup, commit 0c2a6fe4.
      
      So cleanup the usage and remove the two extra parameters.
      
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Tested-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8ab6bb9e
  2. 29 6月, 2018 1 次提交
    • J
      blk-mq: don't queue more if we get a busy return · 1f57f8d4
      Jens Axboe 提交于
      Some devices have different queue limits depending on the type of IO. A
      classic case is SATA NCQ, where some commands can queue, but others
      cannot. If we have NCQ commands inflight and encounter a non-queueable
      command, the driver returns busy. Currently we attempt to dispatch more
      from the scheduler, if we were able to queue some commands. But for the
      case where we ended up stopping due to BUSY, we should not attempt to
      retrieve more from the scheduler. If we do, we can get into a situation
      where we attempt to queue a non-queueable command, get BUSY, then
      successfully retrieve more commands from that scheduler and queue those.
      This can repeat forever, starving the non-queuable command indefinitely.
      
      Fix this by NOT attempting to pull more commands from the scheduler, if
      we get a BUSY return. This should also be more optimal in terms of
      letting requests stay in the scheduler for as long as possible, if we
      get a BUSY due to the regular out-of-tags condition.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1f57f8d4
  3. 24 6月, 2018 1 次提交
  4. 14 6月, 2018 1 次提交
  5. 13 6月, 2018 1 次提交
    • K
      treewide: kzalloc_node() -> kcalloc_node() · 590b5b7d
      Kees Cook 提交于
      The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
      patch replaces cases of:
      
              kzalloc_node(a * b, gfp, node)
      
      with:
              kcalloc_node(a * b, gfp, node)
      
      as well as handling cases of:
      
              kzalloc_node(a * b * c, gfp, node)
      
      with:
      
              kzalloc_node(array3_size(a, b, c), gfp, node)
      
      as it's slightly less ugly than:
      
              kcalloc_node(array_size(a, b), c, gfp, node)
      
      This does, however, attempt to ignore constant size factors like:
      
              kzalloc_node(4 * 1024, gfp, node)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kzalloc_node(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kzalloc_node(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kzalloc_node(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kzalloc_node
      + kcalloc_node
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kzalloc_node(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kzalloc_node(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc_node(C1 * C2 * C3, ...)
      |
        kzalloc_node(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc_node(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc_node(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc_node(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc_node(sizeof(THING) * C2, ...)
      |
        kzalloc_node(sizeof(TYPE) * C2, ...)
      |
        kzalloc_node(C1 * C2 * C3, ...)
      |
        kzalloc_node(C1 * C2, ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      590b5b7d
  6. 11 6月, 2018 1 次提交
    • R
      blk-mq: reinit q->tag_set_list entry only after grace period · a347c7ad
      Roman Pen 提交于
      It is not allowed to reinit q->tag_set_list list entry while RCU grace
      period has not completed yet, otherwise the following soft lockup in
      blk_mq_sched_restart() happens:
      
      [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
      [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
      [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
      [ 1064.256510] Call Trace:
      [ 1064.256664]  <IRQ>
      [ 1064.256824]  blk_mq_free_request+0xea/0x100
      [ 1064.256987]  msg_io_conf+0x59/0xd0 [ibnbd_client]
      [ 1064.257175]  complete_rdma_req+0xf2/0x230 [ibtrs_client]
      [ 1064.257340]  ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
      [ 1064.257502]  ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
      [ 1064.257669]  ib_create_qp+0x321/0x380 [ib_core]
      [ 1064.257841]  ib_process_cq_direct+0xbd/0x120 [ib_core]
      [ 1064.258007]  irq_poll_softirq+0xb7/0xe0
      [ 1064.258165]  __do_softirq+0x106/0x2a2
      [ 1064.258328]  irq_exit+0x92/0xa0
      [ 1064.258509]  do_IRQ+0x4a/0xd0
      [ 1064.258660]  common_interrupt+0x7a/0x7a
      [ 1064.258818]  </IRQ>
      
      Meanwhile another context frees other queue but with the same set of
      shared tags:
      
      [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
      [ 1288.201833] bash            D    0  5910   5820 0x00000000
      [ 1288.202016] Call Trace:
      [ 1288.202315]  schedule+0x32/0x80
      [ 1288.202462]  schedule_timeout+0x1e5/0x380
      [ 1288.203838]  wait_for_completion+0xb0/0x120
      [ 1288.204137]  __wait_rcu_gp+0x125/0x160
      [ 1288.204287]  synchronize_sched+0x6e/0x80
      [ 1288.204770]  blk_mq_free_queue+0x74/0xe0
      [ 1288.204922]  blk_cleanup_queue+0xc7/0x110
      [ 1288.205073]  ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
      [ 1288.205389]  ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
      [ 1288.205548]  kernfs_fop_write+0x109/0x180
      [ 1288.206328]  vfs_write+0xb3/0x1a0
      [ 1288.206476]  SyS_write+0x52/0xc0
      [ 1288.206624]  do_syscall_64+0x68/0x1d0
      [ 1288.206774]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      What happened is the following:
      
      1. There are several MQ queues with shared tags.
      2. One queue is about to be freed and now task is in
         blk_mq_del_queue_tag_set().
      3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
         tag list in order to find hctx to restart.
      
      Because linked list entry was modified in blk_mq_del_queue_tag_set()
      without proper waiting for a grace period, blk_mq_sched_restart()
      never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.
      
      Fix is simple: reinit list entry after an RCU grace period elapsed.
      
      Fixes: Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
      Cc: stable@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-block@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a347c7ad
  7. 05 6月, 2018 1 次提交
  8. 01 6月, 2018 2 次提交
  9. 29 5月, 2018 5 次提交
  10. 22 5月, 2018 1 次提交
  11. 18 5月, 2018 1 次提交
    • H
      blk-mq: clear hctx->dispatch_from when mappings change · d416c92c
      huhai 提交于
      When the number of hardware queues is changed, the drivers will call
      blk_mq_update_nr_hw_queues() to remap hardware queues. This changes
      the ctx mappings, but the current code doesn't clear the
      ->dispatch_from hint. This can result in dispatch_from pointing to
      a ctx that isn't mapped to the hctx anymore.
      
      Fixes: b347689f ("blk-mq-sched: improve dispatching from sw queue")
      Signed-off-by: Nhuhai <huhai@kylinos.cn>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      
      Moved the placement of the clearing to where we clear other items
      pertaining to the existing mapping, added Fixes line, and reworded
      the commit message.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d416c92c
  12. 16 5月, 2018 1 次提交
  13. 11 5月, 2018 1 次提交
  14. 09 5月, 2018 4 次提交
  15. 26 4月, 2018 2 次提交
  16. 25 4月, 2018 1 次提交
  17. 17 4月, 2018 1 次提交
    • J
      blk-mq: start request gstate with gen 1 · f4560231
      Jianchao Wang 提交于
      rq->gstate and rq->aborted_gstate both are zero before rqs are
      allocated. If we have a small timeout, when the timer fires,
      there could be rqs that are never allocated, and also there could
      be rq that has been allocated but not initialized and started. At
      the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
      the blk_mq_terminate_expired will identify the rq is timed out and
      invoke .timeout early.
      
      For scsi, this will cause scsi_times_out to be invoked before the
      scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
      the moment, then we will get crash.
      
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Martin Steigerwald <Martin@Lichtvoll.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f4560231
  18. 10 4月, 2018 7 次提交
  19. 09 3月, 2018 3 次提交
  20. 01 3月, 2018 1 次提交