1. 09 7月, 2018 9 次提交
  2. 29 6月, 2018 1 次提交
    • J
      blk-mq: don't queue more if we get a busy return · 1f57f8d4
      Jens Axboe 提交于
      Some devices have different queue limits depending on the type of IO. A
      classic case is SATA NCQ, where some commands can queue, but others
      cannot. If we have NCQ commands inflight and encounter a non-queueable
      command, the driver returns busy. Currently we attempt to dispatch more
      from the scheduler, if we were able to queue some commands. But for the
      case where we ended up stopping due to BUSY, we should not attempt to
      retrieve more from the scheduler. If we do, we can get into a situation
      where we attempt to queue a non-queueable command, get BUSY, then
      successfully retrieve more commands from that scheduler and queue those.
      This can repeat forever, starving the non-queuable command indefinitely.
      
      Fix this by NOT attempting to pull more commands from the scheduler, if
      we get a BUSY return. This should also be more optimal in terms of
      letting requests stay in the scheduler for as long as possible, if we
      get a BUSY due to the regular out-of-tags condition.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1f57f8d4
  3. 24 6月, 2018 1 次提交
  4. 14 6月, 2018 1 次提交
  5. 13 6月, 2018 1 次提交
    • K
      treewide: kzalloc_node() -> kcalloc_node() · 590b5b7d
      Kees Cook 提交于
      The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
      patch replaces cases of:
      
              kzalloc_node(a * b, gfp, node)
      
      with:
              kcalloc_node(a * b, gfp, node)
      
      as well as handling cases of:
      
              kzalloc_node(a * b * c, gfp, node)
      
      with:
      
              kzalloc_node(array3_size(a, b, c), gfp, node)
      
      as it's slightly less ugly than:
      
              kcalloc_node(array_size(a, b), c, gfp, node)
      
      This does, however, attempt to ignore constant size factors like:
      
              kzalloc_node(4 * 1024, gfp, node)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kzalloc_node(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kzalloc_node(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kzalloc_node(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc_node(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kzalloc_node
      + kcalloc_node
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kzalloc_node(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kzalloc_node(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kzalloc_node(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc_node(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc_node(C1 * C2 * C3, ...)
      |
        kzalloc_node(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc_node(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc_node(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc_node(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc_node(sizeof(THING) * C2, ...)
      |
        kzalloc_node(sizeof(TYPE) * C2, ...)
      |
        kzalloc_node(C1 * C2 * C3, ...)
      |
        kzalloc_node(C1 * C2, ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kzalloc_node
      + kcalloc_node
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      590b5b7d
  6. 11 6月, 2018 1 次提交
    • R
      blk-mq: reinit q->tag_set_list entry only after grace period · a347c7ad
      Roman Pen 提交于
      It is not allowed to reinit q->tag_set_list list entry while RCU grace
      period has not completed yet, otherwise the following soft lockup in
      blk_mq_sched_restart() happens:
      
      [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
      [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
      [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
      [ 1064.256510] Call Trace:
      [ 1064.256664]  <IRQ>
      [ 1064.256824]  blk_mq_free_request+0xea/0x100
      [ 1064.256987]  msg_io_conf+0x59/0xd0 [ibnbd_client]
      [ 1064.257175]  complete_rdma_req+0xf2/0x230 [ibtrs_client]
      [ 1064.257340]  ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
      [ 1064.257502]  ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
      [ 1064.257669]  ib_create_qp+0x321/0x380 [ib_core]
      [ 1064.257841]  ib_process_cq_direct+0xbd/0x120 [ib_core]
      [ 1064.258007]  irq_poll_softirq+0xb7/0xe0
      [ 1064.258165]  __do_softirq+0x106/0x2a2
      [ 1064.258328]  irq_exit+0x92/0xa0
      [ 1064.258509]  do_IRQ+0x4a/0xd0
      [ 1064.258660]  common_interrupt+0x7a/0x7a
      [ 1064.258818]  </IRQ>
      
      Meanwhile another context frees other queue but with the same set of
      shared tags:
      
      [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
      [ 1288.201833] bash            D    0  5910   5820 0x00000000
      [ 1288.202016] Call Trace:
      [ 1288.202315]  schedule+0x32/0x80
      [ 1288.202462]  schedule_timeout+0x1e5/0x380
      [ 1288.203838]  wait_for_completion+0xb0/0x120
      [ 1288.204137]  __wait_rcu_gp+0x125/0x160
      [ 1288.204287]  synchronize_sched+0x6e/0x80
      [ 1288.204770]  blk_mq_free_queue+0x74/0xe0
      [ 1288.204922]  blk_cleanup_queue+0xc7/0x110
      [ 1288.205073]  ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
      [ 1288.205389]  ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
      [ 1288.205548]  kernfs_fop_write+0x109/0x180
      [ 1288.206328]  vfs_write+0xb3/0x1a0
      [ 1288.206476]  SyS_write+0x52/0xc0
      [ 1288.206624]  do_syscall_64+0x68/0x1d0
      [ 1288.206774]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      What happened is the following:
      
      1. There are several MQ queues with shared tags.
      2. One queue is about to be freed and now task is in
         blk_mq_del_queue_tag_set().
      3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
         tag list in order to find hctx to restart.
      
      Because linked list entry was modified in blk_mq_del_queue_tag_set()
      without proper waiting for a grace period, blk_mq_sched_restart()
      never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.
      
      Fix is simple: reinit list entry after an RCU grace period elapsed.
      
      Fixes: Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
      Cc: stable@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-block@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a347c7ad
  7. 05 6月, 2018 1 次提交
  8. 01 6月, 2018 2 次提交
  9. 29 5月, 2018 5 次提交
  10. 22 5月, 2018 1 次提交
  11. 18 5月, 2018 1 次提交
    • H
      blk-mq: clear hctx->dispatch_from when mappings change · d416c92c
      huhai 提交于
      When the number of hardware queues is changed, the drivers will call
      blk_mq_update_nr_hw_queues() to remap hardware queues. This changes
      the ctx mappings, but the current code doesn't clear the
      ->dispatch_from hint. This can result in dispatch_from pointing to
      a ctx that isn't mapped to the hctx anymore.
      
      Fixes: b347689f ("blk-mq-sched: improve dispatching from sw queue")
      Signed-off-by: Nhuhai <huhai@kylinos.cn>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      
      Moved the placement of the clearing to where we clear other items
      pertaining to the existing mapping, added Fixes line, and reworded
      the commit message.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d416c92c
  12. 16 5月, 2018 1 次提交
  13. 11 5月, 2018 1 次提交
  14. 09 5月, 2018 4 次提交
  15. 26 4月, 2018 2 次提交
  16. 25 4月, 2018 1 次提交
  17. 17 4月, 2018 1 次提交
    • J
      blk-mq: start request gstate with gen 1 · f4560231
      Jianchao Wang 提交于
      rq->gstate and rq->aborted_gstate both are zero before rqs are
      allocated. If we have a small timeout, when the timer fires,
      there could be rqs that are never allocated, and also there could
      be rq that has been allocated but not initialized and started. At
      the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
      the blk_mq_terminate_expired will identify the rq is timed out and
      invoke .timeout early.
      
      For scsi, this will cause scsi_times_out to be invoked before the
      scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
      the moment, then we will get crash.
      
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Martin Steigerwald <Martin@Lichtvoll.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f4560231
  18. 10 4月, 2018 6 次提交