1. 31 1月, 2018 1 次提交
    • M
      blk-mq: introduce BLK_STS_DEV_RESOURCE · 86ff7c2a
      Ming Lei 提交于
      This status is returned from driver to block layer if device related
      resource is unavailable, but driver can guarantee that IO dispatch
      will be triggered in future when the resource is available.
      
      Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
      returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
      a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
      3 ms because both scsi-mq and nvmefc are using that magic value.
      
      If a driver can make sure there is in-flight IO, it is safe to return
      BLK_STS_DEV_RESOURCE because:
      
      1) If all in-flight IOs complete before examining SCHED_RESTART in
      blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
      is run immediately in this case by blk_mq_dispatch_rq_list();
      
      2) if there is any in-flight IO after/when examining SCHED_RESTART
      in blk_mq_dispatch_rq_list():
      - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
      - otherwise, this request will be dispatched after any in-flight IO is
        completed via blk_mq_sched_restart()
      
      3) if SCHED_RESTART is set concurently in context because of
      BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
      cases and make sure IO hang can be avoided.
      
      One invariant is that queue will be rerun if SCHED_RESTART is set.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Tested-by: NLaurence Oberman <loberman@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86ff7c2a
  2. 18 1月, 2018 1 次提交
    • M
      blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback · 396eaf21
      Ming Lei 提交于
      blk_insert_cloned_request() is called in the fast path of a dm-rq driver
      (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
      blk_mq_request_bypass_insert() to directly append the request to the
      blk-mq hctx->dispatch_list of the underlying queue.
      
      1) This way isn't efficient enough because the hctx spinlock is always
      used.
      
      2) With blk_insert_cloned_request(), we completely bypass underlying
      queue's elevator and depend on the upper-level dm-rq driver's elevator
      to schedule IO.  But dm-rq currently can't get the underlying queue's
      dispatch feedback at all.  Without knowing whether a request was issued
      or not (e.g. due to underlying queue being busy) the dm-rq elevator will
      not be able to provide effective IO merging (as a side-effect of dm-rq
      currently blindly destaging a request from its elevator only to requeue
      it after a delay, which kills any opportunity for merging).  This
      obviously causes very bad sequential IO performance.
      
      Fix this by updating blk_insert_cloned_request() to use
      blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
      request to be issued directly to the underlying queue and returns the
      dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
      returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
      to _not_ destage the request.  Whereby preserving the opportunity to
      merge IO.
      
      With this, request-based DM's blk-mq sequential IO performance is vastly
      improved (as much as 3X in mpath/virtio-scsi testing).
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
      they were refactored to make them less fragile and easier to read/review]
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      396eaf21
  3. 15 1月, 2018 1 次提交
    • M
      dm: fix incomplete request_queue initialization · c100ec49
      Mike Snitzer 提交于
      DM is no longer prone to having its request_queue be improperly
      initialized.
      
      Summary of changes:
      
      - defer DM's blk_register_queue() from add_disk()-time until
        dm_setup_md_queue() by using add_disk_no_queue_reg() in alloc_dev().
      
      - dm_setup_md_queue() is updated to fully initialize DM's request_queue
        (_after_ all table loads have occurred and the request_queue's type,
        features and limits are known).
      
      A very welcome side-effect of these changes is DM no longer needs to:
      1) backfill the "mq" sysfs entry (because historically DM didn't
      initialize the request_queue to use blk-mq until _after_
      blk_register_queue() was called via add_disk()).
      2) call elv_register_queue() to get .request_fn request-based DM
      device's "iosched" exposed in syfs.
      
      In addition, blk-mq debugfs support is now made available because
      request-based DM's blk-mq request_queue is now properly initialized
      before dm_setup_md_queue() calls blk_register_queue().
      
      These changes also stave off the need to introduce new DM-specific
      workarounds in block core, e.g. this proposal:
      https://patchwork.kernel.org/patch/10067961/
      
      In the end DM devices should be less unicorn in nature (relative to
      initialization and availability of block core infrastructure provided by
      the request_queue).
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Tested-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c100ec49
  4. 06 10月, 2017 1 次提交
  5. 28 8月, 2017 2 次提交
  6. 19 6月, 2017 1 次提交
  7. 09 6月, 2017 3 次提交
    • C
      block: switch bios to blk_status_t · 4e4cbee9
      Christoph Hellwig 提交于
      Replace bi_error with a new bi_status to allow for a clear conversion.
      Note that device mapper overloaded bi_error with a private value, which
      we'll have to keep arround at least for now and thus propagate to a
      proper blk_status_t value.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4e4cbee9
    • C
      blk-mq: switch ->queue_rq return value to blk_status_t · fc17b653
      Christoph Hellwig 提交于
      Use the same values for use for request completion errors as the return
      value from ->queue_rq.  BLK_STS_RESOURCE is special cased to cause
      a requeue, and all the others are completed as-is.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fc17b653
    • C
      block: introduce new block status code type · 2a842aca
      Christoph Hellwig 提交于
      Currently we use nornal Linux errno values in the block layer, and while
      we accept any error a few have overloaded magic meanings.  This patch
      instead introduces a new  blk_status_t value that holds block layer specific
      status codes and explicitly explains their meaning.  Helpers to convert from
      and to the previous special meanings are provided for now, but I suspect
      we want to get rid of them in the long run - those drivers that have a
      errno input (e.g. networking) usually get errnos that don't know about
      the special block layer overloads, and similarly returning them to userspace
      will usually return somethings that strictly speaking isn't correct
      for file system operations, but that's left as an exercise for later.
      
      For now the set of errors is a very limited set that closely corresponds
      to the previous overloaded errno values, but there is some low hanging
      fruite to improve it.
      
      blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
      typechecking, so that we can easily catch places passing the wrong values.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a842aca
  8. 16 5月, 2017 1 次提交
  9. 02 5月, 2017 3 次提交
  10. 28 4月, 2017 1 次提交
  11. 25 4月, 2017 1 次提交
  12. 21 4月, 2017 2 次提交
  13. 09 4月, 2017 1 次提交
  14. 08 4月, 2017 1 次提交
    • B
      dm rq: Avoid that request processing stalls sporadically · 6077c2d7
      Bart Van Assche 提交于
      While running the srp-test software I noticed that request
      processing stalls sporadically at the beginning of a test, namely
      when mkfs is run against a dm-mpath device. Every time when that
      happened the following command was sufficient to resume request
      processing:
      
          echo run >/sys/kernel/debug/block/dm-0/state
      
      This patch avoids that such request processing stalls occur. The
      test I ran is as follows:
      
          while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6077c2d7
  15. 31 3月, 2017 1 次提交
  16. 25 2月, 2017 1 次提交
  17. 03 2月, 2017 1 次提交
  18. 28 1月, 2017 3 次提交
  19. 09 12月, 2016 1 次提交
  20. 21 11月, 2016 1 次提交
  21. 15 11月, 2016 1 次提交
  22. 03 11月, 2016 5 次提交
    • B
      dm: Fix a race condition related to stopping and starting queues · 7b17c2f7
      Bart Van Assche 提交于
      Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request()
      calls have stopped before setting the "queue stopped" flag. This
      allows to remove the "queue stopped" test from dm_mq_queue_rq() and
      dm_mq_requeue_request(). This patch fixes a race condition because
      dm_mq_queue_rq() is called without holding the queue lock and hence
      BLK_MQ_S_STOPPED can be set at any time while dm_mq_queue_rq() is
      in progress. This patch prevents that the following hang occurs
      sporadically when using dm-mq:
      
      INFO: task systemd-udevd:10111 blocked for more than 480 seconds.
      Call Trace:
       [<ffffffff8161f397>] schedule+0x37/0x90
       [<ffffffff816239ef>] schedule_timeout+0x27f/0x470
       [<ffffffff8161e76f>] io_schedule_timeout+0x9f/0x110
       [<ffffffff8161fb36>] bit_wait_io+0x16/0x60
       [<ffffffff8161f929>] __wait_on_bit_lock+0x49/0xa0
       [<ffffffff8114fe69>] __lock_page+0xb9/0xc0
       [<ffffffff81165d90>] truncate_inode_pages_range+0x3e0/0x760
       [<ffffffff81166120>] truncate_inode_pages+0x10/0x20
       [<ffffffff81212a20>] kill_bdev+0x30/0x40
       [<ffffffff81213d41>] __blkdev_put+0x71/0x360
       [<ffffffff81214079>] blkdev_put+0x49/0x170
       [<ffffffff812141c0>] blkdev_close+0x20/0x30
       [<ffffffff811d48e8>] __fput+0xe8/0x1f0
       [<ffffffff811d4a29>] ____fput+0x9/0x10
       [<ffffffff810842d3>] task_work_run+0x83/0xb0
       [<ffffffff8106606e>] do_exit+0x3ee/0xc40
       [<ffffffff8106694b>] do_group_exit+0x4b/0xc0
       [<ffffffff81073d9a>] get_signal+0x2ca/0x940
       [<ffffffff8101bf43>] do_signal+0x23/0x660
       [<ffffffff810022b3>] exit_to_usermode_loop+0x73/0xb0
       [<ffffffff81002cb0>] syscall_return_slowpath+0xb0/0xc0
       [<ffffffff81624e33>] entry_SYSCALL_64_fastpath+0xa6/0xa8
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7b17c2f7
    • B
      dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code · f0d33ab7
      Bart Van Assche 提交于
      Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED
      in the dm start and stop queue functions, only manipulate the latter
      flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped().
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f0d33ab7
    • B
      blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request() · 2b053aca
      Bart Van Assche 提交于
      Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
      are followed by kicking the requeue list. Hence add an argument to
      these two functions that allows to kick the requeue list. This was
      proposed by Christoph Hellwig.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2b053aca
    • B
      blk-mq: Remove blk_mq_cancel_requeue_work() · 9b7dd572
      Bart Van Assche 提交于
      Since blk_mq_requeue_work() no longer restarts stopped queues
      canceling requeue work is no longer needed to prevent that a
      stopped queue would be restarted. Hence remove this function.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9b7dd572
    • B
      blk-mq: Avoid that requeueing starts stopped queues · 52d7f1b5
      Bart Van Assche 提交于
      Since blk_mq_requeue_work() starts stopped queues and since
      execution of this function can be scheduled after a queue has
      been stopped it is not possible to stop queues without using
      an additional state variable to track whether or not the queue
      has been stopped. Hence modify blk_mq_requeue_work() such that it
      does not start stopped queues. My conclusion after a review of
      the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list()
      callers is as follows:
      * In the dm driver starting and stopping queues should only happen
        if __dm_suspend() or __dm_resume() is called and not if the
        requeue list is processed.
      * In the SCSI core queue stopping and starting should only be
        performed by the scsi_internal_device_block() and
        scsi_internal_device_unblock() functions but not by any other
        function. Although the blk_mq_stop_hw_queue() call in
        scsi_queue_rq() may help to reduce CPU load if a LLD queue is
        full, figuring out whether or not a queue should be restarted
        when requeueing a command would require to introduce additional
        locking in scsi_mq_requeue_cmd() to avoid a race with
        scsi_internal_device_block(). Avoid this complexity by removing
        the blk_mq_stop_hw_queue() call from scsi_queue_rq().
      * In the NVMe core only the functions that call
        blk_mq_start_stopped_hw_queues() explicitly should start stopped
        queues.
      * A blk_mq_start_stopped_hwqueues() call must be added in the
        xen-blkfront driver in its blkif_recover() function.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Roger Pau Monné <roger.pau@citrix.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: James Bottomley <jejb@linux.vnet.ibm.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52d7f1b5
  23. 28 10月, 2016 1 次提交
  24. 19 10月, 2016 1 次提交
  25. 12 10月, 2016 1 次提交
    • P
      kthread: kthread worker API cleanup · 3989144f
      Petr Mladek 提交于
      A good practice is to prefix the names of functions by the name
      of the subsystem.
      
      The kthread worker API is a mix of classic kthreads and workqueues.  Each
      worker has a dedicated kthread.  It runs a generic function that process
      queued works.  It is implemented as part of the kthread subsystem.
      
      This patch renames the existing kthread worker API to use
      the corresponding name from the workqueues API prefixed by
      kthread_:
      
      __init_kthread_worker()		-> __kthread_init_worker()
      init_kthread_worker()		-> kthread_init_worker()
      init_kthread_work()		-> kthread_init_work()
      insert_kthread_work()		-> kthread_insert_work()
      queue_kthread_work()		-> kthread_queue_work()
      flush_kthread_work()		-> kthread_flush_work()
      flush_kthread_worker()		-> kthread_flush_worker()
      
      Note that the names of DEFINE_KTHREAD_WORK*() macros stay
      as they are. It is common that the "DEFINE_" prefix has
      precedence over the subsystem names.
      
      Note that INIT() macros and init() functions use different
      naming scheme. There is no good solution. There are several
      reasons for this solution:
      
        + "init" in the function names stands for the verb "initialize"
          aka "initialize worker". While "INIT" in the macro names
          stands for the noun "INITIALIZER" aka "worker initializer".
      
        + INIT() macros are used only in DEFINE() macros
      
        + init() functions are used close to the other kthread()
          functions. It looks much better if all the functions
          use the same scheme.
      
        + There will be also kthread_destroy_worker() that will
          be used close to kthread_cancel_work(). It is related
          to the init() function. Again it looks better if all
          functions use the same naming scheme.
      
        + there are several precedents for such init() function
          names, e.g. amd_iommu_init_device(), free_area_init_node(),
          jump_label_init_type(),  regmap_init_mmio_clk(),
      
        + It is not an argument but it was inconsistent even before.
      
      [arnd@arndb.de: fix linux-next merge conflict]
       Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.comSuggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3989144f
  26. 21 9月, 2016 1 次提交
  27. 15 9月, 2016 2 次提交