1. 04 9月, 2020 2 次提交
  2. 02 9月, 2020 1 次提交
  3. 29 7月, 2020 1 次提交
  4. 01 7月, 2020 1 次提交
  5. 30 6月, 2020 1 次提交
  6. 29 6月, 2020 1 次提交
  7. 24 6月, 2020 2 次提交
  8. 30 5月, 2020 2 次提交
    • M
      blk-mq: drain I/O when all CPUs in a hctx are offline · bf0beec0
      Ming Lei 提交于
      Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
      up queue mapping. Thomas mentioned the following point[1]:
      
      "That was the constraint of managed interrupts from the very beginning:
      
       The driver/subsystem has to quiesce the interrupt line and the associated
       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
       until it's restarted by the core when the CPU is plugged in again."
      
      However, current blk-mq implementation doesn't quiesce hw queue before
      the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
      cpuhp state handled after the CPU is down, so there isn't any chance to
      quiesce the hctx before shutting down the CPU.
      
      Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
      where the last CPU goes away, and wait for completion of in-flight
      requests.  This guarantees that there is no inflight I/O before shutting
      down the managed IRQ.
      
      Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
      to wait for completion of in-flight requests from these drivers to avoid
      a potential dead-lock. It is safe to do this for stacking drivers as those
      do not use interrupts at all and their I/O completions are triggered by
      underlying devices I/O completion.
      
      [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
      
      [hch: different retry mechanism, merged two patches, minor cleanups]
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NDaniel Wagner <dwagner@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bf0beec0
    • K
      blk-mq: blk-mq: provide forced completion method · 7b11eab0
      Keith Busch 提交于
      Drivers may need to bypass error injection for error recovery. Rename
      __blk_mq_complete_request() to blk_mq_force_complete_rq() and export
      that function so drivers may skip potential fake timeouts after they've
      reclaimed lost requests.
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NDaniel Wagner <dwagner@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b11eab0
  9. 25 4月, 2020 1 次提交
  10. 21 4月, 2020 1 次提交
  11. 19 4月, 2020 1 次提交
    • G
      blk-mq: Replace zero-length array with flexible-array member · f36aaf8b
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      f36aaf8b
  12. 28 3月, 2020 1 次提交
  13. 10 3月, 2020 1 次提交
  14. 14 11月, 2019 1 次提交
  15. 01 11月, 2019 1 次提交
  16. 26 10月, 2019 1 次提交
  17. 07 10月, 2019 2 次提交
  18. 06 9月, 2019 1 次提交
    • D
      block: Delay default elevator initialization · 737eb78e
      Damien Le Moal 提交于
      When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
      the only information known about the device is the number of hardware
      queues as the block device scan by the device driver is not completed
      yet for most drivers. The device type and elevator required features
      are not set yet, preventing to correctly select the default elevator
      most suitable for the device.
      
      This currently affects all multi-queue zoned block devices which default
      to the "none" elevator instead of the required "mq-deadline" elevator.
      These drives currently include host-managed SMR disks connected to a
      smartpqi HBA and null_blk block devices with zoned mode enabled.
      Upcoming NVMe Zoned Namespace devices will also be affected.
      
      Fix this by adding the boolean elevator_init argument to
      blk_mq_init_allocated_queue() to control the execution of
      elevator_init_mq(). Two cases exist:
      1) elevator_init = false is used for calls to
         blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
         case, a call to elevator_init_mq() is added to __device_add_disk(),
         resulting in the delayed initialization of the queue elevator
         after the device driver finished probing the device information. This
         effectively allows elevator_init_mq() access to more information
         about the device.
      2) elevator_init = true preserves the current behavior of initializing
         the elevator directly from blk_mq_init_allocated_queue(). This case
         is used for the special request based DM devices where the device
         gendisk is created before the queue initialization and device
         information (e.g. queue limits) is already known when the queue
         initialization is executed.
      
      Additionally, to make sure that the elevator initialization is never
      done while requests are in-flight (there should be none when the device
      driver calls device_add_disk()), freeze and quiesce the device request
      queue before calling blk_mq_init_sched() in elevator_init_mq().
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      737eb78e
  19. 28 8月, 2019 1 次提交
  20. 05 8月, 2019 4 次提交
    • M
      blk-mq: add callback of .cleanup_rq · 226b4fc7
      Ming Lei 提交于
      SCSI maintains its own driver private data hooked off of each SCSI
      request, and the pridate data won't be freed after scsi_queue_rq()
      returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
      (e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
      fully dispatched them, due to a lower level SCSI driver's resource
      limitation identified in scsi_queue_rq(). Currently SCSI's per-request
      private data is leaked when the upper layer driver (dm-rq) frees and
      then retries these requests in response to BLK_STS_RESOURCE or
      BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().
      
      This usecase is so specialized that it doesn't warrant training an
      existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
      account for freeing its driver private data -- doing so would add an
      extra branch for handling a special case that all other consumers of
      SCSI (and blk-mq) won't ever need to worry about.
      
      So the most pragmatic way forward is to delegate freeing SCSI driver
      private data to the upper layer driver (dm-rq).  Do so by adding
      new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
      from dm-rq.  A following commit will implement the .cleanup_rq() hook
      in scsi_mq_ops.
      
      Cc: Ewan D. Milne <emilne@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: <stable@vger.kernel.org>
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      226b4fc7
    • M
      blk-mq: remove blk_mq_complete_request_sync · a87ccce0
      Ming Lei 提交于
      blk_mq_tagset_wait_completed_request() has been applied for waiting
      for completed request's fn, so not necessary to use
      blk_mq_complete_request_sync() any more.
      
      Cc: Max Gurtovoy <maxg@mellanox.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a87ccce0
    • M
      blk-mq: introduce blk_mq_tagset_wait_completed_request() · f9934a80
      Ming Lei 提交于
      blk-mq may schedule to call queue's complete function on remote CPU via
      IPI, but doesn't provide any way to synchronize the request's complete
      fn. The current queue freeze interface can't provide the synchonization
      because aborted requests stay at blk-mq queues during EH.
      
      In some driver's EH(such as NVMe), hardware queue's resource may be freed &
      re-allocated. If the completed request's complete fn is run finally after the
      hardware queue's resource is released, kernel crash will be triggered.
      
      Prepare for fixing this kind of issue by introducing
      blk_mq_tagset_wait_completed_request().
      
      Cc: Max Gurtovoy <maxg@mellanox.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f9934a80
    • M
      blk-mq: introduce blk_mq_request_completed() · aa306ab7
      Ming Lei 提交于
      NVMe needs this function to decide if one request to be aborted has
      been completed in normal IO path already.
      
      So introduce it.
      
      Cc: Max Gurtovoy <maxg@mellanox.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aa306ab7
  21. 21 6月, 2019 1 次提交
    • C
      block: remove the bi_phys_segments field in struct bio · 14ccb66b
      Christoph Hellwig 提交于
      We only need the number of segments in the blk-mq submission path.
      Remove the field from struct bio, and return it from a variant of
      blk_queue_split instead of that it can passed as an argument to
      those functions that need the value.
      
      This also means we stop recounting segments except for cloning
      and partial segments.
      
      To keep the number of arguments in this how path down remove
      pointless struct request_queue arguments from any of the functions
      that had it and grew a nr_segs argument.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      14ccb66b
  22. 04 5月, 2019 1 次提交
    • M
      blk-mq: always free hctx after request queue is freed · 2f8f1336
      Ming Lei 提交于
      In normal queue cleanup path, hctx is released after request queue
      is freed, see blk_mq_release().
      
      However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
      of hw queues shrinking. This way is easy to cause use-after-free,
      because: one implicit rule is that it is safe to call almost all block
      layer APIs if the request queue is alive; and one hctx may be retrieved
      by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
      finally use-after-free is triggered.
      
      Fixes this issue by always freeing hctx after releasing request queue.
      If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
      a per-queue list to hold them, then try to resuse these hctxs if numa
      node is matched.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2f8f1336
  23. 10 4月, 2019 1 次提交
    • M
      blk-mq: introduce blk_mq_complete_request_sync() · 1b8f21b7
      Ming Lei 提交于
      In NVMe's error handler, follows the typical steps of tearing down
      hardware for recovering controller:
      
      1) stop blk_mq hw queues
      2) stop the real hw queues
      3) cancel in-flight requests via
      	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
      cancel_request():
      	mark the request as abort
      	blk_mq_complete_request(req);
      4) destroy real hw queues
      
      However, there may be race between #3 and #4, because blk_mq_complete_request()
      may run q->mq_ops->complete(rq) remotelly and asynchronously, and
      ->complete(rq) may be run after #4.
      
      This patch introduces blk_mq_complete_request_sync() for fixing the
      above race.
      
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: linux-nvme@lists.infradead.org
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1b8f21b7
  24. 21 3月, 2019 1 次提交
  25. 19 3月, 2019 1 次提交
  26. 15 2月, 2019 1 次提交
  27. 19 12月, 2018 1 次提交
  28. 18 12月, 2018 1 次提交
  29. 05 12月, 2018 1 次提交
  30. 30 11月, 2018 1 次提交
    • J
      blk-mq: add mq_ops->commit_rqs() · d666ba98
      Jens Axboe 提交于
      blk-mq passes information to the hardware about any given request being
      the last that we will issue in this sequence. The point is that hardware
      can defer costly doorbell type writes to the last request. But if we run
      into errors issuing a sequence of requests, we may never send the request
      with bd->last == true set. For that case, we need a hook that tells the
      hardware that nothing else is coming right now.
      
      For failures returned by the drivers ->queue_rq() hook, the driver is
      responsible for flushing pending requests, if it uses bd->last to
      optimize that part. This works like before, no changes there.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d666ba98
  31. 27 11月, 2018 2 次提交
  32. 26 11月, 2018 1 次提交