1. 25 2月, 2020 1 次提交
    • M
      blk-mq: insert passthrough request into hctx->dispatch directly · 01e99aec
      Ming Lei 提交于
      For some reason, device may be in one situation which can't handle
      FS request, so STS_RESOURCE is always returned and the FS request
      will be added to hctx->dispatch. However passthrough request may
      be required at that time for fixing the problem. If passthrough
      request is added to scheduler queue, there isn't any chance for
      blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
      Then the FS IO request may never be completed, and IO hang is caused.
      
      So passthrough request has to be added to hctx->dispatch directly
      for fixing the IO hang.
      
      Fix this issue by inserting passthrough request into hctx->dispatch
      directly together withing adding FS request to the tail of
      hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
      to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().
      
      Then it becomes consistent with original legacy IO request
      path, in which passthrough request is always added to q->queue_head.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ewan D. Milne <emilne@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      01e99aec
  2. 21 12月, 2019 1 次提交
    • B
      block: Fix a lockdep complaint triggered by request queue flushing · b3c6a599
      Bart Van Assche 提交于
      Avoid that running test nvme/012 from the blktests suite triggers the
      following false positive lockdep complaint:
      
      ============================================
      WARNING: possible recursive locking detected
      5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted
      --------------------------------------------
      ksoftirqd/1/16 is trying to acquire lock:
      000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0
      
      but task is already holding lock:
      00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&fq->mq_flush_lock)->rlock);
        lock(&(&fq->mq_flush_lock)->rlock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      1 lock held by ksoftirqd/1/16:
       #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0
      
      stack backtrace:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       dump_stack+0x67/0x90
       __lock_acquire.cold.45+0x2b4/0x313
       lock_acquire+0x98/0x160
       _raw_spin_lock_irqsave+0x3b/0x80
       flush_end_io+0x4e/0x1d0
       blk_mq_complete_request+0x76/0x110
       nvmet_req_complete+0x15/0x110 [nvmet]
       nvmet_bio_done+0x27/0x50 [nvmet]
       blk_update_request+0xd7/0x2d0
       blk_mq_end_request+0x1a/0x100
       blk_flush_complete_seq+0xe5/0x350
       flush_end_io+0x12f/0x1d0
       blk_done_softirq+0x9f/0xd0
       __do_softirq+0xca/0x440
       run_ksoftirqd+0x24/0x50
       smpboot_thread_fn+0x113/0x1e0
       kthread+0x121/0x140
       ret_from_fork+0x3a/0x50
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b3c6a599
  3. 22 11月, 2019 1 次提交
    • K
      block: add iostat counters for flush requests · b6866318
      Konstantin Khlebnikov 提交于
      Requests that triggers flushing volatile writeback cache to disk (barriers)
      have significant effect to overall performance.
      
      Block layer has sophisticated engine for combining several flush requests
      into one. But there is no statistics for actual flushes executed by disk.
      Requests which trigger flushes usually are barriers - zero-size writes.
      
      This patch adds two iostat counters into /sys/class/block/$dev/stat and
      /proc/diskstats - count of completed flush requests and their total time.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b6866318
  4. 27 9月, 2019 1 次提交
    • Y
      block: fix null pointer dereference in blk_mq_rq_timed_out() · 8d699663
      Yufen Yu 提交于
      We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
      as following:
      
      [  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
      [  108.827059] PGD 0 P4D 0
      [  108.827313] Oops: 0000 [#1] SMP PTI
      [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
      [  108.829503] Workqueue: kblockd blk_mq_timeout_work
      [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
      [  108.838191] Call Trace:
      [  108.838406]  bt_iter+0x74/0x80
      [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
      [  108.839074]  ? __switch_to_asm+0x34/0x70
      [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
      [  108.840732]  blk_mq_timeout_work+0x74/0x200
      [  108.841151]  process_one_work+0x297/0x680
      [  108.841550]  worker_thread+0x29c/0x6f0
      [  108.841926]  ? rescuer_thread+0x580/0x580
      [  108.842344]  kthread+0x16a/0x1a0
      [  108.842666]  ? kthread_flush_work+0x170/0x170
      [  108.843100]  ret_from_fork+0x35/0x40
      
      The bug is caused by the race between timeout handle and completion for
      flush request.
      
      When timeout handle function blk_mq_rq_timed_out() try to read
      'req->q->mq_ops', the 'req' have completed and reinitiated by next
      flush request, which would call blk_rq_init() to clear 'req' as 0.
      
      After commit 12f5b931 ("blk-mq: Remove generation seqeunce"),
      normal requests lifetime are protected by refcount. Until 'rq->ref'
      drop to zero, the request can really be free. Thus, these requests
      cannot been reused before timeout handle finish.
      
      However, flush request has defined .end_io and rq->end_io() is still
      called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
      can be reused by the next flush request handle, resulting in null
      pointer deference BUG ON.
      
      We fix this problem by covering flush request with 'rq->ref'.
      If the refcount is not zero, flush_end_io() return and wait the
      last holder recall it. To record the request status, we add a new
      entry 'rq_status', which will be used in flush_end_io().
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: stable@vger.kernel.org # v4.18+
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      
      -------
      v2:
       - move rq_status from struct request to struct blk_flush_queue
      v3:
       - remove unnecessary '{}' pair.
      v4:
       - let spinlock to protect 'fq->rq_status'
      v5:
       - move rq_status after flush_running_idx member of struct blk_flush_queue
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8d699663
  5. 01 5月, 2019 1 次提交
  6. 25 3月, 2019 1 次提交
  7. 30 1月, 2019 1 次提交
    • J
      blk-mq: fix a hung issue when fsync · 85bd6e61
      Jianchao Wang 提交于
      Florian reported a io hung issue when fsync(). It should be
      triggered by following race condition.
      
      data + post flush         a flush
      
      blk_flush_complete_seq
        case REQ_FSEQ_DATA
          blk_flush_queue_rq
          issued to driver      blk_mq_dispatch_rq_list
                                  try to issue a flush req
                                  failed due to NON-NCQ command
                                  .queue_rq return BLK_STS_DEV_RESOURCE
      
      request completion
        req->end_io // doesn't check RESTART
        mq_flush_data_end_io
          case REQ_FSEQ_POSTFLUSH
            blk_kick_flush
              do nothing because previous flush
              has not been completed
           blk_mq_run_hw_queue
                                    insert rq to hctx->dispatch
                                    due to RESTART is still set, do nothing
      
      To fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io
      with blk_mq_sched_restart to check and clear the RESTART flag.
      
      Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers)
      Reported-by: NFlorian Stecker <m19@florianstecker.de>
      Tested-by: NFlorian Stecker <m19@florianstecker.de>
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85bd6e61
  8. 16 11月, 2018 2 次提交
  9. 08 11月, 2018 4 次提交
  10. 14 10月, 2018 1 次提交
  11. 09 6月, 2018 1 次提交
  12. 06 6月, 2018 1 次提交
  13. 05 11月, 2017 3 次提交
    • M
      blk-mq: don't allocate driver tag upfront for flush rq · 923218f6
      Ming Lei 提交于
      The idea behind it is simple:
      
      1) for none scheduler, driver tag has to be borrowed for flush rq,
         otherwise we may run out of tag, and that causes an IO hang. And
         get/put driver tag is actually noop for none, so reordering tags
         isn't necessary at all.
      
      2) for a real I/O scheduler, we need not allocate a driver tag upfront
         for flush rq. It works just fine to follow the same approach as
         normal requests: allocate driver tag for each rq just before calling
         ->queue_rq().
      
      One driver visible change is that the driver tag isn't shared in the
      flush request sequence. That won't be a problem, since we always do that
      in legacy path.
      
      Then flush rq need not be treated specially wrt. get/put driver tag.
      This cleans up the code - for instance, reorder_tags_to_front() can be
      removed, and we needn't worry about request ordering in dispatch list
      for avoiding I/O deadlock.
      
      Also we have to put the driver tag before requeueing.
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      923218f6
    • M
      blk-flush: use blk_mq_request_bypass_insert() · 598906f8
      Ming Lei 提交于
      In the following patch, we will use RQF_FLUSH_SEQ to decide:
      
      1) if the flag isn't set, the flush rq need to be inserted via
      blk_insert_flush()
      
      2) otherwise, the flush rq need to be dispatched directly since
      it is in flush machinery now.
      
      So we use blk_mq_request_bypass_insert() for requests of bypassing
      flush machinery, just like the legacy path did.
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      598906f8
    • M
      blk-flush: don't run queue for requests bypassing flush · 9c71c83c
      Ming Lei 提交于
      blk_insert_flush() should only insert request since run queue always
      follows it.
      
      In case of bypassing flush, we don't need to run queue because every
      blk_insert_flush() follows one run queue.
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9c71c83c
  14. 26 8月, 2017 1 次提交
  15. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  16. 21 6月, 2017 1 次提交
  17. 09 6月, 2017 1 次提交
    • C
      block: introduce new block status code type · 2a842aca
      Christoph Hellwig 提交于
      Currently we use nornal Linux errno values in the block layer, and while
      we accept any error a few have overloaded magic meanings.  This patch
      instead introduces a new  blk_status_t value that holds block layer specific
      status codes and explicitly explains their meaning.  Helpers to convert from
      and to the previous special meanings are provided for now, but I suspect
      we want to get rid of them in the long run - those drivers that have a
      errno input (e.g. networking) usually get errnos that don't know about
      the special block layer overloads, and similarly returning them to userspace
      will usually return somethings that strictly speaking isn't correct
      for file system operations, but that's left as an exercise for later.
      
      For now the set of errors is a very limited set that closely corresponds
      to the previous overloaded errno values, but there is some low hanging
      fruite to improve it.
      
      blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
      typechecking, so that we can easily catch places passing the wrong values.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a842aca
  18. 20 4月, 2017 1 次提交
  19. 25 3月, 2017 1 次提交
  20. 18 2月, 2017 1 次提交
    • J
      block: don't defer flushes on blk-mq + scheduling · 7520872c
      Jens Axboe 提交于
      For blk-mq with scheduling, we can potentially end up with ALL
      driver tags assigned and sitting on the flush queues. If we
      defer because of an inlfight data request, then we can deadlock
      if that data request doesn't already have a tag assigned.
      
      This fixes a deadlock with running the xfs/297 xfstest, where
      thousands of syncs can cause the drive queue to stall.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      7520872c
  21. 01 2月, 2017 1 次提交
    • C
      block: fold cmd_type into the REQ_OP_ space · aebf526b
      Christoph Hellwig 提交于
      Instead of keeping two levels of indirection for requests types, fold it
      all into the operations.  The little caveat here is that previously
      cmd_type only applied to struct request, while the request and bio op
      fields were set to plain REQ_OP_READ/WRITE even for passthrough
      operations.
      
      Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
      private requests, althought it has to add two for each so that we
      can communicate the data in/out nature of the request.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aebf526b
  22. 28 1月, 2017 2 次提交
  23. 18 1月, 2017 1 次提交
  24. 10 12月, 2016 1 次提交
  25. 09 11月, 2016 1 次提交
  26. 03 11月, 2016 1 次提交
  27. 01 11月, 2016 1 次提交
  28. 28 10月, 2016 2 次提交
    • C
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig 提交于
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ef295ecf
    • C
      block: split out request-only flags into a new namespace · e8064021
      Christoph Hellwig 提交于
      A lot of the REQ_* flags are only used on struct requests, and only of
      use to the block layer and a few drivers that dig into struct request
      internals.
      
      This patch adds a new req_flags_t rq_flags field to struct request for
      them, and thus dramatically shrinks the number of common requests.  It
      also removes the unfortunate situation where we have to fit the fields
      from the same enum into 32 bits for struct bio and 64 bits for
      struct request.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NShaun Tancheff <shaun.tancheff@seagate.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e8064021
  29. 26 10月, 2016 1 次提交
  30. 15 9月, 2016 1 次提交
  31. 08 6月, 2016 2 次提交