1. 18 1月, 2018 5 次提交
    • M
      blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request · 9e97d295
      Mike Snitzer 提交于
      After commit:
      
      923218f6 ("blk-mq: don't allocate driver tag upfront for flush rq")
      
      we no longer use the 'can_block' argument in
      blk_mq_sched_insert_request(). Kill it.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      
      Added actual commit message as to why it's being removed.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9e97d295
    • M
      blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback · 396eaf21
      Ming Lei 提交于
      blk_insert_cloned_request() is called in the fast path of a dm-rq driver
      (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
      blk_mq_request_bypass_insert() to directly append the request to the
      blk-mq hctx->dispatch_list of the underlying queue.
      
      1) This way isn't efficient enough because the hctx spinlock is always
      used.
      
      2) With blk_insert_cloned_request(), we completely bypass underlying
      queue's elevator and depend on the upper-level dm-rq driver's elevator
      to schedule IO.  But dm-rq currently can't get the underlying queue's
      dispatch feedback at all.  Without knowing whether a request was issued
      or not (e.g. due to underlying queue being busy) the dm-rq elevator will
      not be able to provide effective IO merging (as a side-effect of dm-rq
      currently blindly destaging a request from its elevator only to requeue
      it after a delay, which kills any opportunity for merging).  This
      obviously causes very bad sequential IO performance.
      
      Fix this by updating blk_insert_cloned_request() to use
      blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
      request to be issued directly to the underlying queue and returns the
      dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
      returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
      to _not_ destage the request.  Whereby preserving the opportunity to
      merge IO.
      
      With this, request-based DM's blk-mq sequential IO performance is vastly
      improved (as much as 3X in mpath/virtio-scsi testing).
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
      they were refactored to make them less fragile and easier to read/review]
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      396eaf21
    • M
      blk-mq: factor out a few helpers from __blk_mq_try_issue_directly · 0f95549c
      Mike Snitzer 提交于
      No functional change.  Just makes code flow more logically.
      
      In following commit, __blk_mq_try_issue_directly() will be used to
      return the dispatch result (blk_status_t) to DM.  DM needs this
      information to improve IO merging.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f95549c
    • M
      blk-mq: turn WARN_ON in __blk_mq_run_hw_queue into printk · 7df938fb
      Ming Lei 提交于
      We know this WARN_ON is harmless and in reality it may be trigged,
      so convert it to printk() and dump_stack() to avoid to confusing
      people.
      
      Also add comment about two releated races here.
      
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7df938fb
    • M
      blk-mq: make sure hctx->next_cpu is set correctly · 7bed4595
      Ming Lei 提交于
      When hctx->next_cpu is set from possible online CPUs, there is one
      race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
      break workqueue.
      
      The race can be triggered in the following two sitations:
      
      1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
      to dispatch requests from the DEAD cpu context, but at that
      time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
      CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
      a bad value.
      
      2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
      should be run on the other CPU A, then CPU A may become offline at the
      same time and all CPUs in hctx->cpumask become offline.
      
      This patch deals with this issue by re-selecting next CPU, and making
      sure it is set correctly.
      
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-by: N"jianchao.wang" <jianchao.w.wang@oracle.com>
      Tested-by: N"jianchao.wang" <jianchao.w.wang@oracle.com>
      Fixes: 20e4d813 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7bed4595
  2. 15 1月, 2018 5 次提交
    • D
      blk_rq_map_user_iov: fix error override · 69e0927b
      Douglas Gilbert 提交于
      During stress tests by syzkaller on the sg driver the block layer
      infrequently returns EINVAL. Closer inspection shows the block
      layer was trying to return ENOMEM (which is much more
      understandable) but for some reason overroad that useful error.
      
      Patch below does not show this (unchanged) line:
         ret =__blk_rq_map_user_iov(rq, map_data, &i, gfp_mask, copy);
      That 'ret' was being overridden when that function failed.
      Signed-off-by: NDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      69e0927b
    • M
      block: allow gendisk's request_queue registration to be deferred · fa70d2e2
      Mike Snitzer 提交于
      Since I can remember DM has forced the block layer to allow the
      allocation and initialization of the request_queue to be distinct
      operations.  Reason for this is block/genhd.c:add_disk() has requires
      that the request_queue (and associated bdi) be tied to the gendisk
      before add_disk() is called -- because add_disk() also deals with
      exposing the request_queue via blk_register_queue().
      
      DM's dynamic creation of arbitrary device types (and associated
      request_queue types) requires the DM device's gendisk be available so
      that DM table loads can establish a master/slave relationship with
      subordinate devices that are referenced by loaded DM tables -- using
      bd_link_disk_holder().  But until these DM tables, and their associated
      subordinate devices, are known DM cannot know what type of request_queue
      it needs -- nor what its queue_limits should be.
      
      This chicken and egg scenario has created all manner of problems for DM
      and, at times, the block layer.
      
      Summary of changes:
      
      - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
        that drivers may use to add a disk without also calling
        blk_register_queue().  Driver must call blk_register_queue() once its
        request_queue is fully initialized.
      
      - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
        is not set.  It won't be set if driver used add_disk_no_queue_reg()
        but driver encounters an error and must del_gendisk() before calling
        blk_register_queue().
      
      - Export blk_register_queue().
      
      These changes allow DM to use add_disk_no_queue_reg() to anchor its
      gendisk as the "master" for master/slave relationships DM must establish
      with subordinate devices referenced in DM tables that get loaded.  Once
      all "slave" devices for a DM device are known its request_queue can be
      properly initialized and then advertised via sysfs -- important
      improvement being that no request_queue resource initialization
      performed by blk_register_queue() is missed for DM devices anymore.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fa70d2e2
    • M
      block: properly protect the 'queue' kobj in blk_unregister_queue · 667257e8
      Mike Snitzer 提交于
      The original commit e9a823fb (block: fix warning when I/O elevator
      is changed as request_queue is being removed) is pretty conflated.
      "conflated" because the resource being protected by q->sysfs_lock isn't
      the queue_flags (it is the 'queue' kobj).
      
      q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
      from racing with blk_unregister_queue():
      1) By holding q->sysfs_lock first, __elevator_change() can complete
      before a racing blk_unregister_queue().
      2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
      in case elv_iosched_store() loses the race with blk_unregister_queue(),
      it needs a way to know the 'queue' kobj isn't there.
      
      Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
      held until after the 'queue' kobj is removed.
      
      To do so blk_mq_unregister_dev() must not also take q->sysfs_lock.  So
      rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().
      
      Also, blk_unregister_queue() should use q->queue_lock to protect against
      any concurrent writes to q->queue_flags -- even though chances are the
      queue is being cleaned up so no concurrent writes are likely.
      
      Fixes: e9a823fb ("block: fix warning when I/O elevator is changed as request_queue is being removed")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      667257e8
    • M
      block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN · bc8d062c
      Mike Snitzer 提交于
      device_add_disk() will only call bdi_register_owner() if
      !GENHD_FL_HIDDEN, so it follows that del_gendisk() should only call
      bdi_unregister() if !GENHD_FL_HIDDEN.
      
      Found with code inspection.  bdi_unregister() won't do any harm if
      bdi_register_owner() wasn't used but best to avoid the unnecessary
      call to bdi_unregister().
      
      Fixes: 8ddcd653 ("block: introduce GENHD_FL_HIDDEN")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bc8d062c
    • J
      blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() · bf9ae8c5
      Jens Axboe 提交于
      A previous commit moved the clearing of rq->rq_flags later,
      but we may have already set RQF_MQ_INFLIGHT when that happens.
      Ensure that we correctly initialize rq->rq_flags to the
      right value.
      
      This is based on an original fix by Ming, just rewritten to not
      require a conditional.
      
      Fixes: 7c3fb70f ("block: rearrange a few request fields for better cache layout")
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bf9ae8c5
  3. 13 1月, 2018 2 次提交
  4. 12 1月, 2018 1 次提交
  5. 11 1月, 2018 9 次提交
  6. 10 1月, 2018 14 次提交
    • C
      block, bfq: fix occurrences of request finish method's old name · 8993d445
      Chiara Bruschi 提交于
      Commit '7b9e9361' ("blk-mq-sched: unify request finished methods")
      changed the old name of current bfq_finish_request method, but left it
      unchanged elsewhere in the code (related comments, part of function
      name bfq_put_rq_priv_body).
      
      This commit fixes all occurrences of the old name of this method by
      changing them into the current name.
      
      Fixes: 7b9e9361 ("blk-mq-sched: unify request finished methods")
      Reviewed-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NFederico Motta <federico@willer.it>
      Signed-off-by: NChiara Bruschi <bruschi.chiara@outlook.it>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8993d445
    • M
      Revert "block: blk-merge: try to make front segments in full size" · b4b6cb61
      Ming Lei 提交于
      This reverts commit a2d37968.
      
      If max segment size isn't 512-aligned, this patch won't work well.
      
      Also once multipage bvec is enabled, adjacent bvecs won't be physically
      contiguous if page is added via bio_add_page(), so we don't need this
      kind of complicated logic.
      Reported-by: NDmitry Osipenko <digetx@gmail.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b4b6cb61
    • J
      bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED · 8abef10b
      Jens Axboe 提交于
      It's not available if we don't have group io scheduling set, and
      there's no need to call it.
      
      Fixes: 0d52af59 ("block, bfq: release oom-queue ref to root group on exit")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8abef10b
    • B
      block: Fix kernel-doc warnings reported when building with W=1 · aa98192d
      Bart Van Assche 提交于
      Commit 3a025e1d ("Add optional check for bad kernel-doc comments")
      causes W=1 the kernel-doc script to be run and thereby causes several
      new warnings to appear when building the kernel with W=1. Fix the
      block layer kernel-doc headers such that the block layer again builds
      cleanly with W=1.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aa98192d
    • B
      blk-mq: Fix spelling in a source code comment · ee3e4de5
      Bart Van Assche 提交于
      Change "nedeing" into "needing" and "caes" into "cases".
      
      Fixes: commit f906a6a0 ("blk-mq: improve tag waiting setup for non-shared tags")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ee3e4de5
    • J
      blk-mq: silence false positive warnings in hctx_unlock() · 08b5a6e2
      Jens Axboe 提交于
      In some stupider versions of gcc, it complains:
      
      block/blk-mq.c: In function ‘blk_mq_complete_request’:
      ./include/linux/srcu.h:175:2: warning: ‘srcu_idx’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        __srcu_read_unlock(sp, idx);
        ^
      block/blk-mq.c:620:6: note: ‘srcu_idx’ was declared here
        int srcu_idx;
            ^
      
      which is completely bogus, since we only use srcu_idx when
      hctx->flags & BLK_MQ_F_BLOCKING is set, and that's the case where
      hctx_lock() has initialized it.
      
      Just set it to '0' in the normal path in hctx_lock() to silence
      this annoying warning.
      
      Fixes: 04ced159 ("blk-mq: move hctx lock/unlock into a helper")
      Fixes: 5197c05e ("blk-mq: protect completion path with RCU")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08b5a6e2
    • T
      blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu · 05707b64
      Tejun Heo 提交于
      The RCU protection has been expanded to cover both queueing and
      completion paths making ->queue_rq_srcu a misnomer.  Rename it to
      ->srcu as suggested by Bart.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05707b64
    • T
      blk-mq: remove REQ_ATOM_STARTED · 5a61c363
      Tejun Heo 提交于
      After the recent updates to use generation number and state based
      synchronization, we can easily replace REQ_ATOM_STARTED usages by
      adding an extra state to distinguish completed but not yet freed
      state.
      
      Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
      blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
      left and is removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5a61c363
    • T
      blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq · 634f9e46
      Tejun Heo 提交于
      After the recent updates to use generation number and state based
      synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
      to avoid firing the same timeout multiple times.
      
      Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
      RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
      times.  This removes atomic bitops from hot paths too.
      
      v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().
      
      v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      634f9e46
    • T
      blk-mq: make blk_abort_request() trigger timeout path · 358f70da
      Tejun Heo 提交于
      With issue/complete and timeout paths now using the generation number
      and state based synchronization, blk_abort_request() is the only one
      which depends on REQ_ATOM_COMPLETE for arbitrating completion.
      
      There's no reason for blk_abort_request() to be a completely separate
      path.  This patch makes blk_abort_request() piggyback on the timeout
      path instead of trying to terminate the request directly.
      
      This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.
      
      Note that this makes blk_abort_request() asynchronous - it initiates
      abortion but the actual termination will happen after a short while,
      even when the caller owns the request.  AFAICS, SCSI and ATA should be
      fine with that and I think mtip32xx and dasd should be safe but not
      completely sure.  It'd be great if people who know the drivers take a
      look.
      
      v2: - Add comment explaining the lack of synchronization around
            ->deadline update as requested by Bart.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Asai Thambi SP <asamymuthupa@micron.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      358f70da
    • T
      blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE · 67818d25
      Tejun Heo 提交于
      blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test
      REQ_ATOM_COMPLETE to determine the request state.  Both uses are
      speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for
      equivalent results.  Replace the tests.  This will allow removing
      REQ_ATOM_COMPLETE usages from blk-mq.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      67818d25
    • T
      blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516
      Tejun Heo 提交于
      Currently, blk-mq timeout path synchronizes against the usual
      issue/completion path using a complex scheme involving atomic
      bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
      rules.  Unfortunately, it contains quite a few holes.
      
      There's a complex dancing around REQ_ATOM_STARTED and
      REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
      they don't have a synchronization point across request recycle
      instances and it isn't clear what the barriers add.
      blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
      deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
      
      In fact, it's pretty easy to make blk_mq_check_expired() terminate a
      later instance of a request.  If we induce 5 sec delay before
      time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
      2s, and issue back-to-back large IOs, blk-mq starts timing out
      requests spuriously pretty quickly.  Nothing actually timed out.  It
      just made the call on a recycle instance of a request and then
      terminated a later instance long after the original instance finished.
      The scenario isn't theoretical either.
      
      This patch replaces the broken synchronization mechanism with a RCU
      and generation number based one.
      
      1. Each request has a u64 generation + state value, which can be
         updated only by the request owner.  Whenever a request becomes
         in-flight, the generation number gets bumped up too.  This provides
         the basis for the timeout path to distinguish different recycle
         instances of the request.
      
         Also, marking a request in-flight and setting its deadline are
         protected with a seqcount so that the timeout path can fetch both
         values coherently.
      
      2. The timeout path fetches the generation, state and deadline.  If
         the verdict is timeout, it records the generation into a dedicated
         request abortion field and does RCU wait.
      
      3. The completion path is also protected by RCU (from the previous
         patch) and checks whether the current generation number and state
         match the abortion field.  If so, it skips completion.
      
      4. The timeout path, after RCU wait, scans requests again and
         terminates the ones whose generation and state still match the ones
         requested for abortion.
      
         By now, the timeout path knows that either the generation number
         and state changed if it lost the race or the completion will yield
         to it and can safely timeout the request.
      
      While it's more lines of code, it's conceptually simpler, doesn't
      depend on direct use of subtle memory ordering or coherence, and
      hopefully doesn't terminate the wrong instance.
      
      While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
      between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
      removed yet as it's still used in other places.  Future patches will
      move all state tracking to the new mechanism and remove all bitops in
      the hot paths.
      
      Note that this patch adds a comment explaining a race condition in
      BLK_EH_RESET_TIMER path.  The race has always been there and this
      patch doesn't change it.  It's just documenting the existing race.
      
      v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
          - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
          - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
      
      v3: - Fixed possible extended seqcount / u64_stats_sync read looping
            spotted by Peter.
          - MQ_RQ_IDLE was incorrectly being set in complete_request instead
            of free_request.  Fixed.
      
      v4: - Rebased on top of hctx_lock() refactoring patch.
          - Added comment explaining the use of hctx_lock() in completion path.
      
      v5: - Added comments requested by Bart.
          - Note the addition of BLK_EH_RESET_TIMER race condition in the
            commit message.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d9bd516
    • T
      blk-mq: protect completion path with RCU · 5197c05e
      Tejun Heo 提交于
      Currently, blk-mq protects only the issue path with RCU.  This patch
      puts the completion path under the same RCU protection.  This will be
      used to synchronize issue/completion against timeout by later patches,
      which will also add the comments.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5197c05e
    • J
      blk-mq: move hctx lock/unlock into a helper · 04ced159
      Jens Axboe 提交于
      Move the RCU vs SRCU logic into lock/unlock helpers, which makes
      the actual functional bits within the locked region much easier
      to read.
      
      tj: Reordered in front of timeout revamp patches and added the missing
          blk_mq_run_hw_queue() conversion.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      04ced159
  7. 09 1月, 2018 3 次提交
    • P
      block, bfq: release oom-queue ref to root group on exit · 0d52af59
      Paolo Valente 提交于
      On scheduler init, a reference to the root group, and a reference to
      its corresponding blkg are taken for the oom queue. Yet these
      references are not released on scheduler exit, which prevents these
      objects from be freed. This commit adds the missing reference
      releases.
      Reported-by: NDavide Ferrari <davideferrari8@gmail.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0d52af59
    • P
      block, bfq: put async queues for root bfq groups too · 52257ffb
      Paolo Valente 提交于
      For each pair [device for which bfq is selected as I/O scheduler,
      group in blkio/io], bfq maintains a corresponding bfq group. Each such
      bfq group contains a set of async queues, with each async queue
      created on demand, i.e., when some I/O request arrives for it.  On
      creation, an async queue gets an extra reference, to make sure that
      the queue is not freed as long as its bfq group exists.  Accordingly,
      to allow the queue to be freed after the group exited, this extra
      reference must released on group exit.
      
      The above holds also for a bfq root group, i.e., for the bfq group
      corresponding to the root blkio/io root for a given device. Yet, by
      mistake, the references to the existing async queues of a root group
      are not released when the latter exits. This causes a memory leak when
      the instance of bfq for a given device exits. In a similar vein,
      bfqg_stats_xfer_dead is not executed for a root group.
      
      This commit fixes bfq_pd_offline so that the latter executes the above
      missing operations for a root group too.
      Reported-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reported-by: NGuoqing Jiang <gqjiang@suse.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NDavide Ferrari <davideferrari8@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      52257ffb
    • M
      blk-mq: fix kernel oops in blk_mq_tag_idle() · 8ab0b7dc
      Ming Lei 提交于
      HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(),
      then we need to check it before calling blk_mq_tag_idle(), otherwise
      the following kernel oops can be triggered, so fix it by checking if
      the hw queue is unmapped since it doesn't make sense to idle the tags
      any more after hw queues are unmapped.
      
      [  440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
      [  440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000
      [  440.788359] RIP: 0010:[<ffffffffb730e2b4>]  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
      [  440.798697] RSP: 0018:ffff893bf9bcbd10  EFLAGS: 00010286
      [  440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f
      [  440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00
      [  440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000
      [  440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008
      [  440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080
      [  440.849988] FS:  0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000
      [  440.859955] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0
      [  440.876169] Call Trace:
      [  440.879818]  [<ffffffffb7309d68>] blk_mq_exit_hctx+0xd8/0xe0
      [  440.887051]  [<ffffffffb730dc40>] blk_mq_free_queue+0xf0/0x160
      [  440.894465]  [<ffffffffb72ff679>] blk_cleanup_queue+0xd9/0x150
      [  440.901881]  [<ffffffffc08a802b>] nvme_ns_remove+0x5b/0xb0 [nvme_core]
      [  440.910068]  [<ffffffffc08a811b>] nvme_remove_namespaces+0x3b/0x60 [nvme_core]
      [  440.919026]  [<ffffffffc08b817b>] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma]
      [  440.928079]  [<ffffffffc08b8237>] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma]
      [  440.937126]  [<ffffffffb70ab58a>] process_one_work+0x17a/0x440
      [  440.944517]  [<ffffffffb70ac3a8>] worker_thread+0x278/0x3c0
      [  440.951607]  [<ffffffffb70ac130>] ? manage_workers.isra.24+0x2a0/0x2a0
      [  440.959760]  [<ffffffffb70b352f>] kthread+0xcf/0xe0
      [  440.966055]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
      [  440.973715]  [<ffffffffb76d8658>] ret_from_fork+0x58/0x90
      [  440.980586]  [<ffffffffb70b3460>] ? insert_kthread_work+0x40/0x40
      [  440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 <f0> ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f
      [  441.011620] RIP  [<ffffffffb730e2b4>] __blk_mq_tag_idle+0x24/0x40
      [  441.019301]  RSP <ffff893bf9bcbd10>
      [  441.024052] CR2: 0000000000000008
      Reported-by: NZhang Yi <yizhan@redhat.com>
      Tested-by: NZhang Yi <yizhan@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8ab0b7dc
  8. 07 1月, 2018 1 次提交