1. 01 11月, 2017 7 次提交
    • M
      blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE · 1f460b63
      Ming Lei 提交于
      SCSI restarts its queue in scsi_end_request() automatically, so we don't
      need to handle this case in blk-mq.
      
      Especailly any request won't be dequeued in this case, we needn't to
      worry about IO hang caused by restart vs. dispatch.
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1f460b63
    • M
      blk-mq: don't handle TAG_SHARED in restart · 358a3a6b
      Ming Lei 提交于
      Now restart is used in the following cases, and TAG_SHARED is for
      SCSI only.
      
      1) .get_budget() returns BLK_STS_RESOURCE
      - if resource in target/host level isn't satisfied, this SCSI device
      will be added in shost->starved_list, and the whole queue will be rerun
      (via SCSI's built-in RESTART) in scsi_end_request() after any request
      initiated from this host/targe is completed. Forget to mention, host level
      resource can't be an issue for blk-mq at all.
      
      - the same is true if resource in the queue level isn't satisfied.
      
      - if there isn't outstanding request on this queue, then SCSI's RESTART
      can't work(blk-mq's can't work too), and the queue will be run after
      SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's
      RESTART when this request is finished
      
      2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE
      - if there isn't onprogressing request on this queue, the queue
      will be run after SCSI_QUEUE_DELAY
      
      - otherwise, SCSI's RESTART covers the rerun.
      
      3) blk_mq_get_driver_tag() failed
      - BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver
      allocation.
      
      In one word, SCSI's built-in RESTART is enough to cover the queue
      rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart.
      
      In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when
      running I/O on these 8 luns concurrently.
      
      Aslo Roman Pen reported the current RESTART is very expensive especialy
      when there are lots of LUNs attached in one host, such as in his
      test, RESTART causes half of IOPS be cut.
      
      Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2
      Fixes: 6d8c6c0f ("blk-mq: Restart a single queue if tag sets are shared")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      358a3a6b
    • M
      blk-mq-sched: improve dispatching from sw queue · b347689f
      Ming Lei 提交于
      SCSI devices use host-wide tagset, and the shared driver tag space is
      often quite big. However, there is also a queue depth for each lun(
      .cmd_per_lun), which is often small, for example, on both lpfc and
      qla2xxx, .cmd_per_lun is just 3.
      
      So lots of requests may stay in sw queue, and we always flush all
      belonging to same hw queue and dispatch them all to driver.
      Unfortunately it is easy to cause queue busy because of the small
      .cmd_per_lun.  Once these requests are flushed out, they have to stay in
      hctx->dispatch, and no bio merge can happen on these requests, and
      sequential IO performance is harmed.
      
      This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
      from a sw queue, so that we can dispatch them in scheduler's way. We can
      then avoid dequeueing too many requests from sw queue, since we don't
      flush ->dispatch completely.
      
      This patch improves dispatching from sw queue by using the .get_budget
      and .put_budget callbacks.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b347689f
    • M
      blk-mq: introduce .get_budget and .put_budget in blk_mq_ops · de148297
      Ming Lei 提交于
      For SCSI devices, there is often a per-request-queue depth, which needs
      to be respected before queuing one request.
      
      Currently blk-mq always dequeues the request first, then calls
      .queue_rq() to dispatch the request to lld. One obvious issue with this
      approach is that I/O merging may not be successful, because when the
      per-request-queue depth can't be respected, .queue_rq() has to return
      BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
      list. This means it never gets a chance to be merged with other IO.
      
      This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
      then we can try to get reserved budget first before dequeuing request.
      If the budget for queueing I/O can't be satisfied, we don't need to
      dequeue request at all. Hence the request can be left in the IO
      scheduler queue, for more merging opportunities.
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      de148297
    • M
      block: kyber: check if there are requests in ctx in kyber_has_work() · 63ba8e31
      Ming Lei 提交于
      There may be request in sw queue, and not fetched to domain queue
      yet, so check it in kyber_has_work().
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      63ba8e31
    • M
      blk-mq-sched: move actual dispatching into one helper · caf8eb0d
      Ming Lei 提交于
      So that it becomes easy to support to dispatch from sw queue in the
      following patch.
      
      No functional change.
      Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      caf8eb0d
    • M
      blk-mq-sched: dispatch from scheduler IFF progress is made in ->dispatch · 5e3d02bb
      Ming Lei 提交于
      When the hw queue is busy, we shouldn't take requests from the scheduler
      queue any more, otherwise it is difficult to do IO merge.
      
      This patch fixes the awful IO performance on some SCSI devices(lpfc,
      qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if
      hw queue is busy.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5e3d02bb
  2. 31 10月, 2017 1 次提交
  3. 26 10月, 2017 6 次提交
  4. 25 10月, 2017 1 次提交
  5. 18 10月, 2017 1 次提交
    • O
      kyber: fix hang on domain token wait queue · 8cf46660
      Omar Sandoval 提交于
      When we're getting a domain token, if we fail to get a token on our
      first attempt, we put the current hardware queue on a wait queue and
      then try again just in case a token was freed after our initial attempt
      but before we got on the wait queue. If this second attempt succeeds, we
      currently leave the hardware queue on the wait queue. Usually this is
      okay; we'll just run the hardware queue one extra time when another
      token is freed. However, if the hardware queue doesn't have any other
      requests waiting, then when it it gets the extra wakeup, it won't have
      anything to free and therefore won't wake up any other hardware queues.
      If tokens are limited, then we won't make forward progress and the
      device will hang.
      Reported-by: NBin Zha <zhabin.zb@alibaba-inc.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8cf46660
  6. 17 10月, 2017 1 次提交
  7. 11 10月, 2017 3 次提交
  8. 10 10月, 2017 1 次提交
  9. 09 10月, 2017 2 次提交
    • P
      block, bfq: fix unbalanced decrements of burst size · 99fead8d
      Paolo Valente 提交于
      The commit "block, bfq: decrease burst size when queues in burst
      exit" introduced the decrement of burst_size on the removal of a
      bfq_queue from the burst list. Unfortunately, this decrement can
      happen to be performed even when burst size is already equal to 0,
      because of unbalanced decrements. A description follows of the cause
      of these unbalanced decrements, namely a wrong assumption, and of the
      way how this wrong assumption leads to unbalanced decrements.
      
      The wrong assumption is that a bfq_queue can exit only if the process
      associated with the bfq_queue has exited. This is false, because a
      bfq_queue, say Q, may exit also as a consequence of a merge with
      another bfq_queue. In this case, Q exits because the I/O of its
      associated process has been redirected to another bfq_queue.
      
      The decrement unbalance occurs because Q may then be re-created after
      a split, and added back to the current burst list, *without*
      incrementing burst_size. burst_size is not incremented because Q is
      not a new bfq_queue added to the burst list, but a bfq_queue only
      temporarily removed from the list, and, before the commit "bfq-sq,
      bfq-mq: decrease burst size when queues in burst exit", burst_size was
      not decremented when Q was removed.
      
      This commit addresses this issue by just checking whether the exiting
      bfq_queue is a merged bfq_queue, and, in that case, not decrementing
      burst_size. Unfortunately, this still leaves room for unbalanced
      decrements, in the following rarer case: on a split, the bfq_queue
      happens to be inserted into a different burst list than that it was
      removed from when merged. If this happens, the number of elements in
      the new burst list becomes higher than burst_size (by one). When the
      bfq_queue then exits, it is of course not in a merged state any
      longer, thus burst_size is decremented, which results in an unbalanced
      decrement.  To handle this sporadic, unlucky case in a simple way,
      this commit also checks that burst_size is larger than 0 before
      decrementing it.
      
      Finally, this commit removes an useless, extra check: the check that
      the bfq_queue is sync, performed before checking whether the bfq_queue
      is in the burst list. This extra check is redundant, because only sync
      bfq_queues can be inserted into the burst list.
      
      Fixes: 7cb04004 ("block, bfq: decrease burst size when queues in burst exit")
      Reported-by: NPhilip Müller <philm@manjaro.org>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: NPhilip Müller <philm@manjaro.org>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      99fead8d
    • L
      block,bfq: Disable writeback throttling · b5dc5d4d
      Luca Miccio 提交于
      Similarly to CFQ, BFQ has its write-throttling heuristics, and it
      is better not to combine them with further write-throttling
      heuristics of a different nature.
      So this commit disables write-back throttling for a device if BFQ
      is used as I/O scheduler for that device.
      Signed-off-by: NLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5dc5d4d
  10. 07 10月, 2017 2 次提交
  11. 06 10月, 2017 1 次提交
  12. 05 10月, 2017 2 次提交
    • J
      blk-mq: document the need to have STARTED and COMPLETED share a byte · fc13457f
      Jens Axboe 提交于
      For memory ordering guarantees on stores, we need to ensure that
      these two bits share the same byte of storage in the unsigned
      long. Add a comment as to why, and a BUILD_BUG_ON() to ensure that
      we don't violate this requirement.
      Suggested-by: NBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fc13457f
    • P
      blk-mq: attempt to fix atomic flag memory ordering · a7af0af3
      Peter Zijlstra 提交于
      Attempt to untangle the ordering in blk-mq. The patch introducing the
      single smp_mb__before_atomic() is obviously broken in that it doesn't
      clearly specify a pairing barrier and an obtained guarantee.
      
      The comment is further misleading in that it hints that the
      deadline store and the COMPLETE store also need to be ordered, but
      AFAICT there is no such dependency. However what does appear to be
      important is the clear happening _after_ the store, and that worked by
      pure accident.
      
      This clarifies blk_mq_start_request() -- we should not get there with
      STARTING set -- this simplifies the code and makes the barrier usage
      sane (the old code could be read to allow not having _any_ atomic after
      the barrier, in which case the barrier hasn't got anything to order). We
      then also introduce the missing pairing barrier for it.
      
      Also down-grade the barrier to smp_wmb(), this is cheaper for
      PowerPC/ARM and doesn't cost anything extra on x86.
      
      And it documents the STARTING vs COMPLETE ordering. Although I've not
      been entirely successful in reverse engineering the blk-mq state
      machine so there might still be more funnies around timeout vs
      requeue.
      
      If I got anything wrong, feel free to educate me by adding comments to
      clarify things ;-)
      
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Fixes: 538b7534 ("blk-mq: request deadline must be visible before marking rq as started")
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a7af0af3
  13. 03 10月, 2017 6 次提交
    • C
      block: move __elv_next_request to blk-core.c · 9c988374
      Christoph Hellwig 提交于
      No need to have this helper inline in a header.  Also drop the __ prefix.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9c988374
    • P
      block, bfq: decrease burst size when queues in burst exit · 7cb04004
      Paolo Valente 提交于
      If many queues belonging to the same group happen to be created
      shortly after each other, then the concurrent processes associated
      with these queues have typically a common goal, and they get it done
      as soon as possible if not hampered by device idling.  Examples are
      processes spawned by git grep, or by systemd during boot. As for
      device idling, this mechanism is currently necessary for weight
      raising to succeed in its goal: privileging I/O.  In view of these
      facts, BFQ does not provide the above queues with either weight
      raising or device idling.
      
      On the other hand, a burst of queue creations may be caused also by
      the start-up of a complex application. In this case, these queues need
      usually to be served one after the other, and as quickly as possible,
      to maximise responsiveness. Therefore, in this case the best strategy
      is to weight-raise all the queues created during the burst, i.e., the
      exact opposite of the strategy for the above case.
      
      To distinguish between the two cases, BFQ uses an empirical burst-size
      threshold, found through extensive tests and monitoring of daily
      usage. Only large bursts, i.e., burst with a size above this
      threshold, are considered as generated by a high number of parallel
      processes. In this respect, upstart-based boot proved to be rather
      hard to detect as generating a large burst of queue creations, because
      with upstart most of the queues created in a burst exit *before* the
      next queues in the same burst are created. To address this issue, I
      changed the burst-detection mechanism so as to not decrease the size
      of the current burst even if one of the queues in the burst is
      eliminated.
      
      Unfortunately, this missing decrease causes false positives on very
      fast systems: on the start-up of a complex application, such as
      libreoffice writer, so many queues are created, served and exited
      shortly after each other, that a large burst of queue creations is
      wrongly detected as occurring. These false positives just disappear if
      the size of a burst is decreased when one of the queues in the burst
      exits. This commit restores the missing burst-size decrease, relying
      of the fact that upstart is apparently unlikely to be used on systems
      running this and future versions of the kernel.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NMauro Andreolini <mauro.andreolini@unimore.it>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: NMirko Montanari <mirkomontanari91@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7cb04004
    • P
      block, bfq: let early-merged queues be weight-raised on split too · 894df937
      Paolo Valente 提交于
      A just-created bfq_queue, say Q, may happen to be merged with another
      bfq_queue on the very first invocation of the function
      __bfq_insert_request. In such a case, even if Q would clearly deserve
      interactive weight raising (as it has just been created), the function
      bfq_add_request does not make it to be invoked for Q, and thus to
      activate weight raising for Q. As a consequence, when the state of Q
      is saved for a possible future restore, after a split of Q from the
      other bfq_queue(s), such a state happens to be (unjustly)
      non-weight-raised. Then the bfq_queue will not enjoy any weight
      raising on the split, even if should still be in an interactive
      weight-raising period when the split occurs.
      
      This commit solves this problem as follows, for a just-created
      bfq_queue that is being early-merged: it stores directly, in the saved
      state of the bfq_queue, the weight-raising state that would have been
      assigned to the bfq_queue if not early-merged.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: NMirko Montanari <mirkomontanari91@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      894df937
    • P
      block, bfq: check and switch back to interactive wr also on queue split · 3e2bdd6d
      Paolo Valente 提交于
      As already explained in the message of commit "block, bfq: fix
      wrong init of saved start time for weight raising", if a soft
      real-time weight-raising period happens to be nested in a larger
      interactive weight-raising period, then BFQ restores the interactive
      weight raising at the end of the soft real-time weight raising. In
      particular, BFQ checks whether the latter has ended only on request
      dispatches.
      
      Unfortunately, the above scheme fails to restore interactive weight
      raising in the following corner case: if a bfq_queue, say Q,
      1) Is merged with another bfq_queue while it is in a nested soft
      real-time weight-raising period. The weight-raising state of Q is
      then saved, and not considered any longer until a split occurs.
      2) Is split from the other bfq_queue(s) at a time instant when its
      soft real-time weight raising is already finished.
      On the split, while resuming the previous, soft real-time
      weight-raised state of the bfq_queue Q, BFQ checks whether the
      current soft real-time weight-raising period is actually over. If so,
      BFQ switches weight raising off for Q, *without* checking whether the
      soft real-time period was actually nested in a non-yet-finished
      interactive weight-raising period.
      
      This commit addresses this issue by adding the above missing check in
      bfq_queue splits, and restoring interactive weight raising if needed.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: NMirko Montanari <mirkomontanari91@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3e2bdd6d
    • P
      block, bfq: fix wrong init of saved start time for weight raising · 4baa8bb1
      Paolo Valente 提交于
      This commit fixes a bug that causes bfq to fail to guarantee a high
      responsiveness on some drives, if there is heavy random read+write I/O
      in the background. More precisely, such a failure allowed this bug to
      be found [1], but the bug may well cause other yet unreported
      anomalies.
      
      BFQ raises the weight of the bfq_queues associated with soft real-time
      applications, to privilege the I/O, and thus reduce latency, for these
      applications. This mechanism is named soft-real-time weight raising in
      BFQ. A soft real-time period may happen to be nested into an
      interactive weight raising period, i.e., it may happen that, when a
      bfq_queue switches to a soft real-time weight-raised state, the
      bfq_queue is already being weight-raised because deemed interactive
      too. In this case, BFQ saves in a special variable
      wr_start_at_switch_to_srt, the time instant when the interactive
      weight-raising period started for the bfq_queue, i.e., the time
      instant when BFQ started to deem the bfq_queue interactive. This value
      is then used to check whether the interactive weight-raising period
      would still be in progress when the soft real-time weight-raising
      period ends.  If so, interactive weight raising is restored for the
      bfq_queue. This restore is useful, in particular, because it prevents
      bfq_queues from losing their interactive weight raising prematurely,
      as a consequence of spurious, short-lived soft real-time
      weight-raising periods caused by wrong detections as soft real-time.
      
      If, instead, a bfq_queue switches to soft-real-time weight raising
      while it *is not* already in an interactive weight-raising period,
      then the variable wr_start_at_switch_to_srt has no meaning during the
      following soft real-time weight-raising period. Unfortunately the
      handling of this case is wrong in BFQ: not only the variable is not
      flagged somehow as meaningless, but it is also set to the time when
      the switch to soft real-time weight-raising occurs. This may cause an
      interactive weight-raising period to be considered mistakenly as still
      in progress, and thus a spurious interactive weight-raising period to
      start for the bfq_queue, at the end of the soft-real-time
      weight-raising period. In particular the spurious interactive
      weight-raising period will be considered as still in progress, if the
      soft-real-time weight-raising period does not last very long. The
      bfq_queue will then be wrongly privileged and, if I/O bound, will
      unjustly steal bandwidth to truly interactive or soft real-time
      bfq_queues, harming responsiveness and low latency.
      
      This commit fixes this issue by just setting wr_start_at_switch_to_srt
      to minus infinity (farthest past time instant according to jiffies
      macros): when the soft-real-time weight-raising period ends, certainly
      no interactive weight-raising period will be considered as still in
      progress.
      
      [1] Background I/O Type: Random - Background I/O mix: Reads and writes
      - Application to start: LibreOffice Writer in
      http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-LaptopSigned-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: NMirko Montanari <mirkomontanari91@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4baa8bb1
    • J
      blk-mq: wire up completion notifier for laptop mode · 7beb2f84
      Jens Axboe 提交于
      For some reason, the laptop mode IO completion notified was never wired
      up for blk-mq. Ensure that we trigger the callback appropriately, to arm
      the laptop mode flush timer.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7beb2f84
  14. 01 10月, 2017 1 次提交
  15. 30 9月, 2017 1 次提交
  16. 26 9月, 2017 1 次提交
  17. 25 9月, 2017 3 次提交
    • S
      block: fix a crash caused by wrong API · f5c156c4
      Shaohua Li 提交于
      part_stat_show takes a part device not a disk, so we should use
      part_to_disk.
      
      Fixes: d62e26b3("block: pass in queue to inflight accounting")
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f5c156c4
    • W
      blktrace: Fix potential deadlock between delete & sysfs ops · 5acb3cc2
      Waiman Long 提交于
      The lockdep code had reported the following unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(s_active#228);
                                     lock(&bdev->bd_mutex/1);
                                     lock(s_active#228);
        lock(&bdev->bd_mutex);
      
       *** DEADLOCK ***
      
      The deadlock may happen when one task (CPU1) is trying to delete a
      partition in a block device and another task (CPU0) is accessing
      tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
      partition.
      
      The s_active isn't an actual lock. It is a reference count (kn->count)
      on the sysfs (kernfs) file. Removal of a sysfs file, however, require
      a wait until all the references are gone. The reference count is
      treated like a rwsem using lockdep instrumentation code.
      
      The fact that a thread is in the sysfs callback method or in the
      ioctl call means there is a reference to the opended sysfs or device
      file. That should prevent the underlying block structure from being
      removed.
      
      Instead of using bd_mutex in the block_device structure, a new
      blk_trace_mutex is now added to the request_queue structure to protect
      access to the blk_trace structure.
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      
      Fix typo in patch subject line, and prune a comment detailing how
      the code used to work.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5acb3cc2
    • C
      bsg-lib: don't free job in bsg_prepare_job · f507b54d
      Christoph Hellwig 提交于
      The job structure is allocated as part of the request, so we should not
      free it in the error path of bsg_prepare_job.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f507b54d