1. 01 12月, 2019 2 次提交
  2. 21 11月, 2019 2 次提交
    • P
      blok, bfq: do not plug I/O if all queues are weight-raised · 89f4d27c
      Paolo Valente 提交于
      [ Upstream commit c8765de0adfcaaf4ffb2d951e07444f00ffa9453 ]
      
      To reduce latency for interactive and soft real-time applications, bfq
      privileges the bfq_queues containing the I/O of these
      applications. These privileged queues, referred-to as weight-raised
      queues, get a much higher share of the device throughput
      w.r.t. non-privileged queues. To preserve this higher share, the I/O
      of any non-weight-raised queue must be plugged whenever a sync
      weight-raised queue, while being served, remains temporarily empty. To
      attain this goal, bfq simply plugs any I/O (from any queue), if a sync
      weight-raised queue remains empty while in service.
      
      Unfortunately, this plugging typically lowers throughput with random
      I/O, on devices with internal queueing (because it reduces the filling
      level of the internal queues of the device).
      
      This commit addresses this issue by restricting the cases where
      plugging is performed: if a sync weight-raised queue remains empty
      while in service, then I/O plugging is performed only if some of the
      active bfq_queues are *not* weight-raised (which is actually the only
      circumstance where plugging is needed to preserve the higher share of
      the throughput of weight-raised queues). This restriction proved able
      to boost throughput in really many use cases needing only maximum
      throughput.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      89f4d27c
    • P
      block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash · 6c9a7965
      Paolo Valente 提交于
      [ Upstream commit d0edc2473be9d70f999282e1ca7863ad6ae704dc ]
      
      The Achilles' heel of BFQ is its failing to reach a high throughput
      with sync random I/O on flash storage with internal queueing, in case
      the processes doing I/O have differentiated weights.
      
      The cause of this failure is as follows. If at least two processes do
      sync I/O, and have a different weight from each other, then BFQ plugs
      I/O dispatching every time one of these processes, while it is being
      served, remains temporarily without pending I/O requests. This
      plugging is necessary to guarantee that every process enjoys a
      bandwidth proportional to its weight; but it empties the internal
      queue(s) of the drive. And this kills throughput with random I/O. So,
      if some processes have differentiated weights and do both sync and
      random I/O, the end result is a throughput collapse.
      
      This commit tries to counter this problem by injecting the service of
      other processes, in a controlled way, while the process in service
      happens to have no I/O. This injection is performed only if the medium
      is non rotational and performs internal queueing, and the process in
      service does random I/O (service injection might be beneficial for
      sequential I/O too, we'll work on that).
      
      As an example of the benefits of this commit, on a PLEXTOR PX-256M5S
      SSD, and with five processes having differentiated weights and doing
      sync random 4KB I/O, this commit makes the throughput with bfq grow by
      400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than
      that reached with none. As some less random I/O is added to the mix,
      the throughput becomes equal to or higher than that with none.
      
      This commit is a very first attempt to recover throughput without
      losing control, and certainly has many limitations. One is, e.g., that
      the processes whose service is injected are not chosen so as to
      distribute the extra bandwidth they receive in accordance to their
      weights. Thus there might be loss of weighted fairness in some
      cases. Anyway, this loss concerns extra service, which would not have
      been received at all without this commit. Other limitations and issues
      will probably show up with usage.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6c9a7965
  3. 13 11月, 2019 1 次提交
    • T
      blkcg: make blkcg_print_stat() print stats only for online blkgs · 52212812
      Tejun Heo 提交于
      [ Upstream commit b0814361a25cba73a224548843ed92d8ea78715a ]
      
      blkcg_print_stat() iterates blkgs under RCU and doesn't test whether
      the blkg is online.  This can call into pd_stat_fn() on a pd which is
      still being initialized leading to an oops.
      
      The heaviest operation - recursively summing up rwstat counters - is
      already done while holding the queue_lock.  Expand queue_lock to cover
      the other operations and skip the blkg if it isn't online yet.  The
      online state is protected by both blkcg and queue locks, so this
      guarantees that only online blkgs are processed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NRoman Gushchin <guro@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Fixes: 903d23f0 ("blk-cgroup: allow controllers to output their own stats")
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      52212812
  4. 29 10月, 2019 1 次提交
  5. 18 10月, 2019 1 次提交
  6. 08 10月, 2019 1 次提交
    • D
      block: mq-deadline: Fix queue restart handling · dbb7339c
      Damien Le Moal 提交于
      [ Upstream commit cb8acabbe33b110157955a7425ee876fb81e6bbc ]
      
      Commit 7211aef86f79 ("block: mq-deadline: Fix write completion
      handling") added a call to blk_mq_sched_mark_restart_hctx() in
      dd_dispatch_request() to make sure that write request dispatching does
      not stall when all target zones are locked. This fix left a subtle race
      when a write completion happens during a dispatch execution on another
      CPU:
      
      CPU 0: Dispatch			CPU1: write completion
      
      dd_dispatch_request()
          lock(&dd->lock);
          ...
          lock(&dd->zone_lock);	dd_finish_request()
          rq = find request		lock(&dd->zone_lock);
          unlock(&dd->zone_lock);
          				zone write unlock
      				unlock(&dd->zone_lock);
      				...
      				__blk_mq_free_request
                                            check restart flag (not set)
      				      -> queue not run
          ...
          if (!rq && have writes)
              blk_mq_sched_mark_restart_hctx()
          unlock(&dd->lock)
      
      Since the dispatch context finishes after the write request completion
      handling, marking the queue as needing a restart is not seen from
      __blk_mq_free_request() and blk_mq_sched_restart() not executed leading
      to the dispatch stall under 100% write workloads.
      
      Fix this by moving the call to blk_mq_sched_mark_restart_hctx() from
      dd_dispatch_request() into dd_finish_request() under the zone lock to
      ensure full mutual exclusion between write request dispatch selection
      and zone unlock on write request completion.
      
      Fixes: 7211aef86f79 ("block: mq-deadline: Fix write completion handling")
      Cc: stable@vger.kernel.org
      Reported-by: NHans Holmberg <Hans.Holmberg@wdc.com>
      Reviewed-by: NHans Holmberg <hans.holmberg@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      dbb7339c
  7. 05 10月, 2019 1 次提交
    • Y
      block: fix null pointer dereference in blk_mq_rq_timed_out() · 82652c06
      Yufen Yu 提交于
      commit 8d6996630c03d7ceeabe2611378fea5ca1c3f1b3 upstream.
      
      We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
      as following:
      
      [  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
      [  108.827059] PGD 0 P4D 0
      [  108.827313] Oops: 0000 [#1] SMP PTI
      [  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
      [  108.829503] Workqueue: kblockd blk_mq_timeout_work
      [  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
      [  108.838191] Call Trace:
      [  108.838406]  bt_iter+0x74/0x80
      [  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
      [  108.839074]  ? __switch_to_asm+0x34/0x70
      [  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
      [  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
      [  108.840732]  blk_mq_timeout_work+0x74/0x200
      [  108.841151]  process_one_work+0x297/0x680
      [  108.841550]  worker_thread+0x29c/0x6f0
      [  108.841926]  ? rescuer_thread+0x580/0x580
      [  108.842344]  kthread+0x16a/0x1a0
      [  108.842666]  ? kthread_flush_work+0x170/0x170
      [  108.843100]  ret_from_fork+0x35/0x40
      
      The bug is caused by the race between timeout handle and completion for
      flush request.
      
      When timeout handle function blk_mq_rq_timed_out() try to read
      'req->q->mq_ops', the 'req' have completed and reinitiated by next
      flush request, which would call blk_rq_init() to clear 'req' as 0.
      
      After commit 12f5b931 ("blk-mq: Remove generation seqeunce"),
      normal requests lifetime are protected by refcount. Until 'rq->ref'
      drop to zero, the request can really be free. Thus, these requests
      cannot been reused before timeout handle finish.
      
      However, flush request has defined .end_io and rq->end_io() is still
      called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
      can be reused by the next flush request handle, resulting in null
      pointer deference BUG ON.
      
      We fix this problem by covering flush request with 'rq->ref'.
      If the refcount is not zero, flush_end_io() return and wait the
      last holder recall it. To record the request status, we add a new
      entry 'rq_status', which will be used in flush_end_io().
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: stable@vger.kernel.org # v4.18+
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      -------
      v2:
       - move rq_status from struct request to struct blk_flush_queue
      v3:
       - remove unnecessary '{}' pair.
      v4:
       - let spinlock to protect 'fq->rq_status'
      v5:
       - move rq_status after flush_running_idx member of struct blk_flush_queue
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      82652c06
  8. 01 10月, 2019 2 次提交
  9. 16 9月, 2019 3 次提交
    • M
      blk-mq: free hw queue's resource in hctx's release handler · e238e6dc
      Ming Lei 提交于
      [ Upstream commit c7e2d94b3d1634988a95ac4d77a72dc7487ece06 ]
      
      Once blk_cleanup_queue() returns, tags shouldn't be used any more,
      because blk_mq_free_tag_set() may be called. Commit 45a9c9d9
      ("blk-mq: Fix a use-after-free") fixes this issue exactly.
      
      However, that commit introduces another issue. Before 45a9c9d9,
      we are allowed to run queue during cleaning up queue if the queue's
      kobj refcount is held. After that commit, queue can't be run during
      queue cleaning up, otherwise oops can be triggered easily because
      some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
      
      We have invented ways for addressing this kind of issue before, such as:
      
      	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
      	c2856ae2 ("blk-mq: quiesce queue before freeing queue")
      
      But still can't cover all cases, recently James reports another such
      kind of issue:
      
      	https://marc.info/?l=linux-scsi&m=155389088124782&w=2
      
      This issue can be quite hard to address by previous way, given
      scsi_run_queue() may run requeues for other LUNs.
      
      Fixes the above issue by freeing hctx's resources in its release handler, and this
      way is safe becasue tags isn't needed for freeing such hctx resource.
      
      This approach follows typical design pattern wrt. kobject's release handler.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reported-by: NJames Smart <james.smart@broadcom.com>
      Fixes: 45a9c9d9 ("blk-mq: Fix a use-after-free")
      Cc: stable@vger.kernel.org
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e238e6dc
    • D
      blk-iolatency: fix STS_AGAIN handling · 178d1337
      Dennis Zhou 提交于
      [ Upstream commit c9b3007feca018d3f7061f5d5a14cb00766ffe9b ]
      
      The iolatency controller is based on rq_qos. It increments on
      rq_qos_throttle() and decrements on either rq_qos_cleanup() or
      rq_qos_done_bio(). a3fb01ba5af0 fixes the double accounting issue where
      blk_mq_make_request() may call both rq_qos_cleanup() and
      rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the
      double decrement.
      
      The above works upstream as the only way we can get STS_AGAIN is from
      blk_mq_get_request() failing. The STS_AGAIN handling isn't a real
      problem as bio_endio() skipping only happens on reserved tag allocation
      failures which can only be caused by driver bugs and already triggers
      WARN.
      
      However, the fix creates a not so great dependency on how STS_AGAIN can
      be propagated. Internally, we (Facebook) carry a patch that kills read
      ahead if a cgroup is io congested or a fatal signal is pending. This
      combined with chained bios progagate their bi_status to the parent is
      not already set can can cause the parent bio to not clean up properly
      even though it was successful. This consequently leaks the inflight
      counter and can hang all IOs under that blkg.
      
      To nip the adverse interaction early, this removes the rq_qos_cleanup()
      callback in iolatency in favor of cleaning up always on the
      rq_qos_done_bio() path.
      
      Fixes: a3fb01ba5af0 ("blk-iolatency: only account submitted bios")
      Debugged-by: NTejun Heo <tj@kernel.org>
      Debugged-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      178d1337
    • L
      Blk-iolatency: warn on negative inflight IO counter · 5f33e812
      Liu Bo 提交于
      [ Upstream commit 391f552af213985d3d324c60004475759a7030c5 ]
      
      This is to catch any unexpected negative value of inflight IO counter.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5f33e812
  10. 29 8月, 2019 1 次提交
  11. 04 8月, 2019 1 次提交
  12. 31 7月, 2019 2 次提交
    • W
      block/bio-integrity: fix a memory leak bug · af50d6a1
      Wenwen Wang 提交于
      [ Upstream commit e7bf90e5afe3aa1d1282c1635a49e17a32c4ecec ]
      
      In bio_integrity_prep(), a kernel buffer is allocated through kmalloc() to
      hold integrity metadata. Later on, the buffer will be attached to the bio
      structure through bio_integrity_add_page(), which returns the number of
      bytes of integrity metadata attached. Due to unexpected situations,
      bio_integrity_add_page() may return 0. As a result, bio_integrity_prep()
      needs to be terminated with 'false' returned to indicate this error.
      However, the allocated kernel buffer is not freed on this execution path,
      leading to a memory leak.
      
      To fix this issue, free the allocated buffer before returning from
      bio_integrity_prep().
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NWenwen Wang <wenwen@cs.uga.edu>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      af50d6a1
    • J
      block: init flush rq ref count to 1 · 8a1a3d38
      Josef Bacik 提交于
      [ Upstream commit b554db147feea39617b533ab6bca247c91c6198a ]
      
      We discovered a problem in newer kernels where a disconnect of a NBD
      device while the flush request was pending would result in a hang.  This
      is because the blk mq timeout handler does
      
              if (!refcount_inc_not_zero(&rq->ref))
                      return true;
      
      to determine if it's ok to run the timeout handler for the request.
      Flush_rq's don't have a ref count set, so we'd skip running the timeout
      handler for this request and it would just sit there in limbo forever.
      
      Fix this by always setting the refcount of any request going through
      blk_init_rq() to 1.  I tested this with a nbd-server that dropped flush
      requests to verify that it hung, and then tested with this patch to
      verify I got the timeout as expected and the error handling kicked in.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8a1a3d38
  13. 26 7月, 2019 4 次提交
  14. 14 7月, 2019 1 次提交
    • D
      block, bfq: NULL out the bic when it's no longer valid · 018524b7
      Douglas Anderson 提交于
      commit dbc3117d4ca9e17819ac73501e914b8422686750 upstream.
      
      In reboot tests on several devices we were seeing a "use after free"
      when slub_debug or KASAN was enabled.  The kernel complained about:
      
        Unable to handle kernel paging request at virtual address 6b6b6c2b
      
      ...which is a classic sign of use after free under slub_debug.  The
      stack crawl in kgdb looked like:
      
       0  test_bit (addr=<optimized out>, nr=<optimized out>)
       1  bfq_bfqq_busy (bfqq=<optimized out>)
       2  bfq_select_queue (bfqd=<optimized out>)
       3  __bfq_dispatch_request (hctx=<optimized out>)
       4  bfq_dispatch_request (hctx=<optimized out>)
       5  0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440)
       6  0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440)
       7  0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440)
       8  0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>)
       9  0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480)
       10 0xc024cff4 in worker_thread (__worker=0xec6d4640)
      
      Digging in kgdb, it could be found that, though bfqq looked fine,
      bfqq->bic had been freed.
      
      Through further digging, I postulated that perhaps it is illegal to
      access a "bic" (AKA an "icq") after bfq_exit_icq() had been called
      because the "bic" can be freed at some point in time after this call
      is made.  I confirmed that there certainly were cases where the exact
      crashing code path would access the "bic" after bfq_exit_icq() had
      been called.  Sspecifically I set the "bfqq->bic" to (void *)0x7 and
      saw that the bic was 0x7 at the time of the crash.
      
      To understand a bit more about why this crash was fairly uncommon (I
      saw it only once in a few hundred reboots), you can see that much of
      the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't
      access the ->bic anymore.  The only case it doesn't is if
      bfq_put_queue() sees a reference still held.
      
      However, even in the case when bfqq isn't freed, the crash is still
      rare.  Why?  I tracked what happened to the "bic" after the exit
      routine.  It doesn't get freed right away.  Rather,
      put_io_context_active() eventually called put_io_context() which
      queued up freeing on a workqueue.  The freeing then actually happened
      later than that through call_rcu().  Despite all these delays, some
      extra debugging showed that all the hoops could be jumped through in
      time and the memory could be freed causing the original crash.  Phew!
      
      To make a long story short, assuming it truly is illegal to access an
      icq after the "exit_icq" callback is finished, this patch is needed.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NPaolo Valente <paolo.valente@unimore.it>
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      018524b7
  15. 10 7月, 2019 1 次提交
    • G
      block: Fix a NULL pointer dereference in generic_make_request() · c9d8d3e9
      Guilherme G. Piccoli 提交于
      ```--------------------------------------------------------------
      This patch is not on mainline and is meant to 4.19 stable *only*.
      After the patch description there's a reasoning about that.
      ```
      
      --------------------------------------------------------------
      
      Commit 37f9579f ("blk-mq: Avoid that submitting a bio concurrently
      with device removal triggers a crash") introduced a NULL pointer
      dereference in generic_make_request(). The patch sets q to NULL and
      enter_succeeded to false; right after, there's an 'if (enter_succeeded)'
      which is not taken, and then the 'else' will dereference q in
      blk_queue_dying(q).
      
      This patch just moves the 'q = NULL' to a point in which it won't trigger
      the oops, although the semantics of this NULLification remains untouched.
      
      A simple test case/reproducer is as follows:
      a) Build kernel v4.19.56-stable with CONFIG_BLK_CGROUP=n.
      
      b) Create a raid0 md array with 2 NVMe devices as members, and mount
      it with an ext4 filesystem.
      
      c) Run the following oneliner (supposing the raid0 is mounted in /mnt):
      (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3;
      echo 1 > /sys/block/nvme1n1/device/device/remove
      (whereas nvme1n1 is the 2nd array member)
      
      This will trigger the following oops:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      RIP: 0010:generic_make_request+0x32b/0x400
      Call Trace:
       submit_bio+0x73/0x140
       ext4_io_submit+0x4d/0x60
       ext4_writepages+0x626/0xe90
       do_writepages+0x4b/0xe0
      [...]
      
      This patch has no functional changes and preserves the md/raid0 behavior
      when a member is removed before kernel v4.17.
      
      ----------------------------
      Why this is not on mainline?
      ----------------------------
      
      The patch was originally submitted upstream in linux-raid and
      linux-block mailing-lists - it was initially accepted by Song Liu,
      but Christoph Hellwig[0] observed that there was a clean-up series
      ready to be accepted from Ming Lei[1] that fixed the same issue.
      
      The accepted patches from Ming's series in upstream are: commit
      47cdee29ef9d ("block: move blk_exit_queue into __blk_release_queue") and
      commit fe2008640ae3 ("block: don't protect generic_make_request_checks
      with blk_queue_enter"). Those patches basically do a clean-up in the
      block layer involving:
      
      1) Putting back blk_exit_queue() logic into __blk_release_queue(); that
      path was changed in the past and the logic from blk_exit_queue() was
      added to blk_cleanup_queue().
      
      2) Removing the guard/protection in generic_make_request_checks() with
      blk_queue_enter().
      
      The problem with Ming's series for -stable is that it relies in the
      legacy request IO path removal. So it's "backport-able" to v5.0+,
      but doing that for early versions (like 4.19) would incur in complex
      code changes. Hence, it was suggested by Christoph and Song Liu that
      this patch was submitted to stable only; otherwise merging it upstream
      would add code to fix a path removed in a subsequent commit.
      
      [0] lore.kernel.org/linux-block/20190521172258.GA32702@infradead.org
      [1] lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.com
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Tested-by: NEric Ren <renzhengeek@gmail.com>
      Fixes: 37f9579f ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash")
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9d8d3e9
  16. 15 6月, 2019 2 次提交
    • P
      block, bfq: increase idling for weight-raised queues · b5a185ee
      Paolo Valente 提交于
      [ Upstream commit 778c02a236a8728bb992de10ed1f12c0be5b7b0e ]
      
      If a sync bfq_queue has a higher weight than some other queue, and
      remains temporarily empty while in service, then, to preserve the
      bandwidth share of the queue, it is necessary to plug I/O dispatching
      until a new request arrives for the queue. In addition, a timeout
      needs to be set, to avoid waiting for ever if the process associated
      with the queue has actually finished its I/O.
      
      Even with the above timeout, the device is however not fed with new
      I/O for a while, if the process has finished its I/O. If this happens
      often, then throughput drops and latencies grow. For this reason, the
      timeout is kept rather low: 8 ms is the current default.
      
      Unfortunately, such a low value may cause, on the opposite end, a
      violation of bandwidth guarantees for a process that happens to issue
      new I/O too late. The higher the system load, the higher the
      probability that this happens to some process. This is a problem in
      scenarios where service guarantees matter more than throughput. One
      important case are weight-raised queues, which need to be granted a
      very high fraction of the bandwidth.
      
      To address this issue, this commit lower-bounds the plugging timeout
      for weight-raised queues to 20 ms. This simple change provides
      relevant benefits. For example, on a PLEXTOR PX-256M5S, with which
      gnome-terminal starts in 0.6 seconds if there is no other I/O in
      progress, the same applications starts in
      - 0.8 seconds, instead of 1.2 seconds, if ten files are being read
        sequentially in parallel
      - 1 second, instead of 2 seconds, if, in parallel, five files are
        being read sequentially, and five more files are being written
        sequentially
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b5a185ee
    • M
      blk-mq: move cancel of requeue_work into blk_mq_release · 525b5265
      Ming Lei 提交于
      [ Upstream commit fbc2a15e3433058582e5635aabe48a3011a644a8 ]
      
      With holding queue's kobject refcount, it is safe for driver
      to schedule requeue. However, blk_mq_kick_requeue_list() may
      be called after blk_sync_queue() is done because of concurrent
      requeue activities, then requeue work may not be completed when
      freeing queue, and kernel oops is triggered.
      
      So moving the cancel of requeue_work into blk_mq_release() for
      avoiding race between requeue and freeing queue.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      525b5265
  17. 31 5月, 2019 2 次提交
    • D
      block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBR · 2b18febc
      David Kozub 提交于
      [ Upstream commit 78bf47353b0041865564deeed257a54f047c2fdc ]
      
      The implementation of IOC_OPAL_ENABLE_DISABLE_MBR handled the value
      opal_mbr_data.enable_disable incorrectly: enable_disable is expected
      to be one of OPAL_MBR_ENABLE(0) or OPAL_MBR_DISABLE(1). enable_disable
      was passed directly to set_mbr_done and set_mbr_enable_disable where
      is was interpreted as either OPAL_TRUE(1) or OPAL_FALSE(0). The end
      result was that calling IOC_OPAL_ENABLE_DISABLE_MBR with OPAL_MBR_ENABLE
      actually disabled the shadow MBR and vice versa.
      
      This patch adds correct conversion from OPAL_MBR_DISABLE/ENABLE to
      OPAL_FALSE/TRUE. The change affects existing programs using
      IOC_OPAL_ENABLE_DISABLE_MBR but this is typically used only once when
      setting up an Opal drive.
      Acked-by: NJon Derrick <jonathan.derrick@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NScott Bauer <sbauer@plzdonthack.me>
      Signed-off-by: NDavid Kozub <zub@linux.fjfi.cvut.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2b18febc
    • Y
      block: fix use-after-free on gendisk · ad393793
      Yufen Yu 提交于
      [ Upstream commit 2c88e3c7ec32d7a40cc7c9b4a487cf90e4671bdd ]
      
      commit 2da78092 "block: Fix dev_t minor allocation lifetime"
      specifically moved blk_free_devt(dev->devt) call to part_release()
      to avoid reallocating device number before the device is fully
      shutdown.
      
      However, it can cause use-after-free on gendisk in get_gendisk().
      We use md device as example to show the race scenes:
      
      Process1		Worker			Process2
      md_free
      						blkdev_open
      del_gendisk
        add delete_partition_work_fn() to wq
        						__blkdev_get
      						get_gendisk
      put_disk
        disk_release
          kfree(disk)
          						find part from ext_devt_idr
      						get_disk_and_module(disk)
          					  	cause use after free
      
          			delete_partition_work_fn
      			put_device(part)
          		  	part_release
      		    	remove part from ext_devt_idr
      
      Before <devt, hd_struct pointer> is removed from ext_devt_idr by
      delete_partition_work_fn(), we can find the devt and then access
      gendisk by hd_struct pointer. But, if we access the gendisk after
      it have been freed, it can cause in use-after-freeon gendisk in
      get_gendisk().
      
      We fix this by adding a new helper blk_invalidate_devt() in
      delete_partition() and del_gendisk(). It replaces hd_struct
      pointer in idr with value 'NULL', and deletes the entry from
      idr in part_release() as we do now.
      
      Thanks to Jan Kara for providing the solution and more clear comments
      for the code.
      
      Fixes: 2da78092 ("block: Fix dev_t minor allocation lifetime")
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ad393793
  18. 17 5月, 2019 1 次提交
  19. 08 5月, 2019 2 次提交
  20. 20 4月, 2019 1 次提交
  21. 17 4月, 2019 1 次提交
  22. 06 4月, 2019 1 次提交
    • P
      block, bfq: fix in-service-queue check for queue merging · 06666a19
      Paolo Valente 提交于
      [ Upstream commit 058fdecc6de7cdecbf4c59b851e80eb2d6c5295f ]
      
      When a new I/O request arrives for a bfq_queue, say Q, bfq checks
      whether that request is close to
      (a) the head request of some other queue waiting to be served, or
      (b) the last request dispatched for the in-service queue (in case Q
      itself is not the in-service queue)
      
      If a queue, say Q2, is found for which the above condition holds, then
      bfq merges Q and Q2, to hopefully get a more sequential I/O in the
      resulting merged queue, and thus a possibly higher throughput.
      
      Case (b) is checked by comparing the new request for Q with the last
      request dispatched, assuming that the latter necessarily belonged to the
      in-service queue. Unfortunately, this assumption is no longer always
      correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O
      into seeky idle queues on NCQ flash").
      
      When the assumption does not hold, queues that must not be merged may be
      merged, causing unexpected loss of control on per-queue service
      guarantees.
      
      This commit solves this problem by adding an extra field, which stores
      the actual last request dispatched for the in-service queue, and by
      using this new field to correctly check case (b).
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      06666a19
  23. 24 3月, 2019 1 次提交
    • J
      blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue · 29452f66
      Jianchao Wang 提交于
      [ Upstream commit aef1897cd36dcf5e296f1d2bae7e0d268561b685 ]
      
      When requeue, if RQF_DONTPREP, rq has contained some driver
      specific data, so insert it to hctx dispatch list to avoid any
      merge. Take scsi as example, here is the trace event log (no
      io scheduler, because RQF_STARTED would prevent merging),
      
         kworker/0:1H-339   [000] ...1  2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
      scsi_inert_test-1987  [000] ....  2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
      scsi_inert_test-1987  [000] ...2  2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
         kworker/0:1H-339   [000] ....  2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
      scsi_inert_test-1996  [000] ..s1  2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
      scsi_inert_test-1996  [000] .Ns1  2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
         kworker/0:1H-339   [000] ...1  2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
         kworker/0:1H-339   [000] ...1  2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
      scsi_inert_test-1986  [000] ..s1  2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]
      
      (32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
      Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
      the sdb only contained the part of (32768 + 8), then only that part
      was completed. The lucky thing was that scsi_io_completion detected
      it and requeued the remaining part. So we didn't get corrupted data.
      However, the requeue of (32776 + 8) is not expected.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      29452f66
  24. 14 3月, 2019 1 次提交
    • L
      blk-iolatency: fix IO hang due to negative inflight counter · 6d482bc5
      Liu Bo 提交于
      [ Upstream commit 8c772a9bfc7c07c76f4a58b58910452fbb20843b ]
      
      Our test reported the following stack, and vmcore showed that
      ->inflight counter is -1.
      
      [ffffc9003fcc38d0] __schedule at ffffffff8173d95d
      [ffffc9003fcc3958] schedule at ffffffff8173de26
      [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6
      [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb
      [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3
      [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a
      [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49
      [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d
      [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4]
      [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4]
      [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae
      [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188
      [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301
      [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4]
      [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b
      [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd
      [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50
      [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04
      [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e
      
      The ->inflight counter may be negative (-1) if
      
      1) blk-iolatency was disabled when the IO was issued,
      
      2) blk-iolatency was enabled before this IO reached its endio,
      
      3) the ->inflight counter is decreased from 0 to -1 in endio()
      
      In fact the hang can be easily reproduced by the below script,
      
      H=/sys/fs/cgroup/unified/
      P=/sys/fs/cgroup/unified/test
      
      echo "+io" > $H/cgroup.subtree_control
      mkdir -p $P
      
      echo $$ > $P/cgroup.procs
      
      xfs_io -f -d -c "pwrite 0 4k" /dev/sdg
      
      echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency
      
      xfs_io -f -d -c "pwrite 0 4k" /dev/sdg
      
      This fixes the problem by freezing the queue so that while
      enabling/disabling iolatency, there is no inflight rq running.
      
      Note that quiesce_queue is not needed as this only updating iolatency
      configuration about which dispatching request_queue doesn't care.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6d482bc5
  25. 20 2月, 2019 1 次提交
    • J
      blk-mq: fix a hung issue when fsync · 80c8452a
      Jianchao Wang 提交于
      [ Upstream commit 85bd6e61f34dffa8ec2dc75ff3c02ee7b2f1cbce ]
      
      Florian reported a io hung issue when fsync(). It should be
      triggered by following race condition.
      
      data + post flush         a flush
      
      blk_flush_complete_seq
        case REQ_FSEQ_DATA
          blk_flush_queue_rq
          issued to driver      blk_mq_dispatch_rq_list
                                  try to issue a flush req
                                  failed due to NON-NCQ command
                                  .queue_rq return BLK_STS_DEV_RESOURCE
      
      request completion
        req->end_io // doesn't check RESTART
        mq_flush_data_end_io
          case REQ_FSEQ_POSTFLUSH
            blk_kick_flush
              do nothing because previous flush
              has not been completed
           blk_mq_run_hw_queue
                                    insert rq to hctx->dispatch
                                    due to RESTART is still set, do nothing
      
      To fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io
      with blk_mq_sched_restart to check and clear the RESTART flag.
      
      Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers)
      Reported-by: NFlorian Stecker <m19@florianstecker.de>
      Tested-by: NFlorian Stecker <m19@florianstecker.de>
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      80c8452a
  26. 23 1月, 2019 1 次提交
    • Y
      block: use rcu_work instead of call_rcu to avoid sleep in softirq · 4cc66cc4
      Yufen Yu 提交于
      commit 94a2c3a32b62e868dc1e3d854326745a7f1b8c7a upstream.
      
      We recently got a stack by syzkaller like this:
      
      BUG: sleeping function called from invalid context at mm/slab.h:361
      in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
      INFO: lockdep is turned off.
      CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
       0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
       0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
       0000000000000000 0000000000000001 0000000000000004 0000000000000000
      Call Trace:
       <IRQ>  [<ffffffff81cb2194>] __dump_stack lib/dump_stack.c:15 [inline]
       <IRQ>  [<ffffffff81cb2194>] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
       [<ffffffff8129a981>] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
       [<ffffffff8129ac33>] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
       [<ffffffff81794c13>] slab_pre_alloc_hook mm/slab.h:361 [inline]
       [<ffffffff81794c13>] slab_alloc_node mm/slub.c:2610 [inline]
       [<ffffffff81794c13>] slab_alloc mm/slub.c:2692 [inline]
       [<ffffffff81794c13>] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
       [<ffffffff81cbe9a7>] kmalloc include/linux/slab.h:479 [inline]
       [<ffffffff81cbe9a7>] kzalloc include/linux/slab.h:623 [inline]
       [<ffffffff81cbe9a7>] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
       [<ffffffff81cbf84f>] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
       [<ffffffff81cbb5b9>] kobject_cleanup lib/kobject.c:633 [inline]
       [<ffffffff81cbb5b9>] kobject_release+0x229/0x440 lib/kobject.c:675
       [<ffffffff81cbb0a2>] kref_sub include/linux/kref.h:73 [inline]
       [<ffffffff81cbb0a2>] kref_put include/linux/kref.h:98 [inline]
       [<ffffffff81cbb0a2>] kobject_put+0x72/0xd0 lib/kobject.c:692
       [<ffffffff8216f095>] put_device+0x25/0x30 drivers/base/core.c:1237
       [<ffffffff81c4cc34>] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
       [<ffffffff813c08bc>] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
       [<ffffffff813c08bc>] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
       [<ffffffff813c08bc>] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
       [<ffffffff813c08bc>] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
       [<ffffffff813c08bc>] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
       [<ffffffff8120f509>] __do_softirq+0x299/0xe20 kernel/softirq.c:273
       [<ffffffff81210496>] invoke_softirq kernel/softirq.c:350 [inline]
       [<ffffffff81210496>] irq_exit+0x216/0x2c0 kernel/softirq.c:391
       [<ffffffff82c2cd7b>] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
       [<ffffffff82c2cd7b>] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
       [<ffffffff82c2bc25>] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
       <EOI>  [<ffffffff814cbf40>] ? audit_kill_trees+0x180/0x180
       [<ffffffff8187d2f7>] fd_install+0x57/0x80 fs/file.c:626
       [<ffffffff8180989e>] do_sys_open+0x45e/0x550 fs/open.c:1043
       [<ffffffff818099c2>] SYSC_open fs/open.c:1055 [inline]
       [<ffffffff818099c2>] SyS_open+0x32/0x40 fs/open.c:1050
       [<ffffffff82c299e1>] entry_SYSCALL_64_fastpath+0x1e/0x9a
      
      In softirq context, we call rcu callback function delete_partition_rcu_cb(),
      which may allocate memory by kzalloc with GFP_KERNEL flag. If the
      allocation cannot be satisfied, it may sleep. However, That is not allowed
      in softirq contex.
      
      Although we found this problem on linux 4.4, the latest kernel version
      seems to have this problem as well. And it is very similar to the
      previous one:
      	https://lkml.org/lkml/2018/7/9/391
      
      Fix it by using RCU workqueue, which allows sleep.
      Reviewed-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4cc66cc4
  27. 13 1月, 2019 2 次提交
    • D
      block: mq-deadline: Fix write completion handling · 6353c0a0
      Damien Le Moal 提交于
      commit 7211aef86f79583e59b88a0aba0bc830566f7e8e upstream.
      
      For a zoned block device using mq-deadline, if a write request for a
      zone is received while another write was already dispatched for the same
      zone, dd_dispatch_request() will return NULL and the newly inserted
      write request is kept in the scheduler queue waiting for the ongoing
      zone write to complete. With this behavior, when no other request has
      been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
      and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
      __blk_mq_free_request() call of blk_mq_sched_restart() to not run the
      queue when the already dispatched write request completes. The newly
      dispatched request stays stuck in the scheduler queue until eventually
      another request is submitted.
      
      This problem does not affect SCSI disk as the SCSI stack handles queue
      restart on request completion. However, this problem is can be triggered
      the nullblk driver with zoned mode enabled.
      
      Fix this by always requesting a queue restart in dd_dispatch_request()
      if no request was dispatched while WRITE requests are queued.
      
      Fixes: 5700f691 ("mq-deadline: Introduce zone locking support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Add missing export of blk_mq_sched_restart()
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6353c0a0
    • M
      block: deactivate blk_stat timer in wbt_disable_default() · 69e9b285
      Ming Lei 提交于
      commit 544fbd16a461a318cd80537d1331c0df5c6cf930 upstream.
      
      rwb_enabled() can't be changed when there is any inflight IO.
      
      wbt_disable_default() may set rwb->wb_normal as zero, however the
      blk_stat timer may still be pending, and the timer function will update
      wrb->wb_normal again.
      
      This patch introduces blk_stat_deactivate() and applies it in
      wbt_disable_default(), then the following IO hang triggered when running
      parted & switching io scheduler can be fixed:
      
      [  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
      [  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
      [  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  369.940768] parted          D    0  3645   3239 0x00000000
      [  369.941500] Call Trace:
      [  369.941874]  ? __schedule+0x6d9/0x74c
      [  369.942392]  ? wbt_done+0x5e/0x5e
      [  369.942864]  ? wbt_cleanup_cb+0x16/0x16
      [  369.943404]  ? wbt_done+0x5e/0x5e
      [  369.943874]  schedule+0x67/0x78
      [  369.944298]  io_schedule+0x12/0x33
      [  369.944771]  rq_qos_wait+0xb5/0x119
      [  369.945193]  ? karma_partition+0x1c2/0x1c2
      [  369.945691]  ? wbt_cleanup_cb+0x16/0x16
      [  369.946151]  wbt_wait+0x85/0xb6
      [  369.946540]  __rq_qos_throttle+0x23/0x2f
      [  369.947014]  blk_mq_make_request+0xe6/0x40a
      [  369.947518]  generic_make_request+0x192/0x2fe
      [  369.948042]  ? submit_bio+0x103/0x11f
      [  369.948486]  ? __radix_tree_lookup+0x35/0xb5
      [  369.949011]  submit_bio+0x103/0x11f
      [  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
      [  369.949962]  submit_bio_wait+0x53/0x7f
      [  369.950469]  blkdev_issue_flush+0x8a/0xae
      [  369.951032]  blkdev_fsync+0x2f/0x3a
      [  369.951502]  do_fsync+0x2e/0x47
      [  369.951887]  __x64_sys_fsync+0x10/0x13
      [  369.952374]  do_syscall_64+0x89/0x149
      [  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  369.953492] RIP: 0033:0x7f95a1e729d4
      [  369.953996] Code: Bad RIP value.
      [  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
      [  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
      [  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
      [  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
      [  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
      [  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
      
      Cc: stable@vger.kernel.org
      Cc: Paolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69e9b285