1. 01 3月, 2018 2 次提交
  2. 27 2月, 2018 4 次提交
    • J
      genhd: Fix BUG in blkdev_open() · 56c0908c
      Jan Kara 提交于
      When two blkdev_open() calls for a partition race with device removal
      and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
      blkdev_open(). The race can happen as follows:
      
      CPU0				CPU1			CPU2
      							del_gendisk()
      							  bdev_unhash_inode(part1);
      
      blkdev_open(part1, O_EXCL)	blkdev_open(part1, O_EXCL)
        bdev = bd_acquire()		  bdev = bd_acquire()
        blkdev_get(bdev)
          bd_start_claiming(bdev)
            - finds old inode 'whole'
            bd_prepare_to_claim() -> 0
      							  bdev_unhash_inode(whole);
      							<device removed>
      							<new device under same
      							 number created>
      				  blkdev_get(bdev);
      				    bd_start_claiming(bdev)
      				      - finds new inode 'whole'
      				      bd_prepare_to_claim()
      					- this also succeeds as we have
      					  different 'whole' here...
      					- bad things happen now as we
      					  have two exclusive openers of
      					  the same bdev
      
      The problem here is that block device opens can see various intermediate
      states while gendisk is shutting down and then being recreated.
      
      We fix the problem by introducing new lookup_sem in gendisk that
      synchronizes gendisk deletion with get_gendisk() and furthermore by
      making sure that get_gendisk() does not return gendisk that is being (or
      has been) deleted. This makes sure that once we ever manage to look up
      newly created bdev inode, we are also guaranteed that following
      get_gendisk() will either return failure (and we fail open) or it
      returns gendisk for the new device and following bdget_disk() will
      return new bdev inode (i.e., blkdev_open() follows the path as if it is
      completely run after new device is created).
      Reported-and-analyzed-by: NHou Tao <houtao1@huawei.com>
      Tested-by: NHou Tao <houtao1@huawei.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      56c0908c
    • J
      genhd: Add helper put_disk_and_module() · 9df6c299
      Jan Kara 提交于
      Add a proper counterpart to get_disk_and_module() -
      put_disk_and_module(). Currently it is opencoded in several places.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9df6c299
    • J
      genhd: Rename get_disk() to get_disk_and_module() · 3079c22e
      Jan Kara 提交于
      Rename get_disk() to get_disk_and_module() to make sure what the
      function does. It's not a great name but at least it is now clear that
      put_disk() is not it's counterpart.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3079c22e
    • J
      genhd: Fix leaked module reference for NVME devices · d52987b5
      Jan Kara 提交于
      Commit 8ddcd653 "block: introduce GENHD_FL_HIDDEN" added handling of
      hidden devices to get_gendisk() but forgot to drop module reference
      which is also acquired by get_disk(). Drop the reference as necessary.
      
      Arguably the function naming here is misleading as put_disk() is *not*
      the counterpart of get_disk() but let's fix that in the follow up
      commit since that will be more intrusive.
      
      Fixes: 8ddcd653
      CC: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d52987b5
  3. 25 2月, 2018 2 次提交
  4. 24 2月, 2018 1 次提交
  5. 14 2月, 2018 1 次提交
  6. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  7. 08 2月, 2018 1 次提交
    • P
      block, bfq: add requeue-request hook · a7877390
      Paolo Valente 提交于
      Commit 'a6a252e6 ("blk-mq-sched: decide how to handle flush rq via
      RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
      be re-inserted into the active I/O scheduler for that device. As a
      consequence, I/O schedulers may get the same request inserted again,
      even several times, without a finish_request invoked on that request
      before each re-insertion.
      
      This fact is the cause of the failure reported in [1]. For an I/O
      scheduler, every re-insertion of the same re-prepared request is
      equivalent to the insertion of a new request. For schedulers like
      mq-deadline or kyber, this fact causes no harm. In contrast, it
      confuses a stateful scheduler like BFQ, which keeps state for an I/O
      request, until the finish_request hook is invoked on the request. In
      particular, BFQ may get stuck, waiting forever for the number of
      request dispatches, of the same request, to be balanced by an equal
      number of request completions (while there will be one completion for
      that request). In this state, BFQ may refuse to serve I/O requests
      from other bfq_queues. The hang reported in [1] then follows.
      
      However, the above re-prepared requests undergo a requeue, thus the
      requeue_request hook of the active elevator is invoked for these
      requests, if set. This commit then addresses the above issue by
      properly implementing the hook requeue_request in BFQ.
      
      [1] https://marc.info/?l=linux-block&m=151211117608676Reported-by: NIvan Kozik <ivan@ludios.org>
      Reported-by: NAlban Browaeys <alban.browaeys@gmail.com>
      Tested-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NSerena Ziviani <ziviani.serena@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a7877390
  8. 07 2月, 2018 2 次提交
    • H
      block: Add should_fail_bio() for bpf error injection · 30abb3a6
      Howard McLauchlan 提交于
      The classic error injection mechanism, should_fail_request() does not
      support use cases where more information is required (from the entire
      struct bio, for example).
      
      To that end, this patch introduces should_fail_bio(), which calls
      should_fail_request() under the hood but provides a convenient
      place for kprobes to hook into if they require the entire struct bio.
      This patch also replaces some existing calls to should_fail_request()
      with should_fail_bio() with no degradation in performance.
      Signed-off-by: NHoward McLauchlan <hmclauchlan@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      30abb3a6
    • J
      blk-wbt: account flush requests correctly · 5235553d
      Jens Axboe 提交于
      Mikulas reported a workload that saw bad performance, and figured
      out what it was due to various other types of requests being
      accounted as reads. Flush requests, for instance. Due to the
      high latency of those, we heavily throttle the writes to keep
      the latencies in balance. But they really should be accounted
      as writes.
      
      Fix this by checking the exact type of the request. If it's a
      read, account as a read, if it's a write or a flush, account
      as a write. Any other request we disregard. Previously everything
      would have been mistakenly accounted as reads.
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5235553d
  9. 02 2月, 2018 2 次提交
    • K
      bea99a50
    • J
      blk-mq: fix discard merge with scheduler attached · 445251d0
      Jens Axboe 提交于
      I ran into an issue on my laptop that triggered a bug on the
      discard path:
      
      WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
       Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
       CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G     U           4.15.0+ #176
       Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
       RIP: 0010:nvme_setup_cmd+0x3d3/0x430
       RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
       RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
       RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
       RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
       R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
       R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
       FS:  0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        nvme_queue_rq+0x40/0xa00
        ? __sbitmap_queue_get+0x24/0x90
        ? blk_mq_get_tag+0xa3/0x250
        ? wait_woken+0x80/0x80
        ? blk_mq_get_driver_tag+0x97/0xf0
        blk_mq_dispatch_rq_list+0x7b/0x4a0
        ? deadline_remove_request+0x49/0xb0
        blk_mq_do_dispatch_sched+0x4f/0xc0
        blk_mq_sched_dispatch_requests+0x106/0x170
        __blk_mq_run_hw_queue+0x53/0xa0
        __blk_mq_delay_run_hw_queue+0x83/0xa0
        blk_mq_run_hw_queue+0x6c/0xd0
        blk_mq_sched_insert_request+0x96/0x140
        __blk_mq_try_issue_directly+0x3d/0x190
        blk_mq_try_issue_directly+0x30/0x70
        blk_mq_make_request+0x1a4/0x6a0
        generic_make_request+0xfd/0x2f0
        ? submit_bio+0x5c/0x110
        submit_bio+0x5c/0x110
        ? __blkdev_issue_discard+0x152/0x200
        submit_bio_wait+0x43/0x60
        ext4_process_freed_data+0x1cd/0x440
        ? account_page_dirtied+0xe2/0x1a0
        ext4_journal_commit_callback+0x4a/0xc0
        jbd2_journal_commit_transaction+0x17e2/0x19e0
        ? kjournald2+0xb0/0x250
        kjournald2+0xb0/0x250
        ? wait_woken+0x80/0x80
        ? commit_timeout+0x10/0x10
        kthread+0x111/0x130
        ? kthread_create_worker_on_cpu+0x50/0x50
        ? do_group_exit+0x3a/0xa0
        ret_from_fork+0x1f/0x30
       Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
       ---[ end trace 50d361cc444506c8 ]---
       print_req_error: I/O error, dev nvme0n1, sector 847167488
      
      Decoding the assembly, the request claims to have 0xffff segments,
      while nvme counts two. This turns out to be because we don't check
      for a data carrying request on the mq scheduler path, and since
      blk_phys_contig_segment() returns true for a non-data request,
      we decrement the initial segment count of 0 and end up with
      0xffff in the unsigned short.
      
      There are a few issues here:
      
      1) We should initialize the segment count for a discard to 1.
      2) The discard merging is currently using the data limits for
         segments and sectors.
      
      Fix this up by having attempt_merge() correctly identify the
      request, and by initializing the segment count correctly
      for discards.
      
      This can only be triggered with mq-deadline on discard capable
      devices right now, which isn't a common configuration.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      445251d0
  10. 31 1月, 2018 1 次提交
    • M
      blk-mq: introduce BLK_STS_DEV_RESOURCE · 86ff7c2a
      Ming Lei 提交于
      This status is returned from driver to block layer if device related
      resource is unavailable, but driver can guarantee that IO dispatch
      will be triggered in future when the resource is available.
      
      Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
      returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
      a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
      3 ms because both scsi-mq and nvmefc are using that magic value.
      
      If a driver can make sure there is in-flight IO, it is safe to return
      BLK_STS_DEV_RESOURCE because:
      
      1) If all in-flight IOs complete before examining SCHED_RESTART in
      blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
      is run immediately in this case by blk_mq_dispatch_rq_list();
      
      2) if there is any in-flight IO after/when examining SCHED_RESTART
      in blk_mq_dispatch_rq_list():
      - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
      - otherwise, this request will be dispatched after any in-flight IO is
        completed via blk_mq_sched_restart()
      
      3) if SCHED_RESTART is set concurently in context because of
      BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
      cases and make sure IO hang can be avoided.
      
      One invariant is that queue will be rerun if SCHED_RESTART is set.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Tested-by: NLaurence Oberman <loberman@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86ff7c2a
  11. 25 1月, 2018 2 次提交
  12. 24 1月, 2018 1 次提交
  13. 20 1月, 2018 4 次提交
  14. 19 1月, 2018 7 次提交
  15. 18 1月, 2018 9 次提交
    • P
      block, bfq: limit sectors served with interactive weight raising · 8a8747dc
      Paolo Valente 提交于
      To maximise responsiveness, BFQ raises the weight, and performs device
      idling, for bfq_queues associated with processes deemed as
      interactive. In particular, weight raising has a maximum duration,
      equal to the time needed to start a large application. If a
      weight-raised process goes on doing I/O beyond this maximum duration,
      it loses weight-raising.
      
      This mechanism is evidently vulnerable to the following false
      positives: I/O-bound applications that will go on doing I/O for much
      longer than the duration of weight-raising. These applications have
      basically no benefit from being weight-raised at the beginning of
      their I/O. On the opposite end, while being weight-raised, these
      applications
      a) unjustly steal throughput to applications that may truly need
      low latency;
      b) make BFQ uselessly perform device idling; device idling results
      in loss of device throughput with most flash-based storage, and may
      increase latencies when used purposelessly.
      
      This commit adds a countermeasure to reduce both the above
      problems. To introduce this countermeasure, we provide the following
      extra piece of information (full details in the comments added by this
      commit). During the start-up of the large application used as a
      reference to set the duration of weight-raising, involved processes
      transfer at most ~110K sectors each. Accordingly, a process initially
      deemed as interactive has no right to be weight-raised any longer,
      once transferred 110K sectors or more.
      
      Basing on this consideration, this commit early-ends weight-raising
      for a bfq_queue if the latter happens to have received an amount of
      service at least equal to 110K sectors (actually, a little bit more,
      to keep a safety margin). I/O-bound applications that reach a high
      throughput, such as file copy, get to this threshold much before the
      allowed weight-raising period finishes. Thus this early ending of
      weight-raising reduces the amount of time during which these
      applications cause the problems described above.
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8a8747dc
    • P
      block, bfq: limit tags for writes and async I/O · a52a69ea
      Paolo Valente 提交于
      Asynchronous I/O can easily starve synchronous I/O (both sync reads
      and sync writes), by consuming all request tags. Similarly, storms of
      synchronous writes, such as those that sync(2) may trigger, can starve
      synchronous reads. In their turn, these two problems may also cause
      BFQ to loose control on latency for interactive and soft real-time
      applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
      Writer takes 0.6 seconds to start if the device is idle, but it takes
      more than 45 seconds (!) if there are sequential writes in the
      background.
      
      This commit addresses this issue by limiting the maximum percentage of
      tags that asynchronous I/O requests and synchronous write requests can
      consume. In particular, this commit grants a higher threshold to
      synchronous writes, to prevent the latter from being starved by
      asynchronous I/O.
      
      According to the above test, LibreOffice Writer now starts in about
      1.2 seconds on average, regardless of the background workload, and
      apart from some rare outlier. To check this improvement, run, e.g.,
      sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
      for the comm_startup_lat benchmark in the S suite [1].
      
      [1] https://github.com/Algodev-github/STested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a52a69ea
    • M
      blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy · 23d4ee19
      Ming Lei 提交于
      If we run into blk_mq_request_direct_issue(), when queue is busy, we
      don't want to dispatch this request into hctx->dispatch_list, and
      what we need to do is to return the queue busy info to caller, so
      that caller can deal with it well.
      
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Reported-by: NLaurence Oberman <loberman@redhat.com>
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      23d4ee19
    • B
      block: Fix __bio_integrity_endio() documentation · de99a346
      Bart Van Assche 提交于
      Fixes: 4246a0b6 ("block: add a bi_error field to struct bio")
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      de99a346
    • M
      blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request · 9e97d295
      Mike Snitzer 提交于
      After commit:
      
      923218f6 ("blk-mq: don't allocate driver tag upfront for flush rq")
      
      we no longer use the 'can_block' argument in
      blk_mq_sched_insert_request(). Kill it.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      
      Added actual commit message as to why it's being removed.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9e97d295
    • M
      blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback · 396eaf21
      Ming Lei 提交于
      blk_insert_cloned_request() is called in the fast path of a dm-rq driver
      (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
      blk_mq_request_bypass_insert() to directly append the request to the
      blk-mq hctx->dispatch_list of the underlying queue.
      
      1) This way isn't efficient enough because the hctx spinlock is always
      used.
      
      2) With blk_insert_cloned_request(), we completely bypass underlying
      queue's elevator and depend on the upper-level dm-rq driver's elevator
      to schedule IO.  But dm-rq currently can't get the underlying queue's
      dispatch feedback at all.  Without knowing whether a request was issued
      or not (e.g. due to underlying queue being busy) the dm-rq elevator will
      not be able to provide effective IO merging (as a side-effect of dm-rq
      currently blindly destaging a request from its elevator only to requeue
      it after a delay, which kills any opportunity for merging).  This
      obviously causes very bad sequential IO performance.
      
      Fix this by updating blk_insert_cloned_request() to use
      blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
      request to be issued directly to the underlying queue and returns the
      dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
      returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
      to _not_ destage the request.  Whereby preserving the opportunity to
      merge IO.
      
      With this, request-based DM's blk-mq sequential IO performance is vastly
      improved (as much as 3X in mpath/virtio-scsi testing).
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
      they were refactored to make them less fragile and easier to read/review]
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      396eaf21
    • M
      blk-mq: factor out a few helpers from __blk_mq_try_issue_directly · 0f95549c
      Mike Snitzer 提交于
      No functional change.  Just makes code flow more logically.
      
      In following commit, __blk_mq_try_issue_directly() will be used to
      return the dispatch result (blk_status_t) to DM.  DM needs this
      information to improve IO merging.
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f95549c
    • M
      blk-mq: turn WARN_ON in __blk_mq_run_hw_queue into printk · 7df938fb
      Ming Lei 提交于
      We know this WARN_ON is harmless and in reality it may be trigged,
      so convert it to printk() and dump_stack() to avoid to confusing
      people.
      
      Also add comment about two releated races here.
      
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7df938fb
    • M
      blk-mq: make sure hctx->next_cpu is set correctly · 7bed4595
      Ming Lei 提交于
      When hctx->next_cpu is set from possible online CPUs, there is one
      race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
      break workqueue.
      
      The race can be triggered in the following two sitations:
      
      1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
      to dispatch requests from the DEAD cpu context, but at that
      time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
      CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
      a bad value.
      
      2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
      should be run on the other CPU A, then CPU A may become offline at the
      same time and all CPUs in hctx->cpumask become offline.
      
      This patch deals with this issue by re-selecting next CPU, and making
      sure it is set correctly.
      
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-by: N"jianchao.wang" <jianchao.w.wang@oracle.com>
      Tested-by: N"jianchao.wang" <jianchao.w.wang@oracle.com>
      Fixes: 20e4d813 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7bed4595