1. 11 11月, 2022 1 次提交
    • G
      sbitmap: Use single per-bitmap counting to wake up queued tags · 4f8126bb
      Gabriel Krisman Bertazi 提交于
      sbitmap suffers from code complexity, as demonstrated by recent fixes,
      and eventual lost wake ups on nested I/O completion.  The later happens,
      from what I understand, due to the non-atomic nature of the updates to
      wait_cnt, which needs to be subtracted and eventually reset when equal
      to zero.  This two step process can eventually miss an update when a
      nested completion happens to interrupt the CPU in between the wait_cnt
      updates.  This is very hard to fix, as shown by the recent changes to
      this code.
      
      The code complexity arises mostly from the corner cases to avoid missed
      wakes in this scenario.  In addition, the handling of wake_batch
      recalculation plus the synchronization with sbq_queue_wake_up is
      non-trivial.
      
      This patchset implements the idea originally proposed by Jan [1], which
      removes the need for the two-step updates of wait_cnt.  This is done by
      tracking the number of completions and wakeups in always increasing,
      per-bitmap counters.  Instead of having to reset the wait_cnt when it
      reaches zero, we simply keep counting, and attempt to wake up N threads
      in a single wait queue whenever there is enough space for a batch.
      Waking up less than batch_wake shouldn't be a problem, because we
      haven't changed the conditions for wake up, and the existing batch
      calculation guarantees at least enough remaining completions to wake up
      a batch for each queue at any time.
      
      Performance-wise, one should expect very similar performance to the
      original algorithm for the case where there is no queueing.  In both the
      old algorithm and this implementation, the first thing is to check
      ws_active, which bails out if there is no queueing to be managed. In the
      new code, we took care to avoid accounting completions and wakeups when
      there is no queueing, to not pay the cost of atomic operations
      unnecessarily, since it doesn't skew the numbers.
      
      For more interesting cases, where there is queueing, we need to take
      into account the cross-communication of the atomic operations.  I've
      been benchmarking by running parallel fio jobs against a single hctx
      nullb in different hardware queue depth scenarios, and verifying both
      IOPS and queueing.
      
      Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
      jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
      varying only the hardware queue length per test.
      
      queue size 2                 4                 8                 16                 32                 64
      6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
      patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
      
      The following is a similar experiment, ran against a nullb with a single
      bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
      parallel fio jobs operating on the same device
      
      queue size 2 	             4                 8              	16             	    32		       64
      6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
      patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)
      
      It has also survived blktests and a 12h-stress run against nullb. I also
      ran the code against nvme and a scsi SSD, and I didn't observe
      performance regression in those. If there are other tests you think I
      should run, please let me know and I will follow up with results.
      
      [1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/
      
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Liu Song <liusong@linux.alibaba.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGabriel Krisman Bertazi <krisman@suse.de>
      Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      4f8126bb
  2. 12 10月, 2022 2 次提交
    • J
      treewide: use prandom_u32_max() when possible, part 2 · 8b3ccbc1
      Jason A. Donenfeld 提交于
      Rather than incurring a division or requesting too many random bytes for
      the given range, use the prandom_u32_max() function, which only takes
      the minimum required bytes from the RNG and avoids divisions. This was
      done by hand, covering things that coccinelle could not do on its own.
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: Jan Kara <jack@suse.cz> # for ext2, ext4, and sbitmap
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      8b3ccbc1
    • J
      treewide: use prandom_u32_max() when possible, part 1 · 81895a65
      Jason A. Donenfeld 提交于
      Rather than incurring a division or requesting too many random bytes for
      the given range, use the prandom_u32_max() function, which only takes
      the minimum required bytes from the RNG and avoids divisions. This was
      done mechanically with this coccinelle script:
      
      @basic@
      expression E;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u64;
      @@
      (
      - ((T)get_random_u32() % (E))
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ((E) - 1))
      + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
      |
      - ((u64)(E) * get_random_u32() >> 32)
      + prandom_u32_max(E)
      |
      - ((T)get_random_u32() & ~PAGE_MASK)
      + prandom_u32_max(PAGE_SIZE)
      )
      
      @multi_line@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      identifier RAND;
      expression E;
      @@
      
      -       RAND = get_random_u32();
              ... when != RAND
      -       RAND %= (E);
      +       RAND = prandom_u32_max(E);
      
      // Find a potential literal
      @literal_mask@
      expression LITERAL;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      position p;
      @@
      
              ((T)get_random_u32()@p & (LITERAL))
      
      // Add one to the literal.
      @script:python add_one@
      literal << literal_mask.LITERAL;
      RESULT;
      @@
      
      value = None
      if literal.startswith('0x'):
              value = int(literal, 16)
      elif literal[0] in '123456789':
              value = int(literal, 10)
      if value is None:
              print("I don't know how to handle %s" % (literal))
              cocci.include_match(False)
      elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
              print("Skipping 0x%x for cleanup elsewhere" % (value))
              cocci.include_match(False)
      elif value & (value + 1) != 0:
              print("Skipping 0x%x because it's not a power of two minus one" % (value))
              cocci.include_match(False)
      elif literal.startswith('0x'):
              coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
      else:
              coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
      
      // Replace the literal mask with the calculated result.
      @plus_one@
      expression literal_mask.LITERAL;
      position literal_mask.p;
      expression add_one.RESULT;
      identifier FUNC;
      @@
      
      -       (FUNC()@p & (LITERAL))
      +       prandom_u32_max(RESULT)
      
      @collapse_ret@
      type T;
      identifier VAR;
      expression E;
      @@
      
       {
      -       T VAR;
      -       VAR = (E);
      -       return VAR;
      +       return E;
       }
      
      @drop_var@
      type T;
      identifier VAR;
      @@
      
       {
      -       T VAR;
              ... when != VAR
       }
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NYury Norov <yury.norov@gmail.com>
      Reviewed-by: NKP Singh <kpsingh@kernel.org>
      Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
      Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
      Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
      Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      81895a65
  3. 30 9月, 2022 1 次提交
    • H
      sbitmap: fix lockup while swapping · 30514bd2
      Hugh Dickins 提交于
      Commit 4acb8341 ("sbitmap: fix batched wait_cnt accounting")
      is a big improvement: without it, I had to revert to before commit
      040b83fc ("sbitmap: fix possible io hung due to lost wakeup")
      to avoid the high system time and freezes which that had introduced.
      
      Now okay on the NVME laptop, but 4acb8341 is a disaster for heavy
      swapping (kernel builds in low memory) on another: soon locking up in
      sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
      around with waitqueue_active() but wait_cnt 0 .  Here is a backtrace,
      showing the common pattern of outer sbitmap_queue_wake_up() interrupted
      before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
      are idle, in other cases they're spinning for a lock in dd_bio_merge()):
      
      sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
      __blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
      scsi_end_request < scsi_io_completion < scsi_finish_command <
      scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
      __irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
      _raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
      sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
      __blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
      blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
      __submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
      submit_bio < __swap_writepage < swap_writepage < pageout <
      shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
      shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
      __alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
      do_anonymous_page < __handle_mm_fault < handle_mm_fault <
      do_user_addr_fault < exc_page_fault < asm_exc_page_fault
      
      See how the process-context sbitmap_queue_wake_up() has been interrupted,
      after bringing wait_cnt down to 0 (and in this example, after doing its
      wakeups), before advancing wake_index and refilling wake_cnt: an
      interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.
      
      I have almost no grasp of all the possible sbitmap races, and their
      consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
      so it is better if sbq_wake_ptr() skips on to the next ws in that case:
      which fixes the lockup and shows no adverse consequence for me.
      
      The check for wait_cnt being 0 is obviously racy, and ultimately can lead
      to lost wakeups: for example, when there is only a single waitqueue with
      waiters.  However, lost wakeups are unlikely to matter in these cases,
      and a proper fix requires redesign (and benchmarking) of the batched
      wakeup code: so let's plug the hole with this bandaid for now.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Link: https://lore.kernel.org/r/9c2038a7-cdc5-5ee-854c-fbc6168bf16@google.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      30514bd2
  4. 12 9月, 2022 1 次提交
  5. 08 9月, 2022 2 次提交
  6. 04 9月, 2022 1 次提交
  7. 02 9月, 2022 1 次提交
  8. 26 8月, 2022 1 次提交
  9. 23 8月, 2022 1 次提交
    • Y
      sbitmap: fix possible io hung due to lost wakeup · 040b83fc
      Yu Kuai 提交于
      There are two problems can lead to lost wakeup:
      
      1) invalid wakeup on the wrong waitqueue:
      
      For example, 2 * wake_batch tags are put, while only wake_batch threads
      are woken:
      
      __sbq_wake_up
       atomic_cmpxchg -> reset wait_cnt
      			__sbq_wake_up -> decrease wait_cnt
      			...
      			__sbq_wake_up -> wait_cnt is decreased to 0 again
      			 atomic_cmpxchg
      			 sbq_index_atomic_inc -> increase wake_index
      			 wake_up_nr -> wake up and waitqueue might be empty
       sbq_index_atomic_inc -> increase again, one waitqueue is skipped
       wake_up_nr -> invalid wake up because old wakequeue might be empty
      
      To fix the problem, increasing 'wake_index' before resetting 'wait_cnt'.
      
      2) 'wait_cnt' can be decreased while waitqueue is empty
      
      As pointed out by Jan Kara, following race is possible:
      
      CPU1				CPU2
      __sbq_wake_up			 __sbq_wake_up
       sbq_wake_ptr()			 sbq_wake_ptr() -> the same
       wait_cnt = atomic_dec_return()
       /* decreased to 0 */
       sbq_index_atomic_inc()
       /* move to next waitqueue */
       atomic_set()
       /* reset wait_cnt */
       wake_up_nr()
       /* wake up on the old waitqueue */
      				 wait_cnt = atomic_dec_return()
      				 /*
      				  * decrease wait_cnt in the old
      				  * waitqueue, while it can be
      				  * empty.
      				  */
      
      Fix the problem by waking up before updating 'wake_index' and
      'wait_cnt'.
      
      With this patch, noted that 'wait_cnt' is still decreased in the old
      empty waitqueue, however, the wakeup is redirected to a active waitqueue,
      and the extra decrement on the old empty waitqueue is not handled.
      
      Fixes: 88459642 ("blk-mq: abstract tag allocation out into sbitmap library")
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220803121504.212071-1-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      040b83fc
  10. 26 6月, 2022 1 次提交
  11. 22 3月, 2022 1 次提交
  12. 08 2月, 2022 2 次提交
  13. 28 1月, 2022 1 次提交
  14. 14 1月, 2022 1 次提交
  15. 26 10月, 2021 1 次提交
    • J
      sbitmap: silence data race warning · 9f8b93a7
      Jens Axboe 提交于
      KCSAN complaints about the sbitmap hint update:
      
      ==================================================================
      BUG: KCSAN: data-race in sbitmap_queue_clear / sbitmap_queue_clear
      
      write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 1:
       sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
       blk_mq_put_tag+0x82/0x90
       __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
       blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
       __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
       blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
       lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
       blk_complete_reqs block/blk-mq.c:584 [inline]
       blk_done_softirq+0x69/0x90 block/blk-mq.c:589
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
       smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
       kthread+0x262/0x280 kernel/kthread.c:319
       ret_from_fork+0x1f/0x30
      
      write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 0:
       sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
       blk_mq_put_tag+0x82/0x90
       __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
       blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
       __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
       blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
       lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
       blk_complete_reqs block/blk-mq.c:584 [inline]
       blk_done_softirq+0x69/0x90 block/blk-mq.c:589
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
       smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
       kthread+0x262/0x280 kernel/kthread.c:319
       ret_from_fork+0x1f/0x30
      
      value changed: 0x00000035 -> 0x00000044
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.15.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      ==================================================================
      
      which is a data race, but not an important one. This is just updating the
      percpu alloc hint, and the reader of that hint doesn't ever require it to
      be valid.
      
      Just annotate it with data_race() to silence this one.
      
      Reported-by: syzbot+4f8bfd804b4a1f95b8f6@syzkaller.appspotmail.com
      Acked-by: NMarco Elver <elver@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9f8b93a7
  16. 19 10月, 2021 1 次提交
  17. 18 10月, 2021 1 次提交
    • J
      sbitmap: add __sbitmap_queue_get_batch() · 9672b0d4
      Jens Axboe 提交于
      The block layer tag allocation batching still calls into sbitmap to get
      each tag, but we can improve on that. Add __sbitmap_queue_get_batch(),
      which returns a mask of tags all at once, along with an offset for
      those tags.
      
      An example return would be 0xff, where bits 0..7 are set, with
      tag_offset == 128. The valid tags in this case would be 128..135.
      
      A batch is specific to an individual sbitmap_map, hence it cannot be
      larger than that. The requested number of tags is automatically reduced
      to the max that can be satisfied with a single map.
      
      On failure, 0 is returned. Caller should fall back to single tag
      allocation at that point/
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9672b0d4
  18. 09 7月, 2021 1 次提交
  19. 05 3月, 2021 5 次提交
  20. 08 12月, 2020 4 次提交
  21. 02 7月, 2020 1 次提交
  22. 21 12月, 2019 1 次提交
    • D
      sbitmap: only queue kyber's wait callback if not already active · df034c93
      David Jeffery 提交于
      Under heavy loads where the kyber I/O scheduler hits the token limits for
      its scheduling domains, kyber can become stuck.  When active requests
      complete, kyber may not be woken up leaving the I/O requests in kyber
      stuck.
      
      This stuck state is due to a race condition with kyber and the sbitmap
      functions it uses to run a callback when enough requests have completed.
      The running of a sbt_wait callback can race with the attempt to insert the
      sbt_wait.  Since sbitmap_del_wait_queue removes the sbt_wait from the list
      first then sets the sbq field to NULL, kyber can see the item as not on a
      list but the call to sbitmap_add_wait_queue will see sbq as non-NULL. This
      results in the sbt_wait being inserted onto the wait list but ws_active
      doesn't get incremented.  So the sbitmap queue does not know there is a
      waiter on a wait list.
      
      Since sbitmap doesn't think there is a waiter, kyber may never be
      informed that there are domain tokens available and the I/O never advances.
      With the sbt_wait on a wait list, kyber believes it has an active waiter
      so cannot insert a new waiter when reaching the domain's full state.
      
      This race can be fixed by only adding the sbt_wait to the queue if the
      sbq field is NULL.  If sbq is not NULL, there is already an action active
      which will trigger the re-running of kyber.  Let it run and add the
      sbt_wait to the wait list if still needing to wait.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Reported-by: NJohn Pittman <jpittman@redhat.com>
      Tested-by: NJohn Pittman <jpittman@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      df034c93
  23. 14 11月, 2019 1 次提交
  24. 02 7月, 2019 1 次提交
  25. 05 6月, 2019 1 次提交
  26. 24 5月, 2019 1 次提交
  27. 26 3月, 2019 1 次提交
    • M
      sbitmap: order READ/WRITE freed instance and setting clear bit · e6d1fa58
      Ming Lei 提交于
      Inside sbitmap_queue_clear(), once the clear bit is set, it will be
      visiable to allocation path immediately. Meantime READ/WRITE on old
      associated instance(such as request in case of blk-mq) may be
      out-of-order with the setting clear bit, so race with re-allocation
      may be triggered.
      
      Adds one memory barrier for ordering READ/WRITE of the freed associated
      instance with setting clear bit for avoiding race with re-allocation.
      
      The following kernel oops triggerd by block/006 on aarch64 may be fixed:
      
      [  142.330954] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000330
      [  142.338794] Mem abort info:
      [  142.341554]   ESR = 0x96000005
      [  142.344632]   Exception class = DABT (current EL), IL = 32 bits
      [  142.350500]   SET = 0, FnV = 0
      [  142.353544]   EA = 0, S1PTW = 0
      [  142.356678] Data abort info:
      [  142.359528]   ISV = 0, ISS = 0x00000005
      [  142.363343]   CM = 0, WnR = 0
      [  142.366305] user pgtable: 64k pages, 48-bit VAs, pgdp = 000000002a3c51c0
      [  142.372983] [0000000000000330] pgd=0000000000000000, pud=0000000000000000
      [  142.379777] Internal error: Oops: 96000005 [#1] SMP
      [  142.384613] Modules linked in: null_blk ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp vfat fat rpcrdma sunrpc rdma_ucm ib_iser rdma_cm iw_cm libiscsi ib_umad scsi_transport_iscsi ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core sbsa_gwdt crct10dif_ce ghash_ce ipmi_ssif sha2_ce ipmi_devintf sha256_arm64 sg sha1_ce ipmi_msghandler ip_tables xfs libcrc32c mlx5_core sdhci_acpi mlxfw ahci_platform at803x sdhci libahci_platform qcom_emac mmc_core hdma hdma_mgmt i2c_dev [last unloaded: null_blk]
      [  142.429753] CPU: 7 PID: 1983 Comm: fio Not tainted 5.0.0.cki #2
      [  142.449458] pstate: 00400005 (nzcv daif +PAN -UAO)
      [  142.454239] pc : __blk_mq_free_request+0x4c/0xa8
      [  142.458830] lr : blk_mq_free_request+0xec/0x118
      [  142.463344] sp : ffff00003360f6a0
      [  142.466646] x29: ffff00003360f6a0 x28: ffff000010e70000
      [  142.471941] x27: ffff801729a50048 x26: 0000000000010000
      [  142.477232] x25: ffff00003360f954 x24: ffff7bdfff021440
      [  142.482529] x23: 0000000000000000 x22: 00000000ffffffff
      [  142.487830] x21: ffff801729810000 x20: 0000000000000000
      [  142.493123] x19: ffff801729a50000 x18: 0000000000000000
      [  142.498413] x17: 0000000000000000 x16: 0000000000000001
      [  142.503709] x15: 00000000000000ff x14: ffff7fe000000000
      [  142.509003] x13: ffff8017dcde09a0 x12: 0000000000000000
      [  142.514308] x11: 0000000000000001 x10: 0000000000000008
      [  142.519597] x9 : ffff8017dcde09a0 x8 : 0000000000002000
      [  142.524889] x7 : ffff8017dcde0a00 x6 : 000000015388f9be
      [  142.530187] x5 : 0000000000000001 x4 : 0000000000000000
      [  142.535478] x3 : 0000000000000000 x2 : 0000000000000000
      [  142.540777] x1 : 0000000000000001 x0 : ffff00001041b194
      [  142.546071] Process fio (pid: 1983, stack limit = 0x000000006460a0ea)
      [  142.552500] Call trace:
      [  142.554926]  __blk_mq_free_request+0x4c/0xa8
      [  142.559181]  blk_mq_free_request+0xec/0x118
      [  142.563352]  blk_mq_end_request+0xfc/0x120
      [  142.567444]  end_cmd+0x3c/0xa8 [null_blk]
      [  142.571434]  null_complete_rq+0x20/0x30 [null_blk]
      [  142.576194]  blk_mq_complete_request+0x108/0x148
      [  142.580797]  null_handle_cmd+0x1d4/0x718 [null_blk]
      [  142.585662]  null_queue_rq+0x60/0xa8 [null_blk]
      [  142.590171]  blk_mq_try_issue_directly+0x148/0x280
      [  142.594949]  blk_mq_try_issue_list_directly+0x9c/0x108
      [  142.600064]  blk_mq_sched_insert_requests+0xb0/0xd0
      [  142.604926]  blk_mq_flush_plug_list+0x16c/0x2a0
      [  142.609441]  blk_flush_plug_list+0xec/0x118
      [  142.613608]  blk_finish_plug+0x3c/0x4c
      [  142.617348]  blkdev_direct_IO+0x3b4/0x428
      [  142.621336]  generic_file_read_iter+0x84/0x180
      [  142.625761]  blkdev_read_iter+0x50/0x78
      [  142.629579]  aio_read.isra.6+0xf8/0x190
      [  142.633409]  __io_submit_one.isra.8+0x148/0x738
      [  142.637912]  io_submit_one.isra.9+0x88/0xb8
      [  142.642078]  __arm64_sys_io_submit+0xe0/0x238
      [  142.646428]  el0_svc_handler+0xa0/0x128
      [  142.650238]  el0_svc+0x8/0xc
      [  142.653104] Code: b9402a63 f9000a7f 3100047f 540000a0 (f9419a81)
      [  142.659202] ---[ end trace 467586bc175eb09d ]---
      
      Fixes: ea86ea2c ("sbitmap: ammortize cost of clearing bits")
      Reported-and-bisected_and_tested-by: Yi Zhang <yi.zhang@redhat.com>
      Cc: Yi Zhang <yi.zhang@redhat.com>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e6d1fa58
  28. 15 1月, 2019 2 次提交
    • M
      sbitmap: Protect swap_lock from hardirq · fe76fc6a
      Ming Lei 提交于
      Because we may call blk_mq_get_driver_tag() directly from
      blk_mq_dispatch_rq_list() without holding any lock, then HARDIRQ may
      come and the above DEADLOCK is triggered.
      
      Commit ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq") tries to
      fix this issue by using 'spin_lock_bh', which isn't enough because we
      complete request from hardirq context direclty in case of multiqueue.
      
      Cc: Clark Williams <williams@redhat.com>
      Fixes: ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq")
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe76fc6a
    • S
      sbitmap: Protect swap_lock from softirqs · 37198768
      Steven Rostedt (VMware) 提交于
      The swap_lock used by sbitmap has a chain with locks taken from softirq,
      but the swap_lock is not protected from being preempted by softirqs.
      
      A chain exists of:
      
       sbq->ws[i].wait -> dispatch_wait_lock -> swap_lock
      
      Where the sbq->ws[i].wait lock can be taken from softirq context, which
      means all locks below it in the chain must also be protected from
      softirqs.
      Reported-by: NClark Williams <williams@redhat.com>
      Fixes: 58ab5e32 ("sbitmap: silence bogus lockdep IRQ warning")
      Fixes: ea86ea2c ("sbitmap: amortize cost of clearing bits")
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37198768
  29. 21 12月, 2018 1 次提交