1. 12 12月, 2018 1 次提交
    • M
      block: deactivate blk_stat timer in wbt_disable_default() · 544fbd16
      Ming Lei 提交于
      rwb_enabled() can't be changed when there is any inflight IO.
      
      wbt_disable_default() may set rwb->wb_normal as zero, however the
      blk_stat timer may still be pending, and the timer function will update
      wrb->wb_normal again.
      
      This patch introduces blk_stat_deactivate() and applies it in
      wbt_disable_default(), then the following IO hang triggered when running
      parted & switching io scheduler can be fixed:
      
      [  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
      [  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
      [  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  369.940768] parted          D    0  3645   3239 0x00000000
      [  369.941500] Call Trace:
      [  369.941874]  ? __schedule+0x6d9/0x74c
      [  369.942392]  ? wbt_done+0x5e/0x5e
      [  369.942864]  ? wbt_cleanup_cb+0x16/0x16
      [  369.943404]  ? wbt_done+0x5e/0x5e
      [  369.943874]  schedule+0x67/0x78
      [  369.944298]  io_schedule+0x12/0x33
      [  369.944771]  rq_qos_wait+0xb5/0x119
      [  369.945193]  ? karma_partition+0x1c2/0x1c2
      [  369.945691]  ? wbt_cleanup_cb+0x16/0x16
      [  369.946151]  wbt_wait+0x85/0xb6
      [  369.946540]  __rq_qos_throttle+0x23/0x2f
      [  369.947014]  blk_mq_make_request+0xe6/0x40a
      [  369.947518]  generic_make_request+0x192/0x2fe
      [  369.948042]  ? submit_bio+0x103/0x11f
      [  369.948486]  ? __radix_tree_lookup+0x35/0xb5
      [  369.949011]  submit_bio+0x103/0x11f
      [  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
      [  369.949962]  submit_bio_wait+0x53/0x7f
      [  369.950469]  blkdev_issue_flush+0x8a/0xae
      [  369.951032]  blkdev_fsync+0x2f/0x3a
      [  369.951502]  do_fsync+0x2e/0x47
      [  369.951887]  __x64_sys_fsync+0x10/0x13
      [  369.952374]  do_syscall_64+0x89/0x149
      [  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  369.953492] RIP: 0033:0x7f95a1e729d4
      [  369.953996] Code: Bad RIP value.
      [  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
      [  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
      [  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
      [  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
      [  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
      [  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
      
      Cc: stable@vger.kernel.org
      Cc: Paolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      544fbd16
  2. 10 12月, 2018 4 次提交
  3. 08 12月, 2018 18 次提交
    • M
      blk-mq: re-build queue map in case of kdump kernel · 59388702
      Ming Lei 提交于
      Now almost all .map_queues() implementation based on managed irq
      affinity doesn't update queue mapping and it just retrieves the
      old built mapping, so if nr_hw_queues is changed, the mapping talbe
      includes stale mapping. And only blk_mq_map_queues() may rebuild
      the mapping talbe.
      
      One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
      However, drivers often builds queue mapping before allocating tagset
      via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
      as 1 in case of kdump kernel, so wrong queue mapping is used, and
      kernel panic[1] is observed during booting.
      
      This patch fixes the kernel panic triggerd on nvme by rebulding the
      mapping table via blk_mq_map_queues().
      
      [1] kernel panic log
      [    4.438371] nvme nvme0: 16/0/0 default/read/poll queues
      [    4.443277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
      [    4.444681] PGD 0 P4D 0
      [    4.445367] Oops: 0000 [#1] SMP NOPTI
      [    4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
      [    4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
      [    4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
      [    4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
      [    4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
      [    4.453132] RSP: 0018:ffffc900023b3cd8 EFLAGS: 00010286
      [    4.454061] RAX: 0000000000000000 RBX: ffff888174448000 RCX: 0000000000000001
      [    4.456480] RDX: 0000000000000001 RSI: ffffe8feffc506c0 RDI: 0000000000000001
      [    4.458750] RBP: ffff88810722d008 R08: ffff88817647a880 R09: 0000000000000002
      [    4.464580] R10: ffffc900023b3c10 R11: 0000000000000004 R12: ffff888174448538
      [    4.467803] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000001
      [    4.469220] FS:  0000000000000000(0000) GS:ffff88817bac0000(0000) knlGS:0000000000000000
      [    4.471554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    4.472464] CR2: 0000000000000098 CR3: 0000000174e4e001 CR4: 0000000000760ee0
      [    4.474264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    4.476007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    4.477061] PKRU: 55555554
      [    4.477464] Call Trace:
      [    4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
      [    4.479595]  blk_mq_init_queue+0x32/0x4e
      [    4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
      [    4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
      [    4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
      [    4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
      [    4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
      [    4.483930]  ? try_to_wake_up+0x38d/0x3b3
      [    4.484478]  ? process_one_work+0x179/0x2fc
      [    4.485118]  process_one_work+0x1d3/0x2fc
      [    4.485655]  ? rescuer_thread+0x2ae/0x2ae
      [    4.486196]  worker_thread+0x1e9/0x2be
      [    4.486841]  kthread+0x115/0x11d
      [    4.487294]  ? kthread_park+0x76/0x76
      [    4.487784]  ret_from_fork+0x3a/0x50
      [    4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables
      [    4.489428] Dumping ftrace buffer:
      [    4.489939]    (ftrace buffer empty)
      [    4.490492] CR2: 0000000000000098
      [    4.491052] ---[ end trace 03cd268ad5a86ff7 ]---
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: linux-nvme@lists.infradead.org
      Cc: David Milburn <dmilburn@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59388702
    • J
      block: convert io-latency to use rq_qos_wait · d3fcdff1
      Josef Bacik 提交于
      Now that we have this common helper, convert io-latency over to use it
      as well.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d3fcdff1
    • J
      block: convert wbt_wait() to use rq_qos_wait() · b6c7b58f
      Josef Bacik 提交于
      Now that we have rq_qos_wait() in place, convert wbt_wait() over to
      using it with it's specific callbacks.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b6c7b58f
    • J
      block: add rq_qos_wait to rq_qos · 84f60324
      Josef Bacik 提交于
      Originally when I split out the common code from blk-wbt into rq_qos I
      left the wbt_wait() where it was and simply copied and modified it
      slightly to work for io-latency.  However they are both basically the
      same thing, and as time has gone on wbt_wait() has ended up much smarter
      and kinder than it was when I copied it into io-latency, which means
      io-latency has lost out on these improvements.
      
      Since they are the same thing essentially except for a few minor things,
      create rq_qos_wait() that replicates what wbt_wait() currently does with
      callbacks that can be passed in for the snowflakes to do their own thing
      as appropriate.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      84f60324
    • D
      blkcg: rename blkg_try_get() to blkg_tryget() · 7754f669
      Dennis Zhou 提交于
      blkg reference counting now uses percpu_ref rather than atomic_t. Let's
      make this consistent with css_tryget. This renames blkg_try_get to
      blkg_tryget and now returns a bool rather than the blkg or %NULL.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7754f669
    • D
      blkcg: change blkg reference counting to use percpu_ref · 7fcf2b03
      Dennis Zhou 提交于
      Every bio is now associated with a blkg putting blkg_get, blkg_try_get,
      and blkg_put on the hot path. Switch over the refcnt in blkg to use
      percpu_ref.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7fcf2b03
    • D
      blkcg: remove bio_disassociate_task() · 6f70fb66
      Dennis Zhou 提交于
      Now that a bio only holds a blkg reference, so clean up is simply
      putting back that reference. Remove bio_disassociate_task() as it just
      calls bio_disassociate_blkg() and call the latter directly.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6f70fb66
    • D
      blkcg: remove additional reference to the css · fc5a828b
      Dennis Zhou 提交于
      The previous patch in this series removed carrying around a pointer to
      the css in blkg. However, the blkg association logic still relied on
      taking a reference on the css to ensure we wouldn't fail in getting a
      reference for the blkg.
      
      Here the implicit dependency on the css is removed. The association
      continues to rely on the tryget logic walking up the blkg tree. This
      streamlines the three ways that association can happen: normal, swap,
      and writeback.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fc5a828b
    • D
      blkcg: remove bio->bi_css and instead use bio->bi_blkg · db6638d7
      Dennis Zhou 提交于
      Prior patches ensured that any bio that interacts with a request_queue
      is properly associated with a blkg. This makes bio->bi_css unnecessary
      as blkg maintains a reference to blkcg already.
      
      This removes the bio field bi_css and transfers corresponding uses to
      access via bi_blkg.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      db6638d7
    • D
      blkcg: associate writeback bios with a blkg · fd42df30
      Dennis Zhou 提交于
      One of the goals of this series is to remove a separate reference to
      the css of the bio. This can and should be accessed via bio_blkcg(). In
      this patch, wbc_init_bio() now requires a bio to have a device
      associated with it.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd42df30
    • D
      blkcg: associate a blkg for pages being evicted by swap · 6a7f6d86
      Dennis Zhou 提交于
      A prior patch in this series added blkg association to bios issued by
      cgroups. There are two other paths that we want to attribute work back
      to the appropriate cgroup: swap and writeback. Here we modify the way
      swap tags bios to include the blkg. Writeback will be tackle in the next
      patch.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6a7f6d86
    • D
      blkcg: consolidate bio_issue_init() to be a part of core · e439bedf
      Dennis Zhou 提交于
      bio_issue_init among other things initializes the timestamp for an IO.
      Rather than have this logic handled by policies, this consolidates it to
      be on the init paths (normal, clone, bounce clone).
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e439bedf
    • D
      blkcg: associate blkg when associating a device · 5cdf2e3f
      Dennis Zhou 提交于
      Previously, blkg association was handled by controller specific code in
      blk-throttle and blk-iolatency. However, because a blkg represents a
      relationship between a blkcg and a request_queue, it makes sense to keep
      the blkg->q and bio->bi_disk->queue consistent.
      
      This patch moves association into the bio_set_dev macro(). This should
      cover the majority of cases where the device is set/changed keeping the
      two pointers consistent. Fallback code is added to
      blkcg_bio_issue_check() to catch any missing paths.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5cdf2e3f
    • D
      dm: set the static flush bio device on demand · 892ad71f
      Dennis Zhou 提交于
      The next patch changes the macro bio_set_dev() to associate a bio with a
      blkg based on the device set. However, dm creates a static bio to be
      used as the basis for cloning empty flush bios on creation. The
      bio_set_dev() call in alloc_dev() will cause problems with the next
      patch adding association to bio_set_dev() because the call is before the
      bdev is associated with a gendisk (bd_disk is %NULL). To get around
      this, set the device on the static bio every time and use that to clone
      to the other bios.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      892ad71f
    • D
      blkcg: introduce common blkg association logic · 2268c0fe
      Dennis Zhou 提交于
      There are 3 ways blkg association can happen: association with the
      current css, with the page css (swap), or from the wbc css (writeback).
      
      This patch handles how association is done for the first case where we
      are associating bsaed on the current css. If there is already a blkg
      associated, the css will be reused and association will be redone as the
      request_queue may have changed.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2268c0fe
    • D
      blkcg: convert blkg_lookup_create() to find closest blkg · beea9da0
      Dennis Zhou 提交于
      There are several scenarios where blkg_lookup_create() can fail such as
      the blkcg dying, request_queue is dying, or simply being OOM. Most
      handle this by simply falling back to the q->root_blkg and calling it a
      day.
      
      This patch implements the notion of closest blkg. During
      blkg_lookup_create(), if it fails to create, return the closest blkg
      found or the q->root_blkg. blkg_try_get_closest() is introduced and used
      during association so a bio is always attached to a blkg.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      beea9da0
    • D
      blkcg: update blkg_lookup_create() to do locking · b978962a
      Dennis Zhou 提交于
      To know when to create a blkg, the general pattern is to do a
      blkg_lookup() and if that fails, lock and do the lookup again, and if
      that fails finally create. It doesn't make much sense for everyone who
      wants to do creation to write this themselves.
      
      This changes blkg_lookup_create() to do locking and implement this
      pattern. The old blkg_lookup_create() is renamed to
      __blkg_lookup_create().  If a call site wants to do its own error
      handling or already owns the queue lock, they can use
      __blkg_lookup_create(). This will be used in upcoming patches.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b978962a
    • D
      blkcg: fix ref count issue with bio_blkcg() using task_css · 0fe061b9
      Dennis Zhou 提交于
      The bio_blkcg() function turns out to be inconsistent and consequently
      dangerous to use. The first part returns a blkcg where a reference is
      owned by the bio meaning it does not need to be rcu protected. However,
      the third case, the last line, is problematic:
      
      	return css_to_blkcg(task_css(current, io_cgrp_id));
      
      This can race against task migration and the cgroup dying. It is also
      semantically different as it must be called rcu protected and is
      susceptible to failure when trying to get a reference to it.
      
      This patch adds association ahead of calling bio_blkcg() rather than
      after. This makes association a required and explicit step along the
      code paths for calling bio_blkcg(). In blk-iolatency, association is
      moved above the bio_blkcg() call to ensure it will not return %NULL.
      
      BFQ uses the old bio_blkcg() function, but I do not want to address it
      in this series due to the complexity. I have created a private version
      documenting the inconsistency and noting not to use it.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0fe061b9
  4. 07 12月, 2018 2 次提交
    • J
      blk-mq: punt failed direct issue to dispatch list · c616cbee
      Jens Axboe 提交于
      After the direct dispatch corruption fix, we permanently disallow direct
      dispatch of non read/write requests. This works fine off the normal IO
      path, as they will be retried like any other failed direct dispatch
      request. But for the blk_insert_cloned_request() that only DM uses to
      bypass the bottom level scheduler, we always first attempt direct
      dispatch. For some types of requests, that's now a permanent failure,
      and no amount of retrying will make that succeed. This results in a
      livelock.
      
      Instead of making special cases for what we can direct issue, and now
      having to deal with DM solving the livelock while still retaining a BUSY
      condition feedback loop, always just add a request that has been through
      ->queue_rq() to the hardware queue dispatch list. These are safe to use
      as no merging can take place there. Additionally, if requests do have
      prepped data from drivers, we aren't dependent on them not sharing space
      in the request structure to safely add them to the IO scheduler lists.
      
      This basically reverts ffe81d45 and is based on a patch from Ming,
      but with the list insert case covered as well.
      
      Fixes: ffe81d45 ("blk-mq: fix corruption with direct issue")
      Cc: stable@vger.kernel.org
      Suggested-by: NMing Lei <ming.lei@redhat.com>
      Reported-by: NBart Van Assche <bvanassche@acm.org>
      Tested-by: NMing Lei <ming.lei@redhat.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c616cbee
    • P
      block, bfq: fix decrement of num_active_groups · ba7aeae5
      Paolo Valente 提交于
      Since commit '2d29c9f8 ("block, bfq: improve asymmetric scenarios
      detection")', if there are process groups with I/O requests waiting for
      completion, then BFQ tags the scenario as 'asymmetric'. This detection
      is needed for preserving service guarantees (for details, see comments
      on the computation * of the variable asymmetric_scenario in the
      function bfq_better_to_idle).
      
      Unfortunately, commit '2d29c9f8 ("block, bfq: improve asymmetric
      scenarios detection")' contains an error exactly in the updating of
      the number of groups with I/O requests waiting for completion: if a
      group has more than one descendant process, then the above number of
      groups, which is renamed from num_active_groups to a more appropriate
      num_groups_with_pending_reqs by this commit, may happen to be wrongly
      decremented multiple times, namely every time one of the descendant
      processes gets all its pending I/O requests completed.
      
      A correct, complete solution should work as follows. Consider a group
      that is inactive, i.e., that has no descendant process with pending
      I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
      is still accounting for this group, because the group still has some
      descendant process with some I/O request still in
      flight. num_groups_with_pending_reqs should be decremented when the
      in-flight request of the last descendant process is finally completed
      (assuming that nothing else has changed for the group in the meantime,
      in terms of composition of the group and active/inactive state of
      child groups and processes). To accomplish this, an additional
      pending-request counter must be added to entities, and must be
      updated correctly.
      
      To avoid this additional field and operations, this commit resorts to
      the following tradeoff between simplicity and accuracy: for an
      inactive group that is still counted in num_groups_with_pending_reqs,
      this commit decrements num_groups_with_pending_reqs when the first
      descendant process of the group remains with no request waiting for
      completion.
      
      This simplified scheme provides a fix to the unbalanced decrements
      introduced by 2d29c9f8. Since this error was also caused by lack
      of comments on this non-trivial issue, this commit also adds related
      comments.
      
      Fixes: 2d29c9f8 ("block, bfq: improve asymmetric scenarios detection")
      Reported-by: NSteven Barrett <steven@liquorix.net>
      Tested-by: NSteven Barrett <steven@liquorix.net>
      Tested-by: NLucjan Lucjanov <lucjan.lucjanov@gmail.com>
      Reviewed-by: NFederico Motta <federico@willer.it>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba7aeae5
  5. 05 12月, 2018 5 次提交
  6. 04 12月, 2018 1 次提交
  7. 02 12月, 2018 1 次提交
  8. 01 12月, 2018 2 次提交
    • J
      sbitmap: optimize wakeup check · 5d2ee712
      Jens Axboe 提交于
      Even if we have no waiters on any of the sbitmap_queue wait states, we
      still have to loop every entry to check. We do this for every IO, so
      the cost adds up.
      
      Shift a bit of the cost to the slow path, when we actually have waiters.
      Wrap prepare_to_wait_exclusive() and finish_wait(), so we can maintain
      an internal count of how many are currently active. Then we can simply
      check this count in sbq_wake_ptr() and not have to loop if we don't
      have any sleepers.
      
      Convert the two users of sbitmap with waiting, blk-mq-tag and iSCSI.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5d2ee712
    • M
      block: fix single range discard merge · 2a5cf35c
      Ming Lei 提交于
      There are actually two kinds of discard merge:
      
      - one is the normal discard merge, just like normal read/write request,
      and call it single-range discard
      
      - another is the multi-range discard, queue_max_discard_segments(rq->q) > 1
      
      For the former case, queue_max_discard_segments(rq->q) is 1, and we
      should handle this kind of discard merge like the normal read/write
      request.
      
      This patch fixes the following kernel panic issue[1], which is caused by
      not removing the single-range discard request from elevator queue.
      
      Guangwu has one raid discard test case, in which this issue is a bit
      easier to trigger, and I verified that this patch can fix the kernel
      panic issue in Guangwu's test case.
      
      [1] kernel panic log from Jens's report
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
       PGD 0 P4D 0.
       Oops: 0000 [#1] SMP PTI
       CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \
      4.20.0-rc3-00649-ge64d9a554a91-dirty #14  Hardware name: Wiwynn \
      Leopard-Orv2/Leopard-DDR BW, BIOS LBM08   03/03/2017       Workqueue: kblockd \
      blk_mq_run_work_fn                                            RIP: \
      0010:blk_mq_get_driver_tag+0x81/0x120                                       Code: 24 \
      10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 \
      0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 \
      f6 87 b0 00 00 00 02  RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246                     \
        RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8
       RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000
       RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000
       R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300
       R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000
       FS:  0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        blk_mq_dispatch_rq_list+0xec/0x480
        ? elv_rb_del+0x11/0x30
        blk_mq_do_dispatch_sched+0x6e/0xf0
        blk_mq_sched_dispatch_requests+0xfa/0x170
        __blk_mq_run_hw_queue+0x5f/0xe0
        process_one_work+0x154/0x350
        worker_thread+0x46/0x3c0
        kthread+0xf5/0x130
        ? process_one_work+0x350/0x350
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x1f/0x30
       Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel \
      kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii \
      cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq \
      button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme \
      nvme_core fuse sg loop efivarfs autofs4  CR2: 0000000000000148                        \
      
       ---[ end trace 340a1fb996df1b9b ]---
       RIP: 0010:blk_mq_get_driver_tag+0x81/0x120
       Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 \
      00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 \
      20 72 37 f6 87 b0 00 00 00 02
      
      Fixes: 445251d0 ("blk-mq: fix discard merge with scheduler attached")
      Reported-by: NJens Axboe <axboe@kernel.dk>
      Cc: Guangwu Zhang <guazhang@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2a5cf35c
  9. 30 11月, 2018 4 次提交
  10. 29 11月, 2018 2 次提交