1. 26 9月, 2018 1 次提交
  2. 21 8月, 2018 1 次提交
    • J
      blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter · f5bbbbe4
      Jianchao Wang 提交于
      For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
      account the inflight requests. It will access the queue_hw_ctx and
      nr_hw_queues w/o any protection. When updating nr_hw_queues and
      blk_mq_in_flight/rw occur concurrently, panic comes up.
      
      Before update nr_hw_queues, the q will be frozen. So we could use
      q_usage_counter to avoid the race. percpu_ref_is_zero is used here
      so that we will not miss any in-flight request. The access to
      nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
      under rcu critical section, __blk_mq_update_nr_hw_queues could use
      synchronize_rcu to ensure the zeroed q_usage_counter to be globally
      visible.
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f5bbbbe4
  3. 09 8月, 2018 1 次提交
    • J
      blk-mq: count the hctx as active before allocating tag · d263ed99
      Jianchao Wang 提交于
      Currently, we count the hctx as active after allocate driver tag
      successfully. If a previously inactive hctx try to get tag first
      time, it may fails and need to wait. However, due to the stale tag
      ->active_queues, the other shared-tags users are still able to
      occupy all driver tags while there is someone waiting for tag.
      Consequently, even if the previously inactive hctx is waked up, it
      still may not be able to get a tag and could be starved.
      
      To fix it, we count the hctx as active before try to allocate driver
      tag, then when it is waiting the tag, the other shared-tag users
      will reserve budget for it.
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d263ed99
  4. 03 8月, 2018 2 次提交
    • M
      blk-mq: fix blk_mq_tagset_busy_iter · 2d5ba0e2
      Ming Lei 提交于
      Commit d250bf4e("blk-mq: only iterate over inflight requests
      in blk_mq_tagset_busy_iter") uses 'blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT'
      to replace 'blk_mq_request_started(req)', this way is wrong, and causes
      lots of test system hang during booting.
      
      Fix the issue by using blk_mq_request_started(req) inside bt_tags_iter().
      
      Fixes: d250bf4e ("blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter")
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Matt Hart <matthew.hart@linaro.org>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: John Garry <john.garry@huawei.com>
      Cc: Hannes Reinecke <hare@suse.com>,
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: linux-scsi@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Reported-by: NMark Brown <broonie@kernel.org>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d5ba0e2
    • M
      blk-mq: fix updating tags depth · 75d6e175
      Ming Lei 提交于
      The passed 'nr' from userspace represents the total depth, meantime
      inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
      and 'nr_reserved_tags' stores the reserved part.
      
      There are two issues in blk_mq_tag_update_depth() now:
      
      1) for growing tags, we should have used the passed 'nr', and keep the
      number of reserved tags not changed.
      
      2) the passed 'nr' should have been used for checking against
      'tags->nr_tags', instead of number of the normal part.
      
      This patch fixes the above two cases, and avoids kernel crash caused
      by wrong resizing sbitmap queue.
      
      Cc: "Ewan D. Milne" <emilne@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bart.vanassche@sandisk.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Tested by: Marco Patalano <mpatalan@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      75d6e175
  5. 14 6月, 2018 1 次提交
  6. 31 5月, 2018 1 次提交
  7. 25 5月, 2018 1 次提交
    • M
      blk-mq: avoid starving tag allocation after allocating process migrates · e6fc4649
      Ming Lei 提交于
      When the allocation process is scheduled back and the mapped hw queue is
      changed, fake one extra wake up on previous queue for compensating wake
      up miss, so other allocations on the previous queue won't be starved.
      
      This patch fixes one request allocation hang issue, which can be
      triggered easily in case of very low nr_request.
      
      The race is as follows:
      
      1) 2 hw queues, nr_requests are 2, and wake_batch is one
      
      2) there are 3 waiters on hw queue 0
      
      3) two in-flight requests in hw queue 0 are completed, and only two
         waiters of 3 are waken up because of wake_batch, but both the two
         waiters can be scheduled to another CPU and cause to switch to hw
         queue 1
      
      4) then the 3rd waiter will wait for ever, since no in-flight request
         is in hw queue 0 any more.
      
      5) this patch fixes it by the fake wakeup when waiter is scheduled to
         another hw queue
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      
      Modified commit message to make it clearer, and make it apply on
      top of the 4.18 branch.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e6fc4649
  8. 23 12月, 2017 1 次提交
    • J
      blk-mq: improve heavily contended tag case · 4e5dff41
      Jens Axboe 提交于
      Even with a number of waitqueues, we can get into a situation where we
      are heavily contended on the waitqueue lock. I got a report on spc1
      where we're spending seconds doing this. Arguably the use case is nasty,
      I reproduce it with one device and 1000 threads banging on the device.
      But that doesn't mean we shouldn't be handling it better.
      
      What ends up happening is that a thread will fail to get a tag, add
      itself to the waitqueue, and subsequently get woken up when a tag is
      freed - only to find itself going back to sleep on the waitqueue.
      
      Instead of waking all threads, use an exclusive wait and wake up our
      sbitmap batch count instead. This seems to work well for me (massive
      improvement for this use case), and it survives basic testing. But I
      haven't fully verified it yet.
      
      An additional improvement is running the queue and checking for a new
      tag BEFORE needing to add ourselves to the waitqueue.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4e5dff41
  9. 19 10月, 2017 2 次提交
  10. 18 8月, 2017 1 次提交
  11. 10 8月, 2017 1 次提交
  12. 15 4月, 2017 1 次提交
  13. 13 3月, 2017 1 次提交
  14. 02 3月, 2017 1 次提交
  15. 28 1月, 2017 1 次提交
  16. 27 1月, 2017 1 次提交
  17. 25 1月, 2017 1 次提交
  18. 21 1月, 2017 1 次提交
    • J
      blk-mq: allow resize of scheduler requests · 70f36b60
      Jens Axboe 提交于
      Add support for growing the tags associated with a hardware queue, for
      the scheduler tags. Currently we only support resizing within the
      limits of the original depth, change that so we can grow it as well by
      allocating and replacing the existing scheduler tag set.
      
      This is similar to how we could increase the software queue depth with
      the legacy IO stack and schedulers.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      70f36b60
  19. 19 1月, 2017 1 次提交
  20. 18 1月, 2017 2 次提交
  21. 17 9月, 2016 4 次提交
  22. 15 9月, 2016 2 次提交
  23. 08 7月, 2016 1 次提交
  24. 13 4月, 2016 2 次提交
  25. 02 12月, 2015 1 次提交
  26. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  27. 15 10月, 2015 1 次提交
  28. 10 10月, 2015 1 次提交
    • K
      blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c · 8ee1b7b9
      Kosuke Tatsukawa 提交于
      blk_mq_tag_update_depth() seems to be missing a memory barrier which
      might cause the waker to not notice the waiter and fail to send a
      wake_up as in the following figure.
      
      	blk_mq_tag_update_depth			bt_get
      ------------------------------------------------------------------------
      if (waitqueue_active(&bs->wait))
      /* The CPU might reorder the test for
         the waitqueue up here, before
         prior writes complete */
      					prepare_to_wait(&bs->wait, &wait,
      					  TASK_UNINTERRUPTIBLE);
      					tag = __bt_get(hctx, bt, last_tag,
      					  tags);
      					/* Value set in bt_update_count not
      					   visible yet */
      bt_update_count(&tags->bitmap_tags, tdepth);
      /* blk_mq_tag_wakeup_all(tags, false); */
       bt = &tags->bitmap_tags;
       wake_index = atomic_read(&bt->wake_index);
      					...
      					io_schedule();
      ------------------------------------------------------------------------
      
      This patch adds the missing memory barrier.
      
      I found this issue when I was looking through the linux source code
      for places calling waitqueue_active() before wake_up*(), but without
      preceding memory barriers, after sending a patch to fix a similar
      issue in drivers/tty/n_tty.c  (Details about the original issue can be
      found here: https://lkml.org/lkml/2015/9/28/849).
      Signed-off-by: NKosuke Tatsukawa <tatsu@ab.jp.nec.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ee1b7b9
  29. 01 10月, 2015 1 次提交
  30. 15 8月, 2015 1 次提交
    • M
      blk-mq: fix race between timeout and freeing request · 0048b483
      Ming Lei 提交于
      Inside timeout handler, blk_mq_tag_to_rq() is called
      to retrieve the request from one tag. This way is obviously
      wrong because the request can be freed any time and some
      fiedds of the request can't be trusted, then kernel oops
      might be triggered[1].
      
      Currently wrt. blk_mq_tag_to_rq(), the only special case is
      that the flush request can share same tag with the request
      cloned from, and the two requests can't be active at the same
      time, so this patch fixes the above issue by updating tags->rqs[tag]
      with the active request(either flush rq or the request cloned
      from) of the tag.
      
      Also blk_mq_tag_to_rq() gets much simplified with this patch.
      
      Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
      make sure the request can't be freed, so in bt_for_each() this
      helper is replaced with tags->rqs[tag].
      
      [1] kernel oops log
      [  439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
      [  439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
      [  439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
      [  439.700653] Dumping ftrace buffer:^M
      [  439.700653]    (ftrace buffer empty)^M
      [  439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
      [  439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
      [  439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
      [  439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
      [  439.730500] RIP: 0010:[<ffffffff812d89ba>]  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.730500] RSP: 0018:ffff880819203da0  EFLAGS: 00010283^M
      [  439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
      [  439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
      [  439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
      [  439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
      [  439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
      [  439.730500] FS:  00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
      [  439.730500] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
      [  439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
      [  439.730500] Stack:^M
      [  439.730500]  0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
      [  439.755663]  ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
      [  439.755663] Call Trace:^M
      [  439.755663]  <IRQ> ^M
      [  439.755663]  [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M
      [  439.755663]  [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M
      [  439.755663]  [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M
      [  439.755663]  [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M
      [  439.755663]  [<ffffffff8104c76e>] irq_exit+0x40/0x94^M
      [  439.755663]  [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M
      [  439.755663]  [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M
      [  439.755663]  <EOI> ^M
      [  439.755663]  [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M
      [  439.755663]  [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M
      [  439.755663]  [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M
      [  439.755663]  [<ffffffff81550066>] __schedule+0x469/0x6cd^M
      [  439.755663]  [<ffffffff8155039b>] schedule+0x82/0x9a^M
      [  439.789267]  [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M
      [  439.790911]  [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M
      [  439.790911]  [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M
      [  439.790911]  [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M
      [  439.790911]  [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M
      [  439.790911]  [<ffffffff8116292b>] SyS_read+0x49/0x7f^M
      [  439.790911]  [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M
      [  439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
      f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
      53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
      ^M
      [  439.790911] RIP  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.790911]  RSP <ffff880819203da0>^M
      [  439.790911] CR2: 0000000000000158^M
      [  439.790911] ---[ end trace d40af58949325661 ]---^M
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0048b483
  31. 02 6月, 2015 1 次提交
    • K
      blk-mq: Shared tag enhancements · f26cdc85
      Keith Busch 提交于
      Storage controllers may expose multiple block devices that share hardware
      resources managed by blk-mq. This patch enhances the shared tags so a
      low-level driver can access the shared resources not tied to the unshared
      h/w contexts. This way the LLD can dynamically add and delete disks and
      request queues without having to track all the request_queue hctx's to
      iterate outstanding tags.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f26cdc85
  32. 19 3月, 2015 1 次提交