1. 21 1月, 2017 1 次提交
    • J
      blk-mq: allow resize of scheduler requests · 70f36b60
      Jens Axboe 提交于
      Add support for growing the tags associated with a hardware queue, for
      the scheduler tags. Currently we only support resizing within the
      limits of the original depth, change that so we can grow it as well by
      allocating and replacing the existing scheduler tag set.
      
      This is similar to how we could increase the software queue depth with
      the legacy IO stack and schedulers.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      70f36b60
  2. 19 1月, 2017 1 次提交
  3. 18 1月, 2017 2 次提交
  4. 17 9月, 2016 4 次提交
  5. 15 9月, 2016 2 次提交
  6. 08 7月, 2016 1 次提交
  7. 13 4月, 2016 2 次提交
  8. 02 12月, 2015 1 次提交
  9. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  10. 15 10月, 2015 1 次提交
  11. 10 10月, 2015 1 次提交
    • K
      blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c · 8ee1b7b9
      Kosuke Tatsukawa 提交于
      blk_mq_tag_update_depth() seems to be missing a memory barrier which
      might cause the waker to not notice the waiter and fail to send a
      wake_up as in the following figure.
      
      	blk_mq_tag_update_depth			bt_get
      ------------------------------------------------------------------------
      if (waitqueue_active(&bs->wait))
      /* The CPU might reorder the test for
         the waitqueue up here, before
         prior writes complete */
      					prepare_to_wait(&bs->wait, &wait,
      					  TASK_UNINTERRUPTIBLE);
      					tag = __bt_get(hctx, bt, last_tag,
      					  tags);
      					/* Value set in bt_update_count not
      					   visible yet */
      bt_update_count(&tags->bitmap_tags, tdepth);
      /* blk_mq_tag_wakeup_all(tags, false); */
       bt = &tags->bitmap_tags;
       wake_index = atomic_read(&bt->wake_index);
      					...
      					io_schedule();
      ------------------------------------------------------------------------
      
      This patch adds the missing memory barrier.
      
      I found this issue when I was looking through the linux source code
      for places calling waitqueue_active() before wake_up*(), but without
      preceding memory barriers, after sending a patch to fix a similar
      issue in drivers/tty/n_tty.c  (Details about the original issue can be
      found here: https://lkml.org/lkml/2015/9/28/849).
      Signed-off-by: NKosuke Tatsukawa <tatsu@ab.jp.nec.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ee1b7b9
  12. 01 10月, 2015 1 次提交
  13. 15 8月, 2015 1 次提交
    • M
      blk-mq: fix race between timeout and freeing request · 0048b483
      Ming Lei 提交于
      Inside timeout handler, blk_mq_tag_to_rq() is called
      to retrieve the request from one tag. This way is obviously
      wrong because the request can be freed any time and some
      fiedds of the request can't be trusted, then kernel oops
      might be triggered[1].
      
      Currently wrt. blk_mq_tag_to_rq(), the only special case is
      that the flush request can share same tag with the request
      cloned from, and the two requests can't be active at the same
      time, so this patch fixes the above issue by updating tags->rqs[tag]
      with the active request(either flush rq or the request cloned
      from) of the tag.
      
      Also blk_mq_tag_to_rq() gets much simplified with this patch.
      
      Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
      make sure the request can't be freed, so in bt_for_each() this
      helper is replaced with tags->rqs[tag].
      
      [1] kernel oops log
      [  439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
      [  439.697162] IP: [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
      [  439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
      [  439.700653] Dumping ftrace buffer:^M
      [  439.700653]    (ftrace buffer empty)^M
      [  439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
      [  439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
      [  439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
      [  439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
      [  439.730500] RIP: 0010:[<ffffffff812d89ba>]  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.730500] RSP: 0018:ffff880819203da0  EFLAGS: 00010283^M
      [  439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
      [  439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
      [  439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
      [  439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
      [  439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
      [  439.730500] FS:  00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
      [  439.730500] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
      [  439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
      [  439.730500] Stack:^M
      [  439.730500]  0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
      [  439.755663]  ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
      [  439.755663] Call Trace:^M
      [  439.755663]  <IRQ> ^M
      [  439.755663]  [<ffffffff812dc104>] bt_for_each+0x6e/0xc8^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812d9f5e>] ? blk_mq_rq_timed_out+0x6a/0x6a^M
      [  439.755663]  [<ffffffff812dc1b3>] blk_mq_tag_busy_iter+0x55/0x5e^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff812d8911>] blk_mq_rq_timer+0x5d/0xd4^M
      [  439.755663]  [<ffffffff810a3e10>] call_timer_fn+0xf7/0x284^M
      [  439.755663]  [<ffffffff810a3d1e>] ? call_timer_fn+0x5/0x284^M
      [  439.755663]  [<ffffffff812d88b4>] ? blk_mq_bio_to_request+0x38/0x38^M
      [  439.755663]  [<ffffffff810a46d6>] run_timer_softirq+0x1ce/0x1f8^M
      [  439.755663]  [<ffffffff8104c367>] __do_softirq+0x181/0x3a4^M
      [  439.755663]  [<ffffffff8104c76e>] irq_exit+0x40/0x94^M
      [  439.755663]  [<ffffffff81031482>] smp_apic_timer_interrupt+0x33/0x3e^M
      [  439.755663]  [<ffffffff815559a4>] apic_timer_interrupt+0x84/0x90^M
      [  439.755663]  <EOI> ^M
      [  439.755663]  [<ffffffff81554350>] ? _raw_spin_unlock_irq+0x32/0x4a^M
      [  439.755663]  [<ffffffff8106a98b>] finish_task_switch+0xe0/0x163^M
      [  439.755663]  [<ffffffff8106a94d>] ? finish_task_switch+0xa2/0x163^M
      [  439.755663]  [<ffffffff81550066>] __schedule+0x469/0x6cd^M
      [  439.755663]  [<ffffffff8155039b>] schedule+0x82/0x9a^M
      [  439.789267]  [<ffffffff8119b28b>] signalfd_read+0x186/0x49a^M
      [  439.790911]  [<ffffffff8106d86a>] ? wake_up_q+0x47/0x47^M
      [  439.790911]  [<ffffffff811618c2>] __vfs_read+0x28/0x9f^M
      [  439.790911]  [<ffffffff8117a289>] ? __fget_light+0x4d/0x74^M
      [  439.790911]  [<ffffffff811620a7>] vfs_read+0x7a/0xc6^M
      [  439.790911]  [<ffffffff8116292b>] SyS_read+0x49/0x7f^M
      [  439.790911]  [<ffffffff81554c17>] entry_SYSCALL_64_fastpath+0x12/0x6f^M
      [  439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
      f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
      53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
      ^M
      [  439.790911] RIP  [<ffffffff812d89ba>] blk_mq_tag_to_rq+0x21/0x6e^M
      [  439.790911]  RSP <ffff880819203da0>^M
      [  439.790911] CR2: 0000000000000158^M
      [  439.790911] ---[ end trace d40af58949325661 ]---^M
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0048b483
  14. 02 6月, 2015 1 次提交
    • K
      blk-mq: Shared tag enhancements · f26cdc85
      Keith Busch 提交于
      Storage controllers may expose multiple block devices that share hardware
      resources managed by blk-mq. This patch enhances the shared tags so a
      low-level driver can access the shared resources not tied to the unshared
      h/w contexts. This way the LLD can dynamically add and delete disks and
      request queues without having to track all the request_queue hctx's to
      iterate outstanding tags.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f26cdc85
  15. 19 3月, 2015 1 次提交
  16. 12 2月, 2015 1 次提交
  17. 24 1月, 2015 1 次提交
    • S
      blk-mq: add tag allocation policy · 24391c0d
      Shaohua Li 提交于
      This is the blk-mq part to support tag allocation policy. The default
      allocation policy isn't changed (though it's not a strict FIFO). The new
      policy is round-robin for libata. But it's a try-best implementation. If
      multiple tasks are competing, the tags returned will be mixed (which is
      unavoidable even with !mq, as requests from different tasks can be
      mixed in queue)
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      24391c0d
  18. 14 1月, 2015 1 次提交
    • J
      blk-mq: fix false negative out-of-tags condition · 0bf36498
      Jens Axboe 提交于
      The blk-mq tagging tries to maintain some locality between CPUs and
      the tags issued. The tags are split into groups of words, and the
      words may not be fully populated. When searching for a new free tag,
      blk-mq may look at partial words, hence it passes in an offset/size
      to find_next_zero_bit(). However, it does that wrong, the size must
      always be the full length of the number of tags in that word,
      otherwise we'll potentially miss some near the end.
      
      Another issue is when __bt_get() goes from one word set to the next.
      It bumps the index, but not the last_tag associated with the
      previous index. Bump that to be in the range of the new word.
      
      Finally, clean up __bt_get() and __bt_get_word() a bit and get
      rid of the goto in there, and the unnecessary 'wrap' variable.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0bf36498
  19. 01 1月, 2015 1 次提交
  20. 15 12月, 2014 1 次提交
  21. 10 12月, 2014 3 次提交
    • B
      blk-mq: Micro-optimize bt_get() · 52f7eb94
      Bart Van Assche 提交于
      Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr()
      calls into a single call.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      52f7eb94
    • B
      blk-mq: Fix a race between bt_clear_tag() and bt_get() · c38d185d
      Bart Van Assche 提交于
      What we need is the following two guarantees:
      * Any thread that observes the effect of the test_and_set_bit() by
        __bt_get_word() also observes the preceding addition of 'current'
        to the appropriate wait list. This is guaranteed by the semantics
        of the spin_unlock() operation performed by prepare_and_wait().
        Hence the conversion of test_and_set_bit_lock() into
        test_and_set_bit().
      * The wait lists are examined by bt_clear() after the tag bit has
        been cleared. clear_bit_unlock() guarantees that any thread that
        observes that the bit has been cleared also observes the store
        operations preceding clear_bit_unlock(). However,
        clear_bit_unlock() does not prevent that the wait lists are examined
        before that the tag bit is cleared. Hence the addition of a memory
        barrier between clear_bit() and the wait list examination.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c38d185d
    • B
      blk-mq: Avoid that __bt_get_word() wraps multiple times · 9e98e9d7
      Bart Van Assche 提交于
      If __bt_get_word() is called with last_tag != 0, if the first
      find_next_zero_bit() fails, if after wrap-around the
      test_and_set_bit() call fails and find_next_zero_bit() succeeds,
      if the next test_and_set_bit() call fails and subsequently
      find_next_zero_bit() does not find a zero bit, then another
      wrap-around will occur. Avoid this by introducing an additional
      local variable.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9e98e9d7
  22. 08 12月, 2014 2 次提交
  23. 25 11月, 2014 1 次提交
  24. 12 11月, 2014 1 次提交
  25. 07 10月, 2014 2 次提交
  26. 23 9月, 2014 1 次提交
  27. 18 6月, 2014 3 次提交
    • A
      blk-mq: bitmap tag: fix races in bt_get() function · 86fb5c56
      Alexander Gordeev 提交于
      This update fixes few issues in bt_get() function:
      
      - list_empty(&wait.task_list) check is not protected;
      
      - was_empty check is always true which results in *every* thread
        entering the loop resets bt_wait_state::wait_cnt counter rather
        than every bt->wake_cnt'th thread;
      
      - 'bt_wait_state::wait_cnt' counter update is redundant, since
        it also gets reset in bt_clear_tag() function;
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      86fb5c56
    • A
      blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt · 2971c35f
      Alexander Gordeev 提交于
      This piece of code in bt_clear_tag() function is racy:
      
      	bs = bt_wake_ptr(bt);
      	if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
      		atomic_set(&bs->wait_cnt, bt->wake_cnt);
       		wake_up(&bs->wait);
      	}
      
      Since nothing prevents bt_wake_ptr() from returning the very
      same 'bs' address on multiple CPUs, the following scenario is
      possible:
      
          CPU1                                CPU2
          ----                                ----
      
      0.  bs = bt_wake_ptr(bt);               bs = bt_wake_ptr(bt);
      1.  atomic_dec_and_test(&bs->wait_cnt)
      2.                                      atomic_dec_and_test(&bs->wait_cnt)
      3.  atomic_set(&bs->wait_cnt, bt->wake_cnt);
      
      If the decrement in [1] yields zero then for some amount of time
      the decrement in [2] results in a negative/overflow value, which
      is not expected. The follow-up assignment in [3] overwrites the
      invalid value with the batch value (and likely prevents the issue
      from being severe) which is still incorrect and should be a lesser.
      
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2971c35f
    • A
      blk-mq: bitmap tag: fix races on shared ::wake_index fields · 8537b120
      Alexander Gordeev 提交于
      Fix racy updates of shared blk_mq_bitmap_tags::wake_index
      and blk_mq_hw_ctx::wake_index fields.
      
      Cc: Ming Lei <tom.leiming@gmail.com>
      Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8537b120
  28. 04 6月, 2014 1 次提交