1. 28 8月, 2019 1 次提交
  2. 01 5月, 2019 1 次提交
  3. 25 1月, 2019 1 次提交
    • B
      blk-wbt: Declare local functions static · c83f536a
      Bart Van Assche 提交于
      This patch avoids that sparse reports the following warnings:
      
        CHECK   block/blk-wbt.c
      block/blk-wbt.c:600:6: warning: symbol 'wbt_issue' was not declared. Should it be static?
      block/blk-wbt.c:620:6: warning: symbol 'wbt_requeue' was not declared. Should it be static?
        CC      block/blk-wbt.o
      block/blk-wbt.c:600:6: warning: no previous prototype for wbt_issue [-Wmissing-prototypes]
       void wbt_issue(struct rq_qos *rqos, struct request *rq)
            ^~~~~~~~~
      block/blk-wbt.c:620:6: warning: no previous prototype for wbt_requeue [-Wmissing-prototypes]
       void wbt_requeue(struct rq_qos *rqos, struct request *rq)
            ^~~~~~~~~~~
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c83f536a
  4. 17 12月, 2018 1 次提交
  5. 12 12月, 2018 1 次提交
    • M
      block: deactivate blk_stat timer in wbt_disable_default() · 544fbd16
      Ming Lei 提交于
      rwb_enabled() can't be changed when there is any inflight IO.
      
      wbt_disable_default() may set rwb->wb_normal as zero, however the
      blk_stat timer may still be pending, and the timer function will update
      wrb->wb_normal again.
      
      This patch introduces blk_stat_deactivate() and applies it in
      wbt_disable_default(), then the following IO hang triggered when running
      parted & switching io scheduler can be fixed:
      
      [  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
      [  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
      [  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  369.940768] parted          D    0  3645   3239 0x00000000
      [  369.941500] Call Trace:
      [  369.941874]  ? __schedule+0x6d9/0x74c
      [  369.942392]  ? wbt_done+0x5e/0x5e
      [  369.942864]  ? wbt_cleanup_cb+0x16/0x16
      [  369.943404]  ? wbt_done+0x5e/0x5e
      [  369.943874]  schedule+0x67/0x78
      [  369.944298]  io_schedule+0x12/0x33
      [  369.944771]  rq_qos_wait+0xb5/0x119
      [  369.945193]  ? karma_partition+0x1c2/0x1c2
      [  369.945691]  ? wbt_cleanup_cb+0x16/0x16
      [  369.946151]  wbt_wait+0x85/0xb6
      [  369.946540]  __rq_qos_throttle+0x23/0x2f
      [  369.947014]  blk_mq_make_request+0xe6/0x40a
      [  369.947518]  generic_make_request+0x192/0x2fe
      [  369.948042]  ? submit_bio+0x103/0x11f
      [  369.948486]  ? __radix_tree_lookup+0x35/0xb5
      [  369.949011]  submit_bio+0x103/0x11f
      [  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
      [  369.949962]  submit_bio_wait+0x53/0x7f
      [  369.950469]  blkdev_issue_flush+0x8a/0xae
      [  369.951032]  blkdev_fsync+0x2f/0x3a
      [  369.951502]  do_fsync+0x2e/0x47
      [  369.951887]  __x64_sys_fsync+0x10/0x13
      [  369.952374]  do_syscall_64+0x89/0x149
      [  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  369.953492] RIP: 0033:0x7f95a1e729d4
      [  369.953996] Code: Bad RIP value.
      [  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
      [  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
      [  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
      [  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
      [  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
      [  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008
      
      Cc: stable@vger.kernel.org
      Cc: Paolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      544fbd16
  6. 08 12月, 2018 1 次提交
  7. 16 11月, 2018 4 次提交
  8. 08 11月, 2018 1 次提交
  9. 12 10月, 2018 1 次提交
  10. 28 8月, 2018 3 次提交
  11. 23 8月, 2018 4 次提交
  12. 15 8月, 2018 1 次提交
  13. 08 8月, 2018 1 次提交
    • A
      blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait · 2887e41b
      Anchal Agarwal 提交于
      I am currently running a large bare metal instance (i3.metal)
      on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
      4.18 kernel. I have a workload that simulates a database
      workload and I am running into lockup issues when writeback
      throttling is enabled,with the hung task detector also
      kicking in.
      
      Crash dumps show that most CPUs (up to 50 of them) are
      all trying to get the wbt wait queue lock while trying to add
      themselves to it in __wbt_wait (see stack traces below).
      
      [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
      [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
      [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
      [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
      [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
      [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
      [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
      [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
      [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
      [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
      [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
      [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    0.948138] Call Trace:
      [    0.948139]  <IRQ>
      [    0.948142]  do_raw_spin_lock+0xad/0xc0
      [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
      [    0.948149]  ? __wake_up_common_lock+0x53/0x90
      [    0.948150]  __wake_up_common_lock+0x53/0x90
      [    0.948155]  wbt_done+0x7b/0xa0
      [    0.948158]  blk_mq_free_request+0xb7/0x110
      [    0.948161]  __blk_mq_complete_request+0xcb/0x140
      [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
      [    0.948169]  nvme_irq+0x23/0x50 [nvme]
      [    0.948173]  __handle_irq_event_percpu+0x46/0x300
      [    0.948176]  handle_irq_event_percpu+0x20/0x50
      [    0.948179]  handle_irq_event+0x34/0x60
      [    0.948181]  handle_edge_irq+0x77/0x190
      [    0.948185]  handle_irq+0xaf/0x120
      [    0.948188]  do_IRQ+0x53/0x110
      [    0.948191]  common_interrupt+0x87/0x87
      [    0.948192]  </IRQ>
      ....
      [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
      [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
      [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
      [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
      [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
      [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
      [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
      [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
      [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
      [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
      [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
      [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
      [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    0.311154] Call Trace:
      [    0.311157]  do_raw_spin_lock+0xad/0xc0
      [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
      [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
      [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
      [    0.311167]  wbt_wait+0x127/0x330
      [    0.311169]  ? finish_wait+0x80/0x80
      [    0.311172]  ? generic_make_request+0xda/0x3b0
      [    0.311174]  blk_mq_make_request+0xd6/0x7b0
      [    0.311176]  ? blk_queue_enter+0x24/0x260
      [    0.311178]  ? generic_make_request+0xda/0x3b0
      [    0.311181]  generic_make_request+0x10c/0x3b0
      [    0.311183]  ? submit_bio+0x5c/0x110
      [    0.311185]  submit_bio+0x5c/0x110
      [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
      [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
      [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
      [    0.311229]  ? do_writepages+0x3c/0xd0
      [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
      [    0.311240]  do_writepages+0x3c/0xd0
      [    0.311243]  ? _raw_spin_unlock+0x24/0x30
      [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
      [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
      [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
      [    0.311253]  file_write_and_wait_range+0x34/0x90
      [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
      [    0.311267]  do_fsync+0x38/0x60
      [    0.311270]  SyS_fsync+0xc/0x10
      [    0.311272]  do_syscall_64+0x6f/0x170
      [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      In the original patch, wbt_done is waking up all the exclusive
      processes in the wait queue, which can cause a thundering herd
      if there is a large number of writer threads in the queue. The
      original intention of the code seems to be to wake up one thread
      only however, it uses wake_up_all() in __wbt_done(), and then
      uses the following check in __wbt_wait to have only one thread
      actually get out of the wait loop:
      
      if (waitqueue_active(&rqw->wait) &&
                  rqw->wait.head.next != &wait->entry)
                      return false;
      
      The problem with this is that the wait entry in wbt_wait is
      define with DEFINE_WAIT, which uses the autoremove wakeup function.
      That means that the above check is invalid - the wait entry will
      have been removed from the queue already by the time we hit the
      check in the loop.
      
      Secondly, auto-removing the wait entries also means that the wait
      queue essentially gets reordered "randomly" (e.g. threads re-add
      themselves in the order they got to run after being woken up).
      Additionally, new requests entering wbt_wait might overtake requests
      that were queued earlier, because the wait queue will be
      (temporarily) empty after the wake_up_all, so the waitqueue_active
      check will not stop them. This can cause certain threads to starve
      under high load.
      
      The fix is to leave the woken up requests in the queue and remove
      them in finish_wait() once the current thread breaks out of the
      wait loop in __wbt_wait. This will ensure new requests always
      end up at the back of the queue, and they won't overtake requests
      that are already in the wait queue. With that change, the loop
      in wbt_wait is also in line with many other wait loops in the kernel.
      Waking up just one thread drastically reduces lock contention, as
      does moving the wait queue add/remove out of the loop.
      
      A significant drop in lockdep's lock contention numbers is seen when
      running the test application on the patched kernel.
      Signed-off-by: NAnchal Agarwal <anchalag@amazon.com>
      Signed-off-by: NFrank van der Linden <fllinden@amazon.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2887e41b
  14. 09 7月, 2018 2 次提交
  15. 09 5月, 2018 6 次提交
  16. 07 2月, 2018 1 次提交
    • J
      blk-wbt: account flush requests correctly · 5235553d
      Jens Axboe 提交于
      Mikulas reported a workload that saw bad performance, and figured
      out what it was due to various other types of requests being
      accounted as reads. Flush requests, for instance. Due to the
      high latency of those, we heavily throttle the writes to keep
      the latencies in balance. But they really should be accounted
      as writes.
      
      Fix this by checking the exact type of the request. If it's a
      read, account as a read, if it's a write or a flush, account
      as a write. Any other request we disregard. Previously everything
      would have been mistakenly accounted as reads.
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5235553d
  17. 24 11月, 2017 3 次提交
  18. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  19. 09 10月, 2017 1 次提交
  20. 20 6月, 2017 2 次提交
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  21. 21 4月, 2017 1 次提交
  22. 19 4月, 2017 1 次提交
  23. 11 4月, 2017 1 次提交
    • J
      block: Fix list corruption of blk stats callback list · 3f19cd23
      Jan Kara 提交于
      When CFQ calls wbt_disable_default(), it will call
      blk_stat_remove_callback() to stop gathering IO statistics for the
      purposes of writeback throttling. Later, when request_queue is
      unregistered, wbt_exit() will call blk_stat_remove_callback() again
      which will try to delete callback from the list again and possibly cause
      list corruption.
      
      Fix the problem by making wbt_disable_default() called wbt_exit() which
      is properly guarded against being called multiple times.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3f19cd23