1. 28 8月, 2018 3 次提交
  2. 23 8月, 2018 4 次提交
  3. 15 8月, 2018 1 次提交
  4. 08 8月, 2018 1 次提交
    • A
      blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait · 2887e41b
      Anchal Agarwal 提交于
      I am currently running a large bare metal instance (i3.metal)
      on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
      4.18 kernel. I have a workload that simulates a database
      workload and I am running into lockup issues when writeback
      throttling is enabled,with the hung task detector also
      kicking in.
      
      Crash dumps show that most CPUs (up to 50 of them) are
      all trying to get the wbt wait queue lock while trying to add
      themselves to it in __wbt_wait (see stack traces below).
      
      [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
      [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
      [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
      [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
      [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
      [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
      [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
      [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
      [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
      [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
      [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
      [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    0.948138] Call Trace:
      [    0.948139]  <IRQ>
      [    0.948142]  do_raw_spin_lock+0xad/0xc0
      [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
      [    0.948149]  ? __wake_up_common_lock+0x53/0x90
      [    0.948150]  __wake_up_common_lock+0x53/0x90
      [    0.948155]  wbt_done+0x7b/0xa0
      [    0.948158]  blk_mq_free_request+0xb7/0x110
      [    0.948161]  __blk_mq_complete_request+0xcb/0x140
      [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
      [    0.948169]  nvme_irq+0x23/0x50 [nvme]
      [    0.948173]  __handle_irq_event_percpu+0x46/0x300
      [    0.948176]  handle_irq_event_percpu+0x20/0x50
      [    0.948179]  handle_irq_event+0x34/0x60
      [    0.948181]  handle_edge_irq+0x77/0x190
      [    0.948185]  handle_irq+0xaf/0x120
      [    0.948188]  do_IRQ+0x53/0x110
      [    0.948191]  common_interrupt+0x87/0x87
      [    0.948192]  </IRQ>
      ....
      [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
      [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
      [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
      [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
      [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
      [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
      [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
      [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
      [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
      [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
      [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
      [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
      [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    0.311154] Call Trace:
      [    0.311157]  do_raw_spin_lock+0xad/0xc0
      [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
      [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
      [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
      [    0.311167]  wbt_wait+0x127/0x330
      [    0.311169]  ? finish_wait+0x80/0x80
      [    0.311172]  ? generic_make_request+0xda/0x3b0
      [    0.311174]  blk_mq_make_request+0xd6/0x7b0
      [    0.311176]  ? blk_queue_enter+0x24/0x260
      [    0.311178]  ? generic_make_request+0xda/0x3b0
      [    0.311181]  generic_make_request+0x10c/0x3b0
      [    0.311183]  ? submit_bio+0x5c/0x110
      [    0.311185]  submit_bio+0x5c/0x110
      [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
      [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
      [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
      [    0.311229]  ? do_writepages+0x3c/0xd0
      [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
      [    0.311240]  do_writepages+0x3c/0xd0
      [    0.311243]  ? _raw_spin_unlock+0x24/0x30
      [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
      [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
      [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
      [    0.311253]  file_write_and_wait_range+0x34/0x90
      [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
      [    0.311267]  do_fsync+0x38/0x60
      [    0.311270]  SyS_fsync+0xc/0x10
      [    0.311272]  do_syscall_64+0x6f/0x170
      [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      In the original patch, wbt_done is waking up all the exclusive
      processes in the wait queue, which can cause a thundering herd
      if there is a large number of writer threads in the queue. The
      original intention of the code seems to be to wake up one thread
      only however, it uses wake_up_all() in __wbt_done(), and then
      uses the following check in __wbt_wait to have only one thread
      actually get out of the wait loop:
      
      if (waitqueue_active(&rqw->wait) &&
                  rqw->wait.head.next != &wait->entry)
                      return false;
      
      The problem with this is that the wait entry in wbt_wait is
      define with DEFINE_WAIT, which uses the autoremove wakeup function.
      That means that the above check is invalid - the wait entry will
      have been removed from the queue already by the time we hit the
      check in the loop.
      
      Secondly, auto-removing the wait entries also means that the wait
      queue essentially gets reordered "randomly" (e.g. threads re-add
      themselves in the order they got to run after being woken up).
      Additionally, new requests entering wbt_wait might overtake requests
      that were queued earlier, because the wait queue will be
      (temporarily) empty after the wake_up_all, so the waitqueue_active
      check will not stop them. This can cause certain threads to starve
      under high load.
      
      The fix is to leave the woken up requests in the queue and remove
      them in finish_wait() once the current thread breaks out of the
      wait loop in __wbt_wait. This will ensure new requests always
      end up at the back of the queue, and they won't overtake requests
      that are already in the wait queue. With that change, the loop
      in wbt_wait is also in line with many other wait loops in the kernel.
      Waking up just one thread drastically reduces lock contention, as
      does moving the wait queue add/remove out of the loop.
      
      A significant drop in lockdep's lock contention numbers is seen when
      running the test application on the patched kernel.
      Signed-off-by: NAnchal Agarwal <anchalag@amazon.com>
      Signed-off-by: NFrank van der Linden <fllinden@amazon.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2887e41b
  5. 09 7月, 2018 2 次提交
  6. 09 5月, 2018 6 次提交
  7. 07 2月, 2018 1 次提交
    • J
      blk-wbt: account flush requests correctly · 5235553d
      Jens Axboe 提交于
      Mikulas reported a workload that saw bad performance, and figured
      out what it was due to various other types of requests being
      accounted as reads. Flush requests, for instance. Due to the
      high latency of those, we heavily throttle the writes to keep
      the latencies in balance. But they really should be accounted
      as writes.
      
      Fix this by checking the exact type of the request. If it's a
      read, account as a read, if it's a write or a flush, account
      as a write. Any other request we disregard. Previously everything
      would have been mistakenly accounted as reads.
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5235553d
  8. 24 11月, 2017 3 次提交
  9. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  10. 09 10月, 2017 1 次提交
  11. 20 6月, 2017 2 次提交
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  12. 21 4月, 2017 1 次提交
  13. 19 4月, 2017 1 次提交
  14. 11 4月, 2017 1 次提交
    • J
      block: Fix list corruption of blk stats callback list · 3f19cd23
      Jan Kara 提交于
      When CFQ calls wbt_disable_default(), it will call
      blk_stat_remove_callback() to stop gathering IO statistics for the
      purposes of writeback throttling. Later, when request_queue is
      unregistered, wbt_exit() will call blk_stat_remove_callback() again
      which will try to delete callback from the list again and possibly cause
      list corruption.
      
      Fix the problem by making wbt_disable_default() called wbt_exit() which
      is properly guarded against being called multiple times.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3f19cd23
  15. 22 3月, 2017 2 次提交
    • O
      blk-stat: convert to callback-based statistics reporting · 34dbad5d
      Omar Sandoval 提交于
      Currently, statistics are gathered in ~0.13s windows, and users grab the
      statistics whenever they need them. This is not ideal for both in-tree
      users:
      
      1. Writeback throttling wants its own dynamically sized window of
         statistics. Since the blk-stats statistics are reset after every
         window and the wbt windows don't line up with the blk-stats windows,
         wbt doesn't see every I/O.
      2. Polling currently grabs the statistics on every I/O. Again, depending
         on how the window lines up, we may miss some I/Os. It's also
         unnecessary overhead to get the statistics on every I/O; the hybrid
         polling heuristic would be just as happy with the statistics from the
         previous full window.
      
      This reworks the blk-stats infrastructure to be callback-based: users
      register a callback that they want called at a given time with all of
      the statistics from the window during which the callback was active.
      Users can dynamically bucketize the statistics. wbt and polling both
      currently use read vs. write, but polling can be extended to further
      subdivide based on request size.
      
      The callbacks are kept on an RCU list, and each callback has percpu
      stats buffers. There will only be a few users, so the overhead on the
      I/O completion side is low. The stats flushing is also simplified
      considerably: since the timer function is responsible for clearing the
      statistics, we don't have to worry about stale statistics.
      
      wbt is a trivial conversion. After the conversion, the windowing problem
      mentioned above is fixed.
      
      For polling, we register an extra callback that caches the previous
      window's statistics in the struct request_queue for the hybrid polling
      heuristic to use.
      
      Since we no longer have a single stats buffer for the request queue,
      this also removes the sysfs and debugfs stats entries. To replace those,
      we add a debugfs entry for the poll statistics.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      34dbad5d
    • O
      blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE} · fa2e39cb
      Omar Sandoval 提交于
      The stats buckets will become generic soon, so make the existing users
      use the common READ and WRITE definitions instead of one internal to
      blk-stat.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fa2e39cb
  16. 02 2月, 2017 1 次提交
  17. 03 1月, 2017 2 次提交
  18. 09 12月, 2016 1 次提交
  19. 01 12月, 2016 1 次提交
  20. 29 11月, 2016 3 次提交
  21. 16 11月, 2016 1 次提交
  22. 12 11月, 2016 1 次提交