1. 20 6月, 2017 2 次提交
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  2. 07 6月, 2017 2 次提交
    • M
      blk-mq: fix direct issue · d964f04a
      Ming Lei 提交于
      If queue is stopped, we shouldn't dispatch request into driver and
      hardware, unfortunately the check is removed in bd166ef1(blk-mq-sched:
      add framework for MQ capable IO schedulers).
      
      This patch fixes the issue by moving the check back into
      __blk_mq_try_issue_directly().
      
      This patch fixes request use-after-free[1][2] during canceling requets
      of NVMe in nvme_dev_disable(), which can be triggered easily during
      NVMe reset & remove test.
      
      [1] oops kernel log when CONFIG_BLK_DEV_INTEGRITY is on
      [  103.412969] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
      [  103.412980] IP: bio_integrity_advance+0x48/0xf0
      [  103.412981] PGD 275a88067
      [  103.412981] P4D 275a88067
      [  103.412982] PUD 276c43067
      [  103.412983] PMD 0
      [  103.412984]
      [  103.412986] Oops: 0000 [#1] SMP
      [  103.412989] Modules linked in: vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support mxm_wmi glue_helper dcdbas ipmi_si mei_me pcspkr mei sg ipmi_devintf lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel nvme ahci nvme_core libahci libata tg3 i2c_core megaraid_sas ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
      [  103.413035] CPU: 0 PID: 102 Comm: kworker/0:2 Not tainted 4.11.0+ #1
      [  103.413036] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [  103.413041] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
      [  103.413043] task: ffff9cc8775c8000 task.stack: ffffc033c252c000
      [  103.413045] RIP: 0010:bio_integrity_advance+0x48/0xf0
      [  103.413046] RSP: 0018:ffffc033c252fc10 EFLAGS: 00010202
      [  103.413048] RAX: 0000000000000000 RBX: ffff9cc8720a8cc0 RCX: ffff9cca72958240
      [  103.413049] RDX: ffff9cca72958000 RSI: 0000000000000008 RDI: ffff9cc872537f00
      [  103.413049] RBP: ffffc033c252fc28 R08: 0000000000000000 R09: ffffffffb963a0d5
      [  103.413050] R10: 000000000000063e R11: 0000000000000000 R12: ffff9cc8720a8d18
      [  103.413051] R13: 0000000000001000 R14: ffff9cc872682e00 R15: 00000000fffffffb
      [  103.413053] FS:  0000000000000000(0000) GS:ffff9cc877c00000(0000) knlGS:0000000000000000
      [  103.413054] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  103.413055] CR2: 000000000000000a CR3: 0000000276c41000 CR4: 00000000001406f0
      [  103.413056] Call Trace:
      [  103.413063]  bio_advance+0x2a/0xe0
      [  103.413067]  blk_update_request+0x76/0x330
      [  103.413072]  blk_mq_end_request+0x1a/0x70
      [  103.413074]  blk_mq_dispatch_rq_list+0x370/0x410
      [  103.413076]  ? blk_mq_flush_busy_ctxs+0x94/0xe0
      [  103.413080]  blk_mq_sched_dispatch_requests+0x173/0x1a0
      [  103.413083]  __blk_mq_run_hw_queue+0x8e/0xa0
      [  103.413085]  __blk_mq_delay_run_hw_queue+0x9d/0xa0
      [  103.413088]  blk_mq_start_hw_queue+0x17/0x20
      [  103.413090]  blk_mq_start_hw_queues+0x32/0x50
      [  103.413095]  nvme_kill_queues+0x54/0x80 [nvme_core]
      [  103.413097]  nvme_remove_dead_ctrl_work+0x1f/0x40 [nvme]
      [  103.413103]  process_one_work+0x149/0x360
      [  103.413105]  worker_thread+0x4d/0x3c0
      [  103.413109]  kthread+0x109/0x140
      [  103.413111]  ? rescuer_thread+0x380/0x380
      [  103.413113]  ? kthread_park+0x60/0x60
      [  103.413120]  ret_from_fork+0x2c/0x40
      [  103.413121] Code: 08 4c 8b 63 50 48 8b 80 80 00 00 00 48 8b 90 d0 03 00 00 31 c0 48 83 ba 40 02 00 00 00 48 8d 8a 40 02 00 00 48 0f 45 c1 c1 ee 09 <0f> b6 48 0a 0f b6 40 09 41 89 f5 83 e9 09 41 d3 ed 44 0f af e8
      [  103.413145] RIP: bio_integrity_advance+0x48/0xf0 RSP: ffffc033c252fc10
      [  103.413146] CR2: 000000000000000a
      [  103.413157] ---[ end trace cd6875d16eb5a11e ]---
      [  103.455368] Kernel panic - not syncing: Fatal exception
      [  103.459826] Kernel Offset: 0x37600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [  103.850916] ---[ end Kernel panic - not syncing: Fatal exception
      [  103.857637] sched: Unexpected reschedule of offline CPU#1!
      [  103.863762] ------------[ cut here ]------------
      
      [2] kernel hang in blk_mq_freeze_queue_wait() when CONFIG_BLK_DEV_INTEGRITY is off
      [  247.129825] INFO: task nvme-test:1772 blocked for more than 120 seconds.
      [  247.137311]       Not tainted 4.12.0-rc2.upstream+ #4
      [  247.142954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  247.151704] Call Trace:
      [  247.154445]  __schedule+0x28a/0x880
      [  247.158341]  schedule+0x36/0x80
      [  247.161850]  blk_mq_freeze_queue_wait+0x4b/0xb0
      [  247.166913]  ? remove_wait_queue+0x60/0x60
      [  247.171485]  blk_freeze_queue+0x1a/0x20
      [  247.175770]  blk_cleanup_queue+0x7f/0x140
      [  247.180252]  nvme_ns_remove+0xa3/0xb0 [nvme_core]
      [  247.185503]  nvme_remove_namespaces+0x32/0x50 [nvme_core]
      [  247.191532]  nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
      [  247.196977]  nvme_remove+0x70/0x110 [nvme]
      [  247.201545]  pci_device_remove+0x39/0xc0
      [  247.205927]  device_release_driver_internal+0x141/0x200
      [  247.211761]  device_release_driver+0x12/0x20
      [  247.216531]  pci_stop_bus_device+0x8c/0xa0
      [  247.221104]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
      [  247.227420]  remove_store+0x7c/0x90
      [  247.231320]  dev_attr_store+0x18/0x30
      [  247.235409]  sysfs_kf_write+0x3a/0x50
      [  247.239497]  kernfs_fop_write+0xff/0x180
      [  247.243867]  __vfs_write+0x37/0x160
      [  247.247757]  ? selinux_file_permission+0xe5/0x120
      [  247.253011]  ? security_file_permission+0x3b/0xc0
      [  247.258260]  vfs_write+0xb2/0x1b0
      [  247.261964]  ? syscall_trace_enter+0x1d0/0x2b0
      [  247.266924]  SyS_write+0x55/0xc0
      [  247.270540]  do_syscall_64+0x67/0x150
      [  247.274636]  entry_SYSCALL64_slow_path+0x25/0x25
      [  247.279794] RIP: 0033:0x7f5c96740840
      [  247.283785] RSP: 002b:00007ffd00e87ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  247.292238] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5c96740840
      [  247.300194] RDX: 0000000000000002 RSI: 00007f5c97060000 RDI: 0000000000000001
      [  247.308159] RBP: 00007f5c97060000 R08: 000000000000000a R09: 00007f5c97059740
      [  247.316123] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f5c96a14400
      [  247.324087] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
      [  370.016340] INFO: task nvme-test:1772 blocked for more than 120 seconds.
      
      Fixes: 12d70958(blk-mq: don't fail allocating driver tag for stopped hw queue)
      Cc: stable@vger.kernel.org
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d964f04a
    • M
      blk-mq: pass correct hctx to blk_mq_try_issue_directly · dad7a3be
      Ming Lei 提交于
      When direct issue is done on request picked up from plug list,
      the hctx need to be updated with the actual hw queue, otherwise
      wrong hctx is used and may hurt performance, especially when
      wrong SRCU readlock is acquired/released
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dad7a3be
  3. 31 5月, 2017 1 次提交
  4. 23 5月, 2017 1 次提交
  5. 10 5月, 2017 1 次提交
    • W
      blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op · f36ea50c
      Wen Xiong 提交于
      When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
      "Input/output error". Looks block layer split the bio after calling
      bio_integrity_prep(bio). This patch fixes the issue.
      
      Below is how we debug this issue:
      (1)format nvme to 4K block # size with type 2 DIF
      (2)dd with block size bigger than 1024k.
      oflag=direct
      dd: error writing '/dev/nvme0n1': Input/output error
      
      We added some debug code in nvme device driver. It showed us the first
      op and the second op have the same bi and pi address. This is not
      correct.
      
      1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
      	dsmgmt=0x0, AT=0x0 & RT=0x505
      	Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828
      
      2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
      	AT=0x0 & RT=0x605  ==> This op fails and subsequent 5 retires..
      	Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828
      
      With the fix, It showed us both of the first op and the second op have
      correct bi and pi address.
      
      1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
      	dsmgmt=0x0, AT=0x0 & RT=0x505
      	Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
      	0x00002828
      2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
      	AT=0x0 & RT=0x605
      	Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
      	0x00003028
      Signed-off-by: NWen Xiong <wenxiong@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f36ea50c
  6. 08 5月, 2017 2 次提交
    • C
      blk-mq: make __blk_mq_stop_hw_queues static · ebd76857
      Colin Ian King 提交于
      Making __blk_mq_stop_hw_queues static fixes sparse warning:
      
        block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not
        declared. Should it be static?
      
      Fixes: 2719aa21 ("blk-mq: don't use sync workqueue flushing from drivers")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ebd76857
    • W
      block/mq: fix potential deadlock during cpu hotplug · 51d638b1
      Wanpeng Li 提交于
      This can be triggered by hot-unplug one cpu.
      
      ======================================================
       [ INFO: possible circular locking dependency detected ]
       4.11.0+ #17 Not tainted
       -------------------------------------------------------
       step_after_susp/2640 is trying to acquire lock:
        (all_q_mutex){+.+...}, at: [<ffffffffb33f95b8>] blk_mq_queue_reinit_work+0x18/0x110
      
       but task is already holding lock:
        (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (cpu_hotplug.lock){+.+.+.}:
              lock_acquire+0x11c/0x230
              __mutex_lock+0x92/0x990
              mutex_lock_nested+0x1b/0x20
              get_online_cpus+0x64/0x80
              blk_mq_init_allocated_queue+0x3a0/0x4e0
              blk_mq_init_queue+0x3a/0x60
              loop_add+0xe5/0x280
              loop_init+0x124/0x177
              do_one_initcall+0x53/0x1c0
              kernel_init_freeable+0x1e3/0x27f
              kernel_init+0xe/0x100
              ret_from_fork+0x31/0x40
      
       -> #0 (all_q_mutex){+.+...}:
              __lock_acquire+0x189a/0x18a0
              lock_acquire+0x11c/0x230
              __mutex_lock+0x92/0x990
              mutex_lock_nested+0x1b/0x20
              blk_mq_queue_reinit_work+0x18/0x110
              blk_mq_queue_reinit_dead+0x1c/0x20
              cpuhp_invoke_callback+0x1f2/0x810
              cpuhp_down_callbacks+0x42/0x80
              _cpu_down+0xb2/0xe0
              freeze_secondary_cpus+0xb6/0x390
              suspend_devices_and_enter+0x3b3/0xa40
              pm_suspend+0x129/0x490
              state_store+0x82/0xf0
              kobj_attr_store+0xf/0x20
              sysfs_kf_write+0x45/0x60
              kernfs_fop_write+0x135/0x1c0
              __vfs_write+0x37/0x160
              vfs_write+0xcd/0x1d0
              SyS_write+0x58/0xc0
              do_syscall_64+0x8f/0x710
              return_from_SYSCALL_64+0x0/0x7a
      
       other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(cpu_hotplug.lock);
                                      lock(all_q_mutex);
                                      lock(cpu_hotplug.lock);
         lock(all_q_mutex);
      
        *** DEADLOCK ***
      
       8 locks held by step_after_susp/2640:
        #0:  (sb_writers#6){.+.+.+}, at: [<ffffffffb3244aed>] vfs_write+0x1ad/0x1d0
        #1:  (&of->mutex){+.+.+.}, at: [<ffffffffb32d3a51>] kernfs_fop_write+0x101/0x1c0
        #2:  (s_active#166){.+.+.+}, at: [<ffffffffb32d3a59>] kernfs_fop_write+0x109/0x1c0
        #3:  (pm_mutex){+.+...}, at: [<ffffffffb30d2ecd>] pm_suspend+0x21d/0x490
        #4:  (acpi_scan_lock){+.+.+.}, at: [<ffffffffb34dc3d7>] acpi_scan_lock_acquire+0x17/0x20
        #5:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffffb306d6d7>] freeze_secondary_cpus+0x27/0x390
        #6:  (cpu_hotplug.dep_map){++++++}, at: [<ffffffffb306cfd5>] cpu_hotplug_begin+0x5/0xe0
        #7:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0
      
       stack backtrace:
       CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17
       Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016
       Call Trace:
        dump_stack+0x99/0xce
        print_circular_bug+0x1fa/0x270
        __lock_acquire+0x189a/0x18a0
        lock_acquire+0x11c/0x230
        ? lock_acquire+0x11c/0x230
        ? blk_mq_queue_reinit_work+0x18/0x110
        ? blk_mq_queue_reinit_work+0x18/0x110
        __mutex_lock+0x92/0x990
        ? blk_mq_queue_reinit_work+0x18/0x110
        ? kmem_cache_free+0x2cb/0x330
        ? anon_transport_class_unregister+0x20/0x20
        ? blk_mq_queue_reinit_work+0x110/0x110
        mutex_lock_nested+0x1b/0x20
        ? mutex_lock_nested+0x1b/0x20
        blk_mq_queue_reinit_work+0x18/0x110
        blk_mq_queue_reinit_dead+0x1c/0x20
        cpuhp_invoke_callback+0x1f2/0x810
        ? __flow_cache_shrink+0x160/0x160
        cpuhp_down_callbacks+0x42/0x80
        _cpu_down+0xb2/0xe0
        freeze_secondary_cpus+0xb6/0x390
        suspend_devices_and_enter+0x3b3/0xa40
        ? rcu_read_lock_sched_held+0x79/0x80
        pm_suspend+0x129/0x490
        state_store+0x82/0xf0
        kobj_attr_store+0xf/0x20
        sysfs_kf_write+0x45/0x60
        kernfs_fop_write+0x135/0x1c0
        __vfs_write+0x37/0x160
        ? rcu_read_lock_sched_held+0x79/0x80
        ? rcu_sync_lockdep_assert+0x2f/0x60
        ? __sb_start_write+0xd9/0x1c0
        ? vfs_write+0x1ad/0x1d0
        vfs_write+0xcd/0x1d0
        SyS_write+0x58/0xc0
        ? rcu_read_lock_sched_held+0x79/0x80
        do_syscall_64+0x8f/0x710
        ? trace_hardirqs_on_thunk+0x1a/0x1c
        entry_SYSCALL64_slow_path+0x25/0x25
      
      The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting
      queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will
      contend these two locks in the inversion order. This is due to commit eabe0659
      (blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion
      issue because of hotplug rework, however the hotplug rework is still work-in-progress
      and lives in a -tip branch and mainline cannot yet trigger that splat. The commit
      breaks the linus's tree in the merge window, so this patch reverts the lock order
      and avoids to splat linus's tree.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      51d638b1
  7. 04 5月, 2017 3 次提交
    • O
      blk-mq: untangle debugfs and sysfs · 9c1051aa
      Omar Sandoval 提交于
      Originally, I tied debugfs registration/unregistration together with
      sysfs. There's no reason to do this, and it's getting in the way of
      letting schedulers define their own debugfs attributes. Instead, tie the
      debugfs registration to the lifetime of the structures themselves.
      
      The saner lifetimes mean we can also get rid of the extra mq directory
      and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
      just nvme0n1/hctx0/tags.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9c1051aa
    • P
      block/mq: Cure cpu hotplug lock inversion · eabe0659
      Peter Zijlstra 提交于
      By poking at /debug/sched_features I triggered the following splat:
      
       [] ======================================================
       [] WARNING: possible circular locking dependency detected
       [] 4.11.0-00873-g964c8b7-dirty #694 Not tainted
       [] ------------------------------------------------------
       [] bash/2109 is trying to acquire lock:
       []  (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff8120cb8b>] static_key_slow_dec+0x1b/0x50
       []
       [] but task is already holding lock:
       []  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
       []
       [] which lock already depends on the new lock.
       []
       []
       [] the existing dependency chain (in reverse order) is:
       []
       [] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}:
       []        lock_acquire+0x100/0x210
       []        down_write+0x28/0x60
       []        start_creating+0x5e/0xf0
       []        debugfs_create_dir+0x13/0x110
       []        blk_mq_debugfs_register+0x21/0x70
       []        blk_mq_register_dev+0x64/0xd0
       []        blk_register_queue+0x6a/0x170
       []        device_add_disk+0x22d/0x440
       []        loop_add+0x1f3/0x280
       []        loop_init+0x104/0x142
       []        do_one_initcall+0x43/0x180
       []        kernel_init_freeable+0x1de/0x266
       []        kernel_init+0xe/0x100
       []        ret_from_fork+0x31/0x40
       []
       [] -> #1 (all_q_mutex){+.+.+.}:
       []        lock_acquire+0x100/0x210
       []        __mutex_lock+0x6c/0x960
       []        mutex_lock_nested+0x1b/0x20
       []        blk_mq_init_allocated_queue+0x37c/0x4e0
       []        blk_mq_init_queue+0x3a/0x60
       []        loop_add+0xe5/0x280
       []        loop_init+0x104/0x142
       []        do_one_initcall+0x43/0x180
       []        kernel_init_freeable+0x1de/0x266
       []        kernel_init+0xe/0x100
       []        ret_from_fork+0x31/0x40
      
       []  *** DEADLOCK ***
       []
       [] 3 locks held by bash/2109:
       []  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81292bcd>] vfs_write+0x17d/0x1a0
       []  #1:  (debugfs_srcu){......}, at: [<ffffffff8155a90d>] full_proxy_write+0x5d/0xd0
       []  #2:  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
       []
       [] stack backtrace:
       [] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694
       [] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
       [] Call Trace:
      
       []  lock_acquire+0x100/0x210
       []  get_online_cpus+0x2a/0x90
       []  static_key_slow_dec+0x1b/0x50
       []  static_key_disable+0x20/0x30
       []  sched_feat_write+0x131/0x170
       []  full_proxy_write+0x97/0xd0
       []  __vfs_write+0x28/0x120
       []  vfs_write+0xb5/0x1a0
       []  SyS_write+0x49/0xa0
       []  entry_SYSCALL_64_fastpath+0x23/0xc2
      
      This is because of the cpu hotplug lock rework. Break the chain at #1
      by reversing the lock acquisition order. This way i_mutex_key#4 no
      longer depends on cpu_hotplug_lock and things are good.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      eabe0659
    • J
      blk-mq: don't use sync workqueue flushing from drivers · 2719aa21
      Jens Axboe 提交于
      A previous commit introduced the sync flush, which we need from
      internal callers like blk_mq_quiesce_queue(). However, we also
      call the stop helpers from drivers, particularly from ->queue_rq()
      when we have to stop processing for a bit. We can't block from
      those locations, and we don't have to guarantee that we're
      fully flushed.
      
      Fixes: 9f993737 ("blk-mq: unify hctx delayed_run_work and run_work")
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2719aa21
  8. 03 5月, 2017 1 次提交
  9. 02 5月, 2017 1 次提交
  10. 28 4月, 2017 2 次提交
    • J
      blk-mq: unify hctx delay_work and run_work · 21c6e939
      Jens Axboe 提交于
      The only difference between ->run_work and ->delay_work, is that
      the latter is used to defer running a queue. This is done by
      marking the queue stopped, and scheduling ->delay_work to run
      sometime in the future. While the queue is stopped, direct runs
      or runs through ->run_work will not run the queue.
      
      If we combine the handlers, then we need to handle two things:
      
      1) If a delayed/stopped run is scheduled, then we should not run
         the queue before that has been completed.
      2) If a queue is delayed/stopped, the handler needs to restart
         the queue. Normally a run of a queue with the stopped bit set
         would be a no-op.
      
      Case 1 is handled by modifying a currently pending queue run
      to the deadline set by the caller of blk_mq_delay_queue().
      Subsequent attempts to queue a queue run will find the work
      item already pending, and direct runs will see a stopped queue
      as before.
      
      Case 2 is handled by adding a new bit, BLK_MQ_S_START_ON_RUN,
      that tells the work handler that it should clear a stopped
      queue and run the handler.
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      21c6e939
    • J
      blk-mq: unify hctx delayed_run_work and run_work · 9f993737
      Jens Axboe 提交于
      They serve the exact same purpose. Get rid of the non-delayed
      work variant, and just run it without delay for the normal case.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9f993737
  11. 22 4月, 2017 1 次提交
    • B
      blk-mq: Fix preempt count imbalance · abc25a69
      Bart Van Assche 提交于
      Avoid that the following kernel bug gets triggered:
      
      BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:349
      in_atomic(): 1, irqs_disabled(): 0, pid: 8019, name: find
      CPU: 10 PID: 8019 Comm: find Tainted: G        W I     4.11.0-rc4-dbg+ #2
      Call Trace:
       dump_stack+0x68/0x93
       ___might_sleep+0x16e/0x230
       __might_sleep+0x4a/0x80
       __ext4_get_inode_loc+0x1e0/0x4e0
       ext4_iget+0x70/0xbc0
       ext4_iget_normal+0x2f/0x40
       ext4_lookup+0xb6/0x1f0
       lookup_slow+0x104/0x1e0
       walk_component+0x19a/0x330
       path_lookupat+0x4b/0x100
       filename_lookup+0x9a/0x110
       user_path_at_empty+0x36/0x40
       vfs_statx+0x67/0xc0
       SYSC_newfstatat+0x20/0x40
       SyS_newfstatat+0xe/0x10
       entry_SYSCALL_64_fastpath+0x18/0xad
      
      This happens since the big if/else in blk_mq_make_request() doesn't
      have final else section that also drops the ctx. Add that.
      
      Fixes: b00c53e8 ("blk-mq: fix schedule-while-atomic with scheduler attached")
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Omar Sandoval <osandov@fb.com>
      
      Added a bit more to the commit log.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      abc25a69
  12. 21 4月, 2017 9 次提交
  13. 20 4月, 2017 1 次提交
  14. 15 4月, 2017 2 次提交
  15. 08 4月, 2017 6 次提交
  16. 07 4月, 2017 4 次提交
    • O
      blk-mq: remap queues when adding/removing hardware queues · ebe8bddb
      Omar Sandoval 提交于
      blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
      behavior that drivers expect. However, commit 4e68a011 changed
      blk_mq_queue_reinit() to not remap queues for the case of CPU
      hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
      queues as well. This breaks, for example, NBD's multi-connection mode,
      leaving the added hardware queues unused. Fix it by making
      blk_mq_update_nr_hw_queues() explicitly remap the queues.
      
      Fixes: 4e68a011 ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ebe8bddb
    • O
      blk-mq-sched: fix crash in switch error path · 54d5329d
      Omar Sandoval 提交于
      In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
      back to the original scheduler. However, at this point, we've already
      torn down the original scheduler's tags, so this causes a crash. Doing
      the fallback like the legacy elevator path is much harder for mq, so fix
      it by just falling back to none, instead.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54d5329d
    • O
      blk-mq-sched: set up scheduler tags when bringing up new queues · 93252632
      Omar Sandoval 提交于
      If a new hardware queue is added at runtime, we don't allocate scheduler
      tags for it, leading to a crash. This hooks up the scheduler framework
      to blk_mq_{init,exit}_hctx() to make sure everything gets properly
      initialized/freed.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      93252632
    • O
      blk-mq: use the right hctx when getting a driver tag fails · 81380ca1
      Omar Sandoval 提交于
      While dispatching requests, if we fail to get a driver tag, we mark the
      hardware queue as waiting for a tag and put the requests on a
      hctx->dispatch list to be run later when a driver tag is freed. However,
      blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
      queues if using a single-queue scheduler with a multiqueue device. If
      blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
      are processing. This means we end up using the hardware queue of the
      previous request, which may or may not be the same as that of the
      current request. If it isn't, the wrong hardware queue will end up
      waiting for a tag, and the requests will be on the wrong dispatch list,
      leading to a hang.
      
      The fix is twofold:
      
      1. Make sure we save which hardware queue we were trying to get a
         request for in blk_mq_get_driver_tag() regardless of whether it
         succeeds or not.
      2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
         blk_mq_hw_queue to make it clear that it must handle multiple
         hardware queues, since I've already messed this up on a couple of
         occasions.
      
      This didn't appear in testing with nvme and mq-deadline because nvme has
      more driver tags than the default number of scheduler tags. However,
      with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      81380ca1
  17. 05 4月, 2017 1 次提交