1. 19 6月, 2017 14 次提交
  2. 13 6月, 2017 1 次提交
  3. 09 6月, 2017 2 次提交
    • C
      blk-mq: switch ->queue_rq return value to blk_status_t · fc17b653
      Christoph Hellwig 提交于
      Use the same values for use for request completion errors as the return
      value from ->queue_rq.  BLK_STS_RESOURCE is special cased to cause
      a requeue, and all the others are completed as-is.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fc17b653
    • C
      block: introduce new block status code type · 2a842aca
      Christoph Hellwig 提交于
      Currently we use nornal Linux errno values in the block layer, and while
      we accept any error a few have overloaded magic meanings.  This patch
      instead introduces a new  blk_status_t value that holds block layer specific
      status codes and explicitly explains their meaning.  Helpers to convert from
      and to the previous special meanings are provided for now, but I suspect
      we want to get rid of them in the long run - those drivers that have a
      errno input (e.g. networking) usually get errnos that don't know about
      the special block layer overloads, and similarly returning them to userspace
      will usually return somethings that strictly speaking isn't correct
      for file system operations, but that's left as an exercise for later.
      
      For now the set of errors is a very limited set that closely corresponds
      to the previous overloaded errno values, but there is some low hanging
      fruite to improve it.
      
      blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
      typechecking, so that we can easily catch places passing the wrong values.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a842aca
  4. 07 6月, 2017 2 次提交
    • M
      blk-mq: fix direct issue · d964f04a
      Ming Lei 提交于
      If queue is stopped, we shouldn't dispatch request into driver and
      hardware, unfortunately the check is removed in bd166ef1(blk-mq-sched:
      add framework for MQ capable IO schedulers).
      
      This patch fixes the issue by moving the check back into
      __blk_mq_try_issue_directly().
      
      This patch fixes request use-after-free[1][2] during canceling requets
      of NVMe in nvme_dev_disable(), which can be triggered easily during
      NVMe reset & remove test.
      
      [1] oops kernel log when CONFIG_BLK_DEV_INTEGRITY is on
      [  103.412969] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
      [  103.412980] IP: bio_integrity_advance+0x48/0xf0
      [  103.412981] PGD 275a88067
      [  103.412981] P4D 275a88067
      [  103.412982] PUD 276c43067
      [  103.412983] PMD 0
      [  103.412984]
      [  103.412986] Oops: 0000 [#1] SMP
      [  103.412989] Modules linked in: vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support mxm_wmi glue_helper dcdbas ipmi_si mei_me pcspkr mei sg ipmi_devintf lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel nvme ahci nvme_core libahci libata tg3 i2c_core megaraid_sas ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
      [  103.413035] CPU: 0 PID: 102 Comm: kworker/0:2 Not tainted 4.11.0+ #1
      [  103.413036] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [  103.413041] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
      [  103.413043] task: ffff9cc8775c8000 task.stack: ffffc033c252c000
      [  103.413045] RIP: 0010:bio_integrity_advance+0x48/0xf0
      [  103.413046] RSP: 0018:ffffc033c252fc10 EFLAGS: 00010202
      [  103.413048] RAX: 0000000000000000 RBX: ffff9cc8720a8cc0 RCX: ffff9cca72958240
      [  103.413049] RDX: ffff9cca72958000 RSI: 0000000000000008 RDI: ffff9cc872537f00
      [  103.413049] RBP: ffffc033c252fc28 R08: 0000000000000000 R09: ffffffffb963a0d5
      [  103.413050] R10: 000000000000063e R11: 0000000000000000 R12: ffff9cc8720a8d18
      [  103.413051] R13: 0000000000001000 R14: ffff9cc872682e00 R15: 00000000fffffffb
      [  103.413053] FS:  0000000000000000(0000) GS:ffff9cc877c00000(0000) knlGS:0000000000000000
      [  103.413054] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  103.413055] CR2: 000000000000000a CR3: 0000000276c41000 CR4: 00000000001406f0
      [  103.413056] Call Trace:
      [  103.413063]  bio_advance+0x2a/0xe0
      [  103.413067]  blk_update_request+0x76/0x330
      [  103.413072]  blk_mq_end_request+0x1a/0x70
      [  103.413074]  blk_mq_dispatch_rq_list+0x370/0x410
      [  103.413076]  ? blk_mq_flush_busy_ctxs+0x94/0xe0
      [  103.413080]  blk_mq_sched_dispatch_requests+0x173/0x1a0
      [  103.413083]  __blk_mq_run_hw_queue+0x8e/0xa0
      [  103.413085]  __blk_mq_delay_run_hw_queue+0x9d/0xa0
      [  103.413088]  blk_mq_start_hw_queue+0x17/0x20
      [  103.413090]  blk_mq_start_hw_queues+0x32/0x50
      [  103.413095]  nvme_kill_queues+0x54/0x80 [nvme_core]
      [  103.413097]  nvme_remove_dead_ctrl_work+0x1f/0x40 [nvme]
      [  103.413103]  process_one_work+0x149/0x360
      [  103.413105]  worker_thread+0x4d/0x3c0
      [  103.413109]  kthread+0x109/0x140
      [  103.413111]  ? rescuer_thread+0x380/0x380
      [  103.413113]  ? kthread_park+0x60/0x60
      [  103.413120]  ret_from_fork+0x2c/0x40
      [  103.413121] Code: 08 4c 8b 63 50 48 8b 80 80 00 00 00 48 8b 90 d0 03 00 00 31 c0 48 83 ba 40 02 00 00 00 48 8d 8a 40 02 00 00 48 0f 45 c1 c1 ee 09 <0f> b6 48 0a 0f b6 40 09 41 89 f5 83 e9 09 41 d3 ed 44 0f af e8
      [  103.413145] RIP: bio_integrity_advance+0x48/0xf0 RSP: ffffc033c252fc10
      [  103.413146] CR2: 000000000000000a
      [  103.413157] ---[ end trace cd6875d16eb5a11e ]---
      [  103.455368] Kernel panic - not syncing: Fatal exception
      [  103.459826] Kernel Offset: 0x37600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [  103.850916] ---[ end Kernel panic - not syncing: Fatal exception
      [  103.857637] sched: Unexpected reschedule of offline CPU#1!
      [  103.863762] ------------[ cut here ]------------
      
      [2] kernel hang in blk_mq_freeze_queue_wait() when CONFIG_BLK_DEV_INTEGRITY is off
      [  247.129825] INFO: task nvme-test:1772 blocked for more than 120 seconds.
      [  247.137311]       Not tainted 4.12.0-rc2.upstream+ #4
      [  247.142954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  247.151704] Call Trace:
      [  247.154445]  __schedule+0x28a/0x880
      [  247.158341]  schedule+0x36/0x80
      [  247.161850]  blk_mq_freeze_queue_wait+0x4b/0xb0
      [  247.166913]  ? remove_wait_queue+0x60/0x60
      [  247.171485]  blk_freeze_queue+0x1a/0x20
      [  247.175770]  blk_cleanup_queue+0x7f/0x140
      [  247.180252]  nvme_ns_remove+0xa3/0xb0 [nvme_core]
      [  247.185503]  nvme_remove_namespaces+0x32/0x50 [nvme_core]
      [  247.191532]  nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
      [  247.196977]  nvme_remove+0x70/0x110 [nvme]
      [  247.201545]  pci_device_remove+0x39/0xc0
      [  247.205927]  device_release_driver_internal+0x141/0x200
      [  247.211761]  device_release_driver+0x12/0x20
      [  247.216531]  pci_stop_bus_device+0x8c/0xa0
      [  247.221104]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
      [  247.227420]  remove_store+0x7c/0x90
      [  247.231320]  dev_attr_store+0x18/0x30
      [  247.235409]  sysfs_kf_write+0x3a/0x50
      [  247.239497]  kernfs_fop_write+0xff/0x180
      [  247.243867]  __vfs_write+0x37/0x160
      [  247.247757]  ? selinux_file_permission+0xe5/0x120
      [  247.253011]  ? security_file_permission+0x3b/0xc0
      [  247.258260]  vfs_write+0xb2/0x1b0
      [  247.261964]  ? syscall_trace_enter+0x1d0/0x2b0
      [  247.266924]  SyS_write+0x55/0xc0
      [  247.270540]  do_syscall_64+0x67/0x150
      [  247.274636]  entry_SYSCALL64_slow_path+0x25/0x25
      [  247.279794] RIP: 0033:0x7f5c96740840
      [  247.283785] RSP: 002b:00007ffd00e87ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  247.292238] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5c96740840
      [  247.300194] RDX: 0000000000000002 RSI: 00007f5c97060000 RDI: 0000000000000001
      [  247.308159] RBP: 00007f5c97060000 R08: 000000000000000a R09: 00007f5c97059740
      [  247.316123] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f5c96a14400
      [  247.324087] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
      [  370.016340] INFO: task nvme-test:1772 blocked for more than 120 seconds.
      
      Fixes: 12d70958(blk-mq: don't fail allocating driver tag for stopped hw queue)
      Cc: stable@vger.kernel.org
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d964f04a
    • M
      blk-mq: pass correct hctx to blk_mq_try_issue_directly · dad7a3be
      Ming Lei 提交于
      When direct issue is done on request picked up from plug list,
      the hctx need to be updated with the actual hw queue, otherwise
      wrong hctx is used and may hurt performance, especially when
      wrong SRCU readlock is acquired/released
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dad7a3be
  5. 31 5月, 2017 1 次提交
  6. 27 5月, 2017 2 次提交
    • M
      blk-mq: make per-sw-queue bio merge as default .bio_merge · 9bddeb2a
      Ming Lei 提交于
      Because what the per-sw-queue bio merge does is basically same with
      scheduler's .bio_merge(), this patch makes per-sw-queue bio merge
      as the default .bio_merge if no scheduler is used or io scheduler
      doesn't provide .bio_merge().
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9bddeb2a
    • M
      blk-mq: merge bio into sw queue before plugging · ab42f35d
      Ming Lei 提交于
      Before blk-mq is introduced, I/O is merged to elevator
      before being putted into plug queue, but blk-mq changed the
      order and makes merging to sw queue basically impossible.
      Then it is observed that throughput of sequential I/O is degraded
      about 10%~20% on virtio-blk in the test[1] if mq-deadline isn't used.
      
      This patch moves the bio merging per sw queue before plugging,
      like what blk_queue_bio() does, and the performance regression is
      fixed under this situation.
      
      [1]. test script:
      sudo fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=16 --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/vdb --name=virtio_blk-test-$RW --rw=$RW --output-format=json
      
      RW=read or write
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ab42f35d
  7. 23 5月, 2017 1 次提交
  8. 10 5月, 2017 1 次提交
    • W
      blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op · f36ea50c
      Wen Xiong 提交于
      When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
      "Input/output error". Looks block layer split the bio after calling
      bio_integrity_prep(bio). This patch fixes the issue.
      
      Below is how we debug this issue:
      (1)format nvme to 4K block # size with type 2 DIF
      (2)dd with block size bigger than 1024k.
      oflag=direct
      dd: error writing '/dev/nvme0n1': Input/output error
      
      We added some debug code in nvme device driver. It showed us the first
      op and the second op have the same bi and pi address. This is not
      correct.
      
      1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
      	dsmgmt=0x0, AT=0x0 & RT=0x505
      	Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828
      
      2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
      	AT=0x0 & RT=0x605  ==> This op fails and subsequent 5 retires..
      	Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828
      
      With the fix, It showed us both of the first op and the second op have
      correct bi and pi address.
      
      1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
      	dsmgmt=0x0, AT=0x0 & RT=0x505
      	Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
      	0x00002828
      2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
      	AT=0x0 & RT=0x605
      	Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
      	0x00003028
      Signed-off-by: NWen Xiong <wenxiong@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f36ea50c
  9. 08 5月, 2017 2 次提交
    • C
      blk-mq: make __blk_mq_stop_hw_queues static · ebd76857
      Colin Ian King 提交于
      Making __blk_mq_stop_hw_queues static fixes sparse warning:
      
        block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not
        declared. Should it be static?
      
      Fixes: 2719aa21 ("blk-mq: don't use sync workqueue flushing from drivers")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ebd76857
    • W
      block/mq: fix potential deadlock during cpu hotplug · 51d638b1
      Wanpeng Li 提交于
      This can be triggered by hot-unplug one cpu.
      
      ======================================================
       [ INFO: possible circular locking dependency detected ]
       4.11.0+ #17 Not tainted
       -------------------------------------------------------
       step_after_susp/2640 is trying to acquire lock:
        (all_q_mutex){+.+...}, at: [<ffffffffb33f95b8>] blk_mq_queue_reinit_work+0x18/0x110
      
       but task is already holding lock:
        (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (cpu_hotplug.lock){+.+.+.}:
              lock_acquire+0x11c/0x230
              __mutex_lock+0x92/0x990
              mutex_lock_nested+0x1b/0x20
              get_online_cpus+0x64/0x80
              blk_mq_init_allocated_queue+0x3a0/0x4e0
              blk_mq_init_queue+0x3a/0x60
              loop_add+0xe5/0x280
              loop_init+0x124/0x177
              do_one_initcall+0x53/0x1c0
              kernel_init_freeable+0x1e3/0x27f
              kernel_init+0xe/0x100
              ret_from_fork+0x31/0x40
      
       -> #0 (all_q_mutex){+.+...}:
              __lock_acquire+0x189a/0x18a0
              lock_acquire+0x11c/0x230
              __mutex_lock+0x92/0x990
              mutex_lock_nested+0x1b/0x20
              blk_mq_queue_reinit_work+0x18/0x110
              blk_mq_queue_reinit_dead+0x1c/0x20
              cpuhp_invoke_callback+0x1f2/0x810
              cpuhp_down_callbacks+0x42/0x80
              _cpu_down+0xb2/0xe0
              freeze_secondary_cpus+0xb6/0x390
              suspend_devices_and_enter+0x3b3/0xa40
              pm_suspend+0x129/0x490
              state_store+0x82/0xf0
              kobj_attr_store+0xf/0x20
              sysfs_kf_write+0x45/0x60
              kernfs_fop_write+0x135/0x1c0
              __vfs_write+0x37/0x160
              vfs_write+0xcd/0x1d0
              SyS_write+0x58/0xc0
              do_syscall_64+0x8f/0x710
              return_from_SYSCALL_64+0x0/0x7a
      
       other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(cpu_hotplug.lock);
                                      lock(all_q_mutex);
                                      lock(cpu_hotplug.lock);
         lock(all_q_mutex);
      
        *** DEADLOCK ***
      
       8 locks held by step_after_susp/2640:
        #0:  (sb_writers#6){.+.+.+}, at: [<ffffffffb3244aed>] vfs_write+0x1ad/0x1d0
        #1:  (&of->mutex){+.+.+.}, at: [<ffffffffb32d3a51>] kernfs_fop_write+0x101/0x1c0
        #2:  (s_active#166){.+.+.+}, at: [<ffffffffb32d3a59>] kernfs_fop_write+0x109/0x1c0
        #3:  (pm_mutex){+.+...}, at: [<ffffffffb30d2ecd>] pm_suspend+0x21d/0x490
        #4:  (acpi_scan_lock){+.+.+.}, at: [<ffffffffb34dc3d7>] acpi_scan_lock_acquire+0x17/0x20
        #5:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffffb306d6d7>] freeze_secondary_cpus+0x27/0x390
        #6:  (cpu_hotplug.dep_map){++++++}, at: [<ffffffffb306cfd5>] cpu_hotplug_begin+0x5/0xe0
        #7:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0
      
       stack backtrace:
       CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17
       Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016
       Call Trace:
        dump_stack+0x99/0xce
        print_circular_bug+0x1fa/0x270
        __lock_acquire+0x189a/0x18a0
        lock_acquire+0x11c/0x230
        ? lock_acquire+0x11c/0x230
        ? blk_mq_queue_reinit_work+0x18/0x110
        ? blk_mq_queue_reinit_work+0x18/0x110
        __mutex_lock+0x92/0x990
        ? blk_mq_queue_reinit_work+0x18/0x110
        ? kmem_cache_free+0x2cb/0x330
        ? anon_transport_class_unregister+0x20/0x20
        ? blk_mq_queue_reinit_work+0x110/0x110
        mutex_lock_nested+0x1b/0x20
        ? mutex_lock_nested+0x1b/0x20
        blk_mq_queue_reinit_work+0x18/0x110
        blk_mq_queue_reinit_dead+0x1c/0x20
        cpuhp_invoke_callback+0x1f2/0x810
        ? __flow_cache_shrink+0x160/0x160
        cpuhp_down_callbacks+0x42/0x80
        _cpu_down+0xb2/0xe0
        freeze_secondary_cpus+0xb6/0x390
        suspend_devices_and_enter+0x3b3/0xa40
        ? rcu_read_lock_sched_held+0x79/0x80
        pm_suspend+0x129/0x490
        state_store+0x82/0xf0
        kobj_attr_store+0xf/0x20
        sysfs_kf_write+0x45/0x60
        kernfs_fop_write+0x135/0x1c0
        __vfs_write+0x37/0x160
        ? rcu_read_lock_sched_held+0x79/0x80
        ? rcu_sync_lockdep_assert+0x2f/0x60
        ? __sb_start_write+0xd9/0x1c0
        ? vfs_write+0x1ad/0x1d0
        vfs_write+0xcd/0x1d0
        SyS_write+0x58/0xc0
        ? rcu_read_lock_sched_held+0x79/0x80
        do_syscall_64+0x8f/0x710
        ? trace_hardirqs_on_thunk+0x1a/0x1c
        entry_SYSCALL64_slow_path+0x25/0x25
      
      The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting
      queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will
      contend these two locks in the inversion order. This is due to commit eabe0659
      (blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion
      issue because of hotplug rework, however the hotplug rework is still work-in-progress
      and lives in a -tip branch and mainline cannot yet trigger that splat. The commit
      breaks the linus's tree in the merge window, so this patch reverts the lock order
      and avoids to splat linus's tree.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      51d638b1
  10. 04 5月, 2017 3 次提交
    • O
      blk-mq: untangle debugfs and sysfs · 9c1051aa
      Omar Sandoval 提交于
      Originally, I tied debugfs registration/unregistration together with
      sysfs. There's no reason to do this, and it's getting in the way of
      letting schedulers define their own debugfs attributes. Instead, tie the
      debugfs registration to the lifetime of the structures themselves.
      
      The saner lifetimes mean we can also get rid of the extra mq directory
      and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
      just nvme0n1/hctx0/tags.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9c1051aa
    • P
      block/mq: Cure cpu hotplug lock inversion · eabe0659
      Peter Zijlstra 提交于
      By poking at /debug/sched_features I triggered the following splat:
      
       [] ======================================================
       [] WARNING: possible circular locking dependency detected
       [] 4.11.0-00873-g964c8b7-dirty #694 Not tainted
       [] ------------------------------------------------------
       [] bash/2109 is trying to acquire lock:
       []  (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff8120cb8b>] static_key_slow_dec+0x1b/0x50
       []
       [] but task is already holding lock:
       []  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
       []
       [] which lock already depends on the new lock.
       []
       []
       [] the existing dependency chain (in reverse order) is:
       []
       [] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}:
       []        lock_acquire+0x100/0x210
       []        down_write+0x28/0x60
       []        start_creating+0x5e/0xf0
       []        debugfs_create_dir+0x13/0x110
       []        blk_mq_debugfs_register+0x21/0x70
       []        blk_mq_register_dev+0x64/0xd0
       []        blk_register_queue+0x6a/0x170
       []        device_add_disk+0x22d/0x440
       []        loop_add+0x1f3/0x280
       []        loop_init+0x104/0x142
       []        do_one_initcall+0x43/0x180
       []        kernel_init_freeable+0x1de/0x266
       []        kernel_init+0xe/0x100
       []        ret_from_fork+0x31/0x40
       []
       [] -> #1 (all_q_mutex){+.+.+.}:
       []        lock_acquire+0x100/0x210
       []        __mutex_lock+0x6c/0x960
       []        mutex_lock_nested+0x1b/0x20
       []        blk_mq_init_allocated_queue+0x37c/0x4e0
       []        blk_mq_init_queue+0x3a/0x60
       []        loop_add+0xe5/0x280
       []        loop_init+0x104/0x142
       []        do_one_initcall+0x43/0x180
       []        kernel_init_freeable+0x1de/0x266
       []        kernel_init+0xe/0x100
       []        ret_from_fork+0x31/0x40
      
       []  *** DEADLOCK ***
       []
       [] 3 locks held by bash/2109:
       []  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81292bcd>] vfs_write+0x17d/0x1a0
       []  #1:  (debugfs_srcu){......}, at: [<ffffffff8155a90d>] full_proxy_write+0x5d/0xd0
       []  #2:  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
       []
       [] stack backtrace:
       [] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694
       [] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
       [] Call Trace:
      
       []  lock_acquire+0x100/0x210
       []  get_online_cpus+0x2a/0x90
       []  static_key_slow_dec+0x1b/0x50
       []  static_key_disable+0x20/0x30
       []  sched_feat_write+0x131/0x170
       []  full_proxy_write+0x97/0xd0
       []  __vfs_write+0x28/0x120
       []  vfs_write+0xb5/0x1a0
       []  SyS_write+0x49/0xa0
       []  entry_SYSCALL_64_fastpath+0x23/0xc2
      
      This is because of the cpu hotplug lock rework. Break the chain at #1
      by reversing the lock acquisition order. This way i_mutex_key#4 no
      longer depends on cpu_hotplug_lock and things are good.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      eabe0659
    • J
      blk-mq: don't use sync workqueue flushing from drivers · 2719aa21
      Jens Axboe 提交于
      A previous commit introduced the sync flush, which we need from
      internal callers like blk_mq_quiesce_queue(). However, we also
      call the stop helpers from drivers, particularly from ->queue_rq()
      when we have to stop processing for a bit. We can't block from
      those locations, and we don't have to guarantee that we're
      fully flushed.
      
      Fixes: 9f993737 ("blk-mq: unify hctx delayed_run_work and run_work")
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2719aa21
  11. 03 5月, 2017 1 次提交
  12. 02 5月, 2017 1 次提交
  13. 28 4月, 2017 2 次提交
    • J
      blk-mq: unify hctx delay_work and run_work · 21c6e939
      Jens Axboe 提交于
      The only difference between ->run_work and ->delay_work, is that
      the latter is used to defer running a queue. This is done by
      marking the queue stopped, and scheduling ->delay_work to run
      sometime in the future. While the queue is stopped, direct runs
      or runs through ->run_work will not run the queue.
      
      If we combine the handlers, then we need to handle two things:
      
      1) If a delayed/stopped run is scheduled, then we should not run
         the queue before that has been completed.
      2) If a queue is delayed/stopped, the handler needs to restart
         the queue. Normally a run of a queue with the stopped bit set
         would be a no-op.
      
      Case 1 is handled by modifying a currently pending queue run
      to the deadline set by the caller of blk_mq_delay_queue().
      Subsequent attempts to queue a queue run will find the work
      item already pending, and direct runs will see a stopped queue
      as before.
      
      Case 2 is handled by adding a new bit, BLK_MQ_S_START_ON_RUN,
      that tells the work handler that it should clear a stopped
      queue and run the handler.
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      21c6e939
    • J
      blk-mq: unify hctx delayed_run_work and run_work · 9f993737
      Jens Axboe 提交于
      They serve the exact same purpose. Get rid of the non-delayed
      work variant, and just run it without delay for the normal case.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9f993737
  14. 22 4月, 2017 1 次提交
    • B
      blk-mq: Fix preempt count imbalance · abc25a69
      Bart Van Assche 提交于
      Avoid that the following kernel bug gets triggered:
      
      BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:349
      in_atomic(): 1, irqs_disabled(): 0, pid: 8019, name: find
      CPU: 10 PID: 8019 Comm: find Tainted: G        W I     4.11.0-rc4-dbg+ #2
      Call Trace:
       dump_stack+0x68/0x93
       ___might_sleep+0x16e/0x230
       __might_sleep+0x4a/0x80
       __ext4_get_inode_loc+0x1e0/0x4e0
       ext4_iget+0x70/0xbc0
       ext4_iget_normal+0x2f/0x40
       ext4_lookup+0xb6/0x1f0
       lookup_slow+0x104/0x1e0
       walk_component+0x19a/0x330
       path_lookupat+0x4b/0x100
       filename_lookup+0x9a/0x110
       user_path_at_empty+0x36/0x40
       vfs_statx+0x67/0xc0
       SYSC_newfstatat+0x20/0x40
       SyS_newfstatat+0xe/0x10
       entry_SYSCALL_64_fastpath+0x18/0xad
      
      This happens since the big if/else in blk_mq_make_request() doesn't
      have final else section that also drops the ctx. Add that.
      
      Fixes: b00c53e8 ("blk-mq: fix schedule-while-atomic with scheduler attached")
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Omar Sandoval <osandov@fb.com>
      
      Added a bit more to the commit log.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      abc25a69
  15. 21 4月, 2017 6 次提交
    • J
      blk-stat: kill blk_stat_rq_ddir() · 99c749a4
      Jens Axboe 提交于
      No point in providing and exporting this helper. There's just
      one (real) user of it, just use rq_data_dir().
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      99c749a4
    • J
      blk-mq: add might_sleep check to blk_mq_get_driver_tag() · 5feeacdd
      Jens Axboe 提交于
      If the caller passes in wait=true, it has to be able to block
      for a driver tag. We just had a bug where flush insertion
      would block on tag allocation, while we had preempt disabled.
      Ensure that we catch cases like that earlier next time.
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5feeacdd
    • S
      blk-mq: Fix poll_stat for new size-based bucketing. · 0206319f
      Stephen Bates 提交于
      Fixes an issue where the size of the poll_stat array in request_queue
      does not match the size expected by the new size based bucketing for
      IO completion polling.
      
      Fixes: 720b8ccc ("blk-mq: Add a polling specific stats function")
      Signed-off-by: NStephen Bates <sbates@raithlin.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0206319f
    • J
      blk-mq: fix schedule-while-atomic with scheduler attached · b00c53e8
      Jens Axboe 提交于
      We must have dropped the ctx before we call
      blk_mq_sched_insert_request() with can_block=true, otherwise we risk
      that a flush request can block on insertion if we are currently out of
      tags.
      
      [   47.667190] BUG: scheduling while atomic: jbd2/sda2-8/2089/0x00000002
      [   47.674493] Modules linked in: x86_pkg_temp_thermal btrfs xor zlib_deflate raid6_pq sr_mod cdre
      [   47.690572] Preemption disabled at:
      [   47.690584] [<ffffffff81326c7c>] blk_mq_sched_get_request+0x6c/0x280
      [   47.701764] CPU: 1 PID: 2089 Comm: jbd2/sda2-8 Not tainted 4.11.0-rc7+ #271
      [   47.709630] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
      [   47.718081] Call Trace:
      [   47.720903]  dump_stack+0x4f/0x73
      [   47.724694]  ? blk_mq_sched_get_request+0x6c/0x280
      [   47.730137]  __schedule_bug+0x6c/0xc0
      [   47.734314]  __schedule+0x559/0x780
      [   47.738302]  schedule+0x3b/0x90
      [   47.741899]  io_schedule+0x11/0x40
      [   47.745788]  blk_mq_get_tag+0x167/0x2a0
      [   47.750162]  ? remove_wait_queue+0x70/0x70
      [   47.754901]  blk_mq_get_driver_tag+0x92/0xf0
      [   47.759758]  blk_mq_sched_insert_request+0x134/0x170
      [   47.765398]  ? blk_account_io_start+0xd0/0x270
      [   47.770679]  blk_mq_make_request+0x1b2/0x850
      [   47.775766]  generic_make_request+0xf7/0x2d0
      [   47.780860]  submit_bio+0x5f/0x120
      [   47.784979]  ? submit_bio+0x5f/0x120
      [   47.789631]  submit_bh_wbc.isra.46+0x10d/0x130
      [   47.794902]  submit_bh+0xb/0x10
      [   47.798719]  journal_submit_commit_record+0x190/0x210
      [   47.804686]  ? _raw_spin_unlock+0x13/0x30
      [   47.809480]  jbd2_journal_commit_transaction+0x180a/0x1d00
      [   47.815925]  kjournald2+0xb6/0x250
      [   47.820022]  ? kjournald2+0xb6/0x250
      [   47.824328]  ? remove_wait_queue+0x70/0x70
      [   47.829223]  kthread+0x10e/0x140
      [   47.833147]  ? commit_timeout+0x10/0x10
      [   47.837742]  ? kthread_create_on_node+0x40/0x40
      [   47.843122]  ret_from_fork+0x29/0x40
      
      Fixes: a4d907b6 ("blk-mq: streamline blk_mq_make_request")
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b00c53e8
    • S
      blk-mq: Add a polling specific stats function · 720b8ccc
      Stephen Bates 提交于
      Rather than bucketing IO statisics based on direction only we also
      bucket based on the IO size. This leads to improved polling
      performance. Update the bucket callback function and use it in the
      polling latency estimation.
      Signed-off-by: NStephen Bates <sbates@raithlin.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      720b8ccc
    • J
      blk-mq: fix potential oops with polling and blk-mq scheduler · 3a07bb1d
      Jens Axboe 提交于
      If we have a scheduler attached, blk_mq_tag_to_rq() on the
      scheduled tags will return NULL if a request is no longer
      in flight. This is different than using the normal tags,
      where it will always return the fixed request. Check for
      this condition for polling, in case we happen to enter
      polling for a completed request.
      
      The request address remains valid, so this check and return
      should be perfectly safe.
      
      Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
      Tested-by: NStephen Bates <sbates@raithlin.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3a07bb1d