1. 02 8月, 2017 1 次提交
  2. 04 7月, 2017 1 次提交
  3. 29 6月, 2017 1 次提交
  4. 28 6月, 2017 2 次提交
  5. 22 6月, 2017 3 次提交
  6. 21 6月, 2017 5 次提交
  7. 20 6月, 2017 3 次提交
    • G
      block: return on congested block device · 03a07c92
      Goldwyn Rodrigues 提交于
      A new bio operation flag REQ_NOWAIT is introduced to identify bio's
      orignating from iocb with IOCB_NOWAIT. This flag indicates
      to return immediately if a request cannot be made instead
      of retrying.
      
      Stacked devices such as md (the ones with make_request_fn hooks)
      currently are not supported because it may block for housekeeping.
      For example, an md can have a part of the device suspended.
      For this reason, only request based devices are supported.
      In the future, this feature will be expanded to stacked devices
      by teaching them how to handle the REQ_NOWAIT flags.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03a07c92
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  8. 19 6月, 2017 16 次提交
  9. 13 6月, 2017 1 次提交
  10. 09 6月, 2017 2 次提交
    • C
      blk-mq: switch ->queue_rq return value to blk_status_t · fc17b653
      Christoph Hellwig 提交于
      Use the same values for use for request completion errors as the return
      value from ->queue_rq.  BLK_STS_RESOURCE is special cased to cause
      a requeue, and all the others are completed as-is.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fc17b653
    • C
      block: introduce new block status code type · 2a842aca
      Christoph Hellwig 提交于
      Currently we use nornal Linux errno values in the block layer, and while
      we accept any error a few have overloaded magic meanings.  This patch
      instead introduces a new  blk_status_t value that holds block layer specific
      status codes and explicitly explains their meaning.  Helpers to convert from
      and to the previous special meanings are provided for now, but I suspect
      we want to get rid of them in the long run - those drivers that have a
      errno input (e.g. networking) usually get errnos that don't know about
      the special block layer overloads, and similarly returning them to userspace
      will usually return somethings that strictly speaking isn't correct
      for file system operations, but that's left as an exercise for later.
      
      For now the set of errors is a very limited set that closely corresponds
      to the previous overloaded errno values, but there is some low hanging
      fruite to improve it.
      
      blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
      typechecking, so that we can easily catch places passing the wrong values.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a842aca
  11. 07 6月, 2017 2 次提交
    • M
      blk-mq: fix direct issue · d964f04a
      Ming Lei 提交于
      If queue is stopped, we shouldn't dispatch request into driver and
      hardware, unfortunately the check is removed in bd166ef1(blk-mq-sched:
      add framework for MQ capable IO schedulers).
      
      This patch fixes the issue by moving the check back into
      __blk_mq_try_issue_directly().
      
      This patch fixes request use-after-free[1][2] during canceling requets
      of NVMe in nvme_dev_disable(), which can be triggered easily during
      NVMe reset & remove test.
      
      [1] oops kernel log when CONFIG_BLK_DEV_INTEGRITY is on
      [  103.412969] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
      [  103.412980] IP: bio_integrity_advance+0x48/0xf0
      [  103.412981] PGD 275a88067
      [  103.412981] P4D 275a88067
      [  103.412982] PUD 276c43067
      [  103.412983] PMD 0
      [  103.412984]
      [  103.412986] Oops: 0000 [#1] SMP
      [  103.412989] Modules linked in: vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support mxm_wmi glue_helper dcdbas ipmi_si mei_me pcspkr mei sg ipmi_devintf lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel nvme ahci nvme_core libahci libata tg3 i2c_core megaraid_sas ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
      [  103.413035] CPU: 0 PID: 102 Comm: kworker/0:2 Not tainted 4.11.0+ #1
      [  103.413036] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [  103.413041] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
      [  103.413043] task: ffff9cc8775c8000 task.stack: ffffc033c252c000
      [  103.413045] RIP: 0010:bio_integrity_advance+0x48/0xf0
      [  103.413046] RSP: 0018:ffffc033c252fc10 EFLAGS: 00010202
      [  103.413048] RAX: 0000000000000000 RBX: ffff9cc8720a8cc0 RCX: ffff9cca72958240
      [  103.413049] RDX: ffff9cca72958000 RSI: 0000000000000008 RDI: ffff9cc872537f00
      [  103.413049] RBP: ffffc033c252fc28 R08: 0000000000000000 R09: ffffffffb963a0d5
      [  103.413050] R10: 000000000000063e R11: 0000000000000000 R12: ffff9cc8720a8d18
      [  103.413051] R13: 0000000000001000 R14: ffff9cc872682e00 R15: 00000000fffffffb
      [  103.413053] FS:  0000000000000000(0000) GS:ffff9cc877c00000(0000) knlGS:0000000000000000
      [  103.413054] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  103.413055] CR2: 000000000000000a CR3: 0000000276c41000 CR4: 00000000001406f0
      [  103.413056] Call Trace:
      [  103.413063]  bio_advance+0x2a/0xe0
      [  103.413067]  blk_update_request+0x76/0x330
      [  103.413072]  blk_mq_end_request+0x1a/0x70
      [  103.413074]  blk_mq_dispatch_rq_list+0x370/0x410
      [  103.413076]  ? blk_mq_flush_busy_ctxs+0x94/0xe0
      [  103.413080]  blk_mq_sched_dispatch_requests+0x173/0x1a0
      [  103.413083]  __blk_mq_run_hw_queue+0x8e/0xa0
      [  103.413085]  __blk_mq_delay_run_hw_queue+0x9d/0xa0
      [  103.413088]  blk_mq_start_hw_queue+0x17/0x20
      [  103.413090]  blk_mq_start_hw_queues+0x32/0x50
      [  103.413095]  nvme_kill_queues+0x54/0x80 [nvme_core]
      [  103.413097]  nvme_remove_dead_ctrl_work+0x1f/0x40 [nvme]
      [  103.413103]  process_one_work+0x149/0x360
      [  103.413105]  worker_thread+0x4d/0x3c0
      [  103.413109]  kthread+0x109/0x140
      [  103.413111]  ? rescuer_thread+0x380/0x380
      [  103.413113]  ? kthread_park+0x60/0x60
      [  103.413120]  ret_from_fork+0x2c/0x40
      [  103.413121] Code: 08 4c 8b 63 50 48 8b 80 80 00 00 00 48 8b 90 d0 03 00 00 31 c0 48 83 ba 40 02 00 00 00 48 8d 8a 40 02 00 00 48 0f 45 c1 c1 ee 09 <0f> b6 48 0a 0f b6 40 09 41 89 f5 83 e9 09 41 d3 ed 44 0f af e8
      [  103.413145] RIP: bio_integrity_advance+0x48/0xf0 RSP: ffffc033c252fc10
      [  103.413146] CR2: 000000000000000a
      [  103.413157] ---[ end trace cd6875d16eb5a11e ]---
      [  103.455368] Kernel panic - not syncing: Fatal exception
      [  103.459826] Kernel Offset: 0x37600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [  103.850916] ---[ end Kernel panic - not syncing: Fatal exception
      [  103.857637] sched: Unexpected reschedule of offline CPU#1!
      [  103.863762] ------------[ cut here ]------------
      
      [2] kernel hang in blk_mq_freeze_queue_wait() when CONFIG_BLK_DEV_INTEGRITY is off
      [  247.129825] INFO: task nvme-test:1772 blocked for more than 120 seconds.
      [  247.137311]       Not tainted 4.12.0-rc2.upstream+ #4
      [  247.142954] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  247.151704] Call Trace:
      [  247.154445]  __schedule+0x28a/0x880
      [  247.158341]  schedule+0x36/0x80
      [  247.161850]  blk_mq_freeze_queue_wait+0x4b/0xb0
      [  247.166913]  ? remove_wait_queue+0x60/0x60
      [  247.171485]  blk_freeze_queue+0x1a/0x20
      [  247.175770]  blk_cleanup_queue+0x7f/0x140
      [  247.180252]  nvme_ns_remove+0xa3/0xb0 [nvme_core]
      [  247.185503]  nvme_remove_namespaces+0x32/0x50 [nvme_core]
      [  247.191532]  nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
      [  247.196977]  nvme_remove+0x70/0x110 [nvme]
      [  247.201545]  pci_device_remove+0x39/0xc0
      [  247.205927]  device_release_driver_internal+0x141/0x200
      [  247.211761]  device_release_driver+0x12/0x20
      [  247.216531]  pci_stop_bus_device+0x8c/0xa0
      [  247.221104]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
      [  247.227420]  remove_store+0x7c/0x90
      [  247.231320]  dev_attr_store+0x18/0x30
      [  247.235409]  sysfs_kf_write+0x3a/0x50
      [  247.239497]  kernfs_fop_write+0xff/0x180
      [  247.243867]  __vfs_write+0x37/0x160
      [  247.247757]  ? selinux_file_permission+0xe5/0x120
      [  247.253011]  ? security_file_permission+0x3b/0xc0
      [  247.258260]  vfs_write+0xb2/0x1b0
      [  247.261964]  ? syscall_trace_enter+0x1d0/0x2b0
      [  247.266924]  SyS_write+0x55/0xc0
      [  247.270540]  do_syscall_64+0x67/0x150
      [  247.274636]  entry_SYSCALL64_slow_path+0x25/0x25
      [  247.279794] RIP: 0033:0x7f5c96740840
      [  247.283785] RSP: 002b:00007ffd00e87ee8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  247.292238] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5c96740840
      [  247.300194] RDX: 0000000000000002 RSI: 00007f5c97060000 RDI: 0000000000000001
      [  247.308159] RBP: 00007f5c97060000 R08: 000000000000000a R09: 00007f5c97059740
      [  247.316123] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f5c96a14400
      [  247.324087] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
      [  370.016340] INFO: task nvme-test:1772 blocked for more than 120 seconds.
      
      Fixes: 12d70958(blk-mq: don't fail allocating driver tag for stopped hw queue)
      Cc: stable@vger.kernel.org
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d964f04a
    • M
      blk-mq: pass correct hctx to blk_mq_try_issue_directly · dad7a3be
      Ming Lei 提交于
      When direct issue is done on request picked up from plug list,
      the hctx need to be updated with the actual hw queue, otherwise
      wrong hctx is used and may hurt performance, especially when
      wrong SRCU readlock is acquired/released
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dad7a3be
  12. 31 5月, 2017 1 次提交
  13. 27 5月, 2017 2 次提交
    • M
      blk-mq: make per-sw-queue bio merge as default .bio_merge · 9bddeb2a
      Ming Lei 提交于
      Because what the per-sw-queue bio merge does is basically same with
      scheduler's .bio_merge(), this patch makes per-sw-queue bio merge
      as the default .bio_merge if no scheduler is used or io scheduler
      doesn't provide .bio_merge().
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      9bddeb2a
    • M
      blk-mq: merge bio into sw queue before plugging · ab42f35d
      Ming Lei 提交于
      Before blk-mq is introduced, I/O is merged to elevator
      before being putted into plug queue, but blk-mq changed the
      order and makes merging to sw queue basically impossible.
      Then it is observed that throughput of sequential I/O is degraded
      about 10%~20% on virtio-blk in the test[1] if mq-deadline isn't used.
      
      This patch moves the bio merging per sw queue before plugging,
      like what blk_queue_bio() does, and the performance regression is
      fixed under this situation.
      
      [1]. test script:
      sudo fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=16 --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/vdb --name=virtio_blk-test-$RW --rw=$RW --output-format=json
      
      RW=read or write
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ab42f35d