1. 10 12月, 2014 1 次提交
    • B
      blk-mq: Fix a use-after-free · 45a9c9d9
      Bart Van Assche 提交于
      blk-mq users are allowed to free the memory request_queue.tag_set
      points at after blk_cleanup_queue() has finished but before
      blk_release_queue() has started. This can happen e.g. in the SCSI
      core. The SCSI core namely embeds the tag_set structure in a SCSI
      host structure. The SCSI host structure is freed by
      scsi_host_dev_release(). This function is called after
      blk_cleanup_queue() finished but can be called before
      blk_release_queue().
      
      This means that it is not safe to access request_queue.tag_set from
      inside blk_release_queue(). Hence remove the blk_sync_queue() call
      from blk_release_queue(). This call is not necessary - outstanding
      requests must have finished before blk_release_queue() is
      called. Additionally, move the blk_mq_free_queue() call from
      blk_release_queue() to blk_cleanup_queue() to avoid that struct
      request_queue.tag_set gets accessed after it has been freed.
      
      This patch avoids that the following kernel oops can be triggered
      when deleting a SCSI host for which scsi-mq was enabled:
      
      Call Trace:
       [<ffffffff8109a7c4>] lock_acquire+0xc4/0x270
       [<ffffffff814ce111>] mutex_lock_nested+0x61/0x380
       [<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180
       [<ffffffff8124d654>] blk_release_queue+0x84/0xd0
       [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
       [<ffffffff8126c140>] kobject_put+0x30/0x70
       [<ffffffff81245895>] blk_put_queue+0x15/0x20
       [<ffffffff8125c409>] disk_release+0x99/0xd0
       [<ffffffff8133d056>] device_release+0x36/0xb0
       [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
       [<ffffffff8126c140>] kobject_put+0x30/0x70
       [<ffffffff8125a78a>] put_disk+0x1a/0x20
       [<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0
       [<ffffffff811d56a0>] blkdev_put+0x50/0x160
       [<ffffffff81199eb4>] kill_block_super+0x44/0x70
       [<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60
       [<ffffffff8119a87e>] deactivate_super+0x4e/0x70
       [<ffffffff811b9833>] cleanup_mnt+0x43/0x90
       [<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20
       [<ffffffff8107252c>] task_work_run+0xac/0xe0
       [<ffffffff81002c01>] do_notify_resume+0x61/0xa0
       [<ffffffff814d2c58>] int_signal+0x12/0x17
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Robert Elliott <elliott@hp.com>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.13+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      45a9c9d9
  2. 04 12月, 2014 1 次提交
  3. 12 11月, 2014 1 次提交
    • C
      scsi: add new scsi-command flag for tagged commands · 125c99bc
      Christoph Hellwig 提交于
      Currently scsi piggy backs on the block layer to define the concept
      of a tagged command.  But we want to be able to have block-level host-wide
      tags assigned even for untagged commands like the initial INQUIRY, so add
      a new SCSI-level flag for commands that are tagged at the scsi level, so
      that even commands without that set can have tags assigned to them.  Note
      that this alredy is the case for the blk-mq code path, and this just lets
      the old path catch up with it.
      
      We also set this flag based upon sdev->simple_tags instead of the block
      queue flag, so that it is entirely independent of the block layer tagging,
      and thus always correct even if a driver doesn't use block level tagging
      yet.
      
      Also remove the old blk_rq_tagged; it was only used by SCSI drivers, and
      removing it forces them to look for the proper replacement.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMike Christie <michaelc@cs.wisc.edu>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      125c99bc
  4. 13 10月, 2014 2 次提交
  5. 04 10月, 2014 1 次提交
  6. 01 10月, 2014 1 次提交
  7. 26 9月, 2014 6 次提交
  8. 09 9月, 2014 2 次提交
  9. 29 8月, 2014 1 次提交
    • J
      block,scsi: fixup blk_get_request dead queue scenarios · a492f075
      Joe Lawrence 提交于
      The blk_get_request function may fail in low-memory conditions or during
      device removal (even if __GFP_WAIT is set). To distinguish between these
      errors, modify the blk_get_request call stack to return the appropriate
      ERR_PTR. Verify that all callers check the return status and consider
      IS_ERR instead of a simple NULL pointer check.
      
      For consistency, make a similar change to the blk_mq_alloc_request leg
      of blk_get_request.  It may fail if the queue is dead, or the caller was
      unwilling to wait.
      Signed-off-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd]
      Acked-by: Boaz Harrosh <bharrosh@panasas.com> [for osd]
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a492f075
  10. 23 8月, 2014 1 次提交
  11. 02 7月, 2014 2 次提交
    • T
      blk-mq: decouble blk-mq freezing from generic bypassing · 780db207
      Tejun Heo 提交于
      blk_mq freezing is entangled with generic bypassing which bypasses
      blkcg and io scheduler and lets IO requests fall through the block
      layer to the drivers in FIFO order.  This allows forward progress on
      IOs with the advanced features disabled so that those features can be
      configured or altered without worrying about stalling IO which may
      lead to deadlock through memory allocation.
      
      However, generic bypassing doesn't quite fit blk-mq.  blk-mq currently
      doesn't make use of blkcg or ioscheds and it maps bypssing to
      freezing, which blocks request processing and drains all the in-flight
      ones.  This causes problems as bypassing assumes that request
      processing is online.  blk-mq works around this by conditionally
      allowing request processing for the problem case - during queue
      initialization.
      
      Another weirdity is that except for during queue cleanup, bypassing
      started on the generic side prevents blk-mq from processing new
      requests but doesn't drain the in-flight ones.  This shouldn't break
      anything but again highlights that something isn't quite right here.
      
      The root cause is conflating blk-mq freezing and generic bypassing
      which are two different mechanisms.  The only intersecting purpose
      that they serve is during queue cleanup.  Let's properly separate
      blk-mq freezing from generic bypassing and simply use it where
      necessary.
      
      * request_queue->mq_freeze_depth is added and
        blk_mq_[un]freeze_queue() now operate on this counter instead of
        ->bypass_depth.  The replacement for QUEUE_FLAG_BYPASS isn't added
        but the counter is tested directly.  This will be further updated by
        later changes.
      
      * blk_mq_drain_queue() is dropped and "__" prefix is dropped from
        blk_mq_freeze_queue().  Queue cleanup path now calls
        blk_mq_freeze_queue() directly.
      
      * blk_queue_enter()'s fast path condition is simplified to simply
        check @q->mq_freeze_depth.  Previously, the condition was
      
      	!blk_queue_dying(q) &&
      	    (!blk_queue_bypass(q) || !blk_queue_init_done(q))
      
        mq_freeze_depth is incremented right after dying is set and
        blk_queue_init_done() exception isn't necessary as blk-mq doesn't
        start frozen, which only leaves the blk_queue_bypass() test which
        can be replaced by @q->mq_freeze_depth test.
      
      This change simplifies the code and reduces confusion in the area.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      780db207
    • T
      block, blk-mq: draining can't be skipped even if bypass_depth was non-zero · 776687bc
      Tejun Heo 提交于
      Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue()
      skip queue draining if bypass_depth was already above zero.  The
      assumption is that the one which bumped the bypass_depth should have
      performed draining already; however, there's nothing which prevents a
      new instance of bypassing/freezing from starting before the previous
      one finishes draining.  The current code may allow the later
      bypassing/freezing instances to complete while there still are
      in-flight requests which haven't finished draining.
      
      Fix it by draining regardless of bypass_depth.  We still skip draining
      from blk_queue_bypass_start() while the queue is initializing to avoid
      introducing excessive delays during boot.  INIT_DONE setting is moved
      above the initial blk_queue_bypass_end() so that bypassing attempts
      can't slip inbetween.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      776687bc
  12. 12 6月, 2014 1 次提交
    • M
      block: remove WQ_POWER_EFFICIENT from kblockd · 28747fcd
      Matias Bjørling 提交于
      blk-mq issues async requests through kblockd. To issue a work request on
      a specific CPU, kblockd_schedule_delayed_work_on is used. However, the
      specific CPU choice may not be honored, if the power_efficient option
      for workqueues is set. blk-mq requires that we have strict per-cpu
      scheduling, so it wont work properly if kblockd is marked
      POWER_EFFICIENT and power_efficient is set.
      
      Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior.
      This essentially reverts part of commit 695588f9, which added
      the WQ_POWER_EFFICIENT marker to kblockd.
      Signed-off-by: NMatias Bjørling <m@bjorling.me>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      28747fcd
  13. 06 6月, 2014 1 次提交
    • J
      block: add blk_rq_set_block_pc() · f27b087b
      Jens Axboe 提交于
      With the optimizations around not clearing the full request at alloc
      time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC
      up to the user allocating the request.
      
      Add a blk_rq_set_block_pc() that sets the command type to
      REQ_TYPE_BLOCK_PC, and properly initializes the members associated
      with this type of request. Update callers to use this function instead
      of manipulating rq->cmd_type directly.
      
      Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed
      attempt.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f27b087b
  14. 29 5月, 2014 1 次提交
  15. 28 5月, 2014 1 次提交
  16. 27 5月, 2014 1 次提交
  17. 21 5月, 2014 2 次提交
  18. 10 5月, 2014 1 次提交
  19. 05 5月, 2014 1 次提交
  20. 17 4月, 2014 2 次提交
  21. 16 4月, 2014 1 次提交
    • J
      block: remove struct request buffer member · b4f42e28
      Jens Axboe 提交于
      This was used in the olden days, back when onions were proper
      yellow. Basically it mapped to the current buffer to be
      transferred. With highmem being added more than a decade ago,
      most drivers map pages out of a bio, and rq->buffer isn't
      pointing at anything valid.
      
      Convert old style drivers to just use bio_data().
      
      For the discard payload use case, just reference the page
      in the bio.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b4f42e28
  22. 11 4月, 2014 1 次提交
  23. 10 4月, 2014 3 次提交
    • J
      block: fix regression with block enabled tagging · 360f92c2
      Jens Axboe 提交于
      Martin reported that his test system would not boot with
      current git, it oopsed with this:
      
      BUG: unable to handle kernel paging request at ffff88046c6c9e80
      IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
      PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
      Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
      Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
      rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
      CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
      Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
      Workqueue: events_unbound async_run_entry_fn
      task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
      RIP: 0010:[<ffffffff812971e0>]  [<ffffffff812971e0>]
      blk_queue_start_tag+0x90/0x150
      RSP: 0018:ffff880273d03a58  EFLAGS: 00010092
      RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
      RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
      RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
      R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
      R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
      FS:  0000000000000000(0000) GS:ffff880277b00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
      Stack:
       ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
       ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
       ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
      Call Trace:
       [<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510
       [<ffffffff81293167>] __blk_run_queue+0x37/0x50
       [<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130
       [<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0
       [<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0
       [<ffffffff813bba35>] scsi_execute+0xe5/0x180
       [<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110
       [<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod]
       [<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0
       [<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
       [<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod]
       [<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140
       [<ffffffff8106a1c5>] process_one_work+0x175/0x410
       [<ffffffff8106b373>] worker_thread+0x123/0x400
       [<ffffffff8106b250>] ? manage_workers+0x160/0x160
       [<ffffffff8107104e>] kthread+0xce/0xf0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
       [<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
      Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
      df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89
      58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
      RIP  [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
       RSP <ffff880273d03a58>
      CR2: ffff88046c6c9e80
      
      Martin bisected and found this to be the problem patch;
      
      	commit 6d113398
      	Author: Jan Kara <jack@suse.cz>
      	Date:   Mon Feb 24 16:39:54 2014 +0100
      
      	    block: Stop abusing rq->csd.list in blk-softirq
      
      and the problem was immediately apparent. The patch states that
      it is safe to reuse queuelist at completion time, since it is
      no longer used. However, that is not true if a device is using
      block enabled tagging. If that is the case, then the queuelist
      is reused to keep track of busy tags. If a device also ended
      up using softirq completions, we'd reuse ->queuelist for the
      IPI handling while block tagging was still using it. Boom.
      
      Fix this by adding a new ipi_list list head, and share the
      memory used with the request hash table. The hash table is
      never used after the request is moved to the dispatch list,
      which happens long before any potential completion of the
      request. Add a new request bit for this, so we don't have
      cases that check rq->hash while it could potentially have
      been reused for the IPI completion.
      Reported-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Tested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      360f92c2
    • J
      block: add kblockd_schedule_delayed_work_on() · 8ab14595
      Jens Axboe 提交于
      Same function as kblockd_schedule_delayed_work(), but allow the
      caller to pass in a CPU that the work should be executed on. This
      just directly extends and maps into the workqueue API, and will
      be used to make the blk-mq mappings more strict.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ab14595
    • J
      block: remove 'q' parameter from kblockd_schedule_*_work() · 59c3d45e
      Jens Axboe 提交于
      The queue parameter is never used, just get rid of it.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      59c3d45e
  24. 21 3月, 2014 1 次提交
  25. 09 3月, 2014 1 次提交
    • M
      block: fix q->flush_rq NULL pointer crash on dm-mpath flush · 7982e90c
      Mike Snitzer 提交于
      Commit 18741986 ("blk-mq: rework flush sequencing logic") switched
      ->flush_rq from being an embedded member of the request_queue structure
      to being dynamically allocated in blk_init_queue_node().
      
      Request-based DM multipath doesn't use blk_init_queue_node(), instead it
      uses blk_alloc_queue_node() + blk_init_allocated_queue().  Because
      commit 18741986 placed the dynamic allocation of ->flush_rq in
      blk_init_queue_node() any flush issued to a dm-mpath device would crash
      with a NULL pointer, e.g.:
      
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0
      PGD bb3c7067 PUD bb01d067 PMD 0
      Oops: 0002 [#1] SMP
      ...
      CPU: 5 PID: 5028 Comm: dt Tainted: G        W  O 3.14.0-rc3.snitm+ #10
      ...
      task: ffff88032fb270e0 ti: ffff880079564000 task.ti: ffff880079564000
      RIP: 0010:[<ffffffff8125037e>]  [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0
      RSP: 0018:ffff880079565c98  EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030
      RDX: ffff880260c74048 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff880079565ca8 R08: ffff880260aa1e98 R09: 0000000000000001
      R10: ffff88032fa78500 R11: 0000000000000246 R12: 0000000000000000
      R13: ffff880260aa1de8 R14: 0000000000000650 R15: 0000000000000000
      FS:  00007f8d36a2a700(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000079b36000 CR4: 00000000000007e0
      Stack:
       0000000000000000 ffff880260c74048 ffff880079565cd8 ffffffff81257a47
       ffff880260aa1de8 ffff880260c74048 0000000000000001 0000000000000000
       ffff880079565d08 ffffffff81257c2d 0000000000000000 ffff880260aa1de8
      Call Trace:
       [<ffffffff81257a47>] blk_flush_complete_seq+0x2d7/0x2e0
       [<ffffffff81257c2d>] blk_insert_flush+0x1dd/0x210
       [<ffffffff8124ec59>] __elv_add_request+0x1f9/0x320
       [<ffffffff81250681>] ? blk_account_io_start+0x111/0x190
       [<ffffffff81253a4b>] blk_queue_bio+0x25b/0x330
       [<ffffffffa0020bf5>] dm_request+0x35/0x40 [dm_mod]
       [<ffffffff812530c0>] generic_make_request+0xc0/0x100
       [<ffffffff81253173>] submit_bio+0x73/0x140
       [<ffffffff811becdd>] submit_bio_wait+0x5d/0x80
       [<ffffffff81257528>] blkdev_issue_flush+0x78/0xa0
       [<ffffffff811c1f6f>] blkdev_fsync+0x3f/0x60
       [<ffffffff811b7fde>] vfs_fsync_range+0x1e/0x20
       [<ffffffff811b7ffc>] vfs_fsync+0x1c/0x20
       [<ffffffff811b81f1>] do_fsync+0x41/0x80
       [<ffffffff8118874e>] ? SyS_lseek+0x7e/0x80
       [<ffffffff811b8260>] SyS_fsync+0x10/0x20
       [<ffffffff8154c2d2>] system_call_fastpath+0x16/0x1b
      
      Fix this by moving the ->flush_rq allocation from blk_init_queue_node()
      to blk_init_allocated_queue().  blk_init_queue_node() also calls
      blk_init_allocated_queue() so this change is functionality equivalent
      for all blk_init_queue_node() callers.
      Reported-by: NHannes Reinecke <hare@suse.de>
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7982e90c
  26. 06 3月, 2014 1 次提交
    • R
      blktrace: fix accounting of partially completed requests · af5040da
      Roman Pen 提交于
      trace_block_rq_complete does not take into account that request can
      be partially completed, so we can get the following incorrect output
      of blkparser:
      
        C   R 232 + 240 [0]
        C   R 240 + 232 [0]
        C   R 248 + 224 [0]
        C   R 256 + 216 [0]
      
      but should be:
      
        C   R 232 + 8 [0]
        C   R 240 + 8 [0]
        C   R 248 + 8 [0]
        C   R 256 + 8 [0]
      
      Also, the whole output summary statistics of completed requests and
      final throughput will be incorrect.
      
      This patch takes into account real completion size of the request and
      fixes wrong completion accounting.
      Signed-off-by: NRoman Pen <r.peniaev@gmail.com>
      CC: Steven Rostedt <rostedt@goodmis.org>
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      CC: Ingo Molnar <mingo@redhat.com>
      CC: linux-kernel@vger.kernel.org
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      af5040da
  27. 19 2月, 2014 1 次提交
  28. 11 2月, 2014 1 次提交
    • C
      blk-mq: rework flush sequencing logic · 18741986
      Christoph Hellwig 提交于
      Witch to using a preallocated flush_rq for blk-mq similar to what's done
      with the old request path.  This allows us to set up the request properly
      with a tag from the actually allowed range and ->rq_disk as needed by
      some drivers.  To make life easier we also switch to dynamic allocation
      of ->flush_rq for the old path.
      
      This effectively reverts most of
      
          "blk-mq: fix for flush deadlock"
      
      and
      
          "blk-mq: Don't reserve a tag for flush request"
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      18741986