1. 28 8月, 2019 2 次提交
    • M
      block: split .sysfs_lock into two locks · cecf5d87
      Ming Lei 提交于
      The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
      path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
      required.
      
      However, when mq & iosched kobjects are removed via
      blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
      too. This way causes AB-BA lock because the kernfs built-in lock of
      'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
      
      On the other hand, it isn't necessary to acquire q->sysfs_lock for
      both blk_mq_unregister_dev() & elv_unregister_queue() because
      clearing REGISTERED flag prevents storing to 'queue/scheduler'
      from being happened. Also sysfs write(store) is exclusive, so no
      necessary to hold the lock for elv_unregister_queue() when it is
      called in switching elevator path.
      
      So split .sysfs_lock into two: one is still named as .sysfs_lock for
      covering sync .store, the other one is named as .sysfs_dir_lock
      for covering kobjects and related status change.
      
      sysfs itself can handle the race between add/remove kobjects and
      showing/storing attributes under kobjects. For switching scheduler
      via storing to 'queue/scheduler', we use the queue flag of
      QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
      we can avoid to hold .sysfs_lock during removing/adding kobjects.
      
      [1]  lockdep warning
          ======================================================
          WARNING: possible circular locking dependency detected
          5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
          ------------------------------------------------------
          rmmod/777 is trying to acquire lock:
          00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
      
          but task is already holding lock:
          00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #1 (&q->sysfs_lock){+.+.}:
                 __lock_acquire+0x95f/0xa2f
                 lock_acquire+0x1b4/0x1e8
                 __mutex_lock+0x14a/0xa9b
                 blk_mq_hw_sysfs_show+0x63/0xb6
                 sysfs_kf_seq_show+0x11f/0x196
                 seq_read+0x2cd/0x5f2
                 vfs_read+0xc7/0x18c
                 ksys_read+0xc4/0x13e
                 do_syscall_64+0xa7/0x295
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          -> #0 (kn->count#202){++++}:
                 check_prev_add+0x5d2/0xc45
                 validate_chain+0xed3/0xf94
                 __lock_acquire+0x95f/0xa2f
                 lock_acquire+0x1b4/0x1e8
                 __kernfs_remove+0x237/0x40b
                 kernfs_remove_by_name_ns+0x59/0x72
                 remove_files+0x61/0x96
                 sysfs_remove_group+0x81/0xa4
                 sysfs_remove_groups+0x3b/0x44
                 kobject_del+0x44/0x94
                 blk_mq_unregister_dev+0x83/0xdd
                 blk_unregister_queue+0xa0/0x10b
                 del_gendisk+0x259/0x3fa
                 null_del_dev+0x8b/0x1c3 [null_blk]
                 null_exit+0x5c/0x95 [null_blk]
                 __se_sys_delete_module+0x204/0x337
                 do_syscall_64+0xa7/0x295
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          other info that might help us debug this:
      
           Possible unsafe locking scenario:
      
                 CPU0                    CPU1
                 ----                    ----
            lock(&q->sysfs_lock);
                                         lock(kn->count#202);
                                         lock(&q->sysfs_lock);
            lock(kn->count#202);
      
           *** DEADLOCK ***
      
          2 locks held by rmmod/777:
           #0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
           #1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
      
          stack backtrace:
          CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
          Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
          Call Trace:
           dump_stack+0x9a/0xe6
           check_noncircular+0x207/0x251
           ? print_circular_bug+0x32a/0x32a
           ? find_usage_backwards+0x84/0xb0
           check_prev_add+0x5d2/0xc45
           validate_chain+0xed3/0xf94
           ? check_prev_add+0xc45/0xc45
           ? mark_lock+0x11b/0x804
           ? check_usage_forwards+0x1ca/0x1ca
           __lock_acquire+0x95f/0xa2f
           lock_acquire+0x1b4/0x1e8
           ? kernfs_remove_by_name_ns+0x59/0x72
           __kernfs_remove+0x237/0x40b
           ? kernfs_remove_by_name_ns+0x59/0x72
           ? kernfs_next_descendant_post+0x7d/0x7d
           ? strlen+0x10/0x23
           ? strcmp+0x22/0x44
           kernfs_remove_by_name_ns+0x59/0x72
           remove_files+0x61/0x96
           sysfs_remove_group+0x81/0xa4
           sysfs_remove_groups+0x3b/0x44
           kobject_del+0x44/0x94
           blk_mq_unregister_dev+0x83/0xdd
           blk_unregister_queue+0xa0/0x10b
           del_gendisk+0x259/0x3fa
           ? disk_events_poll_msecs_store+0x12b/0x12b
           ? check_flags+0x1ea/0x204
           ? mark_held_locks+0x1f/0x7a
           null_del_dev+0x8b/0x1c3 [null_blk]
           null_exit+0x5c/0x95 [null_blk]
           __se_sys_delete_module+0x204/0x337
           ? free_module+0x39f/0x39f
           ? blkcg_maybe_throttle_current+0x8a/0x718
           ? rwlock_bug+0x62/0x62
           ? __blkcg_punt_bio_submit+0xd0/0xd0
           ? trace_hardirqs_on_thunk+0x1a/0x20
           ? mark_held_locks+0x1f/0x7a
           ? do_syscall_64+0x4c/0x295
           do_syscall_64+0xa7/0x295
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
          RIP: 0033:0x7fb696cdbe6b
          Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
          RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
          RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
          RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
          RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
          R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
          R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cecf5d87
    • B
      block: Remove blk_mq_register_dev() · 9685b227
      Bart Van Assche 提交于
      This function has no callers. Hence remove it.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9685b227
  2. 04 5月, 2019 2 次提交
    • M
      blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release · 1b97871b
      Ming Lei 提交于
      hctx is always released after requeue is freed.
      
      With holding queue's kobject refcount, it is safe for driver to run queue,
      so one run queue might be scheduled after blk_sync_queue() is done.
      
      So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
      for avoiding run released queue.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1b97871b
    • M
      blk-mq: free hw queue's resource in hctx's release handler · c7e2d94b
      Ming Lei 提交于
      Once blk_cleanup_queue() returns, tags shouldn't be used any more,
      because blk_mq_free_tag_set() may be called. Commit 45a9c9d9
      ("blk-mq: Fix a use-after-free") fixes this issue exactly.
      
      However, that commit introduces another issue. Before 45a9c9d9,
      we are allowed to run queue during cleaning up queue if the queue's
      kobj refcount is held. After that commit, queue can't be run during
      queue cleaning up, otherwise oops can be triggered easily because
      some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
      
      We have invented ways for addressing this kind of issue before, such as:
      
      	8dc765d4 ("SCSI: fix queue cleanup race before queue initialization is done")
      	c2856ae2 ("blk-mq: quiesce queue before freeing queue")
      
      But still can't cover all cases, recently James reports another such
      kind of issue:
      
      	https://marc.info/?l=linux-scsi&m=155389088124782&w=2
      
      This issue can be quite hard to address by previous way, given
      scsi_run_queue() may run requeues for other LUNs.
      
      Fixes the above issue by freeing hctx's resources in its release handler, and this
      way is safe becasue tags isn't needed for freeing such hctx resource.
      
      This approach follows typical design pattern wrt. kobject's release handler.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reported-by: NJames Smart <james.smart@broadcom.com>
      Fixes: 45a9c9d9 ("blk-mq: Fix a use-after-free")
      Cc: stable@vger.kernel.org
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7e2d94b
  3. 01 5月, 2019 1 次提交
  4. 26 4月, 2019 1 次提交
  5. 17 12月, 2018 1 次提交
    • M
      blk-mq: export hctx->type in debugfs instead of sysfs · 346fc108
      Ming Lei 提交于
      Now we only export hctx->type via sysfs, and there isn't such info
      in hctx entry under debugfs. We often use debugfs only to diagnose
      queue mapping issue, so add the support in debugfs.
      
      Queue mapping becomes a bit more complicated after multiple queue
      mapping is supported, we may write blktest to verify if queue mapping
      is valid based on blk-mq-debugfs.
      
      Given not necessary to export hctx->type twice, so remove the export
      from sysfs.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      346fc108
  6. 05 12月, 2018 1 次提交
  7. 21 11月, 2018 1 次提交
  8. 16 11月, 2018 1 次提交
  9. 08 11月, 2018 1 次提交
  10. 25 5月, 2018 1 次提交
  11. 15 1月, 2018 1 次提交
    • M
      block: properly protect the 'queue' kobj in blk_unregister_queue · 667257e8
      Mike Snitzer 提交于
      The original commit e9a823fb (block: fix warning when I/O elevator
      is changed as request_queue is being removed) is pretty conflated.
      "conflated" because the resource being protected by q->sysfs_lock isn't
      the queue_flags (it is the 'queue' kobj).
      
      q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
      from racing with blk_unregister_queue():
      1) By holding q->sysfs_lock first, __elevator_change() can complete
      before a racing blk_unregister_queue().
      2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
      in case elv_iosched_store() loses the race with blk_unregister_queue(),
      it needs a way to know the 'queue' kobj isn't there.
      
      Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
      held until after the 'queue' kobj is removed.
      
      To do so blk_mq_unregister_dev() must not also take q->sysfs_lock.  So
      rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().
      
      Also, blk_unregister_queue() should use q->queue_lock to protect against
      any concurrent writes to q->queue_flags -- even though chances are the
      queue is being cleaned up so no concurrent writes are likely.
      
      Fixes: e9a823fb ("block: fix warning when I/O elevator is changed as request_queue is being removed")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      667257e8
  12. 04 5月, 2017 2 次提交
  13. 27 4月, 2017 4 次提交
  14. 09 3月, 2017 4 次提交
  15. 03 2月, 2017 1 次提交
  16. 27 1月, 2017 5 次提交
  17. 18 1月, 2017 1 次提交
  18. 11 11月, 2016 1 次提交
    • J
      block: add scalable completion tracking of requests · cf43e6be
      Jens Axboe 提交于
      For legacy block, we simply track them in the request queue. For
      blk-mq, we track them on a per-sw queue basis, which we can then
      sum up through the hardware queues and finally to a per device
      state.
      
      The stats are tracked in, roughly, 0.1s interval windows.
      
      Add sysfs files to display the stats.
      
      The feature is off by default, to avoid any extra overhead. In-kernel
      users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
      flags. We currently don't turn it on if someone just reads any of
      the stats files, that is something we could add as well.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cf43e6be
  19. 21 9月, 2016 1 次提交
  20. 17 9月, 2016 1 次提交
    • J
      blk-mq: account higher order dispatch · 703fd1c0
      Jens Axboe 提交于
      We currently account a '0' dispatch, and anything above that still falls
      below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
      we don't account it.
      
      Change the last bucket to be inclusive of anything above the range we
      track, and have the sysfs file reflect that by including a '+' in the
      output:
      
      $ cat /sys/block/nvme0n1/mq/0/dispatched
              0	1006
              1	20229
              2	1
              4	0
              8	0
             16	0
             32+	0
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      703fd1c0
  21. 14 9月, 2016 2 次提交
  22. 05 8月, 2016 1 次提交
  23. 20 3月, 2016 1 次提交
  24. 10 2月, 2016 1 次提交
    • K
      blk-mq: dynamic h/w context count · 868f2f0b
      Keith Busch 提交于
      The hardware's provided queue count may change at runtime with resource
      provisioning. This patch allows a block driver to alter the number of
      h/w queues available when its resource count changes.
      
      The main part is a new blk-mq API to request a new number of h/w queues
      for a given live tag set. The new API freezes all queues using that set,
      then adjusts the allocated count prior to remapping these to CPUs.
      
      The bulk of the rest just shifts where h/w contexts and all their
      artifacts are allocated and freed.
      
      The number of max h/w contexts is capped to the number of possible cpus
      since there is no use for more than that. As such, all pre-allocated
      memory for pointers need to account for the max possible rather than
      the initial number of queues.
      
      A side effect of this is that the blk-mq will proceed successfully as
      long as it can allocate at least one h/w context. Previously it would
      fail request queue initialization if less than the requested number
      was allocated.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJon Derrick <jonathan.derrick@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      868f2f0b
  25. 08 11月, 2015 1 次提交
    • J
      block: add block polling support · 05229bee
      Jens Axboe 提交于
      Add basic support for polling for specific IO to complete. This uses
      the cookie that blk-mq passes back, which enables the block layer
      to pass this cookie to the driver to spin for a specific request.
      
      This will be combined with request latency tracking, so we can make
      qualified decisions about when to poll and when not to. For now, for
      benchmark purposes, we add a sysfs file that controls whether polling
      is enabled or not.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NKeith Busch <keith.busch@intel.com>
      05229bee
  26. 22 10月, 2015 1 次提交
    • D
      block: generic request_queue reference counting · 3ef28e83
      Dan Williams 提交于
      Allow pmem, and other synchronous/bio-based block drivers, to fallback
      on a per-cpu reference count managed by the core for tracking queue
      live/dead state.
      
      The existing per-cpu reference count for the blk_mq case is promoted to
      be used in all block i/o scenarios.  This involves initializing it by
      default, waiting for it to drop to zero at exit, and holding a live
      reference over the invocation of q->make_request_fn() in
      generic_make_request().  The blk_mq code continues to take its own
      reference per blk_mq request and retains the ability to freeze the
      queue, but the check that the queue is frozen is moved to
      generic_make_request().
      
      This fixes crash signatures like the following:
      
       BUG: unable to handle kernel paging request at ffff880140000000
       [..]
       Call Trace:
        [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
        [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
        [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
        [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
        [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
        [<ffffffff8141f966>] submit_bio+0x76/0x170
        [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
        [<ffffffff81286e62>] submit_bh+0x12/0x20
        [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
        [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
        [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
        [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
        [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
        [<ffffffff81303494>] ext4_put_super+0x64/0x360
        [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      3ef28e83