1. 16 6月, 2019 3 次提交
    • T
      blkcg: perpcu_ref init/exit should be done from blkg_alloc/free() · ef069b97
      Tejun Heo 提交于
      blkg alloc is performed as a separate step from the rest of blkg
      creation so that GFP_KERNEL allocations can be used when creating
      blkgs from configuration file writes because otherwise user actions
      may fail due to failures of opportunistic GFP_NOWAIT allocations.
      
      While making blkgs use percpu_ref, 7fcf2b03 ("blkcg: change blkg
      reference counting to use percpu_ref") incorrectly added unconditional
      opportunistic percpu_ref_init() to blkg_create() breaking this
      guarantee.
      
      This patch moves percpu_ref_init() to blkg_alloc() so makes it use
      @gfp_mask that blkg_alloc() is called with.  Also, percpu_ref_exit()
      is moved to blkg_free() for consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 7fcf2b03 ("blkcg: change blkg reference counting to use percpu_ref")
      Cc: Dennis Zhou <dennis@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ef069b97
    • T
      blkcg: update blkcg_print_stat() to handle larger outputs · f539da82
      Tejun Heo 提交于
      Depending on the number of devices, blkcg stats can go over the
      default seqfile buf size.  seqfile normally retries with a larger
      buffer but since the ->pd_stat() addition, blkcg_print_stat() doesn't
      tell seqfile that overflow has happened and the output gets printed
      truncated.  Fix it by calling seq_commit() w/ -1 on possible
      overflows.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 903d23f0 ("blk-cgroup: allow controllers to output their own stats")
      Cc: stable@vger.kernel.org # v4.19+
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f539da82
    • T
      blk-iolatency: clear use_delay when io.latency is set to zero · 5de0073f
      Tejun Heo 提交于
      If use_delay was non-zero when the latency target of a cgroup was set
      to zero, it will stay stuck until io.latency is enabled on the cgroup
      again.  This keeps readahead disabled for the cgroup impacting
      performance negatively.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Fixes: d7067512 ("block: introduce blk-iolatency io controller")
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5de0073f
  2. 15 6月, 2019 3 次提交
  3. 13 6月, 2019 3 次提交
  4. 10 6月, 2019 1 次提交
  5. 07 6月, 2019 2 次提交
    • A
      block, bfq: add weight symlink to the bfq.weight cgroup parameter · 19e9da9e
      Angelo Ruocco 提交于
      Many userspace tools and services use the proportional-share policy of
      the blkio/io cgroups controller. The CFQ I/O scheduler implemented
      this policy for the legacy block layer. To modify the weight of a
      group in case CFQ was in charge, the 'weight' parameter of the group
      must be modified. On the other hand, the BFQ I/O scheduler implements
      the same policy in blk-mq, but, with BFQ, the parameter to modify has
      a different name: bfq.weight (forced choice until legacy block was
      present, because two different policies cannot share a common parameter
      in cgroups).
      
      Due to CFQ legacy, most if not all userspace configurations still use
      the parameter 'weight', and for the moment do not seem likely to be
      changed. But, when CFQ went away with legacy block, such a parameter
      ceased to exist.
      
      So, a simple workaround has been proposed [1] to make all
      configurations work: add a symlink, named weight, to bfq.weight. This
      commit adds such a symlink.
      
      [1] https://lkml.org/lkml/2019/4/8/555Suggested-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      19e9da9e
    • M
      block: free sched's request pool in blk_cleanup_queue · c3e22192
      Ming Lei 提交于
      In theory, IO scheduler belongs to request queue, and the request pool
      of sched tags belongs to the request queue too.
      
      However, the current tags allocation interfaces are re-used for both
      driver tags and sched tags, and driver tags is definitely host wide,
      and doesn't belong to any request queue, same with its request pool.
      So we need tagset instance for freeing request of sched tags.
      
      Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
      of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
      tags to be freed before calling blk_mq_free_tag_set().
      
      Commit 47cdee29 ("block: move blk_exit_queue into __blk_release_queue")
      moves blk_exit_queue into __blk_release_queue for simplying the fast
      path in generic_make_request(), then causes oops during freeing requests
      of sched tags in __blk_release_queue().
      
      Fix the above issue by move freeing request pool of sched tags into
      blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
      in-queue requests at that time. Freeing sched tags has to be kept in queue's
      release handler becasue there might be un-completed dispatch activity
      which might refer to sched tags.
      
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Fixes: 47cdee29 ("block: move blk_exit_queue into __blk_release_queue")
      Tested-by: NYi Zhang <yi.zhang@redhat.com>
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c3e22192
  6. 05 6月, 2019 1 次提交
  7. 01 6月, 2019 9 次提交
  8. 30 5月, 2019 1 次提交
  9. 29 5月, 2019 2 次提交
  10. 24 5月, 2019 5 次提交
    • B
      blk-mq: fix hang caused by freeze/unfreeze sequence · 7996a8b5
      Bob Liu 提交于
      The following is a description of a hang in blk_mq_freeze_queue_wait().
      The hang happens on attempt to freeze a queue while another task does
      queue unfreeze.
      
      The root cause is an incorrect sequence of percpu_ref_resurrect() and
      percpu_ref_kill() and as a result those two can be swapped:
      
       CPU#0                         CPU#1
       ----------------              -----------------
       q1 = blk_mq_init_queue(shared_tags)
      
                                      q2 = blk_mq_init_queue(shared_tags):
                                        blk_mq_add_queue_tag_set(shared_tags):
                                          blk_mq_update_tag_set_depth(shared_tags):
      				     list_for_each_entry()
                                            blk_mq_freeze_queue(q1)
                                             > percpu_ref_kill()
                                             > blk_mq_freeze_queue_wait()
      
       blk_cleanup_queue(q1)
        blk_mq_freeze_queue(q1)
         > percpu_ref_kill()
                       ^^^^^^ freeze_depth can't guarantee the order
      
                                            blk_mq_unfreeze_queue()
                                              > percpu_ref_resurrect()
      
         > blk_mq_freeze_queue_wait()
                       ^^^^^^ Hang here!!!!
      
      This wrong sequence raises kernel warning:
      percpu_ref_kill_and_confirm called more than once on blk_queue_usage_counter_release!
      WARNING: CPU: 0 PID: 11854 at lib/percpu-refcount.c:336 percpu_ref_kill_and_confirm+0x99/0xb0
      
      But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
      which waits for a zero of a q_usage_counter, which never happens
      because percpu-ref was reinited (instead of being killed) and stays in
      PERCPU state forever.
      
      How to reproduce:
       - "insmod null_blk.ko shared_tags=1 nr_devices=0 queue_mode=2"
       - cpu0: python Script.py 0; taskset the corresponding process running on cpu0
       - cpu1: python Script.py 1; taskset the corresponding process running on cpu1
      
       Script.py:
       ------
       #!/usr/bin/python3
      
      import os
      import sys
      
      while True:
          on = "echo 1 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
          off = "echo 0 > /sys/kernel/config/nullb/%s/power" % sys.argv[1]
          os.system(on)
          os.system(off)
      ------
      
      This bug was first reported and fixed by Roman, previous discussion:
      [1] Message id: 1443287365-4244-7-git-send-email-akinobu.mita@gmail.com
      [2] Message id: 1443563240-29306-6-git-send-email-tj@kernel.org
      [3] https://patchwork.kernel.org/patch/9268199/Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com>
      Signed-off-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7996a8b5
    • C
      block: remove the bi_seg_{front,back}_size fields in struct bio · 6869875f
      Christoph Hellwig 提交于
      At this point these fields aren't used for anything, so we can remove
      them.
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6869875f
    • C
      block: remove the segment size check in bio_will_gap · 200a9aff
      Christoph Hellwig 提交于
      We fundamentally do not have a maximum segement size for devices with a
      virt boundary.  So don't bother checking it, especially given that the
      existing checks didn't properly work to start with as we never fully
      update the front/back segment size and miss the bi_seg_front_size that
      wuld have been required for some cases.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      200a9aff
    • C
      block: force an unlimited segment size on queues with a virt boundary · 09324d32
      Christoph Hellwig 提交于
      We currently fail to update the front/back segment size in the bio when
      deciding to allow an otherwise gappy segement to a device with a
      virt boundary.  The reason why this did not cause problems is that
      devices with a virt boundary fundamentally don't use segments as we
      know it and thus don't care.  Make that assumption formal by forcing
      an unlimited segement size in this case.
      
      Fixes: f6970f83 ("block: don't check if adjacent bvecs in one bio can be mergeable")
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      09324d32
    • C
      block: don't decrement nr_phys_segments for physically contigous segments · eded341c
      Christoph Hellwig 提交于
      Currently ll_merge_requests_fn, unlike all other merge functions,
      reduces nr_phys_segments by one if the last segment of the previous,
      and the first segment of the next segement are contigous.  While this
      seems like a nice solution to avoid building smaller than possible
      requests it causes a mismatch between the segments actually present
      in the request and those iterated over by the bvec iterators, including
      __rq_for_each_bio.  This can for example mistrigger the single segment
      optimization in the nvme-pci driver, and might lead to mismatching
      nr_phys_segments number when recalculating the number of request
      when inserting a cloned request.
      
      We could possibly work around this by making the bvec iterators take
      the front and back segment size into account, but that would require
      moving them from the bio to the bio_iter and spreading this mess
      over all users of bvecs.  Or we could simply remove this optimization
      under the assumption that most users already build good enough bvecs,
      and that the bio merge patch never cared about this optimization
      either.  The latter is what this patch does.
      
      dff824b2 ("nvme-pci: optimize mapping of small single segment requests").
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eded341c
  11. 16 5月, 2019 1 次提交
  12. 04 5月, 2019 7 次提交
    • M
      block: don't drain in-progress dispatch in blk_cleanup_queue() · 66215664
      Ming Lei 提交于
      Now freeing hw queue resource is moved to hctx's release handler,
      we don't need to worry about the race between blk_cleanup_queue and
      run queue any more.
      
      So don't drain in-progress dispatch in blk_cleanup_queue().
      
      This is basically revert of c2856ae2 ("blk-mq: quiesce queue before
      freeing queue").
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      66215664
    • M
      blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release · 1b97871b
      Ming Lei 提交于
      hctx is always released after requeue is freed.
      
      With holding queue's kobject refcount, it is safe for driver to run queue,
      so one run queue might be scheduled after blk_sync_queue() is done.
      
      So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
      for avoiding run released queue.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1b97871b
    • M
      blk-mq: always free hctx after request queue is freed · 2f8f1336
      Ming Lei 提交于
      In normal queue cleanup path, hctx is released after request queue
      is freed, see blk_mq_release().
      
      However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
      of hw queues shrinking. This way is easy to cause use-after-free,
      because: one implicit rule is that it is safe to call almost all block
      layer APIs if the request queue is alive; and one hctx may be retrieved
      by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
      finally use-after-free is triggered.
      
      Fixes this issue by always freeing hctx after releasing request queue.
      If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
      a per-queue list to hold them, then try to resuse these hctxs if numa
      node is matched.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2f8f1336
    • M
      blk-mq: split blk_mq_alloc_and_init_hctx into two parts · 7c6c5b7c
      Ming Lei 提交于
      Split blk_mq_alloc_and_init_hctx into two parts, and one is
      blk_mq_alloc_hctx() for allocating all hctx resources, another
      is blk_mq_init_hctx() for initializing hctx, which serves as
      counter-part of blk_mq_exit_hctx().
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org
      Cc: Martin K . Petersen <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7c6c5b7c
    • M
      blk-mq: free hw queue's resource in hctx's release handler · c7e2d94b
      Ming Lei 提交于
      Once blk_cleanup_queue() returns, tags shouldn't be used any more,
      because blk_mq_free_tag_set() may be called. Commit 45a9c9d9
      ("blk-mq: Fix a use-after-free") fixes this issue exactly.
      
      However, that commit introduces another issue. Before 45a9c9d9,
      we are allowed to run queue during cleaning up queue if the queue's
      kobj refcount is held. After that commit, queue can't be run during
      queue cleaning up, otherwise oops can be triggered easily because
      some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
      
      We have invented ways for addressing this kind of issue before, such as:
      
      	8dc765d4 ("SCSI: fix queue cleanup race before queue initialization is done")
      	c2856ae2 ("blk-mq: quiesce queue before freeing queue")
      
      But still can't cover all cases, recently James reports another such
      kind of issue:
      
      	https://marc.info/?l=linux-scsi&m=155389088124782&w=2
      
      This issue can be quite hard to address by previous way, given
      scsi_run_queue() may run requeues for other LUNs.
      
      Fixes the above issue by freeing hctx's resources in its release handler, and this
      way is safe becasue tags isn't needed for freeing such hctx resource.
      
      This approach follows typical design pattern wrt. kobject's release handler.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reported-by: NJames Smart <james.smart@broadcom.com>
      Fixes: 45a9c9d9 ("blk-mq: Fix a use-after-free")
      Cc: stable@vger.kernel.org
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7e2d94b
    • M
      blk-mq: move cancel of requeue_work into blk_mq_release · fbc2a15e
      Ming Lei 提交于
      With holding queue's kobject refcount, it is safe for driver
      to schedule requeue. However, blk_mq_kick_requeue_list() may
      be called after blk_sync_queue() is done because of concurrent
      requeue activities, then requeue work may not be completed when
      freeing queue, and kernel oops is triggered.
      
      So moving the cancel of requeue_work into blk_mq_release() for
      avoiding race between requeue and freeing queue.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fbc2a15e
    • M
      blk-mq: grab .q_usage_counter when queuing request from plug code path · e87eb301
      Ming Lei 提交于
      Just like aio/io_uring, we need to grab 2 refcount for queuing one
      request, one is for submission, another is for completion.
      
      If the request isn't queued from plug code path, the refcount grabbed
      in generic_make_request() serves for submission. In theroy, this
      refcount should have been released after the sumission(async run queue)
      is done. blk_freeze_queue() works with blk_sync_queue() together
      for avoiding race between cleanup queue and IO submission, given async
      run queue activities are canceled because hctx->run_work is scheduled with
      the refcount held, so it is fine to not hold the refcount when
      running the run queue work function for dispatch IO.
      
      However, if request is staggered into plug list, and finally queued
      from plug code path, the refcount in submission side is actually missed.
      And we may start to run queue after queue is removed because the queue's
      kobject refcount isn't guaranteed to be grabbed in flushing plug list
      context, then kernel oops is triggered, see the following race:
      
      blk_mq_flush_plug_list():
              blk_mq_sched_insert_requests()
                      insert requests to sw queue or scheduler queue
                      blk_mq_run_hw_queue
      
      Because of concurrent run queue, all requests inserted above may be
      completed before calling the above blk_mq_run_hw_queue. Then queue can
      be freed during the above blk_mq_run_hw_queue().
      
      Fixes the issue by grab .q_usage_counter before calling
      blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
      safe because the queue is absolutely alive before inserting request.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e87eb301
  13. 03 5月, 2019 1 次提交
  14. 01 5月, 2019 1 次提交