1. 01 8月, 2020 1 次提交
  2. 18 7月, 2020 1 次提交
    • B
      blk-cgroup: show global disk stats in root cgroup io.stat · ef45fe47
      Boris Burkov 提交于
      In order to improve consistency and usability in cgroup stat accounting,
      we would like to support the root cgroup's io.stat.
      
      Since the root cgroup has processes doing io even if the system has no
      explicitly created cgroups, we need to be careful to avoid overhead in
      that case.  For that reason, the rstat algorithms don't handle the root
      cgroup, so just turning the file on wouldn't give correct statistics.
      
      To get around this, we simulate flushing the iostat struct by filling it
      out directly from global disk stats. The result is a root cgroup io.stat
      file consistent with both /proc/diskstats and io.stat.
      
      Note that in order to collect the disk stats, we needed to iterate over
      devices. To facilitate that, we had to change the linkage of a disk_type
      to external so that it can be used from blk-cgroup.c to iterate over
      disks.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ef45fe47
  3. 09 7月, 2020 1 次提交
  4. 24 6月, 2020 3 次提交
    • L
      block: revert back to synchronous request_queue removal · e8c7d14a
      Luis Chamberlain 提交于
      Commit dc9edc44 ("block: Fix a blk_exit_rl() regression") merged on
      v4.12 moved the work behind blk_release_queue() into a workqueue after a
      splat floated around which indicated some work on blk_release_queue()
      could sleep in blk_exit_rl(). This splat would be possible when a driver
      called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
      as its final call) from an atomic context.
      
      blk_put_queue() decrements the refcount for the request_queue kobject, and
      upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
      now removed through commit db6d9952 ("block: remove request_list code")
      on v5.0, we reserve the right to be able to sleep within
      blk_release_queue() context.
      
      The last reference for the request_queue must not be called from atomic
      context. *When* the last reference to the request_queue reaches 0 varies,
      and so let's take the opportunity to document when that is expected to
      happen and also document the context of the related calls as best as
      possible so we can avoid future issues, and with the hopes that the
      synchronous request_queue removal sticks.
      
      We revert back to synchronous request_queue removal because asynchronous
      removal creates a regression with expected userspace interaction with
      several drivers. An example is when removing the loopback driver, one
      uses ioctls from userspace to do so, but upon return and if successful,
      one expects the device to be removed. Likewise if one races to add another
      device the new one may not be added as it is still being removed. This was
      expected behavior before and it now fails as the device is still present
      and busy still. Moving to asynchronous request_queue removal could have
      broken many scripts which relied on the removal to have been completed if
      there was no error. Document this expectation as well so that this
      doesn't regress userspace again.
      
      Using asynchronous request_queue removal however has helped us find
      other bugs. In the future we can test what could break with this
      arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.
      
      While at it, update the docs with the context expectations for the
      request_queue / gendisk refcount decrement, and make these
      expectations explicit by using might_sleep().
      
      Fixes: dc9edc44 ("block: Fix a blk_exit_rl() regression")
      Suggested-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: yu kuai <yukuai3@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e8c7d14a
    • L
      block: clarify context for refcount increment helpers · 763b5892
      Luis Chamberlain 提交于
      Let us clarify the context under which the helpers to increment the
      refcount for the gendisk and request_queue can be called under. We
      make this explicit on the places where we may sleep with might_sleep().
      
      We don't address the decrement context yet, as that needs some extra
      work and fixes, but will be addressed in the next patch.
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      763b5892
    • L
      block: add docs for gendisk / request_queue refcount helpers · b5bd357c
      Luis Chamberlain 提交于
      This adds documentation for the gendisk / request_queue refcount
      helpers.
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5bd357c
  5. 27 5月, 2020 2 次提交
  6. 19 5月, 2020 2 次提交
  7. 13 5月, 2020 3 次提交
  8. 10 5月, 2020 1 次提交
  9. 21 4月, 2020 3 次提交
  10. 27 3月, 2020 1 次提交
  11. 25 3月, 2020 7 次提交
  12. 24 3月, 2020 3 次提交
  13. 19 3月, 2020 1 次提交
  14. 12 3月, 2020 1 次提交
  15. 22 11月, 2019 1 次提交
    • K
      block: add iostat counters for flush requests · b6866318
      Konstantin Khlebnikov 提交于
      Requests that triggers flushing volatile writeback cache to disk (barriers)
      have significant effect to overall performance.
      
      Block layer has sophisticated engine for combining several flush requests
      into one. But there is no statistics for actual flushes executed by disk.
      Requests which trigger flushes usually are barriers - zero-size writes.
      
      This patch adds two iostat counters into /sys/class/block/$dev/stat and
      /proc/diskstats - count of completed flush requests and their total time.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b6866318
  16. 06 9月, 2019 1 次提交
    • D
      block: Delay default elevator initialization · 737eb78e
      Damien Le Moal 提交于
      When elevator_init_mq() is called from blk_mq_init_allocated_queue(),
      the only information known about the device is the number of hardware
      queues as the block device scan by the device driver is not completed
      yet for most drivers. The device type and elevator required features
      are not set yet, preventing to correctly select the default elevator
      most suitable for the device.
      
      This currently affects all multi-queue zoned block devices which default
      to the "none" elevator instead of the required "mq-deadline" elevator.
      These drives currently include host-managed SMR disks connected to a
      smartpqi HBA and null_blk block devices with zoned mode enabled.
      Upcoming NVMe Zoned Namespace devices will also be affected.
      
      Fix this by adding the boolean elevator_init argument to
      blk_mq_init_allocated_queue() to control the execution of
      elevator_init_mq(). Two cases exist:
      1) elevator_init = false is used for calls to
         blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this
         case, a call to elevator_init_mq() is added to __device_add_disk(),
         resulting in the delayed initialization of the queue elevator
         after the device driver finished probing the device information. This
         effectively allows elevator_init_mq() access to more information
         about the device.
      2) elevator_init = true preserves the current behavior of initializing
         the elevator directly from blk_mq_init_allocated_queue(). This case
         is used for the special request based DM devices where the device
         gendisk is created before the queue initialization and device
         information (e.g. queue limits) is already known when the queue
         initialization is executed.
      
      Additionally, to make sure that the elevator initialization is never
      done while requests are in-flight (there should be none when the device
      driver calls device_add_disk()), freeze and quiesce the device request
      queue before calling blk_mq_init_sched() in elevator_init_mq().
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      737eb78e
  17. 17 7月, 2019 1 次提交
  18. 15 6月, 2019 1 次提交
    • G
      block: genhd: Use struct_size() helper · 78b90a2c
      Gustavo A. R. Silva 提交于
      Make use of the struct_size() helper instead of an open-coded version
      in order to avoid any potential type mistakes, in particular in the
      context in which this code is being used.
      
      So, replace the following form:
      
      sizeof(*new_ptbl) + target * sizeof(new_ptbl->part[0])
      
      with:
      
      struct_size(new_ptbl, part, target)
      
      Also, notice that variable size is unnecessary, hence it is removed.
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      78b90a2c
  19. 01 6月, 2019 1 次提交
  20. 01 5月, 2019 1 次提交
  21. 22 4月, 2019 1 次提交
    • Y
      block: fix use-after-free on gendisk · 6fcc44d1
      Yufen Yu 提交于
      commit 2da78092 "block: Fix dev_t minor allocation lifetime"
      specifically moved blk_free_devt(dev->devt) call to part_release()
      to avoid reallocating device number before the device is fully
      shutdown.
      
      However, it can cause use-after-free on gendisk in get_gendisk().
      We use md device as example to show the race scenes:
      
      Process1		Worker			Process2
      md_free
      						blkdev_open
      del_gendisk
        add delete_partition_work_fn() to wq
        						__blkdev_get
      						get_gendisk
      put_disk
        disk_release
          kfree(disk)
          						find part from ext_devt_idr
      						get_disk_and_module(disk)
          					  	cause use after free
      
          			delete_partition_work_fn
      			put_device(part)
          		  	part_release
      		    	remove part from ext_devt_idr
      
      Before <devt, hd_struct pointer> is removed from ext_devt_idr by
      delete_partition_work_fn(), we can find the devt and then access
      gendisk by hd_struct pointer. But, if we access the gendisk after
      it have been freed, it can cause in use-after-freeon gendisk in
      get_gendisk().
      
      We fix this by adding a new helper blk_invalidate_devt() in
      delete_partition() and del_gendisk(). It replaces hd_struct
      pointer in idr with value 'NULL', and deletes the entry from
      idr in part_release() as we do now.
      
      Thanks to Jan Kara for providing the solution and more clear comments
      for the code.
      
      Fixes: 2da78092 ("block: Fix dev_t minor allocation lifetime")
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6fcc44d1
  22. 16 4月, 2019 1 次提交
    • Y
      block: fix use-after-free on gendisk · 2c88e3c7
      Yufen Yu 提交于
      commit 2da78092 "block: Fix dev_t minor allocation lifetime"
      specifically moved blk_free_devt(dev->devt) call to part_release()
      to avoid reallocating device number before the device is fully
      shutdown.
      
      However, it can cause use-after-free on gendisk in get_gendisk().
      We use md device as example to show the race scenes:
      
      Process1		Worker			Process2
      md_free
      						blkdev_open
      del_gendisk
        add delete_partition_work_fn() to wq
        						__blkdev_get
      						get_gendisk
      put_disk
        disk_release
          kfree(disk)
          						find part from ext_devt_idr
      						get_disk_and_module(disk)
          					  	cause use after free
      
          			delete_partition_work_fn
      			put_device(part)
          		  	part_release
      		    	remove part from ext_devt_idr
      
      Before <devt, hd_struct pointer> is removed from ext_devt_idr by
      delete_partition_work_fn(), we can find the devt and then access
      gendisk by hd_struct pointer. But, if we access the gendisk after
      it have been freed, it can cause in use-after-freeon gendisk in
      get_gendisk().
      
      We fix this by adding a new helper blk_invalidate_devt() in
      delete_partition() and del_gendisk(). It replaces hd_struct
      pointer in idr with value 'NULL', and deletes the entry from
      idr in part_release() as we do now.
      
      Thanks to Jan Kara for providing the solution and more clear comments
      for the code.
      
      Fixes: 2da78092 ("block: Fix dev_t minor allocation lifetime")
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2c88e3c7
  23. 13 4月, 2019 2 次提交
    • M
      block: check_events: don't bother with events if unsupported · cdf3e3de
      Martin Wilck 提交于
      Drivers now report to the block layer if they support media change
      events. If this is not the case, there's no need to allocate the event
      structure, and all event handling code can effectively be skipped. This
      simplifies code flow in particular for non-removable sd devices.
      
      This effectively reverts commit 75e3f3ee ("block: always allocate
      genhd->ev if check_events is implemented").
      
      The sysfs files for the events are kept in place even if no events are
      supported, as user space may rely on them being present. The only
      difference is that an error code is now returned if the user tries to
      set poll_msecs.
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMartin Wilck <mwilck@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cdf3e3de
    • M
      block: disk_events: introduce event flags · c92e2f04
      Martin Wilck 提交于
      Currently, an empty disk->events field tells the block layer not to
      forward media change events to user space. This was done in commit
      7c88a168 ("block: don't propagate unlisted DISK_EVENTs to userland")
      in order to avoid events from "fringe" drivers to be forwarded to user
      space. By doing so, the block layer lost the information which events
      were supported by a particular block device, and most importantly,
      whether or not a given device supports media change events at all.
      
      Prepare for not interpreting the "events" field this way in the future
      any more. This is done by adding an additional field "event_flags" to
      struct gendisk, and two flag bits that can be set to have the device
      treated like one that had the "events" field set to a non-zero value
      before. This applies only to the sd and sr drivers, which are changed to
      set the new flags.
      
      The new flags are DISK_EVENT_FLAG_POLL to enforce polling of the device
      for synchronous events, and DISK_EVENT_FLAG_UEVENT to tell the
      blocklayer to generate udev events from kernel events.
      
      In order to add the event_flags field to struct gendisk, the events
      field is converted to an "unsigned short"; it doesn't need to hold
      values bigger than 2 anyway.
      
      This patch doesn't change behavior.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMartin Wilck <mwilck@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c92e2f04