1. 01 7月, 2020 8 次提交
  2. 29 6月, 2020 1 次提交
  3. 24 6月, 2020 4 次提交
    • L
      block: create the request_queue debugfs_dir on registration · 85e0cbbb
      Luis Chamberlain 提交于
      We were only creating the request_queue debugfs_dir only
      for make_request block drivers (multiqueue), but never for
      request-based block drivers. We did this as we were only
      creating non-blktrace additional debugfs files on that directory
      for make_request drivers. However, since blktrace *always* creates
      that directory anyway, we special-case the use of that directory
      on blktrace. Other than this being an eye-sore, this exposes
      request-based block drivers to the same debugfs fragile
      race that used to exist with make_request block drivers
      where if we start adding files onto that directory we can later
      run a race with a double removal of dentries on the directory
      if we don't deal with this carefully on blktrace.
      
      Instead, just simplify things by always creating the request_queue
      debugfs_dir on request_queue registration. Rename the mutex also to
      reflect the fact that this is used outside of the blktrace context.
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85e0cbbb
    • L
      block: revert back to synchronous request_queue removal · e8c7d14a
      Luis Chamberlain 提交于
      Commit dc9edc44 ("block: Fix a blk_exit_rl() regression") merged on
      v4.12 moved the work behind blk_release_queue() into a workqueue after a
      splat floated around which indicated some work on blk_release_queue()
      could sleep in blk_exit_rl(). This splat would be possible when a driver
      called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
      as its final call) from an atomic context.
      
      blk_put_queue() decrements the refcount for the request_queue kobject, and
      upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
      now removed through commit db6d9952 ("block: remove request_list code")
      on v5.0, we reserve the right to be able to sleep within
      blk_release_queue() context.
      
      The last reference for the request_queue must not be called from atomic
      context. *When* the last reference to the request_queue reaches 0 varies,
      and so let's take the opportunity to document when that is expected to
      happen and also document the context of the related calls as best as
      possible so we can avoid future issues, and with the hopes that the
      synchronous request_queue removal sticks.
      
      We revert back to synchronous request_queue removal because asynchronous
      removal creates a regression with expected userspace interaction with
      several drivers. An example is when removing the loopback driver, one
      uses ioctls from userspace to do so, but upon return and if successful,
      one expects the device to be removed. Likewise if one races to add another
      device the new one may not be added as it is still being removed. This was
      expected behavior before and it now fails as the device is still present
      and busy still. Moving to asynchronous request_queue removal could have
      broken many scripts which relied on the removal to have been completed if
      there was no error. Document this expectation as well so that this
      doesn't regress userspace again.
      
      Using asynchronous request_queue removal however has helped us find
      other bugs. In the future we can test what could break with this
      arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.
      
      While at it, update the docs with the context expectations for the
      request_queue / gendisk refcount decrement, and make these
      expectations explicit by using might_sleep().
      
      Fixes: dc9edc44 ("block: Fix a blk_exit_rl() regression")
      Suggested-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: yu kuai <yukuai3@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e8c7d14a
    • L
      block: clarify context for refcount increment helpers · 763b5892
      Luis Chamberlain 提交于
      Let us clarify the context under which the helpers to increment the
      refcount for the gendisk and request_queue can be called under. We
      make this explicit on the places where we may sleep with might_sleep().
      
      We don't address the decrement context yet, as that needs some extra
      work and fixes, but will be addressed in the next patch.
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      763b5892
    • L
      block: add docs for gendisk / request_queue refcount helpers · b5bd357c
      Luis Chamberlain 提交于
      This adds documentation for the gendisk / request_queue refcount
      helpers.
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5bd357c
  4. 03 6月, 2020 1 次提交
  5. 29 5月, 2020 1 次提交
  6. 27 5月, 2020 4 次提交
  7. 19 5月, 2020 3 次提交
  8. 14 5月, 2020 2 次提交
    • S
      block: Inline encryption support for blk-mq · a892c8d5
      Satya Tangirala 提交于
      We must have some way of letting a storage device driver know what
      encryption context it should use for en/decrypting a request. However,
      it's the upper layers (like the filesystem/fscrypt) that know about and
      manages encryption contexts. As such, when the upper layer submits a bio
      to the block layer, and this bio eventually reaches a device driver with
      support for inline encryption, the device driver will need to have been
      told the encryption context for that bio.
      
      We want to communicate the encryption context from the upper layer to the
      storage device along with the bio, when the bio is submitted to the block
      layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
      represent an encryption context (note that we can't use the bi_private
      field in struct bio to do this because that field does not function to pass
      information across layers in the storage stack). We also introduce various
      functions to manipulate the bio_crypt_ctx and make the bio/request merging
      logic aware of the bio_crypt_ctx.
      
      We also make changes to blk-mq to make it handle bios with encryption
      contexts. blk-mq can merge many bios into the same request. These bios need
      to have contiguous data unit numbers (the necessary changes to blk-merge
      are also made to ensure this) - as such, it suffices to keep the data unit
      number of just the first bio, since that's all a storage driver needs to
      infer the data unit number to use for each data block in each bio in a
      request. blk-mq keeps track of the encryption context to be used for all
      the bios in a request with the request's rq_crypt_ctx. When the first bio
      is added to an empty request, blk-mq will program the encryption context
      of that bio into the request_queue's keyslot manager, and store the
      returned keyslot in the request's rq_crypt_ctx. All the functions to
      operate on encryption contexts are in blk-crypto.c.
      
      Upper layers only need to call bio_crypt_set_ctx with the encryption key,
      algorithm and data_unit_num; they don't have to worry about getting a
      keyslot for each encryption context, as blk-mq/blk-crypto handles that.
      Blk-crypto also makes it possible for request-based layered devices like
      dm-rq to make use of inline encryption hardware by cloning the
      rq_crypt_ctx and programming a keyslot in the new request_queue when
      necessary.
      
      Note that any user of the block layer can submit bios with an
      encryption context, such as filesystems, device-mapper targets, etc.
      Signed-off-by: NSatya Tangirala <satyat@google.com>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a892c8d5
    • M
      block: move blk_io_schedule() out of header file · 71ac860a
      Ming Lei 提交于
      blk_io_schedule() isn't called from performance sensitive code path, and
      it is easier to maintain by exporting it as symbol.
      
      Also blk_io_schedule() is only called by CONFIG_BLOCK code, so it is safe
      to do this way. Meantime fixes build failure when CONFIG_BLOCK is off.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Fixes: e6249cdd ("block: add blk_io_schedule() for avoiding task hung in sync dio")
      Reported-by: NSatya Tangirala <satyat@google.com>
      Tested-by: NSatya Tangirala <satyat@google.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71ac860a
  9. 13 5月, 2020 2 次提交
    • K
      block: Introduce REQ_OP_ZONE_APPEND · 0512a75b
      Keith Busch 提交于
      Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
      block device. This is a no-merge write operation.
      
      A zone append write BIO must:
      * Target a zoned block device
      * Have a sector position indicating the start sector of the target zone
      * The target zone must be a sequential write zone
      * The BIO must not cross a zone boundary
      * The BIO size must not be split to ensure that a single range of LBAs
        is written with a single command.
      
      Implement these checks in generic_make_request_checks() using the
      helper function blk_check_zone_append(). To avoid write append BIO
      splitting, introduce the new max_zone_append_sectors queue limit
      attribute and ensure that a BIO size is always lower than this limit.
      Export this new limit through sysfs and check these limits in bio_full().
      
      Also when a LLDD can't dispatch a request to a specific zone, it
      will return BLK_STS_ZONE_RESOURCE indicating this request needs to
      be delayed, e.g.  because the zone it will be dispatched to is still
      write-locked. If this happens set the request aside in a local list
      to continue trying dispatching requests such as READ requests or a
      WRITE/ZONE_APPEND requests targetting other zones. This way we can
      still keep a high queue depth without starving other requests even if
      one request can't be served due to zone write-locking.
      
      Finally, make sure that the bio sector position indicates the actual
      write position as indicated by the device on completion.
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      [ jth: added zone-append specific add_page and merge_page helpers ]
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0512a75b
    • M
      block: fix use-after-free on cached last_lookup partition · b7d6c303
      Ming Lei 提交于
      delete_partition() clears the cached last_lookup partition. However the
      .last_lookup cache may be overwritten by one IO path after it is cleared
      from delete_partition(). Then another IO path may use the cached deleting
      partition after hd_struct_free() is called, then use-after-free is triggered
      on the cached partition.
      
      Fixes the issue by the following approach:
      
      1) always get the partition's refcount via hd_struct_try_get() before
      setting .last_lookup
      
      2) move clearing .last_lookup from delete_partition() to hd_struct_free()
      which is the release handle of the partition's percpu-refcount, so that no
      IO path can cache deleteing partition via .last_lookup.
      
      It is one candidate approach of Yufen's patch[1] which adds overhead
      in fast path by indirect lookup which may introduce one extra cacheline
      in IO path. Also this patch relies on percpu-refcount's protection, and
      it is easier to understand and verify.
      
      [1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#tReported-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b7d6c303
  10. 10 5月, 2020 2 次提交
  11. 29 4月, 2020 3 次提交
  12. 25 4月, 2020 2 次提交
  13. 23 4月, 2020 1 次提交
  14. 30 3月, 2020 1 次提交
  15. 28 3月, 2020 1 次提交
    • C
      block: simplify queue allocation · 3d745ea5
      Christoph Hellwig 提交于
      Current make_request based drivers use either blk_alloc_queue_node or
      blk_alloc_queue to allocate a queue, and then set up the make_request_fn
      function pointer and a few parameters using the blk_queue_make_request
      helper.  Simplify this by passing the make_request pointer to
      blk_alloc_queue, and while at it merge the _node variant into the main
      helper by always passing a node_id, and remove the superfluous gfp_mask
      parameter.  A lower-level __blk_alloc_queue is kept for the blk-mq case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d745ea5
  16. 25 3月, 2020 2 次提交
    • K
      block/diskstats: replace time_in_queue with sum of request times · 8cd5b8fc
      Konstantin Khlebnikov 提交于
      Column "time_in_queue" in diskstats is supposed to show total waiting time
      of all requests. I.e. value should be equal to the sum of times from other
      columns. But this is not true, because column "time_in_queue" is counted
      separately in jiffies rather than in nanoseconds as other times.
      
      This patch removes redundant counter for "time_in_queue" and shows total
      time of read, write, discard and flush requests.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8cd5b8fc
    • K
      block/diskstats: more accurate approximation of io_ticks for slow disks · 2b8bd423
      Konstantin Khlebnikov 提交于
      Currently io_ticks is approximated by adding one at each start and end of
      requests if jiffies counter has changed. This works perfectly for requests
      shorter than a jiffy or if one of requests starts/ends at each jiffy.
      
      If disk executes just one request at a time and they are longer than two
      jiffies then only first and last jiffies will be accounted.
      
      Fix is simple: at the end of request add up into io_ticks jiffies passed
      since last update rather than just one jiffy.
      
      Example: common HDD executes random read 4k requests around 12ms.
      
      fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
      iostat -x 10 sdb
      
      Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
      
      Before:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
      
      After:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
      
      Now io_ticks does not loose time between start and end of requests, but
      for queue-depth > 1 some I/O time between adjacent starts might be lost.
      
      For load estimation "%util" is not as useful as average queue length,
      but it clearly shows how often disk queue is completely empty.
      
      Fixes: 5b18b5a7 ("block: delete part_round_stats and switch to less precise counting")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b8bd423
  17. 12 3月, 2020 2 次提交