1. 29 6月, 2020 1 次提交
  2. 24 6月, 2020 4 次提交
    • L
      block: create the request_queue debugfs_dir on registration · 85e0cbbb
      Luis Chamberlain 提交于
      We were only creating the request_queue debugfs_dir only
      for make_request block drivers (multiqueue), but never for
      request-based block drivers. We did this as we were only
      creating non-blktrace additional debugfs files on that directory
      for make_request drivers. However, since blktrace *always* creates
      that directory anyway, we special-case the use of that directory
      on blktrace. Other than this being an eye-sore, this exposes
      request-based block drivers to the same debugfs fragile
      race that used to exist with make_request block drivers
      where if we start adding files onto that directory we can later
      run a race with a double removal of dentries on the directory
      if we don't deal with this carefully on blktrace.
      
      Instead, just simplify things by always creating the request_queue
      debugfs_dir on request_queue registration. Rename the mutex also to
      reflect the fact that this is used outside of the blktrace context.
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85e0cbbb
    • L
      block: revert back to synchronous request_queue removal · e8c7d14a
      Luis Chamberlain 提交于
      Commit dc9edc44 ("block: Fix a blk_exit_rl() regression") merged on
      v4.12 moved the work behind blk_release_queue() into a workqueue after a
      splat floated around which indicated some work on blk_release_queue()
      could sleep in blk_exit_rl(). This splat would be possible when a driver
      called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
      as its final call) from an atomic context.
      
      blk_put_queue() decrements the refcount for the request_queue kobject, and
      upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
      now removed through commit db6d9952 ("block: remove request_list code")
      on v5.0, we reserve the right to be able to sleep within
      blk_release_queue() context.
      
      The last reference for the request_queue must not be called from atomic
      context. *When* the last reference to the request_queue reaches 0 varies,
      and so let's take the opportunity to document when that is expected to
      happen and also document the context of the related calls as best as
      possible so we can avoid future issues, and with the hopes that the
      synchronous request_queue removal sticks.
      
      We revert back to synchronous request_queue removal because asynchronous
      removal creates a regression with expected userspace interaction with
      several drivers. An example is when removing the loopback driver, one
      uses ioctls from userspace to do so, but upon return and if successful,
      one expects the device to be removed. Likewise if one races to add another
      device the new one may not be added as it is still being removed. This was
      expected behavior before and it now fails as the device is still present
      and busy still. Moving to asynchronous request_queue removal could have
      broken many scripts which relied on the removal to have been completed if
      there was no error. Document this expectation as well so that this
      doesn't regress userspace again.
      
      Using asynchronous request_queue removal however has helped us find
      other bugs. In the future we can test what could break with this
      arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.
      
      While at it, update the docs with the context expectations for the
      request_queue / gendisk refcount decrement, and make these
      expectations explicit by using might_sleep().
      
      Fixes: dc9edc44 ("block: Fix a blk_exit_rl() regression")
      Suggested-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: yu kuai <yukuai3@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e8c7d14a
    • L
      block: clarify context for refcount increment helpers · 763b5892
      Luis Chamberlain 提交于
      Let us clarify the context under which the helpers to increment the
      refcount for the gendisk and request_queue can be called under. We
      make this explicit on the places where we may sleep with might_sleep().
      
      We don't address the decrement context yet, as that needs some extra
      work and fixes, but will be addressed in the next patch.
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      763b5892
    • L
      block: add docs for gendisk / request_queue refcount helpers · b5bd357c
      Luis Chamberlain 提交于
      This adds documentation for the gendisk / request_queue refcount
      helpers.
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5bd357c
  3. 03 6月, 2020 1 次提交
  4. 29 5月, 2020 1 次提交
  5. 27 5月, 2020 4 次提交
  6. 19 5月, 2020 3 次提交
  7. 14 5月, 2020 2 次提交
    • S
      block: Inline encryption support for blk-mq · a892c8d5
      Satya Tangirala 提交于
      We must have some way of letting a storage device driver know what
      encryption context it should use for en/decrypting a request. However,
      it's the upper layers (like the filesystem/fscrypt) that know about and
      manages encryption contexts. As such, when the upper layer submits a bio
      to the block layer, and this bio eventually reaches a device driver with
      support for inline encryption, the device driver will need to have been
      told the encryption context for that bio.
      
      We want to communicate the encryption context from the upper layer to the
      storage device along with the bio, when the bio is submitted to the block
      layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
      represent an encryption context (note that we can't use the bi_private
      field in struct bio to do this because that field does not function to pass
      information across layers in the storage stack). We also introduce various
      functions to manipulate the bio_crypt_ctx and make the bio/request merging
      logic aware of the bio_crypt_ctx.
      
      We also make changes to blk-mq to make it handle bios with encryption
      contexts. blk-mq can merge many bios into the same request. These bios need
      to have contiguous data unit numbers (the necessary changes to blk-merge
      are also made to ensure this) - as such, it suffices to keep the data unit
      number of just the first bio, since that's all a storage driver needs to
      infer the data unit number to use for each data block in each bio in a
      request. blk-mq keeps track of the encryption context to be used for all
      the bios in a request with the request's rq_crypt_ctx. When the first bio
      is added to an empty request, blk-mq will program the encryption context
      of that bio into the request_queue's keyslot manager, and store the
      returned keyslot in the request's rq_crypt_ctx. All the functions to
      operate on encryption contexts are in blk-crypto.c.
      
      Upper layers only need to call bio_crypt_set_ctx with the encryption key,
      algorithm and data_unit_num; they don't have to worry about getting a
      keyslot for each encryption context, as blk-mq/blk-crypto handles that.
      Blk-crypto also makes it possible for request-based layered devices like
      dm-rq to make use of inline encryption hardware by cloning the
      rq_crypt_ctx and programming a keyslot in the new request_queue when
      necessary.
      
      Note that any user of the block layer can submit bios with an
      encryption context, such as filesystems, device-mapper targets, etc.
      Signed-off-by: NSatya Tangirala <satyat@google.com>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a892c8d5
    • M
      block: move blk_io_schedule() out of header file · 71ac860a
      Ming Lei 提交于
      blk_io_schedule() isn't called from performance sensitive code path, and
      it is easier to maintain by exporting it as symbol.
      
      Also blk_io_schedule() is only called by CONFIG_BLOCK code, so it is safe
      to do this way. Meantime fixes build failure when CONFIG_BLOCK is off.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Fixes: e6249cdd ("block: add blk_io_schedule() for avoiding task hung in sync dio")
      Reported-by: NSatya Tangirala <satyat@google.com>
      Tested-by: NSatya Tangirala <satyat@google.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71ac860a
  8. 13 5月, 2020 2 次提交
    • K
      block: Introduce REQ_OP_ZONE_APPEND · 0512a75b
      Keith Busch 提交于
      Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
      block device. This is a no-merge write operation.
      
      A zone append write BIO must:
      * Target a zoned block device
      * Have a sector position indicating the start sector of the target zone
      * The target zone must be a sequential write zone
      * The BIO must not cross a zone boundary
      * The BIO size must not be split to ensure that a single range of LBAs
        is written with a single command.
      
      Implement these checks in generic_make_request_checks() using the
      helper function blk_check_zone_append(). To avoid write append BIO
      splitting, introduce the new max_zone_append_sectors queue limit
      attribute and ensure that a BIO size is always lower than this limit.
      Export this new limit through sysfs and check these limits in bio_full().
      
      Also when a LLDD can't dispatch a request to a specific zone, it
      will return BLK_STS_ZONE_RESOURCE indicating this request needs to
      be delayed, e.g.  because the zone it will be dispatched to is still
      write-locked. If this happens set the request aside in a local list
      to continue trying dispatching requests such as READ requests or a
      WRITE/ZONE_APPEND requests targetting other zones. This way we can
      still keep a high queue depth without starving other requests even if
      one request can't be served due to zone write-locking.
      
      Finally, make sure that the bio sector position indicates the actual
      write position as indicated by the device on completion.
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      [ jth: added zone-append specific add_page and merge_page helpers ]
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0512a75b
    • M
      block: fix use-after-free on cached last_lookup partition · b7d6c303
      Ming Lei 提交于
      delete_partition() clears the cached last_lookup partition. However the
      .last_lookup cache may be overwritten by one IO path after it is cleared
      from delete_partition(). Then another IO path may use the cached deleting
      partition after hd_struct_free() is called, then use-after-free is triggered
      on the cached partition.
      
      Fixes the issue by the following approach:
      
      1) always get the partition's refcount via hd_struct_try_get() before
      setting .last_lookup
      
      2) move clearing .last_lookup from delete_partition() to hd_struct_free()
      which is the release handle of the partition's percpu-refcount, so that no
      IO path can cache deleteing partition via .last_lookup.
      
      It is one candidate approach of Yufen's patch[1] which adds overhead
      in fast path by indirect lookup which may introduce one extra cacheline
      in IO path. Also this patch relies on percpu-refcount's protection, and
      it is easier to understand and verify.
      
      [1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#tReported-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b7d6c303
  9. 10 5月, 2020 2 次提交
  10. 29 4月, 2020 3 次提交
  11. 25 4月, 2020 2 次提交
  12. 23 4月, 2020 1 次提交
  13. 30 3月, 2020 1 次提交
  14. 28 3月, 2020 1 次提交
    • C
      block: simplify queue allocation · 3d745ea5
      Christoph Hellwig 提交于
      Current make_request based drivers use either blk_alloc_queue_node or
      blk_alloc_queue to allocate a queue, and then set up the make_request_fn
      function pointer and a few parameters using the blk_queue_make_request
      helper.  Simplify this by passing the make_request pointer to
      blk_alloc_queue, and while at it merge the _node variant into the main
      helper by always passing a node_id, and remove the superfluous gfp_mask
      parameter.  A lower-level __blk_alloc_queue is kept for the blk-mq case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d745ea5
  15. 25 3月, 2020 2 次提交
    • K
      block/diskstats: replace time_in_queue with sum of request times · 8cd5b8fc
      Konstantin Khlebnikov 提交于
      Column "time_in_queue" in diskstats is supposed to show total waiting time
      of all requests. I.e. value should be equal to the sum of times from other
      columns. But this is not true, because column "time_in_queue" is counted
      separately in jiffies rather than in nanoseconds as other times.
      
      This patch removes redundant counter for "time_in_queue" and shows total
      time of read, write, discard and flush requests.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8cd5b8fc
    • K
      block/diskstats: more accurate approximation of io_ticks for slow disks · 2b8bd423
      Konstantin Khlebnikov 提交于
      Currently io_ticks is approximated by adding one at each start and end of
      requests if jiffies counter has changed. This works perfectly for requests
      shorter than a jiffy or if one of requests starts/ends at each jiffy.
      
      If disk executes just one request at a time and they are longer than two
      jiffies then only first and last jiffies will be accounted.
      
      Fix is simple: at the end of request add up into io_ticks jiffies passed
      since last update rather than just one jiffy.
      
      Example: common HDD executes random read 4k requests around 12ms.
      
      fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
      iostat -x 10 sdb
      
      Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
      
      Before:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
      
      After:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
      
      Now io_ticks does not loose time between start and end of requests, but
      for queue-depth > 1 some I/O time between adjacent starts might be lost.
      
      For load estimation "%util" is not as useful as average queue length,
      but it clearly shows how often disk queue is completely empty.
      
      Fixes: 5b18b5a7 ("block: delete part_round_stats and switch to less precise counting")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b8bd423
  16. 12 3月, 2020 4 次提交
  17. 02 3月, 2020 1 次提交
  18. 18 12月, 2019 1 次提交
  19. 12 12月, 2019 1 次提交
    • L
      block: fix NULL pointer dereference in account statistics with IDE · ecb6186c
      Logan Gunthorpe 提交于
      The IDE driver creates some passthru requests which never get
      submitted to the block layer in such a way that blk_account_io_start()
      gets called. However, the driver still calls __blk_mq_end_request() in
      ide_end_rq() which will call blk_account_io_completion() which tries
      to dereferences req->part which is never set. See ide_prep_sense() for
      an example of where these requests come from.
      
      To fix this, blk_account_io_completion() and blk_account_io_done()
      should do nothing if req->part is not set.
      
      The back trace of this bug is:
      
          BUG: kernel NULL pointer dereference, address: 000002ac
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          *pde = 00000000
          Oops: 0002 [#1]
          CPU: 0 PID: 237 Comm: kworker/0:1H Not tainted
          5.4.0-rc2-00011-g48d9b0d4 #1
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1
          04/01/2014
          Workqueue: kblockd drive_rq_insert_work
          EIP: blk_account_io_completion+0x7a/0xf0
          Code: 89 54 24 08 31 d2 89 4c 24 04 31 c9 c7 04 24 02 00 00 00 c1 ee
          09 e8 f5 21 a6 ff e8 70 5c a7 ff 8b 53 60 8d 04 bd 00 00 00 00 <01> b4
          02 ac 02 00 00 8b 9a 88 02 00 00 85 db 74 11 85 d2 74 51 8b
          EAX: 00000000 EBX: f5b80000 ECX: 00000000 EDX: 00000000
          ESI: 00000000 EDI: 00000000 EBP: f3031e70 ESP: f3031e54
          DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010046
          CR0: 80050033 CR2: 000002ac CR3: 03c25000 CR4: 000406d0
          Call Trace:
           <IRQ>
            blk_update_request+0x85/0x420
            ide_end_rq+0x38/0xa0
            ide_complete_rq+0x3d/0x70
            cdrom_newpc_intr+0x258/0xba0
            ide_intr+0x135/0x250
            __handle_irq_event_percpu+0x3e/0x250
            handle_irq_event_percpu+0x1f/0x50
            handle_irq_event+0x32/0x60
            handle_level_irq+0x6c/0x110
            handle_irq+0x72/0xa0
            </IRQ>
            do_IRQ+0x45/0xad
            common_interrupt+0x115/0x11c
      
      Fixes: 48d9b0d4 ("block: account statistics for passthrough requests")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ecb6186c
  20. 10 12月, 2019 1 次提交
  21. 13 11月, 2019 1 次提交
    • D
      block: Remove partition support for zoned block devices · 5eac3eb3
      Damien Le Moal 提交于
      No known partitioning tool supports zoned block devices, especially the
      host managed flavor with strong sequential write constraints.
      Furthermore, there are also no known user nor use cases for partitioned
      zoned block devices.
      
      This patch removes partition device creation for zoned block devices,
      which allows simplifying the processing of zone commands for zoned
      block devices. A warning is added if a partition table is found on the
      device.
      
      For report zones operations no zone sector information remapping is
      necessary anymore, simplifying the code. Of note is that remapping of
      zone reports for DM targets is still necessary as done by
      dm_remap_zone_report().
      
      Similarly, remaping of a zone reset bio is not necessary anymore.
      Testing for the applicability of the zone reset all request also becomes
      simpler and only needs to check that the number of sectors of the
      requested zone range is equal to the disk capacity.
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5eac3eb3
  22. 07 11月, 2019 1 次提交