1. 24 6月, 2020 1 次提交
  2. 03 6月, 2020 1 次提交
  3. 29 5月, 2020 1 次提交
  4. 27 5月, 2020 4 次提交
  5. 19 5月, 2020 3 次提交
  6. 14 5月, 2020 2 次提交
    • S
      block: Inline encryption support for blk-mq · a892c8d5
      Satya Tangirala 提交于
      We must have some way of letting a storage device driver know what
      encryption context it should use for en/decrypting a request. However,
      it's the upper layers (like the filesystem/fscrypt) that know about and
      manages encryption contexts. As such, when the upper layer submits a bio
      to the block layer, and this bio eventually reaches a device driver with
      support for inline encryption, the device driver will need to have been
      told the encryption context for that bio.
      
      We want to communicate the encryption context from the upper layer to the
      storage device along with the bio, when the bio is submitted to the block
      layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
      represent an encryption context (note that we can't use the bi_private
      field in struct bio to do this because that field does not function to pass
      information across layers in the storage stack). We also introduce various
      functions to manipulate the bio_crypt_ctx and make the bio/request merging
      logic aware of the bio_crypt_ctx.
      
      We also make changes to blk-mq to make it handle bios with encryption
      contexts. blk-mq can merge many bios into the same request. These bios need
      to have contiguous data unit numbers (the necessary changes to blk-merge
      are also made to ensure this) - as such, it suffices to keep the data unit
      number of just the first bio, since that's all a storage driver needs to
      infer the data unit number to use for each data block in each bio in a
      request. blk-mq keeps track of the encryption context to be used for all
      the bios in a request with the request's rq_crypt_ctx. When the first bio
      is added to an empty request, blk-mq will program the encryption context
      of that bio into the request_queue's keyslot manager, and store the
      returned keyslot in the request's rq_crypt_ctx. All the functions to
      operate on encryption contexts are in blk-crypto.c.
      
      Upper layers only need to call bio_crypt_set_ctx with the encryption key,
      algorithm and data_unit_num; they don't have to worry about getting a
      keyslot for each encryption context, as blk-mq/blk-crypto handles that.
      Blk-crypto also makes it possible for request-based layered devices like
      dm-rq to make use of inline encryption hardware by cloning the
      rq_crypt_ctx and programming a keyslot in the new request_queue when
      necessary.
      
      Note that any user of the block layer can submit bios with an
      encryption context, such as filesystems, device-mapper targets, etc.
      Signed-off-by: NSatya Tangirala <satyat@google.com>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a892c8d5
    • M
      block: move blk_io_schedule() out of header file · 71ac860a
      Ming Lei 提交于
      blk_io_schedule() isn't called from performance sensitive code path, and
      it is easier to maintain by exporting it as symbol.
      
      Also blk_io_schedule() is only called by CONFIG_BLOCK code, so it is safe
      to do this way. Meantime fixes build failure when CONFIG_BLOCK is off.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Fixes: e6249cdd ("block: add blk_io_schedule() for avoiding task hung in sync dio")
      Reported-by: NSatya Tangirala <satyat@google.com>
      Tested-by: NSatya Tangirala <satyat@google.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71ac860a
  7. 13 5月, 2020 2 次提交
    • K
      block: Introduce REQ_OP_ZONE_APPEND · 0512a75b
      Keith Busch 提交于
      Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
      block device. This is a no-merge write operation.
      
      A zone append write BIO must:
      * Target a zoned block device
      * Have a sector position indicating the start sector of the target zone
      * The target zone must be a sequential write zone
      * The BIO must not cross a zone boundary
      * The BIO size must not be split to ensure that a single range of LBAs
        is written with a single command.
      
      Implement these checks in generic_make_request_checks() using the
      helper function blk_check_zone_append(). To avoid write append BIO
      splitting, introduce the new max_zone_append_sectors queue limit
      attribute and ensure that a BIO size is always lower than this limit.
      Export this new limit through sysfs and check these limits in bio_full().
      
      Also when a LLDD can't dispatch a request to a specific zone, it
      will return BLK_STS_ZONE_RESOURCE indicating this request needs to
      be delayed, e.g.  because the zone it will be dispatched to is still
      write-locked. If this happens set the request aside in a local list
      to continue trying dispatching requests such as READ requests or a
      WRITE/ZONE_APPEND requests targetting other zones. This way we can
      still keep a high queue depth without starving other requests even if
      one request can't be served due to zone write-locking.
      
      Finally, make sure that the bio sector position indicates the actual
      write position as indicated by the device on completion.
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      [ jth: added zone-append specific add_page and merge_page helpers ]
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0512a75b
    • M
      block: fix use-after-free on cached last_lookup partition · b7d6c303
      Ming Lei 提交于
      delete_partition() clears the cached last_lookup partition. However the
      .last_lookup cache may be overwritten by one IO path after it is cleared
      from delete_partition(). Then another IO path may use the cached deleting
      partition after hd_struct_free() is called, then use-after-free is triggered
      on the cached partition.
      
      Fixes the issue by the following approach:
      
      1) always get the partition's refcount via hd_struct_try_get() before
      setting .last_lookup
      
      2) move clearing .last_lookup from delete_partition() to hd_struct_free()
      which is the release handle of the partition's percpu-refcount, so that no
      IO path can cache deleteing partition via .last_lookup.
      
      It is one candidate approach of Yufen's patch[1] which adds overhead
      in fast path by indirect lookup which may introduce one extra cacheline
      in IO path. Also this patch relies on percpu-refcount's protection, and
      it is easier to understand and verify.
      
      [1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#tReported-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b7d6c303
  8. 10 5月, 2020 2 次提交
  9. 29 4月, 2020 3 次提交
  10. 25 4月, 2020 2 次提交
  11. 23 4月, 2020 1 次提交
  12. 30 3月, 2020 1 次提交
  13. 28 3月, 2020 1 次提交
    • C
      block: simplify queue allocation · 3d745ea5
      Christoph Hellwig 提交于
      Current make_request based drivers use either blk_alloc_queue_node or
      blk_alloc_queue to allocate a queue, and then set up the make_request_fn
      function pointer and a few parameters using the blk_queue_make_request
      helper.  Simplify this by passing the make_request pointer to
      blk_alloc_queue, and while at it merge the _node variant into the main
      helper by always passing a node_id, and remove the superfluous gfp_mask
      parameter.  A lower-level __blk_alloc_queue is kept for the blk-mq case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d745ea5
  14. 25 3月, 2020 2 次提交
    • K
      block/diskstats: replace time_in_queue with sum of request times · 8cd5b8fc
      Konstantin Khlebnikov 提交于
      Column "time_in_queue" in diskstats is supposed to show total waiting time
      of all requests. I.e. value should be equal to the sum of times from other
      columns. But this is not true, because column "time_in_queue" is counted
      separately in jiffies rather than in nanoseconds as other times.
      
      This patch removes redundant counter for "time_in_queue" and shows total
      time of read, write, discard and flush requests.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8cd5b8fc
    • K
      block/diskstats: more accurate approximation of io_ticks for slow disks · 2b8bd423
      Konstantin Khlebnikov 提交于
      Currently io_ticks is approximated by adding one at each start and end of
      requests if jiffies counter has changed. This works perfectly for requests
      shorter than a jiffy or if one of requests starts/ends at each jiffy.
      
      If disk executes just one request at a time and they are longer than two
      jiffies then only first and last jiffies will be accounted.
      
      Fix is simple: at the end of request add up into io_ticks jiffies passed
      since last update rather than just one jiffy.
      
      Example: common HDD executes random read 4k requests around 12ms.
      
      fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
      iostat -x 10 sdb
      
      Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
      
      Before:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
      
      After:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
      
      Now io_ticks does not loose time between start and end of requests, but
      for queue-depth > 1 some I/O time between adjacent starts might be lost.
      
      For load estimation "%util" is not as useful as average queue length,
      but it clearly shows how often disk queue is completely empty.
      
      Fixes: 5b18b5a7 ("block: delete part_round_stats and switch to less precise counting")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b8bd423
  15. 12 3月, 2020 4 次提交
  16. 02 3月, 2020 1 次提交
  17. 18 12月, 2019 1 次提交
  18. 12 12月, 2019 1 次提交
    • L
      block: fix NULL pointer dereference in account statistics with IDE · ecb6186c
      Logan Gunthorpe 提交于
      The IDE driver creates some passthru requests which never get
      submitted to the block layer in such a way that blk_account_io_start()
      gets called. However, the driver still calls __blk_mq_end_request() in
      ide_end_rq() which will call blk_account_io_completion() which tries
      to dereferences req->part which is never set. See ide_prep_sense() for
      an example of where these requests come from.
      
      To fix this, blk_account_io_completion() and blk_account_io_done()
      should do nothing if req->part is not set.
      
      The back trace of this bug is:
      
          BUG: kernel NULL pointer dereference, address: 000002ac
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          *pde = 00000000
          Oops: 0002 [#1]
          CPU: 0 PID: 237 Comm: kworker/0:1H Not tainted
          5.4.0-rc2-00011-g48d9b0d4 #1
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1
          04/01/2014
          Workqueue: kblockd drive_rq_insert_work
          EIP: blk_account_io_completion+0x7a/0xf0
          Code: 89 54 24 08 31 d2 89 4c 24 04 31 c9 c7 04 24 02 00 00 00 c1 ee
          09 e8 f5 21 a6 ff e8 70 5c a7 ff 8b 53 60 8d 04 bd 00 00 00 00 <01> b4
          02 ac 02 00 00 8b 9a 88 02 00 00 85 db 74 11 85 d2 74 51 8b
          EAX: 00000000 EBX: f5b80000 ECX: 00000000 EDX: 00000000
          ESI: 00000000 EDI: 00000000 EBP: f3031e70 ESP: f3031e54
          DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010046
          CR0: 80050033 CR2: 000002ac CR3: 03c25000 CR4: 000406d0
          Call Trace:
           <IRQ>
            blk_update_request+0x85/0x420
            ide_end_rq+0x38/0xa0
            ide_complete_rq+0x3d/0x70
            cdrom_newpc_intr+0x258/0xba0
            ide_intr+0x135/0x250
            __handle_irq_event_percpu+0x3e/0x250
            handle_irq_event_percpu+0x1f/0x50
            handle_irq_event+0x32/0x60
            handle_level_irq+0x6c/0x110
            handle_irq+0x72/0xa0
            </IRQ>
            do_IRQ+0x45/0xad
            common_interrupt+0x115/0x11c
      
      Fixes: 48d9b0d4 ("block: account statistics for passthrough requests")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ecb6186c
  19. 10 12月, 2019 1 次提交
  20. 13 11月, 2019 1 次提交
    • D
      block: Remove partition support for zoned block devices · 5eac3eb3
      Damien Le Moal 提交于
      No known partitioning tool supports zoned block devices, especially the
      host managed flavor with strong sequential write constraints.
      Furthermore, there are also no known user nor use cases for partitioned
      zoned block devices.
      
      This patch removes partition device creation for zoned block devices,
      which allows simplifying the processing of zone commands for zoned
      block devices. A warning is added if a partition table is found on the
      device.
      
      For report zones operations no zone sector information remapping is
      necessary anymore, simplifying the code. Of note is that remapping of
      zone reports for DM targets is still necessary as done by
      dm_remap_zone_report().
      
      Similarly, remaping of a zone reset bio is not necessary anymore.
      Testing for the applicability of the zone reset all request also becomes
      simpler and only needs to check that the number of sectors of the
      requested zone range is equal to the disk capacity.
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5eac3eb3
  21. 07 11月, 2019 1 次提交
  22. 07 10月, 2019 2 次提交
    • B
      block: Reduce sysfs_lock locking inside blk_cleanup_queue() · 73f1c77e
      Bart Van Assche 提交于
      Since blk_cleanup_queue() is called after blk_unregister_queue() and
      since that last function removes all sysfs attributes, serializing
      any code in blk_cleanup_queue() against sysfs callback methods nor against
      I/O scheduler changes is necessary. Hence remove the syfs_lock locking
      calls from the start of blk_cleanup_queue().
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      73f1c77e
    • B
      block: Remove "dying" checks from sysfs callbacks · bae85c15
      Bart Van Assche 提交于
      Block drivers must call del_gendisk() before blk_cleanup_queue().
      del_gendisk() calls kobject_del() and kobject_del() waits until any
      ongoing sysfs callback functions have finished. In other words, the
      sysfs callback functions won't be called for a queue in the dying
      state. Hence remove the "dying" checks from the sysfs callback
      functions.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bae85c15
  23. 18 9月, 2019 1 次提交
  24. 29 8月, 2019 1 次提交