1. 19 5月, 2020 2 次提交
  2. 14 5月, 2020 2 次提交
    • S
      block: Inline encryption support for blk-mq · a892c8d5
      Satya Tangirala 提交于
      We must have some way of letting a storage device driver know what
      encryption context it should use for en/decrypting a request. However,
      it's the upper layers (like the filesystem/fscrypt) that know about and
      manages encryption contexts. As such, when the upper layer submits a bio
      to the block layer, and this bio eventually reaches a device driver with
      support for inline encryption, the device driver will need to have been
      told the encryption context for that bio.
      
      We want to communicate the encryption context from the upper layer to the
      storage device along with the bio, when the bio is submitted to the block
      layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
      represent an encryption context (note that we can't use the bi_private
      field in struct bio to do this because that field does not function to pass
      information across layers in the storage stack). We also introduce various
      functions to manipulate the bio_crypt_ctx and make the bio/request merging
      logic aware of the bio_crypt_ctx.
      
      We also make changes to blk-mq to make it handle bios with encryption
      contexts. blk-mq can merge many bios into the same request. These bios need
      to have contiguous data unit numbers (the necessary changes to blk-merge
      are also made to ensure this) - as such, it suffices to keep the data unit
      number of just the first bio, since that's all a storage driver needs to
      infer the data unit number to use for each data block in each bio in a
      request. blk-mq keeps track of the encryption context to be used for all
      the bios in a request with the request's rq_crypt_ctx. When the first bio
      is added to an empty request, blk-mq will program the encryption context
      of that bio into the request_queue's keyslot manager, and store the
      returned keyslot in the request's rq_crypt_ctx. All the functions to
      operate on encryption contexts are in blk-crypto.c.
      
      Upper layers only need to call bio_crypt_set_ctx with the encryption key,
      algorithm and data_unit_num; they don't have to worry about getting a
      keyslot for each encryption context, as blk-mq/blk-crypto handles that.
      Blk-crypto also makes it possible for request-based layered devices like
      dm-rq to make use of inline encryption hardware by cloning the
      rq_crypt_ctx and programming a keyslot in the new request_queue when
      necessary.
      
      Note that any user of the block layer can submit bios with an
      encryption context, such as filesystems, device-mapper targets, etc.
      Signed-off-by: NSatya Tangirala <satyat@google.com>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a892c8d5
    • M
      block: move blk_io_schedule() out of header file · 71ac860a
      Ming Lei 提交于
      blk_io_schedule() isn't called from performance sensitive code path, and
      it is easier to maintain by exporting it as symbol.
      
      Also blk_io_schedule() is only called by CONFIG_BLOCK code, so it is safe
      to do this way. Meantime fixes build failure when CONFIG_BLOCK is off.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Fixes: e6249cdd ("block: add blk_io_schedule() for avoiding task hung in sync dio")
      Reported-by: NSatya Tangirala <satyat@google.com>
      Tested-by: NSatya Tangirala <satyat@google.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71ac860a
  3. 13 5月, 2020 2 次提交
    • K
      block: Introduce REQ_OP_ZONE_APPEND · 0512a75b
      Keith Busch 提交于
      Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
      block device. This is a no-merge write operation.
      
      A zone append write BIO must:
      * Target a zoned block device
      * Have a sector position indicating the start sector of the target zone
      * The target zone must be a sequential write zone
      * The BIO must not cross a zone boundary
      * The BIO size must not be split to ensure that a single range of LBAs
        is written with a single command.
      
      Implement these checks in generic_make_request_checks() using the
      helper function blk_check_zone_append(). To avoid write append BIO
      splitting, introduce the new max_zone_append_sectors queue limit
      attribute and ensure that a BIO size is always lower than this limit.
      Export this new limit through sysfs and check these limits in bio_full().
      
      Also when a LLDD can't dispatch a request to a specific zone, it
      will return BLK_STS_ZONE_RESOURCE indicating this request needs to
      be delayed, e.g.  because the zone it will be dispatched to is still
      write-locked. If this happens set the request aside in a local list
      to continue trying dispatching requests such as READ requests or a
      WRITE/ZONE_APPEND requests targetting other zones. This way we can
      still keep a high queue depth without starving other requests even if
      one request can't be served due to zone write-locking.
      
      Finally, make sure that the bio sector position indicates the actual
      write position as indicated by the device on completion.
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      [ jth: added zone-append specific add_page and merge_page helpers ]
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0512a75b
    • M
      block: fix use-after-free on cached last_lookup partition · b7d6c303
      Ming Lei 提交于
      delete_partition() clears the cached last_lookup partition. However the
      .last_lookup cache may be overwritten by one IO path after it is cleared
      from delete_partition(). Then another IO path may use the cached deleting
      partition after hd_struct_free() is called, then use-after-free is triggered
      on the cached partition.
      
      Fixes the issue by the following approach:
      
      1) always get the partition's refcount via hd_struct_try_get() before
      setting .last_lookup
      
      2) move clearing .last_lookup from delete_partition() to hd_struct_free()
      which is the release handle of the partition's percpu-refcount, so that no
      IO path can cache deleteing partition via .last_lookup.
      
      It is one candidate approach of Yufen's patch[1] which adds overhead
      in fast path by indirect lookup which may introduce one extra cacheline
      in IO path. Also this patch relies on percpu-refcount's protection, and
      it is easier to understand and verify.
      
      [1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#tReported-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b7d6c303
  4. 10 5月, 2020 2 次提交
  5. 29 4月, 2020 3 次提交
  6. 25 4月, 2020 2 次提交
  7. 23 4月, 2020 1 次提交
  8. 30 3月, 2020 1 次提交
  9. 28 3月, 2020 1 次提交
    • C
      block: simplify queue allocation · 3d745ea5
      Christoph Hellwig 提交于
      Current make_request based drivers use either blk_alloc_queue_node or
      blk_alloc_queue to allocate a queue, and then set up the make_request_fn
      function pointer and a few parameters using the blk_queue_make_request
      helper.  Simplify this by passing the make_request pointer to
      blk_alloc_queue, and while at it merge the _node variant into the main
      helper by always passing a node_id, and remove the superfluous gfp_mask
      parameter.  A lower-level __blk_alloc_queue is kept for the blk-mq case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d745ea5
  10. 25 3月, 2020 2 次提交
    • K
      block/diskstats: replace time_in_queue with sum of request times · 8cd5b8fc
      Konstantin Khlebnikov 提交于
      Column "time_in_queue" in diskstats is supposed to show total waiting time
      of all requests. I.e. value should be equal to the sum of times from other
      columns. But this is not true, because column "time_in_queue" is counted
      separately in jiffies rather than in nanoseconds as other times.
      
      This patch removes redundant counter for "time_in_queue" and shows total
      time of read, write, discard and flush requests.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8cd5b8fc
    • K
      block/diskstats: more accurate approximation of io_ticks for slow disks · 2b8bd423
      Konstantin Khlebnikov 提交于
      Currently io_ticks is approximated by adding one at each start and end of
      requests if jiffies counter has changed. This works perfectly for requests
      shorter than a jiffy or if one of requests starts/ends at each jiffy.
      
      If disk executes just one request at a time and they are longer than two
      jiffies then only first and last jiffies will be accounted.
      
      Fix is simple: at the end of request add up into io_ticks jiffies passed
      since last update rather than just one jiffy.
      
      Example: common HDD executes random read 4k requests around 12ms.
      
      fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
      iostat -x 10 sdb
      
      Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
      
      Before:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
      
      After:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
      
      Now io_ticks does not loose time between start and end of requests, but
      for queue-depth > 1 some I/O time between adjacent starts might be lost.
      
      For load estimation "%util" is not as useful as average queue length,
      but it clearly shows how often disk queue is completely empty.
      
      Fixes: 5b18b5a7 ("block: delete part_round_stats and switch to less precise counting")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b8bd423
  11. 12 3月, 2020 4 次提交
  12. 02 3月, 2020 1 次提交
  13. 18 12月, 2019 1 次提交
  14. 12 12月, 2019 1 次提交
    • L
      block: fix NULL pointer dereference in account statistics with IDE · ecb6186c
      Logan Gunthorpe 提交于
      The IDE driver creates some passthru requests which never get
      submitted to the block layer in such a way that blk_account_io_start()
      gets called. However, the driver still calls __blk_mq_end_request() in
      ide_end_rq() which will call blk_account_io_completion() which tries
      to dereferences req->part which is never set. See ide_prep_sense() for
      an example of where these requests come from.
      
      To fix this, blk_account_io_completion() and blk_account_io_done()
      should do nothing if req->part is not set.
      
      The back trace of this bug is:
      
          BUG: kernel NULL pointer dereference, address: 000002ac
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          *pde = 00000000
          Oops: 0002 [#1]
          CPU: 0 PID: 237 Comm: kworker/0:1H Not tainted
          5.4.0-rc2-00011-g48d9b0d4 #1
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1
          04/01/2014
          Workqueue: kblockd drive_rq_insert_work
          EIP: blk_account_io_completion+0x7a/0xf0
          Code: 89 54 24 08 31 d2 89 4c 24 04 31 c9 c7 04 24 02 00 00 00 c1 ee
          09 e8 f5 21 a6 ff e8 70 5c a7 ff 8b 53 60 8d 04 bd 00 00 00 00 <01> b4
          02 ac 02 00 00 8b 9a 88 02 00 00 85 db 74 11 85 d2 74 51 8b
          EAX: 00000000 EBX: f5b80000 ECX: 00000000 EDX: 00000000
          ESI: 00000000 EDI: 00000000 EBP: f3031e70 ESP: f3031e54
          DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010046
          CR0: 80050033 CR2: 000002ac CR3: 03c25000 CR4: 000406d0
          Call Trace:
           <IRQ>
            blk_update_request+0x85/0x420
            ide_end_rq+0x38/0xa0
            ide_complete_rq+0x3d/0x70
            cdrom_newpc_intr+0x258/0xba0
            ide_intr+0x135/0x250
            __handle_irq_event_percpu+0x3e/0x250
            handle_irq_event_percpu+0x1f/0x50
            handle_irq_event+0x32/0x60
            handle_level_irq+0x6c/0x110
            handle_irq+0x72/0xa0
            </IRQ>
            do_IRQ+0x45/0xad
            common_interrupt+0x115/0x11c
      
      Fixes: 48d9b0d4 ("block: account statistics for passthrough requests")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ecb6186c
  15. 10 12月, 2019 1 次提交
  16. 13 11月, 2019 1 次提交
    • D
      block: Remove partition support for zoned block devices · 5eac3eb3
      Damien Le Moal 提交于
      No known partitioning tool supports zoned block devices, especially the
      host managed flavor with strong sequential write constraints.
      Furthermore, there are also no known user nor use cases for partitioned
      zoned block devices.
      
      This patch removes partition device creation for zoned block devices,
      which allows simplifying the processing of zone commands for zoned
      block devices. A warning is added if a partition table is found on the
      device.
      
      For report zones operations no zone sector information remapping is
      necessary anymore, simplifying the code. Of note is that remapping of
      zone reports for DM targets is still necessary as done by
      dm_remap_zone_report().
      
      Similarly, remaping of a zone reset bio is not necessary anymore.
      Testing for the applicability of the zone reset all request also becomes
      simpler and only needs to check that the number of sectors of the
      requested zone range is equal to the disk capacity.
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5eac3eb3
  17. 07 11月, 2019 1 次提交
  18. 07 10月, 2019 2 次提交
    • B
      block: Reduce sysfs_lock locking inside blk_cleanup_queue() · 73f1c77e
      Bart Van Assche 提交于
      Since blk_cleanup_queue() is called after blk_unregister_queue() and
      since that last function removes all sysfs attributes, serializing
      any code in blk_cleanup_queue() against sysfs callback methods nor against
      I/O scheduler changes is necessary. Hence remove the syfs_lock locking
      calls from the start of blk_cleanup_queue().
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      73f1c77e
    • B
      block: Remove "dying" checks from sysfs callbacks · bae85c15
      Bart Van Assche 提交于
      Block drivers must call del_gendisk() before blk_cleanup_queue().
      del_gendisk() calls kobject_del() and kobject_del() waits until any
      ongoing sysfs callback functions have finished. In other words, the
      sysfs callback functions won't be called for a queue in the dying
      state. Hence remove the "dying" checks from the sysfs callback
      functions.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bae85c15
  19. 18 9月, 2019 1 次提交
  20. 29 8月, 2019 1 次提交
  21. 28 8月, 2019 1 次提交
    • M
      block: split .sysfs_lock into two locks · cecf5d87
      Ming Lei 提交于
      The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
      path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
      required.
      
      However, when mq & iosched kobjects are removed via
      blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
      too. This way causes AB-BA lock because the kernfs built-in lock of
      'kn-count' is required inside kobject_del() too, see the lockdep warning[1].
      
      On the other hand, it isn't necessary to acquire q->sysfs_lock for
      both blk_mq_unregister_dev() & elv_unregister_queue() because
      clearing REGISTERED flag prevents storing to 'queue/scheduler'
      from being happened. Also sysfs write(store) is exclusive, so no
      necessary to hold the lock for elv_unregister_queue() when it is
      called in switching elevator path.
      
      So split .sysfs_lock into two: one is still named as .sysfs_lock for
      covering sync .store, the other one is named as .sysfs_dir_lock
      for covering kobjects and related status change.
      
      sysfs itself can handle the race between add/remove kobjects and
      showing/storing attributes under kobjects. For switching scheduler
      via storing to 'queue/scheduler', we use the queue flag of
      QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
      we can avoid to hold .sysfs_lock during removing/adding kobjects.
      
      [1]  lockdep warning
          ======================================================
          WARNING: possible circular locking dependency detected
          5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
          ------------------------------------------------------
          rmmod/777 is trying to acquire lock:
          00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72
      
          but task is already holding lock:
          00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #1 (&q->sysfs_lock){+.+.}:
                 __lock_acquire+0x95f/0xa2f
                 lock_acquire+0x1b4/0x1e8
                 __mutex_lock+0x14a/0xa9b
                 blk_mq_hw_sysfs_show+0x63/0xb6
                 sysfs_kf_seq_show+0x11f/0x196
                 seq_read+0x2cd/0x5f2
                 vfs_read+0xc7/0x18c
                 ksys_read+0xc4/0x13e
                 do_syscall_64+0xa7/0x295
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          -> #0 (kn->count#202){++++}:
                 check_prev_add+0x5d2/0xc45
                 validate_chain+0xed3/0xf94
                 __lock_acquire+0x95f/0xa2f
                 lock_acquire+0x1b4/0x1e8
                 __kernfs_remove+0x237/0x40b
                 kernfs_remove_by_name_ns+0x59/0x72
                 remove_files+0x61/0x96
                 sysfs_remove_group+0x81/0xa4
                 sysfs_remove_groups+0x3b/0x44
                 kobject_del+0x44/0x94
                 blk_mq_unregister_dev+0x83/0xdd
                 blk_unregister_queue+0xa0/0x10b
                 del_gendisk+0x259/0x3fa
                 null_del_dev+0x8b/0x1c3 [null_blk]
                 null_exit+0x5c/0x95 [null_blk]
                 __se_sys_delete_module+0x204/0x337
                 do_syscall_64+0xa7/0x295
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          other info that might help us debug this:
      
           Possible unsafe locking scenario:
      
                 CPU0                    CPU1
                 ----                    ----
            lock(&q->sysfs_lock);
                                         lock(kn->count#202);
                                         lock(&q->sysfs_lock);
            lock(kn->count#202);
      
           *** DEADLOCK ***
      
          2 locks held by rmmod/777:
           #0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
           #1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b
      
          stack backtrace:
          CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
          Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
          Call Trace:
           dump_stack+0x9a/0xe6
           check_noncircular+0x207/0x251
           ? print_circular_bug+0x32a/0x32a
           ? find_usage_backwards+0x84/0xb0
           check_prev_add+0x5d2/0xc45
           validate_chain+0xed3/0xf94
           ? check_prev_add+0xc45/0xc45
           ? mark_lock+0x11b/0x804
           ? check_usage_forwards+0x1ca/0x1ca
           __lock_acquire+0x95f/0xa2f
           lock_acquire+0x1b4/0x1e8
           ? kernfs_remove_by_name_ns+0x59/0x72
           __kernfs_remove+0x237/0x40b
           ? kernfs_remove_by_name_ns+0x59/0x72
           ? kernfs_next_descendant_post+0x7d/0x7d
           ? strlen+0x10/0x23
           ? strcmp+0x22/0x44
           kernfs_remove_by_name_ns+0x59/0x72
           remove_files+0x61/0x96
           sysfs_remove_group+0x81/0xa4
           sysfs_remove_groups+0x3b/0x44
           kobject_del+0x44/0x94
           blk_mq_unregister_dev+0x83/0xdd
           blk_unregister_queue+0xa0/0x10b
           del_gendisk+0x259/0x3fa
           ? disk_events_poll_msecs_store+0x12b/0x12b
           ? check_flags+0x1ea/0x204
           ? mark_held_locks+0x1f/0x7a
           null_del_dev+0x8b/0x1c3 [null_blk]
           null_exit+0x5c/0x95 [null_blk]
           __se_sys_delete_module+0x204/0x337
           ? free_module+0x39f/0x39f
           ? blkcg_maybe_throttle_current+0x8a/0x718
           ? rwlock_bug+0x62/0x62
           ? __blkcg_punt_bio_submit+0xd0/0xd0
           ? trace_hardirqs_on_thunk+0x1a/0x20
           ? mark_held_locks+0x1f/0x7a
           ? do_syscall_64+0x4c/0x295
           do_syscall_64+0xa7/0x295
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
          RIP: 0033:0x7fb696cdbe6b
          Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
          RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
          RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
          RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
          RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
          R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
          R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cecf5d87
  22. 19 8月, 2019 1 次提交
  23. 14 8月, 2019 1 次提交
  24. 05 8月, 2019 2 次提交
  25. 11 7月, 2019 1 次提交
    • D
      block: Disable write plugging for zoned block devices · b49773e7
      Damien Le Moal 提交于
      Simultaneously writing to a sequential zone of a zoned block device
      from multiple contexts requires mutual exclusion for BIO issuing to
      ensure that writes happen sequentially. However, even for a well
      behaved user correctly implementing such synchronization, BIO plugging
      may interfere and result in BIOs from the different contextx to be
      reordered if plugging is done outside of the mutual exclusion section,
      e.g. the plug was started by a function higher in the call chain than
      the function issuing BIOs.
      
               Context A                     Context B
      
         | blk_start_plug()
         | ...
         | seq_write_zone()
           | mutex_lock(zone)
           | bio-0->bi_iter.bi_sector = zone->wp
           | zone->wp += bio_sectors(bio-0)
           | submit_bio(bio-0)
           | bio-1->bi_iter.bi_sector = zone->wp
           | zone->wp += bio_sectors(bio-1)
           | submit_bio(bio-1)
           | mutex_unlock(zone)
           | return
         | -----------------------> | seq_write_zone()
        				| mutex_lock(zone)
           				| bio-2->bi_iter.bi_sector = zone->wp
           				| zone->wp += bio_sectors(bio-2)
      				| submit_bio(bio-2)
      				| mutex_unlock(zone)
         | <------------------------- |
         | blk_finish_plug()
      
      In the above example, despite the mutex synchronization ensuring the
      correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
      issued after BIO 2 of context B, when the plug is released with
      blk_finish_plug().
      
      While this problem can be addressed using the blk_flush_plug_list()
      function (in the above example, the call must be inserted before the
      zone mutex lock is released), a simple generic solution in the block
      layer avoid this additional code in all zoned block device user code.
      The simple generic solution implemented with this patch is to introduce
      the internal helper function blk_mq_plug() to access the current
      context plug on BIO submission. This helper returns the current plug
      only if the target device is not a zoned block device or if the BIO to
      be plugged is not a write operation. Otherwise, the caller context plug
      is ignored and NULL returned, resulting is all writes to zoned block
      device to never be plugged.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b49773e7
  26. 10 7月, 2019 2 次提交
    • T
      blkcg: implement REQ_CGROUP_PUNT · d3f77dfd
      Tejun Heo 提交于
      When a shared kthread needs to issue a bio for a cgroup, doing so
      synchronously can lead to priority inversions as the kthread can be
      trapped waiting for that cgroup.  This patch implements
      REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
      to a dedicated per-blkcg work item to avoid such priority inversions.
      
      This will be used to fix priority inversions in btrfs compression and
      should be generally useful as we grow filesystem support for
      comprehensive IO control.
      
      Cc: Chris Mason <clm@fb.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d3f77dfd
    • J
      block: init flush rq ref count to 1 · b554db14
      Josef Bacik 提交于
      We discovered a problem in newer kernels where a disconnect of a NBD
      device while the flush request was pending would result in a hang.  This
      is because the blk mq timeout handler does
      
              if (!refcount_inc_not_zero(&rq->ref))
                      return true;
      
      to determine if it's ok to run the timeout handler for the request.
      Flush_rq's don't have a ref count set, so we'd skip running the timeout
      handler for this request and it would just sit there in limbo forever.
      
      Fix this by always setting the refcount of any request going through
      blk_init_rq() to 1.  I tested this with a nbd-server that dropped flush
      requests to verify that it hung, and then tested with this patch to
      verify I got the timeout as expected and the error handling kicked in.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b554db14