1. 24 6月, 2020 5 次提交
  2. 18 6月, 2020 1 次提交
    • W
      block: update hctx map when use multiple maps · fe35ec58
      Weiping Zhang 提交于
      There is an issue when tune the number for read and write queues,
      if the total queue count was not changed. The hctx->type cannot
      be updated, since __blk_mq_update_nr_hw_queues will return directly
      if the total queue count has not been changed.
      
      Reproduce:
      
      dmesg | grep "default/read/poll"
      [    2.607459] nvme nvme0: 48/0/0 default/read/poll queues
      cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
           48 default
      
      tune the write queues to 24:
      echo 24 > /sys/module/nvme/parameters/write_queues
      echo 1 > /sys/block/nvme0n1/device/reset_controller
      
      dmesg | grep "default/read/poll"
      [  433.547235] nvme nvme0: 24/24/0 default/read/poll queues
      
      cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
           48 default
      
      The driver's hardware queue mapping is not same as block layer.
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fe35ec58
  3. 07 6月, 2020 1 次提交
  4. 30 5月, 2020 8 次提交
  5. 27 5月, 2020 1 次提交
  6. 19 5月, 2020 4 次提交
  7. 14 5月, 2020 1 次提交
    • S
      block: Inline encryption support for blk-mq · a892c8d5
      Satya Tangirala 提交于
      We must have some way of letting a storage device driver know what
      encryption context it should use for en/decrypting a request. However,
      it's the upper layers (like the filesystem/fscrypt) that know about and
      manages encryption contexts. As such, when the upper layer submits a bio
      to the block layer, and this bio eventually reaches a device driver with
      support for inline encryption, the device driver will need to have been
      told the encryption context for that bio.
      
      We want to communicate the encryption context from the upper layer to the
      storage device along with the bio, when the bio is submitted to the block
      layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
      represent an encryption context (note that we can't use the bi_private
      field in struct bio to do this because that field does not function to pass
      information across layers in the storage stack). We also introduce various
      functions to manipulate the bio_crypt_ctx and make the bio/request merging
      logic aware of the bio_crypt_ctx.
      
      We also make changes to blk-mq to make it handle bios with encryption
      contexts. blk-mq can merge many bios into the same request. These bios need
      to have contiguous data unit numbers (the necessary changes to blk-merge
      are also made to ensure this) - as such, it suffices to keep the data unit
      number of just the first bio, since that's all a storage driver needs to
      infer the data unit number to use for each data block in each bio in a
      request. blk-mq keeps track of the encryption context to be used for all
      the bios in a request with the request's rq_crypt_ctx. When the first bio
      is added to an empty request, blk-mq will program the encryption context
      of that bio into the request_queue's keyslot manager, and store the
      returned keyslot in the request's rq_crypt_ctx. All the functions to
      operate on encryption contexts are in blk-crypto.c.
      
      Upper layers only need to call bio_crypt_set_ctx with the encryption key,
      algorithm and data_unit_num; they don't have to worry about getting a
      keyslot for each encryption context, as blk-mq/blk-crypto handles that.
      Blk-crypto also makes it possible for request-based layered devices like
      dm-rq to make use of inline encryption hardware by cloning the
      rq_crypt_ctx and programming a keyslot in the new request_queue when
      necessary.
      
      Note that any user of the block layer can submit bios with an
      encryption context, such as filesystems, device-mapper targets, etc.
      Signed-off-by: NSatya Tangirala <satyat@google.com>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a892c8d5
  8. 13 5月, 2020 2 次提交
    • K
      block: Introduce REQ_OP_ZONE_APPEND · 0512a75b
      Keith Busch 提交于
      Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
      block device. This is a no-merge write operation.
      
      A zone append write BIO must:
      * Target a zoned block device
      * Have a sector position indicating the start sector of the target zone
      * The target zone must be a sequential write zone
      * The BIO must not cross a zone boundary
      * The BIO size must not be split to ensure that a single range of LBAs
        is written with a single command.
      
      Implement these checks in generic_make_request_checks() using the
      helper function blk_check_zone_append(). To avoid write append BIO
      splitting, introduce the new max_zone_append_sectors queue limit
      attribute and ensure that a BIO size is always lower than this limit.
      Export this new limit through sysfs and check these limits in bio_full().
      
      Also when a LLDD can't dispatch a request to a specific zone, it
      will return BLK_STS_ZONE_RESOURCE indicating this request needs to
      be delayed, e.g.  because the zone it will be dispatched to is still
      write-locked. If this happens set the request aside in a local list
      to continue trying dispatching requests such as READ requests or a
      WRITE/ZONE_APPEND requests targetting other zones. This way we can
      still keep a high queue depth without starving other requests even if
      one request can't be served due to zone write-locking.
      
      Finally, make sure that the bio sector position indicates the actual
      write position as indicated by the device on completion.
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      [ jth: added zone-append specific add_page and merge_page helpers ]
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0512a75b
    • W
      block: reset mapping if failed to update hardware queue count · aa880ad6
      Weiping Zhang 提交于
      When we increase hardware queue count, blk_mq_update_queue_map will
      reset the mapping between cpu and hardware queue base on the hardware
      queue count(set->nr_hw_queues). The mapping cannot be reset if it
      encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will
      continue using it, then blk_mq_map_swqueue will touch a invalid memory,
      because the mapping points to a wrong hctx.
      
      blktest block/030:
      
      null_blk: module loaded
      Increasing nr_hw_queues to 8 fails, fallback to 1
      ==================================================================
      BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830
      Read of size 8 at addr 0000000000000128 by task nproc/8541
      
      CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
      Call Trace:
      dump_stack+0xa5/0xe6
      __kasan_report.cold+0x65/0xbb
      kasan_report+0x45/0x60
      check_memory_region+0x15e/0x1c0
      __kasan_check_read+0x15/0x20
      blk_mq_map_swqueue+0x2f2/0x830
      __blk_mq_update_nr_hw_queues+0x3df/0x690
      blk_mq_update_nr_hw_queues+0x32/0x50
      nullb_device_submit_queues_store+0xde/0x160 [null_blk]
      configfs_write_file+0x1c4/0x250 [configfs]
      __vfs_write+0x4c/0x90
      vfs_write+0x14b/0x2d0
      ksys_write+0xdd/0x180
      __x64_sys_write+0x47/0x50
      do_syscall_64+0x6f/0x310
      entry_SYSCALL_64_after_hwframe+0x49/0xb3
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Tested-by: NBart van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aa880ad6
  9. 10 5月, 2020 5 次提交
    • W
      block: rename blk_mq_alloc_rq_maps · 79fab528
      Weiping Zhang 提交于
      rename blk_mq_alloc_rq_maps to blk_mq_alloc_map_and_requests,
      this function allocs both map and request, make function name align
      with funtion.
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      79fab528
    • W
      block: rename __blk_mq_alloc_rq_map · 03b63b02
      Weiping Zhang 提交于
      rename __blk_mq_alloc_rq_map to __blk_mq_alloc_map_and_request,
      actually it alloc both map and request, make function name
      align with function.
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03b63b02
    • M
      block: alloc map and request for new hardware queue · fd689871
      Ming Lei 提交于
      Alloc new map and request for new hardware queue when increse
      hardware queue count. Before this patch, it will show a
      warning for each new hardware queue, but it's not enough, these
      hctx have no maps and reqeust, when a bio was mapped to these
      hardware queue, it will trigger kernel panic when get request
      from these hctx.
      
      Test environment:
       * A NVMe disk supports 128 io queues
       * 96 cpus in system
      
      A corner case can always trigger this panic, there are 96
      io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
      log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
      queues to 96, then nvme will alloc others(32) queues for read, but
      blk_mq_update_nr_hw_queues does not alloc map and request for these new
      added io queues. So when process read nvme disk, it will trigger kernel
      panic when get request from these hardware context.
      
      Reproduce script:
      
      nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
      echo $nr > /sys/module/nvme/parameters/write_queues
      echo 1 > /sys/block/nvme0n1/device/reset_controller
      dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1
      
      [ 8040.805626] ------------[ cut here ]------------
      [ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
      [ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
      ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
      tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
      cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
      rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
      [ 8040.805637]  ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
      [ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G        W         5.6.0-rc5.78317c+ #2
      [ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
      [ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
      [ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
      [ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
      [ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
      [ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
      [ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
      [ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
      [ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
      [ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
      [ 8040.805645] FS:  0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
      [ 8040.805646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
      [ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 8040.805647] PKRU: 55555554
      [ 8040.805647] Call Trace:
      [ 8040.805649]  blk_mq_update_nr_hw_queues+0x31b/0x390
      [ 8040.805650]  nvme_reset_work+0xb4b/0xeab [nvme]
      [ 8040.805651]  process_one_work+0x1a7/0x370
      [ 8040.805652]  worker_thread+0x1c9/0x380
      [ 8040.805653]  ? max_active_store+0x80/0x80
      [ 8040.805655]  kthread+0x112/0x130
      [ 8040.805656]  ? __kthread_parkme+0x70/0x70
      [ 8040.805657]  ret_from_fork+0x35/0x40
      [ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
      [ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
      [ 8229.365165] #PF: supervisor read access in kernel mode
      [ 8229.365178] #PF: error_code(0x0000) - not-present page
      [ 8229.365191] PGD 0 P4D 0
      [ 8229.365201] Oops: 0000 [#1] SMP PTI
      [ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G        W         5.6.0-rc5.78317c+ #2
      [ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
      [ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
      [ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
      [ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
      [ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
      [ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
      [ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
      [ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
      [ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
      [ 8229.365397] FS:  00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
      [ 8229.365415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
      [ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 8229.365476] PKRU: 55555554
      [ 8229.365484] Call Trace:
      [ 8229.365498]  ? finish_wait+0x80/0x80
      [ 8229.365512]  blk_mq_get_request+0xcb/0x3f0
      [ 8229.365525]  blk_mq_make_request+0x143/0x5d0
      [ 8229.365538]  generic_make_request+0xcf/0x310
      [ 8229.365553]  ? scan_shadow_nodes+0x30/0x30
      [ 8229.365564]  submit_bio+0x3c/0x150
      [ 8229.365576]  mpage_readpages+0x163/0x1a0
      [ 8229.365588]  ? blkdev_direct_IO+0x490/0x490
      [ 8229.365601]  read_pages+0x6b/0x190
      [ 8229.365612]  __do_page_cache_readahead+0x1c1/0x1e0
      [ 8229.365626]  ondemand_readahead+0x182/0x2f0
      [ 8229.365639]  generic_file_buffered_read+0x590/0xab0
      [ 8229.365655]  new_sync_read+0x12a/0x1c0
      [ 8229.365666]  vfs_read+0x8a/0x140
      [ 8229.365676]  ksys_read+0x59/0xd0
      [ 8229.365688]  do_syscall_64+0x55/0x1d0
      [ 8229.365700]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Tested-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd689871
    • W
      block: save previous hardware queue count before udpate · a2584e43
      Weiping Zhang 提交于
      blk_mq_realloc_tag_set_tags will update set->nr_hw_queues, so
      save old set->nr_hw_queues before call this function.
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a2584e43
    • W
      block: free both rq_map and request · 2e194422
      Weiping Zhang 提交于
      Allocation:
      
      __blk_mq_alloc_rq_map
      	blk_mq_alloc_rq_map
      		blk_mq_alloc_rq_map
      			tags = blk_mq_init_tags : kzalloc_node:
      			tags->rqs = kcalloc_node
      			tags->static_rqs = kcalloc_node
      	blk_mq_alloc_rqs
      		p = alloc_pages_node
      		tags->static_rqs[i] = p + offset;
      
      Free:
      
      blk_mq_free_rq_map
      	kfree(tags->rqs);
      	kfree(tags->static_rqs);
      	blk_mq_free_tags
      		kfree(tags);
      
      The page allocated in blk_mq_alloc_rqs cannot be released,
      so we should use blk_mq_free_map_and_requests here.
      
      blk_mq_free_map_and_requests
      	blk_mq_free_rqs
      		__free_pages : cleanup for blk_mq_alloc_rqs
      	blk_mq_free_rq_map : cleanup for blk_mq_alloc_rq_map
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2e194422
  10. 25 4月, 2020 1 次提交
  11. 23 4月, 2020 2 次提交
  12. 21 4月, 2020 2 次提交
  13. 16 4月, 2020 1 次提交
  14. 07 4月, 2020 1 次提交
  15. 28 3月, 2020 2 次提交
  16. 25 3月, 2020 1 次提交
  17. 12 3月, 2020 1 次提交
  18. 10 3月, 2020 1 次提交
    • B
      blk-mq: Fix a recently introduced regression in blk_mq_realloc_hw_ctxs() · d0930bb8
      Bart Van Assche 提交于
      q->nr_hw_queues must only be updated once it is known that
      blk_mq_realloc_hw_ctxs() has succeeded. Otherwise it can happen that
      reallocation fails and that q->nr_hw_queues is larger than the number of
      allocated hardware queues. This patch fixes the following crash if
      increasing the number of hardware queues fails:
      
      BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x775/0x810
      Write of size 8 at addr 0000000000000118 by task check/977
      
      CPU: 3 PID: 977 Comm: check Not tainted 5.6.0-rc1-dbg+ #8
      Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      Call Trace:
       dump_stack+0xa5/0xe6
       __kasan_report.cold+0x65/0x99
       kasan_report+0x16/0x20
       check_memory_region+0x140/0x1b0
       memset+0x28/0x40
       blk_mq_map_swqueue+0x775/0x810
       blk_mq_update_nr_hw_queues+0x468/0x710
       nullb_device_submit_queues_store+0xf7/0x1a0 [null_blk]
       configfs_write_file+0x1c4/0x250 [configfs]
       __vfs_write+0x4c/0x90
       vfs_write+0x145/0x2c0
       ksys_write+0xd7/0x180
       __x64_sys_write+0x47/0x50
       do_syscall_64+0x6f/0x2f0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: ac0d6b92 ("block: Reduce the amount of memory required per request queue")
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Johannes Thumshirn <jth@kernel.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d0930bb8