1. 13 5月, 2020 3 次提交
    • M
      block: only define 'nr_sects_seq' in hd_part for 32bit SMP · 07c4e1e8
      Ming Lei 提交于
      The seqcount of 'nr_sects_seq' is only needed in case of 32bit SMP,
      so define it just for 32bit SMP.
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@infradead.org>
      Cc: Yufen Yu <yuyufen@huawei.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      07c4e1e8
    • M
      block: fix use-after-free on cached last_lookup partition · b7d6c303
      Ming Lei 提交于
      delete_partition() clears the cached last_lookup partition. However the
      .last_lookup cache may be overwritten by one IO path after it is cleared
      from delete_partition(). Then another IO path may use the cached deleting
      partition after hd_struct_free() is called, then use-after-free is triggered
      on the cached partition.
      
      Fixes the issue by the following approach:
      
      1) always get the partition's refcount via hd_struct_try_get() before
      setting .last_lookup
      
      2) move clearing .last_lookup from delete_partition() to hd_struct_free()
      which is the release handle of the partition's percpu-refcount, so that no
      IO path can cache deleteing partition via .last_lookup.
      
      It is one candidate approach of Yufen's patch[1] which adds overhead
      in fast path by indirect lookup which may introduce one extra cacheline
      in IO path. Also this patch relies on percpu-refcount's protection, and
      it is easier to understand and verify.
      
      [1] https://lore.kernel.org/linux-block/20200109013551.GB9655@ming.t460p/T/#tReported-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b7d6c303
    • W
      block: reset mapping if failed to update hardware queue count · aa880ad6
      Weiping Zhang 提交于
      When we increase hardware queue count, blk_mq_update_queue_map will
      reset the mapping between cpu and hardware queue base on the hardware
      queue count(set->nr_hw_queues). The mapping cannot be reset if it
      encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will
      continue using it, then blk_mq_map_swqueue will touch a invalid memory,
      because the mapping points to a wrong hctx.
      
      blktest block/030:
      
      null_blk: module loaded
      Increasing nr_hw_queues to 8 fails, fallback to 1
      ==================================================================
      BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830
      Read of size 8 at addr 0000000000000128 by task nproc/8541
      
      CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
      Call Trace:
      dump_stack+0xa5/0xe6
      __kasan_report.cold+0x65/0xbb
      kasan_report+0x45/0x60
      check_memory_region+0x15e/0x1c0
      __kasan_check_read+0x15/0x20
      blk_mq_map_swqueue+0x2f2/0x830
      __blk_mq_update_nr_hw_queues+0x3df/0x690
      blk_mq_update_nr_hw_queues+0x32/0x50
      nullb_device_submit_queues_store+0xde/0x160 [null_blk]
      configfs_write_file+0x1c4/0x250 [configfs]
      __vfs_write+0x4c/0x90
      vfs_write+0x14b/0x2d0
      ksys_write+0xdd/0x180
      __x64_sys_write+0x47/0x50
      do_syscall_64+0x6f/0x310
      entry_SYSCALL_64_after_hwframe+0x49/0xb3
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Tested-by: NBart van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aa880ad6
  2. 10 5月, 2020 9 次提交
    • C
      bdi: remove the name field in struct backing_dev_info · 1cd925d5
      Christoph Hellwig 提交于
      The name is only printed for a not registered bdi in writeback.  Use the
      device name there as is more useful anyway for the unlike case that the
      warning triggers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1cd925d5
    • C
      bdi: simplify bdi_alloc · aef33c2f
      Christoph Hellwig 提交于
      Merge the _node vs normal version and drop the superflous gfp_t argument.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aef33c2f
    • C
      bdi: remove bdi_register_owner · 3c5d202b
      Christoph Hellwig 提交于
      Split out a new bdi_set_owner helper to set the owner, and move the policy
      for creating the bdi name back into genhd.c, where it belongs.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3c5d202b
    • W
      block: rename blk_mq_alloc_rq_maps · 79fab528
      Weiping Zhang 提交于
      rename blk_mq_alloc_rq_maps to blk_mq_alloc_map_and_requests,
      this function allocs both map and request, make function name align
      with funtion.
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      79fab528
    • W
      block: rename __blk_mq_alloc_rq_map · 03b63b02
      Weiping Zhang 提交于
      rename __blk_mq_alloc_rq_map to __blk_mq_alloc_map_and_request,
      actually it alloc both map and request, make function name
      align with function.
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03b63b02
    • M
      block: alloc map and request for new hardware queue · fd689871
      Ming Lei 提交于
      Alloc new map and request for new hardware queue when increse
      hardware queue count. Before this patch, it will show a
      warning for each new hardware queue, but it's not enough, these
      hctx have no maps and reqeust, when a bio was mapped to these
      hardware queue, it will trigger kernel panic when get request
      from these hctx.
      
      Test environment:
       * A NVMe disk supports 128 io queues
       * 96 cpus in system
      
      A corner case can always trigger this panic, there are 96
      io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
      log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
      queues to 96, then nvme will alloc others(32) queues for read, but
      blk_mq_update_nr_hw_queues does not alloc map and request for these new
      added io queues. So when process read nvme disk, it will trigger kernel
      panic when get request from these hardware context.
      
      Reproduce script:
      
      nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
      echo $nr > /sys/module/nvme/parameters/write_queues
      echo 1 > /sys/block/nvme0n1/device/reset_controller
      dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1
      
      [ 8040.805626] ------------[ cut here ]------------
      [ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
      [ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
      ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
      tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
      cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
      rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
      [ 8040.805637]  ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
      [ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G        W         5.6.0-rc5.78317c+ #2
      [ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
      [ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
      [ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
      [ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 <0f> 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
      [ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
      [ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
      [ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
      [ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
      [ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
      [ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
      [ 8040.805645] FS:  0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
      [ 8040.805646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
      [ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 8040.805647] PKRU: 55555554
      [ 8040.805647] Call Trace:
      [ 8040.805649]  blk_mq_update_nr_hw_queues+0x31b/0x390
      [ 8040.805650]  nvme_reset_work+0xb4b/0xeab [nvme]
      [ 8040.805651]  process_one_work+0x1a7/0x370
      [ 8040.805652]  worker_thread+0x1c9/0x380
      [ 8040.805653]  ? max_active_store+0x80/0x80
      [ 8040.805655]  kthread+0x112/0x130
      [ 8040.805656]  ? __kthread_parkme+0x70/0x70
      [ 8040.805657]  ret_from_fork+0x35/0x40
      [ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
      [ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
      [ 8229.365165] #PF: supervisor read access in kernel mode
      [ 8229.365178] #PF: error_code(0x0000) - not-present page
      [ 8229.365191] PGD 0 P4D 0
      [ 8229.365201] Oops: 0000 [#1] SMP PTI
      [ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G        W         5.6.0-rc5.78317c+ #2
      [ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
      [ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
      [ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef <44> 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
      [ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
      [ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
      [ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
      [ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
      [ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
      [ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
      [ 8229.365397] FS:  00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
      [ 8229.365415] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
      [ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 8229.365476] PKRU: 55555554
      [ 8229.365484] Call Trace:
      [ 8229.365498]  ? finish_wait+0x80/0x80
      [ 8229.365512]  blk_mq_get_request+0xcb/0x3f0
      [ 8229.365525]  blk_mq_make_request+0x143/0x5d0
      [ 8229.365538]  generic_make_request+0xcf/0x310
      [ 8229.365553]  ? scan_shadow_nodes+0x30/0x30
      [ 8229.365564]  submit_bio+0x3c/0x150
      [ 8229.365576]  mpage_readpages+0x163/0x1a0
      [ 8229.365588]  ? blkdev_direct_IO+0x490/0x490
      [ 8229.365601]  read_pages+0x6b/0x190
      [ 8229.365612]  __do_page_cache_readahead+0x1c1/0x1e0
      [ 8229.365626]  ondemand_readahead+0x182/0x2f0
      [ 8229.365639]  generic_file_buffered_read+0x590/0xab0
      [ 8229.365655]  new_sync_read+0x12a/0x1c0
      [ 8229.365666]  vfs_read+0x8a/0x140
      [ 8229.365676]  ksys_read+0x59/0xd0
      [ 8229.365688]  do_syscall_64+0x55/0x1d0
      [ 8229.365700]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Tested-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd689871
    • W
      block: save previous hardware queue count before udpate · a2584e43
      Weiping Zhang 提交于
      blk_mq_realloc_tag_set_tags will update set->nr_hw_queues, so
      save old set->nr_hw_queues before call this function.
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a2584e43
    • W
      block: free both rq_map and request · 2e194422
      Weiping Zhang 提交于
      Allocation:
      
      __blk_mq_alloc_rq_map
      	blk_mq_alloc_rq_map
      		blk_mq_alloc_rq_map
      			tags = blk_mq_init_tags : kzalloc_node:
      			tags->rqs = kcalloc_node
      			tags->static_rqs = kcalloc_node
      	blk_mq_alloc_rqs
      		p = alloc_pages_node
      		tags->static_rqs[i] = p + offset;
      
      Free:
      
      blk_mq_free_rq_map
      	kfree(tags->rqs);
      	kfree(tags->static_rqs);
      	blk_mq_free_tags
      		kfree(tags);
      
      The page allocated in blk_mq_alloc_rqs cannot be released,
      so we should use blk_mq_free_map_and_requests here.
      
      blk_mq_free_map_and_requests
      	blk_mq_free_rqs
      		__free_pages : cleanup for blk_mq_alloc_rqs
      	blk_mq_free_rq_map : cleanup for blk_mq_alloc_rq_map
      Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2e194422
    • Y
      bdi: use bdi_dev_name() to get device name · d51cfc53
      Yufen Yu 提交于
      Use the common interface bdi_dev_name() to get device name.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      
      Add missing <linux/backing-dev.h> include BFQ
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d51cfc53
  3. 05 5月, 2020 1 次提交
    • T
      iocost: protect iocg->abs_vdebt with iocg->waitq.lock · 0b80f986
      Tejun Heo 提交于
      abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
      is and controls the activation of use_delay mechanism. Once a cgroup goes
      over budget from forced IOs, it has to pay it back with its future budget.
      The progress guarantee on debt paying comes from the iocg being active -
      active iocgs are processed by the periodic timer, which ensures that as time
      passes the debts dissipate and the iocg returns to normal operation.
      
      However, both iocg activation and vdebt handling are asynchronous and a
      sequence like the following may happen.
      
      1. The iocg is in the process of being deactivated by the periodic timer.
      
      2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
         without anything because it still sees that the iocg is already active.
      
      3. The iocg is deactivated.
      
      4. The bio from #2 is over budget but needs to be forced. It increases
         abs_vdebt and goes over the threshold and enables use_delay.
      
      5. IO control is enabled for the iocg's subtree and now IOs are attributed
         to the descendant cgroups and the iocg itself no longer issues IOs.
      
      This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
      further IOs which can activate it. This can end up unduly punishing all the
      descendants cgroups.
      
      The usual throttling path has the same issue - the iocg must be active while
      throttled to ensure that future event will wake it up - and solves the
      problem by synchronizing the throttling path with a spinlock. abs_vdebt
      handling is another form of overage handling and shares a lot of
      characteristics including the fact that it isn't in the hottest path.
      
      This patch fixes the above and other possible races by strictly
      synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NVlad Dmitriev <vvd@fb.com>
      Cc: stable@vger.kernel.org # v5.4+
      Fixes: e1518f63 ("blk-iocost: Don't let merges push vtime into the future")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0b80f986
  4. 01 5月, 2020 3 次提交
    • T
      blk-iocost: account for IO size when testing latencies · cd006509
      Tejun Heo 提交于
      On each IO completion, iocost decides whether the IO met or missed its latency
      target. Currently, the targets are fixed numbers per IO type. While this can be
      good enough for loose latency targets way higher than typical completion
      latencies, the effect of IO size makes it difficult to tighten the latency
      target - a target adequate for 4k IOs might be too tight for 512k IOs and
      vice-versa.
      
      iocost already has all the necessary information to account for different IO
      sizes when testing whether the latency target is met as iocost can calculate the
      size vtime cost of a given IO. This patch updates the completion path to
      calculate the size vtime cost of the IO, deduct the nsec equivalent from the
      observed latency and use the adjusted value to decide whether the target is met.
      
      This makes latency targets independent from IO size and enables determining
      adequate latency targets with fixed size fio runs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd006509
    • T
      blk-iocost: switch to fixed non-auto-decaying use_delay · 54c52e10
      Tejun Heo 提交于
      The use_delay mechanism was introduced by blk-iolatency to hold memory
      allocators accountable for the reclaim and other shared IOs they cause. The
      duration of the delay is dynamically balanced between iolatency increasing the
      value on each target miss and it auto-decaying as time passes and threads get
      delayed on it.
      
      While this works well for iolatency, iocost's control model isn't compatible
      with it. There is no repeated "violation" events which can be balanced against
      auto-decaying. iocost instead knows how much a given cgroup is over budget and
      wants to prevent that cgroup from issuing IOs while over budget. Until now,
      iocost has been adding the cost of force-issued IOs. However, this doesn't
      reflect the amount which is already over budget and is simply not enough to
      counter the auto-decaying allowing anon-memory leaking low priority cgroup to
      go over its alloted share of IOs.
      
      As auto-decaying doesn't make much sense for iocost, this patch introduces a
      different mode of operation for use_delay - when blkcg_set_delay() are used
      insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it
      is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the
      delay duration synchronized to the budget overage amount.
      
      With this change, iocost can effectively police cgroups which generate
      significant amount of force-issued IOs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      54c52e10
    • C
      block: remove the bd_openers checks in blk_drop_partitions · 10c70d95
      Christoph Hellwig 提交于
      When replacing the bd_super check with a bd_openers I followed a logical
      conclusion, which turns out to be utterly wrong.  When a block device has
      bd_super sets it has a mount file system on it (although not every
      mounted file system sets bd_super), but that also implies it doesn't even
      have partitions to start with.
      
      So instead of trying to come up with a logical check for all openers,
      just remove the check entirely.
      
      Fixes: d3ef5536 ("block: fix busy device checking in blk_drop_partitions")
      Fixes: cb6b771b ("block: fix busy device checking in blk_drop_partitions again")
      Reported-by: NMichal Koutný <mkoutny@suse.com>
      Reported-by: NYang Xu <xuyang2018.jy@cn.fujitsu.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      10c70d95
  5. 29 4月, 2020 5 次提交
  6. 25 4月, 2020 2 次提交
  7. 24 4月, 2020 1 次提交
    • S
      block: Limit number of items taken from the I/O scheduler in one go · 28d65729
      Salman Qazi 提交于
      Flushes bypass the I/O scheduler and get added to hctx->dispatch
      in blk_mq_sched_bypass_insert.  This can happen while a kworker is running
      hctx->run_work work item and is past the point in
      blk_mq_sched_dispatch_requests where hctx->dispatch is checked.
      
      The blk_mq_do_dispatch_sched call is not guaranteed to end in bounded time,
      because the I/O scheduler can feed an arbitrary number of commands.
      
      Since we have only one hctx->run_work, the commands waiting in
      hctx->dispatch will wait an arbitrary length of time for run_work to be
      rerun.
      
      A similar phenomenon exists with dispatches from the software queue.
      
      The solution is to poll hctx->dispatch in blk_mq_do_dispatch_sched and
      blk_mq_do_dispatch_ctx and return from the run_work handler and let it
      rerun.
      Signed-off-by: NSalman Qazi <sqazi@google.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      28d65729
  8. 23 4月, 2020 4 次提交
  9. 21 4月, 2020 12 次提交
    • W
      blk-iocost: Fix error on iocost_ioc_vrate_adj · d6c8e949
      Waiman Long 提交于
      Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
      argument of the iocost_ioc_vrate_adj trace entry defined in
      include/trace/events/iocost.h leading to the following error:
      
        /tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
        error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
         , u32[]* __tracepoint_arg_missed_ppm
      
      That argument type is indeed rather complex and hard to read. Looking
      at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
      the argument to a simple "u32 *missed_ppm" and adjusting the trace
      entry accordingly, the compilation error was gone.
      
      Fixes: 7caa4715 ("blkcg: implement blk-iocost")
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d6c8e949
    • C
      block: fold bdev_unhash_inode into invalidate_partition · 9bc5c397
      Christoph Hellwig 提交于
      invalidate_partition and bdev_unhash_inode are always paired, and
      invalidate_partition already does an icache lookup for the block device
      inode.  Piggy back on that to remove the inode from the hash.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9bc5c397
    • C
      block: mark invalidate_partition static · 02d33b67
      Christoph Hellwig 提交于
      invalidate_partition is only used in genhd.c, so mark it static.  Also
      drop the return value given that is is always ignored.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      02d33b67
    • C
      block: simplify block device syncing in bdev_del_partition · d5f3178e
      Christoph Hellwig 提交于
      We just checked a little above that the block device for the partition
      im busy.  That implies no file system is mounted, and thus the only
      thing in fsync_bdev that actually is used is sync_blockdev.  Just call
      sync_blockdev directly.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d5f3178e
    • C
      block: don't call invalidate_partition from blk_drop_partitions · e669c1da
      Christoph Hellwig 提交于
      Given that the device must not be busy, most of the calls from
      invalidate_partition that are related to file system metadata are
      guranteed to not happen.  Just open code the calls to sync_blockdev
      and invalidate_bdev instead.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e669c1da
    • C
      dasd: use blk_drop_partitions instead of badly reimplementing it · 21be6cdc
      Christoph Hellwig 提交于
      Use the blk_drop_partitions function instead of messing around with
      ioctls that get kernel pointers.  For this blk_drop_partitions needs
      to be exported, which it normally shouldn't - make an exception for
      s390 only.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      21be6cdc
    • C
      block: remove the disk argument from blk_drop_partitions · d46430bf
      Christoph Hellwig 提交于
      The gendisk can be trivially deducted from the block_device.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d46430bf
    • C
      block: remove hd_struct_kill · 4377b48d
      Christoph Hellwig 提交于
      The function has a single caller, so just open code it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4377b48d
    • C
      block: cleanup hd_struct freeing · 8da2892e
      Christoph Hellwig 提交于
      Move hd_ref_init out of line as there it isn't anywhere near a fast path,
      and rename the rcu ref freeing callbacks to be more descriptive.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8da2892e
    • C
      block: pass a hd_struct to delete_partition · cddae808
      Christoph Hellwig 提交于
      All callers have the hd_struct at hand, so pass it instead of performing
      another lookup.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cddae808
    • C
      block: refactor blkpg_ioctl · fa9156ae
      Christoph Hellwig 提交于
      Split each sub-command out into a separate helper, and move those helpers
      to block/partitions/core.c instead of having a lot of partition
      manipulation logic open coded in block/ioctl.c.
      
      Signed-off-by: Christoph Hellwig <hch@lst.de
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fa9156ae
    • D
      blk-mq: Rerun dispatching in the case of budget contention · a0823421
      Douglas Anderson 提交于
      If ever a thread running blk-mq code tries to get budget and fails it
      immediately stops doing work and assumes that whenever budget is freed
      up that queues will be kicked and whatever work the thread was trying
      to do will be tried again.
      
      One path where budget is freed and queues are kicked in the normal
      case can be seen in scsi_finish_command().  Specifically:
      - scsi_finish_command()
        - scsi_device_unbusy()
          - # Decrement "device_busy", AKA release budget
        - scsi_io_completion()
          - scsi_end_request()
            - blk_mq_run_hw_queues()
      
      The above is all well and good.  The problem comes up when a thread
      claims the budget but then releases it without actually dispatching
      any work.  Since we didn't schedule any work we'll never run the path
      of finishing work / kicking the queues.
      
      This isn't often actually a problem which is why this issue has
      existed for a while and nobody noticed.  Specifically we only get into
      this situation when we unexpectedly found that we weren't going to do
      any work.  Code that later receives new work kicks the queues.  All
      good, right?
      
      The problem shows up, however, if timing is just wrong and we hit a
      race.  To see this race let's think about the case where we only have
      a budget of 1 (only one thread can hold budget).  Now imagine that a
      thread got budget and then decided not to dispatch work.  It's about
      to call put_budget() but then the thread gets context switched out for
      a long, long time.  While in this state, any and all kicks of the
      queue (like the when we received new work) will be no-ops because
      nobody can get budget.  Finally the thread holding budget gets to run
      again and returns.  All the normal kicks will have been no-ops and we
      have an I/O stall.
      
      As you can see from the above, you need just the right timing to see
      the race.  To start with, the only case it happens if we thought we
      had work, actually managed to get the budget, but then actually didn't
      have work.  That's pretty rare to start with.  Even then, there's
      usually a very small amount of time between realizing that there's no
      work and putting the budget.  During this small amount of time new
      work has to come in and the queue kick has to make it all the way to
      trying to get the budget and fail.  It's pretty unlikely.
      
      One case where this could have failed is illustrated by an example of
      threads running blk_mq_do_dispatch_sched():
      
      * Threads A and B both run has_work() at the same time with the same
        "hctx".  Imagine has_work() is exact.  There's no lock, so it's OK
        if Thread A and B both get back true.
      * Thread B gets interrupted for a long time right after it decides
        that there is work.  Maybe its CPU gets an interrupt and the
        interrupt handler is slow.
      * Thread A runs, get budget, dispatches work.
      * Thread A's work finishes and budget is released.
      * Thread B finally runs again and gets budget.
      * Since Thread A already took care of the work and no new work has
        come in, Thread B will get NULL from dispatch_request().  I believe
        this is specifically why dispatch_request() is allowed to return
        NULL in the first place if has_work() must be exact.
      * Thread B will now be holding the budget and is about to call
        put_budget(), but hasn't called it yet.
      * Thread B gets interrupted for a long time (again).  Dang interrupts.
      * Now Thread C (maybe with a different "hctx" but the same queue)
        comes along and runs blk_mq_do_dispatch_sched().
      * Thread C won't do anything because it can't get budget.
      * Finally Thread B will run again and put the budget without kicking
        any queues.
      
      Even though the example above is with blk_mq_do_dispatch_sched() I
      believe the race is possible any time someone is holding budget but
      doesn't do work.
      
      Unfortunately, the unlikely has become more likely if you happen to be
      using the BFQ I/O scheduler.  BFQ, by design, sometimes returns "true"
      for has_work() but then NULL for dispatch_request() and stays in this
      state for a while (currently up to 9 ms).  Suddenly you only need one
      race to hit, not two races in a row.  With my current setup this is
      easy to reproduce in reboot tests and traces have actually shown that
      we hit a race similar to the one described above.
      
      Note that we only need to fix blk_mq_do_dispatch_sched() and
      blk_mq_do_dispatch_ctx() and not the other places that put budget.  In
      other cases we know that we have work to do on at least one "hctx" and
      code already exists to kick that "hctx"'s queue.  When that work
      finally finishes all the queues will be kicked using the normal flow.
      
      One last note is that (at least in the SCSI case) budget is shared by
      all "hctx"s that have the same queue.  Thus we need to make sure to
      kick the whole queue, not just re-run dispatching on a single "hctx".
      Signed-off-by: NDouglas Anderson <dianders@chromium.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a0823421