1. 02 9月, 2020 8 次提交
  2. 01 9月, 2020 3 次提交
  3. 24 8月, 2020 1 次提交
  4. 22 8月, 2020 3 次提交
    • Y
      blkcg: fix memleak for iolatency · 27029b4b
      Yufen Yu 提交于
      Normally, blkcg_iolatency_exit() will free related memory in iolatency
      when cleanup queue. But if blk_throtl_init() return error and queue init
      fail, blkcg_iolatency_exit() will not do that for us. Then it cause
      memory leak.
      
      Fixes: d7067512 ("block: introduce blk-iolatency io controller")
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      27029b4b
    • K
      block: fix get_max_io_size() · e4b469c6
      Keith Busch 提交于
      A previous commit aligning splits to physical block sizes inadvertently
      modified one return case such that that it now returns 0 length splits
      when the number of sectors doesn't exceed the physical offset. This
      later hits a BUG in bio_split(). Restore the previous working behavior.
      
      Fixes: 9cc5169c ("block: Improve physical block alignment of split bios")
      Reported-by: NEric Deal <eric.deal@wdc.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e4b469c6
    • M
      blk-mq: insert request not through ->queue_rq into sw/scheduler queue · db03f88f
      Ming Lei 提交于
      c616cbee ("blk-mq: punt failed direct issue to dispatch list") supposed
      to add request which has been through ->queue_rq() to the hw queue dispatch
      list, however it adds request running out of budget or driver tag to hw queue
      too. This way basically bypasses request merge, and causes too many request
      dispatched to LLD, and system% is unnecessary increased.
      
      Fixes this issue by adding request not through ->queue_rq into sw/scheduler
      queue, and this way is safe because no ->queue_rq is called on this request
      yet.
      
      High %system can be observed on Azure storvsc device, and even soft lock
      is observed. This patch reduces %system during heavy sequential IO,
      meantime decreases soft lockup risk.
      
      Fixes: c616cbee ("blk-mq: punt failed direct issue to dispatch list")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      db03f88f
  5. 18 8月, 2020 2 次提交
    • D
      bfq: fix blkio cgroup leakage v4 · 2de791ab
      Dmitry Monakhov 提交于
      Changes from v1:
          - update commit description with proper ref-accounting justification
      
      commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      introduce leak forbfq_group and blkcg_gq objects because of get/put
      imbalance.
      In fact whole idea of original commit is wrong because bfq_group entity
      can not dissapear under us because it is referenced by child bfq_queue's
      entities from here:
       -> bfq_init_entity()
          ->bfqg_and_blkg_get(bfqg);
          ->entity->parent = bfqg->my_entity
      
       -> bfq_put_queue(bfqq)
          FINAL_PUT
          ->bfqg_and_blkg_put(bfqq_group(bfqq))
          ->kmem_cache_free(bfq_pool, bfqq);
      
      So parent entity can not disappear while child entity is in tree,
      and child entities already has proper protection.
      This patch revert commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      
      bfq_group leak trace caused by bad commit:
      -> blkg_alloc
         -> bfq_pq_alloc
           -> bfqg_get (+1)
      ->bfq_activate_bfqq
        ->bfq_activate_requeue_entity
          -> __bfq_activate_entity
             ->bfq_get_entity
               ->bfqg_and_blkg_get (+1)  <==== : Note1
      ->bfq_del_bfqq_busy
        ->bfq_deactivate_entity+0x53/0xc0 [bfq]
          ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
            -> bfq_forget_entity(is_in_service = true)
      	 entity->on_st_or_in_serv = false   <=== :Note2
      	 if (is_in_service)
      	     return;  ==> do not touch reference
      -> blkcg_css_offline
       -> blkcg_destroy_blkgs
        -> blkg_destroy
         -> bfq_pd_offline
          -> __bfq_deactivate_entity
               if (!entity->on_st_or_in_serv) /* true, because (Note2)
      		return false;
       -> bfq_pd_free
          -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
      So bfq_group and blkcg_gq  will leak forever, see test-case below.
      
      ##TESTCASE_BEGIN:
      #!/bin/bash
      
      max_iters=${1:-100}
      #prep cgroup mounts
      mount -t tmpfs cgroup_root /sys/fs/cgroup
      mkdir /sys/fs/cgroup/blkio
      mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
      
      # Prepare blkdev
      grep blkio /proc/cgroups
      truncate -s 1M img
      losetup /dev/loop0 img
      echo bfq > /sys/block/loop0/queue/scheduler
      
      grep blkio /proc/cgroups
      for ((i=0;i<max_iters;i++))
      do
          mkdir -p /sys/fs/cgroup/blkio/a
          echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
          dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
          echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
          rmdir /sys/fs/cgroup/blkio/a
          grep blkio /proc/cgroups
      done
      ##TESTCASE_END:
      
      Fixes: db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2de791ab
    • M
      block: Fix page_is_mergeable() for compound pages · d8166519
      Matthew Wilcox (Oracle) 提交于
      If we pass in an offset which is larger than PAGE_SIZE, then
      page_is_mergeable() thinks it's not mergeable with the previous bio_vec,
      leading to a large number of bio_vecs being used.  Use a slightly more
      obvious test that the two pages are compatible with each other.
      
      Fixes: 52d52d1c ("block: only allow contiguous page structs in a bio_vec")
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d8166519
  6. 17 8月, 2020 4 次提交
    • M
      block: respect queue limit of max discard segment · 943b40c8
      Ming Lei 提交于
      When queue_max_discard_segments(q) is 1, blk_discard_mergable() will
      return false for discard request, then normal request merge is applied.
      However, only queue_max_segments() is checked, so max discard segment
      limit isn't respected.
      
      Check max discard segment limit in the request merge code for fixing
      the issue.
      
      Discard request failure of virtio_blk is fixed.
      
      Fixes: 69840466 ("block: fix the DISCARD request merge")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Stefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      943b40c8
    • M
      blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART · d7d8535f
      Ming Lei 提交于
      SCHED_RESTART code path is relied to re-run queue for dispatch requests
      in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
      requests to hctx->dispatch.
      
      memory barriers have to be used for ordering the following two pair of OPs:
      
      1) adding requests to hctx->dispatch and checking SCHED_RESTART in
      blk_mq_dispatch_rq_list()
      
      2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
      in blk_mq_sched_restart().
      
      Without the added memory barrier, either:
      
      1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
      blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
      dispatch side
      
      or
      
      2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
      in dispatch side, meantime checking if there is request in
      hctx->dispatch from blk_mq_sched_restart() is missed.
      
      IO hang in ltp/fs_fill test is reported by kernel test robot:
      
      	https://lkml.org/lkml/2020/7/26/77
      
      Turns out it is caused by the above out-of-order OPs. And the IO hang
      can't be observed any more after applying this patch.
      
      Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Jeffery <djeffery@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d7d8535f
    • X
      bsg-lib: convert comma to semicolon · 03ef5941
      Xu Wang 提交于
      Replace a comma between expression statements by a semicolon.
      Signed-off-by: NXu Wang <vulab@iscas.ac.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03ef5941
    • R
      block: blk-mq.c: fix @at_head kernel-doc warning · 26bfeb26
      Randy Dunlap 提交于
      Fix a kernel-doc warning in block/blk-mq.c:
      
      ../block/blk-mq.c:1844: warning: Function parameter or member 'at_head' not described in 'blk_mq_request_bypass_insert'
      
      Fixes: 01e99aec ("blk-mq: insert passthrough request into hctx->dispatch directly")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: André Almeida <andrealmeid@collabora.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: linux-block@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      26bfeb26
  7. 12 8月, 2020 1 次提交
  8. 06 8月, 2020 1 次提交
    • C
      block: check queue's limits.discard_granularity in __blkdev_issue_discard() · b35fd742
      Coly Li 提交于
      If create a loop device with a backing NVMe SSD, current loop device
      driver doesn't correctly set its  queue's limits.discard_granularity and
      leaves it as 0. If a discard request at LBA 0 on this loop device, in
      __blkdev_issue_discard() the calculated req_sects will be 0, and a zero
      length discard request will trigger a BUG() panic in generic block layer
      code at block/blk-mq.c:563.
      
      [  955.565006][   C39] ------------[ cut here ]------------
      [  955.559660][   C39] invalid opcode: 0000 [#1] SMP NOPTI
      [  955.622171][   C39] CPU: 39 PID: 248 Comm: ksoftirqd/39 Tainted: G            E     5.8.0-default+ #40
      [  955.622171][   C39] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE160M-2.70]- 07/17/2020
      [  955.622175][   C39] RIP: 0010:blk_mq_end_request+0x107/0x110
      [  955.622177][   C39] Code: 48 8b 03 e9 59 ff ff ff 48 89 df 5b 5d 41 5c e9 9f ed ff ff 48 8b 35 98 3c f4 00 48 83 c7 10 48 83 c6 19 e8 cb 56 c9 ff eb cb <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 54
      [  955.622179][   C39] RSP: 0018:ffffb1288701fe28 EFLAGS: 00010202
      [  955.749277][   C39] RAX: 0000000000000001 RBX: ffff956fffba5080 RCX: 0000000000004003
      [  955.749278][   C39] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
      [  955.749279][   C39] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      [  955.749279][   C39] R10: ffffb1288701fd28 R11: 0000000000000001 R12: ffffffffa8e05160
      [  955.749280][   C39] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffffa7ad3a1e
      [  955.749281][   C39] FS:  0000000000000000(0000) GS:ffff95bfbda00000(0000) knlGS:0000000000000000
      [  955.749282][   C39] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  955.749282][   C39] CR2: 00007f6f0ef766a8 CR3: 0000005a37012002 CR4: 00000000007606e0
      [  955.749283][   C39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  955.749284][   C39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  955.749284][   C39] PKRU: 55555554
      [  955.749285][   C39] Call Trace:
      [  955.749290][   C39]  blk_done_softirq+0x99/0xc0
      [  957.550669][   C39]  __do_softirq+0xd3/0x45f
      [  957.550677][   C39]  ? smpboot_thread_fn+0x2f/0x1e0
      [  957.550679][   C39]  ? smpboot_thread_fn+0x74/0x1e0
      [  957.550680][   C39]  ? smpboot_thread_fn+0x14e/0x1e0
      [  957.550684][   C39]  run_ksoftirqd+0x30/0x60
      [  957.550687][   C39]  smpboot_thread_fn+0x149/0x1e0
      [  957.886225][   C39]  ? sort_range+0x20/0x20
      [  957.886226][   C39]  kthread+0x137/0x160
      [  957.886228][   C39]  ? kthread_park+0x90/0x90
      [  957.886231][   C39]  ret_from_fork+0x22/0x30
      [  959.117120][   C39] ---[ end trace 3dacdac97e2ed164 ]---
      
      This is the procedure to reproduce the panic,
        # modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1
        # losetup -f /dev/nvme0n1 --direct-io=on
        # blkdiscard /dev/loop0 -o 0 -l 0x200
      
      This patch fixes the issue by checking q->limits.discard_granularity in
      __blkdev_issue_discard() before composing the discard bio. If the value
      is 0, then prints a warning oops information and returns -EOPNOTSUPP to
      the caller to indicate that this buggy device driver doesn't support
      discard request.
      
      Fixes: 9b15d109 ("block: improve discard bio alignment in __blkdev_issue_discard()")
      Fixes: c52abf56 ("loop: Better discard support for block devices")
      Reported-and-suggested-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Enzo Matsumiya <ematsumiya@suse.com>
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Xiao Ni <xni@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b35fd742
  9. 03 8月, 2020 1 次提交
    • J
      block: don't do revalidate zones on invalid devices · 1a1206dc
      Johannes Thumshirn 提交于
      When we loose a device for whatever reason while (re)scanning zones, we
      trip over a NULL pointer in blk_revalidate_zone_cb, like in the following
      log:
      
      sd 0:0:0:0: [sda] 3418095616 4096-byte logical blocks: (14.0 TB/12.7 TiB)
      sd 0:0:0:0: [sda] 52156 zones of 65536 logical blocks
      sd 0:0:0:0: [sda] Write Protect is off
      sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08
      sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
      sd 0:0:0:0: [sda] REPORT ZONES start lba 1065287680 failed
      sd 0:0:0:0: [sda] REPORT ZONES: Result: hostbyte=0x00 driverbyte=0x08
      sd 0:0:0:0: [sda] Sense Key : 0xb [current]
      sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6
      sda: failed to revalidate zones
      sd 0:0:0:0: [sda] 0 4096-byte logical blocks: (0 B/0 B)
      sda: detected capacity change from 14000519643136 to 0
      ==================================================================
      BUG: KASAN: null-ptr-deref in blk_revalidate_zone_cb+0x1b7/0x550
      Write of size 8 at addr 0000000000000010 by task kworker/u4:1/58
      
      CPU: 1 PID: 58 Comm: kworker/u4:1 Not tainted 5.8.0-rc1 #692
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
      Workqueue: events_unbound async_run_entry_fn
      Call Trace:
       dump_stack+0x7d/0xb0
       ? blk_revalidate_zone_cb+0x1b7/0x550
       kasan_report.cold+0x5/0x37
       ? blk_revalidate_zone_cb+0x1b7/0x550
       check_memory_region+0x145/0x1a0
       blk_revalidate_zone_cb+0x1b7/0x550
       sd_zbc_parse_report+0x1f1/0x370
       ? blk_req_zone_write_trylock+0x200/0x200
       ? sectors_to_logical+0x60/0x60
       ? blk_req_zone_write_trylock+0x200/0x200
       ? blk_req_zone_write_trylock+0x200/0x200
       sd_zbc_report_zones+0x3c4/0x5e0
       ? sd_dif_config_host+0x500/0x500
       blk_revalidate_disk_zones+0x231/0x44d
       ? _raw_write_lock_irqsave+0xb0/0xb0
       ? blk_queue_free_zone_bitmaps+0xd0/0xd0
       sd_zbc_read_zones+0x8cf/0x11a0
       sd_revalidate_disk+0x305c/0x64e0
       ? __device_add_disk+0x776/0xf20
       ? read_capacity_16.part.0+0x1080/0x1080
       ? blk_alloc_devt+0x250/0x250
       ? create_object.isra.0+0x595/0xa20
       ? kasan_unpoison_shadow+0x33/0x40
       sd_probe+0x8dc/0xcd2
       really_probe+0x20e/0xaf0
       __driver_attach_async_helper+0x249/0x2d0
       async_run_entry_fn+0xbe/0x560
       process_one_work+0x764/0x1290
       ? _raw_read_unlock_irqrestore+0x30/0x30
       worker_thread+0x598/0x12f0
       ? __kthread_parkme+0xc6/0x1b0
       ? schedule+0xed/0x2c0
       ? process_one_work+0x1290/0x1290
       kthread+0x36b/0x440
       ? kthread_create_worker_on_cpu+0xa0/0xa0
       ret_from_fork+0x22/0x30
      ==================================================================
      
      When the device is already gone we end up with the following scenario:
      The device's capacity is 0 and thus the number of zones will be 0 as well. When
      allocating the bitmap for the conventional zones, we then trip over a NULL
      pointer.
      
      So if we encounter a zoned block device with a 0 capacity, don't dare to
      revalidate the zones sizes.
      
      Fixes: 6c6b3549 ("block: set the zone size in blk_revalidate_disk_zones atomically")
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1a1206dc
  10. 01 8月, 2020 7 次提交
  11. 31 7月, 2020 1 次提交
  12. 29 7月, 2020 1 次提交
  13. 28 7月, 2020 1 次提交
  14. 25 7月, 2020 1 次提交
    • A
      scsi: block: pm: Simplify resume handling · 8f38f8e0
      Alan Stern 提交于
      Commit 05d18ae1 ("scsi: pm: Balance pm_only counter of request queue
      during system resume") fixed a problem in the block layer's runtime-PM
      code: blk_set_runtime_active() failed to call blk_clear_pm_only().
      However, the commit's implementation was awkward; it forced the SCSI
      system-resume handler to choose whether to call blk_post_runtime_resume()
      or blk_set_runtime_active(), depending on whether or not the SCSI device
      had previously been runtime suspended.
      
      This patch simplifies the situation considerably by adding the missing
      function call directly into blk_set_runtime_active() (under the condition
      that the queue is not already in the RPM_ACTIVE state).  This allows the
      SCSI routine to revert back to its original form.  Furthermore, making this
      change reveals that blk_post_runtime_resume() (in its success pathway) does
      exactly the same thing as blk_set_runtime_active().  The duplicate code is
      easily removed by making one routine call the other.
      
      No functional changes are intended.
      
      Link: https://lore.kernel.org/r/20200706151436.GA702867@rowland.harvard.edu
      CC: Can Guo <cang@codeaurora.org>
      CC: Bart Van Assche <bvanassche@acm.org>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      8f38f8e0
  15. 21 7月, 2020 3 次提交
  16. 18 7月, 2020 2 次提交
    • B
      blk-cgroup: show global disk stats in root cgroup io.stat · ef45fe47
      Boris Burkov 提交于
      In order to improve consistency and usability in cgroup stat accounting,
      we would like to support the root cgroup's io.stat.
      
      Since the root cgroup has processes doing io even if the system has no
      explicitly created cgroups, we need to be careful to avoid overhead in
      that case.  For that reason, the rstat algorithms don't handle the root
      cgroup, so just turning the file on wouldn't give correct statistics.
      
      To get around this, we simulate flushing the iostat struct by filling it
      out directly from global disk stats. The result is a root cgroup io.stat
      file consistent with both /proc/diskstats and io.stat.
      
      Note that in order to collect the disk stats, we needed to iterate over
      devices. To facilitate that, we had to change the linkage of a disk_type
      to external so that it can be used from blk-cgroup.c to iterate over
      disks.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ef45fe47
    • B
      blk-cgroup: make iostat functions visible to stat printing · cd1fc4b9
      Boris Burkov 提交于
      Previously, the code which printed io.stat only needed access to the
      generic rstat flushing code, but since we plan to write some more
      specific code for preparing root cgroup stats, we need to manipulate
      iostat structs directly. Since declaring static functions ahead does not
      seem like common practice in this file, simply move the iostat functions
      up. We only plan to use blkg_iostat_set, but it seems better to keep them
      all together.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd1fc4b9