1. 24 8月, 2020 1 次提交
  2. 12 8月, 2020 1 次提交
  3. 06 8月, 2020 1 次提交
    • C
      block: check queue's limits.discard_granularity in __blkdev_issue_discard() · b35fd742
      Coly Li 提交于
      If create a loop device with a backing NVMe SSD, current loop device
      driver doesn't correctly set its  queue's limits.discard_granularity and
      leaves it as 0. If a discard request at LBA 0 on this loop device, in
      __blkdev_issue_discard() the calculated req_sects will be 0, and a zero
      length discard request will trigger a BUG() panic in generic block layer
      code at block/blk-mq.c:563.
      
      [  955.565006][   C39] ------------[ cut here ]------------
      [  955.559660][   C39] invalid opcode: 0000 [#1] SMP NOPTI
      [  955.622171][   C39] CPU: 39 PID: 248 Comm: ksoftirqd/39 Tainted: G            E     5.8.0-default+ #40
      [  955.622171][   C39] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE160M-2.70]- 07/17/2020
      [  955.622175][   C39] RIP: 0010:blk_mq_end_request+0x107/0x110
      [  955.622177][   C39] Code: 48 8b 03 e9 59 ff ff ff 48 89 df 5b 5d 41 5c e9 9f ed ff ff 48 8b 35 98 3c f4 00 48 83 c7 10 48 83 c6 19 e8 cb 56 c9 ff eb cb <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 54
      [  955.622179][   C39] RSP: 0018:ffffb1288701fe28 EFLAGS: 00010202
      [  955.749277][   C39] RAX: 0000000000000001 RBX: ffff956fffba5080 RCX: 0000000000004003
      [  955.749278][   C39] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
      [  955.749279][   C39] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      [  955.749279][   C39] R10: ffffb1288701fd28 R11: 0000000000000001 R12: ffffffffa8e05160
      [  955.749280][   C39] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffffa7ad3a1e
      [  955.749281][   C39] FS:  0000000000000000(0000) GS:ffff95bfbda00000(0000) knlGS:0000000000000000
      [  955.749282][   C39] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  955.749282][   C39] CR2: 00007f6f0ef766a8 CR3: 0000005a37012002 CR4: 00000000007606e0
      [  955.749283][   C39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  955.749284][   C39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  955.749284][   C39] PKRU: 55555554
      [  955.749285][   C39] Call Trace:
      [  955.749290][   C39]  blk_done_softirq+0x99/0xc0
      [  957.550669][   C39]  __do_softirq+0xd3/0x45f
      [  957.550677][   C39]  ? smpboot_thread_fn+0x2f/0x1e0
      [  957.550679][   C39]  ? smpboot_thread_fn+0x74/0x1e0
      [  957.550680][   C39]  ? smpboot_thread_fn+0x14e/0x1e0
      [  957.550684][   C39]  run_ksoftirqd+0x30/0x60
      [  957.550687][   C39]  smpboot_thread_fn+0x149/0x1e0
      [  957.886225][   C39]  ? sort_range+0x20/0x20
      [  957.886226][   C39]  kthread+0x137/0x160
      [  957.886228][   C39]  ? kthread_park+0x90/0x90
      [  957.886231][   C39]  ret_from_fork+0x22/0x30
      [  959.117120][   C39] ---[ end trace 3dacdac97e2ed164 ]---
      
      This is the procedure to reproduce the panic,
        # modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1
        # losetup -f /dev/nvme0n1 --direct-io=on
        # blkdiscard /dev/loop0 -o 0 -l 0x200
      
      This patch fixes the issue by checking q->limits.discard_granularity in
      __blkdev_issue_discard() before composing the discard bio. If the value
      is 0, then prints a warning oops information and returns -EOPNOTSUPP to
      the caller to indicate that this buggy device driver doesn't support
      discard request.
      
      Fixes: 9b15d109 ("block: improve discard bio alignment in __blkdev_issue_discard()")
      Fixes: c52abf56 ("loop: Better discard support for block devices")
      Reported-and-suggested-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Enzo Matsumiya <ematsumiya@suse.com>
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Xiao Ni <xni@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b35fd742
  4. 03 8月, 2020 1 次提交
    • J
      block: don't do revalidate zones on invalid devices · 1a1206dc
      Johannes Thumshirn 提交于
      When we loose a device for whatever reason while (re)scanning zones, we
      trip over a NULL pointer in blk_revalidate_zone_cb, like in the following
      log:
      
      sd 0:0:0:0: [sda] 3418095616 4096-byte logical blocks: (14.0 TB/12.7 TiB)
      sd 0:0:0:0: [sda] 52156 zones of 65536 logical blocks
      sd 0:0:0:0: [sda] Write Protect is off
      sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08
      sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
      sd 0:0:0:0: [sda] REPORT ZONES start lba 1065287680 failed
      sd 0:0:0:0: [sda] REPORT ZONES: Result: hostbyte=0x00 driverbyte=0x08
      sd 0:0:0:0: [sda] Sense Key : 0xb [current]
      sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6
      sda: failed to revalidate zones
      sd 0:0:0:0: [sda] 0 4096-byte logical blocks: (0 B/0 B)
      sda: detected capacity change from 14000519643136 to 0
      ==================================================================
      BUG: KASAN: null-ptr-deref in blk_revalidate_zone_cb+0x1b7/0x550
      Write of size 8 at addr 0000000000000010 by task kworker/u4:1/58
      
      CPU: 1 PID: 58 Comm: kworker/u4:1 Not tainted 5.8.0-rc1 #692
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
      Workqueue: events_unbound async_run_entry_fn
      Call Trace:
       dump_stack+0x7d/0xb0
       ? blk_revalidate_zone_cb+0x1b7/0x550
       kasan_report.cold+0x5/0x37
       ? blk_revalidate_zone_cb+0x1b7/0x550
       check_memory_region+0x145/0x1a0
       blk_revalidate_zone_cb+0x1b7/0x550
       sd_zbc_parse_report+0x1f1/0x370
       ? blk_req_zone_write_trylock+0x200/0x200
       ? sectors_to_logical+0x60/0x60
       ? blk_req_zone_write_trylock+0x200/0x200
       ? blk_req_zone_write_trylock+0x200/0x200
       sd_zbc_report_zones+0x3c4/0x5e0
       ? sd_dif_config_host+0x500/0x500
       blk_revalidate_disk_zones+0x231/0x44d
       ? _raw_write_lock_irqsave+0xb0/0xb0
       ? blk_queue_free_zone_bitmaps+0xd0/0xd0
       sd_zbc_read_zones+0x8cf/0x11a0
       sd_revalidate_disk+0x305c/0x64e0
       ? __device_add_disk+0x776/0xf20
       ? read_capacity_16.part.0+0x1080/0x1080
       ? blk_alloc_devt+0x250/0x250
       ? create_object.isra.0+0x595/0xa20
       ? kasan_unpoison_shadow+0x33/0x40
       sd_probe+0x8dc/0xcd2
       really_probe+0x20e/0xaf0
       __driver_attach_async_helper+0x249/0x2d0
       async_run_entry_fn+0xbe/0x560
       process_one_work+0x764/0x1290
       ? _raw_read_unlock_irqrestore+0x30/0x30
       worker_thread+0x598/0x12f0
       ? __kthread_parkme+0xc6/0x1b0
       ? schedule+0xed/0x2c0
       ? process_one_work+0x1290/0x1290
       kthread+0x36b/0x440
       ? kthread_create_worker_on_cpu+0xa0/0xa0
       ret_from_fork+0x22/0x30
      ==================================================================
      
      When the device is already gone we end up with the following scenario:
      The device's capacity is 0 and thus the number of zones will be 0 as well. When
      allocating the bitmap for the conventional zones, we then trip over a NULL
      pointer.
      
      So if we encounter a zoned block device with a 0 capacity, don't dare to
      revalidate the zones sizes.
      
      Fixes: 6c6b3549 ("block: set the zone size in blk_revalidate_disk_zones atomically")
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1a1206dc
  5. 01 8月, 2020 7 次提交
  6. 31 7月, 2020 1 次提交
  7. 29 7月, 2020 1 次提交
  8. 28 7月, 2020 1 次提交
  9. 25 7月, 2020 1 次提交
    • A
      scsi: block: pm: Simplify resume handling · 8f38f8e0
      Alan Stern 提交于
      Commit 05d18ae1 ("scsi: pm: Balance pm_only counter of request queue
      during system resume") fixed a problem in the block layer's runtime-PM
      code: blk_set_runtime_active() failed to call blk_clear_pm_only().
      However, the commit's implementation was awkward; it forced the SCSI
      system-resume handler to choose whether to call blk_post_runtime_resume()
      or blk_set_runtime_active(), depending on whether or not the SCSI device
      had previously been runtime suspended.
      
      This patch simplifies the situation considerably by adding the missing
      function call directly into blk_set_runtime_active() (under the condition
      that the queue is not already in the RPM_ACTIVE state).  This allows the
      SCSI routine to revert back to its original form.  Furthermore, making this
      change reveals that blk_post_runtime_resume() (in its success pathway) does
      exactly the same thing as blk_set_runtime_active().  The duplicate code is
      easily removed by making one routine call the other.
      
      No functional changes are intended.
      
      Link: https://lore.kernel.org/r/20200706151436.GA702867@rowland.harvard.edu
      CC: Can Guo <cang@codeaurora.org>
      CC: Bart Van Assche <bvanassche@acm.org>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      8f38f8e0
  10. 21 7月, 2020 3 次提交
  11. 18 7月, 2020 2 次提交
    • B
      blk-cgroup: show global disk stats in root cgroup io.stat · ef45fe47
      Boris Burkov 提交于
      In order to improve consistency and usability in cgroup stat accounting,
      we would like to support the root cgroup's io.stat.
      
      Since the root cgroup has processes doing io even if the system has no
      explicitly created cgroups, we need to be careful to avoid overhead in
      that case.  For that reason, the rstat algorithms don't handle the root
      cgroup, so just turning the file on wouldn't give correct statistics.
      
      To get around this, we simulate flushing the iostat struct by filling it
      out directly from global disk stats. The result is a root cgroup io.stat
      file consistent with both /proc/diskstats and io.stat.
      
      Note that in order to collect the disk stats, we needed to iterate over
      devices. To facilitate that, we had to change the linkage of a disk_type
      to external so that it can be used from blk-cgroup.c to iterate over
      disks.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ef45fe47
    • B
      blk-cgroup: make iostat functions visible to stat printing · cd1fc4b9
      Boris Burkov 提交于
      Previously, the code which printed io.stat only needed access to the
      generic rstat flushing code, but since we plan to write some more
      specific code for preparing root cgroup stats, we need to manipulate
      iostat structs directly. Since declaring static functions ahead does not
      seem like common practice in this file, simply move the iostat functions
      up. We only plan to use blkg_iostat_set, but it seems better to keep them
      all together.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd1fc4b9
  12. 17 7月, 2020 6 次提交
    • C
      block: improve discard bio alignment in __blkdev_issue_discard() · 9b15d109
      Coly Li 提交于
      This patch improves discard bio split for address and size alignment in
      __blkdev_issue_discard(). The aligned discard bio may help underlying
      device controller to perform better discard and internal garbage
      collection, and avoid unnecessary internal fragment.
      
      Current discard bio split algorithm in __blkdev_issue_discard() may have
      non-discarded fregment on device even the discard bio LBA and size are
      both aligned to device's discard granularity size.
      
      Here is the example steps on how to reproduce the above problem.
      - On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
        with thin mode and give it to a Linux virtual machine.
      - Inside the Linux virtual machine, if the 50GB virtual disk shows up as
        /dev/sdb, fill data into the first 50GB by,
              # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
      - Discard the 50GB range from offset 0 on /dev/sdb,
              # blkdiscard /dev/sdb -o 0 -l 53687091200
      - Observe the underlying mapping status of the device
              # sg_get_lba_status /dev/sdb -m 1048 --lba=0
        descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
        descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
        descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
        descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
        descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
        descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
        descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
        descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
        descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
        descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
        descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
        descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
        descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
        descriptor LBA: 0x0000000006600000  blocks: 0  deallocated
      
      Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
      are perfect aligned to the discard granularity, from the above list
      these are many 1MB (2048 sectors) internal fragments exist unexpectedly.
      
      The problem is in __blkdev_issue_discard(), an improper algorithm causes
      an improper bio size which is not aligned.
      
       25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
       26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
       27                 struct bio **biop)
       28 {
       29         struct request_queue *q = bdev_get_queue(bdev);
         [snipped]
       56
       57         while (nr_sects) {
       58                 sector_t req_sects = min_t(sector_t, nr_sects,
       59                                 bio_allowed_max_sectors(q));
       60
       61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
       62
       63                 bio = blk_next_bio(bio, 0, gfp_mask);
       64                 bio->bi_iter.bi_sector = sector;
       65                 bio_set_dev(bio, bdev);
       66                 bio_set_op_attrs(bio, op, 0);
       67
       68                 bio->bi_iter.bi_size = req_sects << 9;
       69                 sector += req_sects;
       70                 nr_sects -= req_sects;
         [snipped]
       79         }
       80
       81         *biop = bio;
       82         return 0;
       83 }
       84 EXPORT_SYMBOL(__blkdev_issue_discard);
      
      At line 58-59, to discard a 50GB range, req_sects is set as return value
      of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
      case, the discard granularity is 2048 sectors, although the start LBA
      and discard length are aligned to discard granularity, req_sects never
      has chance to be aligned to discard granularity. This is why there are
      some still-mapped 2048 sectors fragment in every 4 or 8 GB range.
      
      If req_sects at line 58 is set to a value aligned to discard_granularity
      and close to UNIT_MAX, then all consequent split bios inside device
      driver are (almostly) aligned to discard_granularity of the device
      queue. The 2048 sectors still-mapped fragment will disappear.
      
      This patch introduces bio_aligned_discard_max_sectors() to return the
      the value which is aligned to q->limits.discard_granularity and closest
      to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
      this new routine to decide a more proper split bio length.
      
      But we still need to handle the situation when discard start LBA is not
      aligned to q->limits.discard_granularity, otherwise even the length is
      aligned, current code may still leave 2048 fragment around every 4GB
      range. Therefore, to calculate req_sects, firstly the start LBA of
      discard range is checked (including partition offset), if it is not
      aligned to discard granularity, the first split location should make
      sure following bio has bi_sector aligned to discard granularity. Then
      there won't be still-mapped fragment in the middle of the discard range.
      
      The above is how this patch improves discard bio alignment in
      __blkdev_issue_discard(). Now with this patch, after discard with same
      command line mentiond previously, sg_get_lba_status returns,
      descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
      descriptor LBA: 0x0000000006600000  blocks: 0  deallocated
      
      We an see there is no 2048 sectors segment anymore, everything is clean.
      Reported-and-tested-by: NAcshai Manoj <acshai.manoj@microfocus.com>
      Signed-off-by: NColy Li <colyli@suse.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NXiao Ni <xni@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Enzo Matsumiya <ematsumiya@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b15d109
    • Y
      block: defer flush request no matter whether we have elevator · b5718d6c
      Yufen Yu 提交于
      Commit 7520872c ("block: don't defer flushes on blk-mq + scheduling")
      tried to fix deadlock for cycled wait between flush requests and data
      request into flush_data_in_flight. The former holded all driver tags
      and wait for data request completion, but the latter can not complete
      for waiting free driver tags.
      
      After commit 923218f6 ("blk-mq: don't allocate driver tag upfront
      for flush rq"), flush requests will not get driver tag before queuing
      into flush queue.
      
      * With elevator, flush request just get sched_tags before inserting
        flush queue. It will not get driver tag until issue them to driver.
        data request on list fq->flush_data_in_flight will complete in
        the end.
      
      * Without elevator, each flush request will get a driver tag when
        allocate request. Then data request on fq->flush_data_in_flight
        don't worry about lacking driver tag.
      
      In both of these cases, cycled wait cannot be true. So we may allow
      to defer flush request.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5718d6c
    • W
      block: make blk_timeout_init() static · 943c4d90
      Wei Yongjun 提交于
      The sparse tool complains as follows:
      
      block/blk-timeout.c:93:12: warning:
       symbol 'blk_timeout_init' was not declared. Should it be static?
      
      Function blk_timeout_init() is not used outside of blk-timeout.c, so
      mark it static.
      
      Fixes: 9054650f ("block: relax jiffies rounding for timeouts")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      943c4d90
    • K
      treewide: Remove uninitialized_var() usage · 3f649ab7
      Kees Cook 提交于
      Using uninitialized_var() is dangerous as it papers over real bugs[1]
      (or can in the future), and suppresses unrelated compiler warnings
      (e.g. "unused variable"). If the compiler thinks it is uninitialized,
      either simply initialize the variable or make compiler changes.
      
      In preparation for removing[2] the[3] macro[4], remove all remaining
      needless uses with the following script:
      
      git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
      	xargs perl -pi -e \
      		's/\buninitialized_var\(([^\)]+)\)/\1/g;
      		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
      
      drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
      pathological white-space.
      
      No outstanding warnings were found building allmodconfig with GCC 9.3.0
      for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
      alpha, and m68k.
      
      [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
      [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
      [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
      [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
      
      Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
      Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
      Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
      Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
      Signed-off-by: NKees Cook <keescook@chromium.org>
      3f649ab7
    • J
      block: remove retry loop in ioc_release_fn() · ab96bbab
      John Ogness 提交于
      The reverse-order double lock dance in ioc_release_fn() is using a
      retry loop. This is a problem on PREEMPT_RT because it could preempt
      the task that would release q->queue_lock and thus live lock in the
      retry loop.
      
      RCU is already managing the freeing of the request queue and icq. If
      the trylock fails, use RCU to guarantee that the request queue and
      icq are not freed and re-acquire the locks in the correct order,
      allowing forward progress.
      Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
      Reviewed-by: NDaniel Wagner <dwagner@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ab96bbab
    • J
      block: remove unnecessary ioc nested locking · a43f085f
      John Ogness 提交于
      The legacy CFQ IO scheduler could call put_io_context() in its exit_icq()
      elevator callback. This led to a lockdep warning, which was fixed in
      commit d8c66c5d ("block: fix lockdep warning on io_context release
      put_io_context()") by using a nested subclass for the ioc spinlock.
      However, with commit f382fb0b ("block: remove legacy IO schedulers")
      the CFQ IO scheduler no longer exists.
      
      The BFQ IO scheduler also implements the exit_icq() elevator callback but
      does not call put_io_context().
      
      The nested subclass for the ioc spinlock is no longer needed. Since it
      existed as an exception and no longer applies, remove the nested subclass
      usage.
      Signed-off-by: NJohn Ogness <john.ogness@linutronix.de>
      Reviewed-by: NDaniel Wagner <dwagner@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a43f085f
  13. 16 7月, 2020 2 次提交
  14. 15 7月, 2020 3 次提交
    • J
      Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait." · e791ee68
      Jens Axboe 提交于
      This reverts commit 826f2f48.
      
      Qian Cai reports that this commit causes stalls with swap. Revert until
      the reason can be figured out.
      Reported-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e791ee68
    • M
      block: always remove partitions from blk_drop_partitions() · d0f0f1b4
      Ming Lei 提交于
      In theory, when GENHD_FL_NO_PART_SCAN is set, no partitions can be created
      on one disk. However, ioctl(BLKPG, BLKPG_ADD_PARTITION) doesn't check
      GENHD_FL_NO_PART_SCAN, so partitions still can be added even though
      GENHD_FL_NO_PART_SCAN is set.
      
      So far blk_drop_partitions() only removes partitions when disk_part_scan_enabled()
      return true. This way can make ghost partition on loop device after changing/clearing
      FD in case that PARTSCAN is disabled, such as partitions can be added
      via 'parted' on loop disk even though GENHD_FL_NO_PART_SCAN is set.
      
      Fix this issue by always removing partitions in blk_drop_partitions(), and
      this way is correct because the current code supposes that no partitions
      can be added in case of GENHD_FL_NO_PART_SCAN.
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d0f0f1b4
    • J
      block: relax jiffies rounding for timeouts · 9054650f
      Jens Axboe 提交于
      In doing high IOPS testing, blk-mq is generally pretty well optimized.
      There are a few things that stuck out as using more CPU than what is
      really warranted, and one thing is the round_jiffies_up() that we do
      twice for each request. That accounts for about 0.8% of the CPU in
      my testing.
      
      We can make this cheaper by avoiding an integer division, by just adding
      a rough HZ mask that we can AND with instead. The timeouts are only on a
      second granularity already, we don't have to be that accurate here and
      this patch barely changes that. All we care about is nice grouping.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9054650f
  15. 10 7月, 2020 2 次提交
  16. 09 7月, 2020 4 次提交
  17. 08 7月, 2020 3 次提交