1. 22 8月, 2020 17 次提交
    • K
      nvme: skip noiob for zoned devices · c41ad98b
      Keith Busch 提交于
      Zoned block devices reuse the chunk_sectors queue limit to define zone
      boundaries. If a such a device happens to also report an optimal
      boundary, do not use that to define the chunk_sectors as that may
      intermittently interfere with io splitting and zone size queries.
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c41ad98b
    • C
      nvme-pci: fix PRP pool size · c61b82c7
      Christoph Hellwig 提交于
      All operations are based on the controller, not the host page size.
      Switch the dma pool to use the controller page size as well to avoid
      massive overallocations on large page size systems.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c61b82c7
    • J
      nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth · 7442ddce
      John Garry 提交于
      Recently nvme_dev.q_depth was changed from an int to u16 type.
      
      This falls over for the queue depth calculation in nvme_pci_enable(),
      where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as
      NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the
      result:
      
      root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
      dereference at virtual address 0000000000000010
      Mem abort info:
      ESR = 0x96000004
      EC = 0x25: DABT (current EL), IL = 32 bits
      SET = 0, FnV = 0
      EA = 0, S1PTW = 0
      Data abort info:
      ISV = 0, ISS = 0x00000004
      CM = 0, WnR = 0
      user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000
      [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
      Internal error: Oops: 96000004 [#1] PREEMPT SMP
      Modules linked in: nvme nvme_core
      CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
      5.8.0-next-20200812 #27
      Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
      V1.16.01 03/15/2019
      Workqueue: nvme-reset-wq nvme_reset_work [nvme]
      pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--)
      pc : __sg_alloc_table_from_pages+0xec/0x238
      lr : __sg_alloc_table_from_pages+0xc8/0x238
      sp : ffff800013ccbad0
      x29: ffff800013ccbad0 x28: ffff0a27b3d380a8
      x27: 0000000000000000 x26: 0000000000002dc2
      x25: 0000000000000dc0 x24: 0000000000000000
      x23: 0000000000000000 x22: ffff800013ccbbe8
      x21: 0000000000000010 x20: 0000000000000000
      x19: 00000000fffff000 x18: ffffffffffffffff
      x17: 00000000000000c0 x16: fffffe289eaf6380
      x15: ffff800011b59948 x14: ffff002bc8fe98f8
      x13: ff00000000000000 x12: ffff8000114ca000
      x11: 0000000000000000 x10: ffffffffffffffff
      x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0
      x7 : 0000000000000000 x6 : 0000000000000001
      x5 : ffff0a27b5f9b680 x4 : 0000000000000000
      x3 : ffff0a27b5f9b680 x2 : 0000000000000000
       x1 : 0000000000000001 x0 : 0000000000000000
       Call trace:
      __sg_alloc_table_from_pages+0xec/0x238
      sg_alloc_table_from_pages+0x18/0x28
      iommu_dma_alloc+0x474/0x678
      dma_alloc_attrs+0xd8/0xf0
      nvme_alloc_queue+0x114/0x160 [nvme]
      nvme_reset_work+0xb34/0x14b4 [nvme]
      process_one_work+0x1e8/0x360
      worker_thread+0x44/0x478
      kthread+0x150/0x158
      ret_from_fork+0x10/0x34
       Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
      ---[ end trace 89bb2b72d59bf925 ]---
      
      Fix by making onto a u32.
      
      Also use u32 for nvme_dev.q_depth, as we assign this value from
      nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this
      avoids the same crash as above.
      
      Fixes: 61f3b896 ("nvme-pci: use unsigned for io queue depth")
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7442ddce
    • L
      nvme: Use spin_lock_irq() when taking the ctrl->lock · ecbcdf0c
      Logan Gunthorpe 提交于
      When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a
      dead lock. The new spin_lock() calls recently added produce the
      following lockdep warning when running the blktest nvme/003:
      
          ================================
          WARNING: inconsistent lock state
          --------------------------------
          inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
          ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes:
          ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0
          {SOFTIRQ-ON-W} state was registered at:
            lock_acquire+0x164/0x500
            _raw_spin_lock+0x28/0x40
            nvme_get_effects_log+0x37/0x1c0
            nvme_init_identify+0x9e4/0x14f0
            nvme_reset_work+0xadd/0x2360
            process_one_work+0x66b/0xb70
            worker_thread+0x6e/0x6c0
            kthread+0x1e7/0x210
            ret_from_fork+0x22/0x30
          irq event stamp: 1449221
          hardirqs last  enabled at (1449220): [<ffffffff81c58e69>] ktime_get+0xf9/0x140
          hardirqs last disabled at (1449221): [<ffffffff83129665>] _raw_spin_lock_irqsave+0x25/0x60
          softirqs last  enabled at (1449210): [<ffffffff83400447>] __do_softirq+0x447/0x595
          softirqs last disabled at (1449215): [<ffffffff81b489b5>] run_ksoftirqd+0x35/0x50
      
          other info that might help us debug this:
           Possible unsafe locking scenario:
      
                 CPU0
                 ----
            lock(&ctrl->lock);
            <Interrupt>
              lock(&ctrl->lock);
      
           *** DEADLOCK ***
      
          no locks held by ksoftirqd/2/22.
      
          stack backtrace:
          CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c #1450
          Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
          Call Trace:
           dump_stack+0xc8/0x11a
           print_usage_bug.cold.63+0x235/0x23e
           mark_lock+0xa9c/0xcf0
           __lock_acquire+0xd9a/0x2b50
           lock_acquire+0x164/0x500
           _raw_spin_lock_irqsave+0x40/0x60
           nvme_keep_alive_end_io+0x50/0xc0
           blk_mq_end_request+0x158/0x210
           nvme_complete_rq+0x146/0x500
           nvme_loop_complete_rq+0x26/0x30 [nvme_loop]
           blk_done_softirq+0x187/0x1e0
           __do_softirq+0x118/0x595
           run_ksoftirqd+0x35/0x50
           smpboot_thread_fn+0x1d3/0x310
           kthread+0x1e7/0x210
           ret_from_fork+0x22/0x30
      
      Fixes: be93e87e ("nvme: support for multiple Command Sets Supported and Effects log pages")
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Tested-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ecbcdf0c
    • C
      nvmet: call blk_mq_free_request() directly · 7ee51cf6
      Chaitanya Kulkarni 提交于
      Instead of calling blk_put_request() which calls blk_mq_free_request(),
      call blk_mq_free_request() directly for NVMeOF passthru. This is to
      mainly avoid an extra function call in the completion path
      nvmet_passthru_req_done().
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7ee51cf6
    • C
      nvmet: fix oops in pt cmd execution · a2138fd4
      Chaitanya Kulkarni 提交于
      In the existing NVMeOF Passthru core command handling on failure of
      nvme_alloc_request() it errors out with rq value set to NULL. In the
      error handling path it calls blk_put_request() without checking if
      rq is set to NULL or not which produces following Oops:-
      
      [ 1457.346861] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [ 1457.347838] #PF: supervisor read access in kernel mode
      [ 1457.348464] #PF: error_code(0x0000) - not-present page
      [ 1457.349085] PGD 0 P4D 0
      [ 1457.349402] Oops: 0000 [#1] SMP NOPTI
      [ 1457.349851] CPU: 18 PID: 10782 Comm: kworker/18:2 Tainted: G           OE     5.8.0-rc4nvme-5.9+ #35
      [ 1457.350951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
      [ 1457.352347] Workqueue: events nvme_loop_execute_work [nvme_loop]
      [ 1457.353062] RIP: 0010:blk_mq_free_request+0xe/0x110
      [ 1457.353651] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
      [ 1457.355975] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
      [ 1457.356636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
      [ 1457.357526] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
      [ 1457.358416] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
      [ 1457.359317] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
      [ 1457.360424] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
      [ 1457.361322] FS:  0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
      [ 1457.362337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1457.363058] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
      [ 1457.363973] Call Trace:
      [ 1457.364296]  nvmet_passthru_execute_cmd+0x150/0x2c0 [nvmet]
      [ 1457.364990]  process_one_work+0x24e/0x5a0
      [ 1457.365493]  ? __schedule+0x353/0x840
      [ 1457.365957]  worker_thread+0x3c/0x380
      [ 1457.366426]  ? process_one_work+0x5a0/0x5a0
      [ 1457.366948]  kthread+0x135/0x150
      [ 1457.367362]  ? kthread_create_on_node+0x60/0x60
      [ 1457.367934]  ret_from_fork+0x22/0x30
      [ 1457.368388] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corer
      [ 1457.368414]  ata_piix crc32c_intel virtio_pci libata virtio_ring serio_raw t10_pi virtio floppy dm_]
      [ 1457.380849] CR2: 0000000000000000
      [ 1457.381288] ---[ end trace c6cab61bfd1f68fd ]---
      [ 1457.381861] RIP: 0010:blk_mq_free_request+0xe/0x110
      [ 1457.382469] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
      [ 1457.384749] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
      [ 1457.385393] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
      [ 1457.386264] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
      [ 1457.387142] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
      [ 1457.388029] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
      [ 1457.388914] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
      [ 1457.389798] FS:  0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
      [ 1457.390796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1457.391508] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
      [ 1457.392525] Kernel panic - not syncing: Fatal exception
      [ 1457.394138] Kernel Offset: disabled
      [ 1457.394677] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      We fix this Oops by adding a new goto label out_put_req and reordering
      the blk_put_request call to avoid calling blk_put_request() with rq
      value is set to NULL. Here we also update the rest of the code
      accordingly.
      
      Fixes: 06b7164dfdc0 ("nvmet: add passthru code to process commands")
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a2138fd4
    • C
      nvmet: add ns tear down label for pt-cmd handling · 4db69a3d
      Chaitanya Kulkarni 提交于
      In the current implementation before submitting the passthru cmd we
      may come across error e.g. getting ns from passthru controller,
      allocating a request from passthru controller, etc. For all the failure
      cases it only uses single goto label fail_out.
      
      In the target code, we follow the pattern to have a separate label for
      each error out the case when setting up multiple things before the actual
      action.
      
      This patch follows the same pattern and renames generic fail_out label
      to out_put_ns and updates the error out cases in the
      nvmet_passthru_execute_cmd() where it is needed.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4db69a3d
    • M
      nvme: multipath: round-robin: eliminate "fallback" variable · e398863b
      Martin Wilck 提交于
      If we find an optimized path, we quit the loop immediately. Thus we can use
      just one variable for the next path, slighly simplifying the code.
      Signed-off-by: NMartin Wilck <mwilck@suse.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e398863b
    • M
      nvme: multipath: round-robin: fix single non-optimized path case · 93eb0381
      Martin Wilck 提交于
      If there's only one usable, non-optimized path, nvme_round_robin_path()
      returns NULL, which is wrong. Fix it by falling back to "old", like in
      the single optimized path case. Also, if the active path isn't changed,
      there's no need to re-assign the pointer.
      
      Fixes: 3f6e3246 ("nvme-multipath: fix logic for non-optimized paths")
      Signed-off-by: NMartin Wilck <mwilck@suse.com>
      Signed-off-by: NMartin George <marting@netapp.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      93eb0381
    • T
      nvme-fc: Fix wrong return value in __nvme_fc_init_request() · f34448cd
      Tianjia Zhang 提交于
      On an error exit path, a negative error code should be returned
      instead of a positive return value.
      
      Fixes: e399441d ("nvme-fabrics: Add host support for FC transport")
      Cc: James Smart <jsmart2021@gmail.com>
      Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f34448cd
    • L
      nvmet-passthru: Reject commands with non-sgl flags set · 0ceeab96
      Logan Gunthorpe 提交于
      Any command with a non-SGL flag set (like fuse flags) should be
      rejected.
      
      Fixes: c1fef73f ("nvmet: add passthru code to process commands")
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0ceeab96
    • S
      nvmet: fix a memory leak · 382fee1a
      Sagi Grimberg 提交于
      We forgot to free new_model_number
      
      Fixes: 013b7ebe ("nvmet: make ctrl model configurable")
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      382fee1a
    • Y
      blkcg: fix memleak for iolatency · 27029b4b
      Yufen Yu 提交于
      Normally, blkcg_iolatency_exit() will free related memory in iolatency
      when cleanup queue. But if blk_throtl_init() return error and queue init
      fail, blkcg_iolatency_exit() will not do that for us. Then it cause
      memory leak.
      
      Fixes: d7067512 ("block: introduce blk-iolatency io controller")
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      27029b4b
    • G
      MAINTAINERS: Add missing header files to BLOCK LAYER section · 0c8b9c35
      Geert Uytterhoeven 提交于
      The various <linux/blk*.h> header files are part of the Block Layer.
      Add them to the corresponding section in the MAINTAINERS file, so
      scripts/get_maintainer.pl will pick them up.
      Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0c8b9c35
    • K
      block: fix get_max_io_size() · e4b469c6
      Keith Busch 提交于
      A previous commit aligning splits to physical block sizes inadvertently
      modified one return case such that that it now returns 0 length splits
      when the number of sectors doesn't exceed the physical offset. This
      later hits a BUG in bio_split(). Restore the previous working behavior.
      
      Fixes: 9cc5169c ("block: Improve physical block alignment of split bios")
      Reported-by: NEric Deal <eric.deal@wdc.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e4b469c6
    • M
      blk-mq: insert request not through ->queue_rq into sw/scheduler queue · db03f88f
      Ming Lei 提交于
      c616cbee ("blk-mq: punt failed direct issue to dispatch list") supposed
      to add request which has been through ->queue_rq() to the hw queue dispatch
      list, however it adds request running out of budget or driver tag to hw queue
      too. This way basically bypasses request merge, and causes too many request
      dispatched to LLD, and system% is unnecessary increased.
      
      Fixes this issue by adding request not through ->queue_rq into sw/scheduler
      queue, and this way is safe because no ->queue_rq is called on this request
      yet.
      
      High %system can be observed on Azure storvsc device, and even soft lock
      is observed. This patch reduces %system during heavy sequential IO,
      meantime decreases soft lockup risk.
      
      Fixes: c616cbee ("blk-mq: punt failed direct issue to dispatch list")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      db03f88f
    • N
      block/rnbd: Ensure err is always initialized in process_rdma · 17bc1030
      Nathan Chancellor 提交于
      Clang warns:
      
      drivers/block/rnbd/rnbd-srv.c:150:6: warning: variable 'err' is used
      uninitialized whenever 'if' condition is true
      [-Wsometimes-uninitialized]
              if (IS_ERR(bio)) {
                  ^~~~~~~~~~~
      drivers/block/rnbd/rnbd-srv.c:177:9: note: uninitialized use occurs here
              return err;
                     ^~~
      drivers/block/rnbd/rnbd-srv.c:150:2: note: remove the 'if' if its
      condition is always false
              if (IS_ERR(bio)) {
              ^~~~~~~~~~~~~~~~~~
      drivers/block/rnbd/rnbd-srv.c:126:9: note: initialize the variable 'err'
      to silence this warning
              int err;
                     ^
                      = 0
      1 warning generated.
      
      err is indeed uninitialized when this statement is taken. Ensure that it
      is assigned the error value of bio before jumping to the error handling
      label.
      
      Fixes: 735d77d4 ("rnbd: remove rnbd_dev_submit_io")
      Reported-by: NBrooke Basile <brookebasile@gmail.com>
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Acked-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Link: https://github.com/ClangBuiltLinux/linux/issues/1134Signed-off-by: NJens Axboe <axboe@kernel.dk>
      17bc1030
  2. 18 8月, 2020 2 次提交
    • D
      bfq: fix blkio cgroup leakage v4 · 2de791ab
      Dmitry Monakhov 提交于
      Changes from v1:
          - update commit description with proper ref-accounting justification
      
      commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      introduce leak forbfq_group and blkcg_gq objects because of get/put
      imbalance.
      In fact whole idea of original commit is wrong because bfq_group entity
      can not dissapear under us because it is referenced by child bfq_queue's
      entities from here:
       -> bfq_init_entity()
          ->bfqg_and_blkg_get(bfqg);
          ->entity->parent = bfqg->my_entity
      
       -> bfq_put_queue(bfqq)
          FINAL_PUT
          ->bfqg_and_blkg_put(bfqq_group(bfqq))
          ->kmem_cache_free(bfq_pool, bfqq);
      
      So parent entity can not disappear while child entity is in tree,
      and child entities already has proper protection.
      This patch revert commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      
      bfq_group leak trace caused by bad commit:
      -> blkg_alloc
         -> bfq_pq_alloc
           -> bfqg_get (+1)
      ->bfq_activate_bfqq
        ->bfq_activate_requeue_entity
          -> __bfq_activate_entity
             ->bfq_get_entity
               ->bfqg_and_blkg_get (+1)  <==== : Note1
      ->bfq_del_bfqq_busy
        ->bfq_deactivate_entity+0x53/0xc0 [bfq]
          ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
            -> bfq_forget_entity(is_in_service = true)
      	 entity->on_st_or_in_serv = false   <=== :Note2
      	 if (is_in_service)
      	     return;  ==> do not touch reference
      -> blkcg_css_offline
       -> blkcg_destroy_blkgs
        -> blkg_destroy
         -> bfq_pd_offline
          -> __bfq_deactivate_entity
               if (!entity->on_st_or_in_serv) /* true, because (Note2)
      		return false;
       -> bfq_pd_free
          -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
      So bfq_group and blkcg_gq  will leak forever, see test-case below.
      
      ##TESTCASE_BEGIN:
      #!/bin/bash
      
      max_iters=${1:-100}
      #prep cgroup mounts
      mount -t tmpfs cgroup_root /sys/fs/cgroup
      mkdir /sys/fs/cgroup/blkio
      mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
      
      # Prepare blkdev
      grep blkio /proc/cgroups
      truncate -s 1M img
      losetup /dev/loop0 img
      echo bfq > /sys/block/loop0/queue/scheduler
      
      grep blkio /proc/cgroups
      for ((i=0;i<max_iters;i++))
      do
          mkdir -p /sys/fs/cgroup/blkio/a
          echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
          dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
          echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
          rmdir /sys/fs/cgroup/blkio/a
          grep blkio /proc/cgroups
      done
      ##TESTCASE_END:
      
      Fixes: db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2de791ab
    • M
      block: Fix page_is_mergeable() for compound pages · d8166519
      Matthew Wilcox (Oracle) 提交于
      If we pass in an offset which is larger than PAGE_SIZE, then
      page_is_mergeable() thinks it's not mergeable with the previous bio_vec,
      leading to a large number of bio_vecs being used.  Use a slightly more
      obvious test that the two pages are compatible with each other.
      
      Fixes: 52d52d1c ("block: only allow contiguous page structs in a bio_vec")
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d8166519
  3. 17 8月, 2020 9 次提交
    • M
      block: virtio_blk: fix handling single range discard request · af822aa6
      Ming Lei 提交于
      1f23816b ("virtio_blk: add discard and write zeroes support") starts
      to support multi-range discard for virtio-blk. However, the virtio-blk
      disk may report max discard segment as 1, at least that is exactly what
      qemu is doing.
      
      So far, block layer switches to normal request merge if max discard segment
      limit is 1, and multiple bios can be merged to single segment. This way may
      cause memory corruption in virtblk_setup_discard_write_zeroes().
      
      Fix the issue by handling single max discard segment in straightforward
      way.
      
      Fixes: 1f23816b ("virtio_blk: add discard and write zeroes support")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Changpeng Liu <changpeng.liu@intel.com>
      Cc: Daniel Verkamp <dverkamp@chromium.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Stefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      af822aa6
    • M
      block: respect queue limit of max discard segment · 943b40c8
      Ming Lei 提交于
      When queue_max_discard_segments(q) is 1, blk_discard_mergable() will
      return false for discard request, then normal request merge is applied.
      However, only queue_max_segments() is checked, so max discard segment
      limit isn't respected.
      
      Check max discard segment limit in the request merge code for fixing
      the issue.
      
      Discard request failure of virtio_blk is fixed.
      
      Fixes: 69840466 ("block: fix the DISCARD request merge")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Stefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      943b40c8
    • M
      block: loop: set discard granularity and alignment for block device backed loop · bcb21c8c
      Ming Lei 提交于
      In case of block device backend, if the backend supports write zeros, the
      loop device will set queue flag of QUEUE_FLAG_DISCARD. However,
      limits.discard_granularity isn't setup, and this way is wrong,
      see the following description in Documentation/ABI/testing/sysfs-block:
      
      	A discard_granularity of 0 means that the device does not support
      	discard functionality.
      
      Especially 9b15d109 ("block: improve discard bio alignment in
      __blkdev_issue_discard()") starts to take q->limits.discard_granularity
      for computing max discard sectors. And zero discard granularity may cause
      kernel oops, or fail discard request even though the loop queue claims
      discard support via QUEUE_FLAG_DISCARD.
      
      Fix the issue by setup discard granularity and alignment.
      
      Fixes: c52abf56 ("loop: Better discard support for block devices")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NColy Li <colyli@suse.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Xiao Ni <xni@redhat.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Gwendal Grignou <gwendal@chromium.org>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bcb21c8c
    • M
      blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART · d7d8535f
      Ming Lei 提交于
      SCHED_RESTART code path is relied to re-run queue for dispatch requests
      in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
      requests to hctx->dispatch.
      
      memory barriers have to be used for ordering the following two pair of OPs:
      
      1) adding requests to hctx->dispatch and checking SCHED_RESTART in
      blk_mq_dispatch_rq_list()
      
      2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
      in blk_mq_sched_restart().
      
      Without the added memory barrier, either:
      
      1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
      blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
      dispatch side
      
      or
      
      2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
      in dispatch side, meantime checking if there is request in
      hctx->dispatch from blk_mq_sched_restart() is missed.
      
      IO hang in ltp/fs_fill test is reported by kernel test robot:
      
      	https://lkml.org/lkml/2020/7/26/77
      
      Turns out it is caused by the above out-of-order OPs. And the IO hang
      can't be observed any more after applying this patch.
      
      Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Jeffery <djeffery@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d7d8535f
    • X
      bsg-lib: convert comma to semicolon · 03ef5941
      Xu Wang 提交于
      Replace a comma between expression statements by a semicolon.
      Signed-off-by: NXu Wang <vulab@iscas.ac.cn>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03ef5941
    • R
      block: blk-mq.c: fix @at_head kernel-doc warning · 26bfeb26
      Randy Dunlap 提交于
      Fix a kernel-doc warning in block/blk-mq.c:
      
      ../block/blk-mq.c:1844: warning: Function parameter or member 'at_head' not described in 'blk_mq_request_bypass_insert'
      
      Fixes: 01e99aec ("blk-mq: insert passthrough request into hctx->dispatch directly")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: André Almeida <andrealmeid@collabora.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: linux-block@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      26bfeb26
    • L
      Linux 5.9-rc1 · 9123e3a7
      Linus Torvalds 提交于
      9123e3a7
    • L
      Merge tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block · 2cc3c4b3
      Linus Torvalds 提交于
      Pull io_uring fixes from Jens Axboe:
       "A few differerent things in here.
      
        Seems like syzbot got some more io_uring bits wired up, and we got a
        handful of reports and the associated fixes are in here.
      
        General fixes too, and a lot of them marked for stable.
      
        Lastly, a bit of fallout from the async buffered reads, where we now
        more easily trigger short reads. Some applications don't really like
        that, so the io_read() code now handles short reads internally, and
        got a cleanup along the way so that it's now easier to read (and
        documented). We're now passing tests that failed before"
      
      * tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block:
        io_uring: short circuit -EAGAIN for blocking read attempt
        io_uring: sanitize double poll handling
        io_uring: internally retry short reads
        io_uring: retain iov_iter state over io_read/io_write calls
        task_work: only grab task signal lock when needed
        io_uring: enable lookup of links holding inflight files
        io_uring: fail poll arm on queue proc failure
        io_uring: hold 'ctx' reference around task_work queue + execute
        fs: RWF_NOWAIT should imply IOCB_NOIO
        io_uring: defer file table grabbing request cleanup for locked requests
        io_uring: add missing REQ_F_COMP_LOCKED for nested requests
        io_uring: fix recursive completion locking on oveflow flush
        io_uring: use TWA_SIGNAL for task_work uncondtionally
        io_uring: account locked memory before potential error case
        io_uring: set ctx sq/cq entry count earlier
        io_uring: Fix NULL pointer dereference in loop_rw_iter()
        io_uring: add comments on how the async buffered read retry works
        io_uring: io_async_buf_func() need not test page bit
      2cc3c4b3
    • M
      parisc: fix PMD pages allocation by restoring pmd_alloc_one() · 6f6aea7e
      Mike Rapoport 提交于
      Commit 1355c31e ("asm-generic: pgalloc: provide generic pmd_alloc_one()
      and pmd_free_one()") converted parisc to use generic version of
      pmd_alloc_one() but it missed the fact that parisc uses order-1 pages for
      PMD.
      
      Restore the original version of pmd_alloc_one() for parisc, just use
      GFP_PGTABLE_KERNEL that implies __GFP_ZERO instead of GFP_KERNEL and
      memset.
      
      Fixes: 1355c31e ("asm-generic: pgalloc: provide generic pmd_alloc_one() and pmd_free_one()")
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Tested-by: NMeelis Roos <mroos@linux.ee>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Link: https://lkml.kernel.org/r/9f2b5ebd-e4a4-0fa1-6cd3-4b9f6892d1ad@linux.eeSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f6aea7e
  4. 16 8月, 2020 10 次提交
    • L
      Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block · 4b6c093e
      Linus Torvalds 提交于
      Pull block fixes from Jens Axboe:
       "A few fixes on the block side of things:
      
         - Discard granularity fix (Coly)
      
         - rnbd cleanups (Guoqing)
      
         - md error handling fix (Dan)
      
         - md sysfs fix (Junxiao)
      
         - Fix flush request accounting, which caused an IO slowdown for some
           configurations (Ming)
      
         - Properly propagate loop flag for partition scanning (Lennart)"
      
      * tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block:
        block: fix double account of flush request's driver tag
        loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
        rnbd: no need to set bi_end_io in rnbd_bio_map_kern
        rnbd: remove rnbd_dev_submit_io
        md-cluster: Fix potential error pointer dereference in resize_bitmaps()
        block: check queue's limits.discard_granularity in __blkdev_issue_discard()
        md: get sysfs entry after redundancy attr group create
      4b6c093e
    • L
      Merge tag 'riscv-for-linus-5.9-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · d84835b1
      Linus Torvalds 提交于
      Pull RISC-V fix from Palmer Dabbelt:
       "I collected a single fix during the merge window: we managed to break
        the early trap setup on !MMU, this fixes it"
      
      * tag 'riscv-for-linus-5.9-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: Setup exception vector for nommu platform
      d84835b1
    • L
      Merge tag 'sh-for-5.9' of git://git.libc.org/linux-sh · 5bbec3cf
      Linus Torvalds 提交于
      Pull arch/sh updates from Rich Felker:
       "Cleanup, SECCOMP_FILTER support, message printing fixes, and other
        changes to arch/sh"
      
      * tag 'sh-for-5.9' of git://git.libc.org/linux-sh: (34 commits)
        sh: landisk: Add missing initialization of sh_io_port_base
        sh: bring syscall_set_return_value in line with other architectures
        sh: Add SECCOMP_FILTER
        sh: Rearrange blocks in entry-common.S
        sh: switch to copy_thread_tls()
        sh: use the generic dma coherent remap allocator
        sh: don't allow non-coherent DMA for NOMMU
        dma-mapping: consolidate the NO_DMA definition in kernel/dma/Kconfig
        sh: unexport register_trapped_io and match_trapped_io_handler
        sh: don't include <asm/io_trapped.h> in <asm/io.h>
        sh: move the ioremap implementation out of line
        sh: move ioremap_fixed details out of <asm/io.h>
        sh: remove __KERNEL__ ifdefs from non-UAPI headers
        sh: sort the selects for SUPERH alphabetically
        sh: remove -Werror from Makefiles
        sh: Replace HTTP links with HTTPS ones
        arch/sh/configs: remove obsolete CONFIG_SOC_CAMERA*
        sh: stacktrace: Remove stacktrace_ops.stack()
        sh: machvec: Modernize printing of kernel messages
        sh: pci: Modernize printing of kernel messages
        ...
      5bbec3cf
    • J
      io_uring: short circuit -EAGAIN for blocking read attempt · f91daf56
      Jens Axboe 提交于
      One case was missed in the short IO retry handling, and that's hitting
      -EAGAIN on a blocking attempt read (eg from io-wq context). This is a
      problem on sockets that are marked as non-blocking when created, they
      don't carry any REQ_F_NOWAIT information to help us terminate them
      instead of perpetually retrying.
      
      Fixes: 227c0c96 ("io_uring: internally retry short reads")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f91daf56
    • J
      io_uring: sanitize double poll handling · d4e7cd36
      Jens Axboe 提交于
      There's a bit of confusion on the matching pairs of poll vs double poll,
      depending on if the request is a pure poll (IORING_OP_POLL_ADD) or
      poll driven retry.
      
      Add io_poll_get_double() that returns the double poll waitqueue, if any,
      and io_poll_get_single() that returns the original poll waitqueue. With
      that, remove the argument to io_poll_remove_double().
      
      Finally ensure that wait->private is cleared once the double poll handler
      has run, so that remove knows it's already been seen.
      
      Cc: stable@vger.kernel.org # v5.8
      Reported-by: syzbot+7f617d4a9369028b8a2c@syzkaller.appspotmail.com
      Fixes: 18bceab1 ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d4e7cd36
    • L
      Merge tag 'perf-tools-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux · 713eee84
      Linus Torvalds 提交于
      Pull more perf tools updates from Arnaldo Carvalho de Melo:
       "Fixes:
         - Fixes for 'perf bench numa'.
      
         - Always memset source before memcpy in 'perf bench mem'.
      
         - Quote CC and CXX for their arguments to fix build in environments
           using those variables to pass more than just the compiler names.
      
         - Fix module symbol processing, addressing regression detected via
           "perf test".
      
         - Allow multiple probes in record+script_probe_vfs_getname.sh 'perf
           test' entry.
      
        Improvements:
         - Add script to autogenerate socket family name id->string table from
           copy of kernel header, used so far in 'perf trace'.
      
         - 'perf ftrace' improvements to provide similar options for this
           utility so that one can go from 'perf record', 'perf trace', etc to
           'perf ftrace' just by changing the name of the subcommand.
      
         - Prefer new "sched:sched_waking" trace event when it exists in 'perf
           sched' post processing.
      
         - Update POWER9 metrics to utilize other metrics.
      
         - Fall back to querying debuginfod if debuginfo not found locally.
      
        Miscellaneous:
         - Sync various kvm headers with kernel sources"
      
      * tag 'perf-tools-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (40 commits)
        perf ftrace: Make option description initials all capital letters
        perf build-ids: Fall back to debuginfod query if debuginfo not found
        perf bench numa: Remove dead code in parse_nodes_opt()
        perf stat: Update POWER9 metrics to utilize other metrics
        perf ftrace: Add change log
        perf: ftrace: Add set_tracing_options() to set all trace options
        perf ftrace: Add option --tid to filter by thread id
        perf ftrace: Add option -D/--delay to delay tracing
        perf: ftrace: Allow set graph depth by '--graph-opts'
        perf ftrace: Add support for trace option tracing_thresh
        perf ftrace: Add option 'verbose' to show more info for graph tracer
        perf ftrace: Add support for tracing option 'irq-info'
        perf ftrace: Add support for trace option funcgraph-irqs
        perf ftrace: Add support for trace option sleep-time
        perf ftrace: Add support for tracing option 'func_stack_trace'
        perf tools: Add general function to parse sublevel options
        perf ftrace: Add option '--inherit' to trace children processes
        perf ftrace: Show trace column header
        perf ftrace: Add option '-m/--buffer-size' to set per-cpu buffer size
        perf ftrace: Factor out function write_tracing_file_int()
        ...
      713eee84
    • L
      Merge tag 'x86-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 50f6c7db
      Linus Torvalds 提交于
      Pull x86 fixes from Ingo Molnar:
       "Misc fixes and small updates all around the place:
      
         - Fix mitigation state sysfs output
      
         - Fix an FPU xstate/sxave code assumption bug triggered by
           Architectural LBR support
      
         - Fix Lightning Mountain SoC TSC frequency enumeration bug
      
         - Fix kexec debug output
      
         - Fix kexec memory range assumption bug
      
         - Fix a boundary condition in the crash kernel code
      
         - Optimize porgatory.ro generation a bit
      
         - Enable ACRN guests to use X2APIC mode
      
         - Reduce a __text_poke() IRQs-off critical section for the benefit of
           PREEMPT_RT"
      
      * tag 'x86-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/alternatives: Acquire pte lock with interrupts enabled
        x86/bugs/multihit: Fix mitigation reporting when VMX is not in use
        x86/fpu/xstate: Fix an xstate size check warning with architectural LBRs
        x86/purgatory: Don't generate debug info for purgatory.ro
        x86/tsr: Fix tsc frequency enumeration bug on Lightning Mountain SoC
        kexec_file: Correctly output debugging information for the PT_LOAD ELF header
        kexec: Improve & fix crash_exclude_mem_range() to handle overlapping ranges
        x86/crash: Correct the address boundary of function parameters
        x86/acrn: Remove redundant chars from ACRN signature
        x86/acrn: Allow ACRN guest to use X2APIC mode
      50f6c7db
    • L
      Merge tag 'sched-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1195d58f
      Linus Torvalds 提交于
      Pull scheduler fixes from Ingo Molnar:
       "Two fixes: fix a new tracepoint's output value, and fix the formatting
        of show-state syslog printouts"
      
      * tag 'sched-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/debug: Fix the alignment of the show-state debug output
        sched: Fix use of count for nr_running tracepoint
      1195d58f
    • L
      Merge tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7f5faaaa
      Linus Torvalds 提交于
      Pull perf fixes from Ingo Molnar:
       "Misc fixes, an expansion of perf syscall access to CAP_PERFMON
        privileged tools, plus a RAPL HW-enablement for Intel SPR platforms"
      
      * tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/rapl: Add support for Intel SPR platform
        perf/x86/rapl: Support multiple RAPL unit quirks
        perf/x86/rapl: Fix missing psys sysfs attributes
        hw_breakpoint: Remove unused __register_perf_hw_breakpoint() declaration
        kprobes: Remove show_registers() function prototype
        perf/core: Take over CAP_SYS_PTRACE creds to CAP_PERFMON capability
      7f5faaaa
    • L
      Merge tag 'locking-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · eb1319af
      Linus Torvalds 提交于
      Pull locking fixlets from Ingo Molnar:
       "A documentation fix and a 'fallthrough' macro update"
      
      * tag 'locking-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        futex: Convert to use the preferred 'fallthrough' macro
        Documentation/locking/locktypes: Fix a typo
      eb1319af
  5. 15 8月, 2020 2 次提交