1. 12 7月, 2019 5 次提交
    • D
      sd_zbc: Fix report zones buffer allocation · b091ac61
      Damien Le Moal 提交于
      During disk scan and revalidation done with sd_revalidate(), the zones
      of a zoned disk are checked using the helper function
      blk_revalidate_disk_zones() if a configuration change is detected
      (change in the number of zones or zone size). The function
      blk_revalidate_disk_zones() issues report_zones calls that are very
      large, that is, to obtain zone information for all zones of the disk
      with a single command. The size of the report zones command buffer
      necessary for such large request generally is lower than the disk
      max_hw_sectors and KMALLOC_MAX_SIZE (4MB) and succeeds on boot (no
      memory fragmentation), but often fail at run time (e.g. hot-plug
      event). This causes the disk revalidation to fail and the disk
      capacity to be changed to 0.
      
      This problem can be avoided by using vmalloc() instead of kmalloc() for
      the buffer allocation. To limit the amount of memory to be allocated,
      this patch also introduces the arbitrary SD_ZBC_REPORT_MAX_ZONES
      maximum number of zones to report with a single report zones command.
      This limit may be lowered further to satisfy the disk max_hw_sectors
      limit. Finally, to ensure that the vmalloc-ed buffer can always be
      mapped in a request, the buffer size is further limited to at most
      queue_max_segments() pages, allowing successful mapping of the buffer
      even in the worst case scenario where none of the buffer pages are
      contiguous.
      
      Fixes: 515ce606 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation")
      Fixes: e76239a3 ("block: add a report_zones method")
      Cc: stable@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b091ac61
    • D
      block: Kill gfp_t argument of blkdev_report_zones() · bd976e52
      Damien Le Moal 提交于
      Only GFP_KERNEL and GFP_NOIO are used with blkdev_report_zones(). In
      preparation of using vmalloc() for large report buffer and zone array
      allocations used by this function, remove its "gfp_t gfp_mask" argument
      and rely on the caller context to use memalloc_noio_save/restore() where
      necessary (block layer zone revalidation and dm-zoned I/O error path).
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bd976e52
    • D
      block: Allow mapping of vmalloc-ed buffers · b4c5875d
      Damien Le Moal 提交于
      To allow the SCSI subsystem scsi_execute_req() function to issue
      requests using large buffers that are better allocated with vmalloc()
      rather than kmalloc(), modify bio_map_kern() to allow passing a buffer
      allocated with vmalloc().
      
      To do so, detect vmalloc-ed buffers using is_vmalloc_addr(). For
      vmalloc-ed buffers, flush the buffer using flush_kernel_vmap_range(),
      use vmalloc_to_page() instead of virt_to_page() to obtain the pages of
      the buffer, and invalidate the buffer addresses with
      invalidate_kernel_vmap_range() on completion of read BIOs. This last
      point is executed using the function bio_invalidate_vmalloc_pages()
      which is defined only if the architecture defines
      ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE, that is, if the architecture
      actually needs the invalidation done.
      
      Fixes: 515ce606 ("scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation")
      Fixes: e76239a3 ("block: add a report_zones method")
      Cc: stable@vger.kernel.org
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b4c5875d
    • W
      block/bio-integrity: fix a memory leak bug · e7bf90e5
      Wenwen Wang 提交于
      In bio_integrity_prep(), a kernel buffer is allocated through kmalloc() to
      hold integrity metadata. Later on, the buffer will be attached to the bio
      structure through bio_integrity_add_page(), which returns the number of
      bytes of integrity metadata attached. Due to unexpected situations,
      bio_integrity_add_page() may return 0. As a result, bio_integrity_prep()
      needs to be terminated with 'false' returned to indicate this error.
      However, the allocated kernel buffer is not freed on this execution path,
      leading to a memory leak.
      
      To fix this issue, free the allocated buffer before returning from
      bio_integrity_prep().
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NWenwen Wang <wenwen@cs.uga.edu>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e7bf90e5
    • M
      nvme: fix NULL deref for fabrics options · 7d30c81b
      Minwoo Im 提交于
      git://git.infradead.org/nvme.git nvme-5.3 branch now causes the
      following NULL deref oops.  Check the ctrl->opts first before the deref.
      
      [   16.337581] BUG: kernel NULL pointer dereference, address: 0000000000000056
      [   16.338551] #PF: supervisor read access in kernel mode
      [   16.338551] #PF: error_code(0x0000) - not-present page
      [   16.338551] PGD 0 P4D 0
      [   16.338551] Oops: 0000 [#1] SMP PTI
      [   16.338551] CPU: 2 PID: 1035 Comm: kworker/u16:5 Not tainted 5.2.0-rc6+ #1
      [   16.338551] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
      [   16.338551] Workqueue: nvme-wq nvme_scan_work [nvme_core]
      [   16.338551] RIP: 0010:nvme_validate_ns+0xc9/0x7e0 [nvme_core]
      [   16.338551] Code: c0 49 89 c5 0f 84 00 07 00 00 48 8b 7b 58 e8 be 48 39 c1 48 3d 00 f0 ff ff 49 89 45 18 0f 87 a4 06 00 00 48 8b 93 70 0a 00 00 <80> 7a 56 00 74 0c 48 8b 40 68 83 48 3c 08 49 8b 45 18 48 89 c6 bf
      [   16.338551] RSP: 0018:ffffc900024c7d10 EFLAGS: 00010283
      [   16.338551] RAX: ffff888135a30720 RBX: ffff88813a4fd1f8 RCX: 0000000000000007
      [   16.338551] RDX: 0000000000000000 RSI: ffffffff8256dd38 RDI: ffff888135a30720
      [   16.338551] RBP: 0000000000000001 R08: 0000000000000007 R09: ffff88813aa6a840
      [   16.338551] R10: 0000000000000001 R11: 000000000002d060 R12: ffff88813a4fd1f8
      [   16.338551] R13: ffff88813a77f800 R14: ffff88813aa35180 R15: 0000000000000001
      [   16.338551] FS:  0000000000000000(0000) GS:ffff88813ba80000(0000) knlGS:0000000000000000
      [   16.338551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   16.338551] CR2: 0000000000000056 CR3: 000000000240a002 CR4: 0000000000360ee0
      [   16.338551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   16.338551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   16.338551] Call Trace:
      [   16.338551]  nvme_scan_work+0x2c0/0x340 [nvme_core]
      [   16.338551]  ? __switch_to_asm+0x40/0x70
      [   16.338551]  ? _raw_spin_unlock_irqrestore+0x18/0x30
      [   16.338551]  ? try_to_wake_up+0x408/0x450
      [   16.338551]  process_one_work+0x20b/0x3e0
      [   16.338551]  worker_thread+0x1f9/0x3d0
      [   16.338551]  ? cancel_delayed_work+0xa0/0xa0
      [   16.338551]  kthread+0x117/0x120
      [   16.338551]  ? kthread_stop+0xf0/0xf0
      [   16.338551]  ret_from_fork+0x3a/0x50
      [   16.338551] Modules linked in: nvme nvme_core
      [   16.338551] CR2: 0000000000000056
      [   16.338551] ---[ end trace b9bf761a93e62d84 ]---
      [   16.338551] RIP: 0010:nvme_validate_ns+0xc9/0x7e0 [nvme_core]
      [   16.338551] Code: c0 49 89 c5 0f 84 00 07 00 00 48 8b 7b 58 e8 be 48 39 c1 48 3d 00 f0 ff ff 49 89 45 18 0f 87 a4 06 00 00 48 8b 93 70 0a 00 00 <80> 7a 56 00 74 0c 48 8b 40 68 83 48 3c 08 49 8b 45 18 48 89 c6 bf
      [   16.338551] RSP: 0018:ffffc900024c7d10 EFLAGS: 00010283
      [   16.338551] RAX: ffff888135a30720 RBX: ffff88813a4fd1f8 RCX: 0000000000000007
      [   16.338551] RDX: 0000000000000000 RSI: ffffffff8256dd38 RDI: ffff888135a30720
      [   16.338551] RBP: 0000000000000001 R08: 0000000000000007 R09: ffff88813aa6a840
      [   16.338551] R10: 0000000000000001 R11: 000000000002d060 R12: ffff88813a4fd1f8
      [   16.338551] R13: ffff88813a77f800 R14: ffff88813aa35180 R15: 0000000000000001
      [   16.338551] FS:  0000000000000000(0000) GS:ffff88813ba80000(0000) knlGS:0000000000000000
      [   16.338551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   16.338551] CR2: 0000000000000056 CR3: 000000000240a002 CR4: 0000000000360ee0
      [   16.338551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   16.338551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Fixes: 958f2a0f ("nvme-tcp: set the STABLE_WRITES flag when data digests are enabled")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Keith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7d30c81b
  2. 11 7月, 2019 7 次提交
    • J
      Merge branch 'nvme-5.3' of git://git.infradead.org/nvme into for-linus · b7403066
      Jens Axboe 提交于
      Pull NVMe fixes from Christoph:
      
      "Lof of fixes all over the place, and two very minor features that
       were in the nvme tree by the end of the merge window, but hadn't made
       it out to Jens yet."
      
      * 'nvme-5.3' of git://git.infradead.org/nvme:
        nvme: fix regression upon hot device removal and insertion
        nvme-fc: fix module unloads while lports still pending
        nvme-tcp: don't use sendpage for SLAB pages
        nvme-tcp: set the STABLE_WRITES flag when data digests are enabled
        nvmet: print a hint while rejecting NSID 0 or 0xffffffff
        nvme-multipath: do not select namespaces which are about to be removed
        nvme-multipath: also check for a disabled path if there is a single sibling
        nvme-multipath: factor out a nvme_path_is_disabled helper
        nvme: set physical block size and optimal I/O size
        nvme: add I/O characteristics fields
        nvmet: export I/O characteristics attributes in Identify
        nvme-trace: add delete completion and submission queue to admin cmds tracer
        nvme-trace: fix spelling mistake "spcecific" -> "specific"
        nvme-pci: limit max_hw_sectors based on the DMA max mapping size
        nvme-pci: check for NULL return from pci_alloc_p2pmem()
        nvme-pci: don't create a read hctx mapping without read queues
        nvme-pci: don't fall back to a 32-bit DMA mask
        nvme-pci: make nvme_dev_pm_ops static
        nvme-fcloop: resolve warnings on RCU usage and sleep warnings
        nvme-fcloop: fix inconsistent lock state warnings
      b7403066
    • M
      nbd: add netlink reconfigure resize support · 4ddeaae8
      Mike Christie 提交于
      If the device is setup with ioctl we can resize the device after the
      initial setup, but if the device is setup with netlink we cannot use the
      resize related ioctls and there is no netlink reconfigure size ATTR
      handling code.
      
      This patch adds netlink reconfigure resize support to match the ioctl
      interface.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4ddeaae8
    • X
      nbd: fix crash when the blksize is zero · 553768d1
      Xiubo Li 提交于
      This will allow the blksize to be set zero and then use 1024 as
      default.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NXiubo Li <xiubli@redhat.com>
      [fix to use goto out instead of return in genl_connect]
      Signed-off-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      553768d1
    • D
      block: Disable write plugging for zoned block devices · b49773e7
      Damien Le Moal 提交于
      Simultaneously writing to a sequential zone of a zoned block device
      from multiple contexts requires mutual exclusion for BIO issuing to
      ensure that writes happen sequentially. However, even for a well
      behaved user correctly implementing such synchronization, BIO plugging
      may interfere and result in BIOs from the different contextx to be
      reordered if plugging is done outside of the mutual exclusion section,
      e.g. the plug was started by a function higher in the call chain than
      the function issuing BIOs.
      
               Context A                     Context B
      
         | blk_start_plug()
         | ...
         | seq_write_zone()
           | mutex_lock(zone)
           | bio-0->bi_iter.bi_sector = zone->wp
           | zone->wp += bio_sectors(bio-0)
           | submit_bio(bio-0)
           | bio-1->bi_iter.bi_sector = zone->wp
           | zone->wp += bio_sectors(bio-1)
           | submit_bio(bio-1)
           | mutex_unlock(zone)
           | return
         | -----------------------> | seq_write_zone()
        				| mutex_lock(zone)
           				| bio-2->bi_iter.bi_sector = zone->wp
           				| zone->wp += bio_sectors(bio-2)
      				| submit_bio(bio-2)
      				| mutex_unlock(zone)
         | <------------------------- |
         | blk_finish_plug()
      
      In the above example, despite the mutex synchronization ensuring the
      correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
      issued after BIO 2 of context B, when the plug is released with
      blk_finish_plug().
      
      While this problem can be addressed using the blk_flush_plug_list()
      function (in the above example, the call must be inserted before the
      zone mutex lock is released), a simple generic solution in the block
      layer avoid this additional code in all zoned block device user code.
      The simple generic solution implemented with this patch is to introduce
      the internal helper function blk_mq_plug() to access the current
      context plug on BIO submission. This helper returns the current plug
      only if the target device is not a zoned block device or if the BIO to
      be plugged is not a write operation. Otherwise, the caller context plug
      is ignored and NULL returned, resulting is all writes to zoned block
      device to never be plugged.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b49773e7
    • D
      block: Fix elevator name declaration · 9305d5d7
      Damien Le Moal 提交于
      The elevator_name field in struct elevator_type is declared as an array
      of characters (ELV_NAME_MAX size) but in practice used as a string
      pointer with its initialization done statically within each
      elevator elevator_type structure declaration.
      
      Change the declaration of elevator_name to the more appropriate
      "const char *" type.
      Acked-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9305d5d7
    • D
      block: Remove unused definitions · 36847a00
      Damien Le Moal 提交于
      The ELV_MQUEUE_XXX definitions in include/linux/elevator.h are unused
      since the removal of elevator_may_queue_fn in kernel 5.0. Remove these
      definitions and also remove the documentation of elevator_may_queue_fn
      in Documentiation/block/biodoc.txt.
      Acked-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      36847a00
    • S
      nvme: fix regression upon hot device removal and insertion · 420dc733
      Sagi Grimberg 提交于
      When we validate the new controller id, we want to skip
      controllers that are either deleting or dead. Fix the check
      to do that and not on the newly added controller.
      
      Fixes: 1b1031ca ("nvme: validate cntlid during controller initialisation")
      Reported-by: NJon Derrick <jonathan.derrick@intel.com>
      Tested-by: NJon Derrick <jonathan.derrick@intel.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      420dc733
  3. 10 7月, 2019 28 次提交