1. 24 6月, 2020 1 次提交
  2. 05 6月, 2020 1 次提交
  3. 27 5月, 2020 6 次提交
  4. 10 5月, 2020 3 次提交
  5. 26 3月, 2020 2 次提交
  6. 27 11月, 2019 1 次提交
    • I
      nvme-rdma: Avoid preallocating big SGL for data · 38e18002
      Israel Rukshin 提交于
      nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based
      on SG_CHUNK_SIZE.
      
      Modern DMA engines are often capable of dealing with very big segments so
      the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
      SGL allocation per command.
      
      If a controller has lots of deep queues, preallocation for the sg list can
      consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be
      128 and each queue's depth 128. This means the resulting preallocation
      for the data SGL is 128*128*4K = 64MB per controller.
      
      Switch to runtime allocation for SGL for lists longer than 2 entries. This
      is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
      well. Runtime SGL allocation has always been the case for the legacy I/O
      path so this is nothing new.
      
      The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
      support SG_CHAIN, use only runtime allocation for the SGL.
      
      We didn't notice of a performance degradation, since for small IOs we'll
      use the inline SG and for the bigger IOs the allocation of a bigger SGL
      from slab is fast enough.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      38e18002
  7. 22 11月, 2019 1 次提交
  8. 12 11月, 2019 1 次提交
    • G
      nvme: Add hardware monitoring support · 400b6a7b
      Guenter Roeck 提交于
      nvme devices report temperature information in the controller information
      (for limits) and in the smart log. Currently, the only means to retrieve
      this information is the nvme command line interface, which requires
      super-user privileges.
      
      At the same time, it would be desirable to be able to use NVMe temperature
      information for thermal control.
      
      This patch adds support to read NVMe temperatures from the kernel using the
      hwmon API and adds temperature zones for NVMe drives. The thermal subsystem
      can use this information to set thermal policies, and userspace can access
      it using libsensors and/or the "sensors" command.
      
      Example output from the "sensors" command:
      
      nvme0-pci-0100
      Adapter: PCI adapter
      Composite:    +39.0°C  (high = +85.0°C, crit = +85.0°C)
      Sensor 1:     +39.0°C
      Sensor 2:     +41.0°C
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      400b6a7b
  9. 05 11月, 2019 3 次提交
  10. 14 10月, 2019 2 次提交
  11. 27 9月, 2019 1 次提交
  12. 30 8月, 2019 6 次提交
  13. 21 8月, 2019 1 次提交
  14. 01 8月, 2019 2 次提交
    • S
      nvme: fix controller removal race with scan work · 0157ec8d
      Sagi Grimberg 提交于
      With multipath enabled, nvme_scan_work() can read from the device
      (through nvme_mpath_add_disk()) and hang [1]. However, with fabrics,
      once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
      (see nvmf_check_ready()) and the mpath stack device make_request
      will block if head->list is not empty. However, when the head->list
      consistst of only DELETING/DEAD controllers, we should actually not
      block, but rather fail immediately.
      
      In addition, before we go ahead and remove the namespaces, make sure
      to clear the current path and kick the requeue list so that the
      request will fast fail upon requeuing.
      
      [1]:
      --
        INFO: task kworker/u4:3:166 blocked for more than 120 seconds.
              Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        kworker/u4:3    D    0   166      2 0x80004000
        Workqueue: nvme-wq nvme_scan_work
        Call Trace:
         __schedule+0x851/0x1400
         schedule+0x99/0x210
         io_schedule+0x21/0x70
         do_read_cache_page+0xa57/0x1330
         read_cache_page+0x4a/0x70
         read_dev_sector+0xbf/0x380
         amiga_partition+0xc4/0x1230
         check_partition+0x30f/0x630
         rescan_partitions+0x19a/0x980
         __blkdev_get+0x85a/0x12f0
         blkdev_get+0x2a5/0x790
         __device_add_disk+0xe25/0x1250
         device_add_disk+0x13/0x20
         nvme_mpath_set_live+0x172/0x2b0
         nvme_update_ns_ana_state+0x130/0x180
         nvme_set_ns_ana_state+0x9a/0xb0
         nvme_parse_ana_log+0x1c3/0x4a0
         nvme_mpath_add_disk+0x157/0x290
         nvme_validate_ns+0x1017/0x1bd0
         nvme_scan_work+0x44d/0x6a0
         process_one_work+0x7d7/0x1240
         worker_thread+0x8e/0xff0
         kthread+0x2c3/0x3b0
         ret_from_fork+0x35/0x40
      
         INFO: task kworker/u4:1:1034 blocked for more than 120 seconds.
              Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        kworker/u4:1    D    0  1034      2 0x80004000
        Workqueue: nvme-delete-wq nvme_delete_ctrl_work
        Call Trace:
         __schedule+0x851/0x1400
         schedule+0x99/0x210
         schedule_timeout+0x390/0x830
         wait_for_completion+0x1a7/0x310
         __flush_work+0x241/0x5d0
         flush_work+0x10/0x20
         nvme_remove_namespaces+0x85/0x3d0
         nvme_do_delete_ctrl+0xb4/0x1e0
         nvme_delete_ctrl_work+0x15/0x20
         process_one_work+0x7d7/0x1240
         worker_thread+0x8e/0xff0
         kthread+0x2c3/0x3b0
         ret_from_fork+0x35/0x40
      --
      Reported-by: NLogan Gunthorpe <logang@deltatee.com>
      Tested-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      0157ec8d
    • S
      nvme: fix a possible deadlock when passthru commands sent to a multipath device · b9156dae
      Sagi Grimberg 提交于
      When the user issues a command with side effects, we will end up freezing
      the namespace request queue when updating disk info (and the same for
      the corresponding mpath disk node).
      
      However, we are not freezing the mpath node request queue,
      which means that mpath I/O can still come in and block on blk_queue_enter
      (called from nvme_ns_head_make_request -> direct_make_request).
      
      This is a deadlock, because blk_queue_enter will block until the inner
      namespace request queue is unfroze, but that process is blocked because
      the namespace revalidation is trying to update the mpath disk info
      and freeze its request queue (which will never complete because
      of the I/O that is blocked on blk_queue_enter).
      
      Fix this by freezing all the subsystem nsheads request queues before
      executing the passthru command. Given that these commands are infrequent
      we should not worry about this temporary I/O freeze to keep things sane.
      
      Here is the matching hang traces:
      --
      [ 374.465002] INFO: task systemd-udevd:17994 blocked for more than 122 seconds.
      [ 374.472975] Not tainted 5.2.0-rc3-mpdebug+ #42
      [ 374.478522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 374.487274] systemd-udevd D 0 17994 1 0x00000000
      [ 374.493407] Call Trace:
      [ 374.496145] __schedule+0x2ef/0x620
      [ 374.500047] schedule+0x38/0xa0
      [ 374.503569] blk_queue_enter+0x139/0x220
      [ 374.507959] ? remove_wait_queue+0x60/0x60
      [ 374.512540] direct_make_request+0x60/0x130
      [ 374.517219] nvme_ns_head_make_request+0x11d/0x420 [nvme_core]
      [ 374.523740] ? generic_make_request_checks+0x307/0x6f0
      [ 374.529484] generic_make_request+0x10d/0x2e0
      [ 374.534356] submit_bio+0x75/0x140
      [ 374.538163] ? guard_bio_eod+0x32/0xe0
      [ 374.542361] submit_bh_wbc+0x171/0x1b0
      [ 374.546553] block_read_full_page+0x1ed/0x330
      [ 374.551426] ? check_disk_change+0x70/0x70
      [ 374.556008] ? scan_shadow_nodes+0x30/0x30
      [ 374.560588] blkdev_readpage+0x18/0x20
      [ 374.564783] do_read_cache_page+0x301/0x860
      [ 374.569463] ? blkdev_writepages+0x10/0x10
      [ 374.574037] ? prep_new_page+0x88/0x130
      [ 374.578329] ? get_page_from_freelist+0xa2f/0x1280
      [ 374.583688] ? __alloc_pages_nodemask+0x179/0x320
      [ 374.588947] read_cache_page+0x12/0x20
      [ 374.593142] read_dev_sector+0x2d/0xd0
      [ 374.597337] read_lba+0x104/0x1f0
      [ 374.601046] find_valid_gpt+0xfa/0x720
      [ 374.605243] ? string_nocheck+0x58/0x70
      [ 374.609534] ? find_valid_gpt+0x720/0x720
      [ 374.614016] efi_partition+0x89/0x430
      [ 374.618113] ? string+0x48/0x60
      [ 374.621632] ? snprintf+0x49/0x70
      [ 374.625339] ? find_valid_gpt+0x720/0x720
      [ 374.629828] check_partition+0x116/0x210
      [ 374.634214] rescan_partitions+0xb6/0x360
      [ 374.638699] __blkdev_reread_part+0x64/0x70
      [ 374.643377] blkdev_reread_part+0x23/0x40
      [ 374.647860] blkdev_ioctl+0x48c/0x990
      [ 374.651956] block_ioctl+0x41/0x50
      [ 374.655766] do_vfs_ioctl+0xa7/0x600
      [ 374.659766] ? locks_lock_inode_wait+0xb1/0x150
      [ 374.664832] ksys_ioctl+0x67/0x90
      [ 374.668539] __x64_sys_ioctl+0x1a/0x20
      [ 374.672732] do_syscall_64+0x5a/0x1c0
      [ 374.676828] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      [ 374.738474] INFO: task nvmeadm:49141 blocked for more than 123 seconds.
      [ 374.745871] Not tainted 5.2.0-rc3-mpdebug+ #42
      [ 374.751419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 374.760170] nvmeadm D 0 49141 36333 0x00004080
      [ 374.766301] Call Trace:
      [ 374.769038] __schedule+0x2ef/0x620
      [ 374.772939] schedule+0x38/0xa0
      [ 374.776452] blk_mq_freeze_queue_wait+0x59/0x100
      [ 374.781614] ? remove_wait_queue+0x60/0x60
      [ 374.786192] blk_mq_freeze_queue+0x1a/0x20
      [ 374.790773] nvme_update_disk_info.isra.57+0x5f/0x350 [nvme_core]
      [ 374.797582] ? nvme_identify_ns.isra.50+0x71/0xc0 [nvme_core]
      [ 374.804006] __nvme_revalidate_disk+0xe5/0x110 [nvme_core]
      [ 374.810139] nvme_revalidate_disk+0xa6/0x120 [nvme_core]
      [ 374.816078] ? nvme_submit_user_cmd+0x11e/0x320 [nvme_core]
      [ 374.822299] nvme_user_cmd+0x264/0x370 [nvme_core]
      [ 374.827661] nvme_dev_ioctl+0x112/0x1d0 [nvme_core]
      [ 374.833114] do_vfs_ioctl+0xa7/0x600
      [ 374.837117] ? __audit_syscall_entry+0xdd/0x130
      [ 374.842184] ksys_ioctl+0x67/0x90
      [ 374.845891] __x64_sys_ioctl+0x1a/0x20
      [ 374.850082] do_syscall_64+0x5a/0x1c0
      [ 374.854178] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      --
      Reported-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Tested-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      b9156dae
  15. 23 7月, 2019 1 次提交
    • M
      nvme: fix multipath crash when ANA is deactivated · 66b20ac0
      Marta Rybczynska 提交于
      Fix a crash with multipath activated. It happends when ANA log
      page is larger than MDTS and because of that ANA is disabled.
      The driver then tries to access unallocated buffer when connecting
      to a nvme target. The signature is as follows:
      
      [  300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192).
      [  300.435387] nvme nvme0: disabling ANA support.
      [  300.437835] nvme nvme0: creating 4 I/O queues.
      [  300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009
      [  300.464609] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  300.466342] #PF error: [normal kernel read fault]
      [  300.467385] PGD 0 P4D 0
      [  300.467987] Oops: 0000 [#1] SMP PTI
      [  300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4
      [  300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core]
      [  300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
      [  300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
      [  300.477374] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296
      [  300.478334] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000
      [  300.479784] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258
      [  300.481488] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
      [  300.483203] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0
      [  300.484928] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0
      [  300.486626] FS:  0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000
      [  300.488538] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  300.489907] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0
      [  300.491612] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  300.493303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  300.494991] Call Trace:
      [  300.495645]  nvme_mpath_add_disk+0x5c/0xb0 [nvme_core]
      [  300.496880]  nvme_validate_ns+0x2ef/0x550 [nvme_core]
      [  300.498105]  ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core]
      [  300.499539]  nvme_scan_work+0x2b4/0x370 [nvme_core]
      [  300.500717]  ? __switch_to_asm+0x35/0x70
      [  300.501663]  process_one_work+0x171/0x380
      [  300.502340]  worker_thread+0x49/0x3f0
      [  300.503079]  kthread+0xf8/0x130
      [  300.503795]  ? max_active_store+0x80/0x80
      [  300.504690]  ? kthread_bind+0x10/0x10
      [  300.505502]  ret_from_fork+0x35/0x40
      [  300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm ib_core nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle ip6table_filter ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle iptable_filter veth ebtable_filter ebtable_nat ebtables iptable_raw vxlan ip6_udp_tunnel udp_tunnel sunrpc joydev pcspkr virtio_balloon br_netfilter bridge stp llc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net virtio_console net_failover virtio_blk failover ata_piix serio_raw libata virtio_pci virtio_ring virtio
      [  300.514984] CR2: 0000000000000008
      [  300.515569] ---[ end trace faa2eefad7e7f218 ]---
      [  300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
      [  300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
      [  300.520353] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296
      [  300.521229] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000
      [  300.522399] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258
      [  300.523560] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
      [  300.524734] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0
      [  300.525915] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0
      [  300.527084] FS:  0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000
      [  300.528396] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  300.529440] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0
      [  300.530739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  300.531989] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  300.533264] Kernel panic - not syncing: Fatal exception
      [  300.534338] Kernel Offset: 0x17c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [  300.536227] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Condition check refactoring from Christoph Hellwig.
      Signed-off-by: NMarta Rybczynska <marta.rybczynska@kalray.eu>
      Tested-by: NJean-Baptiste Riaux <jbriaux@kalray.eu>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      66b20ac0
  16. 10 7月, 2019 1 次提交
    • B
      nvme: set physical block size and optimal I/O size · 81adb863
      Bart Van Assche 提交于
      >From the NVMe 1.4 spec:
      
      NSFEAT bit 4 if set to 1: indicates that the fields NPWG, NPWA, NPDG, NPDA,
      and NOWS are defined for this namespace and should be used by the host for
      I/O optimization;
      [ ... ]
      Namespace Preferred Write Granularity (NPWG): This field indicates the
      smallest recommended write granularity in logical blocks for this namespace.
      This is a 0's based value. The size indicated should be less than or equal
      to Maximum Data Transfer Size (MDTS) that is specified in units of minimum
      memory page size. The value of this field may change if the namespace is
      reformatted. The size should be a multiple of Namespace Preferred Write
      Alignment (NPWA). Refer to section 8.25 for how this field is utilized to
      improve performance and endurance.
      [ ... ]
      Each Write, Write Uncorrectable, or Write Zeroes commands should address a
      multiple of Namespace Preferred Write Granularity (NPWG) (refer to Figure
      245) and Stream Write Size (SWS) (refer to Figure 515) logical blocks (as
      expressed in the NLB field), and the SLBA field of the command should be
      aligned to Namespace Preferred Write Alignment (NPWA) (refer to Figure 245)
      for best performance.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      81adb863
  17. 21 6月, 2019 3 次提交
  18. 18 5月, 2019 1 次提交
  19. 01 5月, 2019 1 次提交
  20. 14 3月, 2019 1 次提交
  21. 20 2月, 2019 1 次提交