1. 01 8月, 2019 1 次提交
    • L
      nvme-core: Fix extra device_put() call on error path · 8c36e66f
      Logan Gunthorpe 提交于
      In the error path for nvme_init_subsystem(), nvme_put_subsystem()
      will call device_put(), but it will get called again after the
      mutex_unlock().
      
      The device_put() only needs to be called if device_add() fails.
      
      This bug caused a KASAN use-after-free error when adding and
      removing subsytems in a loop:
      
        BUG: KASAN: use-after-free in device_del+0x8d9/0x9a0
        Read of size 8 at addr ffff8883cdaf7120 by task multipathd/329
      
        CPU: 0 PID: 329 Comm: multipathd Not tainted 5.2.0-rc6-vmlocalyes-00019-g70a2b39005fd-dirty #314
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        Call Trace:
         dump_stack+0x7b/0xb5
         print_address_description+0x6f/0x280
         ? device_del+0x8d9/0x9a0
         __kasan_report+0x148/0x199
         ? device_del+0x8d9/0x9a0
         ? class_release+0x100/0x130
         ? device_del+0x8d9/0x9a0
         kasan_report+0x12/0x20
         __asan_report_load8_noabort+0x14/0x20
         device_del+0x8d9/0x9a0
         ? device_platform_notify+0x70/0x70
         nvme_destroy_subsystem+0xf9/0x150
         nvme_free_ctrl+0x280/0x3a0
         device_release+0x72/0x1d0
         kobject_put+0x144/0x410
         put_device+0x13/0x20
         nvme_free_ns+0xc4/0x100
         nvme_release+0xb3/0xe0
         __blkdev_put+0x549/0x6e0
         ? kasan_check_write+0x14/0x20
         ? bd_set_size+0xb0/0xb0
         ? kasan_check_write+0x14/0x20
         ? mutex_lock+0x8f/0xe0
         ? __mutex_lock_slowpath+0x20/0x20
         ? locks_remove_file+0x239/0x370
         blkdev_put+0x72/0x2c0
         blkdev_close+0x8d/0xd0
         __fput+0x256/0x770
         ? _raw_read_lock_irq+0x40/0x40
         ____fput+0xe/0x10
         task_work_run+0x10c/0x180
         ? filp_close+0xf7/0x140
         exit_to_usermode_loop+0x151/0x170
         do_syscall_64+0x240/0x2e0
         ? prepare_exit_to_usermode+0xd5/0x190
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f5a79af05d7
        Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 3f c3 66 0f 1f 44 00 00 53 89 fb 48 83 ec 10 e8 c4 fb ff ff 89 df 89 c2 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2b 89 d7 89 44 24 0c e8 06 fc ff ff 8b 44 24
        RSP: 002b:00007f5a7799c810 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
        RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007f5a79af05d7
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000008
        RBP: 00007f5a58000f98 R08: 0000000000000002 R09: 00007f5a7935ee80
        R10: 0000000000000000 R11: 0000000000000293 R12: 000055e432447240
        R13: 0000000000000000 R14: 0000000000000001 R15: 000055e4324a9cf0
      
        Allocated by task 1236:
         save_stack+0x21/0x80
         __kasan_kmalloc.constprop.6+0xab/0xe0
         kasan_kmalloc+0x9/0x10
         kmem_cache_alloc_trace+0x102/0x210
         nvme_init_identify+0x13c3/0x3820
         nvme_loop_configure_admin_queue+0x4fa/0x5e0
         nvme_loop_create_ctrl+0x469/0xf40
         nvmf_dev_write+0x19a3/0x21ab
         __vfs_write+0x66/0x120
         vfs_write+0x154/0x490
         ksys_write+0x104/0x240
         __x64_sys_write+0x73/0xb0
         do_syscall_64+0xa5/0x2e0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 329:
         save_stack+0x21/0x80
         __kasan_slab_free+0x129/0x190
         kasan_slab_free+0xe/0x10
         kfree+0xa7/0x200
         nvme_release_subsystem+0x49/0x60
         device_release+0x72/0x1d0
         kobject_put+0x144/0x410
         put_device+0x13/0x20
         klist_class_dev_put+0x31/0x40
         klist_put+0x8f/0xf0
         klist_del+0xe/0x10
         device_del+0x3a7/0x9a0
         nvme_destroy_subsystem+0xf9/0x150
         nvme_free_ctrl+0x280/0x3a0
         device_release+0x72/0x1d0
         kobject_put+0x144/0x410
         put_device+0x13/0x20
         nvme_free_ns+0xc4/0x100
         nvme_release+0xb3/0xe0
         __blkdev_put+0x549/0x6e0
         blkdev_put+0x72/0x2c0
         blkdev_close+0x8d/0xd0
         __fput+0x256/0x770
         ____fput+0xe/0x10
         task_work_run+0x10c/0x180
         exit_to_usermode_loop+0x151/0x170
         do_syscall_64+0x240/0x2e0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 32fd90c4 ("nvme: change locking for the per-subsystem controller list")
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      8c36e66f
  2. 30 7月, 2019 1 次提交
  3. 23 7月, 2019 1 次提交
    • L
      nvme: fix memory leak caused by incorrect subsystem free · e654dfd3
      Logan Gunthorpe 提交于
      When freeing the subsystem after finding another match with
      __nvme_find_get_subsystem(), use put_device() instead of
      __nvme_release_subsystem() which calls kfree() directly.
      
      Per the documentation, put_device() should always be used
      after device_initialization() is called. Otherwise, leaks
      like the one below which was detected by kmemleak may occur.
      
      Once the call of __nvme_release_subsystem() is removed it no
      longer makes sense to keep the helper, so fold it back
      into nvme_release_subsystem().
      
      unreferenced object 0xffff8883d12bfbc0 (size 16):
        comm "nvme", pid 2635, jiffies 4294933602 (age 739.952s)
        hex dump (first 16 bytes):
          6e 76 6d 65 2d 73 75 62 73 79 73 32 00 88 ff ff  nvme-subsys2....
        backtrace:
          [<000000007d8fc208>] __kmalloc_track_caller+0x16d/0x2a0
          [<0000000081169e5f>] kvasprintf+0xad/0x130
          [<0000000025626f25>] kvasprintf_const+0x47/0x120
          [<00000000fa66ad36>] kobject_set_name_vargs+0x44/0x120
          [<000000004881f8b3>] dev_set_name+0x98/0xc0
          [<000000007124dae3>] nvme_init_identify+0x1995/0x38e0
          [<000000009315020a>] nvme_loop_configure_admin_queue+0x4fa/0x5e0
          [<000000001a63e766>] nvme_loop_create_ctrl+0x489/0xf80
          [<00000000a46ecc23>] nvmf_dev_write+0x1a12/0x2220
          [<000000002259b3d5>] __vfs_write+0x66/0x120
          [<000000002f6df81e>] vfs_write+0x154/0x490
          [<000000007e8cfc19>] ksys_write+0x10a/0x240
          [<00000000ff5c7b85>] __x64_sys_write+0x73/0xb0
          [<00000000fee6d692>] do_syscall_64+0xaa/0x470
          [<00000000997e1ede>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: ab9e00cc ("nvme: track subsystems")
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e654dfd3
  4. 12 7月, 2019 1 次提交
    • M
      nvme: fix NULL deref for fabrics options · 7d30c81b
      Minwoo Im 提交于
      git://git.infradead.org/nvme.git nvme-5.3 branch now causes the
      following NULL deref oops.  Check the ctrl->opts first before the deref.
      
      [   16.337581] BUG: kernel NULL pointer dereference, address: 0000000000000056
      [   16.338551] #PF: supervisor read access in kernel mode
      [   16.338551] #PF: error_code(0x0000) - not-present page
      [   16.338551] PGD 0 P4D 0
      [   16.338551] Oops: 0000 [#1] SMP PTI
      [   16.338551] CPU: 2 PID: 1035 Comm: kworker/u16:5 Not tainted 5.2.0-rc6+ #1
      [   16.338551] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
      [   16.338551] Workqueue: nvme-wq nvme_scan_work [nvme_core]
      [   16.338551] RIP: 0010:nvme_validate_ns+0xc9/0x7e0 [nvme_core]
      [   16.338551] Code: c0 49 89 c5 0f 84 00 07 00 00 48 8b 7b 58 e8 be 48 39 c1 48 3d 00 f0 ff ff 49 89 45 18 0f 87 a4 06 00 00 48 8b 93 70 0a 00 00 <80> 7a 56 00 74 0c 48 8b 40 68 83 48 3c 08 49 8b 45 18 48 89 c6 bf
      [   16.338551] RSP: 0018:ffffc900024c7d10 EFLAGS: 00010283
      [   16.338551] RAX: ffff888135a30720 RBX: ffff88813a4fd1f8 RCX: 0000000000000007
      [   16.338551] RDX: 0000000000000000 RSI: ffffffff8256dd38 RDI: ffff888135a30720
      [   16.338551] RBP: 0000000000000001 R08: 0000000000000007 R09: ffff88813aa6a840
      [   16.338551] R10: 0000000000000001 R11: 000000000002d060 R12: ffff88813a4fd1f8
      [   16.338551] R13: ffff88813a77f800 R14: ffff88813aa35180 R15: 0000000000000001
      [   16.338551] FS:  0000000000000000(0000) GS:ffff88813ba80000(0000) knlGS:0000000000000000
      [   16.338551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   16.338551] CR2: 0000000000000056 CR3: 000000000240a002 CR4: 0000000000360ee0
      [   16.338551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   16.338551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   16.338551] Call Trace:
      [   16.338551]  nvme_scan_work+0x2c0/0x340 [nvme_core]
      [   16.338551]  ? __switch_to_asm+0x40/0x70
      [   16.338551]  ? _raw_spin_unlock_irqrestore+0x18/0x30
      [   16.338551]  ? try_to_wake_up+0x408/0x450
      [   16.338551]  process_one_work+0x20b/0x3e0
      [   16.338551]  worker_thread+0x1f9/0x3d0
      [   16.338551]  ? cancel_delayed_work+0xa0/0xa0
      [   16.338551]  kthread+0x117/0x120
      [   16.338551]  ? kthread_stop+0xf0/0xf0
      [   16.338551]  ret_from_fork+0x3a/0x50
      [   16.338551] Modules linked in: nvme nvme_core
      [   16.338551] CR2: 0000000000000056
      [   16.338551] ---[ end trace b9bf761a93e62d84 ]---
      [   16.338551] RIP: 0010:nvme_validate_ns+0xc9/0x7e0 [nvme_core]
      [   16.338551] Code: c0 49 89 c5 0f 84 00 07 00 00 48 8b 7b 58 e8 be 48 39 c1 48 3d 00 f0 ff ff 49 89 45 18 0f 87 a4 06 00 00 48 8b 93 70 0a 00 00 <80> 7a 56 00 74 0c 48 8b 40 68 83 48 3c 08 49 8b 45 18 48 89 c6 bf
      [   16.338551] RSP: 0018:ffffc900024c7d10 EFLAGS: 00010283
      [   16.338551] RAX: ffff888135a30720 RBX: ffff88813a4fd1f8 RCX: 0000000000000007
      [   16.338551] RDX: 0000000000000000 RSI: ffffffff8256dd38 RDI: ffff888135a30720
      [   16.338551] RBP: 0000000000000001 R08: 0000000000000007 R09: ffff88813aa6a840
      [   16.338551] R10: 0000000000000001 R11: 000000000002d060 R12: ffff88813a4fd1f8
      [   16.338551] R13: ffff88813a77f800 R14: ffff88813aa35180 R15: 0000000000000001
      [   16.338551] FS:  0000000000000000(0000) GS:ffff88813ba80000(0000) knlGS:0000000000000000
      [   16.338551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   16.338551] CR2: 0000000000000056 CR3: 000000000240a002 CR4: 0000000000360ee0
      [   16.338551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   16.338551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Fixes: 958f2a0f ("nvme-tcp: set the STABLE_WRITES flag when data digests are enabled")
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Keith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7d30c81b
  5. 11 7月, 2019 1 次提交
  6. 10 7月, 2019 2 次提交
    • M
      nvme-tcp: set the STABLE_WRITES flag when data digests are enabled · 958f2a0f
      Mikhail Skorzhinskii 提交于
      There was a few false alarms sighted on target side about wrong data
      digest while performing high throughput load to XFS filesystem shared
      through NVMoF TCP.
      
      This flag tells the rest of the kernel to ensure that the data buffer
      does not change while the write is in flight.  It incurs a performance
      penalty, so only enable it when it is actually needed, i.e. when we are
      calculating data digests.
      
      Although even with this change in place, ext2 users can steel experience
      false positives, as ext2 is not respecting this flag. This may be apply
      to vfat as well.
      Signed-off-by: NMikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
      Signed-off-by: NMike Playle <mplayle@solarflare.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      958f2a0f
    • B
      nvme: set physical block size and optimal I/O size · 81adb863
      Bart Van Assche 提交于
      >From the NVMe 1.4 spec:
      
      NSFEAT bit 4 if set to 1: indicates that the fields NPWG, NPWA, NPDG, NPDA,
      and NOWS are defined for this namespace and should be used by the host for
      I/O optimization;
      [ ... ]
      Namespace Preferred Write Granularity (NPWG): This field indicates the
      smallest recommended write granularity in logical blocks for this namespace.
      This is a 0's based value. The size indicated should be less than or equal
      to Maximum Data Transfer Size (MDTS) that is specified in units of minimum
      memory page size. The value of this field may change if the namespace is
      reformatted. The size should be a multiple of Namespace Preferred Write
      Alignment (NPWA). Refer to section 8.25 for how this field is utilized to
      improve performance and endurance.
      [ ... ]
      Each Write, Write Uncorrectable, or Write Zeroes commands should address a
      multiple of Namespace Preferred Write Granularity (NPWG) (refer to Figure
      245) and Stream Write Size (SWS) (refer to Figure 515) logical blocks (as
      expressed in the NLB field), and the SLBA field of the command should be
      aligned to Namespace Preferred Write Alignment (NPWA) (refer to Figure 245)
      for best performance.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      81adb863
  7. 21 6月, 2019 4 次提交
  8. 07 6月, 2019 1 次提交
  9. 21 5月, 2019 1 次提交
  10. 18 5月, 2019 7 次提交
  11. 14 5月, 2019 3 次提交
  12. 13 5月, 2019 1 次提交
  13. 01 5月, 2019 2 次提交
  14. 25 4月, 2019 1 次提交
  15. 10 4月, 2019 1 次提交
  16. 05 4月, 2019 2 次提交
  17. 27 3月, 2019 1 次提交
    • P
      srcu: Remove cleanup_srcu_struct_quiesced() · f5ad3991
      Paul E. McKenney 提交于
      The cleanup_srcu_struct_quiesced() function was added because NVME
      used WQ_MEM_RECLAIM workqueues and SRCU did not, which meant that
      NVME workqueues waiting on SRCU workqueues could result in deadlocks
      during low-memory conditions.  However, SRCU now also has WQ_MEM_RECLAIM
      workqueues, so there is no longer a potential for deadlock.  Furthermore,
      it turns out to be extremely hard to use cleanup_srcu_struct_quiesced()
      correctly due to the fact that SRCU callback invocation accesses the
      srcu_struct structure's per-CPU data area just after callbacks are
      invoked.  Therefore, the usual practice of using srcu_barrier() to wait
      for callbacks to be invoked before invoking cleanup_srcu_struct_quiesced()
      fails because SRCU's callback-invocation workqueue handler might be
      delayed, which can result in cleanup_srcu_struct_quiesced() being invoked
      (and thus freeing the per-CPU data) before the SRCU's callback-invocation
      workqueue handler is finished using that per-CPU data.  Nor is this a
      theoretical problem: KASAN emitted use-after-free warnings because of
      this problem on actual runs.
      
      In short, NVME can now safely invoke cleanup_srcu_struct(), which
      avoids the use-after-free scenario.  And cleanup_srcu_struct_quiesced()
      is quite difficult to use safely.  This commit therefore removes
      cleanup_srcu_struct_quiesced(), switching its sole user back to
      cleanup_srcu_struct().  This effectively reverts the following pair
      of commits:
      
      f7194ac3 ("srcu: Add cleanup_srcu_struct_quiesced()")
      4317228a ("nvme: Avoid flush dependency in delete controller flow")
      Reported-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Tested-by: NBart Van Assche <bvanassche@acm.org>
      f5ad3991
  18. 14 3月, 2019 7 次提交
  19. 20 2月, 2019 2 次提交