1. 24 4月, 2021 1 次提交
    • H
      md-cluster: fix use-after-free issue when removing rdev · f7c7a2f9
      Heming Zhao 提交于
      md_kick_rdev_from_array will remove rdev, so we should
      use rdev_for_each_safe to search list.
      
      How to trigger:
      
      env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).
      
      ```
      node2=192.168.0.3
      
      for i in {1..20}; do
          echo ==== $i `date` ====;
      
          mdadm -Ss && ssh ${node2} "mdadm -Ss"
          wipefs -a /dev/sda /dev/sdb
      
          mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
             /dev/sdb --assume-clean
          ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
          mdadm --wait /dev/md0
          ssh ${node2} "mdadm --wait /dev/md0"
      
          mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
          sleep 1
      done
      ```
      
      Crash stack:
      
      ```
      stack segment: 0000 [#1] SMP
      ... ...
      RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
      ... ...
      RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
      RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
      RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
      RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
      FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       raid1d+0x5c/0xd40 [raid1]
       ? finish_task_switch+0x75/0x2a0
       ? lock_timer_base+0x67/0x80
       ? try_to_del_timer_sync+0x4d/0x80
       ? del_timer_sync+0x41/0x50
       ? schedule_timeout+0x254/0x2d0
       ? md_start_sync+0xe0/0xe0 [md_mod]
       ? md_thread+0x127/0x160 [md_mod]
       md_thread+0x127/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       kthread+0x10d/0x130
       ? kthread_park+0xa0/0xa0
       ret_from_fork+0x1f/0x40
      ```
      
      Fixes: dbb64f86 ("md-cluster: Fix adding of new disk with new reload code")
      Fixes: 659b254f ("md-cluster: remove a disk asynchronously from cluster environment")
      Cc: stable@vger.kernel.org
      Reviewed-by: NGang He <ghe@suse.com>
      Signed-off-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      f7c7a2f9
  2. 23 4月, 2021 1 次提交
    • J
      Merge tag 'nvme-5.13-2021-04-22' of git://git.infradead.org/nvme into for-5.13/drivers · 87d9ad02
      Jens Axboe 提交于
      Pull NVMe updates from Christoph:
      
      "- add support for a per-namespace character device (Minwoo Im)
       - various KATO fixes and cleanups (Hou Pu, Hannes Reinecke)
       - APST fix and cleanup"
      
      * tag 'nvme-5.13-2021-04-22' of git://git.infradead.org/nvme:
        nvme: introduce generic per-namespace chardev
        nvme: cleanup nvme_configure_apst
        nvme: do not try to reconfigure APST when the controller is not live
        nvme: add 'kato' sysfs attribute
        nvme: sanitize KATO setting
        nvmet: avoid queuing keep-alive timer if it is disabled
      87d9ad02
  3. 22 4月, 2021 7 次提交
    • M
      nvme: introduce generic per-namespace chardev · 2637baed
      Minwoo Im 提交于
      Userspace has not been allowed to I/O to device that's failed to
      be initialized.  This patch introduces generic per-namespace character
      device to allow userspace to I/O regardless the block device is there or
      not.
      
      The chardev naming convention will similar to the existing blkdev naming,
      using a ng prefix instead of nvme, i.e.
      
      	- /dev/ngXnY
      
      It also supports multipath which means it will not expose chardev for the
      hidden namespace blkdevs (e.g., nvmeXcYnZ).  If /dev/ngXnY is created for
      a ns_head, then I/O request will be routed to a specific controller
      selected by the iopolicy of the subsystem.
      Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Signed-off-by: NJavier González <javier.gonz@samsung.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Tested-by: NKanchan Joshi <joshi.k@samsung.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      2637baed
    • C
      nvme: cleanup nvme_configure_apst · 60df5de9
      Christoph Hellwig 提交于
      Remove a level of indentation from the main code implementating the table
      search by using a goto for the APST not supported case.  Also move the
      main comment above the function.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NNiklas Cassel <niklas.cassel@wdc.com>
      60df5de9
    • C
      nvme: do not try to reconfigure APST when the controller is not live · 53fe2a30
      Christoph Hellwig 提交于
      Do not call nvme_configure_apst when the controller is not live, given
      that nvme_configure_apst will fail due the lack of an admin queue when
      the controller is being torn down and nvme_set_latency_tolerance is
      called from dev_pm_qos_hide_latency_tolerance.
      
      Fixes: 510a405d("nvme: fix memory leak for power latency tolerance")
      Reported-by: NPeng Liu <liupeng17@lenovo.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      53fe2a30
    • H
      nvme: add 'kato' sysfs attribute · 74c22990
      Hannes Reinecke 提交于
      Add a 'kato' controller sysfs attribute to display the current
      keep-alive timeout value (if any). This allows userspace to identify
      persistent discovery controllers, as these will have a non-zero
      KATO value.
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      74c22990
    • H
      nvme: sanitize KATO setting · a70b81bd
      Hannes Reinecke 提交于
      According to the NVMe base spec the KATO commands should be sent
      at half of the KATO interval, to properly account for round-trip
      times.
      As we now will only ever send one KATO command per connection we
      can easily use the recommended values.
      This also fixes a potential issue where the request timeout for
      the KATO command does not match the value in the connect command,
      which might be causing spurious connection drops from the target.
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      a70b81bd
    • H
      nvmet: avoid queuing keep-alive timer if it is disabled · 8f864c59
      Hou Pu 提交于
      Issue following command:
      nvme set-feature -f 0xf -v 0 /dev/nvme1n1 # disable keep-alive timer
      nvme admin-passthru -o 0x18 /dev/nvme1n1  # send keep-alive command
      will make keep-alive timer fired and thus delete the controller like
      below:
      
      [247459.907635] nvmet: ctrl 1 keep-alive timer (0 seconds) expired!
      [247459.930294] nvmet: ctrl 1 fatal error occurred!
      
      Avoid this by not queuing delayed keep-alive if it is disabled when
      keep-alive command is received from the admin queue.
      Signed-off-by: NHou Pu <houpu.main@gmail.com>
      Tested-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      8f864c59
    • C
      brd: expose number of allocated pages in debugfs · f4be591f
      Calvin Owens 提交于
      While the maximum size of each ramdisk is defined either as a module
      parameter, or compile time default, it's impossible to know how many pages
      have currently been allocated by each ram%d device, since they're
      allocated when used and never freed.
      
      This patch creates a new directory at this location:
      
      /sys/kernel/debug/ramdisk_pages/
      
      which will contain a file named "ram%d" for each instantiated ramdisk on
      the system. The file is read-only, and read() will output the number of
      pages currently held by that ramdisk.
      
      We lose track how much memory a ramdisk is using as pages once used are
      simply recycled but never freed.
      
      In instances where we exhaust the size of the ramdisk with a file that
      exceeds it, encounter ENOSPC and delete the file for mitigation; df would
      show decrease in used and increase in available blocks but the since we
      have touched all pages, the memory footprint of the ramdisk does not
      reflect the blocks used/available count
      
      ...
      [root@localhost ~]# mkfs.ext2 /dev/ram15
      mke2fs 1.45.6 (20-Mar-2020)
      Creating filesystem with 4096 1k blocks and 1024 inodes
      [root@localhost ~]# mount /dev/ram15 /mnt/ram15/
      
      [root@localhost ~]# cat
      /sys/kernel/debug/ramdisk_pages/ram15
      58
      [root@kerneltest008.06.prn3 ~]# df /dev/ram15
      Filesystem     1K-blocks  Used Available Use% Mounted on
      /dev/ram15          3963    31      3728   1% /mnt/ram15
      [root@kerneltest008.06.prn3 ~]# dd if=/dev/urandom of=/mnt/ram15/test2
      bs=1M count=5
      dd: error writing '/mnt/ram15/test2': No space left on device
      4+0 records in
      3+0 records out
      4005888 bytes (4.0 MB, 3.8 MiB) copied, 0.0446614 s, 89.7 MB/s
      [root@kerneltest008.06.prn3 ~]# df /mnt/ram15/
      Filesystem     1K-blocks  Used Available Use% Mounted on
      /dev/ram15          3963  3960         0 100% /mnt/ram15
      [root@kerneltest008.06.prn3 ~]# cat
      /sys/kernel/debug/ramdisk_pages/ram15
      1024
      [root@kerneltest008.06.prn3 ~]# rm /mnt/ram15/test2
      rm: remove regular file '/mnt/ram15/test2'? y
      [root@kerneltest008.06.prn3 /var]# df /dev/ram15
      Filesystem     1K-blocks  Used Available Use% Mounted on
      /dev/ram15          3963    31      3728   1% /mnt/ram15
      
      # Acutal memory footprint
      [root@kerneltest008.06.prn3 /var]# cat
      /sys/kernel/debug/ramdisk_pages/ram15
      1024
      ...
      
      This debugfs counter will always reveal the accurate number of
      permanently allocated pages to the ramdisk.
      Signed-off-by: NCalvin Owens <calvinowens@fb.com>
      [cleaned up the !CONFIG_DEBUG_FS case and API changes for HEAD]
      Signed-off-by: NKyle McMartin <jkkm@fb.com>
      [rebased]
      Signed-off-by: NSaravanan D <saravanand@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f4be591f
  4. 21 4月, 2021 3 次提交
  5. 20 4月, 2021 25 次提交
  6. 16 4月, 2021 3 次提交
    • J
      Merge branch 'md-next' of... · 455abda6
      Jens Axboe 提交于
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.13/drivers
      
      Pull MD updates from Song:
      
      "1. mddev_find_or_alloc() clean up, from Christoph.
       2. Fix NULL pointer deref with external bitmap, from Sudhakar."
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md/bitmap: wait for external bitmap writes to complete during tear down
        md: do not return existing mddevs from mddev_find_or_alloc
        md: refactor mddev_find_or_alloc
        md: factor out a mddev_alloc_unit helper from mddev_find
      455abda6
    • J
      Merge tag 'nvme-5.13-2021-04-15' of git://git.infradead.org/nvme into for-5.13/drivers · e63c8eb1
      Jens Axboe 提交于
      Pull NVMe updates from Christoph:
      
      "nvme updates for Linux 5.13
      
       - refactor the ioctl code
       - fix a segmentation fault during io parsing error in nvmet-tcp
         (Elad Grupi)
       - fix NULL derefence in nvme_ctrl_fast_io_fail_tmo_show/store
         (Gopal Tiwari)
       - properly respect the sgl_threshold flag in nvme-pci (Niklas Cassel)
       - misc cleanups (Niklas Cassel, Amit Engel, Minwoo Im, Colin Ian King)"
      
      * tag 'nvme-5.13-2021-04-15' of git://git.infradead.org/nvme:
        nvme: fix NULL derefence in nvme_ctrl_fast_io_fail_tmo_show/store
        nvme: let namespace probing continue for unsupported features
        nvme: factor out nvme_ns_open and nvme_ns_release helpers
        nvme: move nvme_ns_head_ops to multipath.c
        nvme: factor out a nvme_tryget_ns_head helper
        nvme: move the ioctl code to a separate file
        nvme: don't bother to look up a namespace for controller ioctls
        nvme: simplify block device ioctl handling for the !multipath case
        nvme: simplify the compat ioctl handling
        nvme: factor out a nvme_ns_ioctl helper
        nvme: pass a user pointer to nvme_nvm_ioctl
        nvme: cleanup setting the disk name
        nvme: add a nvme_ns_head_multipath helper
        nvme: remove single trailing whitespace
        nvme-multipath: remove single trailing whitespace
        nvme-pci: remove single trailing whitespace
        nvme-pci: don't simple map sgl when sgls are disabled
        nvmet: fix a spelling mistake "nubmer" -> "number"
        nvmet-fc: simplify nvmet_fc_alloc_hostport
        nvmet-tcp: fix a segmentation fault during io parsing error
      e63c8eb1
    • S
      md/bitmap: wait for external bitmap writes to complete during tear down · 404a8ef5
      Sudhakar Panneerselvam 提交于
      NULL pointer dereference was observed in super_written() when it tries
      to access the mddev structure.
      
      [The below stack trace is from an older kernel, but the problem described
      in this patch applies to the mainline kernel.]
      
      [ 1194.474861] task: ffff8fdd20858000 task.stack: ffffb99d40790000
      [ 1194.488000] RIP: 0010:super_written+0x29/0xe1
      [ 1194.499688] RSP: 0018:ffff8ffb7fcc3c78 EFLAGS: 00010046
      [ 1194.512477] RAX: 0000000000000000 RBX: ffff8ffb7bf4a000 RCX: ffff8ffb78991048
      [ 1194.527325] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ffb56b8a200
      [ 1194.542576] RBP: ffff8ffb7fcc3c90 R08: 000000000000000b R09: 0000000000000000
      [ 1194.558001] R10: ffff8ffb56b8a298 R11: 0000000000000000 R12: ffff8ffb56b8a200
      [ 1194.573070] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      [ 1194.588117] FS:  0000000000000000(0000) GS:ffff8ffb7fcc0000(0000) knlGS:0000000000000000
      [ 1194.604264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1194.617375] CR2: 00000000000002b8 CR3: 00000021e040a002 CR4: 00000000007606e0
      [ 1194.632327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1194.647865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1194.663316] PKRU: 55555554
      [ 1194.674090] Call Trace:
      [ 1194.683735]  <IRQ>
      [ 1194.692948]  bio_endio+0xae/0x135
      [ 1194.703580]  blk_update_request+0xad/0x2fa
      [ 1194.714990]  blk_update_bidi_request+0x20/0x72
      [ 1194.726578]  __blk_end_bidi_request+0x2c/0x4d
      [ 1194.738373]  __blk_end_request_all+0x31/0x49
      [ 1194.749344]  blk_flush_complete_seq+0x377/0x383
      [ 1194.761550]  flush_end_io+0x1dd/0x2a7
      [ 1194.772910]  blk_finish_request+0x9f/0x13c
      [ 1194.784544]  scsi_end_request+0x180/0x25c
      [ 1194.796149]  scsi_io_completion+0xc8/0x610
      [ 1194.807503]  scsi_finish_command+0xdc/0x125
      [ 1194.818897]  scsi_softirq_done+0x81/0xde
      [ 1194.830062]  blk_done_softirq+0xa4/0xcc
      [ 1194.841008]  __do_softirq+0xd9/0x29f
      [ 1194.851257]  irq_exit+0xe6/0xeb
      [ 1194.861290]  do_IRQ+0x59/0xe3
      [ 1194.871060]  common_interrupt+0x1c6/0x382
      [ 1194.881988]  </IRQ>
      [ 1194.890646] RIP: 0010:cpuidle_enter_state+0xdd/0x2a5
      [ 1194.902532] RSP: 0018:ffffb99d40793e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff43
      [ 1194.917317] RAX: ffff8ffb7fce27c0 RBX: ffff8ffb7fced800 RCX: 000000000000001f
      [ 1194.932056] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000
      [ 1194.946428] RBP: ffffb99d40793ea0 R08: 0000000000000004 R09: 0000000000002ed2
      [ 1194.960508] R10: 0000000000002664 R11: 0000000000000018 R12: 0000000000000003
      [ 1194.974454] R13: 000000000000000b R14: ffffffff925715a0 R15: 0000011610120d5a
      [ 1194.988607]  ? cpuidle_enter_state+0xcc/0x2a5
      [ 1194.999077]  cpuidle_enter+0x17/0x19
      [ 1195.008395]  call_cpuidle+0x23/0x3a
      [ 1195.017718]  do_idle+0x172/0x1d5
      [ 1195.026358]  cpu_startup_entry+0x73/0x75
      [ 1195.035769]  start_secondary+0x1b9/0x20b
      [ 1195.044894]  secondary_startup_64+0xa5/0xa5
      [ 1195.084921] RIP: super_written+0x29/0xe1 RSP: ffff8ffb7fcc3c78
      [ 1195.096354] CR2: 00000000000002b8
      
      bio in the above stack is a bitmap write whose completion is invoked after
      the tear down sequence sets the mddev structure to NULL in rdev.
      
      During tear down, there is an attempt to flush the bitmap writes, but for
      external bitmaps, there is no explicit wait for all the bitmap writes to
      complete. For instance, md_bitmap_flush() is called to flush the bitmap
      writes, but the last call to md_bitmap_daemon_work() in md_bitmap_flush()
      could generate new bitmap writes for which there is no explicit wait to
      complete those writes. The call to md_bitmap_update_sb() will return
      simply for external bitmaps and the follow-up call to md_update_sb() is
      conditional and may not get called for external bitmaps. This results in a
      kernel panic when the completion routine, super_written() is called which
      tries to reference mddev in the rdev that has been set to
      NULL(in unbind_rdev_from_array() by tear down sequence).
      
      The solution is to call md_super_wait() for external bitmaps after the
      last call to md_bitmap_daemon_work() in md_bitmap_flush() to ensure there
      are no pending bitmap writes before proceeding with the tear down.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>
      Reviewed-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      404a8ef5