1. 20 4月, 2023 1 次提交
  2. 14 4月, 2023 2 次提交
  3. 07 4月, 2023 1 次提交
    • Y
      block: don't set GD_NEED_PART_SCAN if scan partition failed · 3723091e
      Yu Kuai 提交于
      Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
      set, and partition scan will be proceed again when blkdev_get_by_dev()
      is called. However, this will cause a problem that re-assemble partitioned
      raid device will creat partition for underlying disk.
      
      Test procedure:
      
      mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
      sgdisk -n 0:0:+100MiB /dev/md0
      blockdev --rereadpt /dev/sda
      blockdev --rereadpt /dev/sdb
      mdadm -S /dev/md0
      mdadm -A /dev/md0 /dev/sda /dev/sdb
      
      Test result: underlying disk partition and raid partition can be
      observed at the same time
      
      Note that this can still happen in come corner cases that
      GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
      device.
      
      Fixes: e5cfefa9 ("block: fix scan partition for exclusively open device again")
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3723091e
  4. 06 4月, 2023 3 次提交
  5. 05 4月, 2023 2 次提交
  6. 31 3月, 2023 2 次提交
  7. 30 3月, 2023 2 次提交
    • S
      nvme-tcp: fix a possible UAF when failing to allocate an io queue · 88eaba80
      Sagi Grimberg 提交于
      When we allocate a nvme-tcp queue, we set the data_ready callback before
      we actually need to use it. This creates the potential that if a stray
      controller sends us data on the socket before we connect, we can trigger
      the io_work and start consuming the socket.
      
      In this case reported: we failed to allocate one of the io queues, and
      as we start releasing the queues that we already allocated, we get
      a UAF [1] from the io_work which is running before it should really.
      
      Fix this by setting the socket ops callbacks only before we start the
      queue, so that we can't accidentally schedule the io_work in the
      initialization phase before the queue started. While we are at it,
      rename nvme_tcp_restore_sock_calls to pair with nvme_tcp_setup_sock_ops.
      
      [1]:
      [16802.107284] nvme nvme4: starting error recovery
      [16802.109166] nvme nvme4: Reconnecting in 10 seconds...
      [16812.173535] nvme nvme4: failed to connect socket: -111
      [16812.173745] nvme nvme4: Failed reconnect attempt 1
      [16812.173747] nvme nvme4: Reconnecting in 10 seconds...
      [16822.413555] nvme nvme4: failed to connect socket: -111
      [16822.413762] nvme nvme4: Failed reconnect attempt 2
      [16822.413765] nvme nvme4: Reconnecting in 10 seconds...
      [16832.661274] nvme nvme4: creating 32 I/O queues.
      [16833.919887] BUG: kernel NULL pointer dereference, address: 0000000000000088
      [16833.920068] nvme nvme4: Failed reconnect attempt 3
      [16833.920094] #PF: supervisor write access in kernel mode
      [16833.920261] nvme nvme4: Reconnecting in 10 seconds...
      [16833.920368] #PF: error_code(0x0002) - not-present page
      [16833.921086] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
      [16833.921191] RIP: 0010:_raw_spin_lock_bh+0x17/0x30
      ...
      [16833.923138] Call Trace:
      [16833.923271]  <TASK>
      [16833.923402]  lock_sock_nested+0x1e/0x50
      [16833.923545]  nvme_tcp_try_recv+0x40/0xa0 [nvme_tcp]
      [16833.923685]  nvme_tcp_io_work+0x68/0xa0 [nvme_tcp]
      [16833.923824]  process_one_work+0x1e8/0x390
      [16833.923969]  worker_thread+0x53/0x3d0
      [16833.924104]  ? process_one_work+0x390/0x390
      [16833.924240]  kthread+0x124/0x150
      [16833.924376]  ? set_kthread_struct+0x50/0x50
      [16833.924518]  ret_from_fork+0x1f/0x30
      [16833.924655]  </TASK>
      Reported-by: NYanjun Zhang <zhangyanjun@cestc.cn>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Tested-by: NYanjun Zhang <zhangyanjun@cestc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      88eaba80
    • Y
      md: fix regression for null-ptr-deference in __md_stop() · 433279be
      Yu Kuai 提交于
      Commit 3e453522 ("md: Free resources in __md_stop") tried to fix
      null-ptr-deference for 'active_io' by moving percpu_ref_exit() to
      __md_stop(), however, the commit also moving 'writes_pending' to
      __md_stop(), and this will cause mdadm tests broken:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000038
      Oops: 0000 [#1] PREEMPT SMP
      CPU: 15 PID: 17830 Comm: mdadm Not tainted 6.3.0-rc3-next-20230324-00009-g520d37
      RIP: 0010:free_percpu+0x465/0x670
      Call Trace:
       <TASK>
       __percpu_ref_exit+0x48/0x70
       percpu_ref_exit+0x1a/0x90
       __md_stop+0xe9/0x170
       do_md_stop+0x1e1/0x7b0
       md_ioctl+0x90c/0x1aa0
       blkdev_ioctl+0x19b/0x400
       vfs_ioctl+0x20/0x50
       __x64_sys_ioctl+0xba/0xe0
       do_syscall_64+0x6c/0xe0
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      And the problem can be reporduced 100% by following test:
      
      mdadm -CR /dev/md0 -l1 -n1 /dev/sda --force
      echo inactive > /sys/block/md0/md/array_state
      echo read-auto  > /sys/block/md0/md/array_state
      echo inactive > /sys/block/md0/md/array_state
      
      Root cause:
      
      // start raid
      raid1_run
       mddev_init_writes_pending
        percpu_ref_init
      
      // inactive raid
      array_state_store
       do_md_stop
        __md_stop
         percpu_ref_exit
      
      // start raid again
      array_state_store
       do_md_run
        raid1_run
         mddev_init_writes_pending
          if (mddev->writes_pending.percpu_count_ptr)
          // won't reinit
      
      // inactive raid again
      ...
      percpu_ref_exit
      -> null-ptr-deference
      
      Before the commit, 'writes_pending' is exited when mddev is freed, and
      it's safe to restart raid because mddev_init_writes_pending() already make
      sure that 'writes_pending' will only be initialized once.
      
      Fix the prblem by moving 'writes_pending' back, it's a litter hard to find
      the relationship between alloc memory and free memory, however, code
      changes is much less and we lived with this for a long time already.
      
      Fixes: 3e453522 ("md: Free resources in __md_stop")
      Signed-off-by: NYu Kuai <yukuai3@huawei.com>
      Reviewed-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230328094400.1448955-1-yukuai1@huaweicloud.com
      433279be
  8. 28 3月, 2023 2 次提交
    • J
      nvme-pci: mark Lexar NM760 as IGNORE_DEV_SUBNQN · 1231363a
      Juraj Pecigos 提交于
      A system with more than one of these SSDs will only have one usable.
      The kernel fails to detect more than one nvme device due to duplicate
      cntlids.
      
      before:
      [    9.395229] nvme 0000:01:00.0: platform quirk: setting simple suspend
      [    9.395262] nvme nvme0: pci function 0000:01:00.0
      [    9.395282] nvme 0000:03:00.0: platform quirk: setting simple suspend
      [    9.395305] nvme nvme1: pci function 0000:03:00.0
      [    9.409873] nvme nvme0: Duplicate cntlid 1 with nvme1, subsys nqn.2022-07.com.siliconmotion:nvm-subsystem-sn-                    , rejecting
      [    9.409982] nvme nvme0: Removing after probe failure status: -22
      [    9.427487] nvme nvme1: allocated 64 MiB host memory buffer.
      [    9.445088] nvme nvme1: 16/0/0 default/read/poll queues
      [    9.449898] nvme nvme1: Ignoring bogus Namespace Identifiers
      
      after:
      [    1.161890] nvme 0000:01:00.0: platform quirk: setting simple suspend
      [    1.162660] nvme nvme0: pci function 0000:01:00.0
      [    1.162684] nvme 0000:03:00.0: platform quirk: setting simple suspend
      [    1.162707] nvme nvme1: pci function 0000:03:00.0
      [    1.191354] nvme nvme0: allocated 64 MiB host memory buffer.
      [    1.193378] nvme nvme1: allocated 64 MiB host memory buffer.
      [    1.211044] nvme nvme1: 16/0/0 default/read/poll queues
      [    1.211080] nvme nvme0: 16/0/0 default/read/poll queues
      [    1.216145] nvme nvme0: Ignoring bogus Namespace Identifiers
      [    1.216261] nvme nvme1: Ignoring bogus Namespace Identifiers
      
      Adding the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk to resolves the issue.
      Signed-off-by: NJuraj Pecigos <kernel@juraj.dev>
      Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      1231363a
    • A
      loop: LOOP_CONFIGURE: send uevents for partitions · bb430b69
      Alyssa Ross 提交于
      LOOP_CONFIGURE is, as far as I understand it, supposed to be a way to
      combine LOOP_SET_FD and LOOP_SET_STATUS64 into a single syscall.  When
      using LOOP_SET_FD+LOOP_SET_STATUS64, a single uevent would be sent for
      each partition found on the loop device after the second ioctl(), but
      when using LOOP_CONFIGURE, no such uevent was being sent.
      
      In the old setup, uevents are disabled for LOOP_SET_FD, but not for
      LOOP_SET_STATUS64.  This makes sense, as it prevents uevents being
      sent for a partially configured device during LOOP_SET_FD - they're
      only sent at the end of LOOP_SET_STATUS64.  But for LOOP_CONFIGURE,
      uevents were disabled for the entire operation, so that final
      notification was never issued.  To fix this, reduce the critical
      section to exclude the loop_reread_partitions() call, which causes
      the uevents to be issued, to after uevents are re-enabled, matching
      the behaviour of the LOOP_SET_FD+LOOP_SET_STATUS64 combination.
      
      I noticed this because Busybox's losetup program recently changed from
      using LOOP_SET_FD+LOOP_SET_STATUS64 to LOOP_CONFIGURE, and this broke
      my setup, for which I want a notification from the kernel any time a
      new partition becomes available.
      Signed-off-by: NAlyssa Ross <hi@alyssa.is>
      [hch: reduced the critical section]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Fixes: 3448914e ("loop: Add LOOP_CONFIGURE ioctl")
      Link: https://lore.kernel.org/r/20230320125430.55367-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      bb430b69
  9. 24 3月, 2023 1 次提交
  10. 22 3月, 2023 2 次提交
  11. 21 3月, 2023 1 次提交
  12. 18 3月, 2023 1 次提交
  13. 16 3月, 2023 4 次提交
  14. 15 3月, 2023 15 次提交
  15. 14 3月, 2023 1 次提交
    • J
      block: do not reverse request order when flushing plug list · 34e0a279
      Jan Kara 提交于
      Commit 26fed4ac ("block: flush plug based on hardware and software
      queue order") changed flushing of plug list to submit requests one
      device at a time. However while doing that it also started using
      list_add_tail() instead of list_add() used previously thus effectively
      submitting requests in reverse order. Also when forming a rq_list with
      remaining requests (in case two or more devices are used), we
      effectively reverse the ordering of the plug list for each device we
      process. Submitting requests in reverse order has negative impact on
      performance for rotational disks (when BFQ is not in use). We observe
      10-25% regression in random 4k write throughput, as well as ~20%
      regression in MariaDB OLTP benchmark on rotational storage on btrfs
      filesystem.
      
      Fix the problem by preserving ordering of the plug list when inserting
      requests into the queuelist as well as by appending to requeue_list
      instead of prepending to it.
      
      Fixes: 26fed4ac ("block: flush plug based on hardware and software queue order")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20230313093002.11756-1-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>
      34e0a279