1. 20 1月, 2022 1 次提交
    • Z
      md: Fix undefined behaviour in is_mddev_idle · 406295a3
      zhangwensheng 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4QXS1?from=project-issue
      CVE: NA
      
      --------------------------------
      
      UBSAN reports this problem:
      
      [ 5984.281385] UBSAN: Undefined behaviour in drivers/md/md.c:8175:15
      [ 5984.281390] signed integer overflow:
      [ 5984.281393] -2147483291 - 2072033152 cannot be represented in type 'int'
      [ 5984.281400] CPU: 25 PID: 1854 Comm: md101_resync Kdump: loaded Not tainted 4.19.90
      [ 5984.281404] Hardware name: Huawei TaiShan 200 (Model 5280)/BC82AMDDA
      [ 5984.281406] Call trace:
      [ 5984.281415]  dump_backtrace+0x0/0x310
      [ 5984.281418]  show_stack+0x28/0x38
      [ 5984.281425]  dump_stack+0xec/0x15c
      [ 5984.281430]  ubsan_epilogue+0x18/0x84
      [ 5984.281434]  handle_overflow+0x14c/0x19c
      [ 5984.281439]  __ubsan_handle_sub_overflow+0x34/0x44
      [ 5984.281445]  is_mddev_idle+0x338/0x3d8
      [ 5984.281449]  md_do_sync+0x1bb8/0x1cf8
      [ 5984.281452]  md_thread+0x220/0x288
      [ 5984.281457]  kthread+0x1d8/0x1e0
      [ 5984.281461]  ret_from_fork+0x10/0x18
      
      When the stat aacum of the disk is greater than INT_MAX, its value
      becomes negative after casting to 'int', which may lead to overflow
      after subtracting a positive number. In the same way, when the value
      of sync_io is greater than INT_MAX,overflow may also occur. These
      situations will lead to undefined behavior.
      
      Otherwise, if the stat accum of the disk is close to INT_MAX when
      creating raid arrays, the initial value of last_events would be set
      close to INT_MAX when mddev initializes IO event counters.
      'curr_events - rdev->last_events > 64' will always false during
      synchronization. If all the disks of mddev are in this case,
      is_mddev_idle() will always return 1, which may cause non-sync IO
      is very slow.
      
      To address these problems, need to use 64bit signed integer type
      for sync_io,last_events, and curr_events.
      Signed-off-by: Nzhangwensheng <zhangwensheng5@huawei.com>
      Reviewed-by: NTao Hou <houtao1@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      406295a3
  2. 14 1月, 2022 2 次提交
  3. 12 1月, 2022 7 次提交
  4. 10 1月, 2022 7 次提交
  5. 23 12月, 2021 1 次提交
    • Y
      md/raid1: fix a race between removing rdev and access conf->mirrors[i].rdev · ceff49d9
      Yufen Yu 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4JYYO?from=project-issue
      CVE: NA
      
      ---------------------------
      
      We get a NULL pointer dereference oops when test raid1 as follow:
      
      mdadm -CR /dev/md1 -l 1 -n 2 /dev/sd[ab]
      
      mdadm /dev/md1 -f /dev/sda
      mdadm /dev/md1 -r /dev/sda
      mdadm /dev/md1 -a /dev/sda
      sleep 5
      mdadm /dev/md1 -f /dev/sdb
      mdadm /dev/md1 -r /dev/sdb
      mdadm /dev/md1 -a /dev/sdb
      
      After a disk(/dev/sda) has been removed, we add the disk to
      raid array again, which would trigger recovery action.
      Since the rdev current state is 'spare', read/write bio can
      be issued to the disk.
      
      Then we set the other disk (/dev/sdb) faulty. Since the raid
      array is now in degraded state and /dev/sdb is the only
      'In_sync' disk, raid1_error() will return but without set
      faulty success.
      
      However, that can interrupt the recovery action and
      md_check_recovery will try to call remove_and_add_spares()
      to remove the spare disk. And the race condition between
      remove_and_add_spares() and raid1_write_request() in follow
      can cause NULL pointer dereference for conf->mirrors[i].rdev:
      
      raid1_write_request()   md_check_recovery    raid1_error()
      rcu_read_lock()
      rdev != NULL
      !test_bit(Faulty, &rdev->flags)
      
                                                 conf->recovery_disabled=
                                                   mddev->recovery_disabled;
                                                  return busy
      
                              remove_and_add_spares
                              raid1_remove_disk
                              rdev->nr_pending == 0
      
      atomic_inc(&rdev->nr_pending);
      rcu_read_unlock()
      
                              p->rdev=NULL
      
      conf->mirrors[i].rdev->data_offset
      NULL pointer deref!!!
      
                              if (!test_bit(RemoveSynchronized,
                                &rdev->flags))
                               synchronize_rcu();
                               p->rdev=rdev
      
      To fix the race condition, we add a new flag 'WantRemove' for rdev.
      Before access conf->mirrors[i].rdev, we need to ensure the rdev
      without 'WantRemove' bit.
      
      Link: https://marc.info/?l=linux-raid&m=156412052717709&w=2Reported-by: NZou Wei <zou_wei@huawei.com>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Confilct:
              drivers/md/md.h
      Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
      Reviewed-by: Nyuyufen <yuyufen@huawei.com>
      Reviewed-by: NJason Yan <yanaijie@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      ceff49d9
  6. 06 12月, 2021 1 次提交
  7. 15 11月, 2021 3 次提交
  8. 21 10月, 2021 1 次提交
    • A
      dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() · 29975cf5
      Arne Welzel 提交于
      stable inclusion
      from stable-5.10.67
      commit 7509c4cb7c8050177da9ee5e053c0c3d55bb66b7
      bugzilla: 182619 https://gitee.com/openeuler/kernel/issues/I4EWO7
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7509c4cb7c8050177da9ee5e053c0c3d55bb66b7
      
      --------------------------------
      
      commit 528b16bf upstream.
      
      On systems with many cores using dm-crypt, heavy spinlock contention in
      percpu_counter_compare() can be observed when the page allocation limit
      for a given device is reached or close to be reached. This is due
      to percpu_counter_compare() taking a spinlock to compute an exact
      result on potentially many CPUs at the same time.
      
      Switch to non-exact comparison of allocated and allowed pages by using
      the value returned by percpu_counter_read_positive() to avoid taking
      the percpu_counter spinlock.
      
      This may over/under estimate the actual number of allocated pages by at
      most (batch-1) * num_online_cpus().
      
      Currently, batch is bounded by 32. The system on which this issue was
      first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
      change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
      allocations, this seems an acceptable error. Certainly preferred over
      running into the spinlock contention.
      
      This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
      and 192GB RAM as follows, but can be provoked on systems with less CPUs
      as well.
      
       * Disable swap
       * Tune vm settings to promote regular writeback
           $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
           $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
           $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes
      
       * Create 8 dmcrypt devices based on files on a tmpfs
       * Create and mount an ext4 filesystem on each crypt devices
       * Run stress-ng --hdd 8 within one of above filesystems
      
      Total %system usage collected from sysstat goes to ~35%. Write throughput
      on the underlying loop device is ~2GB/s. perf profiling an individual
      kworker kcryptd thread shows the following profile, indicating spinlock
      contention in percpu_counter_compare():
      
          99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
            |
            --ret_from_fork
              kthread
              worker_thread
              |
               --99.92%--process_one_work
                  |
                  |--80.52%--kcryptd_crypt
                  |    |
                  |    |--62.58%--mempool_alloc
                  |    |  |
                  |    |   --62.24%--crypt_page_alloc
                  |    |     |
                  |    |      --61.51%--__percpu_counter_compare
                  |    |        |
                  |    |         --61.34%--__percpu_counter_sum
                  |    |           |
                  |    |           |--58.68%--_raw_spin_lock_irqsave
                  |    |           |  |
                  |    |           |   --58.30%--native_queued_spin_lock_slowpath
                  |    |           |
                  |    |            --0.69%--cpumask_next
                  |    |                |
                  |    |                 --0.51%--_find_next_bit
                  |    |
                  |    |--10.61%--crypt_convert
                  |    |          |
                  |    |          |--6.05%--xts_crypt
                  ...
      
      After applying this patch and running the same test, %system usage is
      lowered to ~7% and write throughput on the loop device increases
      to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
      in the profile and not hitting the percpu_counter() spinlock anymore.
      
          |--8.15%--mempool_alloc
          |    |
          |    |--3.93%--crypt_page_alloc
          |    |    |
          |    |     --3.75%--__alloc_pages
          |    |         |
          |    |          --3.62%--get_page_from_freelist
          |    |              |
          |    |               --3.22%--rmqueue_bulk
          |    |                   |
          |    |                    --2.59%--_raw_spin_lock
          |    |                      |
          |    |                       --2.57%--native_queued_spin_lock_slowpath
          |    |
          |     --3.05%--_raw_spin_lock_irqsave
          |               |
          |                --2.49%--native_queued_spin_lock_slowpath
      Suggested-by: NDJ Gregor <dj@corelight.com>
      Reviewed-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NArne Welzel <arne.welzel@corelight.com>
      Fixes: 5059353d ("dm crypt: limit the number of allocated pages")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      29975cf5
  9. 19 10月, 2021 2 次提交
  10. 15 10月, 2021 8 次提交
  11. 03 7月, 2021 2 次提交
  12. 15 6月, 2021 1 次提交
  13. 03 6月, 2021 4 次提交
    • J
      dm verity: allow only one error handling mode · 4ae8420c
      JeongHyeon Lee 提交于
      mainline inclusion
      from mainline-v5.13-rc1
      commit 219a9b5e
      category: bugfix
      bugzilla: 51874
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=219a9b5e738b75a6a5e9effe1d72f60037a2f131
      
      -----------------------------------------------
      
      If more than one one handling mode is requested during DM verity table
      load, the last requested mode will be used.
      
      Change this to impose more strict checking so that the table load will
      fail if more than one error handling mode is requested.
      Signed-off-by: NJeongHyeon Lee <jhs2.lee@samsung.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NLuo Meng <luomeng12@huawei.com>
      Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      4ae8420c
    • M
      dm snapshot: fix crash with transient storage and zero chunk size · b712cd09
      Mikulas Patocka 提交于
      stable inclusion
      from stable-5.10.40
      commit 2a61f0ccb756f966f7d04aa149635c843f821ad3
      bugzilla: 51882
      CVE: NA
      
      --------------------------------
      
      commit c699a0db upstream.
      
      The following commands will crash the kernel:
      
      modprobe brd rd_size=1048576
      dmsetup create o --table "0 `blockdev --getsize /dev/ram0` snapshot-origin /dev/ram0"
      dmsetup create s --table "0 `blockdev --getsize /dev/ram0` snapshot /dev/ram0 /dev/ram1 N 0"
      
      The reason is that when we test for zero chunk size, we jump to the label
      bad_read_metadata without setting the "r" variable. The function
      snapshot_ctr destroys all the structures and then exits with "r == 0". The
      kernel then crashes because it falsely believes that snapshot_ctr
      succeeded.
      
      In order to fix the bug, we set the variable "r" to -EINVAL.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b712cd09
    • J
      md: Fix missing unused status line of /proc/mdstat · 1e099dfb
      Jan Glauber 提交于
      stable inclusion
      from stable-5.10.37
      commit 0035a4704557ba66824c08d5759d6e743747410b
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit 7abfabaf upstream.
      
      Reading /proc/mdstat with a read buffer size that would not
      fit the unused status line in the first read will skip this
      line from the output.
      
      So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something
      like: unused devices: <none>
      
      Don't return NULL immediately in start() for v=2 but call
      show() once to print the status line also for multiple reads.
      
      Cc: stable@vger.kernel.org
      Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface")
      Signed-off-by: NJan Glauber <jglauber@digitalocean.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1e099dfb
    • Z
      md: md_open returns -EBUSY when entering racing area · 640134e4
      Zhao Heming 提交于
      stable inclusion
      from stable-5.10.37
      commit b70b7ec500892f8bc12ffc6f60a3af6fd61d3a8b
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit 6a4db2a6 upstream.
      
      commit d3374825 ("md: make devices disappear when they are no longer
      needed.") introduced protection between mddev creating & removing. The
      md_open shouldn't create mddev when all_mddevs list doesn't contain
      mddev. With currently code logic, there will be very easy to trigger
      soft lockup in non-preempt env.
      
      This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
      will break the infinitely retry when md_open enter racing area.
      
      This patch is partly fix soft lockup issue, full fix needs mddev_find
      is split into two functions: mddev_find & mddev_find_or_alloc. And
      md_open should call new mddev_find (it only does searching job).
      
      For more detail, please refer with Christoph's "split mddev_find" patch
      in later commits.
      
      *** env ***
      kvm-qemu VM 2C1G with 2 iscsi luns
      kernel should be non-preempt
      
      *** script ***
      
      about trigger every time with below script
      
      ```
      1  node1="mdcluster1"
      2  node2="mdcluster2"
      3
      4  mdadm -Ss
      5  ssh ${node2} "mdadm -Ss"
      6  wipefs -a /dev/sda /dev/sdb
      7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
         /dev/sdb --assume-clean
      8
      9  for i in {1..10}; do
      10    echo ==== $i ====;
      11
      12    echo "test  ...."
      13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
      14    sleep 1
      15
      16    echo "clean  ....."
      17    ssh ${node2} "mdadm -Ss"
      18 done
      ```
      
      I use mdcluster env to trigger soft lockup, but it isn't mdcluster
      speical bug. To stop md array in mdcluster env will do more jobs than
      non-cluster array, which will leave enough time/gap to allow kernel to
      run md_open.
      
      *** stack ***
      
      ```
      [  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
      [  884.226515]  md_open+0x3c/0xe0 [md_mod]
      [  884.226518]  __blkdev_get+0x30d/0x710
      [  884.226520]  ? bd_acquire+0xd0/0xd0
      [  884.226522]  blkdev_get+0x14/0x30
      [  884.226524]  do_dentry_open+0x204/0x3a0
      [  884.226531]  path_openat+0x2fc/0x1520
      [  884.226534]  ? seq_printf+0x4e/0x70
      [  884.226536]  do_filp_open+0x9b/0x110
      [  884.226542]  ? md_release+0x20/0x20 [md_mod]
      [  884.226543]  ? seq_read+0x1d8/0x3e0
      [  884.226545]  ? kmem_cache_alloc+0x18a/0x270
      [  884.226547]  ? do_sys_open+0x1bd/0x260
      [  884.226548]  do_sys_open+0x1bd/0x260
      [  884.226551]  do_syscall_64+0x5b/0x1e0
      [  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      ```
      
      *** rootcause ***
      
      "mdadm -A" (or other array assemble commands) will start a daemon "mdadm
      --monitor" by default. When "mdadm -Ss" is running, the stop action will
      wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
      info from /proc/mdstat. This time mddev in kernel still exist, so
      /proc/mdstat still show md device, which makes "mdadm --monitor" to open
      /dev/md0.
      
      The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
      open action will trigger md_open which is creating action. Racing is
      happening.
      
      ```
      <thread 1>: "mdadm -Ss"
      md_release
        mddev_put deletes mddev from all_mddevs
        queue_work for mddev_delayed_delete
        at this time, "/dev/md0" is still available for opening
      
      <thread 2>: "mdadm --monitor ..."
      md_open
       + mddev_find can't find mddev of /dev/md0, and create a new mddev and
       |    return.
       + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
            -ERESTARTSYS.
      ```
      
      In non-preempt kernel, <thread 2> is occupying on current CPU. and
      mddev_delayed_delete which was created in <thread 1> also can't be
      schedule.
      
      In preempt kernel, it can also trigger above racing. But kernel doesn't
      allow one thread running on a CPU all the time. after <thread 2> running
      some time, the later "mdadm -A" (refer above script line 13) will call
      md_alloc to alloc a new gendisk for mddev. it will break md_open
      statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
      the soft lockup is broken.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      640134e4