1. 06 12月, 2021 1 次提交
  2. 15 11月, 2021 3 次提交
  3. 21 10月, 2021 1 次提交
    • A
      dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() · 29975cf5
      Arne Welzel 提交于
      stable inclusion
      from stable-5.10.67
      commit 7509c4cb7c8050177da9ee5e053c0c3d55bb66b7
      bugzilla: 182619 https://gitee.com/openeuler/kernel/issues/I4EWO7
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7509c4cb7c8050177da9ee5e053c0c3d55bb66b7
      
      --------------------------------
      
      commit 528b16bf upstream.
      
      On systems with many cores using dm-crypt, heavy spinlock contention in
      percpu_counter_compare() can be observed when the page allocation limit
      for a given device is reached or close to be reached. This is due
      to percpu_counter_compare() taking a spinlock to compute an exact
      result on potentially many CPUs at the same time.
      
      Switch to non-exact comparison of allocated and allowed pages by using
      the value returned by percpu_counter_read_positive() to avoid taking
      the percpu_counter spinlock.
      
      This may over/under estimate the actual number of allocated pages by at
      most (batch-1) * num_online_cpus().
      
      Currently, batch is bounded by 32. The system on which this issue was
      first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
      change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
      allocations, this seems an acceptable error. Certainly preferred over
      running into the spinlock contention.
      
      This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
      and 192GB RAM as follows, but can be provoked on systems with less CPUs
      as well.
      
       * Disable swap
       * Tune vm settings to promote regular writeback
           $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
           $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
           $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes
      
       * Create 8 dmcrypt devices based on files on a tmpfs
       * Create and mount an ext4 filesystem on each crypt devices
       * Run stress-ng --hdd 8 within one of above filesystems
      
      Total %system usage collected from sysstat goes to ~35%. Write throughput
      on the underlying loop device is ~2GB/s. perf profiling an individual
      kworker kcryptd thread shows the following profile, indicating spinlock
      contention in percpu_counter_compare():
      
          99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
            |
            --ret_from_fork
              kthread
              worker_thread
              |
               --99.92%--process_one_work
                  |
                  |--80.52%--kcryptd_crypt
                  |    |
                  |    |--62.58%--mempool_alloc
                  |    |  |
                  |    |   --62.24%--crypt_page_alloc
                  |    |     |
                  |    |      --61.51%--__percpu_counter_compare
                  |    |        |
                  |    |         --61.34%--__percpu_counter_sum
                  |    |           |
                  |    |           |--58.68%--_raw_spin_lock_irqsave
                  |    |           |  |
                  |    |           |   --58.30%--native_queued_spin_lock_slowpath
                  |    |           |
                  |    |            --0.69%--cpumask_next
                  |    |                |
                  |    |                 --0.51%--_find_next_bit
                  |    |
                  |    |--10.61%--crypt_convert
                  |    |          |
                  |    |          |--6.05%--xts_crypt
                  ...
      
      After applying this patch and running the same test, %system usage is
      lowered to ~7% and write throughput on the loop device increases
      to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
      in the profile and not hitting the percpu_counter() spinlock anymore.
      
          |--8.15%--mempool_alloc
          |    |
          |    |--3.93%--crypt_page_alloc
          |    |    |
          |    |     --3.75%--__alloc_pages
          |    |         |
          |    |          --3.62%--get_page_from_freelist
          |    |              |
          |    |               --3.22%--rmqueue_bulk
          |    |                   |
          |    |                    --2.59%--_raw_spin_lock
          |    |                      |
          |    |                       --2.57%--native_queued_spin_lock_slowpath
          |    |
          |     --3.05%--_raw_spin_lock_irqsave
          |               |
          |                --2.49%--native_queued_spin_lock_slowpath
      Suggested-by: NDJ Gregor <dj@corelight.com>
      Reviewed-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NArne Welzel <arne.welzel@corelight.com>
      Fixes: 5059353d ("dm crypt: limit the number of allocated pages")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      29975cf5
  4. 19 10月, 2021 2 次提交
  5. 15 10月, 2021 8 次提交
  6. 03 7月, 2021 2 次提交
  7. 15 6月, 2021 1 次提交
  8. 03 6月, 2021 14 次提交
    • J
      dm verity: allow only one error handling mode · 4ae8420c
      JeongHyeon Lee 提交于
      mainline inclusion
      from mainline-v5.13-rc1
      commit 219a9b5e
      category: bugfix
      bugzilla: 51874
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=219a9b5e738b75a6a5e9effe1d72f60037a2f131
      
      -----------------------------------------------
      
      If more than one one handling mode is requested during DM verity table
      load, the last requested mode will be used.
      
      Change this to impose more strict checking so that the table load will
      fail if more than one error handling mode is requested.
      Signed-off-by: NJeongHyeon Lee <jhs2.lee@samsung.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NLuo Meng <luomeng12@huawei.com>
      Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      4ae8420c
    • M
      dm snapshot: fix crash with transient storage and zero chunk size · b712cd09
      Mikulas Patocka 提交于
      stable inclusion
      from stable-5.10.40
      commit 2a61f0ccb756f966f7d04aa149635c843f821ad3
      bugzilla: 51882
      CVE: NA
      
      --------------------------------
      
      commit c699a0db upstream.
      
      The following commands will crash the kernel:
      
      modprobe brd rd_size=1048576
      dmsetup create o --table "0 `blockdev --getsize /dev/ram0` snapshot-origin /dev/ram0"
      dmsetup create s --table "0 `blockdev --getsize /dev/ram0` snapshot /dev/ram0 /dev/ram1 N 0"
      
      The reason is that when we test for zero chunk size, we jump to the label
      bad_read_metadata without setting the "r" variable. The function
      snapshot_ctr destroys all the structures and then exits with "r == 0". The
      kernel then crashes because it falsely believes that snapshot_ctr
      succeeded.
      
      In order to fix the bug, we set the variable "r" to -EINVAL.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b712cd09
    • J
      md: Fix missing unused status line of /proc/mdstat · 1e099dfb
      Jan Glauber 提交于
      stable inclusion
      from stable-5.10.37
      commit 0035a4704557ba66824c08d5759d6e743747410b
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit 7abfabaf upstream.
      
      Reading /proc/mdstat with a read buffer size that would not
      fit the unused status line in the first read will skip this
      line from the output.
      
      So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something
      like: unused devices: <none>
      
      Don't return NULL immediately in start() for v=2 but call
      show() once to print the status line also for multiple reads.
      
      Cc: stable@vger.kernel.org
      Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface")
      Signed-off-by: NJan Glauber <jglauber@digitalocean.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1e099dfb
    • Z
      md: md_open returns -EBUSY when entering racing area · 640134e4
      Zhao Heming 提交于
      stable inclusion
      from stable-5.10.37
      commit b70b7ec500892f8bc12ffc6f60a3af6fd61d3a8b
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit 6a4db2a6 upstream.
      
      commit d3374825 ("md: make devices disappear when they are no longer
      needed.") introduced protection between mddev creating & removing. The
      md_open shouldn't create mddev when all_mddevs list doesn't contain
      mddev. With currently code logic, there will be very easy to trigger
      soft lockup in non-preempt env.
      
      This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
      will break the infinitely retry when md_open enter racing area.
      
      This patch is partly fix soft lockup issue, full fix needs mddev_find
      is split into two functions: mddev_find & mddev_find_or_alloc. And
      md_open should call new mddev_find (it only does searching job).
      
      For more detail, please refer with Christoph's "split mddev_find" patch
      in later commits.
      
      *** env ***
      kvm-qemu VM 2C1G with 2 iscsi luns
      kernel should be non-preempt
      
      *** script ***
      
      about trigger every time with below script
      
      ```
      1  node1="mdcluster1"
      2  node2="mdcluster2"
      3
      4  mdadm -Ss
      5  ssh ${node2} "mdadm -Ss"
      6  wipefs -a /dev/sda /dev/sdb
      7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
         /dev/sdb --assume-clean
      8
      9  for i in {1..10}; do
      10    echo ==== $i ====;
      11
      12    echo "test  ...."
      13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
      14    sleep 1
      15
      16    echo "clean  ....."
      17    ssh ${node2} "mdadm -Ss"
      18 done
      ```
      
      I use mdcluster env to trigger soft lockup, but it isn't mdcluster
      speical bug. To stop md array in mdcluster env will do more jobs than
      non-cluster array, which will leave enough time/gap to allow kernel to
      run md_open.
      
      *** stack ***
      
      ```
      [  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
      [  884.226515]  md_open+0x3c/0xe0 [md_mod]
      [  884.226518]  __blkdev_get+0x30d/0x710
      [  884.226520]  ? bd_acquire+0xd0/0xd0
      [  884.226522]  blkdev_get+0x14/0x30
      [  884.226524]  do_dentry_open+0x204/0x3a0
      [  884.226531]  path_openat+0x2fc/0x1520
      [  884.226534]  ? seq_printf+0x4e/0x70
      [  884.226536]  do_filp_open+0x9b/0x110
      [  884.226542]  ? md_release+0x20/0x20 [md_mod]
      [  884.226543]  ? seq_read+0x1d8/0x3e0
      [  884.226545]  ? kmem_cache_alloc+0x18a/0x270
      [  884.226547]  ? do_sys_open+0x1bd/0x260
      [  884.226548]  do_sys_open+0x1bd/0x260
      [  884.226551]  do_syscall_64+0x5b/0x1e0
      [  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      ```
      
      *** rootcause ***
      
      "mdadm -A" (or other array assemble commands) will start a daemon "mdadm
      --monitor" by default. When "mdadm -Ss" is running, the stop action will
      wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
      info from /proc/mdstat. This time mddev in kernel still exist, so
      /proc/mdstat still show md device, which makes "mdadm --monitor" to open
      /dev/md0.
      
      The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
      open action will trigger md_open which is creating action. Racing is
      happening.
      
      ```
      <thread 1>: "mdadm -Ss"
      md_release
        mddev_put deletes mddev from all_mddevs
        queue_work for mddev_delayed_delete
        at this time, "/dev/md0" is still available for opening
      
      <thread 2>: "mdadm --monitor ..."
      md_open
       + mddev_find can't find mddev of /dev/md0, and create a new mddev and
       |    return.
       + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
            -ERESTARTSYS.
      ```
      
      In non-preempt kernel, <thread 2> is occupying on current CPU. and
      mddev_delayed_delete which was created in <thread 1> also can't be
      schedule.
      
      In preempt kernel, it can also trigger above racing. But kernel doesn't
      allow one thread running on a CPU all the time. after <thread 2> running
      some time, the later "mdadm -A" (refer above script line 13) will call
      md_alloc to alloc a new gendisk for mddev. it will break md_open
      statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
      the soft lockup is broken.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      640134e4
    • C
      md: factor out a mddev_find_locked helper from mddev_find · f80d4b29
      Christoph Hellwig 提交于
      stable inclusion
      from stable-5.10.37
      commit cdcfa77a332a57962ee3af255f8769fd5cdf97ad
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit 8b57251f upstream.
      
      Factor out a self-contained helper to just lookup a mddev by the dev_t
      "unit".
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f80d4b29
    • C
      md: split mddev_find · 69eae441
      Christoph Hellwig 提交于
      stable inclusion
      from stable-5.10.37
      commit 07e73740850299e39f1737aff4811e79021f72e5
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit 65aa97c4 upstream.
      
      Split mddev_find into a simple mddev_find that just finds an existing
      mddev by the unit number, and a more complicated mddev_find that deals
      with find or allocating a mddev.
      
      This turns out to fix this bug reported by Zhao Heming.
      
      ----------------------------- snip ------------------------------
      commit d3374825 ("md: make devices disappear when they are no longer
      needed.") introduced protection between mddev creating & removing. The
      md_open shouldn't create mddev when all_mddevs list doesn't contain
      mddev. With currently code logic, there will be very easy to trigger
      soft lockup in non-preempt env.
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      69eae441
    • H
      md-cluster: fix use-after-free issue when removing rdev · 152be1b9
      Heming Zhao 提交于
      stable inclusion
      from stable-5.10.37
      commit 61b8c6efbe87c445c3907fc36a9644ed705228f8
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit f7c7a2f9 upstream.
      
      md_kick_rdev_from_array will remove rdev, so we should
      use rdev_for_each_safe to search list.
      
      How to trigger:
      
      env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).
      
      ```
      node2=192.168.0.3
      
      for i in {1..20}; do
          echo ==== $i `date` ====;
      
          mdadm -Ss && ssh ${node2} "mdadm -Ss"
          wipefs -a /dev/sda /dev/sdb
      
          mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
             /dev/sdb --assume-clean
          ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
          mdadm --wait /dev/md0
          ssh ${node2} "mdadm --wait /dev/md0"
      
          mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
          sleep 1
      done
      ```
      
      Crash stack:
      
      ```
      stack segment: 0000 [#1] SMP
      ... ...
      RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
      ... ...
      RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
      RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
      RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
      RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
      FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       raid1d+0x5c/0xd40 [raid1]
       ? finish_task_switch+0x75/0x2a0
       ? lock_timer_base+0x67/0x80
       ? try_to_del_timer_sync+0x4d/0x80
       ? del_timer_sync+0x41/0x50
       ? schedule_timeout+0x254/0x2d0
       ? md_start_sync+0xe0/0xe0 [md_mod]
       ? md_thread+0x127/0x160 [md_mod]
       md_thread+0x127/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       kthread+0x10d/0x130
       ? kthread_park+0xa0/0xa0
       ret_from_fork+0x1f/0x40
      ```
      
      Fixes: dbb64f86 ("md-cluster: Fix adding of new disk with new reload code")
      Fixes: 659b254f ("md-cluster: remove a disk asynchronously from cluster environment")
      Cc: stable@vger.kernel.org
      Reviewed-by: NGang He <ghe@suse.com>
      Signed-off-by: NHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      152be1b9
    • S
      md/bitmap: wait for external bitmap writes to complete during tear down · 5b4ff7e6
      Sudhakar Panneerselvam 提交于
      stable inclusion
      from stable-5.10.37
      commit 569885ad7518421d76e4fc1b71b6b6eb8f3bedc7
      bugzilla: 51868
      CVE: NA
      
      --------------------------------
      
      commit 404a8ef5 upstream.
      
      NULL pointer dereference was observed in super_written() when it tries
      to access the mddev structure.
      
      [The below stack trace is from an older kernel, but the problem described
      in this patch applies to the mainline kernel.]
      
      [ 1194.474861] task: ffff8fdd20858000 task.stack: ffffb99d40790000
      [ 1194.488000] RIP: 0010:super_written+0x29/0xe1
      [ 1194.499688] RSP: 0018:ffff8ffb7fcc3c78 EFLAGS: 00010046
      [ 1194.512477] RAX: 0000000000000000 RBX: ffff8ffb7bf4a000 RCX: ffff8ffb78991048
      [ 1194.527325] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ffb56b8a200
      [ 1194.542576] RBP: ffff8ffb7fcc3c90 R08: 000000000000000b R09: 0000000000000000
      [ 1194.558001] R10: ffff8ffb56b8a298 R11: 0000000000000000 R12: ffff8ffb56b8a200
      [ 1194.573070] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      [ 1194.588117] FS:  0000000000000000(0000) GS:ffff8ffb7fcc0000(0000) knlGS:0000000000000000
      [ 1194.604264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1194.617375] CR2: 00000000000002b8 CR3: 00000021e040a002 CR4: 00000000007606e0
      [ 1194.632327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1194.647865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1194.663316] PKRU: 55555554
      [ 1194.674090] Call Trace:
      [ 1194.683735]  <IRQ>
      [ 1194.692948]  bio_endio+0xae/0x135
      [ 1194.703580]  blk_update_request+0xad/0x2fa
      [ 1194.714990]  blk_update_bidi_request+0x20/0x72
      [ 1194.726578]  __blk_end_bidi_request+0x2c/0x4d
      [ 1194.738373]  __blk_end_request_all+0x31/0x49
      [ 1194.749344]  blk_flush_complete_seq+0x377/0x383
      [ 1194.761550]  flush_end_io+0x1dd/0x2a7
      [ 1194.772910]  blk_finish_request+0x9f/0x13c
      [ 1194.784544]  scsi_end_request+0x180/0x25c
      [ 1194.796149]  scsi_io_completion+0xc8/0x610
      [ 1194.807503]  scsi_finish_command+0xdc/0x125
      [ 1194.818897]  scsi_softirq_done+0x81/0xde
      [ 1194.830062]  blk_done_softirq+0xa4/0xcc
      [ 1194.841008]  __do_softirq+0xd9/0x29f
      [ 1194.851257]  irq_exit+0xe6/0xeb
      [ 1194.861290]  do_IRQ+0x59/0xe3
      [ 1194.871060]  common_interrupt+0x1c6/0x382
      [ 1194.881988]  </IRQ>
      [ 1194.890646] RIP: 0010:cpuidle_enter_state+0xdd/0x2a5
      [ 1194.902532] RSP: 0018:ffffb99d40793e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff43
      [ 1194.917317] RAX: ffff8ffb7fce27c0 RBX: ffff8ffb7fced800 RCX: 000000000000001f
      [ 1194.932056] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000
      [ 1194.946428] RBP: ffffb99d40793ea0 R08: 0000000000000004 R09: 0000000000002ed2
      [ 1194.960508] R10: 0000000000002664 R11: 0000000000000018 R12: 0000000000000003
      [ 1194.974454] R13: 000000000000000b R14: ffffffff925715a0 R15: 0000011610120d5a
      [ 1194.988607]  ? cpuidle_enter_state+0xcc/0x2a5
      [ 1194.999077]  cpuidle_enter+0x17/0x19
      [ 1195.008395]  call_cpuidle+0x23/0x3a
      [ 1195.017718]  do_idle+0x172/0x1d5
      [ 1195.026358]  cpu_startup_entry+0x73/0x75
      [ 1195.035769]  start_secondary+0x1b9/0x20b
      [ 1195.044894]  secondary_startup_64+0xa5/0xa5
      [ 1195.084921] RIP: super_written+0x29/0xe1 RSP: ffff8ffb7fcc3c78
      [ 1195.096354] CR2: 00000000000002b8
      
      bio in the above stack is a bitmap write whose completion is invoked after
      the tear down sequence sets the mddev structure to NULL in rdev.
      
      During tear down, there is an attempt to flush the bitmap writes, but for
      external bitmaps, there is no explicit wait for all the bitmap writes to
      complete. For instance, md_bitmap_flush() is called to flush the bitmap
      writes, but the last call to md_bitmap_daemon_work() in md_bitmap_flush()
      could generate new bitmap writes for which there is no explicit wait to
      complete those writes. The call to md_bitmap_update_sb() will return
      simply for external bitmaps and the follow-up call to md_update_sb() is
      conditional and may not get called for external bitmaps. This results in a
      kernel panic when the completion routine, super_written() is called which
      tries to reference mddev in the rdev that has been set to
      NULL(in unbind_rdev_from_array() by tear down sequence).
      
      The solution is to call md_super_wait() for external bitmaps after the
      last call to md_bitmap_daemon_work() in md_bitmap_flush() to ensure there
      are no pending bitmap writes before proceeding with the tear down.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>
      Reviewed-by: NZhao Heming <heming.zhao@suse.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      5b4ff7e6
    • B
      dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails · 41266ba0
      Benjamin Block 提交于
      stable inclusion
      from stable-5.10.36
      commit 1cb02dc76f4c0a2749a02b26469512d6984252e9
      bugzilla: 51867
      CVE: NA
      
      --------------------------------
      
      commit 8e947c8f upstream.
      
      When loading a device-mapper table for a request-based mapped device,
      and the allocation/initialization of the blk_mq_tag_set for the device
      fails, a following device remove will cause a double free.
      
      E.g. (dmesg):
        device-mapper: core: Cannot initialize queue for request-based dm-mq mapped device
        device-mapper: ioctl: unable to set up device queue for new table.
        Unable to handle kernel pointer dereference in virtual kernel address space
        Failing address: 0305e098835de000 TEID: 0305e098835de803
        Fault in home space mode while using kernel ASCE.
        AS:000000025efe0007 R3:0000000000000024
        Oops: 0038 ilc:3 [#1] SMP
        Modules linked in: ... lots of modules ...
        Supported: Yes, External
        CPU: 0 PID: 7348 Comm: multipathd Kdump: loaded Tainted: G        W      X    5.3.18-53-default #1 SLE15-SP3
        Hardware name: IBM 8561 T01 7I2 (LPAR)
        Krnl PSW : 0704e00180000000 000000025e368eca (kfree+0x42/0x330)
                   R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
        Krnl GPRS: 000000000000004a 000000025efe5230 c1773200d779968d 0000000000000000
                   000000025e520270 000000025e8d1b40 0000000000000003 00000007aae10000
                   000000025e5202a2 0000000000000001 c1773200d779968d 0305e098835de640
                   00000007a8170000 000003ff80138650 000000025e5202a2 000003e00396faa8
        Krnl Code: 000000025e368eb8: c4180041e100       lgrl    %r1,25eba50b8
                   000000025e368ebe: ecba06b93a55       risbg   %r11,%r10,6,185,58
                  #000000025e368ec4: e3b010000008       ag      %r11,0(%r1)
                  >000000025e368eca: e310b0080004       lg      %r1,8(%r11)
                   000000025e368ed0: a7110001           tmll    %r1,1
                   000000025e368ed4: a7740129           brc     7,25e369126
                   000000025e368ed8: e320b0080004       lg      %r2,8(%r11)
                   000000025e368ede: b904001b           lgr     %r1,%r11
        Call Trace:
         [<000000025e368eca>] kfree+0x42/0x330
         [<000000025e5202a2>] blk_mq_free_tag_set+0x72/0xb8
         [<000003ff801316a8>] dm_mq_cleanup_mapped_device+0x38/0x50 [dm_mod]
         [<000003ff80120082>] free_dev+0x52/0xd0 [dm_mod]
         [<000003ff801233f0>] __dm_destroy+0x150/0x1d0 [dm_mod]
         [<000003ff8012bb9a>] dev_remove+0x162/0x1c0 [dm_mod]
         [<000003ff8012a988>] ctl_ioctl+0x198/0x478 [dm_mod]
         [<000003ff8012ac8a>] dm_ctl_ioctl+0x22/0x38 [dm_mod]
         [<000000025e3b11ee>] ksys_ioctl+0xbe/0xe0
         [<000000025e3b127a>] __s390x_sys_ioctl+0x2a/0x40
         [<000000025e8c15ac>] system_call+0xd8/0x2c8
        Last Breaking-Event-Address:
         [<000000025e52029c>] blk_mq_free_tag_set+0x6c/0xb8
        Kernel panic - not syncing: Fatal exception: panic_on_oops
      
      When allocation/initialization of the blk_mq_tag_set fails in
      dm_mq_init_request_queue(), it is uninitialized/freed, but the pointer
      is not reset to NULL; so when dev_remove() later gets into
      dm_mq_cleanup_mapped_device() it sees the pointer and tries to
      uninitialize and free it again.
      
      Fix this by setting the pointer to NULL in dm_mq_init_request_queue()
      error-handling. Also set it to NULL in dm_mq_cleanup_mapped_device().
      
      Cc: <stable@vger.kernel.org> # 4.6+
      Fixes: 1c357a1e ("dm: allocate blk_mq_tag_set rather than embed in mapped_device")
      Signed-off-by: NBenjamin Block <bblock@linux.ibm.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      41266ba0
    • T
      dm integrity: fix missing goto in bitmap_flush_interval error handling · 10f497cf
      Tian Tao 提交于
      stable inclusion
      from stable-5.10.36
      commit 06141465e37251d3f4cb830fc762f5685d2fa4c3
      bugzilla: 51867
      CVE: NA
      
      --------------------------------
      
      commit 17e9e134 upstream.
      
      Fixes: 468dfca3 ("dm integrity: add a bitmap mode")
      Cc: stable@vger.kernel.org
      Signed-off-by: NTian Tao <tiantao6@hisilicon.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      10f497cf
    • J
      dm space map common: fix division bug in sm_ll_find_free_block() · 43c6f524
      Joe Thornber 提交于
      stable inclusion
      from stable-5.10.36
      commit df893916b33026117051fdff492dcd2aa21c38a3
      bugzilla: 51867
      CVE: NA
      
      --------------------------------
      
      commit 5208692e upstream.
      
      This division bug meant the search for free metadata space could skip
      the final allocation bitmap's worth of entries. Fix affects DM thinp,
      cache and era targets.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Tested-by: NMing-Hung Tsai <mtsai@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      43c6f524
    • J
      dm persistent data: packed struct should have an aligned() attribute too · c1394603
      Joe Thornber 提交于
      stable inclusion
      from stable-5.10.36
      commit fcf763a80e0ea051057926527ded16b1933d50e6
      bugzilla: 51867
      CVE: NA
      
      --------------------------------
      
      commit a88b2358 upstream.
      
      Otherwise most non-x86 architectures (e.g. riscv, arm) will resort to
      byte-by-byte access.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      c1394603
    • H
      dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences · 3e997379
      Heinz Mauelshagen 提交于
      stable inclusion
      from stable-5.10.36
      commit 0cd2d2577a982863a65d1f7546771bb6547d92c5
      bugzilla: 51867
      CVE: NA
      
      --------------------------------
      
      commit f99a8e43 upstream.
      
      If fast table reloads occur during an ongoing reshape of raid4/5/6
      devices the target may race reading a superblock vs the the MD resync
      thread; causing an inconclusive reshape state to be read in its
      constructor.
      
      lvm2 test lvconvert-raid-reshape-stripes-load-reload.sh can cause
      BUG_ON() to trigger in md_run(), e.g.:
      "kernel BUG at drivers/md/raid5.c:7567!".
      
      Scenario triggering the bug:
      
      1. the MD sync thread calls end_reshape() from raid5_sync_request()
         when done reshaping. However end_reshape() _only_ updates the
         reshape position to MaxSector keeping the changed layout
         configuration though (i.e. any delta disks, chunk sector or RAID
         algorithm changes). That inconclusive configuration is stored in
         the superblock.
      
      2. dm-raid constructs a mapping, loading named inconsistent superblock
         as of step 1 before step 3 is able to finish resetting the reshape
         state completely, and calls md_run() which leads to mentioned bug
         in raid5.c.
      
      3. the MD RAID personality's finish_reshape() is called; which resets
         the reshape information on chunk sectors, delta disks, etc. This
         explains why the bug is rarely seen on multi-core machines, as MD's
         finish_reshape() superblock update races with the dm-raid
         constructor's superblock load in step 2.
      
      Fix identifies inconclusive superblock content in the dm-raid
      constructor and resets it before calling md_run(), factoring out
      identifying checks into rs_is_layout_change() to share in existing
      rs_reshape_requested() and new rs_reset_inclonclusive_reshape(). Also
      enhance a comment and remove an empty line.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      3e997379
    • P
      md/raid1: properly indicate failure when ending a failed write request · aeefa199
      Paul Clements 提交于
      stable inclusion
      from stable-5.10.36
      commit 661061a45e32d8b2cc0e306da9f169ad44011382
      bugzilla: 51867
      CVE: NA
      
      --------------------------------
      
      commit 2417b986 upstream.
      
      This patch addresses a data corruption bug in raid1 arrays using bitmaps.
      Without this fix, the bitmap bits for the failed I/O end up being cleared.
      
      Since we are in the failure leg of raid1_end_write_request, the request
      either needs to be retried (R1BIO_WriteError) or failed (R1BIO_Degraded).
      
      Fixes: eeba6809 ("md/raid1: end bio when the device faulty")
      Cc: stable@vger.kernel.org # v5.2+
      Signed-off-by: NPaul Clements <paul.clements@us.sios.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      aeefa199
  9. 26 4月, 2021 1 次提交
  10. 19 4月, 2021 3 次提交
  11. 09 4月, 2021 4 次提交
    • M
      dm verity: fix FEC for RS roots unaligned to block size · f179e678
      Milan Broz 提交于
      stable inclusion
      from stable-5.10.22
      commit ce1cca17381f9395c2b27d24fcfe553efa0bf466
      bugzilla: 50796
      
      --------------------------------
      
      commit df7b59ba upstream.
      
      Optional Forward Error Correction (FEC) code in dm-verity uses
      Reed-Solomon code and should support roots from 2 to 24.
      
      The error correction parity bytes (of roots lengths per RS block) are
      stored on a separate device in sequence without any padding.
      
      Currently, to access FEC device, the dm-verity-fec code uses dm-bufio
      client with block size set to verity data block (usually 4096 or 512
      bytes).
      
      Because this block size is not divisible by some (most!) of the roots
      supported lengths, data repair cannot work for partially stored parity
      bytes.
      
      This fix changes FEC device dm-bufio block size to "roots << SECTOR_SHIFT"
      where we can be sure that the full parity data is always available.
      (There cannot be partial FEC blocks because parity must cover whole
      sectors.)
      
      Because the optional FEC starting offset could be unaligned to this
      new block size, we have to use dm_bufio_set_sector_offset() to
      configure it.
      
      The problem is easily reproduced using veritysetup, e.g. for roots=13:
      
        # create verity device with RS FEC
        dd if=/dev/urandom of=data.img bs=4096 count=8 status=none
        veritysetup format data.img hash.img --fec-device=fec.img --fec-roots=13 | awk '/^Root hash/{ print $3 }' >roothash
      
        # create an erasure that should be always repairable with this roots setting
        dd if=/dev/zero of=data.img conv=notrunc bs=1 count=8 seek=4088 status=none
      
        # try to read it through dm-verity
        veritysetup open data.img test hash.img --fec-device=fec.img --fec-roots=13 $(cat roothash)
        dd if=/dev/mapper/test of=/dev/null bs=4096 status=noxfer
        # wait for possible recursive recovery in kernel
        udevadm settle
        veritysetup close test
      
      With this fix, errors are properly repaired.
        device-mapper: verity-fec: 7:1: FEC 0: corrected 8 errors
        ...
      
      Without it, FEC code usually ends on unrecoverable failure in RS decoder:
        device-mapper: verity-fec: 7:1: FEC 0: failed to correct: -74
        ...
      
      This problem is present in all kernels since the FEC code's
      introduction (kernel 4.5).
      
      It is thought that this problem is not visible in Android ecosystem
      because it always uses a default RS roots=2.
      
      Depends-on: a14e5ec6 ("dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size")
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      Tested-by: NJérôme Carretero <cJ-ko@zougloub.eu>
      Reviewed-by: NSami Tolvanen <samitolvanen@google.com>
      Cc: stable@vger.kernel.org # 4.5+
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f179e678
    • M
      dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size · 921a0a3b
      Mikulas Patocka 提交于
      stable inclusion
      from stable-5.10.22
      commit 7bda53f46387bc2b40a80befb1a5648dd9840e9c
      bugzilla: 50796
      
      --------------------------------
      
      commit a14e5ec6 upstream.
      
      dm_bufio_get_device_size returns the device size in blocks. Before
      returning the value, we must subtract the nubmer of starting
      sectors. The number of starting sectors may not be divisible by block
      size.
      
      Note that currently, no target is using dm_bufio_set_sector_offset and
      dm_bufio_get_device_size simultaneously, so this change has no effect.
      However, an upcoming dm-verity-fec fix needs this change.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NMilan Broz <gmazyland@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      921a0a3b
    • N
      dm era: only resize metadata in preresume · 48c4990e
      Nikos Tsironis 提交于
      stable inclusion
      from stable-5.10.20
      commit 9bfb6d528467d8d40cc0091f314be6a7718f2c04
      bugzilla: 50608
      
      --------------------------------
      
      commit cca2c6ae upstream.
      
      Metadata resize shouldn't happen in the ctr. The ctr loads a temporary
      (inactive) table that will only become active upon resume. That is why
      resize should always be done in terms of resume. Otherwise a load (ctr)
      whose inactive table never becomes active will incorrectly resize the
      metadata.
      
      Also, perform the resize directly in preresume, instead of using the
      worker to do it.
      
      The worker might run other metadata operations, e.g., it could start
      digestion, before resizing the metadata. These operations will end up
      using the old size.
      
      This could lead to errors, like:
      
        device-mapper: era: metadata_digest_transcribe_writeset: dm_array_set_value failed
        device-mapper: era: process_old_eras: digest step failed, stopping digestion
      
      The reason of the above error is that the worker started the digestion
      of the archived writeset using the old, larger size.
      
      As a result, metadata_digest_transcribe_writeset tried to write beyond
      the end of the era array.
      
      Fixes: eec40579 ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      48c4990e
    • N
      dm era: Reinitialize bitset cache before digesting a new writeset · dc55e080
      Nikos Tsironis 提交于
      stable inclusion
      from stable-5.10.20
      commit a46ab7c3a411620db9f18a11b19896b4bfbbec50
      bugzilla: 50608
      
      --------------------------------
      
      commit 25249333 upstream.
      
      In case of devices with at most 64 blocks, the digestion of consecutive
      eras uses the writeset of the first era as the writeset of all eras to
      digest, leading to lost writes. That is, we lose the information about
      what blocks were written during the affected eras.
      
      The digestion code uses a dm_disk_bitset object to access the archived
      writesets. This structure includes a one word (64-bit) cache to reduce
      the number of array lookups.
      
      This structure is initialized only once, in metadata_digest_start(),
      when we kick off digestion.
      
      But, when we insert a new writeset into the writeset tree, before the
      digestion of the previous writeset is done, or equivalently when there
      are multiple writesets in the writeset tree to digest, then all these
      writesets are digested using the same cache and the cache is not
      re-initialized when moving from one writeset to the next.
      
      For devices with more than 64 blocks, i.e., the size of the cache, the
      cache is indirectly invalidated when we move to a next set of blocks, so
      we avoid the bug.
      
      But for devices with at most 64 blocks we end up using the same cached
      data for digesting all archived writesets, i.e., the cache is loaded
      when digesting the first writeset and it never gets reloaded, until the
      digestion is done.
      
      As a result, the writeset of the first era to digest is used as the
      writeset of all the following archived eras, leading to lost writes.
      
      Fix this by reinitializing the dm_disk_bitset structure, and thus
      invalidating the cache, every time the digestion code starts digesting a
      new writeset.
      
      Fixes: eec40579 ("dm: add era target")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      dc55e080