1. 23 2月, 2022 1 次提交
  2. 07 1月, 2022 3 次提交
    • X
      md: Move alloc/free acct bioset in to personality · 0c031fd3
      Xiao Ni 提交于
      bioset acct is only needed for raid0 and raid5. Therefore, md_run only
      allocates it for raid0 and raid5. However, this does not cover
      personality takeover, which may cause uninitialized bioset. For example,
      the following repro steps:
      
        mdadm -CR /dev/md0 -l1 -n2 /dev/loop0 /dev/loop1
        mdadm --wait /dev/md0
        mkfs.xfs /dev/md0
        mdadm /dev/md0 --grow -l5
        mount /dev/md0 /mnt
      
      causes panic like:
      
      [  225.933939] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [  225.934903] #PF: supervisor instruction fetch in kernel mode
      [  225.935639] #PF: error_code(0x0010) - not-present page
      [  225.936361] PGD 0 P4D 0
      [  225.936677] Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
      [  225.937525] CPU: 27 PID: 1133 Comm: mount Not tainted 5.16.0-rc3+ #706
      [  225.938416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.module_el8.4.0+547+a85d02ba 04/01/2014
      [  225.939922] RIP: 0010:0x0
      [  225.940289] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
      [  225.941196] RSP: 0018:ffff88815897eff0 EFLAGS: 00010246
      [  225.941897] RAX: 0000000000000000 RBX: 0000000000092800 RCX: ffffffff81370a39
      [  225.942813] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000092800
      [  225.943772] RBP: 1ffff1102b12fe04 R08: fffffbfff0b43c01 R09: fffffbfff0b43c01
      [  225.944807] R10: ffffffff85a1e007 R11: fffffbfff0b43c00 R12: ffff88810eaaaf58
      [  225.945757] R13: 0000000000000000 R14: ffff88810eaaafb8 R15: ffff88815897f040
      [  225.946709] FS:  00007ff3f2505080(0000) GS:ffff888fb5e00000(0000) knlGS:0000000000000000
      [  225.947814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  225.948556] CR2: ffffffffffffffd6 CR3: 000000015aa5a006 CR4: 0000000000370ee0
      [  225.949537] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  225.950455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  225.951414] Call Trace:
      [  225.951787]  <TASK>
      [  225.952120]  mempool_alloc+0xe5/0x250
      [  225.952625]  ? mempool_resize+0x370/0x370
      [  225.953187]  ? rcu_read_lock_sched_held+0xa1/0xd0
      [  225.953862]  ? rcu_read_lock_bh_held+0xb0/0xb0
      [  225.954464]  ? sched_clock_cpu+0x15/0x120
      [  225.955019]  ? find_held_lock+0xac/0xd0
      [  225.955564]  bio_alloc_bioset+0x1ed/0x2a0
      [  225.956080]  ? lock_downgrade+0x3a0/0x3a0
      [  225.956644]  ? bvec_alloc+0xc0/0xc0
      [  225.957135]  bio_clone_fast+0x19/0x80
      [  225.957651]  raid5_make_request+0x1370/0x1b70
      [  225.958286]  ? sched_clock_cpu+0x15/0x120
      [  225.958797]  ? __lock_acquire+0x8b2/0x3510
      [  225.959339]  ? raid5_get_active_stripe+0xce0/0xce0
      [  225.959986]  ? lock_is_held_type+0xd8/0x130
      [  225.960528]  ? rcu_read_lock_sched_held+0xa1/0xd0
      [  225.961135]  ? rcu_read_lock_bh_held+0xb0/0xb0
      [  225.961703]  ? sched_clock_cpu+0x15/0x120
      [  225.962232]  ? lock_release+0x27a/0x6c0
      [  225.962746]  ? do_wait_intr_irq+0x130/0x130
      [  225.963302]  ? lock_downgrade+0x3a0/0x3a0
      [  225.963815]  ? lock_release+0x6c0/0x6c0
      [  225.964348]  md_handle_request+0x342/0x530
      [  225.964888]  ? set_in_sync+0x170/0x170
      [  225.965397]  ? blk_queue_split+0x133/0x150
      [  225.965988]  ? __blk_queue_split+0x8b0/0x8b0
      [  225.966524]  ? submit_bio_checks+0x3b2/0x9d0
      [  225.967069]  md_submit_bio+0x127/0x1c0
      [...]
      
      Fix this by moving alloc/free of acct bioset to pers->run and pers->free.
      
      While we are on this, properly handle md_integrity_register() error in
      raid0_run().
      
      Fixes: daee2024 (md: check level before create and exit io_acct_set)
      Cc: stable@vger.kernel.org
      Acked-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      0c031fd3
    • V
      md: raid456 add nowait support · bf2c411b
      Vishal Verma 提交于
      Returns EAGAIN in case the raid456 driver would block waiting for reshape.
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NVishal Verma <vverma@digitalocean.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      bf2c411b
    • D
      md/raid5: play nice with PREEMPT_RT · 770b1d21
      Davidlohr Bueso 提交于
      raid_run_ops() relies on the implicitly disabled preemption for
      its percpu ops, although this is really about CPU locality. This
      breaks RT semantics as it can take regular (and thus sleeping)
      spinlocks, such as stripe_lock.
      
      Add a local_lock such that non-RT does not change and continues
      to be just map to preempt_disable/enable, but makes RT happy as
      the region will use a per-CPU spinlock and thus be preemptible
      and still guarantee CPU locality.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      770b1d21
  3. 19 10月, 2021 2 次提交
  4. 28 8月, 2021 1 次提交
  5. 15 6月, 2021 5 次提交
    • G
      md/raid5: avoid device_lock in read_one_chunk() · 97ae2725
      Gal Ofri 提交于
      There is a lock contention on device_lock in read_one_chunk().
      device_lock is taken to sync conf->active_aligned_reads and
      conf->quiesce.
      read_one_chunk() takes the lock, then waits for quiesce=0 (resumed)
      before incrementing active_aligned_reads.
      raid5_quiesce() takes the lock, sets quiesce=2 (in-progress), then waits
      for active_aligned_reads to be zero before setting quiesce=1
      (suspended).
      
      Introduce a fast (lockless) path in read_one_chunk(): activate aligned
      read without taking device_lock.  In case quiesce starts while
      activating the aligned-read in fast path, deactivate it and revert to
      old behavior (take device_lock and wait for quiesce to finish).
      
      Add smp store/load in raid5_quiesce()/read_one_chunk() respectively to
      gaurantee that read_one_chunk() does not miss an ongoing quiesce.
      
      My setups:
      1. 8 local nvme drives (each up to 250k iops).
      2. 8 ram disks (brd).
      
      Each setup with raid6 (6+2), 1024 io threads on a 96 cpu-cores (48 per
      socket) system. Record both iops and cpu spent on this contention with
      rand-read-4k. Record bw with sequential-read-128k.  Note: in most cases
      cpu is still busy but due to "new" bottlenecks.
      
      nvme:
                    | iops           | cpu  | bw
      -----------------------------------------------
      without patch | 1.6M           | ~50% | 5.5GB/s
      with patch    | 2M (throttled) | 0%   | 16GB/s (throttled)
      
      ram (brd):
                    | iops           | cpu  | bw
      -----------------------------------------------
      without patch | 2M             | ~80% | 24GB/s
      with patch    | 4M             | 0%   | 55GB/s
      
      CC: Song Liu <song@kernel.org>
      CC: Neil Brown <neilb@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NGal Ofri <gal.ofri@storing.io>
      Signed-off-by: NSong Liu <song@kernel.org>
      97ae2725
    • R
      md: Constify attribute_group structs · c32dc040
      Rikard Falkeborn 提交于
      The attribute_group structs are never modified, they're only passed to
      sysfs_create_group() and sysfs_remove_group(). Make them const to allow
      the compiler to put them in read-only memory.
      Signed-off-by: NRikard Falkeborn <rikard.falkeborn@gmail.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      c32dc040
    • G
      md/raid5: avoid redundant bio clone in raid5_read_one_chunk · 1147f58e
      Guoqing Jiang 提交于
      After enable io accounting, chunk read bio could be cloned twice which
      is not good. To avoid such inefficiency, let's clone align_bio from
      io_acct_set too, then we need only call md_account_bio in make_request
      unconditionally.
      Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
      Signed-off-by: NSong Liu <song@kernel.org>
      1147f58e
    • G
      md/raid5: move checking badblock before clone bio in raid5_read_one_chunk · c82aa1b7
      Guoqing Jiang 提交于
      We don't need to clone bio if the relevant region has badblock.
      Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
      Signed-off-by: NSong Liu <song@kernel.org>
      c82aa1b7
    • G
      md: add io accounting for raid0 and raid5 · 10764815
      Guoqing Jiang 提交于
      We introduce a new bioset (io_acct_set) for raid0 and raid5 since they
      don't own clone infrastructure to accounting io. And the bioset is added
      to mddev instead of to raid0 and raid5 layer, because with this way, we
      can put common functions to md.h and reuse them in raid0 and raid5.
      
      Also struct md_io_acct is added accordingly which includes io start_time,
      the origin bio and cloned bio. Then we can call bio_{start,end}_io_acct
      to get related io status.
      Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
      Signed-off-by: NSong Liu <song@kernel.org>
      10764815
  6. 26 5月, 2021 1 次提交
  7. 09 4月, 2021 1 次提交
  8. 04 2月, 2021 1 次提交
  9. 28 1月, 2021 1 次提交
  10. 25 1月, 2021 1 次提交
  11. 05 12月, 2020 1 次提交
  12. 09 10月, 2020 1 次提交
    • S
      md/raid5: fix oops during stripe resizing · b44c018c
      Song Liu 提交于
      KoWei reported crash during raid5 reshape:
      
      [ 1032.252932] Oops: 0002 [#1] SMP PTI
      [...]
      [ 1032.252943] RIP: 0010:memcpy_erms+0x6/0x10
      [...]
      [ 1032.252947] RSP: 0018:ffffba1ac0c03b78 EFLAGS: 00010286
      [ 1032.252949] RAX: 0000784ac0000000 RBX: ffff91bec3d09740 RCX: 0000000000001000
      [ 1032.252951] RDX: 0000000000001000 RSI: ffff91be6781c000 RDI: 0000784ac0000000
      [ 1032.252953] RBP: ffffba1ac0c03bd8 R08: 0000000000001000 R09: ffffba1ac0c03bf8
      [ 1032.252954] R10: 0000000000000000 R11: 0000000000000000 R12: ffffba1ac0c03bf8
      [ 1032.252955] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000
      [ 1032.252958] FS:  0000000000000000(0000) GS:ffff91becf500000(0000) knlGS:0000000000000000
      [ 1032.252959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1032.252961] CR2: 0000784ac0000000 CR3: 000000031780a002 CR4: 00000000001606e0
      [ 1032.252962] Call Trace:
      [ 1032.252969]  ? async_memcpy+0x179/0x1000 [async_memcpy]
      [ 1032.252977]  ? raid5_release_stripe+0x8e/0x110 [raid456]
      [ 1032.252982]  handle_stripe_expansion+0x15a/0x1f0 [raid456]
      [ 1032.252988]  handle_stripe+0x592/0x1270 [raid456]
      [ 1032.252993]  handle_active_stripes.isra.0+0x3cb/0x5a0 [raid456]
      [ 1032.252999]  raid5d+0x35c/0x550 [raid456]
      [ 1032.253002]  ? schedule+0x42/0xb0
      [ 1032.253006]  ? schedule_timeout+0x10e/0x160
      [ 1032.253011]  md_thread+0x97/0x160
      [ 1032.253015]  ? wait_woken+0x80/0x80
      [ 1032.253019]  kthread+0x104/0x140
      [ 1032.253022]  ? md_start_sync+0x60/0x60
      [ 1032.253024]  ? kthread_park+0x90/0x90
      [ 1032.253027]  ret_from_fork+0x35/0x40
      
      This is because cache_size_mutex was unlocked too early in resize_stripes,
      which races with grow_one_stripe() that grow_one_stripe() allocates a
      stripe with wrong pool_size.
      
      Fix this issue by unlocking cache_size_mutex after updating pool_size.
      
      Cc: <stable@vger.kernel.org> # v4.4+
      Reported-by: NKoWei Sung <winders@amazon.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      b44c018c
  13. 25 9月, 2020 11 次提交
  14. 28 8月, 2020 1 次提交
    • Y
      md/raid5: make sure stripe_size as power of two · 6af10a33
      Yufen Yu 提交于
      Commit 3b5408b9 ("md/raid5: support config stripe_size by sysfs
      entry") make stripe_size as a configurable value. It just requires
      stripe_size as multiple of 4KB.
      
      In fact, we should make sure stripe_size as power of two. Otherwise,
      stripe_shift which is the result of ilog2 can not represent the real
      stripe_size. Then, stripe_hash() and stripe_hash_locks_hash() may
      get unexpected value.
      
      Fixes: 3b5408b9 ("md/raid5: support config stripe_size by sysfs entry")
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      6af10a33
  15. 24 8月, 2020 1 次提交
  16. 03 8月, 2020 4 次提交
  17. 29 7月, 2020 1 次提交
  18. 23 7月, 2020 1 次提交
  19. 22 7月, 2020 2 次提交
    • Y
      md/raid5: support config stripe_size by sysfs entry · 3b5408b9
      Yufen Yu 提交于
      Adding a new 'stripe_size' sysfs entry to set and show stripe_size.
      stripe_size should not be bigger than PAGE_SIZE, and it requires to
      be multiple of 4096. We can adjust stripe_size by writing value into
      sysfs entry, likely, set stripe_size as 16KB:
      
                echo 16384 > /sys/block/md1/md/stripe_size
      
      Show current stripe_size value:
      
                cat /sys/block/md1/md/stripe_size
      
      For PAGE_SIZE is equal to 4096, 'stripe_size' can just be read.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      3b5408b9
    • Y
      md/raid5: set default stripe_size as 4096 · e2368582
      Yufen Yu 提交于
      In RAID5, if issued bio size is bigger than stripe_size, it will be
      split in the unit of stripe_size and process them one by one. Even
      for size less then stripe_size, RAID5 also request data from disk at
      least of stripe_size.
      
      Nowdays, stripe_size is equal to the value of PAGE_SIZE. Since filesystem
      usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE
      as 4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data
      while RAID5 issue IO at least stripe_size (64KB) each time. That will
      waste resource of disk bandwidth and compute xor.
      
      To avoding the waste, we want to make stripe_size configurable. This
      patch just set default stripe_size as 4096. User can also set the value
      bigger than 4KB for some special requirements, such as we know the
      issued io size is more than 4KB.
      
      To evaluate the new feature, we create raid5 device '/dev/md5' with
      4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE.
      
      1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default
       configure on /mnt directory. Then, trying to test it by dbench with
       command: dbench -D /mnt -t 1000 10. Result show as:
      
       'stripe_size = 64KB'
      
        Operation      Count    AvgLat    MaxLat
        ----------------------------------------
        NTCreateX    9805011     0.021    64.728
        Close        7202525     0.001     0.120
        Rename        415213     0.051    44.681
        Unlink       1980066     0.079    93.147
        Deltree          240     1.793     6.516
        Mkdir            120     0.004     0.007
        Qpathinfo    8887512     0.007    37.114
        Qfileinfo    1557262     0.001     0.030
        Qfsinfo      1629582     0.012     0.152
        Sfileinfo     798756     0.040    57.641
        Find         3436004     0.019    57.782
        WriteX       4887239     0.021    57.638
        ReadX        15370483     0.005    37.818
        LockX          31934     0.003     0.022
        UnlockX        31933     0.001     0.021
        Flush         687205    13.302   530.088
      
       Throughput 307.799 MB/sec  10 clients  10 procs  max_latency=530.091 ms
       -------------------------------------------------------
      
       'stripe_size = 4KB'
      
        Operation      Count    AvgLat    MaxLat
        ----------------------------------------
        NTCreateX    11999166     0.021    36.380
        Close        8814128     0.001     0.122
        Rename        508113     0.051    29.169
        Unlink       2423242     0.070    38.141
        Deltree          300     1.885     7.155
        Mkdir            150     0.004     0.006
        Qpathinfo    10875921     0.007    35.485
        Qfileinfo    1905837     0.001     0.032
        Qfsinfo      1994304     0.012     0.125
        Sfileinfo     977450     0.029    26.489
        Find         4204952     0.019     9.361
        WriteX       5981890     0.019    27.804
        ReadX        18809742     0.004    33.491
        LockX          39074     0.003     0.025
        UnlockX        39074     0.001     0.014
        Flush         841022    10.712   458.848
      
       Throughput 376.777 MB/sec  10 clients  10 procs  max_latency=458.852 ms
       -------------------------------------------------------
      
       It show that setting stripe_size as 4KB has higher thoughput, i.e.
       (376.777 vs 307.799) and has smaller latency than that setting as 64KB.
      
       2) We try to evaluate IO throughput for /dev/md5 by fio with config:
      
       [4KB randwrite]
       direct=1
       numjob=2
       iodepth=64
       ioengine=libaio
       filename=/dev/md5
       bs=4KB
       rw=randwrite
      
       [64KB write]
       direct=1
       numjob=2
       iodepth=64
       ioengine=libaio
       filename=/dev/md5
       bs=1MB
       rw=write
      
       The result as follow:
      
                     +                   +
                     | stripe_size(64KB) | stripe_size(4KB)
       +----------------------------------------------------+
       4KB randwrite |     15MB/s        |      100MB/s
       +----------------------------------------------------+
       1MB write     |   1000MB/s        |      700MB/s
      
       The result show that when size of io is bigger than 4KB (64KB),
       64KB stripe_size has much higher IOPS. But for 4KB randwrite, that
       means, size of io issued to device are smaller, 4KB stripe_size
       have better performance.
      
      Normally, default value (4096) can get relatively good performance.
      But if each issued io is bigger than 4096, setting value more than
      4096 may get better performance.
      
      Here, we just set default stripe_size as 4096, and we will try to
      support setting different stripe_size by sysfs interface in the
      following patch.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      e2368582