1. 15 11月, 2022 11 次提交
    • J
      Merge branch 'md-next' of... · 5626196a
      Jens Axboe 提交于
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.2/block
      
      Pull MD fixes from Song.
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md/raid1: stop mdx_raid1 thread when raid1 array run failed
        md/raid5: use bdev_write_cache instead of open coding it
        md: fix a crash in mempool_free
        md/raid0, raid10: Don't set discard sectors for request queue
        md/bitmap: Fix bitmap chunk size overflow issues
        md: introduce md_ro_state
        md: factor out __md_set_array_info()
        lib/raid6: drop RAID6_USE_EMPTY_ZERO_PAGE
        raid5-cache: use try_cmpxchg in r5l_wake_reclaim
        drivers/md/md-bitmap: check the return value of md_bitmap_get_counter()
      5626196a
    • J
      md/raid1: stop mdx_raid1 thread when raid1 array run failed · b611ad14
      Jiang Li 提交于
      fail run raid1 array when we assemble array with the inactive disk only,
      but the mdx_raid1 thread were not stop, Even if the associated resources
      have been released. it will caused a NULL dereference when we do poweroff.
      
      This causes the following Oops:
          [  287.587787] BUG: kernel NULL pointer dereference, address: 0000000000000070
          [  287.594762] #PF: supervisor read access in kernel mode
          [  287.599912] #PF: error_code(0x0000) - not-present page
          [  287.605061] PGD 0 P4D 0
          [  287.607612] Oops: 0000 [#1] SMP NOPTI
          [  287.611287] CPU: 3 PID: 5265 Comm: md0_raid1 Tainted: G     U            5.10.146 #0
          [  287.619029] Hardware name: xxxxxxx/To be filled by O.E.M, BIOS 5.19 06/16/2022
          [  287.626775] RIP: 0010:md_check_recovery+0x57/0x500 [md_mod]
          [  287.632357] Code: fe 01 00 00 48 83 bb 10 03 00 00 00 74 08 48 89 ......
          [  287.651118] RSP: 0018:ffffc90000433d78 EFLAGS: 00010202
          [  287.656347] RAX: 0000000000000000 RBX: ffff888105986800 RCX: 0000000000000000
          [  287.663491] RDX: ffffc90000433bb0 RSI: 00000000ffffefff RDI: ffff888105986800
          [  287.670634] RBP: ffffc90000433da0 R08: 0000000000000000 R09: c0000000ffffefff
          [  287.677771] R10: 0000000000000001 R11: ffffc90000433ba8 R12: ffff888105986800
          [  287.684907] R13: 0000000000000000 R14: fffffffffffffe00 R15: ffff888100b6b500
          [  287.692052] FS:  0000000000000000(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000
          [  287.700149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [  287.705897] CR2: 0000000000000070 CR3: 000000000320a000 CR4: 0000000000350ee0
          [  287.713033] Call Trace:
          [  287.715498]  raid1d+0x6c/0xbbb [raid1]
          [  287.719256]  ? __schedule+0x1ff/0x760
          [  287.722930]  ? schedule+0x3b/0xb0
          [  287.726260]  ? schedule_timeout+0x1ed/0x290
          [  287.730456]  ? __switch_to+0x11f/0x400
          [  287.734219]  md_thread+0xe9/0x140 [md_mod]
          [  287.738328]  ? md_thread+0xe9/0x140 [md_mod]
          [  287.742601]  ? wait_woken+0x80/0x80
          [  287.746097]  ? md_register_thread+0xe0/0xe0 [md_mod]
          [  287.751064]  kthread+0x11a/0x140
          [  287.754300]  ? kthread_park+0x90/0x90
          [  287.757974]  ret_from_fork+0x1f/0x30
      
      In fact, when raid1 array run fail, we need to do
      md_unregister_thread() before raid1_free().
      Signed-off-by: NJiang Li <jiang.li@ugreen.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      b611ad14
    • C
      md/raid5: use bdev_write_cache instead of open coding it · ad831a16
      Christoph Hellwig 提交于
      Use the bdev_write_cache instead of two equivalent open coded checks.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      ad831a16
    • M
      md: fix a crash in mempool_free · 341097ee
      Mikulas Patocka 提交于
      There's a crash in mempool_free when running the lvm test
      shell/lvchange-rebuild-raid.sh.
      
      The reason for the crash is this:
      * super_written calls atomic_dec_and_test(&mddev->pending_writes) and
        wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev)
        and bio_put(bio).
      * so, the process that waited on sb_wait and that is woken up is racing
        with bio_put(bio).
      * if the process wins the race, it calls bioset_exit before bio_put(bio)
        is executed.
      * bio_put(bio) attempts to free a bio into a destroyed bio set - causing
        a crash in mempool_free.
      
      We fix this bug by moving bio_put before atomic_dec_and_test.
      
      We also move rdev_dec_pending before atomic_dec_and_test as suggested by
      Neil Brown.
      
      The function md_end_flush has a similar bug - we must call bio_put before
      we decrement the number of in-progress bios.
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0002) - not-present page
       PGD 11557f0067 P4D 11557f0067 PUD 0
       Oops: 0002 [#1] PREEMPT SMP
       CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
       Workqueue: kdelayd flush_expired_bios [dm_delay]
       RIP: 0010:mempool_free+0x47/0x80
       Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00
       RSP: 0018:ffff88910036bda8 EFLAGS: 00010093
       RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001
       RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8
       RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900
       R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000
       R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05
       FS:  0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0
       Call Trace:
        <TASK>
        clone_endio+0xf4/0x1c0 [dm_mod]
        clone_endio+0xf4/0x1c0 [dm_mod]
        __submit_bio+0x76/0x120
        submit_bio_noacct_nocheck+0xb6/0x2a0
        flush_expired_bios+0x28/0x2f [dm_delay]
        process_one_work+0x1b4/0x300
        worker_thread+0x45/0x3e0
        ? rescuer_thread+0x380/0x380
        kthread+0xc2/0x100
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x1f/0x30
        </TASK>
       Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd]
       CR2: 0000000000000000
       ---[ end trace 0000000000000000 ]---
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSong Liu <song@kernel.org>
      341097ee
    • X
      md/raid0, raid10: Don't set discard sectors for request queue · 8e1a2279
      Xiao Ni 提交于
      It should use disk_stack_limits to get a proper max_discard_sectors
      rather than setting a value by stack drivers.
      
      And there is a bug. If all member disks are rotational devices,
      raid0/raid10 set max_discard_sectors. So the member devices are
      not ssd/nvme, but raid0/raid10 export the wrong value. It reports
      warning messages in function __blkdev_issue_discard when mkfs.xfs
      like this:
      
      [ 4616.022599] ------------[ cut here ]------------
      [ 4616.027779] WARNING: CPU: 4 PID: 99634 at block/blk-lib.c:50 __blkdev_issue_discard+0x16a/0x1a0
      [ 4616.140663] RIP: 0010:__blkdev_issue_discard+0x16a/0x1a0
      [ 4616.146601] Code: 24 4c 89 20 31 c0 e9 fe fe ff ff c1 e8 09 8d 48 ff 4c 89 f0 4c 09 e8 48 85 c1 0f 84 55 ff ff ff b8 ea ff ff ff e9 df fe ff ff <0f> 0b 48 8d 74 24 08 e8 ea d6 00 00 48 c7 c6 20 1e 89 ab 48 c7 c7
      [ 4616.167567] RSP: 0018:ffffaab88cbffca8 EFLAGS: 00010246
      [ 4616.173406] RAX: ffff9ba1f9e44678 RBX: 0000000000000000 RCX: ffff9ba1c9792080
      [ 4616.181376] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ba1c9792080
      [ 4616.189345] RBP: 0000000000000cc0 R08: ffffaab88cbffd10 R09: 0000000000000000
      [ 4616.197317] R10: 0000000000000012 R11: 0000000000000000 R12: 0000000000000000
      [ 4616.205288] R13: 0000000000400000 R14: 0000000000000cc0 R15: ffff9ba1c9792080
      [ 4616.213259] FS:  00007f9a5534e980(0000) GS:ffff9ba1b7c80000(0000) knlGS:0000000000000000
      [ 4616.222298] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4616.228719] CR2: 000055a390a4c518 CR3: 0000000123e40006 CR4: 00000000001706e0
      [ 4616.236689] Call Trace:
      [ 4616.239428]  blkdev_issue_discard+0x52/0xb0
      [ 4616.244108]  blkdev_common_ioctl+0x43c/0xa00
      [ 4616.248883]  blkdev_ioctl+0x116/0x280
      [ 4616.252977]  __x64_sys_ioctl+0x8a/0xc0
      [ 4616.257163]  do_syscall_64+0x5c/0x90
      [ 4616.261164]  ? handle_mm_fault+0xc5/0x2a0
      [ 4616.265652]  ? do_user_addr_fault+0x1d8/0x690
      [ 4616.270527]  ? do_syscall_64+0x69/0x90
      [ 4616.274717]  ? exc_page_fault+0x62/0x150
      [ 4616.279097]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      [ 4616.284748] RIP: 0033:0x7f9a55398c6b
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Reported-by: NYi Zhang <yi.zhang@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      8e1a2279
    • F
      md/bitmap: Fix bitmap chunk size overflow issues · 45552111
      Florian-Ewald Mueller 提交于
      - limit bitmap chunk size internal u64 variable to values not overflowing
        the u32 bitmap superblock structure variable stored on persistent media
      - assign bitmap chunk size internal u64 variable from unsigned values to
        avoid possible sign extension artifacts when assigning from a s32 value
      
      The bug has been there since at least kernel 4.0.
      Steps to reproduce it:
      1: mdadm -C /dev/mdx -l 1 --bitmap=internal --bitmap-chunk=256M -e 1.2
      -n2 /dev/rnbd1 /dev/rnbd2
      2 resize member device rnbd1 and rnbd2 to 8 TB
      3 mdadm --grow /dev/mdx --size=max
      
      The bitmap_chunksize will overflow without patch.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFlorian-Ewald Mueller <florian-ewald.mueller@ionos.com>
      Signed-off-by: NJack Wang <jinpu.wang@ionos.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      45552111
    • Y
      md: introduce md_ro_state · f97a5528
      Ye Bin 提交于
      Introduce md_ro_state for mddev->ro, so it is easy to understand.
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      f97a5528
    • Y
      md: factor out __md_set_array_info() · 2f6d261e
      Ye Bin 提交于
      Factor out __md_set_array_info(). No functional change.
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      2f6d261e
    • G
      lib/raid6: drop RAID6_USE_EMPTY_ZERO_PAGE · 42271ca3
      Giulio Benetti 提交于
      RAID6_USE_EMPTY_ZERO_PAGE is unused and hardcoded to 0, so let's drop it.
      Signed-off-by: NGiulio Benetti <giulio.benetti@benettiengineering.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      42271ca3
    • U
      raid5-cache: use try_cmpxchg in r5l_wake_reclaim · 9487a0f6
      Uros Bizjak 提交于
      Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
      r5l_wake_reclaim. 86 CMPXCHG instruction returns success in ZF flag, so
      this change saves a compare after cmpxchg (and related move instruction in
      front of cmpxchg).
      
      Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
      fails. There is no need to re-read the value in the loop.
      
      Note that the value from *ptr should be read using READ_ONCE to prevent
      the compiler from merging, refetching or reordering the read.
      
      No functional change intended.
      
      Cc: Song Liu <song@kernel.org>
      Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      9487a0f6
    • L
      drivers/md/md-bitmap: check the return value of md_bitmap_get_counter() · 3bd548e5
      Li Zhong 提交于
      Check the return value of md_bitmap_get_counter() in case it returns
      NULL pointer, which will result in a null pointer dereference.
      
      v2: update the check to include other dereference
      Signed-off-by: NLi Zhong <floridsleeves@gmail.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      3bd548e5
  2. 11 11月, 2022 3 次提交
    • G
      sbitmap: Use single per-bitmap counting to wake up queued tags · 4f8126bb
      Gabriel Krisman Bertazi 提交于
      sbitmap suffers from code complexity, as demonstrated by recent fixes,
      and eventual lost wake ups on nested I/O completion.  The later happens,
      from what I understand, due to the non-atomic nature of the updates to
      wait_cnt, which needs to be subtracted and eventually reset when equal
      to zero.  This two step process can eventually miss an update when a
      nested completion happens to interrupt the CPU in between the wait_cnt
      updates.  This is very hard to fix, as shown by the recent changes to
      this code.
      
      The code complexity arises mostly from the corner cases to avoid missed
      wakes in this scenario.  In addition, the handling of wake_batch
      recalculation plus the synchronization with sbq_queue_wake_up is
      non-trivial.
      
      This patchset implements the idea originally proposed by Jan [1], which
      removes the need for the two-step updates of wait_cnt.  This is done by
      tracking the number of completions and wakeups in always increasing,
      per-bitmap counters.  Instead of having to reset the wait_cnt when it
      reaches zero, we simply keep counting, and attempt to wake up N threads
      in a single wait queue whenever there is enough space for a batch.
      Waking up less than batch_wake shouldn't be a problem, because we
      haven't changed the conditions for wake up, and the existing batch
      calculation guarantees at least enough remaining completions to wake up
      a batch for each queue at any time.
      
      Performance-wise, one should expect very similar performance to the
      original algorithm for the case where there is no queueing.  In both the
      old algorithm and this implementation, the first thing is to check
      ws_active, which bails out if there is no queueing to be managed. In the
      new code, we took care to avoid accounting completions and wakeups when
      there is no queueing, to not pay the cost of atomic operations
      unnecessarily, since it doesn't skew the numbers.
      
      For more interesting cases, where there is queueing, we need to take
      into account the cross-communication of the atomic operations.  I've
      been benchmarking by running parallel fio jobs against a single hctx
      nullb in different hardware queue depth scenarios, and verifying both
      IOPS and queueing.
      
      Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
      jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
      varying only the hardware queue length per test.
      
      queue size 2                 4                 8                 16                 32                 64
      6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
      patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
      
      The following is a similar experiment, ran against a nullb with a single
      bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
      parallel fio jobs operating on the same device
      
      queue size 2 	             4                 8              	16             	    32		       64
      6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
      patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)
      
      It has also survived blktests and a 12h-stress run against nullb. I also
      ran the code against nvme and a scsi SSD, and I didn't observe
      performance regression in those. If there are other tests you think I
      should run, please let me know and I will follow up with results.
      
      [1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/
      
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Liu Song <liusong@linux.alibaba.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGabriel Krisman Bertazi <krisman@suse.de>
      Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      4f8126bb
    • C
      blk-mq: simplify blk_mq_realloc_tag_set_tags · ee9d5521
      Christoph Hellwig 提交于
      Use set->nr_hw_queues for the current number of tags, and remove the
      duplicate set->nr_hw_queues update in the caller.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
      Link: https://lore.kernel.org/r/20221109100811.2413423-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      ee9d5521
    • C
      blk-mq: remove blk_mq_alloc_tag_set_tags · 5ee20298
      Christoph Hellwig 提交于
      There is no point in trying to share any code with the realloc case when
      all that is needed by the initial tagset allocation is a simple
      kcalloc_node.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
      Link: https://lore.kernel.org/r/20221109100811.2413423-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
      5ee20298
  3. 10 11月, 2022 14 次提交
  4. 07 11月, 2022 1 次提交
  5. 02 11月, 2022 11 次提交