1. 15 11月, 2022 16 次提交
    • C
      nvme: implement the DEAC bit for the Write Zeroes command · 1b96f862
      Christoph Hellwig 提交于
      While the specification allows devices to either deallocate data
      or to actually write zeroes on any Write Zeroes command, many SSDs
      only do the sensible thing and deallocate data when the DEAC bit
      is specific.  Set it when it is supported and the caller doesn't
      explicitly opt out of deallocation.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      1b96f862
    • K
      nvme: identify-namespace without CAP_SYS_ADMIN · e4fbcf32
      Kanchan Joshi 提交于
      Allow all identify-namespace variants (CNS 00h, 05h and 08h) without
      requiring CAP_SYS_ADMIN. The information (retrieved using id-ns) is
      needed to form IO commands for passthrough interface.
      Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e4fbcf32
    • K
      nvme: fine-granular CAP_SYS_ADMIN for nvme io commands · 855b7717
      Kanchan Joshi 提交于
      Currently both io and admin commands are kept under a
      coarse-granular CAP_SYS_ADMIN check, disregarding file mode completely.
      
      $ ls -l /dev/ng*
      crw-rw-rw- 1 root root 242, 0 Sep  9 19:20 /dev/ng0n1
      crw------- 1 root root 242, 1 Sep  9 19:20 /dev/ng0n2
      
      In the example above, ng0n1 appears as if it may allow unprivileged
      read/write operation but it does not and behaves same as ng0n2.
      
      This patch implements a shift from CAP_SYS_ADMIN to more fine-granular
      control for io-commands.
      If CAP_SYS_ADMIN is present, nothing else is checked as before.
      Otherwise, following rules are in place
      - any admin-cmd is not allowed
      - vendor-specific and fabric commmand are not allowed
      - io-commands that can write are allowed if matching FMODE_WRITE
      permission is present
      - io-commands that read are allowed
      
      Add a helper nvme_cmd_allowed that implements above policy.
      Change all the callers of CAP_SYS_ADMIN to go through nvme_cmd_allowed
      for any decision making.
      Since file open mode is counted for any approval/denial, change at
      various places to keep file-mode information handy.
      Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      855b7717
    • C
      nvme-fc: improve memory usage in nvme_fc_rcv_ls_req() · cf3d0084
      Christophe JAILLET 提交于
      sizeof( struct nvmefc_ls_rcv_op ) = 64
      sizeof( union nvmefc_ls_requests ) = 1024
      sizeof( union nvmefc_ls_responses ) = 128
      
      So, in nvme_fc_rcv_ls_req(), 1216 bytes of memory are requested when
      kzalloc() is called.
      
      Because of the way memory allocations are performed, 2048 bytes are
      allocated. So about 800 bytes are wasted for each request.
      
      Switch to 3 distinct memory allocations, in order to:
         - save these 800 bytes
         - avoid zeroing this extra memory
         - make sure that memory is properly aligned in case of DMA access
          ("fc_dma_map_single(lsop->rspbuf)" just a few lines below)
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: NJames Smart <jsmart2021@gmail.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      cf3d0084
    • C
      nvmet: only allocate a single slab for bvecs · fa8f9ac4
      Christoph Hellwig 提交于
      There is no need to have a separate slab cache for each namespace,
      and having separate ones creates duplicate debugs file names as well.
      
      Fixes: d5eff33e ("nvmet: add simple file backed ns support")
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      fa8f9ac4
    • D
      nvmet: force reconnect when number of queue changes · 2be2cd52
      Daniel Wagner 提交于
      In order to test queue number changes we need to make sure that the
      host reconnects. Because only when the host disconnects from the
      target the number of queues are allowed to change according the spec.
      
      The initial idea was to disable and re-enable the ports and have the
      host wait until the KATO timer expires, triggering error
      recovery. Though the host would see a DNR reply when trying to
      reconnect. Because of the DNR bit the connection is dropped
      completely. There is no point in trying to reconnect with the same
      parameters according the spec.
      
      We can force to reconnect the host is by deleting all controllers. The
      host will observe any newly posted request to fail and thus starts the
      error recovery but this time without the DNR bit set.
      Signed-off-by: NDaniel Wagner <dwagner@suse.de>
      Reviewed-by: NChaitanya Kulkarni  <kch@nvidia.com>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Acked-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      2be2cd52
    • U
      nvmet: use try_cmpxchg in nvmet_update_sq_head · bbf5410b
      Uros Bizjak 提交于
      Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
      nvmet_update_sq_head.  x86 CMPXCHG instruction returns success in ZF flag, so
      this change saves a compare after cmpxchg (and related move instruction in
      front of cmpxchg).
      
      Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
      fails. There is no need to re-read the value in the loop.
      
      Note that the value from *ptr should be read using READ_ONCE to prevent
      the compiler from merging, refetching or reordering the read.
      
      No functional change intended.
      Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      bbf5410b
    • J
      md/raid1: stop mdx_raid1 thread when raid1 array run failed · b611ad14
      Jiang Li 提交于
      fail run raid1 array when we assemble array with the inactive disk only,
      but the mdx_raid1 thread were not stop, Even if the associated resources
      have been released. it will caused a NULL dereference when we do poweroff.
      
      This causes the following Oops:
          [  287.587787] BUG: kernel NULL pointer dereference, address: 0000000000000070
          [  287.594762] #PF: supervisor read access in kernel mode
          [  287.599912] #PF: error_code(0x0000) - not-present page
          [  287.605061] PGD 0 P4D 0
          [  287.607612] Oops: 0000 [#1] SMP NOPTI
          [  287.611287] CPU: 3 PID: 5265 Comm: md0_raid1 Tainted: G     U            5.10.146 #0
          [  287.619029] Hardware name: xxxxxxx/To be filled by O.E.M, BIOS 5.19 06/16/2022
          [  287.626775] RIP: 0010:md_check_recovery+0x57/0x500 [md_mod]
          [  287.632357] Code: fe 01 00 00 48 83 bb 10 03 00 00 00 74 08 48 89 ......
          [  287.651118] RSP: 0018:ffffc90000433d78 EFLAGS: 00010202
          [  287.656347] RAX: 0000000000000000 RBX: ffff888105986800 RCX: 0000000000000000
          [  287.663491] RDX: ffffc90000433bb0 RSI: 00000000ffffefff RDI: ffff888105986800
          [  287.670634] RBP: ffffc90000433da0 R08: 0000000000000000 R09: c0000000ffffefff
          [  287.677771] R10: 0000000000000001 R11: ffffc90000433ba8 R12: ffff888105986800
          [  287.684907] R13: 0000000000000000 R14: fffffffffffffe00 R15: ffff888100b6b500
          [  287.692052] FS:  0000000000000000(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000
          [  287.700149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [  287.705897] CR2: 0000000000000070 CR3: 000000000320a000 CR4: 0000000000350ee0
          [  287.713033] Call Trace:
          [  287.715498]  raid1d+0x6c/0xbbb [raid1]
          [  287.719256]  ? __schedule+0x1ff/0x760
          [  287.722930]  ? schedule+0x3b/0xb0
          [  287.726260]  ? schedule_timeout+0x1ed/0x290
          [  287.730456]  ? __switch_to+0x11f/0x400
          [  287.734219]  md_thread+0xe9/0x140 [md_mod]
          [  287.738328]  ? md_thread+0xe9/0x140 [md_mod]
          [  287.742601]  ? wait_woken+0x80/0x80
          [  287.746097]  ? md_register_thread+0xe0/0xe0 [md_mod]
          [  287.751064]  kthread+0x11a/0x140
          [  287.754300]  ? kthread_park+0x90/0x90
          [  287.757974]  ret_from_fork+0x1f/0x30
      
      In fact, when raid1 array run fail, we need to do
      md_unregister_thread() before raid1_free().
      Signed-off-by: NJiang Li <jiang.li@ugreen.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      b611ad14
    • C
      md/raid5: use bdev_write_cache instead of open coding it · ad831a16
      Christoph Hellwig 提交于
      Use the bdev_write_cache instead of two equivalent open coded checks.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSong Liu <song@kernel.org>
      ad831a16
    • M
      md: fix a crash in mempool_free · 341097ee
      Mikulas Patocka 提交于
      There's a crash in mempool_free when running the lvm test
      shell/lvchange-rebuild-raid.sh.
      
      The reason for the crash is this:
      * super_written calls atomic_dec_and_test(&mddev->pending_writes) and
        wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev)
        and bio_put(bio).
      * so, the process that waited on sb_wait and that is woken up is racing
        with bio_put(bio).
      * if the process wins the race, it calls bioset_exit before bio_put(bio)
        is executed.
      * bio_put(bio) attempts to free a bio into a destroyed bio set - causing
        a crash in mempool_free.
      
      We fix this bug by moving bio_put before atomic_dec_and_test.
      
      We also move rdev_dec_pending before atomic_dec_and_test as suggested by
      Neil Brown.
      
      The function md_end_flush has a similar bug - we must call bio_put before
      we decrement the number of in-progress bios.
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0002) - not-present page
       PGD 11557f0067 P4D 11557f0067 PUD 0
       Oops: 0002 [#1] PREEMPT SMP
       CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
       Workqueue: kdelayd flush_expired_bios [dm_delay]
       RIP: 0010:mempool_free+0x47/0x80
       Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00
       RSP: 0018:ffff88910036bda8 EFLAGS: 00010093
       RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001
       RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8
       RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900
       R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000
       R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05
       FS:  0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0
       Call Trace:
        <TASK>
        clone_endio+0xf4/0x1c0 [dm_mod]
        clone_endio+0xf4/0x1c0 [dm_mod]
        __submit_bio+0x76/0x120
        submit_bio_noacct_nocheck+0xb6/0x2a0
        flush_expired_bios+0x28/0x2f [dm_delay]
        process_one_work+0x1b4/0x300
        worker_thread+0x45/0x3e0
        ? rescuer_thread+0x380/0x380
        kthread+0xc2/0x100
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x1f/0x30
        </TASK>
       Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd]
       CR2: 0000000000000000
       ---[ end trace 0000000000000000 ]---
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSong Liu <song@kernel.org>
      341097ee
    • X
      md/raid0, raid10: Don't set discard sectors for request queue · 8e1a2279
      Xiao Ni 提交于
      It should use disk_stack_limits to get a proper max_discard_sectors
      rather than setting a value by stack drivers.
      
      And there is a bug. If all member disks are rotational devices,
      raid0/raid10 set max_discard_sectors. So the member devices are
      not ssd/nvme, but raid0/raid10 export the wrong value. It reports
      warning messages in function __blkdev_issue_discard when mkfs.xfs
      like this:
      
      [ 4616.022599] ------------[ cut here ]------------
      [ 4616.027779] WARNING: CPU: 4 PID: 99634 at block/blk-lib.c:50 __blkdev_issue_discard+0x16a/0x1a0
      [ 4616.140663] RIP: 0010:__blkdev_issue_discard+0x16a/0x1a0
      [ 4616.146601] Code: 24 4c 89 20 31 c0 e9 fe fe ff ff c1 e8 09 8d 48 ff 4c 89 f0 4c 09 e8 48 85 c1 0f 84 55 ff ff ff b8 ea ff ff ff e9 df fe ff ff <0f> 0b 48 8d 74 24 08 e8 ea d6 00 00 48 c7 c6 20 1e 89 ab 48 c7 c7
      [ 4616.167567] RSP: 0018:ffffaab88cbffca8 EFLAGS: 00010246
      [ 4616.173406] RAX: ffff9ba1f9e44678 RBX: 0000000000000000 RCX: ffff9ba1c9792080
      [ 4616.181376] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ba1c9792080
      [ 4616.189345] RBP: 0000000000000cc0 R08: ffffaab88cbffd10 R09: 0000000000000000
      [ 4616.197317] R10: 0000000000000012 R11: 0000000000000000 R12: 0000000000000000
      [ 4616.205288] R13: 0000000000400000 R14: 0000000000000cc0 R15: ffff9ba1c9792080
      [ 4616.213259] FS:  00007f9a5534e980(0000) GS:ffff9ba1b7c80000(0000) knlGS:0000000000000000
      [ 4616.222298] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4616.228719] CR2: 000055a390a4c518 CR3: 0000000123e40006 CR4: 00000000001706e0
      [ 4616.236689] Call Trace:
      [ 4616.239428]  blkdev_issue_discard+0x52/0xb0
      [ 4616.244108]  blkdev_common_ioctl+0x43c/0xa00
      [ 4616.248883]  blkdev_ioctl+0x116/0x280
      [ 4616.252977]  __x64_sys_ioctl+0x8a/0xc0
      [ 4616.257163]  do_syscall_64+0x5c/0x90
      [ 4616.261164]  ? handle_mm_fault+0xc5/0x2a0
      [ 4616.265652]  ? do_user_addr_fault+0x1d8/0x690
      [ 4616.270527]  ? do_syscall_64+0x69/0x90
      [ 4616.274717]  ? exc_page_fault+0x62/0x150
      [ 4616.279097]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      [ 4616.284748] RIP: 0033:0x7f9a55398c6b
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Reported-by: NYi Zhang <yi.zhang@redhat.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      8e1a2279
    • F
      md/bitmap: Fix bitmap chunk size overflow issues · 45552111
      Florian-Ewald Mueller 提交于
      - limit bitmap chunk size internal u64 variable to values not overflowing
        the u32 bitmap superblock structure variable stored on persistent media
      - assign bitmap chunk size internal u64 variable from unsigned values to
        avoid possible sign extension artifacts when assigning from a s32 value
      
      The bug has been there since at least kernel 4.0.
      Steps to reproduce it:
      1: mdadm -C /dev/mdx -l 1 --bitmap=internal --bitmap-chunk=256M -e 1.2
      -n2 /dev/rnbd1 /dev/rnbd2
      2 resize member device rnbd1 and rnbd2 to 8 TB
      3 mdadm --grow /dev/mdx --size=max
      
      The bitmap_chunksize will overflow without patch.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFlorian-Ewald Mueller <florian-ewald.mueller@ionos.com>
      Signed-off-by: NJack Wang <jinpu.wang@ionos.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      45552111
    • Y
      md: introduce md_ro_state · f97a5528
      Ye Bin 提交于
      Introduce md_ro_state for mddev->ro, so it is easy to understand.
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      f97a5528
    • Y
      md: factor out __md_set_array_info() · 2f6d261e
      Ye Bin 提交于
      Factor out __md_set_array_info(). No functional change.
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      2f6d261e
    • U
      raid5-cache: use try_cmpxchg in r5l_wake_reclaim · 9487a0f6
      Uros Bizjak 提交于
      Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
      r5l_wake_reclaim. 86 CMPXCHG instruction returns success in ZF flag, so
      this change saves a compare after cmpxchg (and related move instruction in
      front of cmpxchg).
      
      Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
      fails. There is no need to re-read the value in the loop.
      
      Note that the value from *ptr should be read using READ_ONCE to prevent
      the compiler from merging, refetching or reordering the read.
      
      No functional change intended.
      
      Cc: Song Liu <song@kernel.org>
      Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      9487a0f6
    • L
      drivers/md/md-bitmap: check the return value of md_bitmap_get_counter() · 3bd548e5
      Li Zhong 提交于
      Check the return value of md_bitmap_get_counter() in case it returns
      NULL pointer, which will result in a null pointer dereference.
      
      v2: update the check to include other dereference
      Signed-off-by: NLi Zhong <floridsleeves@gmail.com>
      Signed-off-by: NSong Liu <song@kernel.org>
      3bd548e5
  2. 10 11月, 2022 4 次提交
  3. 02 11月, 2022 11 次提交
  4. 25 10月, 2022 4 次提交
  5. 24 10月, 2022 1 次提交
  6. 22 10月, 2022 1 次提交
  7. 21 10月, 2022 3 次提交
    • A
      efi: runtime: Don't assume virtual mappings are missing if VA == PA == 0 · 37926f96
      Ard Biesheuvel 提交于
      The generic EFI stub can be instructed to avoid SetVirtualAddressMap(),
      and simply run with the firmware's 1:1 mapping. In this case, it
      populates the virtual address fields of the runtime regions in the
      memory map with the physical address of each region, so that the mapping
      code has to be none the wiser. Only if SetVirtualAddressMap() fails, the
      virtual addresses are wiped and the kernel code knows that the regions
      cannot be mapped.
      
      However, wiping amounts to setting it to zero, and if a runtime region
      happens to live at physical address 0, its valid 1:1 mapped virtual
      address could be mistaken for a wiped field, resulting on loss of access
      to the EFI services at runtime.
      
      So let's only assume that VA == 0 means 'no runtime services' if the
      region in question does not live at PA 0x0.
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      37926f96
    • A
      efi: libstub: Fix incorrect payload size in zboot header · 53a7ea28
      Ard Biesheuvel 提交于
      The linker script symbol definition that captures the size of the
      compressed payload inside the zboot decompressor (which is exposed via
      the image header) refers to '.' for the end of the region, which does
      not give the correct result as the expression is not placed at the end
      of the payload. So use the symbol name explicitly.
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      53a7ea28
    • A
      efi: libstub: Give efi_main() asmlinkage qualification · db14655a
      Ard Biesheuvel 提交于
      To stop the bots from sending sparse warnings to me and the list about
      efi_main() not having a prototype, decorate it with asmlinkage so that
      it is clear that it is called from assembly, and therefore needs to
      remain external, even if it is never declared in a header file.
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      db14655a