1. 16 6月, 2020 32 次提交
    • E
      xfs: clear PF_MEMALLOC before exiting xfsaild thread · 0ff2b809
      Eric Biggers 提交于
      task #28557760
      
      commit 10a98cb16d80be3595fdb165fad898bb28b8b6d2 upstream.
      
      Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
      during do_exit().  That can confuse things.  In particular, if BSD
      process accounting is enabled, then do_exit() writes data to an
      accounting file.  If that file has FS_SYNC_FL set, then this write
      occurs synchronously and can misbehave if PF_MEMALLOC is set.
      
      For example, if the accounting file is located on an XFS filesystem,
      then a WARN_ON_ONCE() in iomap_do_writepage() is triggered and the data
      doesn't get written when it should.  Or if the accounting file is
      located on an ext4 filesystem without a journal, then a WARN_ON_ONCE()
      in ext4_write_inode() is triggered and the inode doesn't get written.
      
      Fix this in xfsaild() by using the helper functions to save and restore
      PF_MEMALLOC.
      
      This can be reproduced as follows in the kvm-xfstests test appliance
      modified to add the 'acct' Debian package, and with kvm-xfstests's
      recommended kconfig modified to add CONFIG_BSD_PROCESS_ACCT=y:
      
              mkfs.xfs -f /dev/vdb
              mount /vdb
              touch /vdb/file
              chattr +S /vdb/file
              accton /vdb/file
              mkfs.xfs -f /dev/vdc
              mount /vdc
              umount /vdc
      
      It causes:
      	WARNING: CPU: 1 PID: 336 at fs/iomap/buffered-io.c:1534
      	CPU: 1 PID: 336 Comm: xfsaild/vdc Not tainted 5.6.0-rc5 #3
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191223_100556-anatol 04/01/2014
      	RIP: 0010:iomap_do_writepage+0x16b/0x1f0 fs/iomap/buffered-io.c:1534
      	[...]
      	Call Trace:
      	 write_cache_pages+0x189/0x4d0 mm/page-writeback.c:2238
      	 iomap_writepages+0x1c/0x33 fs/iomap/buffered-io.c:1642
      	 xfs_vm_writepages+0x65/0x90 fs/xfs/xfs_aops.c:578
      	 do_writepages+0x41/0xe0 mm/page-writeback.c:2344
      	 __filemap_fdatawrite_range+0xd2/0x120 mm/filemap.c:421
      	 file_write_and_wait_range+0x71/0xc0 mm/filemap.c:760
      	 xfs_file_fsync+0x7a/0x2b0 fs/xfs/xfs_file.c:114
      	 generic_write_sync include/linux/fs.h:2867 [inline]
      	 xfs_file_buffered_aio_write+0x379/0x3b0 fs/xfs/xfs_file.c:691
      	 call_write_iter include/linux/fs.h:1901 [inline]
      	 new_sync_write+0x130/0x1d0 fs/read_write.c:483
      	 __kernel_write+0x54/0xe0 fs/read_write.c:515
      	 do_acct_process+0x122/0x170 kernel/acct.c:522
      	 slow_acct_process kernel/acct.c:581 [inline]
      	 acct_process+0x1d4/0x27c kernel/acct.c:607
      	 do_exit+0x83d/0xbc0 kernel/exit.c:791
      	 kthread+0xf1/0x140 kernel/kthread.c:257
      	 ret_from_fork+0x27/0x50 arch/x86/entry/entry_64.S:352
      
      This bug was originally reported by syzbot at
      https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com.
      
      Reported-by: syzbot+1f9dc49e8de2582d90c2@syzkaller.appspotmail.com
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0ff2b809
    • B
      xfs: acquire superblock freeze protection on eofblocks scans · 29c0b559
      Brian Foster 提交于
      task #28557760
      
      commit 4b674b9ac852937af1f8c62f730c325fb6eadcdb upstream.
      
      The filesystem freeze sequence in XFS waits on any background
      eofblocks or cowblocks scans to complete before the filesystem is
      quiesced. At this point, the freezer has already stopped the
      transaction subsystem, however, which means a truncate or cowblock
      cancellation in progress is likely blocked in transaction
      allocation. This results in a deadlock between freeze and the
      associated scanner.
      
      Fix this problem by holding superblock write protection across calls
      into the block reapers. Since protection for background scans is
      acquired from the workqueue task context, trylock to avoid a similar
      deadlock between freeze and blocking on the write lock.
      
      Fixes: d6b636eb ("xfs: halt auto-reclamation activities while rebuilding rmap")
      Reported-by: NPaul Furtado <paulfurtado91@gmail.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChandan Rajendra <chandanrlinux@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAllison Collins <allison.henderson@oracle.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      29c0b559
    • K
      xfs: Fix deadlock between AGI and AGF with RENAME_WHITEOUT · d637cc8f
      kaixuxia 提交于
      task #28557760
      
      commit bc56ad8c74b8588685c2875de0df8ab6974828ef upstream.
      
      When performing rename operation with RENAME_WHITEOUT flag, we will
      hold AGF lock to allocate or free extents in manipulating the dirents
      firstly, and then doing the xfs_iunlink_remove() call last to hold
      AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
      
      The big problem here is that we have an ordering constraint on AGF
      and AGI locking - inode allocation locks the AGI, then can allocate
      a new extent for new inodes, locking the AGF after the AGI. Hence
      the ordering that is imposed by other parts of the code is AGI before
      AGF. So we get an ABBA deadlock between the AGI and AGF here.
      
      Process A:
      Call trace:
       ? __schedule+0x2bd/0x620
       schedule+0x33/0x90
       schedule_timeout+0x17d/0x290
       __down_common+0xef/0x125
       ? xfs_buf_find+0x215/0x6c0 [xfs]
       down+0x3b/0x50
       xfs_buf_lock+0x34/0xf0 [xfs]
       xfs_buf_find+0x215/0x6c0 [xfs]
       xfs_buf_get_map+0x37/0x230 [xfs]
       xfs_buf_read_map+0x29/0x190 [xfs]
       xfs_trans_read_buf_map+0x13d/0x520 [xfs]
       xfs_read_agf+0xa6/0x180 [xfs]
       ? schedule_timeout+0x17d/0x290
       xfs_alloc_read_agf+0x52/0x1f0 [xfs]
       xfs_alloc_fix_freelist+0x432/0x590 [xfs]
       ? down+0x3b/0x50
       ? xfs_buf_lock+0x34/0xf0 [xfs]
       ? xfs_buf_find+0x215/0x6c0 [xfs]
       xfs_alloc_vextent+0x301/0x6c0 [xfs]
       xfs_ialloc_ag_alloc+0x182/0x700 [xfs]
       ? _xfs_trans_bjoin+0x72/0xf0 [xfs]
       xfs_dialloc+0x116/0x290 [xfs]
       xfs_ialloc+0x6d/0x5e0 [xfs]
       ? xfs_log_reserve+0x165/0x280 [xfs]
       xfs_dir_ialloc+0x8c/0x240 [xfs]
       xfs_create+0x35a/0x610 [xfs]
       xfs_generic_create+0x1f1/0x2f0 [xfs]
       ...
      
      Process B:
      Call trace:
       ? __schedule+0x2bd/0x620
       ? xfs_bmapi_allocate+0x245/0x380 [xfs]
       schedule+0x33/0x90
       schedule_timeout+0x17d/0x290
       ? xfs_buf_find+0x1fd/0x6c0 [xfs]
       __down_common+0xef/0x125
       ? xfs_buf_get_map+0x37/0x230 [xfs]
       ? xfs_buf_find+0x215/0x6c0 [xfs]
       down+0x3b/0x50
       xfs_buf_lock+0x34/0xf0 [xfs]
       xfs_buf_find+0x215/0x6c0 [xfs]
       xfs_buf_get_map+0x37/0x230 [xfs]
       xfs_buf_read_map+0x29/0x190 [xfs]
       xfs_trans_read_buf_map+0x13d/0x520 [xfs]
       xfs_read_agi+0xa8/0x160 [xfs]
       xfs_iunlink_remove+0x6f/0x2a0 [xfs]
       ? current_time+0x46/0x80
       ? xfs_trans_ichgtime+0x39/0xb0 [xfs]
       xfs_rename+0x57a/0xae0 [xfs]
       xfs_vn_rename+0xe4/0x150 [xfs]
       ...
      
      In this patch we move the xfs_iunlink_remove() call to
      before acquiring the AGF lock to preserve correct AGI/AGF locking
      order.
      
      [Minor massage required due to upstream change making xfs_bumplink() a
      void function where as in the 4.19.y tree the return value is checked,
      even though it is always zero. Only change was to the last code block
      removed by the patch. Functionally equivalent to upstream.]
      Signed-off-by: Nkaixuxia <kaixuxia@tencent.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      d637cc8f
    • D
      xfs: periodically yield scrub threads to the scheduler · 8718f7b4
      Darrick J. Wong 提交于
      task #28557760
      
      [ Upstream commit 5d1116d4c6af3e580f1ed0382ca5a94bd65a34cf ]
      
      Christoph Hellwig complained about the following soft lockup warning
      when running scrub after generic/175 when preemption is disabled and
      slub debugging is enabled:
      
      watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [xfs_scrub:161]
      Modules linked in:
      irq event stamp: 41692326
      hardirqs last  enabled at (41692325): [<ffffffff8232c3b7>] _raw_0
      hardirqs last disabled at (41692326): [<ffffffff81001c5a>] trace0
      softirqs last  enabled at (41684994): [<ffffffff8260031f>] __do_e
      softirqs last disabled at (41684987): [<ffffffff81127d8c>] irq_e0
      CPU: 3 PID: 16189 Comm: xfs_scrub Not tainted 5.4.0-rc3+ #30
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.124
      RIP: 0010:_raw_spin_unlock_irqrestore+0x39/0x40
      Code: 89 f3 be 01 00 00 00 e8 d5 3a e5 fe 48 89 ef e8 ed 87 e5 f2
      RSP: 0018:ffffc9000233f970 EFLAGS: 00000286 ORIG_RAX: ffffffffff3
      RAX: ffff88813b398040 RBX: 0000000000000286 RCX: 0000000000000006
      RDX: 0000000000000006 RSI: ffff88813b3988c0 RDI: ffff88813b398040
      RBP: ffff888137958640 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00042b0c00
      R13: 0000000000000001 R14: ffff88810ac32308 R15: ffff8881376fc040
      FS:  00007f6113dea700(0000) GS:ffff88813bb80000(0000) knlGS:00000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6113de8ff8 CR3: 000000012f290000 CR4: 00000000000006e0
      Call Trace:
       free_debug_processing+0x1dd/0x240
       __slab_free+0x231/0x410
       kmem_cache_free+0x30e/0x360
       xchk_ag_btcur_free+0x76/0xb0
       xchk_ag_free+0x10/0x80
       xchk_bmap_iextent_xref.isra.14+0xd9/0x120
       xchk_bmap_iextent+0x187/0x210
       xchk_bmap+0x2e0/0x3b0
       xfs_scrub_metadata+0x2e7/0x500
       xfs_ioc_scrub_metadata+0x4a/0xa0
       xfs_file_ioctl+0x58a/0xcd0
       do_vfs_ioctl+0xa0/0x6f0
       ksys_ioctl+0x5b/0x90
       __x64_sys_ioctl+0x11/0x20
       do_syscall_64+0x4b/0x1a0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      If preemption is disabled, all metadata buffers needed to perform the
      scrub are already in memory, and there are a lot of records to check,
      it's possible that the scrub thread will run for an extended period of
      time without sleeping for IO or any other reason.  Then the watchdog
      timer or the RCU stall timeout can trigger, producing the backtrace
      above.
      
      To fix this problem, call cond_resched() from the scrub thread so that
      we back out to the scheduler whenever necessary.
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8718f7b4
    • O
      xfs: don't check for AG deadlock for realtime files in bunmapi · 1bb971df
      Omar Sandoval 提交于
      task #28557760
      
      commit 69ffe5960df16938bccfe1b65382af0b3de51265 upstream.
      
      Commit 5b094d6d ("xfs: fix multi-AG deadlock in xfs_bunmapi") added
      a check in __xfs_bunmapi() to stop early if we would touch multiple AGs
      in the wrong order. However, this check isn't applicable for realtime
      files. In most cases, it just makes us do unnecessary commits. However,
      without the fix from the previous commit ("xfs: fix realtime file data
      space leak"), if the last and second-to-last extents also happen to have
      different "AG numbers", then the break actually causes __xfs_bunmapi()
      to return without making any progress, which sends
      xfs_itruncate_extents_flags() into an infinite loop.
      
      Fixes: 5b094d6d ("xfs: fix multi-AG deadlock in xfs_bunmapi")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      1bb971df
    • B
      xfs: fix mount failure crash on invalid iclog memory access · e1a4d741
      Brian Foster 提交于
      task #28557760
      
      [ Upstream commit 798a9cada4694ca8d970259f216cec47e675bfd5 ]
      
      syzbot (via KASAN) reports a use-after-free in the error path of
      xlog_alloc_log(). Specifically, the iclog freeing loop doesn't
      handle the case of a fully initialized ->l_iclog linked list.
      Instead, it assumes that the list is partially constructed and NULL
      terminated.
      
      This bug manifested because there was no possible error scenario
      after iclog list setup when the original code was added.  Subsequent
      code and associated error conditions were added some time later,
      while the original error handling code was never updated. Fix up the
      error loop to terminate either on a NULL iclog or reaching the end
      of the list.
      
      Reported-by: syzbot+c732f8644185de340492@syzkaller.appspotmail.com
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e1a4d741
    • A
      ovl: fix value of i_ino for lower hardlink corner case · 9c6a4685
      Amir Goldstein 提交于
      task #28557782
      
      commit 300b124fcf6ad2cd99a7b721e0f096785e0a3134 upstream.
      
      Commit 6dde1e42f497 ("ovl: make i_ino consistent with st_ino in more
      cases"), relaxed the condition nfs_export=on in order to set the value of
      i_ino to xino map of real ino.
      
      Specifically, it also relaxed the pre-condition that index=on for
      consistent i_ino. This opened the corner case of lower hardlink in
      ovl_get_inode(), which calls ovl_fill_inode() with ino=0 and then
      ovl_init_inode() is called to set i_ino to lower real ino without the xino
      mapping.
      
      Pass the correct values of ino;fsid in this case to ovl_fill_inode(), so it
      can initialize i_ino correctly.
      
      Fixes: 6dde1e42f497 ("ovl: make i_ino consistent with st_ino in more ...")
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      9c6a4685
    • M
      ovl: fix lseek overflow on 32bit · 6d77067b
      Miklos Szeredi 提交于
      task #28557782
      
      [ Upstream commit a4ac9d45c0cd14a2adc872186431c79804b77dbf ]
      
      ovl_lseek() is using ssize_t to return the value from vfs_llseek().  On a
      32-bit kernel ssize_t is a 32-bit signed int, which overflows above 2 GB.
      
      Assign the return value of vfs_llseek() to loff_t to fix this.
      Reported-by: NBoris Gjenero <boris.gjenero@gmail.com>
      Fixes: 9e46b840c705 ("ovl: support stacked SEEK_HOLE/SEEK_DATA")
      Cc: <stable@vger.kernel.org> # v4.19
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6d77067b
    • Q
      jbd2: fix data races at struct journal_head · 8d6725cc
      Qian Cai 提交于
      task #28557737
      
      [ Upstream commit 6c5d911249290f41f7b50b43344a7520605b1acb ]
      
      journal_head::b_transaction and journal_head::b_next_transaction could
      be accessed concurrently as noticed by KCSAN,
      
       LTP: starting fsync04
       /dev/zero: Can't open blockdev
       EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
       EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
       ==================================================================
       BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]
      
       write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
        __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
        __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
        jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
        (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
        kjournald2+0x13b/0x450 [jbd2]
        kthread+0x1cd/0x1f0
        ret_from_fork+0x27/0x50
      
       read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
        jbd2_write_access_granted+0x1b2/0x250 [jbd2]
        jbd2_write_access_granted at fs/jbd2/transaction.c:1155
        jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
        __ext4_journal_get_write_access+0x50/0x90 [ext4]
        ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
        ext4_mb_new_blocks+0x54f/0xca0 [ext4]
        ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
        ext4_map_blocks+0x3b4/0x950 [ext4]
        _ext4_get_block+0xfc/0x270 [ext4]
        ext4_get_block+0x3b/0x50 [ext4]
        __block_write_begin_int+0x22e/0xae0
        __block_write_begin+0x39/0x50
        ext4_write_begin+0x388/0xb50 [ext4]
        generic_perform_write+0x15d/0x290
        ext4_buffered_write_iter+0x11f/0x210 [ext4]
        ext4_file_write_iter+0xce/0x9e0 [ext4]
        new_sync_write+0x29c/0x3b0
        __vfs_write+0x92/0xa0
        vfs_write+0x103/0x260
        ksys_write+0x9d/0x130
        __x64_sys_write+0x4c/0x60
        do_syscall_64+0x91/0xb05
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       5 locks held by fsync04/25724:
        #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
        #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
        #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
        #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
        #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
       irq event stamp: 1407125
       hardirqs last  enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790
       hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790
       softirqs last  enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c
       softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
      The plain reads are outside of jh->b_state_lock critical section which result
      in data races. Fix them by adding pairs of READ|WRITE_ONCE().
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pwSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8d6725cc
    • W
      jbd2: fix ocfs2 corrupt when clearing block group bits · cfc67fd4
      wangyan 提交于
      task #28557737
      
      commit 8eedabfd66b68a4623beec0789eac54b8c9d0fb6 upstream.
      
      I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
      The running environment:
      	kernel version: 4.19
      	A cluster with two nodes, 5 luns mounted on two nodes, and do some
      	file operations like dd/fallocate/truncate/rm on every lun with storage
      	network disconnection.
      
      The fallocate operation on dm-23-45 caused an null pointer dereference.
      
      The information of NULL pointer dereference as follows:
      	[577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45.
      	[577992.878290] Aborting journal on device dm-23-45.
      	...
      	[577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46.
      	[577992.890908] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30
      	[577992.890918] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30
      	[577992.890922] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30
      	[577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30
      	[577992.890928] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30
      	[577992.890933] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890939] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
      	[577992.890950] Mem abort info:
      	[577992.890951]   ESR = 0x96000004
      	[577992.890952]   Exception class = DABT (current EL), IL = 32 bits
      	[577992.890952]   SET = 0, FnV = 0
      	[577992.890953]   EA = 0, S1PTW = 0
      	[577992.890954] Data abort info:
      	[577992.890955]   ISV = 0, ISS = 0x00000004
      	[577992.890956]   CM = 0, WnR = 0
      	[577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9
      	[577992.890960] [0000000000000020] pgd=0000000000000000
      	[577992.890964] Internal error: Oops: 96000004 [#1] SMP
      	[577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd)
      	[577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G        W  OE     4.19.36 #1
      	[577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
      	[577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO)
      	[577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
      	[577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2]
      	[577992.891084] sp : ffff0000c8e2b810
      	[577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000
      	[577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70
      	[577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2
      	[577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30
      	[577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000
      	[577992.891097] x19: ffff000001681638 x18: ffffffffffffffff
      	[577992.891098] x17: 0000000000000000 x16: ffff000080a03df0
      	[577992.891100] x15: ffff0000811d9708 x14: 203d207375746174
      	[577992.891101] x13: 73203a524f525245 x12: 20373439343a6565
      	[577992.891103] x11: 0000000000000038 x10: 0101010101010101
      	[577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f
      	[577992.891109] x7 : 0000000000000000 x6 : 0000000000000080
      	[577992.891110] x5 : 0000000000000000 x4 : 0000000000000002
      	[577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00
      	[577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000
      	[577992.891116] Call trace:
      	[577992.891139]  _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
      	[577992.891162]  _ocfs2_free_clusters+0x100/0x290 [ocfs2]
      	[577992.891185]  ocfs2_free_clusters+0x50/0x68 [ocfs2]
      	[577992.891206]  ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2]
      	[577992.891227]  ocfs2_add_inode_data+0x94/0xc8 [ocfs2]
      	[577992.891248]  ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2]
      	[577992.891269]  ocfs2_allocate_extents+0x14c/0x338 [ocfs2]
      	[577992.891290]  __ocfs2_change_file_space+0x3f8/0x610 [ocfs2]
      	[577992.891309]  ocfs2_fallocate+0xe4/0x128 [ocfs2]
      	[577992.891316]  vfs_fallocate+0x11c/0x250
      	[577992.891317]  ksys_fallocate+0x54/0x88
      	[577992.891319]  __arm64_sys_fallocate+0x28/0x38
      	[577992.891323]  el0_svc_common+0x78/0x130
      	[577992.891325]  el0_svc_handler+0x38/0x78
      	[577992.891327]  el0_svc+0x8/0xc
      
      My analysis process as follows:
      ocfs2_fallocate
        __ocfs2_change_file_space
          ocfs2_allocate_extents
            ocfs2_extend_allocation
              ocfs2_add_inode_data
                ocfs2_add_clusters_in_btree
                  ocfs2_insert_extent
                    ocfs2_do_insert_extent
                      ocfs2_rotate_tree_right
                        ocfs2_extend_rotate_transaction
                          ocfs2_extend_trans
                            jbd2_journal_restart
                              jbd2__journal_restart
                                /* handle->h_transaction is NULL,
                                 * is_handle_aborted(handle) is true
                                 */
                                handle->h_transaction = NULL;
                                start_this_handle
                                  return -EROFS;
                  ocfs2_free_clusters
                    _ocfs2_free_clusters
                      _ocfs2_free_suballoc_bits
                        ocfs2_block_group_clear_bits
                          ocfs2_journal_access_gd
                            __ocfs2_journal_access
                              jbd2_journal_get_undo_access
                                /* I think jbd2_write_access_granted() will
                                 * return true, because do_get_write_access()
                                 * will return -EROFS.
                                 */
                                if (jbd2_write_access_granted(...)) return 0;
                                do_get_write_access
                                  /* handle->h_transaction is NULL, it will
                                   * return -EROFS here, so do_get_write_access()
                                   * was not called.
                                   */
                                  if (is_handle_aborted(handle)) return -EROFS;
                          /* bh2jh(group_bh) is NULL, caused NULL
                             pointer dereference */
                          undo_bg = (struct ocfs2_group_desc *)
                                      bh2jh(group_bh)->b_committed_data;
      
      If handle->h_transaction == NULL, then jbd2_write_access_granted()
      does not really guarantee that journal_head will stay around,
      not even speaking of its b_committed_data. The bh2jh(group_bh)
      can be removed after ocfs2_journal_access_gd() and before call
      "bh2jh(group_bh)->b_committed_data". So, we should move
      is_handle_aborted() check from do_get_write_access() into
      jbd2_journal_get_undo_access() and jbd2_journal_get_write_access()
      before the call to jbd2_write_access_granted().
      
      Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.comSigned-off-by: NYan Wang <wangyan122@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJun Piao <piaojun@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cfc67fd4
    • Z
      jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer · 9d80cc40
      zhangyi (F) 提交于
      task #28557737
      
      [ Upstream commit c96dceeabf765d0b1b1f29c3bf50a5c01315b820 ]
      
      Commit 904cdbd41d74 ("jbd2: clear dirty flag when revoking a buffer from
      an older transaction") set the BH_Freed flag when forgetting a metadata
      buffer which belongs to the committing transaction, it indicate the
      committing process clear dirty bits when it is done with the buffer. But
      it also clear the BH_Mapped flag at the same time, which may trigger
      below NULL pointer oops when block_size < PAGE_SIZE.
      
      rmdir 1             kjournald2                 mkdir 2
                          jbd2_journal_commit_transaction
      		    commit transaction N
      jbd2_journal_forget
      set_buffer_freed(bh1)
                          jbd2_journal_commit_transaction
                           commit transaction N+1
                           ...
                           clear_buffer_mapped(bh1)
                                                     ext4_getblk(bh2 ummapped)
                                                     ...
                                                     grow_dev_page
                                                      init_page_buffers
                                                       bh1->b_private=NULL
                                                       bh2->b_private=NULL
                           jbd2_journal_put_journal_head(jh1)
                            __journal_remove_journal_head(hb1)
      		       jh1 is NULL and trigger oops
      
      *) Dir entry block bh1 and bh2 belongs to one page, and the bh2 has
         already been unmapped.
      
      For the metadata buffer we forgetting, we should always keep the mapped
      flag and clear the dirty flags is enough, so this patch pick out the
      these buffers and keep their BH_Mapped flag.
      
      Link: https://lore.kernel.org/r/20200213063821.30455-3-yi.zhang@huawei.com
      Fixes: 904cdbd41d74 ("jbd2: clear dirty flag when revoking a buffer from an older transaction")
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      9d80cc40
    • Z
      jbd2: move the clearing of b_modified flag to the journal_unmap_buffer() · cdf29ec3
      zhangyi (F) 提交于
      task #28557737
      
      [ Upstream commit 6a66a7ded12baa6ebbb2e3e82f8cb91382814839 ]
      
      There is no need to delay the clearing of b_modified flag to the
      transaction committing time when unmapping the journalled buffer, so
      just move it to the journal_unmap_buffer().
      
      Link: https://lore.kernel.org/r/20200213063821.30455-2-yi.zhang@huawei.comReviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cdf29ec3
    • V
      jbd2_seq_info_next should increase position index · 49f0705e
      Vasily Averin 提交于
      task #28557737
      
      commit 1a8e9cf40c9a6a2e40b1e924b13ed303aeea4418 upstream.
      
      if seq_file .next fuction does not change position index,
      read after some lseek can generate unexpected output.
      
      Script below generates endless output
       $ q=;while read -r r;do echo "$((++q)) $r";done </proc/fs/jbd2/DEV/info
      
      https://bugzilla.kernel.org/show_bug.cgi?id=206283
      
      Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface")
      Cc: stable@kernel.org
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/d13805e5-695e-8ac3-b678-26ca2313629f@virtuozzo.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      49f0705e
    • E
      ext4: fix race between writepages and enabling EXT4_EXTENTS_FL · c6efddb2
      Eric Biggers 提交于
      task #28557685
      
      commit cb85f4d23f794e24127f3e562cb3b54b0803f456 upstream.
      
      If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
      on it, the following warning in ext4_add_complete_io() can be hit:
      
      WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120
      
      Here's a minimal reproducer (not 100% reliable) (root isn't required):
      
              while true; do
                      sync
              done &
              while true; do
                      rm -f file
                      touch file
                      chattr -e file
                      echo X >> file
                      chattr +e file
              done
      
      The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
      (which only returns true on extent-based files) is checked once to set
      the number of reserved journal credits, and also again later to select
      the flags for ext4_map_blocks() and copy the reserved journal handle to
      ext4_io_end::handle.  But if EXT4_EXTENTS_FL is being concurrently set,
      the first check can see dioread_nolock disabled while the later one can
      see it enabled, causing the reserved handle to unexpectedly be NULL.
      
      Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
      related to doing so as well, fix this by synchronizing changing
      EXT4_EXTENTS_FL with ext4_writepages() via the existing
      s_writepages_rwsem (previously called s_journal_flag_rwsem).
      
      This was originally reported by syzbot without a reproducer at
      https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
      but now that dioread_nolock is the default I also started seeing this
      when running syzkaller locally.
      
      Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org
      Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
      Fixes: 6b523df4 ("ext4: use transaction reservation for extent conversion in ext4_end_io")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c6efddb2
    • E
      ext4: rename s_journal_flag_rwsem to s_writepages_rwsem · 1b7c46e9
      Eric Biggers 提交于
      task #28557685
      
      commit bbd55937de8f2754adc5792b0f8e5ff7d9c0420e upstream.
      
      In preparation for making s_journal_flag_rwsem synchronize
      ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA
      flags (rather than just JOURNAL_DATA as it does currently), rename it to
      s_writepages_rwsem.
      
      Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      1b7c46e9
    • D
      ext4: potential crash on allocation error in ext4_alloc_flex_bg_array() · 98770662
      Dan Carpenter 提交于
      task #28557685
      
      commit 37b0b6b8b99c0e1c1f11abbe7cf49b6d03795b3f upstream.
      
      If sbi->s_flex_groups_allocated is zero and the first allocation fails
      then this code will crash.  The problem is that "i--" will set "i" to
      -1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1
      is type promoted to unsigned and becomes UINT_MAX.  Since UINT_MAX
      is more than zero, the condition is true so we call kvfree(new_groups[-1]).
      The loop will carry on freeing invalid memory until it crashes.
      
      Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access")
      Reviewed-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountainSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      98770662
    • S
      ext4: fix potential race between s_flex_groups online resizing and access · e0b1f982
      Suraj Jitindar Singh 提交于
      task #28557685
      
      commit 7c990728b99ed6fbe9c75fc202fce1172d9916da upstream.
      
      During an online resize an array of s_flex_groups structures gets replaced
      so it can get enlarged. If there is a concurrent access to the array and
      this memory has been reused then this can lead to an invalid memory access.
      
      The s_flex_group array has been converted into an array of pointers rather
      than an array of structures. This is to ensure that the information
      contained in the structures cannot get out of sync during a resize due to
      an accessor updating the value in the old structure after it has been
      copied but before the array pointer is updated. Since the structures them-
      selves are no longer copied but only the pointers to them this case is
      mitigated.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
      Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.eduSigned-off-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e0b1f982
    • S
      ext4: fix potential race between s_group_info online resizing and access · f12237c7
      Suraj Jitindar Singh 提交于
      task #28557685
      
      commit df3da4ea5a0fc5d115c90d5aa6caa4dd433750a7 upstream.
      
      During an online resize an array of pointers to s_group_info gets replaced
      so it can get enlarged. If there is a concurrent access to the array in
      ext4_get_group_info() and this memory has been reused then this can lead to
      an invalid memory access.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
      Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.eduSigned-off-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NBalbir Singh <sblbir@amazon.com>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f12237c7
    • T
      ext4: fix potential race between online resizing and write operations · 615a6d13
      Theodore Ts'o 提交于
      task #28557685
      
      commit 1d0c3924a92e69bfa91163bda83c12a994b4d106 upstream.
      
      During an online resize an array of pointers to buffer heads gets
      replaced so it can get enlarged.  If there is a racing block
      allocation or deallocation which uses the old array, and the old array
      has gotten reused this can lead to a GPF or some other random kernel
      memory getting modified.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
      Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.eduReported-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      615a6d13
    • J
      ext4: fix checksum errors with indexed dirs · 20bd22e8
      Jan Kara 提交于
      task #28557685
      
      commit 48a34311953d921235f4d7bbd2111690d2e469cf upstream.
      
      DIR_INDEX has been introduced as a compat ext4 feature. That means that
      even kernels / tools that don't understand the feature may modify the
      filesystem. This works because for kernels not understanding indexed dir
      format, internal htree nodes appear just as empty directory entries.
      Index dir aware kernels then check the htree structure is still
      consistent before using the data. This all worked reasonably well until
      metadata checksums were introduced. The problem is that these
      effectively made DIR_INDEX only ro-compatible because internal htree
      nodes store checksums in a different place than normal directory blocks.
      Thus any modification ignorant to DIR_INDEX (or just clearing
      EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch
      and trigger kernel errors. So we have to be more careful when dealing
      with indexed directories on filesystems with checksumming enabled.
      
      1) We just disallow loading any directory inodes with EXT4_INDEX_FL when
      DIR_INDEX is not enabled. This is harsh but it should be very rare (it
      means someone disabled DIR_INDEX on existing filesystem and didn't run
      e2fsck), e2fsck can fix the problem, and we don't want to answer the
      difficult question: "Should we rather corrupt the directory more or
      should we ignore that DIR_INDEX feature is not set?"
      
      2) When we find out htree structure is corrupted (but the filesystem and
      the directory should in support htrees), we continue just ignoring htree
      information for reading but we refuse to add new entries to the
      directory to avoid corrupting it more.
      
      Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz
      Fixes: dbe89444 ("ext4: Calculate and verify checksums for htree nodes")
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      20bd22e8
    • T
      ext4: fix support for inode sizes > 1024 bytes · 88f03258
      Theodore Ts'o 提交于
      task #28557685
      
      commit 4f97a68192bd33b9963b400759cef0ca5963af00 upstream.
      
      A recent commit, 9803387c55f7 ("ext4: validate the
      debug_want_extra_isize mount option at parse time"), moved mount-time
      checks around.  One of those changes moved the inode size check before
      the blocksize variable was set to the blocksize of the file system.
      After 9803387c55f7 was set to the minimum allowable blocksize, which
      in practice on most systems would be 1024 bytes.  This cuased file
      systems with inode sizes larger than 1024 bytes to be rejected with a
      message:
      
      EXT4-fs (sdXX): unsupported inode size: 4096
      
      Fixes: 9803387c55f7 ("ext4: validate the debug_want_extra_isize mount option at parse time")
      Link: https://lore.kernel.org/r/20200206225252.GA3673@mit.eduReported-by: NHerbert Poetzl <herbert@13thfloor.at>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      88f03258
    • T
      ext4: validate the debug_want_extra_isize mount option at parse time · 3d55e0b6
      Theodore Ts'o 提交于
      task #28557685
      
      commit 9803387c55f7d2ce69aa64340c5fdc6b3027dbc8 upstream.
      
      Instead of setting s_want_extra_size and then making sure that it is a
      valid value afterwards, validate the field before we set it.  This
      avoids races and other problems when remounting the file system.
      
      Link: https://lore.kernel.org/r/20191215063020.GA11512@mit.edu
      Cc: stable@kernel.org
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reported-and-tested-by: syzbot+4a39a025912b265cacef@syzkaller.appspotmail.com
      Signed-off-by: NZubin Mithra <zsm@chromium.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      3d55e0b6
    • A
      ext4: don't assume that mmp_nodename/bdevname have NUL · 879afab1
      Andreas Dilger 提交于
      task #28557685
      
      commit 14c9ca0583eee8df285d68a0e6ec71053efd2228 upstream.
      
      Don't assume that the mmp_nodename and mmp_bdevname strings are NUL
      terminated, since they are filled in by snprintf(), which is not
      guaranteed to do so.
      
      Link: https://lore.kernel.org/r/1580076215-1048-1-git-send-email-adilger@dilger.caSigned-off-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      879afab1
    • S
      ext4: add cond_resched() to __ext4_find_entry() · fe7cd0c7
      Shijie Luo 提交于
      task #28557685
      
      commit 9424ef56e13a1f14c57ea161eed3ecfdc7b2770e upstream.
      
      We tested a soft lockup problem in linux 4.19 which could also
      be found in linux 5.x.
      
      When dir inode takes up a large number of blocks, and if the
      directory is growing when we are searching, it's possible the
      restart branch could be called many times, and the do while loop
      could hold cpu a long time.
      
      Here is the call trace in linux 4.19.
      
      [  473.756186] Call trace:
      [  473.756196]  dump_backtrace+0x0/0x198
      [  473.756199]  show_stack+0x24/0x30
      [  473.756205]  dump_stack+0xa4/0xcc
      [  473.756210]  watchdog_timer_fn+0x300/0x3e8
      [  473.756215]  __hrtimer_run_queues+0x114/0x358
      [  473.756217]  hrtimer_interrupt+0x104/0x2d8
      [  473.756222]  arch_timer_handler_virt+0x38/0x58
      [  473.756226]  handle_percpu_devid_irq+0x90/0x248
      [  473.756231]  generic_handle_irq+0x34/0x50
      [  473.756234]  __handle_domain_irq+0x68/0xc0
      [  473.756236]  gic_handle_irq+0x6c/0x150
      [  473.756238]  el1_irq+0xb8/0x140
      [  473.756286]  ext4_es_lookup_extent+0xdc/0x258 [ext4]
      [  473.756310]  ext4_map_blocks+0x64/0x5c0 [ext4]
      [  473.756333]  ext4_getblk+0x6c/0x1d0 [ext4]
      [  473.756356]  ext4_bread_batch+0x7c/0x1f8 [ext4]
      [  473.756379]  ext4_find_entry+0x124/0x3f8 [ext4]
      [  473.756402]  ext4_lookup+0x8c/0x258 [ext4]
      [  473.756407]  __lookup_hash+0x8c/0xe8
      [  473.756411]  filename_create+0xa0/0x170
      [  473.756413]  do_mkdirat+0x6c/0x140
      [  473.756415]  __arm64_sys_mkdirat+0x28/0x38
      [  473.756419]  el0_svc_common+0x78/0x130
      [  473.756421]  el0_svc_handler+0x38/0x78
      [  473.756423]  el0_svc+0x8/0xc
      [  485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149]
      
      Add cond_resched() to avoid soft lockup and to provide a better
      system responding.
      
      Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.comSigned-off-by: NShijie Luo <luoshijie1@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      fe7cd0c7
    • Q
      ext4: fix a data race at inode->i_blocks · 7b067e44
      Qian Cai 提交于
      task #28557685
      
      commit 28936b62e71e41600bab319f262ea9f9b1027629 upstream.
      
      inode->i_blocks could be accessed concurrently as noticed by KCSAN,
      
       BUG: KCSAN: data-race in ext4_do_update_inode [ext4] / inode_add_bytes
      
       write to 0xffff9a00d4b982d0 of 8 bytes by task 22100 on cpu 118:
        inode_add_bytes+0x65/0xf0
        __inode_add_bytes at fs/stat.c:689
        (inlined by) inode_add_bytes at fs/stat.c:702
        ext4_mb_new_blocks+0x418/0xca0 [ext4]
        ext4_ext_map_blocks+0x1a6b/0x27b0 [ext4]
        ext4_map_blocks+0x1a9/0x950 [ext4]
        _ext4_get_block+0xfc/0x270 [ext4]
        ext4_get_block_unwritten+0x33/0x50 [ext4]
        __block_write_begin_int+0x22e/0xae0
        __block_write_begin+0x39/0x50
        ext4_write_begin+0x388/0xb50 [ext4]
        ext4_da_write_begin+0x35f/0x8f0 [ext4]
        generic_perform_write+0x15d/0x290
        ext4_buffered_write_iter+0x11f/0x210 [ext4]
        ext4_file_write_iter+0xce/0x9e0 [ext4]
        new_sync_write+0x29c/0x3b0
        __vfs_write+0x92/0xa0
        vfs_write+0x103/0x260
        ksys_write+0x9d/0x130
        __x64_sys_write+0x4c/0x60
        do_syscall_64+0x91/0xb05
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       read to 0xffff9a00d4b982d0 of 8 bytes by task 8 on cpu 65:
        ext4_do_update_inode+0x4a0/0xf60 [ext4]
        ext4_inode_blocks_set at fs/ext4/inode.c:4815
        ext4_mark_iloc_dirty+0xaf/0x160 [ext4]
        ext4_mark_inode_dirty+0x129/0x3e0 [ext4]
        ext4_convert_unwritten_extents+0x253/0x2d0 [ext4]
        ext4_convert_unwritten_io_end_vec+0xc5/0x150 [ext4]
        ext4_end_io_rsv_work+0x22c/0x350 [ext4]
        process_one_work+0x54f/0xb90
        worker_thread+0x80/0x5f0
        kthread+0x1cd/0x1f0
        ret_from_fork+0x27/0x50
      
       4 locks held by kworker/u256:0/8:
        #0: ffff9a025abc4328 ((wq_completion)ext4-rsv-conversion){+.+.}, at: process_one_work+0x443/0xb90
        #1: ffffab5a862dbe20 ((work_completion)(&ei->i_rsv_conversion_work)){+.+.}, at: process_one_work+0x443/0xb90
        #2: ffff9a025a9d0f58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
        #3: ffff9a00d4b985d8 (&(&ei->i_raw_lock)->rlock){+.+.}, at: ext4_do_update_inode+0xaa/0xf60 [ext4]
       irq event stamp: 3009267
       hardirqs last  enabled at (3009267): [<ffffffff980da9b7>] __find_get_block+0x107/0x790
       hardirqs last disabled at (3009266): [<ffffffff980da8f9>] __find_get_block+0x49/0x790
       softirqs last  enabled at (3009230): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c
       softirqs last disabled at (3009223): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 65 PID: 8 Comm: kworker/u256:0 Tainted: G L 5.6.0-rc2-next-20200221+ #7
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
       Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work [ext4]
      
      The plain read is outside of inode->i_lock critical section which
      results in a data race. Fix it by adding READ_ONCE() there.
      
      Link: https://lore.kernel.org/r/20200222043258.2279-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      7b067e44
    • Q
      ext4: fix a data race in EXT4_I(inode)->i_disksize · ff11a028
      Qian Cai 提交于
      task #28557685
      
      commit 35df4299a6487f323b0aca120ea3f485dfee2ae3 upstream.
      
      EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by
      KCSAN,
      
       BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4]
      
       write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127:
        ext4_write_end+0x4e3/0x750 [ext4]
        ext4_update_i_disksize at fs/ext4/ext4.h:3032
        (inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046
        (inlined by) ext4_write_end at fs/ext4/inode.c:1287
        generic_perform_write+0x208/0x2a0
        ext4_buffered_write_iter+0x11f/0x210 [ext4]
        ext4_file_write_iter+0xce/0x9e0 [ext4]
        new_sync_write+0x29c/0x3b0
        __vfs_write+0x92/0xa0
        vfs_write+0x103/0x260
        ksys_write+0x9d/0x130
        __x64_sys_write+0x4c/0x60
        do_syscall_64+0x91/0xb47
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37:
        ext4_writepages+0x10ac/0x1d00 [ext4]
        mpage_map_and_submit_extent at fs/ext4/inode.c:2468
        (inlined by) ext4_writepages at fs/ext4/inode.c:2772
        do_writepages+0x5e/0x130
        __writeback_single_inode+0xeb/0xb20
        writeback_sb_inodes+0x429/0x900
        __writeback_inodes_wb+0xc4/0x150
        wb_writeback+0x4bd/0x870
        wb_workfn+0x6b4/0x960
        process_one_work+0x54c/0xbe0
        worker_thread+0x80/0x650
        kthread+0x1e0/0x200
        ret_from_fork+0x27/0x50
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G        W  O L 5.5.0-next-20200204+ #5
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
       Workqueue: writeback wb_workfn (flush-7:0)
      
      Since only the read is operating as lockless (outside of the
      "i_data_sem"), load tearing could introduce a logic bug. Fix it by
      adding READ_ONCE() for the read and WRITE_ONCE() for the write.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pwSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ff11a028
    • S
      ext4: add cond_resched() to ext4_protect_reserved_inode · b6063100
      Shijie Luo 提交于
      task #28557685
      
      commit af133ade9a40794a37104ecbcc2827c0ea373a3c upstream.
      
      When journal size is set too big by "mkfs.ext4 -J size=", or when
      we mount a crafted image to make journal inode->i_size too big,
      the loop, "while (i < num)", holds cpu too long. This could cause
      soft lockup.
      
      [  529.357541] Call trace:
      [  529.357551]  dump_backtrace+0x0/0x198
      [  529.357555]  show_stack+0x24/0x30
      [  529.357562]  dump_stack+0xa4/0xcc
      [  529.357568]  watchdog_timer_fn+0x300/0x3e8
      [  529.357574]  __hrtimer_run_queues+0x114/0x358
      [  529.357576]  hrtimer_interrupt+0x104/0x2d8
      [  529.357580]  arch_timer_handler_virt+0x38/0x58
      [  529.357584]  handle_percpu_devid_irq+0x90/0x248
      [  529.357588]  generic_handle_irq+0x34/0x50
      [  529.357590]  __handle_domain_irq+0x68/0xc0
      [  529.357593]  gic_handle_irq+0x6c/0x150
      [  529.357595]  el1_irq+0xb8/0x140
      [  529.357599]  __ll_sc_atomic_add_return_acquire+0x14/0x20
      [  529.357668]  ext4_map_blocks+0x64/0x5c0 [ext4]
      [  529.357693]  ext4_setup_system_zone+0x330/0x458 [ext4]
      [  529.357717]  ext4_fill_super+0x2170/0x2ba8 [ext4]
      [  529.357722]  mount_bdev+0x1a8/0x1e8
      [  529.357746]  ext4_mount+0x44/0x58 [ext4]
      [  529.357748]  mount_fs+0x50/0x170
      [  529.357752]  vfs_kern_mount.part.9+0x54/0x188
      [  529.357755]  do_mount+0x5ac/0xd78
      [  529.357758]  ksys_mount+0x9c/0x118
      [  529.357760]  __arm64_sys_mount+0x28/0x38
      [  529.357764]  el0_svc_common+0x78/0x130
      [  529.357766]  el0_svc_handler+0x38/0x78
      [  529.357769]  el0_svc+0x8/0xc
      [  541.356516] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [mount:18674]
      
      Link: https://lore.kernel.org/r/20200211011752.29242-1-luoshijie1@huawei.comReviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NShijie Luo <luoshijie1@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b6063100
    • E
      ext4: fix deadlock allocating crypto bounce page from mempool · 4141548c
      Eric Biggers 提交于
      task #28557685
      
      [ Upstream commit 547c556f4db7c09447ecf5f833ab6aaae0c5ab58 ]
      
      ext4_writepages() on an encrypted file has to encrypt the data, but it
      can't modify the pagecache pages in-place, so it encrypts the data into
      bounce pages and writes those instead.  All bounce pages are allocated
      from a mempool using GFP_NOFS.
      
      This is not correct use of a mempool, and it can deadlock.  This is
      because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never
      fail" mode for mempool_alloc() where a failed allocation will fall back
      to waiting for one of the preallocated elements in the pool.
      
      But since this mode is used for all a bio's pages and not just the
      first, it can deadlock waiting for pages already in the bio to be freed.
      
      This deadlock can be reproduced by patching mempool_alloc() to pretend
      that pool->alloc() always fails (so that it always falls back to the
      preallocations), and then creating an encrypted file of size > 128 KiB.
      
      Fix it by only using GFP_NOFS for the first page in the bio.  For
      subsequent pages just use GFP_NOWAIT, and if any of those fail, just
      submit the bio and start a new one.
      
      This will need to be fixed in f2fs too, but that's less straightforward.
      
      Fixes: c9af28fd ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      4141548c
    • C
      ext4: set error return correctly when ext4_htree_store_dirent fails · 219c5eb7
      Colin Ian King 提交于
      task #28557685
      
      [ Upstream commit 7a14826ede1d714f0bb56de8167c0e519041eeda ]
      
      Currently when the call to ext4_htree_store_dirent fails the error return
      variable 'ret' is is not being set to the error code and variable count is
      instead, hence the error code is not being returned.  Fix this by assigning
      ret to the error return code.
      
      Addresses-Coverity: ("Unused value")
      Fixes: 8af0f082 ("ext4: fix readdir error in the case of inline_data+dir_index")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      219c5eb7
    • D
      ext4: unlock on error in ext4_expand_extra_isize() · b23f8091
      Dan Carpenter 提交于
      task #28557685
      
      commit 7f420d64a08c1dcd65b27be82a27cf2bdb2e7847 upstream.
      
      We need to unlock the xattr before returning on this error path.
      
      Cc: stable@kernel.org # 4.13
      Fixes: c03b45b8 ("ext4, project: expand inode extra size if possible")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Link: https://lore.kernel.org/r/20191213185010.6k7yl2tck3wlsdkt@kili.mountainSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b23f8091
    • J
      ext4: check for directory entries too close to block end · 3ac53bd1
      Jan Kara 提交于
      task #28557685
      
      commit 109ba779d6cca2d519c5dd624a3276d03e21948e upstream.
      
      ext4_check_dir_entry() currently does not catch a case when a directory
      entry ends so close to the block end that the header of the next
      directory entry would not fit in the remaining space. This can lead to
      directory iteration code trying to access address beyond end of current
      buffer head leading to oops.
      
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20191202170213.4761-3-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      3ac53bd1
    • J
      ext4: fix ext4_empty_dir() for directories with holes · 824b5c3c
      Jan Kara 提交于
      task #28557685
      
      commit 64d4ce892383b2ad6d782e080d25502f91bf2a38 upstream.
      
      Function ext4_empty_dir() doesn't correctly handle directories with
      holes and crashes on bh->b_data dereference when bh is NULL. Reorganize
      the loop to use 'offset' variable all the times instead of comparing
      pointers to current direntry with bh->b_data pointer. Also add more
      strict checking of '.' and '..' directory entries to avoid entering loop
      in possibly invalid state on corrupted filesystems.
      
      References: CVE-2019-19037
      CC: stable@vger.kernel.org
      Fixes: 4e19d6b65fb4 ("ext4: allow directory holes")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20191202170213.4761-2-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      824b5c3c
  2. 15 6月, 2020 8 次提交
    • A
      nvme: retain split access workaround for capability reads · c2fe0cbc
      Ard Biesheuvel 提交于
      task #28557808
      
      [ Upstream commit 3a8ecc935efabdad106b5e06d07b150c394b4465 ]
      
      Commit 7fd8930f
      
        "nvme: add a common helper to read Identify Controller data"
      
      has re-introduced an issue that we have attempted to work around in the
      past, in commit a310acd7 ("NVMe: use split lo_hi_{read,write}q").
      
      The problem is that some PCIe NVMe controllers do not implement 64-bit
      outbound accesses correctly, which is why the commit above switched
      to using lo_hi_[read|write]q for all 64-bit BAR accesses occuring in
      the code.
      
      In the mean time, the NVMe subsystem has been refactored, and now calls
      into the PCIe support layer for NVMe via a .reg_read64() method, which
      fails to use lo_hi_readq(), and thus reintroduces the problem that the
      workaround above aimed to address.
      
      Given that, at the moment, .reg_read64() is only used to read the
      capability register [which is known to tolerate split reads], let's
      switch .reg_read64() to lo_hi_readq() as well.
      
      This fixes a boot issue on some ARM boxes with NVMe behind a Synopsys
      DesignWare PCIe host controller.
      
      Fixes: 7fd8930f ("nvme: add a common helper to read Identify Controller data")
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c2fe0cbc
    • E
      nvme: Discard workaround for non-conformant devices · fd00911e
      Eduard Hasenleithner 提交于
      task #28557808
      
      [ Upstream commit 530436c45ef2e446c12538a400e465929a0b3ade ]
      
      Users observe IOMMU related errors when performing discard on nvme from
      non-compliant nvme devices reading beyond the end of the DMA mapped
      ranges to discard.
      
      Two different variants of this behavior have been observed: SM22XX
      controllers round up the read size to a multiple of 512 bytes, and Phison
      E12 unconditionally reads the maximum discard size allowed by the spec
      (256 segments or 4kB).
      
      Make nvme_setup_discard unconditionally allocate the maximum DSM buffer
      so the driver DMA maps a memory range that will always succeed.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202665 many
      Signed-off-by: NEduard Hasenleithner <eduard@hasenleithner.at>
      [changelog, use existing define, kernel coding style]
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      fd00911e
    • G
      dm multipath: use updated MPATHF_QUEUE_IO on mapping for bio-based mpath · 533c5ba8
      Gabriel Krisman Bertazi 提交于
      task #28557827
      
      commit 5686dee34dbfe0238c0274e0454fa0174ac0a57a upstream.
      
      When adding devices that don't have a scsi_dh on a BIO based multipath,
      I was able to consistently hit the warning below and lock-up the system.
      
      The problem is that __map_bio reads the flag before it potentially being
      modified by choose_pgpath, and ends up using the older value.
      
      The WARN_ON below is not trivially linked to the issue. It goes like
      this: The activate_path delayed_work is not initialized for non-scsi_dh
      devices, but we always set MPATHF_QUEUE_IO, asking for initialization.
      That is fine, since MPATHF_QUEUE_IO would be cleared in choose_pgpath.
      Nevertheless, only for BIO-based mpath, we cache the flag before calling
      choose_pgpath, and use the older version when deciding if we should
      initialize the path.  Therefore, we end up trying to initialize the
      paths, and calling the non-initialized activate_path work.
      
      [   82.437100] ------------[ cut here ]------------
      [   82.437659] WARNING: CPU: 3 PID: 602 at kernel/workqueue.c:1624
        __queue_delayed_work+0x71/0x90
      [   82.438436] Modules linked in:
      [   82.438911] CPU: 3 PID: 602 Comm: systemd-udevd Not tainted 5.6.0-rc6+ #339
      [   82.439680] RIP: 0010:__queue_delayed_work+0x71/0x90
      [   82.440287] Code: c1 48 89 4a 50 81 ff 00 02 00 00 75 2a 4c 89 cf e9
      94 d6 07 00 e9 7f e9 ff ff 0f 0b eb c7 0f 0b 48 81 7a 58 40 74 a8 94 74
      a7 <0f> 0b 48 83 7a 48 00 74 a5 0f 0b eb a1 89 fe 4c 89 cf e9 c8 c4 07
      [   82.441719] RSP: 0018:ffffb738803977c0 EFLAGS: 00010007
      [   82.442121] RAX: ffffa086389f9740 RBX: 0000000000000002 RCX: 0000000000000000
      [   82.442718] RDX: ffffa086350dd930 RSI: ffffa0863d76f600 RDI: 0000000000000200
      [   82.443484] RBP: 0000000000000200 R08: 0000000000000000 R09: ffffa086350dd970
      [   82.444128] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa086350dd930
      [   82.444773] R13: ffffa0863d76f600 R14: 0000000000000000 R15: ffffa08636738008
      [   82.445427] FS:  00007f6abfe9dd40(0000) GS:ffffa0863dd80000(0000) knlGS:00000
      [   82.446040] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   82.446478] CR2: 0000557d288db4e8 CR3: 0000000078b36000 CR4: 00000000000006e0
      [   82.447104] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   82.447561] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   82.448012] Call Trace:
      [   82.448164]  queue_delayed_work_on+0x6d/0x80
      [   82.448472]  __pg_init_all_paths+0x7b/0xf0
      [   82.448714]  pg_init_all_paths+0x26/0x40
      [   82.448980]  __multipath_map_bio.isra.0+0x84/0x210
      [   82.449267]  __map_bio+0x3c/0x1f0
      [   82.449468]  __split_and_process_non_flush+0x14a/0x1b0
      [   82.449775]  __split_and_process_bio+0xde/0x340
      [   82.450045]  ? dm_get_live_table+0x5/0xb0
      [   82.450278]  dm_process_bio+0x98/0x290
      [   82.450518]  dm_make_request+0x54/0x120
      [   82.450778]  generic_make_request+0xd2/0x3e0
      [   82.451038]  ? submit_bio+0x3c/0x150
      [   82.451278]  submit_bio+0x3c/0x150
      [   82.451492]  mpage_readpages+0x129/0x160
      [   82.451756]  ? bdev_evict_inode+0x1d0/0x1d0
      [   82.452033]  read_pages+0x72/0x170
      [   82.452260]  __do_page_cache_readahead+0x1ba/0x1d0
      [   82.452624]  force_page_cache_readahead+0x96/0x110
      [   82.452903]  generic_file_read_iter+0x84f/0xae0
      [   82.453192]  ? __seccomp_filter+0x7c/0x670
      [   82.453547]  new_sync_read+0x10e/0x190
      [   82.453883]  vfs_read+0x9d/0x150
      [   82.454172]  ksys_read+0x65/0xe0
      [   82.454466]  do_syscall_64+0x4e/0x210
      [   82.454828]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [...]
      [   82.462501] ---[ end trace bb39975e9cf45daa ]---
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      533c5ba8
    • M
      dm: fix potential for q->make_request_fn NULL pointer · e572e842
      Mike Snitzer 提交于
      task #28557827
      
      commit 47ace7e012b9f7ad71d43ac9063d335ea3d6820b upstream.
      
      Move blk_queue_make_request() to dm.c:alloc_dev() so that
      q->make_request_fn is never NULL during the lifetime of a DM device
      (even one that is created without a DM table).
      
      Otherwise generic_make_request() will crash simply by doing:
        dmsetup create -n test
        mount /dev/dm-N /mnt
      
      While at it, move ->congested_data initialization out of
      dm.c:alloc_dev() and into the bio-based specific init method.
      Reported-by: NStefan Bader <stefan.bader@canonical.com>
      BugLink: https://bugs.launchpad.net/bugs/1860231
      Fixes: ff36ab34 ("dm: remove request-based logic from make_request_fn wrapper")
      Depends-on: c12c9a3c ("dm: various cleanups to md->queue initialization code")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e572e842
    • M
      dm crypt: fix benbi IV constructor crash if used in authenticated mode · e76e922c
      Milan Broz 提交于
      task #28557827
      
      commit 4ea9471fbd1addb25a4d269991dc724e200ca5b5 upstream.
      
      If benbi IV is used in AEAD construction, for example:
        cryptsetup luksFormat <device> --cipher twofish-xts-benbi --key-size 512 --integrity=hmac-sha256
      the constructor uses wrong skcipher function and crashes:
      
       BUG: kernel NULL pointer dereference, address: 00000014
       ...
       EIP: crypt_iv_benbi_ctr+0x15/0x70 [dm_crypt]
       Call Trace:
        ? crypt_subkey_size+0x20/0x20 [dm_crypt]
        crypt_ctr+0x567/0xfc0 [dm_crypt]
        dm_table_add_target+0x15f/0x340 [dm_mod]
      
      Fix this by properly using crypt_aead_blocksize() in this case.
      
      Fixes: ef43aa38 ("dm crypt: add cryptographic data integrity protection (authenticated encryption)")
      Cc: stable@vger.kernel.org # v4.12+
      Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941051Reported-by: NJerad Simpson <jbsimpson@gmail.com>
      Signed-off-by: NMilan Broz <gmazyland@gmail.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e76e922c
    • J
      dm space map common: fix to ensure new block isn't already in use · d78b9658
      Joe Thornber 提交于
      task #28557827
      
      commit 4feaef830de7ffdd8352e1fe14ad3bf13c9688f8 upstream.
      
      The space-maps track the reference counts for disk blocks allocated by
      both the thin-provisioning and cache targets.  There are variants for
      tracking metadata blocks and data blocks.
      
      Transactionality is implemented by never touching blocks from the
      previous transaction, so we can rollback in the event of a crash.
      
      When allocating a new block we need to ensure the block is free (has
      reference count of 0) in both the current and previous transaction.
      Prior to this fix we were doing this by searching for a free block in
      the previous transaction, and relying on a 'begin' counter to track
      where the last allocation in the current transaction was.  This
      'begin' field was not being updated in all code paths (eg, increment
      of a data block reference count due to breaking sharing of a neighbour
      block in the same btree leaf).
      
      This fix keeps the 'begin' field, but now it's just a hint to speed up
      the search.  Instead the current transaction is searched for a free
      block, and then the old transaction is double checked to ensure it's
      free.  Much simpler.
      
      This fixes reports of sm_disk_new_block()'s BUG_ON() triggering when
      DM thin-provisioning's snapshots are heavily used.
      Reported-by: NEric Wheeler <dm-devel@lists.ewheeler.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      d78b9658
    • S
      virtio-blk: handle block_device_operations callbacks after hot unplug · ef05ab79
      Stefan Hajnoczi 提交于
      task #28557821
      
      [ Upstream commit 90b5feb8c4bebc76c27fcaf3e1a0e5ca2d319e9e ]
      
      A userspace process holding a file descriptor to a virtio_blk device can
      still invoke block_device_operations after hot unplug.  This leads to a
      use-after-free accessing vblk->vdev in virtblk_getgeo() when
      ioctl(HDIO_GETGEO) is invoked:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
        IP: [<ffffffffc00e5450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]
        PGD 800000003a92f067 PUD 3a930067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 1310 Comm: hdio-getgeo Tainted: G           OE  ------------   3.10.0-1062.el7.x86_64 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        task: ffff9be5fbfb8000 ti: ffff9be5fa890000 task.ti: ffff9be5fa890000
        RIP: 0010:[<ffffffffc00e5450>]  [<ffffffffc00e5450>] virtio_check_driver_offered_feature+0x10/0x90 [virtio]
        RSP: 0018:ffff9be5fa893dc8  EFLAGS: 00010246
        RAX: ffff9be5fc3f3400 RBX: ffff9be5fa893e30 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff9be5fbc10b40
        RBP: ffff9be5fa893dc8 R08: 0000000000000301 R09: 0000000000000301
        R10: 0000000000000000 R11: 0000000000000000 R12: ffff9be5fdc24680
        R13: ffff9be5fbc10b40 R14: ffff9be5fbc10480 R15: 0000000000000000
        FS:  00007f1bfb968740(0000) GS:ffff9be5ffc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000090 CR3: 000000003a894000 CR4: 0000000000360ff0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         [<ffffffffc016ac37>] virtblk_getgeo+0x47/0x110 [virtio_blk]
         [<ffffffff8d3f200d>] ? handle_mm_fault+0x39d/0x9b0
         [<ffffffff8d561265>] blkdev_ioctl+0x1f5/0xa20
         [<ffffffff8d488771>] block_ioctl+0x41/0x50
         [<ffffffff8d45d9e0>] do_vfs_ioctl+0x3a0/0x5a0
         [<ffffffff8d45dc81>] SyS_ioctl+0xa1/0xc0
      
      A related problem is that virtblk_remove() leaks the vd_index_ida index
      when something still holds a reference to vblk->disk during hot unplug.
      This causes virtio-blk device names to be lost (vda, vdb, etc).
      
      Fix these issues by protecting vblk->vdev with a mutex and reference
      counting vblk so the vd_index_ida index can be removed in all cases.
      
      Fixes: 48e4043d ("virtio: add virtio disk geometry feature")
      Reported-by: NLance Digby <ldigby@redhat.com>
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Link: https://lore.kernel.org/r/20200430140442.171016-1-stefanha@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ef05ab79
    • H
      virtio-blk: improve virtqueue error to BLK_STS · e322fee8
      Halil Pasic 提交于
      task #28557821
      
      [ Upstream commit 3d973b2e9a625996ee997c7303cd793b9d197c65 ]
      
      Let's change the mapping between virtqueue_add errors to BLK_STS
      statuses, so that -ENOSPC, which indicates virtqueue full is still
      mapped to BLK_STS_DEV_RESOURCE, but -ENOMEM which indicates non-device
      specific resource outage is mapped to BLK_STS_RESOURCE.
      Signed-off-by: NHalil Pasic <pasic@linux.ibm.com>
      Link: https://lore.kernel.org/r/20200213123728.61216-3-pasic@linux.ibm.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e322fee8