1. 02 10月, 2020 1 次提交
    • L
      pipe: remove pipe_wait() and fix wakeup race with splice · 472e5b05
      Linus Torvalds 提交于
      The pipe splice code still used the old model of waiting for pipe IO by
      using a non-specific "pipe_wait()" that waited for any pipe event to
      happen, which depended on all pipe IO being entirely serialized by the
      pipe lock.  So by checking the state you were waiting for, and then
      adding yourself to the wait queue before dropping the lock, you were
      guaranteed to see all the wakeups.
      
      Strictly speaking, the actual wakeups were not done under the lock, but
      the pipe_wait() model still worked, because since the waiter held the
      lock when checking whether it should sleep, it would always see the
      current state, and the wakeup was always done after updating the state.
      
      However, commit 0ddad21d ("pipe: use exclusive waits when reading or
      writing") split the single wait-queue into two, and in the process also
      made the "wait for event" code wait for _two_ wait queues, and that then
      showed a race with the wakers that were not serialized by the pipe lock.
      
      It's only splice that used that "pipe_wait()" model, so the problem
      wasn't obvious, but Josef Bacik reports:
      
       "I hit a hang with fstest btrfs/187, which does a btrfs send into
        /dev/null. This works by creating a pipe, the write side is given to
        the kernel to write into, and the read side is handed to a thread that
        splices into a file, in this case /dev/null.
      
        The box that was hung had the write side stuck here [pipe_write] and
        the read side stuck here [splice_from_pipe_next -> pipe_wait].
      
        [ more details about pipe_wait() scenario ]
      
        The problem is we're doing the prepare_to_wait, which sets our state
        each time, however we can be woken up either with reads or writes. In
        the case above we race with the WRITER waking us up, and re-set our
        state to INTERRUPTIBLE, and thus never break out of schedule"
      
      Josef had a patch that avoided the issue in pipe_wait() by just making
      it set the state only once, but the deeper problem is that pipe_wait()
      depends on a level of synchonization by the pipe mutex that it really
      shouldn't.  And the whole "wait for any pipe state change" model really
      isn't very good to begin with.
      
      So rather than trying to work around things in pipe_wait(), remove that
      legacy model of "wait for arbitrary pipe event" entirely, and actually
      create functions that wait for the pipe actually being readable or
      writable, and can do so without depending on the pipe lock serializing
      everything.
      
      Fixes: 0ddad21d ("pipe: use exclusive waits when reading or writing")
      Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/Reported-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-and-tested-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      472e5b05
  2. 01 10月, 2020 2 次提交
    • F
      btrfs: fix filesystem corruption after a device replace · 4c8f3532
      Filipe Manana 提交于
      We use a device's allocation state tree to track ranges in a device used
      for allocated chunks, and we set ranges in this tree when allocating a new
      chunk. However after a device replace operation, we were not setting the
      allocated ranges in the new device's allocation state tree, so that tree
      is empty after a device replace.
      
      This means that a fitrim operation after a device replace will trim the
      device ranges that have allocated chunks and extents, as we trim every
      range for which there is not a range marked in the device's allocation
      state tree. It is also important during chunk allocation, since the
      device's allocation state is used to determine if a range is already
      allocated when allocating a new chunk.
      
      This is trivial to reproduce and the following script triggers the bug:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        DEV1="/dev/sdg"
        DEV2="/dev/sdh"
        DEV3="/dev/sdi"
      
        wipefs -a $DEV1 $DEV2 $DEV3 &> /dev/null
      
        # Create a raid1 test fs on 2 devices.
        mkfs.btrfs -f -m raid1 -d raid1 $DEV1 $DEV2 > /dev/null
        mount $DEV1 /mnt/btrfs
      
        xfs_io -f -c "pwrite -S 0xab 0 10M" /mnt/btrfs/foo
      
        echo "Starting to replace $DEV1 with $DEV3"
        btrfs replace start -B $DEV1 $DEV3 /mnt/btrfs
        echo
      
        echo "Running fstrim"
        fstrim /mnt/btrfs
        echo
      
        echo "Unmounting filesystem"
        umount /mnt/btrfs
      
        echo "Mounting filesystem in degraded mode using $DEV3 only"
        wipefs -a $DEV1 $DEV2 &> /dev/null
        mount -o degraded $DEV3 /mnt/btrfs
        if [ $? -ne 0 ]; then
                dmesg | tail
                echo
                echo "Failed to mount in degraded mode"
                exit 1
        fi
      
        echo
        echo "File foo data (expected all bytes = 0xab):"
        od -A d -t x1 /mnt/btrfs/foo
      
        umount /mnt/btrfs
      
      When running the reproducer:
      
        $ ./replace-test.sh
        wrote 10485760/10485760 bytes at offset 0
        10 MiB, 2560 ops; 0.0901 sec (110.877 MiB/sec and 28384.5216 ops/sec)
        Starting to replace /dev/sdg with /dev/sdi
      
        Running fstrim
      
        Unmounting filesystem
        Mounting filesystem in degraded mode using /dev/sdi only
        mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/sdi, missing codepage or helper program, or other error.
        [19581.748641] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi started
        [19581.803842] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi finished
        [19582.208293] BTRFS info (device sdi): allowing degraded mounts
        [19582.208298] BTRFS info (device sdi): disk space caching is enabled
        [19582.208301] BTRFS info (device sdi): has skinny extents
        [19582.212853] BTRFS warning (device sdi): devid 2 uuid 1f731f47-e1bb-4f00-bfbb-9e5a0cb4ba9f is missing
        [19582.213904] btree_readpage_end_io_hook: 25839 callbacks suppressed
        [19582.213907] BTRFS error (device sdi): bad tree block start, want 30490624 have 0
        [19582.214780] BTRFS warning (device sdi): failed to read root (objectid=7): -5
        [19582.231576] BTRFS error (device sdi): open_ctree failed
      
        Failed to mount in degraded mode
      
      So fix by setting all allocated ranges in the replace target device when
      the replace operation is finishing, when we are holding the chunk mutex
      and we can not race with new chunk allocations.
      
      A test case for fstests follows soon.
      
      Fixes: 1c11b63e ("btrfs: replace pending/pinned chunks lists with io tree")
      CC: stable@vger.kernel.org # 5.2+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c8f3532
    • J
      btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks · a466c85e
      Josef Bacik 提交于
      When closing and freeing the source device we could end up doing our
      final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
      we want to be holding as few locks as possible, so move this call
      outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
      we're modifying the fs_devices we need to make sure we're holding the
      uuid_mutex here, so take that as well.
      
      There's a report from syzbot probably hitting one of the cases where
      the bd_mutex and device_list_mutex are taken in the wrong order, however
      it's not with device replace, like this patch fixes. As there's no
      reproducer available so far, we can't verify the fix.
      
      https://lore.kernel.org/lkml/000000000000fc04d105afcf86d7@google.com/
      dashboard link: https://syzkaller.appspot.com/bug?extid=84a0634dc5d21d488419
      
        WARNING: possible circular locking dependency detected
        5.9.0-rc5-syzkaller #0 Not tainted
        ------------------------------------------------------
        syz-executor.0/6878 is trying to acquire lock:
        ffff88804c17d780 (&bdev->bd_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x520 fs/block_dev.c:1804
      
        but task is already holding lock:
        ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #4 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
      	 __mutex_lock_common kernel/locking/mutex.c:956 [inline]
      	 __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
      	 btrfs_finish_chunk_alloc+0x281/0xf90 fs/btrfs/volumes.c:5255
      	 btrfs_create_pending_block_groups+0x2f3/0x700 fs/btrfs/block-group.c:2109
      	 __btrfs_end_transaction+0xf5/0x690 fs/btrfs/transaction.c:916
      	 find_free_extent_update_loop fs/btrfs/extent-tree.c:3807 [inline]
      	 find_free_extent+0x23b7/0x2e60 fs/btrfs/extent-tree.c:4127
      	 btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
      	 cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
      	 btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
      	 writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
      	 __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
      	 extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
      	 extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
      	 do_writepages+0xec/0x290 mm/page-writeback.c:2352
      	 __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
      	 writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
      	 wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
      	 wb_do_writeback fs/fs-writeback.c:2039 [inline]
      	 wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
      	 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
      	 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
      	 kthread+0x3b5/0x4a0 kernel/kthread.c:292
      	 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
        -> #3 (sb_internal#2){.+.+}-{0:0}:
      	 percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
      	 __sb_start_write+0x234/0x470 fs/super.c:1672
      	 sb_start_intwrite include/linux/fs.h:1690 [inline]
      	 start_transaction+0xbe7/0x1170 fs/btrfs/transaction.c:624
      	 find_free_extent_update_loop fs/btrfs/extent-tree.c:3789 [inline]
      	 find_free_extent+0x25e1/0x2e60 fs/btrfs/extent-tree.c:4127
      	 btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
      	 cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
      	 btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
      	 writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
      	 __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
      	 extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
      	 extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
      	 do_writepages+0xec/0x290 mm/page-writeback.c:2352
      	 __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
      	 writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
      	 wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
      	 wb_do_writeback fs/fs-writeback.c:2039 [inline]
      	 wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
      	 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
      	 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
      	 kthread+0x3b5/0x4a0 kernel/kthread.c:292
      	 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
      
        -> #2 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
      	 __flush_work+0x60e/0xac0 kernel/workqueue.c:3041
      	 wb_shutdown+0x180/0x220 mm/backing-dev.c:355
      	 bdi_unregister+0x174/0x590 mm/backing-dev.c:872
      	 del_gendisk+0x820/0xa10 block/genhd.c:933
      	 loop_remove drivers/block/loop.c:2192 [inline]
      	 loop_control_ioctl drivers/block/loop.c:2291 [inline]
      	 loop_control_ioctl+0x3b1/0x480 drivers/block/loop.c:2257
      	 vfs_ioctl fs/ioctl.c:48 [inline]
      	 __do_sys_ioctl fs/ioctl.c:753 [inline]
      	 __se_sys_ioctl fs/ioctl.c:739 [inline]
      	 __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
      	 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (loop_ctl_mutex){+.+.}-{3:3}:
      	 __mutex_lock_common kernel/locking/mutex.c:956 [inline]
      	 __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
      	 lo_open+0x19/0xd0 drivers/block/loop.c:1893
      	 __blkdev_get+0x759/0x1aa0 fs/block_dev.c:1507
      	 blkdev_get fs/block_dev.c:1639 [inline]
      	 blkdev_open+0x227/0x300 fs/block_dev.c:1753
      	 do_dentry_open+0x4b9/0x11b0 fs/open.c:817
      	 do_open fs/namei.c:3251 [inline]
      	 path_openat+0x1b9a/0x2730 fs/namei.c:3368
      	 do_filp_open+0x17e/0x3c0 fs/namei.c:3395
      	 do_sys_openat2+0x16d/0x420 fs/open.c:1168
      	 do_sys_open fs/open.c:1184 [inline]
      	 __do_sys_open fs/open.c:1192 [inline]
      	 __se_sys_open fs/open.c:1188 [inline]
      	 __x64_sys_open+0x119/0x1c0 fs/open.c:1188
      	 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&bdev->bd_mutex){+.+.}-{3:3}:
      	 check_prev_add kernel/locking/lockdep.c:2496 [inline]
      	 check_prevs_add kernel/locking/lockdep.c:2601 [inline]
      	 validate_chain kernel/locking/lockdep.c:3218 [inline]
      	 __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
      	 lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
      	 __mutex_lock_common kernel/locking/mutex.c:956 [inline]
      	 __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
      	 blkdev_put+0x30/0x520 fs/block_dev.c:1804
      	 btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
      	 btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
      	 btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
      	 close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
      	 close_fs_devices fs/btrfs/volumes.c:1193 [inline]
      	 btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
      	 close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
      	 generic_shutdown_super+0x144/0x370 fs/super.c:464
      	 kill_anon_super+0x36/0x60 fs/super.c:1108
      	 btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
      	 deactivate_locked_super+0x94/0x160 fs/super.c:335
      	 deactivate_super+0xad/0xd0 fs/super.c:366
      	 cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
      	 task_work_run+0xdd/0x190 kernel/task_work.c:141
      	 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
      	 exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
      	 exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
      	 syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &bdev->bd_mutex --> sb_internal#2 --> &fs_devs->device_list_mutex
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_devs->device_list_mutex);
      				 lock(sb_internal#2);
      				 lock(&fs_devs->device_list_mutex);
          lock(&bdev->bd_mutex);
      
         *** DEADLOCK ***
      
        3 locks held by syz-executor.0/6878:
         #0: ffff88809070c0e0 (&type->s_umount_key#70){++++}-{3:3}, at: deactivate_super+0xa5/0xd0 fs/super.c:365
         #1: ffffffff8a5b37a8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_close_devices+0x23/0x1f0 fs/btrfs/volumes.c:1178
         #2: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159
      
        stack backtrace:
        CPU: 0 PID: 6878 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x198/0x1fd lib/dump_stack.c:118
         check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
         check_prev_add kernel/locking/lockdep.c:2496 [inline]
         check_prevs_add kernel/locking/lockdep.c:2601 [inline]
         validate_chain kernel/locking/lockdep.c:3218 [inline]
         __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
         lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
         __mutex_lock_common kernel/locking/mutex.c:956 [inline]
         __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
         blkdev_put+0x30/0x520 fs/block_dev.c:1804
         btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
         btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
         btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
         close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
         close_fs_devices fs/btrfs/volumes.c:1193 [inline]
         btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
         close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
         generic_shutdown_super+0x144/0x370 fs/super.c:464
         kill_anon_super+0x36/0x60 fs/super.c:1108
         btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
         deactivate_locked_super+0x94/0x160 fs/super.c:335
         deactivate_super+0xad/0xd0 fs/super.c:366
         cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
         task_work_run+0xdd/0x190 kernel/task_work.c:141
         tracehook_notify_resume include/linux/tracehook.h:188 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
         exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
         syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x460027
        RSP: 002b:00007fff59216328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 0000000000076035 RCX: 0000000000460027
        RDX: 0000000000403188 RSI: 0000000000000002 RDI: 00007fff592163d0
        RBP: 0000000000000333 R08: 0000000000000000 R09: 000000000000000b
        R10: 0000000000000005 R11: 0000000000000246 R12: 00007fff59217460
        R13: 0000000002df2a60 R14: 0000000000000000 R15: 00007fff59217460
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add syzbot reference ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a466c85e
  3. 30 9月, 2020 1 次提交
    • L
      autofs: use __kernel_write() for the autofs pipe writing · 90fb7027
      Linus Torvalds 提交于
      autofs got broken in some configurations by commit 13c164b1
      ("autofs: switch to kernel_write") because there is now an extra LSM
      permission check done by security_file_permission() in rw_verify_area().
      
      autofs is one if the few places that really does want the much more
      limited __kernel_write(), because the write is an internal kernel one
      that shouldn't do any user permission checks (it also doesn't need the
      file_start_write/file_end_write logic, since it's just a pipe).
      
      There are a couple of other cases like that - accounting, core dumping,
      and splice - but autofs stands out because it can be built as a module.
      
      As a result, we need to export this internal __kernel_write() function
      again.
      
      We really don't want any other module to use this, but we don't have a
      "EXPORT_SYMBOL_FOR_AUTOFS_ONLY()".  But we can mark it GPL-only to at
      least approximate that "internal use only" for licensing.
      
      While in this area, make autofs pass in NULL for the file position
      pointer, since it's always a pipe, and we now use a NULL file pointer
      for streaming file descriptors (see file_ppos() and commit 438ab720:
      "vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files")
      
      This effectively reverts commits 9db97752 ("fs: unexport
      __kernel_write") and 13c164b1 ("autofs: switch to kernel_write").
      
      Fixes: 13c164b1 ("autofs: switch to kernel_write")
      Reported-by: NOndrej Mosnacek <omosnace@redhat.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NAcked-by: Ian Kent <raven@themaw.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90fb7027
  4. 26 9月, 2020 1 次提交
  5. 25 9月, 2020 4 次提交
  6. 22 9月, 2020 1 次提交
    • A
      btrfs: fix put of uninitialized kobject after seed device delete · b5ddcffa
      Anand Jain 提交于
      The following test case leads to NULL kobject free error:
      
        mount seed /mnt
        add sprout to /mnt
        umount /mnt
        mount sprout to /mnt
        delete seed
      
        kobject: '(null)' (00000000dd2b87e4): is not initialized, yet kobject_put() is being called.
        WARNING: CPU: 1 PID: 15784 at lib/kobject.c:736 kobject_put+0x80/0x350
        RIP: 0010:kobject_put+0x80/0x350
        ::
        Call Trace:
        btrfs_sysfs_remove_devices_dir+0x6e/0x160 [btrfs]
        btrfs_rm_device.cold+0xa8/0x298 [btrfs]
        btrfs_ioctl+0x206c/0x22a0 [btrfs]
        ksys_ioctl+0xe2/0x140
        __x64_sys_ioctl+0x1e/0x29
        do_syscall_64+0x96/0x150
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f4047c6288b
        ::
      
      This is because, at the end of the seed device-delete, we try to remove
      the seed's devid sysfs entry. But for the seed devices under the sprout
      fs, we don't initialize the devid kobject yet. So add a kobject state
      check, which takes care of the bug.
      
      Fixes: 668e48af ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
      CC: stable@vger.kernel.org # 5.6+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b5ddcffa
  7. 21 9月, 2020 5 次提交
    • J
      io_uring: fix openat/openat2 unified prep handling · 4eb8dded
      Jens Axboe 提交于
      A previous commit unified how we handle prep for these two functions,
      but this means that we check the allowed context (SQPOLL, specifically)
      later than we should. Move the ring type checking into the two parent
      functions, instead of doing it after we've done some setup work.
      
      Fixes: ec65fea5 ("io_uring: deduplicate io_openat{,2}_prep()")
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4eb8dded
    • J
      io_uring: mark statx/files_update/epoll_ctl as non-SQPOLL · 6ca56f84
      Jens Axboe 提交于
      These will naturally fail when attempted through SQPOLL, but either
      with -EFAULT or -EBADF. Make it explicit that these are not workable
      through SQPOLL and return -EINVAL, just like other ops that need to
      use ->files.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6ca56f84
    • J
      io_uring: don't use retry based buffered reads for non-async bdev · f5cac8b1
      Jens Axboe 提交于
      Some block devices, like dm, bubble back -EAGAIN through the completion
      handler. We check for this in io_read(), but don't honor it for when
      we have copied the iov. Return -EAGAIN for this case before retrying,
      to force punt to io-wq.
      
      Fixes: bcf5a063 ("io_uring: support true async buffered reads, if file provides it")
      Reported-by: NZorro Lang <zlang@redhat.com>
      Tested-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f5cac8b1
    • J
      io_uring: don't re-setup vecs/iter in io_resumit_prep() is already there · 8f3d7496
      Jens Axboe 提交于
      If we already have mapped the necessary data for retry, then don't set
      it up again. It's a pointless operation, and we leak the iovec if it's
      a large (non-stack) vec.
      
      Fixes: b63534c4 ("io_uring: re-issue block requests that failed because of resources")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8f3d7496
    • J
      btrfs: fix overflow when copying corrupt csums for a message · 35be8851
      Johannes Thumshirn 提交于
      Syzkaller reported a buffer overflow in btree_readpage_end_io_hook()
      when loop mounting a crafted image:
      
        detected buffer overflow in memcpy
        ------------[ cut here ]------------
        kernel BUG at lib/string.c:1129!
        invalid opcode: 0000 [#1] PREEMPT SMP KASAN
        CPU: 1 PID: 26 Comm: kworker/u4:2 Not tainted 5.9.0-rc4-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Workqueue: btrfs-endio-meta btrfs_work_helper
        RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129
        RSP: 0018:ffffc90000e27980 EFLAGS: 00010286
        RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000
        RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22
        RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7
        R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e
        R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c
        FS:  0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000557335f440d0 CR3: 000000009647d000 CR4: 00000000001506e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         memcpy include/linux/string.h:405 [inline]
         btree_readpage_end_io_hook.cold+0x206/0x221 fs/btrfs/disk-io.c:642
         end_bio_extent_readpage+0x4de/0x10c0 fs/btrfs/extent_io.c:2854
         bio_endio+0x3cf/0x7f0 block/bio.c:1449
         end_workqueue_fn+0x114/0x170 fs/btrfs/disk-io.c:1695
         btrfs_work_helper+0x221/0xe20 fs/btrfs/async-thread.c:318
         process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
         worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
         kthread+0x3b5/0x4a0 kernel/kthread.c:292
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
        Modules linked in:
        ---[ end trace b68924293169feef ]---
        RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129
        RSP: 0018:ffffc90000e27980 EFLAGS: 00010286
        RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000
        RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22
        RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7
        R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e
        R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c
        FS:  0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f95b7c4d008 CR3: 000000009647d000 CR4: 00000000001506e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      The overflow happens, because in btree_readpage_end_io_hook() we assume
      that we have found a 4 byte checksum instead of the real possible 32
      bytes we have for the checksums.
      
      With the fix applied:
      
      [   35.726623] BTRFS: device fsid 815caf9a-dc43-4d2a-ac54-764b8333d765 devid 1 transid 5 /dev/loop0 scanned by syz-repro (215)
      [   35.738994] BTRFS info (device loop0): disk space caching is enabled
      [   35.738998] BTRFS info (device loop0): has skinny extents
      [   35.743337] BTRFS warning (device loop0): loop0 checksum verify failed on 1052672 wanted 0xf9c035fc8d239a54 found 0x67a25c14b7eabcf9 level 0
      [   35.743420] BTRFS error (device loop0): failed to read chunk root
      [   35.745899] BTRFS error (device loop0): open_ctree failed
      
      Reported-by: syzbot+e864a35d361e1d4e29a5@syzkaller.appspotmail.com
      Fixes: d5178578 ("btrfs: directly call into crypto framework for checksumming")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35be8851
  8. 20 9月, 2020 1 次提交
  9. 18 9月, 2020 3 次提交
  10. 17 9月, 2020 2 次提交
    • O
      NFSv4.2: fix client's attribute cache management for copy_file_range · 16abd2a0
      Olga Kornievskaia 提交于
      After client is done with the COPY operation, it needs to invalidate
      its pagecache (as it did no reading or writing of the data locally)
      and it needs to invalidate it's attributes just like it would have
      for a read on the source file and write on the destination file.
      
      Once the linux server started giving out read delegations to
      read+write opens, the destination file of the copy_file range
      started having delegations and not doing syncup on close of the
      file leading to xfstest failures for generic/430,431,432,433,565.
      
      v2: changing cache_validity needs to be protected by the i_lock.
      Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Fixes: 2e72448b ("NFS: Add COPY nfs operation")
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      16abd2a0
    • J
      nfs: Fix security label length not being reset · d33030e2
      Jeffrey Mitchell 提交于
      nfs_readdir_page_filler() iterates over entries in a directory, reusing
      the same security label buffer, but does not reset the buffer's length.
      This causes decode_attr_security_label() to return -ERANGE if an entry's
      security label is longer than the previous one's. This error, in
      nfs4_decode_dirent(), only gets passed up as -EAGAIN, which causes another
      failed attempt to copy into the buffer. The second error is ignored and
      the remaining entries do not show up in ls, specifically the getdents64()
      syscall.
      
      Reproduce by creating multiple files in NFS and giving one of the later
      files a longer security label. ls will not see that file nor any that are
      added afterwards, though they will exist on the backend.
      
      In nfs_readdir_page_filler(), reset security label buffer length before
      every reuse
      Signed-off-by: NJeffrey Mitchell <jeffrey.mitchell@starlab.io>
      Fixes: b4487b93 ("nfs: Fix getxattr kernel panic and memory overflow")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      d33030e2
  11. 15 9月, 2020 2 次提交
  12. 14 9月, 2020 2 次提交
  13. 11 9月, 2020 1 次提交
    • A
      epoll: EPOLL_CTL_ADD: close the race in decision to take fast path · fe0a916c
      Al Viro 提交于
      Checking for the lack of epitems refering to the epoll we want to insert into
      is not enough; we might have an insertion of that epoll into another one that
      has already collected the set of files to recheck for excessive reverse paths,
      but hasn't gotten to creating/inserting the epitem for it.
      
      However, any such insertion in progress can be detected - it will update the
      generation count in our epoll when it's done looking through it for files
      to check.  That gets done under ->mtx of our epoll and that allows us to
      detect that safely.
      
      We are *not* holding epmutex here, so the generation count is not stable.
      However, since both the update of ep->gen by loop check and (later)
      insertion into ->f_ep_link are done with ep->mtx held, we are fine -
      the sequence is
      	grab epmutex
      	bump loop_check_gen
      	...
      	grab tep->mtx		// 1
      	tep->gen = loop_check_gen
      	...
      	drop tep->mtx		// 2
      	...
      	grab tep->mtx		// 3
      	...
      	insert into ->f_ep_link
      	...
      	drop tep->mtx		// 4
      	bump loop_check_gen
      	drop epmutex
      and if the fastpath check in another thread happens for that
      eventpoll, it can come
      	* before (1) - in that case fastpath is just fine
      	* after (4) - we'll see non-empty ->f_ep_link, slow path
      taken
      	* between (2) and (3) - loop_check_gen is stable,
      with ->mtx providing barriers and we end up taking slow path.
      
      Note that ->f_ep_link emptiness check is slightly racy - we are protected
      against insertions into that list, but removals can happen right under us.
      Not a problem - in the worst case we'll end up taking a slow path for
      no good reason.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0a916c
  14. 10 9月, 2020 2 次提交
  15. 09 9月, 2020 3 次提交
    • G
      f2fs: Return EOF on unaligned end of file DIO read · 20d0a107
      Gabriel Krisman Bertazi 提交于
      Reading past end of file returns EOF for aligned reads but -EINVAL for
      unaligned reads on f2fs.  While documentation is not strict about this
      corner case, most filesystem returns EOF on this case, like iomap
      filesystems.  This patch consolidates the behavior for f2fs, by making
      it return EOF(0).
      
      it can be verified by a read loop on a file that does a partial read
      before EOF (A file that doesn't end at an aligned address).  The
      following code fails on an unaligned file on f2fs, but not on
      btrfs, ext4, and xfs.
      
        while (done < total) {
          ssize_t delta = pread(fd, buf + done, total - done, off + done);
          if (!delta)
            break;
          ...
        }
      
      It is arguable whether filesystems should actually return EOF or
      -EINVAL, but since iomap filesystems support it, and so does the
      original DIO code, it seems reasonable to consolidate on that.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      20d0a107
    • S
      f2fs: fix indefinite loop scanning for free nid · e2cab031
      Sahitya Tummala 提交于
      If the sbi->ckpt->next_free_nid is not NAT block aligned and if there
      are free nids in that NAT block between the start of the block and
      next_free_nid, then those free nids will not be scanned in scan_nat_page().
      This results into mismatch between nm_i->available_nids and the sum of
      nm_i->free_nid_count of all NAT blocks scanned. And nm_i->available_nids
      will always be greater than the sum of free nids in all the blocks.
      Under this condition, if we use all the currently scanned free nids,
      then it will loop forever in f2fs_alloc_nid() as nm_i->available_nids
      is still not zero but nm_i->free_nid_count of that partially scanned
      NAT block is zero.
      
      Fix this to align the nm_i->next_scan_nid to the first nid of the
      corresponding NAT block.
      Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      e2cab031
    • S
      f2fs: Fix type of section block count variables · 123aaf77
      Shin'ichiro Kawasaki 提交于
      Commit da52f8ad ("f2fs: get the right gc victim section when section
      has several segments") added code to count blocks of each section using
      variables with type 'unsigned short', which has 2 bytes size in many
      systems. However, the counts can be larger than the 2 bytes range and
      type conversion results in wrong values. Especially when the f2fs
      sections have blocks as many as USHRT_MAX + 1, the count is handled as 0.
      This triggers eternal loop in init_dirty_segmap() at mount system call.
      Fix this by changing the type of the variables to block_t.
      
      Fixes: da52f8ad ("f2fs: get the right gc victim section when section has several segments")
      Signed-off-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      123aaf77
  16. 08 9月, 2020 1 次提交
    • F
      btrfs: fix NULL pointer dereference after failure to create snapshot · 2d892ccd
      Filipe Manana 提交于
      When trying to get a new fs root for a snapshot during the transaction
      at transaction.c:create_pending_snapshot(), if btrfs_get_new_fs_root()
      fails we leave "pending->snap" pointing to an error pointer, and then
      later at ioctl.c:create_snapshot() we dereference that pointer, resulting
      in a crash:
      
        [12264.614689] BUG: kernel NULL pointer dereference, address: 00000000000007c4
        [12264.615650] #PF: supervisor write access in kernel mode
        [12264.616487] #PF: error_code(0x0002) - not-present page
        [12264.617436] PGD 0 P4D 0
        [12264.618328] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        [12264.619150] CPU: 0 PID: 2310635 Comm: fsstress Tainted: G        W         5.9.0-rc3-btrfs-next-67 #1
        [12264.619960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [12264.621769] RIP: 0010:btrfs_mksubvol+0x438/0x4a0 [btrfs]
        [12264.622528] Code: bc ef ff ff (...)
        [12264.624092] RSP: 0018:ffffaa6fc7277cd8 EFLAGS: 00010282
        [12264.624669] RAX: 00000000fffffff4 RBX: ffff9d3e8f151a60 RCX: 0000000000000000
        [12264.625249] RDX: 0000000000000001 RSI: ffffffff9d56c9be RDI: fffffffffffffff4
        [12264.625830] RBP: ffff9d3e8f151b48 R08: 0000000000000000 R09: 0000000000000000
        [12264.626413] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffff4
        [12264.626994] R13: ffff9d3ede380538 R14: ffff9d3ede380500 R15: ffff9d3f61b2eeb8
        [12264.627582] FS:  00007f140d5d8200(0000) GS:ffff9d3fb5e00000(0000) knlGS:0000000000000000
        [12264.628176] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [12264.628773] CR2: 00000000000007c4 CR3: 000000020f8e8004 CR4: 00000000003706f0
        [12264.629379] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [12264.629994] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [12264.630594] Call Trace:
        [12264.631227]  btrfs_mksnapshot+0x7b/0xb0 [btrfs]
        [12264.631840]  __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
        [12264.632458]  btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
        [12264.633078]  btrfs_ioctl+0x1864/0x3130 [btrfs]
        [12264.633689]  ? do_sys_openat2+0x1a7/0x2d0
        [12264.634295]  ? kmem_cache_free+0x147/0x3a0
        [12264.634899]  ? __x64_sys_ioctl+0x83/0xb0
        [12264.635488]  __x64_sys_ioctl+0x83/0xb0
        [12264.636058]  do_syscall_64+0x33/0x80
        [12264.636616]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        (gdb) list *(btrfs_mksubvol+0x438)
        0x7c7b8 is in btrfs_mksubvol (fs/btrfs/ioctl.c:858).
        853		ret = 0;
        854		pending_snapshot->anon_dev = 0;
        855	fail:
        856		/* Prevent double freeing of anon_dev */
        857		if (ret && pending_snapshot->snap)
        858			pending_snapshot->snap->anon_dev = 0;
        859		btrfs_put_root(pending_snapshot->snap);
        860		btrfs_subvolume_release_metadata(root, &pending_snapshot->block_rsv);
        861	free_pending:
        862		if (pending_snapshot->anon_dev)
      
      So fix this by setting "pending->snap" to NULL if we get an error from the
      call to btrfs_get_new_fs_root() at transaction.c:create_pending_snapshot().
      
      Fixes: 2dfb1e43 ("btrfs: preallocate anon block device at first phase of snapshot creation")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2d892ccd
  17. 07 9月, 2020 4 次提交
    • J
      btrfs: free data reloc tree on failed mount · 9e3aa805
      Josef Bacik 提交于
      While testing a weird problem with -o degraded, I noticed I was getting
      leaked root errors
      
        BTRFS warning (device loop0): writable mount is not allowed due to too many missing devices
        BTRFS error (device loop0): open_ctree failed
        BTRFS error (device loop0): leaked root -9-0 refcount 1
      
      This is the DATA_RELOC root, which gets read before the other fs roots,
      but is included in the fs roots radix tree.  Handle this by adding a
      btrfs_drop_and_free_fs_root() on the data reloc root if it exists.  This
      is ok to do here if we fail further up because we will only drop the ref
      if we delete the root from the radix tree, and all other cleanup won't
      be duplicated.
      
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9e3aa805
    • Q
      btrfs: require only sector size alignment for parent eb bytenr · ea57788e
      Qu Wenruo 提交于
      [BUG]
      A completely sane converted fs will cause kernel warning at balance
      time:
      
        [ 1557.188633] BTRFS info (device sda7): relocating block group 8162107392 flags data
        [ 1563.358078] BTRFS info (device sda7): found 11722 extents
        [ 1563.358277] BTRFS info (device sda7): leaf 7989321728 gen 95 total ptrs 213 free space 3458 owner 2
        [ 1563.358280] 	item 0 key (7984947200 169 0) itemoff 16250 itemsize 33
        [ 1563.358281] 		extent refs 1 gen 90 flags 2
        [ 1563.358282] 		ref#0: tree block backref root 4
        [ 1563.358285] 	item 1 key (7985602560 169 0) itemoff 16217 itemsize 33
        [ 1563.358286] 		extent refs 1 gen 93 flags 258
        [ 1563.358287] 		ref#0: shared block backref parent 7985602560
        [ 1563.358288] 			(parent 7985602560 is NOT ALIGNED to nodesize 16384)
        [ 1563.358290] 	item 2 key (7985635328 169 0) itemoff 16184 itemsize 33
        ...
        [ 1563.358995] BTRFS error (device sda7): eb 7989321728 invalid extent inline ref type 182
        [ 1563.358996] ------------[ cut here ]------------
        [ 1563.359005] WARNING: CPU: 14 PID: 2930 at 0xffffffff9f231766
      
      Then with transaction abort, and obviously failed to balance the fs.
      
      [CAUSE]
      That mentioned inline ref type 182 is completely sane, it's
      BTRFS_SHARED_BLOCK_REF_KEY, it's some extra check making kernel to
      believe it's invalid.
      
      Commit 64ecdb64 ("Btrfs: add one more sanity check for shared ref
      type") introduced extra checks for backref type.
      
      One of the requirement is, parent bytenr must be aligned to node size,
      which is not correct.
      
      One example is like this:
      
      0	1G  1G+4K		2G 2G+4K
      	|   |///////////////////|//|  <- A chunk starts at 1G+4K
                  |   |	<- A tree block get reserved at bytenr 1G+4K
      
      Then we have a valid tree block at bytenr 1G+4K, but not aligned to
      nodesize (16K).
      
      Such chunk is not ideal, but current kernel can handle it pretty well.
      We may warn about such tree block in the future, but should not reject
      them.
      
      [FIX]
      Change the alignment requirement from node size alignment to sector size
      alignment.
      
      Also, to make our lives a little easier, also output @iref when
      btrfs_get_extent_inline_ref_type() failed, so we can locate the item
      easier.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205475
      Fixes: 64ecdb64 ("Btrfs: add one more sanity check for shared ref type")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ update comments and messages ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ea57788e
    • J
      btrfs: fix lockdep splat in add_missing_dev · fccc0007
      Josef Bacik 提交于
      Nikolay reported a lockdep splat in generic/476 that I could reproduce
      with btrfs/187.
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc2+ #1 Tainted: G        W
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc_trace+0x3a/0x1a0
      	 btrfs_alloc_device+0x43/0x210
      	 add_missing_dev+0x20/0x90
      	 read_one_chunk+0x301/0x430
      	 btrfs_read_sys_array+0x17b/0x1b0
      	 open_ctree+0xa62/0x1896
      	 btrfs_mount_root.cold+0x12/0xea
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 vfs_kern_mount.part.0+0x71/0xb0
      	 btrfs_mount+0x10d/0x379
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_lookup_inode+0x2a/0x8f
      	 __btrfs_update_delayed_inode+0x80/0x240
      	 btrfs_commit_inode_delayed_inode+0x119/0x120
      	 btrfs_evict_inode+0x357/0x500
      	 evict+0xcf/0x1f0
      	 vfs_rmdir.part.0+0x149/0x160
      	 do_rmdir+0x136/0x1a0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x1184/0x1fa0
      	 lock_acquire+0xa4/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(&fs_info->chunk_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 1 PID: 100 Comm: kswapd0 Tainted: G        W         5.9.0-rc2+ #1
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x92/0xc8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x1184/0x1fa0
         lock_acquire+0xa4/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa4/0x3d0
         ? btrfs_evict_inode+0x11e/0x500
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x46/0x60
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This is because we are holding the chunk_mutex when we call
      btrfs_alloc_device, which does a GFP_KERNEL allocation.  We don't want
      to switch that to a GFP_NOFS lock because this is the only place where
      it matters.  So instead use memalloc_nofs_save() around the allocation
      in order to avoid the lockdep splat.
      Reported-by: NNikolay Borisov <nborisov@suse.com>
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fccc0007
    • R
      cifs: fix DFS mount with cifsacl/modefromsid · 01ec372c
      Ronnie Sahlberg 提交于
      RHBZ: 1871246
      
      If during cifs_lookup()/get_inode_info() we encounter a DFS link
      and we use the cifsacl or modefromsid mount options we must suppress
      any -EREMOTE errors that triggers or else we will not be able to follow
      the DFS link and automount the target.
      
      This fixes an issue with modefromsid/cifsacl where these mountoptions
      would break DFS and we would no longer be able to access the share.
      Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      01ec372c
  18. 06 9月, 2020 4 次提交
    • P
      io_uring: fix linked deferred ->files cancellation · c127a2a1
      Pavel Begunkov 提交于
      While looking for ->files in ->defer_list, consider that requests there
      may actually be links.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c127a2a1
    • P
      io_uring: fix cancel of deferred reqs with ->files · b7ddce3c
      Pavel Begunkov 提交于
      While trying to cancel requests with ->files, it also should look for
      requests in ->defer_list, otherwise it might end up hanging a thread.
      
      Cancel all requests in ->defer_list up to the last request there with
      matching ->files, that's needed to follow drain ordering semantics.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b7ddce3c
    • M
      xfs: don't update mtime on COW faults · b17164e2
      Mikulas Patocka 提交于
      When running in a dax mode, if the user maps a page with MAP_PRIVATE and
      PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime
      when the user hits a COW fault.
      
      This breaks building of the Linux kernel.  How to reproduce:
      
       1. extract the Linux kernel tree on dax-mounted xfs filesystem
       2. run make clean
       3. run make -j12
       4. run make -j12
      
      at step 4, make would incorrectly rebuild the whole kernel (although it
      was already built in step 3).
      
      The reason for the breakage is that almost all object files depend on
      objtool.  When we run objtool, it takes COW page fault on its .data
      section, and these faults will incorrectly update the timestamp of the
      objtool binary.  The updated timestamp causes make to rebuild the whole
      tree.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b17164e2
    • M
      ext2: don't update mtime on COW faults · 1ef6ea0e
      Mikulas Patocka 提交于
      When running in a dax mode, if the user maps a page with MAP_PRIVATE and
      PROT_WRITE, the ext2 filesystem would incorrectly update ctime and mtime
      when the user hits a COW fault.
      
      This breaks building of the Linux kernel.  How to reproduce:
      
       1. extract the Linux kernel tree on dax-mounted ext2 filesystem
       2. run make clean
       3. run make -j12
       4. run make -j12
      
      at step 4, make would incorrectly rebuild the whole kernel (although it
      was already built in step 3).
      
      The reason for the breakage is that almost all object files depend on
      objtool.  When we run objtool, it takes COW page fault on its .data
      section, and these faults will incorrectly update the timestamp of the
      objtool binary.  The updated timestamp causes make to rebuild the whole
      tree.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ef6ea0e