1. 19 7月, 2022 1 次提交
  2. 14 1月, 2022 1 次提交
  3. 06 12月, 2021 2 次提交
    • J
      btrfs: do not take the uuid_mutex in btrfs_rm_device · 3c911db8
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.80
      commit b917f9b94633bde8982f98965aa1fa534b9e8f46
      bugzilla: 185821 https://gitee.com/openeuler/kernel/issues/I4L7CG
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b917f9b94633bde8982f98965aa1fa534b9e8f46
      
      --------------------------------
      
      [ Upstream commit 8ef9dc0f ]
      
      We got the following lockdep splat while running fstests (specifically
      btrfs/003 and btrfs/020 in a row) with the new rc.  This was uncovered
      by 87579e9b ("loop: use worker per cgroup instead of kworker") which
      converted loop to using workqueues, which comes with lockdep
      annotations that don't exist with kworkers.  The lockdep splat is as
      follows:
      
        WARNING: possible circular locking dependency detected
        5.14.0-rc2-custom+ #34 Not tainted
        ------------------------------------------------------
        losetup/156417 is trying to acquire lock:
        ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600
      
        but task is already holding lock:
        ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #5 (&lo->lo_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0xba/0x7c0
      	 lo_open+0x28/0x60 [loop]
      	 blkdev_get_whole+0x28/0xf0
      	 blkdev_get_by_dev.part.0+0x168/0x3c0
      	 blkdev_open+0xd2/0xe0
      	 do_dentry_open+0x163/0x3a0
      	 path_openat+0x74d/0xa40
      	 do_filp_open+0x9c/0x140
      	 do_sys_openat2+0xb1/0x170
      	 __x64_sys_openat+0x54/0x90
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        -> #4 (&disk->open_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0xba/0x7c0
      	 blkdev_get_by_dev.part.0+0xd1/0x3c0
      	 blkdev_get_by_path+0xc0/0xd0
      	 btrfs_scan_one_device+0x52/0x1f0 [btrfs]
      	 btrfs_control_ioctl+0xac/0x170 [btrfs]
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        -> #3 (uuid_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0xba/0x7c0
      	 btrfs_rm_device+0x48/0x6a0 [btrfs]
      	 btrfs_ioctl+0x2d1c/0x3110 [btrfs]
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        -> #2 (sb_writers#11){.+.+}-{0:0}:
      	 lo_write_bvec+0x112/0x290 [loop]
      	 loop_process_work+0x25f/0xcb0 [loop]
      	 process_one_work+0x28f/0x5d0
      	 worker_thread+0x55/0x3c0
      	 kthread+0x140/0x170
      	 ret_from_fork+0x22/0x30
      
        -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
      	 process_one_work+0x266/0x5d0
      	 worker_thread+0x55/0x3c0
      	 kthread+0x140/0x170
      	 ret_from_fork+0x22/0x30
      
        -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
      	 __lock_acquire+0x1130/0x1dc0
      	 lock_acquire+0xf5/0x320
      	 flush_workqueue+0xae/0x600
      	 drain_workqueue+0xa0/0x110
      	 destroy_workqueue+0x36/0x250
      	 __loop_clr_fd+0x9a/0x650 [loop]
      	 lo_ioctl+0x29d/0x780 [loop]
      	 block_ioctl+0x3f/0x50
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        other info that might help us debug this:
        Chain exists of:
          (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
         Possible unsafe locking scenario:
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&lo->lo_mutex);
      				 lock(&disk->open_mutex);
      				 lock(&lo->lo_mutex);
          lock((wq_completion)loop0);
      
         *** DEADLOCK ***
        1 lock held by losetup/156417:
         #0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
      
        stack backtrace:
        CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack_lvl+0x57/0x72
         check_noncircular+0x10a/0x120
         __lock_acquire+0x1130/0x1dc0
         lock_acquire+0xf5/0x320
         ? flush_workqueue+0x84/0x600
         flush_workqueue+0xae/0x600
         ? flush_workqueue+0x84/0x600
         drain_workqueue+0xa0/0x110
         destroy_workqueue+0x36/0x250
         __loop_clr_fd+0x9a/0x650 [loop]
         lo_ioctl+0x29d/0x780 [loop]
         ? __lock_acquire+0x3a0/0x1dc0
         ? update_dl_rq_load_avg+0x152/0x360
         ? lock_is_held_type+0xa5/0x120
         ? find_held_lock.constprop.0+0x2b/0x80
         block_ioctl+0x3f/0x50
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7f645884de6b
      
      Usually the uuid_mutex exists to protect the fs_devices that map
      together all of the devices that match a specific uuid.  In rm_device
      we're messing with the uuid of a device, so it makes sense to protect
      that here.
      
      However in doing that it pulls in a whole host of lockdep dependencies,
      as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus
      we end up with the dependency chain under the uuid_mutex being added
      under the normal sb write dependency chain, which causes problems with
      loop devices.
      
      We don't need the uuid mutex here however.  If we call
      btrfs_scan_one_device() before we scratch the super block we will find
      the fs_devices and not find the device itself and return EBUSY because
      the fs_devices is open.  If we call it after the scratch happens it will
      not appear to be a valid btrfs file system.
      
      We do not need to worry about other fs_devices modifying operations here
      because we're protected by the exclusive operations locking.
      
      So drop the uuid_mutex here in order to fix the lockdep splat.
      
      A more detailed explanation from the discussion:
      
      We are worried about rm and scan racing with each other, before this
      change we'll zero the device out under the UUID mutex so when scan does
      run it'll make sure that it can go through the whole device scan thing
      without rm messing with us.
      
      We aren't worried if the scratch happens first, because the result is we
      don't think this is a btrfs device and we bail out.
      
      The only case we are concerned with is we scratch _after_ scan is able
      to read the superblock and gets a seemingly valid super block, so lets
      consider this case.
      
      Scan will call device_list_add() with the device we're removing.  We'll
      call find_fsid_with_metadata_uuid() and get our fs_devices for this
      UUID.  At this point we lock the fs_devices->device_list_mutex.  This is
      what protects us in this case, but we have two cases here.
      
      1. We aren't to the device removal part of the RM.  We found our device,
         and device name matches our path, we go down and we set total_devices
         to our super number of devices, which doesn't affect anything because
         we haven't done the remove yet.
      
      2. We are past the device removal part, which is protected by the
         device_list_mutex.  Scan doesn't find the device, it goes down and
         does the
      
         if (fs_devices->opened)
      	   return -EBUSY;
      
         check and we bail out.
      
      Nothing about this situation is ideal, but the lockdep splat is real,
      and the fix is safe, tho admittedly a bit scary looking.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ copy more from the discussion ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      3c911db8
    • L
      btrfs: clear MISSING device status bit in btrfs_close_one_device · 08a4f14e
      Li Zhang 提交于
      stable inclusion
      from stable-5.10.80
      commit 8992aab294cb7c70d430e7ad6671ccb0002de5b7
      bugzilla: 185821 https://gitee.com/openeuler/kernel/issues/I4L7CG
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8992aab294cb7c70d430e7ad6671ccb0002de5b7
      
      --------------------------------
      
      commit 5d03dbeb upstream.
      
      Reported bug: https://github.com/kdave/btrfs-progs/issues/389
      
      There's a problem with scrub reporting aborted status but returning
      error code 0, on a filesystem with missing and readded device.
      
      Roughly these steps:
      
      - mkfs -d raid1 dev1 dev2
      - fill with data
      - unmount
      - make dev1 disappear
      - mount -o degraded
      - copy more data
      - make dev1 appear again
      
      Running scrub afterwards reports that the command was aborted, but the
      system log message says the exit code was 0.
      
      It seems that the cause of the error is decrementing
      fs_devices->missing_devices but not clearing device->dev_state.  Every
      time we umount filesystem, it would call close_ctree, And it would
      eventually involve btrfs_close_one_device to close the device, but it
      only decrements fs_devices->missing_devices but does not clear the
      device BTRFS_DEV_STATE_MISSING bit. Worse, this bug will cause Integer
      Overflow, because every time umount, fs_devices->missing_devices will
      decrease. If fs_devices->missing_devices value hit 0, it would overflow.
      
      With added debugging:
      
         loop1: detected capacity change from 0 to 20971520
         BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 1 transid 21 /dev/loop1 scanned by systemd-udevd (2311)
         loop2: detected capacity change from 0 to 20971520
         BTRFS: device fsid 56ad51f1-5523-463b-8547-c19486c51ebb devid 2 transid 17 /dev/loop2 scanned by systemd-udevd (2313)
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 0
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): using free space tree
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000f706684d /dev/loop1 18446744073709551615
         BTRFS warning (device loop1): devid 2 uuid 6635ac31-56dd-4852-873b-c60f5e2d53d2 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 18446744073709551615
      
      If fs_devices->missing_devices is 0, next time it would be 18446744073709551615
      
      After apply this patch, the fs_devices->missing_devices seems to be
      right:
      
        $ truncate -s 10g test1
        $ truncate -s 10g test2
        $ losetup /dev/loop1 test1
        $ losetup /dev/loop2 test2
        $ mkfs.btrfs -draid1 -mraid1 /dev/loop1 /dev/loop2 -f
        $ losetup -d /dev/loop2
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ mount -o degraded /dev/loop1 /mnt/1
        $ umount /mnt/1
        $ dmesg
      
         loop1: detected capacity change from 0 to 20971520
         loop2: detected capacity change from 0 to 20971520
         BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 1 transid 5 /dev/loop1 scanned by mkfs.btrfs (1863)
         BTRFS: device fsid 15aa1203-98d3-4a66-bcae-ca82f629c2cd devid 2 transid 5 /dev/loop2 scanned by mkfs.btrfs (1863)
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): checking UUID tree
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
         BTRFS info (device loop1): flagging fs with big metadata feature
         BTRFS info (device loop1): allowing degraded mounts
         BTRFS info (device loop1): disk space caching is enabled
         BTRFS info (device loop1): has skinny extents
         BTRFS info (device loop1):  before clear_missing.00000000975bd577 /dev/loop1 0
         BTRFS warning (device loop1): devid 2 uuid 8b333791-0b3f-4f57-b449-1c1ab6b51f38 is missing
         BTRFS info (device loop1):  before clear_missing.0000000000000000 /dev/loop2 1
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NLi Zhang <zhanglikernel@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      08a4f14e
  4. 15 11月, 2021 4 次提交
    • F
      btrfs: fix mount failure due to past and transient device flush error · 09fa6f6e
      Filipe Manana 提交于
      stable inclusion
      from stable-5.10.72
      commit 63c89930d4b5f6205b4f4f13498a37cee86d80fa
      bugzilla: 182982 https://gitee.com/openeuler/kernel/issues/I4I3L1
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=63c89930d4b5f6205b4f4f13498a37cee86d80fa
      
      --------------------------------
      
      [ Upstream commit 6b225baa ]
      
      When we get an error flushing one device, during a super block commit, we
      record the error in the device structure, in the field 'last_flush_error'.
      This is used to later check if we should error out the super block commit,
      depending on whether the number of flush errors is greater than or equals
      to the maximum tolerated device failures for a raid profile.
      
      However if we get a transient device flush error, unmount the filesystem
      and later try to mount it, we can fail the mount because we treat that
      past error as critical and consider the device is missing. Even if it's
      very likely that the error will happen again, as it's probably due to a
      hardware related problem, there may be cases where the error might not
      happen again. One example is during testing, and a test case like the
      new generic/648 from fstests always triggers this. The test cases
      generic/019 and generic/475 also trigger this scenario, but very
      sporadically.
      
      When this happens we get an error like this:
      
        $ mount /dev/sdc /mnt
        mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.
      
        $ dmesg
        (...)
        [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
        [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
        [12918.890853] BTRFS error (device sdc): open_ctree failed
      
      The failure happens because when btrfs_check_rw_degradable() is called at
      mount time, or at remount from RO to RW time, is sees a non zero value in
      a device's ->last_flush_error attribute, and therefore considers that the
      device is 'missing'.
      
      Fix this by setting a device's ->last_flush_error to zero when we close a
      device, making sure the error is not seen on the next mount attempt. We
      only need to track flush errors during the current mount, so that we never
      commit a super block if such errors happened.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      09fa6f6e
    • S
      treewide: Change list_sort to use const pointers · 962e9cbe
      Sami Tolvanen 提交于
      stable inclusion
      from stable-5.10.70
      commit 55e6f8b3c0f5cc600df12ddd0371d2703b910fd7
      bugzilla: 182949 https://gitee.com/openeuler/kernel/issues/I4I3GQ
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=55e6f8b3c0f5cc600df12ddd0371d2703b910fd7
      
      --------------------------------
      
      [ Upstream commit 4f0f586b ]
      
      list_sort() internally casts the comparison function passed to it
      to a different type with constant struct list_head pointers, and
      uses this pointer to call the functions, which trips indirect call
      Control-Flow Integrity (CFI) checking.
      
      Instead of removing the consts, this change defines the
      list_cmp_func_t type and changes the comparison function types of
      all list_sort() callers to use const pointers, thus avoiding type
      mismatches.
      Suggested-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Tested-by: NNick Desaulniers <ndesaulniers@google.com>
      Tested-by: NNathan Chancellor <nathan@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      962e9cbe
    • A
      btrfs: fix lockdep warning while mounting sprout fs · c9e65b98
      Anand Jain 提交于
      stable inclusion
      from stable-5.10.69
      commit aa1af89a6697ed8f1f0d3d62029036158a4486d9
      bugzilla: 182675 https://gitee.com/openeuler/kernel/issues/I4I3ED
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=aa1af89a6697ed8f1f0d3d62029036158a4486d9
      
      --------------------------------
      
      [ Upstream commit c1247069 ]
      
      Following test case reproduces lockdep warning.
      
        Test case:
      
        $ mkfs.btrfs -f <dev1>
        $ btrfstune -S 1 <dev1>
        $ mount <dev1> <mnt>
        $ btrfs device add <dev2> <mnt> -f
        $ umount <mnt>
        $ mount <dev2> <mnt>
        $ umount <mnt>
      
      The warning claims a possible ABBA deadlock between the threads
      initiated by [#1] btrfs device add and [#0] the mount.
      
        [ 540.743122] WARNING: possible circular locking dependency detected
        [ 540.743129] 5.11.0-rc7+ #5 Not tainted
        [ 540.743135] ------------------------------------------------------
        [ 540.743142] mount/2515 is trying to acquire lock:
        [ 540.743149] ffffa0c5544c2ce0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: clone_fs_devices+0x6d/0x210 [btrfs]
        [ 540.743458] but task is already holding lock:
        [ 540.743461] ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
        [ 540.743541] which lock already depends on the new lock.
        [ 540.743543] the existing dependency chain (in reverse order) is:
      
        [ 540.743546] -> #1 (btrfs-chunk-00){++++}-{4:4}:
        [ 540.743566] down_read_nested+0x48/0x2b0
        [ 540.743585] __btrfs_tree_read_lock+0x32/0x200 [btrfs]
        [ 540.743650] btrfs_read_lock_root_node+0x70/0x200 [btrfs]
        [ 540.743733] btrfs_search_slot+0x6c6/0xe00 [btrfs]
        [ 540.743785] btrfs_update_device+0x83/0x260 [btrfs]
        [ 540.743849] btrfs_finish_chunk_alloc+0x13f/0x660 [btrfs] <--- device_list_mutex
        [ 540.743911] btrfs_create_pending_block_groups+0x18d/0x3f0 [btrfs]
        [ 540.743982] btrfs_commit_transaction+0x86/0x1260 [btrfs]
        [ 540.744037] btrfs_init_new_device+0x1600/0x1dd0 [btrfs]
        [ 540.744101] btrfs_ioctl+0x1c77/0x24c0 [btrfs]
        [ 540.744166] __x64_sys_ioctl+0xe4/0x140
        [ 540.744170] do_syscall_64+0x4b/0x80
        [ 540.744174] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [ 540.744180] -> #0 (&fs_devs->device_list_mutex){+.+.}-{4:4}:
        [ 540.744184] __lock_acquire+0x155f/0x2360
        [ 540.744188] lock_acquire+0x10b/0x5c0
        [ 540.744190] __mutex_lock+0xb1/0xf80
        [ 540.744193] mutex_lock_nested+0x27/0x30
        [ 540.744196] clone_fs_devices+0x6d/0x210 [btrfs]
        [ 540.744270] btrfs_read_chunk_tree+0x3c7/0xbb0 [btrfs]
        [ 540.744336] open_ctree+0xf6e/0x2074 [btrfs]
        [ 540.744406] btrfs_mount_root.cold.72+0x16/0x127 [btrfs]
        [ 540.744472] legacy_get_tree+0x38/0x90
        [ 540.744475] vfs_get_tree+0x30/0x140
        [ 540.744478] fc_mount+0x16/0x60
        [ 540.744482] vfs_kern_mount+0x91/0x100
        [ 540.744484] btrfs_mount+0x1e6/0x670 [btrfs]
        [ 540.744536] legacy_get_tree+0x38/0x90
        [ 540.744537] vfs_get_tree+0x30/0x140
        [ 540.744539] path_mount+0x8d8/0x1070
        [ 540.744541] do_mount+0x8d/0xc0
        [ 540.744543] __x64_sys_mount+0x125/0x160
        [ 540.744545] do_syscall_64+0x4b/0x80
        [ 540.744547] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [ 540.744551] other info that might help us debug this:
        [ 540.744552] Possible unsafe locking scenario:
      
        [ 540.744553] CPU0 				CPU1
        [ 540.744554] ---- 				----
        [ 540.744555] lock(btrfs-chunk-00);
        [ 540.744557] 					lock(&fs_devs->device_list_mutex);
        [ 540.744560] 					lock(btrfs-chunk-00);
        [ 540.744562] lock(&fs_devs->device_list_mutex);
        [ 540.744564]
         *** DEADLOCK ***
      
        [ 540.744565] 3 locks held by mount/2515:
        [ 540.744567] #0: ffffa0c56bf7a0e0 (&type->s_umount_key#42/1){+.+.}-{4:4}, at: alloc_super.isra.16+0xdf/0x450
        [ 540.744574] #1: ffffffffc05a9628 (uuid_mutex){+.+.}-{4:4}, at: btrfs_read_chunk_tree+0x63/0xbb0 [btrfs]
        [ 540.744640] #2: ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
        [ 540.744708]
         stack backtrace:
        [ 540.744712] CPU: 2 PID: 2515 Comm: mount Not tainted 5.11.0-rc7+ #5
      
      But the device_list_mutex in clone_fs_devices() is redundant, as
      explained below.  Two threads [1]  and [2] (below) could lead to
      clone_fs_device().
      
        [1]
        open_ctree <== mount sprout fs
         btrfs_read_chunk_tree()
          mutex_lock(&uuid_mutex) <== global lock
          read_one_dev()
           open_seed_devices()
            clone_fs_devices() <== seed fs_devices
             mutex_lock(&orig->device_list_mutex) <== seed fs_devices
      
        [2]
        btrfs_init_new_device() <== sprouting
         mutex_lock(&uuid_mutex); <== global lock
         btrfs_prepare_sprout()
           lockdep_assert_held(&uuid_mutex)
           clone_fs_devices(seed_fs_device) <== seed fs_devices
      
      Both of these threads hold uuid_mutex which is sufficient to protect
      getting the seed device(s) freed while we are trying to clone it for
      sprouting [2] or mounting a sprout [1] (as above). A mounted seed device
      can not free/write/replace because it is read-only. An unmounted seed
      device can be freed by btrfs_free_stale_devices(), but it needs
      uuid_mutex.  So this patch removes the unnecessary device_list_mutex in
      clone_fs_devices().  And adds a lockdep_assert_held(&uuid_mutex) in
      clone_fs_devices().
      Reported-by: NSu Yue <l@damenly.su>
      Tested-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      c9e65b98
    • J
      btrfs: update the bdev time directly when closing · 2f472c80
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.69
      commit c43803c1aa76f2cc15ee641564412741a18028ff
      bugzilla: 182675 https://gitee.com/openeuler/kernel/issues/I4I3ED
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c43803c1aa76f2cc15ee641564412741a18028ff
      
      --------------------------------
      
      [ Upstream commit 8f96a5bf ]
      
      We update the ctime/mtime of a block device when we remove it so that
      blkid knows the device changed.  However we do this by re-opening the
      block device and calling filp_update_time.  This is more correct because
      it'll call the inode->i_op->update_time if it exists, but the block dev
      inodes do not do this.  Instead call generic_update_time() on the
      bd_inode in order to avoid the blkdev_open path and get rid of the
      following lockdep splat:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc2+ #406 Not tainted
      ------------------------------------------------------
      losetup/11596 is trying to acquire lock:
      ffff939640d2f538 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
      
      but task is already holding lock:
      ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (&lo->lo_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             lo_open+0x28/0x60 [loop]
             blkdev_get_whole+0x25/0xf0
             blkdev_get_by_dev.part.0+0x168/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             do_sys_openat2+0x7b/0x130
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #3 (&disk->open_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             blkdev_get_by_dev.part.0+0x56/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             file_open_name+0xc7/0x170
             filp_open+0x2c/0x50
             btrfs_scratch_superblocks.part.0+0x10f/0x170
             btrfs_rm_device.cold+0xe8/0xed
             btrfs_ioctl+0x2a31/0x2e70
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #2 (sb_writers#12){.+.+}-{0:0}:
             lo_write_bvec+0xc2/0x240 [loop]
             loop_process_work+0x238/0xd00 [loop]
             process_one_work+0x26b/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
             process_one_work+0x245/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
             __lock_acquire+0x10ea/0x1d90
             lock_acquire+0xb5/0x2b0
             flush_workqueue+0x91/0x5e0
             drain_workqueue+0xa0/0x110
             destroy_workqueue+0x36/0x250
             __loop_clr_fd+0x9a/0x660 [loop]
             block_ioctl+0x3f/0x50
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&lo->lo_mutex);
                                     lock(&disk->open_mutex);
                                     lock(&lo->lo_mutex);
        lock((wq_completion)loop0);
      
       *** DEADLOCK ***
      
      1 lock held by losetup/11596:
       #0: ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      stack backtrace:
      CPU: 1 PID: 11596 Comm: losetup Not tainted 5.14.0-rc2+ #406
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack_lvl+0x57/0x72
       check_noncircular+0xcf/0xf0
       ? stack_trace_save+0x3b/0x50
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       ? flush_workqueue+0x67/0x5e0
       ? lockdep_init_map_type+0x47/0x220
       flush_workqueue+0x91/0x5e0
       ? flush_workqueue+0x67/0x5e0
       ? verify_cpu+0xf0/0x100
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       ? blkdev_ioctl+0x8d/0x2a0
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      2f472c80
  5. 21 10月, 2021 1 次提交
    • D
      btrfs: reset replace target device to allocation state on close · 442fd787
      Desmond Cheong Zhi Xi 提交于
      stable inclusion
      from stable-5.10.67
      commit c1b249e02a80347fe519b761a8c66008e5e7dcbf
      bugzilla: 182619 https://gitee.com/openeuler/kernel/issues/I4EWO7
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c1b249e02a80347fe519b761a8c66008e5e7dcbf
      
      --------------------------------
      
      commit 0d977e0e upstream.
      
      This crash was observed with a failed assertion on device close:
      
        BTRFS: Transaction aborted (error -28)
        WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
        Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
        CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
        RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
        RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
        RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
        RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
        R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
        FS:  0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
        Call Trace:
         flush_space+0x197/0x2f0 [btrfs]
         btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
         process_one_work+0x262/0x5e0
         worker_thread+0x4c/0x320
         ? process_one_work+0x5e0/0x5e0
         kthread+0x144/0x170
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x1f/0x30
        irq event stamp: 19334989
        hardirqs last  enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
        hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
        softirqs last  enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
        softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
        ---[ end trace 45939e308e0dd3c7 ]---
        BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
        BTRFS info (device vdd): forced readonly
        BTRFS warning (device vdd): failed setting block group ro: -30
        BTRFS info (device vdd): suspending dev_replace for unmount
        assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3431!
        invalid opcode: 0000 [#1] PREEMPT SMP
        CPU: 1 PID: 3982 Comm: umount Tainted: G        W         5.14.0-rc5-default+ #1532
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
        RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
        RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
        RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
        R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
        FS:  00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
        Call Trace:
         btrfs_close_one_device.cold+0x11/0x55 [btrfs]
         close_fs_devices+0x44/0xb0 [btrfs]
         btrfs_close_devices+0x48/0x160 [btrfs]
         generic_shutdown_super+0x69/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x2c/0xa0
         cleanup_mnt+0x144/0x1b0
         task_work_run+0x59/0xa0
         exit_to_user_mode_loop+0xe7/0xf0
         exit_to_user_mode_prepare+0xaf/0xf0
         syscall_exit_to_user_mode+0x19/0x50
         do_syscall_64+0x4a/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      This happens when close_ctree is called while a dev_replace hasn't
      completed. In close_ctree, we suspend the dev_replace, but keep the
      replace target around so that we can resume the dev_replace procedure
      when we mount the root again. This is the call trace:
      
        close_ctree():
          btrfs_dev_replace_suspend_for_unmount();
          btrfs_close_devices():
            btrfs_close_fs_devices():
              btrfs_close_one_device():
                ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
                       &device->dev_state));
      
      However, since the replace target sticks around, there is a device
      with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
      assertion in btrfs_close_one_device.
      
      To fix this, if we come across the replace target device when
      closing, we should properly reset it back to allocation state. This
      fix also ensures that if a non-target device has a corrupted state and
      has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
      catch the error.
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Fixes: b2a61667 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDesmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      442fd787
  6. 19 10月, 2021 1 次提交
  7. 15 10月, 2021 1 次提交
    • D
      btrfs: fix rw device counting in __btrfs_free_extra_devids · 30fe79cf
      Desmond Cheong Zhi Xi 提交于
      stable inclusion
      from stable-5.10.56
      commit 4e1a57d75264dd4f10f3497c35dda521947368df
      bugzilla: 176004 https://gitee.com/openeuler/kernel/issues/I4DYZ4
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4e1a57d75264dd4f10f3497c35dda521947368df
      
      --------------------------------
      
      commit b2a61667 upstream.
      
      When removing a writeable device in __btrfs_free_extra_devids, the rw
      device count should be decremented.
      
      This error was caught by Syzbot which reported a warning in
      close_fs_devices:
      
        WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
        Modules linked in:
        CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
        RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293
        RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
        RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128
        R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000
        R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a
        FS:  00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180
         open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693
         btrfs_fill_super fs/btrfs/super.c:1382 [inline]
         btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x86/0x270 fs/super.c:1498
         fc_mount fs/namespace.c:993 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023
         btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x86/0x270 fs/super.c:1498
         do_new_mount fs/namespace.c:2905 [inline]
         path_mount+0x196f/0x2be0 fs/namespace.c:3235
         do_mount fs/namespace.c:3248 [inline]
         __do_sys_mount fs/namespace.c:3456 [inline]
         __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
         do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Because fs_devices->rw_devices was not 0 after
      closing all devices. Here is the call trace that was observed:
      
        btrfs_mount_root():
          btrfs_scan_one_device():
            device_list_add();   <---------------- device added
          btrfs_open_devices():
            open_fs_devices():
              btrfs_open_one_device();   <-------- writable device opened,
      	                                     rw device count ++
          btrfs_fill_super():
            open_ctree():
              btrfs_free_extra_devids():
      	  __btrfs_free_extra_devids();  <--- writable device removed,
      	                              rw device count not decremented
      	  fail_tree_roots:
      	    btrfs_close_devices():
      	      close_fs_devices();   <------- rw device count off by 1
      
      As a note, prior to commit cf89af14 ("btrfs: dev-replace: fail
      mount if we don't have replace item with target device"), rw_devices
      was decremented on removing a writable device in
      __btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit
      was not set for the device. However, this check does not need to be
      reinstated as it is now redundant and incorrect.
      
      In __btrfs_free_extra_devids, we skip removing the device if it is the
      target for replacement. This is done by checking whether device->devid
      == BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set
      only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices
      should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check,
      and so it's redundant to test for that bit.
      
      Additionally, following commit 82372bc8 ("Btrfs: make
      the logic of source device removing more clear"), rw_devices is
      incremented whenever a writeable device is added to the alloc
      list (including the target device in btrfs_dev_replace_finishing), so
      all removals of writable devices from the alloc list should also be
      accompanied by a decrement to rw_devices.
      
      Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
      Fixes: cf89af14 ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
      CC: stable@vger.kernel.org # 5.10+
      Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDesmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      30fe79cf
  8. 09 2月, 2021 1 次提交
    • S
      btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch · 8d385b73
      Su Yue 提交于
      stable inclusion
      from stable-5.10.13
      commit f343bf1aaf5517683d77be1f78da0d7117770464
      bugzilla: 47995
      
      --------------------------------
      
      commit c41ec452 upstream.
      
      This effectively reverts commit d5c82388 ("btrfs: convert
      data_seqcount to seqcount_mutex_t").
      
      While running fstests on 32 bits test box, many tests failed because of
      warnings in dmesg. One of those warnings (btrfs/003):
      
        [66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
        [66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G           O      5.11.0-rc4-custom+ #5
        [66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
        [66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
        [66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803
        [66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70
        [66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
        [66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0
        [66.441485] Call Trace:
        [66.441510]  btrfs_relocate_chunk+0xb1/0x100 [btrfs]
        [66.441529]  ? btrfs_lookup_block_group+0x17/0x20 [btrfs]
        [66.441562]  btrfs_balance+0x8ed/0x13b0 [btrfs]
        [66.441586]  ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
        [66.441619]  ? __this_cpu_preempt_check+0xf/0x11
        [66.441643]  btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
        [66.441664]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441683]  btrfs_ioctl+0x414/0x2ae0 [btrfs]
        [66.441700]  ? __lock_acquire+0x35f/0x2650
        [66.441717]  ? lockdep_hardirqs_on+0x87/0x120
        [66.441720]  ? lockdep_hardirqs_on_prepare+0xd0/0x1e0
        [66.441724]  ? call_rcu+0x2d3/0x530
        [66.441731]  ? __might_fault+0x41/0x90
        [66.441736]  ? kvm_sched_clock_read+0x15/0x50
        [66.441740]  ? sched_clock+0x8/0x10
        [66.441745]  ? sched_clock_cpu+0x13/0x180
        [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [66.441768]  __ia32_sys_ioctl+0x165/0x8a0
        [66.441773]  ? __this_cpu_preempt_check+0xf/0x11
        [66.441785]  ? __might_fault+0x89/0x90
        [66.441791]  __do_fast_syscall_32+0x54/0x80
        [66.441796]  do_fast_syscall_32+0x32/0x70
        [66.441801]  do_SYSENTER_32+0x15/0x20
        [66.441805]  entry_SYSENTER_32+0x9f/0xf2
        [66.441808] EIP: 0xab7b5549
        [66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c
        [66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98
        [66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
        [66.441833] irq event stamp: 42579
        [66.441835] hardirqs last  enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590
        [66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590
        [66.441840] softirqs last  enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60
        [66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60
      
        ========================================================================
        btrfs_remove_chunk+0x58b/0x7b0:
        __seqprop_mutex_assert at linux/./include/linux/seqlock.h:279
        (inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212
        (inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994
        ========================================================================
      
      The warning is produced by lockdep_assert_held() in
      __seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled.
      And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock
      fs_info->chunk_mutex held already.
      
      After adding some debug prints, the cause was found that many
      __alloc_device() are called with NULL @fs_info (during scanning ioctl).
      Inside the function, btrfs_device_data_ordered_init() is expanded to
      seqcount_mutex_init().  In this scenario, its second
      parameter info->chunk_mutex  is &NULL->chunk_mutex which equals
      to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus,
      seqcount_mutex_init() is called in wrong way. And later
      btrfs_device_get/set helpers trigger lockdep warnings.
      
      The device and filesystem object lifetimes are different and we'd have
      to synchronize initialization of the btrfs_device::data_seqcount with
      the fs_info, possibly using some additional synchronization. It would
      still not prevent concurrent access to the seqcount lock when it's used
      for read and initialization.
      
      Commit d5c82388 ("btrfs: convert data_seqcount to seqcount_mutex_t")
      does not mention a particular problem being fixed so revert should not
      cause any harm and we'll get the lockdep warning fixed.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139Reported-by: NErhard F <erhard_f@mailbox.org>
      Fixes: d5c82388 ("btrfs: convert data_seqcount to seqcount_mutex_t")
      CC: stable@vger.kernel.org # 5.10
      CC: Davidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NSu Yue <l@damenly.su>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      8d385b73
  9. 08 2月, 2021 1 次提交
    • J
      btrfs: fix lockdep splat in btrfs_recover_relocation · d26e08a4
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.11
      commit 14e17e90bfaaf0392d8a48744f91d81ea121fd10
      bugzilla: 47621
      
      --------------------------------
      
      commit fb286100 upstream.
      
      While testing the error paths of relocation I hit the following lockdep
      splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.10.0-rc6+ #217 Not tainted
        ------------------------------------------------------
        mount/779 is trying to acquire lock:
        ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340
      
        but task is already holding lock:
        ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-root-00){++++}-{3:3}:
      	 down_read_nested+0x43/0x130
      	 __btrfs_tree_read_lock+0x27/0x100
      	 btrfs_read_lock_root_node+0x31/0x40
      	 btrfs_search_slot+0x462/0x8f0
      	 btrfs_update_root+0x55/0x2b0
      	 btrfs_drop_snapshot+0x398/0x750
      	 clean_dirty_subvols+0xdf/0x120
      	 btrfs_recover_relocation+0x534/0x5a0
      	 btrfs_start_pre_rw_mount+0xcb/0x170
      	 open_ctree+0x151f/0x1726
      	 btrfs_mount_root.cold+0x12/0xea
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 vfs_kern_mount.part.0+0x71/0xb0
      	 btrfs_mount+0x10d/0x380
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x433/0xc10
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (sb_internal#2){.+.+}-{0:0}:
      	 start_transaction+0x444/0x700
      	 insert_balance_item.isra.0+0x37/0x320
      	 btrfs_balance+0x354/0xf40
      	 btrfs_ioctl_balance+0x2cf/0x380
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&fs_info->balance_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x1120/0x1e10
      	 lock_acquire+0x116/0x370
      	 __mutex_lock+0x7e/0x7b0
      	 btrfs_recover_balance+0x2f0/0x340
      	 open_ctree+0x1095/0x1726
      	 btrfs_mount_root.cold+0x12/0xea
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 vfs_kern_mount.part.0+0x71/0xb0
      	 btrfs_mount+0x10d/0x380
      	 legacy_get_tree+0x30/0x50
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x433/0xc10
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        other info that might help us debug this:
      
        Chain exists of:
          &fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-root-00);
      				 lock(sb_internal#2);
      				 lock(btrfs-root-00);
          lock(&fs_info->balance_mutex);
      
         *** DEADLOCK ***
      
        2 locks held by mount/779:
         #0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
         #1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100
      
        stack backtrace:
        CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb0
         check_noncircular+0xcf/0xf0
         ? trace_call_bpf+0x139/0x260
         __lock_acquire+0x1120/0x1e10
         lock_acquire+0x116/0x370
         ? btrfs_recover_balance+0x2f0/0x340
         __mutex_lock+0x7e/0x7b0
         ? btrfs_recover_balance+0x2f0/0x340
         ? btrfs_recover_balance+0x2f0/0x340
         ? rcu_read_lock_sched_held+0x3f/0x80
         ? kmem_cache_alloc_trace+0x2c4/0x2f0
         ? btrfs_get_64+0x5e/0x100
         btrfs_recover_balance+0x2f0/0x340
         open_ctree+0x1095/0x1726
         btrfs_mount_root.cold+0x12/0xea
         ? rcu_read_lock_sched_held+0x3f/0x80
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         vfs_kern_mount.part.0+0x71/0xb0
         btrfs_mount+0x10d/0x380
         ? __kmalloc_track_caller+0x2f2/0x320
         legacy_get_tree+0x30/0x50
         vfs_get_tree+0x28/0xc0
         ? capable+0x3a/0x60
         path_mount+0x433/0xc10
         __x64_sys_mount+0xe3/0x120
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is straightforward to fix, simply release the path before we setup
      the balance_ctl.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      d26e08a4
  10. 24 11月, 2020 1 次提交
    • J
      btrfs: don't access possibly stale fs_info data for printing duplicate device · 0697d9a6
      Johannes Thumshirn 提交于
      Syzbot reported a possible use-after-free when printing a duplicate device
      warning device_list_add().
      
      At this point it can happen that a btrfs_device::fs_info is not correctly
      setup yet, so we're accessing stale data, when printing the warning
      message using the btrfs_printk() wrappers.
      
        ==================================================================
        BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
        Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068
      
        CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x1d6/0x29e lib/dump_stack.c:118
         print_address_description+0x66/0x620 mm/kasan/report.c:383
         __kasan_report mm/kasan/report.c:513 [inline]
         kasan_report+0x132/0x1d0 mm/kasan/report.c:530
         btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
         device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
         btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
         btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x44840a
        RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
        RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
        RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
        RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
        R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003
      
        Allocated by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track mm/kasan/common.c:56 [inline]
         __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
         kmalloc_node include/linux/slab.h:577 [inline]
         kvmalloc_node+0x81/0x110 mm/util.c:574
         kvmalloc include/linux/mm.h:757 [inline]
         kvzalloc include/linux/mm.h:765 [inline]
         btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
         kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
         __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
         __cache_free mm/slab.c:3418 [inline]
         kfree+0x113/0x200 mm/slab.c:3756
         deactivate_locked_super+0xa7/0xf0 fs/super.c:335
         btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff8880878e0000
         which belongs to the cache kmalloc-16k of size 16384
        The buggy address is located 1704 bytes inside of
         16384-byte region [ffff8880878e0000, ffff8880878e4000)
        The buggy address belongs to the page:
        page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
        head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
        flags: 0xfffe0000010200(slab|head)
        raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
        raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      				    ^
         ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
      
      The syzkaller reproducer for this use-after-free crafts a filesystem image
      and loop mounts it twice in a loop. The mount will fail as the crafted
      image has an invalid chunk tree. When this happens btrfs_mount_root() will
      call deactivate_locked_super(), which then cleans up fs_info and
      fs_info::sb. If a second thread now adds the same block-device to the
      filesystem, it will get detected as a duplicate device and
      device_list_add() will reject the duplicate and print a warning. But as
      the fs_info pointer passed in is non-NULL this will result in a
      use-after-free.
      
      Instead of printing possibly uninitialized or already freed memory in
      btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
      device name will be skipped altogether.
      
      There was a slightly different approach discussed in
      https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u
      
      Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
      Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0697d9a6
  11. 05 11月, 2020 1 次提交
    • A
      btrfs: dev-replace: fail mount if we don't have replace item with target device · cf89af14
      Anand Jain 提交于
      If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace
      item, then it means the filesystem is inconsistent state. This is either
      corruption or a crafted image.  Fail the mount as this needs a closer
      look what is actually wrong.
      
      As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace
      item, in __btrfs_free_extra_devids() we determine that there is an
      extra device, and free those extra devices but continue to mount the
      device.
      However, we were wrong in keeping tack of the rw_devices so the syzbot
      testcase failed:
      
        WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
        Kernel panic - not syncing: panic_on_warn set ...
        CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x198/0x1fd lib/dump_stack.c:118
         panic+0x347/0x7c0 kernel/panic.c:231
         __warn.cold+0x20/0x46 kernel/panic.c:600
         report_bug+0x1bd/0x210 lib/bug.c:198
         handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
         exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
         asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
        RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166
        RSP: 0018:ffffc900091777e0 EFLAGS: 00010246
        RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000
        RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007
        RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130
        R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050
         close_fs_devices fs/btrfs/volumes.c:1193 [inline]
         btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
         open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434
         btrfs_fill_super fs/btrfs/super.c:1316 [inline]
         btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672
      
      The fix here is, when we determine that there isn't a replace item
      then fail the mount if there is a replace target device (devid 0).
      
      CC: stable@vger.kernel.org # 4.19+
      Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cf89af14
  12. 27 10月, 2020 1 次提交
  13. 26 10月, 2020 1 次提交
    • F
      btrfs: fix readahead hang and use-after-free after removing a device · 66d204a1
      Filipe Manana 提交于
      Very sporadically I had test case btrfs/069 from fstests hanging (for
      years, it is not a recent regression), with the following traces in
      dmesg/syslog:
      
        [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
        [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
        [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
        [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
        [162513.514318]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.514747] task:btrfs-transacti state:D stack:    0 pid:1356167 ppid:     2 flags:0x00004000
        [162513.514751] Call Trace:
        [162513.514761]  __schedule+0x5ce/0xd00
        [162513.514765]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.514771]  schedule+0x46/0xf0
        [162513.514844]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.514850]  ? finish_wait+0x90/0x90
        [162513.514864]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.514879]  transaction_kthread+0xa4/0x170 [btrfs]
        [162513.514891]  ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
        [162513.514894]  kthread+0x153/0x170
        [162513.514897]  ? kthread_stop+0x2c0/0x2c0
        [162513.514902]  ret_from_fork+0x22/0x30
        [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
        [162513.515192]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.515680] task:fsstress        state:D stack:    0 pid:1356184 ppid:1356177 flags:0x00004000
        [162513.515682] Call Trace:
        [162513.515688]  __schedule+0x5ce/0xd00
        [162513.515691]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.515697]  schedule+0x46/0xf0
        [162513.515712]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.515716]  ? finish_wait+0x90/0x90
        [162513.515729]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.515743]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
        [162513.515753]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
        [162513.515758]  ? __ia32_sys_fdatasync+0x20/0x20
        [162513.515761]  iterate_supers+0x87/0xf0
        [162513.515765]  ksys_sync+0x60/0xb0
        [162513.515768]  __do_sys_sync+0xa/0x10
        [162513.515771]  do_syscall_64+0x33/0x80
        [162513.515774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.515781] RIP: 0033:0x7f5238f50bd7
        [162513.515782] Code: Bad RIP value.
        [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
        [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
        [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
        [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
        [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
        [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
        [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
        [162513.516064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.516617] task:fsstress        state:D stack:    0 pid:1356185 ppid:1356177 flags:0x00000000
        [162513.516620] Call Trace:
        [162513.516625]  __schedule+0x5ce/0xd00
        [162513.516628]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.516634]  schedule+0x46/0xf0
        [162513.516647]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.516650]  ? finish_wait+0x90/0x90
        [162513.516662]  start_transaction+0x4d7/0x5f0 [btrfs]
        [162513.516679]  btrfs_setxattr_trans+0x3c/0x100 [btrfs]
        [162513.516686]  __vfs_setxattr+0x66/0x80
        [162513.516691]  __vfs_setxattr_noperm+0x70/0x200
        [162513.516697]  vfs_setxattr+0x6b/0x120
        [162513.516703]  setxattr+0x125/0x240
        [162513.516709]  ? lock_acquire+0xb1/0x480
        [162513.516712]  ? mnt_want_write+0x20/0x50
        [162513.516721]  ? rcu_read_lock_any_held+0x8e/0xb0
        [162513.516723]  ? preempt_count_add+0x49/0xa0
        [162513.516725]  ? __sb_start_write+0x19b/0x290
        [162513.516727]  ? preempt_count_add+0x49/0xa0
        [162513.516732]  path_setxattr+0xba/0xd0
        [162513.516739]  __x64_sys_setxattr+0x27/0x30
        [162513.516741]  do_syscall_64+0x33/0x80
        [162513.516743]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.516745] RIP: 0033:0x7f5238f56d5a
        [162513.516746] Code: Bad RIP value.
        [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
        [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
        [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
        [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
        [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
        [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
        [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
        [162513.517064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.517763] task:fsstress        state:D stack:    0 pid:1356196 ppid:1356177 flags:0x00004000
        [162513.517780] Call Trace:
        [162513.517786]  __schedule+0x5ce/0xd00
        [162513.517789]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.517796]  schedule+0x46/0xf0
        [162513.517810]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.517814]  ? finish_wait+0x90/0x90
        [162513.517829]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.517845]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
        [162513.517857]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
        [162513.517862]  ? __ia32_sys_fdatasync+0x20/0x20
        [162513.517865]  iterate_supers+0x87/0xf0
        [162513.517869]  ksys_sync+0x60/0xb0
        [162513.517872]  __do_sys_sync+0xa/0x10
        [162513.517875]  do_syscall_64+0x33/0x80
        [162513.517878]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.517881] RIP: 0033:0x7f5238f50bd7
        [162513.517883] Code: Bad RIP value.
        [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
        [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
        [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
        [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
        [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
        [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
        [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
        [162513.518298]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.519157] task:fsstress        state:D stack:    0 pid:1356197 ppid:1356177 flags:0x00000000
        [162513.519160] Call Trace:
        [162513.519165]  __schedule+0x5ce/0xd00
        [162513.519168]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.519174]  schedule+0x46/0xf0
        [162513.519190]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.519193]  ? finish_wait+0x90/0x90
        [162513.519206]  start_transaction+0x4d7/0x5f0 [btrfs]
        [162513.519222]  btrfs_create+0x57/0x200 [btrfs]
        [162513.519230]  lookup_open+0x522/0x650
        [162513.519246]  path_openat+0x2b8/0xa50
        [162513.519270]  do_filp_open+0x91/0x100
        [162513.519275]  ? find_held_lock+0x32/0x90
        [162513.519280]  ? lock_acquired+0x33b/0x470
        [162513.519285]  ? do_raw_spin_unlock+0x4b/0xc0
        [162513.519287]  ? _raw_spin_unlock+0x29/0x40
        [162513.519295]  do_sys_openat2+0x20d/0x2d0
        [162513.519300]  do_sys_open+0x44/0x80
        [162513.519304]  do_syscall_64+0x33/0x80
        [162513.519307]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.519309] RIP: 0033:0x7f5238f4a903
        [162513.519310] Code: Bad RIP value.
        [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
        [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
        [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
        [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
        [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
        [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
        [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
        [162513.519727]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.520508] task:btrfs           state:D stack:    0 pid:1356211 ppid:1356178 flags:0x00004002
        [162513.520511] Call Trace:
        [162513.520516]  __schedule+0x5ce/0xd00
        [162513.520519]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.520525]  schedule+0x46/0xf0
        [162513.520544]  btrfs_scrub_pause+0x11f/0x180 [btrfs]
        [162513.520548]  ? finish_wait+0x90/0x90
        [162513.520562]  btrfs_commit_transaction+0x45a/0xc30 [btrfs]
        [162513.520574]  ? start_transaction+0xe0/0x5f0 [btrfs]
        [162513.520596]  btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
        [162513.520619]  btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
        [162513.520639]  btrfs_ioctl+0x2a25/0x36f0 [btrfs]
        [162513.520643]  ? do_sigaction+0xf3/0x240
        [162513.520645]  ? find_held_lock+0x32/0x90
        [162513.520648]  ? do_sigaction+0xf3/0x240
        [162513.520651]  ? lock_acquired+0x33b/0x470
        [162513.520655]  ? _raw_spin_unlock_irq+0x24/0x50
        [162513.520657]  ? lockdep_hardirqs_on+0x7d/0x100
        [162513.520660]  ? _raw_spin_unlock_irq+0x35/0x50
        [162513.520662]  ? do_sigaction+0xf3/0x240
        [162513.520671]  ? __x64_sys_ioctl+0x83/0xb0
        [162513.520672]  __x64_sys_ioctl+0x83/0xb0
        [162513.520677]  do_syscall_64+0x33/0x80
        [162513.520679]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.520681] RIP: 0033:0x7fc3cd307d87
        [162513.520682] Code: Bad RIP value.
        [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
        [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
        [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
        [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
        [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
        [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
        [162513.520703]
      		  Showing all locks held in the system:
        [162513.520712] 1 lock held by khungtaskd/54:
        [162513.520713]  #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
        [162513.520728] 1 lock held by in:imklog/596:
        [162513.520729]  #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
        [162513.520782] 1 lock held by btrfs-transacti/1356167:
        [162513.520784]  #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
        [162513.520798] 1 lock held by btrfs/1356190:
        [162513.520800]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
        [162513.520805] 1 lock held by fsstress/1356184:
        [162513.520806]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
        [162513.520811] 3 locks held by fsstress/1356185:
        [162513.520812]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
        [162513.520815]  #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
        [162513.520820]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
        [162513.520833] 1 lock held by fsstress/1356196:
        [162513.520834]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
        [162513.520838] 3 locks held by fsstress/1356197:
        [162513.520839]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
        [162513.520843]  #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
        [162513.520846]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
        [162513.520858] 2 locks held by btrfs/1356211:
        [162513.520859]  #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
        [162513.520877]  #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
      
      This was weird because the stack traces show that a transaction commit,
      triggered by a device replace operation, is blocking trying to pause any
      running scrubs but there are no stack traces of blocked tasks doing a
      scrub.
      
      After poking around with drgn, I noticed there was a scrub task that was
      constantly running and blocking for shorts periods of time:
      
        >>> t = find_task(prog, 1356190)
        >>> prog.stack_trace(t)
        #0  __schedule+0x5ce/0xcfc
        #1  schedule+0x46/0xe4
        #2  schedule_timeout+0x1df/0x475
        #3  btrfs_reada_wait+0xda/0x132
        #4  scrub_stripe+0x2a8/0x112f
        #5  scrub_chunk+0xcd/0x134
        #6  scrub_enumerate_chunks+0x29e/0x5ee
        #7  btrfs_scrub_dev+0x2d5/0x91b
        #8  btrfs_ioctl+0x7f5/0x36e7
        #9  __x64_sys_ioctl+0x83/0xb0
        #10 do_syscall_64+0x33/0x77
        #11 entry_SYSCALL_64+0x7c/0x156
      
      Which corresponds to:
      
      int btrfs_reada_wait(void *handle)
      {
          struct reada_control *rc = handle;
          struct btrfs_fs_info *fs_info = rc->fs_info;
      
          while (atomic_read(&rc->elems)) {
              if (!atomic_read(&fs_info->reada_works_cnt))
                  reada_start_machine(fs_info);
              wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
                                (HZ + 9) / 10);
          }
      (...)
      
      So the counter "rc->elems" was set to 1 and never decreased to 0, causing
      the scrub task to loop forever in that function. Then I used the following
      script for drgn to check the readahead requests:
      
        $ cat dump_reada.py
        import sys
        import drgn
        from drgn import NULL, Object, cast, container_of, execscript, \
            reinterpret, sizeof
        from drgn.helpers.linux import *
      
        mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
      
        mnt = None
        for mnt in for_each_mount(prog, dst = mnt_path):
            pass
      
        if mnt is None:
            sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
            sys.exit(1)
      
        fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
      
        def dump_re(re):
            nzones = re.nzones.value_()
            print(f're at {hex(re.value_())}')
            print(f'\t logical {re.logical.value_()}')
            print(f'\t refcnt {re.refcnt.value_()}')
            print(f'\t nzones {nzones}')
            for i in range(nzones):
                dev = re.zones[i].device
                name = dev.name.str.string_()
                print(f'\t\t dev id {dev.devid.value_()} name {name}')
            print()
      
        for _, e in radix_tree_for_each(fs_info.reada_tree):
            re = cast('struct reada_extent *', e)
            dump_re(re)
      
        $ drgn dump_reada.py
        re at 0xffff8f3da9d25ad8
                logical 38928384
                refcnt 1
                nzones 1
                       dev id 0 name b'/dev/sdd'
        $
      
      So there was one readahead extent with a single zone corresponding to the
      source device of that last device replace operation logged in dmesg/syslog.
      Also the ID of that zone's device was 0 which is a special value set in
      the source device of a device replace operation when the operation finishes
      (constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
      confirming again that device /dev/sdd was the source of a device replace
      operation.
      
      Normally there should be as many zones in the readahead extent as there are
      devices, and I wasn't expecting the extent to be in a block group with a
      'single' profile, so I went and confirmed with the following drgn script
      that there weren't any single profile block groups:
      
        $ cat dump_block_groups.py
        import sys
        import drgn
        from drgn import NULL, Object, cast, container_of, execscript, \
            reinterpret, sizeof
        from drgn.helpers.linux import *
      
        mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
      
        mnt = None
        for mnt in for_each_mount(prog, dst = mnt_path):
            pass
      
        if mnt is None:
            sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
            sys.exit(1)
      
        fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
      
        BTRFS_BLOCK_GROUP_DATA = (1 << 0)
        BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
        BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
        BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
        BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
        BTRFS_BLOCK_GROUP_DUP = (1 << 5)
        BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
        BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
        BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
        BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
        BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)
      
        def bg_flags_string(bg):
            flags = bg.flags.value_()
            ret = ''
            if flags & BTRFS_BLOCK_GROUP_DATA:
                ret = 'data'
            if flags & BTRFS_BLOCK_GROUP_METADATA:
                if len(ret) > 0:
                    ret += '|'
                ret += 'meta'
            if flags & BTRFS_BLOCK_GROUP_SYSTEM:
                if len(ret) > 0:
                    ret += '|'
                ret += 'system'
            if flags & BTRFS_BLOCK_GROUP_RAID0:
                ret += ' raid0'
            elif flags & BTRFS_BLOCK_GROUP_RAID1:
                ret += ' raid1'
            elif flags & BTRFS_BLOCK_GROUP_DUP:
                ret += ' dup'
            elif flags & BTRFS_BLOCK_GROUP_RAID10:
                ret += ' raid10'
            elif flags & BTRFS_BLOCK_GROUP_RAID5:
                ret += ' raid5'
            elif flags & BTRFS_BLOCK_GROUP_RAID6:
                ret += ' raid6'
            elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
                ret += ' raid1c3'
            elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
                ret += ' raid1c4'
            else:
                ret += ' single'
      
            return ret
      
        def dump_bg(bg):
            print()
            print(f'block group at {hex(bg.value_())}')
            print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
            print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')
      
        bg_root = fs_info.block_group_cache_tree.address_of_()
        for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
            dump_bg(bg)
      
        $ drgn dump_block_groups.py
      
        block group at 0xffff8f3d673b0400
               start 22020096 length 16777216
               flags 258 - system raid6
      
        block group at 0xffff8f3d53ddb400
               start 38797312 length 536870912
               flags 260 - meta raid6
      
        block group at 0xffff8f3d5f4d9c00
               start 575668224 length 2147483648
               flags 257 - data raid6
      
        block group at 0xffff8f3d08189000
               start 2723151872 length 67108864
               flags 258 - system raid6
      
        block group at 0xffff8f3db70ff000
               start 2790260736 length 1073741824
               flags 260 - meta raid6
      
        block group at 0xffff8f3d5f4dd800
               start 3864002560 length 67108864
               flags 258 - system raid6
      
        block group at 0xffff8f3d67037000
               start 3931111424 length 2147483648
               flags 257 - data raid6
        $
      
      So there were only 2 reasons left for having a readahead extent with a
      single zone: reada_find_zone(), called when creating a readahead extent,
      returned NULL either because we failed to find the corresponding block
      group or because a memory allocation failed. With some additional and
      custom tracing I figured out that on every further ocurrence of the
      problem the block group had just been deleted when we were looping to
      create the zones for the readahead extent (at reada_find_extent()), so we
      ended up with only one zone in the readahead extent, corresponding to a
      device that ends up getting replaced.
      
      So after figuring that out it became obvious why the hang happens:
      
      1) Task A starts a scrub on any device of the filesystem, except for
         device /dev/sdd;
      
      2) Task B starts a device replace with /dev/sdd as the source device;
      
      3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
         starting to scrub a stripe from block group X. This call to
         btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
         calls reada_add_block(), it passes the logical address of the extent
         tree's root node as its 'logical' argument - a value of 38928384;
      
      4) Task A then enters reada_find_extent(), called from reada_add_block().
         It finds there isn't any existing readahead extent for the logical
         address 38928384, so it proceeds to the path of creating a new one.
      
         It calls btrfs_map_block() to find out which stripes exist for the block
         group X. On the first iteration of the for loop that iterates over the
         stripes, it finds the stripe for device /dev/sdd, so it creates one
         zone for that device and adds it to the readahead extent. Before getting
         into the second iteration of the loop, the cleanup kthread deletes block
         group X because it was empty. So in the iterations for the remaining
         stripes it does not add more zones to the readahead extent, because the
         calls to reada_find_zone() returned NULL because they couldn't find
         block group X anymore.
      
         As a result the new readahead extent has a single zone, corresponding to
         the device /dev/sdd;
      
      4) Before task A returns to btrfs_reada_add() and queues the readahead job
         for the readahead work queue, task B finishes the device replace and at
         btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
         device /dev/sdg;
      
      5) Task A returns to reada_add_block(), which increments the counter
         "->elems" of the reada_control structure allocated at btrfs_reada_add().
      
         Then it returns back to btrfs_reada_add() and calls
         reada_start_machine(). This queues a job in the readahead work queue to
         run the function reada_start_machine_worker(), which calls
         __reada_start_machine().
      
         At __reada_start_machine() we take the device list mutex and for each
         device found in the current device list, we call
         reada_start_machine_dev() to start the readahead work. However at this
         point the device /dev/sdd was already freed and is not in the device
         list anymore.
      
         This means the corresponding readahead for the extent at 38928384 is
         never started, and therefore the "->elems" counter of the reada_control
         structure allocated at btrfs_reada_add() never goes down to 0, causing
         the call to btrfs_reada_wait(), done by the scrub task, to wait forever.
      
      Note that the readahead request can be made either after the device replace
      started or before it started, however in pratice it is very unlikely that a
      device replace is able to start after a readahead request is made and is
      able to complete before the readahead request completes - maybe only on a
      very small and nearly empty filesystem.
      
      This hang however is not the only problem we can have with readahead and
      device removals. When the readahead extent has other zones other than the
      one corresponding to the device that is being removed (either by a device
      replace or a device remove operation), we risk having a use-after-free on
      the device when dropping the last reference of the readahead extent.
      
      For example if we create a readahead extent with two zones, one for the
      device /dev/sdd and one for the device /dev/sde:
      
      1) Before the readahead worker starts, the device /dev/sdd is removed,
         and the corresponding btrfs_device structure is freed. However the
         readahead extent still has the zone pointing to the device structure;
      
      2) When the readahead worker starts, it only finds device /dev/sde in the
         current device list of the filesystem;
      
      3) It starts the readahead work, at reada_start_machine_dev(), using the
         device /dev/sde;
      
      4) Then when it finishes reading the extent from device /dev/sde, it calls
         __readahead_hook() which ends up dropping the last reference on the
         readahead extent through the last call to reada_extent_put();
      
      5) At reada_extent_put() it iterates over each zone of the readahead extent
         and attempts to delete an element from the device's 'reada_extents'
         radix tree, resulting in a use-after-free, as the device pointer of the
         zone for /dev/sdd is now stale. We can also access the device after
         dropping the last reference of a zone, through reada_zone_release(),
         also called by reada_extent_put().
      
      And a device remove suffers the same problem, however since it shrinks the
      device size down to zero before removing the device, it is very unlikely to
      still have readahead requests not completed by the time we free the device,
      the only possibility is if the device has a very little space allocated.
      
      While the hang problem is exclusive to scrub, since it is currently the
      only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
      problem affects any path that triggers readhead, which includes
      btree_readahead_hook() and __readahead_hook() (a readahead worker can
      trigger readahed for the children of a node) for example - any path that
      ends up calling reada_add_block() can trigger the use-after-free after a
      device is removed.
      
      So fix this by waiting for any readahead requests for a device to complete
      before removing a device, ensuring that while waiting for existing ones no
      new ones can be made.
      
      This problem has been around for a very long time - the readahead code was
      added in 2011, device remove exists since 2008 and device replace was
      introduced in 2013, hard to pick a specific commit for a git Fixes tag.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      66d204a1
  14. 07 10月, 2020 23 次提交
    • A
      btrfs: skip devices without magic signature when mounting · 96c2e067
      Anand Jain 提交于
      Many things can happen after the device is scanned and before the device
      is mounted.  One such thing is losing the BTRFS_MAGIC on the device.
      If it happens we still won't free that device from the memory and cause
      the userland confusion.
      
      For example: As the BTRFS_IOC_DEV_INFO still carries the device path
      which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
      device which does not belong to the filesystem anymore:
      
        $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
        $ wipefs -a /dev/sdb
        # /dev/sdb does not contain magic signature
        $ mount -o degraded /dev/sda /btrfs
        $ btrfs fi show -m
        Label: none  uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
      	  Total devices 2 FS bytes used 128.00KiB
      	  devid    1 size 3.00GiB used 571.19MiB path /dev/sda
      	  devid    2 size 3.00GiB used 571.19MiB path /dev/sdb
      
      We need to distinguish the missing signature and invalid superblock, so
      add a specific error code ENODATA for that. This also fixes failure of
      fstest btrfs/198.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      96c2e067
    • J
      btrfs: return error if we're unable to read device stats · 92e26df4
      Josef Bacik 提交于
      I noticed when fixing device stats for seed devices that we simply threw
      away the return value from btrfs_search_slot().  This is because we may
      not have stat items, but we could very well get an error, and thus miss
      reporting the error up the chain.
      
      Fix this by returning ret if it's an actual error, and then stop trying
      to init the rest of the devices stats and return the error up the chain.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92e26df4
    • J
      btrfs: init device stats for seed devices · 124604eb
      Josef Bacik 提交于
      We recently started recording device stats across the fleet, and noticed
      a large increase in messages such as this
      
        BTRFS warning (device dm-0): get dev_stats failed, not yet valid
      
      on our tiers that use seed devices for their root devices.  This is
      because we do not initialize the device stats for any seed devices if we
      have a sprout device and mount using that sprout device.  The basic
      steps for reproducing are:
      
        $ mkfs seed device
        $ mount seed device
        # fill seed device
        $ umount seed device
        $ btrfstune -S 1 seed device
        $ mount seed device
        $ btrfs device add -f sprout device /mnt/wherever
        $ umount /mnt/wherever
        $ mount sprout device /mnt/wherever
        $ btrfs device stats /mnt/wherever
      
      This will fail with the above message in dmesg.
      
      Fix this by iterating over the fs_devices->seed if they exist in
      btrfs_init_dev_stats.  This fixed the problem and properly reports the
      stats for both devices.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_device_init_dev_stats ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      124604eb
    • A
      btrfs: simplify gotos in open_seed_device · c83b60c0
      Anand Jain 提交于
      The function does not have a common exit block and returns immediatelly
      so there's no point having the goto. Remove the two cases.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c83b60c0
    • A
      btrfs: remove unnecessary tmp variable in btrfs_assign_next_active_device() · e493e8f9
      Anand Jain 提交于
      We can check the argument value directly, no need for the temporary
      variable.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e493e8f9
    • A
      btrfs: use sprout device_list_mutex in btrfs_init_devices_late · e17125b5
      Anand Jain 提交于
      On a mounted sprout filesystem, all threads now are using the
      sprout::device_list_mutex, and this is the only code using the
      seed::device_list_mutex. This patch converts to use the sprouts
      fs_info->fs_devices->device_list_mutex.
      
      The same reasoning holds true here, that device delete is holding
      the sprout::device_list_mutex.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e17125b5
    • A
      btrfs: split and refactor btrfs_sysfs_remove_devices_dir · 53f8a74c
      Anand Jain 提交于
      Similar to btrfs_sysfs_add_devices_dir()'s refactoring, split
      btrfs_sysfs_remove_devices_dir() so that we don't have to use the device
      argument to indicate whether to free all devices or just one device.
      
      Export btrfs_sysfs_remove_device() as device operations outside of
      sysfs.c now calls this instead of btrfs_sysfs_remove_devices_dir().
      
      btrfs_sysfs_remove_devices_dir() is renamed to
      btrfs_sysfs_remove_fs_devices() to suite its new role.
      
      Now, no one outside of sysfs.c calls btrfs_sysfs_remove_fs_devices()
      so it is redeclared s static. And the same function had to be moved
      before its first caller.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53f8a74c
    • A
      btrfs: simplify parameters of btrfs_sysfs_add_devices_dir · cd36da2e
      Anand Jain 提交于
      When we add a device we need to add it to sysfs, so instead of using the
      btrfs_sysfs_add_devices_dir() fs_devices argument to specify whether to
      add a device or all of fs_devices, call the helper function directly
      btrfs_sysfs_add_device() and thus make it non-static.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd36da2e
    • A
      btrfs: improve device scanning messages · 79dae17d
      Anand Jain 提交于
      Systems booting without the initramfs seems to scan an unusual kind
      of device path (/dev/root). And at a later time, the device is updated
      to the correct path. We generally print the process name and PID of the
      process scanning the device but we don't capture the same information if
      the device path is rescanned with a different pathname.
      
      The current message is too long, so drop the unnecessary UUID and add
      process name and PID.
      
      While at this also update the duplicate device warning to include the
      process name and PID so the messages are consistent
      
      CC: stable@vger.kernel.org # 4.19+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79dae17d
    • G
      btrfs: enumerate the type of exclusive operation in progress · c3e1f96c
      Goldwyn Rodrigues 提交于
      Instead of using a flag bit for exclusive operation, use a variable to
      store which exclusive operation is being performed.  Introduce an API
      to start and finish an exclusive operation.
      
      This would enable another way for tools to check which operation is
      running on why starting an exclusive operation failed. The followup
      patch adds a sysfs_notify() to alert userspace when the state changes, so
      userspace can perform select() on it to get notified of the change.
      
      This would enable us to enqueue a command which will wait for current
      exclusive operation to complete before issuing the next exclusive
      operation. This has been done synchronously as opposed to a background
      process, or else error collection (if any) will become difficult.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3e1f96c
    • J
      btrfs: sysfs: init devices outside of the chunk_mutex · ca10845a
      Josef Bacik 提交于
      While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
      following lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc3+ #4 Not tainted
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc+0x37/0x270
      	 alloc_inode+0x82/0xb0
      	 iget_locked+0x10d/0x2c0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x240
      	 sysfs_get_tree+0x16/0x40
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (kernfs_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_link+0x63/0xa0
      	 sysfs_do_create_link_sd+0x5e/0xd0
      	 btrfs_sysfs_add_devices_dir+0x81/0x130
      	 btrfs_init_new_device+0x67f/0x1250
      	 btrfs_ioctl+0x1ef/0x2e20
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_insert_empty_items+0x64/0xb0
      	 btrfs_insert_delayed_items+0x90/0x4f0
      	 btrfs_commit_inode_delayed_items+0x93/0x140
      	 btrfs_log_inode+0x5de/0x2020
      	 btrfs_log_inode_parent+0x429/0xc90
      	 btrfs_log_new_name+0x95/0x9b
      	 btrfs_rename2+0xbb9/0x1800
      	 vfs_rename+0x64f/0x9f0
      	 do_renameat2+0x320/0x4e0
      	 __x64_sys_rename+0x1f/0x30
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x119c/0x1fc0
      	 lock_acquire+0xa7/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x119c/0x1fc0
         lock_acquire+0xa7/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa7/0x3d0
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because we are holding the chunk_mutex at the time of
      adding in a new device.  However we only need to hold the
      device_list_mutex, as we're going to iterate over the fs_devices
      devices.  Move the sysfs init stuff outside of the chunk_mutex to get
      rid of this lockdep splat.
      
      CC: stable@vger.kernel.org # 4.4.x: f3cd2c58: btrfs: sysfs, rename device_link add/remove functions
      CC: stable@vger.kernel.org # 4.4.x
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca10845a
    • N
      btrfs: don't opencode sync_blockdev in btrfs_init_new_device · b9ba017f
      Nikolay Borisov 提交于
      Instead of opencoding filemap_write_and_wait simply call syncblockdev as
      it makes it abundantly clear what's going on and why this is used. No
      semantics changes.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b9ba017f
    • N
      btrfs: remove redundant code from btrfs_free_stale_devices · 4ae312e9
      Nikolay Borisov 提交于
      Following the refactor of btrfs_free_stale_devices in
      7bcb8164 ("btrfs: use device_list_mutex when removing stale devices")
      fs_devices are freed after they have been iterated by the inner
      list_for_each so the use-after-free fixed by introducing the break in
      fd649f10 ("btrfs: Fix use-after-free when cleaning up fs_devs with
      a single stale device") is no longer necessary. Just remove it
      altogether. No functional changes.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4ae312e9
    • N
      btrfs: refactor locked condition in btrfs_init_new_device · 44cab9ba
      Nikolay Borisov 提交于
      Invert unlocked to locked and exploit the fact it can only ever be
      modified if we are adding a new device to a seed filesystem. This allows
      to simplify the check in error: label. No semantics changes.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      44cab9ba
    • N
      btrfs: use RCU for quick device check in btrfs_init_new_device · f4cfa9bd
      Nikolay Borisov 提交于
      When adding a new device there's a mandatory check to see if a device is
      being duplicated to the filesystem it's added to. Since this is a
      read-only operations not necessary to take device_list_mutex and can simply
      make do with an rcu-readlock.
      
      Using just RCU is safe because there won't be another device add delete
      running in parallel as btrfs_init_new_device is called only from
      btrfs_ioctl_add_dev.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f4cfa9bd
    • J
      btrfs: do not hold device_list_mutex when closing devices · 425c6ed6
      Josef Bacik 提交于
      The following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
      ------------------------------------------------------
      fsstress/8739 is trying to acquire lock:
      ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70
      
      but task is already holding lock:
      ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #10 (sb_pagefaults){.+.+}-{0:0}:
             __sb_start_write+0x129/0x210
             btrfs_page_mkwrite+0x6a/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      -> #9 (&mm->mmap_lock#2){++++}-{3:3}:
             __might_fault+0x68/0x90
             _copy_to_user+0x1e/0x80
             perf_read+0x141/0x2c0
             vfs_read+0xad/0x1b0
             ksys_read+0x5f/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #8 (&cpuctx_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x88/0x150
             perf_event_init+0x1db/0x20b
             start_kernel+0x3ae/0x53c
             secondary_startup_64+0xa4/0xb0
      
      -> #7 (pmus_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             perf_event_init_cpu+0x4f/0x150
             cpuhp_invoke_callback+0xb1/0x900
             _cpu_up.constprop.26+0x9f/0x130
             cpu_up+0x7b/0xc0
             bringup_nonboot_cpus+0x4f/0x60
             smp_init+0x26/0x71
             kernel_init_freeable+0x110/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #6 (cpu_hotplug_lock){++++}-{0:0}:
             cpus_read_lock+0x39/0xb0
             kmem_cache_create_usercopy+0x28/0x230
             kmem_cache_create+0x12/0x20
             bioset_init+0x15e/0x2b0
             init_bio+0xa3/0xaa
             do_one_initcall+0x5a/0x2e0
             kernel_init_freeable+0x1f4/0x258
             kernel_init+0xa/0x103
             ret_from_fork+0x1f/0x30
      
      -> #5 (bio_slab_lock){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             bioset_init+0xbc/0x2b0
             __blk_alloc_queue+0x6f/0x2d0
             blk_mq_init_queue_data+0x1b/0x70
             loop_add+0x110/0x290 [loop]
             fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
             do_one_initcall+0x5a/0x2e0
             do_init_module+0x5a/0x220
             load_module+0x2459/0x26e0
             __do_sys_finit_module+0xba/0xe0
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #4 (loop_ctl_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             lo_open+0x18/0x50 [loop]
             __blkdev_get+0xec/0x570
             blkdev_get+0xe8/0x150
             do_dentry_open+0x167/0x410
             path_openat+0x7c9/0xa80
             do_filp_open+0x93/0x100
             do_sys_openat2+0x22a/0x2e0
             do_sys_open+0x4b/0x80
             do_syscall_64+0x50/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             blkdev_put+0x1d/0x120
             close_fs_devices.part.31+0x84/0x130
             btrfs_close_devices+0x44/0xb0
             close_ctree+0x296/0x2b2
             generic_shutdown_super+0x69/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_run_dev_stats+0x49/0x480
             commit_cowonly_roots+0xb5/0x2a0
             btrfs_commit_transaction+0x516/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
             __mutex_lock+0x9f/0x930
             btrfs_commit_transaction+0x4bb/0xa60
             sync_filesystem+0x6b/0x90
             generic_shutdown_super+0x22/0x100
             kill_anon_super+0xe/0x30
             btrfs_kill_super+0x12/0x20
             deactivate_locked_super+0x29/0x60
             cleanup_mnt+0xb8/0x140
             task_work_run+0x6d/0xb0
             __prepare_exit_to_usermode+0x1cc/0x1e0
             do_syscall_64+0x5c/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __lock_acquire+0x1272/0x2310
             lock_acquire+0x9e/0x360
             __mutex_lock+0x9f/0x930
             btrfs_record_root_in_trans+0x43/0x70
             start_transaction+0xd1/0x5d0
             btrfs_dirty_inode+0x42/0xd0
             file_update_time+0xc8/0x110
             btrfs_page_mkwrite+0x10c/0x4a0
             do_page_mkwrite+0x4d/0xc0
             handle_mm_fault+0x103c/0x1730
             exc_page_fault+0x340/0x660
             asm_exc_page_fault+0x1e/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sb_pagefaults);
                                     lock(&mm->mmap_lock#2);
                                     lock(sb_pagefaults);
        lock(&fs_info->reloc_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by fsstress/8739:
       #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
       #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
       #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0
      
      stack backtrace:
      CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
      Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
      Call Trace:
       dump_stack+0x78/0xa0
       check_noncircular+0x165/0x180
       __lock_acquire+0x1272/0x2310
       ? btrfs_get_alloc_profile+0x150/0x210
       lock_acquire+0x9e/0x360
       ? btrfs_record_root_in_trans+0x43/0x70
       __mutex_lock+0x9f/0x930
       ? btrfs_record_root_in_trans+0x43/0x70
       ? lock_acquire+0x9e/0x360
       ? join_transaction+0x5d/0x450
       ? find_held_lock+0x2d/0x90
       ? btrfs_record_root_in_trans+0x43/0x70
       ? join_transaction+0x3d5/0x450
       ? btrfs_record_root_in_trans+0x43/0x70
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       file_update_time+0xc8/0x110
       btrfs_page_mkwrite+0x10c/0x4a0
       ? handle_mm_fault+0x5e/0x1730
       do_page_mkwrite+0x4d/0xc0
       ? __do_fault+0x32/0x150
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       ? asm_exc_page_fault+0x8/0x30
       asm_exc_page_fault+0x1e/0x30
      RIP: 0033:0x7faa6c9969c4
      
      Was seen in testing.  The fix is similar to that of
      
        btrfs: open device without device_list_mutex
      
      where we're holding the device_list_mutex and then grab the bd_mutex,
      which pulls in a bunch of dependencies under the bd_mutex.  We only ever
      call btrfs_close_devices() on mount failure or unmount, so we're save to
      not have the device_list_mutex here.  We're already holding the
      uuid_mutex which keeps us safe from any external modification of the
      fs_devices.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      425c6ed6
    • J
      btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks · 62cf5391
      Josef Bacik 提交于
      When closing and freeing the source device we could end up doing our
      final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
      we want to be holding as few locks as possible, so move this call
      outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
      we're modifying the fs_devices we need to make sure we're holding the
      uuid_mutex here, so take that as well.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62cf5391
    • N
      btrfs: remove alloc_list splice in btrfs_prepare_sprout · 68abf360
      Nikolay Borisov 提交于
      btrfs_prepare_sprout is called when the first rw device is added to a
      seed filesystem. This means the filesystem can't have its alloc_list
      be non-empty, since seed filesystems are read only. Simply remove the
      code altogether.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      68abf360
    • N
      btrfs: document some invariants of seed code · 427c8fdd
      Nikolay Borisov 提交于
      Without good understanding of how seed devices works it's hard to grok
      some of what the code in open_seed_devices or btrfs_prepare_sprout does.
      
      Add comments hopefully reducing some of the cognitive load.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      427c8fdd
    • N
      btrfs: switch seed device to list api · 944d3f9f
      Nikolay Borisov 提交于
      While this patch touches a bunch of files the conversion is
      straighforward. Instead of using the implicit linked list anchored at
      btrfs_fs_devices::seed the code is switched to using
      list_for_each_entry.
      
      Previous patches in the series already factored out code that processed
      both main and seed devices so in those cases the factored out functions
      are called on the main fs_devices and then on every seed dev inside
      list_for_each_entry.
      
      Using list api also allows to simplify deletion from the seed dev list
      performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
      substituting a while() loop with a simple list_del_init.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      944d3f9f
    • N
      btrfs: simplify setting/clearing fs_info to btrfs_fs_devices · c4989c2f
      Nikolay Borisov 提交于
      It makes no sense to have sysfs-related routines be responsible for
      properly initialising the fs_info pointer of struct btrfs_fs_device.
      Instead this can be streamlined by making it the responsibility of
      btrfs_init_devices_late to initialize it. That function already
      initializes fs_info of every individual device in btrfs_fs_devices.
      
      As far as clearing it is concerned it makes sense to move it to
      close_fs_devices. That function is only called when struct
      btrfs_fs_devices is no longer in use - either for holding seeds or
      main devices for a mounted filesystem.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c4989c2f
    • N
      btrfs: make close_fs_devices return void · 54eed6ae
      Nikolay Borisov 提交于
      The return value of this function conveys absolutely no information.
      All callers already check the state of fs_devices->opened to decide how
      to proceed. So convert the function to returning void. While at it make
      btrfs_close_devices also return void.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54eed6ae
    • N
      btrfs: factor out loop logic from btrfs_free_extra_devids · 3712ccb7
      Nikolay Borisov 提交于
      This prepares the code to switching seeds devices to a proper list.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3712ccb7