1. 07 10月, 2020 40 次提交
    • A
      btrfs: skip devices without magic signature when mounting · 96c2e067
      Anand Jain 提交于
      Many things can happen after the device is scanned and before the device
      is mounted.  One such thing is losing the BTRFS_MAGIC on the device.
      If it happens we still won't free that device from the memory and cause
      the userland confusion.
      
      For example: As the BTRFS_IOC_DEV_INFO still carries the device path
      which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
      device which does not belong to the filesystem anymore:
      
        $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
        $ wipefs -a /dev/sdb
        # /dev/sdb does not contain magic signature
        $ mount -o degraded /dev/sda /btrfs
        $ btrfs fi show -m
        Label: none  uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
      	  Total devices 2 FS bytes used 128.00KiB
      	  devid    1 size 3.00GiB used 571.19MiB path /dev/sda
      	  devid    2 size 3.00GiB used 571.19MiB path /dev/sdb
      
      We need to distinguish the missing signature and invalid superblock, so
      add a specific error code ENODATA for that. This also fixes failure of
      fstest btrfs/198.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      96c2e067
    • J
      btrfs: cleanup cow block on error · 572c83ac
      Josef Bacik 提交于
      In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
      to a system lockup. It gets stuck trying to write back inodes, and the
      write back thread was trying to lock an extent buffer:
      
        $ cat /proc/2143497/stack
        [<0>] __btrfs_tree_lock+0x108/0x250
        [<0>] lock_extent_buffer_for_io+0x35e/0x3a0
        [<0>] btree_write_cache_pages+0x15a/0x3b0
        [<0>] do_writepages+0x28/0xb0
        [<0>] __writeback_single_inode+0x54/0x5c0
        [<0>] writeback_sb_inodes+0x1e8/0x510
        [<0>] wb_writeback+0xcc/0x440
        [<0>] wb_workfn+0xd7/0x650
        [<0>] process_one_work+0x236/0x560
        [<0>] worker_thread+0x55/0x3c0
        [<0>] kthread+0x13a/0x150
        [<0>] ret_from_fork+0x1f/0x30
      
      This is because we got an error while COWing a block, specifically here
      
              if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
                      ret = btrfs_reloc_cow_block(trans, root, buf, cow);
                      if (ret) {
                              btrfs_abort_transaction(trans, ret);
                              return ret;
                      }
              }
      
        [16402.241552] BTRFS: Transaction aborted (error -2)
        [16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
        [16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
        [16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        [16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
        [16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
        [16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
        [16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
        [16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
        [16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
        [16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
        [16402.256321] FS:  00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
        [16402.256973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
        [16402.257867] Call Trace:
        [16402.258072]  btrfs_cow_block+0x109/0x230
        [16402.258356]  btrfs_search_slot+0x530/0x9d0
        [16402.258655]  btrfs_lookup_file_extent+0x37/0x40
        [16402.259155]  __btrfs_drop_extents+0x13c/0xd60
        [16402.259628]  ? btrfs_block_rsv_migrate+0x4f/0xb0
        [16402.259949]  btrfs_replace_file_extents+0x190/0x820
        [16402.260873]  btrfs_clone+0x9ae/0xc00
        [16402.261139]  btrfs_extent_same_range+0x66/0x90
        [16402.261771]  btrfs_remap_file_range+0x353/0x3b1
        [16402.262333]  vfs_dedupe_file_range_one.part.0+0xd5/0x140
        [16402.262821]  vfs_dedupe_file_range+0x189/0x220
        [16402.263150]  do_vfs_ioctl+0x552/0x700
        [16402.263662]  __x64_sys_ioctl+0x62/0xb0
        [16402.264023]  do_syscall_64+0x33/0x40
        [16402.264364]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [16402.264862] RIP: 0033:0x7f542c7d15cb
        [16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
        [16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
        [16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
        [16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
        [16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
        [16402.271498] irq event stamp: 0
        [16402.271846] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
        [16402.273343] softirqs last  enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
        [16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [16402.274338] ---[ end trace 737874a5a41a8236 ]---
        [16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.276179] BTRFS info (device dm-9): forced readonly
        [16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
        [16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.280582] BTRFS info (device dm-9): balance: ended with status: -30
      
      The problem here is that as soon as we allocate the new block it is
      locked and marked dirty in the btree inode.  This means that we could
      attempt to writeback this block and need to lock the extent buffer.
      However we're not unlocking it here and thus we deadlock.
      
      Fix this by unlocking the cow block if we have any errors inside of
      __btrfs_cow_block, and also free it so we do not leak it.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      572c83ac
    • G
      btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK · e3c57805
      Goldwyn Rodrigues 提交于
      Since we now perform direct reads using i_rwsem, we can remove this
      inode flag used to co-ordinate unlocked reads.
      
      The truncate call takes i_rwsem. This means it is correctly synchronized
      with concurrent direct reads.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jth@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3c57805
    • G
      fs: remove no longer used dio_end_io() · c33fe275
      Goldwyn Rodrigues 提交于
      Since we removed the last user of dio_end_io() when btrfs got converted
      to iomap infrastructure ("btrfs: switch to iomap for direct IO"), remove
      the helper function dio_end_io().
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c33fe275
    • J
      btrfs: return error if we're unable to read device stats · 92e26df4
      Josef Bacik 提交于
      I noticed when fixing device stats for seed devices that we simply threw
      away the return value from btrfs_search_slot().  This is because we may
      not have stat items, but we could very well get an error, and thus miss
      reporting the error up the chain.
      
      Fix this by returning ret if it's an actual error, and then stop trying
      to init the rest of the devices stats and return the error up the chain.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92e26df4
    • J
      btrfs: init device stats for seed devices · 124604eb
      Josef Bacik 提交于
      We recently started recording device stats across the fleet, and noticed
      a large increase in messages such as this
      
        BTRFS warning (device dm-0): get dev_stats failed, not yet valid
      
      on our tiers that use seed devices for their root devices.  This is
      because we do not initialize the device stats for any seed devices if we
      have a sprout device and mount using that sprout device.  The basic
      steps for reproducing are:
      
        $ mkfs seed device
        $ mount seed device
        # fill seed device
        $ umount seed device
        $ btrfstune -S 1 seed device
        $ mount seed device
        $ btrfs device add -f sprout device /mnt/wherever
        $ umount /mnt/wherever
        $ mount sprout device /mnt/wherever
        $ btrfs device stats /mnt/wherever
      
      This will fail with the above message in dmesg.
      
      Fix this by iterating over the fs_devices->seed if they exist in
      btrfs_init_dev_stats.  This fixed the problem and properly reports the
      stats for both devices.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_device_init_dev_stats ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      124604eb
    • N
      btrfs: remove struct extent_io_ops · 905eb88b
      Nikolay Borisov 提交于
      It's no longer used just remove the function and any related code which
      was initialising it for inodes. No functional changes.
      
      Removing 8 bytes from extent_io_tree in turn reduces size of other
      structures where it is embedded, notably btrfs_inode where it reduces
      size by 24 bytes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      905eb88b
    • N
      btrfs: call submit_bio_hook directly for metadata pages · 1b36294a
      Nikolay Borisov 提交于
      No need to go through a function pointer indirection simply call
      submit_bio_hook directly by exporting and renaming the helper to
      btrfs_submit_metadata_bio. This makes the code more readable and should
      result in somewhat faster code due to no longer paying the price for
      specualtive attack mitigations that come with indirect function calls.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b36294a
    • N
      btrfs: stop calling submit_bio_hook for data inodes · 908930f3
      Nikolay Borisov 提交于
      Instead export and rename the function to btrfs_submit_data_bio and
      call it directly in submit_one_bio. This avoids paying the cost for
      speculative attacks mitigations and improves code readability.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      908930f3
    • N
      btrfs: don't opencode is_data_inode in end_bio_extent_readpage · be17b3af
      Nikolay Borisov 提交于
      Use the is_data_inode helper.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be17b3af
    • N
      btrfs: call submit_bio_hook directly in submit_one_bio · cd053744
      Nikolay Borisov 提交于
      BTRFS has 2 inode types (for the purposes of the code in submit_one_bio)
      - ordinary data inodes (including the freespace inode) and the btree
      inode. Both of these implement submit_bio_hook so btrfsic_submit_bio can
      never be called from submit_one_bio so just remove it.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd053744
    • N
      btrfs: remove extent_io_ops::readpage_end_io_hook · 1f03d9cf
      Nikolay Borisov 提交于
      It's no longer used so let's remove it.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1f03d9cf
    • N
      btrfs: replace readpage_end_io_hook with direct calls · 9a446d6a
      Nikolay Borisov 提交于
      Don't call readpage_end_io_hook for the btree inode.  Instead of relying
      on indirect calls to implement metadata buffer validation simply check
      if the inode whose page we are processing equals the btree inode. If it
      does call the necessary function.
      
      This is an improvement in 2 directions:
      
      1. We aren't paying the penalty of indirect calls in a post-speculation
         attacks world.
      
      2. The function is now named more explicitly so it's obvious what's
         going on
      
      This is in preparation to removing struct extent_io_ops altogether.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a446d6a
    • F
      btrfs: send, recompute reference path after orphanization of a directory · 9c2b4e03
      Filipe Manana 提交于
      During an incremental send, when an inode has multiple new references we
      might end up emitting rename operations for orphanizations that have a
      source path that is no longer valid due to a previous orphanization of
      some directory inode. This causes the receiver to fail since it tries
      to rename a path that does not exists.
      
      Example reproducer:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdi >/dev/null
        mount /dev/sdi /mnt/sdi
      
        touch /mnt/sdi/f1
        touch /mnt/sdi/f2
        mkdir /mnt/sdi/d1
        mkdir /mnt/sdi/d1/d2
      
        # Filesystem looks like:
        #
        # .                           (ino 256)
        # |----- f1                   (ino 257)
        # |----- f2                   (ino 258)
        # |----- d1/                  (ino 259)
        #        |----- d2/           (ino 260)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
        btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
      
        # Now do a series of changes such that:
        #
        # *) inode 258 has one new hardlink and the previous name changed
        #
        # *) both names conflict with the old names of two other inodes:
        #
        #    1) the new name "d1" conflicts with the old name of inode 259,
        #       under directory inode 256 (root)
        #
        #    2) the new name "d2" conflicts with the old name of inode 260
        #       under directory inode 259
        #
        # *) inodes 259 and 260 now have the old names of inode 258
        #
        # *) inode 257 is now located under inode 260 - an inode with a number
        #    smaller than the inode (258) for which we created a second hard
        #    link and swapped its names with inodes 259 and 260
        #
        ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
        mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1
      
        # Swap d1 and f2.
        mv /mnt/sdi/d1 /mnt/sdi/tmp
        mv /mnt/sdi/f2 /mnt/sdi/d1
        mv /mnt/sdi/tmp /mnt/sdi/f2
      
        # Swap d2 and f2_link
        mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
        mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
        mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link
      
        # Filesystem now looks like:
        #
        # .                                (ino 256)
        # |----- d1                        (ino 258)
        # |----- f2/                       (ino 259)
        #        |----- f2_link/           (ino 260)
        #        |       |----- f1         (ino 257)
        #        |
        #        |----- d2                 (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
        btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
      
        mkfs.btrfs -f /dev/sdj >/dev/null
        mount /dev/sdj /mnt/sdj
      
        btrfs receive -f /tmp/snap1.send /mnt/sdj
        btrfs receive -f /tmp/snap2.send /mnt/sdj
      
        umount /mnt/sdi
        umount /mnt/sdj
      
      When executed the receive of the incremental stream fails:
      
        $ ./reproducer.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory
      
      This happens because:
      
      1) When processing inode 257 we end up computing the name for inode 259
         because it is an ancestor in the send snapshot, and at that point it
         still has its old name, "d1", from the parent snapshot because inode
         259 was not yet processed. We then cache that name, which is valid
         until we start processing inode 259 (or set the progress to 260 after
         processing its references);
      
      2) Later we start processing inode 258 and collecting all its new
         references into the list sctx->new_refs. The first reference in the
         list happens to be the reference for name "d1" while the reference for
         name "d2" is next (the last element of the list).
         We compute the full path "d1/d2" for this second reference and store
         it in the reference (its ->full_path member). The path used for the
         new parent directory was "d1" and not "f2" because inode 259, the
         new parent, was not yet processed;
      
      3) When we start processing the new references at process_recorded_refs()
         we start with the first reference in the list, for the new name "d1".
         Because there is a conflicting inode that was not yet processed, which
         is directory inode 259, we orphanize it, renaming it from "d1" to
         "o259-6-0";
      
      4) Then we start processing the new reference for name "d2", and we
         realize it conflicts with the reference of inode 260 in the parent
         snapshot. So we issue an orphanization operation for inode 260 by
         emitting a rename operation with a destination path of "o260-6-0"
         and a source path of "d1/d2" - this source path is the value we
         stored in the reference earlier at step 2), corresponding to the
         ->full_path member of the reference, however that path is no longer
         valid due to the orphanization of the directory inode 259 in step 3).
         This makes the receiver fail since the path does not exists, it should
         have been "o259-6-0/d2".
      
      Fix this by recomputing the full path of a reference before emitting an
      orphanization if we previously orphanized any directory, since that
      directory could be a parent in the new path. This is a rare scenario so
      keeping it simple and not checking if that previously orphanized directory
      is in fact an ancestor of the inode we are trying to orphanize.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c2b4e03
    • F
      btrfs: send, orphanize first all conflicting inodes when processing references · 98272bb7
      Filipe Manana 提交于
      When doing an incremental send it is possible that when processing the new
      references for an inode we end up issuing rename or link operations that
      have an invalid path, which contains the orphanized name of a directory
      before we actually orphanized it, causing the receiver to fail.
      
      The following reproducer triggers such scenario:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdi >/dev/null
        mount /dev/sdi /mnt/sdi
      
        touch /mnt/sdi/a
        touch /mnt/sdi/b
        mkdir /mnt/sdi/testdir
        # We want "a" to have a lower inode number then "testdir" (257 vs 259).
        mv /mnt/sdi/a /mnt/sdi/testdir/a
      
        # Filesystem looks like:
        #
        # .                           (ino 256)
        # |----- testdir/             (ino 259)
        # |          |----- a         (ino 257)
        # |
        # |----- b                    (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
        btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
      
        # Now rename 259 to "testdir_2", then change the name of 257 to
        # "testdir" and make it a direct descendant of the root inode (256).
        # Also create a new link for inode 257 with the old name of inode 258.
        # By swapping the names and location of several inodes and create a
        # nasty dependency chain of rename and link operations.
        mv /mnt/sdi/testdir/a /mnt/sdi/a2
        touch /mnt/sdi/testdir/a
        mv /mnt/sdi/b /mnt/sdi/b2
        ln /mnt/sdi/a2 /mnt/sdi/b
        mv /mnt/sdi/testdir /mnt/sdi/testdir_2
        mv /mnt/sdi/a2 /mnt/sdi/testdir
      
        # Filesystem now looks like:
        #
        # .                            (ino 256)
        # |----- testdir_2/            (ino 259)
        # |          |----- a          (ino 260)
        # |
        # |----- testdir               (ino 257)
        # |----- b                     (ino 257)
        # |----- b2                    (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
        btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
      
        mkfs.btrfs -f /dev/sdj >/dev/null
        mount /dev/sdj /mnt/sdj
      
        btrfs receive -f /tmp/snap1.send /mnt/sdj
        btrfs receive -f /tmp/snap2.send /mnt/sdj
      
        umount /mnt/sdi
        umount /mnt/sdj
      
      When running the reproducer, the receive of the incremental send stream
      fails:
      
        $ ./reproducer.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: link b -> o259-6-0/a failed: No such file or directory
      
      The problem happens because of the following:
      
      1) Before we start iterating the list of new references for inode 257,
         we generate its current path and store it at @valid_path, done at
         the very beginning of process_recorded_refs(). The generated path
         is "o259-6-0/a", containing the orphanized name for inode 259;
      
      2) Then we iterate over the list of new references, which has the
         references "b" and "testdir" in that specific order;
      
      3) We process reference "b" first, because it is in the list before
         reference "testdir". We then issue a link operation to create
         the new reference "b" using a target path corresponding to the
         content at @valid_path, which corresponds to "o259-6-0/a".
         However we haven't yet orphanized inode 259, its name is still
         "testdir", and not "o259-6-0". The orphanization of 259 did not
         happen yet because we will process the reference named "testdir"
         for inode 257 only in the next iteration of the loop that goes
         over the list of new references.
      
      Fix the issue by having a preliminar iteration over all the new references
      at process_recorded_refs(). This iteration is responsible only for doing
      the orphanization of other inodes that have and old reference that
      conflicts with one of the new references of the inode we are currently
      processing. The emission of rename and link operations happen now in the
      next iteration of the new references.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98272bb7
    • Q
      btrfs: tree-checker: fix false alert caused by legacy btrfs root item · 1465af12
      Qu Wenruo 提交于
      Commit 259ee775 ("btrfs: tree-checker: Add ROOT_ITEM check")
      introduced btrfs root item size check, however btrfs root item has two
      versions, the legacy one which just ends before generation_v2 member, is
      smaller than current btrfs root item size.
      
      This caused btrfs kernel to reject valid but old tree root leaves.
      
      Fix this problem by also allowing legacy root item, since kernel can
      already handle them pretty well and upgrade to newer root item format
      when needed.
      Reported-by: NMartin Steigerwald <martin@lichtvoll.de>
      Fixes: 259ee775 ("btrfs: tree-checker: Add ROOT_ITEM check")
      CC: stable@vger.kernel.org # 5.4+
      Tested-By: NMartin Steigerwald <martin@lichtvoll.de>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1465af12
    • D
      btrfs: use unaligned helpers for stack and header set/get helpers · e97659ce
      David Sterba 提交于
      In the definitions generated by BTRFS_SETGET_HEADER_FUNCS there's direct
      pointer assignment but we should use the helpers for unaligned access
      for clarity. It hasn't been a problem so far because of the natural
      alignment.
      
      Similarly for BTRFS_SETGET_STACK_FUNCS, that usually get a structure
      from stack that has an aligned start but some members may not be aligned
      due to packing. This as well hasn't caused problems so far.
      
      Move the put/get_unaligned_le8 stubs to ctree.h so we can use them.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e97659ce
    • D
      btrfs: free-space-cache: use unaligned helpers to access data · 6994ca36
      David Sterba 提交于
      The free space inode stores the tracking data, checksums etc, using the
      io_ctl structure and moving the pointers. The data are generally aligned
      to at least 4 bytes (u32 for CRC) so it's not completely unaligned but
      for clarity we should use the proper helpers whenever a struct is
      initialized from io_ctl->cur pointer.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6994ca36
    • D
      btrfs: send: use helpers for unaligned access to header members · e2f896b3
      David Sterba 提交于
      The header is mapped onto the send buffer and thus its members may be
      potentially unaligned so use the helpers instead of directly assigning
      the pointers. This has worked so far but let's use the helpers to make
      that clear.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e2f896b3
    • Q
      btrfs: use own btree inode io_tree owner id · 2c53a14d
      Qu Wenruo 提交于
      Btree inode is special compared to all other inode extent io_trees,
      although it has a btrfs inode, it doesn't have the track_uptodate bit at
      all.
      
      This means a lot of things like extent locking doesn't even need to be
      applied to btree io tree.
      
      Since it's so special, adds a new owner value for it to make debuging a
      little easier.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2c53a14d
    • J
      btrfs: reschedule when cloning lots of extents · 6b613cc9
      Johannes Thumshirn 提交于
      We have several occurrences of a soft lockup from fstest's generic/175
      testcase, which look more or less like this one:
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [xfs_io:10030]
        Kernel panic - not syncing: softlockup: hung tasks
        CPU: 0 PID: 10030 Comm: xfs_io Tainted: G             L    5.9.0-rc5+ #768
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         <IRQ>
         dump_stack+0x77/0xa0
         panic+0xfa/0x2cb
         watchdog_timer_fn.cold+0x85/0xa5
         ? lockup_detector_update_enable+0x50/0x50
         __hrtimer_run_queues+0x99/0x4c0
         ? recalibrate_cpu_khz+0x10/0x10
         hrtimer_run_queues+0x9f/0xb0
         update_process_times+0x28/0x80
         tick_handle_periodic+0x1b/0x60
         __sysvec_apic_timer_interrupt+0x76/0x210
         asm_call_on_stack+0x12/0x20
         </IRQ>
         sysvec_apic_timer_interrupt+0x7f/0x90
         asm_sysvec_apic_timer_interrupt+0x12/0x20
        RIP: 0010:btrfs_tree_unlock+0x91/0x1a0 [btrfs]
        RSP: 0018:ffffc90007123a58 EFLAGS: 00000282
        RAX: ffff8881cea2fbe0 RBX: ffff8881cea2fbe0 RCX: 0000000000000000
        RDX: ffff8881d23fd200 RSI: ffffffff82045220 RDI: ffff8881cea2fba0
        RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000032
        R10: 0000160000000000 R11: 0000000000001000 R12: 0000000000001000
        R13: ffff8882357fd5b0 R14: ffff88816fa76e70 R15: ffff8881cea2fad0
         ? btrfs_tree_unlock+0x15b/0x1a0 [btrfs]
         btrfs_release_path+0x67/0x80 [btrfs]
         btrfs_insert_replace_extent+0x177/0x2c0 [btrfs]
         btrfs_replace_file_extents+0x472/0x7c0 [btrfs]
         btrfs_clone+0x9ba/0xbd0 [btrfs]
         btrfs_clone_files.isra.0+0xeb/0x140 [btrfs]
         ? file_update_time+0xcd/0x120
         btrfs_remap_file_range+0x322/0x3b0 [btrfs]
         do_clone_file_range+0xb7/0x1e0
         vfs_clone_file_range+0x30/0xa0
         ioctl_file_clone+0x8a/0xc0
         do_vfs_ioctl+0x5b2/0x6f0
         __x64_sys_ioctl+0x37/0xa0
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f87977fc247
        RSP: 002b:00007ffd51a2f6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f87977fc247
        RDX: 00007ffd51a2f710 RSI: 000000004020940d RDI: 0000000000000003
        RBP: 0000000000000004 R08: 00007ffd51a79080 R09: 0000000000000000
        R10: 00005621f11352f2 R11: 0000000000000206 R12: 0000000000000000
        R13: 0000000000000000 R14: 00005621f128b958 R15: 0000000080000000
        Kernel Offset: disabled
        ---[ end Kernel panic - not syncing: softlockup: hung tasks ]---
      
      All of these lockup reports have the call chain btrfs_clone_files() ->
      btrfs_clone() in common. btrfs_clone_files() calls btrfs_clone() with
      both source and destination extents locked and loops over the source
      extent to create the clones.
      
      Conditionally reschedule in the btrfs_clone() loop, to give some time back
      to other processes.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b613cc9
    • D
      btrfs: use kvcalloc for allocation in btrfs_ioctl_send() · bae12df9
      Denis Efremov 提交于
      Replace kvzalloc() call with kvcalloc() that also checks the size
      internally. There's a standalone overflow check in the function so we
      can return invalid parameter combination.  Use array_size() helper to
      compute the memory size for clone_sources_tmp.
      
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NDenis Efremov <efremov@linux.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bae12df9
    • D
      btrfs: use kvzalloc() to allocate clone_roots in btrfs_ioctl_send() · 8eb2fd00
      Denis Efremov 提交于
      btrfs_ioctl_send() used open-coded kvzalloc implementation earlier.
      The code was accidentally replaced with kzalloc() call [1]. Restore
      the original code by using kvzalloc() to allocate sctx->clone_roots.
      
      [1] https://patchwork.kernel.org/patch/9757891/#20529627
      
      Fixes: 818e010b ("btrfs: replace opencoded kvzalloc with the helper")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NDenis Efremov <efremov@linux.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8eb2fd00
    • N
      btrfs: remove inode argument from btrfs_start_ordered_extent · c0a43603
      Nikolay Borisov 提交于
      The passed in ordered_extent struct is always well-formed and contains
      the inode making the explicit argument redundant.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c0a43603
    • N
      btrfs: remove inode argument from add_pending_csums · 510f85ed
      Nikolay Borisov 提交于
      It's used to reference the csum root which can be done from the trans
      handle as well. Simplify the signature and while at it also remove the
      noinline attribute as the function uses only at most 16 bytes of stack
      space.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      510f85ed
    • N
    • N
    • N
      633cc816
    • N
      btrfs: open code extent_read_full_page to its sole caller · 0f208812
      Nikolay Borisov 提交于
      This makes reading the code a tad easier by decreasing the level of
      indirection by one.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0f208812
    • N
      btrfs: sink mirror_num argument in __do_readpage · fd513000
      Nikolay Borisov 提交于
      It's always set to 0 by the 2 callers so move it inside __do_readpage.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd513000
    • N
      btrfs: sink read_flags argument into extent_read_full_page · 6f15af60
      Nikolay Borisov 提交于
      It's always set to 0 by its sole caller - btrfs_readpage. Simply remove
      it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f15af60
    • N
      btrfs: sink mirror_num argument in extent_read_full_page · 003c286a
      Nikolay Borisov 提交于
      It's always set to 0 from the sole caller - btrfs_readpage.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      003c286a
    • N
      btrfs: promote extent_read_full_page to btrfs_readpage · c1be9c1a
      Nikolay Borisov 提交于
      Now that btrfs_readpage is the only caller of extent_read_full_page the
      latter can be open coded in the former. Use the occassion to rename
      __extent_read_full_page to extent_read_full_page. To facillitate this
      change submit_one_bio has to be exported as well.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1be9c1a
    • N
      btrfs: remove mirror_num argument from extent_read_full_page · 72cffee4
      Nikolay Borisov 提交于
      It's called only from btrfs_readpage which always passes 0 so just sink
      the argument into extent_read_full_page.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72cffee4
    • N
      btrfs: remove btrfs_get_extent indirection from __do_readpage · 1a5ee1e6
      Nikolay Borisov 提交于
      Now that this function is only responsible for reading data pages it's
      no longer necessary to pass get_extent_t parameter across several
      layers of functions. This patch removes this parameter from multiple
      functions: __get_extent_map/__do_readpage/__extent_read_full_page/
      extent_read_full_page and simply calls btrfs_get_extent directly in
      __get_extent_map.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a5ee1e6
    • N
      btrfs: remove btree_get_extent · 208d6341
      Nikolay Borisov 提交于
      The sole purpose of this function was to satisfy the requirements of
      __do_readpage. Since that function is no longer used to read metadata
      pages the need to keep btree_get_extent around has also disappeared.
      Simply remove it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      208d6341
    • N
      btrfs: simplify metadata pages reading · 0420177c
      Nikolay Borisov 提交于
      Metadata pages currently use __do_readpage to read metadata pages,
      unfortunately this function is also used to deal with ordinary data
      pages. This makes the metadata pages reading code to go through multiple
      hoops in order to adhere to __do_readpage invariants. Most of these are
      necessary for data pages which could be compressed. For metadata it's
      enough to simply build a bio and submit it.
      
      To this effect simply call submit_extent_page directly from
      read_extent_buffer_pages which is the only callpath used to populate
      extent_buffers with data. This in turn enables further cleanups.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0420177c
    • N
      btrfs: remove btree_readpage · 2f1d3e4b
      Nikolay Borisov 提交于
      There is no way for this function to be called as ->readpage() since
      it's called from
      generic_file_buffered_read/filemap_fault/do_read_cache_page/readhead
      code. BTRFS doesn't utilize the first 3 for the btree inode and
      implements it's owon readhead mechanism. So simply remove the function.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2f1d3e4b
    • F
      btrfs: reschedule if necessary when logging directory items · bb56f02f
      Filipe Manana 提交于
      Logging directories with many entries can take a significant amount of
      time, and in some cases monopolize a cpu/core for a long time if the
      logging task doesn't happen to block often enough.
      
      Johannes and Lu Fengqi reported test case generic/041 triggering a soft
      lockup when the kernel has CONFIG_SOFTLOCKUP_DETECTOR=y. For this test
      case we log an inode with 3002 hard links, and because the test removed
      one hard link before fsyncing the file, the inode logging causes the
      parent directory do be logged as well, which has 6004 directory items to
      log (3002 BTRFS_DIR_ITEM_KEY items plus 3002 BTRFS_DIR_INDEX_KEY items),
      so it can take a significant amount of time and trigger the soft lockup.
      
      So just make tree-log.c:log_dir_items() reschedule when necessary,
      releasing the current search path before doing so and then resume from
      where it was before the reschedule.
      
      The stack trace produced when the soft lockup happens is the following:
      
      [10480.277653] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [xfs_io:28172]
      [10480.279418] Modules linked in: dm_thin_pool dm_persistent_data (...)
      [10480.284915] irq event stamp: 29646366
      [10480.285987] hardirqs last  enabled at (29646365): [<ffffffff85249b66>] __slab_alloc.constprop.0+0x56/0x60
      [10480.288482] hardirqs last disabled at (29646366): [<ffffffff8579b00d>] irqentry_enter+0x1d/0x50
      [10480.290856] softirqs last  enabled at (4612): [<ffffffff85a00323>] __do_softirq+0x323/0x56c
      [10480.293615] softirqs last disabled at (4483): [<ffffffff85800dbf>] asm_call_on_stack+0xf/0x20
      [10480.296428] CPU: 2 PID: 28172 Comm: xfs_io Not tainted 5.9.0-rc4-default+ #1248
      [10480.298948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
      [10480.302455] RIP: 0010:__slab_alloc.constprop.0+0x19/0x60
      [10480.304151] Code: 86 e8 31 75 21 00 66 66 2e 0f 1f 84 00 00 00 (...)
      [10480.309558] RSP: 0018:ffffadbe09397a58 EFLAGS: 00000282
      [10480.311179] RAX: ffff8a495ab92840 RBX: 0000000000000282 RCX: 0000000000000006
      [10480.313242] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff85249b66
      [10480.315260] RBP: ffff8a497d04b740 R08: 0000000000000001 R09: 0000000000000001
      [10480.317229] R10: ffff8a497d044800 R11: ffff8a495ab93c40 R12: 0000000000000000
      [10480.319169] R13: 0000000000000000 R14: 0000000000000c40 R15: ffffffffc01daf70
      [10480.321104] FS:  00007fa1dc5c0e40(0000) GS:ffff8a497da00000(0000) knlGS:0000000000000000
      [10480.323559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [10480.325235] CR2: 00007fa1dc5befb8 CR3: 0000000004f8a006 CR4: 0000000000170ea0
      [10480.327259] Call Trace:
      [10480.328286]  ? overwrite_item+0x1f0/0x5a0 [btrfs]
      [10480.329784]  __kmalloc+0x831/0xa20
      [10480.331009]  ? btrfs_get_32+0xb0/0x1d0 [btrfs]
      [10480.332464]  overwrite_item+0x1f0/0x5a0 [btrfs]
      [10480.333948]  log_dir_items+0x2ee/0x570 [btrfs]
      [10480.335413]  log_directory_changes+0x82/0xd0 [btrfs]
      [10480.336926]  btrfs_log_inode+0xc9b/0xda0 [btrfs]
      [10480.338374]  ? init_once+0x20/0x20 [btrfs]
      [10480.339711]  btrfs_log_inode_parent+0x8d3/0xd10 [btrfs]
      [10480.341257]  ? dget_parent+0x97/0x2e0
      [10480.342480]  btrfs_log_dentry_safe+0x3a/0x50 [btrfs]
      [10480.343977]  btrfs_sync_file+0x24b/0x5e0 [btrfs]
      [10480.345381]  do_fsync+0x38/0x70
      [10480.346483]  __x64_sys_fsync+0x10/0x20
      [10480.347703]  do_syscall_64+0x2d/0x70
      [10480.348891]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [10480.350444] RIP: 0033:0x7fa1dc80970b
      [10480.351642] Code: 0f 05 48 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 (...)
      [10480.356952] RSP: 002b:00007fffb3d081d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
      [10480.359458] RAX: ffffffffffffffda RBX: 0000562d93d45e40 RCX: 00007fa1dc80970b
      [10480.361426] RDX: 0000562d93d44ab0 RSI: 0000562d93d45e60 RDI: 0000000000000003
      [10480.363367] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007fa1dc7b2a40
      [10480.365317] R10: 0000562d93d0e366 R11: 0000000000000293 R12: 0000000000000001
      [10480.367299] R13: 0000562d93d45290 R14: 0000562d93d45e40 R15: 0000562d93d45e60
      
      Link: https://lore.kernel.org/linux-btrfs/20180713090216.GC575@fnst.localdomain/Reported-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      CC: stable@vger.kernel.org # 4.4+
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bb56f02f
    • J
      btrfs: do not create raid sysfs entries under any locks · 49ea112d
      Josef Bacik 提交于
      While running xfstests btrfs/177 I got the following lockdep splat
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc3+ #5 Not tainted
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc+0x37/0x270
      	 alloc_inode+0x82/0xb0
      	 iget_locked+0x10d/0x2c0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x240
      	 sysfs_get_tree+0x16/0x40
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (kernfs_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_dir_ns+0x7a/0xb0
      	 sysfs_create_dir_ns+0x60/0xb0
      	 kobject_add_internal+0xc0/0x2c0
      	 kobject_add+0x6e/0x90
      	 btrfs_sysfs_add_block_group_type+0x102/0x160
      	 btrfs_make_block_group+0x167/0x230
      	 btrfs_alloc_chunk+0x54f/0xb80
      	 btrfs_chunk_alloc+0x18e/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_insert_empty_items+0x64/0xb0
      	 btrfs_new_inode+0x225/0x730
      	 btrfs_create+0xab/0x1f0
      	 lookup_open.isra.0+0x52d/0x690
      	 path_openat+0x2a7/0x9e0
      	 do_filp_open+0x75/0x100
      	 do_sys_openat2+0x7b/0x130
      	 __x64_sys_openat+0x46/0x70
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_lookup_inode+0x2a/0x8f
      	 __btrfs_update_delayed_inode+0x80/0x240
      	 btrfs_commit_inode_delayed_inode+0x119/0x120
      	 btrfs_evict_inode+0x357/0x500
      	 evict+0xcf/0x1f0
      	 do_unlinkat+0x1a9/0x2b0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x119c/0x1fc0
      	 lock_acquire+0xa7/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x119c/0x1fc0
         lock_acquire+0xa7/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa7/0x3d0
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because when we link in a block group with a new raid index
      type we'll create the corresponding sysfs entries for it.  This is
      problematic because while restriping we're holding the chunk_mutex, and
      while mounting we're holding the tree locks.
      
      Fixing this isn't pretty, we move the call to the sysfs stuff into the
      btrfs_create_pending_block_groups() work, where we're not holding any
      locks.  This creates a slight race where other threads could see that
      there's no sysfs kobj for that raid type, and race to create the
      sysfs dir.  Fix this by wrapping the creation in space_info->lock, so we
      only get one thread calling kobject_add() for the new directory.  We
      don't worry about the lock on cleanup as it only gets deleted on
      unmount.
      
      On mount it's more straightforward, we loop through the space_infos
      already, just check every raid index in each space_info and added the
      sysfs entries for the corresponding block groups.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49ea112d