1. 27 1月, 2021 1 次提交
    • F
      btrfs: send: fix wrong file path when there is an inode with a pending rmdir · e3bc7d18
      Filipe Manana 提交于
      stable inclusion
      from stable-5.10.7
      commit 5e84c99055eb84a2a6226bf6164ee70bdcfb996e
      bugzilla: 47429
      
      --------------------------------
      
      commit 0b3f407e upstream.
      
      When doing an incremental send, if we have a new inode that happens to
      have the same number that an old directory inode had in the base snapshot
      and that old directory has a pending rmdir operation, we end up computing
      a wrong path for the new inode, causing the receiver to fail.
      
      Example reproducer:
      
        $ cat test-send-rmdir.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        mkdir $MNT/dir
        touch $MNT/dir/file1
        touch $MNT/dir/file2
        touch $MNT/dir/file3
      
        # Filesystem looks like:
        #
        # .                                     (ino 256)
        # |----- dir/                           (ino 257)
        #         |----- file1                  (ino 258)
        #         |----- file2                  (ino 259)
        #         |----- file3                  (ino 260)
        #
      
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Now remove our directory and all its files.
        rm -fr $MNT/dir
      
        # Unmount the filesystem and mount it again. This is to ensure that
        # the next inode that is created ends up with the same inode number
        # that our directory "dir" had, 257, which is the first free "objectid"
        # available after mounting again the filesystem.
        umount $MNT
        mount $DEV $MNT
      
        # Now create a new file (it could be a directory as well).
        touch $MNT/newfile
      
        # Filesystem now looks like:
        #
        # .                                     (ino 256)
        # |----- newfile                        (ino 257)
        #
      
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        # Now unmount the filesystem, create a new one, mount it and try to apply
        # both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
      
        mount $DEV $MNT
      
        btrfs receive -f /tmp/snap1.send $MNT
        btrfs receive -f /tmp/snap2.send $MNT
      
        umount $MNT
      
      When running the test, the receive operation for the incremental stream
      fails:
      
        $ ./test-send-rmdir.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: chown o257-9-0 failed: No such file or directory
      
      So fix this by tracking directories that have a pending rmdir by inode
      number and generation number, instead of only inode number.
      
      A test case for fstests follows soon.
      Reported-by: NMassimo B. <massimo.b@gmx.net>
      Tested-by: NMassimo B. <massimo.b@gmx.net>
      Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      e3bc7d18
  2. 07 10月, 2020 9 次提交
    • F
      btrfs: send, recompute reference path after orphanization of a directory · 9c2b4e03
      Filipe Manana 提交于
      During an incremental send, when an inode has multiple new references we
      might end up emitting rename operations for orphanizations that have a
      source path that is no longer valid due to a previous orphanization of
      some directory inode. This causes the receiver to fail since it tries
      to rename a path that does not exists.
      
      Example reproducer:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdi >/dev/null
        mount /dev/sdi /mnt/sdi
      
        touch /mnt/sdi/f1
        touch /mnt/sdi/f2
        mkdir /mnt/sdi/d1
        mkdir /mnt/sdi/d1/d2
      
        # Filesystem looks like:
        #
        # .                           (ino 256)
        # |----- f1                   (ino 257)
        # |----- f2                   (ino 258)
        # |----- d1/                  (ino 259)
        #        |----- d2/           (ino 260)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
        btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
      
        # Now do a series of changes such that:
        #
        # *) inode 258 has one new hardlink and the previous name changed
        #
        # *) both names conflict with the old names of two other inodes:
        #
        #    1) the new name "d1" conflicts with the old name of inode 259,
        #       under directory inode 256 (root)
        #
        #    2) the new name "d2" conflicts with the old name of inode 260
        #       under directory inode 259
        #
        # *) inodes 259 and 260 now have the old names of inode 258
        #
        # *) inode 257 is now located under inode 260 - an inode with a number
        #    smaller than the inode (258) for which we created a second hard
        #    link and swapped its names with inodes 259 and 260
        #
        ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
        mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1
      
        # Swap d1 and f2.
        mv /mnt/sdi/d1 /mnt/sdi/tmp
        mv /mnt/sdi/f2 /mnt/sdi/d1
        mv /mnt/sdi/tmp /mnt/sdi/f2
      
        # Swap d2 and f2_link
        mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
        mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
        mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link
      
        # Filesystem now looks like:
        #
        # .                                (ino 256)
        # |----- d1                        (ino 258)
        # |----- f2/                       (ino 259)
        #        |----- f2_link/           (ino 260)
        #        |       |----- f1         (ino 257)
        #        |
        #        |----- d2                 (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
        btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
      
        mkfs.btrfs -f /dev/sdj >/dev/null
        mount /dev/sdj /mnt/sdj
      
        btrfs receive -f /tmp/snap1.send /mnt/sdj
        btrfs receive -f /tmp/snap2.send /mnt/sdj
      
        umount /mnt/sdi
        umount /mnt/sdj
      
      When executed the receive of the incremental stream fails:
      
        $ ./reproducer.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory
      
      This happens because:
      
      1) When processing inode 257 we end up computing the name for inode 259
         because it is an ancestor in the send snapshot, and at that point it
         still has its old name, "d1", from the parent snapshot because inode
         259 was not yet processed. We then cache that name, which is valid
         until we start processing inode 259 (or set the progress to 260 after
         processing its references);
      
      2) Later we start processing inode 258 and collecting all its new
         references into the list sctx->new_refs. The first reference in the
         list happens to be the reference for name "d1" while the reference for
         name "d2" is next (the last element of the list).
         We compute the full path "d1/d2" for this second reference and store
         it in the reference (its ->full_path member). The path used for the
         new parent directory was "d1" and not "f2" because inode 259, the
         new parent, was not yet processed;
      
      3) When we start processing the new references at process_recorded_refs()
         we start with the first reference in the list, for the new name "d1".
         Because there is a conflicting inode that was not yet processed, which
         is directory inode 259, we orphanize it, renaming it from "d1" to
         "o259-6-0";
      
      4) Then we start processing the new reference for name "d2", and we
         realize it conflicts with the reference of inode 260 in the parent
         snapshot. So we issue an orphanization operation for inode 260 by
         emitting a rename operation with a destination path of "o260-6-0"
         and a source path of "d1/d2" - this source path is the value we
         stored in the reference earlier at step 2), corresponding to the
         ->full_path member of the reference, however that path is no longer
         valid due to the orphanization of the directory inode 259 in step 3).
         This makes the receiver fail since the path does not exists, it should
         have been "o259-6-0/d2".
      
      Fix this by recomputing the full path of a reference before emitting an
      orphanization if we previously orphanized any directory, since that
      directory could be a parent in the new path. This is a rare scenario so
      keeping it simple and not checking if that previously orphanized directory
      is in fact an ancestor of the inode we are trying to orphanize.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c2b4e03
    • F
      btrfs: send, orphanize first all conflicting inodes when processing references · 98272bb7
      Filipe Manana 提交于
      When doing an incremental send it is possible that when processing the new
      references for an inode we end up issuing rename or link operations that
      have an invalid path, which contains the orphanized name of a directory
      before we actually orphanized it, causing the receiver to fail.
      
      The following reproducer triggers such scenario:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdi >/dev/null
        mount /dev/sdi /mnt/sdi
      
        touch /mnt/sdi/a
        touch /mnt/sdi/b
        mkdir /mnt/sdi/testdir
        # We want "a" to have a lower inode number then "testdir" (257 vs 259).
        mv /mnt/sdi/a /mnt/sdi/testdir/a
      
        # Filesystem looks like:
        #
        # .                           (ino 256)
        # |----- testdir/             (ino 259)
        # |          |----- a         (ino 257)
        # |
        # |----- b                    (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
        btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
      
        # Now rename 259 to "testdir_2", then change the name of 257 to
        # "testdir" and make it a direct descendant of the root inode (256).
        # Also create a new link for inode 257 with the old name of inode 258.
        # By swapping the names and location of several inodes and create a
        # nasty dependency chain of rename and link operations.
        mv /mnt/sdi/testdir/a /mnt/sdi/a2
        touch /mnt/sdi/testdir/a
        mv /mnt/sdi/b /mnt/sdi/b2
        ln /mnt/sdi/a2 /mnt/sdi/b
        mv /mnt/sdi/testdir /mnt/sdi/testdir_2
        mv /mnt/sdi/a2 /mnt/sdi/testdir
      
        # Filesystem now looks like:
        #
        # .                            (ino 256)
        # |----- testdir_2/            (ino 259)
        # |          |----- a          (ino 260)
        # |
        # |----- testdir               (ino 257)
        # |----- b                     (ino 257)
        # |----- b2                    (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
        btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
      
        mkfs.btrfs -f /dev/sdj >/dev/null
        mount /dev/sdj /mnt/sdj
      
        btrfs receive -f /tmp/snap1.send /mnt/sdj
        btrfs receive -f /tmp/snap2.send /mnt/sdj
      
        umount /mnt/sdi
        umount /mnt/sdj
      
      When running the reproducer, the receive of the incremental send stream
      fails:
      
        $ ./reproducer.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: link b -> o259-6-0/a failed: No such file or directory
      
      The problem happens because of the following:
      
      1) Before we start iterating the list of new references for inode 257,
         we generate its current path and store it at @valid_path, done at
         the very beginning of process_recorded_refs(). The generated path
         is "o259-6-0/a", containing the orphanized name for inode 259;
      
      2) Then we iterate over the list of new references, which has the
         references "b" and "testdir" in that specific order;
      
      3) We process reference "b" first, because it is in the list before
         reference "testdir". We then issue a link operation to create
         the new reference "b" using a target path corresponding to the
         content at @valid_path, which corresponds to "o259-6-0/a".
         However we haven't yet orphanized inode 259, its name is still
         "testdir", and not "o259-6-0". The orphanization of 259 did not
         happen yet because we will process the reference named "testdir"
         for inode 257 only in the next iteration of the loop that goes
         over the list of new references.
      
      Fix the issue by having a preliminar iteration over all the new references
      at process_recorded_refs(). This iteration is responsible only for doing
      the orphanization of other inodes that have and old reference that
      conflicts with one of the new references of the inode we are currently
      processing. The emission of rename and link operations happen now in the
      next iteration of the new references.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98272bb7
    • D
      btrfs: send: use helpers for unaligned access to header members · e2f896b3
      David Sterba 提交于
      The header is mapped onto the send buffer and thus its members may be
      potentially unaligned so use the helpers instead of directly assigning
      the pointers. This has worked so far but let's use the helpers to make
      that clear.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e2f896b3
    • D
      btrfs: use kvcalloc for allocation in btrfs_ioctl_send() · bae12df9
      Denis Efremov 提交于
      Replace kvzalloc() call with kvcalloc() that also checks the size
      internally. There's a standalone overflow check in the function so we
      can return invalid parameter combination.  Use array_size() helper to
      compute the memory size for clone_sources_tmp.
      
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NDenis Efremov <efremov@linux.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bae12df9
    • D
      btrfs: use kvzalloc() to allocate clone_roots in btrfs_ioctl_send() · 8eb2fd00
      Denis Efremov 提交于
      btrfs_ioctl_send() used open-coded kvzalloc implementation earlier.
      The code was accidentally replaced with kzalloc() call [1]. Restore
      the original code by using kvzalloc() to allocate sctx->clone_roots.
      
      [1] https://patchwork.kernel.org/patch/9757891/#20529627
      
      Fixes: 818e010b ("btrfs: replace opencoded kvzalloc with the helper")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NDenis Efremov <efremov@linux.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8eb2fd00
    • O
      btrfs: send: use btrfs_file_extent_end() in send_write_or_clone() · c9a949af
      Omar Sandoval 提交于
      send_write_or_clone() basically has an open-coded copy of
      btrfs_file_extent_end() except that it (incorrectly) aligns to PAGE_SIZE
      instead of sectorsize. Fix and simplify the code by using
      btrfs_file_extent_end().
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c9a949af
    • O
      btrfs: send: avoid copying file data · 8c7d9fe0
      Omar Sandoval 提交于
      send_write() currently copies from the page cache to sctx->read_buf, and
      then from sctx->read_buf to sctx->send_buf. Similarly, send_hole()
      zeroes sctx->read_buf and then copies from sctx->read_buf to
      sctx->send_buf. However, if we write the TLV header manually, we can
      copy to sctx->send_buf directly and get rid of sctx->read_buf.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c7d9fe0
    • O
      btrfs: send: get rid of i_size logic in send_write() · a9b2e0de
      Omar Sandoval 提交于
      send_write()/fill_read_buf() have some logic for avoiding reading past
      i_size. However, everywhere that we call
      send_write()/send_extent_data(), we've already clamped the length down
      to i_size. Get rid of the i_size handling, which simplifies the next
      change.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a9b2e0de
    • D
      btrfs: send: remove indirect callback parameter for changed_cb · 1b51d6fc
      David Sterba 提交于
      There's a custom callback passed to btrfs_compare_trees which happens to
      be named exactly same as the existing function implementing it. This is
      confusing and the indirection is not necessary for our needs. Compiler
      is clever enough to call it directly so there's effectively no change.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b51d6fc
  3. 25 5月, 2020 3 次提交
    • D
      btrfs: simplify iget helpers · 0202e83f
      David Sterba 提交于
      The inode lookup starting at btrfs_iget takes the full location key,
      while only the objectid is used to match the inode, because the lookup
      happens inside the given root thus the inode number is unique.
      The entire location key is properly set up in btrfs_init_locked_inode.
      
      Simplify the helpers and pass only inode number, renaming it to 'ino'
      instead of 'objectid'. This allows to remove temporary variables key,
      saving some stack space.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0202e83f
    • D
      btrfs: simplify root lookup by id · 56e9357a
      David Sterba 提交于
      The main function to lookup a root by its id btrfs_get_fs_root takes the
      whole key, while only using the objectid. The value of offset is preset
      to (u64)-1 but not actually used until btrfs_find_root that does the
      actual search.
      
      Switch btrfs_get_fs_root to use only objectid and remove all local
      variables that existed just for the lookup. The actual key for search is
      set up in btrfs_get_fs_root, reusing another key variable.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56e9357a
    • M
      btrfs: send: emit file capabilities after chown · 89efda52
      Marcos Paulo de Souza 提交于
      Whenever a chown is executed, all capabilities of the file being touched
      are lost.  When doing incremental send with a file with capabilities,
      there is a situation where the capability can be lost on the receiving
      side. The sequence of actions bellow shows the problem:
      
        $ mount /dev/sda fs1
        $ mount /dev/sdb fs2
      
        $ touch fs1/foo.bar
        $ setcap cap_sys_nice+ep fs1/foo.bar
        $ btrfs subvolume snapshot -r fs1 fs1/snap_init
        $ btrfs send fs1/snap_init | btrfs receive fs2
      
        $ chgrp adm fs1/foo.bar
        $ setcap cap_sys_nice+ep fs1/foo.bar
      
        $ btrfs subvolume snapshot -r fs1 fs1/snap_complete
        $ btrfs subvolume snapshot -r fs1 fs1/snap_incremental
      
        $ btrfs send fs1/snap_complete | btrfs receive fs2
        $ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2
      
      At this point, only a chown was emitted by "btrfs send" since only the
      group was changed. This makes the cap_sys_nice capability to be dropped
      from fs2/snap_incremental/foo.bar
      
      To fix that, only emit capabilities after chown is emitted. The current
      code first checks for xattrs that are new/changed, emits them, and later
      emit the chown. Now, __process_new_xattr skips capabilities, letting
      only finish_inode_if_needed to emit them, if they exist, for the inode
      being processed.
      
      This behavior was being worked around in "btrfs receive" side by caching
      the capability and only applying it after chown. Now, xattrs are only
      emmited _after_ chown, making that workaround not needed anymore.
      
      Link: https://github.com/kdave/btrfs-progs/issues/202
      CC: stable@vger.kernel.org # 4.4+
      Suggested-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      89efda52
  4. 10 5月, 2020 1 次提交
  5. 24 3月, 2020 6 次提交
  6. 31 1月, 2020 1 次提交
    • F
      Btrfs: send, fix emission of invalid clone operations within the same file · 9722b101
      Filipe Manana 提交于
      When doing an incremental send and a file has extents shared with itself
      at different file offsets, it's possible for send to emit clone operations
      that will fail at the destination because the source range goes beyond the
      file's current size. This happens when the file size has increased in the
      send snapshot, there is a hole between the shared extents and both shared
      extents are at file offsets which are greater the file's size in the
      parent snapshot.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xf1 0 64K" /mnt/sdb/foobar
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
        $ btrfs send -f /tmp/1.snap /mnt/sdb/base
      
        # Create a 320K extent at file offset 512K.
        $ xfs_io -c "pwrite -S 0xab 512K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0xcd 576K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0xef 640K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0x64 704K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0x73 768K 64K" /mnt/sdb/foobar
      
        # Clone part of that 320K extent into a lower file offset (192K).
        # This file offset is greater than the file's size in the parent
        # snapshot (64K). Also the clone range is a bit behind the offset of
        # the 320K extent so that we leave a hole between the shared extents.
        $ xfs_io -c "reflink /mnt/sdb/foobar 448K 192K 192K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
        $ btrfs send -p /mnt/sdb/base -f /tmp/2.snap /mnt/sdb/incr
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/1.snap /mnt/sdc
        $ btrfs receive -f /tmp/2.snap /mnt/sdc
        ERROR: failed to clone extents to foobar: Invalid argument
      
      The problem is that after processing the extent at file offset 256K, which
      refers to the first 128K of the 320K extent created by the buffered write
      operations, we have 'cur_inode_next_write_offset' set to 384K, which
      corresponds to the end offset of the partially shared extent (256K + 128K)
      and to the current file size in the receiver. Then when we process the
      extent at offset 512K, we do extent backreference iteration to figure out
      if we can clone the extent from some other inode or from the same inode,
      and we consider the extent at offset 256K of the same inode as a valid
      source for a clone operation, which is not correct because at that point
      the current file size in the receiver is 384K, which corresponds to the
      end of last processed extent (at file offset 256K), so using a clone
      source range from 256K to 256K + 320K is invalid because that goes past
      the current size of the file (384K) - this makes the receiver get an
      -EINVAL error when attempting the clone operation.
      
      So fix this by excluding clone sources that have a range that goes beyond
      the current file size in the receiver when iterating extent backreferences.
      
      A test case for fstests follows soon.
      
      Fixes: 11f2069c ("Btrfs: send, allow clone operations within the same file")
      CC: stable@vger.kernel.org # 5.5+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9722b101
  7. 13 12月, 2019 1 次提交
    • A
      btrfs: send: remove WARN_ON for readonly mount · fbd54297
      Anand Jain 提交于
      We log warning if root::orphan_cleanup_state is not set to
      ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is
      mounted as readonly we skip the orphan item cleanup during the lookup
      and root::orphan_cleanup_state remains at the init state 0 instead of
      ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the
      warning as below.
      
        WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE);
      
      WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
      ::
      RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
      ::
      Call Trace:
      ::
      _btrfs_ioctl_send+0x7b/0x110 [btrfs]
      btrfs_ioctl+0x150a/0x2b00 [btrfs]
      ::
      do_vfs_ioctl+0xa9/0x620
      ? __fget+0xac/0xe0
      ksys_ioctl+0x60/0x90
      __x64_sys_ioctl+0x16/0x20
      do_syscall_64+0x49/0x130
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reproducer:
        mkfs.btrfs -fq /dev/sdb
        mount /dev/sdb /btrfs
        btrfs subvolume create /btrfs/sv1
        btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1
        umount /btrfs
        mount -o ro /dev/sdb /btrfs
        btrfs send /btrfs/ss1 -f /tmp/f
      
      The warning exists because having orphan inodes could confuse send and
      cause it to fail or produce incorrect streams.  The two cases that would
      cause such send failures, which are already fixed are:
      
      1) Inodes that were unlinked - these are orphanized and remain with a
         link count of 0. These caused send operations to fail because it
         expected to always find at least one path for an inode. However this
         is no longer a problem since send is now able to deal with such
         inodes since commit 46b2f459 ("Btrfs: fix send failure when root
         has deleted files still open") and treats them as having been
         completely removed (the state after an orphan cleanup is performed).
      
      2) Inodes that were in the process of being truncated. These resulted in
         send not knowing about the truncation and potentially issue write
         operations full of zeroes for the range from the new file size to the
         old file size. This is no longer a problem because we no longer
         create orphan items for truncation since commit f7e9e8fc ("Btrfs:
         stop creating orphan items for truncate").
      
      As such before these commits, the WARN_ON here provided a clue in case
      something went wrong. Instead of being a warning against the
      root::orphan_cleanup_state value, it could have been more accurate by
      checking if there were actually any orphan items, and then issue a
      warning only if any exists, but that would be more expensive to check.
      Since orphanized inodes no longer cause problems for send, just remove
      the warning.
      Reported-by: NChristoph Anton Mitterer <calestyo@scientia.net>
      Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/
      CC: stable@vger.kernel.org # 4.19+
      Suggested-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbd54297
  8. 19 11月, 2019 2 次提交
    • F
      Btrfs: send, skip backreference walking for extents with many references · fd0ddbe2
      Filipe Manana 提交于
      Backreference walking, which is used by send to figure if it can issue
      clone operations instead of write operations, can be very slow and use
      too much memory when extents have many references. This change simply
      skips backreference walking when an extent has more than 64 references,
      in which case we fallback to a write operation instead of a clone
      operation. This limit is conservative and in practice I observed no
      signicant slowdown with up to 100 references and still low memory usage
      up to that limit.
      
      This is a temporary workaround until there are speedups in the backref
      walking code, and as such it does not attempt to add extra interfaces or
      knobs to tweak the threshold.
      Reported-by: NAtemu <atemu.main@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd0ddbe2
    • F
      Btrfs: send, allow clone operations within the same file · 11f2069c
      Filipe Manana 提交于
      For send we currently skip clone operations when the source and
      destination files are the same. This is so because clone didn't support
      this case in its early days, but support for it was added back in May
      2013 by commit a96fbc72 ("Btrfs: allow file data clone within a
      file"). This change adds support for it.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdd /mnt/sdd
      
        $ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar
        $ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap
      
        $ mkfs.btrfs -f /dev/sde
        $ mount /dev/sde /mnt/sde
      
        $ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde
      
      Without this change file foobar at the destination has a single 128Kb
      extent:
      
        $ filefrag -v /mnt/sde/snap/foobar
        Filesystem type is: 9123683e
        File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
         ext:     logical_offset:        physical_offset: length:   expected: flags:
           0:        0..      31:          0..        31:     32:             last,unknown_loc,delalloc,eof
        /mnt/sde/snap/foobar: 1 extent found
      
      With this we get a single 64Kb extent that is shared at file offsets 0
      and 64K, just like in the source filesystem:
      
        $ filefrag -v /mnt/sde/snap/foobar
        Filesystem type is: 9123683e
        File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
         ext:     logical_offset:        physical_offset: length:   expected: flags:
           0:        0..      15:       3328..      3343:     16:             shared
           1:       16..      31:       3328..      3343:     16:       3344: last,shared,eof
        /mnt/sde/snap/foobar: 2 extents found
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11f2069c
  9. 18 11月, 2019 1 次提交
  10. 08 10月, 2019 1 次提交
    • A
      btrfs: silence maybe-uninitialized warning in clone_range · 431d3988
      Austin Kim 提交于
      GCC throws warning message as below:
      
      ‘clone_src_i_size’ may be used uninitialized in this function
      [-Wmaybe-uninitialized]
       #define IS_ALIGNED(x, a)  (((x) & ((typeof(x))(a) - 1)) == 0)
                             ^
      fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here
       u64 clone_src_i_size;
         ^
      The clone_src_i_size is only used as call-by-reference
      in a call to get_inode_info().
      
      Silence the warning by initializing clone_src_i_size to 0.
      
      Note that the warning is a false positive and reported by older versions
      of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the
      patch is applied. Setting clone_src_i_size to 0 does not otherwise make
      sense and would not do any action in case the code changes in the future.
      Signed-off-by: NAustin Kim <austindh.kim@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      431d3988
  11. 09 9月, 2019 2 次提交
    • N
      btrfs: Relinquish CPUs in btrfs_compare_trees · 6af112b1
      Nikolay Borisov 提交于
      When doing any form of incremental send the parent and the child trees
      need to be compared via btrfs_compare_trees. This  can result in long
      loop chains without ever relinquishing the CPU. This causes softlockup
      detector to trigger when comparing trees with a lot of items. Example
      report:
      
      watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
      CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased)
      Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      pstate: 40000005 (nZcv daif -PAN -UAO)
      pc : __ll_sc_arch_atomic_sub_return+0x14/0x20
      lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs]
      sp : ffff00001273b7e0
      Call trace:
       __ll_sc_arch_atomic_sub_return+0x14/0x20
       release_extent_buffer+0xdc/0x120 [btrfs]
       free_extent_buffer.part.0+0xb0/0x118 [btrfs]
       free_extent_buffer+0x24/0x30 [btrfs]
       btrfs_release_path+0x4c/0xa0 [btrfs]
       btrfs_free_path.part.0+0x20/0x40 [btrfs]
       btrfs_free_path+0x24/0x30 [btrfs]
       get_inode_info+0xa8/0xf8 [btrfs]
       finish_inode_if_needed+0xe0/0x6d8 [btrfs]
       changed_cb+0x9c/0x410 [btrfs]
       btrfs_compare_trees+0x284/0x648 [btrfs]
       send_subvol+0x33c/0x520 [btrfs]
       btrfs_ioctl_send+0x8a0/0xaf0 [btrfs]
       btrfs_ioctl+0x199c/0x2288 [btrfs]
       do_vfs_ioctl+0x4b0/0x820
       ksys_ioctl+0x84/0xb8
       __arm64_sys_ioctl+0x28/0x38
       el0_svc_common.constprop.0+0x7c/0x188
       el0_svc_handler+0x34/0x90
       el0_svc+0x8/0xc
      
      Fix this by adding a call to cond_resched at the beginning of the main
      loop in btrfs_compare_trees.
      
      Fixes: 7069830a ("Btrfs: add btrfs_compare_trees function")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6af112b1
    • D
      btrfs: move functions for tree compare to send.c · 18d0f5c6
      David Sterba 提交于
      Send is the only user of tree_compare, we can move it there along with
      the other helpers and definitions.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18d0f5c6
  12. 31 7月, 2019 1 次提交
    • F
      Btrfs: fix incremental send failure after deduplication · b4f9a1a8
      Filipe Manana 提交于
      When doing an incremental send operation we can fail if we previously did
      deduplication operations against a file that exists in both snapshots. In
      that case we will fail the send operation with -EIO and print a message
      to dmesg/syslog like the following:
      
        BTRFS error (device sdc): Send: inconsistent snapshot, found updated \
        extent for inode 257 without updated inode item, send root is 258, \
        parent root is 257
      
      This requires that we deduplicate to the same file in both snapshots for
      the same amount of times on each snapshot. The issue happens because a
      deduplication only updates the iversion of an inode and does not update
      any other field of the inode, therefore if we deduplicate the file on
      each snapshot for the same amount of time, the inode will have the same
      iversion value (stored as the "sequence" field on the inode item) on both
      snapshots, therefore it will be seen as unchanged between in the send
      snapshot while there are new/updated/deleted extent items when comparing
      to the parent snapshot. This makes the send operation return -EIO and
      print an error message.
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        # Create our first file. The first half of the file has several 64Kb
        # extents while the second half as a single 512Kb extent.
        $ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo
      
        # Create the base snapshot and the parent send stream from it.
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
        $ btrfs send -f /tmp/1.snap /mnt/mysnap1
      
        # Create our second file, that has exactly the same data as the first
        # file.
        $ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar
      
        # Create the second snapshot, used for the incremental send, before
        # doing the file deduplication.
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
      
        # Now before creating the incremental send stream:
        #
        # 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This
        #    will drop several extent items and add a new one, also updating
        #    the inode's iversion (sequence field in inode item) by 1, but not
        #    any other field of the inode;
        #
        # 2) Deduplicate into a different subrange of file foo in snapshot
        #    mysnap2. This will replace an extent item with a new one, also
        #    updating the inode's iversion by 1 but not any other field of the
        #    inode.
        #
        # After these two deduplication operations, the inode items, for file
        # foo, are identical in both snapshots, but we have different extent
        # items for this inode in both snapshots. We want to check this doesn't
        # cause send to fail with an error or produce an incorrect stream.
      
        $ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo
        $ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo
      
        # Create the incremental send stream.
        $ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2
        ERROR: send ioctl failed with -5: Input/output error
      
      This issue started happening back in 2015 when deduplication was updated
      to not update the inode's ctime and mtime and update only the iversion.
      Back then we would hit a BUG_ON() in send, but later in 2016 send was
      updated to return -EIO and print the error message instead of doing the
      BUG_ON().
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933
      Fixes: 1c919a5e ("btrfs: don't update mtime/ctime on deduped inodes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4f9a1a8
  13. 02 7月, 2019 1 次提交
    • F
      Btrfs: prevent send failures and crashes due to concurrent relocation · 9e967495
      Filipe Manana 提交于
      Send always operates on read-only trees and always expected that while it
      is in progress, nothing changes in those trees. Due to that expectation
      and the fact that send is a read-only operation, it operates on commit
      roots and does not hold transaction handles. However relocation can COW
      nodes and leafs from read-only trees, which can cause unexpected failures
      and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
      COWed, the transaction used to COW it is committed, a new transaction
      starts, the extent previously used for that node/leaf gets allocated,
      possibly for another tree, and the respective extent buffer' content
      changes while send is still using it. When this happens send normally
      fails with EIO being returned to user space and messages like the
      following are found in dmesg/syslog:
      
        [ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
        [ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984
      
      Other times, less often, we hit a BUG_ON() because an extent buffer that
      send is using used to be a node, and while send is still using it, it
      got COWed and got reused as a leaf while send is still using, producing
      the following trace:
      
       [ 3478.466280] ------------[ cut here ]------------
       [ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
       [ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       [ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
       [ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
       (...)
       [ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
       [ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
       [ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
       [ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
       [ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
       [ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       [ 3478.475840] FS:  00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
       [ 3478.476489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
       [ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [ 3478.479003] Call Trace:
       [ 3478.479600]  ? do_raw_spin_unlock+0x49/0xc0
       [ 3478.480202]  tree_advance+0x173/0x1d0 [btrfs]
       [ 3478.480810]  btrfs_compare_trees+0x30c/0x690 [btrfs]
       [ 3478.481388]  ? process_extent+0x1280/0x1280 [btrfs]
       [ 3478.481954]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
       [ 3478.482510]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
       [ 3478.483062]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
       [ 3478.483581]  ? rq_clock_task+0x2e/0x60
       [ 3478.484086]  ? wake_up_new_task+0x1f3/0x370
       [ 3478.484582]  ? do_vfs_ioctl+0xa2/0x6f0
       [ 3478.485075]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
       [ 3478.485552]  do_vfs_ioctl+0xa2/0x6f0
       [ 3478.486016]  ? __fget+0x113/0x200
       [ 3478.486467]  ksys_ioctl+0x70/0x80
       [ 3478.486911]  __x64_sys_ioctl+0x16/0x20
       [ 3478.487337]  do_syscall_64+0x60/0x1b0
       [ 3478.487751]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
       (...)
       [ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
       [ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
       [ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
       [ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
       [ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
       (...)
       [ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---
      
      Another possibility, much less likely to happen, is that send will not
      fail but the contents of the stream it produces may not be correct.
      
      To avoid this, do not allow send and relocation (balance) to run in
      parallel. In the long term the goal is to allow for both to be able to
      run concurrently without any problems, but that will take a significant
      effort in development and testing.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9e967495
  14. 01 7月, 2019 1 次提交
  15. 29 5月, 2019 2 次提交
    • F
      Btrfs: incremental send, fix emission of invalid clone operations · 3c850b45
      Filipe Manana 提交于
      When doing an incremental send we can now issue clone operations with a
      source range that ends at the source's file eof and with a destination
      range that ends at an offset smaller then the destination's file eof.
      If the eof of the source file is not aligned to the sector size of the
      filesystem, the receiver will get a -EINVAL error when trying to do the
      operation or, on older kernels, silently corrupt the destination file.
      The corruption happens on kernels without commit ac765f83
      ("Btrfs: fix data corruption due to cloning of eof block"), while the
      failure to clone happens on kernels with that commit.
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xb1 0 2M" /mnt/sdb/foo
        $ xfs_io -f -c "pwrite -S 0xc7 0 2M" /mnt/sdb/bar
        $ xfs_io -f -c "pwrite -S 0x4d 0 2M" /mnt/sdb/baz
        $ xfs_io -f -c "pwrite -S 0xe2 0 2M" /mnt/sdb/zoo
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
      
        $ btrfs send -f /tmp/base.send /mnt/sdb/base
      
        $ xfs_io -c "reflink /mnt/sdb/bar 1560K 500K 100K" /mnt/sdb/bar
        $ xfs_io -c "reflink /mnt/sdb/bar 1560K 0 100K" /mnt/sdb/zoo
        $ xfs_io -c "truncate 550K" /mnt/sdb/bar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
      
        $ btrfs send -f /tmp/incr.send -p /mnt/sdb/base /mnt/sdb/incr
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/base.send /mnt/sdc
        $ btrfs receive -vv -f /tmp/incr.send /mnt/sdc
        (...)
        truncate bar size=563200
        utimes bar
        clone zoo - source=bar source offset=512000 offset=0 length=51200
        ERROR: failed to clone extents to zoo
        Invalid argument
      
      The failure happens because the clone source range ends at the eof of file
      bar, 563200, which is not aligned to the filesystems sector size (4Kb in
      this case), and the destination range ends at offset 0 + 51200, which is
      less then the size of the file zoo (2Mb).
      
      So fix this by detecting such case and instead of issuing a clone
      operation for the whole range, do a clone operation for smaller range
      that is sector size aligned followed by a write operation for the block
      containing the eof. Here we will always be pessimistic and assume the
      destination filesystem of the send stream has the largest possible sector
      size (64Kb), since we have no way of determining it.
      
      This fixes a recent regression introduced in kernel 5.2-rc1.
      
      Fixes: 040ee612 ("Btrfs: send, improve clone range")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3c850b45
    • F
      Btrfs: incremental send, fix file corruption when no-holes feature is enabled · 6b1f72e5
      Filipe Manana 提交于
      When using the no-holes feature, if we have a file with prealloc extents
      with a start offset beyond the file's eof, doing an incremental send can
      cause corruption of the file due to incorrect hole detection. Such case
      requires that the prealloc extent(s) exist in both the parent and send
      snapshots, and that a hole is punched into the file that covers all its
      extents that do not cross the eof boundary.
      
      Example reproducer:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xab 0 500K" /mnt/sdb/foobar
        $ xfs_io -c "falloc -k 1200K 800K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
      
        $ btrfs send -f /tmp/base.snap /mnt/sdb/base
      
        $ xfs_io -c "fpunch 0 500K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
      
        $ btrfs send -p /mnt/sdb/base -f /tmp/incr.snap /mnt/sdb/incr
      
        $ md5sum /mnt/sdb/incr/foobar
        816df6f64deba63b029ca19d880ee10a   /mnt/sdb/incr/foobar
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/base.snap /mnt/sdc
        $ btrfs receive -f /tmp/incr.snap /mnt/sdc
      
        $ md5sum /mnt/sdc/incr/foobar
        cf2ef71f4a9e90c2f6013ba3b2257ed2   /mnt/sdc/incr/foobar
      
          --> Different checksum, because the prealloc extent beyond the
              file's eof confused the hole detection code and it assumed
              a hole starting at offset 0 and ending at the offset of the
              prealloc extent (1200Kb) instead of ending at the offset
              500Kb (the file's size).
      
      Fix this by ensuring we never cross the file's size when issuing the
      write operations for a hole.
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      CC: stable@vger.kernel.org # 3.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b1f72e5
  16. 30 4月, 2019 3 次提交
    • F
      Btrfs: fix race between send and deduplication that lead to failures and crashes · 62d54f3a
      Filipe Manana 提交于
      Send operates on read only trees and expects them to never change while it
      is using them. This is part of its initial design, and this expection is
      due to two different reasons:
      
      1) When it was introduced, no operations were allowed to modifiy read-only
         subvolumes/snapshots (including defrag for example).
      
      2) It keeps send from having an impact on other filesystem operations.
         Namely send does not need to keep locks on the trees nor needs to hold on
         to transaction handles and delay transaction commits. This ends up being
         a consequence of the former reason.
      
      However the deduplication feature was introduced later (on September 2013,
      while send was introduced in July 2012) and it allowed for deduplication
      with destination files that belong to read-only trees (subvolumes and
      snapshots).
      
      That means that having a send operation (either full or incremental) running
      in parallel with a deduplication that has the destination inode in one of
      the trees used by the send operation, can result in tree nodes and leaves
      getting freed and reused while send is using them. This problem is similar
      to the problem solved for the root nodes getting freed and reused when a
      snapshot is made against one tree that is currenly being used by a send
      operation, fixed in commits [1] and [2]. These commits explain in detail
      how the problem happens and the explanation is valid for any node or leaf
      that is not the root of a tree as well. This problem was also discussed
      and explained recently in a thread [3].
      
      The problem is very easy to reproduce when using send with large trees
      (snapshots) and just a few concurrent deduplication operations that target
      files in the trees used by send. A stress test case is being sent for
      fstests that triggers the issue easily. The most common error to hit is
      the send ioctl return -EIO with the following messages in dmesg/syslog:
      
       [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
       [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
      
      The first one is very easy to hit while the second one happens much less
      frequently, except for very large trees (in that test case, snapshots
      with 100000 files having large xattrs to get deep and wide trees).
      Less frequently, at least one BUG_ON can be hit:
      
       [1631742.130080] ------------[ cut here ]------------
       [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
       [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
       [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G    B D W         5.0.0-rc8-btrfs-next-45 #1
       [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
       (...)
       [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
       [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
       [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
       [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
       [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
       [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
       [1631742.138455] FS:  00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
       [1631742.139010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
       [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [1631742.141272] Call Trace:
       [1631742.141826]  ? do_raw_spin_unlock+0x49/0xc0
       [1631742.142390]  tree_advance+0x173/0x1d0 [btrfs]
       [1631742.142948]  btrfs_compare_trees+0x268/0x690 [btrfs]
       [1631742.143533]  ? process_extent+0x1070/0x1070 [btrfs]
       [1631742.144088]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
       [1631742.144645]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
       [1631742.145161]  ? trace_sched_stick_numa+0xe0/0xe0
       [1631742.145685]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
       [1631742.146179]  ? account_entity_enqueue+0xd3/0x100
       [1631742.146662]  ? reweight_entity+0x154/0x1a0
       [1631742.147135]  ? update_curr+0x20/0x2a0
       [1631742.147593]  ? check_preempt_wakeup+0x103/0x250
       [1631742.148053]  ? do_vfs_ioctl+0xa2/0x6f0
       [1631742.148510]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
       [1631742.148942]  do_vfs_ioctl+0xa2/0x6f0
       [1631742.149361]  ? __fget+0x113/0x200
       [1631742.149767]  ksys_ioctl+0x70/0x80
       [1631742.150159]  __x64_sys_ioctl+0x16/0x20
       [1631742.150543]  do_syscall_64+0x60/0x1b0
       [1631742.150931]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [1631742.151326] RIP: 0033:0x7f4cd9f5add7
       (...)
       [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
       [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
       [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
       [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
       [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
       (...)
       [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
      
      That BUG_ON happens because while send is using a node, that node is COWed
      by a concurrent deduplication, gets freed and gets reused as a leaf (because
      a transaction commit happened in between), so when it attempts to read a
      slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
      contents were wiped out and it now matches a leaf (which can even belong to
      some other tree now), hitting the BUG_ON(level == 0).
      
      Fix this concurrency issue by not allowing send and deduplication to run
      in parallel if both operate on the same readonly trees, returning EAGAIN
      to user space and logging an exlicit warning in dmesg/syslog.
      
      [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
      [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
      [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62d54f3a
    • F
      Btrfs: send, flush dellaloc in order to avoid data loss · 9f89d5de
      Filipe Manana 提交于
      When we set a subvolume to read-only mode we do not flush dellaloc for any
      of its inodes (except if the filesystem is mounted with -o flushoncommit),
      since it does not affect correctness for any subsequent operations - except
      for a future send operation. The send operation will not be able to see the
      delalloc data since the respective file extent items, inode item updates,
      backreferences, etc, have not hit yet the subvolume and extent trees.
      
      Effectively this means data loss, since the send stream will not contain
      any data from existing delalloc. Another problem from this is that if the
      writeback starts and finishes while the send operation is in progress, we
      have the subvolume tree being being modified concurrently which can result
      in send failing unexpectedly with EIO or hitting runtime errors, assertion
      failures or hitting BUG_ONs, etc.
      
      Simple reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ btrfs subvolume create /mnt/sv
        $ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
      
        $ btrfs property set /mnt/sv ro true
        $ btrfs send -f /tmp/send.stream /mnt/sv
      
        $ od -t x1 -A d /mnt/sv/foo
        0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
        *
        0110592
      
        $ umount /mnt
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ btrfs receive -f /tmp/send.stream /mnt
        $ echo $?
        0
        $ od -t x1 -A d /mnt/sv/foo
        0000000
        # ---> empty file
      
      Since this a problem that affects send only, fix it in send by flushing
      dellaloc for all the roots used by the send operation before send starts
      to process the commit roots.
      
      This is a problem that affects send since it was introduced (commit
      31db9f7c ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
      but backporting it to older kernels has some dependencies:
      
      - For kernels between 3.19 and 4.20, it depends on commit 3cd24c69
        ("btrfs: use tagged writepage to mitigate livelock of snapshot") because
        the function btrfs_start_delalloc_snapshot() does not exist before that
        commit. So one has to either pick that commit or replace the calls to
        btrfs_start_delalloc_snapshot() in this patch with calls to
        btrfs_start_delalloc_inodes().
      
      - For kernels older than 3.19 it also requires commit e5fa8f86
        ("Btrfs: ensure send always works on roots without orphans") because
        it depends on the function ensure_commit_roots_uptodate() which that
        commits introduced.
      
      - No dependencies for 5.0+ kernels.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 3.19+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9f89d5de
    • R
      Btrfs: send, improve clone range · 040ee612
      Robbie Ko 提交于
      Improve clone_range in two scenarios.
      
      1. Remove the limit of inode size when find clone inodes We can do
         partial clone, so there is no need to limit the size of the candidate
         inode.  When clone a range, we clone the legal range only by bytenr,
         offset, len, inode size.
      
      2. In the scenarios of rewrite or clone_range, data_offset rarely
         matches exactly, so the chance of a clone is missed.
      
      e.g.
          1. Write a 1M file
              dd if=/dev/zero of=1M bs=1M count=1
      
          2. Clone 1M file
             cp --reflink 1M clone
      
          3. Rewrite 4k on the clone file
             dd if=/dev/zero of=clone bs=4k count=1 conv=notrunc
      
          The disk layout is as follows:
          item 16 key (257 EXTENT_DATA 0) itemoff 15353 itemsize 53
      	extent data disk byte 1103101952 nr 1048576
      	extent data offset 0 nr 1048576 ram 1048576
      	extent compression(none)
          ...
          item 22 key (258 EXTENT_DATA 0) itemoff 14959 itemsize 53
      	extent data disk byte 1104150528 nr 4096
      	extent data offset 0 nr 4096 ram 4096
      	extent compression(none)
          item 23 key (258 EXTENT_DATA 4096) itemoff 14906 itemsize 53
      	extent data disk byte 1103101952 nr 1048576
      	extent data offset 4096 nr 1044480 ram 1048576
      	extent compression(none)
      
      When send, inode 258 file offset 4096~1048576 (item 23) has a chance to
      clone_range, but because data_offset does not match inode 257 (item 16),
      it causes missed clone and can only transfer actual data.
      
      Improve the problem by judging whether the current data_offset has
      overlap with the file extent item, and if so, adjusting offset and
      extent_len so that we can clone correctly.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      040ee612
  17. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  18. 17 12月, 2018 2 次提交
  19. 22 11月, 2018 1 次提交
    • R
      Btrfs: send, fix infinite loop due to directory rename dependencies · a4390aee
      Robbie Ko 提交于
      When doing an incremental send, due to the need of delaying directory move
      (rename) operations we can end up in infinite loop at
      apply_children_dir_moves().
      
      An example scenario that triggers this problem is described below, where
      directory names correspond to the numbers of their respective inodes.
      
      Parent snapshot:
      
       .
       |--- 261/
             |--- 271/
                   |--- 266/
                         |--- 259/
                         |--- 260/
                         |     |--- 267
                         |
                         |--- 264/
                         |     |--- 258/
                         |           |--- 257/
                         |
                         |--- 265/
                         |--- 268/
                         |--- 269/
                         |     |--- 262/
                         |
                         |--- 270/
                         |--- 272/
                         |     |--- 263/
                         |     |--- 275/
                         |
                         |--- 274/
                               |--- 273/
      
      Send snapshot:
      
       .
       |-- 275/
            |-- 274/
                 |-- 273/
                      |-- 262/
                           |-- 269/
                                |-- 258/
                                     |-- 271/
                                          |-- 268/
                                               |-- 267/
                                                    |-- 270/
                                                         |-- 259/
                                                         |    |-- 265/
                                                         |
                                                         |-- 272/
                                                              |-- 257/
                                                                   |-- 260/
                                                                   |-- 264/
                                                                        |-- 263/
                                                                             |-- 261/
                                                                                  |-- 266/
      
      When processing inode 257 we delay its move (rename) operation because its
      new parent in the send snapshot, inode 272, was not yet processed. Then
      when processing inode 272, we delay the move operation for that inode
      because inode 274 is its ancestor in the send snapshot. Finally we delay
      the move operation for inode 274 when processing it because inode 275 is
      its new parent in the send snapshot and was not yet moved.
      
      When finishing processing inode 275, we start to do the move operations
      that were previously delayed (at apply_children_dir_moves()), resulting in
      the following iterations:
      
      1) We issue the move operation for inode 274;
      
      2) Because inode 262 depended on the move operation of inode 274 (it was
         delayed because 274 is its ancestor in the send snapshot), we issue the
         move operation for inode 262;
      
      3) We issue the move operation for inode 272, because it was delayed by
         inode 274 too (ancestor of 272 in the send snapshot);
      
      4) We issue the move operation for inode 269 (it was delayed by 262);
      
      5) We issue the move operation for inode 257 (it was delayed by 272);
      
      6) We issue the move operation for inode 260 (it was delayed by 272);
      
      7) We issue the move operation for inode 258 (it was delayed by 269);
      
      8) We issue the move operation for inode 264 (it was delayed by 257);
      
      9) We issue the move operation for inode 271 (it was delayed by 258);
      
      10) We issue the move operation for inode 263 (it was delayed by 264);
      
      11) We issue the move operation for inode 268 (it was delayed by 271);
      
      12) We verify if we can issue the move operation for inode 270 (it was
          delayed by 271). We detect a path loop in the current state, because
          inode 267 needs to be moved first before we can issue the move
          operation for inode 270. So we delay again the move operation for
          inode 270, this time we will attempt to do it after inode 267 is
          moved;
      
      13) We issue the move operation for inode 261 (it was delayed by 263);
      
      14) We verify if we can issue the move operation for inode 266 (it was
          delayed by 263). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12);
      
      15) We issue the move operation for inode 267 (it was delayed by 268);
      
      16) We verify if we can issue the move operation for inode 266 (it was
          delayed by 270). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12). So here we added
          again the same delayed move operation that we added in step 14;
      
      17) We attempt again to see if we can issue the move operation for inode
          266, and as in step 16, we realize we can not due to a path loop in
          the current state due to a dependency on inode 270. Again we delay
          inode's 266 rename to happen after inode's 270 move operation, adding
          the same dependency to the empty stack that we did in steps 14 and 16.
          The next iteration will pick the same move dependency on the stack
          (the only entry) and realize again there is still a path loop and then
          again the same dependency to the stack, over and over, resulting in
          an infinite loop.
      
      So fix this by preventing adding the same move dependency entries to the
      stack by removing each pending move record from the red black tree of
      pending moves. This way the next call to get_pending_dir_moves() will
      not return anything for the current parent inode.
      
      A test case for fstests, with this reproducer, follows soon.
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [Wrote changelog with example and more clear explanation]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a4390aee