1. 22 5月, 2019 1 次提交
    • F
      Btrfs: send, flush dellaloc in order to avoid data loss · 74ca0a76
      Filipe Manana 提交于
      commit 9f89d5de8631c7930898a601b6612e271aa2261c upstream.
      
      When we set a subvolume to read-only mode we do not flush dellaloc for any
      of its inodes (except if the filesystem is mounted with -o flushoncommit),
      since it does not affect correctness for any subsequent operations - except
      for a future send operation. The send operation will not be able to see the
      delalloc data since the respective file extent items, inode item updates,
      backreferences, etc, have not hit yet the subvolume and extent trees.
      
      Effectively this means data loss, since the send stream will not contain
      any data from existing delalloc. Another problem from this is that if the
      writeback starts and finishes while the send operation is in progress, we
      have the subvolume tree being being modified concurrently which can result
      in send failing unexpectedly with EIO or hitting runtime errors, assertion
      failures or hitting BUG_ONs, etc.
      
      Simple reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ btrfs subvolume create /mnt/sv
        $ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo
      
        $ btrfs property set /mnt/sv ro true
        $ btrfs send -f /tmp/send.stream /mnt/sv
      
        $ od -t x1 -A d /mnt/sv/foo
        0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
        *
        0110592
      
        $ umount /mnt
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ btrfs receive -f /tmp/send.stream /mnt
        $ echo $?
        0
        $ od -t x1 -A d /mnt/sv/foo
        0000000
        # ---> empty file
      
      Since this a problem that affects send only, fix it in send by flushing
      dellaloc for all the roots used by the send operation before send starts
      to process the commit roots.
      
      This is a problem that affects send since it was introduced (commit
      31db9f7c ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"))
      but backporting it to older kernels has some dependencies:
      
      - For kernels between 3.19 and 4.20, it depends on commit 3cd24c698004d2
        ("btrfs: use tagged writepage to mitigate livelock of snapshot") because
        the function btrfs_start_delalloc_snapshot() does not exist before that
        commit. So one has to either pick that commit or replace the calls to
        btrfs_start_delalloc_snapshot() in this patch with calls to
        btrfs_start_delalloc_inodes().
      
      - For kernels older than 3.19 it also requires commit e5fa8f86
        ("Btrfs: ensure send always works on roots without orphans") because
        it depends on the function ensure_commit_roots_uptodate() which that
        commits introduced.
      
      - No dependencies for 5.0+ kernels.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 3.19+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      74ca0a76
  2. 17 12月, 2018 1 次提交
    • R
      Btrfs: send, fix infinite loop due to directory rename dependencies · 91f6a9aa
      Robbie Ko 提交于
      [ Upstream commit a4390aee ]
      
      When doing an incremental send, due to the need of delaying directory move
      (rename) operations we can end up in infinite loop at
      apply_children_dir_moves().
      
      An example scenario that triggers this problem is described below, where
      directory names correspond to the numbers of their respective inodes.
      
      Parent snapshot:
      
       .
       |--- 261/
             |--- 271/
                   |--- 266/
                         |--- 259/
                         |--- 260/
                         |     |--- 267
                         |
                         |--- 264/
                         |     |--- 258/
                         |           |--- 257/
                         |
                         |--- 265/
                         |--- 268/
                         |--- 269/
                         |     |--- 262/
                         |
                         |--- 270/
                         |--- 272/
                         |     |--- 263/
                         |     |--- 275/
                         |
                         |--- 274/
                               |--- 273/
      
      Send snapshot:
      
       .
       |-- 275/
            |-- 274/
                 |-- 273/
                      |-- 262/
                           |-- 269/
                                |-- 258/
                                     |-- 271/
                                          |-- 268/
                                               |-- 267/
                                                    |-- 270/
                                                         |-- 259/
                                                         |    |-- 265/
                                                         |
                                                         |-- 272/
                                                              |-- 257/
                                                                   |-- 260/
                                                                   |-- 264/
                                                                        |-- 263/
                                                                             |-- 261/
                                                                                  |-- 266/
      
      When processing inode 257 we delay its move (rename) operation because its
      new parent in the send snapshot, inode 272, was not yet processed. Then
      when processing inode 272, we delay the move operation for that inode
      because inode 274 is its ancestor in the send snapshot. Finally we delay
      the move operation for inode 274 when processing it because inode 275 is
      its new parent in the send snapshot and was not yet moved.
      
      When finishing processing inode 275, we start to do the move operations
      that were previously delayed (at apply_children_dir_moves()), resulting in
      the following iterations:
      
      1) We issue the move operation for inode 274;
      
      2) Because inode 262 depended on the move operation of inode 274 (it was
         delayed because 274 is its ancestor in the send snapshot), we issue the
         move operation for inode 262;
      
      3) We issue the move operation for inode 272, because it was delayed by
         inode 274 too (ancestor of 272 in the send snapshot);
      
      4) We issue the move operation for inode 269 (it was delayed by 262);
      
      5) We issue the move operation for inode 257 (it was delayed by 272);
      
      6) We issue the move operation for inode 260 (it was delayed by 272);
      
      7) We issue the move operation for inode 258 (it was delayed by 269);
      
      8) We issue the move operation for inode 264 (it was delayed by 257);
      
      9) We issue the move operation for inode 271 (it was delayed by 258);
      
      10) We issue the move operation for inode 263 (it was delayed by 264);
      
      11) We issue the move operation for inode 268 (it was delayed by 271);
      
      12) We verify if we can issue the move operation for inode 270 (it was
          delayed by 271). We detect a path loop in the current state, because
          inode 267 needs to be moved first before we can issue the move
          operation for inode 270. So we delay again the move operation for
          inode 270, this time we will attempt to do it after inode 267 is
          moved;
      
      13) We issue the move operation for inode 261 (it was delayed by 263);
      
      14) We verify if we can issue the move operation for inode 266 (it was
          delayed by 263). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12);
      
      15) We issue the move operation for inode 267 (it was delayed by 268);
      
      16) We verify if we can issue the move operation for inode 266 (it was
          delayed by 270). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12). So here we added
          again the same delayed move operation that we added in step 14;
      
      17) We attempt again to see if we can issue the move operation for inode
          266, and as in step 16, we realize we can not due to a path loop in
          the current state due to a dependency on inode 270. Again we delay
          inode's 266 rename to happen after inode's 270 move operation, adding
          the same dependency to the empty stack that we did in steps 14 and 16.
          The next iteration will pick the same move dependency on the stack
          (the only entry) and realize again there is still a path loop and then
          again the same dependency to the stack, over and over, resulting in
          an infinite loop.
      
      So fix this by preventing adding the same move dependency entries to the
      stack by removing each pending move record from the red black tree of
      pending moves. This way the next call to get_pending_dir_moves() will
      not return anything for the current parent inode.
      
      A test case for fstests, with this reproducer, follows soon.
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [Wrote changelog with example and more clear explanation]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      91f6a9aa
  3. 06 8月, 2018 4 次提交
    • F
      Btrfs: send, fix incorrect file layout after hole punching beyond eof · 22d3151c
      Filipe Manana 提交于
      When doing an incremental send, if we have a file in the parent snapshot
      that has prealloc extents beyond EOF and in the send snapshot it got a
      hole punch that partially covers the prealloc extents, the send stream,
      when replayed by a receiver, can result in a file that has a size bigger
      than it should and filled with zeroes past the correct EOF.
      
      For example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "falloc -k 0 4M" /mnt/foobar
        $ xfs_io -c "pwrite -S 0xea 0 1M" /mnt/foobar
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap1
        $ btrfs send -f /tmp/1.send /mnt/snap1
      
        $ xfs_io -c "fpunch 1M 2M" /mnt/foobar
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap2
        $ btrfs send -f /tmp/2.send -p /mnt/snap1 /mnt/snap2
      
        $ stat --format %s /mnt/snap2/foobar
        1048576
        $ md5sum /mnt/snap2/foobar
        d31659e82e87798acd4669a1e0a19d4f  /mnt/snap2/foobar
      
        $ umount /mnt
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ btrfs receive -f /mnt/1.snap /mnt
        $ btrfs receive -f /mnt/2.snap /mnt
      
        $ stat --format %s /mnt/snap2/foobar
        3145728
        # --> should be 1Mb and not 3Mb (which was the end offset of hole
        #     punch operation)
        $ md5sum /mnt/snap2/foobar
        117baf295297c2a995f92da725b0b651  /mnt/snap2/foobar
        # --> should be d31659e82e87798acd4669a1e0a19d4f as in the original fs
      
      This issue actually happens only since commit ffa7c429 ("Btrfs: send,
      do not issue unnecessary truncate operations"), but before that commit we
      were issuing a write operation full of zeroes (to "punch" a hole) which
      was extending the file size beyond the correct value and then immediately
      issue a truncate operation to the correct size and undoing the previous
      write operation. Since the send protocol does not support fallocate, for
      extent preallocation and hole punching, fix this by not even attempting
      to send a "hole" (regular write full of zeroes) if it starts at an offset
      greater then or equals to the file's size. This approach, besides being
      much more simple then making send issue the truncate operation, adds the
      benefit of avoiding the useless pair of write of zeroes and truncate
      operations, saving time and IO at the receiver and reducing the size of
      the send stream.
      
      A test case for fstests follows soon.
      
      Fixes: ffa7c429 ("Btrfs: send, do not issue unnecessary truncate operations")
      CC: stable@vger.kernel.org # 4.17+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      22d3151c
    • F
      Btrfs: fix send failure when root has deleted files still open · 46b2f459
      Filipe Manana 提交于
      The more common use case of send involves creating a RO snapshot and then
      use it for a send operation. In this case it's not possible to have inodes
      in the snapshot that have a link count of zero (inode with an orphan item)
      since during snapshot creation we do the orphan cleanup. However, other
      less common use cases for send can end up seeing inodes with a link count
      of zero and in this case the send operation fails with a ENOENT error
      because any attempt to generate a path for the inode, with the purpose
      of creating it or updating it at the receiver, fails since there are no
      inode reference items. One use case it to use a regular subvolume for
      a send operation after turning it to RO mode or turning a RW snapshot
      into RO mode and then using it for a send operation. In both cases, if a
      file gets all its hard links deleted while there is an open file
      descriptor before turning the subvolume/snapshot into RO mode, the send
      operation will encounter an inode with a link count of zero and then
      fail with errno ENOENT.
      
      Example using a full send with a subvolume:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ btrfs subvolume create /mnt/sv1
        $ touch /mnt/sv1/foo
        $ touch /mnt/sv1/bar
      
        # keep an open file descriptor on file bar
        $ exec 73</mnt/sv1/bar
        $ unlink /mnt/sv1/bar
      
        # Turn the subvolume to RO mode and use it for a full send, while
        # holding the open file descriptor.
        $ btrfs property set /mnt/sv1 ro true
      
        $ btrfs send -f /tmp/full.send /mnt/sv1
        At subvol /mnt/sv1
        ERROR: send ioctl failed with -2: No such file or directory
      
      Example using an incremental send with snapshots:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ btrfs subvolume create /mnt/sv1
        $ touch /mnt/sv1/foo
        $ touch /mnt/sv1/bar
      
        $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
      
        $ echo "hello world" >> /mnt/sv1/bar
      
        $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap2
      
        # Turn the second snapshot to RW mode and delete file foo while
        # holding an open file descriptor on it.
        $ btrfs property set /mnt/snap2 ro false
        $ exec 73</mnt/snap2/foo
        $ unlink /mnt/snap2/foo
      
        # Set the second snapshot back to RO mode and do an incremental send.
        $ btrfs property set /mnt/snap2 ro true
      
        $ btrfs send -f /tmp/inc.send -p /mnt/snap1 /mnt/snap2
        At subvol /mnt/snap2
        ERROR: send ioctl failed with -2: No such file or directory
      
      So fix this by ignoring inodes with a link count of zero if we are either
      doing a full send or if they do not exist in the parent snapshot (they
      are new in the send snapshot), and unlink all paths found in the parent
      snapshot when doing an incremental send (and ignoring all other inode
      items, such as xattrs and extents).
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: NMartin Wilck <martin.wilck@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      46b2f459
    • F
      Btrfs: remove unused key assignment when doing a full send · ca5d2ba1
      Filipe Manana 提交于
      At send.c:full_send_tree() we were setting the 'key' variable in the loop
      while never using it later. We were also using two btrfs_key variables
      to store the initial key for search and the key found in every iteration
      of the loop. So remove this useless key assignment and use the same
      btrfs_key variable to store the initial search key and the key found in
      each iteration. This was introduced in the initial send commit but was
      never used (commit 31db9f7c ("Btrfs: introduce BTRFS_IOC_SEND for
      btrfs send/receive").
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca5d2ba1
    • Q
      btrfs: Get rid of the confusing btrfs_file_extent_inline_len · e41ca589
      Qu Wenruo 提交于
      We used to call btrfs_file_extent_inline_len() to get the uncompressed
      data size of an inlined extent.
      
      However this function is hiding evil, for compressed extent, it has no
      choice but to directly read out ram_bytes from btrfs_file_extent_item.
      While for uncompressed extent, it uses item size to calculate the real
      data size, and ignoring ram_bytes completely.
      
      In fact, for corrupted ram_bytes, due to above behavior kernel
      btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.
      
      Since we have the tree-checker to verify all EXTENT_DATA, such mismatch
      can be detected pretty easily, thus we can trust ram_bytes without the
      evil btrfs_file_extent_inline_len().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e41ca589
  4. 29 5月, 2018 3 次提交
  5. 02 5月, 2018 1 次提交
    • F
      Btrfs: send, fix missing truncate for inode with prealloc extent past eof · a6aa10c7
      Filipe Manana 提交于
      An incremental send operation can miss a truncate operation when an inode
      has an increased size in the send snapshot and a prealloc extent beyond
      its size.
      
      Consider the following scenario where a necessary truncate operation is
      missing in the incremental send stream:
      
      1) In the parent snapshot an inode has a size of 1282957 bytes and it has
         no prealloc extents beyond its size;
      
      2) In the the send snapshot it has a size of 5738496 bytes and has a new
         extent at offsets 1884160 (length of 106496 bytes) and a prealloc
         extent beyond eof at offset 6729728 (and a length of 339968 bytes);
      
      3) When processing the prealloc extent, at offset 6729728, we end up at
         send.c:send_write_or_clone() and set the @len variable to a value of
         18446744073708560384 because @offset plus the original @len value is
         larger then the inode's size (6729728 + 339968 > 5738496). We then
         call send_extent_data(), with that @offset and @len, which in turn
         calls send_write(), and then the later calls fill_read_buf(). Because
         the offset passed to fill_read_buf() is greater then inode's i_size,
         this function returns 0 immediately, which makes send_write() and
         send_extent_data() do nothing and return immediately as well. When
         we get back to send.c:send_write_or_clone() we adjust the value
         of sctx->cur_inode_next_write_offset to @offset plus @len, which
         corresponds to 6729728 + 18446744073708560384 = 5738496, which is
         precisely the the size of the inode in the send snapshot;
      
      4) Later when at send.c:finish_inode_if_needed() we determine that
         we don't need to issue a truncate operation because the value of
         sctx->cur_inode_next_write_offset corresponds to the inode's new
         size, 5738496 bytes. This is wrong because the last write operation
         that was issued started at offset 1884160 with a length of 106496
         bytes, so the correct value for sctx->cur_inode_next_write_offset
         should be 1990656 (1884160 + 106496), so that a truncate operation
         with a value of 5738496 bytes would have been sent to insert a
         trailing hole at the destination.
      
      So fix the issue by making send.c:send_write_or_clone() not attempt
      to send write or clone operations for extents that start beyond the
      inode's size, since such attempts do nothing but waste time by
      calling helper functions and allocating path structures, and send
      currently has no fallocate command in order to create prealloc extents
      at the destination (either beyond a file's eof or not).
      
      The issue was found running the test btrfs/007 from fstests using a seed
      value of 1524346151 for fsstress.
      Reported-by: NGu, Jinxiang <gujx@cn.fujitsu.com>
      Fixes: ffa7c429 ("Btrfs: send, do not issue unnecessary truncate operations")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a6aa10c7
  6. 12 4月, 2018 1 次提交
  7. 26 3月, 2018 4 次提交
    • L
      Btrfs: send: fix typo in TLV_PUT · 895a72be
      Liu Bo 提交于
      According to tlv_put()'s prototype, data and attrlen needs to be
      exchanged in the macro, but seems all callers are already aware of
      this misorder and are therefore not affected.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      895a72be
    • F
      Btrfs: send, do not issue unnecessary truncate operations · ffa7c429
      Filipe Manana 提交于
      When send finishes processing an inode representing a regular file, it
      always issues a truncate operation for that file, even if its size did
      not change or the last write sets the file size correctly. In the most
      common cases, the issued write operations set the file to correct size
      (either full or incremental sends) or the file size did not change (for
      incremental sends), so the only case where a truncate operation is needed
      is when a file size becomes smaller in the send snapshot when compared
      to the parent snapshot.
      
      By not issuing unnecessary truncate operations we reduce the stream size
      and save time in the receiver. Currently truncating a file to the same
      size triggers writeback of its last page (if it's dirty) and waits for it
      to complete (only if the file size is not aligned with the filesystem's
      sector size). This is being fixed by another patch and is independent of
      this change (that patch's title is "Btrfs: skip writeback of last page
      when truncating file to same size").
      
      The following script was used to measure time spent by a receiver without
      this change applied, with this change applied, and without this change and
      with the truncate fix applied (the fix to not make it start and wait for
      writeback to complete).
      
        $ cat test_send.sh
        #!/bin/bash
      
        SRC_DEV=/dev/sdc
        DST_DEV=/dev/sdd
        SRC_MNT=/mnt/sdc
        DST_MNT=/mnt/sdd
      
        mkfs.btrfs -f $SRC_DEV >/dev/null
        mkfs.btrfs -f $DST_DEV >/dev/null
        mount $SRC_DEV $SRC_MNT
        mount $DST_DEV $DST_MNT
      
        echo "Creating source filesystem"
        for ((t = 0; t < 10; t++)); do
            (
                for ((i = 1; i <= 20000; i++)); do
                    xfs_io -f -c "pwrite -S 0xab 0 5000" \
                        $SRC_MNT/file_$i > /dev/null
                done
            ) &
           worker_pids[$t]=$!
        done
        wait ${worker_pids[@]}
      
        echo "Creating and sending snapshot"
        btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
        /usr/bin/time -f "send took %e seconds"    \
               btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
        /usr/bin/time -f "receive took %e seconds" \
               btrfs receive -f $SRC_MNT/send_file $DST_MNT
      
        umount $SRC_MNT
        umount $DST_MNT
      
      The results, which are averages for 5 runs for each case, were the
      following:
      
      * Without this change
      
      average receive time was 26.49 seconds
      standard deviation of 2.53 seconds
      
      * Without this change and with the truncate fix
      
      average receive time was 12.51 seconds
      standard deviation of 0.32 seconds
      
      * With this change and without the truncate fix
      
      average receive time was 10.02 seconds
      standard deviation of 1.11 seconds
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ffa7c429
    • D
      btrfs: add more __cold annotations · e67c718b
      David Sterba 提交于
      The __cold functions are placed to a special section, as they're
      expected to be called rarely. This could help i-cache prefetches or help
      compiler to decide which branches are more/less likely to be taken
      without any other annotations needed.
      
      Though we can't add more __exit annotations, it's still possible to add
      __cold (that's also added with __exit). That way the following function
      categories are tagged:
      
      - printf wrappers, error messages
      - exit helpers
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e67c718b
    • N
      btrfs: Remove custom crc32c init code · 9678c543
      Nikolay Borisov 提交于
      The custom crc32 init code was introduced in
      14a958e6 ("Btrfs: fix btrfs boot when compiled as built-in") to
      enable using btrfs as a built-in. However, later as pointed out by
      60efa5eb ("Btrfs: use late_initcall instead of module_init") this
      wasn't enough and finally btrfs was switched to late_initcall which
      comes after the generic crc32c implementation is initiliased. The
      latter commit superseeded the former. Now that we don't have to
      maintain our own code let's just remove it and switch to using the
      generic implementation.
      
      Despite touching a lot of files the patch is really simple. Here is the gist of
      the changes:
      
      1. Select LIBCRC32C rather than the low-level modules.
      2. s/btrfs_crc32c/crc32c/g
      3. replace hash.h with linux/crc32c.h
      4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
      
      I've tested this with btrfs being both a module and a built-in and xfstest
      doesn't complain.
      
      Does seem to fix the longstanding problem of not automatically selectiong
      the crc32c module when btrfs is used. Possibly there is a workaround in
      dracut.
      
      The modinfo confirms that now all the module dependencies are there:
      
      before:
      depends:        zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
      
      after:
      depends:        libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add more info to changelog from mails ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9678c543
  8. 01 3月, 2018 1 次提交
    • F
      Btrfs: send, fix issuing write op when processing hole in no data mode · d4dfc0f4
      Filipe Manana 提交于
      When doing an incremental send of a filesystem with the no-holes feature
      enabled, we end up issuing a write operation when using the no data mode
      send flag, instead of issuing an update extent operation. Fix this by
      issuing the update extent operation instead.
      
      Trivial reproducer:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdc
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdc /mnt/sdc
        $ mount /dev/sdd /mnt/sdd
      
        $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
        $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1
      
        $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
        $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2
      
        $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
        $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
             | btrfs receive -vv /mnt/sdd
      
      Before this change the output of the second receive command is:
      
        receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
        utimes
        write foobar, offset 8192, len 8192
        utimes foobar
        BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...
      
      After this change it is:
      
        receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
        utimes
        update_extent foobar: offset=8192, len=8192
        utimes foobar
        BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4dfc0f4
  9. 22 1月, 2018 1 次提交
  10. 29 11月, 2017 1 次提交
    • F
      Btrfs: incremental send, fix wrong unlink path after renaming file · ea37d599
      Filipe Manana 提交于
      Under some circumstances, an incremental send operation can issue wrong
      paths for unlink commands related to files that have multiple hard links
      and some (or all) of those links were renamed between the parent and send
      snapshots. Consider the following example:
      
      Parent snapshot
      
       .                                                      (ino 256)
       |---- a/                                               (ino 257)
       |     |---- b/                                         (ino 259)
       |     |     |---- c/                                   (ino 260)
       |     |     |---- f2                                   (ino 261)
       |     |
       |     |---- f2l1                                       (ino 261)
       |
       |---- d/                                               (ino 262)
             |---- f1l1_2                                     (ino 258)
             |---- f2l2                                       (ino 261)
             |---- f1_2                                       (ino 258)
      
      Send snapshot
      
       .                                                      (ino 256)
       |---- a/                                               (ino 257)
       |     |---- f2l1/                                      (ino 263)
       |             |---- b2/                                (ino 259)
       |                   |---- c/                           (ino 260)
       |                   |     |---- d3                     (ino 262)
       |                   |           |---- f1l1_2           (ino 258)
       |                   |           |---- f2l2_2           (ino 261)
       |                   |           |---- f1_2             (ino 258)
       |                   |
       |                   |---- f2                           (ino 261)
       |                   |---- f1l2                         (ino 258)
       |
       |---- d                                                (ino 261)
      
      When computing the incremental send stream the following steps happen:
      
      1) When processing inode 261, a rename operation is issued that renames
         inode 262, which currently as a path of "d", to an orphan name of
         "o262-7-0". This is done because in the send snapshot, inode 261 has
         of its hard links with a path of "d" as well.
      
      2) Two link operations are issued that create the new hard links for
         inode 261, whose names are "d" and "f2l2_2", at paths "/" and
         "o262-7-0/" respectively.
      
      3) Still while processing inode 261, unlink operations are issued to
         remove the old hard links of inode 261, with names "f2l1" and "f2l2",
         at paths "a/" and "d/". However path "d/" does not correspond anymore
         to the directory inode 262 but corresponds instead to a hard link of
         inode 261 (link command issued in the previous step). This makes the
         receiver fail with a ENOTDIR error when attempting the unlink
         operation.
      
      The problem happens because before sending the unlink operation, we failed
      to detect that inode 262 was one of ancestors for inode 261 in the parent
      snapshot, and therefore we didn't recompute the path for inode 262 before
      issuing the unlink operation for the link named "f2l2" of inode 262. The
      detection failed because the function "is_ancestor()" only follows the
      first hard link it finds for an inode instead of all of its hard links
      (as it was originally created for being used with directories only, for
      which only one hard link exists). So fix this by making "is_ancestor()"
      follow all hard links of the input inode.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ea37d599
  11. 02 11月, 2017 2 次提交
    • Z
      btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents · c995ab3c
      Zygo Blaxell 提交于
      The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
      offset (encoded as a single logical address) to a list of extent refs.
      LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
      (extent ref -> extent bytenr and offset, or logical address).  These are
      useful capabilities for programs that manipulate extents and extent
      references from userspace (e.g. dedup and defrag utilities).
      
      When the extents are uncompressed (and not encrypted and not other),
      check_extent_in_eb performs filtering of the extent refs to remove any
      extent refs which do not contain the same extent offset as the 'logical'
      parameter's extent offset.  This prevents LOGICAL_INO from returning
      references to more than a single block.
      
      To find the set of extent references to an uncompressed extent from [a, b),
      userspace has to run a loop like this pseudocode:
      
      	for (i = a; i < b; ++i)
      		extent_ref_set += LOGICAL_INO(i);
      
      At each iteration of the loop (up to 32768 iterations for a 128M extent),
      data we are interested in is collected in the kernel, then deleted by
      the filter in check_extent_in_eb.
      
      When the extents are compressed (or encrypted or other), the 'logical'
      parameter must be an extent bytenr (the 'a' parameter in the loop).
      No filtering by extent offset is done (or possible?) so the result is
      the complete set of extent refs for the entire extent.  This removes
      the need for the loop, since we get all the extent refs in one call.
      
      Add an 'ignore_offset' argument to iterate_inodes_from_logical,
      [...several levels of function call graph...], and check_extent_in_eb, so
      that we can disable the extent offset filtering for uncompressed extents.
      This flag can be set by an improved version of the LOGICAL_INO ioctl to
      get either behavior as desired.
      
      There is no functional change in this patch.  The new flag is always
      false.
      Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor coding style fixes ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c995ab3c
    • N
      btrfs: send: remove unused code · eb7b9d6a
      Nikolay Borisov 提交于
      This code was first introduced in 31db9f7c ("Btrfs: introduce
      BTRFS_IOC_SEND for btrfs send/receive") and it was not functional, then
      it got slightly refactored in e938c8ad ("Btrfs: code cleanups for
      send/receive"), alas it was still dead. So let's remove it for good!
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb7b9d6a
  12. 30 10月, 2017 4 次提交
  13. 26 9月, 2017 1 次提交
  14. 05 9月, 2017 1 次提交
  15. 21 8月, 2017 1 次提交
    • F
      Btrfs: incremental send, fix emission of invalid clone operations · 72610b1b
      Filipe Manana 提交于
      When doing an incremental send it's possible that the computed send stream
      contains clone operations that will fail on the receiver if the receiver
      has compression enabled and the clone operations target a sector sized
      extent that starts at a zero file offset, is not compressed on the source
      filesystem but ends up being compressed and inlined at the destination
      filesystem.
      
      Example scenario:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount -o compress /dev/sdb /mnt
      
        # By doing a direct IO write, the data is not compressed.
        $ xfs_io -f -d -c "pwrite -S 0xab 0 4K" /mnt/foobar
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
      
        $ xfs_io -c "reflink /mnt/foobar 0 8K 4K" /mnt/foobar
        $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
      
        $ btrfs send -f /tmp/1.snap /mnt/mysnap1
        $ btrfs send -f /tmp/2.snap -p /mnt/mysnap1 /mnt/mysnap2
        $ umount /mnt
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount -o compress /dev/sdc /mnt
        $ btrfs receive -f /tmp/1.snap /mnt
        $ btrfs receive -f /tmp/2.snap /mnt
        ERROR: failed to clone extents to foobar
        Operation not supported
      
      The same could be achieved by mounting the source filesystem without
      compression and doing a buffered IO write instead of a direct IO one,
      and mounting the destination filesystem with compression enabled.
      
      So fix this by issuing regular write operations in the send stream
      instead of clone operations when the source offset is zero and the
      range has a length matching the sector size.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72610b1b
  16. 16 8月, 2017 1 次提交
  17. 07 7月, 2017 2 次提交
    • F
      Btrfs: incremental send, fix invalid memory access · 24e52b11
      Filipe Manana 提交于
      When doing an incremental send, while processing an extent that changed
      between the parent and send snapshots and that extent was an inline extent
      in the parent snapshot, it's possible to access a memory region beyond
      the end of leaf if the inline extent is very small and it is the first
      item in a leaf.
      
      An example scenario is described below.
      
      The send snapshot has the following leaf:
      
       leaf 33865728 items 33 free space 773 generation 46 owner 5
       fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
       chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
              (...)
              item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53
                      generation 36 type 1 (regular)
                      extent data disk byte 12791808 nr 4096
                      extent data offset 0 nr 4096 ram 4096
                      extent compression 0 (none)
              item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53
                      generation 36 type 1 (regular)
                      extent data disk byte 138170368 nr 225280
                      extent data offset 0 nr 225280 ram 225280
                      extent compression 0 (none)
              (...)
      
      And the parent snapshot has the following leaf:
      
       leaf 31272960 items 17 free space 17 generation 31 owner 5
       fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
       chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
              item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44
                      generation 31 type 0 (inline)
                      inline extent data size 23 ram_bytes 613 compression 1 (zlib)
              (...)
      
      When computing the send stream, it is detected that the extent of inode
      335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we
      grab the leaf from the parent snapshot and access the inline extent item.
      However, before jumping to the 'out' label, we access the 'offset' and
      'disk_bytenr' fields of the extent item, which should not be done for
      inline extents since the inlined data starts at the offset of the
      'disk_bytenr' field and can be very small. For example accessing the
      'offset' field of the file extent item results in the following trace:
      
      [  599.705368] general protection fault: 0000 [#1] PREEMPT SMP
      [  599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$
      [  599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1
      [  599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000
      [  599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs]
      [  599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286
      [  599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001
      [  599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f
      [  599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098
      [  599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7
      [  599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088
      [  599.709340] FS:  00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000
      [  599.709340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0
      [  599.709340] Call Trace:
      [  599.709340]  btrfs_get_token_64+0x93/0xce [btrfs]
      [  599.709340]  ? printk+0x48/0x50
      [  599.709340]  btrfs_get_64+0xb/0xd [btrfs]
      [  599.709340]  process_extent+0x3a1/0x1106 [btrfs]
      [  599.709340]  ? btree_read_extent_buffer_pages+0x5/0xef [btrfs]
      [  599.709340]  changed_cb+0xb03/0xb3d [btrfs]
      [  599.709340]  ? btrfs_get_token_32+0x7a/0xcc [btrfs]
      [  599.709340]  btrfs_compare_trees+0x432/0x53d [btrfs]
      [  599.709340]  ? process_extent+0x1106/0x1106 [btrfs]
      [  599.709340]  btrfs_ioctl_send+0x960/0xe26 [btrfs]
      [  599.709340]  btrfs_ioctl+0x181b/0x1fed [btrfs]
      [  599.709340]  ? trace_hardirqs_on_caller+0x150/0x1ac
      [  599.709340]  vfs_ioctl+0x21/0x38
      [  599.709340]  ? vfs_ioctl+0x21/0x38
      [  599.709340]  do_vfs_ioctl+0x611/0x645
      [  599.709340]  ? rcu_read_unlock+0x5b/0x5d
      [  599.709340]  ? __fget+0x6d/0x79
      [  599.709340]  SyS_ioctl+0x57/0x7b
      [  599.709340]  entry_SYSCALL_64_fastpath+0x18/0xad
      [  599.709340] RIP: 0033:0x7f51945eec47
      [  599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [  599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47
      [  599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004
      [  599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700
      [  599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046
      [  599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040
      [  599.709340]  ? trace_hardirqs_off_caller+0x43/0xb1
      [  599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$
      [  599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00
      [  599.762057] ---[ end trace fe00d7af61b9f49e ]---
      
      This is because the 'offset' field starts at an offset of 37 bytes
      (offsetof(struct btrfs_file_extent_item, offset)), has a length of 8
      bytes and therefore attemping to read it causes a 1 byte access beyond
      the end of the leaf, as the first item's content in a leaf is located
      at the tail of the leaf, the item size is 44 bytes and the offset of
      that field plus its length (37 + 8 = 45) goes beyond the item's size
      by 1 byte.
      
      So fix this by accessing the 'offset' and 'disk_bytenr' fields after
      jumping to the 'out' label if we are processing an inline extent. We
      move the reading operation of the 'disk_bytenr' field too because we
      have the same problem as for the 'offset' field explained above when
      the inline data is less then 8 bytes. The access to the 'generation'
      field is also moved but just for the sake of grouping access to all
      the fields.
      
      Fixes: e1cbfd7b ("Btrfs: send, fix file hole not being preserved due to inline extent")
      Cc: <stable@vger.kernel.org>  # v4.12+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      24e52b11
    • F
      Btrfs: incremental send, fix invalid path for link commands · f5962781
      Filipe Manana 提交于
      In some scenarios an incremental send stream can contain link commands
      with an invalid target path. Such scenarios happen after moving some
      directory inode A, renaming a regular file inode B into the old name of
      inode A and finally creating a new hard link for inode B at directory
      inode A.
      
      Consider the following example scenario where this issue happens.
      
      Parent snapshot:
      
        .                                                      (ino 256)
        |
        |--- dir1/                                             (ino 257)
        |      |--- dir2/                                      (ino 258)
        |             |--- dir3/                               (ino 259)
        |                   |--- file1                         (ino 261)
        |                   |--- dir4/                         (ino 262)
        |
        |--- dir5/                                             (ino 260)
      
      Send snapshot:
      
        .                                                      (ino 256)
        |
        |--- dir1/                                             (ino 257)
               |--- dir2/                                      (ino 258)
               |      |--- dir3/                               (ino 259)
               |            |--- dir4                          (ino 261)
               |
               |--- dir6/                                      (ino 263)
                      |--- dir44/                              (ino 262)
                             |--- file11                       (ino 261)
                             |--- dir55/                       (ino 260)
      
      When attempting to apply the corresponding incremental send stream, a
      link command contains an invalid target path which makes the receiver
      fail. The following is the verbose output of the btrfs receive command:
      
        receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
        utimes
        utimes dir1
        utimes dir1/dir2/dir3
        utimes
        rename dir1/dir2/dir3/dir4 -> o262-7-0
        link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
        link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
        ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
      
      The following steps happen during the computation of the incremental send
      stream the lead to this issue:
      
      1) When processing inode 261, we orphanize inode 262 due to a name/location
         collision with one of the new hard links for inode 261 (created in the
         second step below).
      
      2) We create one of the 2 new hard links for inode 261, the one whose
         location is at "dir1/dir2/dir3/dir4".
      
      3) We then attempt to create the other new hard link for inode 261, which
         has inode 262 as its parent directory. Because the path for this new
         hard link was computed before we started processing the new references
         (hard links), it reflects the old name/location of inode 262, that is,
         it does not account for the orphanization step that happened when
         we started processing the new references for inode 261, whence it is
         no longer valid, causing the receiver to fail.
      
      So fix this issue by recomputing the full path of new references if we
      ended up orphanizing other inodes which are directories.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      f5962781
  18. 22 6月, 2017 1 次提交
  19. 21 6月, 2017 2 次提交
    • F
      Btrfs: incremental send, fix invalid path for unlink commands · fdb13889
      Filipe Manana 提交于
      An incremental send can contain unlink operations with an invalid target
      path when we rename some directory inode A, then rename some file inode B
      to the old name of inode A and directory inode A is an ancestor of inode B
      in the parent snapshot (but not anymore in the send snapshot).
      
      Consider the following example scenario where this issue happens.
      
      Parent snapshot:
      
       .                                                      (ino 256)
       |
       |--- dir1/                                             (ino 257)
             |--- dir2/                                       (ino 258)
             |     |--- file1                                 (ino 259)
             |     |--- file3                                 (ino 261)
             |
             |--- dir3/                                       (ino 262)
                   |--- file22                                (ino 260)
                   |--- dir4/                                 (ino 263)
      
      Send snapshot:
      
       .                                                      (ino 256)
       |
       |--- dir1/                                             (ino 257)
             |--- dir2/                                       (ino 258)
             |--- dir3                                        (ino 260)
             |--- file3/                                      (ino 262)
                   |--- dir4/                                 (ino 263)
                         |--- file11                          (ino 269)
                         |--- file33                          (ino 261)
      
      When attempting to apply the corresponding incremental send stream, an
      unlink operation contains an invalid path which makes the receiver fail.
      The following is verbose output of the btrfs receive command:
      
       receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
       utimes
       utimes dir1
       utimes dir1/dir2
       link dir1/dir3/dir4/file11 -> dir1/dir2/file1
       unlink dir1/dir2/file1
       utimes dir1/dir2
       truncate dir1/dir3/dir4/file11 size=0
       utimes dir1/dir3/dir4/file11
       rename dir1/dir3 -> o262-7-0
       link dir1/dir3 -> o262-7-0/file22
       unlink dir1/dir3/file22
       ERROR: unlink dir1/dir3/file22 failed. Not a directory
      
      The following steps happen during the computation of the incremental send
      stream the lead to this issue:
      
      1) Before we start processing the new and deleted references for inode
         260, we compute the full path of the deleted reference
         ("dir1/dir3/file22") and cache it in the list of deleted references
         for our inode.
      
      2) We then start processing the new references for inode 260, for which
         there is only one new, located at "dir1/dir3". When processing this
         new reference, we check that inode 262, which was not yet processed,
         collides with the new reference and because of that we orphanize
         inode 262 so its new full path becomes "o262-7-0".
      
      3) After the orphanization of inode 262, we create the new reference for
         inode 260 by issuing a link command with a target path of "dir1/dir3"
         and a source path of "o262-7-0/file22".
      
      4) We then start processing the deleted references for inode 260, for
         which there is only one with the base name of "file22", and issue
         an unlink operation containing the target path computed at step 1,
         which is wrong because that path no longer exists and should be
         replaced with "o262-7-0/file22".
      
      So fix this issue by recomputing the full path of deleted references if
      when we processed the new references for an inode we ended up orphanizing
      any other inode that is an ancestor of our inode in the parent snapshot.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      [ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fdb13889
    • F
      Btrfs: send, fix invalid path after renaming and linking file · 72c3668f
      Filipe Manana 提交于
      Currently an incremental snapshot can generate link operations which
      contain an invalid target path. Such case happens when in the send
      snapshot a file was renamed, a new hard link added for it and some
      other inode (with a lower number) got renamed to the former name of
      that file. Example:
      
      Parent snapshot
      
       .                  (ino 256)
       |
       |--- f1            (ino 257)
       |--- f2            (ino 258)
       |--- f3            (ino 259)
      
      Send snapshot
      
       .                  (ino 256)
       |
       |--- f2            (ino 257)
       |--- f3            (ino 258)
       |--- f4            (ino 259)
       |--- f5            (ino 258)
      
      The following steps happen when computing the incremental send stream:
      
      1) When processing inode 257, inode 258 is orphanized (renamed to
         "o258-7-0"), because its current reference has the same name as the
         new reference for inode 257;
      
      2) When processing inode 258, we iterate over all its new references,
         which have the names "f3" and "f5". The first iteration sees name
         "f5" and renames the inode from its orphan name ("o258-7-0") to
         "f5", while the second iteration sees the name "f3" and, incorrectly,
         issues a link operation with a target name matching the orphan name,
         which no longer exists. The first iteration had reset the current
         valid path of the inode to "f5", but in the second iteration we lost
         it because we found another inode, with a higher number of 259, which
         has a reference named "f3" as well, so we orphanized inode 259 and
         recomputed the current valid path of inode 258 to its old orphan
         name because inode 259 could be an ancestor of inode 258 and therefore
         the current valid path could contain the pre-orphanization name of
         inode 259. However in this case inode 259 is not an ancestor of inode
         258 so the current valid path should not be recomputed.
         This makes the receiver fail with the following error:
      
         ERROR: link f3 -> o258-7-0 failed: No such file or directory
      
      So fix this by not recomputing the current valid path for an inode
      whenever we find a colliding reference from some not yet processed inode
      (inode number higher then the one currently being processed), unless
      that other inode is an ancestor of the one we are currently processing.
      
      A test case for fstests will follow soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72c3668f
  20. 20 6月, 2017 3 次提交
  21. 09 5月, 2017 1 次提交
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
  22. 26 4月, 2017 1 次提交
    • F
      Btrfs: send, fix file hole not being preserved due to inline extent · e1cbfd7b
      Filipe Manana 提交于
      Normally we don't have inline extents followed by regular extents, but
      there's currently at least one harmless case where this happens. For
      example, when the page size is 4Kb and compression is enabled:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount -o compress /dev/sdb /mnt
        $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
        $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar
      
      In this case we get a compressed inline extent, representing 4Kb of
      data, followed by a hole extent and then a regular data extent. The
      inline extent was not expanded/converted to a regular extent exactly
      because it represents 4Kb of data. This does not cause any apparent
      problem (such as the issue solved by commit e1699d2d
      ("btrfs: add missing memset while reading compressed inline extents"))
      except trigger an unexpected case in the incremental send code path
      that makes us issue an operation to write a hole when it's not needed,
      resulting in more writes at the receiver and wasting space at the
      receiver.
      
      So teach the incremental send code to deal with this particular case.
      
      The issue can be currently triggered by running fstests btrfs/137 with
      compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      e1cbfd7b
  23. 29 3月, 2017 1 次提交
    • D
      Btrfs: fix an integer overflow check · 457ae726
      Dan Carpenter 提交于
      This isn't super serious because you need CAP_ADMIN to run this code.
      
      I added this integer overflow check last year but apparently I am
      rubbish at writing integer overflow checks...  There are two issues.
      First, access_ok() works on unsigned long type and not u64 so on 32 bit
      systems the access_ok() could be checking a truncated size.  The other
      issue is that we should be using a stricter limit so we don't overflow
      the kzalloc() setting ctx->clone_roots later in the function after the
      access_ok():
      
      	alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
      	sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
      
      Fixes: f5ecec3c ("btrfs: send: silence an integer overflow warning")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ added comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      457ae726
  24. 24 2月, 2017 1 次提交
    • F
      Btrfs: incremental send, fix unnecessary hole writes for sparse files · 82bfb2e7
      Filipe Manana 提交于
      When using the NO_HOLES feature, during an incremental send we often issue
      write operations for holes when we should not, because that range is already
      a hole in the destination snapshot. While that does not change the contents
      of the file at the receiver, it avoids preservation of file holes, leading
      to wasted disk space and extra IO during send/receive.
      
      A couple examples where the holes are not preserved follows.
      
       $ mkfs.btrfs -O no-holes -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ xfs_io -f -c "pwrite -S 0xaa 0 4K" /mnt/foo
       $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "pwrite -S 0xbb 1028K 4K" /mnt/bar
       $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
       # Now add one new extent to our first test file, increasing its size and
       # leaving a 1Mb hole between the first extent and this new extent.
       $ xfs_io -c "pwrite -S 0xbb 1028K 4K" /mnt/foo
      
       # Now overwrite the last extent of our second test file.
       $ xfs_io -c "pwrite -S 0xcc 1028K 4K" /mnt/bar
      
       $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
       $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo
       /mnt/snap2/foo:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          25088..25095         8 0x2000
         1: [8..2055]:       hole              2048
         2: [2056..2063]:    24576..24583         8 0x2001
      
       $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar
       /mnt/snap2/bar:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          25096..25103         8 0x2000
         1: [8..2055]:       hole              2048
         2: [2056..2063]:    24584..24591         8 0x2001
      
        $ btrfs send /mnt/snap1 -f /tmp/1.snap
        $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
      
        $ umount /mnt
        # It's not relevant to enable no-holes in the new filesystem.
        $ mkfs.btrfs -O no-holes -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ btrfs receive /mnt -f /tmp/1.snap
        $ btrfs receive /mnt -f /tmp/2.snap
      
        $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo
        /mnt/snap2/foo:
        EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
          0: [0..7]:          24576..24583         8 0x2000
          1: [8..2063]:       25624..27679      2056   0x1
      
        $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar
        /mnt/snap2/bar:
        EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
          0: [0..7]:          24584..24591         8 0x2000
          1: [8..2063]:       27680..29735      2056   0x1
      
      The holes do not exist in the second filesystem and they were replaced
      with extents filled with the byte 0x00, making each file take 1032Kb of
      space instead of 8Kb.
      
      So fix this by not issuing the write operations consisting of buffers
      filled with the byte 0x00 when the destination snapshot already has a
      hole for the respective range.
      
      A test case for fstests will follow soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      82bfb2e7