1. 21 6月, 2017 2 次提交
    • F
      Btrfs: incremental send, fix invalid path for unlink commands · fdb13889
      Filipe Manana 提交于
      An incremental send can contain unlink operations with an invalid target
      path when we rename some directory inode A, then rename some file inode B
      to the old name of inode A and directory inode A is an ancestor of inode B
      in the parent snapshot (but not anymore in the send snapshot).
      
      Consider the following example scenario where this issue happens.
      
      Parent snapshot:
      
       .                                                      (ino 256)
       |
       |--- dir1/                                             (ino 257)
             |--- dir2/                                       (ino 258)
             |     |--- file1                                 (ino 259)
             |     |--- file3                                 (ino 261)
             |
             |--- dir3/                                       (ino 262)
                   |--- file22                                (ino 260)
                   |--- dir4/                                 (ino 263)
      
      Send snapshot:
      
       .                                                      (ino 256)
       |
       |--- dir1/                                             (ino 257)
             |--- dir2/                                       (ino 258)
             |--- dir3                                        (ino 260)
             |--- file3/                                      (ino 262)
                   |--- dir4/                                 (ino 263)
                         |--- file11                          (ino 269)
                         |--- file33                          (ino 261)
      
      When attempting to apply the corresponding incremental send stream, an
      unlink operation contains an invalid path which makes the receiver fail.
      The following is verbose output of the btrfs receive command:
      
       receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
       utimes
       utimes dir1
       utimes dir1/dir2
       link dir1/dir3/dir4/file11 -> dir1/dir2/file1
       unlink dir1/dir2/file1
       utimes dir1/dir2
       truncate dir1/dir3/dir4/file11 size=0
       utimes dir1/dir3/dir4/file11
       rename dir1/dir3 -> o262-7-0
       link dir1/dir3 -> o262-7-0/file22
       unlink dir1/dir3/file22
       ERROR: unlink dir1/dir3/file22 failed. Not a directory
      
      The following steps happen during the computation of the incremental send
      stream the lead to this issue:
      
      1) Before we start processing the new and deleted references for inode
         260, we compute the full path of the deleted reference
         ("dir1/dir3/file22") and cache it in the list of deleted references
         for our inode.
      
      2) We then start processing the new references for inode 260, for which
         there is only one new, located at "dir1/dir3". When processing this
         new reference, we check that inode 262, which was not yet processed,
         collides with the new reference and because of that we orphanize
         inode 262 so its new full path becomes "o262-7-0".
      
      3) After the orphanization of inode 262, we create the new reference for
         inode 260 by issuing a link command with a target path of "dir1/dir3"
         and a source path of "o262-7-0/file22".
      
      4) We then start processing the deleted references for inode 260, for
         which there is only one with the base name of "file22", and issue
         an unlink operation containing the target path computed at step 1,
         which is wrong because that path no longer exists and should be
         replaced with "o262-7-0/file22".
      
      So fix this issue by recomputing the full path of deleted references if
      when we processed the new references for an inode we ended up orphanizing
      any other inode that is an ancestor of our inode in the parent snapshot.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      [ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fdb13889
    • F
      Btrfs: send, fix invalid path after renaming and linking file · 72c3668f
      Filipe Manana 提交于
      Currently an incremental snapshot can generate link operations which
      contain an invalid target path. Such case happens when in the send
      snapshot a file was renamed, a new hard link added for it and some
      other inode (with a lower number) got renamed to the former name of
      that file. Example:
      
      Parent snapshot
      
       .                  (ino 256)
       |
       |--- f1            (ino 257)
       |--- f2            (ino 258)
       |--- f3            (ino 259)
      
      Send snapshot
      
       .                  (ino 256)
       |
       |--- f2            (ino 257)
       |--- f3            (ino 258)
       |--- f4            (ino 259)
       |--- f5            (ino 258)
      
      The following steps happen when computing the incremental send stream:
      
      1) When processing inode 257, inode 258 is orphanized (renamed to
         "o258-7-0"), because its current reference has the same name as the
         new reference for inode 257;
      
      2) When processing inode 258, we iterate over all its new references,
         which have the names "f3" and "f5". The first iteration sees name
         "f5" and renames the inode from its orphan name ("o258-7-0") to
         "f5", while the second iteration sees the name "f3" and, incorrectly,
         issues a link operation with a target name matching the orphan name,
         which no longer exists. The first iteration had reset the current
         valid path of the inode to "f5", but in the second iteration we lost
         it because we found another inode, with a higher number of 259, which
         has a reference named "f3" as well, so we orphanized inode 259 and
         recomputed the current valid path of inode 258 to its old orphan
         name because inode 259 could be an ancestor of inode 258 and therefore
         the current valid path could contain the pre-orphanization name of
         inode 259. However in this case inode 259 is not an ancestor of inode
         258 so the current valid path should not be recomputed.
         This makes the receiver fail with the following error:
      
         ERROR: link f3 -> o258-7-0 failed: No such file or directory
      
      So fix this by not recomputing the current valid path for an inode
      whenever we find a colliding reference from some not yet processed inode
      (inode number higher then the one currently being processed), unless
      that other inode is an ancestor of the one we are currently processing.
      
      A test case for fstests will follow soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72c3668f
  2. 20 6月, 2017 3 次提交
  3. 09 5月, 2017 1 次提交
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
  4. 26 4月, 2017 1 次提交
    • F
      Btrfs: send, fix file hole not being preserved due to inline extent · e1cbfd7b
      Filipe Manana 提交于
      Normally we don't have inline extents followed by regular extents, but
      there's currently at least one harmless case where this happens. For
      example, when the page size is 4Kb and compression is enabled:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount -o compress /dev/sdb /mnt
        $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "fsync" /mnt/foobar
        $ xfs_io -c "pwrite -S 0xbb 8K 4K" -c "fsync" /mnt/foobar
      
      In this case we get a compressed inline extent, representing 4Kb of
      data, followed by a hole extent and then a regular data extent. The
      inline extent was not expanded/converted to a regular extent exactly
      because it represents 4Kb of data. This does not cause any apparent
      problem (such as the issue solved by commit e1699d2d
      ("btrfs: add missing memset while reading compressed inline extents"))
      except trigger an unexpected case in the incremental send code path
      that makes us issue an operation to write a hole when it's not needed,
      resulting in more writes at the receiver and wasting space at the
      receiver.
      
      So teach the incremental send code to deal with this particular case.
      
      The issue can be currently triggered by running fstests btrfs/137 with
      compression enabled (MOUNT_OPTIONS="-o compress" ./check btrfs/137).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      e1cbfd7b
  5. 29 3月, 2017 1 次提交
    • D
      Btrfs: fix an integer overflow check · 457ae726
      Dan Carpenter 提交于
      This isn't super serious because you need CAP_ADMIN to run this code.
      
      I added this integer overflow check last year but apparently I am
      rubbish at writing integer overflow checks...  There are two issues.
      First, access_ok() works on unsigned long type and not u64 so on 32 bit
      systems the access_ok() could be checking a truncated size.  The other
      issue is that we should be using a stricter limit so we don't overflow
      the kzalloc() setting ctx->clone_roots later in the function after the
      access_ok():
      
      	alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1);
      	sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
      
      Fixes: f5ecec3c ("btrfs: send: silence an integer overflow warning")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ added comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      457ae726
  6. 24 2月, 2017 4 次提交
    • F
      Btrfs: incremental send, fix unnecessary hole writes for sparse files · 82bfb2e7
      Filipe Manana 提交于
      When using the NO_HOLES feature, during an incremental send we often issue
      write operations for holes when we should not, because that range is already
      a hole in the destination snapshot. While that does not change the contents
      of the file at the receiver, it avoids preservation of file holes, leading
      to wasted disk space and extra IO during send/receive.
      
      A couple examples where the holes are not preserved follows.
      
       $ mkfs.btrfs -O no-holes -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ xfs_io -f -c "pwrite -S 0xaa 0 4K" /mnt/foo
       $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "pwrite -S 0xbb 1028K 4K" /mnt/bar
       $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
       # Now add one new extent to our first test file, increasing its size and
       # leaving a 1Mb hole between the first extent and this new extent.
       $ xfs_io -c "pwrite -S 0xbb 1028K 4K" /mnt/foo
      
       # Now overwrite the last extent of our second test file.
       $ xfs_io -c "pwrite -S 0xcc 1028K 4K" /mnt/bar
      
       $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
       $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo
       /mnt/snap2/foo:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          25088..25095         8 0x2000
         1: [8..2055]:       hole              2048
         2: [2056..2063]:    24576..24583         8 0x2001
      
       $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar
       /mnt/snap2/bar:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          25096..25103         8 0x2000
         1: [8..2055]:       hole              2048
         2: [2056..2063]:    24584..24591         8 0x2001
      
        $ btrfs send /mnt/snap1 -f /tmp/1.snap
        $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
      
        $ umount /mnt
        # It's not relevant to enable no-holes in the new filesystem.
        $ mkfs.btrfs -O no-holes -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ btrfs receive /mnt -f /tmp/1.snap
        $ btrfs receive /mnt -f /tmp/2.snap
      
        $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo
        /mnt/snap2/foo:
        EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
          0: [0..7]:          24576..24583         8 0x2000
          1: [8..2063]:       25624..27679      2056   0x1
      
        $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar
        /mnt/snap2/bar:
        EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
          0: [0..7]:          24584..24591         8 0x2000
          1: [8..2063]:       27680..29735      2056   0x1
      
      The holes do not exist in the second filesystem and they were replaced
      with extents filled with the byte 0x00, making each file take 1032Kb of
      space instead of 8Kb.
      
      So fix this by not issuing the write operations consisting of buffers
      filled with the byte 0x00 when the destination snapshot already has a
      hole for the respective range.
      
      A test case for fstests will follow soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      82bfb2e7
    • R
      Btrfs: incremental send, do not issue invalid rmdir operations · 01914101
      Robbie Ko 提交于
      When both the parent and send snapshots have a directory inode with the
      same number but different generations (therefore they are different
      inodes) and both have an entry with the same name, an incremental send
      stream will contain an invalid rmdir operation that refers to the
      orphanized name of the inode from the parent snapshot.
      
      The following example scenario shows how this happens.
      
      Parent snapshot:
      
       .
       |---- d259_old/               (ino 259, gen 9)
       |         |---- d1/           (ino 258, gen 9)
       |
       |---- f                       (ino 257, gen 9)
      
      Send snapshot:
      
       .
       |---- d258/                   (ino 258, gen 7)
       |---- d259/                   (ino 259, gen 7)
               |---- d1/             (ino 257, gen 7)
      
      When the kernel is processing inode 258 it notices that in both snapshots
      there is an inode numbered 259 that is a parent of an inode 258. However
      it ignores the fact that the inodes numbered 259 have different generations
      in both snapshots, which means they are effectively different inodes.
      Then it checks that both inodes 259 have a dentry named "d1" and because
      of that it issues a rmdir operation with orphanized name of the inode 258
      from the parent snapshot. This happens at send.c:process_record_refs(),
      which calls send.c:did_overwrite_first_ref() that returns true and because
      of that later on at process_recorded_refs() such rmdir operation is issued
      because the inode being currently processed (258) is a directory and it
      was deleted in the send snapshot (and replaced with another inode that has
      the same number and is a directory too).
      Fix this issue by comparing the generations of parent directory inodes
      that have the same number and make send.c:did_overwrite_first_ref() when
      the generations are different.
      
      The following steps reproduce the problem.
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ touch /mnt/f
       $ mkdir /mnt/d1
       $ mkdir /mnt/d259_old
       $ mv /mnt/d1 /mnt/d259_old/d1
       $ btrfs subvolume snapshot -r /mnt /mnt/snap1
       $ btrfs send /mnt/snap1 -f /tmp/1.snap
       $ umount /mnt
      
       $ mkfs.btrfs -f /dev/sdc
       $ mount /dev/sdc /mnt
       $ mkdir /mnt/d1
       $ mkdir /mnt/dir258
       $ mkdir /mnt/dir259
       $ mv /mnt/d1 /mnt/dir259/d1
       $ btrfs subvolume snapshot -r /mnt /mnt/snap2
       $ btrfs receive /mnt/ -f /tmp/1.snap
       # Take note that once the filesystem is created, its current
       # generation has value 7 so the inodes from the second snapshot all have
       # a generation value of 7. And after receiving the first snapshot
       # the filesystem is at a generation value of 10, because the call to
       # create the second snapshot bumps the generation to 8 (the snapshot
       # creation ioctl does a transaction commit), the receive command calls
       # the snapshot creation ioctl to create the first snapshot, which bumps
       # the filesystem's generation to 9, and finally when the receive
       # operation finishes it calls an ioctl to transition the first snapshot
       # (snap1) from RW mode to RO mode, which does another transaction commit
       # and bumps the filesystem's generation to 10. This means all the inodes
       # in the first snapshot (snap1) have a generation value of 9.
       $ rm -f /tmp/1.snap
       $ btrfs send /mnt/snap1 -f /tmp/1.snap
       $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
       $ umount /mnt
      
       $ mkfs.btrfs -f /dev/sdd
       $ mount /dev/sdd /mnt
       $ btrfs receive /mnt -f /tmp/1.snap
       $ btrfs receive -vv /mnt -f /tmp/2.snap
       receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9
       utimes
       unlink f
       mkdir o257-7-0
       mkdir o259-7-0
       rename o257-7-0 -> o259-7-0/d1
       chown o259-7-0/d1 - uid=0, gid=0
       chmod o259-7-0/d1 - mode=0755
       utimes o259-7-0/d1
       rmdir o258-9-0
       ERROR: rmdir o258-9-0 failed: No such file or directory
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [Rewrote changelog to be more precise and clear]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      01914101
    • F
      Btrfs: incremental send, do not delay rename when parent inode is new · fe9c798d
      Filipe Manana 提交于
      When we are checking if we need to delay the rename operation for an
      inode we not checking if a parent inode that exists in the send and
      parent snapshots is really the same inode or not, that is, we are not
      comparing the generation number of the parent inode in the send and
      parent snapshots. Not only this results in unnecessarily delaying a
      rename operation but also can later on make us generate an incorrect
      name for a new inode in the send snapshot that has the same number
      as another inode in the parent snapshot but a different generation.
      
      Here follows an example where this happens.
      
      Parent snapshot:
      
       .                                                  (ino 256, gen 3)
       |--- dir258/                                       (ino 258, gen 7)
       |       |--- dir257/                               (ino 257, gen 7)
       |
       |--- dir259/                                       (ino 259, gen 7)
      
      Send snapshot:
      
       .                                                  (ino 256, gen 3)
       |--- file258                                       (ino 258, gen 10)
       |
       |--- new_dir259/                                   (ino 259, gen 10)
                |--- dir257/                              (ino 257, gen 7)
      
      The following steps happen when computing the incremental send stream:
      
      1) When processing inode 257, its new parent is created using its orphan
         name (o257-21-0), and the rename operation for inode 257 is delayed
         because its new parent (inode 259) was not yet processed - this
         decision to delay the rename operation does not make much sense
         because the inode 259 in the send snapshot is a new inode, it's not
         the same as inode 259 in the parent snapshot.
      
      2) When processing inode 258 we end up delaying its rmdir operation,
         because inode 257 was not yet renamed (moved away from the directory
         inode 258 represents). We also create the new inode 258 using its
         orphan name "o258-10-0", then rename it to its final name of "file258"
         and then issue a truncate operation for it. However this truncate
         operation contains an incorrect name, which corresponds to the orphan
         name and not to the final name, which makes the receiver fail. This
         happens because when we attempt to compute the inode's current name
         we verify that there's another inode with the same number (258) that
         has its rmdir operation pending and because of that we generate an
         orphan name for the new inode 258 (we do this in the function
         get_cur_path()).
      
      Fix this by not delayed the rename operation of an inode if it has parents
      with the same number but different generations in both snapshots.
      
      The following steps reproduce this example scenario.
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ mkdir /mnt/dir257
       $ mkdir /mnt/dir258
       $ mkdir /mnt/dir259
       $ mv /mnt/dir257 /mnt/dir258/dir257
       $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
       $ mv /mnt/dir258/dir257 /mnt/dir257
       $ rmdir /mnt/dir258
       $ rmdir /mnt/dir259
      
       # Remount the filesystem so that the next created inodes will have the
       # numbers 258 and 259. This is because when a filesystem is mounted,
       # btrfs sets the subvolume's inode counter to a value corresponding to
       # the highest inode number in the subvolume plus 1. This inode counter
       # is used to assign a unique number to each new inode and it's
       # incremented by 1 after very inode creation.
       # Note: we unmount and then mount instead of doing a mount with
       # "-o remount" because otherwise the inode counter remains at value 260.
       $ umount /mnt
       $ mount /dev/sdb /mnt
       $ touch /mnt/file258
       $ mkdir /mnt/new_dir259
       $ mv /mnt/dir257 /mnt/new_dir259/dir257
       $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
       $ btrfs send /mnt/snap1 -f /tmp/1.snap
       $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
      
       $ umount /mnt
       $ mkfs.btrfs -f /dev/sdc
       $ mount /dev/sdc /mnt
       $ btrfs receive /mnt -f /tmo/1.snap
       $ btrfs receive /mnt -f /tmo/2.snap -vv
       receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7
       utimes
       mkdir o259-10-0
       rename dir258 -> o258-7-0
       utimes
       mkfile o258-10-0
       rename o258-10-0 -> file258
       utimes
       truncate o258-10-0 size=0
       ERROR: truncate o258-10-0 failed: No such file or directory
      Reported-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      fe9c798d
    • R
      Btrfs: send, fix failure to rename top level inode due to name collision · 4dd9920d
      Robbie Ko 提交于
      Under certain situations, an incremental send operation can fail due to a
      premature attempt to create a new top level inode (a direct child of the
      subvolume/snapshot root) whose name collides with another inode that was
      removed from the send snapshot.
      
      Consider the following example scenario.
      
      Parent snapshot:
      
        .                 (ino 256, gen 8)
        |---- a1/         (ino 257, gen 9)
        |---- a2/         (ino 258, gen 9)
      
      Send snapshot:
      
        .                 (ino 256, gen 3)
        |---- a2/         (ino 257, gen 7)
      
      In this scenario, when receiving the incremental send stream, the btrfs
      receive command fails like this (ran in verbose mode, -vv argument):
      
        rmdir a1
        mkfile o257-7-0
        rename o257-7-0 -> a2
        ERROR: rename o257-7-0 -> a2 failed: Is a directory
      
      What happens when computing the incremental send stream is:
      
      1) An operation to remove the directory with inode number 257 and
         generation 9 is issued.
      
      2) An operation to create the inode with number 257 and generation 7 is
         issued. This creates the inode with an orphanized name of "o257-7-0".
      
      3) An operation rename the new inode 257 to its final name, "a2", is
         issued. This is incorrect because inode 258, which has the same name
         and it's a child of the same parent (root inode 256), was not yet
         processed and therefore no rmdir operation for it was yet issued.
         The rename operation is issued because we fail to detect that the
         name of the new inode 257 collides with inode 258, because their
         parent, a subvolume/snapshot root (inode 256) has a different
         generation in both snapshots.
      
      So fix this by ignoring the generation value of a parent directory that
      matches a root inode (number 256) when we are checking if the name of the
      inode currently being processed collides with the name of some other
      inode that was not yet processed.
      
      We can achieve this scenario of different inodes with the same number but
      different generation values either by mounting a filesystem with the inode
      cache option (-o inode_cache) or by creating and sending snapshots across
      different filesystems, like in the following example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir /mnt/a1
        $ mkdir /mnt/a2
        $ btrfs subvolume snapshot -r /mnt /mnt/snap1
        $ btrfs send /mnt/snap1 -f /tmp/1.snap
        $ umount /mnt
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ touch /mnt/a2
        $ btrfs subvolume snapshot -r /mnt /mnt/snap2
        $ btrfs receive /mnt -f /tmp/1.snap
        # Take note that once the filesystem is created, its current
        # generation has value 7 so the inode from the second snapshot has
        # a generation value of 7. And after receiving the first snapshot
        # the filesystem is at a generation value of 10, because the call to
        # create the second snapshot bumps the generation to 8 (the snapshot
        # creation ioctl does a transaction commit), the receive command calls
        # the snapshot creation ioctl to create the first snapshot, which bumps
        # the filesystem's generation to 9, and finally when the receive
        # operation finishes it calls an ioctl to transition the first snapshot
        # (snap1) from RW mode to RO mode, which does another transaction commit
        # and bumps the filesystem's generation to 10.
        $ rm -f /tmp/1.snap
        $ btrfs send /mnt/snap1 -f /tmp/1.snap
        $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
        $ umount /mnt
      
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdd /mnt
        $ btrfs receive /mnt /tmp/1.snap
        # Receive of snapshot snap2 used to fail.
        $ btrfs receive /mnt /tmp/2.snap
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [Rewrote changelog to be more precise and clear]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      4dd9920d
  7. 06 12月, 2016 4 次提交
  8. 12 10月, 2016 1 次提交
    • F
      Btrfs: fix incremental send failure caused by balance · d5e84fd8
      Filipe Manana 提交于
      Commit 95155585 ("Btrfs: send, don't bug on inconsistent snapshots")
      removed some BUG_ON() statements (replacing them with returning errors
      to user space and logging error messages) when a snapshot is in an
      inconsistent state due to failures to update a delayed inode item (ENOMEM
      or ENOSPC) after adding/updating/deleting references, xattrs or file
      extent items.
      
      However there is a case, when no errors happen, where a file extent item
      can be modified without having the corresponding inode item updated. This
      case happens during balance under very specific timings, when relocation
      is in the stage where it updates data pointers and a leaf that contains
      file extent items is COWed. When that happens file extent items get their
      disk_bytenr field updated to a new value that reflects the post relocation
      logical address of the extent, without updating their respective inode
      items (as there is nothing that needs to be updated on them). This is
      performed at relocation.c:replace_file_extents() through
      relocation.c:btrfs_reloc_cow_block().
      
      So make an incremental send deal with this case and don't do any processing
      for a file extent item that got its disk_bytenr field updated by relocation,
      since the extent's data is the same as the one pointed by the file extent
      item in the parent snapshot.
      
      After the recent commit mentioned above this case resulted in EIO errors
      returned to user space (and an error message logged to dmesg/syslog) when
      doing an incremental send, while before it, it resulted in hitting a
      BUG_ON leading to the following trace:
      
      [  952.206705] ------------[ cut here ]------------
      [  952.206714] kernel BUG at ../fs/btrfs/send.c:5653!
      [  952.206719] Internal error: Oops - BUG: 0 [#1] SMP
      [  952.209854] Modules linked in: st dm_mod nls_utf8 isofs fuse nf_log_ipv6 xt_pkttype xt_physdev br_netfilter nf_log_ipv4 nf_log_common xt_LOG xt_limit ebtable_filter ebtables af_packet bridge stp llc ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables xfs libcrc32c nls_iso8859_1 nls_cp437 vfat fat joydev aes_ce_blk ablk_helper cryptd snd_intel8x0 aes_ce_cipher snd_ac97_codec ac97_bus snd_pcm ghash_ce sha2_ce sha1_ce snd_timer snd virtio_net soundcore btrfs xor sr_mod cdrom hid_generic usbhid raid6_pq virtio_blk virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_mmio xhci_pci xhci_hcd usbcore usb_common virtio_pci virtio_ring virtio drm sg efivarfs
      [  952.228333] Supported: Yes
      [  952.228908] CPU: 0 PID: 12779 Comm: snapperd Not tainted 4.4.14-50-default #1
      [  952.230329] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [  952.231683] task: ffff800058e94100 ti: ffff8000d866c000 task.ti: ffff8000d866c000
      [  952.233279] PC is at changed_cb+0x9f4/0xa48 [btrfs]
      [  952.234375] LR is at changed_cb+0x58/0xa48 [btrfs]
      [  952.236552] pc : [<ffff7ffffc39de7c>] lr : [<ffff7ffffc39d4e0>] pstate: 80000145
      [  952.238049] sp : ffff8000d866fa20
      [  952.238732] x29: ffff8000d866fa20 x28: 0000000000000019
      [  952.239840] x27: 00000000000028d5 x26: 00000000000024a2
      [  952.241008] x25: 0000000000000002 x24: ffff8000e66e92f0
      [  952.242131] x23: ffff8000b8c76800 x22: ffff800092879140
      [  952.243238] x21: 0000000000000002 x20: ffff8000d866fb78
      [  952.244348] x19: ffff8000b8f8c200 x18: 0000000000002710
      [  952.245607] x17: 0000ffff90d42480 x16: ffff800000237dc0
      [  952.246719] x15: 0000ffff90de7510 x14: ab000c000a2faf08
      [  952.247835] x13: 0000000000577c2b x12: ab000c000b696665
      [  952.248981] x11: 2e65726f632f6966 x10: 652d34366d72612f
      [  952.250101] x9 : 32627572672f746f x8 : ab000c00092f1671
      [  952.251352] x7 : 8000000000577c2b x6 : ffff800053eadf45
      [  952.252468] x5 : 0000000000000000 x4 : ffff80005e169494
      [  952.253582] x3 : 0000000000000004 x2 : ffff8000d866fb78
      [  952.254695] x1 : 000000000003e2a3 x0 : 000000000003e2a4
      [  952.255803]
      [  952.256150] Process snapperd (pid: 12779, stack limit = 0xffff8000d866c020)
      [  952.257516] Stack: (0xffff8000d866fa20 to 0xffff8000d8670000)
      [  952.258654] fa20: ffff8000d866fae0 ffff7ffffc308fc0 ffff800092879140 ffff8000e66e92f0
      [  952.260219] fa40: 0000000000000035 ffff800055de6000 ffff8000b8c76800 ffff8000d866fb78
      [  952.261745] fa60: 0000000000000002 00000000000024a2 00000000000028d5 0000000000000019
      [  952.263269] fa80: ffff8000d866fae0 ffff7ffffc3090f0 ffff8000d866fae0 ffff7ffffc309128
      [  952.264797] faa0: ffff800092879140 ffff8000e66e92f0 0000000000000035 ffff800055de6000
      [  952.268261] fac0: ffff8000b8c76800 ffff8000d866fb78 0000000000000002 0000000000001000
      [  952.269822] fae0: ffff8000d866fbc0 ffff7ffffc39ecfc ffff8000b8f8c200 ffff8000b8f8c368
      [  952.271368] fb00: ffff8000b8f8c378 ffff800055de6000 0000000000000001 ffff8000ecb17500
      [  952.272893] fb20: ffff8000b8c76800 ffff800092879140 ffff800062b6d000 ffff80007a9e2470
      [  952.274420] fb40: ffff8000b8f8c208 0000000005784000 ffff8000580a8000 ffff8000b8f8c200
      [  952.276088] fb60: ffff7ffffc39d488 00000002b8f8c368 0000000000000000 000000000003e2a4
      [  952.280275] fb80: 000000000000006c ffff7ffffc39ec00 000000000003e2a4 000000000000006c
      [  952.283219] fba0: ffff8000b8f8c300 0000000000000100 0000000000000001 ffff8000ecb17500
      [  952.286166] fbc0: ffff8000d866fcd0 ffff7ffffc3643c0 ffff8000f8842700 0000ffff8ffe9278
      [  952.289136] fbe0: 0000000040489426 ffff800055de6000 0000ffff8ffe9278 0000000040489426
      [  952.292083] fc00: 000000000000011d 000000000000001d ffff80007a9e4598 ffff80007a9e43e8
      [  952.294959] fc20: ffff8000b8c7693f 0000000000003b24 0000000000000019 ffff8000b8f8c218
      [  952.301161] fc40: 00000001d866fc70 ffff8000b8c76800 0000000000000128 ffffffffffffff84
      [  952.305749] fc60: ffff800058e941ff 0000000000003a58 ffff8000d866fcb0 ffff8000000f7390
      [  952.308875] fc80: 000000000000012a 0000000000010290 ffff8000d866fc00 000000000000007b
      [  952.311915] fca0: 0000000000010290 ffff800046c1b100 74732d7366727462 000001006d616572
      [  952.314937] fcc0: ffff8000fffc4100 cb88537fdc8ba60e ffff8000d866fe10 ffff8000002499e8
      [  952.318008] fce0: 0000000040489426 ffff8000f8842700 0000ffff8ffe9278 ffff80007a9e4598
      [  952.321321] fd00: 0000ffff8ffe9278 0000000040489426 000000000000011d 000000000000001d
      [  952.324280] fd20: ffff80000072c000 ffff8000d866c000 ffff8000d866fda0 ffff8000000e997c
      [  952.327156] fd40: ffff8000fffc4180 00000000000031ed ffff8000fffc4180 ffff800046c1b7d4
      [  952.329895] fd60: 0000000000000140 0000ffff907ea170 000000000000011d 00000000000000dc
      [  952.334641] fd80: ffff80000072c000 ffff8000d866c000 0000000000000000 0000000000000002
      [  952.338002] fda0: ffff8000d866fdd0 ffff8000000ebacc ffff800046c1b080 ffff800046c1b7d4
      [  952.340724] fdc0: ffff8000d866fdf0 ffff8000000db67c 0000000000000040 ffff800000e69198
      [  952.343415] fde0: 0000ffff8ffea790 00000000000031ed ffff8000d866fe20 ffff800000254000
      [  952.346101] fe00: 000000000000001d 0000000000000004 ffff8000d866fe90 ffff800000249d3c
      [  952.348980] fe20: ffff8000f8842700 0000000000000000 ffff8000f8842701 0000000000000008
      [  952.351696] fe40: ffff8000d866fe70 0000000000000008 ffff8000d866fe90 ffff800000249cf8
      [  952.354387] fe60: ffff8000f8842700 0000ffff8ffe9170 ffff8000f8842701 0000000000000008
      [  952.357083] fe80: 0000ffff8ffe9278 ffff80008ff85500 0000ffff8ffe90c0 ffff800000085c84
      [  952.359800] fea0: 0000000000000000 0000ffff8ffe9170 ffffffffffffffff 0000ffff90d473bc
      [  952.365351] fec0: 0000000000000000 0000000000000015 0000000000000008 0000000040489426
      [  952.369550] fee0: 0000ffff8ffe9278 0000ffff907ea790 0000ffff907ea170 0000ffff907ea790
      [  952.372416] ff00: 0000ffff907ea170 0000000000000000 000000000000001d 0000000000000004
      [  952.375223] ff20: 0000ffff90a32220 00000000003d0f00 0000ffff907ea0a0 0000ffff8ffe8f30
      [  952.378099] ff40: 0000ffff9100f554 0000ffff91147000 0000ffff91117bc0 0000ffff90d473b0
      [  952.381115] ff60: 0000ffff9100f620 0000ffff880069b0 0000ffff8ffe9170 0000ffff8ffe91a0
      [  952.384003] ff80: 0000ffff8ffe9160 0000ffff8ffe9140 0000ffff88006990 0000ffff8ffe9278
      [  952.386860] ffa0: 0000ffff88008a60 0000ffff8ffe9480 0000ffff88014ca0 0000ffff8ffe90c0
      [  952.389654] ffc0: 0000ffff910be8e8 0000ffff8ffe90c0 0000ffff90d473bc 0000000000000000
      [  952.410986] ffe0: 0000000000000008 000000000000001d 6e2079747265706f 72616d223d656d61
      [  952.415497] Call trace:
      [  952.417403] [<ffff7ffffc39de7c>] changed_cb+0x9f4/0xa48 [btrfs]
      [  952.420023] [<ffff7ffffc308fc0>] btrfs_compare_trees+0x500/0x6b0 [btrfs]
      [  952.422759] [<ffff7ffffc39ecfc>] btrfs_ioctl_send+0xb4c/0xe10 [btrfs]
      [  952.425601] [<ffff7ffffc3643c0>] btrfs_ioctl+0x374/0x29a4 [btrfs]
      [  952.428031] [<ffff8000002499e8>] do_vfs_ioctl+0x33c/0x600
      [  952.430360] [<ffff800000249d3c>] SyS_ioctl+0x90/0xa4
      [  952.432552] [<ffff800000085c84>] el0_svc_naked+0x38/0x3c
      [  952.434803] Code: 2a1503e0 17fffdac b9404282 17ffff28 (d4210000)
      [  952.437457] ---[ end trace 9afd7090c466cf15 ]---
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      d5e84fd8
  9. 28 9月, 2016 1 次提交
  10. 27 9月, 2016 3 次提交
  11. 01 9月, 2016 1 次提交
    • J
      Btrfs: kill invalid ASSERT() in process_all_refs() · 3dc09ec8
      Josef Bacik 提交于
      Suppose you have the following tree in snap1 on a file system mounted with -o
      inode_cache so that inode numbers are recycled
      
      └── [    258]  a
          └── [    257]  b
      
      and then you remove b, rename a to c, and then re-create b in c so you have the
      following tree
      
      └── [    258]  c
          └── [    257]  b
      
      and then you try to do an incremental send you will hit
      
      ASSERT(pending_move == 0);
      
      in process_all_refs().  This is because we assume that any recycling of inodes
      will not have a pending change in our path, which isn't the case.  This is the
      case for the DELETE side, since we want to remove the old file using the old
      path, but on the create side we could have a pending move and need to do the
      normal pending rename dance.  So remove this ASSERT() and put a comment about
      why we ignore pending_move.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3dc09ec8
  12. 01 8月, 2016 8 次提交
    • F
      Btrfs: send, don't bug on inconsistent snapshots · 95155585
      Filipe Manana 提交于
      When doing an incremental send, if we find a new/modified/deleted extent,
      reference or xattr without having previously processed the corresponding
      inode item we end up exexuting a BUG_ON(). This is because whenever an
      extent, xattr or reference is added, modified or deleted, we always expect
      to have the corresponding inode item updated. However there are situations
      where this will not happen due to transient -ENOMEM or -ENOSPC errors when
      doing delayed inode updates.
      
      For example, when punching holes we can succeed in deleting and modifying
      (shrinking) extents but later fail to do the delayed inode update. So after
      such failure we close our transaction handle and right after a snapshot of
      the fs/subvol tree can be made and used later for a send operation. The
      same thing can happen during truncate, link, unlink, and xattr related
      operations.
      
      So instead of executing a BUG_ON, make send return an -EIO error and print
      an informative error message do dmesg/syslog.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      95155585
    • F
      Btrfs: send, avoid incorrect leaf accesses when sending utimes operations · 15b253ea
      Filipe Manana 提交于
      The caller of send_utimes() is supposed to be sure that the inode number
      it passes to this function does actually exists in the send snapshot.
      However due to logic/algorithm bugs (such as the one fixed by the patch
      titled "Btrfs: send, fix invalid leaf accesses due to incorrect utimes
      operations"), this might not be the case and when that happens it makes
      send_utimes() access use an unrelated leaf item as the target inode item
      or access beyond a leaf's boundaries (when the leaf is full and
      path->slots[0] matches the number of items in the leaf).
      
      So if the call to btrfs_search_slot() done by send_utimes() does not find
      the inode item, just make sure send_utimes() returns -ENOENT and does not
      silently accesses unrelated leaf items or does invalid leaf accesses, also
      allowing us to easialy and deterministically catch such algorithmic/logic
      bugs.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      15b253ea
    • R
      Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations · 764433a1
      Robbie Ko 提交于
      During an incremental send, if we have delayed rename operations for inodes
      that were children of directories which were removed in the send snapshot,
      we can end up accessing incorrect items in a leaf or accessing beyond the
      last item of the leaf due to issuing utimes operations for the removed
      inodes. Consider the following example:
      
        Parent snapshot:
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |    |--- c/                                                  (ino 262)
        |
        |--- b/                                                       (ino 258)
        |    |--- d/                                                  (ino 263)
        |
        |--- del/                                                     (ino 261)
              |--- x/                                                 (ino 259)
              |--- y/                                                 (ino 260)
      
        Send snapshot:
      
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |
        |--- b/                                                       (ino 258)
        |
        |--- c/                                                       (ino 262)
        |    |--- y/                                                  (ino 260)
        |
        |--- d/                                                       (ino 263)
             |--- x/                                                  (ino 259)
      
      1) When processing inodes 259 and 260, we end up delaying their rename
         operations because their parents, inodes 263 and 262 respectively, were
         not yet processed and therefore not yet renamed;
      
      2) When processing inode 262, its rename operation is issued and right
         after the rename operation for inode 260 is issued. However right after
         issuing the rename operation for inode 260, at send.c:apply_dir_move(),
         we issue utimes operations for all current and past parents of inode
         260. This means we try to send a utimes operation for its old parent,
         inode 261 (deleted in the send snapshot), which does not cause any
         immediate and deterministic failure, because when the target inode is
         not found in the send snapshot, the send.c:send_utimes() function
         ignores it and uses the leaf region pointed to by path->slots[0],
         which can be any unrelated item (belonging to other inode) or it can
         be a region outside the leaf boundaries, if the leaf is full and
         path->slots[0] matches the number of items in the leaf. So we end
         up either successfully sending a utimes operation, which is fine
         and irrelevant because the old parent (inode 261) will end up being
         deleted later, or we end up doing an invalid memory access tha
         crashes the kernel.
      
      So fix this by making apply_dir_move() issue utimes operations only for
      parents that still exist in the send snapshot. In a separate patch we
      will make send_utimes() return an error (-ENOENT) if the given inode
      does not exists in the send snapshot.
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      [Rewrote change log to be more detailed and better organized]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      764433a1
    • R
      Btrfs: send, fix warning due to late freeing of orphan_dir_info structures · 443f9d26
      Robbie Ko 提交于
      Under certain situations, when doing an incremental send, we can end up
      not freeing orphan_dir_info structures as soon as they are no longer
      needed. Instead we end up freeing them only after finishing the send
      stream, which causes a warning to be emitted:
      
      [282735.229200] ------------[ cut here ]------------
      [282735.229968] WARNING: CPU: 9 PID: 10588 at fs/btrfs/send.c:6298 btrfs_ioctl_send+0xe2f/0xe51 [btrfs]
      [282735.231282] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
      [282735.237130] CPU: 9 PID: 10588 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
      [282735.239309] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [282735.240160]  0000000000000000 ffff880224273ca8 ffffffff8126b42c 0000000000000000
      [282735.240160]  0000000000000000 ffff880224273ce8 ffffffff81052b14 0000189a24273ac8
      [282735.240160]  ffff8802210c9800 0000000000000000 0000000000000001 0000000000000000
      [282735.240160] Call Trace:
      [282735.240160]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
      [282735.240160]  [<ffffffff81052b14>] __warn+0xc2/0xdd
      [282735.240160]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
      [282735.240160]  [<ffffffffa03c99d5>] btrfs_ioctl_send+0xe2f/0xe51 [btrfs]
      [282735.240160]  [<ffffffffa0398358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
      [282735.240160]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
      [282735.240160]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
      [282735.240160]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
      [282735.240160]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
      [282735.240160]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
      [282735.240160]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
      [282735.240160]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [282735.240160]  [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
      [282735.240160]  [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa
      [282735.256343] ---[ end trace a4539270c8056f93 ]---
      
      Consider the following example:
      
        Parent snapshot:
      
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |    |--- c/                                                  (ino 260)
        |
        |--- del/                                                     (ino 259)
              |--- tmp/                                               (ino 258)
              |--- x/                                                 (ino 261)
              |--- y/                                                 (ino 262)
      
        Send snapshot:
      
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |    |--- x/                                                  (ino 261)
        |    |--- y/                                                  (ino 262)
        |
        |--- c/                                                       (ino 260)
             |--- tmp/                                                (ino 258)
      
      1) When processing inode 258, we end up delaying its rename operation
         because it has an ancestor (in the send snapshot) that has a higher
         inode number (inode 260) which was also renamed in the send snapshot,
         therefore we delay the rename of inode 258 so that it happens after
         inode 260 is renamed;
      
      2) When processing inode 259, we end up delaying its deletion (rmdir
         operation) because it has a child inode (258) that has its rename
         operation delayed. At this point we allocate an orphan_dir_info
         structure and tag inode 258 so that we later attempt to see if we
         can delete (rmdir) inode 259 once inode 258 is renamed;
      
      3) When we process inode 260, after renaming it we finally do the rename
         operation for inode 258. Once we issue the rename operation for inode
         258 we notice that this inode was tagged so that we attempt to see
         if at this point we can delete (rmdir) inode 259. But at this point
         we can not still delete inode 259 because it has 2 children, inodes
         261 and 262, that were not yet processed and therefore not yet
         moved (renamed) away from inode 259. We end up not freeing the
         orphan_dir_info structure allocated in step 2;
      
      4) We process inodes 261 and 262, and once we move/rename inode 262
         we issue the rmdir operation for inode 260;
      
      5) We finish the send stream and notice that red black tree that
         contains orphan_dir_info structures is not empty, so we emit
         a warning and then free any orphan_dir_structures left.
      
      So fix this by freeing an orphan_dir_info structure once we try to
      apply a pending rename operation if we can not delete yet the tagged
      directory.
      
      A test case for fstests follows soon.
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      [Modified changelog to be more detailed and easier to understand]
      443f9d26
    • R
      Btrfs: incremental send, fix premature rmdir operations · 99ea42dd
      Robbie Ko 提交于
      Under certain situations, an incremental send operation can contain
      a rmdir operation that will make the receiving end fail when attempting
      to execute it, because the target directory is not yet empty.
      
      Consider the following example:
      
        Parent snapshot:
      
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |    |--- c/                                                  (ino 260)
        |
        |--- del/                                                     (ino 259)
              |--- tmp/                                               (ino 258)
              |--- x/                                                 (ino 261)
      
        Send snapshot:
      
        .                                                             (ino 256)
        |--- a/                                                       (ino 257)
        |    |--- x/                                                  (ino 261)
        |
        |--- c/                                                       (ino 260)
             |--- tmp/                                                (ino 258)
      
      1) When processing inode 258, we delay its rename operation because inode
         260 is its new parent in the send snapshot and it was not yet renamed
         (since 260 > 258, that is, beyond the current progress);
      
      2) When processing inode 259, we realize we can not yet send an rmdir
         operation (against inode 259) because inode 258 was still not yet
         renamed/moved away from inode 259. Therefore we update data structures
         so that after inode 258 is renamed, we try again to see if we can
         finally send an rmdir operation for inode 259;
      
      3) When we process inode 260, we send a rename operation for it followed
         by a rename operation for inode 258. Once we send the rename operation
         for inode 258 we then check if we can finally issue an rmdir for its
         previous parent, inode 259, by calling the can_rmdir() function with
         a value of sctx->cur_ino + 1 (260 + 1 = 261) for its "progress"
         argument. This makes can_rmdir() return true (value 1) because even
         though there's still a child inode of inode 259 that was not yet
         renamed/moved, which is inode 261, the given value of progress (261)
         is not lower then 261 (that is, not lower than the inode number of
         some child of inode 259). So we end up sending a rmdir operation for
         inode 259 before its child inode 261 is processed and renamed.
      
      So fix this by passing the correct progress value to the call to
      can_rmdir() from within apply_dir_move() (where we issue delayed rename
      operations), which should match stcx->cur_ino (the number of the inode
      currently being processed) and not sctx->cur_ino + 1.
      
      A test case for fstests follows soon.
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      [Rewrote change log to be more detailed, clear and well formatted]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      99ea42dd
    • F
      Btrfs: incremental send, fix invalid paths for rename operations · 4122ea64
      Filipe Manana 提交于
      Example scenario:
      
        Parent snapshot:
      
        .                                                       (ino 277)
        |---- tmp/                                              (ino 278)
        |---- pre/                                              (ino 280)
        |      |---- wait_dir/                                  (ino 281)
        |
        |---- desc/                                             (ino 282)
        |---- ance/                                             (ino 283)
        |       |---- below_ance/                               (ino 279)
        |
        |---- other_dir/                                        (ino 284)
      
        Send snapshot:
      
        .                                                       (ino 277)
        |---- tmp/                                              (ino 278)
               |---- other_dir/                                 (ino 284)
                         |---- below_ance/                      (ino 279)
                         |            |---- pre/                (ino 280)
                         |
                         |---- wait_dir/                        (ino 281)
                                    |---- desc/                 (ino 282)
                                            |---- ance/         (ino 283)
      
      While computing the send stream the following steps happen:
      
      1) While processing inode 279 we end up delaying its rename operation
         because its new parent in the send snapshot, inode 284, was not
         yet processed and therefore not yet renamed;
      
      2) Later when processing inode 280 we end up renaming it immediately to
         "ance/below_once/pre" and not delay its rename operation because its
         new parent (inode 279 in the send snapshot) has its rename operation
         delayed and inode 280 is not an encestor of inode 279 (its parent in
         the send snapshot) in the parent snapshot;
      
      3) When processing inode 281 we end up delaying its rename operation
         because its new parent in the send snapshot, inode 284, was not yet
         processed and therefore not yet renamed;
      
      4) When processing inode 282 we do not delay its rename operation because
         its parent in the send snapshot, inode 281, already has its own rename
         operation delayed and our current inode (282) is not an ancestor of
         inode 281 in the parent snapshot. Therefore inode 282 is renamed to
         "ance/below_ance/pre/wait_dir";
      
      5) When processing inode 283 we realize that we can rename it because one
         of its ancestors in the send snapshot, inode 281, has its rename
         operation delayed and inode 283 is not an ancestor of inode 281 in the
         parent snapshot. So a rename operation to rename inode 283 to
         "ance/below_ance/pre/wait_dir/desc/ance" is issued. This path is
         invalid due to a missing path building loop that was undetected by
         the incremental send implementation, as inode 283 ends up getting
         included twice in the path (once with its path in the parent snapshot).
         Therefore its rename operation must wait before the ancestor inode 284
         is renamed.
      
      Fix this by not terminating the rename dependency checks when we find an
      ancestor, in the send snapshot, that has its rename operation delayed. So
      that we continue doing the same checks if the current inode is not an
      ancestor, in the parent snapshot, of an ancestor in the send snapshot we
      are processing in the loop.
      
      The problem and reproducer were reported by Robbie Ko, as part of a patch
      titled "Btrfs: incremental send, avoid ancestor rename to descendant".
      However the fix was unnecessarily complicated and can be addressed with
      much less code and effort.
      Reported-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      4122ea64
    • F
      Btrfs: send, add missing error check for calls to path_loop() · 7969e77a
      Filipe Manana 提交于
      The function path_loop() can return a negative integer, signaling an
      error, 0 if there's no path loop and 1 if there's a path loop. We were
      treating any non zero values as meaning that a path loop exists. Fix
      this by explicitly checking for errors and gracefully return them to
      user space.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      7969e77a
    • R
      Btrfs: send, fix failure to move directories with the same name around · 801bec36
      Robbie Ko 提交于
      When doing an incremental send we can end up not moving directories that
      have the same name. This happens when the same parent directory has
      different child directories with the same name in the parent and send
      snapshots.
      
      For example, consider the following scenario:
      
        Parent snapshot:
      
        .                   (ino 256)
        |---- d/            (ino 257)
        |     |--- p1/      (ino 258)
        |
        |---- p1/           (ino 259)
      
        Send snapshot:
      
        .                    (ino 256)
        |--- d/              (ino 257)
             |--- p1/        (ino 259)
                   |--- p1/  (ino 258)
      
      The directory named "d" (inode 257) has in both snapshots an entry with
      the name "p1" but it refers to different inodes in both snapshots (inode
      258 in the parent snapshot and inode 259 in the send snapshot). When
      attempting to move inode 258, the operation is delayed because its new
      parent, inode 259, was not yet moved/renamed (as the stream is currently
      processing inode 258). Then when processing inode 259, we also end up
      delaying its move/rename operation so that it happens after inode 258 is
      moved/renamed. This decision to delay the move/rename rename operation
      of inode 259 is due to the fact that the new parent inode (257) still
      has inode 258 as its child, which has the same name has inode 259. So
      we end up with inode 258 move/rename operation waiting for inode's 259
      move/rename operation, which in turn it waiting for inode's 258
      move/rename. This results in ending the send stream without issuing
      move/rename operations for inodes 258 and 259 and generating the
      following warnings in syslog/dmesg:
      
      [148402.979747] ------------[ cut here ]------------
      [148402.980588] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6177 btrfs_ioctl_send+0xe03/0xe51 [btrfs]
      [148402.981928] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
      [148402.986999] CPU: 14 PID: 4117 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
      [148402.988136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [148402.988136]  0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
      [148402.988136]  0000000000000000 ffff88022139fce8 ffffffff81052b14 000018212139fac8
      [148402.988136]  ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
      [148402.988136] Call Trace:
      [148402.988136]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
      [148402.988136]  [<ffffffff81052b14>] __warn+0xc2/0xdd
      [148402.988136]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
      [148402.988136]  [<ffffffffa04bc831>] btrfs_ioctl_send+0xe03/0xe51 [btrfs]
      [148402.988136]  [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
      [148402.988136]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
      [148402.988136]  [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
      [148402.988136]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
      [148402.988136]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
      [148402.988136]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
      [148402.988136]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
      [148402.988136]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
      [148402.988136]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [148402.988136]  [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
      [148403.011373] ---[ end trace a4539270c8056f8b ]---
      [148403.012296] ------------[ cut here ]------------
      [148403.013071] WARNING: CPU: 14 PID: 4117 at fs/btrfs/send.c:6194 btrfs_ioctl_send+0xe19/0xe51 [btrfs]
      [148403.014447] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis ppdev tpm parport_pc psmouse parport sg pcspkr i2c_piix4 i2c_core evdev processor serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs]
      [148403.019708] CPU: 14 PID: 4117 Comm: btrfs Tainted: G        W       4.6.0-rc7-btrfs-next-31+ #1
      [148403.020104] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [148403.020104]  0000000000000000 ffff88022139fca8 ffffffff8126b42c 0000000000000000
      [148403.020104]  0000000000000000 ffff88022139fce8 ffffffff81052b14 000018322139fac8
      [148403.020104]  ffff88022b0db400 0000000000000000 0000000000000001 0000000000000000
      [148403.020104] Call Trace:
      [148403.020104]  [<ffffffff8126b42c>] dump_stack+0x67/0x90
      [148403.020104]  [<ffffffff81052b14>] __warn+0xc2/0xdd
      [148403.020104]  [<ffffffff81052beb>] warn_slowpath_null+0x1d/0x1f
      [148403.020104]  [<ffffffffa04bc847>] btrfs_ioctl_send+0xe19/0xe51 [btrfs]
      [148403.020104]  [<ffffffffa048b358>] btrfs_ioctl+0x14f/0x1f81 [btrfs]
      [148403.020104]  [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
      [148403.020104]  [<ffffffff8108eb51>] ? __lock_is_held+0x3c/0x57
      [148403.020104]  [<ffffffff8118da05>] vfs_ioctl+0x18/0x34
      [148403.020104]  [<ffffffff8118e00c>] do_vfs_ioctl+0x550/0x5be
      [148403.020104]  [<ffffffff81196f0c>] ? __fget+0x6b/0x77
      [148403.020104]  [<ffffffff81196fa1>] ? __fget_light+0x62/0x71
      [148403.020104]  [<ffffffff8118e0d1>] SyS_ioctl+0x57/0x79
      [148403.020104]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [148403.020104]  [<ffffffff8108e89d>] ? trace_hardirqs_off_caller+0x3f/0xaa
      [148403.038981] ---[ end trace a4539270c8056f8c ]---
      
      There's another issue caused by similar (but more complex) changes in the
      directory hierarchy that makes move/rename operations fail, described with
      the following example:
      
        Parent snapshot:
      
        .
        |---- a/                                                   (ino 262)
        |     |---- c/                                             (ino 268)
        |
        |---- d/                                                   (ino 263)
              |---- ance/                                          (ino 267)
                      |---- e/                                     (ino 264)
                      |---- f/                                     (ino 265)
                      |---- ance/                                  (ino 266)
      
        Send snapshot:
      
        .
        |---- a/                                                   (ino 262)
        |---- c/                                                   (ino 268)
        |     |---- ance/                                          (ino 267)
        |
        |---- d/                                                   (ino 263)
        |     |---- ance/                                          (ino 266)
        |
        |---- f/                                                   (ino 265)
              |---- e/                                             (ino 264)
      
      When the inode 265 is processed, the path for inode 267 is computed, which
      at that time corresponds to "d/ance", and it's stored in the names cache.
      Later on when processing inode 266, we end up orphanizing (renaming to a
      name matching the pattern o<ino>-<gen>-<seq>) inode 267 because it has
      the same name as inode 266 and it's currently a child of the new parent
      directory (inode 263) for inode 266. After the orphanization and while we
      are still processing inode 266, a rename operation for inode 266 is
      generated. However the source path for that rename operation is incorrect
      because it ends up using the old, pre-orphanization, name of inode 267.
      The no longer valid name for inode 267 was previously cached when
      processing inode 265 and it remains usable and considered valid until
      the inode currently being processed has a number greater than 267.
      This resulted in the receiving side failing with the following error:
      
        ERROR: rename d/ance/ance -> d/ance failed: No such file or directory
      
      So fix these issues by detecting such circular dependencies for rename
      operations and by clearing the cached name of an inode once the inode
      is orphanized.
      
      A test case for fstests will follow soon.
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      [Rewrote change log to be more detailed and organized, and improved
       comments]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      801bec36
  13. 26 5月, 2016 1 次提交
  14. 06 5月, 2016 6 次提交
  15. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  16. 12 3月, 2016 1 次提交
  17. 11 2月, 2016 1 次提交