1. 29 5月, 2018 22 次提交
  2. 28 5月, 2018 3 次提交
  3. 24 5月, 2018 1 次提交
    • O
      Btrfs: fix error handling in btrfs_truncate() · d5014738
      Omar Sandoval 提交于
      Jun Wu at Facebook reported that an internal service was seeing a return
      value of 1 from ftruncate() on Btrfs in some cases. This is coming from
      the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
      
      btrfs_truncate() uses two variables for error handling, ret and err.
      When btrfs_truncate_inode_items() returns non-zero, we set err to the
      return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
      only set err if ret is an error (i.e., negative).
      
      To reproduce the issue: mount a filesystem with -o compress-force=zstd
      and the following program will encounter return value of 1 from
      ftruncate:
      
      int main(void) {
              char buf[256] = { 0 };
              int ret;
              int fd;
      
              fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
              if (fd == -1) {
                      perror("open");
                      return EXIT_FAILURE;
              }
      
              if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
                      perror("write");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              if (fsync(fd) == -1) {
                      perror("fsync");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              ret = ftruncate(fd, 128);
              if (ret) {
                      printf("ftruncate() returned %d\n", ret);
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              close(fd);
              return EXIT_SUCCESS;
      }
      
      Fixes: ddfae63c ("btrfs: move btrfs_truncate_block out of trans handle")
      CC: stable@vger.kernel.org # 4.15+
      Reported-by: NJun Wu <quark@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5014738
  4. 17 5月, 2018 6 次提交
    • A
      btrfs: fix crash when trying to resume balance without the resume flag · 02ee654d
      Anand Jain 提交于
      We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance()
      only, which isn't called during the remount. So when resuming from
      the paused balance we hit the bug:
      
       kernel: kernel BUG at fs/btrfs/volumes.c:3890!
       ::
       kernel:  balance_kthread+0x51/0x60 [btrfs]
       kernel:  kthread+0x111/0x130
       ::
       kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8
      
      Reproducer:
        On a mounted filesystem:
      
        btrfs balance start --full-balance /btrfs
        btrfs balance pause /btrfs
        mount -o remount,ro /dev/sdb /btrfs
        mount -o remount,rw /dev/sdb /btrfs
      
      To fix this set the BTRFS_BALANCE_RESUME flag in
      btrfs_resume_balance_async().
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02ee654d
    • N
      btrfs: Fix delalloc inodes invalidation during transaction abort · fe816d0f
      Nikolay Borisov 提交于
      When a transaction is aborted btrfs_cleanup_transaction is called to
      cleanup all the various in-flight bits and pieces which migth be
      active. One of those is delalloc inodes - inodes which have dirty
      pages which haven't been persisted yet. Currently the process of
      freeing such delalloc inodes in exceptional circumstances such as
      transaction abort boiled down to calling btrfs_invalidate_inodes whose
      sole job is to invalidate the dentries for all inodes related to a
      root. This is in fact wrong and insufficient since such delalloc inodes
      will likely have pending pages or ordered-extents and will be linked to
      the sb->s_inode_list. This means that unmounting a btrfs instance with
      an aborted transaction could potentially lead inodes/their pages
      visible to the system long after their superblock has been freed. This
      in turn leads to a "use-after-free" situation once page shrink is
      triggered. This situation could be simulated by running generic/019
      which would cause such inodes to be left hanging, followed by
      generic/176 which causes memory pressure and page eviction which lead
      to touching the freed super block instance. This situation is
      additionally detected by the unmount code of VFS with the following
      message:
      
      "VFS: Busy inodes after unmount of Self-destruct in 5 seconds.  Have a nice day..."
      
      Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      in free_fs_root for the same reason.
      
      This patch aims to rectify the sitaution by doing the following:
      
      1. Change btrfs_destroy_delalloc_inodes so that it calls
      invalidate_inode_pages2 for every inode on the delalloc list, this
      ensures that all the pages of the inode are released. This function
      boils down to calling btrfs_releasepage. During test I observed cases
      where inodes on the delalloc list were having an i_count of 0, so this
      necessitates using igrab to be sure we are working on a non-freed inode.
      
      2. Since calling btrfs_releasepage might queue delayed iputs move the
      call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
      calling run_delayed_iputs for the last time. This is necessary to ensure
      that delayed iputs are run.
      
      Note: this patch is tagged for 4.14 stable but the fix applies to older
      versions too but needs to be backported manually due to conflicts.
      
      CC: stable@vger.kernel.org # 4.14.x: 2b877331: btrfs: Split btrfs_del_delalloc_inode into 2 functions
      CC: stable@vger.kernel.org # 4.14.x
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment to igrab ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fe816d0f
    • N
      btrfs: Split btrfs_del_delalloc_inode into 2 functions · 2b877331
      Nikolay Borisov 提交于
      This is in preparation of fixing delalloc inodes leakage on transaction
      abort. Also export the new function.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2b877331
    • L
      btrfs: fix reading stale metadata blocks after degraded raid1 mounts · 02a3307a
      Liu Bo 提交于
      If a btree block, aka. extent buffer, is not available in the extent
      buffer cache, it'll be read out from the disk instead, i.e.
      
      btrfs_search_slot()
        read_block_for_search()  # hold parent and its lock, go to read child
          btrfs_release_path()
          read_tree_block()  # read child
      
      Unfortunately, the parent lock got released before reading child, so
      commit 5bdd3536 ("Btrfs: Fix block generation verification race") had
      used 0 as parent transid to read the child block.  It forces
      read_tree_block() not to check if parent transid is different with the
      generation id of the child that it reads out from disk.
      
      A simple PoC is included in btrfs/124,
      
      0. A two-disk raid1 btrfs,
      
      1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
      
      2. Mount this filesystem and put it in use, after a while, device tree's
         root got COW but block A hasn't been allocated/overwritten yet.
      
      3. Umount it and reload the btrfs module to remove both disks from the
         global @fs_devices list.
      
      4. mount -odegraded dev1 and write some data, so now block A is allocated
         to be a leaf in checksum tree.  Note that only dev1 has the latest
         metadata of this filesystem.
      
      5. Umount it and mount it again normally (with both disks), since raid1
         can pick up one disk by the writer task's pid, if btrfs_search_slot()
         needs to read block A, dev2 which does NOT have the latest metadata
         might be read for block A, then we got a stale block A.
      
      6. As parent transid is not checked, block A is marked as uptodate and
         put into the extent buffer cache, so the future search won't bother
         to read disk again, which means it'll make changes on this stale
         one and make it dirty and flush it onto disk.
      
      To avoid the problem, parent transid needs to be passed to
      read_tree_block().
      
      In order to get a valid parent transid, we need to hold the parent's
      lock until finishing reading child.
      
      This patch needs to be slightly adapted for stable kernels, the
      &first_key parameter added to read_tree_block() is from 4.16+
      (581c1760). The fix is to replace 0 by 'gen'.
      
      Fixes: 5bdd3536 ("Btrfs: Fix block generation verification race")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02a3307a
    • M
      btrfs: property: Set incompat flag if lzo/zstd compression is set · 1a63c198
      Misono Tomohiro 提交于
      Incompat flag of LZO/ZSTD compression should be set at:
      
       1. mount time (-o compress/compress-force)
       2. when defrag is done
       3. when property is set
      
      Currently 3. is missing and this commit adds this.
      
      This could lead to a filesystem that uses ZSTD but is not marked as
      such. If a kernel without a ZSTD support encounteres a ZSTD compressed
      extent, it will handle that but this could be confusing to the user.
      
      Typically the filesystem is mounted with the ZSTD option, but the
      discrepancy can arise when a filesystem is never mounted with ZSTD and
      then the property on some file is set (and some new extents are
      written). A simple mount with -o compress=zstd will fix that up on an
      unpatched kernel.
      
      Same goes for LZO, but this has been around for a very long time
      (2.6.37) so it's unlikely that a pre-LZO kernel would be used.
      
      Fixes: 5c1aab1d ("btrfs: Add zstd support")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add user visible impact ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a63c198
    • F
      Btrfs: fix duplicate extents after fsync of file with prealloc extents · 31d11b83
      Filipe Manana 提交于
      In commit 471d557a ("Btrfs: fix loss of prealloc extents past i_size
      after fsync log replay"), on fsync,  we started to always log all prealloc
      extents beyond an inode's i_size in order to avoid losing them after a
      power failure. However under some cases this can lead to the log replay
      code to create duplicate extent items, with different lengths, in the
      extent tree. That happens because, as of that commit, we can now log
      extent items based on extent maps that are not on the "modified" list
      of extent maps of the inode's extent map tree. Logging extent items based
      on extent maps is used during the fast fsync path to save time and for
      this to work reliably it requires that the extent maps are not merged
      with other adjacent extent maps - having the extent maps in the list
      of modified extents gives such guarantee.
      
      Consider the following example, captured during a long run of fsstress,
      which illustrates this problem.
      
      We have inode 271, in the filesystem tree (root 5), for which all of the
      following operations and discussion apply to.
      
      A buffered write starts at offset 312391 with a length of 933471 bytes
      (end offset at 1245862). At this point we have, for this inode, the
      following extent maps with the their field values:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 376832, block_start 1106399232,
            block_len 376832, orig_block_len 376832
      em C, start 417792, orig_start 417792, len 782336, block_start
            18446744073709551613, block_len 0, orig_block_len 0
      em D, start 1200128, orig_start 1200128, len 835584, block_start
            1106776064, block_len 835584, orig_block_len 835584
      em E, start 2035712, orig_start 2035712, len 245760, block_start
            1107611648, block_len 245760, orig_block_len 245760
      
      Extent map A corresponds to a hole and extent maps D and E correspond to
      preallocated extents.
      
      Extent map D ends where extent map E begins (1106776064 + 835584 =
      1107611648), but these extent maps were not merged because they are in
      the inode's list of modified extent maps.
      
      An fsync against this inode is made, which triggers the fast path
      (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback
      of the data previously written using buffered IO, and when the respective
      ordered extent finishes, btrfs_drop_extents() is called against the
      (aligned) range 311296..1249279. This causes a split of extent map D at
      btrfs_drop_extent_cache(), replacing extent map D with a new extent map
      D', also added to the list of modified extents,  with the following
      values:
      
      em D', start 1249280, orig_start of 1200128,
             block_start 1106825216 (= 1106776064 + 1249280 - 1200128),
             orig_block_len 835584,
             block_len 786432 (835584 - (1249280 - 1200128))
      
      Then, during the fast fsync, btrfs_log_changed_extents() is called and
      extent maps D' and E are removed from the list of modified extents. The
      flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged
      clear_em_logging() is called on each of them, and that makes extent map E
      to be merged with extent map D' (try_merge_map()), resulting in D' being
      deleted and E adjusted to:
      
      em E, start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192,
            orig_block_len 245760
      
      A direct IO write at offset 1847296 and length of 360448 bytes (end offset
      at 2207744) starts, and at that moment the following extent maps exist for
      our inode:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em E (prealloc), start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192, orig_block_len 245760
      
      The dio write results in drop_extent_cache() being called twice. The first
      time for a range that starts at offset 1847296 and ends at offset 2035711
      (length of 188416), which results in a double split of extent map E,
      replacing it with two new extent maps:
      
      em F, start 1249280, orig_start 1200128, block_start 1106825216,
            block_len 598016, orig_block_len 598016
      em G, start 2035712, orig_start 1200128, block_start 1107611648,
            block_len 245760, orig_block_len 1032192
      
      It also creates a new extent map that represents a part of the requested
      IO (through create_io_em()):
      
      em H, start 1847296, len 188416, block_start 1107423232, block_len 188416
      
      The second call to drop_extent_cache() has a range with a start offset of
      2035712 and end offset of 2207743 (length of 172032). This leads to
      replacing extent map G with a new extent map I with the following values:
      
      em I, start 2207744, orig_start 1200128, block_start 1107783680,
            block_len 73728, orig_block_len 1032192
      
      It also creates a new extent map that represents the second part of the
      requested IO (through create_io_em()):
      
      em J, start 2035712, len 172032, block_start 1107611648, block_len 172032
      
      The dio write set the inode's i_size to 2207744 bytes.
      
      After the dio write the inode has the following extent maps:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em F, start 1249280, orig_start 1200128, len 598016,
            block_start 1106825216, block_len 598016, orig_block_len 598016
      em H, start 1847296, orig_start 1200128, len 188416,
            block_start 1107423232, block_len 188416, orig_block_len 835584
      em J, start 2035712, orig_start 2035712, len 172032,
            block_start 1107611648, block_len 172032, orig_block_len 245760
      em I, start 2207744, orig_start 1200128, len 73728,
            block_start 1107783680, block_len 73728, orig_block_len 1032192
      
      Now do some change to the file, like adding a xattr for example and then
      fsync it again. This triggers a fast fsync path, and as of commit
      471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync
      log replay"), we use the extent map I to log a file extent item because
      it's a prealloc extent and it starts at an offset matching the inode's
      i_size. However when we log it, we create a file extent item with a value
      for the disk byte location that is wrong, as can be seen from the
      following output of "btrfs inspect-internal dump-tree":
      
       item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53
           generation 22 type 2 (prealloc)
           prealloc data disk byte 1106776064 nr 1032192
           prealloc data offset 1007616 nr 73728
      
      Here the disk byte value corresponds to calculation based on some fields
      from the extent map I:
      
        1106776064 = block_start (1107783680) - 1007616 (extent_offset)
        extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616
      
      The disk byte value of 1106776064 clashes with disk byte values of the
      file extent items at offsets 1249280 and 1847296 in the fs tree:
      
              item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1106776064 nr 835584
                      prealloc data offset 49152 nr 598016
              item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1106776064 nr 835584
                      extent data offset 647168 nr 188416 ram 835584
                      extent compression 0 (none)
              item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1107611648 nr 245760
                      extent data offset 0 nr 172032 ram 245760
                      extent compression 0 (none)
              item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1107611648 nr 245760
                      prealloc data offset 172032 nr 73728
      
      Instead of the disk byte value of 1106776064, the value of 1107611648
      should have been logged. Also the data offset value should have been
      172032 and not 1007616.
      After a log replay we end up getting two extent items in the extent tree
      with different lengths, one of 835584, which is correct and existed
      before the log replay, and another one of 1032192 which is wrong and is
      based on the logged file extent item:
      
       item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53
          refs 2 gen 15 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 2
       item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53
          refs 1 gen 22 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 1
      
      Obviously this leads to many problems and a filesystem check reports many
      errors:
      
       (...)
       checking extents
       Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1
       extent item 1106776064 has multiple extent items
       ref mismatch on [1106776064 835584] extent item 2, found 3
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680
       Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70
       Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192
       backpointer mismatch on [1106776064 835584]
       checking free space cache
       block group 1103101952 has wrong amount of free space
       failed to load free space cache for block group 1103101952
       checking fs roots
       (...)
      
      So fix this by logging the prealloc extents beyond the inode's i_size
      based on searches in the subvolume tree instead of the extent maps.
      
      Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31d11b83
  5. 14 5月, 2018 2 次提交
    • F
      Btrfs: fix xattr loss after power failure · 9a8fca62
      Filipe Manana 提交于
      If a file has xattrs, we fsync it, to ensure we clear the flags
      BTRFS_INODE_NEEDS_FULL_SYNC and BTRFS_INODE_COPY_EVERYTHING from its
      inode, the current transaction commits and then we fsync it (without
      either of those bits being set in its inode), we end up not logging
      all its xattrs. This results in deleting all xattrs when replying the
      log after a power failure.
      
      Trivial reproducer
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ touch /mnt/foobar
        $ setfattr -n user.xa -v qwerty /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xab 0 64K" /mnt/foobar
        $ xfs_io -c "fsync" /mnt/foobar
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ getfattr --absolute-names --dump /mnt/foobar
        <empty output>
        $
      
      So fix this by making sure all xattrs are logged if we log a file's inode
      item and neither the flags BTRFS_INODE_NEEDS_FULL_SYNC nor
      BTRFS_INODE_COPY_EVERYTHING were set in the inode.
      
      Fixes: 36283bf7 ("Btrfs: fix fsync xattr loss in the fast fsync path")
      Cc: <stable@vger.kernel.org> # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a8fca62
    • R
      Btrfs: send, fix invalid access to commit roots due to concurrent snapshotting · 6f2f0b39
      Robbie Ko 提交于
      [BUG]
      btrfs incremental send BUG happens when creating a snapshot of snapshot
      that is being used by send.
      
      [REASON]
      The problem can happen if while we are doing a send one of the snapshots
      used (parent or send) is snapshotted, because snapshoting implies COWing
      the root of the source subvolume/snapshot.
      
      1. When doing an incremental send, the send process will get the commit
         roots from the parent and send snapshots, and add references to them
         through extent_buffer_get().
      
      2. When a snapshot/subvolume is snapshotted, its root node is COWed
         (transaction.c:create_pending_snapshot()).
      
      3. COWing releases the space used by the node immediately, through:
      
         __btrfs_cow_block()
         --btrfs_free_tree_block()
         ----btrfs_add_free_space(bytenr of node)
      
      4. Because send doesn't hold a transaction open, it's possible that
         the transaction used to create the snapshot commits, switches the
         commit root and the old space used by the previous root node gets
         assigned to some other node allocation. Allocation of a new node will
         use the existing extent buffer found in memory, which we previously
         got a reference through extent_buffer_get(), and allow the extent
         buffer's content (pages) to be modified:
      
         btrfs_alloc_tree_block
         --btrfs_reserve_extent
         ----find_free_extent (get bytenr of old node)
         --btrfs_init_new_buffer (use bytenr of old node)
         ----btrfs_find_create_tree_block
         ------alloc_extent_buffer
         --------find_extent_buffer (get old node)
      
      5. So send can access invalid memory content and have unpredictable
         behaviour.
      
      [FIX]
      So we fix the problem by copying the commit roots of the send and
      parent snapshots and use those copies.
      
      CallTrace looks like this:
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ctree.c:1861!
       invalid opcode: 0000 [#1] SMP
       CPU: 6 PID: 24235 Comm: btrfs Tainted: P           O 3.10.105 #23721
       ffff88046652d680 ti: ffff88041b720000 task.ti: ffff88041b720000
       RIP: 0010:[<ffffffffa08dd0e8>] read_node_slot+0x108/0x110 [btrfs]
       RSP: 0018:ffff88041b723b68  EFLAGS: 00010246
       RAX: ffff88043ca6b000 RBX: ffff88041b723c50 RCX: ffff880000000000
       RDX: 000000000000004c RSI: ffff880314b133f8 RDI: ffff880458b24000
       RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88041b723c66
       R10: 0000000000000001 R11: 0000000000001000 R12: ffff8803f3e48890
       R13: ffff8803f3e48880 R14: ffff880466351800 R15: 0000000000000001
       FS:  00007f8c321dc8c0(0000) GS:ffff88047fcc0000(0000)
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       R2: 00007efd1006d000 CR3: 0000000213a24000 CR4: 00000000003407e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Stack:
       ffff88041b723c50 ffff8803f3e48880 ffff8803f3e48890 ffff8803f3e48880
       ffff880466351800 0000000000000001 ffffffffa08dd9d7 ffff88041b723c50
       ffff8803f3e48880 ffff88041b723c66 ffffffffa08dde85 a9ff88042d2c4400
       Call Trace:
       [<ffffffffa08dd9d7>] ? tree_move_down.isra.33+0x27/0x50 [btrfs]
       [<ffffffffa08dde85>] ? tree_advance+0xb5/0xc0 [btrfs]
       [<ffffffffa08e83d4>] ? btrfs_compare_trees+0x2d4/0x760 [btrfs]
       [<ffffffffa0982050>] ? finish_inode_if_needed+0x870/0x870 [btrfs]
       [<ffffffffa09841ea>] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs]
       [<ffffffffa094bd3d>] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs]
       [<ffffffff81111133>] ? handle_pte_fault+0x373/0x990
       [<ffffffff8153a096>] ? atomic_notifier_call_chain+0x16/0x20
       [<ffffffff81063256>] ? set_task_cpu+0xb6/0x1d0
       [<ffffffff811122c3>] ? handle_mm_fault+0x143/0x2a0
       [<ffffffff81539cc0>] ? __do_page_fault+0x1d0/0x500
       [<ffffffff81062f07>] ? check_preempt_curr+0x57/0x90
       [<ffffffff8115075a>] ? do_vfs_ioctl+0x4aa/0x990
       [<ffffffff81034f83>] ? do_fork+0x113/0x3b0
       [<ffffffff812dd7d7>] ? trace_hardirqs_off_thunk+0x3a/0x6c
       [<ffffffff81150cc8>] ? SyS_ioctl+0x88/0xa0
       [<ffffffff8153e422>] ? system_call_fastpath+0x16/0x1b
       ---[ end trace 29576629ee80b2e1 ]---
      
      Fixes: 7069830a ("Btrfs: add btrfs_compare_trees function")
      CC: stable@vger.kernel.org # 3.6+
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f2f0b39
  6. 12 5月, 2018 1 次提交
    • A
      do d_instantiate/unlock_new_inode combinations safely · 1e2e547a
      Al Viro 提交于
      For anything NFS-exported we do _not_ want to unlock new inode
      before it has grown an alias; original set of fixes got the
      ordering right, but missed the nasty complication in case of
      lockdep being enabled - unlock_new_inode() does
      	lockdep_annotate_inode_mutex_key(inode)
      which can only be done before anyone gets a chance to touch
      ->i_mutex.  Unfortunately, flipping the order and doing
      unlock_new_inode() before d_instantiate() opens a window when
      mkdir can race with open-by-fhandle on a guessed fhandle, leading
      to multiple aliases for a directory inode and all the breakage
      that follows from that.
      
      	Correct solution: a new primitive (d_instantiate_new())
      combining these two in the right order - lockdep annotate, then
      d_instantiate(), then the rest of unlock_new_inode().  All
      combinations of d_instantiate() with unlock_new_inode() should
      be converted to that.
      
      Cc: stable@kernel.org	# 2.6.29 and later
      Tested-by: NMike Marshall <hubcap@omnibond.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1e2e547a
  7. 02 5月, 2018 2 次提交
    • F
      Btrfs: send, fix missing truncate for inode with prealloc extent past eof · a6aa10c7
      Filipe Manana 提交于
      An incremental send operation can miss a truncate operation when an inode
      has an increased size in the send snapshot and a prealloc extent beyond
      its size.
      
      Consider the following scenario where a necessary truncate operation is
      missing in the incremental send stream:
      
      1) In the parent snapshot an inode has a size of 1282957 bytes and it has
         no prealloc extents beyond its size;
      
      2) In the the send snapshot it has a size of 5738496 bytes and has a new
         extent at offsets 1884160 (length of 106496 bytes) and a prealloc
         extent beyond eof at offset 6729728 (and a length of 339968 bytes);
      
      3) When processing the prealloc extent, at offset 6729728, we end up at
         send.c:send_write_or_clone() and set the @len variable to a value of
         18446744073708560384 because @offset plus the original @len value is
         larger then the inode's size (6729728 + 339968 > 5738496). We then
         call send_extent_data(), with that @offset and @len, which in turn
         calls send_write(), and then the later calls fill_read_buf(). Because
         the offset passed to fill_read_buf() is greater then inode's i_size,
         this function returns 0 immediately, which makes send_write() and
         send_extent_data() do nothing and return immediately as well. When
         we get back to send.c:send_write_or_clone() we adjust the value
         of sctx->cur_inode_next_write_offset to @offset plus @len, which
         corresponds to 6729728 + 18446744073708560384 = 5738496, which is
         precisely the the size of the inode in the send snapshot;
      
      4) Later when at send.c:finish_inode_if_needed() we determine that
         we don't need to issue a truncate operation because the value of
         sctx->cur_inode_next_write_offset corresponds to the inode's new
         size, 5738496 bytes. This is wrong because the last write operation
         that was issued started at offset 1884160 with a length of 106496
         bytes, so the correct value for sctx->cur_inode_next_write_offset
         should be 1990656 (1884160 + 106496), so that a truncate operation
         with a value of 5738496 bytes would have been sent to insert a
         trailing hole at the destination.
      
      So fix the issue by making send.c:send_write_or_clone() not attempt
      to send write or clone operations for extents that start beyond the
      inode's size, since such attempts do nothing but waste time by
      calling helper functions and allocating path structures, and send
      currently has no fallocate command in order to create prealloc extents
      at the destination (either beyond a file's eof or not).
      
      The issue was found running the test btrfs/007 from fstests using a seed
      value of 1524346151 for fsstress.
      Reported-by: NGu, Jinxiang <gujx@cn.fujitsu.com>
      Fixes: ffa7c429 ("Btrfs: send, do not issue unnecessary truncate operations")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a6aa10c7
    • E
      btrfs: Take trans lock before access running trans in check_delayed_ref · 998ac6d2
      ethanwu 提交于
      In preivous patch:
      Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
      We avoid starting btrfs transaction and get this information from
      fs_info->running_transaction directly.
      
      When accessing running_transaction in check_delayed_ref, there's a
      chance that current transaction will be freed by commit transaction
      after the NULL pointer check of running_transaction is passed.
      
      After looking all the other places using fs_info->running_transaction,
      they are either protected by trans_lock or holding the transactions.
      
      Fix this by using trans_lock and increasing the use_count.
      
      Fixes: e4c3b2dc ("Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: Nethanwu <ethanwu@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      998ac6d2
  8. 26 4月, 2018 1 次提交
    • Q
      btrfs: Fix wrong first_key parameter in replace_path · 17515f1b
      Qu Wenruo 提交于
      Commit 581c1760 ("btrfs: Validate child tree block's level and first
      key") introduced new @first_key parameter for read_tree_block(), however
      caller in replace_path() is parasing wrong key to read_tree_block().
      
      It should use parameter @first_key other than @key.
      
      Normally it won't expose problem as @key is normally initialzied to the
      same value of @first_key we expect.
      However in relocation recovery case, @key can be set to (0, 0, 0), and
      since no valid key in relocation tree can be (0, 0, 0), it will cause
      read_tree_block() to return -EUCLEAN and interrupt relocation recovery.
      
      Fix it by setting @first_key correctly.
      
      Fixes: 581c1760 ("btrfs: Validate child tree block's level and first key")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      17515f1b
  9. 21 4月, 2018 2 次提交
    • Q
      btrfs: print-tree: debugging output enhancement · c0872323
      Qu Wenruo 提交于
      This patch enhances the following things:
      
      - tree block header
        * add generation and owner output for node and leaf
      - node pointer generation output
      - allow btrfs_print_tree() to not follow nodes
        * just like btrfs-progs
      
      Please note that, although function btrfs_print_tree() is not called by
      anyone right now, it's still a pretty useful function to debug kernel.
      So that function is still kept for later use.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c0872323
    • N
      btrfs: Fix race condition between delayed refs and blockgroup removal · 5e388e95
      Nikolay Borisov 提交于
      When the delayed refs for a head are all run, eventually
      cleanup_ref_head is called which (in case of deletion) obtains a
      reference for the relevant btrfs_space_info struct by querying the bg
      for the range. This is problematic because when the last extent of a
      bg is deleted a race window emerges between removal of that bg and the
      subsequent invocation of cleanup_ref_head. This can result in cache being null
      and either a null pointer dereference or assertion failure.
      
      	task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
      	RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
      	RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
      	RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
      	RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
      	RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
      	R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
      	R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
      	FS:  00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	 __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
      	 btrfs_run_delayed_refs+0x68/0x250 [btrfs]
      	 btrfs_should_end_transaction+0x42/0x60 [btrfs]
      	 btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
      	 btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
      	 evict+0xc6/0x190
      	 do_unlinkat+0x19c/0x300
      	 do_syscall_64+0x74/0x140
      	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      	RIP: 0033:0x7fbf589c57a7
      
      To fix this, introduce a new flag "is_system" to head_ref structs,
      which is populated at insertion time. This allows to decouple the
      querying for the spaceinfo from querying the possibly deleted bg.
      
      Fixes: d7eae340 ("Btrfs: rework delayed ref total_bytes_pinned accounting")
      CC: stable@vger.kernel.org # 4.14+
      Suggested-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5e388e95