1. 29 5月, 2018 30 次提交
  2. 28 5月, 2018 3 次提交
  3. 24 5月, 2018 1 次提交
    • O
      Btrfs: fix error handling in btrfs_truncate() · d5014738
      Omar Sandoval 提交于
      Jun Wu at Facebook reported that an internal service was seeing a return
      value of 1 from ftruncate() on Btrfs in some cases. This is coming from
      the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
      
      btrfs_truncate() uses two variables for error handling, ret and err.
      When btrfs_truncate_inode_items() returns non-zero, we set err to the
      return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
      only set err if ret is an error (i.e., negative).
      
      To reproduce the issue: mount a filesystem with -o compress-force=zstd
      and the following program will encounter return value of 1 from
      ftruncate:
      
      int main(void) {
              char buf[256] = { 0 };
              int ret;
              int fd;
      
              fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
              if (fd == -1) {
                      perror("open");
                      return EXIT_FAILURE;
              }
      
              if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
                      perror("write");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              if (fsync(fd) == -1) {
                      perror("fsync");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              ret = ftruncate(fd, 128);
              if (ret) {
                      printf("ftruncate() returned %d\n", ret);
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              close(fd);
              return EXIT_SUCCESS;
      }
      
      Fixes: ddfae63c ("btrfs: move btrfs_truncate_block out of trans handle")
      CC: stable@vger.kernel.org # 4.15+
      Reported-by: NJun Wu <quark@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5014738
  4. 17 5月, 2018 6 次提交
    • A
      btrfs: fix crash when trying to resume balance without the resume flag · 02ee654d
      Anand Jain 提交于
      We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance()
      only, which isn't called during the remount. So when resuming from
      the paused balance we hit the bug:
      
       kernel: kernel BUG at fs/btrfs/volumes.c:3890!
       ::
       kernel:  balance_kthread+0x51/0x60 [btrfs]
       kernel:  kthread+0x111/0x130
       ::
       kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8
      
      Reproducer:
        On a mounted filesystem:
      
        btrfs balance start --full-balance /btrfs
        btrfs balance pause /btrfs
        mount -o remount,ro /dev/sdb /btrfs
        mount -o remount,rw /dev/sdb /btrfs
      
      To fix this set the BTRFS_BALANCE_RESUME flag in
      btrfs_resume_balance_async().
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02ee654d
    • N
      btrfs: Fix delalloc inodes invalidation during transaction abort · fe816d0f
      Nikolay Borisov 提交于
      When a transaction is aborted btrfs_cleanup_transaction is called to
      cleanup all the various in-flight bits and pieces which migth be
      active. One of those is delalloc inodes - inodes which have dirty
      pages which haven't been persisted yet. Currently the process of
      freeing such delalloc inodes in exceptional circumstances such as
      transaction abort boiled down to calling btrfs_invalidate_inodes whose
      sole job is to invalidate the dentries for all inodes related to a
      root. This is in fact wrong and insufficient since such delalloc inodes
      will likely have pending pages or ordered-extents and will be linked to
      the sb->s_inode_list. This means that unmounting a btrfs instance with
      an aborted transaction could potentially lead inodes/their pages
      visible to the system long after their superblock has been freed. This
      in turn leads to a "use-after-free" situation once page shrink is
      triggered. This situation could be simulated by running generic/019
      which would cause such inodes to be left hanging, followed by
      generic/176 which causes memory pressure and page eviction which lead
      to touching the freed super block instance. This situation is
      additionally detected by the unmount code of VFS with the following
      message:
      
      "VFS: Busy inodes after unmount of Self-destruct in 5 seconds.  Have a nice day..."
      
      Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      in free_fs_root for the same reason.
      
      This patch aims to rectify the sitaution by doing the following:
      
      1. Change btrfs_destroy_delalloc_inodes so that it calls
      invalidate_inode_pages2 for every inode on the delalloc list, this
      ensures that all the pages of the inode are released. This function
      boils down to calling btrfs_releasepage. During test I observed cases
      where inodes on the delalloc list were having an i_count of 0, so this
      necessitates using igrab to be sure we are working on a non-freed inode.
      
      2. Since calling btrfs_releasepage might queue delayed iputs move the
      call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
      calling run_delayed_iputs for the last time. This is necessary to ensure
      that delayed iputs are run.
      
      Note: this patch is tagged for 4.14 stable but the fix applies to older
      versions too but needs to be backported manually due to conflicts.
      
      CC: stable@vger.kernel.org # 4.14.x: 2b877331: btrfs: Split btrfs_del_delalloc_inode into 2 functions
      CC: stable@vger.kernel.org # 4.14.x
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment to igrab ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fe816d0f
    • N
      btrfs: Split btrfs_del_delalloc_inode into 2 functions · 2b877331
      Nikolay Borisov 提交于
      This is in preparation of fixing delalloc inodes leakage on transaction
      abort. Also export the new function.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2b877331
    • L
      btrfs: fix reading stale metadata blocks after degraded raid1 mounts · 02a3307a
      Liu Bo 提交于
      If a btree block, aka. extent buffer, is not available in the extent
      buffer cache, it'll be read out from the disk instead, i.e.
      
      btrfs_search_slot()
        read_block_for_search()  # hold parent and its lock, go to read child
          btrfs_release_path()
          read_tree_block()  # read child
      
      Unfortunately, the parent lock got released before reading child, so
      commit 5bdd3536 ("Btrfs: Fix block generation verification race") had
      used 0 as parent transid to read the child block.  It forces
      read_tree_block() not to check if parent transid is different with the
      generation id of the child that it reads out from disk.
      
      A simple PoC is included in btrfs/124,
      
      0. A two-disk raid1 btrfs,
      
      1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
      
      2. Mount this filesystem and put it in use, after a while, device tree's
         root got COW but block A hasn't been allocated/overwritten yet.
      
      3. Umount it and reload the btrfs module to remove both disks from the
         global @fs_devices list.
      
      4. mount -odegraded dev1 and write some data, so now block A is allocated
         to be a leaf in checksum tree.  Note that only dev1 has the latest
         metadata of this filesystem.
      
      5. Umount it and mount it again normally (with both disks), since raid1
         can pick up one disk by the writer task's pid, if btrfs_search_slot()
         needs to read block A, dev2 which does NOT have the latest metadata
         might be read for block A, then we got a stale block A.
      
      6. As parent transid is not checked, block A is marked as uptodate and
         put into the extent buffer cache, so the future search won't bother
         to read disk again, which means it'll make changes on this stale
         one and make it dirty and flush it onto disk.
      
      To avoid the problem, parent transid needs to be passed to
      read_tree_block().
      
      In order to get a valid parent transid, we need to hold the parent's
      lock until finishing reading child.
      
      This patch needs to be slightly adapted for stable kernels, the
      &first_key parameter added to read_tree_block() is from 4.16+
      (581c1760). The fix is to replace 0 by 'gen'.
      
      Fixes: 5bdd3536 ("Btrfs: Fix block generation verification race")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02a3307a
    • M
      btrfs: property: Set incompat flag if lzo/zstd compression is set · 1a63c198
      Misono Tomohiro 提交于
      Incompat flag of LZO/ZSTD compression should be set at:
      
       1. mount time (-o compress/compress-force)
       2. when defrag is done
       3. when property is set
      
      Currently 3. is missing and this commit adds this.
      
      This could lead to a filesystem that uses ZSTD but is not marked as
      such. If a kernel without a ZSTD support encounteres a ZSTD compressed
      extent, it will handle that but this could be confusing to the user.
      
      Typically the filesystem is mounted with the ZSTD option, but the
      discrepancy can arise when a filesystem is never mounted with ZSTD and
      then the property on some file is set (and some new extents are
      written). A simple mount with -o compress=zstd will fix that up on an
      unpatched kernel.
      
      Same goes for LZO, but this has been around for a very long time
      (2.6.37) so it's unlikely that a pre-LZO kernel would be used.
      
      Fixes: 5c1aab1d ("btrfs: Add zstd support")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add user visible impact ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a63c198
    • F
      Btrfs: fix duplicate extents after fsync of file with prealloc extents · 31d11b83
      Filipe Manana 提交于
      In commit 471d557a ("Btrfs: fix loss of prealloc extents past i_size
      after fsync log replay"), on fsync,  we started to always log all prealloc
      extents beyond an inode's i_size in order to avoid losing them after a
      power failure. However under some cases this can lead to the log replay
      code to create duplicate extent items, with different lengths, in the
      extent tree. That happens because, as of that commit, we can now log
      extent items based on extent maps that are not on the "modified" list
      of extent maps of the inode's extent map tree. Logging extent items based
      on extent maps is used during the fast fsync path to save time and for
      this to work reliably it requires that the extent maps are not merged
      with other adjacent extent maps - having the extent maps in the list
      of modified extents gives such guarantee.
      
      Consider the following example, captured during a long run of fsstress,
      which illustrates this problem.
      
      We have inode 271, in the filesystem tree (root 5), for which all of the
      following operations and discussion apply to.
      
      A buffered write starts at offset 312391 with a length of 933471 bytes
      (end offset at 1245862). At this point we have, for this inode, the
      following extent maps with the their field values:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 376832, block_start 1106399232,
            block_len 376832, orig_block_len 376832
      em C, start 417792, orig_start 417792, len 782336, block_start
            18446744073709551613, block_len 0, orig_block_len 0
      em D, start 1200128, orig_start 1200128, len 835584, block_start
            1106776064, block_len 835584, orig_block_len 835584
      em E, start 2035712, orig_start 2035712, len 245760, block_start
            1107611648, block_len 245760, orig_block_len 245760
      
      Extent map A corresponds to a hole and extent maps D and E correspond to
      preallocated extents.
      
      Extent map D ends where extent map E begins (1106776064 + 835584 =
      1107611648), but these extent maps were not merged because they are in
      the inode's list of modified extent maps.
      
      An fsync against this inode is made, which triggers the fast path
      (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback
      of the data previously written using buffered IO, and when the respective
      ordered extent finishes, btrfs_drop_extents() is called against the
      (aligned) range 311296..1249279. This causes a split of extent map D at
      btrfs_drop_extent_cache(), replacing extent map D with a new extent map
      D', also added to the list of modified extents,  with the following
      values:
      
      em D', start 1249280, orig_start of 1200128,
             block_start 1106825216 (= 1106776064 + 1249280 - 1200128),
             orig_block_len 835584,
             block_len 786432 (835584 - (1249280 - 1200128))
      
      Then, during the fast fsync, btrfs_log_changed_extents() is called and
      extent maps D' and E are removed from the list of modified extents. The
      flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged
      clear_em_logging() is called on each of them, and that makes extent map E
      to be merged with extent map D' (try_merge_map()), resulting in D' being
      deleted and E adjusted to:
      
      em E, start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192,
            orig_block_len 245760
      
      A direct IO write at offset 1847296 and length of 360448 bytes (end offset
      at 2207744) starts, and at that moment the following extent maps exist for
      our inode:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em E (prealloc), start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192, orig_block_len 245760
      
      The dio write results in drop_extent_cache() being called twice. The first
      time for a range that starts at offset 1847296 and ends at offset 2035711
      (length of 188416), which results in a double split of extent map E,
      replacing it with two new extent maps:
      
      em F, start 1249280, orig_start 1200128, block_start 1106825216,
            block_len 598016, orig_block_len 598016
      em G, start 2035712, orig_start 1200128, block_start 1107611648,
            block_len 245760, orig_block_len 1032192
      
      It also creates a new extent map that represents a part of the requested
      IO (through create_io_em()):
      
      em H, start 1847296, len 188416, block_start 1107423232, block_len 188416
      
      The second call to drop_extent_cache() has a range with a start offset of
      2035712 and end offset of 2207743 (length of 172032). This leads to
      replacing extent map G with a new extent map I with the following values:
      
      em I, start 2207744, orig_start 1200128, block_start 1107783680,
            block_len 73728, orig_block_len 1032192
      
      It also creates a new extent map that represents the second part of the
      requested IO (through create_io_em()):
      
      em J, start 2035712, len 172032, block_start 1107611648, block_len 172032
      
      The dio write set the inode's i_size to 2207744 bytes.
      
      After the dio write the inode has the following extent maps:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em F, start 1249280, orig_start 1200128, len 598016,
            block_start 1106825216, block_len 598016, orig_block_len 598016
      em H, start 1847296, orig_start 1200128, len 188416,
            block_start 1107423232, block_len 188416, orig_block_len 835584
      em J, start 2035712, orig_start 2035712, len 172032,
            block_start 1107611648, block_len 172032, orig_block_len 245760
      em I, start 2207744, orig_start 1200128, len 73728,
            block_start 1107783680, block_len 73728, orig_block_len 1032192
      
      Now do some change to the file, like adding a xattr for example and then
      fsync it again. This triggers a fast fsync path, and as of commit
      471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync
      log replay"), we use the extent map I to log a file extent item because
      it's a prealloc extent and it starts at an offset matching the inode's
      i_size. However when we log it, we create a file extent item with a value
      for the disk byte location that is wrong, as can be seen from the
      following output of "btrfs inspect-internal dump-tree":
      
       item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53
           generation 22 type 2 (prealloc)
           prealloc data disk byte 1106776064 nr 1032192
           prealloc data offset 1007616 nr 73728
      
      Here the disk byte value corresponds to calculation based on some fields
      from the extent map I:
      
        1106776064 = block_start (1107783680) - 1007616 (extent_offset)
        extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616
      
      The disk byte value of 1106776064 clashes with disk byte values of the
      file extent items at offsets 1249280 and 1847296 in the fs tree:
      
              item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1106776064 nr 835584
                      prealloc data offset 49152 nr 598016
              item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1106776064 nr 835584
                      extent data offset 647168 nr 188416 ram 835584
                      extent compression 0 (none)
              item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1107611648 nr 245760
                      extent data offset 0 nr 172032 ram 245760
                      extent compression 0 (none)
              item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1107611648 nr 245760
                      prealloc data offset 172032 nr 73728
      
      Instead of the disk byte value of 1106776064, the value of 1107611648
      should have been logged. Also the data offset value should have been
      172032 and not 1007616.
      After a log replay we end up getting two extent items in the extent tree
      with different lengths, one of 835584, which is correct and existed
      before the log replay, and another one of 1032192 which is wrong and is
      based on the logged file extent item:
      
       item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53
          refs 2 gen 15 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 2
       item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53
          refs 1 gen 22 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 1
      
      Obviously this leads to many problems and a filesystem check reports many
      errors:
      
       (...)
       checking extents
       Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1
       extent item 1106776064 has multiple extent items
       ref mismatch on [1106776064 835584] extent item 2, found 3
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680
       Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70
       Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192
       backpointer mismatch on [1106776064 835584]
       checking free space cache
       block group 1103101952 has wrong amount of free space
       failed to load free space cache for block group 1103101952
       checking fs roots
       (...)
      
      So fix this by logging the prealloc extents beyond the inode's i_size
      based on searches in the subvolume tree instead of the extent maps.
      
      Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31d11b83