1. 29 5月, 2018 32 次提交
  2. 28 5月, 2018 3 次提交
  3. 24 5月, 2018 1 次提交
    • O
      Btrfs: fix error handling in btrfs_truncate() · d5014738
      Omar Sandoval 提交于
      Jun Wu at Facebook reported that an internal service was seeing a return
      value of 1 from ftruncate() on Btrfs in some cases. This is coming from
      the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
      
      btrfs_truncate() uses two variables for error handling, ret and err.
      When btrfs_truncate_inode_items() returns non-zero, we set err to the
      return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
      only set err if ret is an error (i.e., negative).
      
      To reproduce the issue: mount a filesystem with -o compress-force=zstd
      and the following program will encounter return value of 1 from
      ftruncate:
      
      int main(void) {
              char buf[256] = { 0 };
              int ret;
              int fd;
      
              fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
              if (fd == -1) {
                      perror("open");
                      return EXIT_FAILURE;
              }
      
              if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
                      perror("write");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              if (fsync(fd) == -1) {
                      perror("fsync");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              ret = ftruncate(fd, 128);
              if (ret) {
                      printf("ftruncate() returned %d\n", ret);
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              close(fd);
              return EXIT_SUCCESS;
      }
      
      Fixes: ddfae63c ("btrfs: move btrfs_truncate_block out of trans handle")
      CC: stable@vger.kernel.org # 4.15+
      Reported-by: NJun Wu <quark@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5014738
  4. 17 5月, 2018 4 次提交
    • A
      btrfs: fix crash when trying to resume balance without the resume flag · 02ee654d
      Anand Jain 提交于
      We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance()
      only, which isn't called during the remount. So when resuming from
      the paused balance we hit the bug:
      
       kernel: kernel BUG at fs/btrfs/volumes.c:3890!
       ::
       kernel:  balance_kthread+0x51/0x60 [btrfs]
       kernel:  kthread+0x111/0x130
       ::
       kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8
      
      Reproducer:
        On a mounted filesystem:
      
        btrfs balance start --full-balance /btrfs
        btrfs balance pause /btrfs
        mount -o remount,ro /dev/sdb /btrfs
        mount -o remount,rw /dev/sdb /btrfs
      
      To fix this set the BTRFS_BALANCE_RESUME flag in
      btrfs_resume_balance_async().
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02ee654d
    • N
      btrfs: Fix delalloc inodes invalidation during transaction abort · fe816d0f
      Nikolay Borisov 提交于
      When a transaction is aborted btrfs_cleanup_transaction is called to
      cleanup all the various in-flight bits and pieces which migth be
      active. One of those is delalloc inodes - inodes which have dirty
      pages which haven't been persisted yet. Currently the process of
      freeing such delalloc inodes in exceptional circumstances such as
      transaction abort boiled down to calling btrfs_invalidate_inodes whose
      sole job is to invalidate the dentries for all inodes related to a
      root. This is in fact wrong and insufficient since such delalloc inodes
      will likely have pending pages or ordered-extents and will be linked to
      the sb->s_inode_list. This means that unmounting a btrfs instance with
      an aborted transaction could potentially lead inodes/their pages
      visible to the system long after their superblock has been freed. This
      in turn leads to a "use-after-free" situation once page shrink is
      triggered. This situation could be simulated by running generic/019
      which would cause such inodes to be left hanging, followed by
      generic/176 which causes memory pressure and page eviction which lead
      to touching the freed super block instance. This situation is
      additionally detected by the unmount code of VFS with the following
      message:
      
      "VFS: Busy inodes after unmount of Self-destruct in 5 seconds.  Have a nice day..."
      
      Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      in free_fs_root for the same reason.
      
      This patch aims to rectify the sitaution by doing the following:
      
      1. Change btrfs_destroy_delalloc_inodes so that it calls
      invalidate_inode_pages2 for every inode on the delalloc list, this
      ensures that all the pages of the inode are released. This function
      boils down to calling btrfs_releasepage. During test I observed cases
      where inodes on the delalloc list were having an i_count of 0, so this
      necessitates using igrab to be sure we are working on a non-freed inode.
      
      2. Since calling btrfs_releasepage might queue delayed iputs move the
      call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
      calling run_delayed_iputs for the last time. This is necessary to ensure
      that delayed iputs are run.
      
      Note: this patch is tagged for 4.14 stable but the fix applies to older
      versions too but needs to be backported manually due to conflicts.
      
      CC: stable@vger.kernel.org # 4.14.x: 2b877331: btrfs: Split btrfs_del_delalloc_inode into 2 functions
      CC: stable@vger.kernel.org # 4.14.x
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment to igrab ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fe816d0f
    • N
      btrfs: Split btrfs_del_delalloc_inode into 2 functions · 2b877331
      Nikolay Borisov 提交于
      This is in preparation of fixing delalloc inodes leakage on transaction
      abort. Also export the new function.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2b877331
    • L
      btrfs: fix reading stale metadata blocks after degraded raid1 mounts · 02a3307a
      Liu Bo 提交于
      If a btree block, aka. extent buffer, is not available in the extent
      buffer cache, it'll be read out from the disk instead, i.e.
      
      btrfs_search_slot()
        read_block_for_search()  # hold parent and its lock, go to read child
          btrfs_release_path()
          read_tree_block()  # read child
      
      Unfortunately, the parent lock got released before reading child, so
      commit 5bdd3536 ("Btrfs: Fix block generation verification race") had
      used 0 as parent transid to read the child block.  It forces
      read_tree_block() not to check if parent transid is different with the
      generation id of the child that it reads out from disk.
      
      A simple PoC is included in btrfs/124,
      
      0. A two-disk raid1 btrfs,
      
      1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
      
      2. Mount this filesystem and put it in use, after a while, device tree's
         root got COW but block A hasn't been allocated/overwritten yet.
      
      3. Umount it and reload the btrfs module to remove both disks from the
         global @fs_devices list.
      
      4. mount -odegraded dev1 and write some data, so now block A is allocated
         to be a leaf in checksum tree.  Note that only dev1 has the latest
         metadata of this filesystem.
      
      5. Umount it and mount it again normally (with both disks), since raid1
         can pick up one disk by the writer task's pid, if btrfs_search_slot()
         needs to read block A, dev2 which does NOT have the latest metadata
         might be read for block A, then we got a stale block A.
      
      6. As parent transid is not checked, block A is marked as uptodate and
         put into the extent buffer cache, so the future search won't bother
         to read disk again, which means it'll make changes on this stale
         one and make it dirty and flush it onto disk.
      
      To avoid the problem, parent transid needs to be passed to
      read_tree_block().
      
      In order to get a valid parent transid, we need to hold the parent's
      lock until finishing reading child.
      
      This patch needs to be slightly adapted for stable kernels, the
      &first_key parameter added to read_tree_block() is from 4.16+
      (581c1760). The fix is to replace 0 by 'gen'.
      
      Fixes: 5bdd3536 ("Btrfs: Fix block generation verification race")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02a3307a