1. 14 11月, 2018 5 次提交
    • J
      btrfs: fix error handling in btrfs_dev_replace_start · 18bdce0e
      Jeff Mahoney 提交于
      commit 5c06147128fbbdf7a84232c5f0d808f53153defe upstream.
      
      When we fail to start a transaction in btrfs_dev_replace_start, we leave
      dev_replace->replace_start set to STARTED but clear ->srcdev and
      ->tgtdev.  Later, that can result in an Oops in
      btrfs_dev_replace_progress when having state set to STARTED or SUSPENDED
      implies that ->srcdev is valid.
      
      Also fix error handling when the state is already STARTED or SUSPENDED
      while starting.  That, too, will clear ->srcdev and ->tgtdev even though
      it doesn't own them.  This should be an impossible case to hit since we
      should be protected by the BTRFS_FS_EXCL_OP bit being set.  Let's add an
      ASSERT there while we're at it.
      
      Fixes: e93c89c1 (Btrfs: add new sources for device replace code)
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18bdce0e
    • J
      btrfs: fix error handling in free_log_tree · cdecd48a
      Jeff Mahoney 提交于
      commit 374b0e2d6ba5da7fd1cadb3247731ff27d011f6f upstream.
      
      When we hit an I/O error in free_log_tree->walk_log_tree during file system
      shutdown we can crash due to there not being a valid transaction handle.
      
      Use btrfs_handle_fs_error when there's no transaction handle to use.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
        IP: free_log_tree+0xd2/0x140 [btrfs]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        Modules linked in: <modules>
        CPU: 2 PID: 23544 Comm: umount Tainted: G        W        4.12.14-kvmsmall #9 SLE15 (unreleased)
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
        task: ffff96bfd3478880 task.stack: ffffa7cf40d78000
        RIP: 0010:free_log_tree+0xd2/0x140 [btrfs]
        RSP: 0018:ffffa7cf40d7bd10 EFLAGS: 00010282
        RAX: 00000000fffffffb RBX: 00000000fffffffb RCX: 0000000000000002
        RDX: 0000000000000000 RSI: ffff96c02f07d4c8 RDI: 0000000000000282
        RBP: ffff96c013cf1000 R08: ffff96c02f07d4c8 R09: ffff96c02f07d4d0
        R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
        R13: ffff96c005e800c0 R14: ffffa7cf40d7bdb8 R15: 0000000000000000
        FS:  00007f17856bcfc0(0000) GS:ffff96c03f600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000060 CR3: 0000000045ed6002 CR4: 00000000003606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         ? wait_for_writer+0xb0/0xb0 [btrfs]
         btrfs_free_log+0x17/0x30 [btrfs]
         btrfs_drop_and_free_fs_root+0x9a/0xe0 [btrfs]
         btrfs_free_fs_roots+0xc0/0x130 [btrfs]
         ? wait_for_completion+0xf2/0x100
         close_ctree+0xea/0x2e0 [btrfs]
         ? kthread_stop+0x161/0x260
         generic_shutdown_super+0x6c/0x120
         kill_anon_super+0xe/0x20
         btrfs_kill_super+0x13/0x100 [btrfs]
         deactivate_locked_super+0x3f/0x70
         cleanup_mnt+0x3b/0x70
         task_work_run+0x78/0x90
         exit_to_usermode_loop+0x77/0xa6
         do_syscall_64+0x1c5/0x1e0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
        RIP: 0033:0x7f1784f90827
        RSP: 002b:00007ffdeeb03118 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 0000556a60c62970 RCX: 00007f1784f90827
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556a60c62b50
        RBP: 0000000000000000 R08: 0000000000000005 R09: 00000000ffffffff
        R10: 0000556a60c63900 R11: 0000000000000246 R12: 0000556a60c62b50
        R13: 00007f17854a81c4 R14: 0000000000000000 R15: 0000000000000000
        RIP: free_log_tree+0xd2/0x140 [btrfs] RSP: ffffa7cf40d7bd10
        CR2: 0000000000000060
      
      Fixes: 681ae509 ("Btrfs: cleanup reserved space when freeing tree log on error")
      CC: <stable@vger.kernel.org> # v3.13
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdecd48a
    • Q
      btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock · d4f56c44
      Qu Wenruo 提交于
      commit b72c3aba09a53fc7c1824250d71180ca154517a7 upstream.
      
      [BUG]
      For certain crafted image, whose csum root leaf has missing backref, if
      we try to trigger write with data csum, it could cause deadlock with the
      following kernel WARN_ON():
      
        WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
        CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
        Workqueue: btrfs-endio-write btrfs_endio_write_helper
        RIP: 0010:btrfs_tree_lock+0x3e2/0x400
        Call Trace:
         btrfs_alloc_tree_block+0x39f/0x770
         __btrfs_cow_block+0x285/0x9e0
         btrfs_cow_block+0x191/0x2e0
         btrfs_search_slot+0x492/0x1160
         btrfs_lookup_csum+0xec/0x280
         btrfs_csum_file_blocks+0x2be/0xa60
         add_pending_csums+0xaf/0xf0
         btrfs_finish_ordered_io+0x74b/0xc90
         finish_ordered_fn+0x15/0x20
         normal_work_helper+0xf6/0x500
         btrfs_endio_write_helper+0x12/0x20
         process_one_work+0x302/0x770
         worker_thread+0x81/0x6d0
         kthread+0x180/0x1d0
         ret_from_fork+0x35/0x40
      
      [CAUSE]
      That crafted image has missing backref for csum tree root leaf.  And
      when we try to allocate new tree block, since there is no
      EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
      and use it.
      
      The extent tree of the image looks like:
      
        Normal image                      |       This fuzzed image
        ----------------------------------+--------------------------------
        BG 29360128                       | BG 29360128
         One empty slot                   |  One empty slot
        29364224: backref to UUID tree    | 29364224: backref to UUID tree
         Two empty slots                  |  Two empty slots
        29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
        29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
        ...                               | ...
      
      Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
      alloc tree block, it's an valid slot for btrfs.
      
      And for finish_ordered_write, when we need to insert csum, we try to CoW
      csum tree root.
      
      By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
      BG_OFFSET + 12K is already used by tree block COW for other trees, the
      next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
      tree.
      
      But due to the bad type, btrfs can recognize it and still consider it as
      an empty slot, and will try to use it for csum tree CoW.
      
      Then in the following call trace, we will try to lock the new tree
      block, which turns out to be the old csum tree root which is already
      locked:
      
      btrfs_search_slot() called on csum tree root, which is at 29376512
      |- btrfs_cow_block()
         |- btrfs_set_lock_block()
         |  |- Now locks tree block 29376512 (old csum tree root)
         |- __btrfs_cow_block()
            |- btrfs_alloc_tree_block()
               |- btrfs_reserve_extent()
                  | Now it returns tree block 29376512, which extent tree
                  | shows its empty slot, but it's already hold by csum tree
                  |- btrfs_init_new_buffer()
                     |- btrfs_tree_lock()
                        | Triggers WARN_ON(eb->lock_owner == current->pid)
                        |- wait_event()
                           Wait lock owner to release the lock, but it's
                           locked by ourself, so it will deadlock
      
      [FIX]
      This patch will do the lock_owner and current->pid check at
      btrfs_init_new_buffer().
      So above deadlock can be avoided.
      
      Since such problem can only happen in crafted image, we will still
      trigger kernel warning for later aborted transaction, but with a little
      more meaningful warning message.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405Reported-by: NXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4f56c44
    • Q
      btrfs: Handle owner mismatch gracefully when walking up tree · 4f2a4e02
      Qu Wenruo 提交于
      commit 65c6e82becec33731f48786e5a30f98662c86b16 upstream.
      
      [BUG]
      When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
      when trying to recover balance:
      
        kernel BUG at fs/btrfs/extent-tree.c:8956!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
        RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
        RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
        Call Trace:
         walk_up_tree+0x172/0x1f0 [btrfs]
         btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
         merge_reloc_roots+0xe1/0x1d0 [btrfs]
         btrfs_recover_relocation+0x3ea/0x420 [btrfs]
         open_ctree+0x1af3/0x1dd0 [btrfs]
         btrfs_mount_root+0x66b/0x740 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         btrfs_mount+0x16d/0x890 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         do_mount+0x1fd/0xda0
         ksys_mount+0xba/0xd0
         __x64_sys_mount+0x21/0x30
         do_syscall_64+0x60/0x210
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      Extent tree corruption.  In this particular case, reloc tree root's
      owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
      corrupted and we failed the owner check in walk_up_tree().
      
      [FIX]
      It's pretty hard to take care of every extent tree corruption, but at
      least we can remove such BUG_ON() and exit more gracefully.
      
      And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
      the same root (which is obviously invalid), we needs to make
      __del_reloc_root() more robust to detect such invalid sharing to avoid
      possible NULL dereference as root->node can be NULL in this case.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411Reported-by: NXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f2a4e02
    • Q
      btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled · c25c6665
      Qu Wenruo 提交于
      commit 3628b4ca64f24a4ec55055597d0cb1c814729f8b upstream.
      
      Some qgroup trace events like btrfs_qgroup_release_data() and
      btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
      not enabled.
      
      This is caused by the lack of qgroup status check before calling some
      qgroup functions.  Thankfully the functions can handle quota disabled
      case well and just do nothing for qgroup disabled case.
      
      This patch will do earlier check before triggering related trace events.
      
      And for enabled <-> disabled race case:
      
      1) For enabled->disabled case
         Disable will wipe out all qgroups data including reservation and
         excl/rfer. Even if we leak some reservation or numbers, it will
         still be cleared, so nothing will go wrong.
      
      2) For disabled -> enabled case
         Current btrfs_qgroup_release_data() will use extent_io tree to ensure
         we won't underflow reservation. And for delayed_ref we use
         head->qgroup_reserved to record the reserved space, so in that case
         head->qgroup_reserved should be 0 and we won't underflow.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c25c6665
  2. 24 8月, 2018 1 次提交
  3. 23 8月, 2018 5 次提交
    • D
      btrfs: use after free in btrfs_quota_enable · b9b8a41a
      Dan Carpenter 提交于
      The issue here is that btrfs_commit_transaction() frees "trans" on both
      the error and the success path.  So the problem would be if
      btrfs_commit_transaction() succeeds, and then qgroup_rescan_init()
      fails.  That means that "ret" is non-zero and "trans" is non-NULL and it
      leads to a use after free inside the btrfs_end_transaction() macro.
      
      Fixes: 340f1aa2 ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b9b8a41a
    • A
      btrfs: btrfs_shrink_device should call commit transaction at the end · 801660b0
      Anand Jain 提交于
      Test case btrfs/164 reports use-after-free:
      
      [ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP
      ..
      [ 6712.195423]  btrfs_update_commit_device_size+0x75/0xf0 [btrfs]
      [ 6712.201424]  btrfs_commit_transaction+0x57d/0xa90 [btrfs]
      [ 6712.206999]  btrfs_rm_device+0x627/0x850 [btrfs]
      [ 6712.211800]  btrfs_ioctl+0x2b03/0x3120 [btrfs]
      
      Reason for this is that btrfs_shrink_device adds the resized device to
      the fs_devices::resized_devices after it has called the last commit
      transaction.
      
      So the list fs_devices::resized_devices is not empty when
      btrfs_shrink_device returns.  Now the parent function
      btrfs_rm_device calls:
      
              btrfs_close_bdev(device);
              call_rcu(&device->rcu, free_device_rcu);
      
      and then does the transactio ncommit. It goes through the
      fs_devices::resized_devices in btrfs_update_commit_device_size and
      leads to use-after-free.
      
      Fix this by making sure btrfs_shrink_device calls the last needed
      btrfs_commit_transaction before the return. This is consistent with what
      the grow counterpart does and this makes sure the on-disk state is
      persistent when the function returns.
      Reported-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Tested-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      801660b0
    • L
      btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata · a5b7f429
      Lu Fengqi 提交于
      After btrfs_qgroup_reserve_meta_prealloc(), num_bytes will be assigned
      again by btrfs_calc_trans_metadata_size(). Once block_rsv fails, we
      can't properly free the num_bytes of the previous qgroup_reserve. Use a
      separate variable to store the num_bytes of the qgroup_reserve.
      
      Delete the comment for the qgroup_reserved that does not exist and add a
      comment about use_global_rsv.
      
      Fixes: c4c129db ("btrfs: drop unused parameter qgroup_reserved")
      CC: stable@vger.kernel.org # 4.18+
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5b7f429
    • F
      Btrfs: fix data corruption when deduplicating between different files · de02b9f6
      Filipe Manana 提交于
      If we deduplicate extents between two different files we can end up
      corrupting data if the source range ends at the size of the source file,
      the source file's size is not aligned to the filesystem's block size
      and the destination range does not go past the size of the destination
      file size.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo
        # The first byte with a value of 0xae starts at an offset (2518890)
        # which is not a multiple of the sector size.
        $ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo
      
        # Confirm the file content is full of bytes with values 0x6b and 0xae.
        $ od -t x1 /mnt/foo
        0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
        *
        11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
        11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
        *
        11777540 ae ae ae ae ae ae ae ae
        11777550
      
        # Create a second file with a length not aligned to the sector size,
        # whose bytes all have the value 0x6b, so that its extent(s) can be
        # deduplicated with the first file.
        $ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar
      
        # Now deduplicate the entire second file into a range of the first file
        # that also has all bytes with the value 0x6b. The destination range's
        # end offset must not be aligned to the sector size and must be less
        # then the offset of the first byte with the value 0xae (byte at offset
        # 2518890).
        $ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo
      
        # The bytes in the range starting at offset 2515659 (end of the
        # deduplication range) and ending at offset 2519040 (start offset
        # rounded up to the block size) must all have the value 0xae (and not
        # replaced with 0x00 values). In other words, we should have exactly
        # the same data we had before we asked for deduplication.
        $ od -t x1 /mnt/foo
        0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
        *
        11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
        11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
        *
        11777540 ae ae ae ae ae ae ae ae
        11777550
      
        # Unmount the filesystem and mount it again. This guarantees any file
        # data in the page cache is dropped.
        $ umount /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ od -t x1 /mnt/foo
        0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
        *
        11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00
        11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
        *
        11777540 ae ae ae ae ae ae ae ae
        11777550
      
        # The bytes in range 2515659 to 2519040 have a value of 0x00 and not a
        # value of 0xae, data corruption happened due to the deduplication
        # operation.
      
      So fix this by rounding down, to the sector size, the length used for the
      deduplication when the following conditions are met:
      
        1) Source file's range ends at its i_size;
        2) Source file's i_size is not aligned to the sector size;
        3) Destination range does not cross the i_size of the destination file.
      
      Fixes: e1d227a4 ("btrfs: Handle unaligned length in extent_same")
      CC: stable@vger.kernel.org # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de02b9f6
    • F
      Btrfs: sync log after logging new name · d4682ba0
      Filipe Manana 提交于
      When we add a new name for an inode which was logged in the current
      transaction, we update the inode in the log so that its new name and
      ancestors are added to the log. However when we do this we do not persist
      the log, so the changes remain in memory only, and as a consequence, any
      ancestors that were created in the current transaction are updated such
      that future calls to btrfs_inode_in_log() return true. This leads to a
      subsequent fsync against such new ancestor directories returning
      immediately, without persisting the log, therefore after a power failure
      the new ancestor directories do not exist, despite fsync being called
      against them explicitly.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/A
        $ mkdir /mnt/B
        $ mkdir /mnt/A/C
        $ touch /mnt/B/foo
        $ xfs_io -c "fsync" /mnt/B/foo
        $ ln /mnt/B/foo /mnt/A/C/foo
        $ xfs_io -c "fsync" /mnt/A
        <power failure>
      
      After the power failure, directory "A" does not exist, despite the explicit
      fsync on it.
      
      Instead of fixing this by changing the behaviour of the explicit fsync on
      directory "A" to persist the log instead of doing nothing, make the logging
      of the new file name (which happens when creating a hard link or renaming)
      persist the log. This approach not only is simpler, not requiring addition
      of new fields to the inode in memory structure, but also gives us the same
      behaviour as ext4, xfs and f2fs (possibly other filesystems too).
      
      A test case for fstests follows soon.
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      Reported-by: NVijay Chidambaram <vvijay03@gmail.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4682ba0
  4. 18 8月, 2018 2 次提交
    • J
      btrfs: readpages() should submit IO as read-ahead · 5e9d3982
      Jens Axboe 提交于
      a_ops->readpages() is only ever used for read-ahead.  Ensure that we
      pass this information down to the block layer.
      
      Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dkSigned-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <clm@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e9d3982
    • R
      Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space · 8ecebf4d
      Robbie Ko 提交于
      Commit e9894fd3 ("Btrfs: fix snapshot vs nocow writting") forced
      nocow writes to fallback to COW, during writeback, when a snapshot is
      created. This resulted in writes made before creating the snapshot to
      unexpectedly fail with ENOSPC during writeback when success (0) was
      returned to user space through the write system call.
      
      The steps leading to this problem are:
      
      1. When it's not possible to allocate data space for a write, the
         buffered write path checks if a NOCOW write is possible.  If it is,
         it will not reserve space and success (0) is returned to user space.
      
      2. Then when a snapshot is created, the root's will_be_snapshotted
         atomic is incremented and writeback is triggered for all inode's that
         belong to the root being snapshotted. Incrementing that atomic forces
         all previous writes to fallback to COW during writeback (running
         delalloc).
      
      3. This results in the writeback for the inodes to fail and therefore
         setting the ENOSPC error in their mappings, so that a subsequent
         fsync on them will report the error to user space. So it's not a
         completely silent data loss (since fsync will report ENOSPC) but it's
         a very unexpected and undesirable behaviour, because if a clean
         shutdown/unmount of the filesystem happens without previous calls to
         fsync, it is expected to have the data present in the files after
         mounting the filesystem again.
      
      So fix this by adding a new atomic named snapshot_force_cow to the
      root structure which prevents this behaviour and works the following way:
      
      1. It is incremented when we start to create a snapshot after triggering
         writeback and before waiting for writeback to finish.
      
      2. This new atomic is now what is used by writeback (running delalloc)
         to decide whether we need to fallback to COW or not. Because we
         incremented this new atomic after triggering writeback in the
         snapshot creation ioctl, we ensure that all buffered writes that
         happened before snapshot creation will succeed and not fallback to
         COW (which would make them fail with ENOSPC).
      
      3. The existing atomic, will_be_snapshotted, is kept because it is used
         to force new buffered writes, that start after we started
         snapshotting, to reserve data space even when NOCOW is possible.
         This makes these writes fail early with ENOSPC when there's no
         available space to allocate, preventing the unexpected behaviour of
         writeback later failing with ENOSPC due to a fallback to COW mode.
      
      Fixes: e9894fd3 ("Btrfs: fix snapshot vs nocow writting")
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ecebf4d
  5. 06 8月, 2018 27 次提交