1. 27 7月, 2020 33 次提交
  2. 22 7月, 2020 1 次提交
    • Q
      btrfs: qgroup: fix data leak caused by race between writeback and truncate · fa91e4aa
      Qu Wenruo 提交于
      [BUG]
      When running tests like generic/013 on test device with btrfs quota
      enabled, it can normally lead to data leak, detected at unmount time:
      
        BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096
        ------------[ cut here ]------------
        WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
        RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace caf08beafeca2392 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
      
      [CAUSE]
      In the offending case, the offending operations are:
      2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0
      2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0
      
      The following sequence of events could happen after the writev():
      	CPU1 (writeback)		|		CPU2 (truncate)
      -----------------------------------------------------------------
      btrfs_writepages()			|
      |- extent_write_cache_pages()		|
         |- Got page for 1003520		|
         |  1003520 is Dirty, no writeback	|
         |  So (!clear_page_dirty_for_io())   |
         |  gets called for it		|
         |- Now page 1003520 is Clean.	|
         |					| btrfs_setattr()
         |					| |- btrfs_setsize()
         |					|    |- truncate_setsize()
         |					|       New i_size is 18388
         |- __extent_writepage()		|
         |  |- page_offset() > i_size		|
            |- btrfs_invalidatepage()		|
      	 |- Page is clean, so no qgroup |
      	    callback executed
      
      This means, the qgroup reserved data space is not properly released in
      btrfs_invalidatepage() as the page is Clean.
      
      [FIX]
      Instead of checking the dirty bit of a page, call
      btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage().
      
      As qgroup rsv are completely bound to the QGROUP_RESERVED bit of
      io_tree, not bound to page status, thus we won't cause double freeing
      anyway.
      
      Fixes: 0b34c261 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa91e4aa
  3. 09 7月, 2020 1 次提交
    • J
      btrfs: fix double put of block group with nocow · 230ed397
      Josef Bacik 提交于
      While debugging a patch that I wrote I was hitting use-after-free panics
      when accessing block groups on unmount.  This turned out to be because
      in the nocow case if we bail out of doing the nocow for whatever reason
      we need to call btrfs_dec_nocow_writers() if we called the inc.  This
      puts our block group, but a few error cases does
      
      if (nocow) {
          btrfs_dec_nocow_writers();
          goto error;
      }
      
      unfortunately, error is
      
      error:
      	if (nocow)
      		btrfs_dec_nocow_writers();
      
      so we get a double put on our block group.  Fix this by dropping the
      error cases calling of btrfs_dec_nocow_writers(), as it's handled at the
      error label now.
      
      Fixes: 762bf098 ("btrfs: improve error handling in run_delalloc_nocow")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      230ed397
  4. 17 6月, 2020 3 次提交
    • F
      btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof · 4b194628
      Filipe Manana 提交于
      If we attempt to write to prealloc extent located after eof using a
      RWF_NOWAIT write, we always fail with -EAGAIN.
      
      We do actually check if we have an allocated extent for the write at
      the start of btrfs_file_write_iter() through a call to check_can_nocow(),
      but later when we go into the actual direct IO write path we simply
      return -EAGAIN if the write starts at or beyond EOF.
      
      Trivial to reproduce:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ touch /mnt/foo
        $ chattr +C /mnt/foo
      
        $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec)
      
        $ xfs_io -c "falloc -k 64K 1M" /mnt/foo
      
        $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe -b 64K 64K 64K" /mnt/foo
        pwrite: Resource temporarily unavailable
      
      On xfs and ext4 the write succeeds, as expected.
      
      Fix this by removing the wrong check at btrfs_direct_IO().
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4b194628
    • F
      btrfs: fix bytes_may_use underflow when running balance and scrub in parallel · 6bd335b4
      Filipe Manana 提交于
      When balance and scrub are running in parallel it is possible to end up
      with an underflow of the bytes_may_use counter of the data space_info
      object, which triggers a warning like the following:
      
         [134243.793196] BTRFS info (device sdc): relocating block group 1104150528 flags data
         [134243.806891] ------------[ cut here ]------------
         [134243.807561] WARNING: CPU: 1 PID: 26884 at fs/btrfs/space-info.h:125 btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
         [134243.808819] Modules linked in: btrfs blake2b_generic xor (...)
         [134243.815779] CPU: 1 PID: 26884 Comm: kworker/u8:8 Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
         [134243.816944] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
         [134243.818389] Workqueue: writeback wb_workfn (flush-btrfs-108483)
         [134243.819186] RIP: 0010:btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
         [134243.819963] Code: 0b f2 85 (...)
         [134243.822271] RSP: 0018:ffffa4160aae7510 EFLAGS: 00010287
         [134243.822929] RAX: 000000000000c000 RBX: ffff96159a8c1000 RCX: 0000000000000000
         [134243.823816] RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffff96158067a810
         [134243.824742] RBP: ffff96158067a800 R08: 0000000000000001 R09: 0000000000000000
         [134243.825636] R10: ffff961501432a40 R11: 0000000000000000 R12: 000000000000c000
         [134243.826532] R13: 0000000000000001 R14: ffffffffffff4000 R15: ffff96158067a810
         [134243.827432] FS:  0000000000000000(0000) GS:ffff9615baa00000(0000) knlGS:0000000000000000
         [134243.828451] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [134243.829184] CR2: 000055bd7e414000 CR3: 00000001077be004 CR4: 00000000003606e0
         [134243.830083] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [134243.830975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [134243.831867] Call Trace:
         [134243.832211]  find_free_extent+0x4a0/0x16c0 [btrfs]
         [134243.832846]  btrfs_reserve_extent+0x91/0x180 [btrfs]
         [134243.833487]  cow_file_range+0x12d/0x490 [btrfs]
         [134243.834080]  fallback_to_cow+0x82/0x1b0 [btrfs]
         [134243.834689]  ? release_extent_buffer+0x121/0x170 [btrfs]
         [134243.835370]  run_delalloc_nocow+0x33f/0xa30 [btrfs]
         [134243.836032]  btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
         [134243.836725]  ? find_lock_delalloc_range+0x221/0x250 [btrfs]
         [134243.837450]  writepage_delalloc+0xe8/0x150 [btrfs]
         [134243.838059]  __extent_writepage+0xe8/0x4c0 [btrfs]
         [134243.838674]  extent_write_cache_pages+0x237/0x530 [btrfs]
         [134243.839364]  extent_writepages+0x44/0xa0 [btrfs]
         [134243.839946]  do_writepages+0x23/0x80
         [134243.840401]  __writeback_single_inode+0x59/0x700
         [134243.841006]  writeback_sb_inodes+0x267/0x5f0
         [134243.841548]  __writeback_inodes_wb+0x87/0xe0
         [134243.842091]  wb_writeback+0x382/0x590
         [134243.842574]  ? wb_workfn+0x4a2/0x6c0
         [134243.843030]  wb_workfn+0x4a2/0x6c0
         [134243.843468]  process_one_work+0x26d/0x6a0
         [134243.843978]  worker_thread+0x4f/0x3e0
         [134243.844452]  ? process_one_work+0x6a0/0x6a0
         [134243.844981]  kthread+0x103/0x140
         [134243.845400]  ? kthread_create_worker_on_cpu+0x70/0x70
         [134243.846030]  ret_from_fork+0x3a/0x50
         [134243.846494] irq event stamp: 0
         [134243.846892] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
         [134243.847682] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134243.848687] softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134243.849913] softirqs last disabled at (0): [<0000000000000000>] 0x0
         [134243.850698] ---[ end trace bd7c03622e0b0a96 ]---
         [134243.851335] ------------[ cut here ]------------
      
      When relocating a data block group, for each extent allocated in the
      block group we preallocate another extent with the same size for the
      data relocation inode (we do it at prealloc_file_extent_cluster()).
      We reserve space by calling btrfs_check_data_free_space(), which ends
      up incrementing the data space_info's bytes_may_use counter, and
      then call btrfs_prealloc_file_range() to allocate the extent, which
      always decrements the bytes_may_use counter by the same amount.
      
      The expectation is that writeback of the data relocation inode always
      follows a NOCOW path, by writing into the preallocated extents. However,
      when starting writeback we might end up falling back into the COW path,
      because the block group that contains the preallocated extent was turned
      into RO mode by a scrub running in parallel. The COW path then calls the
      extent allocator which ends up calling btrfs_add_reserved_bytes(), and
      this function decrements the bytes_may_use counter of the data space_info
      object by an amount corresponding to the size of the allocated extent,
      despite we haven't previously incremented it. When the counter currently
      has a value smaller then the allocated extent we reset the counter to 0
      and emit a warning, otherwise we just decrement it and slowly mess up
      with this counter which is crucial for space reservation, the end result
      can be granting reserved space to tasks when there isn't really enough
      free space, and having the tasks fail later in critical places where
      error handling consists of a transaction abort or hitting a BUG_ON().
      
      Fix this by making sure that if we fallback to the COW path for a data
      relocation inode, we increment the bytes_may_use counter of the data
      space_info object. The COW path will then decrement it at
      btrfs_add_reserved_bytes() on success or through its error handling part
      by a call to extent_clear_unlock_delalloc() (which ends up calling
      btrfs_clear_delalloc_extent() that does the decrement operation) in case
      of an error.
      
      Test case btrfs/061 from fstests could sporadically trigger this.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6bd335b4
    • F
      btrfs: fix data block group relocation failure due to concurrent scrub · 432cd2a1
      Filipe Manana 提交于
      When running relocation of a data block group while scrub is running in
      parallel, it is possible that the relocation will fail and abort the
      current transaction with an -EINVAL error:
      
         [134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents
         [134243.999871] ------------[ cut here ]------------
         [134244.000741] BTRFS: Transaction aborted (error -22)
         [134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs]
         [134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
         [134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
         [134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
         [134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs]
         [134244.017151] Code: 48 c7 c7 (...)
         [134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286
         [134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000
         [134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001
         [134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001
         [134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08
         [134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000
         [134244.028024] FS:  00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000
         [134244.029491] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0
         [134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [134244.034484] Call Trace:
         [134244.034984]  btrfs_cow_block+0x12b/0x2b0 [btrfs]
         [134244.035859]  do_relocation+0x30b/0x790 [btrfs]
         [134244.036681]  ? do_raw_spin_unlock+0x49/0xc0
         [134244.037460]  ? _raw_spin_unlock+0x29/0x40
         [134244.038235]  relocate_tree_blocks+0x37b/0x730 [btrfs]
         [134244.039245]  relocate_block_group+0x388/0x770 [btrfs]
         [134244.040228]  btrfs_relocate_block_group+0x161/0x2e0 [btrfs]
         [134244.041323]  btrfs_relocate_chunk+0x36/0x110 [btrfs]
         [134244.041345]  btrfs_balance+0xc06/0x1860 [btrfs]
         [134244.043382]  ? btrfs_ioctl_balance+0x27c/0x310 [btrfs]
         [134244.045586]  btrfs_ioctl_balance+0x1ed/0x310 [btrfs]
         [134244.045611]  btrfs_ioctl+0x1880/0x3760 [btrfs]
         [134244.049043]  ? do_raw_spin_unlock+0x49/0xc0
         [134244.049838]  ? _raw_spin_unlock+0x29/0x40
         [134244.050587]  ? __handle_mm_fault+0x11b3/0x14b0
         [134244.051417]  ? ksys_ioctl+0x92/0xb0
         [134244.052070]  ksys_ioctl+0x92/0xb0
         [134244.052701]  ? trace_hardirqs_off_thunk+0x1a/0x1c
         [134244.053511]  __x64_sys_ioctl+0x16/0x20
         [134244.054206]  do_syscall_64+0x5c/0x280
         [134244.054891]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         [134244.055819] RIP: 0033:0x7f29b51c9dd7
         [134244.056491] Code: 00 00 00 (...)
         [134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
         [134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7
         [134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003
         [134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000
         [134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a
         [134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0
         [134244.067626] irq event stamp: 0
         [134244.068202] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
         [134244.069351] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134244.070909] softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134244.072392] softirqs last disabled at (0): [<0000000000000000>] 0x0
         [134244.073432] ---[ end trace bd7c03622e0b0a99 ]---
      
      The -EINVAL error comes from the following chain of function calls:
      
        __btrfs_cow_block() <-- aborts the transaction
          btrfs_reloc_cow_block()
            replace_file_extents()
              get_new_location() <-- returns -EINVAL
      
      When relocating a data block group, for each allocated extent of the block
      group, we preallocate another extent (at prealloc_file_extent_cluster()),
      associated with the data relocation inode, and then dirty all its pages.
      These preallocated extents have, and must have, the same size that extents
      from the data block group being relocated have.
      
      Later before we start the relocation stage that updates pointers (bytenr
      field of file extent items) to point to the the new extents, we trigger
      writeback for the data relocation inode. The expectation is that writeback
      will write the pages to the previously preallocated extents, that it
      follows the NOCOW path. That is generally the case, however, if a scrub
      is running it may have turned the block group that contains those extents
      into RO mode, in which case writeback falls back to the COW path.
      
      However in the COW path instead of allocating exactly one extent with the
      expected size, the allocator may end up allocating several smaller extents
      due to free space fragmentation - because we tell it at cow_file_range()
      that the minimum allocation size can match the filesystem's sector size.
      This later breaks the relocation's expectation that an extent associated
      to a file extent item in the data relocation inode has the same size as
      the respective extent pointed by a file extent item in another tree - in
      this case the extent to which the relocation inode poins to is smaller,
      causing relocation.c:get_new_location() to return -EINVAL.
      
      For example, if we are relocating a data block group X that has a logical
      address of X and the block group has an extent allocated at the logical
      address X + 128KiB with a size of 64KiB:
      
      1) At prealloc_file_extent_cluster() we allocate an extent for the data
         relocation inode with a size of 64KiB and associate it to the file
         offset 128KiB (X + 128KiB - X) of the data relocation inode. This
         preallocated extent was allocated at block group Z;
      
      2) A scrub running in parallel turns block group Z into RO mode and
         starts scrubing its extents;
      
      3) Relocation triggers writeback for the data relocation inode;
      
      4) When running delalloc (btrfs_run_delalloc_range()), we try first the
         NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC
         set in its flags. However, because block group Z is in RO mode, the
         NOCOW path (run_delalloc_nocow()) falls back into the COW path, by
         calling cow_file_range();
      
      5) At cow_file_range(), in the first iteration of the while loop we call
         btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum
         allocation size of 4KiB (fs_info->sectorsize). Due to free space
         fragmentation, btrfs_reserve_extent() ends up allocating two extents
         of 32KiB each, each one on a different iteration of that while loop;
      
      6) Writeback of the data relocation inode completes;
      
      7) Relocation proceeds and ends up at relocation.c:replace_file_extents(),
         with a leaf which has a file extent item that points to the data extent
         from block group X, that has a logical address (bytenr) of X + 128KiB
         and a size of 64KiB. Then it calls get_new_location(), which does a
         lookup in the data relocation tree for a file extent item starting at
         offset 128KiB (X + 128KiB - X) and belonging to the data relocation
         inode. It finds a corresponding file extent item, however that item
         points to an extent that has a size of 32KiB, which doesn't match the
         expected size of 64KiB, resuling in -EINVAL being returned from this
         function and propagated up to __btrfs_cow_block(), which aborts the
         current transaction.
      
      To fix this make sure that at cow_file_range() when we call the allocator
      we pass it a minimum allocation size corresponding the desired extent size
      if the inode belongs to the data relocation tree, otherwise pass it the
      filesystem's sector size as the minimum allocation size.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      432cd2a1
  5. 14 6月, 2020 1 次提交
    • D
      Revert "btrfs: switch to iomap_dio_rw() for dio" · 55e20bd1
      David Sterba 提交于
      This reverts commit a43a67a2.
      
      This patch reverts the main part of switching direct io implementation
      to iomap infrastructure. There's a problem in invalidate page that
      couldn't be solved as regression in this development cycle.
      
      The problem occurs when buffered and direct io are mixed, and the ranges
      overlap. Although this is not recommended, filesystems implement
      measures or fallbacks to make it somehow work. In this case, fallback to
      buffered IO would be an option for btrfs (this already happens when
      direct io is done on compressed data), but the change would be needed in
      the iomap code, bringing new semantics to other filesystems.
      
      Another problem arises when again the buffered and direct ios are mixed,
      invalidation fails, then -EIO is set on the mapping and fsync will fail,
      though there's no real error.
      
      There have been discussions how to fix that, but revert seems to be the
      least intrusive option.
      
      Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55e20bd1
  6. 10 6月, 2020 1 次提交