1. 27 7月, 2020 7 次提交
    • N
      btrfs: make btrfs_reloc_clone_csums take btrfs_inode · 7bfa9535
      Nikolay Borisov 提交于
      It really wants btrfs_inode and not a vfs inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7bfa9535
    • N
      btrfs: make btrfs_lookup_ordered_extent take btrfs_inode · c3504372
      Nikolay Borisov 提交于
      It doesn't use the generic vfs inode for anything use btrfs_inode
      directly.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3504372
    • N
      btrfs: make get_extent_allocation_hint take btrfs_inode · 43c69849
      Nikolay Borisov 提交于
      It doesn't use the vfs inode for anything, can just as easily take
      btrfs_inode.  Follow up patches will convert callers as well.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      43c69849
    • Q
      btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak · 7dbeaad0
      Qu Wenruo 提交于
      [BUG]
      The following simple workload from fsstress can lead to qgroup reserved
      data space leak:
        0/0: creat f0 x:0 0 0
        0/0: creat add id=0,parent=-1
        0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
        0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
        0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
      
      This would cause btrfs qgroup to leak 20480 bytes for data reserved
      space.  If btrfs qgroup limit is enabled, such leak can lead to
      unexpected early EDQUOT and unusable space.
      
      [CAUSE]
      When doing direct IO, kernel will try to writeback existing buffered
      page cache, then invalidate them:
        generic_file_direct_write()
        |- filemap_write_and_wait_range();
        |- invalidate_inode_pages2_range();
      
      However for btrfs, the bi_end_io hook doesn't finish all its heavy work
      right after bio ends.  In fact, it delays its work further:
      
        submit_extent_page(end_io_func=end_bio_extent_writepage);
        end_bio_extent_writepage()
        |- btrfs_writepage_endio_finish_ordered()
           |- btrfs_init_work(finish_ordered_fn);
      
        <<< Work queue execution >>>
        finish_ordered_fn()
        |- btrfs_finish_ordered_io();
           |- Clear qgroup bits
      
      This means, when filemap_write_and_wait_range() returns,
      btrfs_finish_ordered_io() is not guaranteed to be executed, thus the
      qgroup bits for related range are not cleared.
      
      Now into how the leak happens, this will only focus on the overlapping
      part of buffered and direct IO part.
      
      1. After buffered write
         The inode had the following range with QGROUP_RESERVED bit:
         	596		616K
      	|///////////////|
         Qgroup reserved data space: 20K
      
      2. Writeback part for range [596K, 616K)
         Write back finished, but btrfs_finish_ordered_io() not get called
         yet.
         So we still have:
         	596K		616K
      	|///////////////|
         Qgroup reserved data space: 20K
      
      3. Pages for range [596K, 616K) get released
         This will clear all qgroup bits, but don't update the reserved data
         space.
         So we have:
         	596K		616K
      	|		|
         Qgroup reserved data space: 20K
         That number doesn't match the qgroup bit range anymore.
      
      4. Dio prepare space for range [596K, 700K)
         Qgroup reserved data space for that range, we got:
         	596K		616K			700K
      	|///////////////|///////////////////////|
         Qgroup reserved data space: 20K + 104K = 124K
      
      5. btrfs_finish_ordered_range() gets executed for range [596K, 616K)
         Qgroup free reserved space for that range, we got:
         	596K		616K			700K
      	|		|///////////////////////|
         We need to free that range of reserved space.
         Qgroup reserved data space: 124K - 20K = 104K
      
      6. btrfs_finish_ordered_range() gets executed for range [596K, 700K)
         However qgroup bit for range [596K, 616K) is already cleared in
         previous step, so we only free 84K for qgroup reserved space.
         	596K		616K			700K
      	|		|			|
         We need to free that range of reserved space.
         Qgroup reserved data space: 104K - 84K = 20K
      
         Now there is no way to release that 20K unless disabling qgroup or
         unmounting the fs.
      
      [FIX]
      This patch will change the timing of btrfs_qgroup_release/free_data()
      call.  Here it uses buffered COW write as an example.
      
      	The new timing			|	The old timing
      ----------------------------------------+---------------------------------------
       btrfs_buffered_write()			| btrfs_buffered_write()
       |- btrfs_qgroup_reserve_data() 	| |- btrfs_qgroup_reserve_data()
      					|
       btrfs_run_delalloc_range()		| btrfs_run_delalloc_range()
       |- btrfs_add_ordered_extent()  	|
          |- btrfs_qgroup_release_data()	|
             The reserved is passed into	|
             btrfs_ordered_extent structure	|
      					|
       btrfs_finish_ordered_io()		| btrfs_finish_ordered_io()
       |- The reserved space is passed to 	| |- btrfs_qgroup_release_data()
          btrfs_qgroup_record			|    The resereved space is passed
      					|    to btrfs_qgroup_recrod
      					|
       btrfs_qgroup_account_extents()		| btrfs_qgroup_account_extents()
       |- btrfs_qgroup_free_refroot()		| |- btrfs_qgroup_free_refroot()
      
      The point of such change is to ensure, when ordered extents are
      submitted, the qgroup reserved space is already released, to keep the
      timing aligned with file_write_and_wait_range().
      
      So that qgroup data reserved space is all bound to btrfs_ordered_extent
      and solve the timing mismatch.
      
      Fixes: f695fdce ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space")
      Suggested-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7dbeaad0
    • Q
      btrfs: inode: move qgroup reserved space release to the callers of insert_reserved_file_extent() · 9729f10a
      Qu Wenruo 提交于
      This is to prepare for the incoming timing change of qgroup reserved
      data space and ordered extent.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9729f10a
    • Q
      btrfs: inode: refactor the parameters of insert_reserved_file_extent() · 203f44c5
      Qu Wenruo 提交于
      Function insert_reserved_file_extent() takes a long list of parameters,
      which are all for btrfs_file_extent_item, even including two reserved
      members, encryption and other_encoding.
      
      This makes the parameter list unnecessary long for a function which only
      gets called twice.
      
      This patch will refactor the parameter list, by using
      btrfs_file_extent_item as parameter directly to hugely reduce the number
      of parameters.
      
      Also, since there are only two callers, one in btrfs_finish_ordered_io()
      which inserts file extent for ordered extent, and one
      __btrfs_prealloc_file_range().
      
      These two call sites have completely different context, where ordered
      extent can be compressed, but will always be regular extent, while the
      preallocated one is never going to be compressed and always has PREALLOC
      type.
      
      So use two small wrapper for these two different call sites to improve
      readability.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      203f44c5
    • F
      btrfs: remove the start argument from btrfs_free_reserved_data_space_noquota() · 46d4dac8
      Filipe Manana 提交于
      The start argument for btrfs_free_reserved_data_space_noquota() is only
      used to make sure the amount of bytes we decrement from the bytes_may_use
      counter of the data space_info object is aligned to the filesystem's
      sector size. It serves no other purpose.
      
      All its current callers always pass a length argument that is already
      aligned to the sector size, so we can make the start argument go away.
      In fact its presence makes it impossible to use it in a context where we
      just want to free a number of bytes for a range for which either we do
      not know its start offset or for freeing multiple ranges at once (which
      are not contiguous).
      
      This change is preparatory work for a patch (third patch in this series)
      that makes relocation of data block groups that are not full reserve less
      data space.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      46d4dac8
  2. 22 7月, 2020 1 次提交
    • Q
      btrfs: qgroup: fix data leak caused by race between writeback and truncate · fa91e4aa
      Qu Wenruo 提交于
      [BUG]
      When running tests like generic/013 on test device with btrfs quota
      enabled, it can normally lead to data leak, detected at unmount time:
      
        BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096
        ------------[ cut here ]------------
        WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
        RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace caf08beafeca2392 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
      
      [CAUSE]
      In the offending case, the offending operations are:
      2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0
      2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0
      
      The following sequence of events could happen after the writev():
      	CPU1 (writeback)		|		CPU2 (truncate)
      -----------------------------------------------------------------
      btrfs_writepages()			|
      |- extent_write_cache_pages()		|
         |- Got page for 1003520		|
         |  1003520 is Dirty, no writeback	|
         |  So (!clear_page_dirty_for_io())   |
         |  gets called for it		|
         |- Now page 1003520 is Clean.	|
         |					| btrfs_setattr()
         |					| |- btrfs_setsize()
         |					|    |- truncate_setsize()
         |					|       New i_size is 18388
         |- __extent_writepage()		|
         |  |- page_offset() > i_size		|
            |- btrfs_invalidatepage()		|
      	 |- Page is clean, so no qgroup |
      	    callback executed
      
      This means, the qgroup reserved data space is not properly released in
      btrfs_invalidatepage() as the page is Clean.
      
      [FIX]
      Instead of checking the dirty bit of a page, call
      btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage().
      
      As qgroup rsv are completely bound to the QGROUP_RESERVED bit of
      io_tree, not bound to page status, thus we won't cause double freeing
      anyway.
      
      Fixes: 0b34c261 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa91e4aa
  3. 09 7月, 2020 1 次提交
    • J
      btrfs: fix double put of block group with nocow · 230ed397
      Josef Bacik 提交于
      While debugging a patch that I wrote I was hitting use-after-free panics
      when accessing block groups on unmount.  This turned out to be because
      in the nocow case if we bail out of doing the nocow for whatever reason
      we need to call btrfs_dec_nocow_writers() if we called the inc.  This
      puts our block group, but a few error cases does
      
      if (nocow) {
          btrfs_dec_nocow_writers();
          goto error;
      }
      
      unfortunately, error is
      
      error:
      	if (nocow)
      		btrfs_dec_nocow_writers();
      
      so we get a double put on our block group.  Fix this by dropping the
      error cases calling of btrfs_dec_nocow_writers(), as it's handled at the
      error label now.
      
      Fixes: 762bf098 ("btrfs: improve error handling in run_delalloc_nocow")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      230ed397
  4. 17 6月, 2020 3 次提交
    • F
      btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof · 4b194628
      Filipe Manana 提交于
      If we attempt to write to prealloc extent located after eof using a
      RWF_NOWAIT write, we always fail with -EAGAIN.
      
      We do actually check if we have an allocated extent for the write at
      the start of btrfs_file_write_iter() through a call to check_can_nocow(),
      but later when we go into the actual direct IO write path we simply
      return -EAGAIN if the write starts at or beyond EOF.
      
      Trivial to reproduce:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ touch /mnt/foo
        $ chattr +C /mnt/foo
      
        $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec)
      
        $ xfs_io -c "falloc -k 64K 1M" /mnt/foo
      
        $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe -b 64K 64K 64K" /mnt/foo
        pwrite: Resource temporarily unavailable
      
      On xfs and ext4 the write succeeds, as expected.
      
      Fix this by removing the wrong check at btrfs_direct_IO().
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4b194628
    • F
      btrfs: fix bytes_may_use underflow when running balance and scrub in parallel · 6bd335b4
      Filipe Manana 提交于
      When balance and scrub are running in parallel it is possible to end up
      with an underflow of the bytes_may_use counter of the data space_info
      object, which triggers a warning like the following:
      
         [134243.793196] BTRFS info (device sdc): relocating block group 1104150528 flags data
         [134243.806891] ------------[ cut here ]------------
         [134243.807561] WARNING: CPU: 1 PID: 26884 at fs/btrfs/space-info.h:125 btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
         [134243.808819] Modules linked in: btrfs blake2b_generic xor (...)
         [134243.815779] CPU: 1 PID: 26884 Comm: kworker/u8:8 Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
         [134243.816944] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
         [134243.818389] Workqueue: writeback wb_workfn (flush-btrfs-108483)
         [134243.819186] RIP: 0010:btrfs_add_reserved_bytes+0x1da/0x280 [btrfs]
         [134243.819963] Code: 0b f2 85 (...)
         [134243.822271] RSP: 0018:ffffa4160aae7510 EFLAGS: 00010287
         [134243.822929] RAX: 000000000000c000 RBX: ffff96159a8c1000 RCX: 0000000000000000
         [134243.823816] RDX: 0000000000008000 RSI: 0000000000000000 RDI: ffff96158067a810
         [134243.824742] RBP: ffff96158067a800 R08: 0000000000000001 R09: 0000000000000000
         [134243.825636] R10: ffff961501432a40 R11: 0000000000000000 R12: 000000000000c000
         [134243.826532] R13: 0000000000000001 R14: ffffffffffff4000 R15: ffff96158067a810
         [134243.827432] FS:  0000000000000000(0000) GS:ffff9615baa00000(0000) knlGS:0000000000000000
         [134243.828451] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [134243.829184] CR2: 000055bd7e414000 CR3: 00000001077be004 CR4: 00000000003606e0
         [134243.830083] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [134243.830975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [134243.831867] Call Trace:
         [134243.832211]  find_free_extent+0x4a0/0x16c0 [btrfs]
         [134243.832846]  btrfs_reserve_extent+0x91/0x180 [btrfs]
         [134243.833487]  cow_file_range+0x12d/0x490 [btrfs]
         [134243.834080]  fallback_to_cow+0x82/0x1b0 [btrfs]
         [134243.834689]  ? release_extent_buffer+0x121/0x170 [btrfs]
         [134243.835370]  run_delalloc_nocow+0x33f/0xa30 [btrfs]
         [134243.836032]  btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
         [134243.836725]  ? find_lock_delalloc_range+0x221/0x250 [btrfs]
         [134243.837450]  writepage_delalloc+0xe8/0x150 [btrfs]
         [134243.838059]  __extent_writepage+0xe8/0x4c0 [btrfs]
         [134243.838674]  extent_write_cache_pages+0x237/0x530 [btrfs]
         [134243.839364]  extent_writepages+0x44/0xa0 [btrfs]
         [134243.839946]  do_writepages+0x23/0x80
         [134243.840401]  __writeback_single_inode+0x59/0x700
         [134243.841006]  writeback_sb_inodes+0x267/0x5f0
         [134243.841548]  __writeback_inodes_wb+0x87/0xe0
         [134243.842091]  wb_writeback+0x382/0x590
         [134243.842574]  ? wb_workfn+0x4a2/0x6c0
         [134243.843030]  wb_workfn+0x4a2/0x6c0
         [134243.843468]  process_one_work+0x26d/0x6a0
         [134243.843978]  worker_thread+0x4f/0x3e0
         [134243.844452]  ? process_one_work+0x6a0/0x6a0
         [134243.844981]  kthread+0x103/0x140
         [134243.845400]  ? kthread_create_worker_on_cpu+0x70/0x70
         [134243.846030]  ret_from_fork+0x3a/0x50
         [134243.846494] irq event stamp: 0
         [134243.846892] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
         [134243.847682] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134243.848687] softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134243.849913] softirqs last disabled at (0): [<0000000000000000>] 0x0
         [134243.850698] ---[ end trace bd7c03622e0b0a96 ]---
         [134243.851335] ------------[ cut here ]------------
      
      When relocating a data block group, for each extent allocated in the
      block group we preallocate another extent with the same size for the
      data relocation inode (we do it at prealloc_file_extent_cluster()).
      We reserve space by calling btrfs_check_data_free_space(), which ends
      up incrementing the data space_info's bytes_may_use counter, and
      then call btrfs_prealloc_file_range() to allocate the extent, which
      always decrements the bytes_may_use counter by the same amount.
      
      The expectation is that writeback of the data relocation inode always
      follows a NOCOW path, by writing into the preallocated extents. However,
      when starting writeback we might end up falling back into the COW path,
      because the block group that contains the preallocated extent was turned
      into RO mode by a scrub running in parallel. The COW path then calls the
      extent allocator which ends up calling btrfs_add_reserved_bytes(), and
      this function decrements the bytes_may_use counter of the data space_info
      object by an amount corresponding to the size of the allocated extent,
      despite we haven't previously incremented it. When the counter currently
      has a value smaller then the allocated extent we reset the counter to 0
      and emit a warning, otherwise we just decrement it and slowly mess up
      with this counter which is crucial for space reservation, the end result
      can be granting reserved space to tasks when there isn't really enough
      free space, and having the tasks fail later in critical places where
      error handling consists of a transaction abort or hitting a BUG_ON().
      
      Fix this by making sure that if we fallback to the COW path for a data
      relocation inode, we increment the bytes_may_use counter of the data
      space_info object. The COW path will then decrement it at
      btrfs_add_reserved_bytes() on success or through its error handling part
      by a call to extent_clear_unlock_delalloc() (which ends up calling
      btrfs_clear_delalloc_extent() that does the decrement operation) in case
      of an error.
      
      Test case btrfs/061 from fstests could sporadically trigger this.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6bd335b4
    • F
      btrfs: fix data block group relocation failure due to concurrent scrub · 432cd2a1
      Filipe Manana 提交于
      When running relocation of a data block group while scrub is running in
      parallel, it is possible that the relocation will fail and abort the
      current transaction with an -EINVAL error:
      
         [134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents
         [134243.999871] ------------[ cut here ]------------
         [134244.000741] BTRFS: Transaction aborted (error -22)
         [134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs]
         [134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
         [134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
         [134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
         [134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs]
         [134244.017151] Code: 48 c7 c7 (...)
         [134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286
         [134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000
         [134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001
         [134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001
         [134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08
         [134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000
         [134244.028024] FS:  00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000
         [134244.029491] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0
         [134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [134244.034484] Call Trace:
         [134244.034984]  btrfs_cow_block+0x12b/0x2b0 [btrfs]
         [134244.035859]  do_relocation+0x30b/0x790 [btrfs]
         [134244.036681]  ? do_raw_spin_unlock+0x49/0xc0
         [134244.037460]  ? _raw_spin_unlock+0x29/0x40
         [134244.038235]  relocate_tree_blocks+0x37b/0x730 [btrfs]
         [134244.039245]  relocate_block_group+0x388/0x770 [btrfs]
         [134244.040228]  btrfs_relocate_block_group+0x161/0x2e0 [btrfs]
         [134244.041323]  btrfs_relocate_chunk+0x36/0x110 [btrfs]
         [134244.041345]  btrfs_balance+0xc06/0x1860 [btrfs]
         [134244.043382]  ? btrfs_ioctl_balance+0x27c/0x310 [btrfs]
         [134244.045586]  btrfs_ioctl_balance+0x1ed/0x310 [btrfs]
         [134244.045611]  btrfs_ioctl+0x1880/0x3760 [btrfs]
         [134244.049043]  ? do_raw_spin_unlock+0x49/0xc0
         [134244.049838]  ? _raw_spin_unlock+0x29/0x40
         [134244.050587]  ? __handle_mm_fault+0x11b3/0x14b0
         [134244.051417]  ? ksys_ioctl+0x92/0xb0
         [134244.052070]  ksys_ioctl+0x92/0xb0
         [134244.052701]  ? trace_hardirqs_off_thunk+0x1a/0x1c
         [134244.053511]  __x64_sys_ioctl+0x16/0x20
         [134244.054206]  do_syscall_64+0x5c/0x280
         [134244.054891]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         [134244.055819] RIP: 0033:0x7f29b51c9dd7
         [134244.056491] Code: 00 00 00 (...)
         [134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
         [134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7
         [134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003
         [134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000
         [134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a
         [134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0
         [134244.067626] irq event stamp: 0
         [134244.068202] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
         [134244.069351] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134244.070909] softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
         [134244.072392] softirqs last disabled at (0): [<0000000000000000>] 0x0
         [134244.073432] ---[ end trace bd7c03622e0b0a99 ]---
      
      The -EINVAL error comes from the following chain of function calls:
      
        __btrfs_cow_block() <-- aborts the transaction
          btrfs_reloc_cow_block()
            replace_file_extents()
              get_new_location() <-- returns -EINVAL
      
      When relocating a data block group, for each allocated extent of the block
      group, we preallocate another extent (at prealloc_file_extent_cluster()),
      associated with the data relocation inode, and then dirty all its pages.
      These preallocated extents have, and must have, the same size that extents
      from the data block group being relocated have.
      
      Later before we start the relocation stage that updates pointers (bytenr
      field of file extent items) to point to the the new extents, we trigger
      writeback for the data relocation inode. The expectation is that writeback
      will write the pages to the previously preallocated extents, that it
      follows the NOCOW path. That is generally the case, however, if a scrub
      is running it may have turned the block group that contains those extents
      into RO mode, in which case writeback falls back to the COW path.
      
      However in the COW path instead of allocating exactly one extent with the
      expected size, the allocator may end up allocating several smaller extents
      due to free space fragmentation - because we tell it at cow_file_range()
      that the minimum allocation size can match the filesystem's sector size.
      This later breaks the relocation's expectation that an extent associated
      to a file extent item in the data relocation inode has the same size as
      the respective extent pointed by a file extent item in another tree - in
      this case the extent to which the relocation inode poins to is smaller,
      causing relocation.c:get_new_location() to return -EINVAL.
      
      For example, if we are relocating a data block group X that has a logical
      address of X and the block group has an extent allocated at the logical
      address X + 128KiB with a size of 64KiB:
      
      1) At prealloc_file_extent_cluster() we allocate an extent for the data
         relocation inode with a size of 64KiB and associate it to the file
         offset 128KiB (X + 128KiB - X) of the data relocation inode. This
         preallocated extent was allocated at block group Z;
      
      2) A scrub running in parallel turns block group Z into RO mode and
         starts scrubing its extents;
      
      3) Relocation triggers writeback for the data relocation inode;
      
      4) When running delalloc (btrfs_run_delalloc_range()), we try first the
         NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC
         set in its flags. However, because block group Z is in RO mode, the
         NOCOW path (run_delalloc_nocow()) falls back into the COW path, by
         calling cow_file_range();
      
      5) At cow_file_range(), in the first iteration of the while loop we call
         btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum
         allocation size of 4KiB (fs_info->sectorsize). Due to free space
         fragmentation, btrfs_reserve_extent() ends up allocating two extents
         of 32KiB each, each one on a different iteration of that while loop;
      
      6) Writeback of the data relocation inode completes;
      
      7) Relocation proceeds and ends up at relocation.c:replace_file_extents(),
         with a leaf which has a file extent item that points to the data extent
         from block group X, that has a logical address (bytenr) of X + 128KiB
         and a size of 64KiB. Then it calls get_new_location(), which does a
         lookup in the data relocation tree for a file extent item starting at
         offset 128KiB (X + 128KiB - X) and belonging to the data relocation
         inode. It finds a corresponding file extent item, however that item
         points to an extent that has a size of 32KiB, which doesn't match the
         expected size of 64KiB, resuling in -EINVAL being returned from this
         function and propagated up to __btrfs_cow_block(), which aborts the
         current transaction.
      
      To fix this make sure that at cow_file_range() when we call the allocator
      we pass it a minimum allocation size corresponding the desired extent size
      if the inode belongs to the data relocation tree, otherwise pass it the
      filesystem's sector size as the minimum allocation size.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      432cd2a1
  5. 14 6月, 2020 1 次提交
    • D
      Revert "btrfs: switch to iomap_dio_rw() for dio" · 55e20bd1
      David Sterba 提交于
      This reverts commit a43a67a2.
      
      This patch reverts the main part of switching direct io implementation
      to iomap infrastructure. There's a problem in invalidate page that
      couldn't be solved as regression in this development cycle.
      
      The problem occurs when buffered and direct io are mixed, and the ranges
      overlap. Although this is not recommended, filesystems implement
      measures or fallbacks to make it somehow work. In this case, fallback to
      buffered IO would be an option for btrfs (this already happens when
      direct io is done on compressed data), but the change would be needed in
      the iomap code, bringing new semantics to other filesystems.
      
      Another problem arises when again the buffered and direct ios are mixed,
      invalidation fails, then -EIO is set on the mapping and fsync will fail,
      though there's no real error.
      
      There have been discussions how to fix that, but revert seems to be the
      least intrusive option.
      
      Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55e20bd1
  6. 10 6月, 2020 2 次提交
  7. 04 6月, 2020 2 次提交
  8. 03 6月, 2020 2 次提交
  9. 28 5月, 2020 6 次提交
    • F
      btrfs: fix space_info bytes_may_use underflow during space cache writeout · 2166e5ed
      Filipe Manana 提交于
      We always preallocate a data extent for writing a free space cache, which
      causes writeback to always try the nocow path first, since the free space
      inode has the prealloc bit set in its flags.
      
      However if the block group that contains the data extent for the space
      cache has been turned to RO mode due to a running scrub or balance for
      example, we have to fallback to the cow path. In that case once a new data
      extent is allocated we end up calling btrfs_add_reserved_bytes(), which
      decrements the counter named bytes_may_use from the data space_info object
      with the expection that this counter was previously incremented with the
      same amount (the size of the data extent).
      
      However when we started writeout of the space cache at cache_save_setup(),
      we incremented the value of the bytes_may_use counter through a call to
      btrfs_check_data_free_space() and then decremented it through a call to
      btrfs_prealloc_file_range_trans() immediately after. So when starting the
      writeback if we fallback to cow mode we have to increment the counter
      bytes_may_use of the data space_info again to compensate for the extent
      allocation done by the cow path.
      
      When this issue happens we are incorrectly decrementing the bytes_may_use
      counter and when its current value is smaller then the amount we try to
      subtract we end up with the following warning:
      
       ------------[ cut here ]------------
       WARNING: CPU: 3 PID: 657 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
       CPU: 3 PID: 657 Comm: kworker/u8:7 Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       Workqueue: writeback wb_workfn (flush-btrfs-1591)
       RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Code: ff ff 48 (...)
       RSP: 0000:ffffa41608f13660 EFLAGS: 00010287
       RAX: 0000000000001000 RBX: ffff9615b93ae400 RCX: 0000000000000000
       RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9615b96ab410
       RBP: fffffffffffee000 R08: 0000000000000001 R09: 0000000000000000
       R10: ffff961585e62a40 R11: 0000000000000000 R12: ffff9615b96ab400
       R13: ffff9615a1a2a000 R14: 0000000000012000 R15: ffff9615b93ae400
       FS:  0000000000000000(0000) GS:ffff9615bb200000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055cbbc2ae178 CR3: 0000000115794006 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        find_free_extent+0x4a0/0x16c0 [btrfs]
        btrfs_reserve_extent+0x91/0x180 [btrfs]
        cow_file_range+0x12d/0x490 [btrfs]
        btrfs_run_delalloc_range+0x9f/0x6d0 [btrfs]
        ? find_lock_delalloc_range+0x221/0x250 [btrfs]
        writepage_delalloc+0xe8/0x150 [btrfs]
        __extent_writepage+0xe8/0x4c0 [btrfs]
        extent_write_cache_pages+0x237/0x530 [btrfs]
        extent_writepages+0x44/0xa0 [btrfs]
        do_writepages+0x23/0x80
        __writeback_single_inode+0x59/0x700
        writeback_sb_inodes+0x267/0x5f0
        __writeback_inodes_wb+0x87/0xe0
        wb_writeback+0x382/0x590
        ? wb_workfn+0x4a2/0x6c0
        wb_workfn+0x4a2/0x6c0
        process_one_work+0x26d/0x6a0
        worker_thread+0x4f/0x3e0
        ? process_one_work+0x6a0/0x6a0
        kthread+0x103/0x140
        ? kthread_create_worker_on_cpu+0x70/0x70
        ret_from_fork+0x3a/0x50
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace bd7c03622e0b0a52 ]---
       ------------[ cut here ]------------
      
      So fix this by incrementing the bytes_may_use counter of the data
      space_info when we fallback to the cow path. If the cow path is successful
      the counter is decremented after extent allocation (by
      btrfs_add_reserved_bytes()), if it fails it ends up being decremented as
      well when clearing the delalloc range (extent_clear_unlock_delalloc()).
      
      This could be triggered sporadically by the test case btrfs/061 from
      fstests.
      
      Fixes: 82d5902d ("Btrfs: Support reading/writing on disk free ino cache")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2166e5ed
    • F
      btrfs: fix space_info bytes_may_use underflow after nocow buffered write · 467dc47e
      Filipe Manana 提交于
      When doing a buffered write we always try to reserve data space for it,
      even when the file has the NOCOW bit set or the write falls into a file
      range covered by a prealloc extent. This is done both because it is
      expensive to check if we can do a nocow write (checking if an extent is
      shared through reflinks or if there's a hole in the range for example),
      and because when writeback starts we might actually need to fallback to
      COW mode (for example the block group containing the target extents was
      turned into RO mode due to a scrub or balance).
      
      When we are unable to reserve data space we check if we can do a nocow
      write, and if we can, we proceed with dirtying the pages and setting up
      the range for delalloc. In this case the bytes_may_use counter of the
      data space_info object is not incremented, unlike in the case where we
      are able to reserve data space (done through btrfs_check_data_free_space()
      which calls btrfs_alloc_data_chunk_ondemand()).
      
      Later when running delalloc we attempt to start writeback in nocow mode
      but we might revert back to cow mode, for example because in the meanwhile
      a block group was turned into RO mode by a scrub or relocation. The cow
      path after successfully allocating an extent ends up calling
      btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of
      the data space_info object to have been incremented before - but we did
      not do it when the buffered write started, since there was not enough
      available data space. So btrfs_add_reserved_bytes() ends up decrementing
      the bytes_may_use counter anyway, and when the counter's current value
      is smaller then the size of the allocated extent we get a stack trace
      like the following:
      
       ------------[ cut here ]------------
       WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
       CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       Workqueue: writeback wb_workfn (flush-btrfs-1754)
       RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Code: ff ff 48 (...)
       RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287
       RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000
       RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410
       RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000
       R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400
       R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800
       FS:  0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        find_free_extent+0x4a0/0x16c0 [btrfs]
        btrfs_reserve_extent+0x91/0x180 [btrfs]
        cow_file_range+0x12d/0x490 [btrfs]
        run_delalloc_nocow+0x341/0xa40 [btrfs]
        btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
        ? find_lock_delalloc_range+0x221/0x250 [btrfs]
        writepage_delalloc+0xe8/0x150 [btrfs]
        __extent_writepage+0xe8/0x4c0 [btrfs]
        extent_write_cache_pages+0x237/0x530 [btrfs]
        ? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs]
        extent_writepages+0x44/0xa0 [btrfs]
        do_writepages+0x23/0x80
        __writeback_single_inode+0x59/0x700
        writeback_sb_inodes+0x267/0x5f0
        __writeback_inodes_wb+0x87/0xe0
        wb_writeback+0x382/0x590
        ? wb_workfn+0x4a2/0x6c0
        wb_workfn+0x4a2/0x6c0
        process_one_work+0x26d/0x6a0
        worker_thread+0x4f/0x3e0
        ? process_one_work+0x6a0/0x6a0
        kthread+0x103/0x140
        ? kthread_create_worker_on_cpu+0x70/0x70
        ret_from_fork+0x3a/0x50
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace f9f6ef8ec4cd8ec9 ]---
      
      So to fix this, when falling back into cow mode check if space was not
      reserved, by testing for the bit EXTENT_NORESERVE in the respective file
      range, and if not, increment the bytes_may_use counter for the data
      space_info object. Also clear the EXTENT_NORESERVE bit from the range, so
      that if the cow path fails it decrements the bytes_may_use counter when
      clearing the delalloc range (through the btrfs_clear_delalloc_extent()
      callback).
      
      Fixes: 7ee9e440 ("Btrfs: check if we can nocow if we don't have data space")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      467dc47e
    • F
      btrfs: fix wrong file range cleanup after an error filling dealloc range · e2c8e92d
      Filipe Manana 提交于
      If an error happens while running dellaloc in COW mode for a range, we can
      end up calling extent_clear_unlock_delalloc() for a range that goes beyond
      our range's end offset by 1 byte, which affects 1 extra page. This results
      in clearing bits and doing page operations (such as a page unlock) outside
      our target range.
      
      Fix that by calling extent_clear_unlock_delalloc() with an inclusive end
      offset, instead of an exclusive end offset, at cow_file_range().
      
      Fixes: a315e68f ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e2c8e92d
    • C
      btrfs: split btrfs_direct_IO to read and write part · d8f3e735
      Christoph Hellwig 提交于
      The read and write versions don't have anything in common except for the
      call to iomap_dio_rw.  So split this function, and merge each half into
      its only caller.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8f3e735
    • G
      btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK · 5f008163
      Goldwyn Rodrigues 提交于
      Since we now perform direct reads using i_rwsem, we can remove this
      inode flag used to co-ordinate unlocked reads.
      
      The truncate call takes i_rwsem. This means it is correctly synchronized
      with concurrent direct reads.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jth@kernel.org>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f008163
    • G
      btrfs: switch to iomap_dio_rw() for dio · a43a67a2
      Goldwyn Rodrigues 提交于
      Switch from __blockdev_direct_IO() to iomap_dio_rw().
      Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
      as iomap_begin() for iomap direct I/O functions. This function
      allocates and locks all the blocks required for the I/O.
      btrfs_submit_direct() is used as the submit_io() hook for direct I/O
      ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore so set it to noop.
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
       - for writes, call __endio_write_update_ordered()
       - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a43a67a2
  10. 25 5月, 2020 15 次提交