1. 06 5月, 2021 1 次提交
    • I
      btrfs: use memzero_page() instead of open coded kmap pattern · d048b9c2
      Ira Weiny 提交于
      There are many places where kmap/memset/kunmap patterns occur.
      
      Use the newly lifted memzero_page() to eliminate direct uses of kmap and
      leverage the new core functions use of kmap_local_page().
      
      The development of this patch was aided by the following coccinelle
      script:
      
      // <smpl>
      // SPDX-License-Identifier: GPL-2.0-only
      // Find kmap/memset/kunmap pattern and replace with memset*page calls
      //
      // NOTE: Offsets and other expressions may be more complex than what the script
      // will automatically generate.  Therefore a catchall rule is provided to find
      // the pattern which then must be evaluated by hand.
      //
      // Confidence: Low
      // Copyright: (C) 2021 Intel Corporation
      // URL: http://coccinelle.lip6.fr/
      // Comments:
      // Options:
      
      //
      // Then the memset pattern
      //
      @ memset_rule1 @
      expression page, V, L, Off;
      identifier ptr;
      type VP;
      @@
      
      (
      -VP ptr = kmap(page);
      |
      -ptr = kmap(page);
      |
      -VP ptr = kmap_atomic(page);
      |
      -ptr = kmap_atomic(page);
      )
      <+...
      (
      -memset(ptr, 0, L);
      +memzero_page(page, 0, L);
      |
      -memset(ptr + Off, 0, L);
      +memzero_page(page, Off, L);
      |
      -memset(ptr, V, L);
      +memset_page(page, V, 0, L);
      |
      -memset(ptr + Off, V, L);
      +memset_page(page, V, Off, L);
      )
      ...+>
      (
      -kunmap(page);
      |
      -kunmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on memset_rule1
      @
      identifier memset_rule1.ptr;
      type VP, VP1;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      
      //
      // Catch all
      //
      @ memset_rule2 @
      expression page;
      identifier ptr;
      expression GenTo, GenSize, GenValue;
      type VP;
      @@
      
      (
      -VP ptr = kmap(page);
      |
      -ptr = kmap(page);
      |
      -VP ptr = kmap_atomic(page);
      |
      -ptr = kmap_atomic(page);
      )
      <+...
      (
      //
      // Some call sites have complex expressions within the memset/memcpy
      // The follow are catch alls which need to be evaluated by hand.
      //
      -memset(GenTo, 0, GenSize);
      +memzero_pageExtra(page, GenTo, GenSize);
      |
      -memset(GenTo, GenValue, GenSize);
      +memset_pageExtra(page, GenValue, GenTo, GenSize);
      )
      ...+>
      (
      -kunmap(page);
      |
      -kunmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on memset_rule2
      @
      identifier memset_rule2.ptr;
      type VP, VP1;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      
      // </smpl>
      
      Link: https://lkml.kernel.org/r/20210309212137.2610186-4-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d048b9c2
  2. 29 4月, 2021 1 次提交
    • F
      btrfs: fix deadlock when cloning inline extents and using qgroups · f9baa501
      Filipe Manana 提交于
      There are a few exceptional cases where cloning an inline extent needs to
      copy the inline extent data into a page of the destination inode.
      
      When this happens, we end up starting a transaction while having a dirty
      page for the destination inode and while having the range locked in the
      destination's inode iotree too. Because when reserving metadata space
      for a transaction we may need to flush existing delalloc in case there is
      not enough free space, we have a mechanism in place to prevent a deadlock,
      which was introduced in commit 3d45f221 ("btrfs: fix deadlock when
      cloning inline extent and low on free metadata space").
      
      However when using qgroups, a transaction also reserves metadata qgroup
      space, which can also result in flushing delalloc in case there is not
      enough available space at the moment. When this happens we deadlock, since
      flushing delalloc requires locking the file range in the inode's iotree
      and the range was already locked at the very beginning of the clone
      operation, before attempting to start the transaction.
      
      When this issue happens, stack traces like the following are reported:
      
        [72747.556262] task:kworker/u81:9   state:D stack:    0 pid:  225 ppid:     2 flags:0x00004000
        [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
        [72747.556271] Call Trace:
        [72747.556273]  __schedule+0x296/0x760
        [72747.556277]  schedule+0x3c/0xa0
        [72747.556279]  io_schedule+0x12/0x40
        [72747.556284]  __lock_page+0x13c/0x280
        [72747.556287]  ? generic_file_readonly_mmap+0x70/0x70
        [72747.556325]  extent_write_cache_pages+0x22a/0x440 [btrfs]
        [72747.556331]  ? __set_page_dirty_nobuffers+0xe7/0x160
        [72747.556358]  ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
        [72747.556362]  ? update_group_capacity+0x25/0x210
        [72747.556366]  ? cpumask_next_and+0x1a/0x20
        [72747.556391]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.556394]  do_writepages+0x41/0xd0
        [72747.556398]  __writeback_single_inode+0x39/0x2a0
        [72747.556403]  writeback_sb_inodes+0x1ea/0x440
        [72747.556407]  __writeback_inodes_wb+0x5f/0xc0
        [72747.556410]  wb_writeback+0x235/0x2b0
        [72747.556414]  ? get_nr_inodes+0x35/0x50
        [72747.556417]  wb_workfn+0x354/0x490
        [72747.556420]  ? newidle_balance+0x2c5/0x3e0
        [72747.556424]  process_one_work+0x1aa/0x340
        [72747.556426]  worker_thread+0x30/0x390
        [72747.556429]  ? create_worker+0x1a0/0x1a0
        [72747.556432]  kthread+0x116/0x130
        [72747.556435]  ? kthread_park+0x80/0x80
        [72747.556438]  ret_from_fork+0x1f/0x30
      
        [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [72747.566961] Call Trace:
        [72747.566964]  __schedule+0x296/0x760
        [72747.566968]  ? finish_wait+0x80/0x80
        [72747.566970]  schedule+0x3c/0xa0
        [72747.566995]  wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
        [72747.566999]  ? finish_wait+0x80/0x80
        [72747.567024]  lock_extent_bits+0x37/0x90 [btrfs]
        [72747.567047]  btrfs_invalidatepage+0x299/0x2c0 [btrfs]
        [72747.567051]  ? find_get_pages_range_tag+0x2cd/0x380
        [72747.567076]  __extent_writepage+0x203/0x320 [btrfs]
        [72747.567102]  extent_write_cache_pages+0x2bb/0x440 [btrfs]
        [72747.567106]  ? update_load_avg+0x7e/0x5f0
        [72747.567109]  ? enqueue_entity+0xf4/0x6f0
        [72747.567134]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.567137]  ? enqueue_task_fair+0x93/0x6f0
        [72747.567140]  do_writepages+0x41/0xd0
        [72747.567144]  __filemap_fdatawrite_range+0xc7/0x100
        [72747.567167]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [72747.567195]  btrfs_work_helper+0xc2/0x300 [btrfs]
        [72747.567200]  process_one_work+0x1aa/0x340
        [72747.567202]  worker_thread+0x30/0x390
        [72747.567205]  ? create_worker+0x1a0/0x1a0
        [72747.567208]  kthread+0x116/0x130
        [72747.567211]  ? kthread_park+0x80/0x80
        [72747.567214]  ret_from_fork+0x1f/0x30
      
        [72747.569686] task:fsstress        state:D stack:    0 pid:841421 ppid:841417 flags:0x00000000
        [72747.569689] Call Trace:
        [72747.569691]  __schedule+0x296/0x760
        [72747.569694]  schedule+0x3c/0xa0
        [72747.569721]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569725]  ? finish_wait+0x80/0x80
        [72747.569753]  btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
        [72747.569781]  btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
        [72747.569804]  btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
        [72747.569810]  ? path_lookupat.isra.48+0x97/0x140
        [72747.569833]  btrfs_file_write_iter+0x81/0x410 [btrfs]
        [72747.569836]  ? __kmalloc+0x16a/0x2c0
        [72747.569839]  do_iter_readv_writev+0x160/0x1c0
        [72747.569843]  do_iter_write+0x80/0x1b0
        [72747.569847]  vfs_writev+0x84/0x140
        [72747.569869]  ? btrfs_file_llseek+0x38/0x270 [btrfs]
        [72747.569873]  do_writev+0x65/0x100
        [72747.569876]  do_syscall_64+0x33/0x40
        [72747.569879]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [72747.569899] task:fsstress        state:D stack:    0 pid:841424 ppid:841417 flags:0x00004000
        [72747.569903] Call Trace:
        [72747.569906]  __schedule+0x296/0x760
        [72747.569909]  schedule+0x3c/0xa0
        [72747.569936]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569940]  ? finish_wait+0x80/0x80
        [72747.569967]  __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
        [72747.569989]  start_transaction+0x279/0x580 [btrfs]
        [72747.570014]  clone_copy_inline_extent+0x332/0x490 [btrfs]
        [72747.570041]  btrfs_clone+0x5b7/0x7a0 [btrfs]
        [72747.570068]  ? lock_extent_bits+0x64/0x90 [btrfs]
        [72747.570095]  btrfs_clone_files+0xfc/0x150 [btrfs]
        [72747.570122]  btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
        [72747.570126]  do_clone_file_range+0xed/0x200
        [72747.570131]  vfs_clone_file_range+0x37/0x110
        [72747.570134]  ioctl_file_clone+0x7d/0xb0
        [72747.570137]  do_vfs_ioctl+0x138/0x630
        [72747.570140]  __x64_sys_ioctl+0x62/0xc0
        [72747.570143]  do_syscall_64+0x33/0x40
        [72747.570146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      So fix this by skipping the flush of delalloc for an inode that is
      flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
      such a special case of cloning an inline extent, when flushing delalloc
      during qgroup metadata reservation.
      
      The special cases for cloning inline extents were added in kernel 5.7 by
      by commit 05a5a762 ("Btrfs: implement full reflink support for
      inline extents"), while having qgroup metadata space reservation flushing
      delalloc when low on space was added in kernel 5.9 by commit
      c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get
      -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
      kernel backports.
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
      Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      CC: stable@vger.kernel.org # 5.9+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f9baa501
  3. 19 4月, 2021 14 次提交
  4. 12 4月, 2021 1 次提交
  5. 19 3月, 2021 1 次提交
  6. 18 3月, 2021 1 次提交
    • J
      btrfs: zoned: remove outdated WARN_ON in direct IO · f3da882e
      Johannes Thumshirn 提交于
      In btrfs_submit_direct() there's a WAN_ON_ONCE() that will trigger if
      we're submitting a DIO write on a zoned filesystem but are not using
      REQ_OP_ZONE_APPEND to submit the IO to the block device.
      
      This is a left over from a previous version where btrfs_dio_iomap_begin()
      didn't use btrfs_use_zone_append() to check for sequential write only
      zones.
      
      It is an oversight from the development phase. In v11 (I think) I've
      added 08f45559 ("btrfs: zoned: cache if block group is on a
      sequential zone") and forgot to remove the WARN_ON_ONCE() for
      544d24f9 ("btrfs: zoned: enable zone append writing for direct IO").
      
      When developing auto relocation I got hit by the WARN as a block groups
      where relocated to conventional zone and the dio code calls
      btrfs_use_zone_append() introduced by 08f45559 to check if it can
      use zone append (a.k.a. if it's a sequential zone) or not and sets the
      appropriate flags for iomap.
      
      I've never hit it in testing before, as I was relying on emulation to
      test the conventional zones code but this one case wasn't hit, because
      on emulation fs_info->max_zone_append_size is 0 and the WARN doesn't
      trigger either.
      
      Fixes: 544d24f9 ("btrfs: zoned: enable zone append writing for direct IO")
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3da882e
  7. 17 3月, 2021 1 次提交
  8. 15 3月, 2021 2 次提交
    • Q
      btrfs: fix qgroup data rsv leak caused by falloc failure · a3ee79bd
      Qu Wenruo 提交于
      [BUG]
      When running fsstress with only falloc workload, and a very low qgroup
      limit set, we can get qgroup data rsv leak at unmount time.
      
       BTRFS warning (device dm-0): qgroup 0/5 has unreleased space, type 0 rsv 20480
       BTRFS error (device dm-0): qgroup reserved space leaked
      
      The minimal reproducer looks like:
      
        #!/bin/bash
        dev=/dev/test/test
        mnt="/mnt/btrfs"
        fsstress=~/xfstests-dev/ltp/fsstress
        runtime=8
      
        workload()
        {
                umount $dev &> /dev/null
                umount $mnt &> /dev/null
                mkfs.btrfs -f $dev > /dev/null
                mount $dev $mnt
      
                btrfs quota en $mnt
                btrfs quota rescan -w $mnt
                btrfs qgroup limit 16m 0/5 $mnt
      
                $fsstress -w -z -f creat=10 -f fallocate=10 -p 2 -n 100 \
        		-d $mnt -v > /tmp/fsstress
      
                umount $mnt
                if dmesg | grep leak ; then
      		echo "!!! FAILED !!!"
        		exit 1
                fi
        }
      
        for (( i=0; i < $runtime; i++)); do
                echo "=== $i/$runtime==="
                workload
        done
      
      Normally it would fail before round 4.
      
      [CAUSE]
      In function insert_prealloc_file_extent(), we first call
      btrfs_qgroup_release_data() to know how many bytes are reserved for
      qgroup data rsv.
      
      Then use that @qgroup_released number to continue our work.
      
      But after we call btrfs_qgroup_release_data(), we should either queue
      @qgroup_released to delayed ref or free them manually in error path.
      
      Unfortunately, we lack the error handling to free the released bytes,
      leaking qgroup data rsv.
      
      All the error handling function outside won't help at all, as we have
      released the range, meaning in inode io tree, the EXTENT_QGROUP_RESERVED
      bit is already cleared, thus all btrfs_qgroup_free_data() call won't
      free any data rsv.
      
      [FIX]
      Add free_qgroup tag to manually free the released qgroup data rsv.
      Reported-by: NNikolay Borisov <nborisov@suse.com>
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Fixes: 9729f10a ("btrfs: inode: move qgroup reserved space release to the callers of insert_reserved_file_extent()")
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a3ee79bd
    • Q
      btrfs: track qgroup released data in own variable in insert_prealloc_file_extent · fbf48bb0
      Qu Wenruo 提交于
      There is a piece of weird code in insert_prealloc_file_extent(), which
      looks like:
      
      	ret = btrfs_qgroup_release_data(inode, file_offset, len);
      	if (ret < 0)
      		return ERR_PTR(ret);
      	if (trans) {
      		ret = insert_reserved_file_extent(trans, inode,
      						  file_offset, &stack_fi,
      						  true, ret);
      	...
      	}
      	extent_info.is_new_extent = true;
      	extent_info.qgroup_reserved = ret;
      	...
      
      Note how the variable @ret is abused here, and if anyone is adding code
      just after btrfs_qgroup_release_data() call, it's super easy to
      overwrite the @ret and cause tons of qgroup related bugs.
      
      Fix such abuse by introducing new variable @qgroup_released, so that we
      won't reuse the existing variable @ret.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbf48bb0
  9. 03 3月, 2021 1 次提交
    • N
      btrfs: don't flush from btrfs_delayed_inode_reserve_metadata · 4d14c5cd
      Nikolay Borisov 提交于
      Calling btrfs_qgroup_reserve_meta_prealloc from
      btrfs_delayed_inode_reserve_metadata can result in flushing delalloc
      while holding a transaction and delayed node locks. This is deadlock
      prone. In the past multiple commits:
      
       * ae5e070e ("btrfs: qgroup: don't try to wait flushing if we're
      already holding a transaction")
      
       * 6f23277a ("btrfs: qgroup: don't commit transaction when we already
       hold the handle")
      
      Tried to solve various aspects of this but this was always a
      whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock
      scenario involving btrfs_delayed_node::mutex. Namely, one thread
      can call btrfs_dirty_inode as a result of reading a file and modifying
      its atime:
      
        PID: 6963   TASK: ffff8c7f3f94c000  CPU: 2   COMMAND: "test"
        #0  __schedule at ffffffffa529e07d
        #1  schedule at ffffffffa529e4ff
        #2  schedule_timeout at ffffffffa52a1bdd
        #3  wait_for_completion at ffffffffa529eeea             <-- sleeps with delayed node mutex held
        #4  start_delalloc_inodes at ffffffffc0380db5
        #5  btrfs_start_delalloc_snapshot at ffffffffc0393836
        #6  try_flush_qgroup at ffffffffc03f04b2
        #7  __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6     <-- tries to reserve space and starts delalloc inodes.
        #8  btrfs_delayed_update_inode at ffffffffc03e31aa      <-- acquires delayed node mutex
        #9  btrfs_update_inode at ffffffffc0385ba8
       #10  btrfs_dirty_inode at ffffffffc038627b               <-- TRANSACTIION OPENED
       #11  touch_atime at ffffffffa4cf0000
       #12  generic_file_read_iter at ffffffffa4c1f123
       #13  new_sync_read at ffffffffa4ccdc8a
       #14  vfs_read at ffffffffa4cd0849
       #15  ksys_read at ffffffffa4cd0bd1
       #16  do_syscall_64 at ffffffffa4a052eb
       #17  entry_SYSCALL_64_after_hwframe at ffffffffa540008c
      
      This will cause an asynchronous work to flush the delalloc inodes to
      happen which can try to acquire the same delayed_node mutex:
      
        PID: 455    TASK: ffff8c8085fa4000  CPU: 5   COMMAND: "kworker/u16:30"
        #0  __schedule at ffffffffa529e07d
        #1  schedule at ffffffffa529e4ff
        #2  schedule_preempt_disabled at ffffffffa529e80a
        #3  __mutex_lock at ffffffffa529fdcb                    <-- goes to sleep, never wakes up.
        #4  btrfs_delayed_update_inode at ffffffffc03e3143      <-- tries to acquire the mutex
        #5  btrfs_update_inode at ffffffffc0385ba8              <-- this is the same inode that pid 6963 is holding
        #6  cow_file_range_inline.constprop.78 at ffffffffc0386be7
        #7  cow_file_range at ffffffffc03879c1
        #8  btrfs_run_delalloc_range at ffffffffc038894c
        #9  writepage_delalloc at ffffffffc03a3c8f
       #10  __extent_writepage at ffffffffc03a4c01
       #11  extent_write_cache_pages at ffffffffc03a500b
       #12  extent_writepages at ffffffffc03a6de2
       #13  do_writepages at ffffffffa4c277eb
       #14  __filemap_fdatawrite_range at ffffffffa4c1e5bb
       #15  btrfs_run_delalloc_work at ffffffffc0380987         <-- starts running delayed nodes
       #16  normal_work_helper at ffffffffc03b706c
       #17  process_one_work at ffffffffa4aba4e4
       #18  worker_thread at ffffffffa4aba6fd
       #19  kthread at ffffffffa4ac0a3d
       #20  ret_from_fork at ffffffffa54001ff
      
      To fully address those cases the complete fix is to never issue any
      flushing while holding the transaction or the delayed node lock. This
      patch achieves it by calling qgroup_reserve_meta directly which will
      either succeed without flushing or will fail and return -EDQUOT. In the
      latter case that return value is going to be propagated to
      btrfs_dirty_inode which will fallback to start a new transaction. That's
      fine as the majority of time we expect the inode will have
      BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly
      copying the in-memory state.
      
      Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d14c5cd
  10. 23 2月, 2021 3 次提交
    • F
      btrfs: fix race between swap file activation and snapshot creation · dd0734f2
      Filipe Manana 提交于
      When creating a snapshot we check if the current number of swap files, in
      the root, is non-zero, and if it is, we error out and warn that we can not
      create the snapshot because there are active swap files.
      
      However this is racy because when a task started activation of a swap
      file, another task might have started already snapshot creation and might
      have seen the counter for the number of swap files as zero. This means
      that after the swap file is activated we may end up with a snapshot of the
      same root successfully created, and therefore when the first write to the
      swap file happens it has to fall back into COW mode, which should never
      happen for active swap files.
      
      Basically what can happen is:
      
      1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
         There it sees that root->nr_swapfiles has a value of 0 so it continues;
      
      2) Task B enters btrfs_swap_activate(). It is not aware that another task
         started snapshot creation but it did not finish yet. It increments
         root->nr_swapfiles from 0 to 1;
      
      3) Task B checks that the file meets all requirements to be an active
         swap file - it has NOCOW set, there are no snapshots for the inode's
         root at the moment, no file holes, no reflinked extents, etc;
      
      4) Task B returns success and now the file is an active swap file;
      
      5) Task A commits the transaction to create the snapshot and finishes.
         The swap file's extents are now shared between the original root and
         the snapshot;
      
      6) A write into an extent of the swap file is attempted - there is a
         snapshot of the file's root, so we fall back to COW mode and therefore
         the physical location of the extent changes on disk.
      
      So fix this by taking the snapshot lock during swap file activation before
      locking the extent range, as that is the order in which we lock these
      during buffered writes.
      
      Fixes: ed46ff3d ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dd0734f2
    • F
      btrfs: fix race between writes to swap files and scrub · 195a49ea
      Filipe Manana 提交于
      When we active a swap file, at btrfs_swap_activate(), we acquire the
      exclusive operation lock to prevent the physical location of the swap
      file extents to be changed by operations such as balance and device
      replace/resize/remove. We also call there can_nocow_extent() which,
      among other things, checks if the block group of a swap file extent is
      currently RO, and if it is we can not use the extent, since a write
      into it would result in COWing the extent.
      
      However we have no protection against a scrub operation running after we
      activate the swap file, which can result in the swap file extents to be
      COWed while the scrub is running and operating on the respective block
      group, because scrub turns a block group into RO before it processes it
      and then back again to RW mode after processing it. That means an attempt
      to write into a swap file extent while scrub is processing the respective
      block group, will result in COWing the extent, changing its physical
      location on disk.
      
      Fix this by making sure that block groups that have extents that are used
      by active swap files can not be turned into RO mode, therefore making it
      not possible for a scrub to turn them into RO mode. When a scrub finds a
      block group that can not be turned to RO due to the existence of extents
      used by swap files, it proceeds to the next block group and logs a warning
      message that mentions the block group was skipped due to active swap
      files - this is the same approach we currently use for balance.
      
      Fixes: ed46ff3d ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      195a49ea
    • F
      btrfs: avoid checking for RO block group twice during nocow writeback · 20903032
      Filipe Manana 提交于
      During the nocow writeback path, we currently iterate the rbtree of block
      groups twice: once for checking if the target block group is RO with the
      call to btrfs_extent_readonly()), and once again for getting a nocow
      reference on the block group with a call to btrfs_inc_nocow_writers().
      
      Since btrfs_inc_nocow_writers() already returns false when the target
      block group is RO, remove the call to btrfs_extent_readonly(). Not only
      we avoid searching the blocks group rbtree twice, it also helps reduce
      contention on the lock that protects it (specially since it is a spin
      lock and not a read-write lock). That may make a noticeable difference
      on very large filesystems, with thousands of allocated block groups.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      20903032
  11. 09 2月, 2021 14 次提交
    • N
      btrfs: zoned: wait for existing extents before truncating · 24c0a722
      Naohiro Aota 提交于
      When truncating a file, file buffers which have already been allocated
      but not yet written may be truncated. Truncating these buffers could
      cause breakage of a sequential write pattern in a block group if the
      truncated blocks are for example followed by blocks allocated to another
      file. To avoid this problem, always wait for write out of all unwritten
      buffers before proceeding with the truncate execution.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      24c0a722
    • N
      btrfs: zoned: introduce dedicated data write path for zoned filesystems · 42c01100
      Naohiro Aota 提交于
      If more than one IO is issued for one file extent, these IO can be
      written to separate regions on a device. Since we cannot map one file
      extent to such a separate area on a zoned filesystem, we need to follow
      the "one IO == one ordered extent" rule.
      
      The normal buffered, uncompressed and not pre-allocated write path (used
      by cow_file_range()) sometimes does not follow this rule. It can write a
      part of an ordered extent when specified a region to write e.g., when
      its called from fdatasync().
      
      Introduce a dedicated (uncompressed buffered) data write path for zoned
      filesystems, that will COW the region and write it at once.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      42c01100
    • N
      btrfs: zoned: enable zone append writing for direct IO · 544d24f9
      Naohiro Aota 提交于
      Likewise to buffered IO, enable zone append writing for direct IO when
      its used on a zoned block device.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      544d24f9
    • N
      btrfs: zoned: use ZONE_APPEND write for zoned mode · d8e3fb10
      Naohiro Aota 提交于
      Enable zone append writing for zoned mode. When using zone append, a
      bio is issued to the start of a target zone and the device decides to
      place it inside the zone. Upon completion the device reports the actual
      written position back to the host.
      
      Three parts are necessary to enable zone append mode. First, modify the
      bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
      bi_sector to point the beginning of the zone.
      
      Second, record the returned physical address (and disk/partno) to the
      ordered extent in end_bio_extent_writepage() after the bio has been
      completed. We cannot resolve the physical address to the logical address
      because we can neither take locks nor allocate a buffer in this end_bio
      context. So, we need to record the physical address to resolve it later
      in btrfs_finish_ordered_io().
      
      And finally, rewrite the logical addresses of the extent mapping and
      checksum data according to the physical address using btrfs_rmap_block.
      If the returned address matches the originally allocated address, we can
      skip this rewriting process.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8e3fb10
    • J
      btrfs: zoned: check if bio spans across an ordered extent · cacb2cea
      Johannes Thumshirn 提交于
      To ensure that an ordered extent maps to a contiguous region on disk, we
      need to maintain a "one bio == one ordered extent" rule.
      
      Ensure that constructing bio does not span more than an ordered extent.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cacb2cea
    • N
      btrfs: zoned: split ordered extent when bio is sent · d22002fd
      Naohiro Aota 提交于
      For a zone append write, the device decides the location the data is being
      written to. Therefore we cannot ensure that two bios are written
      consecutively on the device. In order to ensure that an ordered extent
      maps to a contiguous region on disk, we need to maintain a "one bio ==
      one ordered extent" rule.
      
      Implement splitting of an ordered extent and extent map on bio submission
      to adhere to the rule.
      
      extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the
      corresponding ordered extent so that the ordered extent's region fits into
      one bio and the corresponding device limits.
      
      Several sanity checks need to be done in extract_ordered_extent() e.g.
      
      - We cannot split once end_bio'd ordered extent because we cannot divide
        ordered->bytes_left for the split ones
      - We do not expect a compressed ordered extent
      - We should not have checksum list because we omit the list splitting.
        Since the function is called before btrfs_wq_submit_bio() or
        btrfs_csum_one_bio(), this should be always ensured.
      
      We also need to split an extent map by creating a new one. If not,
      unpin_extent_cache() complains about the difference between the start of
      the extent map and the file's logical offset.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d22002fd
    • N
      btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing · cfe94440
      Naohiro Aota 提交于
      Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual
      devices.
      
      Let btrfs_end_bio() and btrfs_op be aware of it, by mapping
      REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of
      bio_op().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cfe94440
    • Q
      btrfs: introduce btrfs_subpage for data inodes · 32443de3
      Qu Wenruo 提交于
      To support subpage sector size, data also need extra info to make sure
      which sectors in a page are uptodate/dirty/...
      
      This patch will make pages for data inodes get btrfs_subpage structure
      attached, and detached when the page is freed.
      
      This patch also slightly changes the timing when
      set_page_extent_mapped() is called to make sure:
      
      - We have page->mapping set
        page->mapping->host is used to grab btrfs_fs_info, thus we can only
        call this function after page is mapped to an inode.
      
        One call site attaches pages to inode manually, thus we have to modify
        the timing of set_page_extent_mapped() a bit.
      
      - As soon as possible, before other operations
        Since memory allocation can fail, we have to do extra error handling.
        Calling set_page_extent_mapped() as soon as possible can simply the
        error handling for several call sites.
      
      The idea is pretty much the same as iomap_page, but with more bitmaps
      for btrfs specific cases.
      
      Currently the plan is to switch iomap if iomap can provide sector
      aligned write back (only write back dirty sectors, but not the full
      page, data balance require this feature).
      
      So we will stick to btrfs specific bitmap for now.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32443de3
    • Q
      btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK · 6869b0a8
      Qu Wenruo 提交于
      PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in
      __process_pages_contig(), to let the function know to clear page dirty
      bit and then set page writeback.
      
      However page writeback and dirty bits are conflicting (at least for
      sector size == PAGE_SIZE case), this means these two have to be always
      updated together.
      
      This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
      PAGE_START_WRITEBACK.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6869b0a8
    • M
      btrfs: let callers of btrfs_get_io_geometry pass the em · 42034313
      Michal Rostecki 提交于
      Before this change, the btrfs_get_io_geometry() function was calling
      btrfs_get_chunk_map() to get the extent mapping, necessary for
      calculating the I/O geometry. It was using that extent mapping only
      internally and freeing the pointer after its execution.
      
      That resulted in calling btrfs_get_chunk_map() de facto twice by the
      __btrfs_map_block() function. It was calling btrfs_get_io_geometry()
      first and then calling btrfs_get_chunk_map() directly to get the extent
      mapping, used by the rest of the function.
      
      Change that to passing the extent mapping to the btrfs_get_io_geometry()
      function as an argument.
      
      This could improve performance in some cases.  For very large
      filesystems, i.e. several thousands of allocated chunks, not only this
      avoids searching two times the rbtree, saving time, it may also help
      reducing contention on the lock that protects the tree - thinking of
      writeback starting for multiple inodes, other tasks allocating or
      removing chunks, and anything else that requires access to the rbtree.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NMichal Rostecki <mrostecki@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add Filipe's analysis ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      42034313
    • Q
      btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge · 951c80f8
      Qu Wenruo 提交于
      Commit dbfdb6d1 ("Btrfs: Search for all ordered extents that could
      span across a page") make btrfs_invalidapage() to search all ordered
      extents.
      
      The offending code looks like this:
      
        again:
      	  start = page_start;
      	  ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
      	  if (ordred) {
      		  end = min(page_end,
      			    ordered->file_offset + ordered->num_bytes - 1);
      
      		  /* Do the cleanup */
      
      		  start = end + 1;
      		  if (start < page_end)
      			  goto again;
      	  }
      
      The behavior is indeed necessary for the incoming subpage support, but
      when it iterates through all the ordered extents, it also resets the
      search range @start.
      
      This means, for the following cases, we can double account the ordered
      extents, causing its bytes_left underflow:
      
      	Page offset
      	0		16K		32K
      	|<--- OE 1  --->|<--- OE 2 ---->|
      
      As the first iteration will find ordered extent (OE) 1, which doesn't
      cover the full page, thus after cleanup code, we need to retry again.
      But again label will reset start to page_start, and we got OE 1 again,
      which causes double accounting on OE 1, and cause OE 1's byte_left to
      underflow.
      
      This problem can only happen for subpage case, as for regular sectorsize
      == PAGE_SIZE case, we will always find a OE ends at or after page end,
      thus no way to trigger the problem.
      
      Move the again label after start = page_start.  There will be more
      comprehensive rework to convert the open coded loop to a proper while
      loop for subpage support.
      
      Fixes: dbfdb6d1 ("Btrfs: Search for all ordered extents that could span across a page")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      951c80f8
    • F
      btrfs: remove wrong comment for can_nocow_extent() · 2965194b
      Filipe Manana 提交于
      The comment for can_nocow_extent() says that the function will flush
      ordered extents, however that never happens and was never true before the
      comment was added in commit e4ecaf90 ("btrfs: add comments for
      btrfs_check_can_nocow() and can_nocow_extent()"). This is true only for
      the function btrfs_check_can_nocow(), which after that commit was renamed
      to check_can_nocow(). So just remove that part of the comment.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2965194b
    • N
      btrfs: fix description format of fs_info of btrfs_wait_on_delayed_iputs · 2639631d
      Nikolay Borisov 提交于
      Fixes fs/btrfs/inode.c:3101: warning: Function parameter or member 'fs_info' not described in 'btrfs_wait_on_delayed_iputs'
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2639631d
    • Q
      btrfs: rework the order of btrfs_ordered_extent::flags · 3c198fe0
      Qu Wenruo 提交于
      [BUG]
      There is a long existing bug in the last parameter of
      btrfs_add_ordered_extent(), in commit 771ed689 ("Btrfs: Optimize
      compressed writeback and reads") back to 2008.
      
      In that ancient commit btrfs_add_ordered_extent() expects the @type
      parameter to be one of the following:
      
      - BTRFS_ORDERED_REGULAR
      - BTRFS_ORDERED_NOCOW
      - BTRFS_ORDERED_PREALLOC
      - BTRFS_ORDERED_COMPRESSED
      
      But we pass 0 in cow_file_range(), which means BTRFS_ORDERED_IO_DONE.
      
      Ironically extra check in __btrfs_add_ordered_extent() won't set the bit
      if we see (type == IO_DONE || type == IO_COMPLETE), and avoid any
      obvious bug.
      
      But this still leads to regular COW ordered extent having no bit to
      indicate its type in various trace events, rendering REGULAR bit
      useless.
      
      [FIX]
      Change the following aspects to avoid such problem:
      
      - Reorder btrfs_ordered_extent::flags
        Now the type bits go first (REGULAR/NOCOW/PREALLCO/COMPRESSED), then
        DIRECT bit, finally extra status bits like IO_DONE/COMPLETE/IOERR.
      
      - Add extra ASSERT() for btrfs_add_ordered_extent_*()
      
      - Remove @type parameter for btrfs_add_ordered_extent_compress()
        As the only valid @type here is BTRFS_ORDERED_COMPRESSED.
      
      - Remove the unnecessary special check for IO_DONE/COMPLETE in
        __btrfs_add_ordered_extent()
        This is just to make the code work, with extra ASSERT(), there are
        limited values can be passed in.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3c198fe0