1. 16 5月, 2022 2 次提交
    • F
      btrfs: remove useless dio wait call when doing fallocate zero range · 831e1ee6
      Filipe Manana 提交于
      When starting a fallocate zero range operation, before getting the first
      extent map for the range, we make a call to inode_dio_wait().
      
      This logic was needed in the past because direct IO writes within the
      i_size boundary did not take the inode's VFS lock. This was because that
      lock used to be a mutex, then some years ago it was switched to a rw
      semaphore (by commit 9902af79 ("parallel lookups: actual switch to
      rwsem")), and then btrfs was changed to take the VFS inode's lock in
      shared mode for writes that don't cross the i_size boundary (done in
      commit e9adabb9 ("btrfs: use shared lock for direct writes within
      EOF")). The lockless direct IO writes could result in a race with the
      zero range operation, resulting in the later getting a stale extent
      map for the range.
      
      So remove this no longer needed call to inode_dio_wait(), as fallocate
      takes the inode's VFS lock in exclusive mode and direct IO writes within
      i_size take that same lock in shared mode.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      831e1ee6
    • F
      btrfs: only reserve the needed data space amount during fallocate · 47e1d1c7
      Filipe Manana 提交于
      During a plain fallocate, we always start by reserving an amount of data
      space that matches the length of the range passed to fallocate. When we
      already have extents allocated in that range, we may end up trying to
      reserve a lot more data space then we need, which can result in several
      undesired behaviours:
      
      1) We fail with -ENOSPC. For example the passed range has a length
         of 1G, but there's only one hole with a size of 1M in that range;
      
      2) We temporarily reserve excessive data space that could be used by
         other operations happening concurrently;
      
      3) By reserving much more data space then we need, we can end up
         doing expensive things like triggering dellaloc for other inodes,
         waiting for the ordered extents to complete, trigger transaction
         commits, allocate new block groups, etc.
      
      Example:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f -b 1g $DEV
        mount $DEV $MNT
      
        # Create a file with a size of 600M and two holes, one at [200M, 201M[
        # and another at [401M, 402M[
        xfs_io -f -c "pwrite -S 0xab 0 200M" \
                  -c "pwrite -S 0xcd 201M 200M" \
                  -c "pwrite -S 0xef 402M 198M" \
                  $MNT/foobar
      
        # Now call fallocate against the whole file range, see if it fails
        # with -ENOSPC or not - it shouldn't since we only need to allocate
        # 2M of data space.
        xfs_io -c "falloc 0 600M" $MNT/foobar
      
        umount $MNT
      
        $ ./test.sh
        (...)
        wrote 209715200/209715200 bytes at offset 0
        200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
        wrote 209715200/209715200 bytes at offset 210763776
        200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
        wrote 207618048/207618048 bytes at offset 421527552
        198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
        fallocate: No space left on device
        $
      
      So fix this by not allocating an amount of data space that matches the
      length of the range passed to fallocate. Instead allocate an amount of
      data space that corresponds to the sum of the sizes of each hole found
      in the range. This reservation now happens after we have locked the file
      range, which is safe since we know at this point there's no delalloc
      in the range because we've taken the inode's VFS lock in exclusive mode,
      we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
      delalloc and waited for all ordered extents in the range to complete.
      
      This type of failure actually seems to happen in practice with systemd,
      and we had at least one report about this in a very long thread which
      is referenced by the Link tag below.
      
      Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      47e1d1c7
  2. 25 3月, 2022 1 次提交
  3. 14 3月, 2022 4 次提交
    • F
      btrfs: reset last_reflink_trans after fsyncing inode · 23e3337f
      Filipe Manana 提交于
      When an inode has a last_reflink_trans matching the current transaction,
      we have to take special care when logging its checksums in order to
      avoid getting checksum items with overlapping ranges in a log tree,
      which could result in missing checksums after log replay (more on that
      in the changelogs of commit 40e046ac ("Btrfs: fix missing data
      checksums after replaying a log tree") and commit e289f03e ("btrfs:
      fix corrupt log due to concurrent fsync of inodes with shared extents")).
      We also need to make sure a full fsync will copy all old file extent
      items it finds in modified leaves, because they might have been copied
      from some other inode.
      
      However once we fsync an inode, we don't need to keep paying the price of
      that extra special care in future fsyncs done in the same transaction,
      unless the inode is used for another reflink operation or the full sync
      flag is set on it (truncate, failure to allocate extent maps for holes,
      and other exceptional and infrequent cases).
      
      So after we fsync an inode reset its last_unlink_trans to zero. In case
      another reflink happens, we continue to update the last_reflink_trans of
      the inode, just as before. Also set last_reflink_trans to the generation
      of the last transaction that modified the inode whenever we need to set
      the full sync flag on the inode, just like when we need to load an inode
      from disk after eviction.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      23e3337f
    • O
      btrfs: add BTRFS_IOC_ENCODED_WRITE · 7c0c7269
      Omar Sandoval 提交于
      The implementation resembles direct I/O: we have to flush any ordered
      extents, invalidate the page cache, and do the io tree/delalloc/extent
      map/ordered extent dance. From there, we can reuse the compression code
      with a minor modification to distinguish the write from writeback. This
      also creates inline extents when possible.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c0c7269
    • O
      btrfs: support different disk extent size for delalloc · 28c9b1e7
      Omar Sandoval 提交于
      Currently, we always reserve the same extent size in the file and extent
      size on disk for delalloc because the former is the worst case for the
      latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of
      the extent on disk, which may be less than or greater than (for
      bookends) the size in the file. Add a disk_num_bytes parameter to
      btrfs_delalloc_reserve_metadata() so that we can reserve the correct
      amount of csum bytes. No functional change.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28c9b1e7
    • F
      btrfs: remove constraint on number of visited leaves when replacing extents · 7ecb4c31
      Filipe Manana 提交于
      At btrfs_drop_extents(), we try to replace a range of file extent items
      with a new file extent in a single btree search, to avoid the need to do
      a search for deletion, followed by a path release and followed by yet
      another search for insertion.
      
      When I originally added that optimization, in commit 1acae57b
      ("Btrfs: faster file extent item replace operations"), I left a constraint
      to do the fast replace only if we visited a single leaf. That was because
      in the most common case we find all file extent items that need to be
      deleted (or trimmed) in a single leaf, however it can work for other
      common cases like when we need to delete a few file extent items located
      at the end of a leaf and a few more located at the beginning of the next
      leaf. The key for the new file extent item is greater than the key of
      any deleted or trimmed file extent item from previous leaves, so we are
      fine to use the last leaf that we found as long as we are holding a
      write lock on it - even if the new key ends up at slot 0, as if that's
      the case, the btree search has obtained a write lock on any upper nodes
      that need to have a key pointer updated.
      
      So removed the constraint that limits the optimization to the case where
      we visited only a single leaf.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7ecb4c31
  4. 24 2月, 2022 2 次提交
    • Q
      btrfs: reduce extent threshold for autodefrag · 558732df
      Qu Wenruo 提交于
      There is a big gap between inode_should_defrag() and autodefrag extent
      size threshold.  For inode_should_defrag() it has a flexible
      @small_write value. For compressed extent is 16K, and for non-compressed
      extent it's 64K.
      
      However for autodefrag extent size threshold, it's always fixed to the
      default value (256K).
      
      This means, the following write sequence will trigger autodefrag to
      defrag ranges which didn't trigger autodefrag:
      
        pwrite 0 8k
        sync
        pwrite 8k 128K
        sync
      
      The latter 128K write will also be considered as a defrag target (if
      other conditions are met). While only that 8K write is really
      triggering autodefrag.
      
      Such behavior can cause extra IO for autodefrag.
      
      Close the gap, by copying the @small_write value into inode_defrag, so
      that later autodefrag can use the same @small_write value which
      triggered autodefrag.
      
      With the existing transid value, this allows autodefrag really to scan
      the ranges which triggered autodefrag.
      
      Although this behavior change is mostly reducing the extent_thresh value
      for autodefrag, I believe in the future we should allow users to specify
      the autodefrag extent threshold through mount options, but that's an
      other problem to consider in the future.
      
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      558732df
    • Q
      btrfs: autodefrag: only scan one inode once · 26fbac25
      Qu Wenruo 提交于
      Although we have btrfs_requeue_inode_defrag(), for autodefrag we are
      still just exhausting all inode_defrag items in the tree.
      
      This means, it doesn't make much difference to requeue an inode_defrag,
      other than scan the inode from the beginning till its end.
      
      Change the behaviour to always scan from offset 0 of an inode, and till
      the end.
      
      By this we get the following benefit:
      
      - Straight-forward code
      
      - No more re-queue related check
      
      - Fewer members in inode_defrag
      
      We still keep the same btrfs_get_fs_root() and btrfs_iget() check for
      each loop, and added extra should_auto_defrag() check per-loop.
      
      Note: the patch needs to be backported and is intentionally written
      to minimize the diff size, code will be cleaned up later.
      
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      26fbac25
  5. 09 11月, 2021 1 次提交
    • F
      btrfs: fix deadlock due to page faults during direct IO reads and writes · 51bd9563
      Filipe Manana 提交于
      If we do a direct IO read or write when the buffer given by the user is
      memory mapped to the file range we are going to do IO, we end up ending
      in a deadlock. This is triggered by the new test case generic/647 from
      fstests.
      
      For a direct IO read we get a trace like this:
      
        [967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
        [967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
        [967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
        [967.875992] Call Trace:
        [967.875999]  __schedule+0x3ca/0xe10
        [967.876015]  schedule+0x43/0xe0
        [967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
        [967.876109]  ? do_wait_intr_irq+0xb0/0xb0
        [967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
        [967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
        [967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
        [967.876214]  extent_readahead+0x32d/0x530 [btrfs]
        [967.876253]  ? lru_cache_add+0x104/0x220
        [967.876255]  ? kvm_sched_clock_read+0x14/0x40
        [967.876258]  ? sched_clock_cpu+0xd/0x110
        [967.876263]  ? lock_release+0x155/0x4a0
        [967.876271]  read_pages+0x86/0x270
        [967.876274]  ? lru_cache_add+0x125/0x220
        [967.876281]  page_cache_ra_unbounded+0x1a3/0x220
        [967.876291]  filemap_fault+0x626/0xa20
        [967.876303]  __do_fault+0x36/0xf0
        [967.876308]  __handle_mm_fault+0x83f/0x15f0
        [967.876322]  handle_mm_fault+0x9e/0x260
        [967.876327]  __get_user_pages+0x204/0x620
        [967.876332]  ? get_user_pages_unlocked+0x69/0x340
        [967.876340]  get_user_pages_unlocked+0xd3/0x340
        [967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
        [967.876366]  iov_iter_get_pages+0x8d/0x3a0
        [967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
        [967.876379]  ? lock_release+0x155/0x4a0
        [967.876387]  iomap_dio_bio_actor+0x232/0x410
        [967.876396]  iomap_apply+0x12a/0x4a0
        [967.876398]  ? iomap_dio_rw+0x30/0x30
        [967.876414]  __iomap_dio_rw+0x29f/0x5e0
        [967.876415]  ? iomap_dio_rw+0x30/0x30
        [967.876420]  ? lock_acquired+0xf3/0x420
        [967.876429]  iomap_dio_rw+0xa/0x30
        [967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
        [967.876460]  new_sync_read+0x118/0x1a0
        [967.876472]  vfs_read+0x128/0x1b0
        [967.876477]  __x64_sys_pread64+0x90/0xc0
        [967.876483]  do_syscall_64+0x3b/0xc0
        [967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [967.876490] RIP: 0033:0x7fb6f2c038d6
        [967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
        [967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
        [967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
        [967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
        [967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
        [967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
      
      This happens because at btrfs_dio_iomap_begin() we lock the extent range
      and return with it locked - we only unlock in the endio callback, at
      end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
      iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
      faults that resulting in reading the pages, through the readahead callback
      btrfs_readahead(), and through there we end to attempt to lock again the
      same extent range (or a subrange of what we locked before), resulting in
      the deadlock.
      
      For a direct IO write, the scenario is a bit different, and it results in
      trace like this:
      
        [1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
        [1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
        [1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
        [1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
        [1330.351906] Call Trace:
        [1330.351913]  __schedule+0x3ca/0xe10
        [1330.351930]  schedule+0x43/0xe0
        [1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
        [1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
        [1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
        [1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
        [1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
        [1330.352133]  ? lru_cache_add+0x104/0x220
        [1330.352135]  ? kvm_sched_clock_read+0x14/0x40
        [1330.352138]  ? sched_clock_cpu+0xd/0x110
        [1330.352143]  ? lock_release+0x155/0x4a0
        [1330.352151]  read_pages+0x86/0x270
        [1330.352155]  ? lru_cache_add+0x125/0x220
        [1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
        [1330.352172]  filemap_fault+0x626/0xa20
        [1330.352176]  ? filemap_map_pages+0x18b/0x660
        [1330.352184]  __do_fault+0x36/0xf0
        [1330.352189]  __handle_mm_fault+0x1253/0x15f0
        [1330.352203]  handle_mm_fault+0x9e/0x260
        [1330.352208]  __get_user_pages+0x204/0x620
        [1330.352212]  ? get_user_pages_unlocked+0x69/0x340
        [1330.352220]  get_user_pages_unlocked+0xd3/0x340
        [1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
        [1330.352246]  iov_iter_get_pages+0x8d/0x3a0
        [1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
        [1330.352259]  ? lock_release+0x155/0x4a0
        [1330.352266]  iomap_dio_bio_actor+0x232/0x410
        [1330.352275]  iomap_apply+0x12a/0x4a0
        [1330.352278]  ? iomap_dio_rw+0x30/0x30
        [1330.352292]  __iomap_dio_rw+0x29f/0x5e0
        [1330.352294]  ? iomap_dio_rw+0x30/0x30
        [1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
        [1330.352339]  new_sync_write+0x11f/0x1b0
        [1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
        [1330.352354]  vfs_write+0x292/0x3c0
        [1330.352359]  __x64_sys_pwrite64+0x90/0xc0
        [1330.352365]  do_syscall_64+0x3b/0xc0
        [1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [1330.352372] RIP: 0033:0x7f4b0a580986
        [1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
        [1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
        [1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
        [1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
        [1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
        [1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      
      Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
      range unlocked, but later when the page faults are triggered and we try
      to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
      we find the ordered extent for our write, created by the iomap callback
      btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
      deadlock since we can't complete the ordered extent without reading the
      pages (the iomap code only submits the bio after the pages are faulted
      in).
      
      Fix this by setting the nofault attribute of the given iov_iter and retry
      the direct IO read/write if we get an -EFAULT error returned from iomap.
      For reads, also disable page faults completely, this is because when we
      read from a hole or a prealloc extent, we can still trigger page faults
      due to the call to iov_iter_zero() done by iomap - at the moment, it is
      oblivious to the value of the ->nofault attribute of an iov_iter.
      We also need to keep track of the number of bytes written or read, and
      pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
      
      This depends on the iov_iter and iomap changes introduced in commit
      c03098d4 ("Merge tag 'gfs2-v5.15-rc5-mmap-fault' of
      git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2").
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      51bd9563
  6. 27 10月, 2021 5 次提交
    • N
      btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref · f42c5da6
      Nikolay Borisov 提交于
      In order to make 'real_root' used only in ref-verify it's required to
      have the necessary context to perform the same checks that this member
      is used for. So add 'mod_root' which will contain the root on behalf of
      which a delayed ref was created and a 'skip_group' parameter which
      will contain callsite-specific override of skip_qgroup.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f42c5da6
    • J
      btrfs: add a BTRFS_FS_ERROR helper · 84961539
      Josef Bacik 提交于
      We have a few flags that are inconsistently used to describe the fs in
      different states of failure.  As of 5963ffca ("btrfs: always abort
      the transaction if we abort a trans handle") we will always set
      BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
      and ERROR to see if things have gone wrong.  Add a helper to check
      BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
      use the helper.
      
      The TRANS_ABORTED bit check was added in af722733 ("Btrfs: clean up
      resources during umount after trans is aborted") but is not actually
      specific.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      84961539
    • Q
      btrfs: subpage: add bitmap for PageChecked flag · e4f94347
      Qu Wenruo 提交于
      Although in btrfs we have very limited usage of PageChecked flag, it's
      still some page flag not yet subpage compatible.
      
      Fix it by introducing btrfs_subpage::checked_offset to do the convert.
      
      For most call sites, especially for free-space cache, COW fixup and
      btrfs_invalidatepage(), they all work in full page mode anyway.
      
      For other call sites, they work as subpage compatible mode.
      
      Some call sites need extra modification:
      
      - btrfs_drop_pages()
        Needs extra parameter to get the real range we need to clear checked
        flag.
      
        Also since btrfs_drop_pages() will accept pages beyond the dirtied
        range, update btrfs_subpage_clamp_range() to handle such case
        by setting @len to 0 if the page is beyond target range.
      
      - btrfs_invalidatepage()
        We need to call subpage helper before calling __btrfs_releasepage(),
        or it will trigger ASSERT() as page->private will be cleared.
      
      - btrfs_verify_data_csum()
        In theory we don't need the io_bio->csum check anymore, but it's
        won't hurt.  Just change the comment.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4f94347
    • F
      btrfs: unexport setup_items_for_insert() · f0641656
      Filipe Manana 提交于
      Since setup_items_for_insert() is not used anymore outside of ctree.c,
      make it static and remove its prototype from ctree.h. This also requires
      to move the definition of setup_item_for_insert() from ctree.h to ctree.c
      and move down btrfs_duplicate_item() so that it's defined after
      setup_items_for_insert().
      
      Further, since setup_item_for_insert() is used outside ctree.c, rename it
      to btrfs_setup_item_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 2/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f0641656
    • F
      btrfs: loop only once over data sizes array when inserting an item batch · b7ef5f3a
      Filipe Manana 提交于
      When inserting a batch of items into a btree, we end up looping over the
      data sizes array 3 times:
      
      1) Once in the caller of btrfs_insert_empty_items(), when it populates the
         array with the data sizes for each item;
      
      2) Once at btrfs_insert_empty_items() to sum the elements of the data
         sizes array and compute the total data size;
      
      3) And then once again at setup_items_for_insert(), where we do exactly
         the same as what we do at btrfs_insert_empty_items(), to compute the
         total data size.
      
      That is not bad for small arrays, but when the arrays have hundreds of
      elements, the time spent on looping is not negligible. For example when
      doing batch inserts of delayed items for dir index items or when logging
      a directory, it's common to have 200 to 260 dir index items in a single
      batch when using a leaf size of 16K and using file names between 8 and 12
      characters. For a 64K leaf size, multiply that by 4. Taking into account
      that during directory logging or when flushing delayed dir index items we
      can have many of those large batches, the time spent on the looping adds
      up quickly.
      
      It's also more important to avoid it at setup_items_for_insert(), since
      we are holding a write lock on a leaf and, in some cases, on upper nodes
      of the btree, which causes us to block other tasks that want to access
      the leaf and nodes for longer than necessary.
      
      So change the code so that setup_items_for_insert() and
      btrfs_insert_empty_items() no longer compute the total data size, and
      instead rely on the caller to supply it. This makes us loop over the
      array only once, where we can both populate the data size array and
      compute the total data size, taking advantage of spatial and temporal
      locality. To make this more manageable, use a structure to contain
      all the relevant details for a batch of items (keys array, data sizes
      array, total data size, number of items), and use it as an argument
      for btrfs_insert_empty_items() and setup_items_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 1/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7ef5f3a
  7. 24 10月, 2021 1 次提交
    • A
      iomap: Add done_before argument to iomap_dio_rw · 4fdccaa0
      Andreas Gruenbacher 提交于
      Add a done_before argument to iomap_dio_rw that indicates how much of
      the request has already been transferred.  When the request succeeds, we
      report that done_before additional bytes were tranferred.  This is
      useful for finishing a request asynchronously when part of the request
      has already been completed synchronously.
      
      We'll use that to allow iomap_dio_rw to be used with page faults
      disabled: when a page fault occurs while submitting a request, we
      synchronously complete the part of the request that has already been
      submitted.  The caller can then take care of the page fault and call
      iomap_dio_rw again for the rest of the request, passing in the number of
      bytes already tranferred.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      4fdccaa0
  8. 18 10月, 2021 1 次提交
    • A
      iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable · a6294593
      Andreas Gruenbacher 提交于
      Turn iov_iter_fault_in_readable into a function that returns the number
      of bytes not faulted in, similar to copy_to_user, instead of returning a
      non-zero value when any of the requested pages couldn't be faulted in.
      This supports the existing users that require all pages to be faulted in
      as well as new users that are happy if any pages can be faulted in.
      
      Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
      sure this change doesn't silently break things.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      a6294593
  9. 08 10月, 2021 2 次提交
    • J
      btrfs: fix abort logic in btrfs_replace_file_extents · 4afb912f
      Josef Bacik 提交于
      Error injection testing uncovered a case where we'd end up with a
      corrupt file system with a missing extent in the middle of a file.  This
      occurs because the if statement to decide if we should abort is wrong.
      
      The only way we would abort in this case is if we got a ret !=
      -EOPNOTSUPP and we called from the file clone code.  However the
      prealloc code uses this path too.  Instead we need to abort if there is
      an error, and the only error we _don't_ abort on is -EOPNOTSUPP and only
      if we came from the clone file code.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4afb912f
    • J
      btrfs: update refs for any root except tree log roots · d175209b
      Josef Bacik 提交于
      I hit a stuck relocation on btrfs/061 during my overnight testing.  This
      turned out to be because we had left over extent entries in our extent
      root for a data reloc inode that no longer existed.  This happened
      because in btrfs_drop_extents() we only update refs if we have SHAREABLE
      set or we are the tree_root.  This regression was introduced by
      aeb935a4 ("btrfs: don't set SHAREABLE flag for data reloc tree")
      where we stopped setting SHAREABLE for the data reloc tree.
      
      The problem here is we actually do want to update extent references for
      data extents in the data reloc tree, in fact we only don't want to
      update extent references if the file extents are in the log tree.
      Update this check to only skip updating references in the case of the
      log tree.
      
      This is relatively rare, because you have to be running scrub at the
      same time, which is what btrfs/061 does.  The data reloc inode has its
      extents pre-allocated, and then we copy the extent into the
      pre-allocated chunks.  We theoretically should never be calling
      btrfs_drop_extents() on a data reloc inode.  The exception of course is
      with scrub, if our pre-allocated extent falls inside of the block group
      we are scrubbing, then the block group will be marked read only and we
      will be forced to cow that extent.  This means we will call
      btrfs_drop_extents() on that range when we COW that file extent.
      
      This isn't really problematic if we do this, the data reloc inode
      requires that our extent lengths match exactly with the extent we are
      copying, thankfully we validate the extent is correct with
      get_new_location(), so if we happen to COW only part of the extent we
      won't link it in when we do the relocation, so we are safe from any
      other shenanigans that arise because of this interaction with scrub.
      
      Fixes: aeb935a4 ("btrfs: don't set SHAREABLE flag for data reloc tree")
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d175209b
  10. 23 8月, 2021 3 次提交
    • B
      btrfs: initial fsverity support · 14605409
      Boris Burkov 提交于
      Add support for fsverity in btrfs. To support the generic interface in
      fs/verity, we add two new item types in the fs tree for inodes with
      verity enabled. One stores the per-file verity descriptor and btrfs
      verity item and the other stores the Merkle tree data itself.
      
      Verity checking is done in end_page_read just before a page is marked
      uptodate. This naturally handles a variety of edge cases like holes,
      preallocated extents, and inline extents. Some care needs to be taken to
      not try to verity pages past the end of the file, which are accessed by
      the generic buffered file reading code under some circumstances like
      reading to the end of the last page and trying to read again. Direct IO
      on a verity file falls back to buffered reads.
      
      Verity relies on PageChecked for the Merkle tree data itself to avoid
      re-walking up shared paths in the tree. For this reason, we need to
      cache the Merkle tree data. Since the file is immutable after verity is
      turned on, we can cache it at an index past EOF.
      
      Use the new inode ro_flags to store verity on the inode item, so that we
      can enable verity on a file, then rollback to an older kernel and still
      mount the file system and read the file. Since we can't safely write the
      file anymore without ruining the invariants of the Merkle tree, we mark
      a ro_compat flag on the file system when a file has verity enabled.
      Acked-by: NEric Biggers <ebiggers@google.com>
      Co-developed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      14605409
    • Q
      btrfs: subpage: fix a potential use-after-free in writeback helper · 7c11d0ae
      Qu Wenruo 提交于
      [BUG]
      There is a possible use-after-free bug when running generic/095.
      
       BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
       Faulting instruction address: 0xc000000000283654
       c000000000283078 do_raw_spin_unlock+0x88/0x230
       c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
       c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
       c0000000009e0458 end_bio_extent_writepage+0x158/0x270
       c000000000b6fd14 bio_endio+0x254/0x270
       c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
       c000000000b6fd14 bio_endio+0x254/0x270
       c000000000b781fc blk_update_request+0x46c/0x670
       c000000000b8b394 blk_mq_end_request+0x34/0x1d0
       c000000000d82d1c lo_complete_rq+0x11c/0x140
       c000000000b880a4 blk_complete_reqs+0x84/0xb0
       c0000000012b2ca4 __do_softirq+0x334/0x680
       c0000000001dd878 irq_exit+0x148/0x1d0
       c000000000016f4c do_IRQ+0x20c/0x240
       c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
      
      [CAUSE]
      There is very small race window like the following in generic/095.
      
      	Thread 1		|		Thread 2
      --------------------------------+------------------------------------
        end_bio_extent_writepage()	| btrfs_releasepage()
        |- spin_lock_irqsave()	| |
        |- end_page_writeback()	| |
        |				| |- if (PageWriteback() ||...)
        |				| |- clear_page_extent_mapped()
        |				|    |- kfree(subpage);
        |- spin_unlock_irqrestore().
      
      The race can also happen between writeback and btrfs_invalidatepage(),
      although that would be much harder as btrfs_invalidatepage() has much
      more work to do before the clear_page_extent_mapped() call.
      
      [FIX]
      Here we "wait" for the subapge spinlock to be released before we detach
      subpage structure.
      So this patch will introduce a new function, wait_subpage_spinlock(), to
      do the "wait" by acquiring the spinlock and release it.
      
      Since the caller has ensured the page is not dirty nor writeback, and
      page is already locked, the only way to hold the subpage spinlock is
      from endio function.
      Thus we only need to acquire the spinlock to wait for any existing
      holder.
      Reported-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Tested-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c11d0ae
    • Q
      btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage() · e0467866
      Qu Wenruo 提交于
      [BUG]
      When running generic/095, there is a high chance to crash with subpage
      data RW support:
      
       assertion failed: PagePrivate(page) && page->private
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ctree.h:3403!
       Internal error: Oops - BUG: 0 [#1] SMP
       CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17
       Hardware name: Khadas VIM3 (DT)
       Call trace:
        assertfail.constprop.0+0x28/0x2c [btrfs]
        btrfs_subpage_assert+0x80/0xa0 [btrfs]
        btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
        btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
        btrfs_dirty_pages+0x160/0x270 [btrfs]
        btrfs_buffered_write+0x444/0x630 [btrfs]
        btrfs_direct_write+0x1cc/0x2d0 [btrfs]
        btrfs_file_write_iter+0xc0/0x160 [btrfs]
        new_sync_write+0xe8/0x180
        vfs_write+0x1b4/0x210
        ksys_pwrite64+0x7c/0xc0
        __arm64_sys_pwrite64+0x24/0x30
        el0_svc_common.constprop.0+0x70/0x140
        do_el0_svc+0x28/0x90
        el0_svc+0x2c/0x54
        el0_sync_handler+0x1a8/0x1ac
        el0_sync+0x170/0x180
       Code: f0000160 913be042 913c4000 955444bc (d4210000)
       ---[ end trace 3fdd39f4cccedd68 ]---
      
      [CAUSE]
      Although prepare_pages() calls find_or_create_page(), which returns the
      page locked, but in later prepare_uptodate_page() calls, we may call
      btrfs_readpage() which will unlock the page before it returns.
      
      This leaves a window where btrfs_releasepage() can sneak in and release
      the page, clearing page->private and causing above ASSERT().
      
      [FIX]
      In prepare_uptodate_page(), we should not only check page->mapping, but
      also PagePrivate() to ensure we are still holding the correct page which
      has proper fs context setup.
      Reported-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Tested-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e0467866
  11. 21 6月, 2021 4 次提交
    • N
      btrfs: eliminate insert label in add_falloc_range · 77d25534
      Nikolay Borisov 提交于
      By way of inverting the list_empty conditional the insert label can be
      eliminated, making the function's flow entirely linear.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      77d25534
    • Q
      btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() · 0528476b
      Qu Wenruo 提交于
      [BUG]
      With current subpage RW support, the following script can hang the fs
      with 64K page size.
      
       # mkfs.btrfs -f -s 4k $dev
       # mount $dev -o nospace_cache $mnt
       # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt
      
      The kernel will do an infinite loop in btrfs_punch_hole_lock_range().
      
      [CAUSE]
      In btrfs_punch_hole_lock_range() we:
      
      - Truncate page cache range
      - Lock extent io tree
      - Wait any ordered extents in the range.
      
      We exit the loop until we meet all the following conditions:
      
      - No ordered extent in the lock range
      - No page is in the lock range
      
      The latter condition has a pitfall, it only works for sector size ==
      PAGE_SIZE case.
      
      While can't handle the following subpage case:
      
        0       32K     64K     96K     128K
        |       |///////||//////|       ||
      
      lockstart=32K
      lockend=96K - 1
      
      In this case, although the range crosses 2 pages,
      truncate_pagecache_range() will invalidate no page at all, but only zero
      the [32K, 96K) range of the two pages.
      
      Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
      will never meet the loop exit condition.
      
      [FIX]
      Fix the problem by doing page alignment for the lock range.
      
      Function filemap_range_has_page() has already handled lend < lstart
      case, we only need to round up @lockstart, and round_down @lockend for
      truncate_pagecache_range().
      
      This modification should not change any thing for sector size ==
      PAGE_SIZE case, as in that case our range is already page aligned.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0528476b
    • Q
      btrfs: make btrfs_dirty_pages() to be subpage compatible · f02a85d2
      Qu Wenruo 提交于
      Since the extent io tree operations in btrfs_dirty_pages() are already
      subpage compatible, we only need to make the page status update to use
      subpage helpers.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f02a85d2
    • N
      btrfs: use list_last_entry in add_falloc_range · ec87b42f
      Nikolay Borisov 提交于
      Instead of calling list_entry with head->prev simply call
      list_last_entry which makes it obvious which member of the list is
      being referred. This allows to remove the extra 'prev' pointer.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ec87b42f
  12. 10 6月, 2021 1 次提交
  13. 04 6月, 2021 1 次提交
  14. 29 4月, 2021 1 次提交
    • F
      btrfs: fix race leading to unpersisted data and metadata on fsync · 626e9f41
      Filipe Manana 提交于
      When doing a fast fsync on a file, there is a race which can result in the
      fsync returning success to user space without logging the inode and without
      durably persisting new data.
      
      The following example shows one possible scenario for this:
      
         $ mkfs.btrfs -f /dev/sdc
         $ mount /dev/sdc /mnt
      
         $ touch /mnt/bar
         $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz
      
         # Now we have:
         # file bar == inode 257
         # file baz == inode 258
      
         $ mv /mnt/baz /mnt/foo
      
         # Now we have:
         # file bar == inode 257
         # file foo == inode 258
      
         $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
      
         # fsync bar before foo, it is important to trigger the race.
         $ xfs_io -c "fsync" /mnt/bar
         $ xfs_io -c "fsync" /mnt/foo
      
         # After this:
         # inode 257, file bar, is empty
         # inode 258, file foo, has 1M filled with 0xcd
      
         <power failure>
      
         # Replay the log:
         $ mount /dev/sdc /mnt
      
         # After this point file foo should have 1M filled with 0xcd and not 0xab
      
      The following steps explain how the race happens:
      
      1) Before the first fsync of inode 258, when it has the "baz" name, its
         ->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1.
         The inode also has the full sync flag set;
      
      2) After the first fsync, we set inode 258 ->logged_trans to 6, which is
         the generation of the current transaction, and set ->last_log_commit
         to 0, which is the current value of ->last_sub_trans (done at
         btrfs_log_inode()).
      
         The full sync flag is cleared from the inode during the fsync.
      
         The log sub transaction that was committed had an ID of 0 and when we
         synced the log, at btrfs_sync_log(), we incremented root->log_transid
         from 0 to 1;
      
      3) During the rename:
      
         We update inode 258, through btrfs_update_inode(), and that causes its
         ->last_sub_trans to be set to 1 (the current log transaction ID), and
         ->last_log_commit remains with a value of 0.
      
         After updating inode 258, because we have previously logged the inode
         in the previous fsync, we log again the inode through the call to
         btrfs_log_new_name(). This results in updating the inode's
         ->last_log_commit from 0 to 1 (the current value of its
         ->last_sub_trans).
      
         The ->last_sub_trans of inode 257 is updated to 1, which is the ID of
         the next log transaction;
      
      4) Then a buffered write against inode 258 is made. This leaves the value
         of ->last_sub_trans as 1 (the ID of the current log transaction, stored
         at root->log_transid);
      
      5) Then an fsync against inode 257 (or any other inode other than 258),
         happens. This results in committing the log transaction with ID 1,
         which results in updating root->last_log_commit to 1 and bumping
         root->log_transid from 1 to 2;
      
      6) Then an fsync against inode 258 starts. We flush delalloc and wait only
         for writeback to complete, since the full sync flag is not set in the
         inode's runtime flags - we do not wait for ordered extents to complete.
      
         Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the
         ordered extent completes. The call returns true:
      
           static inline bool btrfs_inode_in_log(...)
           {
               bool ret = false;
      
               spin_lock(&inode->lock);
               if (inode->logged_trans == generation &&
                   inode->last_sub_trans <= inode->last_log_commit &&
                   inode->last_sub_trans <= inode->root->last_log_commit)
                       ret = true;
               spin_unlock(&inode->lock);
               return ret;
           }
      
         generation has a value of 6 (fs_info->generation), ->logged_trans also
         has a value of 6 (set when we logged the inode during the first fsync
         and when logging it during the rename), ->last_sub_trans has a value
         of 1, set during the rename (step 3), ->last_log_commit also has a
         value of 1 (set in step 3) and root->last_log_commit has a value of 1,
         which was set in step 5 when fsyncing inode 257.
      
         As a consequence we don't log the inode, any new extents and do not
         sync the log, resulting in a data loss if a power failure happens
         after the fsync and before the current transaction commits.
         Also, because we do not log the inode, after a power failure the mtime
         and ctime of the inode do not match those we had before.
      
         When the ordered extent completes before we call btrfs_inode_in_log(),
         then the call returns false and we log the inode and sync the log,
         since at the end of ordered extent completion we update the inode and
         set ->last_sub_trans to 2 (the value of root->log_transid) and
         ->last_log_commit to 1.
      
      This problem is found after removing the check for the emptiness of the
      inode's list of modified extents in the recent commit 209ecbb8
      ("btrfs: remove stale comment and logic from btrfs_inode_in_log()"),
      added in the 5.13 merge window. However checking the emptiness of the
      list is not really the way to solve this problem, and was never intended
      to, because while that solves the problem for COW writes, the problem
      persists for NOCOW writes because in that case the list is always empty.
      
      In the case of NOCOW writes, even though we wait for the writeback to
      complete before returning from btrfs_sync_file(), we end up not logging
      the inode, which has a new mtime/ctime, and because we don't sync the log,
      we never issue disk barriers (send REQ_PREFLUSH to the device) since that
      only happens when we sync the log (when we write super blocks at
      btrfs_sync_log()). So effectively, for a NOCOW case, when we return from
      btrfs_sync_file() to user space, we are not guaranteeing that the data is
      durably persisted on disk.
      
      Also, while the example above uses a rename exchange to show how the
      problem happens, it is not the only way to trigger it. An alternative
      could be adding a new hard link to inode 258, since that also results
      in calling btrfs_log_new_name() and updating the inode in the log.
      An example reproducer using the addition of a hard link instead of a
      rename operation:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ touch /mnt/bar
        $ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/foo
      
        $ ln /mnt/foo /mnt/foo_link
        $ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
      
        $ xfs_io -c "fsync" /mnt/bar
        $ xfs_io -c "fsync" /mnt/foo
      
        <power failure>
      
        # Replay the log:
        $ mount /dev/sdc /mnt
      
        # After this point file foo often has 1M filled with 0xab and not 0xcd
      
      The reasons leading to the final fsync of file foo, inode 258, not
      persisting the new data are the same as for the previous example with
      a rename operation.
      
      So fix by never skipping logging and log syncing when there are still any
      ordered extents in flight. To avoid making the conditional if statement
      that checks if logging an inode is needed harder to read, place all the
      logic into an helper function with separate if statements to make it more
      manageable and easier to read.
      
      A test case for fstests will follow soon.
      
      For NOCOW writes, the problem existed before commit b5e6c3e1
      ("btrfs: always wait on ordered extents at fsync time"), introduced in
      kernel 4.19, then it went away with that commit since we started to always
      wait for ordered extent completion before logging.
      
      The problem came back again once the fast fsync path was changed again to
      avoid waiting for ordered extent completion, in commit 48778179
      ("btrfs: make fast fsyncs wait only for writeback"), added in kernel 5.10.
      
      However, for COW writes, the race only happens after the recent
      commit 209ecbb8 ("btrfs: remove stale comment and logic from
      btrfs_inode_in_log()"), introduced in the 5.13 merge window. For NOCOW
      writes, the bug existed before that commit. So tag 5.10+ as the release
      for stable backports.
      
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      626e9f41
  15. 19 4月, 2021 8 次提交
    • B
      btrfs: fix a potential hole punching failure · 3227788c
      BingJing Chang 提交于
      In commit d7781546 ("btrfs: Avoid trucating page or punching hole
      in a already existed hole."), existing holes can be skipped by calling
      find_first_non_hole() to adjust start and len. However, if the given len
      is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
      not be set to zero because (em->start + em->len) is less than
      (start + len). Then the ret will be 1 but len will not be set to 0.
      The propagated non-zero ret will result in fallocate failure.
      
      In the while-loop of btrfs_replace_file_extents(), len is not updated
      every time before it calls find_first_non_hole(). That is, after
      btrfs_drop_extents() successfully drops the last non-hole file extent,
      it may fail with ENOSPC when attempting to drop a file extent item
      representing a hole. The problem can happen. After it calls
      find_first_non_hole(), the cur_offset will be adjusted to be larger
      than or equal to end. However, since the len is not set to zero, the
      break-loop condition (ret && !len) will not be met. After it leaves the
      while-loop, fallocate will return 1, which is an unexpected return
      value.
      
      We're not able to construct a reproducible way to let
      btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
      file extent but with remaining holes left. However, it's quite easy to
      fix. We just need to update and check the len every time before we call
      find_first_non_hole(). To make the while loop more readable, we also
      pull the variable updates to the bottom of loop like this:
        while (cur_offset < end) {
      	  ...
      	  // update cur_offset & len
      	  // advance cur_offset & len in hole-punching case if needed
        }
      Reported-by: NRobbie Ko <robbieko@synology.com>
      Fixes: d7781546 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3227788c
    • F
      btrfs: update outdated comment at btrfs_replace_file_extents() · e2b84217
      Filipe Manana 提交于
      There is a comment at btrfs_replace_file_extents() that mentions that we
      set the full sync flag on an inode when cloning into a file with a size
      greater than or equals to 16MiB, through try_release_extent_mapping() when
      we truncate the page cache after replacing file extents during a clone
      operation.
      
      That is not true anymore since commit 5e548b32 ("btrfs: do not set
      the full sync flag on the inode during page release"), so update the
      comment to remove that part and rephrase it slightly to make it more
      clear why the full sync flag is set at btrfs_replace_file_extents().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e2b84217
    • F
      btrfs: fix race between marking inode needs to be logged and log syncing · bc0939fc
      Filipe Manana 提交于
      We have a race between marking that an inode needs to be logged, either
      at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
      btrfs_sync_log(). The following steps describe how the race happens.
      
      1) We are at transaction N;
      
      2) Inode I was previously fsynced in the current transaction so it has:
      
          inode->logged_trans set to N;
      
      3) The inode's root currently has:
      
         root->log_transid set to 1
         root->last_log_commit set to 0
      
         Which means only one log transaction was committed to far, log
         transaction 0. When a log tree is created we set ->log_transid and
         ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
      
      4) One more range of pages is dirtied in inode I;
      
      5) Some task A starts an fsync against some other inode J (same root), and
         so it joins log transaction 1.
      
         Before task A calls btrfs_sync_log()...
      
      6) Task B starts an fsync against inode I, which currently has the full
         sync flag set, so it starts delalloc and waits for the ordered extent
         to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
      
      7) During ordered extent completion we have btrfs_update_inode() called
         against inode I, which in turn calls btrfs_set_inode_last_trans(),
         which does the following:
      
           spin_lock(&inode->lock);
           inode->last_trans = trans->transaction->transid;
           inode->last_sub_trans = inode->root->log_transid;
           inode->last_log_commit = inode->root->last_log_commit;
           spin_unlock(&inode->lock);
      
         So ->last_trans is set to N and ->last_sub_trans set to 1.
         But before setting ->last_log_commit...
      
      8) Task A is at btrfs_sync_log():
      
         - it increments root->log_transid to 2
         - starts writeback for all log tree extent buffers
         - waits for the writeback to complete
         - writes the super blocks
         - updates root->last_log_commit to 1
      
         It's a lot of slow steps between updating root->log_transid and
         root->last_log_commit;
      
      9) The task doing the ordered extent completion, currently at
         btrfs_set_inode_last_trans(), then finally runs:
      
           inode->last_log_commit = inode->root->last_log_commit;
           spin_unlock(&inode->lock);
      
         Which results in inode->last_log_commit being set to 1.
         The ordered extent completes;
      
      10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
          true because we have all the following conditions met:
      
          inode->logged_trans == N which matches fs_info->generation &&
          inode->last_subtrans (1) <= inode->last_log_commit (1) &&
          inode->last_subtrans (1) <= root->last_log_commit (1) &&
          list inode->extent_tree.modified_extents is empty
      
          And as a consequence we return without logging the inode, so the
          existing logged version of the inode does not point to the extent
          that was written after the previous fsync.
      
      It should be impossible in practice for one task be able to do so much
      progress in btrfs_sync_log() while another task is at
      btrfs_set_inode_last_trans() right after it reads root->log_transid and
      before it reads root->last_log_commit. Even if kernel preemption is enabled
      we know the task at btrfs_set_inode_last_trans() can not be preempted
      because it is holding the inode's spinlock.
      
      However there is another place where we do the same without holding the
      spinlock, which is in the memory mapped write path at:
      
        vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
        {
           (...)
           BTRFS_I(inode)->last_trans = fs_info->generation;
           BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
           BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
           (...)
      
      So with preemption happening after setting ->last_sub_trans and before
      setting ->last_log_commit, it is less of a stretch to have another task
      do enough progress at btrfs_sync_log() such that the task doing the memory
      mapped write ends up with ->last_sub_trans and ->last_log_commit set to
      the same value. It is still a big stretch to get there, as the task doing
      btrfs_sync_log() has to start writeback, wait for its completion and write
      the super blocks.
      
      So fix this in two different ways:
      
      1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
         value of ->last_sub_trans minus 1;
      
      2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
         like we do for buffered and direct writes at btrfs_file_write_iter(),
         which is all we need to make sure multiple writes and fsyncs to an
         inode in the same transaction never result in an fsync missing that
         the inode changed and needs to be logged. Turn this into a helper
         function and use it both at btrfs_page_mkwrite() and at
         btrfs_file_write_iter() - this also fixes the problem that at
         btrfs_page_mkwrite() we were setting those fields without the
         protection of the inode's spinlock.
      
      This is an extremely unlikely race to happen in practice.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc0939fc
    • F
      btrfs: fix race between memory mapped writes and fsync · 885f46d8
      Filipe Manana 提交于
      When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
      any new delalloc that might have been created before taking the lock and
      then wait either for the ordered extents to complete or just for the
      writeback to complete (depending on whether the full sync flag is set or
      not). We then start logging the inode and assume that while we are doing it
      no one else is touching the inode's file extent items (or adding new ones).
      
      That is generally true because all operations that modify an inode acquire
      the inode's lock first, including buffered and direct IO writes. However
      there is one exception: memory mapped writes, which do not and can not
      acquire the inode's lock.
      
      This can cause two types of issues: ending up logging file extent items
      with overlapping ranges, which is detected by the tree checker and will
      result in aborting the transaction when starting writeback for a log
      tree's extent buffers, or a silent corruption where we log a version of
      the file that never existed.
      
      Scenario 1 - logging overlapping extents
      
      The following steps explain how we can end up with file extents items with
      overlapping ranges in a log tree due to a race between a fsync and memory
      mapped writes:
      
      1) Task A starts an fsync on inode X, which has the full sync runtime flag
         set. First it starts by flushing all delalloc for the inode;
      
      2) Task A then locks the inode and flushes any other delalloc that might
         have been created after the previous flush and waits for all ordered
         extents to complete;
      
      3) In the inode's root we have the following leaf:
      
         Leaf N, generation == current transaction id:
      
         ---------------------------------------------------------
         | (...)  [ file extent item, offset 640K, length 128K ] |
         ---------------------------------------------------------
      
         The last file extent item in leaf N covers the file range from 640K to
         768K;
      
      4) Task B does a memory mapped write for the page corresponding to the
         file range from 764K to 768K;
      
      5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
         btrfs_search_forward() to search for leafs modified in the current
         transaction that contain items for the inode. It finds leaf N and copies
         all the inode items from that leaf into the log tree.
      
         Now the log tree has a copy of the last file extent item from leaf N.
      
         At the end of the while loop at copy_inode_items_to_log(), we have the
         minimum key set to:
      
         min_key.objectid = <inode X number>
         min_key.type = BTRFS_EXTENT_DATA_KEY
         min_key.offset = 640K
      
         Then we increment the key's offset by 1 so that the next call to
         btrfs_search_forward() leaves us at the first key greater than the key
         we just processed.
      
         But before btrfs_search_forward() is called again...
      
      6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
         It can be started for several reasons:
      
           - The async reclaim task is attempting to satisfy metadata or data
             reservation requests, and it has reached a point where it decided
             to flush delalloc;
           - Due to memory pressure the VMM triggers writeback of dirty pages;
           - The system call sync_file_range(2) is called from user space.
      
      7) When the respective ordered extent completes, it trims the length of
         the existing file extent item for file offset 640K from 128K to 124K,
         and a new file extent item is added with a key offset of 764K and a
         length of 4K;
      
      8) Task A calls btrfs_search_forward(), which returns us a path pointing
         to the leaf (can be leaf N or some other) containing the new file extent
         item for file offset 764K.
      
         We end up copying this item to the log tree, which overlaps with the
         last copied file extent item, which covers the file range from 640K to
         768K.
      
         When writeback is triggered for log tree's extent buffers, the issue
         will be detected by the tree checker which will dump a trace and an
         error message on dmesg/syslog. If the writeback is triggered when
         syncing the log, which typically is, then we also end up aborting the
         current transaction.
      
      This is the same type of problem fixed in 0c713cba ("Btrfs: fix race
      between ranged fsync and writeback of adjacent ranges").
      
      Scenario 2 - logging a version of the file that never existed
      
      This scenario only happens when using the NO_HOLES feature and results in
      a silent corruption, in the sense that is not detectable by 'btrfs check'
      or the tree checker:
      
      1) We have an inode I with a size of 1M and two file extent items, one
         covering an extent with disk_bytenr == X for the file range [0, 512K)
         and another one covering another extent with disk_bytenr == Y for the
         file range [512K, 1M);
      
      2) A hole is punched for the file range [512K, 1M);
      
      3) Task A starts an fsync of inode I, which has the full sync runtime flag
         set. It starts by flushing all existing delalloc, locks the inode (VFS
         lock), starts any new delalloc that might have been created before
         taking the lock and waits for all ordered extents to complete;
      
      4) Some other task does a memory mapped write for the page corresponding to
         the file range [640K, 644K) for example;
      
      5) Task A then logs all items of the inode with the call to
         copy_inode_items_to_log();
      
      6) In the meanwhile delalloc for the range [640K, 644K) is started. It can
         be started for several reasons:
      
           - The async reclaim task is attempting to satisfy metadata or data
             reservation requests, and it has reached a point where it decided
             to flush delalloc;
           - Due to memory pressure the VMM triggers writeback of dirty pages;
           - The system call sync_file_range(2) is called from user space.
      
      7) The ordered extent for the range [640K, 644K) completes and a file
         extent item for that range is added to the subvolume tree, pointing
         to a 4K extent with a disk_bytenr == Z;
      
      8) Task A then calls btrfs_log_holes(), to scan for implicit holes in
         the subvolume tree. It finds two implicit holes:
      
         - one for the file range [512K, 640K)
         - one for the file range [644K, 1M)
      
         As a result we end up neither logging a hole for the range [640K, 644K)
         nor logging the file extent item with a disk_bytenr == Z.
         This means that if we have a power failure and replay the log tree we
         end up getting the following file extent layout:
      
         [ disk_bytenr X ]    [   hole   ]    [ disk_bytenr Y ]    [  hole  ]
         0             512K  512K      640K  640K           644K  644K     1M
      
         Which does not corresponding to any layout the file ever had before
         the power failure. The only two valid layouts would be:
      
         [ disk_bytenr X ]    [   hole   ]
         0             512K  512K        1M
      
         and
      
         [ disk_bytenr X ]    [   hole   ]    [ disk_bytenr Z ]    [  hole  ]
         0             512K  512K      640K  640K           644K  644K     1M
      
      This can be fixed by serializing memory mapped writes with fsync, and there
      are two ways to do it:
      
      1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX
         in the inode's io tree. This prevents the race but also blocks any reads
         during the duration of the fsync, which has a negative impact for many
         common workloads;
      
      2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This
         semaphore was recently added by Josef's patch set:
      
         btrfs: add a i_mmap_lock to our inode
         btrfs: cleanup inode_lock/inode_unlock uses
         btrfs: exclude mmaps while doing remap
         btrfs: exclude mmap from happening during all fallocate operations
      
         and is used to solve races between memory mapped writes and
         clone/dedupe/fallocate. This also makes us have the same behaviour we
         have regarding other writes (buffered and direct IO) and fsync - block
         them while the inode logging is in progress.
      
      This change uses the second approach due to the performance impact of the
      first one.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      885f46d8
    • J
      btrfs: exclude mmap from happening during all fallocate operations · 8d9b4a16
      Josef Bacik 提交于
      There's a small window where a deadlock can happen between fallocate and
      mmap.  This is described in detail by Filipe:
      
      """
      When doing a fallocate operation we lock the inode, flush delalloc within
      the target range, wait for any ordered extents to complete and then lock
      the file range. Before we lock the range and after we flush delalloc,
      there is a time window where another task can come in and do a memory
      mapped write for a page within the fallocate range.
      
      This means that after fallocate locks the range, there can be a dirty page
      in the range. More often than not, this does not cause any problem.
      The exception is when we are low on available metadata space, because an
      fallocate operation needs to start a transaction while holding the file
      range locked, either through btrfs_prealloc_file_range() or through the
      call to btrfs_fallocate_update_isize(). If that's the case, we can end up
      in a deadlock. The following list of steps explains how that happens:
      
      1) A fallocate operation starts, locks the inode, flushes delalloc in the
         range and waits for ordered extents in the range to complete;
      
      2) Before the fallocate task locks the file range, another task does a
         memory mapped write for a page in the fallocate target range. This is
         possible since memory mapped writes do not (and can not) lock the
         inode;
      
      3) The fallocate task locks the file range. At this point there is one
         dirty page in the range (due to the memory mapped write);
      
      4) When the fallocate task attempts to start a transaction, it blocks when
         attempting to reserve metadata space, since we are low on available
         metadata space. Before blocking (wait on its reservation ticket), it
         starts the async reclaim task (if not running already);
      
      5) The async reclaim task is not able to release space through any other
         means, so it decides to flush delalloc for inodes with dirty pages.
         It finds that the inode used in the fallocate operation has a dirty
         page and therefore queues a job (fs_info->flush_workers workqueue) to
         flush delalloc for that inode and waits on that job to complete;
      
      6) The flush job blocks when attempting to lock the file range because
         it is currently locked by the fallocate task;
      
      7) The fallocate task keeps waiting for its metadata reservation, waiting
         for a wakeup on its reservation ticket. The async reclaim task is
         waiting on the flush job, which in turn is waiting for locking the file
         range that is currently locked by the fallocate task. So unless some
         other task is able to release enough metadata space, for example an
         ordered extent for some other inode completes, we end up in a deadlock
         between all these tasks.
      
      When this happens stack traces like the following show up in dmesg/syslog:
      
       INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
             Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
       Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
       Call Trace:
        __schedule+0x5d1/0xcf0
        schedule+0x45/0xe0
        lock_extent_bits+0x1e6/0x2d0 [btrfs]
        ? finish_wait+0x90/0x90
        btrfs_invalidatepage+0x32c/0x390 [btrfs]
        ? __mod_memcg_state+0x8e/0x160
        __extent_writepage+0x2d4/0x400 [btrfs]
        extent_write_cache_pages+0x2b2/0x500 [btrfs]
        ? lock_release+0x20e/0x4c0
        ? trace_hardirqs_on+0x1b/0xf0
        extent_writepages+0x43/0x90 [btrfs]
        ? lock_acquire+0x1a3/0x490
        do_writepages+0x43/0xe0
        ? __filemap_fdatawrite_range+0xa4/0x100
        __filemap_fdatawrite_range+0xc5/0x100
        btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        btrfs_work_helper+0xf1/0x600 [btrfs]
        process_one_work+0x24e/0x5e0
        worker_thread+0x50/0x3b0
        ? process_one_work+0x5e0/0x5e0
        kthread+0x153/0x170
        ? kthread_mod_delayed_work+0xc0/0xc0
        ret_from_fork+0x22/0x30
       INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
             Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
       Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
       Call Trace:
        __schedule+0x5d1/0xcf0
        ? kvm_clock_read+0x14/0x30
        ? wait_for_completion+0x81/0x110
        schedule+0x45/0xe0
        schedule_timeout+0x30c/0x580
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        ? lock_acquire+0x1a3/0x490
        ? try_to_wake_up+0x7a/0xa20
        ? lock_release+0x20e/0x4c0
        ? lock_acquired+0x199/0x490
        ? wait_for_completion+0x81/0x110
        wait_for_completion+0xab/0x110
        start_delalloc_inodes+0x2af/0x390 [btrfs]
        btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
        flush_space+0x24f/0x660 [btrfs]
        btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
        process_one_work+0x24e/0x5e0
        worker_thread+0x20f/0x3b0
        ? process_one_work+0x5e0/0x5e0
        kthread+0x153/0x170
        ? kthread_mod_delayed_work+0xc0/0xc0
        ret_from_fork+0x22/0x30
      (...)
      several tasks waiting for the inode lock held by the fallocate task below
      (...)
       RIP: 0033:0x7f61efe73fff
       Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
       RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
       RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
       RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c
       RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
       R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001
       R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
       task:fdm-stress        state:D stack:    0 pid:2508243 ppid:2508153 flags:0x00000000
       Call Trace:
        __schedule+0x5d1/0xcf0
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        schedule+0x45/0xe0
        __reserve_bytes+0x4a4/0xb10 [btrfs]
        ? finish_wait+0x90/0x90
        btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
        btrfs_block_rsv_add+0x1f/0x50 [btrfs]
        start_transaction+0x2d1/0x760 [btrfs]
        btrfs_replace_file_extents+0x120/0x930 [btrfs]
        ? btrfs_fallocate+0xdcf/0x1260 [btrfs]
        btrfs_fallocate+0xdfb/0x1260 [btrfs]
        ? filename_lookup+0xf1/0x180
        vfs_fallocate+0x14f/0x440
        ioctl_preallocate+0x92/0xc0
        do_vfs_ioctl+0x66b/0x750
        ? __do_sys_newfstat+0x53/0x60
        __x64_sys_ioctl+0x62/0xb0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      """
      
      Fix this by disallowing mmaps from happening while we're doing any of
      the fallocate operations on this inode.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8d9b4a16
    • J
      btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers · 64708539
      Josef Bacik 提交于
      A few places we intermix btrfs_inode_lock with a inode_unlock, and some
      places we just use inode_lock/inode_unlock instead of btrfs_inode_lock.
      
      None of these places are using this incorrectly, but as we adjust some
      of these callers it would be nice to keep everything consistent, so
      convert everybody to use btrfs_inode_lock/btrfs_inode_unlock.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      64708539
    • N
      cca5de97
    • N
  16. 02 3月, 2021 1 次提交
  17. 25 2月, 2021 1 次提交
  18. 09 2月, 2021 1 次提交
    • N
      btrfs: zoned: use ZONE_APPEND write for zoned mode · d8e3fb10
      Naohiro Aota 提交于
      Enable zone append writing for zoned mode. When using zone append, a
      bio is issued to the start of a target zone and the device decides to
      place it inside the zone. Upon completion the device reports the actual
      written position back to the host.
      
      Three parts are necessary to enable zone append mode. First, modify the
      bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
      bi_sector to point the beginning of the zone.
      
      Second, record the returned physical address (and disk/partno) to the
      ordered extent in end_bio_extent_writepage() after the bio has been
      completed. We cannot resolve the physical address to the logical address
      because we can neither take locks nor allocate a buffer in this end_bio
      context. So, we need to record the physical address to resolve it later
      in btrfs_finish_ordered_io().
      
      And finally, rewrite the logical addresses of the extent mapping and
      checksum data according to the physical address using btrfs_rmap_block.
      If the returned address matches the originally allocated address, we can
      skip this rewriting process.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8e3fb10