1. 25 3月, 2020 1 次提交
  2. 24 3月, 2020 12 次提交
    • J
      btrfs: kill the subvol_srcu · c75e8394
      Josef Bacik 提交于
      Now that we have proper root ref counting everywhere we can kill the
      subvol_srcu.
      
      * removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes
      
      * the refcount_t used for the references checks for accidental 0->1
        in cases where the root lifetime would not be properly protected
      
      * there's a leak detector for roots to catch unfreed roots at umount
        time
      
      * SRCU served us well over the years but is was not a proper
        synchronization mechanism for some cases
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c75e8394
    • F
      btrfs: make ranged full fsyncs more efficient · 0a8068a3
      Filipe Manana 提交于
      Commit 0c713cba ("Btrfs: fix race between ranged fsync and writeback
      of adjacent ranges") fixed a bug where we could end up with file extent
      items in a log tree that represent file ranges that overlap due to a race
      between the hole detection of a ranged full fsync and writeback for a
      different file range.
      
      The problem was solved by forcing any ranged full fsync to become a
      non-ranged full fsync - setting the range start to 0 and the end offset to
      LLONG_MAX. This was a simple solution because the code that detected and
      marked holes was very complex, it used to be done at copy_items() and
      implied several searches on the fs/subvolume tree. The drawback of that
      solution was that we started to flush delalloc for the entire file and
      wait for all the ordered extents to complete for ranged full fsyncs
      (including ordered extents covering ranges completely outside the given
      range). Fortunatelly ranged full fsyncs are not the most common case
      (hopefully for most workloads).
      
      However a later fix for detecting and marking holes was made by commit
      0e56315c ("Btrfs: fix missing hole after hole punching and fsync
      when using NO_HOLES") and it simplified a lot the detection of holes,
      and now copy_items() no longer does it and we do it in a much more simple
      way at btrfs_log_holes().
      
      This makes it now possible to simply make the code that detects holes to
      operate only on the initial range and no longer need to operate on the
      whole file, while also avoiding the need to flush delalloc for the entire
      file and wait for ordered extents that cover ranges that don't overlap the
      given range.
      
      Another special care is that we must skip file extent items that fall
      entirely outside the fsync range when copying inode items from the
      fs/subvolume tree into the log tree - this is to avoid races with ordered
      extent completion for extents falling outside the fsync range, which could
      cause us to end up with file extent items in the log tree that have
      overlapping ranges - for example if the fsync range is [1Mb, 2Mb], when
      we copy inode items we could copy an extent item for the range [0, 512K],
      then release the search path and before moving to the next leaf, an
      ordered extent for a range of [256Kb, 512Kb] completes - this would
      cause us to copy the new extent item for range [256Kb, 512Kb] into the
      log tree after we have copied one for the range [0, 512Kb] - the extents
      overlap, resulting in a corruption.
      
      So this change just does these steps:
      
      1) When the NO_HOLES feature is enabled it leaves the initial range
         intact - no longer sets it to [0, LLONG_MAX] when the full sync bit
         is set in the inode. If NO_HOLES is not enabled, always set the range
         to a full, just like before this change, to avoid missing file extent
         items representing holes after replaying the log (for both full and
         fast fsyncs);
      
      2) Make the hole detection code to operate only on the fsync range;
      
      3) Make the code that copies items from the fs/subvolume tree to skip
         copying file extent items that cover a range completely outside the
         range of the fsync.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0a8068a3
    • F
      btrfs: fix missing file extent item for hole after ranged fsync · 95418ed1
      Filipe Manana 提交于
      When doing a fast fsync for a range that starts at an offset greater than
      zero, we can end up with a log that when replayed causes the respective
      inode miss a file extent item representing a hole if we are not using the
      NO_HOLES feature. This is because for fast fsyncs we don't log any extents
      that cover a range different from the one requested in the fsync.
      
      Example scenario to trigger it:
      
        $ mkfs.btrfs -O ^no-holes -f /dev/sdd
        $ mount /dev/sdd /mnt
      
        # Create a file with a single 256K and fsync it to clear to full sync
        # bit in the inode - we want the msync below to trigger a fast fsync.
        $ xfs_io -f -c "pwrite -S 0xab 0 256K" -c "fsync" /mnt/foo
      
        # Force a transaction commit and wipe out the log tree.
        $ sync
      
        # Dirty 768K of data, increasing the file size to 1Mb, and flush only
        # the range from 256K to 512K without updating the log tree
        # (sync_file_range() does not trigger fsync, it only starts writeback
        # and waits for it to finish).
      
        $ xfs_io -c "pwrite -S 0xcd 256K 768K" /mnt/foo
        $ xfs_io -c "sync_range -abw 256K 256K" /mnt/foo
      
        # Now dirty the range from 768K to 1M again and sync that range.
        $ xfs_io -c "mmap -w 768K 256K"        \
                 -c "mwrite -S 0xef 768K 256K" \
                 -c "msync -s 768K 256K"       \
                 -c "munmap"                   \
                 /mnt/foo
      
        <power fail>
      
        # Mount to replay the log.
        $ mount /dev/sdd /mnt
        $ umount /mnt
      
        $ btrfs check /dev/sdd
        Opening filesystem to check...
        Checking filesystem on /dev/sdd
        UUID: 482fb574-b288-478e-a190-a9c44a78fca6
        [1/7] checking root items
        [2/7] checking extents
        [3/7] checking free space cache
        [4/7] checking fs roots
        root 5 inode 257 errors 100, file extent discount
        Found file extent holes:
             start: 262144, len: 524288
        ERROR: errors found in fs roots
        found 720896 bytes used, error(s) found
        total csum bytes: 512
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 123514
        file data blocks allocated: 589824
          referenced 589824
      
      Fix this issue by setting the range to full (0 to LLONG_MAX) when the
      NO_HOLES feature is not enabled. This results in extra work being done
      but it gives the guarantee we don't end up with missing holes after
      replaying the log.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95418ed1
    • F
      Btrfs: move all reflink implementation code into its own file · 6a177381
      Filipe Manana 提交于
      The reflink code is quite large and has been living in ioctl.c since ever.
      It has grown over the years after many bug fixes and improvements, and
      since I'm planning on making some further improvements on it, it's time
      to get it better organized by moving into its own file, reflink.c
      (similar to what xfs does for example).
      
      This change only moves the code out of ioctl.c into the new file, it
      doesn't do any other change.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a177381
    • N
      btrfs: convert snapshot/nocow exlcusion to drew lock · dcc3eb96
      Nikolay Borisov 提交于
      This patch removes all haphazard code implementing nocow writers
      exclusion from pending snapshot creation and switches to using the drew
      lock to ensure this invariant still holds.
      
      'Readers' are snapshot creators from create_snapshot and 'writers' are
      nocow writers from buffered write path or btrfs_setsize. This locking
      scheme allows for multiple snapshots to happen while any nocow writers
      are blocked, since writes to page cache in the nocow path will make
      snapshots inconsistent.
      
      So for performance reasons we'd like to have the ability to run multiple
      concurrent snapshots and also favors readers in this case. And in case
      there aren't pending snapshots (which will be the majority of the cases)
      we rely on the percpu's writers counter to avoid cacheline contention.
      
      The main gain from using the drew lock is it's now a lot easier to
      reason about the guarantees of the locking scheme and whether there is
      some silent breakage lurking.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dcc3eb96
    • D
      btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range · b272ae22
      David Sterba 提交于
      The tree pointer can be safely read from the inode so we can drop the
      redundant argument from btrfs_lock_and_flush_ordered_range.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b272ae22
    • J
      btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root · 00246528
      Josef Bacik 提交于
      We are now using these for all roots, rename them to btrfs_put_root()
      and btrfs_grab_root();
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      00246528
    • J
      btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root · bc44d7c4
      Josef Bacik 提交于
      Now that all callers of btrfs_get_fs_root are subsequently calling
      btrfs_grab_fs_root and handling dropping the ref when they are done
      appropriately, go ahead and push btrfs_grab_fs_root up into
      btrfs_get_fs_root.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc44d7c4
    • J
      btrfs: hold a ref on the root in __btrfs_run_defrag_inode · 02162a02
      Josef Bacik 提交于
      We are looking up an arbitrary inode, we need to hold a ref on the root
      while we're doing this.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02162a02
    • J
      btrfs: open code btrfs_read_fs_root_no_name · 3619c94f
      Josef Bacik 提交于
      All this does is call btrfs_get_fs_root() with check_ref == true.  Just
      use btrfs_get_fs_root() so we don't have a bunch of different helpers
      that do the same thing.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3619c94f
    • J
      btrfs: replace all uses of btrfs_ordered_update_i_size · d923afe9
      Josef Bacik 提交于
      Now that we have a safe way to update the i_size, replace all uses of
      btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d923afe9
    • J
      btrfs: use the file extent tree infrastructure · 9ddc959e
      Josef Bacik 提交于
      We want to use this everywhere we modify the file extent items
      permanently.  These include:
      
        1) Inserting new file extents for writes and prealloc extents.
        2) Truncating inode items.
        3) btrfs_cont_expand().
        4) Insert inline extents.
        5) Insert new extents from log replay.
        6) Insert a new extent for clone, as it could be past i_size.
        7) Hole punching
      
      For hole punching in particular it might seem it's not necessary because
      anybody extending would use btrfs_cont_expand, however there is a corner
      that still can give us trouble.  Start with an empty file and
      
      fallocate KEEP_SIZE 1M-2M
      
      We now have a 0 length file, and a hole file extent from 0-1M, and a
      prealloc extent from 1M-2M.  Now
      
      punch 1M-1.5M
      
      Because this is past i_size we have
      
      [HOLE EXTENT][ NOTHING ][PREALLOC]
      [0        1M][1M   1.5M][1.5M  2M]
      
      with an i_size of 0.  Now if we pwrite 0-1.5M we'll increas our i_size
      to 1.5M, but our disk_i_size is still 0 until the ordered extent
      completes.
      
      However if we now immediately truncate 2M on the file we'll just call
      btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M.  If we
      commit the transaction here and crash we'll expose the gap.
      
      To fix this we need to clear the file extent mapping for the range that
      we punched but didn't insert a corresponding file extent for.  This will
      mean the truncate will only get an disk_i_size set to 1M if we crash
      before the finish ordered io happens.
      
      I've written an xfstest to reproduce the problem and validate this fix.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ddc959e
  3. 20 1月, 2020 2 次提交
  4. 13 12月, 2019 1 次提交
    • F
      Btrfs: fix cloning range with a hole when using the NO_HOLES feature · fcb97058
      Filipe Manana 提交于
      When using the NO_HOLES feature if we clone a range that contains a hole
      and a temporary ENOSPC happens while dropping extents from the target
      inode's range, we can end up failing and aborting the transaction with
      -EEXIST or with a corrupt file extent item, that has a length greater
      than it should and overlaps with other extents. For example when cloning
      the following range from inode A to inode B:
      
        Inode A:
      
          extent A1                                          extent A2
        [ ----------- ]  [ hole, implicit, 4MB length ]  [ ------------- ]
        0            1MB                                 5MB            6MB
      
        Range to clone: [1MB, 6MB)
      
        Inode B:
      
          extent B1       extent B2        extent B3         extent B4
        [ ---------- ]  [ --------- ]    [ ---------- ]    [ ---------- ]
        0           1MB 1MB        2MB   2MB        5MB    5MB         6MB
      
        Target range: [1MB, 6MB) (same as source, to make it easier to explain)
      
      The following can happen:
      
      1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
      
      2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
         set 'drop_end' to 2MB, meaning it was able to drop only extent B2;
      
      3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB =
         1MB;
      
      4) We then attempt to insert a file extent item at inode B with a file
         offset of 5MB, which is the value of clone_info->file_offset. This
         fails with error -EEXIST because there's already an extent at that
         offset (extent B4);
      
      5) We abort the current transaction with -EEXIST and return that error
         to user space as well.
      
      Another example, for extent corruption:
      
        Inode A:
      
          extent A1                                           extent A2
        [ ----------- ]   [ hole, implicit, 10MB length ]  [ ------------- ]
        0            1MB                                  11MB            12MB
      
        Inode B:
      
          extent B1         extent B2
        [ ----------- ]   [ --------- ]    [ ----------------------------- ]
        0            1MB 1MB         5MB  5MB                             12MB
      
        Target range: [1MB, 12MB) (same as source, to make it easier to explain)
      
      1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
      
      2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
         set 'drop_end' to 5MB, meaning it was able to drop only extent B2;
      
      3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB =
         4MB;
      
      4) We then insert a file extent item at inode B with a file offset of 11MB
         which is the value of clone_info->file_offset, and a length of 4MB (the
         value of 'clone_len'). So we get 2 extents items with ranges that
         overlap and an extent length of 4MB, larger then the extent A2 from
         inode A (1MB length);
      
      5) After that we end the transaction, balance the btree dirty pages and
         then start another or join the previous transaction. It might happen
         that the transaction which inserted the incorrect extent was committed
         by another task so we end up with extent corruption if a power failure
         happens.
      
      So fix this by making sure we attempt to insert the extent to clone at
      the destination inode only if we are past dropping the sub-range that
      corresponds to a hole.
      
      Fixes: 690a5dbf ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fcb97058
  5. 19 11月, 2019 1 次提交
  6. 18 11月, 2019 6 次提交
    • N
      btrfs: Return offset from find_desired_extent · bc80230e
      Nikolay Borisov 提交于
      Instead of using an input pointer parameter as the return value and have
      an int as the return type of find_desired_extent, rework the function to
      directly return the found offset. Doing that the 'ret' variable in
      btrfs_llseek_file can be removed. Additional (subjective) benefit is
      that btrfs' llseek function now resemebles those of the other major
      filesystems.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc80230e
    • N
      btrfs: Simplify btrfs_file_llseek · 2034f3b4
      Nikolay Borisov 提交于
      Handle SEEK_END/SEEK_CUR in a single 'default' case by directly
      returning from generic_file_llseek. This makes the 'out' label
      redundant.  Finally return directly the vale from vfs_setpos. No
      semantic changes.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2034f3b4
    • N
      btrfs: Speed up btrfs_file_llseek · d79b7c26
      Nikolay Borisov 提交于
      Modifying the file position is done on a per-file basis. This renders
      holding the inode lock for writing useless and makes the performance of
      concurrent llseek's abysmal.
      
      Fix this by holding the inode for read. This provides protection against
      concurrent truncates and find_desired_extent already includes proper
      extent locking for the range which ensures proper locking against
      concurrent writes. SEEK_CUR and SEEK_END can be done lockessly.
      
      The former is synchronized by file::f_lock spinlock. SEEK_END is not
      synchronized but atomic, but that's OK since there is not guarantee that
      SEEK_END will always be at the end of the file in the face of tail
      modifications.
      
      This change brings ~82% performance improvement when doing a lot of
      parallel fseeks. The workload essentially does:
      
          for (d=0; d<num_seek_read; d++)
            {
      	/* offset %= 16777216; */
      	fseek (f, 256 * d % 16777216, SEEK_SET);
      	fread (buffer, 64, 1, f);
            }
      
      Without patch:
      
      num workprocesses = 16
      num fseek/fread = 8000000
      step = 256
      fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
      
      real	0m41.412s
      user	0m28.777s
      sys	2m16.510s
      
      With patch:
      
      num workprocesses = 16
      num fseek/fread = 8000000
      step = 256
      fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
      
      real	0m11.479s
      user	0m27.629s
      sys	0m21.040s
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d79b7c26
    • F
      Btrfs: fix negative subv_writers counter and data space leak after buffered write · a0e248bb
      Filipe Manana 提交于
      When doing a buffered write it's possible to leave the subv_writers
      counter of the root, used for synchronization between buffered nocow
      writers and snapshotting. This happens in an exceptional case like the
      following:
      
      1) We fail to allocate data space for the write, since there's not
         enough available data space nor enough unallocated space for allocating
         a new data block group;
      
      2) Because of that failure, we try to go to NOCOW mode, which succeeds
         and therefore we set the local variable 'only_release_metadata' to true
         and set the root's sub_writers counter to 1 through the call to
         btrfs_start_write_no_snapshotting() made by check_can_nocow();
      
      3) The call to btrfs_copy_from_user() returns zero, which is very unlikely
         to happen but not impossible;
      
      4) No pages are copied because btrfs_copy_from_user() returned zero;
      
      5) We call btrfs_end_write_no_snapshotting() which decrements the root's
         subv_writers counter to 0;
      
      6) We don't set 'only_release_metadata' back to 'false' because we do
         it only if 'copied', the value returned by btrfs_copy_from_user(), is
         greater than zero;
      
      7) On the next iteration of the while loop, which processes the same
         page range, we are now able to allocate data space for the write (we
         got enough data space released in the meanwhile);
      
      8) After this if we fail at btrfs_delalloc_reserve_metadata(), because
         now there isn't enough free metadata space, or in some other place
         further below (prepare_pages(), lock_and_cleanup_extent_if_need(),
         btrfs_dirty_pages()), we break out of the while loop with
         'only_release_metadata' having a value of 'true';
      
      9) Because 'only_release_metadata' is 'true' we end up decrementing the
         root's subv_writers counter to -1 (through a call to
         btrfs_end_write_no_snapshotting()), and we also end up not releasing the
         data space previously reserved through btrfs_check_data_free_space().
         As a consequence the mechanism for synchronizing NOCOW buffered writes
         with snapshotting gets broken.
      
      Fix this by always setting 'only_release_metadata' to false at the start
      of each iteration.
      
      Fixes: 8257b2dc ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume")
      Fixes: 7ee9e440 ("Btrfs: check if we can nocow if we don't have data space")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0e248bb
    • D
      btrfs: drop unused parameter is_new from btrfs_iget · 4c66e0d4
      David Sterba 提交于
      The parameter is now always set to NULL and could be dropped. The last
      user was get_default_root but that got reworked in 05dbe683 ("Btrfs:
      unify subvol= and subvolid= mounting") and the parameter became unused.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c66e0d4
    • G
      btrfs: simplify inode locking for RWF_NOWAIT · 9cf35f67
      Goldwyn Rodrigues 提交于
      This is similar to 942491c9 ("xfs: fix AIM7 regression"). Apparently
      our current rwsem code doesn't like doing the trylock, then lock for
      real scheme. This causes extra contention on the lock and can be
      measured eg. by AIM7 benchmark.  So change our read/write methods to
      just do the trylock for the RWF_NOWAIT case.
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9cf35f67
  7. 18 10月, 2019 1 次提交
    • F
      Btrfs: check for the full sync flag while holding the inode lock during fsync · ba0b084a
      Filipe Manana 提交于
      We were checking for the full fsync flag in the inode before locking the
      inode, which is racy, since at that that time it might not be set but
      after we acquire the inode lock some other task set it. One case where
      this can happen is on a system low on memory and some concurrent task
      failed to allocate an extent map and therefore set the full sync flag on
      the inode, to force the next fsync to work in full mode.
      
      A consequence of missing the full fsync flag set is hitting the problems
      fixed by commit 0c713cba ("Btrfs: fix race between ranged fsync and
      writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
      tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
      of weird inconsistencies after replaying a log due to file extents items
      representing ranges that overlap.
      
      So just move the check such that it's done after locking the inode and
      before starting writeback again.
      
      Fixes: 0c713cba ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ba0b084a
  8. 16 10月, 2019 1 次提交
    • Q
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 8702ba93
      Qu Wenruo 提交于
      [Background]
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS and PREALLOC.
      
      PERTRANS is metadata space reserved for each transaction started by
      btrfs_start_transaction().
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      
      [Inconsistency]
      However there is inconsistency in how we handle PREALLOC metadata space.
      
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      
      We always free qgroup PREALLOC meta space.
      
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      
      We only free qgroup PREALLOC meta space when something went wrong.
      
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      
      - Free the unused metadata space
        Normally in:
        btrfs_delalloc_release_extents()
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      
      E.g. for 64K buffered write, the metadata rsv works like:
      
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef459 ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      
      [FIX]
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      btrfs_inode_rsv_release().
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8702ba93
  9. 02 10月, 2019 1 次提交
    • F
      Btrfs: fix memory leak due to concurrent append writes with fiemap · c67d970f
      Filipe Manana 提交于
      When we have a buffered write that starts at an offset greater than or
      equals to the file's size happening concurrently with a full ranged
      fiemap, we can end up leaking an extent state structure.
      
      Suppose we have a file with a size of 1Mb, and before the buffered write
      and fiemap are performed, it has a single extent state in its io tree
      representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
      
      The following sequence diagram shows how the memory leak happens if a
      fiemap a buffered write, starting at offset 1Mb and with a length of
      4Kb, are performed concurrently.
      
                CPU 1                                                  CPU 2
      
        extent_fiemap()
          --> it's a full ranged fiemap
              range from 0 to LLONG_MAX - 1
              (9223372036854775807)
      
          --> locks range in the inode's
              io tree
            --> after this we have 2 extent
                states in the io tree:
                --> 1 for range [0, 1Mb[ with
                    the bits EXTENT_LOCKED and
                    EXTENT_DELALLOC_BITS set
                --> 1 for the range
                    [1Mb, LLONG_MAX[ with
                    the EXTENT_LOCKED bit set
      
                                                        --> start buffered write at offset
                                                            1Mb with a length of 4Kb
      
                                                        btrfs_file_write_iter()
      
                                                          btrfs_buffered_write()
                                                            --> cached_state is NULL
      
                                                            lock_and_cleanup_extent_if_need()
                                                              --> returns 0 and does not lock
                                                                  range because it starts
                                                                  at current i_size / eof
      
                                                            --> cached_state remains NULL
      
                                                            btrfs_dirty_pages()
                                                              btrfs_set_extent_delalloc()
                                                                (...)
                                                                __set_extent_bit()
      
                                                                  --> splits extent state for range
                                                                      [1Mb, LLONG_MAX[ and now we
                                                                      have 2 extent states:
      
                                                                      --> one for the range
                                                                          [1Mb, 1Mb + 4Kb[ with
                                                                          EXTENT_LOCKED set
                                                                      --> another one for the range
                                                                          [1Mb + 4Kb, LLONG_MAX[ with
                                                                          EXTENT_LOCKED set as well
      
                                                                  --> sets EXTENT_DELALLOC on the
                                                                      extent state for the range
                                                                      [1Mb, 1Mb + 4Kb[
                                                                  --> caches extent state
                                                                      [1Mb, 1Mb + 4Kb[ into
                                                                      @cached_state because it has
                                                                      the bit EXTENT_LOCKED set
      
                                                          --> btrfs_buffered_write() ends up
                                                              with a non-NULL cached_state and
                                                              never calls anything to release its
                                                              reference on it, resulting in a
                                                              memory leak
      
      Fix this by calling free_extent_state() on cached_state if the range was
      not locked by lock_and_cleanup_extent_if_need().
      
      The same issue can happen if anything else other than fiemap locks a range
      that covers eof and beyond.
      
      This could be triggered, sporadically, by test case generic/561 from the
      fstests suite, which makes duperemove run concurrently with fsstress, and
      duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
      the leak is reported in dmesg/syslog when removing the btrfs module with
      a message like the following:
      
        [77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
      
      Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
      
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c67d970f
  10. 09 9月, 2019 7 次提交
  11. 04 7月, 2019 1 次提交
  12. 02 7月, 2019 2 次提交
  13. 01 7月, 2019 2 次提交
  14. 16 5月, 2019 1 次提交
    • F
      Btrfs: fix race between ranged fsync and writeback of adjacent ranges · 0c713cba
      Filipe Manana 提交于
      When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
      inode) that happens to be ranged, which happens during a msync() or writes
      for files opened with O_SYNC for example, we can end up with a corrupt log,
      due to different file extent items representing ranges that overlap with
      each other, or hit some assertion failures.
      
      When doing a ranged fsync we only flush delalloc and wait for ordered
      exents within that range. If while we are logging items from our inode
      ordered extents for adjacent ranges complete, we end up in a race that can
      make us insert the file extent items that overlap with others we logged
      previously and the assertion failures.
      
      For example, if tree-log.c:copy_items() receives a leaf that has the
      following file extents items, all with a length of 4K and therefore there
      is an implicit hole in the range 68K to 72K - 1:
      
        (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
      
      It copies them to the log tree. However due to the need to detect implicit
      holes, it may release the path, in order to look at the previous leaf to
      detect an implicit hole, and then later it will search again in the tree
      for the first file extent item key, with the goal of locking again the
      leaf (which might have changed due to concurrent changes to other inodes).
      
      However when it locks again the leaf containing the first key, the key
      corresponding to the extent at offset 72K may not be there anymore since
      there is an ordered extent for that range that is finishing (that is,
      somewhere in the middle of btrfs_finish_ordered_io()), and it just
      removed the file extent item but has not yet replaced it with a new file
      extent item, so the part of copy_items() that does hole detection will
      decide that there is a hole in the range starting from 68K to 76K - 1,
      and therefore insert a file extent item to represent that hole, having
      a key offset of 68K. After that we now have a log tree with 2 different
      extent items that have overlapping ranges:
      
       1) The file extent item copied before copy_items() released the path,
          which has a key offset of 72K and a length of 4K, representing the
          file range 72K to 76K - 1.
      
       2) And a file extent item representing a hole that has a key offset of
          68K and a length of 8K, representing the range 68K to 76K - 1. This
          item was inserted after releasing the path, and overlaps with the
          extent item inserted before.
      
      The overlapping extent items can cause all sorts of unpredictable and
      incorrect behaviour, either when replayed or if a fast (non full) fsync
      happens later, which can trigger a BUG_ON() when calling
      btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
      trace like the following:
      
        [61666.783269] ------------[ cut here ]------------
        [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
        [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
        (...)
        [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
        [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
        [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
        [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
        [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
        [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
        [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
        [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
        [61666.786253] FS:  00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
        [61666.786253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
        [61666.786253] Call Trace:
        [61666.786253]  __btrfs_drop_extents+0x5e3/0xaad [btrfs]
        [61666.786253]  ? time_hardirqs_on+0x9/0x14
        [61666.786253]  btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
        [61666.786253]  ? release_extent_buffer+0x38/0xb4 [btrfs]
        [61666.786253]  btrfs_log_inode+0xb6e/0xcdc [btrfs]
        [61666.786253]  ? lock_acquire+0x131/0x1c5
        [61666.786253]  ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
        [61666.786253]  btrfs_log_inode_parent+0x223/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? lockref_get_not_zero+0x2c/0x34
        [61666.786253]  ? rcu_read_unlock+0x3e/0x5d
        [61666.786253]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
        [61666.786253]  btrfs_sync_file+0x317/0x42c [btrfs]
        [61666.786253]  vfs_fsync_range+0x8c/0x9e
        [61666.786253]  SyS_msync+0x13c/0x1c9
        [61666.786253]  entry_SYSCALL_64_fastpath+0x18/0xad
      
      A sample of a corrupt log tree leaf with overlapping extents I got from
      running btrfs/072:
      
            item 14 key (295 108 200704) itemoff 2599 itemsize 53
                    extent data disk bytenr 0 nr 0
                    extent data offset 0 nr 458752 ram 458752
            item 15 key (295 108 659456) itemoff 2546 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 606208 nr 163840 ram 770048
            item 16 key (295 108 663552) itemoff 2493 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 610304 nr 155648 ram 770048
            item 17 key (295 108 819200) itemoff 2440 itemsize 53
                    extent data disk bytenr 4334788608 nr 4096
                    extent data offset 0 nr 4096 ram 4096
      
      The file extent item at offset 659456 (item 15) ends at offset 823296
      (659456 + 163840) while the next file extent item (item 16) starts at
      offset 663552.
      
      Another different problem that the race can trigger is a failure in the
      assertions at tree-log.c:copy_items(), which expect that the first file
      extent item key we found before releasing the path exists after we have
      released path and that the last key we found before releasing the path
      also exists after releasing the path:
      
        $ cat -n fs/btrfs/tree-log.c
        4080          if (need_find_last_extent) {
        4081                  /* btrfs_prev_leaf could return 1 without releasing the path */
        4082                  btrfs_release_path(src_path);
        4083                  ret = btrfs_search_slot(NULL, inode->root, &first_key,
        4084                                  src_path, 0, 0);
        4085                  if (ret < 0)
        4086                          return ret;
        4087                  ASSERT(ret == 0);
        (...)
        4103                  if (i >= btrfs_header_nritems(src_path->nodes[0])) {
        4104                          ret = btrfs_next_leaf(inode->root, src_path);
        4105                          if (ret < 0)
        4106                                  return ret;
        4107                          ASSERT(ret == 0);
        4108                          src = src_path->nodes[0];
        4109                          i = 0;
        4110                          need_find_last_extent = true;
        4111                  }
        (...)
      
      The second assertion implicitly expects that the last key before the path
      release still exists, because the surrounding while loop only stops after
      we have found that key. When this assertion fails it produces a stack like
      this:
      
        [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
        [139590.037406] ------------[ cut here ]------------
        [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
        [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G        W         5.0.0-btrfs-next-46 #1
        (...)
        [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
        (...)
        [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
        [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
        [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
        [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
        [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
        [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
        [139590.042501] FS:  00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
        [139590.042847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
        [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [139590.044250] Call Trace:
        [139590.044631]  copy_items+0xa3f/0x1000 [btrfs]
        [139590.045009]  ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
        [139590.045396]  btrfs_log_inode+0x7b3/0xd70 [btrfs]
        [139590.045773]  btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
        [139590.046143]  ? do_raw_spin_unlock+0x49/0xc0
        [139590.046510]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [139590.046872]  btrfs_sync_file+0x3b6/0x440 [btrfs]
        [139590.047243]  btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
        [139590.047592]  __vfs_write+0x129/0x1c0
        [139590.047932]  vfs_write+0xc2/0x1b0
        [139590.048270]  ksys_write+0x55/0xc0
        [139590.048608]  do_syscall_64+0x60/0x1b0
        [139590.048946]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [139590.049287] RIP: 0033:0x7f2efc4be190
        (...)
        [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
        [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
        [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
        [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
        [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
        (...)
        [139590.055128] ---[ end trace 193f35d0215cdeeb ]---
      
      So fix this race between a full ranged fsync and writeback of adjacent
      ranges by flushing all delalloc and waiting for all ordered extents to
      complete before logging the inode. This is the simplest way to solve the
      problem because currently the full fsync path does not deal with ranges
      at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
      to look at adjacent ranges for hole detection. For use cases of ranged
      fsyncs this can make a few fsyncs slower but on the other hand it can
      make some following fsyncs to other ranges do less work or no need to do
      anything at all. A full fsync is rare anyway and happens only once after
      loading/creating an inode and once after less common operations such as a
      shrinking truncate.
      
      This is an issue that exists for a long time, and was often triggered by
      generic/127, because it does mmap'ed writes and msync (which triggers a
      ranged fsync). Adding support for the tree checker to detect overlapping
      extents (next patch in the series) and trigger a WARN() when such cases
      are found, and then calling btrfs_check_leaf_full() at the end of
      btrfs_insert_file_extent() made the issue much easier to detect. Running
      btrfs/072 with that change to the tree checker and making fsstress open
      files always with O_SYNC made it much easier to trigger the issue (as
      triggering it with generic/127 is very rare).
      
      CC: stable@vger.kernel.org # 3.16+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0c713cba
  15. 04 5月, 2019 1 次提交