1. 10 6月, 2020 1 次提交
  2. 28 5月, 2020 2 次提交
    • C
      btrfs: split btrfs_direct_IO to read and write part · d8f3e735
      Christoph Hellwig 提交于
      The read and write versions don't have anything in common except for the
      call to iomap_dio_rw.  So split this function, and merge each half into
      its only caller.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8f3e735
    • G
      btrfs: switch to iomap_dio_rw() for dio · a43a67a2
      Goldwyn Rodrigues 提交于
      Switch from __blockdev_direct_IO() to iomap_dio_rw().
      Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
      as iomap_begin() for iomap direct I/O functions. This function
      allocates and locks all the blocks required for the I/O.
      btrfs_submit_direct() is used as the submit_io() hook for direct I/O
      ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore so set it to noop.
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
       - for writes, call __endio_write_update_ordered()
       - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a43a67a2
  3. 25 5月, 2020 3 次提交
    • D
      btrfs: simplify iget helpers · 0202e83f
      David Sterba 提交于
      The inode lookup starting at btrfs_iget takes the full location key,
      while only the objectid is used to match the inode, because the lookup
      happens inside the given root thus the inode number is unique.
      The entire location key is properly set up in btrfs_init_locked_inode.
      
      Simplify the helpers and pass only inode number, renaming it to 'ino'
      instead of 'objectid'. This allows to remove temporary variables key,
      saving some stack space.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0202e83f
    • D
      btrfs: simplify root lookup by id · 56e9357a
      David Sterba 提交于
      The main function to lookup a root by its id btrfs_get_fs_root takes the
      whole key, while only using the objectid. The value of offset is preset
      to (u64)-1 but not actually used until btrfs_find_root that does the
      actual search.
      
      Switch btrfs_get_fs_root to use only objectid and remove all local
      variables that existed just for the lookup. The actual key for search is
      set up in btrfs_get_fs_root, reusing another key variable.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56e9357a
    • Q
      btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE · 92a7cc42
      Qu Wenruo 提交于
      The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
      
      In fact, that bit can only be set to those trees:
      
      - Subvolume roots
      - Data reloc root
      - Reloc roots for above roots
      
      All other trees won't get this bit set.  So just by the result, it is
      obvious that, roots with this bit set can have tree blocks shared with
      other trees.  Either shared by snapshots, or by reloc roots (an special
      snapshot created by relocation).
      
      This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
      make it easier to understand, and update all comment mentioning
      "reference counted" to follow the rename.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92a7cc42
  4. 09 4月, 2020 1 次提交
    • F
      btrfs: make full fsyncs always operate on the entire file again · 7af59743
      Filipe Manana 提交于
      This is a revert of commit 0a8068a3 ("btrfs: make ranged full
      fsyncs more efficient"), with updated comment in btrfs_sync_file.
      
      Commit 0a8068a3 ("btrfs: make ranged full fsyncs more efficient")
      made full fsyncs operate on the given range only as it assumed it was safe
      when using the NO_HOLES feature, since the hole detection was simplified
      some time ago and no longer was a source for races with ordered extent
      completion of adjacent file ranges.
      
      However it's still not safe to have a full fsync only operate on the given
      range, because extent maps for new extents might not be present in memory
      due to inode eviction or extent cloning. Consider the following example:
      
      1) We are currently at transaction N;
      
      2) We write to the file range [0, 1MiB);
      
      3) Writeback finishes for the whole range and ordered extents complete,
         while we are still at transaction N;
      
      4) The inode is evicted;
      
      5) We open the file for writing, causing the inode to be loaded to
         memory again, which sets the 'full sync' bit on its flags. At this
         point the inode's list of modified extent maps is empty (figuring
         out which extents were created in the current transaction and were
         not yet logged by an fsync is expensive, that's why we set the
         'full sync' bit when loading an inode);
      
      6) We write to the file range [512KiB, 768KiB);
      
      7) We do a ranged fsync (such as msync()) for file range [512KiB, 768KiB).
         This correctly flushes this range and logs its extent into the log
         tree. When the writeback started an extent map for range [512KiB, 768KiB)
         was added to the inode's list of modified extents, and when the fsync()
         finishes logging it removes that extent map from the list of modified
         extent maps. This fsync also clears the 'full sync' bit;
      
      8) We do a regular fsync() (full ranged). This fsync() ends up doing
         nothing because the inode's list of modified extents is empty and
         no other changes happened since the previous ranged fsync(), so
         it just returns success (0) and we end up never logging extents for
         the file ranges [0, 512KiB) and [768KiB, 1MiB).
      
      Another scenario where this can happen is if we replace steps 2 to 4 with
      cloning from another file into our test file, as that sets the 'full sync'
      bit in our inode's flags and does not populate its list of modified extent
      maps.
      
      This was causing test case generic/457 to fail sporadically when using the
      NO_HOLES feature, as it exercised this later case where the inode has the
      'full sync' bit set and has no extent maps in memory to represent the new
      extents due to extent cloning.
      
      Fix this by reverting commit 0a8068a3 ("btrfs: make ranged full fsyncs
      more efficient") since there is no easy way to work around it.
      
      Fixes: 0a8068a3 ("btrfs: make ranged full fsyncs more efficient")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7af59743
  5. 25 3月, 2020 1 次提交
  6. 24 3月, 2020 12 次提交
    • J
      btrfs: kill the subvol_srcu · c75e8394
      Josef Bacik 提交于
      Now that we have proper root ref counting everywhere we can kill the
      subvol_srcu.
      
      * removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes
      
      * the refcount_t used for the references checks for accidental 0->1
        in cases where the root lifetime would not be properly protected
      
      * there's a leak detector for roots to catch unfreed roots at umount
        time
      
      * SRCU served us well over the years but is was not a proper
        synchronization mechanism for some cases
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c75e8394
    • F
      btrfs: make ranged full fsyncs more efficient · 0a8068a3
      Filipe Manana 提交于
      Commit 0c713cba ("Btrfs: fix race between ranged fsync and writeback
      of adjacent ranges") fixed a bug where we could end up with file extent
      items in a log tree that represent file ranges that overlap due to a race
      between the hole detection of a ranged full fsync and writeback for a
      different file range.
      
      The problem was solved by forcing any ranged full fsync to become a
      non-ranged full fsync - setting the range start to 0 and the end offset to
      LLONG_MAX. This was a simple solution because the code that detected and
      marked holes was very complex, it used to be done at copy_items() and
      implied several searches on the fs/subvolume tree. The drawback of that
      solution was that we started to flush delalloc for the entire file and
      wait for all the ordered extents to complete for ranged full fsyncs
      (including ordered extents covering ranges completely outside the given
      range). Fortunatelly ranged full fsyncs are not the most common case
      (hopefully for most workloads).
      
      However a later fix for detecting and marking holes was made by commit
      0e56315c ("Btrfs: fix missing hole after hole punching and fsync
      when using NO_HOLES") and it simplified a lot the detection of holes,
      and now copy_items() no longer does it and we do it in a much more simple
      way at btrfs_log_holes().
      
      This makes it now possible to simply make the code that detects holes to
      operate only on the initial range and no longer need to operate on the
      whole file, while also avoiding the need to flush delalloc for the entire
      file and wait for ordered extents that cover ranges that don't overlap the
      given range.
      
      Another special care is that we must skip file extent items that fall
      entirely outside the fsync range when copying inode items from the
      fs/subvolume tree into the log tree - this is to avoid races with ordered
      extent completion for extents falling outside the fsync range, which could
      cause us to end up with file extent items in the log tree that have
      overlapping ranges - for example if the fsync range is [1Mb, 2Mb], when
      we copy inode items we could copy an extent item for the range [0, 512K],
      then release the search path and before moving to the next leaf, an
      ordered extent for a range of [256Kb, 512Kb] completes - this would
      cause us to copy the new extent item for range [256Kb, 512Kb] into the
      log tree after we have copied one for the range [0, 512Kb] - the extents
      overlap, resulting in a corruption.
      
      So this change just does these steps:
      
      1) When the NO_HOLES feature is enabled it leaves the initial range
         intact - no longer sets it to [0, LLONG_MAX] when the full sync bit
         is set in the inode. If NO_HOLES is not enabled, always set the range
         to a full, just like before this change, to avoid missing file extent
         items representing holes after replaying the log (for both full and
         fast fsyncs);
      
      2) Make the hole detection code to operate only on the fsync range;
      
      3) Make the code that copies items from the fs/subvolume tree to skip
         copying file extent items that cover a range completely outside the
         range of the fsync.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0a8068a3
    • F
      btrfs: fix missing file extent item for hole after ranged fsync · 95418ed1
      Filipe Manana 提交于
      When doing a fast fsync for a range that starts at an offset greater than
      zero, we can end up with a log that when replayed causes the respective
      inode miss a file extent item representing a hole if we are not using the
      NO_HOLES feature. This is because for fast fsyncs we don't log any extents
      that cover a range different from the one requested in the fsync.
      
      Example scenario to trigger it:
      
        $ mkfs.btrfs -O ^no-holes -f /dev/sdd
        $ mount /dev/sdd /mnt
      
        # Create a file with a single 256K and fsync it to clear to full sync
        # bit in the inode - we want the msync below to trigger a fast fsync.
        $ xfs_io -f -c "pwrite -S 0xab 0 256K" -c "fsync" /mnt/foo
      
        # Force a transaction commit and wipe out the log tree.
        $ sync
      
        # Dirty 768K of data, increasing the file size to 1Mb, and flush only
        # the range from 256K to 512K without updating the log tree
        # (sync_file_range() does not trigger fsync, it only starts writeback
        # and waits for it to finish).
      
        $ xfs_io -c "pwrite -S 0xcd 256K 768K" /mnt/foo
        $ xfs_io -c "sync_range -abw 256K 256K" /mnt/foo
      
        # Now dirty the range from 768K to 1M again and sync that range.
        $ xfs_io -c "mmap -w 768K 256K"        \
                 -c "mwrite -S 0xef 768K 256K" \
                 -c "msync -s 768K 256K"       \
                 -c "munmap"                   \
                 /mnt/foo
      
        <power fail>
      
        # Mount to replay the log.
        $ mount /dev/sdd /mnt
        $ umount /mnt
      
        $ btrfs check /dev/sdd
        Opening filesystem to check...
        Checking filesystem on /dev/sdd
        UUID: 482fb574-b288-478e-a190-a9c44a78fca6
        [1/7] checking root items
        [2/7] checking extents
        [3/7] checking free space cache
        [4/7] checking fs roots
        root 5 inode 257 errors 100, file extent discount
        Found file extent holes:
             start: 262144, len: 524288
        ERROR: errors found in fs roots
        found 720896 bytes used, error(s) found
        total csum bytes: 512
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 123514
        file data blocks allocated: 589824
          referenced 589824
      
      Fix this issue by setting the range to full (0 to LLONG_MAX) when the
      NO_HOLES feature is not enabled. This results in extra work being done
      but it gives the guarantee we don't end up with missing holes after
      replaying the log.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95418ed1
    • F
      Btrfs: move all reflink implementation code into its own file · 6a177381
      Filipe Manana 提交于
      The reflink code is quite large and has been living in ioctl.c since ever.
      It has grown over the years after many bug fixes and improvements, and
      since I'm planning on making some further improvements on it, it's time
      to get it better organized by moving into its own file, reflink.c
      (similar to what xfs does for example).
      
      This change only moves the code out of ioctl.c into the new file, it
      doesn't do any other change.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a177381
    • N
      btrfs: convert snapshot/nocow exlcusion to drew lock · dcc3eb96
      Nikolay Borisov 提交于
      This patch removes all haphazard code implementing nocow writers
      exclusion from pending snapshot creation and switches to using the drew
      lock to ensure this invariant still holds.
      
      'Readers' are snapshot creators from create_snapshot and 'writers' are
      nocow writers from buffered write path or btrfs_setsize. This locking
      scheme allows for multiple snapshots to happen while any nocow writers
      are blocked, since writes to page cache in the nocow path will make
      snapshots inconsistent.
      
      So for performance reasons we'd like to have the ability to run multiple
      concurrent snapshots and also favors readers in this case. And in case
      there aren't pending snapshots (which will be the majority of the cases)
      we rely on the percpu's writers counter to avoid cacheline contention.
      
      The main gain from using the drew lock is it's now a lot easier to
      reason about the guarantees of the locking scheme and whether there is
      some silent breakage lurking.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dcc3eb96
    • D
      btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range · b272ae22
      David Sterba 提交于
      The tree pointer can be safely read from the inode so we can drop the
      redundant argument from btrfs_lock_and_flush_ordered_range.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b272ae22
    • J
      btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root · 00246528
      Josef Bacik 提交于
      We are now using these for all roots, rename them to btrfs_put_root()
      and btrfs_grab_root();
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      00246528
    • J
      btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root · bc44d7c4
      Josef Bacik 提交于
      Now that all callers of btrfs_get_fs_root are subsequently calling
      btrfs_grab_fs_root and handling dropping the ref when they are done
      appropriately, go ahead and push btrfs_grab_fs_root up into
      btrfs_get_fs_root.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc44d7c4
    • J
      btrfs: hold a ref on the root in __btrfs_run_defrag_inode · 02162a02
      Josef Bacik 提交于
      We are looking up an arbitrary inode, we need to hold a ref on the root
      while we're doing this.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02162a02
    • J
      btrfs: open code btrfs_read_fs_root_no_name · 3619c94f
      Josef Bacik 提交于
      All this does is call btrfs_get_fs_root() with check_ref == true.  Just
      use btrfs_get_fs_root() so we don't have a bunch of different helpers
      that do the same thing.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3619c94f
    • J
      btrfs: replace all uses of btrfs_ordered_update_i_size · d923afe9
      Josef Bacik 提交于
      Now that we have a safe way to update the i_size, replace all uses of
      btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d923afe9
    • J
      btrfs: use the file extent tree infrastructure · 9ddc959e
      Josef Bacik 提交于
      We want to use this everywhere we modify the file extent items
      permanently.  These include:
      
        1) Inserting new file extents for writes and prealloc extents.
        2) Truncating inode items.
        3) btrfs_cont_expand().
        4) Insert inline extents.
        5) Insert new extents from log replay.
        6) Insert a new extent for clone, as it could be past i_size.
        7) Hole punching
      
      For hole punching in particular it might seem it's not necessary because
      anybody extending would use btrfs_cont_expand, however there is a corner
      that still can give us trouble.  Start with an empty file and
      
      fallocate KEEP_SIZE 1M-2M
      
      We now have a 0 length file, and a hole file extent from 0-1M, and a
      prealloc extent from 1M-2M.  Now
      
      punch 1M-1.5M
      
      Because this is past i_size we have
      
      [HOLE EXTENT][ NOTHING ][PREALLOC]
      [0        1M][1M   1.5M][1.5M  2M]
      
      with an i_size of 0.  Now if we pwrite 0-1.5M we'll increas our i_size
      to 1.5M, but our disk_i_size is still 0 until the ordered extent
      completes.
      
      However if we now immediately truncate 2M on the file we'll just call
      btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M.  If we
      commit the transaction here and crash we'll expose the gap.
      
      To fix this we need to clear the file extent mapping for the range that
      we punched but didn't insert a corresponding file extent for.  This will
      mean the truncate will only get an disk_i_size set to 1M if we crash
      before the finish ordered io happens.
      
      I've written an xfstest to reproduce the problem and validate this fix.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ddc959e
  7. 20 1月, 2020 2 次提交
  8. 13 12月, 2019 1 次提交
    • F
      Btrfs: fix cloning range with a hole when using the NO_HOLES feature · fcb97058
      Filipe Manana 提交于
      When using the NO_HOLES feature if we clone a range that contains a hole
      and a temporary ENOSPC happens while dropping extents from the target
      inode's range, we can end up failing and aborting the transaction with
      -EEXIST or with a corrupt file extent item, that has a length greater
      than it should and overlaps with other extents. For example when cloning
      the following range from inode A to inode B:
      
        Inode A:
      
          extent A1                                          extent A2
        [ ----------- ]  [ hole, implicit, 4MB length ]  [ ------------- ]
        0            1MB                                 5MB            6MB
      
        Range to clone: [1MB, 6MB)
      
        Inode B:
      
          extent B1       extent B2        extent B3         extent B4
        [ ---------- ]  [ --------- ]    [ ---------- ]    [ ---------- ]
        0           1MB 1MB        2MB   2MB        5MB    5MB         6MB
      
        Target range: [1MB, 6MB) (same as source, to make it easier to explain)
      
      The following can happen:
      
      1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
      
      2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
         set 'drop_end' to 2MB, meaning it was able to drop only extent B2;
      
      3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB =
         1MB;
      
      4) We then attempt to insert a file extent item at inode B with a file
         offset of 5MB, which is the value of clone_info->file_offset. This
         fails with error -EEXIST because there's already an extent at that
         offset (extent B4);
      
      5) We abort the current transaction with -EEXIST and return that error
         to user space as well.
      
      Another example, for extent corruption:
      
        Inode A:
      
          extent A1                                           extent A2
        [ ----------- ]   [ hole, implicit, 10MB length ]  [ ------------- ]
        0            1MB                                  11MB            12MB
      
        Inode B:
      
          extent B1         extent B2
        [ ----------- ]   [ --------- ]    [ ----------------------------- ]
        0            1MB 1MB         5MB  5MB                             12MB
      
        Target range: [1MB, 12MB) (same as source, to make it easier to explain)
      
      1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
      
      2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
         set 'drop_end' to 5MB, meaning it was able to drop only extent B2;
      
      3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB =
         4MB;
      
      4) We then insert a file extent item at inode B with a file offset of 11MB
         which is the value of clone_info->file_offset, and a length of 4MB (the
         value of 'clone_len'). So we get 2 extents items with ranges that
         overlap and an extent length of 4MB, larger then the extent A2 from
         inode A (1MB length);
      
      5) After that we end the transaction, balance the btree dirty pages and
         then start another or join the previous transaction. It might happen
         that the transaction which inserted the incorrect extent was committed
         by another task so we end up with extent corruption if a power failure
         happens.
      
      So fix this by making sure we attempt to insert the extent to clone at
      the destination inode only if we are past dropping the sub-range that
      corresponds to a hole.
      
      Fixes: 690a5dbf ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fcb97058
  9. 19 11月, 2019 1 次提交
  10. 18 11月, 2019 6 次提交
    • N
      btrfs: Return offset from find_desired_extent · bc80230e
      Nikolay Borisov 提交于
      Instead of using an input pointer parameter as the return value and have
      an int as the return type of find_desired_extent, rework the function to
      directly return the found offset. Doing that the 'ret' variable in
      btrfs_llseek_file can be removed. Additional (subjective) benefit is
      that btrfs' llseek function now resemebles those of the other major
      filesystems.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc80230e
    • N
      btrfs: Simplify btrfs_file_llseek · 2034f3b4
      Nikolay Borisov 提交于
      Handle SEEK_END/SEEK_CUR in a single 'default' case by directly
      returning from generic_file_llseek. This makes the 'out' label
      redundant.  Finally return directly the vale from vfs_setpos. No
      semantic changes.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2034f3b4
    • N
      btrfs: Speed up btrfs_file_llseek · d79b7c26
      Nikolay Borisov 提交于
      Modifying the file position is done on a per-file basis. This renders
      holding the inode lock for writing useless and makes the performance of
      concurrent llseek's abysmal.
      
      Fix this by holding the inode for read. This provides protection against
      concurrent truncates and find_desired_extent already includes proper
      extent locking for the range which ensures proper locking against
      concurrent writes. SEEK_CUR and SEEK_END can be done lockessly.
      
      The former is synchronized by file::f_lock spinlock. SEEK_END is not
      synchronized but atomic, but that's OK since there is not guarantee that
      SEEK_END will always be at the end of the file in the face of tail
      modifications.
      
      This change brings ~82% performance improvement when doing a lot of
      parallel fseeks. The workload essentially does:
      
          for (d=0; d<num_seek_read; d++)
            {
      	/* offset %= 16777216; */
      	fseek (f, 256 * d % 16777216, SEEK_SET);
      	fread (buffer, 64, 1, f);
            }
      
      Without patch:
      
      num workprocesses = 16
      num fseek/fread = 8000000
      step = 256
      fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
      
      real	0m41.412s
      user	0m28.777s
      sys	2m16.510s
      
      With patch:
      
      num workprocesses = 16
      num fseek/fread = 8000000
      step = 256
      fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
      
      real	0m11.479s
      user	0m27.629s
      sys	0m21.040s
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d79b7c26
    • F
      Btrfs: fix negative subv_writers counter and data space leak after buffered write · a0e248bb
      Filipe Manana 提交于
      When doing a buffered write it's possible to leave the subv_writers
      counter of the root, used for synchronization between buffered nocow
      writers and snapshotting. This happens in an exceptional case like the
      following:
      
      1) We fail to allocate data space for the write, since there's not
         enough available data space nor enough unallocated space for allocating
         a new data block group;
      
      2) Because of that failure, we try to go to NOCOW mode, which succeeds
         and therefore we set the local variable 'only_release_metadata' to true
         and set the root's sub_writers counter to 1 through the call to
         btrfs_start_write_no_snapshotting() made by check_can_nocow();
      
      3) The call to btrfs_copy_from_user() returns zero, which is very unlikely
         to happen but not impossible;
      
      4) No pages are copied because btrfs_copy_from_user() returned zero;
      
      5) We call btrfs_end_write_no_snapshotting() which decrements the root's
         subv_writers counter to 0;
      
      6) We don't set 'only_release_metadata' back to 'false' because we do
         it only if 'copied', the value returned by btrfs_copy_from_user(), is
         greater than zero;
      
      7) On the next iteration of the while loop, which processes the same
         page range, we are now able to allocate data space for the write (we
         got enough data space released in the meanwhile);
      
      8) After this if we fail at btrfs_delalloc_reserve_metadata(), because
         now there isn't enough free metadata space, or in some other place
         further below (prepare_pages(), lock_and_cleanup_extent_if_need(),
         btrfs_dirty_pages()), we break out of the while loop with
         'only_release_metadata' having a value of 'true';
      
      9) Because 'only_release_metadata' is 'true' we end up decrementing the
         root's subv_writers counter to -1 (through a call to
         btrfs_end_write_no_snapshotting()), and we also end up not releasing the
         data space previously reserved through btrfs_check_data_free_space().
         As a consequence the mechanism for synchronizing NOCOW buffered writes
         with snapshotting gets broken.
      
      Fix this by always setting 'only_release_metadata' to false at the start
      of each iteration.
      
      Fixes: 8257b2dc ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume")
      Fixes: 7ee9e440 ("Btrfs: check if we can nocow if we don't have data space")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0e248bb
    • D
      btrfs: drop unused parameter is_new from btrfs_iget · 4c66e0d4
      David Sterba 提交于
      The parameter is now always set to NULL and could be dropped. The last
      user was get_default_root but that got reworked in 05dbe683 ("Btrfs:
      unify subvol= and subvolid= mounting") and the parameter became unused.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c66e0d4
    • G
      btrfs: simplify inode locking for RWF_NOWAIT · 9cf35f67
      Goldwyn Rodrigues 提交于
      This is similar to 942491c9 ("xfs: fix AIM7 regression"). Apparently
      our current rwsem code doesn't like doing the trylock, then lock for
      real scheme. This causes extra contention on the lock and can be
      measured eg. by AIM7 benchmark.  So change our read/write methods to
      just do the trylock for the RWF_NOWAIT case.
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9cf35f67
  11. 18 10月, 2019 1 次提交
    • F
      Btrfs: check for the full sync flag while holding the inode lock during fsync · ba0b084a
      Filipe Manana 提交于
      We were checking for the full fsync flag in the inode before locking the
      inode, which is racy, since at that that time it might not be set but
      after we acquire the inode lock some other task set it. One case where
      this can happen is on a system low on memory and some concurrent task
      failed to allocate an extent map and therefore set the full sync flag on
      the inode, to force the next fsync to work in full mode.
      
      A consequence of missing the full fsync flag set is hitting the problems
      fixed by commit 0c713cba ("Btrfs: fix race between ranged fsync and
      writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
      tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
      of weird inconsistencies after replaying a log due to file extents items
      representing ranges that overlap.
      
      So just move the check such that it's done after locking the inode and
      before starting writeback again.
      
      Fixes: 0c713cba ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ba0b084a
  12. 16 10月, 2019 1 次提交
    • Q
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 8702ba93
      Qu Wenruo 提交于
      [Background]
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS and PREALLOC.
      
      PERTRANS is metadata space reserved for each transaction started by
      btrfs_start_transaction().
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      
      [Inconsistency]
      However there is inconsistency in how we handle PREALLOC metadata space.
      
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      
      We always free qgroup PREALLOC meta space.
      
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      
      We only free qgroup PREALLOC meta space when something went wrong.
      
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      
      - Free the unused metadata space
        Normally in:
        btrfs_delalloc_release_extents()
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      
      E.g. for 64K buffered write, the metadata rsv works like:
      
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef459 ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      
      [FIX]
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      btrfs_inode_rsv_release().
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8702ba93
  13. 02 10月, 2019 1 次提交
    • F
      Btrfs: fix memory leak due to concurrent append writes with fiemap · c67d970f
      Filipe Manana 提交于
      When we have a buffered write that starts at an offset greater than or
      equals to the file's size happening concurrently with a full ranged
      fiemap, we can end up leaking an extent state structure.
      
      Suppose we have a file with a size of 1Mb, and before the buffered write
      and fiemap are performed, it has a single extent state in its io tree
      representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
      
      The following sequence diagram shows how the memory leak happens if a
      fiemap a buffered write, starting at offset 1Mb and with a length of
      4Kb, are performed concurrently.
      
                CPU 1                                                  CPU 2
      
        extent_fiemap()
          --> it's a full ranged fiemap
              range from 0 to LLONG_MAX - 1
              (9223372036854775807)
      
          --> locks range in the inode's
              io tree
            --> after this we have 2 extent
                states in the io tree:
                --> 1 for range [0, 1Mb[ with
                    the bits EXTENT_LOCKED and
                    EXTENT_DELALLOC_BITS set
                --> 1 for the range
                    [1Mb, LLONG_MAX[ with
                    the EXTENT_LOCKED bit set
      
                                                        --> start buffered write at offset
                                                            1Mb with a length of 4Kb
      
                                                        btrfs_file_write_iter()
      
                                                          btrfs_buffered_write()
                                                            --> cached_state is NULL
      
                                                            lock_and_cleanup_extent_if_need()
                                                              --> returns 0 and does not lock
                                                                  range because it starts
                                                                  at current i_size / eof
      
                                                            --> cached_state remains NULL
      
                                                            btrfs_dirty_pages()
                                                              btrfs_set_extent_delalloc()
                                                                (...)
                                                                __set_extent_bit()
      
                                                                  --> splits extent state for range
                                                                      [1Mb, LLONG_MAX[ and now we
                                                                      have 2 extent states:
      
                                                                      --> one for the range
                                                                          [1Mb, 1Mb + 4Kb[ with
                                                                          EXTENT_LOCKED set
                                                                      --> another one for the range
                                                                          [1Mb + 4Kb, LLONG_MAX[ with
                                                                          EXTENT_LOCKED set as well
      
                                                                  --> sets EXTENT_DELALLOC on the
                                                                      extent state for the range
                                                                      [1Mb, 1Mb + 4Kb[
                                                                  --> caches extent state
                                                                      [1Mb, 1Mb + 4Kb[ into
                                                                      @cached_state because it has
                                                                      the bit EXTENT_LOCKED set
      
                                                          --> btrfs_buffered_write() ends up
                                                              with a non-NULL cached_state and
                                                              never calls anything to release its
                                                              reference on it, resulting in a
                                                              memory leak
      
      Fix this by calling free_extent_state() on cached_state if the range was
      not locked by lock_and_cleanup_extent_if_need().
      
      The same issue can happen if anything else other than fiemap locks a range
      that covers eof and beyond.
      
      This could be triggered, sporadically, by test case generic/561 from the
      fstests suite, which makes duperemove run concurrently with fsstress, and
      duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
      the leak is reported in dmesg/syslog when removing the btrfs module with
      a message like the following:
      
        [77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
      
      Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
      
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c67d970f
  14. 09 9月, 2019 7 次提交