1. 14 2月, 2023 1 次提交
    • Q
      btrfs: sysfs: update fs features directory asynchronously · b7625f46
      Qu Wenruo 提交于
      [BUG]
      Since the introduction of per-fs feature sysfs interface
      (/sys/fs/btrfs/<UUID>/features/), the content of that directory is never
      updated.
      
      Thus for the following case, that directory will not show the new
      features like RAID56:
      
        # mkfs.btrfs -f $dev1 $dev2 $dev3
        # mount $dev1 $mnt
        # btrfs balance start -f -mconvert=raid5 $mnt
        # ls /sys/fs/btrfs/$uuid/features/
        extended_iref  free_space_tree  no_holes  skinny_metadata
      
      While after unmount and mount, we got the correct features:
      
        # umount $mnt
        # mount $dev1 $mnt
        # ls /sys/fs/btrfs/$uuid/features/
        extended_iref  free_space_tree  no_holes  raid56 skinny_metadata
      
      [CAUSE]
      Because we never really try to update the content of per-fs features/
      directory.
      
      We had an attempt to update the features directory dynamically in commit
      14e46e04 ("btrfs: synchronize incompat feature bits with sysfs
      files"), but unfortunately it get reverted in commit e410e34f
      ("Revert "btrfs: synchronize incompat feature bits with sysfs files"").
      The problem in the original patch is, in the context of
      btrfs_create_chunk(), we can not afford to update the sysfs group.
      
      The exported but never utilized function, btrfs_sysfs_feature_update()
      is the leftover of such attempt.  As even if we go sysfs_update_group(),
      new files will need extra memory allocation, and we have no way to
      specify the sysfs update to go GFP_NOFS.
      
      [FIX]
      This patch will address the old problem by doing asynchronous sysfs
      update in the cleaner thread.
      
      This involves the following changes:
      
      - Make __btrfs_(set|clear)_fs_(incompat|compat_ro) helpers to set
        BTRFS_FS_FEATURE_CHANGED flag when needed
      
      - Update btrfs_sysfs_feature_update() to use sysfs_update_group()
        And drop unnecessary arguments.
      
      - Call btrfs_sysfs_feature_update() in cleaner_kthread
        If we have the BTRFS_FS_FEATURE_CHANGED flag set.
      
      - Wake up cleaner_kthread in btrfs_commit_transaction if we have
        BTRFS_FS_FEATURE_CHANGED flag
      
      By this, all the previously dangerous call sites like
      btrfs_create_chunk() need no new changes, as above helpers would
      have already set the BTRFS_FS_FEATURE_CHANGED flag.
      
      The real work happens at cleaner_kthread, thus we pay the cost of
      delaying the update to sysfs directory, but the delayed time should be
      small enough that end user can not distinguish though it might get
      delayed if the cleaner thread is busy with removing subvolumes or
      defrag.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7625f46
  2. 06 12月, 2022 6 次提交
  3. 23 11月, 2022 1 次提交
  4. 26 9月, 2022 6 次提交
    • Q
      btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction() · 011b46c3
      Qu Wenruo 提交于
      Btrfs qgroup has a long history of bringing performance penalty in
      btrfs_commit_transaction().
      
      Although we tried our best to migrate such impact, there is still an
      unsolved call site, btrfs_drop_snapshot().
      
      This function will find the highest shared tree block and modify its
      extent ownership to do a subvolume/snapshot dropping.
      
      Such change will affect the whole subtree, and cause tons of qgroup
      dirty extents and stall btrfs_commit_transaction().
      
      To avoid such problem, here we introduce a new sysfs interface,
      /sys/fs/btrfs/<uuid>/qgroups/drop_subptree_threshold, to determine at
      whether and at which level we should skip qgroup accounting for subtree
      dropping.
      
      The default value is BTRFS_MAX_LEVEL, thus every subtree drop will go
      through qgroup accounting, to ensure qgroup numbers are kept as
      consistent as possible.
      
      While for performance sensitive cases, add a way to change the values to
      more reasonable values like 3, to make any subtree, which is at or higher
      than level 3, to mark qgroup inconsistent and skip the accounting.
      
      The cost is obvious, the qgroup number is no longer consistent, but at
      least performance is more reasonable, and users have the control.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      011b46c3
    • Q
      btrfs: sysfs: introduce global qgroup attribute group · ed2e35d8
      Qu Wenruo 提交于
      Although we already have info kobject for each qgroup, we don't have
      global qgroup info attributes to show things like enabled or
      inconsistent status flags.
      
      Add this qgroups attribute groups, and the first member is qgroup_flags,
      which is a read-only attribute to show human readable qgroup flags.
      
      The path is:
        /sys/fs/btrfs/<uuid>/qgroups/enabled
        /sys/fs/btrfs/<uuid>/qgroups/inconsistent
      
      The output is simple, just 1 or 0.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ed2e35d8
    • Z
      btrfs: remove the unnecessary result variables · bd64f622
      zhang songyi 提交于
      Return the sysfs_emit() and iterate_object_props() directly instead of
      using unnecessary variables.
      Reported-by: NZeal Robot <zealci@zte.com.cn>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: Nzhang songyi <zhang.songyi@zte.com.cn>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd64f622
    • Q
      btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2 · 1c56ab99
      Qu Wenruo 提交于
      The problem of long mount time caused by block group item search is
      already known for some time, and the solution of block group tree has
      been proposed.
      
      There is really no need to bound this feature into extent tree v2, just
      introduce compat RO flag, BLOCK_GROUP_TREE, to correctly solve the
      problem.
      
      All the code handling block group root is already in the upstream
      kernel, thus this patch really only needs to introduce the new compat RO
      flag.
      
      This patch introduces one extra artificial limitation on block group
      tree feature, that free space cache v2 and no-holes feature must be
      enabled to use this new compat RO feature.
      
      This artificial requirement is mostly to reduce the test combinations,
      and can be a guideline for future features, to mostly rely on the latest
      default features.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1c56ab99
    • D
      btrfs: sysfs: show discard stats and tunables in non-debug build · fb731430
      David Sterba 提交于
      When discard=async was introduced there were also sysfs knobs and stats
      for debugging and tuning, hidden under CONFIG_BTRFS_DEBUG. The defaults
      have been set and so far seem to satisfy all users on a range of
      workloads. As there are not only tunables (like iops or kbps) but also
      stats tracking amount of discardable bytes, that should be available
      when the async discard is on (otherwise it's not).
      
      The stats are moved from the per-fs debug directory, so it's under
        /sys/fs/btrfs/FSID/discard
      
      - discard_bitmap_bytes - amount of discarded bytes from data tracked as
                               bitmaps
      - discard_extent_bytes - dtto but as extents
      - discard_bytes_saved -
      - discardable_bytes - amount of bytes that can be discarded
      - discardable_extents - number of extents to be discarded
      - iops_limit - tunable limit of number of discard IOs to be issued
      - kbps_limit - tunable limit of kilobytes per second issued as discard IO
      - max_discard_size - tunable limit for size of one IO discard request
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb731430
    • D
      btrfs: sysfs: use sysfs_streq for string matching · 7f298f22
      David Sterba 提交于
      We have own string matching helper that duplicates what sysfs_streq
      does, with a slight difference that it skips initial whitespace. So far
      this is used for the drive allocation policy. The initial whitespace
      of written sysfs values should be rather discouraged and we should use a
      standard helper.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f298f22
  5. 25 7月, 2022 7 次提交
  6. 16 5月, 2022 3 次提交
  7. 06 5月, 2022 1 次提交
  8. 14 3月, 2022 2 次提交
  9. 07 1月, 2022 1 次提交
  10. 27 10月, 2021 1 次提交
  11. 23 8月, 2021 4 次提交
    • A
      btrfs: sysfs: document structures and their associated files · e7849e33
      Anand Jain 提交于
      Sysfs file has grown big. It takes some time to locate the correct
      struct attribute to add new files. Create a table and map the struct
      attribute to its sysfs path.
      
      Also, fix the comment about the debug sysfs path.  And add the comments
      to the attributes instead of attribute group, where sysfs file names are
      defined.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e7849e33
    • J
      btrfs: zoned: allow disabling of zone auto reclaim · 77233c2d
      Johannes Thumshirn 提交于
      Automatically reclaiming dirty zones might not always be desired for all
      workloads, especially as there are currently still some rough edges with
      the relocation code on zoned filesystems.
      
      Allow disabling zone auto reclaim on a per filesystem basis by writing 0
      as the threshold value.
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      77233c2d
    • B
      btrfs: initial fsverity support · 14605409
      Boris Burkov 提交于
      Add support for fsverity in btrfs. To support the generic interface in
      fs/verity, we add two new item types in the fs tree for inodes with
      verity enabled. One stores the per-file verity descriptor and btrfs
      verity item and the other stores the Merkle tree data itself.
      
      Verity checking is done in end_page_read just before a page is marked
      uptodate. This naturally handles a variety of edge cases like holes,
      preallocated extents, and inline extents. Some care needs to be taken to
      not try to verity pages past the end of the file, which are accessed by
      the generic buffered file reading code under some circumstances like
      reading to the end of the last page and trying to read again. Direct IO
      on a verity file falls back to buffered reads.
      
      Verity relies on PageChecked for the Merkle tree data itself to avoid
      re-walking up shared paths in the tree. For this reason, we need to
      cache the Merkle tree data. Since the file is immutable after verity is
      turned on, we can cache it at an index past EOF.
      
      Use the new inode ro_flags to store verity on the inode item, so that we
      can enable verity on a file, then rollback to an older kernel and still
      mount the file system and read the file. Since we can't safely write the
      file anymore without ruining the invariants of the Merkle tree, we mark
      a ro_compat flag on the file system when a file has verity enabled.
      Acked-by: NEric Biggers <ebiggers@google.com>
      Co-developed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      14605409
    • Q
      btrfs: allow read-write for 4K sectorsize on 64K page size systems · 95ea0486
      Qu Wenruo 提交于
      Since now we support data and metadata read-write for subpage, remove
      the RO requirement for subpage mount.
      
      There are some extra limitations though:
      
      - For now, subpage RW mount is still considered experimental
        Thus that mount warning will still be there.
      
      - No compression support
        There are still quite some PAGE_SIZE hard coded and quite some call
        sites use extent_clear_unlock_delalloc() to unlock locked_page.
        This will screw up subpage helpers.
      
        Now for subpage RW mount, no matter what mount option or inode attr is
        set, all writes will not be compressed.  Although reading compressed
        data has no problem.
      
      - No defrag for subpage case
        The defrag support for subpage case will come in later patches, which
        will also rework the defrag workflow.
      
      - No inline extent will be created
        This is mostly due to the fact that filemap_fdatawrite_range() will
        trigger more write than the range specified.
        In fallocate calls, this behavior can make us to writeback which can
        be inlined, before we enlarge the i_size.
      
        This is a very special corner case, and even current btrfs check won't
        report error on such inline extent + regular extent.
        But considering how much effort has been put to prevent such inline +
        regular, I'd prefer to cut off inline extent completely until we have
        a good solution.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95ea0486
  12. 22 6月, 2021 2 次提交
  13. 21 6月, 2021 2 次提交
    • D
      btrfs: sysfs: fix format string for some discard stats · 8c5ec995
      David Sterba 提交于
      The type of discard_bitmap_bytes and discard_extent_bytes is u64 so the
      format should be %llu, though the actual values would hardly ever
      overflow to negative values.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c5ec995
    • D
      btrfs: scrub: per-device bandwidth control · eb3b5053
      David Sterba 提交于
      Add sysfs interface to limit io during scrub. We relied on the ionice
      interface to do that, eg. the idle class let the system usable while
      scrub was running. This has changed when mq-deadline got widespread and
      did not implement the scheduling classes. That was a CFQ thing that got
      deleted. We've got numerous complaints from users about degraded
      performance.
      
      Currently only BFQ supports that but it's not a common scheduler and we
      can't ask everybody to switch to it.
      
      Alternatively the cgroup io limiting can be used but that also a
      non-trivial setup (v2 required, the controller must be enabled on the
      system). This can still be used if desired.
      
      Other ideas that have been explored: piggy-back on ionice (that is set
      per-process and is accessible) and interpret the class and classdata as
      bandwidth limits, but this does not have enough flexibility as there are
      only 8 allowed and we'd have to map fixed limits to each value. Also
      adjusting the value would need to lookup the process that currently runs
      scrub on the given device, and the value is not sticky so would have to
      be adjusted each time scrub runs.
      
      Running out of options, sysfs does not look that bad:
      
      - it's accessible from scripts, or udev rules
      - the name is similar to what MD-RAID has
        (/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max)
      - the value is sticky at least for filesystem mount time
      - adjusting the value has immediate effect
      - sysfs is available in constrained environments (eg. system rescue)
      - the limit also applies to device replace
      
      Sysfs:
      
      - raw value is in bytes
      - values written to the file accept suffixes like K, M
      - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
      - 0 means use default priority of IO
      
      The scheduler is a simple deadline one and the accuracy is up to nearest
      128K.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb3b5053
  14. 21 4月, 2021 1 次提交
    • J
      btrfs: zoned: automatically reclaim zones · 18bb8bbf
      Johannes Thumshirn 提交于
      When a file gets deleted on a zoned file system, the space freed is not
      returned back into the block group's free space, but is migrated to
      zone_unusable.
      
      As this zone_unusable space is behind the current write pointer it is not
      possible to use it for new allocations. In the current implementation a
      zone is reset once all of the block group's space is accounted as zone
      unusable.
      
      This behaviour can lead to premature ENOSPC errors on a busy file system.
      
      Instead of only reclaiming the zone once it is completely unusable,
      kick off a reclaim job once the amount of unusable bytes exceeds a user
      configurable threshold between 51% and 100%. It can be set per mounted
      filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
      by default.
      
      Similar to reclaiming unused block groups, these dirty block groups are
      added to a to_reclaim list and then on a transaction commit, the reclaim
      process is triggered but after we deleted unused block groups, which will
      free space for the relocation process.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18bb8bbf
  15. 19 4月, 2021 1 次提交
  16. 09 2月, 2021 1 次提交
    • N
      btrfs: zoned: track unusable bytes for zones · 169e0da9
      Naohiro Aota 提交于
      In a zoned filesystem a once written then freed region is not usable
      until the underlying zone has been reset. So we need to distinguish such
      unusable space from usable free space.
      
      Therefore we need to introduce the "zone_unusable" field to the block
      group structure, and "bytes_zone_unusable" to the space_info structure
      to track the unusable space.
      
      Pinned bytes are always reclaimed to the unusable space. But, when an
      allocated region is returned before using e.g., the block group becomes
      read-only between allocation time and reservation time, we can safely
      return the region to the block group. For the situation, this commit
      introduces "btrfs_add_free_space_unused". This behaves the same as
      btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
      rewinds the allocation offset.
      
      Because the read-only bytes tracks free but unusable bytes when the block
      group is read-only, we need to migrate the zone_unusable bytes to
      read-only bytes when a block group is marked read-only.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      169e0da9