1. 26 9月, 2022 33 次提交
    • Q
      btrfs: scrub: move logical/physical/dev/mirror_num from scrub_sector to scrub_block · 8686c40e
      Qu Wenruo 提交于
      Currently we store the following members in scrub_sector:
      
      - logical
      - physical
      - physical_for_dev_replace
      - dev
      - mirror_num
      
      However the current scrub code has ensured that scrub_blocks never cross
      stripe boundary.
      This is caused by the entry functions (scrub_simple_mirror,
      scrub_simple_stripe), thus every scrub_block will not cross stripe
      boundary.
      
      Thus this makes it possible to move those members into scrub_block other
      than putting them into scrub_sector.
      
      This should save quite some memory, as a scrub_block can be as large as 64
      sectors, even for metadata it's 16 sectors byte default.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8686c40e
    • Q
      btrfs: scrub: remove scrub_sector::page and use scrub_block::pages instead · eb2fad30
      Qu Wenruo 提交于
      Although scrub currently works for subpage (PAGE_SIZE > sectorsize) cases,
      it will allocate one page for each scrub_sector, which can cause extra
      unnecessary memory usage.
      
      Utilize scrub_block::pages[] instead of allocating page for each
      scrub_sector, this allows us to integrate larger extents while using
      less memory.
      
      For example, if our page size is 64K, sectorsize is 4K, and we got an
      32K sized extent.
      We will only allocate one page for scrub_block, and all 8 scrub sectors
      will point to that page.
      
      To do that properly, here we introduce several small helpers:
      
      - scrub_page_get_logical()
        Get the logical bytenr of a page.
        We store the logical bytenr of the page range into page::private.
        But for 32bit systems, their (void *) is not large enough to contain
        a u64, so in that case we will need to allocate extra memory for it.
      
        For 64bit systems, we can use page::private directly.
      
      - scrub_block_get_logical()
        Just get the logical bytenr of the first page.
      
      - scrub_sector_get_page()
        Return the page which the scrub_sector points to.
      
      - scrub_sector_get_page_offset()
        Return the offset inside the page which the scrub_sector points to.
      
      - scrub_sector_get_kaddr()
        Return the address which the scrub_sector points to.
        Just a wrapper using scrub_sector_get_page() and
        scrub_sector_get_page_offset()
      
      - bio_add_scrub_sector()
      
      Please note that, even with this patch, we're still allocating one page
      for one sector for data extents.
      
      This is because in scrub_extent() we split the data extent using
      sectorsize.
      
      The memory usage reduction will need extra work to make scrub to work
      like data read to only use the correct sector(s).
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb2fad30
    • Q
      btrfs: scrub: introduce scrub_block::pages for more efficient memory usage for subpage · f3e01e0e
      Qu Wenruo 提交于
      [BACKGROUND]
      Currently for scrub, we allocate one page for one sector, this is fine
      for PAGE_SIZE == sectorsize support, but can waste extra memory for
      subpage support.
      
      [CODE CHANGE]
      Make scrub_block contain all the pages, so if we're scrubbing an extent
      sized 64K, and our page size is also 64K, we only need to allocate one
      page.
      
      [LIFESPAN CHANGE]
      Since now scrub_sector no longer holds a page, but is using
      scrub_block::pages[] instead, we have to ensure scrub_block has a longer
      lifespan for write bio. The lifespan for read bio is already large
      enough.
      
      Now scrub_block will only be released after the write bio finished.
      
      [COMING NEXT]
      Currently we only added scrub_block::pages[] for this purpose, but
      scrub_sector is still utilizing the old scrub_sector::page.
      
      The switch will happen in the next patch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3e01e0e
    • Q
      btrfs: scrub: factor out allocation and initialization of scrub_sector into helper · 5dd3d8e4
      Qu Wenruo 提交于
      The allocation and initialization is shared by 3 call sites, and we're
      going to change the initialization of some members in the upcoming
      patches.
      
      So factor out the allocation and initialization of scrub_sector into a
      helper, alloc_scrub_sector(), which will do the following work:
      
      - Allocate the memory for scrub_sector
      
      - Allocate a page for scrub_sector::page
      
      - Initialize scrub_sector::refs to 1
      
      - Attach the allocated scrub_sector to scrub_block
        The attachment is bidirectional, which means scrub_block::sectorv[]
        will be updated and scrub_sector::sblock will also be updated.
      
      - Update scrub_block::sector_count and do extra sanity check on it
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5dd3d8e4
    • Q
      btrfs: scrub: factor out initialization of scrub_block into helper · 15b88f6d
      Qu Wenruo 提交于
      Although there are only two callers, we are going to add some members
      for scrub_block in the incoming patches.  Factoring out the
      initialization code will make later expansion easier.
      
      One thing to note is, even scrub_handle_errored_block() doesn't utilize
      scrub_block::refs, we still use alloc_scrub_block() to initialize
      sblock::ref, allowing us to use scrub_block_put() to do cleanup.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      15b88f6d
    • Q
      btrfs: scrub: use pointer array to replace sblocks_for_recheck · 1dfa5005
      Qu Wenruo 提交于
      In function scrub_handle_errored_block(), we use @sblocks_for_recheck
      pointer to hold one scrub_block for each mirror, and uses kcalloc() to
      allocate an array.
      
      But this one pointer for an array is not readable due to the member
      offsets done by addition and not [].
      
      Change this pointer to struct scrub_block *[BTRFS_MAX_MIRRORS], this
      will slightly increase the stack memory usage.
      
      Since function scrub_handle_errored_block() won't get iterative calls,
      this extra cost would completely be acceptable.
      
      And since we're here, also set sblock->refs and use scrub_block_put() to
      clean them up, as later we will add extra members in scrub_block, which
      needs scrub_block_put() to clean them up.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1dfa5005
    • B
      btrfs: send: add support for fs-verity · 38622010
      Boris Burkov 提交于
      Preserve the fs-verity status of a btrfs file across send/recv.
      
      There is no facility for installing the Merkle tree contents directly on
      the receiving filesystem, so we package up the parameters used to enable
      verity found in the verity descriptor. This gives the receive side
      enough information to properly enable verity again. Note that this means
      that receive will have to re-compute the whole Merkle tree, similar to
      how compression worked before encoded_write.
      
      Since the file becomes read-only after verity is enabled, it is
      important that verity is added to the send stream after any file writes.
      Therefore, when we process a verity item, merely note that it happened,
      then actually create the command in the send stream during
      'finish_inode_if_needed'.
      
      This also creates V3 of the send stream format, without any format
      changes besides adding the new commands and attributes.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38622010
    • U
      btrfs: use atomic_try_cmpxchg in free_extent_buffer · e5677f05
      Uros Bizjak 提交于
      Use `atomic_try_cmpxchg(ptr, &old, new)` instead of
      `atomic_cmpxchg(ptr, old, new) == old` in free_extent_buffer. This
      has two benefits:
      
      - The x86 cmpxchg instruction returns success in the ZF flag, so this
        change saves a compare after cmpxchg, as well as a related move
        instruction in the front of cmpxchg.
      
      - atomic_try_cmpxchg implicitly assigns the *ptr value to &old when
        cmpxchg fails, enabling further code simplifications.
      
      This patch has no functional change.
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e5677f05
    • Q
      btrfs: scrub: remove impossible sanity checks · fc65bb53
      Qu Wenruo 提交于
      There are several sanity checks which are no longer possible to trigger
      inside btrfs_scrub_dev().
      
      Since we have mount time check against super block nodesize/sectorsize,
      and our fixed macro is hardcoded to handle even the worst combination.
      
      Thus those sanity checks are no longer needed, can be easily removed.
      
      But this patch still uses some ASSERT()s as a safe net just in case we
      change some features in the future to trigger those impossible
      combinations.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc65bb53
    • J
      btrfs: delete btrfs_wait_space_cache_v1_finished · 527c490f
      Josef Bacik 提交于
      We used to use this in a few spots, but now we only use it directly
      inside of block-group.c, so remove the helper and just open code where
      we were using it.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      527c490f
    • J
      btrfs: remove lock protection for BLOCK_GROUP_FLAG_RELOCATING_REPAIR · 588a4868
      Josef Bacik 提交于
      Before when this was modifying the bit field we had to protect it with
      the bg->lock, however now we're using bit helpers so we can stop
      using the bg->lock.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      588a4868
    • J
      btrfs: remove BLOCK_GROUP_FLAG_HAS_CACHING_CTL · 7b9c293b
      Josef Bacik 提交于
      This is used mostly to determine if we need to look at the caching ctl
      list and clean up any references to this block group.  However we never
      clear this flag, specifically because we need to know if we have to
      remove a caching ctl we have for this block group still.  This is in the
      remove block group path which isn't a fast path, so the optimization
      doesn't really matter, simplify this logic and remove the flag.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7b9c293b
    • J
      btrfs: simplify block group traversal in btrfs_put_block_group_cache · 50c31eaa
      Josef Bacik 提交于
      We're breaking out and re-searching for the next block group while
      evicting any of the block group cache inodes.  This is not needed, the
      block groups aren't disappearing here, we can simply loop through the
      block groups like normal and iput any inode that we find.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      50c31eaa
    • J
      btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY · 9283b9e0
      Josef Bacik 提交于
      We use this during device replace for zoned devices, we were simply
      taking the lock because it was in a bit field and we needed the lock to
      be safe with other modifications in the bitfield.  With the bit helpers
      we no longer require that locking.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9283b9e0
    • J
      btrfs: convert block group bit field to use bit helpers · 3349b57f
      Josef Bacik 提交于
      We use a bit field in the btrfs_block_group for different flags, however
      this is awkward because we have to hold the block_group->lock for any
      modification of any of these fields, and makes the code clunky for a few
      of these flags.  Convert these to a properly flags setup so we can
      utilize the bit helpers.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3349b57f
    • J
      btrfs: handle space_info setting of bg in btrfs_add_bg_to_space_info · 723de71d
      Josef Bacik 提交于
      We previously had the pattern of
      
      	btrfs_update_space_info(all, the, bg, fields, &space_info);
      	link_block_group(bg);
      	bg->space_info = space_info;
      
      Now that we're passing the bg into btrfs_add_bg_to_space_info we can do
      the linking in that function, transforming this to simply
      
      	btrfs_add_bg_to_space_info(fs_info, bg);
      
      and put the link_block_group() and bg->space_info assignment directly in
      btrfs_add_bg_to_space_info.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      723de71d
    • J
      btrfs: simplify arguments of btrfs_update_space_info and rename · 9d4b0a12
      Josef Bacik 提交于
      This function has grown a bunch of new arguments, and it just boils down
      to passing in all the block group fields as arguments.  Simplify this by
      passing in the block group itself and updating the space_info fields
      based on the block group fields directly.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9d4b0a12
    • J
      btrfs: use btrfs_fs_closing for background bg work · 2f12741f
      Josef Bacik 提交于
      For both unused bg deletion and async balance work we'll happily run if
      the fs is closing.  However I want to move these to their own worker
      thread, and they can be long running jobs, so add a check to see if
      we're closing and simply bail.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2f12741f
    • O
      btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent() · d1f68ba0
      Omar Sandoval 提交于
      btrfs_insert_file_extent() is only ever used to insert holes, so rename
      it and remove the redundant parameters.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1f68ba0
    • D
      btrfs: sysfs: use sysfs_streq for string matching · 7f298f22
      David Sterba 提交于
      We have own string matching helper that duplicates what sysfs_streq
      does, with a slight difference that it skips initial whitespace. So far
      this is used for the drive allocation policy. The initial whitespace
      of written sysfs values should be rather discouraged and we should use a
      standard helper.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f298f22
    • Q
      btrfs: scrub: try to fix super block errors · f9eab5f0
      Qu Wenruo 提交于
      [BUG]
      The following script shows that, although scrub can detect super block
      errors, it never tries to fix it:
      
      	mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
      	xfs_io -c "pwrite 67108864 4k" $dev2
      
      	mount $dev1 $mnt
      	btrfs scrub start -B $dev2
      	btrfs scrub start -Br $dev2
      	umount $mnt
      
      The first scrub reports the super error correctly:
      
        scrub done for f3289218-abd3-41ac-a630-202f766c0859
        Scrub started:    Tue Aug  2 14:44:11 2022
        Status:           finished
        Duration:         0:00:00
        Total to scrub:   1.26GiB
        Rate:             0.00B/s
        Error summary:    super=1
          Corrected:      0
          Uncorrectable:  0
          Unverified:     0
      
      But the second read-only scrub still reports the same super error:
      
        Scrub started:    Tue Aug  2 14:44:11 2022
        Status:           finished
        Duration:         0:00:00
        Total to scrub:   1.26GiB
        Rate:             0.00B/s
        Error summary:    super=1
          Corrected:      0
          Uncorrectable:  0
          Unverified:     0
      
      [CAUSE]
      The comments already shows that super block can be easily fixed by
      committing a transaction:
      
      	/*
      	 * If we find an error in a super block, we just report it.
      	 * They will get written with the next transaction commit
      	 * anyway
      	 */
      
      But the truth is, such assumption is not always true, and since scrub
      should try to repair every error it found (except for read-only scrub),
      we should really actively commit a transaction to fix this.
      
      [FIX]
      Just commit a transaction if we found any super block errors, after
      everything else is done.
      
      We cannot do this just after scrub_supers(), as
      btrfs_commit_transaction() will try to pause and wait for the running
      scrub, thus we can not call it with scrub_lock hold.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f9eab5f0
    • Q
      btrfs: scrub: properly report super block errors in system log · e69bf81c
      Qu Wenruo 提交于
      [PROBLEM]
      
      Unlike data/metadata corruption, if scrub detected some error in the
      super block, the only error message is from the updated device status:
      
        BTRFS info (device dm-1): scrub: started on devid 2
        BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
      
      This is not helpful at all.
      
      [CAUSE]
      Unlike data/metadata error reporting, there is no visible report in
      kernel dmesg to report supper block errors.
      
      In fact, return value of scrub_checksum_super() is intentionally
      skipped, thus scrub_handle_errored_block() will never be called for
      super blocks.
      
      [FIX]
      Make super block errors to output an error message, now the full
      dmesg would looks like this:
      
        BTRFS info (device dm-1): scrub: started on devid 2
        BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864
        BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
        BTRFS info (device dm-1): scrub: started on devid 2
      
      This fix involves:
      
      - Move the super_errors reporting to scrub_handle_errored_block()
        This allows the device status message to show after the super block
        error message.
        But now we no longer distinguish super block corruption and generation
        mismatch, now all counted as corruption.
      
      - Properly check the return value from scrub_checksum_super()
      - Add extra super block error reporting for scrub_print_warning().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e69bf81c
    • A
      btrfs: fix alignment of VMA for memory mapped files on THP · b0c58223
      Alexander Zhu 提交于
      With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs for
      read-only mmapped files, such as shared libraries. However, the kernel
      makes no attempt to actually align those mappings on 2MB boundaries,
      which makes it impossible to use those THPs most of the time. This issue
      applies to general file mapping THP as well as existing setups using
      CONFIG_READ_ONLY_THP_FOR_FS. This is easily fixed by using
      thp_get_unmapped_area for the unmapped_area function in btrfs, which
      is what ext2, ext4, fuse, and xfs all use.
      
      Initially btrfs had been left out in commit 8c07fc452ac0 ("btrfs: fix
      alignment of VMA for memory mapped files on THP") as btrfs does not support
      DAX. However, commit 1854bc6e ("mm/readahead: Align file mappings
      for non-DAX") removed the DAX requirement. We should now be able to call
      thp_get_unmapped_area() for btrfs.
      
      The problem can be seen in /proc/PID/smaps where THPeligible is set to 0
      on mappings to eligible shared object files as shown below.
      
      Before this patch:
      
        7fc6a7e18000-7fc6a80cc000 r-xp 00000000 00:1e 199856
        /usr/lib64/libcrypto.so.1.1.1k
        Size:               2768 kB
        THPeligible:    0
        VmFlags: rd ex mr mw me
      
      With this patch the library is mapped at a 2MB aligned address:
      
        fbdfe200000-7fbdfe4b4000 r-xp 00000000 00:1e 199856
        /usr/lib64/libcrypto.so.1.1.1k
        Size:               2768 kB
        THPeligible:    1
        VmFlags: rd ex mr mw me
      
      This fixes the alignment of VMAs for any mmap of a file that has the
      rd and ex permissions and size >= 2MB. The VMA alignment and
      THPeligible field for anonymous memory is handled separately and
      is thus not effected by this change.
      
      CC: stable@vger.kernel.org # 5.18+
      Signed-off-by: NAlexander Zhu <alexlzhu@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0c58223
    • I
      btrfs: add lockdep annotations for the ordered extents wait event · 5f4403e1
      Ioannis Angelakopoulos 提交于
      This wait event is very similar to the pending ordered wait event in the
      sense that it occurs in a different context than the condition signaling
      for the event. The signaling occurs in btrfs_remove_ordered_extent()
      while the wait event is implemented in btrfs_start_ordered_extent() in
      fs/btrfs/ordered-data.c
      
      However, in this case a thread must not acquire the lockdep map for the
      ordered extents wait event when the ordered extent is related to a free
      space inode. That is because lockdep creates dependencies between locks
      acquired both in execution paths related to normal inodes and paths
      related to free space inodes, thus leading to false positives.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f4403e1
    • I
      btrfs: change the lockdep class of free space inode's invalidate_lock · 9d7464c8
      Ioannis Angelakopoulos 提交于
      Reinitialize the class of the lockdep map for struct inode's
      mapping->invalidate_lock in load_free_space_cache() function in
      fs/btrfs/free-space-cache.c. This will prevent lockdep from producing
      false positives related to execution paths that make use of free space
      inodes and paths that make use of normal inodes.
      
      Specifically, with this change lockdep will create separate lock
      dependencies that include the invalidate_lock, in the case that free
      space inodes are used and in the case that normal inodes are used.
      
      The lockdep class for this lock was first initialized in
      inode_init_always() in fs/inode.c.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9d7464c8
    • I
      btrfs: add lockdep annotations for pending_ordered wait event · 8b53779e
      Ioannis Angelakopoulos 提交于
      In contrast to the num_writers and num_extwriters wait events, the
      condition for the pending ordered wait event is signaled in a different
      context from the wait event itself. The condition signaling occurs in
      btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait
      event is implemented in btrfs_commit_transaction() in
      fs/btrfs/transaction.c
      
      Thus the thread signaling the condition has to acquire the lockdep map
      as a reader at the start of btrfs_remove_ordered_extent() and release it
      after it has signaled the condition. In this case some dependencies
      might be left out due to the placement of the annotation, but it is
      better than no annotation at all.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b53779e
    • I
      btrfs: add lockdep annotations for transaction states wait events · 3e738c53
      Ioannis Angelakopoulos 提交于
      Add lockdep annotations for the transaction states that have wait
      events;
      
        1) TRANS_STATE_COMMIT_START
        2) TRANS_STATE_UNBLOCKED
        3) TRANS_STATE_SUPER_COMMITTED
        4) TRANS_STATE_COMPLETED
      
      The new macros introduced here to annotate the transaction states wait
      events have the same effect as the generic lockdep annotation macros.
      
      With the exception of the lockdep annotation for TRANS_STATE_COMMIT_START
      the transaction thread has to acquire the lockdep maps for the
      transaction states as reader after the lockdep map for num_writers is
      released so that lockdep does not complain.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3e738c53
    • I
      btrfs: add lockdep annotations for num_extwriters wait event · 5a9ba670
      Ioannis Angelakopoulos 提交于
      Similarly to the num_writers wait event in fs/btrfs/transaction.c add a
      lockdep annotation for the num_extwriters wait event.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a9ba670
    • I
      btrfs: add lockdep annotations for num_writers wait event · e1489b4f
      Ioannis Angelakopoulos 提交于
      Annotate the num_writers wait event in fs/btrfs/transaction.c with
      lockdep in order to catch deadlocks involving this wait event.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e1489b4f
    • I
      btrfs: add macros for annotating wait events with lockdep · ab9a323f
      Ioannis Angelakopoulos 提交于
      Introduce four macros that are used to annotate wait events in btrfs code
      with lockdep;
      
        1) the btrfs_lockdep_init_map
        2) the btrfs_lockdep_acquire,
        3) the btrfs_lockdep_release
        4) the btrfs_might_wait_for_event macros.
      
      The btrfs_lockdep_init_map macro is used to initialize a lockdep map.
      
      The btrfs_lockdep_<acquire,release> macros are used by threads to take
      the lockdep map as readers (shared lock) and release it, respectively.
      
      The btrfs_might_wait_for_event macro is used by threads to take the
      lockdep map as writers (exclusive lock) and release it.
      
      In general, the lockdep annotation for wait events work as follows:
      
      The condition for a wait event can be modified and signaled at the same
      time by multiple threads. These threads hold the lockdep map as readers
      when they enter a context in which blocking would prevent signaling the
      condition. Frequently, this occurs when a thread violates a condition
      (lockdep map acquire), before restoring it and signaling it at a later
      point (lockdep map release).
      
      The threads that block on the wait event take the lockdep map as writers
      (exclusive lock). These threads have to block until all the threads that
      hold the lockdep map as readers signal the condition for the wait event
      and release the lockdep map.
      
      The lockdep annotation is used to warn about potential deadlock scenarios
      that involve the threads that modify and signal the wait event condition
      and threads that block on the wait event. A simple example is illustrated
      below:
      
      Without lockdep:
      
      TA                                        TB
      cond = false
                                                lock(A)
                                                wait_event(w, cond)
                                                unlock(A)
      lock(A)
      cond = true
      signal(w)
      unlock(A)
      
      With lockdep:
      
      TA                                        TB
      rwsem_acquire_read(lockdep_map)
      cond = false
                                                lock(A)
                                                rwsem_acquire(lockdep_map)
                                                rwsem_release(lockdep_map)
                                                wait_event(w, cond)
                                                unlock(A)
      lock(A)
      cond = true
      signal(w)
      unlock(A)
      rwsem_release(lockdep_map)
      
      In the second case, with the lockdep annotation, lockdep would warn about
      an ABBA deadlock, while the first case would just deadlock at some point.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ab9a323f
    • Q
      btrfs: dump extra info if one free space cache has more bitmaps than it should · 62cd9d44
      Qu Wenruo 提交于
      There is an internal report on hitting the following ASSERT() in
      recalculate_thresholds():
      
       	ASSERT(ctl->total_bitmaps <= max_bitmaps);
      
      Above @max_bitmaps is calculated using the following variables:
      
      - bytes_per_bg
        8 * 4096 * 4096 (128M) for x86_64/x86.
      
      - block_group->length
        The length of the block group.
      
      @max_bitmaps is the rounded up value of block_group->length / 128M.
      
      Normally one free space cache should not have more bitmaps than above
      value, but when it happens the ASSERT() can be triggered if
      CONFIG_BTRFS_ASSERT is also enabled.
      
      But the ASSERT() itself won't provide enough info to know which is going
      wrong.
      Is the bg too small thus it only allows one bitmap?
      Or is there something else wrong?
      
      So although I haven't found extra reports or crash dump to do further
      investigation, add the extra info to make it more helpful to debug.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62cd9d44
    • L
      Linux 6.0-rc7 · f76349cf
      Linus Torvalds 提交于
      f76349cf
    • L
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 5e049663
      Linus Torvalds 提交于
      Pull ext4 fixes from Ted Ts'o:
       "Regression and bug fixes:
      
         - Performance regression fix from 5.18 on a Rasberry Pi
      
         - Fix extent parsing bug which triggers a BUG_ON when a (corrupted)
           extent tree has has a non-root node when zero entries.
      
         - Fix a livelock where in the right (wrong) circumstances a large
           number of nfsd threads can try to write to a nearly full file
           system, and retry for hours(!)"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: limit the number of retries after discarding preallocations blocks
        ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0
        ext4: use buckets for cr 1 block scan instead of rbtree
        ext4: use locality group preallocation for small closed files
        ext4: make directory inode spreading reflect flexbg size
        ext4: avoid unnecessary spreading of allocations among groups
        ext4: make mballoc try target group first even with mb_optimize_scan
      5e049663
  2. 25 9月, 2022 6 次提交
  3. 24 9月, 2022 1 次提交