1. 22 6月, 2021 12 次提交
    • F
      btrfs: ensure relocation never runs while we have send operations running · 1cea5cf0
      Filipe Manana 提交于
      Relocation and send do not play well together because while send is
      running a block group can be relocated, a transaction committed and
      the respective disk extents get re-allocated and written to or discarded
      while send is about to do something with the extents.
      
      This was explained in commit 9e967495 ("Btrfs: prevent send failures
      and crashes due to concurrent relocation"), which prevented balance and
      send from running in parallel but it did not address one remaining case
      where chunk relocation can happen: shrinking a device (and device deletion
      which shrinks a device's size to 0 before deleting the device).
      
      We also have now one more case where relocation is triggered: on zoned
      filesystems partially used block groups get relocated by a background
      thread, introduced in commit 18bb8bbf ("btrfs: zoned: automatically
      reclaim zones").
      
      So make sure that instead of preventing balance from running when there
      are ongoing send operations, we prevent relocation from happening.
      This uses the infrastructure recently added by a patch that has the
      subject: "btrfs: add cancellable chunk relocation support".
      
      Also it adds a spinlock used exclusively for the exclusivity between
      send and relocation, as before fs_info->balance_mutex was used, which
      would make an attempt to run send to block waiting for balance to
      finish, which can take a lot of time on large filesystems.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cea5cf0
    • D
      btrfs: shorten integrity checker extent data mount option · cbeaae4f
      David Sterba 提交于
      Subjectively, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA is quite long and
      calling it CHECK_INTEGRITY_DATA still keeps the meaning and matches the
      mount option name.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cbeaae4f
    • D
      btrfs: switch mount option bits to enums and use wider type · ccd9395b
      David Sterba 提交于
      Switch defines of BTRFS_MOUNT_* to an enum (the symbolic names are
      recorded in the debugging information for convenience).
      
      There are two more things done but separating them would not make much
      sense as it's touching the same lines:
      
      - Renumber shifts 18..31 to 17..30 to get rid of the hole in the
        sequence.
      
      - Use 1UL as the value that gets shifted because we're approaching the
        32bit limit and due to integer promotions the value of (1 << 31)
        becomes 0xffffffff80000000 when cast to unsigned long (eg. the option
        manipulating helpers).
      
        This is not causing any problems yet as the operations are in-memory
        and masking the 31st bit works, we don't have more than 31 bits so the
        ill effects of not masking higher bits don't happen. But once we have
        more, the problems will emerge.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ccd9395b
    • D
      btrfs: props: change how empty value is interpreted · 5548c8c6
      David Sterba 提交于
      Based on user feedback and actual problems with compression property,
      there's no support to unset any compression options, or to force no
      compression flag.
      
      Note: This has changed recently in e2fsprogs 1.46.2, 'chattr +m'
      (setting NOCOMPRESS).
      
      In btrfs properties, the empty value should really mean reset to
      defaults, for all properties in general. Right now there's only the
      compression one, so this change should not cause too many problems.
      
      Old behaviour:
      
        $ lsattr file
        ---------------------- file
        # the NOCOMPRESS bit is set
        $ btrfs prop set file compression ''
        $ lsattr file
        ---------------------m file
      
      This is equivalent to 'btrfs prop set file compression no' in current
      btrfs-progs as the 'no' or 'none' values are translated to an empty
      string.
      
      This is where the new behaviour is different: empty string drops the
      compression flag (-c) and nocompress (-m):
      
        $ lsattr file
        ---------------------- file
        # No change
        $ btrfs prop set file compression ''
        $ lsattr file
        ---------------------- file
        $ btrfs prop set file compression lzo
        $ lsattr file
        --------c------------- file
        $ btrfs prop get file compression
        compression=lzo
        $ btrfs prop set file compression ''
        # Reset to the initial state
        $ lsattr file
        ---------------------- file
        # Set NOCOMPRESS bit
        $ btrfs prop set file compression no
        $ lsattr file
        ---------------------m file
      
      This obviously brings problems with backward compatibility, so this
      patch should not be backported without making sure the updated
      btrfs-progs are also used and that scripts have been updated to use the
      new semantics.
      
      Summary:
      
      - old kernel:
        no, none, "" - set NOCOMPRESS bit
      - new kernel:
        no, none - set NOCOMPRESS bit
        "" - drop all compression flags, ie. COMPRESS and NOCOMPRESS
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5548c8c6
    • D
      btrfs: compression: don't try to compress if we don't have enough pages · f2165627
      David Sterba 提交于
      The early check if we should attempt compression does not take into
      account the number of input pages. It can happen that there's only one
      page, eg. a tail page after some ranges of the BTRFS_MAX_UNCOMPRESSED
      have been processed, or an isolated page that won't be converted to an
      inline extent.
      
      The single page would be compressed but a later check would drop it
      again because the result size must be at least one block shorter than
      the input. That can never work with just one page.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f2165627
    • N
      btrfs: fix unbalanced unlock in qgroup_account_snapshot() · 44365827
      Naohiro Aota 提交于
      qgroup_account_snapshot() is trying to unlock the not taken
      tree_log_mutex in a error path. Since ret != 0 in this case, we can
      just return from here.
      
      Fixes: 2a4d84c1 ("btrfs: move delayed ref flushing for qgroup into qgroup helper")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      44365827
    • D
      btrfs: sysfs: export dev stats in devinfo directory · da658b57
      David Sterba 提交于
      The device stats can be read by ioctl, wrapped by command 'btrfs device
      stats'. Provide another source where to read the information in
      /sys/fs/btrfs/FSID/devinfo/DEVID/error_stats . The format is a list of
      'key value' pairs one per line, which is common in other stat files.
      The names are the same as used in other device stat outputs.
      
      The stats are all in one file as it's the snapshot of all available
      stats. The 'one value per file' format is not very suitable here. The
      stats should be valid right after the stats item is read from disk,
      shortly after initializing the device.
      
      In case the stats are not yet valid, print just 'invalid' as the file
      contents.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      da658b57
    • D
      btrfs: fix typos in comments · 1a9fd417
      David Sterba 提交于
      Fix typos that have snuck in since the last round. Found by codespell.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a9fd417
    • Q
      btrfs: remove a stale comment for btrfs_decompress_bio() · c86bdc9b
      Qu Wenruo 提交于
      Since commit 8140dc30 ("btrfs: btrfs_decompress_bio() could accept
      compressed_bio instead"), btrfs_decompress_bio() accepts
      "struct compressed_bio" other than open-coded parameter list.
      
      Thus the comments for the parameter list is no longer needed.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c86bdc9b
    • B
      btrfs: send: use list_move_tail instead of list_del/list_add_tail · bb930007
      Baokun Li 提交于
      Use list_move_tail() instead of list_del() + list_add_tail() as it's
      doing the same thing and allows further cleanups.  Open code
      name_cache_used() as there is only one user.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NBaokun Li <libaokun1@huawei.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bb930007
    • C
      btrfs: disable build on platforms having page size 256K · b05fbcc3
      Christophe Leroy 提交于
      With a config having PAGE_SIZE set to 256K, BTRFS build fails
      with the following message
      
        include/linux/compiler_types.h:326:38: error: call to
        '__compiletime_assert_791' declared with attribute error:
        BUILD_BUG_ON failed: (BTRFS_MAX_COMPRESSED % PAGE_SIZE) != 0
      
      BTRFS_MAX_COMPRESSED being 128K, BTRFS cannot support platforms with
      256K pages at the time being.
      
      There are two platforms that can select 256K pages:
       - hexagon
       - powerpc
      
      Disable BTRFS when 256K page size is selected. Supporting this would
      require changes to the subpage mode that's currently being developed.
      Given that 256K is many times larger than page sizes commonly used and
      for what the algorithms and structures have been tuned, it's out of
      scope and disabling build is a reasonable option.
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b05fbcc3
    • F
      btrfs: send: fix invalid path for unlink operations after parent orphanization · d8ac76cd
      Filipe Manana 提交于
      During an incremental send operation, when processing the new references
      for the current inode, we might send an unlink operation for another inode
      that has a conflicting path and has more than one hard link. However this
      path was computed and cached before we processed previous new references
      for the current inode. We may have orphanized a directory of that path
      while processing a previous new reference, in which case the path will
      be invalid and cause the receiver process to fail.
      
      The following reproducer triggers the problem and explains how/why it
      happens in its comments:
      
        $ cat test-send-unlink.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        # Create our test files and directory. Inode 259 (file3) has two hard
        # links.
        touch $MNT/file1
        touch $MNT/file2
        touch $MNT/file3
      
        mkdir $MNT/A
        ln $MNT/file3 $MNT/A/hard_link
      
        # Filesystem looks like:
        #
        # .                                     (ino 256)
        # |----- file1                          (ino 257)
        # |----- file2                          (ino 258)
        # |----- file3                          (ino 259)
        # |----- A/                             (ino 260)
        #        |---- hard_link                (ino 259)
        #
      
        # Now create the base snapshot, which is going to be the parent snapshot
        # for a later incremental send.
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Move inode 257 into directory inode 260. This results in computing the
        # path for inode 260 as "/A" and caching it.
        mv $MNT/file1 $MNT/A/file1
      
        # Move inode 258 (file2) into directory inode 260, with a name of
        # "hard_link", moving first inode 259 away since it currently has that
        # location and name.
        mv $MNT/A/hard_link $MNT/tmp
        mv $MNT/file2 $MNT/A/hard_link
      
        # Now rename inode 260 to something else (B for example) and then create
        # a hard link for inode 258 that has the old name and location of inode
        # 260 ("/A").
        mv $MNT/A $MNT/B
        ln $MNT/B/hard_link $MNT/A
      
        # Filesystem now looks like:
        #
        # .                                     (ino 256)
        # |----- tmp                            (ino 259)
        # |----- file3                          (ino 259)
        # |----- B/                             (ino 260)
        # |      |---- file1                    (ino 257)
        # |      |---- hard_link                (ino 258)
        # |
        # |----- A                              (ino 258)
      
        # Create another snapshot of our subvolume and use it for an incremental
        # send.
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        # Now unmount the filesystem, create a new one, mount it and try to
        # apply both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
      
        mount $DEV $MNT
      
        # First add the first snapshot to the new filesystem by applying the
        # first send stream.
        btrfs receive -f /tmp/snap1.send $MNT
      
        # The incremental receive operation below used to fail with the
        # following error:
        #
        #    ERROR: unlink A/hard_link failed: No such file or directory
        #
        # This is because when send is processing inode 257, it generates the
        # path for inode 260 as "/A", since that inode is its parent in the send
        # snapshot, and caches that path.
        #
        # Later when processing inode 258, it first processes its new reference
        # that has the path of "/A", which results in orphanizing inode 260
        # because there is a a path collision. This results in issuing a rename
        # operation from "/A" to "/o260-6-0".
        #
        # Finally when processing the new reference "B/hard_link" for inode 258,
        # it notices that it collides with inode 259 (not yet processed, because
        # it has a higher inode number), since that inode has the name
        # "hard_link" under the directory inode 260. It also checks that inode
        # 259 has two hardlinks, so it decides to issue a unlink operation for
        # the name "hard_link" for inode 259. However the path passed to the
        # unlink operation is "/A/hard_link", which is incorrect since currently
        # "/A" does not exists, due to the orphanization of inode 260 mentioned
        # before. The path is incorrect because it was computed and cached
        # before the orphanization. This results in the receiver to fail with
        # the above error.
        btrfs receive -f /tmp/snap2.send $MNT
      
        umount $MNT
      
      When running the test, it fails like this:
      
        $ ./test-send-unlink.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: unlink A/hard_link failed: No such file or directory
      
      Fix this by recomputing a path before issuing an unlink operation when
      processing the new references for the current inode if we previously
      have orphanized a directory.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8ac76cd
  2. 21 6月, 2021 28 次提交
    • D
      btrfs: inline wait_current_trans_commit_start in its caller · ae5d29d4
      David Sterba 提交于
      Function wait_current_trans_commit_start is now fairly trivial so it can
      be inlined in its only caller.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae5d29d4
    • D
      btrfs: sink wait_for_unblock parameter to async commit · 32cc4f87
      David Sterba 提交于
      There's only one caller left btrfs_ioctl_start_sync that passes 0, so we
      can remove the switch in btrfs_commit_transaction_async.
      
      A cleanup 9babda9f ("btrfs: Remove async_transid from
      btrfs_mksubvol/create_subvol/create_snapshot") removed calls that passed
      1, so this is a followup.
      
      As this removes last call of wait_current_trans_commit_start_and_unblock,
      remove the function as well.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32cc4f87
    • N
      btrfs: remove total_data_size variable in btrfs_batch_insert_items() · bfaa324e
      Nathan Chancellor 提交于
      clang warns:
      
        fs/btrfs/delayed-inode.c:684:6: warning: variable 'total_data_size' set
        but not used [-Wunused-but-set-variable]
      	  int total_data_size = 0, total_size = 0;
      	      ^
        1 warning generated.
      
      This variable's value has been unused since commit fc0d82e1 ("btrfs:
      sink total_data parameter in setup_items_for_insert"). Eliminate it.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/1391Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NNathan Chancellor <nathan@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bfaa324e
    • N
      btrfs: eliminate insert label in add_falloc_range · 77d25534
      Nikolay Borisov 提交于
      By way of inverting the list_empty conditional the insert label can be
      eliminated, making the function's flow entirely linear.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      77d25534
    • Q
      btrfs: subpage: fix a rare race between metadata endio and eb freeing · 3d078efa
      Qu Wenruo 提交于
      [BUG]
      There is a very rare ASSERT() triggering during full fstests run for
      subpage rw support.
      
      No other reproducer so far.
      
      The ASSERT() gets triggered for metadata read in
      btrfs_page_set_uptodate() inside end_page_read().
      
      [CAUSE]
      There is still a small race window for metadata only, the race could
      happen like this:
      
                      T1                  |              T2
      ------------------------------------+-----------------------------
      end_bio_extent_readpage()           |
      |- btrfs_validate_metadata_buffer() |
      |  |- free_extent_buffer()          |
      |     Still have 2 refs             |
      |- end_page_read()                  |
         |- if (unlikely(PagePrivate())   |
         |  The page still has Private    |
         |                                | free_extent_buffer()
         |                                | |  Only one ref 1, will be
         |                                | |  released
         |                                | |- detach_extent_buffer_page()
         |                                |    |- btrfs_detach_subpage()
         |- btrfs_set_page_uptodate()     |
            The page no longer has Private|
            >>> ASSERT() triggered <<<    |
      
      This race window is super small, thus pretty hard to hit, even with so
      many runs of fstests.
      
      But the race window is still there, we have to go another way to solve
      it other than relying on random PagePrivate() check.
      
      Data path is not affected, as it will lock the page before reading,
      while unlocking the page after the last read has finished, thus no race
      window.
      
      [FIX]
      This patch will fix the bug by repurposing btrfs_subpage::readers.
      
      Now btrfs_subpage::readers will be a member shared by both metadata and
      data.
      
      For metadata path, we don't do the page unlock as metadata only relies
      on extent locking.
      
      At the same time, teach page_range_has_eb() to take
      btrfs_subpage::readers into consideration.
      
      So that even if the last eb of a page gets freed, page::private won't be
      detached as long as there still are pending end_page_read() calls.
      
      By this we eliminate the race window, this will slight increase the
      metadata memory usage, as the page may not be released as frequently as
      usual.  But it should not be a big deal.
      
      The code got introduced in ("btrfs: submit read time repair only for
      each corrupted sector"), but the fix is in a separate patch to keep the
      problem description and the crash is rare so it should not hurt
      bisectability.
      Signed-off-by: NQu Wegruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d078efa
    • Q
      btrfs: don't clear page extent mapped if we're not invalidating the full page · bcd77455
      Qu Wenruo 提交于
      [BUG]
      With current btrfs subpage rw support, the following script can lead to
      fs hang:
      
        $ mkfs.btrfs -f -s 4k $dev
        $ mount $dev -o nospace_cache $mnt
        $ fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt
      
      The fs will hang at btrfs_start_ordered_extent().
      
      [CAUSE]
      In above test case, btrfs_invalidate() will be called with the following
      parameters:
      
        offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000
      
      Since @offset is 0, btrfs_invalidate() will try to invalidate the full
      page, and finally call clear_page_extent_mapped() which will detach
      subpage structure from the page.
      
      And since the page no longer has subpage structure, the subpage dirty
      bitmap will be cleared, preventing the dirty range from being written
      back, thus no way to wake up the ordered extent.
      
      [FIX]
      Just follow other filesystems, only to invalidate the page if the range
      covers the full page.
      
      There are cases like truncate_setsize() which can call
      btrfs_invalidatepage() with offset == 0 and length != 0 for the last
      page of an inode.
      
      Although the old code will still try to invalidate the full page, we are
      still safe to just wait for ordered extent to finish.
      So it shouldn't cause extra problems.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bcd77455
    • Q
      btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range() · 0528476b
      Qu Wenruo 提交于
      [BUG]
      With current subpage RW support, the following script can hang the fs
      with 64K page size.
      
       # mkfs.btrfs -f -s 4k $dev
       # mount $dev -o nospace_cache $mnt
       # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt
      
      The kernel will do an infinite loop in btrfs_punch_hole_lock_range().
      
      [CAUSE]
      In btrfs_punch_hole_lock_range() we:
      
      - Truncate page cache range
      - Lock extent io tree
      - Wait any ordered extents in the range.
      
      We exit the loop until we meet all the following conditions:
      
      - No ordered extent in the lock range
      - No page is in the lock range
      
      The latter condition has a pitfall, it only works for sector size ==
      PAGE_SIZE case.
      
      While can't handle the following subpage case:
      
        0       32K     64K     96K     128K
        |       |///////||//////|       ||
      
      lockstart=32K
      lockend=96K - 1
      
      In this case, although the range crosses 2 pages,
      truncate_pagecache_range() will invalidate no page at all, but only zero
      the [32K, 96K) range of the two pages.
      
      Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
      will never meet the loop exit condition.
      
      [FIX]
      Fix the problem by doing page alignment for the lock range.
      
      Function filemap_range_has_page() has already handled lend < lstart
      case, we only need to round up @lockstart, and round_down @lockend for
      truncate_pagecache_range().
      
      This modification should not change any thing for sector size ==
      PAGE_SIZE case, as in that case our range is already page aligned.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0528476b
    • Q
      btrfs: reflink: make copy_inline_to_page() to be subpage compatible · 3115deb3
      Qu Wenruo 提交于
      The modifications are:
      
      - Page copy destination
        For subpage case, one page can contain multiple sectors, thus we can
        no longer expect the memcpy_to_page()/btrfs_decompress() to copy
        data into page offset 0.
        The correct offset is offset_in_page(file_offset) now, which should
        handle both regular sectorsize and subpage cases well.
      
      - Page status update
        Now we need to use subpage helper to handle the page status update.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3115deb3
    • Q
      btrfs: make btrfs_page_mkwrite() to be subpage compatible · 2d8ec40e
      Qu Wenruo 提交于
      Only set_page_dirty() and SetPageUptodate() is not subpage compatible.
      Convert them to subpage helpers, so that __extent_writepage_io() can
      submit page content correctly.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2d8ec40e
    • Q
      btrfs: make btrfs_truncate_block() to be subpage compatible · 6c9ac8be
      Qu Wenruo 提交于
      btrfs_truncate_block() itself is already mostly subpage compatible, the
      only missing part is the page dirtying code.
      
      Currently if we have a sector that needs to be truncated, we set the
      sector aligned range delalloc, then set the full page dirty.
      
      The problem is, current subpage code requires subpage dirty bit to be
      set, or __extent_writepage_io() won't submit bio, thus leads to ordered
      extent never to finish.
      
      So this patch will make btrfs_truncate_block() to call
      btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the
      problem.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6c9ac8be
    • Q
      btrfs: make __extent_writepage_io() only submit dirty range for subpage · c5ef5c6c
      Qu Wenruo 提交于
      __extent_writepage_io() function originally just iterates through all
      the extent maps of a page, and submits any regular extents.
      
      This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
      need to submit the only sector contained in the page.
      
      But for subpage case, one dirty page can contain several clean sectors
      with at least one dirty sector.
      
      If __extent_writepage_io() still submit all regular extent maps, it can
      submit data which is already written to disk.
      And since such already written data won't have corresponding ordered
      extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().
      
      Change the behavior of __extent_writepage_io() by finding the first
      dirty byte in the page, and only submit the dirty range other than the
      full extent.
      
      Since we're also here, also modify the following calls to be subpage
      compatible:
      
      - SetPageError()
      - end_page_writeback()
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5ef5c6c
    • Q
      btrfs: make btrfs_set_range_writeback() subpage compatible · d2a91064
      Qu Wenruo 提交于
      Function btrfs_set_range_writeback() currently just sets the page
      writeback unconditionally.
      
      Change it to call the subpage helper so that we can handle both cases
      well.
      
      Since the subpage helpers needs btrfs_fs_info, also change the parameter
      to accept btrfs_inode.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d2a91064
    • Q
      btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig() · 4750af3b
      Qu Wenruo 提交于
      In cow_file_range(), after we have succeeded creating an inline extent,
      we unlock the page with extent_clear_unlock_delalloc() by passing
      locked_page == NULL.
      
      For sectorsize == PAGE_SIZE case, this is just making the page lock and
      unlock harder to grab.
      
      But for incoming subpage case, it can be a big problem.
      
      For incoming subpage case, page locking have two entry points:
      
      - __process_pages_contig()
        In that case, we know exactly the range we want to lock (which only
        requires sector alignment).
        To handle the subpage requirement, we introduce btrfs_subpage::writers
        to page::private, and will update it in __process_pages_contig().
      
      - Other directly lock/unlock_page() call sites
        Those won't touch btrfs_subpage::writers at all.
      
      This means, page locked by __process_pages_contig() can only be unlocked
      by __process_pages_contig().
      Thankfully we already have the existing infrastructure in the form of
      @locked_page in various call sites.
      
      Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after
      creating an inline extent is the exception.
      It intentionally call extent_clear_unlock_delalloc() with locked_page ==
      NULL, to also unlock current page (and clear its dirty/writeback bits).
      
      To co-operate with incoming subpage modifications, and make the page
      lock/unlock pair easier to understand, this patch will still call
      extent_clear_unlock_delalloc() with locked_page, and only unlock the
      page in __extent_writepage().
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4750af3b
    • Q
      btrfs: update locked page dirty/writeback/error bits in __process_pages_contig · a33a8e9a
      Qu Wenruo 提交于
      When __process_pages_contig() gets called for
      extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
      bit is updated, but dirty/writeback/error bits are all skipped.
      
      There are several call sites that call extent_clear_unlock_delalloc()
      with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
      
      - cow_file_range()
      - run_delalloc_nocow()
      - cow_file_range_async()
        All for their error handling branches.
      
      For those call sites, since we skip the locked page for
      dirty/error/writeback bit update, the locked page will still have its
      subpage dirty bit remaining.
      
      Normally it's the call sites which locked the page to handle the locked
      page, but it won't hurt if we also do the update.
      
      Especially there are already other call sites doing the same thing by
      manually passing NULL as locked_page.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a33a8e9a
    • Q
      btrfs: make page Ordered bit to be subpage compatible · b945a463
      Qu Wenruo 提交于
      This involves the following modification:
      
      - Ordered extent creation
        This is done in process_one_page(), now PAGE_SET_ORDERED will call
        subpage helper to do the work.
      
      - endio functions
        This is done in btrfs_mark_ordered_io_finished().
      
      - btrfs_invalidatepage()
      
      - btrfs_cleanup_ordered_extents()
        Use the subpage page helper, and add an extra branch to exit if the
        locked page have covered the full range.
      
      Now the usage of page Ordered flag for ordered extent accounting is fully
      subpage compatible.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b945a463
    • Q
      btrfs: introduce helpers for subpage ordered status · 6f17400b
      Qu Wenruo 提交于
      This patch introduces the following functions to handle btrfs subpage
      ordered (Private2) status:
      
      - btrfs_subpage_set_ordered()
      - btrfs_subpage_clear_ordered()
      - btrfs_subpage_test_ordered()
        These helpers can only be called when the range is ensured to be
        inside the page.
      
      - btrfs_page_set_ordered()
      - btrfs_page_clear_ordered()
      - btrfs_page_test_ordered()
        These helpers can handle both regular sector size and subpage without
        problem.
      
      These functions are here to coordinate btrfs_invalidatepage() with
      btrfs_writepage_endio_finish_ordered(), to make sure only one of those
      functions can finish the ordered extent.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f17400b
    • Q
      btrfs: make process_one_page() to handle subpage locking · 1e1de387
      Qu Wenruo 提交于
      Introduce a new data inodes specific subpage member, writers, to record
      how many sectors are under page lock for delalloc writing.
      
      This member acts pretty much the same as readers, except it's only for
      delalloc writes.
      
      This is important for delalloc code to trace which page can really be
      freed, as we have cases like run_delalloc_nocow() where we may exit
      processing nocow range inside a page, but need to exit to do cow half
      way.
      In that case, we need a way to determine if we can really unlock a full
      page.
      
      With the new btrfs_subpage::writers, there is a new requirement:
      - Page locked by process_one_page() must be unlocked by
        process_one_page()
        There are still tons of call sites manually lock and unlock a page,
        without updating btrfs_subpage::writers.
        So if we lock a page through process_one_page() then it must be
        unlocked by process_one_page() to keep btrfs_subpage::writers
        consistent.
      
        This will be handled in next patch.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1e1de387
    • Q
      btrfs: make end_bio_extent_writepage() to be subpage compatible · 9047e317
      Qu Wenruo 提交于
      Now in end_bio_extent_writepage(), the only subpage incompatible code is
      the end_page_writeback().
      
      Just call the subpage helpers.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9047e317
    • Q
      btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status · e38992be
      Qu Wenruo 提交于
      For __process_pages_contig() and process_one_page(), to handle subpage
      we only need to pass bytenr in and call subpage helpers to handle
      dirty/error/writeback status.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e38992be
    • Q
      btrfs: make btrfs_dirty_pages() to be subpage compatible · f02a85d2
      Qu Wenruo 提交于
      Since the extent io tree operations in btrfs_dirty_pages() are already
      subpage compatible, we only need to make the page status update to use
      subpage helpers.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f02a85d2
    • Q
      btrfs: only require sector size alignment for end_bio_extent_writepage() · 321a02db
      Qu Wenruo 提交于
      Just like read page, for subpage support we only require sector size
      alignment.
      
      So change the error message condition to only require sector alignment.
      
      This should not affect existing code, as for regular sectorsize ==
      PAGE_SIZE case, we are still requiring page alignment.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      321a02db
    • Q
      btrfs: provide btrfs_page_clamp_*() helpers · 60e2d255
      Qu Wenruo 提交于
      In the coming subpage RW supports, there are a lot of page status update
      calls which need to be converted to subpage compatible version, which
      needs @start and @len.
      
      Some call sites already have such @start/@len and are already in
      page range, like various endio functions.
      
      But there are also call sites which need to clamp the range for subpage
      case, like btrfs_dirty_pagse() and __process_contig_pages().
      
      Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
      clamp for subpage version.
      
      Although in theory all existing btrfs_page_*() calls can be converted to
      use btrfs_page_clamp_*() directly, but that would make us to do
      unnecessary clamp operations.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      60e2d255
    • Q
      btrfs: refactor page status update into process_one_page() · ed8f13bf
      Qu Wenruo 提交于
      In __process_pages_contig() we update page status according to page_ops.
      
      That update process is a bunch of 'if' branches, which lie inside
      two loops, this makes it pretty hard to expand for later subpage
      operations.
      
      So this patch will extract these operations into its own function,
      process_one_pages().
      
      Also since we're refactoring __process_pages_contig(), also move the new
      helper and __process_pages_contig() before the first caller of them, to
      remove the forward declaration.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ed8f13bf
    • Q
      btrfs: pass bytenr directly to __process_pages_contig() · 98af9ab1
      Qu Wenruo 提交于
      As a preparation for incoming subpage support, we need bytenr passed to
      __process_pages_contig() directly, not the current page index.
      
      So change the parameter and all callers to pass bytenr in.
      
      With the modification, here we need to replace the old @index_ret with
      @processed_end for __process_pages_contig(), but this brings a small
      problem.
      
      Normally we follow the inclusive return value, meaning @processed_end
      should be the last byte we processed.
      
      If parameter @start is 0, and we failed to lock any page, then we would
      return @processed_end as -1, causing more problems for
      __unlock_for_delalloc().
      
      So here for @processed_end, we use two different return value patterns.
      If we have locked any page, @processed_end will be the last byte of
      locked page.
      Or it will be @start otherwise.
      
      This change will impact lock_delalloc_pages(), so it needs to check
      @processed_end to only unlock the range if we have locked any.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98af9ab1
    • Q
      btrfs: fix hang when run_delalloc_range() failed · 968f2566
      Qu Wenruo 提交于
      [BUG]
      When running subpage preparation patches on x86, btrfs/125 will hang
      forever with one ordered extent never finished.
      
      [CAUSE]
      The test case btrfs/125 itself will always fail as the fix is never merged.
      
      When the test fails at balance, btrfs needs to cleanup the ordered
      extent in btrfs_cleanup_ordered_extents() for data reloc inode.
      
      The problem is in the sequence how we cleanup the page Order bit.
      
      Currently it works like:
      
        btrfs_cleanup_ordered_extents()
        |- find_get_page();
        |- btrfs_page_clear_ordered(page);
        |  Now the page doesn't have Ordered bit anymore.
        |  !!! This also includes the first (locked) page !!!
        |
        |- offset += PAGE_SIZE
        |  This is to skip the first page
        |- __endio_write_update_ordered()
           |- btrfs_mark_ordered_io_finished(NULL)
              Except the first page, all ordered extents are finished.
      
      Then the locked page is cleaned up in __extent_writepage():
      
        __extent_writepage()
        |- If (PageError(page))
        |- end_extent_writepage()
           |- btrfs_mark_ordered_io_finished(page)
              |- if (btrfs_test_page_ordered(page))
              |-  !!! The page gets skipped !!!
                  The ordered extent is not decreased as the page doesn't
                  have ordered bit anymore.
      
      This leaves the ordered extent with bytes_left == sectorsize, thus never
      finish.
      
      [FIX]
      The fix is to ensure we never clear page Ordered bit without running the
      ordered extent accounting.
      
      Here we choose to skip the locked page in
      btrfs_cleanup_ordered_extents() so that later end_extent_writepage() can
      properly finish the ordered extent.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      968f2566
    • Q
      btrfs: rename PagePrivate2 to PageOrdered inside btrfs · f57ad937
      Qu Wenruo 提交于
      Inside btrfs we use Private2 page status to indicate we have an ordered
      extent with pending IO for the sector.
      
      But the page status name, Private2, tells us nothing about the bit
      itself, so this patch will rename it to Ordered.
      And with extra comment about the bit added, so reader who is still
      uncertain about the page Ordered status, will find the comment pretty
      easily.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f57ad937
    • Q
      btrfs: refactor btrfs_invalidatepage() for subpage support · 3b835840
      Qu Wenruo 提交于
      This patch will refactor btrfs_invalidatepage() for the incoming subpage
      support.
      
      The involved modifications are:
      
      - Use while() loop instead of "goto again;"
      - Use single variable to determine whether to delete extent states
        Each branch will also have comments why we can or cannot delete the
        extent states
      - Do qgroup free and extent states deletion per-loop
        Current code can only work for PAGE_SIZE == sectorsize case.
      
      This refactor also makes it clear what we do for different sectors:
      
      - Sectors without ordered extent
        We're completely safe to remove all extent states for the sector(s)
      
      - Sectors with ordered extent, but no Private2 bit
        This means the endio has already been executed, we can't remove all
        extent states for the sector(s).
      
      - Sectors with ordere extent, still has Private2 bit
        This means we need to decrease the ordered extent accounting.
        And then it comes to two different variants:
      
        * We have finished and removed the ordered extent
          Then it's the same as "sectors without ordered extent"
        * We didn't finished the ordered extent
          We can remove some extent states, but not all.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3b835840
    • Q
      btrfs: introduce btrfs_lookup_first_ordered_range() · c095f333
      Qu Wenruo 提交于
      Although we already have btrfs_lookup_first_ordered_extent() and
      btrfs_lookup_ordered_extent(), they all have their own limitations:
      
      - btrfs_lookup_ordered_extent() can't do extra range check
      
        It's only designed to lookup any ordered extent before certain bytenr.
      
      - btrfs_lookup_first_ordered_extent() may not return the first ordered
        extent in the range
      
        It doesn't ensure the first ordered extent is returned.
        The existing callers are only interested in exhausting all ordered
        extents in a range, the order is not important.
      
      For incoming btrfs_invalidatepage() refactoring, we need a way to
      properly iterate all ordered extents in their bytenr order of a range.
      
      So this patch will introduce a new function,
      btrfs_lookup_first_ordered_range(), to do ordered extent with bytenr
      order awareness and extra range check.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c095f333