1. 10 12月, 2020 15 次提交
  2. 08 12月, 2020 25 次提交
    • N
      btrfs: introduce ZONED feature flag · 7b3d5a90
      Naohiro Aota 提交于
      This patch introduces the ZONED incompat flag. The flag indicates that
      the volume management will satisfy the constraints imposed by
      host-managed zoned block devices (aligned chunk allocation, append-only
      updates, reset zone after filled).
      
      As the zoned support will happen incrementally due to enhancing some
      core infrastructure like super block writes, tree-log, raid support, the
      feature will appear in sysfs only on debug builds. It will be enabled
      once the support is feature complete and applications can reliably check
      whether zoned support is present or not.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7b3d5a90
    • N
      btrfs: return bool from btrfs_should_end_transaction · a2633b6a
      Nikolay Borisov 提交于
      Results in slightly smaller code.
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-11 (-11)
      Function                                     old     new   delta
      btrfs_should_end_transaction                  96      85     -11
      Total: Before=20070, After=20059, chg -0.05%
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a2633b6a
    • N
      8a8f4dea
    • N
      btrfs: remove err variable from do_relocation · 8df01fdd
      Nikolay Borisov 提交于
      It simply gets assigned to 'ret' in case of errors. The flow of the
      while loop is not changed by this commit since the few call sites
      that 'goto next' will simply break from the loop.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8df01fdd
    • N
      btrfs: eliminate err variable from merge_reloc_root · c6a592f2
      Nikolay Borisov 提交于
      In most cases when an error is returned from a function 'ret' is simply
      assigned to 'err'. There is only one case where walk_up_reloc_tree can
      return a positive value - in this case the code breaks from the loop and
      ret is going to get its return value from btrfs_cow_block - either 0 or
      negative. This retains the old logic of how 'err' used to be set at
      this call site.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c6a592f2
    • N
      btrfs: remove err variable from btrfs_delete_subvolume · ee0d904f
      Nikolay Borisov 提交于
      Use only a single 'ret' to control whether we should abort the
      transaction or not. That's fine, because if we abort a transaction then
      btrfs_end_transaction will return the same value as passed to
      btrfs_abort_transaction. No semantic changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee0d904f
    • F
      btrfs: unlock path before checking if extent is shared during nocow writeback · c65ca98f
      Filipe Manana 提交于
      When we are attempting to start writeback for an existing extent in NOCOW
      mode, at run_delalloc_nocow(), we must check if the extent is shared, and
      if it is, fallback to a COW write. However we do such check while still
      holding a read lock on the leaf that contains the file extent item, and
      that check, the call to btrfs_cross_ref_exist(), can take some time
      because:
      
      1) It needs to do a search on the extent tree, which obviously takes some
         time, specially if delayed references are being run at the moment, as
         we can block when trying to lock currently write locked btree nodes;
      
      2) It needs to check the delayed references for any existing reference
         for our data extent, this requires acquiring the delayed references'
         spinlock and maybe block on the mutex of a delayed reference head in the
         case where there is a delayed reference for our data extent, in the
         worst case it makes us release the path on the extent tree and retry
         the whole process again (going back to step 1).
      
      There are other operations we do while holding the leaf locked that can
      take some significant time as well (specially all together):
      
      * btrfs_extent_readonly() - to check if the block group containing the
        extent is currently in RO mode. This requires taking a spinlock and
        searching for the block group in a rbtree that can be big on large
        filesystems;
      
      * csum_exist_in_range() - to search if there are any checksums in the
        csum tree for the extent. Like before, this can take some time if we are
        in a filesystem that has both COW and NOCOW files, in which case the
        csum tree is not empty;
      
      * btrfs_inc_nocow_writers() - increment the number of nocow writers in the
        block group that contains the data extent. Needs to acquire a spinlock
        and search for the block group in a rbtree that can be big on large
        filesystems.
      
      So just unlock the leaf (release the path) before doing all those checks,
      since we do not need it anymore. In case we can not do a NOCOW write for
      the extent, due to any of those checks failing, and the writeback range
      goes beyond that extents' length, we will do another btree search for the
      next file extent item.
      
      The following script that calls dbench was used to measure the impact of
      this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
      directly (no intermediary filesystem on the host) and using a non-debug
      kernel (default configuration on Debian):
      
        $ cat test-dbench.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-m single -d single"
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 300 64
      
        umount $MNT
      
      Before this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9326331     0.317   399.957
       Close        6851198     0.002     6.402
       Rename        394894     2.621   402.819
       Unlink       1883131     0.931   398.082
       Deltree          256    19.160   303.580
       Mkdir            128     0.003     0.016
       Qpathinfo    8452314     0.068   116.133
       Qfileinfo    1481921     0.001     5.081
       Qfsinfo      1549963     0.002     4.444
       Sfileinfo     759679     0.084    17.079
       Find         3268168     0.396   118.196
       WriteX       4653310     0.056   110.993
       ReadX        14618818     0.005    23.314
       LockX          30364     0.003     0.497
       UnlockX        30364     0.002     1.720
       Flush         653619    16.954   569.299
      
      Throughput 966.651 MB/sec  64 clients  64 procs  max_latency=569.377 ms
      
      After this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9710433     0.302   232.449
       Close        7132948     0.002    11.496
       Rename        411144     2.452   131.805
       Unlink       1960961     0.893   230.383
       Deltree          256    14.858   198.646
       Mkdir            128     0.002     0.005
       Qpathinfo    8800890     0.066   111.588
       Qfileinfo    1542556     0.001     3.852
       Qfsinfo      1613835     0.002     5.483
       Sfileinfo     790871     0.081    19.492
       Find         3402743     0.386   120.185
       WriteX       4842918     0.054   179.312
       ReadX        15220407     0.005    32.435
       LockX          31612     0.003     1.533
       UnlockX        31612     0.002     1.047
       Flush         680567    16.320   463.323
      
      Throughput 1016.59 MB/sec  64 clients  64 procs  max_latency=463.327 ms
      
      +5.0% throughput, -20.5% max latency
      
      Also, the following test using fio was run:
      
        $ cat test-fio.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 4 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=randwrite
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo
        echo "Using fio config:"
        echo
        cat /tmp/fio-job.ini
        echo
        echo "mount options: $MOUNT_OPTIONS"
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo "Creating nodatacow files before fio runs..."
        for ((i = 0; i < $NUM_JOBS; i++)); do
            xfs_io -f -c "pwrite -b 128M 0 $FILE_SIZE" "$MNT/writers.$i.0"
        done
        sync
      
        fio /tmp/fio-job.ini
        umount $MNT
      
      Before this change:
      
      $ ./test-fio.sh 16 512M 2 4K
      (...)
      WRITE: bw=28.3MiB/s (29.6MB/s), 28.3MiB/s-28.3MiB/s (29.6MB/s-29.6MB/s), io=8192MiB (8590MB), run=289800-289800msec
      
      After this change:
      
      $ ./test-fio.sh 16 512M 2 4K
      (...)
      WRITE: bw=31.2MiB/s (32.7MB/s), 31.2MiB/s-31.2MiB/s (32.7MB/s-32.7MB/s), io=8192MiB (8590MB), run=262845-262845msec
      
      +9.7% throughput, -9.8% runtime
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c65ca98f
    • D
      btrfs: tree-checker: annotate all error branches as unlikely · c7c01a4a
      David Sterba 提交于
      The tree checker is called many times as it verifies metadata at
      read/write time. The checks follow a simple pattern:
      
        if (error_condition) {
      	  report_error();
      	  return -EUCLEAN;
        }
      
      All the error reporting functions are annotated as __cold that is
      supposed to hint the compiler to move the statement block out of the hot
      path. This does not seem to happen that often.
      
      As the error condition is expected to be false almost always, we can
      annotate it with 'unlikely' as this satisfies one of the few use cases
      for the annotation. The expected outcome is a stronger hint to compiler
      to reorder the checks
      
        test
        jump to exit
        test
        jump to exit
        ...
      
      which can be observed in asm of eg. check_dir_item,
      btrfs_check_chunk_valid, check_root_item or check_leaf.
      
      There's a measurable run time improvement reported by Josef, the testing
      workload went from 655 MiB/s to 677 MiB/s, which is about +3%.
      
      There should be no functional changes but some of the conditions have
      been rewritten to produce more readable result, some lines are longer
      than 80, for the sake of readability.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7c01a4a
    • D
      btrfs: remove stub device info from messages when we have no fs_info · a0f6d924
      David Sterba 提交于
      Without a NULL fs_info the helpers will print something like
      
      	BTRFS error (device <unknown>): ...
      
      This can happen in contexts where fs_info is not available at all or
      it's potentially unsafe due to object lifetime. The <unknown> stub does
      not bring much information and with the prefix makes the message
      unnecessarily longer.
      
      Remove it for the NULL fs_info case.
      
      	BTRFS error: ...
      
      Callers can add the device information to the message itself if needed.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0f6d924
    • Q
      btrfs: use detach_page_private() in alloc_extent_buffer() · fb22e9c4
      Qu Wenruo 提交于
      In alloc_extent_buffer(), after we got a page from btree inode, we check
      if that page has private pointer attached.
      
      If attached, we check if the existing extent buffer has proper refs.
      If not (the eb is being freed), we will detach that private eb pointer.
      
      The point here is, we are detaching that eb pointer by calling:
      - ClearPagePrivate()
      - put_page()
      
      The put_page() here is especially confusing, as it's decreasing the ref
      from attach_page_private().  Without knowing that, it looks like the
      put_page() is for the find_or_create_page() call, confusing the reader.
      
      Since we're always modifying page private with attach_page_private() and
      detach_page_private(), the only open-coded detach_page_private() here is
      really confusing.
      
      Fix it by calling detach_page_private().
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb22e9c4
    • Q
      btrfs: use nodesize to determine if we need readahead in btrfs_lookup_bio_sums · 35478d05
      Qu Wenruo 提交于
      In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
      start readahead in the csum tree.
      
      However the threshold is an immediate number, (PAGE_SIZE * 8), from the
      initial btrfs merge.
      
      The meaning of the value is pretty hard to guess, especially when the
      immediate number is from the times when 4K sectorsize was the default
      and only CRC32C was supported.
      
      For the most common btrfs setup, CRC32 csum and 4K sectorsize,
      it means just 32K read would kick readahead, while the csum itself is
      only 32 bytes in size.
      
      Now let's be more reasonable by taking both csum size and node size into
      consideration.
      
      If the csum size for the bio is larger than one leaf, then we kick the
      readahead.  This means for current default btrfs, the threshold will be
      16M.
      
      This change should not change performance observably, thus this is
      mostly a readability enhancement.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35478d05
    • Q
      btrfs: only clear EXTENT_LOCK bit in extent_invalidatepage · 829ddec9
      Qu Wenruo 提交于
      extent_invalidatepage() will try to clear all possible bits since it's
      calling clear_extent_bit() with delete == 1.
      
      This is currently fine, since for btree io tree, it only utilizes
      EXTENT_LOCK bit.  But this could be a problem for later subpage support,
      which will utilize extra io tree bit to represent additional info.
      
      This patch will just convert that clear_extent_bit() to
      unlock_extent_cached().
      
      For current code since only EXTENT_LOCKED bit is utilized, this doesn't
      change the behavior, but provides a much cleaner basis for incoming
      subpage support.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      829ddec9
    • Q
      btrfs: remove unused parameter phy_offset from btrfs_validate_metadata_buffer · 8e1dc982
      Qu Wenruo 提交于
      Parameter @phy_offset is the offset against the bio->bi_iter.bi_sector.
      @phy_offset is mostly for data io to lookup the csum in btrfs_io_bio.
      
      But for metadata, it's completely useless as metadata stores their own
      csum in its header, so we can remove it.
      
      Note: parameters @start and @end, they are not utilized at all for
      current sectorsize == PAGE_SIZE case, as we can grab eb directly from
      page.
      
      But those two parameters are very important for later subpage support,
      thus @start/@len are not touched here.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8e1dc982
    • Q
      btrfs: scrub: remove the anonymous structure from scrub_page · 2c363954
      Qu Wenruo 提交于
      That anonymous structure serve no special purpose, just replace it with
      regular members.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2c363954
    • Q
      btrfs: use fixed width int type for extent_state::state · f97e27e9
      Qu Wenruo 提交于
      Currently the type is unsigned int which could change its width
      depending on the architecture. We need up to 32 bits so make it
      explicit.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f97e27e9
    • Q
      btrfs: introduce helper to handle page status update in end_bio_extent_readpage() · e09caaf9
      Qu Wenruo 提交于
      Introduce a new helper to handle update page status in
      end_bio_extent_readpage(). This will be later used for subpage support
      where the page status update can be more complex than now.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e09caaf9
    • Q
      btrfs: add structure to keep track of extent range in end_bio_extent_readpage · 94e8c95c
      Qu Wenruo 提交于
      In end_bio_extent_readpage() we had a strange dance around
      extent_start/extent_len.
      
      Hidden behind the strange dance is, it's just calling
      endio_readpage_release_extent() on each bvec range.
      
      Here is an example to explain the original work flow:
      
        Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
      
        end_bio_extent_extent_readpage() entered
        |- extent_start = 0;
        |- extent_end = 0;
        |- bio_for_each_segment_all() {
        |  |- /* Got the 1st bvec */
        |  |- start = SZ_1M;
        |  |- end = SZ_1M + SZ_4K - 1;
        |  |- update = 1;
        |  |- if (extent_len == 0) {
        |  |  |- extent_start = start; /* SZ_1M */
        |  |  |- extent_len = end + 1 - start; /* SZ_1M */
        |  |  }
        |  |
        |  |- /* Got the 2nd bvec */
        |  |- start = SZ_1M + 4K;
        |  |- end = SZ_1M + 4K - 1;
        |  |- update = 1;
        |  |- if (extent_start + extent_len == start) {
        |  |  |- extent_len += end + 1 - start; /* SZ_8K */
        |  |  }
        |  } /* All bio vec iterated */
        |
        |- if (extent_len) {
           |- endio_readpage_release_extent(tree, extent_start, extent_len,
      				      update);
      	/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
      
      As the above flow shows, the existing code in end_bio_extent_readpage()
      is accumulates extent_start/extent_len, and when the contiguous range
      stops, calls endio_readpage_release_extent() for the range.
      
      However current behavior has something not really considered:
      
      - The inode can change
        For bio, its pages don't need to have contiguous page_offset.
        This means, even pages from different inodes can be packed into one
        bio.
      
      - bvec cross page boundary
        There is a feature called multi-page bvec, where bvec->bv_len can go
        beyond bvec->bv_page boundary.
      
      - Poor readability
      
      This patch will address the problem:
      
      - Introduce a proper structure, processed_extent, to record processed
        extent range
      
      - Integrate inode/start/end/uptodate check into
        endio_readpage_release_extent()
      
      - Add more comment on each step.
        This should greatly improve the readability, now in
        end_bio_extent_readpage() there are only two
        endio_readpage_release_extent() calls.
      
      - Add inode check for contiguity
        Now we also ensure the inode is the same one before checking if the
        range is contiguous.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      94e8c95c
    • Q
      btrfs: tests: remove invalid extent-io test · b1d51f67
      Qu Wenruo 提交于
      In extent-io-test, there are two invalid tests:
      
      - Invalid nodesize for test_eb_bitmaps()
        Instead of the sectorsize and nodesize combination passed in, we're
        always using hand-crafted nodesize, e.g:
      
      	len = (sectorsize < BTRFS_MAX_METADATA_BLOCKSIZE)
      		? sectorsize * 4 : sectorsize;
      
        In above case, if we have 32K page size, then we will get a length of
        128K, which is beyond max node size, and obviously invalid.
      
        The common page size goes up to 64K so we haven't hit that
      
      - Invalid extent buffer bytenr
        For 64K page size, the only combination we're going to test is
        sectorsize = nodesize = 64K.
        However, in that case we will try to test an eb which bytenr is not
        sectorsize aligned:
      
      	/* Do it over again with an extent buffer which isn't page-aligned. */
      	eb = __alloc_dummy_extent_buffer(fs_info, nodesize / 2, len);
      
        Sector alignment is a hard requirement for any sector size.
        The only exception is superblock. But anything else should follow
        sector size alignment.
      
        This is definitely an invalid test case.
      
      This patch will fix both problems by:
      
      - Honor the sectorsize/nodesize combination
        Now we won't bother to hand-craft the length and use it as nodesize.
      
      - Use sectorsize as the 2nd run extent buffer start
        This would test the case where extent buffer is aligned to sectorsize
        but not always aligned to nodesize.
      
      Please note that, later subpage related cleanup will reduce
      extent_buffer::pages[] to exactly what we need, making the sector
      unaligned extent buffer operations cause problems.
      
      Since only extent_io self tests utilize this, this patch is required for
      all later cleanup/refactoring.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b1d51f67
    • T
      btrfs: sysfs: remove unneeded semicolon · 445d8ab5
      Tom Rix 提交于
      A semicolon is not needed after a switch statement.
      Signed-off-by: NTom Rix <trix@redhat.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      445d8ab5
    • N
      btrfs: simplify return values in setup_nodes_for_search · 95b982de
      Nikolay Borisov 提交于
      The function is needlessly convoluted. Fix that by:
      
      * removing redundant sret variable definition in both if arms
      
      * replace the again/done labels with direct return statements, the
        function is short enough and doesn't do anything special upon exit
      
      * remove BUG_ON on split_node returning a positive number - it can't
        happen as split_node returns either 0 or a negative error code.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95b982de
    • N
      btrfs: remove useless return value statement in split_node · d5286a92
      Nikolay Borisov 提交于
      At the point when we set 'ret = 0' it's guaranteed that the function is
      going to return 0 so directly return 0. No functional changes.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5286a92
    • F
      btrfs: remove unnecessary attempt to drop extent maps after adding inline extent · f30bed83
      Filipe Manana 提交于
      At inode.c:cow_file_range_inline(), after we insert the inline extent
      in the fs/subvolume btree, we call btrfs_drop_extent_cache() to drop
      all extent maps in the file range, however that is not necessary because
      we have already done it in the call to btrfs_drop_extents(), which calls
      btrfs_drop_extent_cache() for us, and since at this point we have the file
      range locked in the inode's iotree (we are in the writeback path), we know
      no other task can come in and read stale file extent items or find none
      and therefore create either stale extent maps or an extent map that
      represents a hole.
      
      So just remove that unnecessary call to btrfs_drop_extent_cache(), as it's
      doing nothing and only wasting time. This call has been around since 2008,
      introduced in commit c8b97818 ("Btrfs: Add zlib compression support"),
      but even back then it seems it was not necessary, since we had the range
      locked in the inode's iotree and the call to btrfs_drop_extents() already
      used to always call btrfs_drop_extent_cache().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f30bed83
    • F
      btrfs: stop incrementing log batch when joining log transaction · bc5b5b1e
      Filipe Manana 提交于
      When joining a log transaction we acquire the root's log mutex, then
      increment the root's log batch and log writers counters while holding
      the mutex. However we don't need to increment the log batch there,
      because we are holding the mutex and incremented the log writers counter
      as well, so any other task trying to sync log will wait for the current
      task to finish its logging and still achieve the desired log batching.
      
      Since the log batch counter is an atomic counter and is incremented twice
      at the very beginning of the fsync callback (btrfs_sync_file()), once
      before flushing delalloc and once again after waiting for writeback to
      complete, eliminating its increment when joining the log transaction
      may provide some performance gains in case we have multiple concurrent
      tasks doing fsyncs against different files in the same subvolume, as it
      reduces contention on the atomic (locking the cacheline and bouncing it).
      
      When testing fio with 32 jobs, on a 8 cores VM, doing fsyncs against
      different files of the same subvolume, on top of a zram device, I could
      consistently see gains (higher throughput) between 1% to 2%, which is a
      very low value and possibly hard to be observed with a real device (I
      couldn't observe consistent gains with my low/mid end NVMe device).
      So this change is mostly motivated to just simplify the logic, as updating
      the log batch counter is only relevant when an fsync starts and while not
      holding the root's log mutex.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc5b5b1e
    • F
      btrfs: skip unnecessary searches for xattrs when logging an inode · f2f121ab
      Filipe Manana 提交于
      Every time we log an inode we lookup in the fs/subvol tree for xattrs and
      if we have any, log them into the log tree. However it is very common to
      have inodes without any xattrs, so doing the search wastes times, but more
      importantly it adds contention on the fs/subvol tree locks, either making
      the logging code block and wait for tree locks or making the logging code
      making other concurrent operations block and wait.
      
      The most typical use cases where xattrs are used are when capabilities or
      ACLs are defined for an inode, or when SELinux is enabled.
      
      This change makes the logging code detect when an inode does not have
      xattrs and skip the xattrs search the next time the inode is logged,
      unless the inode is evicted and loaded again or a xattr is added to the
      inode. Therefore skipping the search for xattrs on inodes that don't ever
      have xattrs and are fsynced with some frequency.
      
      The following script that calls dbench was used to measure the impact of
      this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
      directly (no intermediary filesystem on the host) and using a non-debug
      kernel (default configuration on Debian distributions):
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
      
        mkfs.btrfs -f -m single -d single $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 200 40
      
        umount $MNT
      
      The results before this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5761605     0.172   312.057
       Close        4232452     0.002    10.927
       Rename        243937     1.406   277.344
       Unlink       1163456     0.631   298.402
       Deltree          160    11.581   221.107
       Mkdir             80     0.003     0.005
       Qpathinfo    5221410     0.065   122.309
       Qfileinfo     915432     0.001     3.333
       Qfsinfo       957555     0.003     3.992
       Sfileinfo     469244     0.023    20.494
       Find         2018865     0.448   123.659
       WriteX       2874851     0.049   118.529
       ReadX        9030579     0.004    21.654
       LockX          18754     0.003     4.423
       UnlockX        18754     0.002     0.331
       Flush         403792    10.944   359.494
      
      Throughput 908.444 MB/sec  40 clients  40 procs  max_latency=359.500 ms
      
      The results after this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    6442521     0.159   230.693
       Close        4732357     0.002    10.972
       Rename        272809     1.293   227.398
       Unlink       1301059     0.563   218.500
       Deltree          160     7.796    54.887
       Mkdir             80     0.008     0.478
       Qpathinfo    5839452     0.047   124.330
       Qfileinfo    1023199     0.001     4.996
       Qfsinfo      1070760     0.003     5.709
       Sfileinfo     524790     0.033    21.765
       Find         2257658     0.314   125.611
       WriteX       3211520     0.040   232.135
       ReadX        10098969     0.004    25.340
       LockX          20974     0.003     1.569
       UnlockX        20974     0.002     3.475
       Flush         451553    10.287   331.037
      
      Throughput 1011.77 MB/sec  40 clients  40 procs  max_latency=331.045 ms
      
      +10.8% throughput, -8.2% max latency
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f2f121ab
    • N
      btrfs: merge __set_extent_bit and set_extent_bit · 1cab5e72
      Nikolay Borisov 提交于
      There are only 2 direct calls to set_extent_bit outside of extent-io -
      in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are
      thin wrappers around __set_extent_bit. This adds unnecessary indirection
      and just makes it more annoying when looking at the various extent bit
      manipulation functions.  This patch renames __set_extent_bit to
      set_extent_bit effectively removing a level of indirection. No
      functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat and remove __must_check ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cab5e72