1. 14 3月, 2022 2 次提交
    • F
      btrfs: avoid unnecessary COW of leaves when deleting items from a leaf · 7c4063d1
      Filipe Manana 提交于
      When we delete items from a leaf, if we end up with more than two thirds
      of unused leaf space, we try to delete the leaf by moving all its items
      into its left and right neighbour leaves. Sometimes that is not possible
      because there is not enough free space in the left and right leaves, and
      in that case we end up not deleting our leaf.
      
      The way we are doing this is not ideal and can be improved in the
      following ways:
      
      1) When we call push_leaf_left(), we pass a value of 1 byte to the data
         size parameter of push_leaf_left(). This is not realistic value because
         no item can have a size less than 25 bytes, which is the size of struct
         btrfs_item. This means that means that if the left leaf has not enough
         free space to push any item, we end up COWing it even if we end up not
         changing its content at all.
      
         COWing that leaf means allocating a new metadata extent, marking it
         dirty and doing more IO when committing a transaction or when syncing a
         log tree. For a log tree case, it's particularly more important to
         avoid the useless COW operation, as more IO can imply a higher latency
         for an fsync operation.
      
         So instead of passing 1 as the minimum data size for push_leaf_left(),
         pass the size of the first item in our leaf, as we don't want to COW
         the left leaf if we can't at least push the first item of our leaf;
      
      2) When we call push_leaf_right(), we also pass a value of 1 byte as the
         data size parameter of push_leaf_right(). Like the previous case, it
         will also result in COWing the right leaf even if we are not able to
         move any items into it, since there can't be any item with a size
         smaller than 25 bytes (the size of struct btrfs_item).
      
         So instead of passing 1 as the minimum data size to push_leaf_right(),
         pass a size that corresponds to the sum of the size of all the
         remaining items in our leaf. We are not interested in moving less than
         that, because if we do, we are not able to delete our leaf and we have
         COWed the right leaf for nothing. Plus, moving only some of the items
         of our leaf, it means an even less balanced tree.
      
         Just like the previous case, we want to avoid the useless COW of the
         right leaf, this way we don't have to spend time allocating one new
         metadata extent, and doing more IO when committing a transaction or
         syncing a log tree. For the log tree case it's specially more important
         because more IO can result in a higher latency for a fsync operation.
      
      So adjust the minimum data size passed to push_leaf_left() and
      push_leaf_right() as mentioned above.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      Not being able to delete a leaf that became less than 1/3 full after
      deleting items from it is actually common. For example, for the fio test
      mentioned in the changelog of patch 6/6, we are only able to delete a
      leaf at btrfs_del_items() about 5.3% of the time, due to its left and
      right neighbour leaves not having enough free space to push all the
      remaining items into them.
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c4063d1
    • F
      btrfs: remove unnecessary leaf free space checks when pushing items · b4e098a9
      Filipe Manana 提交于
      When trying to push items from a leaf into its left and right neighbours,
      we lock the left or right leaf, check if it has the required minimum free
      space, COW the leaf and then check again if it has the minimum required
      free space. This second check is pointless:
      
      1) Most and foremost because it's not needed. We have a write lock on the
         leaf and on its parent node, so no one can come in and change either
         the pre-COW or post-COW version of the leaf for the whole duration of
         the push_leaf_left() and push_leaf_right() calls;
      
      2) The call to btrfs_leaf_free_space() is not trivial, it has a fair
         amount of arithmetic operations and access to fields in the leaf's
         header and items, so it's not very cheap.
      
      So remove the duplicated free space checks.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4e098a9
  2. 07 1月, 2022 9 次提交
    • N
      btrfs: refactor unlock_up · c1227996
      Nikolay Borisov 提交于
      The purpose of this function is to unlock all nodes in a btrfs path
      which are above 'lowest_unlock' and whose slot used is different than 0.
      As such it used slightly awkward structure of 'if' as well as somewhat
      cryptic "no_skip" control variable which denotes whether we should
      check the current level of skipability or no.
      
      This patch does the following (cosmetic) refactorings:
      
      * Renames 'no_skip' to 'check_skip' and makes it a boolean. This
        variable controls whether we are below the lowest_unlock/skip_level
        levels.
      
      * Consolidates the 2 conditions which warrant checking whether the
        current level should be skipped under 1 common if (check_skip) branch,
        this increase indentation level but is not critical.
      
      * Consolidates the 'skip_level < i && i >= lowest_unlock' and
        'i >= lowest_unlock && i > skip_level' condition into a common branch
        since those are identical.
      
      * Eliminates the local extent_buffer variable as in this case it doesn't
        bring anything to function readability.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1227996
    • F
      btrfs: remove stale comment about locking at btrfs_search_slot() · 727e6060
      Filipe Manana 提交于
      The comment refers to the old extent buffer locking code, where we used to
      have custom locks that had blocking and spinning behaviour modes. That is
      not the case anymore, since we have transitioned to rw semaphores, so the
      comment does not offer any value anymore. Remove it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      727e6060
    • F
      btrfs: remove BUG_ON() after splitting leaf · bb8e9a60
      Filipe Manana 提交于
      After calling split_leaf() we BUG_ON() if the returned value is greater
      than zero. However split_leaf() only returns 0, in case of success, or a
      negative value in case of an error.
      
      The reason for the BUG_ON() is that if we ever get a positive return
      value from split_leaf(), we can not simply propagate it to the callers
      of btrfs_search_slot(), as that would be interpreted as "key not found"
      and not as an error. That means it could result in callers ending up
      causing some potential silent corruption.
      
      So change the BUG_ON() to an ASSERT(), and in case assertions are
      disabled, produce a warning and set the return value to an error, to make
      it not possible to get into a silent corruption and having the error not
      noticed.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bb8e9a60
    • F
      btrfs: move leaf search logic out of btrfs_search_slot() · 109324cf
      Filipe Manana 提交于
      There's quite a significant amount of code for doing the key search for a
      leaf at btrfs_search_slot(), with a couple labels and gotos in it, plus
      btrfs_search_slot() is already big enough.
      
      So move the logic that does the key search on a leaf into a new helper
      function. This makes it better organized, removing the need for the labels
      and the gotos, as well as reducing the indentation level and the size of
      btrfs_search_slot().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      109324cf
    • F
      btrfs: remove useless condition check before splitting leaf · e5e1c174
      Filipe Manana 提交于
      When inserting a key, we check if the write_lock_level is less than 1,
      and if so we set it to 1, release the path and retry the tree traversal.
      
      However that is unnecessary, because when ins_len is greater than 0, we
      know that write_lock_level can never be less than 1.
      
      The logic to retry is also buggy, because in case ins_len was decremented,
      due to an exact key match and the search is not meant for item extension
      (path->search_for_extension is 0), we retry without incrementing ins_len,
      which would make the next retry decrement it again by the same amount.
      
      So remove the check for write_lock_level being less than 1 and add an
      assertion to assert it's always >= 1.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e5e1c174
    • F
      btrfs: try to unlock parent nodes earlier when inserting a key · e2e58d0f
      Filipe Manana 提交于
      When inserting a new key, we release the write lock on the leaf's parent
      only after doing the binary search on the leaf. This is because if the
      key ends up at slot 0, we will have to update the key at slot 0 of the
      parent node. The same reasoning applies to any other upper level nodes
      when their slot is 0. We also need to keep the parent locked in case the
      leaf does not have enough free space to insert the new key/item, because
      in that case we will split the leaf and we will need to add a new key to
      the parent due to a new leaf resulting from the split operation.
      
      However if the leaf has enough space for the new key and the key does not
      end up at slot 0 of the leaf we could release our write lock on the parent
      before doing the binary search on the leaf to figure out the destination
      slot. That leads to reducing the amount of time other tasks are blocked
      waiting to lock the parent, therefore increasing parallelism when there
      are other tasks that are trying to access other leaves accessible through
      the same parent. This also applies to other upper nodes besides the
      immediate parent, when their slot is 0, since we keep locks on them until
      we figure out if the leaf slot is slot 0 or not.
      
      In fact, having the key ending at up slot 0 when is rare. Typically it
      only happens when the key is less than or equals to the smallest, the
      "left most", key of the entire btree, during a split attempt when we try
      to push to the right sibling leaf or when the caller just wants to update
      the item of an existing key. It's also very common that a leaf has enough
      space to insert a new key, since after a split we move about half of the
      keys from one into the new leaf.
      
      So unlock the parent, and any other upper level nodes, when during a key
      insertion we notice the key is greater then the first key in the leaf and
      the leaf has enough free space. After unlocking the upper level nodes, do
      the binary search using a low boundary of slot 1 and not slot 0, to figure
      out the slot where the key will be inserted (or where the key already is
      in case it exists and the caller wants to modify its item data).
      This extra comparison, with the first key, is cheap and the key is very
      likely already in a cache line because it immediately follows the header
      of the extent buffer and we have recently read the level field of the
      header (which in fact is the last field of the header).
      
      The following fs_mark test was run on a non-debug kernel (debian's default
      kernel config), with a 12 cores intel CPU, and using a NVMe device:
      
        $ cat run-fsmark.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
        FILES=100000
        THREADS=$(nproc --all)
        FILE_SIZE=0
      
        echo "performance" | \
      	tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      Before this change:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           0      1200000            0     165273.6          5958381
           0      2400000            0     190938.3          6284477
           0      3600000            0     181429.1          6044059
           0      4800000            0     173979.2          6223418
           0      6000000            0     139288.0          6384560
           0      7200000            0     163000.4          6520083
           1      8400000            0      57799.2          5388544
           1      9600000            0      66461.6          5552969
           2     10800000            0      49593.5          5163675
           2     12000000            0      57672.1          4889398
      
      After this change:
      
      FSUse%        Count         Size    Files/sec            App Overhead
           0      1200000            0     167987.3 (+1.6%)         6272730
           0      2400000            0     198563.9 (+4.0%)         6048847
           0      3600000            0     197436.6 (+8.8%)         6163637
           0      4800000            0     202880.7 (+16.6%)        6371771
           1      6000000            0     167275.9 (+20.1%)        6556733
           1      7200000            0     204051.2 (+25.2%)        6817091
           1      8400000            0      69622.8 (+20.5%)        5525675
           1      9600000            0      69384.5 (+4.4%)         5700723
           1     10800000            0      61454.1 (+23.9%)        5363754
           3     12000000            0      61908.7 (+7.3%)         5370196
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e2e58d0f
    • F
      btrfs: allow generic_bin_search() to take low boundary as an argument · fb81212c
      Filipe Manana 提交于
      Right now generic_bin_search() always uses a low boundary slot of 0, but
      in the next patch we'll want to often skip slot 0 when searching for a
      key. So make generic_bin_search() have the low boundary slot specified
      as an argument, and move the check for the extent buffer level from
      btrfs_bin_search() to generic_bin_search() to avoid adding another
      wrapper around generic_bin_search().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb81212c
    • J
      btrfs: check the root node for uptodate before returning it · 120de408
      Josef Bacik 提交于
      Now that we clear the extent buffer uptodate if we fail to write it out
      we need to check to see if our root node is uptodate before we search
      down it.  Otherwise we could return stale data (or potentially corrupt
      data that was caught by the write verification step) and think that the
      path is OK to search down.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      120de408
    • F
      btrfs: make send work with concurrent block group relocation · d96b3424
      Filipe Manana 提交于
      We don't allow send and balance/relocation to run in parallel in order
      to prevent send failing or silently producing some bad stream. This is
      because while send is using an extent (specially metadata) or about to
      read a metadata extent and expecting it belongs to a specific parent
      node, relocation can run, the transaction used for the relocation is
      committed and the extent gets reallocated while send is still using the
      extent, so it ends up with a different content than expected. This can
      result in just failing to read a metadata extent due to failure of the
      validation checks (parent transid, level, etc), failure to find a
      backreference for a data extent, and other unexpected failures. Besides
      reallocation, there's also a similar problem of an extent getting
      discarded when it's unpinned after the transaction used for block group
      relocation is committed.
      
      The restriction between balance and send was added in commit 9e967495
      ("Btrfs: prevent send failures and crashes due to concurrent relocation"),
      kernel 5.3, while the more general restriction between send and relocation
      was added in commit 1cea5cf0 ("btrfs: ensure relocation never runs
      while we have send operations running"), kernel 5.14.
      
      Both send and relocation can be very long running operations. Relocation
      because it has to do a lot of IO and expensive backreference lookups in
      case there are many snapshots, and send due to read IO when operating on
      very large trees. This makes it inconvenient for users and tools to deal
      with scheduling both operations.
      
      For zoned filesystem we also have automatic block group relocation, so
      send can fail with -EAGAIN when users least expect it or send can end up
      delaying the block group relocation for too long. In the future we might
      also get the automatic block group relocation for non zoned filesystems.
      
      This change makes it possible for send and relocation to run in parallel.
      This is achieved the following way:
      
      1) For all tree searches, send acquires a read lock on the commit root
         semaphore;
      
      2) After each tree search, and before releasing the commit root semaphore,
         the leaf is cloned and placed in the search path (struct btrfs_path);
      
      3) After releasing the commit root semaphore, the changed_cb() callback
         is invoked, which operates on the leaf and writes commands to the pipe
         (or file in case send/receive is not used with a pipe). It's important
         here to not hold a lock on the commit root semaphore, because if we did
         we could deadlock when sending and receiving to the same filesystem
         using a pipe - the send task blocks on the pipe because it's full, the
         receive task, which is the only consumer of the pipe, triggers a
         transaction commit when attempting to create a subvolume or reserve
         space for a write operation for example, but the transaction commit
         blocks trying to write lock the commit root semaphore, resulting in a
         deadlock;
      
      4) Before moving to the next key, or advancing to the next change in case
         of an incremental send, check if a transaction used for relocation was
         committed (or is about to finish its commit). If so, release the search
         path(s) and restart the search, to where we were before, so that we
         don't operate on stale extent buffers. The search restarts are always
         possible because both the send and parent roots are RO, and no one can
         add, remove of update keys (change their offset) in RO trees - the
         only exception is deduplication, but that is still not allowed to run
         in parallel with send;
      
      5) Periodically check if there is contention on the commit root semaphore,
         which means there is a transaction commit trying to write lock it, and
         release the semaphore and reschedule if there is contention, so as to
         avoid causing any significant delays to transaction commits.
      
      This leaves some room for optimizations for send to have less path
      releases and re searching the trees when there's relocation running, but
      for now it's kept simple as it performs quite well (on very large trees
      with resulting send streams in the order of a few hundred gigabytes).
      
      Test case btrfs/187, from fstests, stresses relocation, send and
      deduplication attempting to run in parallel, but without verifying if send
      succeeds and if it produces correct streams. A new test case will be added
      that exercises relocation happening in parallel with send and then checks
      that send succeeds and the resulting streams are correct.
      
      A final note is that for now this still leaves the mutual exclusion
      between send operations and deduplication on files belonging to a root
      used by send operations. A solution for that will be slightly more complex
      but it will eventually be built on top of this change.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d96b3424
  3. 03 1月, 2022 5 次提交
  4. 16 12月, 2021 1 次提交
    • F
      btrfs: fix invalid delayed ref after subvolume creation failure · 7a163608
      Filipe Manana 提交于
      When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
      insert the new root's root item into the root tree, we are freeing the
      metadata extent we reserved for the new root to prevent a metadata
      extent leak, as we don't abort the transaction at that point (since
      there is nothing at that point that is irreversible).
      
      However we allocated the metadata extent for the new root which we are
      creating for the new subvolume, so its delayed reference refers to the
      ID of this new root. But when we free the metadata extent we pass the
      root of the subvolume where the new subvolume is located to
      btrfs_free_tree_block() - this is incorrect because this will generate
      a delayed reference that refers to the ID of the parent subvolume's root,
      and not to ID of the new root.
      
      This results in a failure when running delayed references that leads to
      a transaction abort and a trace like the following:
      
      [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs]
      [3868.739857] Code: 68 0f 85 e6 fb ff (...)
      [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246
      [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002
      [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88
      [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88
      [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000
      [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88
      [3868.750725] FS:  00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000
      [3868.752275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0
      [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [3868.757803] Call Trace:
      [3868.758281]  <TASK>
      [3868.758655]  ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs]
      [3868.759827]  __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
      [3868.761047]  btrfs_run_delayed_refs+0x86/0x210 [btrfs]
      [3868.762069]  ? lock_acquired+0x19f/0x420
      [3868.762829]  btrfs_commit_transaction+0x69/0xb20 [btrfs]
      [3868.763860]  ? _raw_spin_unlock+0x29/0x40
      [3868.764614]  ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs]
      [3868.765870]  create_subvol+0x1d8/0x9a0 [btrfs]
      [3868.766766]  btrfs_mksubvol+0x447/0x4c0 [btrfs]
      [3868.767669]  ? preempt_count_add+0x49/0xa0
      [3868.768444]  __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
      [3868.769639]  ? _copy_from_user+0x66/0xa0
      [3868.770391]  btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
      [3868.771495]  btrfs_ioctl+0xd1e/0x35c0 [btrfs]
      [3868.772364]  ? __slab_free+0x10a/0x360
      [3868.773198]  ? rcu_read_lock_sched_held+0x12/0x60
      [3868.774121]  ? lock_release+0x223/0x4a0
      [3868.774863]  ? lock_acquired+0x19f/0x420
      [3868.775634]  ? rcu_read_lock_sched_held+0x12/0x60
      [3868.776530]  ? trace_hardirqs_on+0x1b/0xe0
      [3868.777373]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
      [3868.778280]  ? kmem_cache_free+0x321/0x3c0
      [3868.779011]  ? __x64_sys_ioctl+0x83/0xb0
      [3868.779718]  __x64_sys_ioctl+0x83/0xb0
      [3868.780387]  do_syscall_64+0x3b/0xc0
      [3868.781059]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [3868.781953] RIP: 0033:0x7f281c59e957
      [3868.782585] Code: 3c 1c 48 f7 d8 4c (...)
      [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957
      [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003
      [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36
      [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
      [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc
      [3868.793765]  </TASK>
      [3868.794037] irq event stamp: 0
      [3868.794548] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
      [3868.797086] softirqs last  enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
      [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [3868.799284] ---[ end trace be24c7002fe27747 ]---
      [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2
      [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627
      [3868.802056]  item 0 key (237436928 169 0) itemoff 16250 itemsize 33
      [3868.802863]          extent refs 1 gen 1265 flags 2
      [3868.803447]          ref#0: tree block backref root 1610
      (...)
      [3869.064354]  item 114 key (241008640 169 0) itemoff 12488 itemsize 33
      [3869.065421]          extent refs 1 gen 1268 flags 2
      [3869.066115]          ref#0: tree block backref root 1689
      (...)
      [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622  owner 0 offset 0
      [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry
      [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry
      
      Fix this by passing the new subvolume's root ID to btrfs_free_tree_block().
      This requires changing the root argument of btrfs_free_tree_block() from
      struct btrfs_root * to a u64, since at this point during the subvolume
      creation we have not yet created the struct btrfs_root for the new
      subvolume, and btrfs_free_tree_block() only needs a root ID and nothing
      else from a struct btrfs_root.
      
      This was triggered by test case generic/475 from fstests.
      
      Fixes: 67addf29 ("btrfs: fix metadata extent leak after failure to create subvolume")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a163608
  5. 27 10月, 2021 3 次提交
    • F
      btrfs: unexport setup_items_for_insert() · f0641656
      Filipe Manana 提交于
      Since setup_items_for_insert() is not used anymore outside of ctree.c,
      make it static and remove its prototype from ctree.h. This also requires
      to move the definition of setup_item_for_insert() from ctree.h to ctree.c
      and move down btrfs_duplicate_item() so that it's defined after
      setup_items_for_insert().
      
      Further, since setup_item_for_insert() is used outside ctree.c, rename it
      to btrfs_setup_item_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 2/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f0641656
    • F
      btrfs: loop only once over data sizes array when inserting an item batch · b7ef5f3a
      Filipe Manana 提交于
      When inserting a batch of items into a btree, we end up looping over the
      data sizes array 3 times:
      
      1) Once in the caller of btrfs_insert_empty_items(), when it populates the
         array with the data sizes for each item;
      
      2) Once at btrfs_insert_empty_items() to sum the elements of the data
         sizes array and compute the total data size;
      
      3) And then once again at setup_items_for_insert(), where we do exactly
         the same as what we do at btrfs_insert_empty_items(), to compute the
         total data size.
      
      That is not bad for small arrays, but when the arrays have hundreds of
      elements, the time spent on looping is not negligible. For example when
      doing batch inserts of delayed items for dir index items or when logging
      a directory, it's common to have 200 to 260 dir index items in a single
      batch when using a leaf size of 16K and using file names between 8 and 12
      characters. For a 64K leaf size, multiply that by 4. Taking into account
      that during directory logging or when flushing delayed dir index items we
      can have many of those large batches, the time spent on the looping adds
      up quickly.
      
      It's also more important to avoid it at setup_items_for_insert(), since
      we are holding a write lock on a leaf and, in some cases, on upper nodes
      of the btree, which causes us to block other tasks that want to access
      the leaf and nodes for longer than necessary.
      
      So change the code so that setup_items_for_insert() and
      btrfs_insert_empty_items() no longer compute the total data size, and
      instead rely on the caller to supply it. This makes us loop over the
      array only once, where we can both populate the data size array and
      compute the total data size, taking advantage of spatial and temporal
      locality. To make this more manageable, use a structure to contain
      all the relevant details for a batch of items (keys array, data sizes
      array, total data size, number of items), and use it as an argument
      for btrfs_insert_empty_items() and setup_items_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 1/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7ef5f3a
    • F
      btrfs: assert that extent buffers are write locked instead of only locked · 49d0c642
      Filipe Manana 提交于
      We currently use lockdep_assert_held() at btrfs_assert_tree_locked(), and
      that checks that we hold a lock either in read mode or write mode.
      
      However in all contexts we use btrfs_assert_tree_locked(), we actually
      want to check if we are holding a write lock on the extent buffer's rw
      semaphore - it would be a bug if in any of those contexts we were holding
      a read lock instead.
      
      So change btrfs_assert_tree_locked() to use lockdep_assert_held_write()
      instead and, to make it more explicit, rename btrfs_assert_tree_locked()
      to btrfs_assert_tree_write_locked(), so that it's clear we want to check
      we are holding a write lock.
      
      For now there are no contexts where we want to assert that we must have
      a read lock, but in case that is needed in the future, we can add a new
      helper function that just calls out lockdep_assert_held_read().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49d0c642
  6. 18 10月, 2021 1 次提交
  7. 23 8月, 2021 4 次提交
    • M
      btrfs: introduce btrfs_search_backwards function · 0ff40a91
      Marcos Paulo de Souza 提交于
      It's a common practice to start a search using offset (u64)-1, which is
      the u64 maximum value, meaning that we want the search_slot function to
      be set in the last item with the same objectid and type.
      
      Once we are in this position, it's a matter to start a search backwards
      by calling btrfs_previous_item, which will check if we'll need to go to
      a previous leaf and other necessary checks, only to be sure that we are
      in last offset of the same object and type.
      
      The new btrfs_search_backwards function does the all these steps when
      necessary, and can be used to avoid code duplication.
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0ff40a91
    • D
      btrfs: make btrfs_next_leaf static inline · 809d6902
      David Sterba 提交于
      btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it
      to header to avoid the function call overhead.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      809d6902
    • F
      btrfs: continue readahead of siblings even if target node is in memory · 069a2e37
      Filipe Manana 提交于
      At reada_for_search(), when attempting to readahead a node or leaf's
      siblings, we skip the readahead of the siblings if the node/leaf is
      already in memory. That is probably fine for the READA_FORWARD and
      READA_BACK readahead types, as they are used on contexts where we
      end up reading some consecutive leaves, but usually not the whole btree.
      
      However for a READA_FORWARD_ALWAYS mode, currently only used for full
      send operations, it does not make sense to skip the readahead if the
      target node or leaf is already loaded in memory, since we know the caller
      is visiting every node and leaf of the btree in ascending order.
      
      So change the behaviour to not skip the readahead when the target node is
      already in memory and the readahead mode is READA_FORWARD_ALWAYS.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk, with 32GiB of RAM and
      using a non-debug kernel config (Debian's default config).
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        file_count=2000000
        add_files $file_count 0 4
      
        echo
        echo "Creating snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
      The duration of the full send operations, in seconds, were the following:
      
      Before this change:  85 seconds
      After this change:   76 seconds (-11.2%)
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      069a2e37
    • M
      btrfs: remove max argument from generic_bin_search · 67d5e289
      Marcos Paulo de Souza 提交于
      Both callers use btrfs_header_nritems to feed the max argument.  Remove
      the argument and let generic_bin_search call it itself.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      67d5e289
  8. 07 7月, 2021 1 次提交
    • F
      btrfs: rework chunk allocation to avoid exhaustion of the system chunk array · 79bd3712
      Filipe Manana 提交于
      Commit eafa4fd0 ("btrfs: fix exhaustion of the system chunk array
      due to concurrent allocations") fixed a problem that resulted in
      exhausting the system chunk array in the superblock when there are many
      tasks allocating chunks in parallel. Basically too many tasks enter the
      first phase of chunk allocation without previous tasks having finished
      their second phase of allocation, resulting in too many system chunks
      being allocated. That was originally observed when running the fallocate
      tests of stress-ng on a PowerPC machine, using a node size of 64K.
      
      However that commit also introduced a deadlock where a task in phase 1 of
      the chunk allocation waited for another task that had allocated a system
      chunk to finish its phase 2, but that other task was waiting on an extent
      buffer lock held by the first task, therefore resulting in both tasks not
      making any progress. That change was later reverted by a patch with the
      subject "btrfs: fix deadlock with concurrent chunk allocations involving
      system chunks", since there is no simple and short solution to address it
      and the deadlock is relatively easy to trigger on zoned filesystems, while
      the system chunk array exhaustion is not so common.
      
      This change reworks the chunk allocation to avoid the system chunk array
      exhaustion. It accomplishes that by making the first phase of chunk
      allocation do the updates of the device items in the chunk btree and the
      insertion of the new chunk item in the chunk btree. This is done while
      under the protection of the chunk mutex (fs_info->chunk_mutex), in the
      same critical section that checks for available system space, allocates
      a new system chunk if needed and reserves system chunk space. This way
      we do not have chunk space reserved until the second phase completes.
      
      The same logic is applied to chunk removal as well, since it keeps
      reserved system space long after it is done updating the chunk btree.
      
      For direct allocation of system chunks, the previous behaviour remains,
      because otherwise we would deadlock on extent buffers of the chunk btree.
      Changes to the chunk btree are by large done by chunk allocation and chunk
      removal, which first reserve chunk system space and then later do changes
      to the chunk btree. The other remaining cases are uncommon and correspond
      to adding a device, removing a device and resizing a device. All these
      other cases do not pre-reserve system space, they modify the chunk btree
      right away, so they don't hold reserved space for a long period like chunk
      allocation and chunk removal do.
      
      The diff of this change is huge, but more than half of it is just addition
      of comments describing both how things work regarding chunk allocation and
      removal, including both the new behavior and the parts of the old behavior
      that did not change.
      
      CC: stable@vger.kernel.org # 5.12+
      Tested-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Tested-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79bd3712
  9. 21 6月, 2021 1 次提交
    • J
      btrfs: always abort the transaction if we abort a trans handle · 5963ffca
      Josef Bacik 提交于
      While stress testing our error handling I noticed that sometimes we
      would still commit the transaction even though we had aborted the
      transaction.
      
      Currently we track if a trans handle has dirtied any metadata, and if it
      hasn't we mark the filesystem as having an error (so no new transactions
      can be started), but we will allow the current transaction to complete
      as we do not mark the transaction itself as having been aborted.
      
      This sounds good in theory, but we were not properly tracking IO errors
      in btrfs_finish_ordered_io, and thus committing the transaction with
      bogus free space data.  This isn't necessarily a problem per-se with the
      free space cache, as the other guards in place would have kept us from
      accepting the free space cache as valid, but highlights a real world
      case where we had a bug and could have corrupted the filesystem because
      of it.
      
      This "skip abort on empty trans handle" is nice in theory, but assumes
      we have perfect error handling everywhere, which we clearly do not.
      Also we do not allow further transactions to be started, so all this
      does is save the last transaction that was happening, which doesn't
      necessarily gain us anything other than the potential for real
      corruption.
      
      Remove this particular bit of code, if we decide we need to abort the
      transaction then abort the current one and keep us from doing real harm
      to the file system, regardless of whether this specific trans handle
      dirtied anything or not.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5963ffca
  10. 19 4月, 2021 3 次提交
    • F
      btrfs: improve btree readahead for full send operations · ace75066
      Filipe Manana 提交于
      Currently a full send operation uses the standard btree readahead when
      iterating over the subvolume/snapshot btree, which despite bringing good
      performance benefits, it could be improved in a few aspects for use cases
      such as full send operations, which are guaranteed to visit every node
      and leaf of a btree, in ascending and sequential order. The limitations
      of that standard btree readahead implementation are the following:
      
      1) It only triggers readahead for leaves that are physically close
         to the leaf being read, within a 64K range;
      
      2) It only triggers readahead for the next or previous leaves if the
         leaf being read is not currently in memory;
      
      3) It never triggers readahead for nodes.
      
      So add a new readahead mode that addresses all these points and use it
      for full send operations.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk and with 16GiB of RAM:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        initial_file_count=500000
        add_files $initial_file_count 0 4
      
        echo
        echo "Creating first snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        echo
        echo "Adding more files..."
        add_files $((initial_file_count / 4)) $initial_file_count 4
      
        echo
        echo "Updating 1/50th of the initial files..."
        for ((i = 1; i < $initial_file_count; i += 50)); do
            xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
        done
      
        echo
        echo "Creating second snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap2
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
      The durations of the full send operation in seconds were the following:
      
      Before this change:  217 seconds
      After this change:   205 seconds (-5.7%)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ace75066
    • F
      btrfs: use booleans where appropriate for the tree mod log functions · 406808ab
      Filipe Manana 提交于
      Several functions of the tree modification log use integers as booleans,
      so change them to use booleans instead, making their use more clear.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      406808ab
    • F
      btrfs: move the tree mod log code into its own file · f3a84ccd
      Filipe Manana 提交于
      The tree modification log, which records modifications done to btrees, is
      quite large and currently spread all over ctree.c, which is a huge file
      already.
      
      To make things better organized, move all that code into its own separate
      source and header files. Functions and definitions that are used outside
      of the module (mostly by ctree.c) are renamed so that they start with a
      "btrfs_" prefix. Everything else remains unchanged.
      
      This makes it easier to go over the tree modification log code every
      time I need to go read it to fix a bug.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor comment updates ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3a84ccd
  11. 17 3月, 2021 1 次提交
    • F
      btrfs: fix race when cloning extent buffer during rewind of an old root · dbcc7d57
      Filipe Manana 提交于
      While resolving backreferences, as part of a logical ino ioctl call or
      fiemap, we can end up hitting a BUG_ON() when replaying tree mod log
      operations of a root, triggering a stack trace like the following:
      
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.c:1210!
        invalid opcode: 0000 [#1] SMP KASAN PTI
        CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G        W         5.11.0-2d11c0084b02-misc-next+ #89
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
        Code: 05 48 8d 74 10 (...)
        RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297
        RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6
        RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c
        RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20
        R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee
        R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c
        FS:  00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0
        Call Trace:
         btrfs_search_old_slot+0x265/0x10d0
         ? lock_acquired+0xbb/0x600
         ? btrfs_search_slot+0x1090/0x1090
         ? free_extent_buffer.part.61+0xd7/0x140
         ? free_extent_buffer+0x13/0x20
         resolve_indirect_refs+0x3e9/0xfc0
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? add_prelim_ref.part.11+0x150/0x150
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? lock_acquired+0xbb/0x600
         ? __kasan_check_write+0x14/0x20
         ? do_raw_spin_unlock+0xa8/0x140
         ? rb_insert_color+0x30/0x360
         ? prelim_ref_insert+0x12d/0x430
         find_parent_nodes+0x5c3/0x1830
         ? resolve_indirect_refs+0xfc0/0xfc0
         ? lock_release+0xc8/0x620
         ? fs_reclaim_acquire+0x67/0xf0
         ? lock_acquire+0xc7/0x510
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x160/0x210
         ? lock_release+0xc8/0x620
         ? fs_reclaim_acquire+0x67/0xf0
         ? lock_acquire+0xc7/0x510
         ? poison_range+0x38/0x40
         ? unpoison_range+0x14/0x40
         ? trace_hardirqs_on+0x55/0x120
         btrfs_find_all_roots_safe+0x142/0x1e0
         ? find_parent_nodes+0x1830/0x1830
         ? btrfs_inode_flags_to_xflags+0x50/0x50
         iterate_extent_inodes+0x20e/0x580
         ? tree_backref_for_extent+0x230/0x230
         ? lock_downgrade+0x3d0/0x3d0
         ? read_extent_buffer+0xdd/0x110
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? lock_acquired+0xbb/0x600
         ? __kasan_check_write+0x14/0x20
         ? _raw_spin_unlock+0x22/0x30
         ? __kasan_check_write+0x14/0x20
         iterate_inodes_from_logical+0x129/0x170
         ? iterate_inodes_from_logical+0x129/0x170
         ? btrfs_inode_flags_to_xflags+0x50/0x50
         ? iterate_extent_inodes+0x580/0x580
         ? __vmalloc_node+0x92/0xb0
         ? init_data_container+0x34/0xb0
         ? init_data_container+0x34/0xb0
         ? kvmalloc_node+0x60/0x80
         btrfs_ioctl_logical_to_ino+0x158/0x230
         btrfs_ioctl+0x205e/0x4040
         ? __might_sleep+0x71/0xe0
         ? btrfs_ioctl_get_supported_features+0x30/0x30
         ? getrusage+0x4b6/0x9c0
         ? __kasan_check_read+0x11/0x20
         ? lock_release+0xc8/0x620
         ? __might_fault+0x64/0xd0
         ? lock_acquire+0xc7/0x510
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? __kasan_check_read+0x11/0x20
         ? do_vfs_ioctl+0xfc/0x9d0
         ? ioctl_file_clone+0xe0/0xe0
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? __kasan_check_read+0x11/0x20
         ? lock_release+0xc8/0x620
         ? __task_pid_nr_ns+0xd3/0x250
         ? lock_acquire+0xc7/0x510
         ? __fget_files+0x160/0x230
         ? __fget_light+0xf2/0x110
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fd1976e2427
        Code: 00 00 90 48 8b 05 (...)
        RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427
        RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004
        RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120
        R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004
        R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8
        Modules linked in:
        ---[ end trace ec8931a1c36e57be ]---
      
        (gdb) l *(__tree_mod_log_rewind+0x3b1)
        0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210).
        1205                     * the modification. as we're going backwards, we do the
        1206                     * opposite of each operation here.
        1207                     */
        1208                    switch (tm->op) {
        1209                    case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
        1210                            BUG_ON(tm->slot < n);
        1211                            fallthrough;
        1212                    case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
        1213                    case MOD_LOG_KEY_REMOVE:
        1214                            btrfs_set_node_key(eb, &tm->key, tm->slot);
      
      Here's what happens to hit that BUG_ON():
      
      1) We have one tree mod log user (through fiemap or the logical ino ioctl),
         with a sequence number of 1, so we have fs_info->tree_mod_seq == 1;
      
      2) Another task is at ctree.c:balance_level() and we have eb X currently as
         the root of the tree, and we promote its single child, eb Y, as the new
         root.
      
         Then, at ctree.c:balance_level(), we call:
      
            tree_mod_log_insert_root(eb X, eb Y, 1);
      
      3) At tree_mod_log_insert_root() we create tree mod log elements for each
         slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each
         with a ->logical pointing to ebX->start. These are placed in an array
         named tm_list.
         Lets assume there are N elements (N pointers in eb X);
      
      4) Then, still at tree_mod_log_insert_root(), we create a tree mod log
         element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to
         ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set
         to the level of eb X and ->generation set to the generation of eb X;
      
      5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
         tm_list as argument. After that, tree_mod_log_free_eb() calls
         __tree_mod_log_insert() for each member of tm_list in reverse order,
         from highest slot in eb X, slot N - 1, to slot 0 of eb X;
      
      6) __tree_mod_log_insert() sets the sequence number of each given tree mod
         log operation - it increments fs_info->tree_mod_seq and sets
         fs_info->tree_mod_seq as the sequence number of the given tree mod log
         operation.
      
         This means that for the tm_list created at tree_mod_log_insert_root(),
         the element corresponding to slot 0 of eb X has the highest sequence
         number (1 + N), and the element corresponding to the last slot has the
         lowest sequence number (2);
      
      7) Then, after inserting tm_list's elements into the tree mod log rbtree,
         the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest
         sequence number, which is N + 2;
      
      8) Back to ctree.c:balance_level(), we free eb X by calling
         btrfs_free_tree_block() on it. Because eb X was created in the current
         transaction, has no other references and writeback did not happen for
         it, we add it back to the free space cache/tree;
      
      9) Later some other task T allocates the metadata extent from eb X, since
         it is marked as free space in the space cache/tree, and uses it as a
         node for some other btree;
      
      10) The tree mod log user task calls btrfs_search_old_slot(), which calls
          get_old_root(), and finally that calls __tree_mod_log_oldest_root()
          with time_seq == 1 and eb_root == eb Y;
      
      11) First iteration of the while loop finds the tree mod log element with
          sequence number N + 2, for the logical address of eb Y and of type
          MOD_LOG_ROOT_REPLACE;
      
      12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out
          of the loop, and set root_logical to point to tm->old_root.logical
          which corresponds to the logical address of eb X;
      
      13) On the next iteration of the while loop, the call to
          tree_mod_log_search_oldest() returns the smallest tree mod log element
          for the logical address of eb X, which has a sequence number of 2, an
          operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to
          the old slot N - 1 of eb X (eb X had N items in it before being freed);
      
      14) We then break out of the while loop and return the tree mod log operation
          of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of
          eb X, to get_old_root();
      
      15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation
          and set "logical" to the logical address of eb X, which was the old
          root. We then call tree_mod_log_search() passing it the logical
          address of eb X and time_seq == 1;
      
      16) Then before calling tree_mod_log_search(), task T adds a key to eb X,
          which results in adding a tree mod log operation of type
          MOD_LOG_KEY_ADD to the tree mod log - this is done at
          ctree.c:insert_ptr() - but after adding the tree mod log operation
          and before updating the number of items in eb X from 0 to 1...
      
      17) The task at get_old_root() calls tree_mod_log_search() and gets the
          tree mod log operation of type MOD_LOG_KEY_ADD just added by task T.
          Then it enters the following if branch:
      
          if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
             (...)
          } (...)
      
          Calls read_tree_block() for eb X, which gets a reference on eb X but
          does not lock it - task T has it locked.
          Then it clones eb X while it has nritems set to 0 in its header, before
          task T sets nritems to 1 in eb X's header. From hereupon we use the
          clone of eb X which no other task has access to;
      
      18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD
          mod log operation we just got from tree_mod_log_search() in the
          previous step and the cloned version of eb X;
      
      19) At __tree_mod_log_rewind(), we set the local variable "n" to the number
          of items set in eb X's clone, which is 0. Then we enter the while loop,
          and in its first iteration we process the MOD_LOG_KEY_ADD operation,
          which just decrements "n" from 0 to (u32)-1, since "n" is declared with
          a type of u32. At the end of this iteration we call rb_next() to find the
          next tree mod log operation for eb X, that gives us the mod log operation
          of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence
          number of N + 1 (steps 3 to 6);
      
      20) Then we go back to the top of the while loop and trigger the following
          BUG_ON():
      
              (...)
              switch (tm->op) {
              case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
                       BUG_ON(tm->slot < n);
                       fallthrough;
              (...)
      
          Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0.
      
      Fix this by taking a read lock on the extent buffer before cloning it at
      ctree.c:get_old_root(). This should be done regardless of the extent
      buffer having been freed and reused, as a concurrent task might be
      modifying it (while holding a write lock on it).
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/
      Fixes: 834328a8 ("Btrfs: tree mod log's old roots could still be part of the tree")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dbcc7d57
  12. 09 2月, 2021 3 次提交
  13. 18 12月, 2020 1 次提交
    • E
      btrfs: correctly calculate item size used when item key collision happens · 9a664971
      ethanwu 提交于
      Item key collision is allowed for some item types, like dir item and
      inode refs, but the overall item size is limited by the nodesize.
      
      item size(ins_len) passed from btrfs_insert_empty_items to
      btrfs_search_slot already contains size of btrfs_item.
      
      When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
      The check incorrectly reports that split leaf is required, because
      it treats the space required by the newly inserted item as
      btrfs_item + item data. But in item key collision case, only item data
      is actually needed, the newly inserted item could merge into the existing
      one. No new btrfs_item will be inserted.
      
      And split_leaf return EOVERFLOW from following code:
      
        if (extend && data_size + btrfs_item_size_nr(l, slot) +
            sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
            return -EOVERFLOW;
      
      In most cases, when callers receive EOVERFLOW, they either return
      this error or handle in different ways. For example, in normal dir item
      creation the userspace will get errno EOVERFLOW; in inode ref case
      INODE_EXTREF is used instead.
      
      However, this is not the case for rename. To avoid the unrecoverable
      situation in rename, btrfs_check_dir_item_collision is called in
      early phase of rename. In this function, when item key collision is
      detected leaf space is checked:
      
        data_size = sizeof(*di) + name_len;
        if (data_size + btrfs_item_size_nr(leaf, slot) +
            sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
      
      the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
      refers to existing item size, the condition here correctly calculates
      the needed size for collision case rather than the wrong case above.
      
      The consequence of inconsistent condition check between
      btrfs_check_dir_item_collision and btrfs_search_slot when item key
      collision happens is that we might pass check here but fail
      later at btrfs_search_slot. Rename fails and volume is forced readonly
      
        [436149.586170] ------------[ cut here ]------------
        [436149.586173] BTRFS: Transaction aborted (error -75)
        [436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
        [436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G      D           4.18.0-rc5+ #1
        [436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
        [436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
        [436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
        [436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
        [436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
        [436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
        [436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
        [436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
        [436149.586260] FS:  00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
        [436149.586261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
        [436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [436149.586296] Call Trace:
        [436149.586311]  vfs_rename+0x383/0x920
        [436149.586313]  ? vfs_rename+0x383/0x920
        [436149.586315]  do_renameat2+0x4ca/0x590
        [436149.586317]  __x64_sys_rename+0x20/0x30
        [436149.586324]  do_syscall_64+0x5a/0x120
        [436149.586330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [436149.586332] RIP: 0033:0x7fa3133b1d37
        [436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
        [436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
        [436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
        [436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
        [436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
      
      Thanks to Hans van Kranenburg for information about crc32 hash collision
      tools, I was able to reproduce the dir item collision with following
      python script.
      https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
      it under a btrfs volume will trigger the abort transaction.  It simply
      creates files and rename them to forged names that leads to
      hash collision.
      
      There are two ways to fix this. One is to simply revert the patch
      878f2d2c ("Btrfs: fix max dir item size calculation") to make the
      condition consistent although that patch is correct about the size.
      
      The other way is to handle the leaf space check correctly when
      collision happens. I prefer the second one since it correct leaf
      space check in collision case. This fix will not account
      sizeof(struct btrfs_item) when the item already exists.
      There are two places where ins_len doesn't contain
      sizeof(struct btrfs_item), however.
      
        1. extent-tree.c: lookup_inline_extent_backref
        2. file-item.c: btrfs_csum_file_blocks
      
      to make the logic of btrfs_search_slot more clear, we add a flag
      search_for_extension in btrfs_path.
      
      This flag indicates that ins_len passed to btrfs_search_slot doesn't
      contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
      will use the actual size needed to calculate the required leaf space.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: Nethanwu <ethanwu@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a664971
  14. 10 12月, 2020 1 次提交
    • Q
      btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors · 884b07d0
      Qu Wenruo 提交于
      To support sectorsize < PAGE_SIZE case, we need to take extra care of
      extent buffer accessors.
      
      Since sectorsize is smaller than PAGE_SIZE, one page can contain
      multiple tree blocks, we must use eb->start to determine the real offset
      to read/write for extent buffer accessors.
      
      This patch introduces two helpers to do this:
      
      - get_eb_page_index()
        This is to calculate the index to access extent_buffer::pages.
        It's just a simple wrapper around "start >> PAGE_SHIFT".
      
        For sectorsize == PAGE_SIZE case, nothing is changed.
        For sectorsize < PAGE_SIZE case, we always get index as 0, and
        the existing page shift also works.
      
      - get_eb_offset_in_page()
        This is to calculate the offset to access extent_buffer::pages.
        This needs to take extent_buffer::start into consideration.
      
        For sectorsize == PAGE_SIZE case, extent_buffer::start is always
        aligned to PAGE_SIZE, thus adding extent_buffer::start to
        offset_in_page() won't change the result.
        For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
        us the correct offset to access.
      
      This patch will touch the following parts to cover all extent buffer
      accessors:
      
      - BTRFS_SETGET_HEADER_FUNCS()
      - read_extent_buffer()
      - read_extent_buffer_to_user()
      - memcmp_extent_buffer()
      - write_extent_buffer_chunk_tree_uuid()
      - write_extent_buffer_fsid()
      - write_extent_buffer()
      - memzero_extent_buffer()
      - copy_extent_buffer_full()
      - copy_extent_buffer()
      - memcpy_extent_buffer()
      - memmove_extent_buffer()
      - btrfs_get_token_##bits()
      - btrfs_get_##bits()
      - btrfs_set_token_##bits()
      - btrfs_set_##bits()
      - generic_bin_search()
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      884b07d0
  15. 08 12月, 2020 4 次提交