1. 22 6月, 2021 5 次提交
    • F
      btrfs: send: fix crash when memory allocations trigger reclaim · 35b22c19
      Filipe Manana 提交于
      When doing a send we don't expect the task to ever start a transaction
      after the initial check that verifies if commit roots match the regular
      roots. This is because after that we set current->journal_info with a
      stub (special value) that signals we are in send context, so that we take
      a read lock on an extent buffer when reading it from disk and verifying
      it is valid (its generation matches the generation stored in the parent).
      This stub was introduced in 2014 by commit a26e8c9f ("Btrfs: don't
      clear uptodate if the eb is under IO") in order to fix a concurrency issue
      between send and balance.
      
      However there is one particular exception where we end up needing to start
      a transaction and when this happens it results in a crash with a stack
      trace like the following:
      
      [60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80
      [60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...)
      [60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1
      [60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015
      [60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80
      [60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...)
      [60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246
      [60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224
      [60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000
      [60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      [60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
      [60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001
      [60015.902408] kernel: FS:  00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000
      [60015.902410] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0
      [60015.902414] kernel: Call Trace:
      [60015.902419] kernel:  kfence_unprotect+0x13/0x30
      [60015.902423] kernel:  page_fault_oops+0x89/0x270
      [60015.902427] kernel:  ? search_module_extables+0xf/0x40
      [60015.902431] kernel:  ? search_bpf_extables+0x57/0x70
      [60015.902435] kernel:  kernelmode_fixup_or_oops+0xd6/0xf0
      [60015.902437] kernel:  __bad_area_nosemaphore+0x142/0x180
      [60015.902440] kernel:  exc_page_fault+0x67/0x150
      [60015.902445] kernel:  asm_exc_page_fault+0x1e/0x30
      [60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580
      [60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...)
      [60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246
      [60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
      [60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039
      [60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000
      [60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001
      [60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000
      [60015.902468] kernel:  btrfs_commit_inode_delayed_inode+0x5d/0x120
      [60015.902473] kernel:  btrfs_evict_inode+0x2c5/0x3f0
      [60015.902476] kernel:  evict+0xd1/0x180
      [60015.902480] kernel:  inode_lru_isolate+0xe7/0x180
      [60015.902483] kernel:  __list_lru_walk_one+0x77/0x150
      [60015.902487] kernel:  ? iput+0x1a0/0x1a0
      [60015.902489] kernel:  ? iput+0x1a0/0x1a0
      [60015.902491] kernel:  list_lru_walk_one+0x47/0x70
      [60015.902495] kernel:  prune_icache_sb+0x39/0x50
      [60015.902497] kernel:  super_cache_scan+0x161/0x1f0
      [60015.902501] kernel:  do_shrink_slab+0x142/0x240
      [60015.902505] kernel:  shrink_slab+0x164/0x280
      [60015.902509] kernel:  shrink_node+0x2c8/0x6e0
      [60015.902512] kernel:  do_try_to_free_pages+0xcb/0x4b0
      [60015.902514] kernel:  try_to_free_pages+0xda/0x190
      [60015.902516] kernel:  __alloc_pages_slowpath.constprop.0+0x373/0xcc0
      [60015.902521] kernel:  ? __memcg_kmem_charge_page+0xc2/0x1e0
      [60015.902525] kernel:  __alloc_pages_nodemask+0x30a/0x340
      [60015.902528] kernel:  pipe_write+0x30b/0x5c0
      [60015.902531] kernel:  ? set_next_entity+0xad/0x1e0
      [60015.902534] kernel:  ? switch_mm_irqs_off+0x58/0x440
      [60015.902538] kernel:  __kernel_write+0x13a/0x2b0
      [60015.902541] kernel:  kernel_write+0x73/0x150
      [60015.902543] kernel:  send_cmd+0x7b/0xd0
      [60015.902545] kernel:  send_extent_data+0x5a3/0x6b0
      [60015.902549] kernel:  process_extent+0x19b/0xed0
      [60015.902551] kernel:  btrfs_ioctl_send+0x1434/0x17e0
      [60015.902554] kernel:  ? _btrfs_ioctl_send+0xe1/0x100
      [60015.902557] kernel:  _btrfs_ioctl_send+0xbf/0x100
      [60015.902559] kernel:  ? enqueue_entity+0x18c/0x7b0
      [60015.902562] kernel:  btrfs_ioctl+0x185f/0x2f80
      [60015.902564] kernel:  ? psi_task_change+0x84/0xc0
      [60015.902569] kernel:  ? _flat_send_IPI_mask+0x21/0x40
      [60015.902572] kernel:  ? check_preempt_curr+0x2f/0x70
      [60015.902576] kernel:  ? selinux_file_ioctl+0x137/0x1e0
      [60015.902579] kernel:  ? expand_files+0x1cb/0x1d0
      [60015.902582] kernel:  ? __x64_sys_ioctl+0x82/0xb0
      [60015.902585] kernel:  __x64_sys_ioctl+0x82/0xb0
      [60015.902588] kernel:  do_syscall_64+0x33/0x40
      [60015.902591] kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [60015.902595] kernel: RIP: 0033:0x7f158e38f0ab
      [60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...)
      [60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab
      [60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004
      [60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640
      [60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
      [60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0
      [60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]---
      
      This happens because when writing to the pipe, by calling kernel_write(),
      we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the
      gfp flags, which allow reclaim to happen if there is memory pressure. This
      allocation happens at fs/pipe.c:pipe_write().
      
      If the reclaim is triggered, inode eviction can be triggered and that in
      turn can result in starting a transaction if the inode has a link count
      of 0. The transaction start happens early on during eviction, when we call
      btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if
      there is currently an open file descriptor for an inode with a link count
      of 0 and the reclaim task gets a reference on the inode before that
      descriptor is closed, in which case the reclaim task ends up doing the
      final iput that triggers the inode eviction.
      
      When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers
      the following assertion at transaction.c:start_transaction():
      
          /* Send isn't supposed to start transactions. */
          ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);
      
      And when assertions are not enabled, it triggers a crash since after that
      assertion we cast current->journal_info into a transaction handle pointer
      and then dereference it:
      
         if (current->journal_info) {
             WARN_ON(type & TRANS_EXTWRITERS);
             h = current->journal_info;
             refcount_inc(&h->use_count);
             (...)
      
      Which obviously results in a crash due to an invalid memory access.
      
      The same type of issue can happen during other memory allocations we
      do directly in the send code with kmalloc (and friends) as they use
      GFP_KERNEL and therefore may trigger reclaim too, which started to
      happen since 2016 after commit e780b0d1 ("btrfs: send: use
      GFP_KERNEL everywhere").
      
      The issue could be solved by setting up a NOFS context for the entire
      send operation so that reclaim could not be triggered when allocating
      memory or pages through kernel_write(). However that is not very friendly
      and we can in fact get rid of the send stub because:
      
      1) The stub was introduced way back in 2014 by commit a26e8c9f
         ("Btrfs: don't clear uptodate if the eb is under IO") to solve an
         issue exclusive to when send and balance are running in parallel,
         however there were other problems between balance and send and we do
         not allow anymore to have balance and send run concurrently since
         commit 9e967495 ("Btrfs: prevent send failures and crashes due
         to concurrent relocation"). More generically the issues are between
         send and relocation, and that last commit eliminated only the
         possibility of having send and balance run concurrently, but shrinking
         a device also can trigger relocation, and on zoned filesystems we have
         relocation of partially used block groups triggered automatically as
         well. The previous patch that has a subject of:
      
         "btrfs: ensure relocation never runs while we have send operations running"
      
         Addresses all the remaining cases that can trigger relocation.
      
      2) We can actually allow starting and even committing transactions while
         in a send context if needed because send is not holding any locks that
         would block the start or the commit of a transaction.
      
      So get rid of all the logic added by commit a26e8c9f ("Btrfs: don't
      clear uptodate if the eb is under IO"). We can now always call
      clear_extent_buffer_uptodate() at verify_parent_transid() since send is
      the only case that uses commit roots without having a transaction open or
      without holding the commit_root_sem.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35b22c19
    • F
      btrfs: ensure relocation never runs while we have send operations running · 1cea5cf0
      Filipe Manana 提交于
      Relocation and send do not play well together because while send is
      running a block group can be relocated, a transaction committed and
      the respective disk extents get re-allocated and written to or discarded
      while send is about to do something with the extents.
      
      This was explained in commit 9e967495 ("Btrfs: prevent send failures
      and crashes due to concurrent relocation"), which prevented balance and
      send from running in parallel but it did not address one remaining case
      where chunk relocation can happen: shrinking a device (and device deletion
      which shrinks a device's size to 0 before deleting the device).
      
      We also have now one more case where relocation is triggered: on zoned
      filesystems partially used block groups get relocated by a background
      thread, introduced in commit 18bb8bbf ("btrfs: zoned: automatically
      reclaim zones").
      
      So make sure that instead of preventing balance from running when there
      are ongoing send operations, we prevent relocation from happening.
      This uses the infrastructure recently added by a patch that has the
      subject: "btrfs: add cancellable chunk relocation support".
      
      Also it adds a spinlock used exclusively for the exclusivity between
      send and relocation, as before fs_info->balance_mutex was used, which
      would make an attempt to run send to block waiting for balance to
      finish, which can take a lot of time on large filesystems.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cea5cf0
    • D
      btrfs: fix typos in comments · 1a9fd417
      David Sterba 提交于
      Fix typos that have snuck in since the last round. Found by codespell.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a9fd417
    • B
      btrfs: send: use list_move_tail instead of list_del/list_add_tail · bb930007
      Baokun Li 提交于
      Use list_move_tail() instead of list_del() + list_add_tail() as it's
      doing the same thing and allows further cleanups.  Open code
      name_cache_used() as there is only one user.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NBaokun Li <libaokun1@huawei.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bb930007
    • F
      btrfs: send: fix invalid path for unlink operations after parent orphanization · d8ac76cd
      Filipe Manana 提交于
      During an incremental send operation, when processing the new references
      for the current inode, we might send an unlink operation for another inode
      that has a conflicting path and has more than one hard link. However this
      path was computed and cached before we processed previous new references
      for the current inode. We may have orphanized a directory of that path
      while processing a previous new reference, in which case the path will
      be invalid and cause the receiver process to fail.
      
      The following reproducer triggers the problem and explains how/why it
      happens in its comments:
      
        $ cat test-send-unlink.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        # Create our test files and directory. Inode 259 (file3) has two hard
        # links.
        touch $MNT/file1
        touch $MNT/file2
        touch $MNT/file3
      
        mkdir $MNT/A
        ln $MNT/file3 $MNT/A/hard_link
      
        # Filesystem looks like:
        #
        # .                                     (ino 256)
        # |----- file1                          (ino 257)
        # |----- file2                          (ino 258)
        # |----- file3                          (ino 259)
        # |----- A/                             (ino 260)
        #        |---- hard_link                (ino 259)
        #
      
        # Now create the base snapshot, which is going to be the parent snapshot
        # for a later incremental send.
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Move inode 257 into directory inode 260. This results in computing the
        # path for inode 260 as "/A" and caching it.
        mv $MNT/file1 $MNT/A/file1
      
        # Move inode 258 (file2) into directory inode 260, with a name of
        # "hard_link", moving first inode 259 away since it currently has that
        # location and name.
        mv $MNT/A/hard_link $MNT/tmp
        mv $MNT/file2 $MNT/A/hard_link
      
        # Now rename inode 260 to something else (B for example) and then create
        # a hard link for inode 258 that has the old name and location of inode
        # 260 ("/A").
        mv $MNT/A $MNT/B
        ln $MNT/B/hard_link $MNT/A
      
        # Filesystem now looks like:
        #
        # .                                     (ino 256)
        # |----- tmp                            (ino 259)
        # |----- file3                          (ino 259)
        # |----- B/                             (ino 260)
        # |      |---- file1                    (ino 257)
        # |      |---- hard_link                (ino 258)
        # |
        # |----- A                              (ino 258)
      
        # Create another snapshot of our subvolume and use it for an incremental
        # send.
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        # Now unmount the filesystem, create a new one, mount it and try to
        # apply both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
      
        mount $DEV $MNT
      
        # First add the first snapshot to the new filesystem by applying the
        # first send stream.
        btrfs receive -f /tmp/snap1.send $MNT
      
        # The incremental receive operation below used to fail with the
        # following error:
        #
        #    ERROR: unlink A/hard_link failed: No such file or directory
        #
        # This is because when send is processing inode 257, it generates the
        # path for inode 260 as "/A", since that inode is its parent in the send
        # snapshot, and caches that path.
        #
        # Later when processing inode 258, it first processes its new reference
        # that has the path of "/A", which results in orphanizing inode 260
        # because there is a a path collision. This results in issuing a rename
        # operation from "/A" to "/o260-6-0".
        #
        # Finally when processing the new reference "B/hard_link" for inode 258,
        # it notices that it collides with inode 259 (not yet processed, because
        # it has a higher inode number), since that inode has the name
        # "hard_link" under the directory inode 260. It also checks that inode
        # 259 has two hardlinks, so it decides to issue a unlink operation for
        # the name "hard_link" for inode 259. However the path passed to the
        # unlink operation is "/A/hard_link", which is incorrect since currently
        # "/A" does not exists, due to the orphanization of inode 260 mentioned
        # before. The path is incorrect because it was computed and cached
        # before the orphanization. This results in the receiver to fail with
        # the above error.
        btrfs receive -f /tmp/snap2.send $MNT
      
        umount $MNT
      
      When running the test, it fails like this:
      
        $ ./test-send-unlink.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: unlink A/hard_link failed: No such file or directory
      
      Fix this by recomputing a path before issuing an unlink operation when
      processing the new references for the current inode if we previously
      have orphanized a directory.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8ac76cd
  2. 29 4月, 2021 1 次提交
    • F
      btrfs: fix deadlock when cloning inline extents and using qgroups · f9baa501
      Filipe Manana 提交于
      There are a few exceptional cases where cloning an inline extent needs to
      copy the inline extent data into a page of the destination inode.
      
      When this happens, we end up starting a transaction while having a dirty
      page for the destination inode and while having the range locked in the
      destination's inode iotree too. Because when reserving metadata space
      for a transaction we may need to flush existing delalloc in case there is
      not enough free space, we have a mechanism in place to prevent a deadlock,
      which was introduced in commit 3d45f221 ("btrfs: fix deadlock when
      cloning inline extent and low on free metadata space").
      
      However when using qgroups, a transaction also reserves metadata qgroup
      space, which can also result in flushing delalloc in case there is not
      enough available space at the moment. When this happens we deadlock, since
      flushing delalloc requires locking the file range in the inode's iotree
      and the range was already locked at the very beginning of the clone
      operation, before attempting to start the transaction.
      
      When this issue happens, stack traces like the following are reported:
      
        [72747.556262] task:kworker/u81:9   state:D stack:    0 pid:  225 ppid:     2 flags:0x00004000
        [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
        [72747.556271] Call Trace:
        [72747.556273]  __schedule+0x296/0x760
        [72747.556277]  schedule+0x3c/0xa0
        [72747.556279]  io_schedule+0x12/0x40
        [72747.556284]  __lock_page+0x13c/0x280
        [72747.556287]  ? generic_file_readonly_mmap+0x70/0x70
        [72747.556325]  extent_write_cache_pages+0x22a/0x440 [btrfs]
        [72747.556331]  ? __set_page_dirty_nobuffers+0xe7/0x160
        [72747.556358]  ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
        [72747.556362]  ? update_group_capacity+0x25/0x210
        [72747.556366]  ? cpumask_next_and+0x1a/0x20
        [72747.556391]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.556394]  do_writepages+0x41/0xd0
        [72747.556398]  __writeback_single_inode+0x39/0x2a0
        [72747.556403]  writeback_sb_inodes+0x1ea/0x440
        [72747.556407]  __writeback_inodes_wb+0x5f/0xc0
        [72747.556410]  wb_writeback+0x235/0x2b0
        [72747.556414]  ? get_nr_inodes+0x35/0x50
        [72747.556417]  wb_workfn+0x354/0x490
        [72747.556420]  ? newidle_balance+0x2c5/0x3e0
        [72747.556424]  process_one_work+0x1aa/0x340
        [72747.556426]  worker_thread+0x30/0x390
        [72747.556429]  ? create_worker+0x1a0/0x1a0
        [72747.556432]  kthread+0x116/0x130
        [72747.556435]  ? kthread_park+0x80/0x80
        [72747.556438]  ret_from_fork+0x1f/0x30
      
        [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [72747.566961] Call Trace:
        [72747.566964]  __schedule+0x296/0x760
        [72747.566968]  ? finish_wait+0x80/0x80
        [72747.566970]  schedule+0x3c/0xa0
        [72747.566995]  wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
        [72747.566999]  ? finish_wait+0x80/0x80
        [72747.567024]  lock_extent_bits+0x37/0x90 [btrfs]
        [72747.567047]  btrfs_invalidatepage+0x299/0x2c0 [btrfs]
        [72747.567051]  ? find_get_pages_range_tag+0x2cd/0x380
        [72747.567076]  __extent_writepage+0x203/0x320 [btrfs]
        [72747.567102]  extent_write_cache_pages+0x2bb/0x440 [btrfs]
        [72747.567106]  ? update_load_avg+0x7e/0x5f0
        [72747.567109]  ? enqueue_entity+0xf4/0x6f0
        [72747.567134]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.567137]  ? enqueue_task_fair+0x93/0x6f0
        [72747.567140]  do_writepages+0x41/0xd0
        [72747.567144]  __filemap_fdatawrite_range+0xc7/0x100
        [72747.567167]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [72747.567195]  btrfs_work_helper+0xc2/0x300 [btrfs]
        [72747.567200]  process_one_work+0x1aa/0x340
        [72747.567202]  worker_thread+0x30/0x390
        [72747.567205]  ? create_worker+0x1a0/0x1a0
        [72747.567208]  kthread+0x116/0x130
        [72747.567211]  ? kthread_park+0x80/0x80
        [72747.567214]  ret_from_fork+0x1f/0x30
      
        [72747.569686] task:fsstress        state:D stack:    0 pid:841421 ppid:841417 flags:0x00000000
        [72747.569689] Call Trace:
        [72747.569691]  __schedule+0x296/0x760
        [72747.569694]  schedule+0x3c/0xa0
        [72747.569721]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569725]  ? finish_wait+0x80/0x80
        [72747.569753]  btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
        [72747.569781]  btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
        [72747.569804]  btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
        [72747.569810]  ? path_lookupat.isra.48+0x97/0x140
        [72747.569833]  btrfs_file_write_iter+0x81/0x410 [btrfs]
        [72747.569836]  ? __kmalloc+0x16a/0x2c0
        [72747.569839]  do_iter_readv_writev+0x160/0x1c0
        [72747.569843]  do_iter_write+0x80/0x1b0
        [72747.569847]  vfs_writev+0x84/0x140
        [72747.569869]  ? btrfs_file_llseek+0x38/0x270 [btrfs]
        [72747.569873]  do_writev+0x65/0x100
        [72747.569876]  do_syscall_64+0x33/0x40
        [72747.569879]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [72747.569899] task:fsstress        state:D stack:    0 pid:841424 ppid:841417 flags:0x00004000
        [72747.569903] Call Trace:
        [72747.569906]  __schedule+0x296/0x760
        [72747.569909]  schedule+0x3c/0xa0
        [72747.569936]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569940]  ? finish_wait+0x80/0x80
        [72747.569967]  __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
        [72747.569989]  start_transaction+0x279/0x580 [btrfs]
        [72747.570014]  clone_copy_inline_extent+0x332/0x490 [btrfs]
        [72747.570041]  btrfs_clone+0x5b7/0x7a0 [btrfs]
        [72747.570068]  ? lock_extent_bits+0x64/0x90 [btrfs]
        [72747.570095]  btrfs_clone_files+0xfc/0x150 [btrfs]
        [72747.570122]  btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
        [72747.570126]  do_clone_file_range+0xed/0x200
        [72747.570131]  vfs_clone_file_range+0x37/0x110
        [72747.570134]  ioctl_file_clone+0x7d/0xb0
        [72747.570137]  do_vfs_ioctl+0x138/0x630
        [72747.570140]  __x64_sys_ioctl+0x62/0xc0
        [72747.570143]  do_syscall_64+0x33/0x40
        [72747.570146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      So fix this by skipping the flush of delalloc for an inode that is
      flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
      such a special case of cloning an inline extent, when flushing delalloc
      during qgroup metadata reservation.
      
      The special cases for cloning inline extents were added in kernel 5.7 by
      by commit 05a5a762 ("Btrfs: implement full reflink support for
      inline extents"), while having qgroup metadata space reservation flushing
      delalloc when low on space was added in kernel 5.9 by commit
      c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get
      -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
      kernel backports.
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
      Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      CC: stable@vger.kernel.org # 5.9+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f9baa501
  3. 19 4月, 2021 3 次提交
    • F
      btrfs: improve btree readahead for full send operations · ace75066
      Filipe Manana 提交于
      Currently a full send operation uses the standard btree readahead when
      iterating over the subvolume/snapshot btree, which despite bringing good
      performance benefits, it could be improved in a few aspects for use cases
      such as full send operations, which are guaranteed to visit every node
      and leaf of a btree, in ascending and sequential order. The limitations
      of that standard btree readahead implementation are the following:
      
      1) It only triggers readahead for leaves that are physically close
         to the leaf being read, within a 64K range;
      
      2) It only triggers readahead for the next or previous leaves if the
         leaf being read is not currently in memory;
      
      3) It never triggers readahead for nodes.
      
      So add a new readahead mode that addresses all these points and use it
      for full send operations.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk and with 16GiB of RAM:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        initial_file_count=500000
        add_files $initial_file_count 0 4
      
        echo
        echo "Creating first snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        echo
        echo "Adding more files..."
        add_files $((initial_file_count / 4)) $initial_file_count 4
      
        echo
        echo "Updating 1/50th of the initial files..."
        for ((i = 1; i < $initial_file_count; i += 50)); do
            xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
        done
      
        echo
        echo "Creating second snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap2
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
      The durations of the full send operation in seconds were the following:
      
      Before this change:  217 seconds
      After this change:   205 seconds (-5.7%)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ace75066
    • F
      btrfs: add btree read ahead for incremental send operations · 2ce73c63
      Filipe Manana 提交于
      Currently we do not do btree read ahead when doing an incremental send,
      however we know that we will read and process any node or leaf in the
      send root that has a generation greater than the generation of the parent
      root. So triggering read ahead for such nodes and leafs is beneficial
      for an incremental send.
      
      This change does that, triggers read ahead of any node or leaf in the
      send root that has a generation greater then the generation of the
      parent root. As for the parent root, no readahead is triggered because
      knowing in advance which nodes/leaves are going to be read is not so
      linear and there's often a large time window between visiting nodes or
      leaves of the parent root. So I opted to leave out the parent root,
      and triggering read ahead for its nodes/leaves seemed to have not made
      significant difference.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk and with 16GiB of ram:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        initial_file_count=500000
        add_files $initial_file_count 0 4
      
        echo
        echo "Creating first snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        echo
        echo "Adding more files..."
        add_files $((initial_file_count / 4)) $initial_file_count 4
      
        echo
        echo "Updating 1/50th of the initial files..."
        for ((i = 1; i < $initial_file_count; i += 50)); do
            xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
        done
      
        echo
        echo "Creating second snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap2
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing incremental send..."
        start=$(date +%s)
        btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
        end=$(date +%s)
        echo
        echo "Incremental send took $((end - start)) seconds"
      
        umount $MNT
      
      Before this change, incremental send duration:
      
        with $initial_file_count == 200000:  51 seconds
        with $initial_file_count == 500000: 168 seconds
      
      After this change, incremental send duration:
      
        with $initial_file_count == 200000:   39 seconds (-26.7%)
        with $initial_file_count == 500000:  125 seconds (-29.4%)
      
      For $initial_file_count == 200000 there are 62600 nodes and leaves in the
      btree of the first snapshot, and 77759 nodes and leaves in the btree of
      the second snapshot. The root nodes were at level 2.
      
      While for $initial_file_count == 500000 there are 152476 nodes and leaves
      in the btree of the first snapshot, and 190511 nodes and leaves in the
      btree of the second snapshot. The root nodes were at level 2 as well.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ce73c63
    • F
      btrfs: add btree read ahead for full send operations · 19358b15
      Filipe Manana 提交于
      When doing a full send we know that we are going to be reading every node
      and leaf of the send root, so we benefit from enabling read ahead for the
      btree.
      
      This change enables read ahead for full send operations only, incremental
      sends will have read ahead enabled in a different way by a separate patch.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk and with 16GiB of RAM:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        initial_file_count=500000
        add_files $initial_file_count 0 4
      
        echo
        echo "Creating first snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        echo
        echo "Adding more files..."
        add_files $((initial_file_count / 4)) $initial_file_count 4
      
        echo
        echo "Updating 1/50th of the initial files..."
        for ((i = 1; i < $initial_file_count; i += 50)); do
            xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
        done
      
        echo
        echo "Creating second snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap2
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing incremental send..."
        start=$(date +%s)
        btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
        end=$(date +%s)
        echo
        echo "Incremental send took $((end - start)) seconds"
      
        umount $MNT
      
      Before this change, full send duration:
      
        with $initial_file_count == 200000:  165 seconds
        with $initial_file_count == 500000:  407 seconds
      
      After this change, full send duration:
      
        with $initial_file_count == 200000:  149 seconds (-10.2%)
        with $initial_file_count == 500000:  353 seconds (-14.2%)
      
      For $initial_file_count == 200000 there are 62600 nodes and leaves in the
      btree of the first snapshot, while for $initial_file_count == 500000 there
      are 152476 nodes and leaves. The roots were at level 2.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      19358b15
  4. 26 2月, 2021 1 次提交
    • I
      btrfs: use memcpy_[to|from]_page() and kmap_local_page() · 3590ec58
      Ira Weiny 提交于
      There are many places where the pattern kmap/memcpy/kunmap occurs.
      
      This pattern was lifted to the core common functions
      memcpy_[to|from]_page().
      
      Use these new functions to reduce the code, eliminate direct uses of
      kmap, and leverage the new core functions use of kmap_local_page().
      
      Also, there is 1 place where a kmap/memcpy is followed by an
      optional memset.  Here we leave the kmap open coded to avoid remapping
      the page but use kmap_local_page() directly.
      
      Development of this patch was aided by the coccinelle script:
      
      // <smpl>
      // SPDX-License-Identifier: GPL-2.0-only
      // Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls
      //
      // NOTE: Offsets and other expressions may be more complex than what the script
      // will automatically generate.  Therefore a catchall rule is provided to find
      // the pattern which then must be evaluated by hand.
      //
      // Confidence: Low
      // Copyright: (C) 2021 Intel Corporation
      // URL: http://coccinelle.lip6.fr/
      // Comments:
      // Options:
      
      //
      // simple memcpy version
      //
      @ memcpy_rule1 @
      expression page, T, F, B, Off;
      identifier ptr;
      type VP;
      @@
      
      (
      -VP ptr = kmap(page);
      |
      -ptr = kmap(page);
      |
      -VP ptr = kmap_atomic(page);
      |
      -ptr = kmap_atomic(page);
      )
      <+...
      (
      -memcpy(ptr + Off, F, B);
      +memcpy_to_page(page, Off, F, B);
      |
      -memcpy(ptr, F, B);
      +memcpy_to_page(page, 0, F, B);
      |
      -memcpy(T, ptr + Off, B);
      +memcpy_from_page(T, page, Off, B);
      |
      -memcpy(T, ptr, B);
      +memcpy_from_page(T, page, 0, B);
      )
      ...+>
      (
      -kunmap(page);
      |
      -kunmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on memcpy_rule1
      @
      identifier memcpy_rule1.ptr;
      type VP, VP1;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      
      //
      // Some callers kmap without a temp pointer
      //
      @ memcpy_rule2 @
      expression page, T, Off, F, B;
      @@
      
      <+...
      (
      -memcpy(kmap(page) + Off, F, B);
      +memcpy_to_page(page, Off, F, B);
      |
      -memcpy(kmap(page), F, B);
      +memcpy_to_page(page, 0, F, B);
      |
      -memcpy(T, kmap(page) + Off, B);
      +memcpy_from_page(T, page, Off, B);
      |
      -memcpy(T, kmap(page), B);
      +memcpy_from_page(T, page, 0, B);
      )
      ...+>
      -kunmap(page);
      // No need for the ptr variable removal
      
      //
      // Catch all
      //
      @ memcpy_rule3 @
      expression page;
      expression GenTo, GenFrom, GenSize;
      identifier ptr;
      type VP;
      @@
      
      (
      -VP ptr = kmap(page);
      |
      -ptr = kmap(page);
      |
      -VP ptr = kmap_atomic(page);
      |
      -ptr = kmap_atomic(page);
      )
      <+...
      (
      //
      // Some call sites have complex expressions within the memcpy
      // match a catch all to be evaluated by hand.
      //
      -memcpy(GenTo, GenFrom, GenSize);
      +memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize);
      +memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize);
      )
      ...+>
      (
      -kunmap(page);
      |
      -kunmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on memcpy_rule3
      @
      identifier memcpy_rule3.ptr;
      type VP, VP1;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      
      // <smpl>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3590ec58
  5. 09 2月, 2021 2 次提交
  6. 12 1月, 2021 1 次提交
    • F
      btrfs: send: fix invalid clone operations when cloning from the same file and root · 518837e6
      Filipe Manana 提交于
      When an incremental send finds an extent that is shared, it checks which
      file extent items in the range refer to that extent, and for those it
      emits clone operations, while for others it emits regular write operations
      to avoid corruption at the destination (as described and fixed by commit
      d906d49f ("Btrfs: send, fix file corruption due to incorrect cloning
      operations")).
      
      However when the root we are cloning from is the send root, we are cloning
      from the inode currently being processed and the source file range has
      several extent items that partially point to the desired extent, with an
      offset smaller than the offset in the file extent item for the range we
      want to clone into, it can cause the algorithm to issue a clone operation
      that starts at the current eof of the file being processed in the receiver
      side, in which case the receiver will fail, with EINVAL, when attempting
      to execute the clone operation.
      
      Example reproducer:
      
        $ cat test-send-clone.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        # Create our test file with a single and large extent (1M) and with
        # different content for different file ranges that will be reflinked
        # later.
        xfs_io -f \
               -c "pwrite -S 0xab 0 128K" \
               -c "pwrite -S 0xcd 128K 128K" \
               -c "pwrite -S 0xef 256K 256K" \
               -c "pwrite -S 0x1a 512K 512K" \
               $MNT/foobar
      
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Now do a series of changes to our file such that we end up with
        # different parts of the extent reflinked into different file offsets
        # and we overwrite a large part of the extent too, so no file extent
        # items refer to that part that was overwritten. This used to confuse
        # the algorithm used by the kernel to figure out which file ranges to
        # clone, making it attempt to clone from a source range starting at
        # the current eof of the file, resulting in the receiver to fail since
        # it is an invalid clone operation.
        #
        xfs_io -c "reflink $MNT/foobar 64K 1M 960K" \
               -c "reflink $MNT/foobar 0K 512K 256K" \
               -c "reflink $MNT/foobar 512K 128K 256K" \
               -c "pwrite -S 0x73 384K 640K" \
               $MNT/foobar
      
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        echo -e "\nFile digest in the original filesystem:"
        md5sum $MNT/snap2/foobar
      
        # Now unmount the filesystem, create a new one, mount it and try to
        # apply both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        btrfs receive -f /tmp/snap1.send $MNT
        btrfs receive -f /tmp/snap2.send $MNT
      
        # Must match what we got in the original filesystem of course.
        echo -e "\nFile digest in the new filesystem:"
        md5sum $MNT/snap2/foobar
      
        umount $MNT
      
      When running the reproducer, the incremental send operation fails due to
      an invalid clone operation:
      
        $ ./test-send-clone.sh
        wrote 131072/131072 bytes at offset 0
        128 KiB, 32 ops; 0.0015 sec (80.906 MiB/sec and 20711.9741 ops/sec)
        wrote 131072/131072 bytes at offset 131072
        128 KiB, 32 ops; 0.0013 sec (90.514 MiB/sec and 23171.6148 ops/sec)
        wrote 262144/262144 bytes at offset 262144
        256 KiB, 64 ops; 0.0025 sec (98.270 MiB/sec and 25157.2327 ops/sec)
        wrote 524288/524288 bytes at offset 524288
        512 KiB, 128 ops; 0.0052 sec (95.730 MiB/sec and 24506.9883 ops/sec)
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        linked 983040/983040 bytes at offset 1048576
        960 KiB, 1 ops; 0.0006 sec (1.419 GiB/sec and 1550.3876 ops/sec)
        linked 262144/262144 bytes at offset 524288
        256 KiB, 1 ops; 0.0020 sec (120.192 MiB/sec and 480.7692 ops/sec)
        linked 262144/262144 bytes at offset 131072
        256 KiB, 1 ops; 0.0018 sec (133.833 MiB/sec and 535.3319 ops/sec)
        wrote 655360/655360 bytes at offset 393216
        640 KiB, 160 ops; 0.0093 sec (66.781 MiB/sec and 17095.8436 ops/sec)
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
      
        File digest in the original filesystem:
        9c13c61cb0b9f5abf45344375cb04dfa  /mnt/sdi/snap2/foobar
        At subvol snap1
        At snapshot snap2
        ERROR: failed to clone extents to foobar: Invalid argument
      
        File digest in the new filesystem:
        132f0396da8f48d2e667196bff882cfc  /mnt/sdi/snap2/foobar
      
      The clone operation is invalid because its source range starts at the
      current eof of the file in the receiver, causing the receiver to get
      an EINVAL error from the clone operation when attempting it.
      
      For the example above, what happens is the following:
      
      1) When processing the extent at file offset 1M, the algorithm checks that
         the extent is shared and can be (fully or partially) found at file
         offset 0.
      
         At this point the file has a size (and eof) of 1M at the receiver;
      
      2) It finds that our extent item at file offset 1M has a data offset of
         64K and, since the file extent item at file offset 0 has a data offset
         of 0, it issues a clone operation, from the same file and root, that
         has a source range offset of 64K, destination offset of 1M and a length
         of 64K, since the extent item at file offset 0 refers only to the first
         128K of the shared extent.
      
         After this clone operation, the file size (and eof) at the receiver is
         increased from 1M to 1088K (1M + 64K);
      
      3) Now there's still 896K (960K - 64K) of data left to clone or write, so
         it checks for the next file extent item, which starts at file offset
         128K. This file extent item has a data offset of 0 and a length of
         256K, so a clone operation with a source range offset of 256K, a
         destination offset of 1088K (1M + 64K) and length of 128K is issued.
      
         After this operation the file size (and eof) at the receiver increases
         from 1088K to 1216K (1088K + 128K);
      
      4) Now there's still 768K (896K - 128K) of data left to clone or write, so
         it checks for the next file extent item, located at file offset 384K.
         This file extent item points to a different extent, not the one we want
         to clone, with a length of 640K. So we issue a write operation into the
         file range 1216K (1088K + 128K, end of the last clone operation), with
         a length of 640K and with a data matching the one we can find for that
         range in send root.
      
         After this operation, the file size (and eof) at the receiver increases
         from 1216K to 1856K (1216K + 640K);
      
      5) Now there's still 128K (768K - 640K) of data left to clone or write, so
         we look into the file extent item, which is for file offset 1M and it
         points to the extent we want to clone, with a data offset of 64K and a
         length of 960K.
      
         However this matches the file offset we started with, the start of the
         range to clone into. So we can't for sure find any file extent item
         from here onwards with the rest of the data we want to clone, yet we
         proceed and since the file extent item points to the shared extent,
         with a data offset of 64K, we issue a clone operation with a source
         range starting at file offset 1856K, which matches the file extent
         item's offset, 1M, plus the amount of data cloned and written so far,
         which is 64K (step 2) + 128K (step 3) + 640K (step 4). This clone
         operation is invalid since the source range offset matches the current
         eof of the file in the receiver. We should have stopped looking for
         extents to clone at this point and instead fallback to write, which
         would simply the contain the data in the file range from 1856K to
         1856K + 128K.
      
      So fix this by stopping the loop that looks for file ranges to clone at
      clone_range() when we reach the current eof of the file being processed,
      if we are cloning from the same file and using the send root as the clone
      root. This ensures any data not yet cloned will be sent to the receiver
      through a write operation.
      
      A test case for fstests will follow soon.
      Reported-by: NMassimo B. <massimo.b@gmx.net>
      Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
      Fixes: 11f2069c ("Btrfs: send, allow clone operations within the same file")
      CC: stable@vger.kernel.org # 5.5+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      518837e6
  7. 18 12月, 2020 1 次提交
    • F
      btrfs: send: fix wrong file path when there is an inode with a pending rmdir · 0b3f407e
      Filipe Manana 提交于
      When doing an incremental send, if we have a new inode that happens to
      have the same number that an old directory inode had in the base snapshot
      and that old directory has a pending rmdir operation, we end up computing
      a wrong path for the new inode, causing the receiver to fail.
      
      Example reproducer:
      
        $ cat test-send-rmdir.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        mkdir $MNT/dir
        touch $MNT/dir/file1
        touch $MNT/dir/file2
        touch $MNT/dir/file3
      
        # Filesystem looks like:
        #
        # .                                     (ino 256)
        # |----- dir/                           (ino 257)
        #         |----- file1                  (ino 258)
        #         |----- file2                  (ino 259)
        #         |----- file3                  (ino 260)
        #
      
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Now remove our directory and all its files.
        rm -fr $MNT/dir
      
        # Unmount the filesystem and mount it again. This is to ensure that
        # the next inode that is created ends up with the same inode number
        # that our directory "dir" had, 257, which is the first free "objectid"
        # available after mounting again the filesystem.
        umount $MNT
        mount $DEV $MNT
      
        # Now create a new file (it could be a directory as well).
        touch $MNT/newfile
      
        # Filesystem now looks like:
        #
        # .                                     (ino 256)
        # |----- newfile                        (ino 257)
        #
      
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        # Now unmount the filesystem, create a new one, mount it and try to apply
        # both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
      
        mount $DEV $MNT
      
        btrfs receive -f /tmp/snap1.send $MNT
        btrfs receive -f /tmp/snap2.send $MNT
      
        umount $MNT
      
      When running the test, the receive operation for the incremental stream
      fails:
      
        $ ./test-send-rmdir.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: chown o257-9-0 failed: No such file or directory
      
      So fix this by tracking directories that have a pending rmdir by inode
      number and generation number, instead of only inode number.
      
      A test case for fstests follows soon.
      Reported-by: NMassimo B. <massimo.b@gmx.net>
      Tested-by: NMassimo B. <massimo.b@gmx.net>
      Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0b3f407e
  8. 08 12月, 2020 1 次提交
  9. 07 10月, 2020 9 次提交
    • F
      btrfs: send, recompute reference path after orphanization of a directory · 9c2b4e03
      Filipe Manana 提交于
      During an incremental send, when an inode has multiple new references we
      might end up emitting rename operations for orphanizations that have a
      source path that is no longer valid due to a previous orphanization of
      some directory inode. This causes the receiver to fail since it tries
      to rename a path that does not exists.
      
      Example reproducer:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdi >/dev/null
        mount /dev/sdi /mnt/sdi
      
        touch /mnt/sdi/f1
        touch /mnt/sdi/f2
        mkdir /mnt/sdi/d1
        mkdir /mnt/sdi/d1/d2
      
        # Filesystem looks like:
        #
        # .                           (ino 256)
        # |----- f1                   (ino 257)
        # |----- f2                   (ino 258)
        # |----- d1/                  (ino 259)
        #        |----- d2/           (ino 260)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
        btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
      
        # Now do a series of changes such that:
        #
        # *) inode 258 has one new hardlink and the previous name changed
        #
        # *) both names conflict with the old names of two other inodes:
        #
        #    1) the new name "d1" conflicts with the old name of inode 259,
        #       under directory inode 256 (root)
        #
        #    2) the new name "d2" conflicts with the old name of inode 260
        #       under directory inode 259
        #
        # *) inodes 259 and 260 now have the old names of inode 258
        #
        # *) inode 257 is now located under inode 260 - an inode with a number
        #    smaller than the inode (258) for which we created a second hard
        #    link and swapped its names with inodes 259 and 260
        #
        ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
        mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1
      
        # Swap d1 and f2.
        mv /mnt/sdi/d1 /mnt/sdi/tmp
        mv /mnt/sdi/f2 /mnt/sdi/d1
        mv /mnt/sdi/tmp /mnt/sdi/f2
      
        # Swap d2 and f2_link
        mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
        mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
        mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link
      
        # Filesystem now looks like:
        #
        # .                                (ino 256)
        # |----- d1                        (ino 258)
        # |----- f2/                       (ino 259)
        #        |----- f2_link/           (ino 260)
        #        |       |----- f1         (ino 257)
        #        |
        #        |----- d2                 (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
        btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
      
        mkfs.btrfs -f /dev/sdj >/dev/null
        mount /dev/sdj /mnt/sdj
      
        btrfs receive -f /tmp/snap1.send /mnt/sdj
        btrfs receive -f /tmp/snap2.send /mnt/sdj
      
        umount /mnt/sdi
        umount /mnt/sdj
      
      When executed the receive of the incremental stream fails:
      
        $ ./reproducer.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory
      
      This happens because:
      
      1) When processing inode 257 we end up computing the name for inode 259
         because it is an ancestor in the send snapshot, and at that point it
         still has its old name, "d1", from the parent snapshot because inode
         259 was not yet processed. We then cache that name, which is valid
         until we start processing inode 259 (or set the progress to 260 after
         processing its references);
      
      2) Later we start processing inode 258 and collecting all its new
         references into the list sctx->new_refs. The first reference in the
         list happens to be the reference for name "d1" while the reference for
         name "d2" is next (the last element of the list).
         We compute the full path "d1/d2" for this second reference and store
         it in the reference (its ->full_path member). The path used for the
         new parent directory was "d1" and not "f2" because inode 259, the
         new parent, was not yet processed;
      
      3) When we start processing the new references at process_recorded_refs()
         we start with the first reference in the list, for the new name "d1".
         Because there is a conflicting inode that was not yet processed, which
         is directory inode 259, we orphanize it, renaming it from "d1" to
         "o259-6-0";
      
      4) Then we start processing the new reference for name "d2", and we
         realize it conflicts with the reference of inode 260 in the parent
         snapshot. So we issue an orphanization operation for inode 260 by
         emitting a rename operation with a destination path of "o260-6-0"
         and a source path of "d1/d2" - this source path is the value we
         stored in the reference earlier at step 2), corresponding to the
         ->full_path member of the reference, however that path is no longer
         valid due to the orphanization of the directory inode 259 in step 3).
         This makes the receiver fail since the path does not exists, it should
         have been "o259-6-0/d2".
      
      Fix this by recomputing the full path of a reference before emitting an
      orphanization if we previously orphanized any directory, since that
      directory could be a parent in the new path. This is a rare scenario so
      keeping it simple and not checking if that previously orphanized directory
      is in fact an ancestor of the inode we are trying to orphanize.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c2b4e03
    • F
      btrfs: send, orphanize first all conflicting inodes when processing references · 98272bb7
      Filipe Manana 提交于
      When doing an incremental send it is possible that when processing the new
      references for an inode we end up issuing rename or link operations that
      have an invalid path, which contains the orphanized name of a directory
      before we actually orphanized it, causing the receiver to fail.
      
      The following reproducer triggers such scenario:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        mkfs.btrfs -f /dev/sdi >/dev/null
        mount /dev/sdi /mnt/sdi
      
        touch /mnt/sdi/a
        touch /mnt/sdi/b
        mkdir /mnt/sdi/testdir
        # We want "a" to have a lower inode number then "testdir" (257 vs 259).
        mv /mnt/sdi/a /mnt/sdi/testdir/a
      
        # Filesystem looks like:
        #
        # .                           (ino 256)
        # |----- testdir/             (ino 259)
        # |          |----- a         (ino 257)
        # |
        # |----- b                    (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
        btrfs send -f /tmp/snap1.send /mnt/sdi/snap1
      
        # Now rename 259 to "testdir_2", then change the name of 257 to
        # "testdir" and make it a direct descendant of the root inode (256).
        # Also create a new link for inode 257 with the old name of inode 258.
        # By swapping the names and location of several inodes and create a
        # nasty dependency chain of rename and link operations.
        mv /mnt/sdi/testdir/a /mnt/sdi/a2
        touch /mnt/sdi/testdir/a
        mv /mnt/sdi/b /mnt/sdi/b2
        ln /mnt/sdi/a2 /mnt/sdi/b
        mv /mnt/sdi/testdir /mnt/sdi/testdir_2
        mv /mnt/sdi/a2 /mnt/sdi/testdir
      
        # Filesystem now looks like:
        #
        # .                            (ino 256)
        # |----- testdir_2/            (ino 259)
        # |          |----- a          (ino 260)
        # |
        # |----- testdir               (ino 257)
        # |----- b                     (ino 257)
        # |----- b2                    (ino 258)
      
        btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
        btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2
      
        mkfs.btrfs -f /dev/sdj >/dev/null
        mount /dev/sdj /mnt/sdj
      
        btrfs receive -f /tmp/snap1.send /mnt/sdj
        btrfs receive -f /tmp/snap2.send /mnt/sdj
      
        umount /mnt/sdi
        umount /mnt/sdj
      
      When running the reproducer, the receive of the incremental send stream
      fails:
      
        $ ./reproducer.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: link b -> o259-6-0/a failed: No such file or directory
      
      The problem happens because of the following:
      
      1) Before we start iterating the list of new references for inode 257,
         we generate its current path and store it at @valid_path, done at
         the very beginning of process_recorded_refs(). The generated path
         is "o259-6-0/a", containing the orphanized name for inode 259;
      
      2) Then we iterate over the list of new references, which has the
         references "b" and "testdir" in that specific order;
      
      3) We process reference "b" first, because it is in the list before
         reference "testdir". We then issue a link operation to create
         the new reference "b" using a target path corresponding to the
         content at @valid_path, which corresponds to "o259-6-0/a".
         However we haven't yet orphanized inode 259, its name is still
         "testdir", and not "o259-6-0". The orphanization of 259 did not
         happen yet because we will process the reference named "testdir"
         for inode 257 only in the next iteration of the loop that goes
         over the list of new references.
      
      Fix the issue by having a preliminar iteration over all the new references
      at process_recorded_refs(). This iteration is responsible only for doing
      the orphanization of other inodes that have and old reference that
      conflicts with one of the new references of the inode we are currently
      processing. The emission of rename and link operations happen now in the
      next iteration of the new references.
      
      A test case for fstests will follow soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98272bb7
    • D
      btrfs: send: use helpers for unaligned access to header members · e2f896b3
      David Sterba 提交于
      The header is mapped onto the send buffer and thus its members may be
      potentially unaligned so use the helpers instead of directly assigning
      the pointers. This has worked so far but let's use the helpers to make
      that clear.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e2f896b3
    • D
      btrfs: use kvcalloc for allocation in btrfs_ioctl_send() · bae12df9
      Denis Efremov 提交于
      Replace kvzalloc() call with kvcalloc() that also checks the size
      internally. There's a standalone overflow check in the function so we
      can return invalid parameter combination.  Use array_size() helper to
      compute the memory size for clone_sources_tmp.
      
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NDenis Efremov <efremov@linux.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bae12df9
    • D
      btrfs: use kvzalloc() to allocate clone_roots in btrfs_ioctl_send() · 8eb2fd00
      Denis Efremov 提交于
      btrfs_ioctl_send() used open-coded kvzalloc implementation earlier.
      The code was accidentally replaced with kzalloc() call [1]. Restore
      the original code by using kvzalloc() to allocate sctx->clone_roots.
      
      [1] https://patchwork.kernel.org/patch/9757891/#20529627
      
      Fixes: 818e010b ("btrfs: replace opencoded kvzalloc with the helper")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NDenis Efremov <efremov@linux.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8eb2fd00
    • O
      btrfs: send: use btrfs_file_extent_end() in send_write_or_clone() · c9a949af
      Omar Sandoval 提交于
      send_write_or_clone() basically has an open-coded copy of
      btrfs_file_extent_end() except that it (incorrectly) aligns to PAGE_SIZE
      instead of sectorsize. Fix and simplify the code by using
      btrfs_file_extent_end().
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c9a949af
    • O
      btrfs: send: avoid copying file data · 8c7d9fe0
      Omar Sandoval 提交于
      send_write() currently copies from the page cache to sctx->read_buf, and
      then from sctx->read_buf to sctx->send_buf. Similarly, send_hole()
      zeroes sctx->read_buf and then copies from sctx->read_buf to
      sctx->send_buf. However, if we write the TLV header manually, we can
      copy to sctx->send_buf directly and get rid of sctx->read_buf.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c7d9fe0
    • O
      btrfs: send: get rid of i_size logic in send_write() · a9b2e0de
      Omar Sandoval 提交于
      send_write()/fill_read_buf() have some logic for avoiding reading past
      i_size. However, everywhere that we call
      send_write()/send_extent_data(), we've already clamped the length down
      to i_size. Get rid of the i_size handling, which simplifies the next
      change.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a9b2e0de
    • D
      btrfs: send: remove indirect callback parameter for changed_cb · 1b51d6fc
      David Sterba 提交于
      There's a custom callback passed to btrfs_compare_trees which happens to
      be named exactly same as the existing function implementing it. This is
      confusing and the indirection is not necessary for our needs. Compiler
      is clever enough to call it directly so there's effectively no change.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b51d6fc
  10. 25 5月, 2020 3 次提交
    • D
      btrfs: simplify iget helpers · 0202e83f
      David Sterba 提交于
      The inode lookup starting at btrfs_iget takes the full location key,
      while only the objectid is used to match the inode, because the lookup
      happens inside the given root thus the inode number is unique.
      The entire location key is properly set up in btrfs_init_locked_inode.
      
      Simplify the helpers and pass only inode number, renaming it to 'ino'
      instead of 'objectid'. This allows to remove temporary variables key,
      saving some stack space.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0202e83f
    • D
      btrfs: simplify root lookup by id · 56e9357a
      David Sterba 提交于
      The main function to lookup a root by its id btrfs_get_fs_root takes the
      whole key, while only using the objectid. The value of offset is preset
      to (u64)-1 but not actually used until btrfs_find_root that does the
      actual search.
      
      Switch btrfs_get_fs_root to use only objectid and remove all local
      variables that existed just for the lookup. The actual key for search is
      set up in btrfs_get_fs_root, reusing another key variable.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56e9357a
    • M
      btrfs: send: emit file capabilities after chown · 89efda52
      Marcos Paulo de Souza 提交于
      Whenever a chown is executed, all capabilities of the file being touched
      are lost.  When doing incremental send with a file with capabilities,
      there is a situation where the capability can be lost on the receiving
      side. The sequence of actions bellow shows the problem:
      
        $ mount /dev/sda fs1
        $ mount /dev/sdb fs2
      
        $ touch fs1/foo.bar
        $ setcap cap_sys_nice+ep fs1/foo.bar
        $ btrfs subvolume snapshot -r fs1 fs1/snap_init
        $ btrfs send fs1/snap_init | btrfs receive fs2
      
        $ chgrp adm fs1/foo.bar
        $ setcap cap_sys_nice+ep fs1/foo.bar
      
        $ btrfs subvolume snapshot -r fs1 fs1/snap_complete
        $ btrfs subvolume snapshot -r fs1 fs1/snap_incremental
      
        $ btrfs send fs1/snap_complete | btrfs receive fs2
        $ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2
      
      At this point, only a chown was emitted by "btrfs send" since only the
      group was changed. This makes the cap_sys_nice capability to be dropped
      from fs2/snap_incremental/foo.bar
      
      To fix that, only emit capabilities after chown is emitted. The current
      code first checks for xattrs that are new/changed, emits them, and later
      emit the chown. Now, __process_new_xattr skips capabilities, letting
      only finish_inode_if_needed to emit them, if they exist, for the inode
      being processed.
      
      This behavior was being worked around in "btrfs receive" side by caching
      the capability and only applying it after chown. Now, xattrs are only
      emmited _after_ chown, making that workaround not needed anymore.
      
      Link: https://github.com/kdave/btrfs-progs/issues/202
      CC: stable@vger.kernel.org # 4.4+
      Suggested-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      89efda52
  11. 10 5月, 2020 1 次提交
  12. 24 3月, 2020 6 次提交
  13. 31 1月, 2020 1 次提交
    • F
      Btrfs: send, fix emission of invalid clone operations within the same file · 9722b101
      Filipe Manana 提交于
      When doing an incremental send and a file has extents shared with itself
      at different file offsets, it's possible for send to emit clone operations
      that will fail at the destination because the source range goes beyond the
      file's current size. This happens when the file size has increased in the
      send snapshot, there is a hole between the shared extents and both shared
      extents are at file offsets which are greater the file's size in the
      parent snapshot.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt/sdb
      
        $ xfs_io -f -c "pwrite -S 0xf1 0 64K" /mnt/sdb/foobar
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/base
        $ btrfs send -f /tmp/1.snap /mnt/sdb/base
      
        # Create a 320K extent at file offset 512K.
        $ xfs_io -c "pwrite -S 0xab 512K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0xcd 576K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0xef 640K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0x64 704K 64K" /mnt/sdb/foobar
        $ xfs_io -c "pwrite -S 0x73 768K 64K" /mnt/sdb/foobar
      
        # Clone part of that 320K extent into a lower file offset (192K).
        # This file offset is greater than the file's size in the parent
        # snapshot (64K). Also the clone range is a bit behind the offset of
        # the 320K extent so that we leave a hole between the shared extents.
        $ xfs_io -c "reflink /mnt/sdb/foobar 448K 192K 192K" /mnt/sdb/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdb /mnt/sdb/incr
        $ btrfs send -p /mnt/sdb/base -f /tmp/2.snap /mnt/sdb/incr
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ btrfs receive -f /tmp/1.snap /mnt/sdc
        $ btrfs receive -f /tmp/2.snap /mnt/sdc
        ERROR: failed to clone extents to foobar: Invalid argument
      
      The problem is that after processing the extent at file offset 256K, which
      refers to the first 128K of the 320K extent created by the buffered write
      operations, we have 'cur_inode_next_write_offset' set to 384K, which
      corresponds to the end offset of the partially shared extent (256K + 128K)
      and to the current file size in the receiver. Then when we process the
      extent at offset 512K, we do extent backreference iteration to figure out
      if we can clone the extent from some other inode or from the same inode,
      and we consider the extent at offset 256K of the same inode as a valid
      source for a clone operation, which is not correct because at that point
      the current file size in the receiver is 384K, which corresponds to the
      end of last processed extent (at file offset 256K), so using a clone
      source range from 256K to 256K + 320K is invalid because that goes past
      the current size of the file (384K) - this makes the receiver get an
      -EINVAL error when attempting the clone operation.
      
      So fix this by excluding clone sources that have a range that goes beyond
      the current file size in the receiver when iterating extent backreferences.
      
      A test case for fstests follows soon.
      
      Fixes: 11f2069c ("Btrfs: send, allow clone operations within the same file")
      CC: stable@vger.kernel.org # 5.5+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9722b101
  14. 13 12月, 2019 1 次提交
    • A
      btrfs: send: remove WARN_ON for readonly mount · fbd54297
      Anand Jain 提交于
      We log warning if root::orphan_cleanup_state is not set to
      ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is
      mounted as readonly we skip the orphan item cleanup during the lookup
      and root::orphan_cleanup_state remains at the init state 0 instead of
      ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the
      warning as below.
      
        WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE);
      
      WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
      ::
      RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
      ::
      Call Trace:
      ::
      _btrfs_ioctl_send+0x7b/0x110 [btrfs]
      btrfs_ioctl+0x150a/0x2b00 [btrfs]
      ::
      do_vfs_ioctl+0xa9/0x620
      ? __fget+0xac/0xe0
      ksys_ioctl+0x60/0x90
      __x64_sys_ioctl+0x16/0x20
      do_syscall_64+0x49/0x130
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reproducer:
        mkfs.btrfs -fq /dev/sdb
        mount /dev/sdb /btrfs
        btrfs subvolume create /btrfs/sv1
        btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1
        umount /btrfs
        mount -o ro /dev/sdb /btrfs
        btrfs send /btrfs/ss1 -f /tmp/f
      
      The warning exists because having orphan inodes could confuse send and
      cause it to fail or produce incorrect streams.  The two cases that would
      cause such send failures, which are already fixed are:
      
      1) Inodes that were unlinked - these are orphanized and remain with a
         link count of 0. These caused send operations to fail because it
         expected to always find at least one path for an inode. However this
         is no longer a problem since send is now able to deal with such
         inodes since commit 46b2f459 ("Btrfs: fix send failure when root
         has deleted files still open") and treats them as having been
         completely removed (the state after an orphan cleanup is performed).
      
      2) Inodes that were in the process of being truncated. These resulted in
         send not knowing about the truncation and potentially issue write
         operations full of zeroes for the range from the new file size to the
         old file size. This is no longer a problem because we no longer
         create orphan items for truncation since commit f7e9e8fc ("Btrfs:
         stop creating orphan items for truncate").
      
      As such before these commits, the WARN_ON here provided a clue in case
      something went wrong. Instead of being a warning against the
      root::orphan_cleanup_state value, it could have been more accurate by
      checking if there were actually any orphan items, and then issue a
      warning only if any exists, but that would be more expensive to check.
      Since orphanized inodes no longer cause problems for send, just remove
      the warning.
      Reported-by: NChristoph Anton Mitterer <calestyo@scientia.net>
      Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/
      CC: stable@vger.kernel.org # 4.19+
      Suggested-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbd54297
  15. 19 11月, 2019 2 次提交
    • F
      Btrfs: send, skip backreference walking for extents with many references · fd0ddbe2
      Filipe Manana 提交于
      Backreference walking, which is used by send to figure if it can issue
      clone operations instead of write operations, can be very slow and use
      too much memory when extents have many references. This change simply
      skips backreference walking when an extent has more than 64 references,
      in which case we fallback to a write operation instead of a clone
      operation. This limit is conservative and in practice I observed no
      signicant slowdown with up to 100 references and still low memory usage
      up to that limit.
      
      This is a temporary workaround until there are speedups in the backref
      walking code, and as such it does not attempt to add extra interfaces or
      knobs to tweak the threshold.
      Reported-by: NAtemu <atemu.main@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd0ddbe2
    • F
      Btrfs: send, allow clone operations within the same file · 11f2069c
      Filipe Manana 提交于
      For send we currently skip clone operations when the source and
      destination files are the same. This is so because clone didn't support
      this case in its early days, but support for it was added back in May
      2013 by commit a96fbc72 ("Btrfs: allow file data clone within a
      file"). This change adds support for it.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdd /mnt/sdd
      
        $ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar
        $ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar
      
        $ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap
      
        $ mkfs.btrfs -f /dev/sde
        $ mount /dev/sde /mnt/sde
      
        $ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde
      
      Without this change file foobar at the destination has a single 128Kb
      extent:
      
        $ filefrag -v /mnt/sde/snap/foobar
        Filesystem type is: 9123683e
        File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
         ext:     logical_offset:        physical_offset: length:   expected: flags:
           0:        0..      31:          0..        31:     32:             last,unknown_loc,delalloc,eof
        /mnt/sde/snap/foobar: 1 extent found
      
      With this we get a single 64Kb extent that is shared at file offsets 0
      and 64K, just like in the source filesystem:
      
        $ filefrag -v /mnt/sde/snap/foobar
        Filesystem type is: 9123683e
        File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
         ext:     logical_offset:        physical_offset: length:   expected: flags:
           0:        0..      15:       3328..      3343:     16:             shared
           1:       16..      31:       3328..      3343:     16:       3344: last,shared,eof
        /mnt/sde/snap/foobar: 2 extents found
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11f2069c
  16. 18 11月, 2019 1 次提交
  17. 08 10月, 2019 1 次提交
    • A
      btrfs: silence maybe-uninitialized warning in clone_range · 431d3988
      Austin Kim 提交于
      GCC throws warning message as below:
      
      ‘clone_src_i_size’ may be used uninitialized in this function
      [-Wmaybe-uninitialized]
       #define IS_ALIGNED(x, a)  (((x) & ((typeof(x))(a) - 1)) == 0)
                             ^
      fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here
       u64 clone_src_i_size;
         ^
      The clone_src_i_size is only used as call-by-reference
      in a call to get_inode_info().
      
      Silence the warning by initializing clone_src_i_size to 0.
      
      Note that the warning is a false positive and reported by older versions
      of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the
      patch is applied. Setting clone_src_i_size to 0 does not otherwise make
      sense and would not do any action in case the code changes in the future.
      Signed-off-by: NAustin Kim <austindh.kim@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      431d3988